JP2007010995A

JP2007010995A - Speaker recognition method

Info

Publication number: JP2007010995A
Application number: JP2005191892A
Authority: JP
Inventors: Takehiko Kawahara; 毅彦川▲原▼
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-06-30
Filing date: 2005-06-30
Publication date: 2007-01-18
Anticipated expiration: 2025-06-30
Also published as: JP4254753B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce a risk of the occurrence of recognition errors by registering native features of registrant's voice. <P>SOLUTION: The registrant inputs his or her identifier and then pronounces prescribed words a plurality of times. Registrant's voice is input to a voice input part 106 and is converted to voice data. A CPU 102 extracts feature quantities of the voice from voice data and stores them in a storage part 105. Feature quantities, which have the distances equal to or longer than a prescribed value from other feature quantities stored together, out of feature quantities of respective pronunciations stored in the storage part 105 are erased. Then the CPU 102 obtains an average value of feature quantities stored in the storage part 105 and stores the obtained average value in the storage part 105 as the feature quantity of speaker's voice in association with the input identifier. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声によって個人を認識する技術に関する。 The present invention relates to a technique for recognizing an individual by voice.

個人の特徴を利用して個人の認識を行う技術の中には、音声により個人を認識する技術がある（例えば、特許文献１、非特許文献１参照）。音声を用いて個人認識を行う場合、まず、個人の音声からケプストラム等の特徴量を抽出して登録する。そして、認識を行う場合には、登録されている音声の特徴量と、話者が発した音声の特徴量との類似度を求め、この類似度がある閾値よりも高いか低いかを見ることにより、音声を発した話者が登録されている人物であるか否かを判定する。
特開平９−１２７９７３号公報松井知子、「ＨＭＭによる話者認識」、信学技法、社団法人電子情報通信学会、平成８年１月、ｐ１７−２４ Among the techniques for recognizing an individual using individual characteristics, there is a technique for recognizing an individual by voice (see, for example, Patent Document 1 and Non-Patent Document 1). When performing personal recognition using voice, first, feature quantities such as cepstrum are extracted from personal voice and registered. When recognizing, the degree of similarity between the registered feature amount of speech and the feature amount of speech uttered by the speaker is obtained, and whether the similarity is higher or lower than a certain threshold value. Thus, it is determined whether or not the speaker who uttered the voice is a registered person.
JP-A-9-127973 Tomoko Matsui, “Speaker Recognition by HMM”, Science Technology, The Institute of Electronics, Information and Communication Engineers, January 1996, p17-24

ところで、人間が音声を発する際には、同じ言葉を複数回発しても違いが生じることがあるため、話者の音声から特徴量を抽出して登録する際には、話者の平均的な特徴量とは異なる特徴量が登録されてしまう虞がある。このように話者の平均的な特徴量とは異なる特徴量が登録されてしまうと、登録されている個人が音声を発しても登録されている人物であると認識されなくなる可能性が高くなり、正確に個人を認識することができなくなってしまう。 By the way, when a person utters a voice, even if the same word is uttered several times, a difference may occur. Therefore, when extracting and registering a feature amount from a speaker's voice, the average of the speaker There is a possibility that a feature amount different from the feature amount is registered. If a feature amount that is different from the average feature amount of the speaker is registered in this way, there is a high possibility that the registered individual will not be recognized as a registered person even if he / she speaks. It becomes impossible to recognize an individual accurately.

本発明は、上述した背景の下になされたものであり、登録者の音声の本来の特徴を登録し、認識誤りが発生する虞を低減させる技術を提供することを目的とする。 The present invention has been made under the above-described background, and an object of the present invention is to provide a technique for registering the original characteristics of a registrant's voice and reducing the possibility of recognition errors.

上述した課題を解決するために本発明は、音声の特徴量を記憶した記憶部に記憶されている特徴量であって入力された識別子に対応付けて前記記憶部に記憶されている特徴量と、入力された音声の特徴量との距離が予め定められた閾値以下である場合に、入力された音声の発音者が登録者であると認識する話者認識方法であって、発音者を一意に識別する識別子が入力される識別子入力ステップと、発音者の音声が複数回入力される音声入力ステップと、前記音声入力ステップにて複数回入力された音声毎に該音声の特徴量を求め、求めた複数の特徴量を前記記憶部に記憶させる複数特徴量記憶ステップと、前記複数特徴量記憶ステップによって前記記憶部に記憶された各特徴量のうち、前記複数特徴量記憶ステップによって共に記憶された他の特徴量との距離が所定値以上に大きな特徴量を前記記憶部から消去する消去ステップと、前記消去ステップの後のステップであって、前記複数特徴量記憶ステップにより前記記憶部に記憶された特徴量の平均値を求める平均特徴量算出ステップと、前記平均特徴量算出ステップによって求められた平均値を、前記識別子入力ステップにて入力された識別子と対応付けて前記発音者の音声の特徴量として前記記憶部に記憶させる特徴量記憶ステップとを有する話者認識方法を提供する。 In order to solve the above-described problem, the present invention provides a feature quantity stored in a storage unit that stores a feature quantity of speech, and a feature quantity stored in the storage unit in association with an input identifier. A speaker recognition method for recognizing that a speaker of an input speech is a registrant when the distance from the feature amount of the input speech is equal to or less than a predetermined threshold, An identifier input step in which an identifier to be identified is input; a voice input step in which a voice of a sound generator is input a plurality of times; and a feature amount of the sound is determined for each voice input a plurality of times in the voice input step; A plurality of feature quantity storage step for storing the obtained plurality of feature quantities in the storage unit, and among the feature quantities stored in the storage unit by the plurality of feature quantity storage steps, the feature quantity is stored together by the plurality of feature quantity storage step. The An erasing step of erasing a feature quantity whose distance from the feature quantity is greater than or equal to a predetermined value from the storage unit, and a step after the erasing step, which is stored in the storage unit by the multiple feature quantity storage step An average feature amount calculating step for obtaining an average value of feature amounts, and an average value obtained by the average feature amount calculating step in association with the identifier input in the identifier input step, As a speaker recognition method, a feature amount storing step stored in the storage unit is provided.

本発明においては、前記消去ステップの後のステップであって、前記複数特徴量記憶ステップにより前記記憶部に記憶された特徴量毎に、前記複数特徴量記憶ステップによって共に記憶された他の複数特徴量の平均値との距離を求め、求めた距離の最大値を閾値とする閾値算出ステップを設け、前記特徴量記憶ステップは、前記識別子入力ステップにて入力された識別子と、前記平均特徴量算出ステップによって求められた平均値と、前記閾値算出ステップにより求められた閾値とを対応付けて前記記憶部に記憶させるようにしてもよい。 In the present invention, after the erasing step, for each feature quantity stored in the storage unit by the multiple feature quantity storage step, other multiple features stored together by the multiple feature quantity storage step A threshold value calculating step using the maximum value of the calculated distance as a threshold value is provided, and the feature value storing step includes the identifier input in the identifier input step and the average feature value calculation. The average value obtained by the step and the threshold value obtained by the threshold value calculating step may be associated with each other and stored in the storage unit.

本発明によれば、登録者の音声の本来の特徴が記憶され、この記憶された特徴量で話者の認識を行うので認識誤りが発生する虞が低減する。 According to the present invention, the original features of the registrant's voice are stored, and the speaker is recognized using the stored feature amounts, so that the possibility of recognition errors being reduced.

以下、図面を参照して本発明の実施形態について説明する。
［実施形態の構成］
図１は、本実施形態に係る音声照合装置の要部のハードウェア構成を示したブロック図である。図１に示したように、音声照合装置の各部は、バス１０１に接続されており、このバス１０１を介して各部間でデータの授受を行う。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of the embodiment]
FIG. 1 is a block diagram showing a hardware configuration of a main part of the speech collating apparatus according to the present embodiment. As shown in FIG. 1, each unit of the speech collating apparatus is connected to a bus 101, and exchanges data between each unit via this bus 101.

音声入力部１０６は、マイクロホン（図示略）を備えており、マイクロホンに入力された音声を表す音声データを生成する。表示部１０８は、例えば、液晶ディスプレイ装置等の表示装置を具備しており、ＣＰＵ１０２の制御の下、文字やグラフィック等の各種表示を行う。情報入力部１０７は、キーボードやマウス等（いずれも図示略）の入力装置を具備しており、キーの押下やマウスの操作等に応じて操作内容に対応した信号をＣＰＵ１０２へ出力する。 The voice input unit 106 includes a microphone (not shown), and generates voice data representing the voice input to the microphone. The display unit 108 includes a display device such as a liquid crystal display device, and performs various displays such as characters and graphics under the control of the CPU 102. The information input unit 107 includes an input device such as a keyboard and a mouse (both not shown), and outputs a signal corresponding to the operation content to the CPU 102 in response to a key press or a mouse operation.

記憶部１０５は、データを永続的に記憶する装置として、例えばハードディスク装置（図示略）を具備しており、このハードディスク装置においては、各種データを記憶する領域として、バッファ領域Ａ１と、登録領域Ａ２とが設けられている。バッファ領域Ａ１は、ＣＰＵ１０２が各種処理を行う際に使用するデータの一時的な記憶領域として使用される。また、登録領域Ａ２は、ＣＰＵ１０２が話者を認識する処理を行う際に使用するデータを記憶する領域として使用される。図２は、登録領域Ａ２に記憶される登録テーブルＴＢ１のフォーマットを例示した図である。この登録テーブルＴＢ１は、「識別子」というフィールドと、「閾値」というフィールドと、「特徴量」というフィールドとを有している。これらのフィールドにおいて、「特徴量」フィールドには音声の特徴量を表すデータが格納され、また、「識別子」フィールドには、個人を一意に識別する識別子が格納される。また、「閾値」フィールドには、話者が予め登録されている個人であるか否かを判断する際に用いる閾値が格納される。 The storage unit 105 includes, for example, a hard disk device (not shown) as a device for permanently storing data. In this hard disk device, a buffer area A1 and a registration area A2 are used as areas for storing various data. And are provided. The buffer area A1 is used as a temporary storage area for data used when the CPU 102 performs various processes. The registration area A2 is used as an area for storing data used when the CPU 102 performs processing for recognizing a speaker. FIG. 2 is a diagram illustrating a format of the registration table TB1 stored in the registration area A2. This registration table TB1 has a field called “identifier”, a field called “threshold”, and a field called “feature”. In these fields, data representing the voice feature amount is stored in the “feature amount” field, and an identifier for uniquely identifying the individual is stored in the “identifier” field. The “threshold” field stores a threshold used when determining whether or not the speaker is a registered individual.

ＲＯＭ１０３は制御プログラムを記憶している。そしてＣＰＵ１０２はＲＯＭ１０３に記憶されている制御プログラムに従って各部を制御する。図３は、ＣＰＵ１０２が行う処理について、その機能構成を示した機能ブロック図である。ＣＰＵ１０２が制御プログラムを実行することにより、図３に示した各部が実現する。 The ROM 103 stores a control program. The CPU 102 controls each unit according to a control program stored in the ROM 103. FIG. 3 is a functional block diagram showing a functional configuration of processing performed by the CPU 102. The units shown in FIG. 3 are realized by the CPU 102 executing the control program.

ここで、図３に示した各機能ブロックについて説明する。情報入力部１０７には、話者を一意に識別する識別子が入力される。この入力された識別子は、話者が登録されている個人であるか否かを判断する時（以下、照合時と称する）には情報取得部５０へ送られ、話者の音声の特徴量を登録する時（以下、登録時の称する）には、情報作成部４０へ送られる。 Here, each functional block shown in FIG. 3 will be described. An identifier for uniquely identifying a speaker is input to the information input unit 107. This input identifier is sent to the information acquisition unit 50 when it is determined whether or not the speaker is a registered individual (hereinafter referred to as collation time), and the feature amount of the speaker's voice is determined. At the time of registration (hereinafter referred to as registration), it is sent to the information creation unit 40.

音声入力部１０６に入力された音声は、音声データに変換されて発話区間抽出部１０へ送られる。発話区間抽出部１０は、音声データを受取ると話者の音声部分のみを発話区間として抽出し、無音の部分や音声以外の音の部分を除去し音声データを生成して特徴量抽出部２０へ送る。なお、後述するように、登録時にはユーザは同じ単語を複数回発音するが、一回の発音毎に単語の発話区間が抽出され、音声データが特徴量抽出部２０へ送られる。 The voice input to the voice input unit 106 is converted into voice data and sent to the utterance section extraction unit 10. Upon receiving the voice data, the utterance section extraction unit 10 extracts only the voice part of the speaker as the utterance section, removes the silent part and the sound part other than the voice, generates voice data, and outputs the voice data to the feature amount extraction unit 20. send. As will be described later, at the time of registration, the user pronounces the same word a plurality of times, but the utterance section of the word is extracted for each pronunciation and the voice data is sent to the feature amount extraction unit 20.

特徴量抽出部２０は、送られた音声データが示す音声から音声の特徴量を抽出し、抽出した特徴量を示す特徴量データＶを生成する。なお、特徴量の種類については、ケプストラムが良く知られているが、ケプストラムに限定されるものではなく、ケプストラム以外の他の特徴量であってもよい。なお、特徴量抽出部２０で生成された特徴量データＶは、登録時には発音毎に記憶部１０５のバッファ領域Ａ１に記憶され、照合時には特徴量比較部６０へ送られる。 The feature amount extraction unit 20 extracts a feature amount of the voice from the voice indicated by the transmitted voice data, and generates feature amount data V indicating the extracted feature amount. Note that the cepstrum is well known as the type of feature quantity, but it is not limited to the cepstrum and may be a feature quantity other than the cepstrum. The feature amount data V generated by the feature amount extraction unit 20 is stored in the buffer area A1 of the storage unit 105 for each pronunciation at the time of registration, and is sent to the feature amount comparison unit 60 at the time of matching.

発話選択部３０は、話者の平均的な特徴量を得られるようにするために、バッファ領域Ａ１に記憶された特徴量データＶの中で他の特徴量データとの距離が大きい特徴量データを消去する。
情報作成部４０は、バッファ領域Ａ１に記憶された特徴量データＶを平均化した平均特徴量データＶＡを求めると共に、照合時に用いられる閾値ｔを求める。そして、情報入力部１０７から送られた識別子と、平均特徴量データＶＡと、求めた閾値ｔとを対応付けて登録テーブルＴＢ１に格納する。 The utterance selection unit 30 has feature data having a large distance from other feature data in the feature data V stored in the buffer area A1 so that the average feature data of the speaker can be obtained. Erase.
The information creating unit 40 obtains the average feature value data VA obtained by averaging the feature value data V stored in the buffer area A1, and obtains a threshold value t used for matching. Then, the identifier sent from the information input unit 107, the average feature amount data VA, and the obtained threshold value t are associated with each other and stored in the registration table TB1.

情報取得部５０は、情報入力部１０７から識別子が送られると、送られた識別子に対応付けられて登録テーブルＴＢ１に格納されている閾値ｔと平均特徴量データＶＡとを読み出し、読み出した閾値ｔと平均特徴量データＶＡとを特徴量比較部６０へ送る。
特徴量比較部６０は、特徴量抽出部２０から送られる特徴量データＶが示す特徴量と、情報取得部５０から送られた平均特徴量データＶＡが示す特徴量との距離を求め、この距離が情報取得部５０から送られた閾値ｔよりも大きいか小さいかを見ることにより、話者が登録されている個人であるか否かを判定する。そして、特徴量比較部６０は、比較結果を示す結果データを表示部１０８へ送る。表示部１０８では、送られた結果データが示す結果に基づいて、話者が登録されている個人であるか否かを表示する。 When the identifier is sent from the information input unit 107, the information acquisition unit 50 reads the threshold value t and the average feature value data VA stored in the registration table TB1 in association with the sent identifier, and reads the read threshold value t And the average feature amount data VA are sent to the feature amount comparison unit 60.
The feature amount comparison unit 60 obtains a distance between the feature amount indicated by the feature amount data V sent from the feature amount extraction unit 20 and the feature amount indicated by the average feature amount data VA sent from the information acquisition unit 50, and this distance Is larger or smaller than the threshold value t sent from the information acquisition unit 50, it is determined whether or not the speaker is a registered individual. Then, the feature amount comparison unit 60 sends result data indicating the comparison result to the display unit 108. The display unit 108 displays whether or not the speaker is a registered individual based on the result indicated by the transmitted result data.

［実施形態の動作］
次に本実施形態の動作について説明する。なお以下の説明においては、まず登録時の動作について説明し、次に照合時の動作について説明する。 [Operation of the embodiment]
Next, the operation of this embodiment will be described. In the following description, the operation during registration will be described first, and then the operation during verification will be described.

［登録時の動作］
まず、音声の特徴量を登録しようとする者（以下、登録者と称する）は、情報入力部１０７を操作し、表示部１０８に表示されたメニュー画面（図８参照）の「登録ボタン」をクリックする操作を行う。登録ボタンをクリックする操作が行われると、表示部１０８には識別子の入力を促す画面（図９参照）が表示される（図４：ステップＳ１０）。この後、登録者を一意に識別する識別子が入力され、画面に表示された決定ボタンをクリックする操作が行われると（ステップＳ１１；ＹＥＳ）、登録者が所定の言葉を発音した回数を示すカウンタｎが初期化（ｎ＝０）される（ステップＳ１２）。そして、所定の言葉（例えば、登録者の氏名等）を発音するように要求する画面（図１０参照）が表示され（ステップＳ１３）、入力された識別子がＲＡＭ１０４に記憶される。 [Operation during registration]
First, a person who intends to register a voice feature amount (hereinafter referred to as a registrant) operates the information input unit 107 and clicks the “registration button” on the menu screen (see FIG. 8) displayed on the display unit 108. Click. When an operation of clicking the registration button is performed, a screen (see FIG. 9) prompting the input of an identifier is displayed on the display unit 108 (FIG. 4: step S10). Thereafter, when an identifier for uniquely identifying the registrant is input and an operation of clicking the enter button displayed on the screen is performed (step S11; YES), a counter indicating the number of times the registrant has pronounced a predetermined word. n is initialized (n = 0) (step S12). Then, a screen (see FIG. 10) requesting to pronounce a predetermined word (for example, the name of the registrant) is displayed (step S13), and the input identifier is stored in the RAM 104.

図１０に例示した画面が表示された後、登録者は所定の言葉を発音する。登録者が発した音声が音声入力部１０６に入力されると、入力された音声の音声データが音声入力部１０６から出力される。ＣＰＵ１０２は、音声データが出力されると（ステップＳ１４；ＹＥＳ）、音声部分のみを発話区間として抽出し、無音の部分や音声以外の音の部分を除去した音声データを生成する（ステップＳ１５）。そしてＣＰＵ１０２は、ステップＳ１５で生成された音声データが表す音声の特徴量を抽出し、抽出した特徴量を示す特徴量データＶを生成する（ステップＳ１６）。次にＣＰＵ１０２は、カウンタｎの値に１を加え（ステップＳ１７）、生成した特徴量データＶを記憶部１０５のバッファ領域Ａ１に記憶する（ステップＳ１８）。なお、バッファ領域Ａ１に特徴量データＶを記憶する際には、図１１に例示したように、配列Ｖ［ｎ］（ｎはカウンタｎの値）に記憶する。 After the screen illustrated in FIG. 10 is displayed, the registrant pronounces a predetermined word. When voice uttered by the registrant is input to the voice input unit 106, voice data of the input voice is output from the voice input unit 106. When the voice data is output (step S14; YES), the CPU 102 extracts only the voice part as an utterance section, and generates voice data from which a silent part or a sound part other than the voice is removed (step S15). Then, the CPU 102 extracts the feature quantity of the voice represented by the voice data generated in step S15, and generates feature quantity data V indicating the extracted feature quantity (step S16). Next, the CPU 102 adds 1 to the value of the counter n (step S17), and stores the generated feature amount data V in the buffer area A1 of the storage unit 105 (step S18). Note that when the feature amount data V is stored in the buffer area A1, it is stored in the array V [n] (n is the value of the counter n) as illustrated in FIG.

ＣＰＵ１０２は、特徴量データＶをバッファ領域Ａ１に記憶させると、登録者が所定の言葉を所定回数（所定値Ｎ回）発音したか否かを判断する（ステップＳ１９）。ＣＰＵ１０２は、カウンタｎの値が所定値Ｎであるか否かを判断し、カウンタｎの値が所定値Ｎ未満である場合には（ステップＳ１９；ＮＯ）、図１０の画面における「登録完了までに行う発音の回数」の部分の数値を所定値Ｎ−カウンタｎの値に変更して表示し、所定の言葉の発音を要求する。一方、カウンタｎの値が所定値Ｎとなった場合には（ステップＳ１９；ＹＥＳ）、次の処理を行う。 When CPU 102 stores feature amount data V in buffer area A1, CPU 102 determines whether or not a registrant has pronounced a predetermined word a predetermined number of times (predetermined value N times) (step S19). The CPU 102 determines whether or not the value of the counter n is a predetermined value N. If the value of the counter n is less than the predetermined value N (step S19; NO), the “until registration is completed” on the screen of FIG. The numerical value of the “number of pronunciations to be performed” is changed to a predetermined value N−the value of the counter n and displayed, and the pronunciation of a predetermined word is requested. On the other hand, when the value of the counter n reaches the predetermined value N (step S19; YES), the following processing is performed.

次にＣＰＵ１０２は、記憶された特徴量データＶが示す特徴量毎に他の各特徴量データが示す特徴量との距離を求め、求めた距離の平均値を求める処理を行う。まずＣＰＵ１０２は、カウンタｉの値を初期化（ｉ＝１）し（図５：ステップＳ２０）、距離Ｄ［ｉ］の値を初期化（Ｄ［ｉ］＝０）する（ステップＳ２１）。次にＣＰＵ１０２は、カウンタｊの値を初期化し（ｊ＝１）（ステップＳ２２）、特徴量データＶ［ｉ］が示す特徴量（ｉはカウンタｉの値）と特徴量データＶ［ｊ］が示す特徴量（ｊはカウンタｊの値）との距離を求め、求めた距離を距離Ｄ［ｉ］に格納されている値に加算する（ステップＳ２３）。
ＣＰＵ１０２はステップＳ２３の処理が終了すると、カウンタｊの値を１増加させ（ステップＳ２４）、カウンタｊの値が所定値Ｎとなったか否かを判断する（ステップＳ２５）。ＣＰＵ１０２は、カウンタｊの値が所定値Ｎとなっていない場合には（ステップＳ２５；ＮＯ）、カウンタｊの値が所定値ＮとなるまでステップＳ２３，ステップＳ２４の処理を繰り返す。一方、カウンタｊの値が所定値Ｎとなった場合には（ステップＳ２５；ＹＥＳ）、距離Ｄ［ｉ］の値をバッファ領域Ａ１に記憶された特徴量データの数である前記所定値Ｎで除算し、特徴量データＶ［ｉ］が示す特徴量と他の特徴量データが示す特徴量との距離の平均値を求め、求めた値を距離Ｄ［ｉ］に格納する（ステップＳ２６）。 Next, the CPU 102 obtains a distance from the feature quantity indicated by the other feature quantity data for each feature quantity indicated by the stored feature quantity data V, and performs processing for obtaining an average value of the obtained distances. First, the CPU 102 initializes the value of the counter i (i = 1) (FIG. 5: Step S20), and initializes the value of the distance D [i] (D [i] = 0) (Step S21). Next, the CPU 102 initializes the value of the counter j (j = 1) (step S22), and the feature quantity (i is the value of the counter i) indicated by the feature quantity data V [i] and the feature quantity data V [j] are obtained. The distance from the indicated feature quantity (j is the value of counter j) is obtained, and the obtained distance is added to the value stored in the distance D [i] (step S23).
When the process of step S23 ends, the CPU 102 increments the value of the counter j by 1 (step S24), and determines whether or not the value of the counter j has reached a predetermined value N (step S25). If the value of the counter j is not the predetermined value N (step S25; NO), the CPU 102 repeats the processes of steps S23 and S24 until the value of the counter j reaches the predetermined value N. On the other hand, when the value of the counter j becomes the predetermined value N (step S25; YES), the value of the distance D [i] is the predetermined value N that is the number of feature amount data stored in the buffer area A1. By dividing, an average value of the distance between the feature quantity indicated by the feature quantity data V [i] and the feature quantity indicated by the other feature quantity data is obtained, and the obtained value is stored in the distance D [i] (step S26).

ＣＰＵ１０２は、ステップＳ２６の処理が終了すると、カウンタｉの値を１増加させ（ステップＳ２７）、カウンタｉの値が所定値Ｎとなったか否かを判断する（ステップＳ２８）。ＣＰＵ１０２は、カウンタｉの値が所定値Ｎとなっていない場合には（ステップＳ２８；ＮＯ）、カウンタｉの値が所定値ＮとなるまでステップＳ２１〜ステップＳ２７の処理を繰り返す。 When the process of step S26 ends, the CPU 102 increments the value of the counter i by 1 (step S27), and determines whether or not the value of the counter i has reached a predetermined value N (step S28). When the value of the counter i is not the predetermined value N (step S28; NO), the CPU 102 repeats the processing from step S21 to step S27 until the value of the counter i reaches the predetermined value N.

一方、カウンタｉの値が所定値となった場合には（ステップＳ２８；ＹＥＳ）、まずＣＰＵ１０２は、カウンタｉの値を初期化（ｉ＝１）する（図６：ステップＳ２９）。そして、ＣＰＵ１０２はステップＳ２１〜ステップＳ２８の処理で求めた距離の平均値Ｄ［ｉ］と、距離の限界閾値Ｔとを比較する（ステップＳ３０）。ここで、限界閾値Ｔは予め定められた固定値である。
ＣＰＵ１０２は、距離の平均値Ｄ［ｉ］が限界閾値Ｔ未満である場合には（ステップＳ３０：ＮＯ）、ステップＳ３３へ処理を進める。一方、ＣＰＵ１０２は、距離の平均値Ｄ［ｉ］が限界閾値Ｔ以上の場合には（ステップＳ３０；ＹＥＳ）、バッファ領域Ａ１から特徴量データＶ［ｉ］を消去する（ステップＳ３１）。 On the other hand, when the value of the counter i becomes a predetermined value (step S28; YES), the CPU 102 first initializes the value of the counter i (i = 1) (FIG. 6: step S29). Then, the CPU 102 compares the distance average value D [i] obtained in the processes of steps S21 to S28 with the distance limit threshold T (step S30). Here, the limit threshold T is a predetermined fixed value.
When the average distance value D [i] is less than the limit threshold T (step S30: NO), the CPU 102 advances the process to step S33. On the other hand, when the average distance value D [i] is greater than or equal to the limit threshold T (step S30; YES), the CPU 102 deletes the feature amount data V [i] from the buffer area A1 (step S31).

次にＣＰＵ１０２は、ステップＳ１３〜ステップＳ１９の処理によって値がＮとなったカウンタｎから１を減算し、減算結果をカウンタｎの値とする（ステップＳ３２）。そしてＣＰＵ１０２は、カウンタｉの値を１増加させ（ステップＳ３３）、カウンタｉの値が所定値Ｎとなったか否かを判断する（ステップＳ３４）。ＣＰＵ１０２は、カウンタｉの値が所定値Ｎとなっていない場合には（ステップＳ３４；ＮＯ）、カウンタｉの値が所定値ＮとなるまでステップＳ３０〜ステップＳ３３の処理を繰り返す。 Next, the CPU 102 subtracts 1 from the counter n whose value has become N by the processing of steps S13 to S19, and sets the subtraction result as the value of the counter n (step S32). Then, the CPU 102 increments the value of the counter i by 1 (step S33), and determines whether or not the value of the counter i has reached a predetermined value N (step S34). When the value of the counter i is not the predetermined value N (step S34; NO), the CPU 102 repeats the processing from step S30 to step S33 until the value of the counter i reaches the predetermined value N.

一方、カウンタｉの値が所定値Ｎとなった場合には（ステップＳ３４；ＹＥＳ）、カウンタｎの値が所定値Ｎと同じであるか判断する（ステップＳ３５）。ここで、ＣＰＵ１０２は、カウンタｎの値が所定値Ｎと同じでない場合には（ステップＳ３０〜ステップＳ３４の処理において、バッファ領域Ａ１に記憶された特徴量データＶを消去した場合）、図１２に例示したように、バッファ領域Ａ１に記憶された特徴量データの並べ替えを行った後（ステップＳ３６）、ステップＳ１３へ処理の流れを戻す。 On the other hand, when the value of the counter i becomes the predetermined value N (step S34; YES), it is determined whether the value of the counter n is the same as the predetermined value N (step S35). Here, when the value of the counter n is not the same as the predetermined value N (when the feature value data V stored in the buffer area A1 is deleted in the processing of step S30 to step S34), the CPU 102 returns to FIG. As illustrated, after the feature amount data stored in the buffer area A1 is rearranged (step S36), the process flow is returned to step S13.

一方、ＣＰＵ１０２は、カウンタｎの値が所定値Ｎと同じである場合には（ステップＳ３５；ＹＥＳ）、閾値ｔの値を初期化（ｔ＝０）し（図７：ステップＳ３７）、カウンタｉの値を初期化（ｉ＝１）する（ステップＳ３８）。次にＣＰＵ１０２は、特徴量データＶ［ｉ］以外の特徴量データを平均化したデータＶａを求める（ステップＳ３９）。そして、特徴量データＶ［ｉ］が示す特徴量と、データＶａが示す特徴量の距離Ｄ［ｉ］を求め（ステップＳ４０）、求めた距離Ｄ［ｉ］が閾値ｔより大きいか判断する（ステップＳ４１）。
ここで、ＣＰＵ１０２は、閾値ｔの値が距離Ｄ［ｉ］の値未満である場合には（ステップＳ４１；ＹＥＳ）、閾値ｔ＝距離Ｄ［ｉ］とする（ステップＳ４２）。一方、ＣＰＵ１０２は、閾値ｔの値が距離Ｄ［ｉ］の値以上である場合には（ステップＳ４１；ＮＯ）、ステップＳ４３へ処理を進める。 On the other hand, when the value of the counter n is the same as the predetermined value N (step S35; YES), the CPU 102 initializes the value of the threshold t (t = 0) (FIG. 7: step S37), and the counter i Is initialized (i = 1) (step S38). Next, the CPU 102 obtains data Va obtained by averaging feature amount data other than the feature amount data V [i] (step S39). Then, a distance D [i] between the feature amount indicated by the feature amount data V [i] and the feature amount indicated by the data Va is obtained (step S40), and it is determined whether the obtained distance D [i] is larger than the threshold value t (step S40). Step S41).
Here, when the value of the threshold value t is less than the value of the distance D [i] (step S41; YES), the CPU 102 sets the threshold value t = distance D [i] (step S42). On the other hand, when the value of the threshold value t is greater than or equal to the value of the distance D [i] (step S41; NO), the CPU 102 advances the process to step S43.

次にＣＰＵ１０２は、カウンタｉの値を１増加させ（ステップＳ４３）、カウンタｉの値が所定値Ｎとなったか否かを判断する（ステップＳ４４）。ＣＰＵ１０２は、カウンタｉの値が所定値Ｎとなっていない場合には（ステップＳ４４；ＮＯ）、カウンタｉの値が所定値となるまでステップＳ３９〜ステップＳ４３の処理を繰り返す。一方、カウンタｉの値が所定値Ｎである場合には（ステップＳ４４；ＹＥＳ）、ＣＰＵ１０２は、バッファ領域Ａ１に記憶されている特徴量データＶの平均値である平均特徴量データＶＡを求める（ステップＳ４５）。そして、情報入力部１０７から入力された識別子と、ステップＳ４５で求めた平均特徴量データＶＡと、ステップＳ３８〜ステップＳ４４の処理で求めた閾値ｔとを対応付けて登録テーブルＴＢ１に格納する（ステップＳ４６）。なお、ＣＰＵ１０２は、閾値ｔと平均特徴量データＶＡとを格納する際、情報入力部１０７から入力された識別子が既に登録テーブルＴＢ１にある場合には、既に格納されている識別子に対応付けて格納されている閾値ｔと平均特徴量データＶＡとを新たに求めた閾値ｔと平均特徴量データＶＡとに更新し、情報入力部１０７から入力された識別子が登録テーブルＴＢ１に格納されていない場合には、入力された識別子と閾値ｔおよび平均特徴量データＶＡとを新たに登録テーブルＴＢ１に格納する。 Next, the CPU 102 increments the value of the counter i by 1 (step S43), and determines whether or not the value of the counter i has reached a predetermined value N (step S44). When the value of the counter i is not the predetermined value N (step S44; NO), the CPU 102 repeats the processing from step S39 to step S43 until the value of the counter i becomes the predetermined value. On the other hand, when the value of the counter i is the predetermined value N (step S44; YES), the CPU 102 obtains average feature value data VA that is an average value of the feature value data V stored in the buffer area A1 ( Step S45). Then, the identifier input from the information input unit 107, the average feature value data VA obtained in step S45, and the threshold value t obtained in the processing in steps S38 to S44 are stored in association in the registration table TB1 (step S46). When the CPU 102 stores the threshold value t and the average feature value data VA, if the identifier input from the information input unit 107 is already in the registration table TB1, the CPU 102 stores it in association with the already stored identifier. When the threshold value t and the average feature value data VA are updated to the newly obtained threshold value t and average feature value data VA, and the identifier input from the information input unit 107 is not stored in the registration table TB1 Stores the input identifier, the threshold value t, and the average feature amount data VA in the registration table TB1.

以上説明したように本実施形態によれば、登録者の音声の平均的な特徴量が登録者の音声の特徴量として記憶される。また、話者が予め登録されている個人であるか否かを判断する際に用いる閾値は、登録者の音声の特徴量を基にして登録者毎に求められる。 As described above, according to the present embodiment, the average feature amount of the registrant's voice is stored as the feature amount of the registrant's voice. In addition, a threshold used when determining whether or not a speaker is an individual registered in advance is obtained for each registrant based on the feature amount of the registrant's voice.

［照合時の動作］
次に照合時の動作について説明する。まず、話者は情報入力部１０７を操作し、表示部１０８に表示されたメニュー画面（図８参照）の「話者判定ボタン」をクリックする操作を行う。話者判定ボタンをクリックする操作が行われると、表示部１０８には識別子の入力を促す画面（図９参照）が表示される（図１３：ステップＳ５０）。この後、話者により識別子が入力され、画面に表示された決定ボタンをクリックする操作が行われると（ステップＳ５１）、入力された識別子がＲＡＭ１０４に記憶される。 [Operation during verification]
Next, the operation at the time of collation will be described. First, the speaker operates the information input unit 107 and performs an operation of clicking the “speaker determination button” on the menu screen (see FIG. 8) displayed on the display unit 108. When an operation for clicking the speaker determination button is performed, a screen (see FIG. 9) prompting the input of an identifier is displayed on the display unit 108 (FIG. 13: step S50). Thereafter, when an identifier is input by the speaker and an operation of clicking the enter button displayed on the screen is performed (step S51), the input identifier is stored in the RAM 104.

この後、ＣＰＵ１０２は、ＲＡＭ１０４に記憶された識別子を登録テーブルＴＢ１において検索する（ステップＳ５２）。ここで、ＲＡＭ１０４に記憶された識別子と同じ識別子が見つからなかった場合には（ステップＳ５３；ＮＯ）、ＣＰＵ１０２は、識別子が登録されていない旨のメッセージを表示部１０８に表示させて（ステップＳ５４）処理を終了する。一方、ＣＰＵ１０２は、ＲＡＭ１０４に記憶された識別子と同じ識別子を見つけた場合には（ステップＳ５３；ＹＥＳ）、検索した識別子に対応付けて登録テーブルＴＢ１に格納されている閾値ｔと平均特徴量データＶＡとを読み出す（ステップＳ５５）。そして、所定の言葉（例えば、登録者の氏名等）の発音を要求する画面を表示し（ステップＳ５６）、音声が入力されるのを待つ（ステップＳ５７）。 Thereafter, the CPU 102 searches the registration table TB1 for the identifier stored in the RAM 104 (step S52). Here, when the same identifier as the identifier stored in the RAM 104 is not found (step S53; NO), the CPU 102 causes the display unit 108 to display a message indicating that the identifier is not registered (step S54). The process ends. On the other hand, when the CPU 102 finds the same identifier as the identifier stored in the RAM 104 (step S53; YES), the threshold t and the average feature amount data VA stored in the registration table TB1 in association with the retrieved identifier are stored. Are read (step S55). Then, a screen requesting pronunciation of a predetermined word (for example, the name of the registrant) is displayed (step S56), and the input of voice is waited (step S57).

この後、話者が所定の言葉を発音し、話者が発した音声が音声入力部１０６に入力されると（ステップＳ５７；ＹＥＳ）、入力された音声の音声データが音声入力部１０６から出力される。ＣＰＵ１０２は、この音声データのうち、音声部分のみを発話区間として抽出し、無音の部分や音声以外の音の部分を除去した音声データを生成する（ステップＳ５８）。そしてＣＰＵ１０２は、ステップＳ５８で生成された音声データが表す音声の特徴量を抽出し、抽出した特徴量を示す特徴量データＶを生成する（ステップＳ５９）。 Thereafter, when the speaker pronounces a predetermined word and the voice uttered by the speaker is input to the voice input unit 106 (step S57; YES), the voice data of the input voice is output from the voice input unit 106. Is done. The CPU 102 extracts only the voice part from the voice data as an utterance section, and generates voice data from which a silent part and a sound part other than the voice are removed (step S58). Then, the CPU 102 extracts the feature amount of the voice represented by the voice data generated in step S58, and generates feature amount data V indicating the extracted feature amount (step S59).

次にＣＰＵ１０２は、特徴量データＶが表す特徴量と平均特徴量データＶＡが表す特徴量との距離を求める（ステップＳ６０）。そして、求めた距離が登録テーブルＴＢ１から読み出した閾値ｔ以下である場合（特徴量データＶと平均特徴量データＶＡの距離が近い場合）には（ステップＳ６１；ＹＥＳ）、ＣＰＵ１０２は話者が登録者であると判断し、判断結果を表示部１０８に表示させる（ステップＳ６２）。一方、求めた距離が登録テーブルＴＢ１から読み出した閾値ｔより大きい場合（特徴量データＶと平均特徴量データＶＡの距離が遠い場合）には（ステップＳ６１；ＮＯ）、ＣＰＵ１０２は話者が登録者ではないと判断し、判断結果を表示部１０８に表示させる（ステップＳ６３）。 Next, the CPU 102 obtains a distance between the feature amount represented by the feature amount data V and the feature amount represented by the average feature amount data VA (step S60). When the obtained distance is equal to or less than the threshold value t read from the registration table TB1 (when the distance between the feature data V and the average feature data VA is close) (step S61; YES), the CPU 102 registers the speaker. The determination result is displayed on the display unit 108 (step S62). On the other hand, when the obtained distance is larger than the threshold value t read from the registration table TB1 (when the distance between the feature data V and the average feature data VA is far) (step S61; NO), the CPU 102 indicates that the speaker is a registered person. The determination result is displayed on the display unit 108 (step S63).

以上説明したように本実施形態によれば、記憶されている登録者の音声の平均的な特徴量を基にし、登録者固有の閾値を用いて話者の認識が行われるので、登録されている個人が発音した際に、登録されている人物ではないと判断される可能性が低くなる。 As described above, according to the present embodiment, the speaker is recognized using the threshold value unique to the registrant based on the stored average feature amount of the registrant's voice. When a certain person pronounces, the possibility that the person is not registered is reduced.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. For example, the present invention may be implemented by modifying the above-described embodiment as follows.

パーソナルコンピュータ装置にマイクロホンを接続し、上述したプログラムをパーソナルコンピュータ装置において実行することにより、パーソナルコンピュータ装置で話者の認識を行うようにしてもよい。また、パーソナルコンピュータ装置だけではなく、マイクロホンを備えたＰＤＡ（Personal Digital Assistance）や携帯電話機等において上述したプログラムを実行させ、話者の認識を行うようにしてもよい。 A microphone may be connected to the personal computer device, and the above-described program may be executed on the personal computer device to recognize the speaker on the personal computer device. In addition to the personal computer device, the above-described program may be executed by a PDA (Personal Digital Assistance) equipped with a microphone, a mobile phone, or the like to recognize a speaker.

上述した実施形態においては、閾値ｔは音声の特徴量のみから算出しているが、音声照合装置の使用環境やマイクロホンの性能に合わせて閾値ｔに所定の定数を加算するようにしてもよい。
また、上述した実施形態において音声照合装置は、閾値ｔを算出せず平均特徴量データＶＡのみを算出するようにしてもよい。この場合には、閾値ｔは情報入力部１０７から入力するようにしてもよい。
また、上述した実施形態においては、ステップＳ３５の処理を行わず、ステップＳ３４の後、ステップＳ３７の処理を実行するようにしてもよい。 In the above-described embodiment, the threshold value t is calculated from only the voice feature amount, but a predetermined constant may be added to the threshold value t in accordance with the use environment of the voice matching device and the performance of the microphone.
In the above-described embodiment, the speech collation apparatus may calculate only the average feature amount data VA without calculating the threshold value t. In this case, the threshold value t may be input from the information input unit 107.
In the embodiment described above, the process of step S35 may be performed after step S34 without performing the process of step S35.

本発明の実施形態に係る音声照合装置のハードウェア構成を示したブロック図である。It is the block diagram which showed the hardware constitutions of the speech collation apparatus which concerns on embodiment of this invention. 登録テーブルＴＢ１のフォーマットを例示した図である。It is the figure which illustrated the format of registration table TB1. ＣＰＵ１０２が行う処理についての機能構成を示した機能ブロック図である。It is the functional block diagram which showed the function structure about the process which CPU102 performs. 登録時にＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 performs at the time of registration. 登録時にＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 performs at the time of registration. 登録時にＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 performs at the time of registration. 登録時にＣＰＵ１０２が行う処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process which CPU102 performs at the time of registration. 表示部１０８に表示される画面を例示した図である。6 is a diagram illustrating a screen displayed on the display unit 108. FIG. 表示部１０８に表示される画面を例示した図である。6 is a diagram illustrating a screen displayed on the display unit 108. FIG. 表示部１０８に表示される画面を例示した図である。6 is a diagram illustrating a screen displayed on the display unit 108. FIG. バッファ領域に記憶されるデータのイメージを例示した図である。It is the figure which illustrated the image of the data memorize | stored in a buffer area. バッファ領域に記憶されたデータの並べ替えを説明するための図である。It is a figure for demonstrating rearrangement of the data memorize | stored in the buffer area | region. 照合時にＣＰＵ１０２が行う処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which CPU102 performs at the time of collation.

Explanation of symbols

１０・・・発話区間抽出部、２０・・・特徴量抽出部、３０・・・発話選択部、４０・・・情報作成部、５０・・・情報取得部、６０・・・特徴量比較部、１０１・・・バス、１０２・・・ＣＰＵ、１０３・・・ＲＯＭ、１０４・・・ＲＡＭ、１０５・・・記憶部、１０６・・・音声入力部、１０７・・・情報入力部、１０８・・・表示部。 DESCRIPTION OF SYMBOLS 10 ... Utterance section extraction part, 20 ... Feature-value extraction part, 30 ... Speech selection part, 40 ... Information preparation part, 50 ... Information acquisition part, 60 ... Feature-value comparison part 101 ... Bus, 102 ... CPU, 103 ... ROM, 104 ... RAM, 105 ... Storage unit, 106 ... Voice input unit, 107 ... Information input unit, 108. ..Display section.

Claims

The distance between the feature quantity stored in the storage unit storing the voice feature quantity and the feature quantity stored in the storage unit in association with the input identifier is as follows: A speaker recognition method for recognizing that an input voice pronunciation person is a registrant when a predetermined threshold value or less,
An identifier input step in which an identifier for uniquely identifying a pronunciation is input;
A voice input step in which the voice of the pronunciation is input multiple times;
A feature amount storage step of obtaining a feature amount of the sound for each sound input a plurality of times in the sound input step, and storing the determined feature amounts in the storage unit;
Among the feature quantities stored in the storage unit by the multiple feature quantity storage step, a feature quantity whose distance from the other feature quantities stored together by the multiple feature quantity storage step is larger than a predetermined value is stored in the storage unit An erasure step to erase from,
An average feature amount calculating step for obtaining an average value of the feature amounts stored in the storage unit by the plurality of feature amount storing steps after the erasing step;
A feature amount storage step of storing the average value obtained in the average feature amount calculation step in the storage unit in association with the identifier input in the identifier input step as the feature amount of the voice of the sound generator. Speaker recognition method.

After the erasing step, for each feature quantity stored in the storage unit by the multiple feature quantity storage step, an average value of other multiple feature quantities stored together by the multiple feature quantity storage step and A threshold calculation step using the maximum value of the calculated distance as a threshold,
The feature amount storage step associates the identifier input in the identifier input step with the average value obtained in the average feature amount calculation step and the threshold value obtained in the threshold value calculation step. The speaker recognition method according to claim 1, wherein: