JP4255897B2

JP4255897B2 - Speaker recognition device

Info

Publication number: JP4255897B2
Application number: JP2004236429A
Authority: JP
Inventors: 直樹関根; 友成柿野; 智則伊久美; 圭祐吉崎
Original assignee: Toshiba TEC Corp
Current assignee: Toshiba TEC Corp
Priority date: 2004-08-16
Filing date: 2004-08-16
Publication date: 2009-04-15
Anticipated expiration: 2024-08-16
Also published as: JP2006053459A

Description

本発明は、話者認識装置に関する。 The present invention relates to a speaker recognition device.

従来、話者認識装置としては、特定の話者（使用者）がパスワード等を発声した音声パターンを標準パターンとして保持するテキスト依存型の話者認識装置等が提案されている。この話者認識装置はＡＴＭ（Automatic Tellers Machine）等に用いられている。 2. Description of the Related Art Conventionally, as a speaker recognition device, a text-dependent speaker recognition device or the like that holds, as a standard pattern, a voice pattern in which a specific speaker (user) utters a password or the like has been proposed. This speaker recognition device is used in ATM (Automatic Tellers Machine) and the like.

通常、話者認識装置では、認識時に入力音声と本人の標準パターンとの時間軸整合後の距離を計算し、その値を一定の閾値と比較することにより本人か否かを判断している。このとき、話者認識装置では、詐称者が、予め登録されている特定話者のパスワードと同じパスワードを発声した場合でも、詐称者を本人として照合してしまうことがあるため、パスワードを他人に知られないようにすることは重要である。 In general, the speaker recognition apparatus calculates the distance after time axis matching between the input voice and the user's standard pattern at the time of recognition, and compares the value with a certain threshold value to determine whether or not the user is the user. At this time, in the speaker recognition device, even if the spoofer utters the same password as the password of the specific speaker registered in advance, the spoofer may be collated as the person himself, so the password is transferred to another person. It is important not to know.

ところが、従来の話者認識装置では、使用者は音声入力用のマイクに向かってパスワードを発声する。このため、使用者の周囲にいる他人はそのパスワードを容易に聞くことが可能でありパスワード等の情報が簡単に他人に知られてしまうという問題がある。 However, in the conventional speaker recognition device, the user utters the password toward the voice input microphone. For this reason, there is a problem that others around the user can easily hear the password, and information such as the password is easily known to others.

そこで、その問題を解決する方法として、特許文献１では、使用者の音声が周囲に聞こえることを防止する妨害音をスピーカから発生させて音声登録時の盗み聞きを防止する方法が提案されている。このとき、マイクに入力される入力音は音声と共に妨害音も含んでいる。この入力音をそのまま話者認識に用いると話者認識の精度が低下してしまうため、特許文献１では、適応フィルタを用いて入力音から妨害音を除去することで話者認識精度の向上を図っている。 Therefore, as a method for solving the problem, Patent Document 1 proposes a method for preventing an eavesdropping at the time of voice registration by generating a disturbing sound from a speaker to prevent a user's voice from being heard in the surroundings. At this time, the input sound input to the microphone includes interference sound as well as sound. If this input sound is used as it is for speaker recognition, the accuracy of speaker recognition will be reduced. Therefore, Patent Document 1 improves the speaker recognition accuracy by removing the interfering sound from the input sound using an adaptive filter. I am trying.

特開平９−１２７９７４号公報JP-A-9-127974

しかしながら、特許文献１の方法では、使用者の音声に妨害音が重畳した波形から妨害音のみを完全に除去することは困難であり、話者認識精度は十分でない。さらに、特許文献１の方法では、妨害音を除去するための演算量負荷も大きいため、運用上大きな問題を抱えている。 However, in the method of Patent Document 1, it is difficult to completely remove only the disturbing sound from the waveform in which the disturbing sound is superimposed on the user's voice, and the speaker recognition accuracy is not sufficient. Furthermore, the method of Patent Document 1 has a large operational problem because of a large calculation load for removing the interference sound.

本発明の目的は、使用者の音声が周囲に聞こえることを防止する妨害音により音声登録時の盗み聞きを防止し、演算量負荷及びコストを低減して、精度が高い話者認識を実現することである。 An object of the present invention is to prevent eavesdropping at the time of voice registration by using a disturbing sound that prevents the user's voice from being heard in the surroundings, and to reduce the amount of calculation load and cost, thereby realizing highly accurate speaker recognition. It is.

本発明の話者認識装置は、妨害音を生成する妨害音生成手段と、前記妨害音生成手段により生成された前記妨害音を外部空間に出力する妨害音出力部と、使用者の音声を入力するための音声入力部と、前記音声入力部から入力される入力音に含まれる前記使用者の音声の音量と前記外部空間に出力された妨害音の音量との比率により前記妨害音出力部が出力しようとする妨害音の音量を変更する妨害音変更手段と、前記使用者による操作を受け付ける操作部と、前記使用者の音声情報を記憶する記憶部と、前記操作部に対する前記使用者の操作に応じて、前記使用者の音声の登録を行うための登録状態と前記使用者の音声の照合を行うための照合状態とを切り替える状態切替手段と、前記入力音を基にして音声特徴量を計算する特徴量計算手段と、前記状態切替手段により切り替えられた前記登録状態で、前記特徴量計算手段により計算された前記音声特徴量を用いて前記記憶部に前記音声情報として前記使用者の音声の登録を行う音声登録手段と、前記状態切替手段により切り替えられた前記照合状態で、前記記憶部に記憶されている前記音声情報と前記特徴量計算手段により計算された前記音声特徴量とを用いて前記使用者の音声の照合を行う音声照合手段と、を備える。
The speaker recognition apparatus according to the present invention includes a disturbing sound generating means for generating a disturbing sound, a disturbing sound output unit for outputting the disturbing sound generated by the disturbing sound generating means to an external space, and a user's voice. And the interference sound output unit according to a ratio between a volume of the user's voice included in the input sound input from the sound input unit and a volume of the interference sound output to the external space. Interfering sound changing means for changing the volume of the interfering sound to be output, an operation unit that accepts an operation by the user, a storage unit that stores voice information of the user, and an operation of the user with respect to the operation unit And a state switching means for switching between a registration state for registering the user's voice and a collation state for collating the user's voice, and a voice feature amount based on the input sound. Feature quantity calculator to calculate Voice registration for registering the user's voice as the voice information in the storage unit using the voice feature quantity calculated by the feature quantity calculation means in the registration state switched by the state switching means The user's voice using the voice information stored in the storage unit and the voice feature quantity calculated by the feature quantity calculation means in the collation state switched by the state switching means. Voice collating means for performing collation.

これにより、妨害音下で音声入力部から入力された入力音を基にして特徴量計算手段により音声特徴量が計算され、使用者の音声を登録する場合には、その音声特徴量が音声情報として音声登録手段により記憶部に登録され、使用者の音声を照合する場合には、その音声特徴量と記憶部に登録されている音声情報とを比較することで音声照合が行われるため、妨害音を除去するための適応フィルタ等を必要とせず、妨害音下で精度が高い話者認識が可能になる。 As a result, the voice feature quantity is calculated by the feature quantity calculation means based on the input sound input from the voice input unit under the interference sound, and when the user's voice is registered, the voice feature quantity is the voice information. Is registered in the storage unit by the voice registration unit, and when the user's voice is collated, the voice collation is performed by comparing the voice feature amount with the voice information registered in the storage unit. An adaptive filter or the like for removing sound is not required, and speaker recognition can be performed with high accuracy under interference sound.

本発明によれば、使用者の音声が周囲に聞こえることを防止する妨害音により音声登録時の盗み聞きを防止し、演算量負荷及びコストを低減して、精度が高い話者認識を実現することができる。 According to the present invention, it is possible to prevent eavesdropping at the time of voice registration by using a disturbing sound that prevents the user's voice from being heard in the surroundings, and to reduce the amount of calculation load and cost, thereby realizing highly accurate speaker recognition. Can do.

本発明の第一の実施の形態を図１ないし図６に基づいて説明する。 A first embodiment of the present invention will be described with reference to FIGS.

図１は本実施の形態の話者認識装置１００の概略構成を示すブロック図である。本実施の形態の話者認識装置１００は、使用者が特定のパスワードを発声することで話者認識を行う一例である。 FIG. 1 is a block diagram showing a schematic configuration of a speaker recognition device 100 according to the present embodiment. The speaker recognition device 100 according to the present embodiment is an example that performs speaker recognition by a user uttering a specific password.

図１に示すように、話者認識装置１００は、使用者の音声が周囲に聞こえることを防止する妨害音を生成する妨害音生成部１、生成された妨害音を出力する妨害音出力部２、使用者の音声を入力するための音声入力部３、音声入力部３に入力された入力音を基にして音声特徴量を計算する特徴量計算部４、使用者による操作を受け付ける操作部５、操作部５に対する使用者の操作に応じて、使用者の音声の登録を行うための登録状態と使用者の音声の照合を行うための照合状態とを切り替える状態切替部６、特徴量計算部４により計算された音声特徴量を用いて登録状態で使用者の音声の登録を行う音声登録部７、音声登録部７からの音声情報を標準パターンとして記憶する標準パターンＤＢ（データベース）８、標準パターンＤＢ８に記憶されている標準パターンと特徴量計算部４により計算された音声特徴量とを用いて照合状態で使用者の音声の照合を行う音声照合部９、及び音声入力部３に入力された入力音に基づいて妨害音を変更する妨害音変更部１０等から構成されている。 As shown in FIG. 1, the speaker recognition device 100 includes a disturbing sound generating unit 1 that generates a disturbing sound that prevents a user's voice from being heard in the surroundings, and a disturbing sound output unit 2 that outputs the generated disturbing sound. A voice input unit 3 for inputting a user's voice, a feature amount calculation unit 4 for calculating a voice feature amount based on an input sound input to the voice input unit 3, and an operation unit 5 for accepting an operation by the user A state switching unit 6 that switches between a registration state for registering a user's voice and a collation state for collating the user's voice in accordance with a user's operation on the operation unit 5; 4, a voice registration unit 7 that registers a user's voice in a registered state using the voice feature amount calculated by 4, a standard pattern DB (database) 8 that stores voice information from the voice registration unit 7 as a standard pattern, and a standard Store in pattern DB8 The voice collation unit 9 that collates the user's voice in the collation state using the standard pattern and the voice feature amount calculated by the feature amount calculation unit 4, and the input sound input to the voice input unit 3 Based on the disturbing sound changing unit 10 or the like for changing the disturbing sound based on this.

なお、妨害音は、妨害音出力部２から出力されて音声入力部３にループバックして入力される。したがって、入力音は、妨害音出力部２により出力された妨害音と使用者の音声とが重なる（混ざる）ことで生成され、音声入力部３に入力される。 The disturbing sound is output from the disturbing sound output unit 2 and looped back and input to the sound input unit 3. Therefore, the input sound is generated by overlapping (mixing) the disturbing sound output from the disturbing sound output unit 2 and the user's voice and is input to the sound input unit 3.

妨害音生成部１は、音楽、ビープ音、合成音声及びラジオ音等の妨害音をデジタル信号として生成し、妨害音出力部２に送る。この妨害音は、使用者の音声をかき消すことで、使用者の音声が周囲に聞こえることを防止する。このような妨害音生成部１は妨害音生成手段として機能する。 The interfering sound generation unit 1 generates interfering sounds such as music, beep sounds, synthesized sounds, and radio sounds as digital signals and sends them to the interfering sound output unit 2. This disturbing sound drowns out the user's voice, thereby preventing the user's voice from being heard in the surroundings. Such an interference sound generator 1 functions as an interference sound generator.

妨害音出力部２は、生成されたデジタル信号をアナログ信号に変換するＤ／Ａ変換器、変換されたアナログ信号を増幅する増幅器及び増幅されたアナログ信号を出力音として出力するスピーカ（いずれも図示せず）を備えている。このような妨害音出力部２は、妨害音生成部１で生成された妨害音のデジタル信号をアナログ信号に変換して増幅し、出力音として外部に出力する。 The interfering sound output unit 2 includes a D / A converter that converts the generated digital signal into an analog signal, an amplifier that amplifies the converted analog signal, and a speaker that outputs the amplified analog signal as output sound (both shown in FIG. Not shown). Such an interference sound output unit 2 converts the digital signal of the interference sound generated by the interference sound generation unit 1 into an analog signal, amplifies it, and outputs it as an output sound to the outside.

音声入力部３は、使用者の音声等の音をアナログ信号として入力するためのマイク、入力されたアナログ信号を増幅する増幅器及び増幅されたアナログ信号をデジタル信号に変換するＡ／Ｄ変換器（いずれも図示せず）を備えている。このような音声入力部３には、主に使用者の音声が入力されるが、使用者の音声以外に妨害音出力部２から出力された妨害音も使用者の音声に重なって（混ざって）入力される。したがって、音声入力部３は、妨害音と使用者の音声とから生成された入力音（合成音）を増幅してアナログ信号からデジタル信号に変換し、特徴量計算部４及び妨害音変更部１０に送信する。 The voice input unit 3 includes a microphone for inputting a sound such as a user's voice as an analog signal, an amplifier for amplifying the input analog signal, and an A / D converter for converting the amplified analog signal into a digital signal ( Neither is shown). The user's voice is mainly input to such a voice input unit 3, but the disturbing sound output from the disturbing sound output unit 2 other than the user's voice also overlaps (mixes) with the user's voice. ) Is input. Therefore, the voice input unit 3 amplifies the input sound (synthetic sound) generated from the interference sound and the user's voice and converts it from an analog signal to a digital signal. The feature amount calculation unit 4 and the interference sound change unit 10 Send to.

特徴量計算部４は、音声入力部３から送られた入力音を基にして線形予測分析を行って音声特徴量を求める。線形予測分析は入力音からスペクトル包絡を求める手法であり、発声メカニズムの声道特性を反映した一般に知られた音声特徴量抽出手法である（鹿野清宏（他４名） “音声認識システム” オーム社出版第１版（２００１年５月）Ｐ１〜Ｐ１３参照）。このような特徴量計算部４は特徴量計算手段として機能する。 The feature quantity calculation unit 4 performs linear prediction analysis based on the input sound sent from the voice input unit 3 to obtain the voice feature quantity. Linear prediction analysis is a technique for obtaining a spectral envelope from input sound, and is a commonly known speech feature extraction method that reflects the vocal tract characteristics of the utterance mechanism (Kiyohiro Shikano (4 others) “Speech recognition system” Ohm Publication 1st edition (May 2001) See P1-P13). Such a feature quantity calculation unit 4 functions as a feature quantity calculation means.

操作部５は、使用者により操作される操作パネルであり、テンキーや複数の選択ボタン（いずれも図示せず）等から構成されている。例えば、使用者は選択ボタン等を押下することで登録状態と照合状態とを切り替える。 The operation unit 5 is an operation panel operated by a user, and includes a numeric keypad, a plurality of selection buttons (all not shown), and the like. For example, the user switches between a registration state and a collation state by pressing a selection button or the like.

状態切替部６は、操作部５に対する使用者の操作に応じて話者認識装置１００の状態を登録状態又は照合状態に切り替える。登録状態では、特徴量計算部４の出力は音声登録部７に渡り、認識状態では、特徴量計算部４の出力は音声照合部９に渡る。すなわち、状態切替部６は、登録状態で、特徴量計算部４により計算された音声特徴量を音声登録部７に送信し、認識状態で、特徴量計算部４により計算された音声特徴量を音声照合部９に送信する。このような状態切替部６は状態切替手段として機能する。 The state switching unit 6 switches the state of the speaker recognition device 100 to a registration state or a collation state in accordance with a user operation on the operation unit 5. In the registration state, the output of the feature amount calculation unit 4 passes to the speech registration unit 7, and in the recognition state, the output of the feature amount calculation unit 4 passes to the speech collation unit 9. That is, the state switching unit 6 transmits the voice feature amount calculated by the feature amount calculation unit 4 to the voice registration unit 7 in the registration state, and the voice feature amount calculated by the feature amount calculation unit 4 in the recognition state. It transmits to the voice collation part 9. Such a state switching unit 6 functions as a state switching unit.

音声登録部７は、登録状態で、状態切替部６から送られた音声特徴量（音声パターン）を音声情報である標準パターンとして標準パターンＤＢ８に登録する。このとき、音声特徴量は、使用者（特定の話者）が妨害音下でパスワードを発声した際の入力音（音声及び妨害音を含む入力音）から求められた特徴量である。このような音声登録部７は音声登録手段として機能する。 In the registration state, the voice registration unit 7 registers the voice feature amount (voice pattern) sent from the state switching unit 6 in the standard pattern DB 8 as a standard pattern that is voice information. At this time, the voice feature amount is a feature amount obtained from the input sound (input sound including the voice and the disturbing sound) when the user (specific speaker) utters the password under the disturbing sound. Such a voice registration unit 7 functions as a voice registration unit.

標準パターンＤＢ８は、音声情報である標準パターンを記憶する記憶部である。標準パターンＤＢとしては、例えばＨＤＤ（ハードディスク）やメモリ等が用いられる。 The standard pattern DB 8 is a storage unit that stores a standard pattern that is audio information. As the standard pattern DB, for example, an HDD (hard disk) or a memory is used.

音声照合部９は、照合状態で、状態切替部６から送られた音声特徴量（音声パターン）と標準パターンＤＢ８に記憶されている標準パターンとの時間軸整合後の距離を計算し、その値を一定の閾値と比較することによって音声照合を行う。このとき、状態切替部６から送られてきた音声特徴量は、使用者（特定の話者）が妨害音下でパスワードを発声した際の入力音（音声及び妨害音を含む入力音）から求められた特徴量である。このような音声照合部９は音声照合手段として機能する。 The voice collation unit 9 calculates the distance after time axis matching between the voice feature amount (speech pattern) sent from the state switching unit 6 and the standard pattern stored in the standard pattern DB 8 in the collation state. Is compared with a certain threshold value. At this time, the voice feature amount sent from the state switching unit 6 is obtained from the input sound (the input sound including the voice and the disturbing sound) when the user (specific speaker) utters the password under the disturbing sound. Feature amount. Such a voice collating unit 9 functions as a voice collating unit.

妨害音変更部１０は、音声入力部３から送られた入力音に基づいて使用者の発声音量と妨害音の音量との比率により、妨害音出力部２での妨害音の音量を自在に変更する。このような妨害音変更部１０は妨害音変更手段として機能する。 The interfering sound changing unit 10 can freely change the volume of the interfering sound at the interfering sound output unit 2 based on the input sound sent from the audio input unit 3 according to the ratio of the user's utterance volume and the interfering sound volume. To do. Such a disturbing sound changing unit 10 functions as a disturbing sound changing means.

次に、音声登録部７での音声登録について図２ないし図６を参照して説明する。図２は静かな環境下での音声登録時の音声の波形を示す模式図、図３は妨害音下での音声の波形を示す模式図、図４は適応フィルタ処理後の妨害音下での音声の波形を示す模式図である。また、図５は妨害音下での音声登録時の音声の波形を示す模式図、図６は妨害音下での音声照合時の音声の波形を示す模式図である。 Next, voice registration in the voice registration unit 7 will be described with reference to FIGS. FIG. 2 is a schematic diagram showing a waveform of a voice at the time of voice registration in a quiet environment, FIG. 3 is a schematic diagram showing a waveform of a voice under a disturbing sound, and FIG. 4 is a diagram showing the sound under the disturbing sound after adaptive filter processing. It is a schematic diagram which shows the waveform of an audio | voice. FIG. 5 is a schematic diagram showing a waveform of a voice at the time of voice registration under an interfering sound, and FIG. 6 is a schematic diagram showing a waveform of a voice at the time of voice collation under the disturbing sound.

静かな環境下での音声登録時の音声は、図２に示すようなｘ（ｔ）の波形になる。また、妨害音下での音声は、妨害音をｙ（ｔ）とすると、図３に示すようなｘ（ｔ）＋ｙ（ｔ）の波形になる。さらに、適応フィルタ処理後の妨害音下での音声の波形は、適応フィルタ処理後の妨害音をｙ´（ｔ）とすると、図４に示すようなｘ（ｔ）＋ｙ´（ｔ）の波形になる。 The voice at the time of voice registration in a quiet environment has a waveform of x (t) as shown in FIG. Further, the sound under the disturbing sound has a waveform of x (t) + y (t) as shown in FIG. 3 where the disturbing sound is y (t). Furthermore, the waveform of the sound under the interference sound after the adaptive filter processing is a waveform of x (t) + y ′ (t) as shown in FIG. 4 where the interference sound after the adaptive filter processing is y ′ (t). become.

ここで、従来の技術では、図２に示すような音声ｘ（ｔ）と図４に示すような音声ｘ（ｔ）＋ｙ´（ｔ）とを比較することで音声照合を行うが、それらの間に差ｙ´（ｔ）が生じているため、話者認識精度は低くなってしまう。これは、静かな環境下での音声が音声登録時の音声として使用されているためである。 Here, in the conventional technique, the speech x (t) as shown in FIG. 2 is compared with the speech x (t) + y ′ (t) as shown in FIG. Since there is a difference y ′ (t) between them, the speaker recognition accuracy is lowered. This is because the voice in a quiet environment is used as the voice at the time of voice registration.

そこで、本実施の形態では、妨害音下での音声が音声登録時の音声として使用される。妨害音下での音声登録時の音声は、図５に示すようなｘ（ｔ）＋ｙ（ｔ）の波形になる。また、妨害音下での音声照合時の音声は、図６に示すようなｘ（ｔ）＋ｙ（ｔ）の波形になる。このとき、図５に示すようなｘ（ｔ）＋ｙ（ｔ）の波形と図６に示すようなｘ（ｔ）＋ｙ（ｔ）の波形との差は、ｙ（ｔ）の定常性を加味すると非常に小さく、それらの波形はほぼ同じである。これにより、話者認識精度は向上する。 Therefore, in the present embodiment, the voice under the disturbing sound is used as the voice at the time of voice registration. The voice at the time of voice registration under the interference sound has a waveform of x (t) + y (t) as shown in FIG. Further, the voice at the time of voice collation under the interference sound has a waveform of x (t) + y (t) as shown in FIG. At this time, the difference between the waveform of x (t) + y (t) as shown in FIG. 5 and the waveform of x (t) + y (t) as shown in FIG. 6 takes into account the steadiness of y (t). Then it is very small and their waveforms are almost the same. Thereby, speaker recognition accuracy is improved.

このような構成において、話者認識装置１００は、妨害音生成部１により妨害音を生成し、生成した妨害音を妨害音出力部２により外部に出力する。この妨害音が発生している状態で、使用者は音声入力部３のマイクに向かってパスワードを発声する。このとき、使用者の音声は、妨害音出力部２から出力された妨害音と重なって（混ざって）音声入力部３に入力音として入力される。 In such a configuration, the speaker recognition device 100 generates a disturbing sound by the disturbing sound generation unit 1 and outputs the generated disturbing sound to the outside by the disturbing sound output unit 2. The user utters a password toward the microphone of the voice input unit 3 while the disturbing sound is generated. At this time, the user's voice is input to the voice input unit 3 as an input sound, overlapping (mixed) with the disturbing sound output from the disturbing sound output unit 2.

話者認識装置１００は、音声入力部３から入力された入力音（音声及び妨害音を含む入力音）に基づいて特徴量計算部４により音声特徴量を求める。その後、状態切替部６により登録状態が選択されている場合には、音声登録部７によりその音声特徴量を標準パターンとして標準パターンＤＢ８に登録する。一方、状態切替部６により照合状態が選択されている場合には、音声照合部９によりその音声特徴量と標準パターンＤＢ８に記憶されている標準パターンとを比較して音声照合を行う。なお、使用者は操作部５を操作することによって話者認識装置１００の登録状態と照合状態とを切り替える。 The speaker recognition device 100 obtains a speech feature value by the feature value calculation unit 4 based on the input sound (input sound including speech and interference sound) input from the speech input unit 3. Thereafter, when the registration state is selected by the state switching unit 6, the voice registration unit 7 registers the voice feature amount in the standard pattern DB 8 as a standard pattern. On the other hand, when the collation state is selected by the state switching unit 6, the voice collation unit 9 compares the voice feature amount with the standard pattern stored in the standard pattern DB 8, and performs voice collation. The user switches the registration state and the collation state of the speaker recognition device 100 by operating the operation unit 5.

このように本実施の形態では、妨害音下で音声入力部３から入力された入力音を基にして特徴量計算部４により音声特徴量が計算され、使用者の音声を登録する場合には、その音声特徴量が標準パターンとして音声登録部７により標準パターンＤＢ８に登録され、使用者の音声を照合する場合には、その音声特徴量と標準パターンＤＢ８に登録されている標準パターンとを比較することで音声照合が行われるため、妨害音を除去するための適応フィルタ等を必要せず、妨害音下で精度が高い話者認識が可能になる。これにより、使用者の音声が周囲に聞こえることを防止する妨害音により音声登録時の盗み聞きを防止し、演算量負荷及びコストを低減して、精度が高い話者認識を実現することができる。 As described above, in the present embodiment, when the voice feature amount is calculated by the feature amount calculation unit 4 based on the input sound input from the voice input unit 3 under the interference sound, and the user's voice is registered, The voice feature quantity is registered as a standard pattern in the standard pattern DB 8 by the voice registration unit 7. When the user's voice is collated, the voice feature quantity is compared with the standard pattern registered in the standard pattern DB 8. Thus, voice collation is performed, so that an adaptive filter or the like for removing the interfering sound is not required, and speaker recognition can be performed with high accuracy under the interfering sound. Thereby, it is possible to prevent eavesdropping at the time of voice registration by using a disturbing sound that prevents the user's voice from being heard in the surroundings, reduce the calculation load and cost, and realize speaker recognition with high accuracy.

また、本実施の形態においては、入力音に基づいて妨害音を変更する妨害音変更手段である妨害音変更部１０を備えることから、妨害音は、例えばその音量が使用者の音声の音量に応じて調整され変更されるため、必要以上に妨害音の音量を上げる必要が無くなり、周囲の人に不快感を与えることを防止することができる。 Further, in the present embodiment, since the interference sound changing unit 10 which is an interference sound changing means for changing the interference sound based on the input sound is provided, the interference sound has, for example, the volume of the sound of the user's voice. Since it is adjusted and changed accordingly, it is not necessary to increase the volume of the disturbing sound more than necessary, and it is possible to prevent the surrounding people from feeling uncomfortable.

本発明の第二の実施の形態を図７に基づいて説明する。 A second embodiment of the present invention will be described with reference to FIG.

図７は本実施の形態の話者認識装置１０１の概略構成を示すブロック図である。本実施の形態は、使用者が特定のパスワードを発声することで話者認識を行う話者認識装置１０１の一例である。なお、第一の実施の形態と同一部分は同一符号で示し、その説明も省略する。 FIG. 7 is a block diagram showing a schematic configuration of the speaker recognition apparatus 101 of the present embodiment. The present embodiment is an example of a speaker recognition device 101 that performs speaker recognition by a user uttering a specific password. In addition, the same part as 1st embodiment is shown with the same code | symbol, and the description is also abbreviate | omitted.

図７に示すように、話者認識装置１０１は、使用者の音声が周囲に聞こえることを防止する妨害音を生成する妨害音生成部１、生成された妨害音を出力する妨害音出力部２、使用者の音声を入力するための音声入力部３、音声入力部３に入力された入力音を基にして音声特徴量を計算する特徴量計算部４、使用者による操作を受け付ける操作部５、操作部５に対する使用者の操作に応じて、使用者の音声の登録を行うための登録状態と使用者の音声の照合を行うための照合状態とを切り替える状態切替部６、特徴量計算部４により計算された音声特徴量を用いて登録状態で使用者の音声の登録を行う音声登録部７、音声登録部７からの音声情報を標準パターンとして記憶する標準パターンＤＢ（データベース）８、標準パターンＤＢ８に記憶されている標準パターンと特徴量計算部４により計算された音声特徴量とを用いて照合状態で使用者の音声の照合を行う音声照合部９、及び音声入力部３に入力された入力音に基づいて妨害音を変更する妨害音変更部１０等から構成されている。 As shown in FIG. 7, the speaker recognition device 101 includes an interference sound generator 1 that generates an interference sound that prevents the user's voice from being heard in the surroundings, and an interference sound output unit 2 that outputs the generated interference sound. A voice input unit 3 for inputting a user's voice, a feature amount calculation unit 4 for calculating a voice feature amount based on an input sound input to the voice input unit 3, and an operation unit 5 for accepting an operation by the user A state switching unit 6 that switches between a registration state for registering a user's voice and a collation state for collating the user's voice in accordance with a user's operation on the operation unit 5; 4, a voice registration unit 7 that registers a user's voice in a registered state using the voice feature amount calculated by 4, a standard pattern DB (database) 8 that stores voice information from the voice registration unit 7 as a standard pattern, and a standard Store in pattern DB8 The voice collation unit 9 that collates the user's voice in the collation state using the standard pattern and the voice feature amount calculated by the feature amount calculation unit 4, and the input sound input to the voice input unit 3 Based on the disturbing sound changing unit 10 or the like for changing the disturbing sound based on this.

特徴量計算部４は、音声入力部３から送られた入力音から妨害音以外の雑音を推定して除去する雑音除去手段を備えており、推定した雑音を入力音から除去し、その入力音を基にして線形予測分析を行って音声特徴量を求める。このような特徴量計算部４は特徴量計算手段として機能する。なお、本実施の形態では、雑音推定手段としてスペクトル・サブトラクション法が用いられるが、これに限るものではない。その方法は、音声入力部３から送られた入力音のスペクトルを周波数毎に時間加算平均し、逐次差し引く方法である（Boll S.F.：Suppression of Acoustic Noise in Speech Using Spectral Subtraction，IEEE Trans.ASSP-27, P.113-120, 1979参照）。また、線形予測分析は、入力音からスペクトル包絡を求める手法であり、発声メカニズムの声道特性を反映した一般に知られた音声特徴量抽出手法である（鹿野清宏（他４名） “音声認識システム” オーム社出版第１版（２００１年５月）Ｐ１〜Ｐ１３参照）。 The feature quantity calculation unit 4 includes a noise removal unit that estimates and removes noise other than the interference sound from the input sound sent from the voice input unit 3, and removes the estimated noise from the input sound. Based on the above, a linear prediction analysis is performed to obtain a speech feature amount. Such a feature quantity calculation unit 4 functions as a feature quantity calculation means. In the present embodiment, the spectrum subtraction method is used as the noise estimation means, but the present invention is not limited to this. The method is a method in which the spectrum of the input sound sent from the voice input unit 3 is time-averaged for each frequency and subtracted sequentially (Boll SF: Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans.ASSP-27). , P. 113-120, 1979). Linear prediction analysis is a technique for obtaining a spectral envelope from input sound, and is a generally known speech feature extraction method that reflects the vocal tract characteristics of the utterance mechanism (Kiyohiro Shikano (4 others) “Speech recognition system” "Ohm Publishing Co., Ltd. 1st edition (May 2001) See P1-P13).

操作部５は、使用者により操作される操作パネルであり、テンキーや選択ボタン（いずれも図示せず）等から構成されている。例えば、使用者は選択ボタン等を押下することで登録状態と照合状態とを切り替える。さらに、使用者はテンキー等を操作することでパスワードやＩＤ番号等の申告情報を入力して本人であることを自己申告する。 The operation unit 5 is an operation panel operated by a user, and includes a numeric keypad, a selection button (both not shown), and the like. For example, the user switches between a registration state and a collation state by pressing a selection button or the like. Further, the user inputs the report information such as a password and an ID number by operating the numeric keypad, and self-reports himself / herself.

状態切替部６は、操作部５に対する使用者の操作に応じて話者認識装置１０１の状態を登録状態又は照合状態に切り替える。登録状態では、特徴量計算部４の出力は音声登録部７に渡り、認識状態では、特徴量計算部４の出力は音声照合部９に渡る。すなわち、状態切替部６は、登録状態で、特徴量計算部４により計算された音声特徴量を音声登録部７に送信し、認識状態で、特徴量計算部４により計算された音声特徴量を音声照合部９に送信する。このような状態切替部６は状態切替手段として機能する。 The state switching unit 6 switches the state of the speaker recognition device 101 to a registration state or a collation state in accordance with a user operation on the operation unit 5. In the registration state, the output of the feature amount calculation unit 4 passes to the speech registration unit 7, and in the recognition state, the output of the feature amount calculation unit 4 passes to the speech collation unit 9. That is, the state switching unit 6 transmits the voice feature amount calculated by the feature amount calculation unit 4 to the voice registration unit 7 in the registration state, and the voice feature amount calculated by the feature amount calculation unit 4 in the recognition state. It transmits to the voice collation part 9. Such a state switching unit 6 functions as a state switching unit.

ここで、音声照合部９は、操作部５で入力された申告情報に基づいて話者認識を行い、その認識結果を妨害音生成部１に送る。妨害音生成部１は、音声照合部９による申告情報に基づく認識結果に応じて、音楽、ビープ音、合成音声及びラジオ音等の妨害音を選定し、その妨害音をデジタル信号として生成する。例えば、妨害音生成部１は、予め使用者（登録者）毎に生成する妨害音を設定したファイル等を記憶する記憶部（図示せず）を備えており、音声照合部９による申告情報に基づく認識結果に応じて、ファイルから生成する妨害音を選択し、その妨害音をデジタル信号として生成する。 Here, the voice collation unit 9 performs speaker recognition based on the report information input by the operation unit 5 and sends the recognition result to the interference sound generation unit 1. The interfering sound generating unit 1 selects interfering sounds such as music, beep sound, synthesized speech, and radio sound according to the recognition result based on the report information by the voice collating unit 9, and generates the interfering sound as a digital signal. For example, the interfering sound generation unit 1 includes a storage unit (not shown) that stores a file or the like in which an interfering sound generated for each user (registrant) is set in advance. According to the recognition result based on, the interference sound generated from the file is selected, and the interference sound is generated as a digital signal.

このような構成において、話者認識装置１０１は、使用者が操作部５によりパスワードやＩＤ番号等の申告情報を入力すると、その申告情報に基づいて音声照合部９により話者認識を行い、その認識結果に基づいて妨害音生成部１により妨害音を生成し、生成した妨害音を妨害音出力部２により外部に出力する。この妨害音が発生している状態で、使用者は音声入力部３のマイクに向かってパスワードを発声する。このとき、使用者の音声は、妨害音出力部２から出力された妨害音と重なって（混ざって）音声入力部３に入力音として入力される。 In such a configuration, when the user inputs report information such as a password and an ID number by the operation unit 5, the speaker recognition device 101 performs speaker recognition by the voice collation unit 9 based on the report information. Based on the recognition result, the disturbing sound generating unit 1 generates a disturbing sound, and the disturbing sound output unit 2 outputs the generated disturbing sound to the outside. The user utters a password toward the microphone of the voice input unit 3 while the disturbing sound is generated. At this time, the user's voice is input to the voice input unit 3 as an input sound, overlapping (mixed) with the disturbing sound output from the disturbing sound output unit 2.

話者認識装置１０１は、音声入力部３から入力された入力音（音声及び妨害音を含む入力音）に基づいて特徴量計算部４により音声特徴量を求める。その後、状態切替部６により登録状態が選択されている場合には、音声登録部７によりその音声特徴量を標準パターンとして標準パターンＤＢ８に登録する。一方、状態切替部６により照合状態が選択されている場合には、音声照合部９によりその音声特徴量と標準パターンとを比較して音声照合を行う。なお、使用者は操作部５を操作することによって話者認識装置１０１の登録状態と照合状態とを切り替える。 The speaker recognition apparatus 101 obtains a voice feature value by the feature value calculation unit 4 based on the input sound (input sound including voice and interference sound) input from the voice input unit 3. Thereafter, when the registration state is selected by the state switching unit 6, the voice registration unit 7 registers the voice feature amount in the standard pattern DB 8 as a standard pattern. On the other hand, when the collation state is selected by the state switching unit 6, the voice collation unit 9 compares the voice feature amount with the standard pattern and performs voice collation. The user switches the registration state and the collation state of the speaker recognition device 101 by operating the operation unit 5.

特に、特徴量計算手段である特徴量計算部４は、入力音から妨害音以外の雑音を推定して除去し、その入力音を基にして音声特徴量を計算することから、より精度が高い話者認識を実現することができる。 In particular, the feature quantity calculation unit 4 which is a feature quantity calculation means estimates and removes noise other than the interference sound from the input sound, and calculates the voice feature quantity based on the input sound, so that the accuracy is higher. Speaker recognition can be realized.

さらに、操作部５は、使用者が本人であることを自己申告するための申告情報を入力する操作を受け付け、話者認識手段である音声照合部９は、操作部５により入力された申告情報に基づいて話者認識を行い、妨害音生成手段である妨害音生成部１は、音声照合部９による話者認識の結果に応じて妨害音を変更することから、使用者毎に生成する妨害音を変更することができる。その結果として、使用者の好み等に応じて妨害音を変更出力することができる。 Further, the operation unit 5 accepts an operation of inputting report information for self-reporting that the user is the user, and the speech collation unit 9 as a speaker recognition unit receives the report information input by the operation unit 5. The interference sound generation unit 1 that is the interference sound generation means changes the interference sound according to the result of the speaker recognition by the speech collation unit 9, and thus generates the interference for each user. The sound can be changed. As a result, the disturbing sound can be changed and output according to the user's preference or the like.

なお、本発明は前述したような実施の形態に示す特定のハードウェア構成に限定されるものではなく、ソフトウェアによっても実現可能である。すなわち、話者認識装置１００，１０１が備える各部の機能をソフトウェアで実現することが可能である。この場合には、話者認識装置１００，１０１は、各部を集中的に制御するＣＰＵ（図示せず）を備えている。このＣＰＵには、ＢＩＯＳや各種プログラム等を記憶しているＲＯＭや各種データを書換え可能に記憶するＲＡＭ（いずれも図示せず）等がバス接続されている。ＣＰＵは、ＲＯＭに記憶されているプログラムに基づいて、各種の機能を実現する処理を実行する。 The present invention is not limited to the specific hardware configuration shown in the embodiment as described above, and can be realized by software. That is, the function of each unit included in the speaker recognition devices 100 and 101 can be realized by software. In this case, the speaker recognition devices 100 and 101 include a CPU (not shown) that controls each unit in a centralized manner. The CPU is connected to a ROM storing a BIOS and various programs, a RAM (not shown) that stores various data in a rewritable manner, and the like. The CPU executes processing for realizing various functions based on a program stored in the ROM.

本発明の第一の実施の形態の話者認識装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speaker recognition apparatus of 1st embodiment of this invention. 静かな環境下での音声登録時の音声の波形を示す模式図である。It is a schematic diagram which shows the waveform of the audio | voice at the time of the audio | voice registration in a quiet environment. 妨害音下での音声の波形を示す模式図である。It is a schematic diagram which shows the waveform of the audio | voice under interference sound. 適応フィルタ処理後の妨害音下での音声の波形を示す模式図である。It is a schematic diagram which shows the waveform of the sound under the disturbance sound after an adaptive filter process. 妨害音下での音声登録時の音声の波形を示す模式図である。It is a schematic diagram which shows the waveform of the audio | voice at the time of the audio | voice registration under disturbance sound. 妨害音下での音声照合時の音声の波形を示す模式図である。It is a schematic diagram which shows the waveform of the audio | voice at the time of the audio | voice collation under interference sound. 本発明の第二の実施の形態の話者認識装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the speaker recognition apparatus of 2nd embodiment of this invention.

Explanation of symbols

１妨害音生成手段（妨害音生成部）
２妨害音出力部
３音声入力部
４特徴量計算手段（特徴量計算部）
５操作部
６状態切替手段（状態切替部）
７音声登録手段（音声登録部）
８記憶部（標準パターンＤＢ）
９音声照合手段（音声照合部）
１０妨害音変更手段（妨害音変更部）
１００話者認識装置
１０１話者認識装置
1 Interference sound generation means (interference sound generator)
2 Interfering sound output section 3 Voice input section 4 Feature quantity calculation means (feature quantity calculation section)
5 Operation part 6 State switching means (state switching part)
7 Voice registration means (voice registration part)
8 storage unit (standard pattern DB)
9 Voice verification means (voice verification unit)
10 Interference sound change means (interference sound change part)
100 speaker recognition device 101 speaker recognition device

Claims

An interference sound generating means for generating the interference sound;
An interference sound output unit for outputting the interference sound generated by the interference sound generation means to an external space;
A voice input unit for inputting the user's voice;
Volume of the interference sound to be output by the disturbing sound output unit by the ratio of the volume of the sound input unit included in the input sound input from the output volume of the audio of the user to the external space the interference sound Disturbing sound changing means for changing
An operation unit for receiving an operation by the user;
A storage unit for storing voice information of the user;
A state switching unit that switches between a registration state for registering the voice of the user and a collation state for collating the voice of the user in response to an operation of the user with respect to the operation unit;
Feature quantity calculating means for calculating a voice feature quantity based on the input sound;
Voice registration means for registering the voice of the user as the voice information in the storage unit using the voice feature quantity calculated by the feature quantity calculation means in the registration state switched by the state switching means; ,
In the collation state switched by the state switching unit, collation of the user's voice is performed using the voice information stored in the storage unit and the voice feature amount calculated by the feature amount calculation unit. Voice collation means to perform;
A speaker recognition device comprising:

The feature amount calculating means estimates and removes noise other than the interference sound from the input sound, and calculates the speech feature amount based on the input sound.
The speaker recognition device according to claim 1.

The operation unit accepts an operation of inputting report information for self-reporting that the user is the user,
Means for causing the disturbing sound generating means to generate a preset disturbing sound corresponding to a speaker specified by the input report information;
The speaker recognition device according to claim 1 or 2.