JP6480124B2

JP6480124B2 - Biological detection device, biological detection method, and program

Info

Publication number: JP6480124B2
Application number: JP2014166271A
Authority: JP
Inventors: 山岸　順一; 順一山岸; 越前　功; 功越前; 小野　順貴; 順貴小野; 松井　知子; 知子松井; さやか塩田
Original assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2014-08-19
Filing date: 2014-08-19
Publication date: 2019-03-06
Anticipated expiration: 2034-08-19
Also published as: JP2016042162A

Description

本発明は生体検知装置、生体検知方法及びプログラムに関し、特に声を用いた生体検知技術に関する。 The present invention relates to a living body detection apparatus, a living body detection method, and a program, and more particularly to a living body detection technique using a voice.

近年、個人認証技術のひとつとして、人の身体的特徴や行動に基づく生体認証技術が認知されるようになった。生体認証においては、個人の声、指紋、網膜、静脈等の身体的特徴が当該個人を認証するために利用される。なかでも、声の特徴に基づく個人認証技術である話者照合は、マイクロフォン等の汎用機器を用いて認証システムを構成可能であること、話者に練習を要しないこと、及び秘書機能アプリケーション等の普及に伴い機械に話しかけることに抵抗がなくなるつつあること等の要因により、今後さらなる普及が見込まれている。 In recent years, biometric authentication technology based on human physical characteristics and behavior has been recognized as one of personal authentication technologies. In biometric authentication, physical features such as an individual's voice, fingerprint, retina, vein, etc. are used to authenticate the individual. In particular, speaker verification, which is a personal authentication technology based on voice characteristics, can be used to configure an authentication system using a general-purpose device such as a microphone, requires no practice for the speaker, and has a secretarial function application. Due to factors such as the lack of resistance to talking to machines with the spread, further spread is expected in the future.

しかしながら、話者の音声に酷似した音声を機械的に生成することにより、話者を詐称し、話者照合システムを破ることが可能な種々の手法も発見されている。例えば、非特許文献１には、テキストデータに基づいて合成された音声により、話者照合システムを欺くことができる場合があることが示されている。また、非特許文献２には、ある声を特定の話者の声に似せて変換する声質変換技術により、話者照合システムを欺くことができる場合があることが示されている。 However, various techniques have also been discovered that can generate a voice that closely resembles the voice of the speaker, thereby spoofing the speaker and breaking the speaker verification system. For example, Non-Patent Document 1 shows that the speaker verification system can sometimes be deceived by speech synthesized based on text data. Non-Patent Document 2 shows that there is a case where a speaker verification system can be deceived by a voice quality conversion technique for converting a certain voice to resemble a specific speaker's voice.

一方、これらのなりすまし音声を見破るための技術も種々提案されている。非特許文献３及び４は、合成又は声質変換された音声と自然音声とでは音声パラメータの変化の態様が異なることに基づき、なりすまし音声を検出する手法を開示している。 On the other hand, various techniques for detecting these spoofed voices have been proposed. Non-Patent Documents 3 and 4 disclose techniques for detecting spoofed speech based on differences in the manner of change in speech parameters between synthesized or voice-converted speech and natural speech.

ＬｉｎｄｂｅｒｇＪ．他，“Ｖｕｌｎｅｒａｂｉｌｉｔｙｉｎｓｐｅｒｋｅｒｖｅｒｉｆｉｃａｔｉｏｎ − ａｓｔｕｄｙｏｆｔｅｃｈｎｉｃａｌｉｍｐｏｓｔｏｒｔｅｃｈｎｉｑｕｅｓ”，Ｐｒｏｃ．ＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎａｎｄＴｅｃｈｎｏｌｏｇｙ（Ｅｕｒｏｓｐｅｅｃｈ），１９９９Lindberg J.H. Et al., “Vulnerability in superverification-a study of technical impulse techniques”, Proc. European Conference on Speech Communication and Technology (Eurospeech), 1999 ｋｉｎｎｕｎｅｎ，Ｔ．，Ｗｕ，Ｚ．Ｚ．，Ｌｅｅ，Ｋ．Ａ．，Ｓｅｄｌａｋ，Ｆ．，Ｃｈｎｇ，Ｅ．Ｓ．，Ｌｉ，Ｈ．，２０１２．Ｖｕｌｎｅｒａｂｉｌｉｔｙｏｆｓｐｅａｋｅｒｖｅｒｉｆｉｃａｔｉｏｎｓｙｓｔｅｍｓａｇａｉｎｓｔｖｏｉｃｅｃｏｎｖｅｒｓｉｏｎｓｐｏｏｆｉｎｇａｔｔａｃｋｓ：Ｔｈｅｃａｓｅｏｆｔｅｌｅｐｈｏｎｅｓｐｅｅｃｈ，ｉｎ：Ｐｒｏｃ．ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ（ＩＣＡＳＳＰ）．kinnunen, T .; , Wu, Z .; Z. Lee, K .; A. , Sedlak, F .; , Chng, E .; S. Li, H .; 2012. Vulnerability of spike verification systems against voice conversation attacks: The case of telephone specs, in: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). ＳａｔｏｈＴ．他，“ＡｒｏｂｕｓｔｓｐｅａｋｅｒｖｅｒｉｆｉｃａｔｉｏｎｓｙｓｔｅｍａｇａｉｎｓｔｉｍｐｏｓｔｕｒｅｕｓｉｎｇａｎＨＭＭ−ｂａｓｅｄｓｐｅｅｃｈｓｙｓｎｔｈｅｓｉｓｓｙｓｔｅｍ”，Ｐｒｏｃ．ＥｕｒｏｐｅａｎＣｏｎｆｅｒｅｎｃｅｏｎＳｐｅｅｃｈＣｏｍｍｕｎｉｃａｔｉｏｎａｎｄＴｅｃｈｎｏｌｏｇｙ（Ｅｕｒｏｓｐｅｅｃｈ），２００１Satoh T. Et al., “A robust speaker verification system against the use of HMM-based speech synthesis system”, Proc. European Conference on Speech Communication and Technology (Eurospeech), 2001 Ｗｕ，Ｚ．，Ｃｈｎｇ，Ｅ．Ｓ．，Ｌｉ，Ｈ．，Ｄｅｔｅｃｔｉｎｇｃｏｎｖｅｒｔｅｄｓｐｅｅｃｈａｎｄｎａｔｕｒａｌｓｐｅｅｃｈｆｏｒａｎｔｉ−ｓｐｏｏｆｉｎｇａｔｔａｃｋｉｎｓｐｅａｋｅｒｒｅｃｏｇｎｉｔｉｏｎ，ｉｎ：Ｐｒｏｃ．Ｉｎｔｅｒｓｐｅｅｃｈ２０１２．Wu, Z. , Chng, E .; S. Li, H .; , Detecting converted speed and natural speed for anti-spoofing attack in speaker recognition, in: Proc. Interspeech 2012.

非特許文献３及び４が開示するなりすまし検知技術は、いずれもＴａｒｎｓｍｉｓｓｉｏｎｐｏｉｎｔ、すなわち音声波形から特徴量を抽出する段階におけるものである（図６）。これらの検知技術では、音声波形の特徴に基づいてなりすまし音声と自然音声とを判別することにより、なりすましを検知する。しかしながら、音声合成技術や声質変換技術は時とともに精度が向上し、自然音声との差異は縮小している。そのため、Ｔａｒｎｓｍｉｓｓｉｏｎｐｏｉｎｔにおけるなりすまし検知技術は、かかる技術向上に応じた改善が絶えず求められるという問題がある。 The spoofing detection techniques disclosed in Non-Patent Documents 3 and 4 are all in the stage of extracting a feature value from a transmission point, that is, a speech waveform (FIG. 6). In these detection technologies, impersonation is detected by discriminating between spoofed speech and natural speech based on the characteristics of the speech waveform. However, the accuracy of speech synthesis technology and voice quality conversion technology has improved over time, and the difference from natural speech has been reduced. For this reason, the spoofing detection technique in the transmission point has a problem that improvement according to the technical improvement is constantly required.

さらに、Ｔａｒｎｓｍｉｓｓｉｏｎｐｏｉｎｔにおけるなりすまし検知技術は、そこにいるはずの話者が本当に生きている人なのかを検知するもの（生体検知技術）ではない。そのため、なりすましに対する抜本的な解決策にはならないという問題がある。 Furthermore, the impersonation detection technique in the transmission point is not a technique for detecting whether a speaker who is supposed to be there is really a living person (biological detection technique). Therefore, there is a problem that it is not a drastic solution to spoofing.

本発明は、このような問題点を解決するためになされたものであり、声による生体検知が可能な生体検知装置、生体検知方法及びプログラムを提供することを目的とする。 The present invention has been made to solve such problems, and an object of the present invention is to provide a living body detection apparatus, a living body detection method, and a program capable of detecting a living body by voice.

その他の課題と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。 Other problems and novel features will become apparent from the description of the specification and the accompanying drawings.

本発明に係る生体検知装置は、話者の音声を取得する音声取得部と、前記音声からポップノイズを検出するポップノイズ検出部と、前記ポップノイズの検出結果に基づいて、前記話者が生体であるか否かを判断する判断部と、を有するものである。 The living body detection apparatus according to the present invention includes a voice acquisition unit that acquires a voice of a speaker, a pop noise detection unit that detects pop noise from the voice, and the speaker that is a living body based on the detection result of the pop noise. And a determination unit for determining whether or not.

また、本発明に係る生体検知方法は、話者の音声を取得する音声取得ステップと、前記音声からポップノイズを検出するポップノイズ検出ステップと、前記ポップノイズの検出結果に基づいて、前記話者が生体であるか否かを判断する判断ステップと、を有するものである。 Further, the living body detection method according to the present invention includes a voice acquisition step of acquiring a speaker's voice, a pop noise detection step of detecting pop noise from the voice, and the speaker based on the detection result of the pop noise. And a determination step of determining whether or not is a living body.

また、本発明に係るプログラムは、コンピュータに上記方法を実行させるためのプログラムである。 The program according to the present invention is a program for causing a computer to execute the above method.

本発明により、声による生体検知が可能な生体検知装置、生体検知方法及びプログラムを提供することができる。 According to the present invention, it is possible to provide a living body detection apparatus, a living body detection method, and a program capable of detecting a living body by voice.

本発明の実施の形態にかかる生体検知装置１００の構成を示す図である。It is a figure which shows the structure of the biological detection apparatus 100 concerning embodiment of this invention. 本発明の実施の形態にかかる生体検知装置１００の動作を示す図である。It is a figure which shows operation | movement of the biological detection apparatus 100 concerning embodiment of this invention. 本発明の実施の形態にかかる生体検知装置１００の動作を示す図である。It is a figure which shows operation | movement of the biological detection apparatus 100 concerning embodiment of this invention. 本発明の実施の形態３にかかる実験結果を示す図である。It is a figure which shows the experimental result concerning Embodiment 3 of this invention. 本発明の実施の形態にかかる生体検知装置１００の動作を示す図である。It is a figure which shows operation | movement of the biological detection apparatus 100 concerning embodiment of this invention. 一般的な話者認識システムの構成を示す図である。It is a figure which shows the structure of a general speaker recognition system. 本発明の実施の形態にかかる音声取得部１１０の例を示す図である。It is a figure which shows the example of the audio | voice acquisition part 110 concerning embodiment of this invention. 本発明の実施の形態にかかる音声取得部１１０の例を示す図である。It is a figure which shows the example of the audio | voice acquisition part 110 concerning embodiment of this invention. 本発明の実施の形態にかかる音声取得部１１０の例を示す図である。It is a figure which shows the example of the audio | voice acquisition part 110 concerning embodiment of this invention. 本発明の実施の形態におけるポップノイズ検出部１３０の処理を示す図である。It is a figure which shows the process of the pop noise detection part 130 in embodiment of this invention. 本発明の実施の形態におけるポップノイズ検出部１３０の処理を示す図である。It is a figure which shows the process of the pop noise detection part 130 in embodiment of this invention. 本発明の実施の形態におけるポップノイズ検出部１３０の処理を示す図である。It is a figure which shows the process of the pop noise detection part 130 in embodiment of this invention. 本発明の実施の形態におけるポップノイズ検出部１３０の処理を示す図である。It is a figure which shows the process of the pop noise detection part 130 in embodiment of this invention. 本発明の実施の形態におけるポップノイズ検出部１３０の処理を示す図である。It is a figure which shows the process of the pop noise detection part 130 in embodiment of this invention. 本発明の実施の形態における判断部１５０の処理を示す図である。It is a figure which shows the process of the determination part 150 in embodiment of this invention. 本発明の実施の形態で用いるマイクロフォンの例を示す図である。It is a figure which shows the example of the microphone used by embodiment of this invention. 本発明の実施の形態５にかかる実験結果を示す図である。It is a figure which shows the experimental result concerning Embodiment 5 of this invention.

はじめに、本発明の理解を容易にするため、従来のなりすまし検知手法と比較しつつ、本発明に係る生体検知手法の特徴について説明する。 First, in order to facilitate understanding of the present invention, characteristics of the living body detection method according to the present invention will be described in comparison with a conventional spoofing detection method.

図６は、一般的な話者認識システムの概略を示す図である。従来のＴａｒｎｓｍｉｓｓｉｏｎｐｏｉｎｔにおけるなりすまし検知とは異なり、本発明に係る手法は、Ｍｉｃｒｏｐｈｏｎｅｐｏｉｎｔすなわちマイクロフォンによって音声を取得する段階において生体検知を行う。 FIG. 6 is a diagram showing an outline of a general speaker recognition system. Unlike the conventional detection of spoofing in a transmission point, the technique according to the present invention performs a living body detection in a stage of acquiring sound by a microphone point, that is, a microphone.

声による生体検知を行うためには、生きている人間には可能であって、かつ装置等による再生音声では再現不可能な特徴に着目する必要がある。そこで、発明者は、人が発声する際のマイクロフォンへの息のかかり方に着目した。人は、発声する際、音と同時に息を吐出する。そして、マイクロフォンは、音声だけでなく息も拾うことができる。マイクロフォンが息を大量に拾うと、ポップノイズと呼ばれる独特のノイズが発生することが知られている。ポップノイズは、音声に対してはノイズであるが、観点を変えれば、その音声を発しているのが生きた人間であることを証明する情報でもある。ポップノイズは原理的にスピーカでは再現できないからである。 In order to perform living body detection by voice, it is necessary to pay attention to features that are possible for a living person and that cannot be reproduced by reproduced voice from an apparatus or the like. Therefore, the inventor has focused on how to breathe into the microphone when a person speaks. When a person speaks, he exhales at the same time as the sound. And the microphone can pick up not only the voice but also the breath. It is known that when a microphone picks up a lot of breath, a unique noise called pop noise is generated. Pop noise is noise for speech, but from a different perspective, it is also information that proves that the person uttering the speech is a living person. This is because pop noise cannot be reproduced by a speaker in principle.

本発明はこうした知見に鑑み、Ｍｉｃｒｏｐｈｏｎｅｐｏｉｎｔにおいて取得した音声からポップノイズを検出し、その検出結果を利用して話者の生体検知を行うことを大きな特徴とするものである。以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。 In view of such knowledge, the present invention is characterized in that pop noise is detected from speech acquired at a Microphone point, and the living body of the speaker is detected using the detection result. Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings.

つぎに、図１を用いて、本発明の実施の形態にかかる生体検知装置１００の基本的な構成について説明する。 Next, a basic configuration of the living body detection apparatus 100 according to the embodiment of the present invention will be described with reference to FIG.

生体検知装置１００は、音声取得部１１０、ポップノイズ検出部１３０、判断部１５０を有する。生体検知装置１００は、典型的には、制御プログラム等を格納する記憶装置、制御プログラムに基づいて各種処理を実行する制御装置、及び外部デバイスとの間で情報を入出力する入出力装置等を有する情報処理装置である。音声取得部１１０、ポップノイズ検出部１３０、判断部１５０は、上述のハードウェアと制御プログラムとを用いて、論理的な処理手段として実現される。 The living body detection apparatus 100 includes a sound acquisition unit 110, a pop noise detection unit 130, and a determination unit 150. The living body detection device 100 typically includes a storage device that stores a control program and the like, a control device that executes various processes based on the control program, an input / output device that inputs and outputs information with an external device, and the like. An information processing apparatus. The voice acquisition unit 110, the pop noise detection unit 130, and the determination unit 150 are realized as logical processing means using the hardware and the control program described above.

音声取得部１１０は、話者の発する声を取得する機能を有する。典型的には、音声をアナログ信号に変換するマイクロフォンと、アナログ信号をデジタル信号に変換してポップノイズ検出部１３０に出力する変換部と、を含む。 The voice acquisition unit 110 has a function of acquiring a voice uttered by a speaker. Typically, a microphone that converts sound into an analog signal and a conversion unit that converts the analog signal into a digital signal and outputs the digital signal to the pop noise detection unit 130 are included.

音声取得部１１０は、単数のマイクロフォンにより構成される場合と、複数のマイクロフォンを含む場合とがある。マイクロフォンが単数である場合は、マイクロフォンは話者の声及び息の双方を収録する。マイクロフォンが複数である場合は、一方のマイクロフォン（第１のマイクロフォンと称する）は話者の声及び息の双方を収録する。他方のマイクロフォン（第２のマイクロフォンと称する）は、可能な限り話者の声のみを収録し、息については減衰させるなどして収録が抑制されるよう構成される。 The sound acquisition unit 110 may be configured with a single microphone or may include a plurality of microphones. If there is a single microphone, the microphone will record both the voice and breath of the speaker. When there are a plurality of microphones, one microphone (referred to as a first microphone) records both the voice and breath of the speaker. The other microphone (referred to as the second microphone) is configured to record only the voice of the speaker as much as possible and to suppress recording by attenuating the breath.

複数のマイクロフォンを利用する場合の構成例としては、例えば、第２のマイクロフォンにのみマイクカバーやポップフィルタを設ける手法、及び第１のマイクロフォンと第２のマイクロフォンとを異なる空間に設置する手法（マイクロフォンアレイ）がある。 As a configuration example when using a plurality of microphones, for example, a method of providing a microphone cover or a pop filter only on the second microphone, and a method of installing the first microphone and the second microphone in different spaces (microphones) Array).

ポップノイズ検出部１３０は、音声取得部１１０が出力する音声信号（デジタル信号）から、話者が発する息をマイクロフォンが拾う時に発生するポップノイズ成分を検出する処理を行う。ポップノイズ検出部１３０は、分離モジュール１３１、特徴量化モジュール１３２、音素アライメントモジュール１３３、及び識別モジュール１３４を含み得る。 The pop noise detection unit 130 performs a process of detecting a pop noise component generated when the microphone picks up the breath emitted by the speaker from the audio signal (digital signal) output from the audio acquisition unit 110. The pop noise detection unit 130 may include a separation module 131, a feature amount module 132, a phoneme alignment module 133, and an identification module 134.

分離モジュール１３１は、音声取得部１１０が出力する音声信号から、ポップノイズ成分と疑われる音声信号を分離する処理を行う。すなわち、分離モジュール１３１は音声信号処理を行うモジュールである。音声取得部１１０が単数のマイクロフォンにより構成される場合、分離モジュール１３１は、音声成分とポップノイズ成分とが混じった１つの音声信号から、ポップノイズ成分と疑われる音声信号を分離する。具体的な分離手法としては、例えばローパスフィルタ（スムージング）、及び当該音声信号をスピーカ出力し空気伝搬したものを再収録する手法等がある。本発明の実施にあたっては、必要に応じこれらの手法のうち任意のもの、あるいは他の同等の手法を採用することができる。 The separation module 131 performs a process of separating an audio signal suspected of being a pop noise component from the audio signal output from the audio acquisition unit 110. That is, the separation module 131 is a module that performs audio signal processing. When the sound acquisition unit 110 is configured by a single microphone, the separation module 131 separates a sound signal suspected of being a pop noise component from one sound signal in which a sound component and a pop noise component are mixed. Specific separation methods include, for example, a low-pass filter (smoothing) and a method of re-recording the sound signal that is output from the speaker and propagated in the air. In carrying out the present invention, any of these methods or other equivalent methods can be adopted as necessary.

一方、音声取得部１１０が複数のマイクロフォンにより構成される場合、分離モジュール１３１は、音声成分とポップノイズ成分との双方を含む第１のマイクロフォンの出力信号（第１の音声という）と、音声成分を含むがポップノイズ成分は抑制されている第２のマイクロフォンの出力信号（第２の音声という）とを利用して、ポップノイズ成分と疑われる音声信号を分離する処理を行う。具体的な分離手法としては、例えば線形フィルタ、スペクトルサブトラクション等がある。本発明の実施にあたっては、必要に応じこれらの手法のうち任意のもの、あるいは他の同等の手法を採用することができる。 On the other hand, when the sound acquisition unit 110 is configured by a plurality of microphones, the separation module 131 includes an output signal (referred to as first sound) of the first microphone including both the sound component and the pop noise component, and the sound component. Is used to separate the audio signal suspected of being a pop noise component using an output signal (referred to as second audio) of a second microphone in which the pop noise component is suppressed. Specific examples of the separation method include a linear filter and spectral subtraction. In carrying out the present invention, any of these methods or other equivalent methods can be adopted as necessary.

特徴量化モジュール１３２は、分離モジュール１３１により分離された、ポップノイズ成分と疑われる音声信号から、特徴量を抽出する処理を行う。音声信号は時系列情報であるが、後段の識別モジュール１３４においては時系列情報を扱いにくい。そのため、ここで特徴量化モジュール１３２が、時系列情報を周波数に基づく情報に変換する。具体的な特徴量化手法としては、例えばＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）、ＭＦＣＣの低周波数域への拡張、ＭＦＣＣの低周波数域への拡張に加えて次元圧縮を行う手法等がある。本発明の実施にあたっては、必要に応じこれらの手法のうち任意のもの、あるいは他の同等の手法を採用することができる。なお、特徴量化モジュール１３２の採用は任意であり、存在しなくとも本発明の最低限の目的を達成することは可能である。 The feature conversion module 132 performs a process of extracting a feature amount from the audio signal suspected of being a pop noise component separated by the separation module 131. The audio signal is time-series information, but it is difficult to handle the time-series information in the subsequent identification module 134. For this reason, the feature quantification module 132 converts the time series information into information based on the frequency. Examples of specific feature amount techniques include MFCC (Mel-Frequency Cepstrum Coefficients), extension of MFCC to a low frequency range, and extension of MFCC to a low frequency range, as well as a technique of dimensional compression. In carrying out the present invention, any of these methods or other equivalent methods can be adopted as necessary. It should be noted that the feature amount module 132 is arbitrarily adopted, and even if it does not exist, the minimum object of the present invention can be achieved.

音素アライメントモジュール１３３は、特徴量化された音声信号を音素単位（セグメント）に分割する処理を行う。この処理を行うことにより、音素とポップノイズ発生箇所との対応関係の検証が可能となるため、風などよるポップノイズ類似の成分の影響を排除でき、より高精度なポップノイズ認識を行えるという利点がある。具体的な手法としては、例えば外部の自動音声認識器（ＡＳＲ：ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ）等が提供する音素アライメント機能を利用する手法、予めセグメント境界が既知である参照音声と、入力音声との音響的比較（ＤＴＷ：ＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ）により、入力音声のセグメントを特定する手法等がある。本発明の実施にあたっては、必要に応じこれらの手法のうち任意のもの、あるいは他の同等の手法を採用することができる。なお、後段の識別モジュール１３４において、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）等それ自身が音素境界を認識し得るモデルを用いる場合には、音素アライメントモジュール１３３に相当する構成を格別設ける必要はない。なお、音素アライメントモジュール１３３の採用は任意であり、存在しなくとも本発明の最低限の目的を達成することは可能である。 The phoneme alignment module 133 performs a process of dividing the featured speech signal into phoneme units (segments). By performing this process, it is possible to verify the correspondence between phonemes and locations where pop noise occurs, so it is possible to eliminate the effects of wind noise and other similar components of pop noise and to perform more accurate pop noise recognition There is. As a specific method, for example, a method using a phoneme alignment function provided by an external automatic speech recognition (ASR) or the like, an acoustic of a reference speech whose segment boundary is known in advance and an input speech are used. There is a method of specifying a segment of input speech by comparison (DTW: Dynamic Time Warping). In carrying out the present invention, any of these methods or other equivalent methods can be adopted as necessary. Note that in the identification module 134 in the subsequent stage, when using a model that itself can recognize the phoneme boundary, such as an HMM (Hidden Markov Model), a configuration corresponding to the phoneme alignment module 133 need not be provided. Note that the use of the phoneme alignment module 133 is arbitrary, and even if it does not exist, the minimum object of the present invention can be achieved.

ポップノイズ識別モジュール１３４は、特徴量化モジュール１３２が抽出した特徴量、又は、音素アライメントモジュール１３３がセグメント化した特徴量を入力し、それがポップノイズであるか否かを識別する処理を行う。具体的な識別手法としては、例えばＧＭＭ（Ｇａｕｓｓｉａｎｍｉｘｔｕｒｅｍｏｄｅｌ）等の統計処理モデル、ＳＶＭ（Ｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅ）等の機械学習処理モデル、ＨＭＭ、及び線形識別モデル等の利用が考えられる。音素アライメントを行う場合、識別モジュール１３４はセグメント単位でポップノイズの有無を識別する。一方、音素アライメントを行わない場合は、識別モジュール１３４はセグメントに分割されていない、例えば文章単位でポップノイズ成分の有無を識別する。いずれの場合でも、識別モジュール１３４はポップノイズ成分を含まない音声成分の特徴量を予め学習しておくことにより、いわば異常値であるポップノイズ成分が入力されたときにそれを検出することができる。本発明の実施にあたっては、必要に応じこれらの手法のうち任意のもの、あるいは他の同等の手法を採用することができる。 The pop noise identification module 134 inputs the feature amount extracted by the feature amount module 132 or the segment amount feature amount segmented by the phoneme alignment module 133 and performs a process of identifying whether or not the feature amount is pop noise. As specific identification methods, for example, use of a statistical processing model such as GMM (Gaussian mixture model), a machine learning processing model such as SVM (Support vector machine), an HMM, and a linear identification model can be considered. When performing phoneme alignment, the identification module 134 identifies the presence or absence of pop noise on a segment basis. On the other hand, when phoneme alignment is not performed, the identification module 134 identifies the presence or absence of a pop noise component in units of sentences that are not divided into segments. In any case, the identification module 134 can detect when a pop noise component, which is an abnormal value, is input, by learning in advance the feature amount of an audio component that does not include the pop noise component. . In carrying out the present invention, any of these methods or other equivalent methods can be adopted as necessary.

上述の分離モジュール１３１、特徴量化モジュール１３２、音素アライメントモジュール１３３及び識別モジュール１３４は、任意の組み合わせで利用することができる。 The above-described separation module 131, feature amount module 132, phoneme alignment module 133, and identification module 134 can be used in any combination.

判断部１５０は、ポップノイズ検出部１３０によるポップノイズの検出結果に基づいて、音声取得部１１０が取得した音声の話者が生きた人間であるか否かを判断する処理を行う。 Based on the pop noise detection result by the pop noise detection unit 130, the determination unit 150 performs a process of determining whether or not the voice speaker acquired by the voice acquisition unit 110 is a living person.

＜実施の形態１＞
実施の形態１は、音声取得部１１０が複数のマイクロフォンを備え、一方のマイクロフォンにマイクカバー又はポップフィルタ等を備えた場合の本発明の構成例である。なお、本実施の形態においては、音素アライメントモジュール１３３は採用しない。 <Embodiment 1>
The first embodiment is a configuration example of the present invention when the sound acquisition unit 110 includes a plurality of microphones, and one microphone includes a microphone cover or a pop filter. In the present embodiment, the phoneme alignment module 133 is not employed.

音声取得部１１０は、複数のマイクロフォンを含む。一方のマイクロフォン（第１のマイクロフォン）は、話者の声や息を可能な限りそのまま拾うことを目的とする。典型的には、マイクカバーやポップフィルタを備えていないマイクロフォンである。あるいは、第１のマイクロフォンは、風など周囲の影響を軽減しつつ、発話に伴うポップノイズは取得できるよう、穴あきウィンドスクリーンを備えるものであっても良い（図１６）。 The sound acquisition unit 110 includes a plurality of microphones. One microphone (first microphone) aims to pick up the voice and breath of the speaker as much as possible. Typically, the microphone does not include a microphone cover or a pop filter. Alternatively, the first microphone may be provided with a perforated windscreen so that pop noise accompanying speech can be acquired while reducing the influence of surroundings such as wind (FIG. 16).

他方のマイクロフォン（第２のマイクロフォン）は、話者の息を拾うことを抑制し、可能な限り話者の声だけを拾うことを目的とする。典型的には、マイクカバーやポップフィルタを備えたマイクロフォンである。一般に、マイクカバーやポップフィルタはスポンジやネット等で作られており、これらで振動膜を覆うことで、息が振動膜を直接振動させることにより発生するポップノイズを軽減することができる。 The other microphone (second microphone) suppresses the pickup of the speaker's breath and aims to pick up the speaker's voice as much as possible. Typically, the microphone includes a microphone cover and a pop filter. In general, a microphone cover and a pop filter are made of a sponge, a net, or the like, and by covering the vibration film with these, it is possible to reduce pop noise that is generated when breath directly vibrates the vibration film.

図７乃至図９に、音声取得部１１０の構成例を示す。図７はステレオマイクを用いた構成例、図８は複数のコンデンサマイクを用いた構成例、図９は複数のヘッドセットマイクを用いた構成例である。これらはいずれも、マイクカバーを備えない第１のマイクロフォン、マイクカバーを備えた第２のマイクロフォンを、それぞれ１つ以上含んでいる。 7 to 9 show configuration examples of the voice acquisition unit 110. FIG. 7 is a configuration example using a stereo microphone, FIG. 8 is a configuration example using a plurality of condenser microphones, and FIG. 9 is a configuration example using a plurality of headset microphones. Each of these includes at least one first microphone without a microphone cover and one or more second microphones with a microphone cover.

また、音声取得部１１０は、アレイマイクなど３つ以上のマイクロフォンを含むものであってもよい。この場合、アレイマイクのうち一部のマイクロフォンについてはマイクカバーを設けずに第１のマイクロフォンとして扱い、その余のマイクロフォンについてはマイクカバーを設けて第２のマイクロフォンとして扱う。 The voice acquisition unit 110 may include three or more microphones such as an array microphone. In this case, some microphones in the array microphone are handled as the first microphone without providing the microphone cover, and the remaining microphones are handled as the second microphone with the microphone cover provided.

つづいて、図２及び図３のフローチャートを用いて、実施の形態１にかかる生体検知装置１００の動作について説明する。 Next, the operation of the living body detection apparatus 100 according to the first embodiment will be described using the flowcharts of FIGS. 2 and 3.

Ｓ１０１：音声取得
音声取得部１１０が、話者の音声を取得する。本実施の形態では、音声取得部１１０は第１のマイクロフォン及び第２のマイクロフォンを有しており、それぞれのマイクロフォンが、話者が発した同一の音声を同時に取得するものとする。 S101: Voice acquisition The voice acquisition unit 110 acquires the voice of a speaker. In the present embodiment, the voice acquisition unit 110 includes a first microphone and a second microphone, and each microphone simultaneously acquires the same voice uttered by a speaker.

音声取得部１１０の変換部は、第１のマイクロフォン及び第２のマイクロフォンが取得した音声をそれぞれ別のデジタル信号に変換する。ここで、第１のマイクロフォン由来の音声データを第１の音声、第２のマイクロフォン由来の音声データを第２の音声と称する。変換部は、第１の音声及び第２の音声をポップノイズ検出部１３０に対して出力する。 The conversion unit of the sound acquisition unit 110 converts the sound acquired by the first microphone and the second microphone into different digital signals. Here, the sound data derived from the first microphone is referred to as first sound, and the sound data derived from the second microphone is referred to as second sound. The conversion unit outputs the first sound and the second sound to the pop noise detection unit 130.

Ｓ１０４：ポップノイズ検出
まず、ポップノイズ検出部１３０の分離モジュール１３１は、第１の音声と第２の音声とを比較し、第１の音声のみに含まれる周波数を分離し、差分信号として出力する。これにより、ポップノイズである可能性のある音声信号を抽出できる。差分信号の抽出には、例えば線形フィルタ、スペクトルサブトラクションのほか、Ｉｎｄｅｐｅｎｄｅｎｔｃｏｍｐｏｎｅｎｔａｎａｌｙｓｉｓ、ＩｎｄｅｐｅｎｄｅｎｔＶｅｃｔｏｒＡｎａｌｙｓｉｓ、Ｂｌｉｎｄｓｏｕｒｃｅｓｅｐａｒａｔｉｏｎ等の公知の手法を利用できる。また、ローパスフィルタによる低域の変動を検出する手法、ブラインド信号源分離手法などを利用しても良い。 S104: Pop Noise Detection First, the separation module 131 of the pop noise detection unit 130 compares the first sound and the second sound, separates the frequency included only in the first sound, and outputs it as a difference signal. . As a result, an audio signal that may be pop noise can be extracted. For extraction of the difference signal, for example, a known method such as a linear filter, spectral subtraction, Independent component analysis, Independent Vector Analysis, and Blind source separation can be used. Further, a technique for detecting a low frequency fluctuation by a low-pass filter, a blind signal source separation technique, or the like may be used.

図１０は、第１の音声及び第２の音声を重ね合わせた状態を示している。また、図１１は、第１の音声と第２の音声との差分信号を示している。ここで、横軸（時間軸）の１．０（単位：秒）近辺に突出している振幅成分（単位：ｄＢ）は、話者の声にあたる音声信号に対しては異常値と考えられ、ポップノイズである可能性がある。 FIG. 10 shows a state in which the first sound and the second sound are superimposed. FIG. 11 shows a difference signal between the first sound and the second sound. Here, the amplitude component (unit: dB) protruding around 1.0 (unit: second) on the horizontal axis (time axis) is considered to be an abnormal value for the speech signal corresponding to the voice of the speaker. There may be noise.

つぎに、ポップノイズ検出部１３０の特徴量化モジュール１３２は、差分信号から特徴量を抽出する。換言すれば、時系列情報である差分信号から周波数に基づく情報である特徴量を求める。特徴量の抽出は、例えばＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）等の公知の手法により実施可能である。又は、ＭＦＣＣは７０Ｈｚ以上の音声を識別可能に設計されているところ、識別可能領域を７０Ｈｚ未満の周波数領域にも拡張する手法も採用し得る。一般に、ポップノイズは通常の音声に比べ低周波数であるためである。さらには、このように周波数領域を拡張すると特徴数の次元が増え、概して識別性能が低下することから、特徴量の次元圧縮を併用する手法を採用しても良い。これにより、ＭＦＣＣにおいて低周波数領域に対応しつつ識別性能を維持することができる。 Next, the feature quantity module 132 of the pop noise detection unit 130 extracts a feature quantity from the difference signal. In other words, a feature amount that is information based on a frequency is obtained from a difference signal that is time-series information. The extraction of the feature amount can be performed by a known method such as MFCC (Mel-Frequency Cepstrum Coefficients). Alternatively, since the MFCC is designed to be able to discriminate speech of 70 Hz or higher, a method of extending the discriminable region to a frequency region of less than 70 Hz can also be adopted. This is because pop noise generally has a lower frequency than normal speech. Furthermore, when the frequency domain is expanded in this way, the dimension of the number of features increases and the discrimination performance generally decreases. Therefore, a technique that uses dimensional compression of feature quantities may be employed. Thereby, in MFCC, identification performance can be maintained, corresponding to a low frequency region.

ポップノイズ検出部１３０の識別モジュール１３４は、抽出した差分信号の特徴量を認識器に入力する。ここで、識別モジュール１３４は、差分信号の特徴量を任意のセグメントに分割して識別子に入力することとしても良い。なお、ここでいうセグメントは、音素アライメントモジュール１３３が扱うような音素単位のセグメントである必要はない。 The identification module 134 of the pop noise detection unit 130 inputs the extracted feature amount of the difference signal to the recognizer. Here, the identification module 134 may divide the feature amount of the difference signal into arbitrary segments and input the segment to the identifier. Note that the segment here does not need to be a segment of phonemes as handled by the phoneme alignment module 133.

例えば、ポップノイズ検出部１３０は、ポップノイズを含む文章と、ポップノイズを含まない文章と、をそれぞれ用いて予め学習した２つの識別器を有していても良い。識別モジュール１３４は、これらの識別器に差分信号の特徴量を入力することにより、それぞれ尤度を出力として得る。ここで、ポップノイズを含む文章で学習した識別器が出力する尤度と、ポップノイズを含まない文章で学習した識別器が出力する尤度との間に有意な差があれば、差分信号はポップノイズを含むと判定できる。２つの尤度の差がどの程度であるときにポップノイズが存在するものと判定するかについては、適切な閾値を予め決定しておくことができる。 For example, the pop noise detection unit 130 may include two discriminators learned in advance using a sentence including pop noise and a sentence not including pop noise. The discrimination module 134 obtains the likelihood as an output by inputting the feature quantity of the difference signal to these discriminators. Here, if there is a significant difference between the likelihood that the discriminator learned with a sentence including pop noise and the likelihood that the discriminator learned with a sentence not including pop noise outputs, the difference signal is It can be determined that pop noise is included. An appropriate threshold value can be determined in advance as to how much the difference between the two likelihoods is to determine that pop noise exists.

識別器としては、例えば２クラスのパターン識別器であるＳＶＭのほか、ＧＭＭ、ＨＭＭ等、公知の構成を適宜採用できる。なお、識別器による判定を行う場合には、予め識別器にポップノイズを含む文章及びポップノイズを含まない文章夫々のモデルを学習させる工程が必要となる。この工程については後述する。 As the discriminator, for example, a well-known configuration such as GMM, HMM, etc., as well as SVM, which is a two-class pattern discriminator, can be adopted as appropriate. Note that, when the determination by the discriminator is performed, a step is required in which the discriminator learns in advance a model for each sentence including pop noise and each sentence not including pop noise. This process will be described later.

また、ポップノイズ成分の検出は音声区間検出（ＶＡＤ）と枠組みが近いため、話者照合の分野だけでなくＶＡＤの分野で使われる任意の公知の手法を利用しても良い。 Further, since the detection of the pop noise component is similar in framework to the voice interval detection (VAD), any known method used in the field of VAD as well as the field of speaker verification may be used.

ポップノイズ検出部１３０の識別モジュール１３４は、差分信号をポップノイズと識別した場合、ポップノイズを検出した旨判断部１５０に出力する。判断部１５０は、ポップノイズの検出結果に基づいて、話者が生きた人間であるか否かを判断する。典型的には、ポップノイズが検出された場合には話者は生体であると判断し、Ｓ１０５に遷移する。一方、ポップノイズが検出されなかった場合には話者は生体でないものと判断し、Ｓ１０６に遷移する。 When the identification module 134 of the pop noise detection unit 130 identifies the differential signal as pop noise, the identification module 134 outputs to the determination unit 150 that pop noise has been detected. The determination unit 150 determines whether the speaker is a living person based on the detection result of the pop noise. Typically, when pop noise is detected, it is determined that the speaker is a living body, and the process proceeds to S105. On the other hand, if pop noise is not detected, it is determined that the speaker is not a living body, and the process proceeds to S106.

Ｓ１０５：話者照合
話者が生体であると判断された場合、生体検知装置１００は任意の手法を用いた話者照合フェーズに移行することができる。話者照合については種々の手法が既知であるため、ここでは詳細な説明を省略する。なお、好ましくは、話者照合に際してはポップノイズ成分が比較的少ない第２の音声を用いることができる。 S105: Speaker verification When it is determined that the speaker is a living body, the living body detection apparatus 100 can shift to a speaker verification phase using an arbitrary method. Since various methods are known for speaker verification, detailed description thereof is omitted here. Preferably, the second voice having a relatively small pop noise component can be used for speaker verification.

Ｓ１０６：詐称音声として棄却
話者が生体でないと判断された場合、音声取得部１１０が取得した音声は人が発したものではなく、例えば合成や声質変換による音声である蓋然性が高い。よって、生体検知装置１００はこれを詐称音声と判断し、話者照合を行うことなく棄却する。すなわち、エラー処理や終了処理等を行う。 S106: Rejected as a spoofed voice When it is determined that the speaker is not a living body, the voice acquired by the voice acquisition unit 110 is not generated by a person, but has a high probability of being a voice generated by synthesis or voice quality conversion, for example. Therefore, the living body detection apparatus 100 determines that this is a spoofed voice and rejects it without performing speaker verification. That is, error processing, end processing, and the like are performed.

ここで、図３を用いて、識別器によるポップノイズ判定を行う場合に必要な、事前学習工程について説明する。ここでは、学習器及び識別器としてＳＶＭを用いる場合を例として説明する。 Here, the pre-learning process required when performing the pop noise determination by the discriminator will be described with reference to FIG. Here, the case where SVM is used as a learning device and a discriminator will be described as an example.

Ｓ２０１：音声取得
Ｓ１０１と同様に、音声取得部１１０が、話者の音声を取得する。 S201: Voice acquisition As in S101, the voice acquisition unit 110 acquires the voice of the speaker.

Ｓ２０２：ポップノイズ／非ポップノイズモデルの学習
まず、Ｓ１０４と同様に、ポップノイズ検出部１３０の分離モジュール１３１が、第１の音声と第２の音声とを比較し、第１の音声のみに含まれる周波数を分離し、差分信号として出力する。次いで、特徴量化モジュール１３２が、差分信号を特徴量化する。そして、識別モジュール１３４は、差分信号の特徴量と、それがポップノイズである旨を示す教師信号と、を共に一方の学習器に入力する。また、特徴量化モジュール１３２は、ポップノイズ成分を含まない第２の音声を特徴量化する。そして、識別モジュール１３４は、第２の音声の特徴量と、それが非ポップノイズである旨を示す教師信号と、を共に他方の学習器に入力する。すなわち、本実施の形態では、ポップノイズ／非ポップノイズそれぞれの尤度を判定する学習器を１つずつ、合計２つ生成する。 S202: Learning Pop Noise / Non-Pop Noise Model First, as in S104, the separation module 131 of the pop noise detection unit 130 compares the first sound and the second sound and includes only the first sound. Are separated and output as a differential signal. Next, the feature amount conversion module 132 converts the difference signal into a feature amount. Then, the identification module 134 inputs both the feature quantity of the difference signal and the teacher signal indicating that it is pop noise to one learning device. In addition, the feature amount conversion module 132 converts the second sound that does not include the pop noise component into a feature amount. Then, the identification module 134 inputs both the feature amount of the second sound and the teacher signal indicating that it is non-pop noise to the other learning device. In other words, in the present embodiment, a total of two learners are generated, one each for determining the likelihood of each of pop noise and non-pop noise.

学習器としてＳＶＭ又はＧＭＭを用いる場合は、差分信号の特徴量のうちポップノイズ成分にあたるセグメントを事前に切り出しておき、切り出されたセグメントを学習器に入力することが好ましい。一方、学習器としてＨＭＭを用いる場合は、モデル自体が音素境界を自動的に認識する機能を有するため、上述のような切り出し処理は特段不要である。 When SVM or GMM is used as a learning device, it is preferable that a segment corresponding to the pop noise component is cut out in advance from the feature quantity of the difference signal, and the extracted segment is input to the learning device. On the other hand, when the HMM is used as a learning device, the model itself has a function of automatically recognizing a phoneme boundary, and thus the above-described clipping process is not particularly necessary.

Ｓ２０１乃至Ｓ２０２に係る処理を複数回繰り返すことにより、学習器内に、ポップノイズ、非ポップノイズ音声それぞれのモデルが形成される。これにより、差分信号の特徴量の入力に応じ、該当するモデルを出力する識別器が形成される。 By repeating the processes according to S201 to S202 a plurality of times, models of pop noise and non-pop noise speech are formed in the learning device. As a result, a discriminator that outputs a corresponding model in response to the input of the feature amount of the difference signal is formed.

本実施の形態によれば、音声取得部１１０が、Ｍｉｃｒｏｐｈｏｎｅｐｏｉｎｔにおいて、生体検知に不可欠な情報を取得する。これにより、Ｔａｒｎｓｍｉｓｓｉｏｎｐｏｉｎｔにおけるなりすまし検知では不可能であった生体検知を実現することができる。 According to the present embodiment, the voice acquisition unit 110 acquires information indispensable for living body detection in the Microphone point. As a result, it is possible to realize living body detection that was not possible with impersonation detection at the transmission point.

また、本実施の形態によれば、判断部１５０は、ポップノイズ検出部１３０によるポップノイズ検出結果に基づいて、生体検知を行う。ポップノイズはスピーカでは原理的に再現不能な現象であるので、これにより、話者の詐称に頑健な生体認証を実現することができる。 Further, according to the present embodiment, the determination unit 150 performs living body detection based on the pop noise detection result by the pop noise detection unit 130. Since pop noise is a phenomenon that cannot be reproduced in principle by a speaker, it is possible to realize biometric authentication that is robust against speaker misrepresentation.

また、本実施の形態によれば、音声取得部１１０は、複数のマイクロフォンを用いることで、ポップノイズを含む音声信号及び含まない音声信号を出力する。これにより、ポップノイズ検出部１３０は、公知の分離技術を適用して効率的にポップノイズ成分を分離することができるようになった。 Further, according to the present embodiment, the voice acquisition unit 110 outputs a voice signal including pop noise and a voice signal not including the pop noise by using a plurality of microphones. As a result, the pop noise detection unit 130 can efficiently separate the pop noise component by applying a known separation technique.

＜実施の形態２＞
実施の形態２は、音声取得部１１０が複数のマイクロフォンを備え、それらのマイクロフォンをそれぞれ異なる空間に配置した構成例である。その余の構成については、実施の形態１と同様である。 <Embodiment 2>
The second embodiment is a configuration example in which the sound acquisition unit 110 includes a plurality of microphones, and these microphones are arranged in different spaces. The rest of the configuration is the same as in the first embodiment.

本実施の形態では、音声取得部１１０が有する複数のマイクロフォンのうち、一方のマイクロフォン（第１のマイクロフォン）は、話者の声とともにポップノイズを拾うことを目的とするため、話者の息が直接かかりやすい位置に配置される。例えば、第１のマイクロフォンは話者に正対する位置に配置される。他方のマイクロフォン（第２のマイクロフォン）は、ポップノイズを拾うことを抑制し、可能な限り話者の声だけを拾うことを目的とするため、話者の息が直接かかりにくい位置に配置される。例えば、第２のマイクロフォンは話者の側方や、第１のマイクロフォンよりも離れた位置に配置される。なお、本実施の形態においては、第２のマイクロフォンには必ずしもマイクカバーやポップフィルタを備えることを要しない。 In the present embodiment, one of the plurality of microphones included in the voice acquisition unit 110 (the first microphone) is intended to pick up pop noise along with the voice of the speaker. It is placed in a position where it can be directly applied. For example, the first microphone is arranged at a position facing the speaker. The other microphone (second microphone) is placed at a position where it is difficult for the speaker to breathe directly in order to suppress pop noise and to pick up only the speaker's voice as much as possible. . For example, the second microphone is arranged on the side of the speaker or at a position away from the first microphone. In the present embodiment, the second microphone is not necessarily provided with a microphone cover or a pop filter.

例えば、音声取得部１１０としてアレイマイクを用いる場合は、話者の近くに位置するマイクロフォンを第１のマイクロフォン、第１のマイクロフォンよりも話者から遠くに位置するマイクロフォンを第２のマイクロフォンとして扱うことができる。 For example, when an array microphone is used as the voice acquisition unit 110, a microphone located near the speaker is treated as a first microphone, and a microphone located farther from the speaker than the first microphone is treated as a second microphone. Can do.

本実施の形態によれば、音声取得部１１０は、マイクカバーやポップフィルタを用いることなく、実施の形態１と同様のポップノイズ検出処理を実現することができる。 According to the present embodiment, the audio acquisition unit 110 can realize the same pop noise detection process as that of the first embodiment without using a microphone cover or a pop filter.

＜実施の形態３＞
実施の形態３は、音声取得部１１０が単一のマイクロフォンを備え、分離モジュール１３１としてローパスフィルタを採用した構成例である。その余の構成については、実施の形態１と同様である。 <Embodiment 3>
The third embodiment is a configuration example in which the sound acquisition unit 110 includes a single microphone and a low-pass filter is employed as the separation module 131. The rest of the configuration is the same as in the first embodiment.

実施の形態３における音声取得部１１０は、１本のマイクロフォンにより構成される。このマイクロフォンは、話者の声やポップノイズを可能な限りそのまま拾うことを目的とする。したがって、話者の息がかかりやすい位置に配置された、マイクカバーやポップフィルタを備えていないマイクロフォンであることが好ましい。あるいは、第１のマイクロフォンは、風など周囲の影響を軽減しつつ、発話に伴うポップノイズのみ取得できるよう、穴あきウィンドスクリーンを備えるものであっても良い。 The voice acquisition unit 110 in the third embodiment is configured by a single microphone. The purpose of this microphone is to pick up as much of the speaker's voice and pop noise as possible. Therefore, it is preferable that the microphone is not provided with a microphone cover or a pop filter, which is disposed at a position where the speaker can easily breathe. Alternatively, the first microphone may be provided with a perforated windscreen so that only the pop noise accompanying the speech can be acquired while reducing the influence of the surroundings such as wind.

また、実施の形態３における分離モジュール１３１は、ローカットフィルタ及びローパスフィルタを備える。 In addition, the separation module 131 in the third embodiment includes a low cut filter and a low pass filter.

次いで、実施の形態３の特徴的な動作について説明する。 Next, a characteristic operation of the third embodiment will be described.

Ｓ１０１：音声取得
音声取得部１１０が、話者の音声を取得する。本実施の形態では、音声取得部１１０は１本のマイクロフォンである。音声取得部１１０の変換部は、マイクロフォンが取得した音声をデジタル信号に変換してポップノイズ検出部１３０に出力する。 S101: Voice acquisition The voice acquisition unit 110 acquires the voice of a speaker. In the present embodiment, voice acquisition unit 110 is a single microphone. The conversion unit of the sound acquisition unit 110 converts the sound acquired by the microphone into a digital signal and outputs the digital signal to the pop noise detection unit 130.

Ｓ１０４：ポップノイズ検出
ポップノイズ検出部１３０の分離モジュール１３１は、ローカットフィルタを利用して、音声取得部１１０が出力する音声信号から音声成分のみを抽出する。ポップノイズ検出部１３０は、抽出された音声成分を、実施の形態１における第２の音声と同等のものとして利用する。また、ポップノイズ検出部１３０は、ローパスフィルタを利用して、音声取得部１１０が出力する音声信号からノイズ成分のみを抽出する。ポップノイズ検出部１３０は、抽出されたノイズ成分を、実施の形態１における差分信号と同等のものとして利用できる。 S104: Pop Noise Detection The separation module 131 of the pop noise detection unit 130 uses the low cut filter to extract only the audio component from the audio signal output from the audio acquisition unit 110. Pop noise detection unit 130 uses the extracted audio component as equivalent to the second audio in the first embodiment. Further, the pop noise detection unit 130 extracts only a noise component from the audio signal output from the audio acquisition unit 110 using a low-pass filter. The pop noise detection unit 130 can use the extracted noise component as an equivalent of the difference signal in the first embodiment.

ローパスフィルタは、音声波形を平滑化するが、音声信号に含まれる異常値は残す性質がある。そのため、ローパスフィルタにより、ポップノイズを含む音声信号から、ポップノイズ成分を顕出させることができる。 The low-pass filter smoothes the speech waveform, but has a property of leaving an abnormal value included in the speech signal. Therefore, the pop noise component can be revealed from the audio signal including the pop noise by the low-pass filter.

そして、ポップノイズ検出部１３０は、実施の形態１と同様に識別器を用いて、抽出したノイズ成分がポップノイズであるか否かを判定する。すなわち、特徴量化モジュール１３２が、実施の形態１における差分信号の代わりにローパスフィルタにより抽出されたノイズ成分を特徴量化し、好ましくは幾つかのセグメントに分割する。そして、識別モジュール１３４が、２つの識別器にノイズ成分の特徴量を入力し、出力される尤度の差に基づいてポップノイズの存在の有無を判定する。 And the pop noise detection part 130 determines whether the extracted noise component is a pop noise using a discriminator similarly to Embodiment 1. FIG. That is, the feature quantity module 132 converts the noise component extracted by the low-pass filter instead of the difference signal in the first embodiment into a feature quantity, and preferably divides it into several segments. Then, the identification module 134 inputs the feature amount of the noise component to the two discriminators, and determines the presence / absence of pop noise based on the output likelihood difference.

Ｓ１０５乃至Ｓ１０６：
実施の形態１と同様に動作する。 S105 to S106:
The operation is the same as in the first embodiment.

本実施の形態の事前学習工程における動作は以下のとおりである。
Ｓ２０１：音声取得
Ｓ１０１と同様に、音声取得部１１０が、話者の音声を取得する。 The operation in the pre-learning process of the present embodiment is as follows.
S201: Voice acquisition As in S101, the voice acquisition unit 110 acquires the voice of the speaker.

Ｓ２０２：ポップノイズ／非ポップノイズモデルの学習
まず、Ｓ１０４と同様に、分離モジュール１３１が、ローパスフィルタを使用して音声信号からノイズ成分を抽出する。次いで、特徴量化モジュール１３２がノイズ信号を特徴量化する。そして、識別モジュール１３４が、ノイズ信号の特徴量と、それがポップノイズである旨を示す教師信号とを共に一方の学習器に入力する。また、分離モジュール１３１が、ローカットフィルタを使用して音声信号から音声成分を抽出する。特徴量化モジュール１３２は、音声成分を特徴量化する。そして、識別モジュール１３４は、音声成分の特徴量と、それが非ポップノイズである旨を示す教師信号と、を共に他方の学習器に入力する。その余の動作については、実施の形態１と同様である。 S202: Learning of Pop Noise / Non-Pop Noise Model First, as in S104, the separation module 131 extracts a noise component from an audio signal using a low-pass filter. Next, the feature quantity conversion module 132 converts the noise signal into a feature quantity. Then, the identification module 134 inputs both the feature quantity of the noise signal and the teacher signal indicating that it is pop noise to one learning device. Further, the separation module 131 extracts a sound component from the sound signal using a low cut filter. The feature amount module 132 converts the sound component into a feature amount. Then, the identification module 134 inputs both the feature amount of the speech component and the teacher signal indicating that it is non-pop noise to the other learning device. The other operations are the same as those in the first embodiment.

（実験結果）
図４に、実施の形態３の構成を用いた実証実験結果を示す。発明者は、Ｆ００１乃至Ｆ０１０の１０人の話者を対象として、ポップノイズの検出を試行した。その結果、すべての話者において、ポップノイズを含む文章を学習した識別器が出力する尤度（「ポップノイズあり音声」）が、ポップノイズを含まない文章を学習した識別器が出力する尤度（「ポップノイズなし音声」）を上回った。すなわち、話者の入力音声がポップノイズを含むものであることを、高精度で検出することが可能であることがわかった。 (Experimental result)
FIG. 4 shows the results of a demonstration experiment using the configuration of the third embodiment. The inventor tried to detect pop noise for 10 speakers from F001 to F010. As a result, for all speakers, the likelihood that the classifier that has learned the sentence containing pop noise (“voice with pop noise”) outputs the likelihood that the classifier that has learned the sentence that does not contain pop noise outputs. (“Speech without pop noise”). That is, it has been found that it is possible to detect with high accuracy that the input voice of the speaker includes pop noise.

本実施の形態によれば、音声取得部１１０は単一のマイクロフォンで構成され、ポップノイズ検出部１３０がローカットフィルタ及びローパスフィルタを用いて音声成分及びノイズ成分を抽出する。これにより、複数のマイクロフォンを使用する場合に比べ簡素な構成で生体検知を実現できる。 According to the present embodiment, the voice acquisition unit 110 is configured with a single microphone, and the pop noise detection unit 130 extracts a voice component and a noise component using a low cut filter and a low pass filter. Thereby, a living body detection is realizable with a simple structure compared with the case where a plurality of microphones are used.

＜実施の形態４＞
実施の形態４は、音声取得部１１０が話者の音声を直接取得するための単一のマイクロフォンを備え、分離モジュール１３１としてスピーカ及びマイクロフォンを採用した構成例である。その余の構成については、実施の形態１と同様である。 <Embodiment 4>
The fourth embodiment is a configuration example in which the voice acquisition unit 110 includes a single microphone for directly acquiring a speaker's voice, and a speaker and a microphone are employed as the separation module 131. The rest of the configuration is the same as in the first embodiment.

実施の形態４における音声取得部１１０は、話者の音声を直接取得するための１本のマイクロフォン（第１のマイクロフォン）により構成される。第１のマイクロフォンは、話者の声やポップノイズを可能な限りそのまま拾うことを目的とする。したがって、話者の息がかかりやすい位置に配置された、マイクカバーやポップフィルタを備えていないマイクロフォンであることが好ましい。あるいは、第１のマイクロフォンは、風など周囲の影響を軽減しつつ、発話に伴うポップノイズのみ取得できるよう、穴あきウィンドスクリーンを備えるものであっても良い。 The voice acquisition unit 110 according to the fourth embodiment includes a single microphone (first microphone) for directly acquiring a speaker's voice. The first microphone aims to pick up the voice of the speaker and pop noise as much as possible. Therefore, it is preferable that the microphone is not provided with a microphone cover or a pop filter, which is disposed at a position where the speaker can easily breathe. Alternatively, the first microphone may be provided with a perforated windscreen so that only the pop noise accompanying the speech can be acquired while reducing the influence of the surroundings such as wind.

また、ポップノイズ検出部１３０の分離モジュール１３１は、第１のマイクロフォンによる収録音声を出力するスピーカと、スピーカが出力し空気伝播した音声を収録する第２のマイクロフォンとを備える。なお、第２のマイクロフォンによる音声収録は生体検知装置１００の内部で実施すれば良いため、スピーカ及び第２のマイクロフォンは話者に対して露出している必要はない。 In addition, the separation module 131 of the pop noise detection unit 130 includes a speaker that outputs recorded sound from the first microphone, and a second microphone that records sound that is output from the speaker and propagated in the air. Note that since the sound recording by the second microphone may be performed inside the living body detection apparatus 100, the speaker and the second microphone do not need to be exposed to the speaker.

次いで、実施の形態４の特徴的な動作について説明する。 Next, a characteristic operation of the fourth embodiment will be described.

Ｓ１０１：音声取得
音声取得部１１０の第１のマイクロフォンが、話者の音声を取得する。本実施の形態では、第１のマイクロフォンのみが話者の音声を直接収録する。第１のマイクロフォンが取得する音声は、話者が発生させるポップノイズを含むものである。 S101: Voice acquisition The first microphone of the voice acquisition unit 110 acquires the voice of the speaker. In the present embodiment, only the first microphone directly records the voice of the speaker. The voice acquired by the first microphone includes pop noise generated by a speaker.

次に、分離モジュール１３１のスピーカが、第１のマイクロフォンが収録した話者の音声を再生する。そして、第２のマイクロフォンが、スピーカから再生された音声を収録する。スピーカから出力される音声は、原理的にポップノイズを発生させないので、第２のマイクロフォンが取得する音声はポップノイズ成分が含まれないものとなる。 Next, the speaker of the separation module 131 reproduces the voice of the speaker recorded by the first microphone. Then, the second microphone records the sound reproduced from the speaker. Since the sound output from the speaker does not generate pop noise in principle, the sound acquired by the second microphone does not include the pop noise component.

音声取得部１１０及び分離モジュール１３１は、第１のマイクロフォン及び第２のマイクロフォンが取得した音声をそれぞれ別のデジタル信号に変換する。ここで、第１のマイクロフォン由来の音声データを第１の音声、第２のマイクロフォン由来の音声データを第２の音声と称する。変換部は、第１の音声及び第２の音声をポップノイズ検出部１３０に対して出力する。 The sound acquisition unit 110 and the separation module 131 convert the sound acquired by the first microphone and the second microphone into different digital signals. Here, the sound data derived from the first microphone is referred to as first sound, and the sound data derived from the second microphone is referred to as second sound. The conversion unit outputs the first sound and the second sound to the pop noise detection unit 130.

Ｓ１０４乃至Ｓ１０６、及びＳ２０１乃至Ｓ２０２にかかる動作は実施例１と同様であるため、詳細な説明を省略する。 Since the operations in S104 to S106 and S201 to S202 are the same as those in the first embodiment, detailed description thereof is omitted.

本実施の形態によれば、音声取得部１１０は、話者に対して露出する１本のマイクロフォンと、装置内部に設けられるスピーカ及び第２のマイクロフォンで構成される。これにより、話者に対して露出する複数のマイクロフォンを使用する場合に比べ、簡素な外観で生体検知装置１００を構成できる。 According to the present embodiment, the voice acquisition unit 110 includes one microphone exposed to the speaker, a speaker provided in the apparatus, and a second microphone. Thereby, compared with the case where a plurality of microphones exposed to the speaker are used, the living body detection device 100 can be configured with a simple appearance.

＜実施の形態５＞
実施の形態１乃至４では、生体検知装置１００が、音声信号にポップノイズが含まれているか否かを判断することにより、生体検知を行う例について説明した。しかしながら、実施の形態１乃至４の手法では、例えば風など話者の息以外の要因によりポップノイズ類似の音が入力された場合、生体検知装置１００はこれを生体が発声したポップノイズと誤認してしまうことがある。そこで実施の形態５では、実施の形態１乃至４と比較して、特に風などの影響に対して頑健な生体検知手法を提示する。 <Embodiment 5>
In the first to fourth embodiments, the example in which the living body detection apparatus 100 performs living body detection by determining whether or not pop noise is included in the audio signal has been described. However, in the methods of the first to fourth embodiments, when a sound similar to pop noise is input due to factors other than the speaker's breath such as wind, for example, the living body detection device 100 mistakes this as pop noise generated by the living body. May end up. Therefore, the fifth embodiment presents a living body detection method that is particularly robust against the influence of wind or the like as compared with the first to fourth embodiments.

通常ポップノイズは、例えば破裂音など一部の特定の子音で主に発生することが知られている。そこで本実施の形態では、ポップノイズが適切な場所で発生しているか否かを検査することにより、頑健さを強化する。 Usually, it is known that pop noise mainly occurs in some specific consonants such as plosives. Therefore, in this embodiment, robustness is enhanced by inspecting whether or not pop noise is generated at an appropriate place.

本実施の形態における生体検知装置１００は、実施の形態１乃至４に係る生体検知装置１００の構成要素に加え、ポップノイズ検出部１３０内に音素アライメントモジュール１３３を有する点に特徴を有する。その余の構成は、特段の言及がない限り実施の形態１乃至４と同様である。 The living body detection apparatus 100 according to the present embodiment is characterized in that a phoneme alignment module 133 is included in the pop noise detection unit 130 in addition to the constituent elements of the living body detection apparatus 100 according to the first to fourth embodiments. The rest of the configuration is the same as in Embodiments 1 to 4 unless otherwise specified.

音声取得部１１０は、音声信号を音素アライメントモジュール１３３に対して出力する。ポップノイズ検出部１３０の分離モジュール１３１は、音声信号からポップノイズ成分を分離する。そして特徴量化モジュール１３２が、ポップノイズ成分を特徴量化する。 The voice acquisition unit 110 outputs a voice signal to the phoneme alignment module 133. The separation module 131 of the pop noise detection unit 130 separates the pop noise component from the audio signal. Then, the feature amount module 132 converts the pop noise component into a feature amount.

音素アライメントモジュール１３３は、音声取得部１１０から音声信号を入力し、音声信号中の音素を識別する処理を行う。そして、音声信号を各音素の時間長で分割したセグメントを定義し、ポップノイズ成分の特徴量をセグメントに分割する。 The phoneme alignment module 133 receives a voice signal from the voice acquisition unit 110 and performs a process of identifying a phoneme in the voice signal. Then, a segment obtained by dividing the audio signal by the time length of each phoneme is defined, and the feature amount of the pop noise component is divided into segments.

識別モジュール１３４は、セグメント単位でポップノイズの検出を実行する。すなわち、セグメント化された特徴量を識別器に投入し、セグメント毎にポップノイズの存在の有無を識別する。 The identification module 134 detects pop noise on a segment basis. That is, the segmented feature amount is input to the discriminator, and the presence / absence of pop noise is identified for each segment.

判断部１５０は、ポップノイズ検出部１３０による各セグメントのポップノイズの検出結果と、音素との対応関係が正しいか否かを検証する。 The determination unit 150 verifies whether or not the correspondence between the pop noise detection result of each segment by the pop noise detection unit 130 and the phoneme is correct.

次いで、図５のフローチャートを用いて、生体検知装置１００の動作について説明する。 Next, the operation of the living body detection apparatus 100 will be described using the flowchart of FIG.

Ｓ３０１：音声取得
Ｓ１０１同様、音声取得部１１０が、話者の音声を取得する。 S301: Voice acquisition As in S101, the voice acquisition unit 110 acquires the voice of the speaker.

Ｓ３０２：音素識別
音素アライメントモジュール１３３は、音声取得部１１０から入力した音声信号から音素を抽出する。好ましくは、ポップノイズの比較的少ない第２の音声を利用することができる。 S302: Phoneme Identification The phoneme alignment module 133 extracts phonemes from the voice signal input from the voice acquisition unit 110. Preferably, the second sound with relatively little pop noise can be used.

ここで音素とは、言語学上の価値を有する音声の最小単位をいう。例えば、個々の母音および子音が音素に相当する。音声信号からの音素の認識は、例えばＨＭＭ等の公知の手法を利用して行うことができる。 Here, the phoneme is a minimum unit of speech having linguistic value. For example, individual vowels and consonants correspond to phonemes. Recognition of phonemes from a speech signal can be performed using a known method such as HMM.

音素アライメントモジュール１３３は、音声信号に対し、認識した各音素の時間長に対応する複数のセグメントを定義する（図１２）。セグメントは、典型的には、各音素の始点（時刻）および時間長によって定義できる。また、音素アライメントモジュール１３３は、各セグメントに対し、音素名をラベルとして付与する。例えば音素アライメントモジュール１３３は、音声信号に含まれる音素夫々について、音素名、音素の始点、及び音素の時間長を対応付けたレコードを作成し、図示しない記憶領域に保持させることでこれを実現できる（図１３）。 The phoneme alignment module 133 defines a plurality of segments corresponding to the recognized time length of each phoneme for the speech signal (FIG. 12). A segment can typically be defined by the start point (time) and time length of each phoneme. Moreover, the phoneme alignment module 133 assigns a phoneme name as a label to each segment. For example, the phoneme alignment module 133 can realize this by creating a record in which a phoneme name, a phoneme start point, and a phoneme time length are associated with each other, and holding it in a storage area (not shown). (FIG. 13).

Ｓ３０３：ポップノイズ検出
ポップノイズ検出部１３０の分離モジュール１３１は、音声信号からポップノイズ成分をを分離抽出する。つづいて、特徴量化モジュール１３２がポップノイズ成分を特徴量化する。そして、音素アライメントモジュール１３３が特徴量化されたポップノイズ成分をセグメントに分割する。識別モジュール１３４が、各セグメントごとにポップノイズの検出処理を行う。 S303: Pop Noise Detection The separation module 131 of the pop noise detection unit 130 separates and extracts the pop noise component from the audio signal. Subsequently, the feature amount module 132 converts the pop noise component into a feature amount. Then, the phoneme alignment module 133 divides the pop noise component that has been featured into segments. The identification module 134 performs a pop noise detection process for each segment.

例えば、音声信号に対して図１３のようなセグメントが定義されている場合を考える。まず、始点が００：００：００：００、時間長が００：００：００：１２であるセグメント（ＩＤ＝１）が定義されているので、音素アライメントモジュール１３３は、ポップノイズ成分の特徴量のうち、このセグメントに相当する時間帯、すなわち００：００：００：００から００：００：００：１２までの領域の特徴量を切り出す。そして、識別モジュール１３４は、切り出した特徴量を識別器に入力し、ポップノイズの検出結果を得る。 For example, consider a case where a segment as shown in FIG. 13 is defined for an audio signal. First, since a segment (ID = 1) having a start point of 00:00:00 and a time length of 00: 00: 00: 12 is defined, the phoneme alignment module 133 determines the feature amount of the pop noise component. Among them, a feature amount of a region corresponding to this segment, that is, an area from 00:00:00 to 00: 00: 00: 12 is cut out. Then, the identification module 134 inputs the extracted feature amount to the classifier, and obtains the detection result of the pop noise.

つづいて、音素アライメントモジュール１３３は、ポップノイズ成分の特徴量のうち、次のセグメント（ＩＤ＝２）に相当する時間幅、すなわち００：００：００：１２から００：００：００：２０までの領域の特徴量を切り出す。そして、同様に識別器を用いてポップノイズの検出を行う。同様に、ポップノイズ検出部１３０はすべてのセグメントについてポップノイズの検出を行う。ポップノイズ検出部１３０は、検出を試行した各セグメントについて、検出結果を記憶する（図１４）。 Subsequently, the phoneme alignment module 133 sets the time width corresponding to the next segment (ID = 2) among the feature quantities of the pop noise component, that is, from 00:00:00 to 12: 00: 00: 20. Extract region features. Similarly, the pop noise is detected using the discriminator. Similarly, the pop noise detector 130 detects pop noise for all segments. The pop noise detection unit 130 stores the detection result for each segment for which detection was attempted (FIG. 14).

Ｓ３０４：ポップノイズと音素の関係の妥当性検証
判断部１５０は、Ｓ３０３で得られたポップノイズの検出結果の妥当性を検証する。上述のように、ポップノイズは、破裂音（例えば“ｐ”）など一部の特定の子音で主に発生することが知られている。本実施の形態では、生体検知装置１００は、このような、音素と、ポップノイズ発生可能性の有無と、を対応付けたパターン表を予め保持しているものとする（図１５）。 S304: Validity verification of relationship between pop noise and phoneme The determination unit 150 verifies the validity of the detection result of the pop noise obtained in S303. As described above, it is known that pop noise mainly occurs in some specific consonants such as plosives (eg, “p”). In the present embodiment, it is assumed that living body detection apparatus 100 holds in advance a pattern table in which such phonemes are associated with the possibility of occurrence of pop noise (FIG. 15).

判断部１５０は、Ｓ３０３で得られたポップノイズの検出結果における音素とポップノイズ検出結果との対応関係（図１４）と、予め与えられた音素とポップノイズ発生可能性との対応関係（図１５）を比較し、両者が整合しているか否かを検証する。例えば、図１４では、セグメントＩＤ＝１の音素“ｔ”について、ポップノイズが“非検出”である。一方、図１５では、音素“ｔ”について、ポップノイズの発生可能性は“無”と定義されている。よって、判断部１５０は、セグメントＩＤ＝１の判定結果は妥当と判断する。 The determination unit 150 determines the correspondence between the phoneme and the pop noise detection result (FIG. 14) in the pop noise detection result obtained in S303, and the correspondence between the phoneme given in advance and the possibility of pop noise generation (FIG. 15). ) And verify whether they are consistent. For example, in FIG. 14, the pop noise is “non-detection” for the phoneme “t” with the segment ID = 1. On the other hand, in FIG. 15, for the phoneme “t”, the possibility of occurrence of pop noise is defined as “none”. Therefore, the determination unit 150 determines that the determination result of the segment ID = 1 is appropriate.

また、セグメントＩＤ＝５の音素“ｐ”については、ポップノイズが“検出”されている。一方、図１５でも、音素“ｐ”について、ポップノイズの発生可能性は“有”と定義されている。よって、判断部１５０は、セグメントＩＤ＝５の判定結果も妥当と判断する。同様にして、判断部１５０は、Ｓ３０３で得られたポップノイズの検出結果の各々について、妥当性を検証していく。 Also, pop noise is “detected” for the phoneme “p” of the segment ID = 5. On the other hand, in FIG. 15, the possibility of pop noise is defined as “present” for the phoneme “p”. Therefore, the determination unit 150 determines that the determination result of the segment ID = 5 is also valid. Similarly, the determination unit 150 verifies the validity of each of the pop noise detection results obtained in S303.

判断部１５０は、すべてのセグメントについて妥当性が確認できたならば、話者は生体であると判断し、Ｓ３０５に遷移する。一方、すべてのセグメントで妥当性が確認できなかった場合には話者は生体でないものと判断し、Ｓ３０６に遷移する。 If the validity of all segments is confirmed, the determination unit 150 determines that the speaker is a living body, and the process proceeds to S305. On the other hand, if the validity cannot be confirmed in all segments, it is determined that the speaker is not a living body, and the process proceeds to S306.

なお、ここで判断部１５０は、生体検知の判断基準として、上記以外の任意の基準を適宜採用できる。例えば、妥当性が検証されたセグメントの割合が所定の閾値を超えた場合に、話者は生体であると判断するようにしても良い。 Here, the determination unit 150 can appropriately adopt any criterion other than the above as a determination criterion for living body detection. For example, the speaker may be determined to be a living body when the proportion of segments whose validity has been verified exceeds a predetermined threshold.

Ｓ３０５：話者照合
話者が生体であると判断された場合、生体検知装置１００は任意の手法を用いた話者照合フェーズに移行することができる。 S305: Speaker verification When it is determined that the speaker is a living body, the living body detection apparatus 100 can shift to a speaker verification phase using an arbitrary method.

Ｓ３０６：詐称音声として棄却
話者が生体でないと判断された場合、音声取得部１１０が取得した音声は人が発したものではなく、例えば合成や声質変換による音声である蓋然性が高い。よって、生体検知装置１００はこれを詐称音声と判断し、話者照合を行うことなく棄却する。すなわち、エラー処理や終了処理等を行う。 S306: Rejected as a spoofed voice When it is determined that the speaker is not a living body, the voice acquired by the voice acquisition unit 110 is not generated by a person, but has a high probability of being a voice generated by synthesis or voice quality conversion, for example. Therefore, the living body detection apparatus 100 determines that this is a spoofed voice and rejects it without performing speaker verification. That is, error processing, end processing, and the like are performed.

本実施の形態によれば、音素アライメントモジュール１３３が音声信号を音素レベルに区分し、ポップノイズ検出部１３０が音素レベルでポップノイズの発生を検出する。そして、判断部１５０が、ポップノイズが適切な位置（音素）で発生しているか否かを検証する。これにより、人の発声によらない、例えば風などに由来するポップノイズによる影響を排除し得る、より頑健な生体検知を実現できる。 According to the present embodiment, the phoneme alignment module 133 classifies the speech signal into phoneme levels, and the pop noise detection unit 130 detects the occurrence of pop noise at the phoneme level. Then, the determination unit 150 verifies whether the pop noise is generated at an appropriate position (phoneme). Thereby, it is possible to realize more robust living body detection that can eliminate the influence of pop noise derived from, for example, wind, which does not depend on human speech.

（実験結果）
発明者は、実施の形態５の構成を用いて生体検知の実証実験を行った。実験においては、音声取得部１１０に対し、話者の生音声及びスピーカ出力される音声をそれぞれ入力した。音声取得部１１０としては、ポップフィルタのない第１のマイクロフォン及びポップフィルタを備える第２のマイクロフォンを採用した。第１のマイクロフォン及び第２のマイクロフォンとしては、コンデンサマイク、ステレオマイク及びヘッドセットマイクを使用した。また、特徴量化モジュール１３２は採用せず、第１の音声と第２の音声との差分信号をそのまま識別器に入力した。なお、識別器はポップノイズ発生時の差分信号を学習済みである。そして、識別モジュール１３４は、識別子の出力する尤度が所定の閾値以上であるときにポップノイズを検出したと判定するものとした。 (Experimental result)
The inventor conducted a living body detection demonstration experiment using the configuration of the fifth embodiment. In the experiment, the voice of the speaker and the voice output from the speaker were input to the voice acquisition unit 110, respectively. As the sound acquisition unit 110, a first microphone without a pop filter and a second microphone having a pop filter are employed. A condenser microphone, a stereo microphone, and a headset microphone were used as the first microphone and the second microphone. Further, the feature amount module 132 is not employed, and the difference signal between the first sound and the second sound is input to the discriminator as it is. Note that the discriminator has already learned the differential signal when pop noise occurs. The identification module 134 determines that pop noise has been detected when the likelihood that the identifier is output is equal to or greater than a predetermined threshold.

図１７に、本実証実験の結果を示す。コンデンサマイク、ステレオマイクのいずれを用いた場合においても、話者の生音声を入力した場合は高い確率（１００％）で生体と検知した。また、スピーカ出力した音声を入力した場合も、高い確率（１００％）で生体でないものと判定した。なお、ヘッドセットマイクを用いた場合においても、８６．９％と概ね良好な生体検知率が観察された。 FIG. 17 shows the results of this demonstration experiment. In either case of using a condenser microphone or a stereo microphone, when a speaker's live voice was input, it was detected as a living body with a high probability (100%). In addition, when a sound output from a speaker is input, it is determined that it is not a living body with a high probability (100%). Even when a headset microphone was used, a generally good living body detection rate of 86.9% was observed.

＜その他の実施の形態＞
なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 <Other embodiments>
Note that the present invention is not limited to the above-described embodiment, and can be changed as appropriate without departing from the spirit of the present invention.

例えば、実施の形態５では、１つの音素に対してポップノイズ発生可能性の有無が予め対応付けられている例（図１５）を示したが、複数の連続する音素に対して、ポップノイズ発生可能性の有無が対応付けられていても良い。 For example, in the fifth embodiment, an example (FIG. 15) in which presence / absence of occurrence of pop noise is associated in advance with one phoneme has been shown. However, pop noise is generated for a plurality of continuous phonemes. The presence or absence of possibility may be associated.

また、上述の実施の形態では、本発明を主にハードウェアの構成として説明したが、これに限定されるものではなく、任意の処理を、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）にコンピュータプログラムを実行させることにより実現することも可能である。この場合、コンピュータプログラムは、様々なタイプの非一時的なコンピュータ可読媒体（ｎｏｎ−ｔｒａｎｓｉｔｏｒｙｃｏｍｐｕｔｅｒｒｅａｄａｂｌｅｍｅｄｉｕｍ）を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体（ｔａｎｇｉｂｌｅｓｔｏｒａｇｅｍｅｄｉｕｍ）を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えばフレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば光磁気ディスク）、ＣＤ−ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ（ＰｒｏｇｒａｍｍａｂｌｅＲＯＭ）、ＥＰＲＯＭ（ＥｒａｓａｂｌｅＰＲＯＭ）、フラッシュＲＯＭ、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ））を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体（ｔｒａｎｓｉｔｏｒｙｃｏｍｐｕｔｅｒｒｅａｄａｂｌｅｍｅｄｉｕｍ）によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 In the above-described embodiment, the present invention has been mainly described as a hardware configuration. However, the present invention is not limited to this, and a CPU (Central Processing Unit) executes a computer program for arbitrary processing. Can also be realized. In this case, the computer program can be stored and provided to the computer using various types of non-transitory computer readable media. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer-readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD-Rs, CD-R / W, semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (Random Access Memory)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

１００生体検知装置
１１０音声取得部
１３０ポップノイズ検出部
１３１分離モジュール
１３２特徴量化モジュール
１３３音素アライメントモジュール
１３４識別モジュール
１５０判断部 DESCRIPTION OF SYMBOLS 100 Living body detection apparatus 110 Audio | voice acquisition part 130 Pop noise detection part 131 Separation module 132 Feature-quantization module 133 Phoneme alignment module 134 Identification module 150 Determination part

Claims

A voice acquisition unit that acquires the voice of the speaker;
A pop noise detector for detecting pop noise from the voice;
A living body detection apparatus comprising: a determination unit that determines whether or not the speaker is a living body based on a detection result of the pop noise.

The sound acquisition unit acquires a first microphone that acquires the sound as a first sound, and a second microphone that acquires the sound in which the pop noise is reduced compared to the first sound as a second sound. A microphone, and
The living body detection device according to claim 1, wherein the pop noise detection unit detects the pop noise using a difference between the first sound and the second sound.

The living body detection device according to claim 2, wherein the second microphone includes a cover for reducing the pop noise.

The living body detection device according to claim 2, wherein the first microphone and the second microphone are arranged at different positions in a certain space.

The sound acquisition unit acquires, as a second sound, a first microphone that acquires the sound as a first sound, a speaker that outputs the first sound, and the first sound output from the speaker. A second microphone that
The living body detection device according to claim 1, wherein the pop noise detection unit detects the pop noise using a difference between the first sound and the second sound.

The voice acquisition unit includes a microphone that acquires the voice,
The living body detection device according to claim 1, wherein the pop noise detection unit detects the pop noise from the voice using a low-pass filter.

Further comprising a phoneme identifying unit for identifying a plurality of phonemes from the speech;
The living body detection device according to any one of claims 1 to 6, wherein the pop noise detection unit detects the pop noise for each of a plurality of segments obtained by dividing the voice by a time length of the phoneme.

The determination unit further determines the validity of the correspondence between the phoneme and the detection result of the pop noise for each of the segments, and whether the speaker is a living body based on the determination result of the validity The living body detection device according to claim 7 which judges whether or not.

The living body detection device according to any one of claims 1 to 8, wherein the pop noise detection unit includes a discriminator that has previously learned models of pop noise and non-pop noise.

A voice acquisition step for acquiring the voice of the speaker;
A pop noise detecting step for detecting pop noise from the voice;
And a determination step of determining whether or not the speaker is a living body based on the detection result of the pop noise.

The program for making a computer perform the method of Claim 10.