JP7073910B2

JP7073910B2 - Voice-based authentication device, voice-based authentication method, and program

Info

Publication number: JP7073910B2
Application number: JP2018100010A
Authority: JP
Inventors: 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2022-05-24
Anticipated expiration: 2038-05-24
Also published as: JP2019204368A

Description

本発明は、利用者の発話を用いて認証処理を行う、音声型認証装置、及び音声型認証方法に関し、更には、これらを実現するためのプログラムに関する。 The present invention relates to a voice-type authentication device and a voice-type authentication method that perform authentication processing using a user's utterance, and further relates to a program for realizing these.

近年、人の身体的特徴を用いて認証を行う生体認証が注目されている。生体認証に用いられる身体的特徴としては、顔、虹彩、指紋、静脈、音声等が挙げられる。このうち、音声を使った生体認証では、利用者は、パスワードを発話するだけで良く、利用者にとって最も手軽である。 In recent years, biometric authentication, which authenticates using the physical characteristics of a person, has attracted attention. Physical features used for biometrics include face, iris, fingerprints, veins, voice and the like. Of these, in biometric authentication using voice, the user only has to speak the password, which is the easiest for the user.

但し、音声を使った生体認証では、データのマスクが難しい点と、録音データを用いたなりすましが簡単にできてしまう点とから、セキュリティの確保が難しいという問題がある。これに対して、特許文献１は、セキュリティが確保された音声型認証装置を提案している。 However, biometric authentication using voice has a problem that it is difficult to secure security because it is difficult to mask data and spoofing using recorded data can be easily performed. On the other hand, Patent Document 1 proposes a voice-type authentication device in which security is ensured.

具体的には、特許文献１に開示された音声型認証装置では、まず、可変タグの全てとパスワードとを直結して得られた認証文字列を作成し、ユーザに対して、この認証文字列の発話を要求する。次に、音声型認証装置は、発話の音声データを特徴量化し、可変タグの位置を検出する。その後、音声型認証装置は、検出したタグの位置に基づいて、パスワードを特定し、特定したパスワードと登録パスワードとを比較して、認証を許可するかどうかを判定する。なお、可変タグとしては、数字列、単語等が用いられる。 Specifically, in the voice-type authentication device disclosed in Patent Document 1, first, an authentication character string obtained by directly connecting all of the variable tags and the password is created, and this authentication character string is given to the user. Request to speak. Next, the voice-type authentication device features the voice data of the utterance and detects the position of the variable tag. After that, the voice-type authentication device identifies the password based on the position of the detected tag, compares the identified password with the registered password, and determines whether or not authentication is permitted. As the variable tag, a number string, a word, or the like is used.

このように、特許文献１に開示された音声型認証装置では、パスワードと可変タグとを組み合わせることで、パスワードのマスクが可能となる。また、特許文献１に開示された音声型認証装置では、可変タグは定期的に変更されるので、録音データを用いたなりすましによる認証の排除が可能となる。 As described above, in the voice type authentication device disclosed in Patent Document 1, the password can be masked by combining the password and the variable tag. Further, in the voice type authentication device disclosed in Patent Document 1, since the variable tag is changed periodically, it is possible to eliminate the authentication by spoofing using the recorded data.

特開２００４－２９５５８６号公報Japanese Unexamined Patent Publication No. 2004-295586

しかしながら、特許文献１に開示された音声型認証装置では、ユーザに対して、予め、可変タグの内容と、可変タグが挿入される位置とが提示され、ユーザは、可変タグとパスワードとを組み合わせた認証文字列を、提示されたように読み上げる必要がある。このため、特許文献１に開示された音声型認証装置では、音声認証による手軽さがなく、ユーザにとっての負担が大きいという問題もある。 However, in the voice-type authentication device disclosed in Patent Document 1, the content of the variable tag and the position where the variable tag is inserted are presented to the user in advance, and the user combines the variable tag and the password. You need to read the authentication string as presented. Therefore, the voice-type authentication device disclosed in Patent Document 1 has a problem that the voice authentication is not easy and the burden on the user is heavy.

本発明の目的の一例は、上記問題を解消し、音声認証において、セキュリティの確保を図りつつ、ユーザにおける負担を軽減し得る、音声型認証装置、音声型認証方法、及びプログラムを提供することにある。 An example of an object of the present invention is to provide a voice-type authentication device, a voice-type authentication method, and a program that can solve the above-mentioned problems and reduce the burden on the user while ensuring security in voice authentication. be.

上記目的を達成するため、本発明の一側面における音声型認証装置は、ユーザの音声を用いて認証処理を行うための装置であって、
前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、
重なるように、前記マスク音の音データを再生する、マスク音出力部と、
認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、音声分析部と、
抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、認証処理部と、
を備えている、
ことを特徴とする。 In order to achieve the above object, the voice type authentication device in one aspect of the present invention is a device for performing authentication processing using a user's voice.
The mask sound that masks the user's voice is the voice emitted by the user during authentication.
A mask sound output unit that reproduces the sound data of the mask sound so as to overlap with each other.
At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound are overlapped is acquired, and the voice data of the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. With the voice analysis unit,
An authentication processing unit that executes authentication processing using the extracted voice data of the voice emitted by the user, and
Is equipped with
It is characterized by that.

また、上記目的を達成するため、本発明の一側面における音声型認証方法は、ユーザの音声を用いて認証処理を行うための方法であって、
（ａ）前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、重なるように、前記マスク音の音データを再生する、ステップと、
（ｂ）認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、ステップと、
（ｃ）抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、ステップと、
を有する、
ことを特徴とする。 Further, in order to achieve the above object, the voice type authentication method in one aspect of the present invention is a method for performing an authentication process using a user's voice.
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound overlap is acquired, and the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. Steps and steps to extract audio data
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
Have,
It is characterized by that.

更に、上記目的を達成するため、本発明の一側面におけるプログラムは、コンピュータによって、ユーザの音声を用いて認証処理を行うためのプログラムであって、
前記コンピュータに、
（ａ）前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、重なるように、前記マスク音の音データを再生する、ステップと、
（ｂ）認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、ステップと、
（ｃ）抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、ステップと、
を実行させる、ことを特徴とする。 Further, in order to achieve the above object, the program in one aspect of the present invention is a program for performing an authentication process using a user's voice by a computer.
To the computer
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound overlap is acquired, and the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. Steps and steps to extract audio data
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
It is characterized by executing.

以上のように、本発明によれば、音声認証において、セキュリティの確保を図りつつ、ユーザにおける負担を軽減することができる。 As described above, according to the present invention, it is possible to reduce the burden on the user while ensuring security in voice authentication.

図１は、本発明の実施の形態における音声型認証装置の概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a voice-type authentication device according to an embodiment of the present invention. 図２は、本発明の実施の形態における音声認識処理装置の具体的構成を示すブロック図である。FIG. 2 is a block diagram showing a specific configuration of the voice recognition processing device according to the embodiment of the present invention. 図３は、本発明の実施の形態における音声型認証装置の動作を示すフロー図である。FIG. 3 is a flow chart showing the operation of the voice-type authentication device according to the embodiment of the present invention. 図４は、図３に示すステップＡ４をより具体的に示すフロー図である。FIG. 4 is a flow chart showing step A4 shown in FIG. 3 more specifically. 図５は、本発明の実施の形態における音声型認証装置を実現するコンピュータの一例を示すブロック図である。FIG. 5 is a block diagram showing an example of a computer that realizes the voice-type authentication device according to the embodiment of the present invention.

（実施の形態）
以下、本発明の実施の形態における、音声型認証装置、音声型認証方法、及びプログラムについて、図１～図５を参照しながら説明する。 (Embodiment)
Hereinafter, the voice-type authentication device, the voice-type authentication method, and the program according to the embodiment of the present invention will be described with reference to FIGS. 1 to 5.

［装置構成］
最初に、図１を用いて、本実施の形態における音声型認証装置の概略構成について説明する。図１は、本発明の実施の形態における音声型認証装置の概略構成を示すブロック図である。 [Device configuration]
First, the schematic configuration of the voice-type authentication device according to the present embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing a schematic configuration of a voice-type authentication device according to an embodiment of the present invention.

図１に示す、本実施の形態における音声型認証装置１００は、ユーザの音声を用いて認証処理を行う装置である。図１に示すように、音声型認証装置１００は、マスク音出力部１０と、音声分析部２０と、認証処理部３０とを備えている。 The voice-type authentication device 100 according to the present embodiment shown in FIG. 1 is a device that performs authentication processing using the voice of a user. As shown in FIG. 1, the voice type authentication device 100 includes a mask sound output unit 10, a voice analysis unit 20, and an authentication processing unit 30.

マスク音出力部１０は、ユーザの音声をマスクするマスク音が、認証に際してユーザが発した音声に重なるように、マスク音の音データを再生する。音声分析部２０は、認証に際してユーザが発した音声とマスク音とが重なった状態の音声データを取得する。そして、音声分析部２０は、取得した音声データから、マスク音の音データを用いて、ユーザが発した音声の音声データを抽出する。認証処理部３０は、抽出されたユーザが発した音声の音声データを用いて、認証処理を実行する。 The mask sound output unit 10 reproduces the sound data of the mask sound so that the mask sound that masks the user's voice overlaps with the voice emitted by the user at the time of authentication. The voice analysis unit 20 acquires voice data in a state where the voice emitted by the user at the time of authentication and the mask sound overlap. Then, the voice analysis unit 20 extracts the voice data of the voice emitted by the user from the acquired voice data by using the sound data of the mask sound. The authentication processing unit 30 executes the authentication process using the voice data of the voice emitted by the extracted user.

以上のように、本実施の形態では、ユーザが認証のために発話を行うと、その発話に重なるようにマスク音が再生される。そして、両者が重なった状態の音声データから、ユーザの発話の音声データが取り出されて、認証が行われる。つまり、ユーザは、パスワード等を発話するだけで良く、従来に比べて、ユーザにおける負担は極めて小さくなる。 As described above, in the present embodiment, when the user makes an utterance for authentication, the mask sound is reproduced so as to overlap the utterance. Then, the voice data of the user's utterance is taken out from the voice data in the state where both are overlapped, and the authentication is performed. That is, the user only has to utter a password or the like, and the burden on the user is extremely small as compared with the conventional case.

また、ユーザの発話にはマスク音が重ねられるので、仮に録音されても、再生時にはユーザの発話とマスク音との両方が一緒に再生されることになる。従って、録音による音声データによって認証しようとした場合は、複数のマスク音が重なった状態の音声データが入力されるので、録音によるなりすましを容易に特定できる。本実施の形態によれば、音声認証におけるセキュリティも確保できる。 Further, since the mask sound is superimposed on the user's utterance, even if it is recorded, both the user's utterance and the mask sound are reproduced together at the time of reproduction. Therefore, when an attempt is made to authenticate by recording voice data, voice data in a state where a plurality of mask sounds are overlapped is input, so that spoofing by recording can be easily identified. According to this embodiment, security in voice authentication can be ensured.

続いて、図２を用いて、本実施の形態における音声型認証装置１００の構成をより具体的に説明する。図２は、本発明の実施の形態における音声認識処理装置の具体的構成を示すブロック図である。 Subsequently, with reference to FIG. 2, the configuration of the voice-type authentication device 100 in the present embodiment will be described more specifically. FIG. 2 is a block diagram showing a specific configuration of the voice recognition processing device according to the embodiment of the present invention.

図２に示すように、本実施の形態においては、音声型認証装置１００には、音声入力装置４０と、音声出力装置５０とが接続されている。音声入力装置４０は、マイクであり、外部の音声を音声データに変換し、得られた音声データを音声型認証装置１００に入力する。音声出力装置５０は、スピーカであり、マスク音出力部１０によってマスク音の音データが再生されると、再生されたマスク音を外部に出力する。 As shown in FIG. 2, in the present embodiment, the voice input device 40 and the voice output device 50 are connected to the voice type authentication device 100. The voice input device 40 is a microphone, converts external voice into voice data, and inputs the obtained voice data to the voice type authentication device 100. The voice output device 50 is a speaker, and when the sound data of the mask sound is reproduced by the mask sound output unit 10, the reproduced mask sound is output to the outside.

マスク音出力部１０は、本実施の形態では、ユーザ６０が音声入力装置４０に向かって、認証データ（アカウント、パスワード等）を発話するタイミングで、マスク音の音声データを、音声出力装置５０によって再生する。つまり、また、マスク音出力部１０は、再生されたマスク音が、ユーザが認証のために発した音声と重なるように、再生を実行する。これにより、音声入力装置４０には、ユーザが認証のために発した音声とマスク音とが入力される。そして、音声入力装置４０は、ユーザが認証のために発した認証データの音声とマスク音とが重なった状態の音声データを音声分析部２０に入力する。 In the present embodiment, the mask sound output unit 10 outputs the voice data of the mask sound by the voice output device 50 at the timing when the user 60 speaks the authentication data (account, password, etc.) to the voice input device 40. Reproduce. That is, the mask sound output unit 10 also executes reproduction so that the reproduced mask sound overlaps with the voice emitted by the user for authentication. As a result, the voice input device 40 is input with the voice emitted by the user for authentication and the mask sound. Then, the voice input device 40 inputs the voice data in a state where the voice of the authentication data issued by the user for authentication and the mask sound overlap with each other to the voice analysis unit 20.

また、マスク音出力部１０は、音声データを再生すると、再生された時刻、再生時の時刻といった情報と共に、音声データを音声分析部２０に入力する。更に、マスク音は、マ
スク音出力部１０によって作成されていても良いし、予め作成されて登録されていても良い。 Further, when the voice data is reproduced, the mask sound output unit 10 inputs the voice data to the voice analysis unit 20 together with information such as the time of reproduction and the time of reproduction. Further, the mask sound may be created by the mask sound output unit 10 or may be created and registered in advance.

また、マスク音出力部１０は、マスク対象となるユーザ６０の音声に合わせて、マスク音のパラメータ（音量、音質、内容等）を変化させることができる。これは、一般的な傾向として、マスク音の音量が大きい程、更に、マスク音の音質が、認証時にユーザ６０が発する音声の音質に近い程（例えば、人の声に近い程）、マスク効果が高くなるからである。 Further, the mask sound output unit 10 can change the parameters (volume, sound quality, content, etc.) of the mask sound according to the voice of the user 60 to be masked. This is because, as a general tendency, the louder the volume of the mask sound, and the closer the sound quality of the mask sound is to the sound quality of the voice emitted by the user 60 at the time of authentication (for example, the closer to the human voice), the more the mask effect is. Is high.

また、マスク音の各パラメータは、予め、管理者等によって、デフォルト値として設定されていても良いし、外部の音響環境、秘匿すべき情報の重要度に応じて後から変更されても良い。マスク音のパラメータの設定の態様は、特に限定されるものではない。 Further, each parameter of the mask sound may be set as a default value in advance by an administrator or the like, or may be changed later according to the external acoustic environment and the importance of information to be kept secret. The mode of setting the parameters of the mask sound is not particularly limited.

また、マスク音出力部１０によって出力されるマスク音は、音楽であっても良いし、人の声であっても良い。更には、マスク音は、波長が人の可聴域外にある音であっても良い。更に、マスク音出力部１０は、なりすまし防止効果を高めるため、ユーザによる認証の度に、マスク音を変えることもできる。例えば、マスク音として音楽が利用される場合は、マスク音出力部１０は、認証の度に、曲を変更する。 Further, the mask sound output by the mask sound output unit 10 may be music or a human voice. Furthermore, the mask sound may be a sound whose wavelength is outside the human audible range. Further, the mask sound output unit 10 can change the mask sound each time the user authenticates in order to enhance the spoofing prevention effect. For example, when music is used as a mask sound, the mask sound output unit 10 changes the music each time it is authenticated.

また、本実施の形態では、音声分析部２０は、音声入力装置４０から音声データが出力されると、この音声データを取得し、取得した音声データから、認証情報を取り出すと共に、ユーザの音声が成りすましによる音声であるかどうかの判定を行う。このため、図２に示すように、音声分析部２０は、マスク音取出部２１と、マスク音比較部２２と、ユーザ音声復元部２３と、複製音声判定部２４とを備えている。 Further, in the present embodiment, when the voice data is output from the voice input device 40, the voice analysis unit 20 acquires the voice data, extracts authentication information from the acquired voice data, and sounds the user's voice. It is determined whether or not the voice is spoofed. Therefore, as shown in FIG. 2, the voice analysis unit 20 includes a mask sound extraction unit 21, a mask sound comparison unit 22, a user voice restoration unit 23, and a duplicate voice determination unit 24.

マスク音取出部２１は、音声入力装置４０から音声データが入力されると、この入力された音声データを取得し、取得した音声データをマスク音比較部２２に送出する。これにより、マスク音比較部２２は、後述するように、音声入力装置４０から入力された音声データ中のマスク音の部分を推定し、推定したマスク音の部分を特定するデータ（以下「マスク音特定データ」と表記する）を、マスク音取出部２１に送出する。また、マスク音取出部２１は、送出されてきたマスク音特定データと、音声入力装置４０から入力された音声データとを、ユーザ音声復元部２３に送出する。 When voice data is input from the voice input device 40, the mask sound extraction unit 21 acquires the input voice data and sends the acquired voice data to the mask sound comparison unit 22. As a result, as will be described later, the mask sound comparison unit 22 estimates the mask sound portion in the voice data input from the voice input device 40, and the data for specifying the estimated mask sound portion (hereinafter, “mask sound”). Notated as "specific data") is sent to the mask sound extraction unit 21. Further, the mask sound extraction unit 21 transmits the transmitted mask sound identification data and the voice data input from the voice input device 40 to the user voice restoration unit 23.

マスク音比較部２２は、上述したように、まず、マスク音出力部１０が生成した音データと、音声入力装置４０から入力された音声データとを比較する。そして、マスク音比較部２２は、比較の結果から、音声入力装置４０から入力された音声データ中のマスク音の部分（成分）を推定し、推定したマスク音の部分を特定するマスク音特定データを作成する。 As described above, the mask sound comparison unit 22 first compares the sound data generated by the mask sound output unit 10 with the voice data input from the voice input device 40. Then, the mask sound comparison unit 22 estimates the mask sound portion (component) in the voice data input from the voice input device 40 from the comparison result, and the mask sound identification data for specifying the estimated mask sound part. To create.

具体的には、音声入力装置４０から入力された音声データは合成波となっているため、マスク音比較部２２は、例えば、フーリエ変換等を利用して、この音声データを複数の波に分解し、分解の結果から、マスク音の部分を推定する。 Specifically, since the voice data input from the voice input device 40 is a synthetic wave, the mask sound comparison unit 22 decomposes this voice data into a plurality of waves by using, for example, a Fourier transform. Then, the mask sound part is estimated from the result of decomposition.

また、マスク音比較部２２は、推定したマスク音の部分を特定するマスク音特定データを、上述したように、マスク音取出部２１に送出する。更に、マスク音比較部２２は、マスク音特定データと、マスク音出力部１０が生成した音データとを、後述の複製音声判定部２４に送出する。 Further, the mask sound comparison unit 22 sends the mask sound identification data for specifying the estimated mask sound portion to the mask sound extraction unit 21 as described above. Further, the mask sound comparison unit 22 sends the mask sound identification data and the sound data generated by the mask sound output unit 10 to the duplicate voice determination unit 24 described later.

複製音声判定部２４は、音声データ（マスク音特定データ）と、マスク音出力部１０が生成した音データとに基づいて、認証に際してユーザが発した音声が、なりすましによる
音声、即ち複製（録音）された音声であるかどうかを判定する。また、複製音声判定部２４は、判定結果を、認証処理部３０に入力する。 In the duplicate voice determination unit 24, the voice emitted by the user at the time of authentication is the voice by spoofing, that is, duplication (recording), based on the voice data (mask sound specific data) and the sound data generated by the mask sound output unit 10. Determine if it is a voice that has been played. Further, the duplicate voice determination unit 24 inputs the determination result to the authentication processing unit 30.

具体的には、複製音声判定部２４は、以下の条件（１）～（４）全てが満たされていない場合に、なりすましによる音声であると判定する。 Specifically, the duplicate voice determination unit 24 determines that the voice is spoofed when all of the following conditions (1) to (4) are not satisfied.

条件（１）は、音声入力装置４０が入力した音声データに「マスク音」が存在すること、即ち、音声入力装置４０から入力された音声データから、マスク音の音データと同一の音データを抽出できることである。マスク音が存在しない場合は録音の可能性があるからである。 The condition (1) is that "mask sound" exists in the voice data input by the voice input device 40, that is, the same sound data as the sound data of the mask sound is obtained from the voice data input from the voice input device 40. It is possible to extract. This is because there is a possibility of recording if there is no mask sound.

具体的には、複製音声判定部２４は、マスク音特定データと、マスク音出力部１０が生成した音データとが一致している場合は、条件（１）が満たされていると判定する。 Specifically, the duplicate voice determination unit 24 determines that the condition (1) is satisfied when the mask sound identification data and the sound data generated by the mask sound output unit 10 match.

条件（２）は、「マスク音」が１つであること、即ち、音声データから抽出できた音データが、１つであることである。位相がずれた波、位相が同じ波の合成波が存在する場合は、マスク音と共に録音されたユーザの音声が入力されている可能性があるからである。 The condition (2) is that there is one "mask sound", that is, there is one sound data that can be extracted from the voice data. This is because if there is a wave that is out of phase or a combined wave that has the same phase, the user's voice recorded with the mask sound may have been input.

具体的には、複製音声判定部２４は、マスク音特定データから、１つの音データのみを特定できる場合は、条件（２）が満たされていると判定する。 Specifically, the duplicate voice determination unit 24 determines that the condition (2) is satisfied when only one sound data can be specified from the mask sound identification data.

条件（３）は、「マスク音」の音量レベルが、想定値内にあること、即ち、音声データから抽出できた音データの音量レベルが所定の範囲内であることである。マスク音の音量レベルが大きい場合、小さい場合は録音の可能性があるからである。 The condition (3) is that the volume level of the "mask sound" is within the assumed value, that is, the volume level of the sound data extracted from the voice data is within a predetermined range. This is because if the volume level of the mask sound is high, there is a possibility of recording if it is low.

具体的には、複製音声判定部２４は、マスク音特定データのレベルと、音声出力装置５０で再生されたマスク音の音量のレベルとの差が設定範囲内にある場合は、条件（３）が満たされていると判定する。 Specifically, the duplicate voice determination unit 24 sets the condition (3) when the difference between the level of the mask sound specific data and the level of the volume of the mask sound reproduced by the voice output device 50 is within the set range. Is determined to be satisfied.

条件（４）は、「マスク音」の再生開始時刻が、想定の範囲であること、即ち、音声データから抽出できた音データの元になった音の再生時の時刻が所定の時間帯であることである。音声出力装置５０の再生開始時刻より前にマスク音が入力されていれば、録音の可能性あるからである。また、再生開始時刻から、音声入力装置４０に音声が入力された時刻までに、長い時間が経過している場合も録音の可能性があるからである。 The condition (4) is that the playback start time of the "mask sound" is within the expected range, that is, the playback time of the sound that is the source of the sound data extracted from the voice data is in a predetermined time zone. There is. This is because if the mask sound is input before the reproduction start time of the audio output device 50, there is a possibility of recording. This is also because there is a possibility of recording even if a long time has elapsed from the playback start time to the time when the voice is input to the voice input device 40.

具体的には、複製音声判定部２４は、マスク音特定データから元のマスク音の再生時刻を特定し、特定した時刻と、音声入力装置４０での音声の入力開始時刻との差が設定範囲内にある場合は、条件（４）が満たされていると判定する。 Specifically, the duplicate voice determination unit 24 identifies the reproduction time of the original mask sound from the mask sound identification data, and the difference between the specified time and the voice input start time in the voice input device 40 is a setting range. If it is inside, it is determined that the condition (4) is satisfied.

ユーザ音声復元部２３は、音声入力装置４０に入力された音声データから、マスク音特定データを用いて、マスク音の波長成分を除去して、ユーザの音声の音声データを抽出して、ユーザの音声を復元する。また、ユーザ音声復元部２３は、マスク音の波長成分が除去された音声データを、認証処理部３０に入力する。 The user voice restoration unit 23 removes the wavelength component of the mask sound from the voice data input to the voice input device 40 by using the mask sound specific data, extracts the voice data of the user's voice, and extracts the voice data of the user. Restore audio. Further, the user voice restoration unit 23 inputs voice data from which the wavelength component of the mask sound has been removed to the authentication processing unit 30.

具体的には、ユーザ音声復元部２３は、マスク音特定データの位相を逆位相とした音声データを作成し、作成した音声データと、音声入力装置４０から入力された音声データと合成することで、マスク音の波長成分を除去することができる。なお、マスク音の波長成分の除去の手法としては、従来からの他の手法を用いることができる。 Specifically, the user voice restoration unit 23 creates voice data whose phase is opposite to that of the mask sound specific data, and combines the created voice data with the voice data input from the voice input device 40. , The wavelength component of the mask sound can be removed. As a method for removing the wavelength component of the mask sound, another conventional method can be used.

認証処理部３０は、本実施の形態では、図２に示すように、音声認識部３１と、認証デ
ータ照応部３２と、認証判定部３３と、認証データ格納部３４とを備えている。 In the present embodiment, the authentication processing unit 30 includes a voice recognition unit 31, an authentication data matching unit 32, an authentication determination unit 33, and an authentication data storage unit 34, as shown in FIG.

音声認識部３１は、ユーザ音声復元部２３から音声データが入力されると、入力された音声データに対して音声認識を実行し、音声データをテキストデータに変換する。また、音声認識部３１は、変換によって得られたテキストデータを認証データ照応部３２に送出する。 When voice data is input from the user voice restoration unit 23, the voice recognition unit 31 executes voice recognition for the input voice data and converts the voice data into text data. Further, the voice recognition unit 31 sends the text data obtained by the conversion to the authentication data matching unit 32.

なお、音声認識部３１による音声認識は、通常、雑音データ配下では困難である。従って、複製音声判定部２４によって、複製音声であるにも関わらず、複製音声（なりすまし）ではないと判定された場合は、マスク音が除去されていない録音データが入力されているので、音声認識部３１が認識に失敗する可能性は高くなる。この場合、認証が失敗となり、なりすましによる認証が回避される。 It should be noted that voice recognition by the voice recognition unit 31 is usually difficult under the control of noise data. Therefore, when the duplicated voice determination unit 24 determines that the duplicated voice is not the duplicated voice (spoofing), the recorded data in which the mask sound is not removed is input, so that voice recognition is performed. There is a high possibility that the unit 31 will fail in recognition. In this case, authentication fails and spoofing authentication is avoided.

認証データ照応部３２は、認証データ格納部３４に問い合わせを行い、入力されたテキストデータを、予め登録されている認証データ（パスワード等）に照応し、照応結果を、認証判定部３３に送出する。認証データ格納部３４は、認証判定の対象となるデータ、例えば、暗号化された状態のＩＤ、パスワード等を格納している。 The authentication data matching unit 32 makes an inquiry to the authentication data storage unit 34, collates the input text data with the authentication data (password, etc.) registered in advance, and sends the matching result to the authentication determination unit 33. .. The authentication data storage unit 34 stores data to be authenticated, for example, an encrypted ID, a password, and the like.

認証判定部３３は、認証データ照応部３２による照応の結果に基づいて、認証許可又は不許可の判定を行う。また、認証判定部３３は、認証の結果を、ユーザ６０に通知する。 The authentication determination unit 33 determines whether the authentication is permitted or not based on the result of the verification by the authentication data matching unit 32. Further, the authentication determination unit 33 notifies the user 60 of the authentication result.

［装置動作］
次に、本実施の形態における音声型認証装置１００の動作について図３を用いて説明する。図３は、本発明の実施の形態における音声型認証装置の動作を示すフロー図である。以下の説明においては、適宜図１及び図２を参酌する。また、本実施の形態では、音声型認証装置１００を動作させることによって、音声型認証方法が実施される。よって、本実施の形態における音声型認証方法の説明は、以下の音声型認証装置１００の動作説明に代える。 [Device operation]
Next, the operation of the voice-type authentication device 100 in the present embodiment will be described with reference to FIG. FIG. 3 is a flow chart showing the operation of the voice-type authentication device according to the embodiment of the present invention. In the following description, FIGS. 1 and 2 will be referred to as appropriate. Further, in the present embodiment, the voice-type authentication method is implemented by operating the voice-type authentication device 100. Therefore, the description of the voice-type authentication method in the present embodiment is replaced with the following operation description of the voice-type authentication device 100.

最初に、図３に示すように、マスク音出力部１０は、ユーザの音声をマスクするマスク音が、認証に際してユーザが発した音声に重なるように、マスク音の音データを再生する（ステップＡ１）。 First, as shown in FIG. 3, the mask sound output unit 10 reproduces the sound data of the mask sound so that the mask sound that masks the user's voice overlaps with the voice emitted by the user at the time of authentication (step A1). ).

具体的には、ステップＡ１では、マスク音出力部１０は、ユーザ６０が音声入力装置４０に向かって、認証データを発話するタイミングで、音声データを、音声出力装置５０によって再生する。そして、ステップＡ１が実行されると、音声入力装置４０は、ユーザが認証のために発した認証データの音声とマスク音とが重なった状態の音声データを音声分析部２０に入力する。 Specifically, in step A1, the mask sound output unit 10 reproduces the voice data by the voice output device 50 at the timing when the user 60 speaks the authentication data to the voice input device 40. Then, when step A1 is executed, the voice input device 40 inputs the voice data in a state where the voice of the authentication data issued by the user for authentication and the mask sound overlap to the voice analysis unit 20.

次に、音声分析部２０において、マスク音取出部２１は、音声入力装置４０から音声データが出力されると、この音声データを取得し、取得した音声データをマスク音比較部２２に送出する（ステップＡ２）。 Next, in the voice analysis unit 20, when the voice data is output from the voice input device 40, the mask sound extraction unit 21 acquires the voice data and sends the acquired voice data to the mask sound comparison unit 22 ( Step A2).

次に、マスク音比較部２２は、ステップＡ２で取得された音声データと、マスク音出力部１０が生成した音データとを比較して、音声データ中のマスク音の部分を推定し、推定したマスク音の部分を特定するマスク音特定データを作成する（ステップＡ３）。 Next, the mask sound comparison unit 22 compares the sound data acquired in step A2 with the sound data generated by the mask sound output unit 10, and estimates and estimates the mask sound portion in the voice data. Create mask sound identification data to specify the mask sound part (step A3).

また、ステップＡ３では、マスク音比較部２３は、マスク音特定データを、マスク音取出部２１に送出する。更に、マスク音比較部２３は、マスク音特定データと、マスク音出力部１０が生成した音データとを、後述の複製音声判定部２４に送出する。また、この場
合、マスク音取出部２１は、ステップＡ２で取得した音声データと、マスク音特定データとを、ユーザ音声復元部２３に送出する。 Further, in step A3, the mask sound comparison unit 23 sends the mask sound identification data to the mask sound extraction unit 21. Further, the mask sound comparison unit 23 sends the mask sound identification data and the sound data generated by the mask sound output unit 10 to the duplicate voice determination unit 24 described later. Further, in this case, the mask sound extraction unit 21 sends the voice data acquired in step A2 and the mask sound identification data to the user voice restoration unit 23.

次に、複製音声判定部２４は、マスク音特定データと、マスク音出力部１０が生成した音データとに基づいて、認証に際してユーザが発した音声が、複製された音声であるかどうかを判定する（ステップＡ４）。ステップＡ４の詳細については図４を用いて後述する。 Next, the duplicate voice determination unit 24 determines whether or not the voice emitted by the user at the time of authentication is the duplicated voice based on the mask sound identification data and the sound data generated by the mask sound output unit 10. (Step A4). The details of step A4 will be described later with reference to FIG.

ステップＡ４の判定の結果、複製された音声である場合は、認証判定部３３が、認証は失敗であると判定する（ステップＡ８）。 As a result of the determination in step A4, if the voice is duplicated, the authentication determination unit 33 determines that the authentication has failed (step A8).

一方、ステップＡ４の判定の結果、複製された音声でない場合は、ユーザ音声復元部２３が、ステップＡ２で取得された音声データから、マスク音特定データを用いて、マスク音の波長成分を除去して、ユーザの音声の音声データを抽出して、ユーザの音声を復元する（ステップＡ５）。 On the other hand, if the result of the determination in step A4 is that the voice is not duplicated, the user voice restoration unit 23 removes the wavelength component of the mask sound from the voice data acquired in step A2 by using the mask sound specific data. Then, the voice data of the user's voice is extracted and the user's voice is restored (step A5).

次に、認証処理部３０において、音声認識部３１は、ステップＡ５で抽出されたユーザの音声の音声データに対して音声認識を実行し、音声データをテキストデータに変換する（ステップＡ６）。また、音声認識部３１は、変換によって得られたテキストデータを認証データ照応部３２に送出する。 Next, in the authentication processing unit 30, the voice recognition unit 31 executes voice recognition on the voice data of the user's voice extracted in step A5, and converts the voice data into text data (step A6). Further, the voice recognition unit 31 sends the text data obtained by the conversion to the authentication data matching unit 32.

次に、認証データ照応部３２は、認証データ格納部３４に問い合わせを行い、ステップＡ６で得られたテキストデータを、予め登録されている認証データ（パスワード等）に照応する（ステップＡ７）。具体的には、認証データ照応部３２は、ステップＡ６で得られたテキストデータが、認証データと一致しているかどうかを判断し、判断結果を認証判定部３３に通知する。 Next, the authentication data matching unit 32 makes an inquiry to the authentication data storage unit 34, and collates the text data obtained in step A6 with the authentication data (password or the like) registered in advance (step A7). Specifically, the authentication data matching unit 32 determines whether or not the text data obtained in step A6 matches the authentication data, and notifies the authentication determination unit 33 of the determination result.

次に、認証判定部３３は、認証データ照応部３２による照応の結果に基づいて、認証許可又は不許可の判定を行う（ステップＡ８）。ステップＡ８の実行後、認証判定部３３は、認証の結果を、ユーザ６０に通知する（ステップＡ９）。 Next, the authentication determination unit 33 determines whether the authentication is permitted or not based on the result of the verification by the authentication data matching unit 32 (step A8). After executing step A8, the authentication determination unit 33 notifies the user 60 of the authentication result (step A9).

続いて、図４を用いて、図３に示すステップＡ４についてより具体的に説明する。図４は、図３に示すステップＡ４をより具体的に示すフロー図である。 Subsequently, with reference to FIG. 4, step A4 shown in FIG. 3 will be described more specifically. FIG. 4 is a flow chart showing step A4 shown in FIG. 3 more specifically.

図４に示すように、最初に、複製音声判定部２４は、音声入力装置４０が入力した音声データに「マスク音」が存在するかどうかを判定する（ステップＡ４１）。 As shown in FIG. 4, first, the duplicate voice determination unit 24 determines whether or not a “mask sound” is present in the voice data input by the voice input device 40 (step A41).

具体的には、複製音声判定部２４は、ステップＡ３で作成されたマスク音特定データと、マスク音出力部１０が生成した音データとが一致している場合は、「マスク音」が存在すると判定する。 Specifically, when the duplicated voice determination unit 24 matches the mask sound specifying data created in step A3 with the sound data generated by the mask sound output unit 10, it is assumed that a “mask sound” exists. judge.

ステップＡ４１の判定の結果、「マスク音」が存在しない場合は、複製音声判定部２４は、ステップＡ４６を実行する。 As a result of the determination in step A41, if the "mask sound" does not exist, the duplicate voice determination unit 24 executes step A46.

一方、ステップＡ４１の判定の結果、「マスク音」が存在する場合は、複製音声判定部２４は、「マスク音」が１つであるかどうかを判定する（ステップＡ４２）。 On the other hand, if the "mask sound" is present as a result of the determination in step A41, the duplicate voice determination unit 24 determines whether or not there is one "mask sound" (step A42).

具体的には、複製音声判定部２４は、ステップＡ３で作成されたマスク音特定データから、１つの音データのみを特定できた場合は、「マスク音」が１つであると判定する。 Specifically, the duplicate voice determination unit 24 determines that there is only one "mask sound" when only one sound data can be specified from the mask sound identification data created in step A3.

ステップＡ４２の判定の結果、「マスク音」が１つでない場合は、複製音声判定部２４は、ステップＡ４６を実行する。 As a result of the determination in step A42, if there is not one "mask sound", the duplicate voice determination unit 24 executes step A46.

一方、ステップＡ４２の判定の結果、「マスク音」が１つである場合は、複製音声判定部２４は、「マスク音」の音量レベルが、想定値内にあるかどうかを判定する（ステップＡ４３）。 On the other hand, if the result of the determination in step A42 is that there is only one "mask sound", the duplicate voice determination unit 24 determines whether or not the volume level of the "mask sound" is within the assumed value (step A43). ).

具体的には、複製音声判定部２４は、ステップＡ３で作成されたマスク音特定データのレベルと、音声出力装置５０で再生されたマスク音の音量のレベルとの差が設定範囲内にある場合は、「マスク音」の音量レベルは想定値内であると判定する。 Specifically, when the duplicate voice determination unit 24 has a difference between the level of the mask sound specific data created in step A3 and the volume level of the mask sound reproduced by the voice output device 50 within the set range. Determines that the volume level of the "mask sound" is within the expected value.

ステップＡ４３の判定の結果、「マスク音」が想定値内にない場合は、複製音声判定部２４は、ステップＡ４６を実行する。 As a result of the determination in step A43, if the "mask sound" is not within the assumed value, the duplicate voice determination unit 24 executes step A46.

一方、ステップＡ４３の判定の結果、「マスク音」が想定値内にある場合は、複製音声判定部２４は、「マスク音」の再生開始時刻が、想定の範囲になるかどうかを判定する（ステップＡ４４）。 On the other hand, if the "mask sound" is within the expected value as a result of the determination in step A43, the duplicate voice determination unit 24 determines whether or not the reproduction start time of the "mask sound" is within the expected range ( Step A44).

具体的には、複製音声判定部２４は、ステップＡ３で作成されたマスク音特定データから元のマスク音の再生時刻を特定し、特定した時刻と、音声入力装置４０での音声の入力開始時刻との差が設定範囲内にある場合は、「マスク音」の再生開始時刻が、想定の範囲にあると判定する。 Specifically, the duplicate voice determination unit 24 identifies the reproduction time of the original mask sound from the mask sound identification data created in step A3, and the specified time and the voice input start time in the voice input device 40. If the difference from is within the set range, it is determined that the reproduction start time of the "mask sound" is within the expected range.

ステップＡ４４の判定の結果、「マスク音」の再生開始時刻が、想定の範囲にない場合は、複製音声判定部２４は、ステップＡ４６を実行する。一方、ステップＡ４４の判定の結果、「マスク音」の再生開始時刻が、想定の範囲にある場合は、複製音声判定部２４は、ステップＡ４５を実行する。 As a result of the determination in step A44, if the reproduction start time of the "mask sound" is not within the expected range, the duplicate voice determination unit 24 executes step A46. On the other hand, if the reproduction start time of the "mask sound" is within the expected range as a result of the determination in step A44, the duplicate voice determination unit 24 executes step A45.

ステップＡ４５では、複製音声判定部２４は、音声入力装置４０に入力された音声は複製された音声でないと判定する。ステップＡ４６では、複製音声判定部２４は、音声入力装置４０に入力された音声は複製された音声であると判定する。 In step A45, the duplicate voice determination unit 24 determines that the voice input to the voice input device 40 is not the duplicate voice. In step A46, the duplicate voice determination unit 24 determines that the voice input to the voice input device 40 is the duplicate voice.

以上のステップＡ１～Ａ９の実行により、ユーザの発話による認証時にマスク音が重ねられた状態で認証が行われる。また、その際、なりすましでないかどうかの判定も行われる。 By executing the above steps A1 to A9, the authentication is performed with the mask sound superimposed at the time of the authentication by the user's utterance. At that time, it is also determined whether or not the person is impersonating.

［実施の形態における効果］
以上のように、本実施の形態における音声型認証装置１００は、「音声をマスクする機能」と、「マスク音声の音声データを用いて認証情報を復元する機能」と、「複製音声による成りすましを、マスク音声データを利用してチェックする機能」とを有している。このような機能により、ユーザは、パスワード等を発話するだけで良く、従来に比べて、ユーザにおける負担は極めて小さくなる。また、同時に、音声認証におけるセキュリティも確保される。 [Effect in the embodiment]
As described above, the voice-type authentication device 100 in the present embodiment performs "a function of masking voice", "a function of restoring authentication information using voice data of masked voice", and "spoofing by duplicate voice". , A function to check using mask voice data ". With such a function, the user only has to speak a password or the like, and the burden on the user is extremely small as compared with the conventional case. At the same time, security in voice authentication is also ensured.

［変形例］
上述した例では、マスク音は、音声出力装置５０によって再生されているが、本実施の形態は、この態様に限定されるものではない。例えば、マスク音として、ＢＧＭとして流される音楽が採用される場合であれば、マスク音は外部の音声発生装置（ＣＤプレーヤー等）から常に連続して再生されているものであっても良い。但し、この場合は、マスク音出力部１０は、音声発生装置と連動して、再生されるマスク音の音データを音声分析部２
０に入力する。 [Modification example]
In the above-mentioned example, the mask sound is reproduced by the voice output device 50, but the present embodiment is not limited to this mode. For example, if music played as BGM is adopted as the mask sound, the mask sound may be continuously reproduced from an external sound generator (CD player or the like). However, in this case, the mask sound output unit 10 interlocks with the voice generator to obtain the sound data of the mask sound to be reproduced in the voice analysis unit 2.
Enter 0.

［プログラム］
本実施の形態におけるプログラムは、コンピュータに、図３に示すステップＡ１～Ａ９を実行させるプログラムであれば良い。このプログラムをコンピュータにインストールし、実行することによって、本実施の形態における音声型認証装置１００と音声型認証方法とを実現することができる。この場合、コンピュータのプロセッサは、マスク音出力部１０、音声分析部２０、及び認証処理部３０として機能し、処理を行なう。 [program]
The program in the present embodiment may be any program as long as it causes a computer to execute steps A1 to A9 shown in FIG. By installing and executing this program on a computer, the voice-type authentication device 100 and the voice-type authentication method according to the present embodiment can be realized. In this case, the processor of the computer functions as a mask sound output unit 10, a voice analysis unit 20, and an authentication processing unit 30 to perform processing.

また、本実施の形態におけるプログラムは、複数のコンピュータによって構築されたコンピュータシステムによって実行されても良い。この場合は、例えば、各コンピュータが、それぞれ、マスク音出力部１０、音声分析部２０、及び認証処理部３０のいずれかとして機能しても良い。 Further, the program in the present embodiment may be executed by a computer system constructed by a plurality of computers. In this case, for example, each computer may function as any of the mask sound output unit 10, the voice analysis unit 20, and the authentication processing unit 30, respectively.

［物理構成］
ここで、本実施の形態におけるプログラムを実行することによって、音声型認証装置を実現するコンピュータについて図５を用いて説明する。図５は、本発明の実施の形態における音声型認証装置を実現するコンピュータの一例を示すブロック図である。 [Physical configuration]
Here, a computer that realizes a voice-type authentication device by executing the program according to the present embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing an example of a computer that realizes the voice-type authentication device according to the embodiment of the present invention.

図５に示すように、コンピュータ１１０は、ＣＰＵ（Central Processing Unit）１１１と、メインメモリ１１２と、記憶装置１１３と、入力インターフェイス１１４と、表示コントローラ１１５と、データリーダ／ライタ１１６と、通信インターフェイス１１７とを備える。これらの各部は、バス１２１を介して、互いにデータ通信可能に接続される。なお、コンピュータ１１０は、ＣＰＵ１１１に加えて、又はＣＰＵ１１１に代えて、ＧＰＵ（Graphics Processing Unit）、又はＦＰＧＡ（Field-Programmable Gate Array）を備えていても良い。 As shown in FIG. 5, the computer 110 includes a CPU (Central Processing Unit) 111, a main memory 112, a storage device 113, an input interface 114, a display controller 115, a data reader / writer 116, and a communication interface 117. And prepare. Each of these parts is connected to each other via a bus 121 so as to be capable of data communication. The computer 110 may include a GPU (Graphics Processing Unit) or an FPGA (Field-Programmable Gate Array) in addition to the CPU 111 or in place of the CPU 111.

ＣＰＵ１１１は、記憶装置１１３に格納された、本実施の形態におけるプログラム（コード）をメインメモリ１１２に展開し、これらを所定順序で実行することにより、各種の演算を実施する。メインメモリ１１２は、典型的には、ＤＲＡＭ（Dynamic Random Access Memory）等の揮発性の記憶装置である。また、本実施の形態におけるプログラムは、コンピュータ読み取り可能な記録媒体１２０に格納された状態で提供される。なお、本実施の形態におけるプログラムは、通信インターフェイス１１７を介して接続されたインターネット上で流通するものであっても良い。 The CPU 111 expands the programs (codes) of the present embodiment stored in the storage device 113 into the main memory 112 and executes them in a predetermined order to perform various operations. The main memory 112 is typically a volatile storage device such as a DRAM (Dynamic Random Access Memory). Further, the program in the present embodiment is provided in a state of being stored in a computer-readable recording medium 120. The program in the present embodiment may be distributed on the Internet connected via the communication interface 117.

また、記憶装置１１３の具体例としては、ハードディスクドライブの他、フラッシュメモリ等の半導体記憶装置が挙げられる。入力インターフェイス１１４は、ＣＰＵ１１１と、キーボード及びマウスといった入力機器１１８との間のデータ伝送を仲介する。表示コントローラ１１５は、ディスプレイ装置１１９と接続され、ディスプレイ装置１１９での表示を制御する。 Further, specific examples of the storage device 113 include a semiconductor storage device such as a flash memory in addition to a hard disk drive. The input interface 114 mediates data transmission between the CPU 111 and an input device 118 such as a keyboard and mouse. The display controller 115 is connected to the display device 119 and controls the display on the display device 119.

データリーダ／ライタ１１６は、ＣＰＵ１１１と記録媒体１２０との間のデータ伝送を仲介し、記録媒体１２０からのプログラムの読み出し、及びコンピュータ１１０における処理結果の記録媒体１２０への書き込みを実行する。通信インターフェイス１１７は、ＣＰＵ１１１と、他のコンピュータとの間のデータ伝送を仲介する。 The data reader / writer 116 mediates the data transmission between the CPU 111 and the recording medium 120, reads the program from the recording medium 120, and writes the processing result in the computer 110 to the recording medium 120. The communication interface 117 mediates data transmission between the CPU 111 and another computer.

また、記録媒体１２０の具体例としては、ＣＦ（Compact Flash（登録商標））及びＳＤ（Secure Digital）等の汎用的な半導体記憶デバイス、フレキシブルディスク（Flexible Disk）等の磁気記録媒体、又はＣＤ－ＲＯＭ（Compact Disk Read Only Memory）などの光学記録媒体が挙げられる。 Specific examples of the recording medium 120 include a general-purpose semiconductor storage device such as CF (Compact Flash (registered trademark)) and SD (Secure Digital), a magnetic recording medium such as a flexible disk, or a CD-. Examples include optical recording media such as ROM (Compact Disk Read Only Memory).

なお、本実施の形態における音声型認証装置１００は、プログラムがインストールされたコンピュータではなく、各部に対応したハードウェアを用いることによっても実現可能である。更に、音声型認証装置１００は、一部がプログラムで実現され、残りの部分がハードウェアで実現されていてもよい。 The voice-type authentication device 100 in the present embodiment can also be realized by using hardware corresponding to each part instead of the computer in which the program is installed. Further, the voice type authentication device 100 may be partially realized by a program and the rest may be realized by hardware.

上述した実施の形態の一部又は全部は、以下に記載する（付記１）～（付記１９）によって表現することができるが、以下の記載に限定されるものではない。 A part or all of the above-described embodiments can be expressed by the following descriptions (Appendix 1) to (Appendix 19), but the present invention is not limited to the following description.

（付記１）
ユーザの音声を用いて認証処理を行うための装置であって、
前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、重なるように、前記マスク音の音データを再生する、マスク音出力部と、
認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、音声分析部と、
抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、認証処理部と、
を備えている、
ことを特徴とする音声型認証装置。 (Appendix 1)
A device for performing authentication processing using the user's voice.
A mask sound output unit that reproduces the sound data of the mask sound so that the mask sound that masks the user's voice overlaps the voice emitted by the user at the time of authentication.
At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound are overlapped is acquired, and the voice data of the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. With the voice analysis unit,
An authentication processing unit that executes authentication processing using the extracted voice data of the voice emitted by the user, and
Is equipped with
A voice-type authentication device characterized by this.

（付記２）
付記１に記載の音声型認証装置であって、
前記音声分析部が、取得した前記音声データと前記マスク音の音データとに基づいて、認証に際して前記ユーザが発した音声が、複製による音声であるかどうかを判定する、
ことを特徴とする音声型認証装置。 (Appendix 2)
The voice-type authentication device described in Appendix 1
Based on the acquired voice data and the sound data of the mask sound, the voice analysis unit determines whether or not the voice emitted by the user at the time of authentication is a duplicated voice.
A voice-type authentication device characterized by this.

（付記３）
付記２に記載の音声型認証装置であって、
前記音声分析部が、
取得した前記音声データから、前記マスク音の音データと同一の音データを抽出できること、
抽出できた音データが１つであること、
抽出した音データの音量レベルが所定の範囲内であること、
抽出した音データの元になった音の再生時の時刻が所定の時間帯にあること、
を条件にして、全ての条件が満たされていない場合に、複製による音声であると判定する、
ことを特徴とする音声型認証装置。 (Appendix 3)
The voice-type authentication device described in Appendix 2,
The voice analysis unit
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
If all the conditions are not met, it is judged that the sound is duplicated.
A voice-type authentication device characterized by this.

（付記４）
付記１～３のいずれかに記載の音声型認証装置であって、
前記マスク音の波長が、人の可聴域外に設定されている、
ことを特徴とする音声型認証装置。 (Appendix 4)
The voice-type authentication device according to any one of Supplementary note 1 to 3.
The wavelength of the mask sound is set outside the human audible range.
A voice-type authentication device characterized by this.

（付記５）
付記１～４のいずれかに記載の音声型認証装置であって、
前記マスク音出力部が、前記マスク音の音データを生成し、生成した前記音声データを再生する、
ことを特徴とする音声型認証装置。 (Appendix 5)
The voice-type authentication device according to any one of Supplementary note 1 to 4.
The mask sound output unit generates sound data of the mask sound and reproduces the generated voice data.
A voice-type authentication device characterized by this.

（付記６）
付記１～４のいずれかに記載の音声型認証装置であって、
前記マスク音出力部が、予め作成されている前記マスク音の音データを取得し、取得した前記音声データを再生する、
ことを特徴とする音声型認証装置。 (Appendix 6)
The voice-type authentication device according to any one of Supplementary note 1 to 4.
The mask sound output unit acquires the sound data of the mask sound created in advance and reproduces the acquired voice data.
A voice-type authentication device characterized by this.

（付記７）
付記１～６のいずれかに記載の音声型認証装置であって、
前記マスク音出力部、前記音声分析部、及び前記認証処理部が、ハードウェアによって実現されている、
ことを特徴とする音声型認証装置。 (Appendix 7)
The voice-type authentication device according to any one of Supplementary note 1 to 6.
The mask sound output unit, the voice analysis unit, and the authentication processing unit are realized by hardware.
A voice-type authentication device characterized by this.

（付記８）
ユーザの音声を用いて認証処理を行うための方法であって、
（ａ）前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、重なるように、前記マスク音の音データを再生する、ステップと、
（ｂ）認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、ステップと、
（ｃ）抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、ステップと、
を有する、
ことを特徴とする音声型認証方法。 (Appendix 8)
It is a method for performing authentication processing using the user's voice.
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound overlap is acquired, and the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. Steps and steps to extract audio data from
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
Have,
A voice-based authentication method characterized by this.

（付記９）
付記８に記載の音声型認証方法であって、
（ｄ）前記（ｂ）のステップで取得した前記音声データと前記マスク音の音データとに基づいて、認証に際して前記ユーザが発した音声が、複製による音声であるかどうかを判定する、ステップを更に有する、
ことを特徴とする音声型認証方法。 (Appendix 9)
The voice-based authentication method described in Appendix 8
(D) Based on the voice data acquired in the step (b) and the sound data of the mask sound, a step of determining whether or not the voice emitted by the user at the time of authentication is a duplicated voice is performed. Have more
A voice-based authentication method characterized by this.

（付記１０）
付記９に記載の音声型認証方法であって、
前記（ｄ）のステップにおいて、
取得した前記音声データから、前記マスク音の音データと同一の音データを抽出できること、
抽出できた音データが１つであること、
抽出した音データの音量レベルが所定の範囲内であること、
抽出した音データの元になった音の再生時の時刻が所定の時間帯にあること、
を条件にして、全ての条件が満たされていない場合に、複製による音声であると判定する、
ことを特徴とする音声型認証方法。 (Appendix 10)
The voice-based authentication method described in Appendix 9,
In step (d) above,
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
If all the conditions are not met, it is judged that the sound is duplicated.
A voice-based authentication method characterized by this.

（付記１１）
付記８～１０のいずれかに記載の音声型認証方法であって、
前記マスク音の波長が、人の可聴域外に設定されている、
ことを特徴とする音声型認証方法。 (Appendix 11)
The voice-based authentication method according to any one of Supplementary note 8 to 10.
The wavelength of the mask sound is set outside the human audible range.
A voice-based authentication method characterized by this.

（付記１２）
付記８～１１のいずれかに記載の音声型認証方法であって、
前記（ａ）のステップにおいて、前記マスク音の音データを生成し、生成した前記音声データを再生する、
ことを特徴とする音声型認証方法。 (Appendix 12)
The voice-based authentication method according to any one of Supplementary note 8 to 11.
In the step (a), the sound data of the mask sound is generated, and the generated voice data is reproduced.
A voice-based authentication method characterized by this.

（付記１３）
付記８～１１のいずれかに記載の音声型認証方法であって、
前記（ａ）のステップにおいて、予め作成されている前記マスク音の音データを取得し、取得した前記音声データを再生する、
ことを特徴とする音声型認証方法。 (Appendix 13)
The voice-based authentication method according to any one of Supplementary note 8 to 11.
In the step (a), the sound data of the mask sound created in advance is acquired, and the acquired voice data is reproduced.
A voice-based authentication method characterized by this.

（付記１４）
コンピュータによって、ユーザの音声を用いて認証処理を行うためのプログラムであって、
前記コンピュータに、
（ａ）前記ユーザの音声をマスクするマスク音が、認証に際して前記ユーザが発した音声に、重なるように、前記マスク音の音データを再生する、ステップと、
（ｂ）認証に際して前記ユーザが発した音声と前記マスク音とが重なった状態の音声データを取得し、取得した前記音声データから、前記マスク音の音データを用いて、前記ユーザが発した音声の音声データを抽出する、ステップと、
（ｃ）抽出された前記ユーザが発した音声の音声データを用いて、認証処理を実行する、ステップと、
を実行させる、ことを特徴とするプログラム。 (Appendix 14)
A program for performing authentication processing using the user's voice by a computer.
To the computer
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, voice data in a state where the voice emitted by the user and the mask sound overlap is acquired, and the voice emitted by the user is used from the acquired voice data using the sound data of the mask sound. Steps and steps to extract audio data from
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
A program characterized by executing.

（付記１５）
付記１４に記載のプログラムであって、
前記コンピュータに、
（ｄ）前記（ｂ）のステップで取得した前記音声データと前記マスク音の音データとに基づいて、認証に際して前記ユーザが発した音声が、複製による音声であるかどうかを判定する、ステップを更に実行させる、
ことを特徴とするプログラム。 (Appendix 15)
The program described in Appendix 14,
To the computer
(D) Based on the voice data acquired in the step (b) and the sound data of the mask sound, a step of determining whether or not the voice emitted by the user at the time of authentication is a duplicated voice is performed. Let it run further,
A program characterized by that.

（付記１６）
付記１５に記載のプログラムであって、
前記（ｄ）のステップにおいて、
取得した前記音声データから、前記マスク音の音データと同一の音データを抽出できること、
抽出できた音データが１つであること、
抽出した音データの音量レベルが所定の範囲内であること、
抽出した音データの元になった音の再生時の時刻が所定の時間帯にあること、
を条件にして、全ての条件が満たされていない場合に、複製による音声であると判定する、
ことを特徴とするプログラム。 (Appendix 16)
The program described in Appendix 15
In step (d) above,
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
If all the conditions are not met, it is judged that the sound is duplicated.
A program characterized by that.

（付記１７）
付記１４～１６のいずれかに記載のプログラムであって、
前記マスク音の波長が、人の可聴域外に設定されている、
ことを特徴とするプログラム。 (Appendix 17)
The program described in any of the appendices 14 to 16 and
The wavelength of the mask sound is set outside the human audible range.
A program characterized by that.

（付記１８）
付記１４～１７のいずれかに記載のプログラムであって、
前記（ａ）のステップにおいて、前記マスク音の音データを生成し、生成した前記音声データを再生する、
ことを特徴とするプログラム。 (Appendix 18)
The program described in any of the appendices 14 to 17 and
In the step (a), the sound data of the mask sound is generated, and the generated voice data is reproduced.
A program characterized by that.

（付記１９）
付記１４～１７のいずれかに記載のプログラムであって、
前記（ａ）のステップにおいて、予め作成されている前記マスク音の音データを取得し、取得した前記音声データを再生する、
ことを特徴とするプログラム。 (Appendix 19)
The program described in any of the appendices 14 to 17 and
In the step (a), the sound data of the mask sound created in advance is acquired, and the acquired voice data is reproduced.
A program characterized by that.

以上のように、本発明によれば、音声認証において、セキュリティの確保を図りつつ、ユーザにおける負担を軽減することができる。本発明は、ＩＤ及びパスワードといった秘匿したい認証データを、音声を使って入力するシステムに有用である。 As described above, according to the present invention, it is possible to reduce the burden on the user while ensuring security in voice authentication. The present invention is useful for a system for inputting authentication data to be kept secret, such as an ID and a password, by using voice.

１０マスク音出力部
２０音声分析部
２１マスク音取出部
２２マスク音比較部
２３ユーザ音声復元部
２４複製音声判定部
３０認証処理部
３１音声認識部
３２認証データ照応部
３３認証判定部
３４認証データ格納部
４０音声入力装置
５０音声出力装置
１００音声型認証装置
１１０コンピュータ
１１１ＣＰＵ
１１２メインメモリ
１１３記憶装置
１１４入力インターフェイス
１１５表示コントローラ
１１６データリーダ／ライタ
１１７通信インターフェイス
１１８入力機器
１１９ディスプレイ装置
１２０記録媒体
１２１バス 10 Mask sound output unit 20 Voice analysis unit 21 Mask sound extraction unit 22 Mask sound comparison unit 23 User voice restoration unit 24 Duplicate voice judgment unit 30 Authentication processing unit 31 Voice recognition unit 32 Authentication data verification unit 33 Authentication judgment unit 34 Authentication data storage Part 40 Voice input device 50 Voice output device 100 Voice type authentication device 110 Computer 111 CPU
112 Main memory 113 Storage device 114 Input interface 115 Display controller 116 Data reader / writer 117 Communication interface 118 Input device 119 Display device 120 Recording medium 121 Bus

Claims

A device for performing authentication processing using the user's voice.
A mask sound output unit that reproduces the sound data of the mask sound so that the mask sound that masks the user's voice overlaps the voice emitted by the user at the time of authentication.
At the time of authentication, the voice data in a state where the voice emitted by the user and the mask sound are overlapped is acquired, and the voice emitted by the user at the time of authentication is based on the acquired voice data and the sound data of the mask sound. , It is determined whether or not the voice is duplicated, and as a result of the determination, when the voice emitted by the user is not the voice due to duplication, the user uses the sound data of the mask sound from the acquired voice data. The voice analysis unit that extracts the voice data of the voice emitted by
An authentication processing unit that executes authentication processing using the extracted voice data of the voice emitted by the user, and
Equipped with
The voice analysis unit
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
If all the conditions are not met, it is judged that the sound is duplicated .
A voice-type authentication device characterized by this.

The voice-type authentication device according to claim 1 .
The wavelength of the mask sound is set outside the human audible range.
A voice-type authentication device characterized by this.

The voice-type authentication device according to claim 1 or 2 .
The mask sound output unit generates sound data of the mask sound and reproduces the generated voice data.
A voice-type authentication device characterized by this.

The voice-type authentication device according to claim 1 or 2 .
The mask sound output unit acquires the sound data of the mask sound created in advance and reproduces the acquired voice data.
A voice-type authentication device characterized by this.

The voice-type authentication device according to any one of claims 1 to 4 .
The mask sound output unit, the voice analysis unit, and the authentication processing unit are realized by hardware.
A voice-type authentication device characterized by this.

It is a method for performing authentication processing using the user's voice.
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, the user acquires voice data in a state where the voice emitted by the user and the mask sound overlap each other, and the user emits at the time of authentication based on the acquired voice data and the sound data of the mask sound. It is determined whether or not the generated voice is a duplicated voice, and as a result of the determination, when the voice emitted by the user is not the duplicated voice, the sound data of the mask sound is used from the acquired voice data. , The step of extracting the voice data of the voice emitted by the user,
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
Have,
In step (b) above,
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
If all the conditions are not met, it is judged that the sound is duplicated .
A voice-based authentication method characterized by this.

A program for performing authentication processing using the user's voice by a computer.
To the computer
(A) A step of reproducing the sound data of the mask sound so that the mask sound that masks the voice of the user overlaps the voice emitted by the user at the time of authentication.
(B) At the time of authentication, the user acquires voice data in a state where the voice emitted by the user and the mask sound overlap each other, and the user emits at the time of authentication based on the acquired voice data and the sound data of the mask sound. It is determined whether or not the generated voice is a duplicated voice, and as a result of the determination, when the voice emitted by the user is not the duplicated voice, the sound data of the mask sound is used from the acquired voice data. , The step of extracting the voice data of the voice emitted by the user,
(C) A step of executing an authentication process using the extracted voice data of the voice emitted by the user, and
To execute ,
In step (b) above,
The same sound data as the sound data of the mask sound can be extracted from the acquired voice data.
There is only one sound data that can be extracted.
The volume level of the extracted sound data is within the specified range,
The time when the sound that is the source of the extracted sound data is played back is in the specified time zone.
A program characterized in that, if all the conditions are not satisfied, it is determined that the voice is a duplicated voice .