WO2020226019A1 - Voice authentication system - Google Patents
Voice authentication system Download PDFInfo
- Publication number
- WO2020226019A1 WO2020226019A1 PCT/JP2020/015735 JP2020015735W WO2020226019A1 WO 2020226019 A1 WO2020226019 A1 WO 2020226019A1 JP 2020015735 W JP2020015735 W JP 2020015735W WO 2020226019 A1 WO2020226019 A1 WO 2020226019A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- authentication system
- recognition circuit
- people
- voice recognition
- Prior art date
Links
- 230000007246 mechanism Effects 0.000 claims abstract description 23
- 239000002131 composite material Substances 0.000 claims abstract description 6
- 150000001875 compounds Chemical class 0.000 claims description 2
- 238000012795 verification Methods 0.000 abstract description 4
- 238000000034 method Methods 0.000 description 56
- 239000013598 vector Substances 0.000 description 21
- 238000002474 experimental method Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000007796 conventional method Methods 0.000 description 3
- 230000010365 information processing Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 210000000988 bone and bone Anatomy 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000009408 flooring Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 241000207961 Sesamum Species 0.000 description 1
- 235000003434 Sesamum indicum Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/26—Recognition of special voice characteristics, e.g. for use in lie detectors; Recognition of animal voices
Definitions
- the present invention relates to a voice authentication system, and more specifically, to an authentication system that reacts only when two or more registered specific persons utter in unison.
- lock mechanisms As a conventional authentication technology, there are various devices provided with a lock mechanism so that only a specific person can operate it. These lock mechanisms generally use a key or card or enter a personal identification number.
- Biometric authentication technology is attracting attention as a method to fundamentally solve such problems, and some of it has been put into practical use.
- Biometric information is used in biometric authentication technology.
- Biological information includes physical characteristics based on human physical characteristics and behavioral characteristics based on individual habits and the like.
- Speaker recognition a technique for estimating the speaker from a speech signal
- a voice recognition device As a device using this speaker recognition, for example, a voice recognition device is described in Patent Document 1 below.
- the voice recognition circuit collates the sound quality and whether the word matches the voice registered in advance by the user, and if they match, the solenoid is operated by the control circuit. Activates and unlocks. If they do not match, the lock will not be released.
- the voice is uttered by a specific person, and the lock is released by recording the voice without permission. Due to the recent progress of voice synthesis technology, the technology for artificially creating voice is available. Due to the improvement, a more reliable security system was desired.
- the present invention has been made in view of the above problems, and an object of the present invention is to provide a voice authentication system that responds only when two or more registered specific persons utter in unison.
- d-vector that can directly express the speaker space, which is extracted using Deep Neural Network (DNN).
- DNN Deep Neural Network
- a method has been proposed to improve speaker matching performance, it is thought that this d-vector can better capture the characteristics of two-way simultaneous utterances than MFCC, and d-vector is used as a feature quantity.
- the present invention has been completed by attempting to improve the recognition performance by conducting a speaker identification experiment targeting two simultaneous speeches by the HMM.
- the voice recognition system has a voice recognition circuit that registers and collates voice, a microphone that collects voice, a lock mechanism, and a voice recognition circuit that collates the voice collected by the microphone.
- the voice registered in the voice recognition circuit includes a composite voice of two or more people.
- the compound voice is a recording of substantially simultaneous vocalizations by at least two speakers.
- the voice recognition circuit determines whether or not the voice recognition circuit is a speaker number determination model that determines whether or not the voices are spoken at least two people at the same time, and whether or not the voices are spoken at the same time by the two target speakers. It has a two-speaker vocalization model.
- the voice recognition circuit has a synthetic voice discrimination model for determining whether or not synthetic voice is included.
- the voice recognition circuit has a tension voice discrimination model that determines whether the voices of two people are both nervous.
- FIG. 1 is an example of a block diagram showing a voice authentication system configuration proposed by the inventor.
- Reference numeral 2 denotes a voice recognition circuit, to which the microphone 1 is connected.
- the voice recognition circuit 2 is composed of a voice recognition LSI or the like, and can register and collate the voice input from the microphone 1. Whether or not the voice is surely registered at the time of voice registration, and the voice input at the time of voice verification is displayed. It outputs a signal as to whether or not it matches the registered voice.
- Reference numeral 3 denotes a control circuit, which controls the voice registration and collation processing of the voice recognition circuit 2 and operates the lock mechanism 4 in response to an output signal from the voice recognition circuit 2.
- a registration button (not shown) and a collation button (not shown) are connected to the control circuit 3, and a user operates these to start a voice registration process or a collation process.
- Reference numeral 5 denotes a battery that serves as a power source for each of the above circuits.
- the voice registered in the voice recognition circuit 2 includes a composite voice of two or more people, and voice data (preferably having a plurality of people) in which two people simultaneously utter the same phrase.
- voice data preferably having a plurality of people
- As hostile voice data one of the two voices, one of the two voices is different, one voice is completely different, and a two-speaker voice model that is variously superimposed on the computer is also registered in advance. It is desirable to keep it.
- FIG. 2 is a flowchart showing an example of the unlocking control processing procedure of the voice recognition circuit 2 proposed by the inventors.
- this method includes (1) a procedure (S1) for determining whether or not synthetic voice is included, and (2) a procedure (S2) for determining whether the voice is simultaneously uttered by two people. (3) The procedure (S3) for determining whether the two target speakers are simultaneously uttered voices and whether they are the simultaneous uttered voices of a specific or arbitrary phrase, and (4) whether the two voices are both nervous.
- the procedure (S4) for determining is included.
- this method can be realized by recording a program that realizes this method on a recording medium such as a hard disk of an information processing device and executing this program.
- the voice recognition circuit 2 has (1) a procedure (S1) for determining whether or not synthetic voice is included, (2) a procedure (S2) for determining whether the voice is simultaneously uttered by two people, (S2). 3) Procedures for determining whether the two target speakers are simultaneously uttered voices and whether they are the simultaneous uttered voices of a specific or arbitrary phrase (S3), (4) Judging whether the two voices are both tense.
- This can be realized by recording a program for executing the procedure (S4) on the recording medium of the information processing apparatus and executing this program.
- the above program will be recorded on the recording medium of the information processing apparatus, and the present method will be described as an example of realizing the present method by executing the program.
- this method has (1) a procedure (S1) for determining whether or not synthetic voice is included. This procedure is useful for eliminating voices recorded without permission and voices produced by artificial speech synthesis. When this procedure is executed, if the synthetic voice is included, the recorded voice is included, that is, if the voice other than the voice of a person is included, the lock mechanism 4 is rejected and the lock mechanism 4 is not unlocked.
- this method has (2) a procedure (S2) for determining whether the voices are simultaneously uttered by two people.
- This procedure is useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, for confirming that two people exist at the same time.
- this procedure is executed, in the case of the uttered voice of one person, in the case of the uttered voice of three or more people, it is rejected and the lock mechanism 4 is not unlocked.
- this method has (3) a procedure (S3) of determining whether the two target speakers are simultaneously uttered voices and whether they are simultaneous uttered voices of a specific or arbitrary phrase.
- This procedure is also useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, in order to confirm whether the voices are simultaneously uttered by two registered people. ..
- This procedure is executed, in the case of two uttered voices, if only one person is different from the registered person, if both of them are different from the registered person, or if the two registered people are not uttering at the same time, it will be rejected.
- the lock mechanism 4 is not unlocked. Further, it is determined whether or not the phrase is correct, and if the phrase is incorrect, it is rejected and the lock mechanism 4 is not unlocked.
- the number of speakers is not limited to "2 people”. That is, if the number of people sharing information / assets is "3", it may be determined whether the voice is simultaneously uttered by 3 people, and if it is "4 people", it may be determined whether the voice is uttered by 4 people at the same time. Also, as the number of speakers increases, a more reliable security system can be constructed.
- this method has (4) a procedure (S4) for determining whether the voices of the two persons are both nervous. This procedure makes it impossible to unlock in a threatening state by preventing even one person from unlocking when in a tense state. When this procedure is executed, if a tension voice is included, it is rejected and the lock mechanism 4 is not unlocked.
- the configuration having the procedure (S1), the procedure (S2), the procedure (S3), and the procedure (S4) is shown, but only when the two registered specific persons utter in unison is identified. If only the procedure (S2) and the procedure (S3) are to be performed, only the procedure (S2) and the procedure (S3) may be used, or if it is not necessary to identify the number of speakers, only the procedure (S3) may be configured. Then, the procedure (S1) and the procedure (S4) may be appropriately installed as a more reliable security system according to the use of the voice authentication system and the like.
- the usage of the above voice authentication system is as follows.
- the user first operates the registration button (not shown), and two or more people simultaneously utter the same specific or arbitrary phrase into the microphone 1 to transmit the voice to the voice recognition circuit 2. to register.
- (II) When unlocking the lock mechanism, if the collation button (not shown) is operated and two or more persons who have registered the words registered in item (I) simultaneously emit the words registered in item (I) to the microphone 1. The lock mechanism is unlocked.
- (III) If you want to change the registered voice, repeat item (I).
- a voice recognition circuit for registering and collating voice
- a microphone for collecting voice
- a lock mechanism for collating the voice collected by the microphone.
- the voices registered in the voice recognition circuit include a composite voice of two or more people. It is possible to provide a voice recognition system that responds only when two or more specific persons speak in unison. Therefore, there is an effect that a more reliable security system can be configured.
- an authentication system can be configured on the premise that two or more people exist at the same time and two or more people are cooperative.
- GMM-HMM featuring MFCC was used to identify text-dependent speakers for simultaneous two-party utterances.
- the recognition accuracy was insufficient. It is considered that this is because the MFCC does not fully grasp the characteristics of two-way simultaneous utterance.
- a d-vector was extracted from the final layer of the intermediate layer of the DNN that identifies the speaker for two simultaneous utterances and used as a feature quantity.
- an HMM which is a robust speaker model, is used for the utterance content.
- a speaker identification DNN for simultaneous two-party utterance was constructed to extract the d-vector, and this DNN was used for the d-vector extraction.
- this DNN was used for the d-vector extraction.
- voice data was created by superimposing the voices of the same utterance contents of two different speakers in this data set on a computer.
- the voice created by superimposing on the computer in this way is called a superposed voice.
- superimposition was performed like M001 and M002, M003 and M004, ..., M335 and M336.
- MIX001 to MIX118 were assigned as speaker numbers.
- the utterance number is the same as the above-mentioned data set.
- the contents of the superimposed voice data used in this database are shown in ⁇ Table 2>.
- a logarithmic MFB40 bin was extracted for each frame from one utterance, and a total of 280-dimensional logarithmic MFB that combined the front and rear 3 frames was used as the input of the DNN.
- the logarithmic MFB extraction conditions are shown in ⁇ Table 4>.
- the middle layer of DNN was 700-400-100 with 3 layers.
- the final layer is the identification class.
- the activation function the ReLU function was used in the middle layer and the softmax function was used in the final layer.
- the number of mini-batch during learning was 100, and the number of epochs was 100.
- a speaker identification DNN was constructed for simultaneous two-party utterances.
- the error rate of each DNN was 0.78% for DNN1, 0.29% for DNN2, and 1.50% for DNN3.
- Each d-vector is extracted from the third layer of the above intermediate layer of DNN.
- the d-vectors extracted from DNN1 to DNN3 will be referred to as d-vector1, d-vector2 and d-vector3, respectively.
- a superimposed voice was created by a single utterance of the same utterance content of two different speakers. In this embodiment, this is called a superimposed utterance and is considered as a pseudo simultaneous utterance. As a result, it is possible to perform a speaker identification experiment of two simultaneous utterances with a large number of data.
- the contents of the created superimposed utterance are shown in ⁇ Table 6>.
- d-vectors were extracted using the DNN constructed in ⁇ Table 3> and used as features.
- GMM-HMM was learned by the features of each speaker and used as a speaker model. When learning the HMM, the flooring process was performed so that the variance was 0.5 or more.
- MFCC 39 dimensions were used for the features.
- the speaker model uses HMM, and the conditions are the same as the proposed method. MFCC does not provide distributed flooring.
- the construction conditions for GMM-HMM are shown in ⁇ Table 8>.
- Cosine similarity was used for score calculation, and the speaker with the highest score was used as the identification result.
- the database was divided into five, four were learned, and one was evaluated by five-fold cross-validation.
- Error analysis was performed for each d-vector.
- the error was classified by paying attention to the input voice and the number of speakers in the identification result, and classified according to whether the error was mistaken for the same number of speakers or a different number of speakers.
- the contents of the error classification are shown in ⁇ Table 12>.
- d-vector3 has the least number of misrecognitions of different numbers of speakers in the single + superimposed DB, which is 6 times.
- the d-vector 3 is a d-vector extracted from a DNN trained with a data set using both single speech and superimposed speech. From this, it was suggested that by using both single utterance and superimposed utterance for the learning data of DNN, the d-vector extracted from the DNN becomes a feature suitable for classification of the number of speakers.
- the false recognition rate in the single + simultaneous DB is extremely low, and voice authentication that responds only when two or more registered specific persons utter in unison. It was confirmed that the system could be realized.
- the present invention can be industrially used as a voice authentication system that responds only when two or more registered specific persons utter in unison.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Lock And Its Accessories (AREA)
Abstract
The present invention provides a voice authentication system that responds only when two or more specific registered persons speak in unison. The present invention provides a voice authentication system including a voice recognition circuit for voice registration and verification, a microphone for collecting voices, a locking mechanism, and a control circuit for controlling the locking mechanism on the basis of a verification result obtained by voice verification by the voice recognition circuit for voices collected by the microphone, the voice authentication system being characterized in that the voices registered in the voice recognition circuit include a composite voice of two or more persons. The voice authentication system is further characterized in that the composite voice is obtained by recording substantially simultaneous utterances from at least two or more speakers.
Description
本発明は、音声認証システムに関し、より詳細には、登録した特定の2名以上が声を揃えて発声した時のみ反応する認証システムに関する。
The present invention relates to a voice authentication system, and more specifically, to an authentication system that reacts only when two or more registered specific persons utter in unison.
従来の認証技術として、特定の人以外には操作できないようにロック機構を設けた装置が種々存在している。これらのロック機構は、鍵やカードを使用したり、暗証番号を入力するものが一般的である。
As a conventional authentication technology, there are various devices provided with a lock mechanism so that only a specific person can operate it. These lock mechanisms generally use a key or card or enter a personal identification number.
しかしながら、従来のロック機構を設けた装置は、使用者がロック機構を操作したいときに、その都度、鍵やカードを差し込んだり、暗証番号を入力しなければならず、手間のかかるものであった。また、使用者以外の者に鍵やカードを入手されたり暗証番号を知られた場合、容易にロック機構を操作されてしまうという問題点を有していた。
However, in the conventional device provided with the lock mechanism, each time the user wants to operate the lock mechanism, he / she has to insert a key or a card or enter a personal identification number, which is troublesome. .. In addition, there is a problem that the lock mechanism can be easily operated when a key or card is obtained or a personal identification number is known to a person other than the user.
このような問題点を根本から解決する方法として生体認証技術が注目されており、一部実用化もされている。生体認証技術には生体情報が使用される。生体情報には、人間の身体特徴に基づく身体的特徴と、個人の癖などに基づく行動的特徴がある。話者認識(音声信号からその発話者を推定する技術)では、発声器官の構造に依存する身体的特徴と発声の癖などに依存する行動的特徴の両方を使用することができるため、他人が盗用することが困難とされている。
Biometric authentication technology is attracting attention as a method to fundamentally solve such problems, and some of it has been put into practical use. Biometric information is used in biometric authentication technology. Biological information includes physical characteristics based on human physical characteristics and behavioral characteristics based on individual habits and the like. Speaker recognition (a technique for estimating the speaker from a speech signal) can use both physical characteristics that depend on the structure of the vocal organs and behavioral characteristics that depend on vocal habits, so that others can use it. It is said to be difficult to steal.
この話者認識を用いた装置として、例えば、下記特許文献1に音声認識装置が記載されている。装置に向かって言葉を発すれば、使用者があらかじめ登録しておいた音声に対して音質および言葉が一致しているかどうかを音声認識回路が照合し、一致していれば制御回路によってソレノイドが作動してロックが解除される。一致しない場合にはロックは解除されない。
As a device using this speaker recognition, for example, a voice recognition device is described in Patent Document 1 below. When a word is spoken to the device, the voice recognition circuit collates the sound quality and whether the word matches the voice registered in advance by the user, and if they match, the solenoid is operated by the control circuit. Activates and unlocks. If they do not match, the lock will not be released.
また他の認証技術として、複数人(2名以上)を同時に認証するニーズが増加している。例えば、スマートフォン用のアルバムアプリにおいて、2人の秘密の写真や映像を、2人が物理的に同時にいるときだけ見ることができ、1人では見ることができない機能が望まれている。また、特定の音声で起動する装置の例として、「天空の城ラピュタ」(株式会社スタジオジブリの登録商標)に登場する「飛行石」が玩具として販売されているが、現在販売されているものは「バルス」と言えば誰にでも反応するものである。これに対して、登録した特定の2人が声を揃えて発声した時のみ反応する構成にできれば、より「天空の城ラピュタ」の世界観に近いものとなる。
As another authentication technology, there is an increasing need to authenticate multiple people (two or more people) at the same time. For example, in an album application for smartphones, a function that allows two people to see secret photos and videos only when they are physically at the same time and cannot be seen by one person is desired. Also, as an example of a device that activates with a specific voice, "Flying stone" that appears in "Laputa: Castle in the Sky" (registered trademark of Studio Ghibli Co., Ltd.) is sold as a toy, but it is currently on sale. Speaking of "bals", it reacts to anyone. On the other hand, if it is possible to make a configuration that reacts only when two registered specific people utter in unison, it will be closer to the world view of "Castle in the Sky".
従来の音声認証システムでは、特定の1名の発声によるものであり、無断で音声を録音されたりしてロックが解除されたり、近年の音声合成技術の進歩により、人工的に音声を作り出す技術が向上したため、より信頼性の高いセキュリティシステムが望まれていた。
In the conventional voice authentication system, the voice is uttered by a specific person, and the lock is released by recording the voice without permission. Due to the recent progress of voice synthesis technology, the technology for artificially creating voice is available. Due to the improvement, a more reliable security system was desired.
また、2名以上が同時に存在し、かつ2名以上が協力的であることを前提とする認証システムはないという問題点を有していた。
In addition, there was a problem that there was no authentication system on the premise that two or more people existed at the same time and two or more people were cooperative.
本発明は、上記課題に鑑みてなされたものであり、登録した特定の2名以上が声を揃えて発声した時のみ反応する音声認証システムを提供することを目的とする。
The present invention has been made in view of the above problems, and an object of the present invention is to provide a voice authentication system that responds only when two or more registered specific persons utter in unison.
本発明者らは、上記課題を解決するべく鋭意検討を行った結果、Deep Neural Network(DNN)を使用して抽出した、d-vectorと呼ばれる話者空間を直接的に表現できる特徴量を使用する手法が提案されており、話者照合性能の向上が報告されているが、このd-vectorが二者同時発話の特徴をMFCCより上手く捉えることができると考え、d-vectorを特徴量としたHMMによる二者同時発話を対象とした話者識別実験を行うことにより認識性能の向上を試み、本発明を完成するに至った。
As a result of diligent studies to solve the above problems, the present inventors use a feature quantity called d-vector that can directly express the speaker space, which is extracted using Deep Neural Network (DNN). Although a method has been proposed to improve speaker matching performance, it is thought that this d-vector can better capture the characteristics of two-way simultaneous utterances than MFCC, and d-vector is used as a feature quantity. The present invention has been completed by attempting to improve the recognition performance by conducting a speaker identification experiment targeting two simultaneous speeches by the HMM.
本発明の一観点に係る音声認証システムは、音声の登録および照合を行う音声認識回路と、音声を集音するマイクと、ロック機構と、マイクによって集音された音声を音声認識回路で音声照合して得られる照合結果に基づいてロック機構を制御する制御回路とを有する音声認証システムにおいて、音声認識回路に登録する音声が2名以上の複合音声を含むものである。
The voice recognition system according to one aspect of the present invention has a voice recognition circuit that registers and collates voice, a microphone that collects voice, a lock mechanism, and a voice recognition circuit that collates the voice collected by the microphone. In a voice authentication system having a control circuit that controls a lock mechanism based on the collation result obtained in the above, the voice registered in the voice recognition circuit includes a composite voice of two or more people.
さらに、複合音声が、少なくとも話者2名以上による実質的に同時の発声を録音したものである。
Furthermore, the compound voice is a recording of substantially simultaneous vocalizations by at least two speakers.
さらに、音声認識回路が、少なくとも2名の実質的に同時の発声音声か否かを判断する話者数判別モデルと、ターゲット話者2名の実質的に同時の発声音声か否かを判断する2名話者発声モデルとを有するものである。
Further, the voice recognition circuit determines whether or not the voice recognition circuit is a speaker number determination model that determines whether or not the voices are spoken at least two people at the same time, and whether or not the voices are spoken at the same time by the two target speakers. It has a two-speaker vocalization model.
さらに、音声認識回路が、合成音声が含まれていないかを判断する合成音声判別モデルを有するものである。
Furthermore, the voice recognition circuit has a synthetic voice discrimination model for determining whether or not synthetic voice is included.
さらに、音声認識回路が、2名の音声がともに緊張していないかを判断する緊張音声判別モデルを有するものである。
Furthermore, the voice recognition circuit has a tension voice discrimination model that determines whether the voices of two people are both nervous.
本発明によれば、登録した特定の2名以上が声を揃えて発声した時のみ反応する音声認証システムを提供できる利点がある。
According to the present invention, there is an advantage that it is possible to provide a voice authentication system that responds only when two or more registered specific persons utter in unison.
以下、本発明の実施形態について説明する。本発明の範囲はこれらの説明に拘束されることはなく、以下の例示以外についても、本発明の趣旨を損なわない範囲で適宜変更し実施することができる。
Hereinafter, embodiments of the present invention will be described. The scope of the present invention is not limited to these explanations, and other than the following examples, the scope of the present invention can be appropriately modified and implemented without impairing the gist of the present invention.
図1は、発明者が提案する音声認証システム構成を示すブロック図の一例である。2は音声認識回路で、マイク1が接続されている。この音声認識回路2は音声認識LSI等によって構成され、マイク1から入力された音声について登録および照合を行うことができ、音声登録時には確実に登録されたか否か、音声照合時には入力された音声が登録されている音声と一致しているか否かについて信号を出力するものである。3は制御回路で、音声認識回路2の音声登録および照合の処理を制御するとともに、音声認識回路2からの出力信号に応じてロック機構4を作動させるものである。この制御回路3には登録ボタン(図示せず。)および照合ボタン(図示せず。)が接続されており、これらを使用者が操作することで音声の登録処理もしくは照合処理が開始される。尚、5は上記各回路の電源となる電池である。
FIG. 1 is an example of a block diagram showing a voice authentication system configuration proposed by the inventor. Reference numeral 2 denotes a voice recognition circuit, to which the microphone 1 is connected. The voice recognition circuit 2 is composed of a voice recognition LSI or the like, and can register and collate the voice input from the microphone 1. Whether or not the voice is surely registered at the time of voice registration, and the voice input at the time of voice verification is displayed. It outputs a signal as to whether or not it matches the registered voice. Reference numeral 3 denotes a control circuit, which controls the voice registration and collation processing of the voice recognition circuit 2 and operates the lock mechanism 4 in response to an output signal from the voice recognition circuit 2. A registration button (not shown) and a collation button (not shown) are connected to the control circuit 3, and a user operates these to start a voice registration process or a collation process. Reference numeral 5 denotes a battery that serves as a power source for each of the above circuits.
本実施形態において音声認識回路2に登録する音声は、2名以上の複合音声を含むものであって、2名が同時に声を揃えて同じフレーズを発声した音声データ(複数あることが望ましい)や、敵対音声データとして、2名のうちの1名の発声、2名のうち1名が異なる発声、全く異なる1名の発声、計算機上で様々に重ね合わせた2話者音声モデルなども予め登録しておくことが望ましい。
In the present embodiment, the voice registered in the voice recognition circuit 2 includes a composite voice of two or more people, and voice data (preferably having a plurality of people) in which two people simultaneously utter the same phrase. , As hostile voice data, one of the two voices, one of the two voices is different, one voice is completely different, and a two-speaker voice model that is variously superimposed on the computer is also registered in advance. It is desirable to keep it.
図2は、発明者らが提案する音声認識回路2の解錠制御処理手順の一例を示すフローチャートである。図2で示すように、本方法は、(1)合成音声が含まれていないかを判断する手順(S1)と、(2)2名の同時発声音声かを判断する手順(S2)と、(3)ターゲット話者2名の同時発声音声かを判断するとともに特定又は任意のフレーズの同時発声音声かを判断する手順(S3)と、(4)2名の音声ともに緊張していないかを判断する手順(S4)とを含んでいる。
FIG. 2 is a flowchart showing an example of the unlocking control processing procedure of the voice recognition circuit 2 proposed by the inventors. As shown in FIG. 2, this method includes (1) a procedure (S1) for determining whether or not synthetic voice is included, and (2) a procedure (S2) for determining whether the voice is simultaneously uttered by two people. (3) The procedure (S3) for determining whether the two target speakers are simultaneously uttered voices and whether they are the simultaneous uttered voices of a specific or arbitrary phrase, and (4) whether the two voices are both nervous. The procedure (S4) for determining is included.
本方法は、具体的には、情報処理装置のハードディスク等の記録媒体に本方法を実現するプログラムを記録し、このプログラムを実行することで実現できる。
Specifically, this method can be realized by recording a program that realizes this method on a recording medium such as a hard disk of an information processing device and executing this program.
すなわち、本方法は、音声認識回路2に、(1)合成音声が含まれていないかを判断する手順(S1)、(2)2名の同時発声音声かを判断する手順(S2)、(3)ターゲット話者2名の同時発声音声かを判断するとともに特定又は任意のフレーズの同時発声音声かを判断する手順(S3)、(4)2名の音声ともに緊張していないかを判断する手順(S4)、を実行させるためのプログラムを情報処理装置の記録媒体に記録し、このプログラムを実行することで実現可能である。
That is, in this method, the voice recognition circuit 2 has (1) a procedure (S1) for determining whether or not synthetic voice is included, (2) a procedure (S2) for determining whether the voice is simultaneously uttered by two people, (S2). 3) Procedures for determining whether the two target speakers are simultaneously uttered voices and whether they are the simultaneous uttered voices of a specific or arbitrary phrase (S3), (4) Judging whether the two voices are both tense. This can be realized by recording a program for executing the procedure (S4) on the recording medium of the information processing apparatus and executing this program.
以下、本実施形態では、情報処理装置の記録媒体に上記プログラムを記録し、これを実行することによって本方法を実現する例として説明する。
Hereinafter, in the present embodiment, the above program will be recorded on the recording medium of the information processing apparatus, and the present method will be described as an example of realizing the present method by executing the program.
(合成音声判別モデル)
まず、本方法では(1)合成音声が含まれていないかを判断する手順(S1)を有する。この手順は、無断で録音された音声、人工的な音声合成による音声を排除するのに有用なものである。この手順を実行すると、合成音声が含まれている場合、録音音声が含まれている場合、すなわち、人の発声音声以外が含まれている場合、リジェクトされてロック機構4は解錠しない。 (Synthetic speech discrimination model)
First, this method has (1) a procedure (S1) for determining whether or not synthetic voice is included. This procedure is useful for eliminating voices recorded without permission and voices produced by artificial speech synthesis. When this procedure is executed, if the synthetic voice is included, the recorded voice is included, that is, if the voice other than the voice of a person is included, the lock mechanism 4 is rejected and the lock mechanism 4 is not unlocked.
まず、本方法では(1)合成音声が含まれていないかを判断する手順(S1)を有する。この手順は、無断で録音された音声、人工的な音声合成による音声を排除するのに有用なものである。この手順を実行すると、合成音声が含まれている場合、録音音声が含まれている場合、すなわち、人の発声音声以外が含まれている場合、リジェクトされてロック機構4は解錠しない。 (Synthetic speech discrimination model)
First, this method has (1) a procedure (S1) for determining whether or not synthetic voice is included. This procedure is useful for eliminating voices recorded without permission and voices produced by artificial speech synthesis. When this procedure is executed, if the synthetic voice is included, the recorded voice is included, that is, if the voice other than the voice of a person is included, the lock mechanism 4 is rejected and the lock mechanism 4 is not unlocked.
(話者数判別モデル)
また、本方法では(2)2名の同時発声音声かを判断する手順(S2)を有する。この手順は、複数人(例えば、2名)を同時に認証するセキュリティシステムとして、又、より信頼性の高いセキュリティシステムとして、2名が同時に存在することを確認するために有用なものである。この手順を実行すると、1名の発声音声の場合、3名以上の発声音声の場合、リジェクトされてロック機構4は解錠しない。 (Speaker number discrimination model)
In addition, this method has (2) a procedure (S2) for determining whether the voices are simultaneously uttered by two people. This procedure is useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, for confirming that two people exist at the same time. When this procedure is executed, in the case of the uttered voice of one person, in the case of the uttered voice of three or more people, it is rejected and the lock mechanism 4 is not unlocked.
また、本方法では(2)2名の同時発声音声かを判断する手順(S2)を有する。この手順は、複数人(例えば、2名)を同時に認証するセキュリティシステムとして、又、より信頼性の高いセキュリティシステムとして、2名が同時に存在することを確認するために有用なものである。この手順を実行すると、1名の発声音声の場合、3名以上の発声音声の場合、リジェクトされてロック機構4は解錠しない。 (Speaker number discrimination model)
In addition, this method has (2) a procedure (S2) for determining whether the voices are simultaneously uttered by two people. This procedure is useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, for confirming that two people exist at the same time. When this procedure is executed, in the case of the uttered voice of one person, in the case of the uttered voice of three or more people, it is rejected and the lock mechanism 4 is not unlocked.
(2名話者発声モデル)
また、本方法では(3)ターゲット話者2名の同時発声音声かを判断するとともに特定又は任意のフレーズの同時発声音声かを判断する手順(S3)を有する。この手順も、複数人(例えば、2名)を同時に認証するセキュリティシステムとして、又、より信頼性の高いセキュリティシステムとして、登録した2名の同時発声音声かを確認するために有用なものである。この手順を実行すると、2名の発声音声の場合で、1名だけ登録した人と違う場合、2名とも登録した人と違う場合、登録した2名が同時に発声していない場合、リジェクトされてロック機構4は解錠しない。また、フレーズが正しいか否かを判定し、フレーズが正しくない場合、リジェクトされてロック機構4は解錠しない。 (Two speaker vocalization model)
In addition, this method has (3) a procedure (S3) of determining whether the two target speakers are simultaneously uttered voices and whether they are simultaneous uttered voices of a specific or arbitrary phrase. This procedure is also useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, in order to confirm whether the voices are simultaneously uttered by two registered people. .. When this procedure is executed, in the case of two uttered voices, if only one person is different from the registered person, if both of them are different from the registered person, or if the two registered people are not uttering at the same time, it will be rejected. The lock mechanism 4 is not unlocked. Further, it is determined whether or not the phrase is correct, and if the phrase is incorrect, it is rejected and the lock mechanism 4 is not unlocked.
また、本方法では(3)ターゲット話者2名の同時発声音声かを判断するとともに特定又は任意のフレーズの同時発声音声かを判断する手順(S3)を有する。この手順も、複数人(例えば、2名)を同時に認証するセキュリティシステムとして、又、より信頼性の高いセキュリティシステムとして、登録した2名の同時発声音声かを確認するために有用なものである。この手順を実行すると、2名の発声音声の場合で、1名だけ登録した人と違う場合、2名とも登録した人と違う場合、登録した2名が同時に発声していない場合、リジェクトされてロック機構4は解錠しない。また、フレーズが正しいか否かを判定し、フレーズが正しくない場合、リジェクトされてロック機構4は解錠しない。 (Two speaker vocalization model)
In addition, this method has (3) a procedure (S3) of determining whether the two target speakers are simultaneously uttered voices and whether they are simultaneous uttered voices of a specific or arbitrary phrase. This procedure is also useful as a security system that authenticates a plurality of people (for example, two people) at the same time, and as a more reliable security system, in order to confirm whether the voices are simultaneously uttered by two registered people. .. When this procedure is executed, in the case of two uttered voices, if only one person is different from the registered person, if both of them are different from the registered person, or if the two registered people are not uttering at the same time, it will be rejected. The lock mechanism 4 is not unlocked. Further, it is determined whether or not the phrase is correct, and if the phrase is incorrect, it is rejected and the lock mechanism 4 is not unlocked.
手順(S2、S3)において、話者数は「2名」に限定されるわけではない。すなわち、情報・資産を共有する人数が「3名」ならば3名の同時発声音声かを判断すればよいし、「4名」ならば4名の同時発声音声かを判断すればよい。また、話者数が増えれば増えるほど、より信頼性の高いセキュリティシステムを構築できる。
In the procedure (S2, S3), the number of speakers is not limited to "2 people". That is, if the number of people sharing information / assets is "3", it may be determined whether the voice is simultaneously uttered by 3 people, and if it is "4 people", it may be determined whether the voice is uttered by 4 people at the same time. Also, as the number of speakers increases, a more reliable security system can be constructed.
(緊張音声判別モデル)
また、本方法では(4)2名の音声ともに緊張していないかを判断する手順(S4)を有する。この手順は、1名でも緊張状態の時には解錠しないようにすることで、脅迫状態での解錠を不可能としている。この手順を実行すると、緊張音声が含まれる場合、リジェクトされてロック機構4は解錠しない。 (Tension voice discrimination model)
In addition, this method has (4) a procedure (S4) for determining whether the voices of the two persons are both nervous. This procedure makes it impossible to unlock in a threatening state by preventing even one person from unlocking when in a tense state. When this procedure is executed, if a tension voice is included, it is rejected and the lock mechanism 4 is not unlocked.
また、本方法では(4)2名の音声ともに緊張していないかを判断する手順(S4)を有する。この手順は、1名でも緊張状態の時には解錠しないようにすることで、脅迫状態での解錠を不可能としている。この手順を実行すると、緊張音声が含まれる場合、リジェクトされてロック機構4は解錠しない。 (Tension voice discrimination model)
In addition, this method has (4) a procedure (S4) for determining whether the voices of the two persons are both nervous. This procedure makes it impossible to unlock in a threatening state by preventing even one person from unlocking when in a tense state. When this procedure is executed, if a tension voice is included, it is rejected and the lock mechanism 4 is not unlocked.
以上のような構成により、利用時は、2名が実質的に同時に特定のフレーズを発声したときのみ認証(解錠)され、2名のうち1名で発声、2名のうち1名が他人と声を揃えて発声、他人1名・他人2名の特定のフレーズの発声はリジェクトする。フレーズが異なる場合もリジェクトする。さらに、2名が同時に特定のフレーズを発声していても、1名でも話者の緊張度が高いと判断した場合はリジェクトする。
With the above configuration, when using, authentication (unlocking) is performed only when two people utter a specific phrase practically at the same time, one of the two utters, and one of the two utters another person. The utterances of a specific phrase by one other person and two others are rejected. Reject even if the phrase is different. Furthermore, even if two people are uttering a specific phrase at the same time, if even one person judges that the speaker's tension is high, it will be rejected.
以上の実施形態において、手順(S1)、手順(S2)、手順(S3)および手順(S4)を有する構成を示したが、登録した特定の2名が声を揃えて発声した時のみを識別するだけならば、手順(S2)および手順(S3)だけでもよいし、さらに話者数を識別する必要がなければ手順(S3)のみの構成でもよい。そして、手順(S1)、手順(S4)は、より信頼性の高いセキュリティシステムとして、音声認証システムの用途等に応じて、適宜設置するとよい。
In the above embodiment, the configuration having the procedure (S1), the procedure (S2), the procedure (S3), and the procedure (S4) is shown, but only when the two registered specific persons utter in unison is identified. If only the procedure (S2) and the procedure (S3) are to be performed, only the procedure (S2) and the procedure (S3) may be used, or if it is not necessary to identify the number of speakers, only the procedure (S3) may be configured. Then, the procedure (S1) and the procedure (S4) may be appropriately installed as a more reliable security system according to the use of the voice authentication system and the like.
以上の音声認証システムの使い方は次の通りである。(I)使用者はまず、登録ボタン(図示せず。)を操作して、マイク1に向かって2名以上が同時に同じ特定又は任意のフレーズを発声することにより、音声を音声認識回路2に登録する。(II)ロック機構を解錠する時は、照合ボタン(図示せず。)を操作して、マイク1に向かって(I)項で登録した言葉を登録した2名以上が同時に発すれば、ロック機構が解錠される。(III)登録した音声を変更したいときは、(I)項を繰り返す。
The usage of the above voice authentication system is as follows. (I) The user first operates the registration button (not shown), and two or more people simultaneously utter the same specific or arbitrary phrase into the microphone 1 to transmit the voice to the voice recognition circuit 2. to register. (II) When unlocking the lock mechanism, if the collation button (not shown) is operated and two or more persons who have registered the words registered in item (I) simultaneously emit the words registered in item (I) to the microphone 1. The lock mechanism is unlocked. (III) If you want to change the registered voice, repeat item (I).
以上のような構成の本実施形態においては、音声の登録および照合を行う音声認識回路と、音声を集音するマイクと、ロック機構と、マイクによって集音された音声を音声認識回路で音声照合して得られる照合結果に基づいてロック機構を制御する制御回路とを有する音声認証システムにおいて、音声認識回路に登録する音声が2名以上の複合音声を含むことを特徴とする構成で、登録した特定の2名以上が声を揃えて発声した時のみ反応する音声認証システムを提供することができる。そのため、より信頼性の高いセキュリティシステムを構成できる効果がある。また、共有する情報・資産の取り扱いに関するセキュリティシステムとして、2名以上が同時に存在し、かつ2名以上が協力的であることを前提とする認証システムを構成できる。
In the present embodiment having the above configuration, a voice recognition circuit for registering and collating voice, a microphone for collecting voice, a lock mechanism, and a voice recognition circuit for collating the voice collected by the microphone. In a voice authentication system having a control circuit that controls a lock mechanism based on the collation result obtained in the above process, the voices registered in the voice recognition circuit include a composite voice of two or more people. It is possible to provide a voice recognition system that responds only when two or more specific persons speak in unison. Therefore, there is an effect that a more reliable security system can be configured. In addition, as a security system for handling shared information / assets, an authentication system can be configured on the premise that two or more people exist at the same time and two or more people are cooperative.
以下、実施例により本発明をさらに詳細に説明するが、本発明はこれらによって限定されるものではない。
Hereinafter, the present invention will be described in more detail with reference to Examples, but the present invention is not limited thereto.
従来、MFCCを特徴量としたGMM-HMMで二者同時発話を対象とするテキスト依存型の話者識別を行っていた。しかし認識精度は不十分であった。これは、MFCCが二者同時発話の特徴を十分に捉えていないことが原因であると考えられる。
Conventionally, GMM-HMM featuring MFCC was used to identify text-dependent speakers for simultaneous two-party utterances. However, the recognition accuracy was insufficient. It is considered that this is because the MFCC does not fully grasp the characteristics of two-way simultaneous utterance.
そこで、提案手法では、二者同時発話を対象とする話者識別を行うDNNの中間層の最終層からd-vectorを抽出し、特徴量として使用した。また、本実施例ではテキスト依存型の話者識別を行うため、発話内容に頑健な話者モデルであるHMMを使用した。
Therefore, in the proposed method, a d-vector was extracted from the final layer of the intermediate layer of the DNN that identifies the speaker for two simultaneous utterances and used as a feature quantity. Moreover, in this embodiment, in order to perform text-dependent speaker identification, an HMM, which is a robust speaker model, is used for the utterance content.
本実施例では、d-vectorを抽出するために二者同時発話を対象とする話者識別DNNを構築し、このDNNをd-vector抽出に使用した。ここでは、この二者同時発話を対象とする話者識別DNNの構築について述べる。
In this example, a speaker identification DNN for simultaneous two-party utterance was constructed to extract the d-vector, and this DNN was used for the d-vector extraction. Here, the construction of a speaker identification DNN for these two simultaneous utterances will be described.
DNNの学習データに、科学警察研究所により作成された「大規模話者骨導音声データベース」に収録されている、気導マイクで録音した音声を使用した。実験では、各話者に対して話者番号M001~M336を割り振り、各発話内容に対して発話番号A01~A50を割り振った。この音声データセットの内容を<表1>に示しておく。
For the learning data of DNN, the voice recorded by the air conduction microphone recorded in the "Large-scale speaker bone conduction voice database" created by the National Research Institute of Police Science was used. In the experiment, speaker numbers M001 to M336 were assigned to each speaker, and utterance numbers A01 to A50 were assigned to each utterance content. The contents of this audio data set are shown in <Table 1>.
また、このデータセット内の異なる話者2名の同じ発話内容の音声を計算機上で重畳させた音声データを作成した。本実施例では、このように計算機上で重畳させて作成した音声を重畳音声と呼ぶ。本実験では、M001とM002、M003とM004、…、M335とM336のように重畳を行った。話者番号としてMIX001~MIX118を割り振った。発話番号については前述のデータセットと同様である。このデータベースで使用した重畳音声データの内容を<表2>に示しておく。
In addition, voice data was created by superimposing the voices of the same utterance contents of two different speakers in this data set on a computer. In this embodiment, the voice created by superimposing on the computer in this way is called a superposed voice. In this experiment, superimposition was performed like M001 and M002, M003 and M004, ..., M335 and M336. MIX001 to MIX118 were assigned as speaker numbers. The utterance number is the same as the above-mentioned data set. The contents of the superimposed voice data used in this database are shown in <Table 2>.
以上二種類のデータセットを使用して、二者同時発話を対象とする話者識別DNNの学習と評価を行った。各話者の5文でDNNの学習を行い、残り45文で評価を行った。DNNは3種類構築した。構築したDNNと、その学習と評価に使用したデータセットの内容を<表3>に示しておく。
Using the above two types of data sets, we learned and evaluated the speaker identification DNN for two simultaneous utterances. DNN was learned with 5 sentences of each speaker, and the remaining 45 sentences were evaluated. Three types of DNN were constructed. The contents of the constructed DNN and the data set used for its learning and evaluation are shown in <Table 3>.
1発話からフレーム毎に対数MFB40ビンを抽出し、前後3フレームを結合した計280次元の対数MFBをDNNの入力とした。対数MFB抽出条件を<表4>に示しておく。
A logarithmic MFB40 bin was extracted for each frame from one utterance, and a total of 280-dimensional logarithmic MFB that combined the front and rear 3 frames was used as the input of the DNN. The logarithmic MFB extraction conditions are shown in <Table 4>.
DNNの中間層は3層で700-400-100とした。最終層は識別クラスである。活性化関数については、中間層ではReLU、最終層ではソフトマックス関数を使用した。学習の際のミニバッチ数は100、エポック数は100とした。
The middle layer of DNN was 700-400-100 with 3 layers. The final layer is the identification class. As for the activation function, the ReLU function was used in the middle layer and the softmax function was used in the final layer. The number of mini-batch during learning was 100, and the number of epochs was 100.
以上の条件で二者同時発話を対象とする話者識別DNNを構築した。各DNNのエラー率は、DNN1が0.78%、DNN2が0.29%、DNN3が1.50%となった。以上のDNNの中間層の第三層から、それぞれd-vectorを抽出する。以降、DNN1~DNN3より抽出したd-vectorをそれぞれd-vector1、d-vector2およびd-vector3と呼ぶ。
Under the above conditions, a speaker identification DNN was constructed for simultaneous two-party utterances. The error rate of each DNN was 0.78% for DNN1, 0.29% for DNN2, and 1.50% for DNN3. Each d-vector is extracted from the third layer of the above intermediate layer of DNN. Hereinafter, the d-vectors extracted from DNN1 to DNN3 will be referred to as d-vector1, d-vector2 and d-vector3, respectively.
次に、提案手法での二者同時発話を対象とする話者識別性能の評価実験について述べる。従来法であるMFCCを用いたGMM-HMMとi-vectorをベースラインとして比較を行った。
Next, an evaluation experiment of speaker identification performance for two-party simultaneous utterances by the proposed method will be described. A comparison was made using GMM-HMM and i-vector using the conventional MFCC as a baseline.
本実施例で使用したデータベースでは、先行研究で使用されている音声データに加えて、新たに録音した音声データを使用した。話者1名で録音したものを単独発話、話者2名による同時発声を録音したものを同時発話と呼ぶ。使用した音声データの内容を<表5>に示しておく。
In the database used in this example, newly recorded voice data was used in addition to the voice data used in the previous research. A recording recorded by one speaker is called a single utterance, and a recording of simultaneous utterances by two speakers is called a simultaneous utterance. The contents of the voice data used are shown in <Table 5>.
また、異なる2話者の同一の発話内容の単独発話で重畳音声を作成した。本実施例では、これを重畳発話と呼び、疑似的な同時発話と考える。これにより、大規模なデータ数で二者同時発話の話者識別実験を行うことができる。作成した重畳発話の内容を<表6>に示しておく。
In addition, a superimposed voice was created by a single utterance of the same utterance content of two different speakers. In this embodiment, this is called a superimposed utterance and is considered as a pseudo simultaneous utterance. As a result, it is possible to perform a speaker identification experiment of two simultaneous utterances with a large number of data. The contents of the created superimposed utterance are shown in <Table 6>.
以上の音声データセットを使用して、単独発話と同時発話で構成した「単独+同時DB」と、単独と重畳発話で構成した「単独+重畳DB」の2種類のデータベースを作成した。このデータベースの内容を<表7>に示しておく。
Using the above voice data set, two types of databases were created: "single + simultaneous DB" composed of single and simultaneous utterances and "single + superposed DB" composed of single and superimposed utterances. The contents of this database are shown in <Table 7>.
提案手法では、<表3>で構築したDNNを用いてd-vectorを抽出して特徴量とした。各話者の特徴量でGMM-HMMを学習させ話者モデルとした。HMMの学習の際に、分散が0.5以上になるようにフロアリング処理を行った。従来法では、特徴量にMFCC39次元を使用した。話者モデルはHMMを使用し、条件は提案手法と同様である。MFCCでは分散のフロアリングは行わない。GMM-HMMの構築条件を<表8>に示しておく。
In the proposed method, d-vectors were extracted using the DNN constructed in <Table 3> and used as features. GMM-HMM was learned by the features of each speaker and used as a speaker model. When learning the HMM, the flooring process was performed so that the variance was 0.5 or more. In the conventional method, MFCC 39 dimensions were used for the features. The speaker model uses HMM, and the conditions are the same as the proposed method. MFCC does not provide distributed flooring. The construction conditions for GMM-HMM are shown in <Table 8>.
i-vectorでの話者識別実験では、DNNの学習に使用した「大規模骨導音声データベース」を用いてi-vectorを抽出するためのUBMの学習を行った。使用したデータセットの内容を<表9>に示しておく。
In the speaker identification experiment with i-vector, UBM was learned to extract i-vector using the "large-scale bone conduction speech database" used for DNN learning. The contents of the data set used are shown in <Table 9>.
スコア計算にはコサイン類似度を使用し、最高スコアである話者を識別結果とした。なお,本実験ではデータベースを五分割し、4つを学習、1つを評価に用いる五分割交差検証で評価を行った。
Cosine similarity was used for score calculation, and the speaker with the highest score was used as the identification result. In this experiment, the database was divided into five, four were learned, and one was evaluated by five-fold cross-validation.
単独+同時DBと単独+重畳DBにおける各手法での誤認識率[%]を<表10>と<表11>に示しておく。
The false recognition rate [%] in each method in the single + simultaneous DB and the single + superposed DB is shown in <Table 10> and <Table 11>.
両データベースにおいて、従来手法であるi-vectorおよびMFCCと比較して全てのd-vectorの誤認識率が下回っていることが確認できる。d-vector内で比較すると、単独+同時DBではd-vector2が最も性能が良くなっている。一方,単独+重畳DBでは,発話内容「バルス」においてはd-vector2が、発話内容「ひらけごま」においてはd-vector3が最も性能が良くなっている。
In both databases, it can be confirmed that the false recognition rate of all d-vectors is lower than that of the conventional methods i-vector and MFCC. Comparing within d-vector, d-vector2 has the best performance in single + simultaneous DB. On the other hand, in the single + superposed DB, d-vector2 has the best performance in the utterance content "Bals", and d-vector3 has the best performance in the utterance content "Open sesame".
各d-vectorについて、エラー分析を行った。入力音声およびその識別結果の発話者数に注目してエラーの分類を行い、同じ発話者数のものに誤認したか、異なる発話者数のものに誤認したかで分類を行った。エラー分類の内容を<表12>に示しておく。
Error analysis was performed for each d-vector. The error was classified by paying attention to the input voice and the number of speakers in the identification result, and classified according to whether the error was mistaken for the same number of speakers or a different number of speakers. The contents of the error classification are shown in <Table 12>.
また、d-vector1、d-vector2、d-vector3を以上の内容でエラー分類した結果を<表13>~<表15>に示しておく。
Further, the results of error classification of d-vector1, d-vector2, and d-vector3 with the above contents are shown in <Table 13> to <Table 15>.
単独+同時DBでは、エラー数が少なくほとんど差が見られないことがわかる。単独+重畳DBにおける同じ発話者数内への誤認識は、どのd-vectorでも2to2での誤認識となっている。2to2のエラー数は全体での誤認識率に比例していることから、2to2のエラー数を減らすことが全体の誤認識率の改善に繋がると推測される。
It can be seen that the number of errors is small and there is almost no difference in the single + simultaneous DB. Misrecognition within the same number of speakers in a single + superimposed DB is a 2to2 misrecognition in any d-vector. Since the number of 2to2 errors is proportional to the overall false recognition rate, it is presumed that reducing the number of 2to2 errors will lead to an improvement in the overall false recognition rate.
また、単独+重畳DBにおける異なる発話者数への誤認識はd-vector3が6回と最も少なくなっていることがわかる。d-vector3は、単独音声と重畳音声の両方を用いたデータセットで学習したDNNから抽出されたd-vectorである。これより、DNNの学習データに単独発話と重畳発話の両方を用いることで、そのDNNから抽出されるd-vectorは発話者数についての分類に適した特徴となることが示唆された。
In addition, it can be seen that d-vector3 has the least number of misrecognitions of different numbers of speakers in the single + superimposed DB, which is 6 times. The d-vector 3 is a d-vector extracted from a DNN trained with a data set using both single speech and superimposed speech. From this, it was suggested that by using both single utterance and superimposed utterance for the learning data of DNN, the d-vector extracted from the DNN becomes a feature suitable for classification of the number of speakers.
学習データセットを変えて抽出した様々なd-vectorを用いて、二者同時発話を対象とする話者識別実験を行った。その結果、MFCCを用いている従来手法と比較して話者識別性能が向上することを確認した。
Using various d-vectors extracted by changing the learning data set, a speaker identification experiment was conducted for two simultaneous utterances. As a result, it was confirmed that the speaker identification performance was improved as compared with the conventional method using MFCC.
以上、本実施例により、特に<表10>に示すように、単独+同時DBにおける誤認識率は、極めて低く、登録した特定の2名以上が声を揃えて発声した時のみ反応する音声認証システムを実現できることを確認した。
As described above, according to this embodiment, as shown in <Table 10>, the false recognition rate in the single + simultaneous DB is extremely low, and voice authentication that responds only when two or more registered specific persons utter in unison. It was confirmed that the system could be realized.
本発明は、登録した特定の2名以上が声を揃えて発声した時のみ反応する音声認証システムとして、産業上利用可能である。
The present invention can be industrially used as a voice authentication system that responds only when two or more registered specific persons utter in unison.
1 マイク
2 音声認識回路
3 制御回路
4 ロック機構
5 電源部 1Microphone 2 Voice recognition circuit 3 Control circuit 4 Lock mechanism 5 Power supply
2 音声認識回路
3 制御回路
4 ロック機構
5 電源部 1
Claims (5)
- 音声の登録および照合を行う音声認識回路と、前記音声を集音するマイクと、ロック機構と、前記マイクによって集音された音声を前記音声認識回路で音声照合して得られる照合結果に基づいて前記ロック機構を制御する制御回路とを有する音声認証システムにおいて、
前記音声認識回路に登録する音声が2名以上の複合音声を含む音声認証システム。 Based on the voice recognition circuit that registers and collates the voice, the microphone that collects the voice, the lock mechanism, and the collation result obtained by collating the voice collected by the microphone with the voice recognition circuit. In a voice recognition system having a control circuit for controlling the lock mechanism,
A voice authentication system in which voices registered in the voice recognition circuit include composite voices of two or more people. - 前記複合音声が、少なくとも話者2名以上による実質的に同時の発声を録音したものである請求項1に記載の音声認証システム。 The voice authentication system according to claim 1, wherein the compound voice is a recording of substantially simultaneous vocalizations by at least two or more speakers.
- 前記音声認識回路が、少なくとも2名の実質的に同時の発声音声か否かを判断する話者数判別モデルと、ターゲット話者2名の実質的に同時の発声音声か否かを判断する2名話者発声モデルとを有する請求項1又は2に記載の音声認証システム。 The voice recognition circuit determines whether or not the voice recognition circuit is a speaker number determination model that determines whether or not at least two people are uttered at substantially the same time, and whether or not the target speakers are uttered at substantially the same time. The voice authentication system according to claim 1 or 2, which has a speaker vocalization model.
- 前記音声認識回路が、合成音声が含まれていないかを判断する合成音声判別モデルを有する請求項3に記載の音声認証システム。 The voice authentication system according to claim 3, wherein the voice recognition circuit has a synthetic voice discrimination model for determining whether or not synthetic voice is included.
- 前記音声認識回路が、2名の音声ともに緊張していないかを判断する緊張音声判別モデルを有する請求項3又は4に記載の音声認証システム。 The voice authentication system according to claim 3 or 4, wherein the voice recognition circuit has a tension voice discrimination model for determining whether or not both voices of two people are nervous.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2019088771A JP6833216B2 (en) | 2019-05-09 | 2019-05-09 | Voice authentication system |
JP2019-088771 | 2019-05-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020226019A1 true WO2020226019A1 (en) | 2020-11-12 |
Family
ID=73044670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2020/015735 WO2020226019A1 (en) | 2019-05-09 | 2020-04-07 | Voice authentication system |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP6833216B2 (en) |
WO (1) | WO2020226019A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023037429A1 (en) * | 2021-09-08 | 2023-03-16 | 日本電気株式会社 | Authentication device, authentication method, and recording medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002073830A (en) * | 2000-08-25 | 2002-03-12 | Fujitsu Ltd | Commerce information distribution system |
JP2010237364A (en) * | 2009-03-31 | 2010-10-21 | Oki Electric Ind Co Ltd | Device, method and program for discrimination of synthesized speech |
JP2017151759A (en) * | 2016-02-25 | 2017-08-31 | Necフィールディング株式会社 | Authentication device, authentication method and program |
CN109360315A (en) * | 2018-10-25 | 2019-02-19 | 赵琦伟 | A kind of security protection system |
-
2019
- 2019-05-09 JP JP2019088771A patent/JP6833216B2/en active Active
-
2020
- 2020-04-07 WO PCT/JP2020/015735 patent/WO2020226019A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002073830A (en) * | 2000-08-25 | 2002-03-12 | Fujitsu Ltd | Commerce information distribution system |
JP2010237364A (en) * | 2009-03-31 | 2010-10-21 | Oki Electric Ind Co Ltd | Device, method and program for discrimination of synthesized speech |
JP2017151759A (en) * | 2016-02-25 | 2017-08-31 | Necフィールディング株式会社 | Authentication device, authentication method and program |
CN109360315A (en) * | 2018-10-25 | 2019-02-19 | 赵琦伟 | A kind of security protection system |
Non-Patent Citations (1)
Title |
---|
ZHANG JILIANG, TAN XIAO, WANG XIANGQI, YAN AIBIN, QIN ZHENG: "T2FA: Transparent Two- Factor Authentication", IEEE ACCESS (VOLUME: 6, 15 June 2018 (2018-06-15), pages 32677 - 32686, XP055760248 * |
Also Published As
Publication number | Publication date |
---|---|
JP2020184032A (en) | 2020-11-12 |
JP6833216B2 (en) | 2021-02-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Robust deep feature for spoofing detection—The SJTU system for ASVspoof 2015 challenge | |
Liu et al. | Deep feature for text-dependent speaker verification | |
ES2883326T3 (en) | End-to-end speaker recognition using a deep neural network | |
Larcher et al. | The RSR2015: Database for text-dependent speaker verification using multiple pass-phrases | |
US8209174B2 (en) | Speaker verification system | |
US9489950B2 (en) | Method and system for dual scoring for text-dependent speaker verification | |
Bredin et al. | Audio-visual speech synchrony measure for talking-face identity verification | |
Naika | An overview of automatic speaker verification system | |
Ding et al. | A method to integrate GMM, SVM and DTW for speaker recognition | |
CN107481736A (en) | A kind of vocal print identification authentication system and its certification and optimization method and system | |
Chakroun et al. | Robust text-independent speaker recognition with short utterances using Gaussian mixture models | |
WO2020226019A1 (en) | Voice authentication system | |
Folorunso et al. | A review of voice-base person identification: state-of-the-art | |
Chakroun et al. | Improving text-independent speaker recognition with GMM | |
Liu et al. | Speaker-utterance dual attention for speaker and utterance verification | |
Revathi et al. | Person authentication using speech as a biometric against play back attacks | |
JP6682007B2 (en) | Electronic device, electronic device control method, and electronic device control program | |
Larcher et al. | Imposture classification for text-dependent speaker verification | |
Martsyshyn et al. | Technology of speaker recognition of multimodal interfaces automated systems under stress | |
Ly-Van et al. | Signature with text-dependent and text-independent speech for robust identity verification | |
Ertaş | Fundamentals of speaker recognition | |
Chen et al. | Forensic identification for electronic disguised voice based on supervector and statistical analysis | |
Akingbade et al. | Voice-based door access control system using the mel frequency cepstrum coefficients and gaussian mixture model | |
Chetty | Biometric liveness detection based on cross modal fusion | |
Gupta et al. | Text dependent voice based biometric authentication system using spectrum analysis and image acquisition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20801444 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20801444 Country of ref document: EP Kind code of ref document: A1 |