TWI832552B

TWI832552B - Speaker identification system based on meta-learning applied to real-time short sentences in an open set environment

Info

Publication number: TWI832552B
Application number: TW111143218A
Authority: TW
Inventors: 許正欣; 呂政軒
Original assignee: 國立雲林科技大學
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2024-02-11
Also published as: TW202420291A

Abstract

本發明為基於元學習(Meta learning)方式訓練一深度學習語者模型(ResNest speaker model)，讓該深度學習語者模型經過複數訓練情節(episode)的訓練以及二種損失函數的反傳遞修正而構成，每一該訓練情節具有長語句的支持集(support set)與短語句的查詢集(query set)。本發明藉由單一的該深度學習語者模型能夠將輸入語音轉換為語者嵌入向量，使得我們的語者辨識系統可僅透過語者嵌入向量的比對，就能辨識不同的註冊語者，並有效阻擋欺騙攻擊與冒充者的進入，據此，本發明具有輕量、即時反應、適用短語句與開放集的特性，可使用低成本嵌入式硬件來實現系統。The present invention trains a deep learning speaker model (ResNest speaker model) based on a meta learning (Meta learning) method, and allows the deep learning speaker model to undergo training in plural training episodes (episode) and inverse transfer modification of two loss functions. Each training scenario has a support set of long statements and a query set of short statements. The present invention can convert input speech into speaker embedding vectors through a single deep learning speaker model, so that our speaker identification system can identify different registered speakers only through the comparison of speaker embedding vectors. And effectively blocks the entry of deception attacks and impostors. Accordingly, the present invention has the characteristics of lightweight, instant response, applicable short statements and open sets, and can use low-cost embedded hardware to implement the system.

Description

Speaker identification system based on meta-learning applied to real-time short sentences in an open set environment

本發明有關語音技術，尤指一種辨識註冊語者，阻擋欺騙攻擊與冒充者的語者辨識系統。The present invention relates to voice technology, and in particular, to a speaker identification system that identifies registered speakers and blocks deception attacks and impostors.

語音應用的產品日益增多，因此，對於語音應用產品，語者辨識技術就顯得相當重要，其需能認證使用語者的身分，並防止冒名竊用，始能達到防盜保密的需求。The number of voice application products is increasing day by day. Therefore, speaker identification technology is very important for voice application products. It needs to be able to authenticate the identity of the speaker and prevent impersonation and theft, so as to meet the requirements of anti-theft and confidentiality.

在語者辨識中，註冊語者稱之為目標語者(target speakers)，未註冊的語者則稱為冒充者(imposters)，以人為造假的聲音進行攻擊(試圖通過認證)則稱為欺騙攻擊(spoofing attacks)，其包含錄音回放(replay)與語音合成(speech synthesis)。錄音回放係指回放預先錄製之註冊語者的語句，而語音合成係指以人工技術所生成的假造語音。In speaker identification, registered speakers are called target speakers, unregistered speakers are called imposters, and attacking with artificially fake voices (trying to pass authentication) is called spoofing Attacks (spoofing attacks) include recording playback (replay) and speech synthesis (speech synthesis). Recording playback refers to the playback of pre-recorded sentences of registered speakers, while speech synthesis refers to fake speech generated by artificial technology.

開放集環境下之語者辨識技術，要能正確辨識目標語者，並防止欺騙攻擊與冒充者的侵入。如台灣公告第TWI749709B號「一種語者辨識方法」，其包含三階段辨識。第一階段辨識為偵測一本文相關測試語句是否為欺騙攻擊。第二階段辨識為對一本文無關測試語句進行冒充者侵入的偵測。第三階段辨識則藉由一語者辨識系統，判斷該本文無關測試語句是那一註冊語者，若不是註冊語者時，則判斷為冒充者。前兩個階段分別使用不相同特徵，並有各自的二元分類器，第三階段則使用複數分類器，透過整體學習(ensemble learning)與全數決策略(unanimity rule)，同時搭配條件式重試機制，以決定判斷該本文無關測試語句是目標語者或冒充者。Speaker identification technology in an open set environment must be able to correctly identify the target speaker and prevent deception attacks and intrusions by impostors. For example, Taiwan Announcement No. TWI749709B "A Speaker Identification Method" includes three stages of identification. The first stage of identification is to detect whether a text-related test statement is a spoofing attack. The second stage of identification is to detect impostor intrusion on a text-independent test statement. The third stage of identification uses the speaker identification system to determine which registered speaker the irrelevant test sentence belongs to. If it is not a registered speaker, it is determined to be an impostor. The first two stages use different features and have their own binary classifiers. The third stage uses complex classifiers, through ensemble learning and unanimity rule, and with conditional retry. mechanism to determine whether the test sentence irrelevant to this article is a target speaker or an impostor.

然而，此習知使用分類器的辨識方法，其需要較長的訓練語句來建立模型，且一旦註冊語者有變動，系統需要重新訓練，又由於會使用多個分類器，因此其模型難以小型化並無法即時反應，且所需的存儲空間較大，具有較高的計算複雜度，難以使用低成本嵌入式硬件來實現系統。However, this conventional method uses a classifier identification method, which requires a long training sentence to build a model. Once the registered speakers change, the system needs to be retrained. Moreover, multiple classifiers are used, so the model is difficult to make small. It cannot respond immediately, requires large storage space, and has high computational complexity, making it difficult to use low-cost embedded hardware to implement the system.

本發明之主要目的，在於揭露一種語者辨識系統，具有輕量、即時反應、適用短語句與開放集環境等特性。The main purpose of the present invention is to disclose a speaker recognition system that has the characteristics of lightweight, instant response, applicable short sentences and open set environment.

為達上述目的，本發明為一種基於元學習應用於開放集環境下之即時短語句的語者辨識系統。該語者辨識系統包含一語者嵌入向量產生器(speaker embedding generator)，該語者嵌入向量產生器是由一擷取聲學特徵的梅爾濾波器(Mel-filter bank；MFB)與一深度學習語者模型(ResNest speaker model)所共同組成。該深度學習語者模型將該梅爾濾波器所輸出的聲學特徵向量轉換為語者嵌入向量。該深度學習語者模型為基於元學習(Meta learning)方式使用複數訓練情節(episode)進行訓練，每一該訓練情節具有長語句的支持集(support set)與短語句的查詢集(query set)。同時設計一由二個不同的損失函數所組成的目標函數，以利深度學習語者模型能同時在全域與局部嵌入空間進行學習，利用反傳遞該目標函數的梯度來修正該深度學習語者模型。In order to achieve the above purpose, the present invention is a speaker identification system based on meta-learning applied to real-time short sentences in an open set environment. The speaker identification system includes a speaker embedding generator (speaker embedding generator), which is composed of a Mel-filter bank (MFB) that captures acoustic features and a deep learning It is composed of the ResNest speaker model. The deep learning speaker model converts the acoustic feature vector output by the Mel filter into a speaker embedding vector. The deep learning speaker model is trained using plural training episodes (episode) based on meta learning. Each training episode has a support set of long sentences and a query set of short sentences. . At the same time, an objective function composed of two different loss functions is designed so that the deep learning speaker model can learn in the global and local embedding spaces at the same time, and the gradient of the objective function is back-transmitted to modify the deep learning speaker model. .

該語者辨識系統透過該語者嵌入向量產生器將至少一註冊語者的輸入語音轉換為對應該至少一註冊語者的至少一原型(prototype)向量，而完成註冊。完成註冊後，該語者辨識系統可供複數測試者進行登錄測試，並對每一個該測試者進行下列步驟：The speaker recognition system converts the input speech of at least one registered speaker into at least one prototype vector corresponding to the at least one registered speaker through the speaker embedding vector generator to complete the registration. After completing the registration, the speaker recognition system can be used by multiple testers to log in for testing, and the following steps will be performed for each tester:

一欺騙攻擊辨識步驟，為判斷該測試者的輸入語音是否為欺騙攻擊，不同於習知的方法，輸入語音(測試語句)不必限定是本文相關。該語者辨識系統透過該語者嵌入向量產生器將該測試者的輸入語音轉換為一測試者嵌入向量，接著計算該測試者的該測試者嵌入向量與該系統裡該至少一註冊語者的原型向量之間的餘弦相似度(cosine similarity)。若全部餘弦相似度值都超過一閥值，則判斷該測試者的該輸入語音不是欺騙攻擊，否則就拒絕該測試者登錄；以及A spoofing attack identification step is to determine whether the tester's input voice is a spoofing attack. Different from the conventional method, the input voice (test sentence) does not need to be limited to be relevant to this article. The speaker recognition system converts the input speech of the tester into a tester embedding vector through the speaker embedding vector generator, and then calculates the tester embedding vector of the tester and the at least one registered speaker in the system. Cosine similarity between prototype vectors. If all cosine similarity values exceed a threshold, it is determined that the input voice of the tester is not a spoofing attack, otherwise the tester will be refused to log in; and

一冒充者與註冊語者辨識步驟，為判斷該測試者的輸入語音為一冒充者或任一該註冊語者。首先，該語者辨識系統將該測試者的輸入語音隨機切割出連續的三片段聲音，並透過該語者嵌入向量產生器，產生該三片段聲音各自的語者嵌入向量。接著分別計算每一該三片段聲音的該語者嵌入向量與系統裡每一該註冊語者的原型向量之間的相似度，並分別產生一分數。當該三片段的語者嵌入向量所得到的最高分數值都對應到同一註冊語者，且都大於一設定閥值時，判斷該測試者為此特定的該註冊語者，否則就判斷為冒充者，並拒絕該測試者登錄。An impostor and registered speaker identification step is to determine whether the input voice of the tester is an impostor or any of the registered speakers. First, the speaker identification system randomly cuts the tester's input speech into three consecutive sound segments, and uses the speaker embedding vector generator to generate speaker embedding vectors for each of the three segments of sound. Then, the similarity between the speaker's embedding vector of each of the three sound segments and the prototype vector of each registered speaker in the system is calculated, and a score is generated respectively. When the highest scores obtained by the speaker embedding vectors of the three segments all correspond to the same registered speaker and are greater than a set threshold, the tester is judged to be this specific registered speaker, otherwise it is judged to be an impersonation. , and refuse the tester's login.

據此，本發明可透過比對該至少一註冊語者，而辨識出該測試者的輸入語音是否為欺騙攻擊，並可直接識別出是該註冊語者中的哪一個，而可有效阻擋欺騙攻擊與拒絕冒充者的登錄，本發明具有輕量、即時反應、適用短語句與開放集的特性，並可使用低成本嵌入式硬件來實現系統。Accordingly, the present invention can identify whether the input speech of the tester is a spoofing attack by comparing the at least one registered speaker, and can directly identify which of the registered speakers it is, thereby effectively preventing spoofing. To attack and deny the login of impostors, the invention has the characteristics of lightweight, instant response, applicable short statements and open sets, and can use low-cost embedded hardware to implement the system.

有關本發明之詳細說明及技術內容，現就配合圖示說明如下：The detailed description and technical content of the present invention are as follows with illustrations:

請參閱「圖1」與「圖2」所示，本發明為一種基於元學習應用於開放集環境下之即時短語句的語者辨識系統，其包含一語者嵌入向量產生器10，該語者嵌入向量產生器10具有一深度學習語者模型101與一梅爾濾波器102，該深度學習語者模型101為將經該梅爾濾波器102對輸入語音11擷取到的聲學特徵向量轉換為語者嵌入向量12。該深度學習語者模型101具有一目標函數，該目標函數特性為：同一語者所產生的不同輸入語音11，經該深度學習語者模型101運算後轉換成語者嵌入向量12時，不同的語者嵌入向量12在一高維度空間20上有聚集在一起形成一聚集叢(cluster)的特性，且不同語者的聚集叢彼此相距較遠。Please refer to "Figure 1" and "Figure 2". The present invention is a speaker identification system based on meta-learning applied to real-time short sentences in an open set environment. It includes a speaker embedding vector generator 10. The embedding vector generator 10 has a deep learning speaker model 101 and a Mel filter 102. The deep learning speaker model 101 converts the acoustic feature vector captured from the input speech 11 through the Mel filter 102. Embedding vector 12 for the speaker. The deep learning speaker model 101 has an objective function, and the characteristics of the objective function are: when different input speech sounds 11 produced by the same speaker are converted into speaker embedding vectors 12 after operation by the deep learning speaker model 101, different speech The speaker embedding vectors 12 have the characteristic of being gathered together to form a cluster in a high-dimensional space 20, and the clusters of different speakers are far apart from each other.

該深度學習語者模型101為基於元學習(Meta learning)方式使用複數訓練情節(episode)進行訓練，同時設計包含二損失函數的該目標函數，利用反傳遞該目標函數的梯度來修正該深度學習語者模型101以降低該目標函數的數值。該二損失函數可以選用Angular prototypical loss與Softmax loss。該複數訓練情節中，每一訓練情節都具有長語句的支持集(support set)與短語句的查詢集(query set)，當支持集經過該深度學習語者模型101運算後，會產生一原型(prototype)向量21。同樣的，查詢集經過該深度學習語者模型101運算後會產生一語者嵌入向量22，該原型向量21與該語者嵌入向量22會座落於該高維度空間20。在每一訓練情節的訓練裡，可讓深度學習語者模型101利用反傳遞目標函數的梯度，不但能在局部嵌入空間進行模型的修正，也同時能從全域嵌入空間上獲得更多的資訊以利學習。The deep learning speaker model 101 is trained using complex training episodes (episode) based on meta learning. At the same time, the objective function including two loss functions is designed, and the gradient of the objective function is back-transmitted to modify the deep learning. The speaker model 101 is used to reduce the value of the objective function. The two loss functions can choose Angular prototypical loss and Softmax loss. In the plural training plots, each training plot has a support set of long sentences and a query set of short sentences. When the support set is processed by the deep learning speaker model 101, a prototype will be generated. (prototype)vector21. Similarly, after the query set is processed by the deep learning speaker model 101, a speaker embedding vector 22 will be generated. The prototype vector 21 and the speaker embedding vector 22 will be located in the high-dimensional space 20. During the training of each training episode, the deep learning speaker model 101 can use the gradient of the inverse transfer objective function to not only modify the model in the local embedding space, but also obtain more information from the global embedding space. Conducive to learning.

請再一併參閱「圖3」，該語者辨識系統透過該語者嵌入向量產生器10將至少一註冊語者30的輸入語音31轉換為對應該至少一註冊語者30的至少一原型向量32而完成註冊，該原型向量32可視為在該高維度空間20裡的某一原型向量21(如圖2所示)。Please refer to "Figure 3" again. The speaker recognition system converts the input speech 31 of at least one registered speaker 30 into at least one prototype vector corresponding to the at least one registered speaker 30 through the speaker embedding vector generator 10. 32 and the registration is completed, the prototype vector 32 can be regarded as a certain prototype vector 21 in the high-dimensional space 20 (as shown in Figure 2).

為了避免單次運算所造成的誤差，在實務上，該語者辨識系統可以將該註冊語者30的輸入語音31切割為複數個，再分別轉換為複數個語者嵌入向量，接著求取平均值而作為對應的該註冊語者30的原型向量32。又當該註冊語者30的輸入語音31的長度不足時，該語者辨識系統可以複製該註冊語者30的輸入語音31而增加語音長度，以滿足切割該註冊語者30的輸入語音31時所需的長度要求。In order to avoid errors caused by a single operation, in practice, the speaker identification system can cut the input speech 31 of the registered speaker 30 into a plurality of words, and then convert them into a plurality of speaker embedding vectors respectively, and then calculate the average The value serves as the prototype vector 32 of the corresponding registered speaker 30. And when the length of the input voice 31 of the registered speaker 30 is insufficient, the speaker recognition system can copy the input voice 31 of the registered speaker 30 and increase the length of the voice to meet the requirement of cutting the input voice 31 of the registered speaker 30 Required length requirements.

在完成註冊後，該語者辨識系統可供複數測試者40進行登錄測試，並對每一個該測試者40進行一欺騙攻擊辨識步驟與一冒充者與註冊語者辨識步驟。After completing the registration, the speaker identification system allows multiple testers 40 to perform login testing, and performs a deception attack identification step and an impostor and registered speaker identification step for each tester 40 .

請再一併參閱「圖4」，於該欺騙攻擊辨識步驟，為判斷該測試者40的輸入語音41是否為欺騙攻擊46，該語者辨識系統先將該測試者40的輸入語音41轉換為聲學特徵向量，再產生一測試者嵌入向量42，接著利用一計算步驟43，計算該測試者嵌入向量42與所有該註冊語者30的原型向量32的餘弦相似度是否超過一閥值，接著利用一判斷步驟44進行判斷，若全部超過該閥值，則判斷該測試者40的輸入語音41是真人發音45，若有任一餘弦相似度值不超過該閥值，則判斷是欺騙攻擊46，並拒絕該測試者40登錄。Please refer to "Figure 4" again. In the spoofing attack identification step, in order to determine whether the input voice 41 of the tester 40 is a spoofing attack 46, the speaker identification system first converts the input voice 41 of the tester 40 into Acoustic feature vector, and then generate a tester embedding vector 42, and then use a calculation step 43 to calculate whether the cosine similarity between the tester embedding vector 42 and the prototype vectors 32 of all registered speakers 30 exceeds a threshold, and then use A judgment step 44 is performed. If all exceed the threshold, it is judged that the input speech 41 of the tester 40 is a real person's pronunciation 45. If any cosine similarity value does not exceed the threshold, it is judged to be a spoofing attack 46. And denied the tester 40 logins.

又該測試者嵌入向量42與所有該註冊語者30的原型向量32的餘弦相似度，可以下列方程式表示： In addition, the cosine similarity between the tester embedding vector 42 and the prototype vectors 32 of all registered speakers 30 can be expressed by the following equation:

其中為該測試者嵌入向量42，而為所有該註冊語者30的原型向量32。所謂欺騙攻擊46，包含錄音回放(replay)與語音合成(speech synthesis)等等以電子手段重現語音，其並非真人發音45，因此欺騙攻擊46的輸入語音41，當轉換成該測試者嵌入向量42時，與該至少一註冊語者的原型向量32的餘弦相似度會不高，因此只要判斷該測試者嵌入向量42與每一個註冊語者30的原型向量32的餘弦相似度皆大於該閥值，即可判斷為真人發音45，反之則為欺騙攻擊46。 in is the embedding vector 42 for this tester, and is the prototype vector 32 of all registered speakers 30 . The so-called spoofing attack 46 includes recording playback (replay) and speech synthesis (speech synthesis) to reproduce speech by electronic means. It is not a real person's pronunciation 45 . Therefore, the input speech 41 of the spoofing attack 46 is converted into the tester's embedding vector. 42, the cosine similarity with the prototype vector 32 of at least one registered speaker will not be high. Therefore, as long as the cosine similarity between the tester's embedding vector 42 and the prototype vector 32 of each registered speaker 30 is determined to be greater than this valve If the value is 45, it can be judged as a real person's pronunciation, otherwise it is a spoofing attack 46.

請再一併參閱「圖5」，於該冒充者與註冊語者辨識步驟，為判斷該測試者40的輸入語音41為一冒充者47或任一該註冊語者30。在實施上，該輸入語音41即為前一步驟的該輸入語音31，在本步驟中，我們將該測試者40的輸入語音41隨機切割出連續的三片段聲音411。透過該語者嵌入向量產生器10將該測試者40的該三片段聲音411轉換為聲學特徵向量，以產生三片段的語者嵌入向量412，接著利用一計算步驟43A分別計算每一片段的語者嵌入向量412與該至少一註冊語者30的原型向量32之相似度，並分別產生一分數。Please refer to "Figure 5" again. In the step of identifying the impostor and the registered speaker, it is determined that the input voice 41 of the tester 40 is an impostor 47 or any of the registered speakers 30. In practice, the input voice 41 is the input voice 31 in the previous step. In this step, we randomly cut the input voice 41 of the tester 40 into three consecutive sound segments 411. The speaker embedding vector generator 10 converts the three segments of the voice 411 of the tester 40 into acoustic feature vectors to generate three segments of speaker embedding vectors 412, and then uses a calculation step 43A to calculate the speech of each segment respectively. The similarity between the speaker embedding vector 412 and the prototype vector 32 of the at least one registered speaker 30 is determined, and a score is generated respectively.

又此處相似度的量測方法，與前述的餘弦相似度不同，更明確的說，於該冒充者與註冊語者辨識步驟中，相似度的量測方法是基於具輸入語音11向量長度之尺度的餘弦相似度(cosine similarity scaled by the length of the input speech vector)。第i片段的語者嵌入向量412與該至少一註冊語者30的原型向量32之相似度，可以下列方程式表示之: In addition, the similarity measurement method here is different from the cosine similarity mentioned above. To be more clear, in the step of identifying the impostor and the registered speaker, the similarity measurement method is based on the input speech vector length. Cosine similarity scaled by the length of the input speech vector. The similarity between the speaker embedding vector 412 of the i-th segment and the prototype vector 32 of the at least one registered speaker 30 can be expressed by the following equation:

其中為該語者嵌入向量412，而為所有該註冊語者30的原型向量32。相似度所對應的該分數則是利用Softmax運算將相似度轉換為概率分佈而得到的，可以下列方程式表示之： in is the speaker embedding vector 412, and is the prototype vector 32 of all registered speakers 30 . Similarity The corresponding score is obtained by using the Softmax operation to convert the similarity into a probability distribution, which can be expressed by the following equation:

透過上述的計算，量測方法所得到的相似度值，能有效地加大語者嵌入向量對不同註冊語者原型向量之間的分數分佈之差異性，從而增加辨別力。Through the above calculations, the similarity value obtained by the measurement method can effectively increase the difference in score distribution between speaker embedding vectors and prototype vectors of different registered speakers, thereby increasing discrimination.

接著利用一判斷步驟44A進行冒充者與註冊語者辨識，當該三片段的語者嵌入向量412所得到的最高分數值都對應到同一註冊語者，且都大於一設定閥值時，稱之為滿足全數決，即可判斷該測試者40為此特定的該註冊語者30，否則就判斷為冒充者47，並拒絕該測試者40登錄。Then a judgment step 44A is used to identify the impostor and the registered speaker. When the highest scores obtained by the speaker embedding vectors 412 of the three segments all correspond to the same registered speaker and are greater than a set threshold, it is called In order to satisfy all the requirements, the tester 40 can be judged to be the specific registered speaker 30, otherwise it will be judged to be an impostor 47, and the tester 40 will be refused to log in.

於該冒充者與註冊語者辨識步驟，該三片段的語者嵌入向量412中，若只有其中二個片段的該語者嵌入向量412與該註冊語者的原型向量32的該分數大於該設定閥值，無法滿足全數決，且若剩下的一個片段的該語者嵌入向量412與其他該註冊語者的原型向量32的該分數均小於一指定閥值時，我們稱此一狀況為多數決，而二個片段的該語者嵌入向量412所對應之該註冊語者30稱為多數決之該註冊語者30。在多數決的情下，其傳達了該測試者40非為多數決之該註冊語者30的可能性很低。在實務上，該設定閥值與該指定閥值的數值可以依據使用上的需求及測試的結果進行調整，以最佳化辨識結果。In the step of identifying the impostor and the registered speaker, in the speaker embedding vector 412 of the three segments, if the scores of the speaker embedding vector 412 of only two segments and the prototype vector 32 of the registered speaker are greater than the setting The threshold cannot satisfy the full decision, and if the scores of the speaker's embedding vector 412 of the remaining fragment and the prototype vectors 32 of other registered speakers are both less than a specified threshold, we call this situation a majority decision, and the registered speaker 30 corresponding to the speaker embedding vectors 412 of the two segments is called the registered speaker 30 of the majority decision. In the case of a majority, it conveys a low probability that the tester 40 is not the registered speaker of the majority 30. In practice, the values of the set threshold and the specified threshold can be adjusted according to usage requirements and test results to optimize the identification results.

為了增加該註冊語者30的辨識率，於該冒充者與註冊語者辨識步驟中不滿足全數決，但仍滿足多數決時，則於該欺騙攻擊辨識步驟中，餘弦相似度最高的該註冊語者30，與該冒充者與註冊語者辨識步驟中，其中二個的該語者嵌入向量所對應的該註冊語者30(即多數決之該註冊語者30)若相同，為求慎重，可進行重試48，即給予一次機會重新進行該冒充者與註冊語者辨識步驟。In order to increase the recognition rate of the registered speaker 30, when the impostor and the registered speaker do not meet the full majority rule in the identification step, but still meet the majority rule, then in the deception attack identification step, the registered speaker with the highest cosine similarity will If the speaker 30 is the same as the registered speaker 30 corresponding to two of the speaker embedding vectors in the step of identifying the impostor and the registered speaker (that is, the registered speaker 30 in the majority vote), for the sake of caution , retry 48 can be performed, that is, an opportunity is given to re-perform the step of identifying the impostor and the registered speaker.

而為增加冒充者47的阻擋率，於該欺騙攻擊辨識步驟中，最高的餘弦相似度所對應的該註冊語者30，與該冒充者與註冊語者辨識步驟中所判斷出的該註冊語者30，若不相同，則拒絕該測試者40登錄。In order to increase the blocking rate of the impostor 47, in the spoofing attack identification step, the registered speaker 30 corresponding to the highest cosine similarity is the same as the registered speaker 30 judged in the impostor and registered speaker identification step. 30. If they are not the same, the tester 40 will be refused to log in.

請再一併參閱「圖6」所示，接下來，我們使用 UMAP投影來可視化語者嵌入向量50A、50B、50C、50D的分佈。隨機選擇四個說話者進行比較，每個說話者有60個語句，每個語句的長度為3秒。如下圖所示，在左圖是傳統習知的語者嵌入向量(i-vector)分佈，同一說話者的語者嵌入向量50A、50B、50C、50D是分散的，不同語者的語者嵌入向量50A、50B、50C、50D也沒有充分分離。相比之下，於右圖是我們的深度學習語者模型所生成之語者嵌入向量分佈。可觀察到同一語者的語者嵌入向量50A、50B、50C、50D是集中在一處，並且不同語者之間的語者嵌入向量50A、50B、50C、50D相距足夠遠。也就是說，在我們模型所學到的語者嵌入空間中，可以清楚地區分不同語者的語音。Please refer to "Figure 6" again. Next, we use UMAP projection to visualize the distribution of speaker embedding vectors 50A, 50B, 50C, and 50D. Four speakers were randomly selected for comparison, each speaker had 60 utterances, each utterance was 3 seconds in length. As shown in the figure below, the left figure is the traditional speaker embedding vector (i-vector) distribution. The speaker embedding vectors 50A, 50B, 50C, and 50D of the same speaker are scattered, and the speaker embeddings of different speakers are scattered. Vectors 50A, 50B, 50C, and 50D are also not sufficiently separated. In contrast, the image on the right is the distribution of speaker embedding vectors generated by our deep learning speaker model. It can be observed that the speaker embedding vectors 50A, 50B, 50C, and 50D of the same speaker are concentrated in one place, and the speaker embedding vectors 50A, 50B, 50C, and 50D of different speakers are far enough apart. That is to say, in the speaker embedding space learned by our model, the speech of different speakers can be clearly distinguished.

為了更清楚地展示本發明與習知技術的功效差異，分別針對註冊語者通過率、冒充者阻擋率與欺騙攻擊檢測率進行比較。In order to more clearly demonstrate the difference in efficacy between the present invention and the conventional technology, comparisons are made with respect to the pass rate of registered speakers, the blocking rate of impostors, and the detection rate of spoofing attacks.

在註冊語者通過率實驗中，我們招募了5名註冊語者。每個註冊語者對著系統麥克風錄製20分鐘的語音。此外，為了評估對不同測試語句長度的辨識率，我們進一步將每位註冊語者的20分鐘語音分成1、2和3 秒三個長度，每個長度各有400句。In the experiment on the pass rate of registered speakers, we recruited 5 registered speakers. Each registered speaker records 20 minutes of speech into the system microphone. In addition, in order to evaluate the recognition rate of different test sentence lengths, we further divided the 20 minutes of speech of each registered speaker into three lengths of 1, 2 and 3 seconds, each with 400 sentences.

在冒充者阻擋率的實驗中，我們招募了20名未註冊的語者作為冒充者。每個語者對著系統的麥克風錄製5分鐘的測試音檔。為了評估不同測試語音長度之間的性能比較，我們進一步將每個冒充者的5分鐘語音分成三個長度，分別為1、2和3秒，每個長度各有100個話語。In the experiment of impostor blocking rate, we recruited 20 unregistered whisperers as impostors. Each speaker records a 5-minute test sound file into the system's microphone. To evaluate the performance comparison between different test speech lengths, we further divided each impostor's 5 minutes of speech into three lengths of 1, 2, and 3 seconds, with 100 utterances in each length.

在欺騙攻擊阻擋率實驗中，我們為每個註冊語者錄製5分鐘的語音作為欺騙攻擊的測試語料庫。我們將所錄製的語料庫分割三個長度：1秒、2秒和3秒。首先，每個語者的5分鐘語料被分成100個音檔，每個音檔長度為3秒。然後，對於每個3秒的音檔，我們隨機選擇2秒和1秒長度的連續音訊，因此1秒和2秒語音的數量也是100。In the spoofing attack blocking rate experiment, we recorded 5 minutes of speech for each registered speaker as a test corpus for spoofing attacks. We split the recorded corpus into three lengths: 1 second, 2 seconds and 3 seconds. First, each speaker's 5-minute corpus is divided into 100 audio files, each of which is 3 seconds long. Then, for each 3-second audio file, we randomly select continuous audio of 2-second and 1-second lengths, so the number of 1-second and 2-second voices is also 100.

而習知的技術為採用台灣公告第TWI749709B號「一種語者辨識方法」所揭露的技術，其是採用i-vector的語者辨識模型。而實驗結果列表比較如下：The conventional technology adopts the technology disclosed in Taiwan Announcement No. TWI749709B "A Speaker Identification Method", which uses an i-vector speaker identification model. The experimental results list is as follows:

可見，本案在短語句(1秒)對比習知而言，仍然可以取得相當良好的功效。 1秒 2秒 3秒本發明之註冊語者通過率 96.85% 98.13% 98.46% 習知之註冊語者通過 75.56% 77.94% 80.88% 1秒 2秒 3秒本發明之冒充者阻擋率 99.92% 99.85% 99.76% 習知之冒充者阻擋率 80.13% 86.95% 91.62% 1秒 2秒 3秒本發明之欺騙攻擊檢測率 99.87% 99.66% 99.31% 習知之欺騙攻擊檢測率 94.86% 95.91% 98.44% It can be seen that this case can still achieve quite good results compared with conventional wisdom in short sentences (1 second). 1 second 2 seconds 3 seconds Pass rate of registered speakers of the present invention 96.85% 98.13% 98.46% Registered speakers of Xi Zhi pass 75.56% 77.94% 80.88% 1 second 2 seconds 3 seconds Impostor blocking rate of the present invention 99.92% 99.85% 99.76% Impostor blocking rate 80.13% 86.95% 91.62% 1 second 2 seconds 3 seconds Deception attack detection rate of the present invention 99.87% 99.66% 99.31% Common knowledge about spoofing attack detection rate 94.86% 95.91% 98.44%

綜上所述，本發明具有下列特點：To sum up, the present invention has the following characteristics:

一、相較於傳統使用分類器的辨識方法，本發明讓註冊語者能以少量的註冊語句(enrolled utterances)完成系統註冊，且從而能在短測試語句上獲得可靠的性能。此外，即使移除任何現有的註冊語者或添加新註冊語者到系統中時，我們的模型無需重新訓練。換句話說，目標語者集合的任何改變都不需重對模型進行再次的訓練。1. Compared with the traditional identification method using classifiers, the present invention allows registered speakers to complete system registration with a small number of enrolled utterances, and thereby obtain reliable performance on short test sentences. Furthermore, our model does not require retraining even when any existing registered speakers are removed or new registered speakers are added to the system. In other words, any change in the target speaker set does not require retraining the model.

二、單一的輕量級語者模型能夠應用於開放集環境下，同時處理註冊語者識別、欺騙攻擊檢測和冒充者阻擋。這種輕量級的特性不僅有利於低成本嵌入式硬件來實現，且因具有較低的計算複雜度，而可提供實即時的反應。2. A single lightweight speaker model can be applied in an open set environment and handle registered speaker identification, spoofing attack detection and impostor blocking at the same time. This lightweight feature is not only conducive to low-cost embedded hardware implementation, but also provides real-time response due to its lower computational complexity.

10:語者嵌入向量產生器 101:深度學習語者模型 102:梅爾濾波器 11:輸入語音 12:語者嵌入向量 20:高維度空間 21:原型向量 22:語者嵌入向量 30:註冊語者 31:輸入語音 32:原型向量 40:測試者 41:輸入語音 411:片段聲音 412:語者嵌入向量 42:測試者嵌入向量 43、43A:計算步驟 44、44A:判斷步驟 45:真人發音 46:欺騙攻擊 47:冒充者 48:重試 50A、50B、50C、50D:語者嵌入向量10: Speaker embedding vector generator 101: Deep Learning Speaker Model 102: Mel filter 11: Input voice 12: Speaker embedding vector 20:High-dimensional space 21:Prototype vector 22: Speaker embedding vector 30: Registered speaker 31: Input voice 32: Prototype vector 40:Tester 41: Input voice 411: Fragment of sound 412: Speaker embedding vector 42:Tester embedding vector 43, 43A: Calculation steps 44, 44A: Judgment steps 45: Real person pronunciation 46: Spoofing attack 47:Impostor 48:Retry 50A, 50B, 50C, 50D: speaker embedding vector

圖1，為本發明語者嵌入向量產生器運作示意圖。圖2，為本發明語者嵌入向量和原型向量於高維度嵌入空間之示意圖。圖3，為本發明註冊語者註冊示意圖。圖4，為本發明欺騙攻擊辨識步驟示意圖。圖5，為本發明冒充者與註冊語者辨識步驟示意圖。圖6，為傳統與本發明的語者嵌入向量分佈比較圖。 Figure 1 is a schematic diagram of the operation of the speaker embedding vector generator of the present invention. Figure 2 is a schematic diagram of speaker embedding vectors and prototype vectors in a high-dimensional embedding space according to the present invention. Figure 3 is a schematic diagram of speaker registration according to the present invention. Figure 4 is a schematic diagram of the deception attack identification steps of the present invention. Figure 5 is a schematic diagram of the steps for identifying impostors and registered speakers according to the present invention. Figure 6 is a comparison diagram of speaker embedding vector distributions of the traditional and the present invention.

10:語者嵌入向量產生器 10: Speaker embedding vector generator

101:深度學習語者模型 101: Deep Learning Speaker Model

102:梅爾濾波器 102: Mel filter

11:輸入語音 11: Input voice

12:語者嵌入向量 12: Speaker embedding vector

Claims

A speaker identification system based on meta-learning applied to real-time short sentences in an open set environment, which includes a speaker embedding vector generator with a deep learning speaker model and a Mel filter, The deep learning speaker model converts the acoustic feature vector output by the Mel filter into a speaker embedding vector. The deep learning speaker model is trained using complex training episodes based on a meta-learning method. Each training episode has a long The support set of sentences and the query set of short sentences are designed at the same time. An objective function composed of two different loss functions is used to modify the deep learning speaker model by back-propagating the gradient of the objective function. The speaker identification system uses The speaker embedding vector generator converts the input speech of at least one registered speaker into at least one prototype vector corresponding to the at least one registered speaker, and completes the registration. After the registration is completed, the speaker identification system can be used by multiple testers. test, and perform the following steps for each such tester: A spoofing attack identification step. In order to determine whether the tester's input speech is a spoofing attack, the speaker identification system converts the tester's input speech into a tester embedding vector through the speaker embedding vector generator, and then calculates the Whether the cosine similarity between the tester's embedding vector and the prototype vector of the at least one registered speaker exceeds a threshold. If both exceed the threshold, it is determined that the input speech of the tester is not a spoofing attack. Otherwise, Deny the tester login; and An impostor and registered speaker identification step. In order to determine whether the input voice of the tester is an impostor or any of the registered speakers, the speaker identification system randomly cuts the input voice of the tester into three consecutive sound segments. , and generate corresponding three-segment speaker embedding vectors through the speaker embedding vector generator, and then calculate the similarity between the speaker embedding vector of each segment and the prototype vector of each registered speaker, and A score is generated respectively. When the highest score value obtained by the speaker embedding vector of the three segments all corresponds to the same registered speaker and is greater than a set threshold, the tester is judged to be this specific registered speaker. Otherwise, it will be judged as an impostor and the tester will be refused to log in.

The speaker identification system as described in claim 1, wherein the speaker identification system cuts the input speech of the registered speaker into a plurality of words, converts them into speaker embedding vectors respectively, and then calculates the average value as the corresponding The prototype vector of this registered speaker.

The speaker identification system of claim 2, wherein the speaker identification system copies the input speech of the registered speaker and increases the length of the speech to meet the length requirement required for cutting the input speech of the registered speaker.

The speaker identification system as described in claim 1, wherein in the step of identifying the impostor and the registered speaker, the speaker embedding vectors of the three segments are included in the speaker embedding vectors. If the speaker embedding vectors of two of the segments are the same as the registered speaker If the score of the prototype vector of the speaker is greater than the set threshold, and if the scores of the speaker's embedding vector of the remaining segment and the prototype vectors of other registered speakers are both less than a specified threshold, an opportunity will be given to retry the process. In the step of identifying impostors and registered speakers, the specified threshold is not greater than the set threshold.

The speaker identification system as claimed in claim 1, wherein in the step of identifying the impostor and the registered speaker, in the speaker embedding vectors of the three segments, if only two of the speaker embedding vectors match the specific registered speaker If the score of the speaker's prototype vector is greater than the set threshold, then in the spoofing attack identification step, the registered speaker with the highest cosine similarity will be the same as the impostor and the registered speaker in the identification step. If the registered speaker corresponding to the speaker embedding vector is the same, an opportunity is given to redo the identification step between the impostor and the registered speaker.

The speaker identification system of claim 1, wherein in the spoofing attack identification step, the registered speaker corresponding to the highest cosine similarity is the same as the impostor and the registered speaker determined in the identification step. If the registered speakers are different, the tester will be refused to log in.

The speaker identification system of claim 1, wherein in the step of identifying the impostor and the registered speaker, the similarity measurement method is based on cosine similarity with a scale of input speech vector length, and the score is calculated using The softmax operation converts the similarity into a probability distribution.