TWI749709B

TWI749709B - A method of speaker identification

Info

Publication number: TWI749709B
Application number: TW109127757A
Authority: TW
Inventors: 許正欣; 許良瑋
Original assignee: 國立雲林科技大學
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2021-12-11
Also published as: TW202207209A

Abstract

本發明包含三階段辨識。第一階段辨識為偵測一本文相關測試語句是否為錄音回放之欺騙攻擊。第二階段辨識為對一本文無關測試語句進行合成語音之欺騙攻擊的偵測。第三階段辨識則藉由一語者辨識系統，判斷該本文無關測試語句是那一註冊語者，若不是註冊語者時，則判斷為冒充者。前兩個階段分別使用不相同特徵，並有各自的二元分類器，第三階段則使用複數分類器，透過整體學習(Ensemble Learning)與全數決策略(Unanimity Rule)，同時搭配條件式重試機制，以決定判斷該本文無關測試語句是目標語者或冒充者；據此可在不損失冒充者之阻擋率的情況下，有效降低目標語者的阻擋率。The present invention includes three-stage identification. The first stage of identification is to detect whether a test sentence related to this text is a spoofing attack for recording playback. The second stage of identification is the detection of a synthetic speech spoofing attack on an unrelated test sentence in the text. The third stage of identification uses the one-speaker identification system to determine the registered speaker of the irrelevant test sentence in this article. If it is not a registered speaker, it is judged as an imposter. The first two stages use different features and have their own binary classifiers. The third stage uses a complex classifier, through Ensemble Learning and Unanimity Rule, with conditional retry. Mechanism to determine whether the irrelevant test sentence in this article is the target speaker or the imposter; based on this, the target speaker’s blocking rate can be effectively reduced without losing the impostor’s blocking rate.

Description

A method of speaker identification

本發明有關語音技術，尤指一種語者辨識方法。 The present invention relates to speech technology, especially a method of speaker identification.

語音應用的產品日益增多，因此，對於語音應用產品，語者辨識技術就顯得相當重要，其需能認證使用語者的身分，並防止冒名竊用，始能達到防盜保密的需求。 The number of voice application products is increasing. Therefore, for voice application products, speaker identification technology is very important. It needs to be able to authenticate the identity of the speaker and prevent impersonation, so as to meet the requirements of anti-theft and confidentiality.

尤其，隨著語音合成(Speech Synthesis)、語音轉換(Voice Conversion)等技術的成熟，語音攻擊也成為語者辨識技術待解決的重要議題。在語者辨識中，測試語句符合註冊語者稱之為目標語者(Targets)，不符合註冊語者則稱為冒充者(Imposters)，以人為造假的聲音進行攻擊(試圖通過認證)則稱為欺騙攻擊(Spoofing Attacks)，其包含錄音回放(Replay)、語音合成(Speech Synthesis)或語音轉換(Voice Conversion)等。錄音回放係指回放預先錄製之註冊語者的語句，而語音合成與語音轉換係指以人工技術所生成的假造語音。 In particular, with the maturity of technologies such as Speech Synthesis and Voice Conversion, voice attacks have also become an important issue to be resolved in speaker identification technology. In speaker identification, test sentences that meet the registered speakers are called Targets, those that do not meet the registered speakers are called Imposters, and those who use artificial voices to attack (trying to pass authentication) are called Targets. Spoofing Attacks include recording playback (Replay), speech synthesis (Speech Synthesis), or voice conversion (Voice Conversion). Recording playback refers to the playback of pre-recorded sentences of the registered speaker, while speech synthesis and speech conversion refers to fake speech generated by artificial technology.

語者辨識技術，要能正確辨識目標語者，並防止欺騙攻擊與冒充者的侵入。而為了評斷語音辨識技術的好壞，可以使用ASVspoof-2019所提供的LA(Logical Access)語料庫做為訓練/測試的語料集，ASVspoof-2019是由英國愛丁堡大學、法國國家信息與自動化研究所、日本NEC等組織發起的自動語者識別欺騙及對策比賽，比賽分為LA、PA兩大挑戰。LA語料庫裡有107位真實語者，包含46位男性和61女性，語音沒有明顯的通道或背景噪音。 The speaker identification technology must be able to correctly identify the target speaker and prevent deception attacks and the intrusion of imposters. In order to judge the quality of speech recognition technology, the LA (Logical Access) corpus provided by ASVspoof-2019 can be used as the training/testing corpus. ASVspoof-2019 is developed by the University of Edinburgh in the United Kingdom and the National Institute of Information and Automation in France. , Japan’s NEC and other organizations initiated the automatic speaker identification deception and countermeasures competition. The competition is divided into two major challenges: LA and PA. There are 107 authentic speakers in the LA corpus, including 46 males and 61 females. There is no obvious channel or background noise in the speech.

在實際應用裡，首先，不太可能找到一個能同時適用於辨別“錄音回放”、“合成語音”以及“真人語音”之特徵擷取技術。此外，在語者辨識的應用裡，任何的非註冊語者都屬於冒充者，所以冒充者數量是遠遠大於目標語者數量，惡意的冒充者甚至會嘗試多次闖入系統。綜合此二情形，選用適切的特徵工程，以及同時具備能接受所有目標語者進入以及阻擋任何冒充者侵入之能力，是成功實現一語音辨識技術的關鍵。 In practical applications, first of all, it is unlikely to find a feature extraction technology that can simultaneously distinguish between "recording playback", "synthetic speech" and "real human speech". In addition, in the language In the application of user identification, any non-registered speakers are imposters, so the number of imposters is far greater than the number of target speakers, and malicious imposters will even try to break into the system multiple times. Combining these two situations, selecting appropriate feature engineering, and having the ability to accept entry of all target speakers and block any intrusion by imposters are the keys to the successful realization of a voice recognition technology.

習知的語者辨識技術，如PhD Thesis發表的論文「Speaker Verication using I-vector Features」。或者，如Najim Dehak等人於IEEE TRANSACTIONS ON AUDIO,SPEECH,AND LANGUAGE PROCESSING,VOL.19,NO.4,MAY 2011發表的論文「Front-End Factor Analysis for Speaker Verification」。或者，如Patrick Kenny等人於IEEE Transaction on Audio Speech and Language Processing,15(4)：1435-1447.發表的論文「Joint Factor Analysis versus Eigenchannels in Speaker Recognition」。皆無法有效解決欺騙攻擊與冒充者嘗試闖的問題。 Known speaker recognition technology, such as the paper "Speaker Verication using I-vector Features" published by PhD Thesis. Or, for example, the paper "Front-End Factor Analysis for Speaker Verification" published by Najim Dehak et al. in IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL.19, NO.4, MAY 2011. Or, for example, the paper "Joint Factor Analysis versus Eigenchannels in Speaker Recognition" published by Patrick Kenny et al. in IEEE Transaction on Audio Speech and Language Processing, 15(4): 1435-1447. Neither can effectively solve the problem of spoofing attacks and pretenders trying to break.

另外，用以偵測錄音回放攻擊的習知方法是發展關鍵的聲學特徵，但目前尚無一足夠強健的特徵能全面性地阻擋錄音回放。習知的偵測合成語音之特徵擷取為常數Q倒頻譜係數(CQCC)，然而此特徵仍無法提供完整的資訊，以發展充分辨識合成語句的能力。對於分辨目標語者與冒充者，習知的特徵擷取技術則為i-vector。 In addition, the conventional method used to detect recording playback attacks is the development of key acoustic features, but there is currently no sufficiently robust feature that can completely block recording playback. The conventional feature of detecting synthesized speech is extracted as a constant Q cepstral coefficient (CQCC), but this feature still cannot provide complete information to develop the ability to fully recognize the synthesized sentence. For distinguishing target speakers and impersonators, the conventional feature extraction technology is i-vector.

針對錄音回放與人工合成語句的欺騙攻擊，其均為習知的二元分類的偵測議題。而關於習知的目標語者與冒充者之辨識技術，則多為採用門檻值決策方式，其使用如深度神經網路(DNN)、線性支持向量機(Linear-SVM)、核支持向量機(Kernel-SVM)與隨機森林(Random Forest)等等分類器(Classifiers)。分類器對測試語句產生一分數，當分數高於門檻值時判斷為通過(測試語句為目標語者)，若分數低於門檻值時判斷為不通過認證(測試語句為冒充者)。此門檻值決策方式在“提高冒充者阻擋率”與“降低目標語者阻擋率”之間存在著一權衡(Tradeoff)，也就是沒有任一個門檻值能夠同時滿足此二效能。 The spoofing attacks against recording playback and artificially synthesized sentences are both the detection issues of the conventional binary classification. Regarding the identification technology of the accustomed target speaker and the impostor, most of them adopt the threshold value decision method, which uses such as deep neural network (DNN), linear support vector machine (Linear-SVM), kernel support vector machine ( Classifiers such as Kernel-SVM and Random Forest. The classifier generates a score for the test sentence. When the score is higher than the threshold, it is judged as passed (the test sentence is the target language person), and if the score is lower than the threshold, it is judged as not passing the authentication (the test sentence is a pretender). This threshold decision-making method is used to "increase the blocking rate of imposters" and "reduce the target". There is a tradeoff between "slogan blocking rate", that is, no threshold can satisfy both of these effects at the same time.

所以，習知會使用分數正規化(Score Normalization)方法來將目標語者跟冒充者之間的分數分佈拉開，以期望更容易找出適切的門檻值。但在實務的應用裡，正規化方法仍然是無法同時提高目標語者的接受率與冒充者的拒絕率，這是因為目標語者和冒充者的分數分佈會有部分的重疊，這使得分數正規化的方法無法分開此二分數分佈。 Therefore, Xizhi will use the Score Normalization method to separate the score distribution between the target speaker and the impostor, hoping to find an appropriate threshold more easily. However, in practical applications, the regularization method still cannot improve the acceptance rate of the target speaker and the rejection rate of the imposter at the same time. This is because the score distribution of the target speaker and the impostor will partially overlap, which makes the scores regular The method of transformation cannot separate the two fractional distributions.

本發明之主要目的，在於揭露一種語者辨識方法，其能偵測錄音回放與語合成之欺騙攻擊，同時對於語者識別，可在不損失提高冒充者之阻擋率的情況下，有效降低目標語者的阻擋率。 The main purpose of the present invention is to disclose a speaker identification method, which can detect deception attacks of recording playback and speech synthesis, and at the same time, for speaker identification, it can effectively reduce the target without losing the blocking rate of imposters. The speaker's blocking rate.

為達上述目的，本發明提供一種語者辨識方法，對一使用者說出的一本文相關(Text-dependent)測試語句與一本文無關(Text-independent)測試語句進行辨識。本文相關測試語句為指定的語句，本文無關測試語句則是隨意說出的語句。 To achieve the above objective, the present invention provides a speaker identification method to identify a text-dependent test sentence and a text-independent test sentence spoken by a user. Related test sentences in this article are specified sentences, and irrelevant test sentences in this article are sentences that are spoken at will.

本發明依序進行第一、二、三階段辨識而分別判斷該本文相關測試語句是否為錄音回放的欺騙攻擊、該本文無關測試語句是否為合成語音的欺騙攻擊以及藉由一語者辨識系統判斷該本文無關測試語句為目標語者或冒充者，各步驟如下所述。 The present invention sequentially performs the first, second, and third stages of identification to determine whether the relevant test sentence of this text is a spoofing attack of recording playback, whether the unrelated test sentence of this text is a spoofing attack of synthesized speech, and is judged by the one-speaker identification system The irrelevant test sentences in this article are target speakers or pretenders, and the steps are as follows.

於第一階段辨識，判斷該本文相關測試語句是否為錄音回放的欺騙攻擊時，本發明利用梅爾頻譜係數(Mel-frequency Spectral Coefficients,MFSC)的特徵建構一本文相關的語音辨識模型，該語音辨識模型可以選擇搭配使用語音辨識(Speech Recognition)或模型比對(Template Matching)而形成一二元分類器，透過該二元分類器來判斷該本文相關測試語句是否為錄音回放的欺騙攻擊。若該本文相關測試語句不是為錄音回放的欺騙攻擊，則通過第一階段辨識並進入第二階段辨識，否則就拒絕該本文相關測試語句。 In the first stage of identification, when judging whether the relevant test sentence of this text is a spoofing attack of recording playback, the present invention uses the characteristics of Mel-frequency Spectral Coefficients (MFSC) to construct a speech recognition model related to this text. The recognition model can choose to use Speech Recognition or Template Matching to form a binary classifier. The binary classifier can be used to determine whether the relevant test sentence in this article is recorded. Deception attacks on audio playback. If the test sentence related to this article is not a spoofing attack for recording playback, pass the first stage identification and enter the second stage identification, otherwise the test sentence related to this article is rejected.

第二階段辨識為判斷該本文無關測試語句是否為合成語音的欺騙攻擊，為此，本發明首先為對該本文無關測試語句建構一降維的混合特徵，該混合特徵是由該本文無關測試語句之常數Q倒頻譜係數(CQCC)與頻譜圖(Spectrogram)所建構的一混合特徵向量，為先讓該本文無關測試語句先經過步驟P1：前處理及音框化，接著分頭進行步驟P2-1：抽取CQCC特徵與步驟P2-2：短時間傅利葉轉換(STFT)，其中步驟P2-1為取得常數Q倒頻譜係數，而步驟P2-2為取得頻譜圖，最後進行步驟：P3：合成混合特徵，即產生該混合特徵向量，將該混合特徵向量傳入另一二元分類器進行判斷是否為合成語音。若判斷不是合成語音，則通過第二階段辨識，該本文無關測試語句會被視為現場的真人發聲，而接著進入第三階段辨識，否則就拒絕該本文無關測試語句。 The second stage of identification is to determine whether the irrelevant test sentence in this text is a spoofing attack of synthesized speech. For this reason, the present invention first constructs a dimensional reduced hybrid feature for the irrelevant test sentence in this text. The mixed feature is derived from the irrelevant test sentence in this text. The constant Q cepstral coefficient (CQCC) and the spectrogram (Spectrogram) construct a mixed feature vector. First, let the irrelevant test sentence of this text go through step P1: pre-processing and framing, and then proceed to step P2- separately. 1: Extract CQCC features and step P2-2: Short-time Fourier Transform (STFT), where step P2-1 is to obtain constant Q cepstral coefficients, and step P2-2 is to obtain spectrograms, and finally step: P3: composite mixing The feature is to generate the mixed feature vector, and pass the mixed feature vector to another binary classifier to determine whether it is a synthesized speech. If it is judged that it is not a synthetic speech, it will pass the second stage of recognition. The irrelevant test sentence in this article will be regarded as a real person on the scene, and then enter the third stage of recognition, otherwise the irrelevant test sentence of this article will be rejected.

第三階段辨識為語者辨識，供判斷該本文無關測試語句為目標語者或冒充者，第三階段辨識是藉由一具有複數註冊語者與複數分類器之本文無關的語者辨識系統進行步驟a~步驟c。 The third stage of identification is speaker identification, which is used to determine whether the unrelated test sentence of this text is a target language person or a pretender. The third stage of identification is performed by a text-independent speaker identification system with plural registered speakers and plural classifiers. Step a~Step c.

首先，進行步驟a：擷取該本文無關測試語句的i-vector特徵，並對其降維，且基於i-vector降維後的特徵向量，讓該複數分類器各自判斷該本文無關測試語句是該複數註冊語者的任一或為冒充者，而各自產生一判斷結果。 First, proceed to step a: extract the i-vector feature of the irrelevant test sentence in this article, and reduce its dimensionality, and based on the feature vector after the i-vector dimensionality reduction, let the complex classifier each determine whether the irrelevant test sentence in this article is Any one of the plural registered speakers is a pretender, and each produces a judgment result.

接著進行步驟b：依據該複數判斷結果進行決策，當該複數判斷結果全數均判斷為同一註冊語者時，則判斷該本文無關測試語句為目標語者，並結束第三階段辨識的語者辨識；當該複數判斷結果判斷為同一註冊語者的數目少於半數，則判斷該本文無關測試語句為冒充者，並結束第三階段辨識的語者辨識。若非上述所提的兩類情況，但當該複數判斷結果判斷為同一註冊語者的數目不少於半數，則續行步驟c，給予一具有上限次數且次數為正整數的重試。 Then proceed to step b: make a decision based on the result of the plural judgment. When the result of the plural judgment is all judged to be the same registered speaker, judge the irrelevant test sentence in this text as the target speaker, and end the speaker identification of the third stage of identification ; When the plural judgment result is judged to be the same registration If the number of speakers is less than half, the irrelevant test sentence in this article is judged to be an impostor, and the speaker identification of the third stage of identification is ended. If it is not the two types of situations mentioned above, but when the plural judgment result judges that the number of persons with the same registered language is not less than half, then proceed to step c and give a maximum number of retries with a positive integer number.

在步驟c，當該使用者重試的次數未超過上限次數時，讓該使用者重新說出另一本文無關測試語句後，重新進行步驟a，但當該使用者的該重試超過上限次數時，則結束第三階段辨識的語者辨識，並判斷該本文無關測試語句為冒充者。 In step c, when the number of retries by the user does not exceed the upper limit, the user is asked to re-speak another unrelated test sentence, and then step a is repeated, but when the number of retries by the user exceeds the upper limit At that time, the speaker identification in the third stage of identification is ended, and the irrelevant test sentence in this article is judged to be an imposter.

由上述說明可知，關於錄音回放的欺騙攻擊，我們利用梅爾頻譜係數(MFSC)的特徵來建構該本文相關的語音辨識模型，可搭配使用語音辨識或模型比對而形成該二元分類器，透過該二元分類器來判斷該本文相關測試語句是否為錄音回放的欺騙攻擊。對於合成語音的欺騙攻擊，本發明方法藉由建構該降維的混合特徵，以利用該另一二元分類器對該本文無關測試語句進行判斷是否為合成語音。關於分辨目標語者或冒充者的語者辨識，則透過複數個分類器對同一該本文無關測試語句進行判斷，只有當該複數判斷結果全數回報為同一註冊語者時，才會判斷該本文無關測試語句為目標語者，否則則給予具有上限次數的重試。因此，透過給予具有上限次數的重試，不會損失冒充者之阻擋率，同時又可有效解決目標語者錯誤的阻擋，避免目標語者被誤認為冒充者，有效降低目標語者的阻擋。 From the above description, we can see that regarding the deception attack of recording playback, we use the characteristics of Mel Spectral Coefficient (MFSC) to construct the speech recognition model related to this article, which can be combined with speech recognition or model comparison to form the binary classifier. The binary classifier is used to determine whether the relevant test sentence of this article is a spoofing attack for recording playback. For spoofing attacks on synthetic speech, the method of the present invention constructs the dimensionality-reduced hybrid feature to use the other binary classifier to determine whether the irrelevant test sentence is a synthetic speech. Regarding the identification of the target speaker or the impostor, a plurality of classifiers are used to judge the same irrelevant test sentence of the text. Only when the plural judgment results are all reported as the same registered speaker, the text is judged as irrelevant If the test sentence is the target language, otherwise, a maximum number of retries will be given. Therefore, by giving a maximum number of retries, the impostor's blocking rate will not be lost, and at the same time, the target speaker's erroneous blocking can be effectively solved, preventing the target speaker from being mistaken as an imposter, and effectively reducing the target speaker's blocking.

S1:第一階段辨識 S1: First stage identification

S2:第二階段辨識 S2: Second stage identification

S3:第三階段辨識 S3: Third stage identification

S1-1:提示語者辨識開始進行 S1-1: Prompt speaker identification begins

S1-2:使用者的語句輸入 S1-2: User's sentence input

S1-3:偵測是否為錄音回放 S1-3: Detect whether it is recording playback

S2-1:提示輸入測試語句 S2-1: Prompt to enter the test sentence

S2-2:使用者的語句輸入 S2-2: User's sentence input

S2-3:偵測是否為合成語音 S2-3: Detect whether it is a synthesized voice

a:擷取i-vector特徵並對其降維 a: Extract i-vector features and reduce their dimensionality

b:進行決策 b: make a decision

c:給予一具有上限次數的重試 c: Give a maximum number of retries

c1:判斷是否超過重試次數上限 c1: Determine whether the upper limit of the number of retries has been exceeded

c2:提示輸入測試語句 c2: Prompt to enter the test statement

P1:前處理及音框化 P1: Pre-processing and sound framing

P2-1:抽取CQCC特徵 P2-1: Extract CQCC features

P2-2:短時間傅利葉轉換(STFT) P2-2: Short-time Fourier Transform (STFT)

P3:合成混合特徵 P3: Synthetic hybrid features

10:輸入語音 10: Input voice

圖1，為本發明語者辨識方法系統流程架構圖。 Fig. 1 is a system flow chart of the speaker identification method of the present invention.

圖2，為混合特徵向量產生示意圖。 Figure 2 is a schematic diagram of hybrid feature vector generation.

圖3，為本發明辨識輸入語音之辨識示意圖。 FIG. 3 is a schematic diagram of the recognition of the input voice recognition according to the present invention.

有關本發明之詳細說明及技術內容，現就配合圖示說明如下：請參閱「圖1」所示，一較佳實施方式中，本發明語者辨識方法可對一使用者說出的一本文相關測試語句與一本文無關測試語句進行辨識。其方法步驟包含：一第一階段辨識S1、一第二階段辨識S2與一第三階段辨識S3。 Regarding the detailed description and technical content of the present invention, the illustrations are as follows: Please refer to "FIG. 1". In a preferred embodiment, the speaker identification method of the present invention can speak a text to a user Relevant test sentences are identified from an unrelated test sentence in this text. The method steps include: a first-stage identification S1, a second-stage identification S2, and a third-stage identification S3.

其中該第一階段辨識S1包含步驟S1-1~步驟S1-3，其中步驟S1-1為提示語者辨識開始進行。於此步驟中，為發出一語音訊息通知該使用者可以開始進行辨識。此時所進行的是本文相關語句之辨識，但系統並不提示本次辨識為本文相關或本文無關的語句測試，同時對於冒充者而言，其並不知道該本文相關測試語句的內容為何。 The first stage identification S1 includes steps S1-1 to S1-3, and step S1-1 is the start of the identification of the prompt speaker. In this step, a voice message is sent to notify the user that identification can be started. What is being done at this time is the identification of related sentences in this article, but the system does not prompt that this identification is related to or irrelevant to this article. At the same time, for the impersonator, they do not know the content of the related test sentences in this article.

步驟S1-2為使用者的語句輸入。於此步驟中，為等待該使用者說話產生該本文相關測試語句。 Step S1-2 is the user's sentence input. In this step, the test sentence related to this text is generated in order to wait for the user to speak.

步驟S1-3為偵測是否為錄音回放。於此步驟中，為判斷該本文相關測試語句是否為錄音回放的欺騙攻擊。主要是利用梅爾頻譜係數(MFSC)的特徵建構一本文相關(Text-dependent)的語音辨識模型，該語音辨識模型選擇搭配使用語音辨識(Speech Recognition)或模型比對(Template Matching)而形成一二元分類器，透過該二元分類器來判斷該本文相關測試語句是否為錄音回放的欺騙攻擊，若該本文相關測試語句不是為錄音回放的欺騙攻擊，則通過該第一階段辨識S1，並進入該第二階段辨識S2；否則就拒絕該本文相關測試語句，結束辨識流程。 Step S1-3 is to detect whether it is recording playback. In this step, it is to determine whether the relevant test sentence of this article is a spoofing attack of recording playback. It mainly uses the characteristics of Mel Spectrum Coefficient (MFSC) to construct a text-dependent speech recognition model. The speech recognition model chooses to use speech recognition (Speech Recognition) or model matching (Template Matching) to form a text-dependent speech recognition model. Binary classifier, through the binary classifier to determine whether the relevant test sentence of this article is a spoofing attack of recording playback, if the relevant test sentence of this article is not a spoofing attack of recording playback, S1 is identified through the first stage, and Enter the second stage of identification S2; otherwise, reject the relevant test sentence of this text, and end the identification process.

該第二階段辨識S2包含步驟S2-1~步驟S2-3，其中步驟S2-1為提示輸入測試語句。於此步驟中，為發出另一語音訊息通知該使用者可以開始說話產生該本文無關測試語句。此次的該本文無關測試語句的輸入並不限定為本文相關的語句，而是可以是隨意說出的語句(本文無關)。 The second stage identification S2 includes steps S2-1 to S2-3, where step S2-1 is to prompt to input a test sentence. In this step, in order to send another voice message to notify the user that they can start Speaking produces irrelevant test sentences for this article. This time, the input of the unrelated test sentence in this article is not limited to the sentence related to this article, but can be a sentence that is spoken at will (this article is irrelevant).

步驟S2-2為使用者的語句輸入。於此步驟中，為等待該使用者說話產生該本文無關測試語句。 Step S2-2 is the user's sentence input. In this step, the text-independent test sentence is generated to wait for the user to speak.

步驟S2-3為偵測是否為合成語音。於此步驟中，為判斷該本文無關測試語句是否為合成語音的欺騙攻擊，對該本文無關測試語句建構一降維的混合特徵，將該混合特徵傳入另一二元分類器進行判斷是否為合成語音，其中，該混合特徵是由該本文無關測試語句之常數Q倒頻譜係數(CQCC)與頻譜圖(Spectrogram)所建構的一混合特徵向量，若判斷不是合成語音則通過第二階段辨識S2，並進入第三階段辨識S3，否則就拒絕該本文無關測試語句，結束辨識流程。 Step S2-3 is to detect whether it is a synthesized voice. In this step, in order to determine whether the unrelated test sentence in this article is a spoofing attack of synthesized speech, a dimensionality-reduced hybrid feature is constructed for the unrelated test sentence in this article, and the hybrid feature is passed to another binary classifier to determine whether it is Synthetic speech, where the mixed feature is a mixed feature vector constructed by the constant Q cepstral coefficient (CQCC) and spectrogram (Spectrogram) of the irrelevant test sentence in this text. If it is judged that it is not a synthetic speech, the second stage identification S2 , And enter the third stage identification S3, otherwise it will reject the irrelevant test sentence in this article, and end the identification process.

請再一併參閱「圖2」，其為混合特徵向量產生示意圖，該本文無關測試語句先經過步驟P1：前處理及音框化，接著分頭進行步驟P2-1：抽取CQCC特徵與步驟P2-2：短時間傅利葉轉換(STFT)，最後進行步驟：P3：合成混合特徵，據以產生該混合特徵向量。CQCC特徵為習用的語音特徵，其結合了常數Q轉換(Constant Q Transform)和傳統的倒頻譜過程所得到的語音特徵。 Please also refer to "Figure 2", which is a schematic diagram of hybrid feature vector generation. The irrelevant test sentence in this article first goes through step P1: pre-processing and sound framing, and then separately proceeds to step P2-1: extracting CQCC features and step P2 -2: Short-time Fourier Transform (STFT), the final step: P3: Synthesize the mixed feature, and generate the mixed feature vector accordingly. The CQCC feature is a conventional voice feature, which combines the voice features obtained by Constant Q Transform and the traditional cepstrum process.

該第三階段辨識S3為語者辨識，包含步驟a~步驟c。於此步驟中，為判斷該本文無關測試語句為目標語者或冒充者，該第三階段辨識S3是藉由一具有複數註冊語者與複數分類器之本文無關的語者辨識系統進行步驟a~步驟c。 The third stage identification S3 is speaker identification, including steps a to c. In this step, in order to determine that the text-independent test sentence is a target speaker or an imposter, the third stage identification S3 is performed by a text-independent speaker identification system with plural registered speakers and plural classifiers. ~Step c.

其中，步驟a為擷取i-vector特徵並對其降維。步驟a為擷取該本文無關測試語句的i-vector特徵，且基於i-vector降維後的特徵向量，讓該複數分類器各自判斷該本文無關測試語句是該複數註冊語者的任一或為冒充者，而各自產生一判斷結果。i-vector是一種習用的語音特徵降維技術，主要是基於聯合因素分析(Joint Factor Analysis,JFA)方法來實現，可以讓不同長度的語音特徵降維並固定成同長度，常運用在語音辨識。本發明方法所使用的該複數分類器的數量，較佳者為奇數個，本發明的一實施例為使用5個，且分別為單模型深度神經網路(One-model DNN)、多模型深度神經網路(Multi-model DNN)、線性支持向量機(Linear-SVM)、核支持向量機(Kernel-SVM)與隨機森林(Random Forest)。 Among them, step a is to extract the i-vector feature and reduce its dimensionality. Step a is to extract the i-vector features of the irrelevant test sentence of this text, and based on the feature vector after the i-vector dimensionality reduction, let the plural classifiers each determine that the irrelevant test sentence of this text is any one of the plural registrants. As a pretender, And each produces a judgment result. i-vector is a commonly used voice feature dimensionality reduction technology, which is mainly implemented based on the Joint Factor Analysis (JFA) method. It can reduce the dimensionality of voice features of different lengths and fix them to the same length. It is often used in speech recognition. . The number of the complex number classifiers used in the method of the present invention is preferably an odd number. An embodiment of the present invention uses 5, and they are respectively one-model DNN and multi-model deep. Neural network (Multi-model DNN), linear support vector machine (Linear-SVM), kernel support vector machine (Kernel-SVM) and random forest (Random Forest).

步驟b為進行決策。步驟b為依據該複數判斷結果進行決策。本發明是採用整體學習的全數決的判斷方式來判斷該本文無關測試語句為目標語者或冒充者。更詳細的說，整體學習為整合不同的分類器，於本實施例中，為整合如前所述的5個分類器，並當該複數判斷結果全數均判斷為同一註冊語者時，則判斷該本文無關測試語句為目標語者，並結束第三階段辨識。當該複數判斷結果判斷為同一註冊語者的數目少於半數，則判斷該本文無關測試語句為冒充者，並結束第三階段辨識。當該複數判斷結果判斷為同一註冊語者的數目不為全數，但不少於半數，續行步驟c。 Step b is to make a decision. Step b is to make a decision based on the complex judgment result. The present invention adopts an overall learning judgment method to judge the irrelevant test sentence as the target speaker or imposter. In more detail, the overall learning is to integrate different classifiers. In this embodiment, it is to integrate the five classifiers described above, and when the plural judgment results are all judged to be the same registered speaker, the judgment is The irrelevant test sentence in this article is the target language speaker, and the third stage of identification is ended. When the plural judgment result judges that the number of persons with the same registered language is less than half, the irrelevant test sentence in this article is judged to be an imposter, and the third stage of identification is ended. When the plural judgment result judges that the number of persons with the same registered language is not all but not less than half, proceed to step c.

步驟c為給予一具有上限次數且次數為正整數的重試。步驟c主要包含一步驟c1與一步驟c2，步驟c1為判斷是否超過重試次數上限。當該重試的次數未超過上限次數時，續行步驟c2，步驟c2為提示輸入測試語句。步驟c2讓該使用者重新說出另一本文無關測試語句後，重新進行步驟a。而於步驟c1時，當該使用者的該重試超過上限次數時，則結束第三階段辨識，並判斷該本文無關測試語句為冒充者。另，為了資安的考量，在一實施例中，該重試的機會為不大於兩次，且每當有該本文無關測試語句被判斷為冒充者時，該語者辨識系統會送出警告訊息給該語者辨識系統的一管理者並鎖定該語者辨識系統，同時，除非該使用者能解除鎖定，該語者辨識系統會在一預設的時間後，才能重新啟動。 Step c is to give a retry with an upper limit and a positive integer. Step c mainly includes a step c1 and a step c2, and the step c1 is to determine whether the upper limit of the number of retries is exceeded. When the number of retries does not exceed the upper limit, step c2 is continued, and step c2 is a prompt to enter a test sentence. Step c2 allows the user to re-speak another irrelevant test sentence, and then perform step a again. In step c1, when the number of retries of the user exceeds the upper limit, the third stage identification is ended, and the irrelevant test sentence in this text is judged to be an imposter. In addition, for information security considerations, in one embodiment, the chance of retrying is not more than two times, and whenever there is an unrelated test sentence in this text that is judged to be an imposter, the speaker recognition system will send a warning message To the speaker to identify a manager of the system And lock the speaker recognition system. At the same time, unless the user can unlock the speaker, the speaker recognition system will be restarted after a preset period of time.

請再參閱「圖3」所示，該使用者說出一輸入語音10，該輸入語音10包含該本文相關測試語句與該本文無關測試語句，該輸入語音10可能是目標語者、欺騙攻擊(含錄音回放與合成語音)或冒充者並分別以L3、L1或L2表示本發明方法辨識的最佳狀態。其中大部分的錄音回放與合成語音之欺騙攻擊分別會在第一階段辨識S1與第二階段辨識S2被識別出來而被排除，如圖3之L1。而大部分的冒充者則會在第三階段辨識S3被識別出來而被排除，如圖3之L2。而大部分的目標語者則會通過第一階段辨識S1、第二階段辨識S2與第三階段辨識步驟S3而被識別出來，如圖3之L3。 Please refer to "Figure 3" again, the user speaks an input voice 10, the input voice 10 contains the test sentence related to the text and the test sentence unrelated to the text, the input voice 10 may be the target speaker, deception attack ( (Including recording playback and synthesized speech) or pretenders, and L3, L1, or L2 respectively represent the best state of identification by the method of the present invention. Most of the spoofing attacks on recording playback and synthesized speech will be recognized and eliminated in the first stage identification S1 and the second stage identification S2 respectively, as shown in Figure 3 L1. Most of the impersonators will be identified and excluded in the third stage of identification S3, as shown in Figure 3 L2. Most of the target speakers will be identified through the first stage identification S1, the second stage identification S2, and the third stage identification step S3, as shown in Fig. 3 L3.

於本發明的一模擬測試中，包含20位目標語者、20位欺騙攻擊與67位冒充者。每位目標語者的測試語句有25句，每位欺騙攻擊的測試語句有450句，每位冒充者的測試語句有65句。模擬測試中分為沒重試機制(不給予重試)，有重試機制(給予重試)與條件式重試機制(給予具有上限次數的重試)，其在進行1萬次後在取平均後的數據如下表所示。 In a simulation test of the present invention, 20 target speakers, 20 spoofing attacks, and 67 imposters are included. There are 25 test sentences for each target speaker, 450 sentences for each deception attack, and 65 sentences for each impostor. The simulation test is divided into no retry mechanism (no retry is given), a retry mechanism (given a retry) and a conditional retry mechanism (given a maximum number of retries), which will be fetched after 10,000 times The averaged data is shown in the table below.

相較於沒重試機制，若重試機制不限制(重試到結果收斂為止)，冒充者阻擋率將會下降，意味著冒充者更容易誤判為目標語者。且相較於沒重試機制，在條件式重試機制下，冒充者與欺騙攻擊阻擋率並無變化，但目標語者阻擋率有明顯下降，使得更多目標語者不會被誤判為冒充者。故本發明的條件式重試機制能在不損失冒充者阻擋率、欺騙攻擊阻擋率情況下，有效降低目標語者阻擋率。 Compared with no retry mechanism, if the retry mechanism is not limited (retry until the result converges), the imposter blocking rate will decrease, which means that the imposter is more likely to be misjudged as the target speaker. And compared with no retry mechanism, under the conditional retry mechanism, the blocking rate of imposters and spoofing attacks does not change, but the blocking rate of target speakers has decreased significantly, so that more target speakers will not be misjudged as impersonating By. Therefore The conditional retry mechanism of the present invention can effectively reduce the blocking rate of the target speaker without losing the blocking rate of the impostor and the blocking rate of deception attacks.

綜上所述，本發明具有下列特點： In summary, the present invention has the following characteristics:

一、具有條件式重試機制(Conditional Retry)，並配合全數決策略，可在不損失冒充者之阻擋率的情況下，有效降低目標語者的阻擋率。 1. It has a Conditional Retry mechanism (Conditional Retry), and cooperates with the all-round decision strategy, which can effectively reduce the target speaker's blocking rate without losing the imposter's blocking rate.

二、透過本文相關測試語句與本文無關測試語句的方式進行辨識，有效解決欺騙攻擊試闖的問題。 2. Identify the relevant test sentences in this article and the unrelated test sentences in this article, effectively solving the problem of deception attacks.

S1:第一階段辨識 S1: First stage identification

S2:第二階段辨識 S2: Second stage identification

S3:第三階段辨識 S3: Third stage identification

S1-1:提示語者辨識開始進行 S1-1: Prompt speaker identification begins

S1-2:使用者的語句輸入 S1-2: User's sentence input

S1-3:偵測是否為錄音回放 S1-3: Detect whether it is recording playback

S2-1:提示輸入測試語句 S2-1: Prompt to enter the test sentence

S2-2:使用者的語句輸入 S2-2: User's sentence input

S2-3:偵測是否為合成語音 S2-3: Detect whether it is a synthesized voice

b:進行決策 b: make a decision

c:給予一具有上限次數的重試 c: Give a maximum number of retries

c2:提示輸入測試語句 c2: Prompt to enter the test statement

Claims

A speaker identification method that identifies a test sentence related to the text and a test sentence unrelated to the text spoken by a user. The steps include: a first stage identification, which is to determine whether the test sentence is related to the text For the deception attack of recording playback, we use the characteristics of Mel Spectrum Coefficient (MFSC) to construct a text-dependent speech recognition model. The speech recognition model can be used in combination with Speech Recognition or Template Matching) to form a binary classifier. Through the binary classifier, it is judged whether the relevant test sentence of this article is a spoofing attack of recording playback. If the irrelevant test sentence of this article is not a spoofing attack of recording playback, the first stage is passed Recognize, otherwise, reject the relevant test sentence of the text; a second-stage recognition, which is to determine whether the irrelevant test sentence of this text is a synthetic speech deception attack, and firstly construct a dimensionality reduction Hybrid feature, the hybrid feature is a hybrid feature vector constructed by the constant Q cepstral coefficient (CQCC) and the spectrogram (Spectrogram) of the irrelevant test sentence in this text. First, let the irrelevant test sentence go through step P1: before Processing and framing, and then proceed separately to step P2-1: extract CQCC features and step P2-2: short-time Fourier transform (STFT), where step P2-1 is to obtain constant Q cepstral coefficients, and step P2-2 In order to obtain the spectrogram, the final step is: P3: Synthesize the mixed feature, that is, generate the mixed feature vector, and pass the mixed feature vector to another binary classifier to determine whether it is a synthetic speech. If it is judged not to be a synthetic speech, pass the The second stage of identification, otherwise it will reject the unrelated test sentence of the text; and a third stage of identification, the third stage of identification is to judge that the unrelated test sentence of this text is the target language person or the imposter, the third stage of identification is by a A speaker identification system with a plural registered speaker and a text-independent plural classifier performs the following steps: Step a is to extract the i-vector features of the irrelevant test sentence in this text, and based on the feature vector after i-vector dimensionality reduction, let the plural classifier individually determine that the irrelevant test sentence in this text is any of the plural registrants Or it is a pretender, and each produces a judgment result; step b, make a decision based on the plural judgment result, when the plural judgment results are all judged to be the same registered speaker, judge that the irrelevant test sentence in this article is the target language person, And end the third stage identification; when the plural judgment result judges that the number of persons with the same registered language is less than half, the irrelevant test sentence in this article is judged to be an imposter, and the third stage identification ends; when the plural judgment result judges If the number of persons who are the same registered language is not all but not less than half, proceed to the next step; and step c, give a maximum number of retries with a positive integer number, when the number of retries does not exceed the maximum number of retries When the user re-speaks another unrelated test sentence, perform step a again, but when the user’s retry exceeds the upper limit, the third stage of identification is ended, and the test is judged to be unrelated to the article The sentence is an imposter.

The speaker identification method according to claim 1, wherein whenever there is an unrelated test sentence in the text that is judged to be an imposter, the speaker identification system will send a warning message to a manager of the speaker identification system and lock the The speaker recognition system, and unless the user can unlock it, the speaker recognition system will restart after a preset period of time.

The speaker identification method according to claim 1, wherein the retry is not more than two times.

The speaker identification method according to claim 1, wherein the number of the plural classifiers is an odd number.

The speaker identification method according to claim 4, wherein the number of the plural classifiers is 5, and they are one-model deep neural network (One-model DNN) and multi-model deep neural network (Multi-model DNN). ), Linear-SVM, Kernel-SVM and Random Forest.