TWI386918B

TWI386918B - Sound recognition method

Info

Publication number: TWI386918B
Application number: TW99121320A
Authority: TW
Original assignee: Tung Fang Inst Of Technology
Priority date: 2010-06-29
Filing date: 2010-06-29
Publication date: 2013-02-21
Also published as: TW201201197A

Description

Method of voice recognition

　　本發明係一種聲音識別之方法，尤指一種判斷一音訊為何種聲音源之方法。
The present invention is a method for voice recognition, and more particularly to a method for determining which sound source is an audio source.

　　由於各種生物或人所發出的聲音具有獨特性，使得藉由聲音之辨識得以判斷該聲音是為何種生物或者何人所發出，並藉由聲音之獨特性，使得聲音辨識得以應用於防盜系統、生物辨識系統等。
　　一般較常使用之聲音辨識之技術係利用兩比聲音間各個參數之距離遠近作辨識，較常見為metirc-based聲音識別方法，由一音訊資料庫之複數樣本音訊中，取一樣本音訊與一待鑑定音訊做相似度比對，（這裡所稱之音訊係指聲音數位訊號，由於聲音源係為類比訊號，藉由類比數位轉換器轉換成數位訊號，故以下說明書之內容若無特別解釋，則本說明書所指之音訊係指聲音數位訊號），藉由比對該待鑑定音訊中各個音訊參數與該樣本音訊中各個音訊參數之距離遠近，藉由計算各個音訊參數之距離，得一相似度值來判斷該待鑑定音訊與該樣本音訊是否為同一聲音源。
　　由於metric-based聲音識別方法係利用計算該待鑑定音訊中各個音訊參數與該樣本音訊中各個音訊參數之距離，由於音訊參數數量較多，使得計算各個音訊參數之距離需消耗較多時間來計算各個音訊參數間之距離，且由於生物或者人或者物品所發出之聲音源，並非每次發出聲音均相同，會因為當時產生聲音之能量或者環境較為吵雜，而使得音訊參數變動較為不規律，而使得metric-based聲音識別方法於聲音辨識時具有較高之識別錯誤率，且該metric-based聲音識別方法只憑藉相似度值來判斷該待鑑定音訊與該樣本音訊是否為同一音訊，而無其他檢定之方法，使得metric-based聲音識別方法產生較高之識別錯誤率。
　　由於習用技術具有上述之問題，而使得聲音辨識正確率較為不足，本發明者由於對聲音學領域之專業知識具有較高之造詣，並且秉持著對於聲音辨識之熱誠，故開始研究、思考如何解決上述習用技術不足之問題。

Because of the uniqueness of the sounds emitted by various creatures or people, it is possible to determine which creature or person the sound is emitted by the recognition of the sound, and by the uniqueness of the sound, the sound recognition can be applied to the anti-theft system, the creature. Identification system, etc.
Generally, the technique of sound recognition that is commonly used is to use the distance between the two parameters of the two sounds for identification. It is more common for the metirc-based voice recognition method. The same sample and audio are used in the multiple sample audio of an audio database. The audio to be authenticated is compared. (The audio signal referred to here refers to the sound digital signal. Since the sound source is an analog signal, it is converted into a digital signal by an analog digital converter. Therefore, unless otherwise explained, The audio referred to in this specification refers to a sound digital signal. By comparing the distance between each audio parameter in the to-be-identified audio and each audio parameter in the sample audio, a similarity is obtained by calculating the distance of each audio parameter. The value is used to determine whether the to-be-identified audio and the sample audio are the same sound source.
Since the metric-based voice recognition method calculates the distance between each audio parameter in the to-be-identified audio and each audio parameter in the sample audio, since the number of audio parameters is large, the distance for calculating each audio parameter takes more time to calculate. The distance between each audio parameter, and the sound source emitted by the creature or the person or the item, is not the same every time the sound is generated, and the audio parameter is changed more irregularly because the energy of the sound generated at that time or the environment is noisy. The metric-based voice recognition method has a high recognition error rate in voice recognition, and the metric-based voice recognition method only uses the similarity value to determine whether the to-be-identified audio and the sample audio are the same audio, and no Other methods of verification result in a higher recognition error rate for metric-based voice recognition methods.
Since the conventional technology has the above problems, and the correct rate of sound recognition is insufficient, the inventor has high knowledge of the professional knowledge in the field of sound science, and upholds the enthusiasm for sound recognition, so he began to study and think about how to solve it. The above-mentioned conventional techniques are insufficient.

　　有鑑於先前之技術所產生之問題，本發明者認為應有一種得以改善之方法，經過多次設計、實驗與思考，終於得到一種聲音識別之方法，藉以改善先前技術不足之處。
　　本發明係一種聲音識別之方法，係藉由下列敘述之步驟來判斷待鑑定音訊係為何種聲音，包括下列步驟：
(A)取一音訊資料庫，該音訊資料庫包括複數樣本音訊；
(B)由該複數樣本音訊中任取一樣本音訊，將一待鑑定音訊與該樣本音訊利用一分類機對該待鑑定音訊與該樣本音訊做分類處理，而得一相似度值；
(C)設一相似度門檻值，當該相似度值高於該相似度門檻值時，則判斷該待鑑定音訊與該樣本音訊為同一音訊，若該相似度值低於該相似度門檻值時，則回到步驟(B)。
　　本發明係將該待鑑定音訊與該音訊資料庫中各個樣本音訊做相似度比對，該樣本音訊包括不同人說話的聲音、各種動物發出的叫聲、物體碰撞發出的聲響等，藉以判斷該待鑑定音訊係為何種聲音，由於相似或相同之音訊較不易做分類處理，故利用該分類機將該待鑑定音訊與該音訊資料庫中之任一樣本音訊做分類處理得以作為該相似度值判斷之依據。
　　該分類器可為一支援向量機、最近鄰居分類器、GMM、K-means等，以較佳實施例該分類器為該支援向量機對該待鑑定音訊與該樣本音訊做分類處理，而得到一分類線得以將該待鑑定音訊中各個音訊參數與該樣本音訊中各個音訊參數做分類，再將該待鑑定音訊中各個音訊參數與該樣本音訊中各個音訊參數做分類處理之檢驗，當檢驗過程中該待鑑定音訊參數經檢定後為該樣本音訊參數，則得到該第一分類錯誤率，以下說明書若無特別說明，則樣本音訊參數個數標記為A，該待鑑定音訊參數個數標記為B，因此該第一分類錯誤率計算公式係為：；當檢驗過程中該樣本音訊參數經檢定後為該待鑑定音訊參數，則得到該第二分類錯誤率，第二分類錯誤率計算公式係為：，由於相同聲音之音訊參數較為相同，故，該分類機較無法準確找出一分類線將該待鑑定音訊參數與該樣本音訊參數做較完整之分類，使得該第一分類錯誤率與該第二分類錯誤率具有較高錯誤率值。
　　較佳實施例中步驟(C)，該相似度門檻值係為一分類錯誤率門檻值，當該第一分類錯誤率與該第二分類錯誤率均高於該分類錯誤率門檻值時，則判定該待鑑定音訊與該樣本音訊為同一音訊，藉此找出該待鑑定音訊係為何種聲音；若只有該第一分類錯誤率高於該分類錯誤率門檻值，或只有該第二分類錯誤率高於該分類錯誤率門檻值，或該第一分類錯誤率與該第二分類錯誤率均未高於該分類錯誤率門檻值時，則判斷該待鑑定音訊與該樣本音訊係為不同聲音，該待鑑定音訊繼續與其他樣本音訊做相似度之比對，直到判斷出該待鑑定音訊係為何種聲音為止。
　　由於本發明係利用找出該待鑑定音訊參數與該樣本音訊參數之分類線來判斷是否為同一聲音，較習用技術係利用該待鑑定音訊參數與該樣本音訊參數之各個參數間之距離取最小值來做分類，具有較佳之相似度判斷效果，並且於判斷時需耗較少之時間，且由於本發明之方法找出該分類線後，再以該分類線對各個音訊參數作檢驗，進而提高本發明對於聲音判斷之精確度。

In view of the problems arising from the prior art, the inventors believe that there should be a method for improvement. After many designs, experiments and reflections, a method of voice recognition is finally obtained, thereby improving the prior art deficiencies.
The invention is a method for voice recognition, which determines the sound of the audio system to be authenticated by the following steps, including the following steps:
(A) taking an audio database, the audio database comprising a plurality of sample audio;
(B) taking the same audio from the plurality of sample audio, and classifying the to-be-identified audio and the sample audio by using a to-be-identified audio and the sample audio to obtain a similarity value;
(C) setting a similarity threshold, when the similarity value is higher than the similarity threshold, determining that the to-be-identified audio is the same audio as the sample audio, if the similarity value is lower than the similarity threshold Then, return to step (B).
The invention compares the to-be-identified audio with the similarity of each sample audio in the audio database, and the sample audio includes sounds spoken by different people, sounds of various animals, sounds of object collisions, etc., thereby judging the What kind of sound is to be identified, because similar or identical audio is less easy to classify, the classifier is used to classify the audio to be identified and any sample audio in the audio database as the similarity value. The basis for judgment.
The classifier may be a support vector machine, a nearest neighbor classifier, a GMM, a K-means, etc., and in the preferred embodiment, the classifier performs classification processing on the to-be-identified audio and the sample audio for the support vector machine. A sorting line can classify each audio parameter in the to-be-identified audio and each audio parameter in the sample audio, and then perform a classification test on each audio parameter in the to-be-identified audio and each audio parameter in the sample audio, and check In the process, after the audio parameter to be identified is verified as the sample audio parameter, the first classification error rate is obtained. Unless otherwise specified, the number of sample audio parameters is marked as A, and the number of the audio parameter to be identified is marked. Is B, so the first classification error rate calculation formula is: When the sample audio parameter is verified to be the to-be-identified audio parameter during the verification process, the second classification error rate is obtained, and the second classification error rate calculation formula is: Since the audio parameters of the same sound are relatively the same, the classifier is less able to accurately find a classification line to classify the to-be-identified audio parameters and the sample audio parameters, so that the first classification error rate and the first The binary error rate has a higher error rate value.
In the preferred embodiment, in step (C), the similarity threshold is a classification error rate threshold. When the first classification error rate and the second classification error rate are both higher than the classification error rate threshold, then Determining that the to-be-identified audio is the same audio as the sample audio, thereby finding out what kind of sound the to-be-identified audio system is; if only the first classification error rate is higher than the classification error rate threshold, or only the second classification error If the rate is higher than the classification error rate threshold, or the first classification error rate and the second classification error rate are not higher than the classification error rate threshold, it is determined that the to-be-identified audio is different from the sample audio system. The to-be-identified audio continues to be compared with other sample audios until the sound of the to-be-identified audio system is determined.
Since the present invention uses the classification line of the to-be-identified audio parameter and the sample audio parameter to determine whether it is the same sound, the conventional technique uses the distance between the to-be-identified audio parameter and each parameter of the sample audio parameter to be the smallest. The value is classified, has a better similarity judgment effect, and takes less time to judge, and since the classification line is found by the method of the present invention, each audio parameter is tested by the classification line, and then The accuracy of the sound judgment of the present invention is improved.

　　以下文字說明，藉由圖式之輔助敘述，說明本發明之構造、特點以及實施例，俾使　貴審查人員對於本發明有更進一步之瞭解。
　　本發明係一種聲音識別之方法，係藉由以下之步驟來判斷一聲音源係為何種聲音，包括以下之步驟：
　　請參閱第一圖所示，步驟(A)，取一音訊資料庫，該聲音資料庫包括複數樣本音訊，該複數樣本音訊係包括人、動物、機械或其他物品所發出之聲音源，利用一數位類比轉換器將這些聲音源轉換成音訊，藉此儲存起來作為樣本音訊，係以作判斷待鑑定聲音源係為何種聲音之依據。
　　請參閱第一圖所示，步驟(B)，取一待鑑定音訊，該待鑑定音訊可為人、動物、機械或其他物品所發出之聲音源，再藉由一數位類比轉換器將該聲音源轉換成音訊，於該音訊資料庫中任取一樣本音訊，將該待鑑定音訊與該樣本音訊做相似度比對，藉此判斷該待鑑定音訊與該樣本音訊是否為同一聲音，該相似度比對之方法係利用一分類機對該待鑑定音訊與該樣本音訊做分類處理，由於兩個相似或同一聲音之音訊較不易利用該分類器做分類處理，因此，藉由該分類器將該待鑑定音訊與該樣本音訊做分類處理時得以產生一相似度值，藉以判斷是否為同一聲音，該分類機可為一最近鄰居分類機、一支援向量機、GMM、K-means等，可將兩筆以上之資料做分類處理之機器。
　　請參閱第一圖所示，步驟(C)，設一相似度門檻值，當該相似度值高於該相似度門檻值時，則判斷該待鑑定音訊與該樣本音訊為同一音訊，藉此來達成聲音識別之效果。
　　由於該支援向量機之作動方式係輸入兩筆不同資料，利用該支援向量機找出各個參數間最小邊界來取得一條分類線，將兩筆資料之各個參數作區別、分類。因此，本發明之第一實施例係為步驟(B)中該分類器為一支援向量機，請參閱第三-A圖所示，該待鑑定音訊包括複數待鑑定音訊參數(2)，該樣本音訊包括複數樣本音訊參數(3)，利用該支援向量機得以找出該分類線(1)將該各個待鑑定音訊參數(2)與該樣本音訊參數(3)分成兩類，再利用該分類線(1)將該各個待鑑定音訊參數(2)與該各個樣本音訊參數(3)進行檢驗，若檢驗過程中原本係待鑑定音訊參數(2)，經檢驗後判定為該樣本音訊參數(3)，請參閱第三-B圖所示，該分類線(1)右側為待鑑定音訊參數(2)以圓形做表示，該分類線(1)左側為樣本音訊參數(3)以方形作表示，該分類線(1)右側圓形移動至該分類線(1)左側虛線箭頭所指之方向，係表示原本為該待鑑定音訊參數(2)，經檢驗後判定為該樣本音訊參數(3)，統計這些判斷有誤之參數，則得一第一分類錯誤率，該第一分類錯誤率計算公式係為：。若檢驗過程中原本係該樣本音訊參數(3)，經檢驗後判定為該待鑑定音訊參數(2)，請參閱第三-B圖，該分類線(1)左側方形移動至該分類線(1)右側虛線箭頭所指之方向，係表示原本為該樣本音訊參數(3)，經檢驗後判定為該待鑑定音訊參數(2)，統計這些判斷有誤之參數，則得一第二分類錯誤率，第二分類錯誤率計算公式係為：。
　　請參閱第二圖所示，於該第一實施例中，該相似度值係為該第一分類錯誤率與該第二分類錯誤率，由於藉由該支援向量機將該待鑑定音訊與該樣本音訊做分類時，若該待鑑定音訊與該樣本音訊為同一聲音，則分類時較不易找出該分類線(1)將該待鑑定音訊與該樣本音訊做分類處理，因此，於檢驗過程中會產生較高之分類錯誤率，故藉由分類錯誤率得以作為相似度之判斷。故，步驟(C)之該相似度門檻值係為一分類錯誤率門檻值(0)，當該第一分類錯誤率與該第二分類錯誤率均高於該分類錯誤率門檻值(0)時，則判定該待鑑定音訊與該樣本音訊為同一聲音，若只有該第一分類錯誤率高於該分類錯誤率門檻值(0)，或只有該第二分類錯誤率高於該分類錯誤率門檻值(0)，或該第一分類錯誤率與該第二分類錯誤率均未高於該分類錯誤率門檻值(0)時，則判斷該待鑑定音訊與該樣本音訊為不同聲音，藉由該第一分類錯誤率與該第二分類錯誤率均需高於該分類錯誤率門檻值(0)才判斷該待鑑定音訊與該樣本音訊為同一聲音，使得本發明具有較高聲音辨識正確率。
　　請參閱第二圖所示，本發明之第二實施例係於步驟(A)前更設一前置步驟，該前置步驟係先取一對話音訊(6)，該對話音訊(6)可為電台廣播、會議紀錄、或一般對話紀錄等具有複數不同語者對話之紀錄，再將該對話紀錄用類比數位轉換器轉換成該對話音訊(6)，先將該對話音訊(6)中，取出不同語者之音訊重疊段與音訊靜音段，以便於找出不同語者之一音訊分界(7)，找出該音訊分界(7)之方法可為利用最近鄰居分類機、支援向量機等，本發明較佳實施例係利用該支援向量機來找出該音訊分界(7)。
　　請參閱第二圖所示，設一第一偵測視窗(4)與一第二偵測視窗(5)，該第一偵測視窗(4)與該第二偵測視窗(5)係以分別偵測相同單位時間內該對話音訊(6)，而分別得一第一音訊參數與一第二音訊參數，再利用該支援向量機對該第一音訊參數與該第二音訊參數做分類，藉以判斷該第一偵測視窗(4)與該第二偵測視窗(5)所偵測該對話音訊(6)是否為同一語者之音訊，其判斷之原理於本發明實施方式已說明，故容不贅述，而得一第一分類錯誤率曲線(8)，與一第二分類錯誤率曲線(9)。若當判斷該第一偵測視窗(4)所偵測之該對話音訊(6)與該第二偵測視窗(5)所偵測之該對話音訊(6)為不同語者音訊，則設該音訊分界(7)，該音訊分界(7)通過該第二偵測視窗(5)鄰近該第一偵測視窗(4)之緣邊，而使該對話音訊(6)分成不同語者之該待鑑定音訊，該第一偵測視窗(4)係由該對話音訊(6)之起始時間開始偵測，該第二偵測視窗(5)鄰接該第一偵測視窗(4)開始偵測，藉由該第一偵測視窗(4)與該第二偵測視窗(5)偵測該對話音訊(6)，而得到至少一該音訊分界(7)將該對話音訊(6)分成複數不同語者之該待鑑定音訊，再藉由步驟(A)到步驟(C)判斷該待鑑定音訊為何語者之聲音。
　　本發明之第三實施例承接第二實施例，請參閱第二圖所示，於該複數待鑑定音訊中任取一待鑑定音訊，利用該支援向量機，將該待鑑定音訊與其他複數待鑑定音訊做相似度比對，將比對後相似度值高於該相似度門檻值之待鑑定音訊標示為同一語者之聲音，將該對話音訊(6)中，各個相同語者之音訊區塊標上記號，使得步驟(A)到步驟(C)只需鑑定不同語者之該待鑑定音訊即可，而減少判斷各個待鑑定音訊為何語者之聲音所需之時間。
　　本創作之第四實施例，更設一步驟(D)，該步驟(D)設一語言辨識器，該語言辨識器係將該對話音訊(6)之對話內容轉換成一文字記錄，使得本發明之實施例得以應用於自動會議紀錄，係將開會內容用錄音器紀錄下來，藉由本發明之方法，找出每個人說話之內容，再利用語言辨識器，將每個人說話之內容轉換成文字檔案，使得開會時不需藉由人工的方式將會議內容記錄下來，而加快開會時所需之時間。
　　綜上所述，本發明確實符合產業利用性，且未於申請前見於刊物或公開使用，亦未為公眾所知悉，且具有非顯而易知性，符合可專利之要件，爰依法提出專利申請。
　　惟上述之所陳，為本發明在產業上一較佳實施例，舉凡依本發明申請專利範圍所作之均等變化，皆屬本案訴求標的之範疇。

BRIEF DESCRIPTION OF THE DRAWINGS The constructions, features, and embodiments of the present invention are illustrated by the accompanying drawings, which are set forth in the claims.
The invention is a method for voice recognition, which determines the sound of a sound source system by the following steps, including the following steps:
Referring to the first figure, in step (A), an audio database is included. The sound database includes a plurality of sample audio, and the plurality of sample audio systems includes sound sources emitted by humans, animals, machinery, or other items. The digital analog converter converts these sound sources into audio, which is stored as sample audio, and is used as a basis for judging which sound is to be identified.
Referring to the first figure, step (B), taking an audio to be identified, the sound to be identified may be a sound source for human, animal, mechanical or other objects, and then the sound is amplified by a digital analog converter. Converting the source into audio, taking the same audio in the audio database, comparing the to-be-identified audio with the sample audio, thereby determining whether the to-be-identified audio and the sample audio are the same sound, the similarity The method of comparing degrees uses a classifier to classify the to-be-identified audio and the sample audio, and since two similar or the same sounds are less likely to be classified by the classifier, the classifier will When the to-be-identified audio and the sample audio are classified, a similarity value is generated to determine whether the sound is the same sound, and the classification machine can be a nearest neighbor classifier, a support vector machine, a GMM, a K-means, etc. A machine that classifies two or more pieces of data.
Referring to the first figure, in step (C), a similarity threshold is set. When the similarity value is higher than the similarity threshold, it is determined that the to-be-identified audio is the same audio as the sample audio. To achieve the effect of voice recognition.
Since the operation mode of the support vector machine inputs two different data, the support vector machine is used to find the minimum boundary between each parameter to obtain a classification line, and the parameters of the two data are distinguished and classified. Therefore, in the first embodiment of the present invention, the classifier is a support vector machine in step (B), as shown in FIG. 3A, the to-be-identified audio includes a plurality of to-be-identified audio parameters (2), The sample audio includes a plurality of sample audio parameters (3), and the support vector machine is used to find the classification line (1), and the respective to-be-identified audio parameters (2) and the sample audio parameters (3) are divided into two categories, and then the The classification line (1) checks the respective to-be-identified audio parameters (2) and the respective sample audio parameters (3), and if the audio parameters (2) are to be identified in the verification process, the sample audio parameters are determined after the test. (3), please refer to the third-B diagram, the right side of the classification line (1) is the audio parameter to be identified (2) is represented by a circle, and the left side of the classification line (1) is the sample audio parameter (3) The square shape indicates that the circular line on the right side of the classification line (1) moves to the direction indicated by the dotted arrow on the left side of the classification line (1), which indicates that the audio parameter to be identified (2) is originally determined, and the sample audio is determined after the test. Parameter (3), counting these parameters that are incorrectly judged, then obtaining a first classification error rate, the first classification error The error rate calculation formula is: . If the sample audio parameter (3) is originally in the test, it is determined as the to-be-identified audio parameter (2) after inspection, please refer to the third-B figure, and the left side square of the classification line (1) moves to the classification line ( 1) The direction indicated by the dotted arrow on the right side indicates that the original audio parameter (3) is determined. After the test, it is determined as the audio parameter to be identified (2). If these parameters are incorrectly determined, a second classification is obtained. The error rate, the second classification error rate calculation formula is: .
Referring to the second figure, in the first embodiment, the similarity value is the first classification error rate and the second classification error rate, because the to-be-identified audio is matched by the support vector machine. When the sample audio is classified, if the to-be-identified audio and the sample audio are the same sound, it is difficult to find the classification line when sorting (1) to classify the to-be-identified audio and the sample audio, and therefore, in the inspection process A higher classification error rate is generated, so the classification error rate can be used as the judgment of similarity. Therefore, the threshold of similarity in step (C) is a classification error rate threshold (0), and the first classification error rate and the second classification error rate are both higher than the classification error rate threshold (0). And determining that the to-be-identified audio is the same sound as the sample audio, if only the first classification error rate is higher than the classification error rate threshold (0), or only the second classification error rate is higher than the classification error rate If the threshold value (0), or the first classification error rate and the second classification error rate are not higher than the classification error rate threshold (0), it is determined that the to-be-identified audio is different from the sample audio, The first classification error rate and the second classification error rate are both higher than the classification error rate threshold (0) to determine that the to-be-identified audio and the sample audio are the same sound, so that the invention has a higher sound recognition. rate.
Referring to the second figure, the second embodiment of the present invention is further provided with a pre-step before the step (A). The pre-step is to first take a conversation audio (6), and the conversation audio (6) can be A record of a dialogue between a plurality of different speakers, such as a radio broadcast, a conference record, or a general conversation record, and then converting the dialog record into an audio signal (6) using an analog digital converter, first extracting the dialogue audio (6) The audio overlap segment and the audio mute segment of different speakers are used to find out the audio boundary of one of the different speakers (7), and the method for finding the audio boundary (7) can be using the nearest neighbor classifier, support vector machine, etc. The preferred embodiment of the present invention utilizes the support vector machine to find the audio boundary (7).
Referring to the second figure, a first detection window (4) and a second detection window (5) are provided. The first detection window (4) and the second detection window (5) are Detecting the conversation audio (6) in the same unit time, respectively, and obtaining a first audio parameter and a second audio parameter, and then using the support vector machine to classify the first audio parameter and the second audio parameter. By determining whether the first detection window (4) and the second detection window (5) detect whether the conversation audio (6) is the same language, the principle of the judgment is explained in the embodiment of the present invention. Therefore, the first classification error rate curve (8) and the second classification error rate curve (9) are obtained. If it is determined that the dialogue audio (6) detected by the first detection window (4) and the conversation audio (6) detected by the second detection window (5) are different language audio, then The audio boundary (7), the audio boundary (7) is adjacent to the edge of the first detection window (4) through the second detection window (5), and the conversation audio (6) is divided into different speakers. The first detection window (4) is detected by the start time of the conversation audio (6), and the second detection window (5) is adjacent to the first detection window (4). Detecting, detecting the dialogue audio (6) by the first detection window (4) and the second detection window (5), and obtaining at least one audio boundary (7) to record the conversation (6) The audio to be authenticated is divided into a plurality of different words, and then the sound of the speaker to be identified is determined by the steps (A) to (C).
The third embodiment of the present invention is in accordance with the second embodiment. Referring to the second figure, any audio to be authenticated is selected in the plurality of to-be-identified audio signals, and the to-be-identified audio and other complex numbers are used by the support vector machine. Identifying the audio to make a similarity comparison, and comparing the to-be-identified audio whose similarity value is higher than the similarity threshold value to the voice of the same speaker, and the audio zone of each of the same speakers in the dialogue audio (6) The block is marked so that step (A) to step (C) only need to identify the to-be-identified audio of the different speakers, and reduce the time required to judge the voice of each speaker to be authenticated.
A fourth embodiment of the present invention further includes a step (D), wherein the step (D) is provided with a language recognizer that converts the dialogue content of the dialog audio (6) into a text record, so that the present invention The embodiment is applied to the automatic meeting record, and the content of the meeting is recorded by the recorder. By the method of the present invention, the content of each person's speech is found, and then the language recognizer is used to convert the content of each person's speech into a text file. In order to speed up the meeting, it is not necessary to manually record the content of the meeting, and accelerate the time required for the meeting.
In summary, the present invention is indeed in line with industrial utilization, and is not found in publications or publicly used before application, nor is it known to the public, and has non-obvious knowledge, conforms to patentable requirements, and patents are filed according to law. .
However, the foregoing is a preferred embodiment of the present invention in the industry, and all the equivalent changes made by the scope of the patent application of the present invention belong to the scope of the claim.

(0)‧‧‧分類錯誤率門檻值(0)‧‧‧Classification error rate threshold

(1)‧‧‧分類線(1)‧‧‧ classification line

(2)‧‧‧待鑑定音訊參數(2) ‧‧‧ audio parameters to be authenticated

(3)‧‧‧樣本音訊參數(3) ‧‧‧sample audio parameters

(4)‧‧‧第一偵測視窗(4) ‧‧‧ first detection window

(5)‧‧‧第二偵測視窗(5) ‧‧‧Second detection window

(6)‧‧‧對話音訊(6) ‧‧‧Dialog audio

(7)‧‧‧音訊分界(7) ‧ ‧ audio boundaries

(8)‧‧‧第一分類錯誤率曲線(8) ‧‧‧First classification error rate curve

(9)‧‧‧第二分類錯誤率曲線(9) ‧‧‧Second classification error rate curve

第一圖係本發明之步驟流程圖
第二圖係本發明之實施例示意圖

第三-A圖係本發明之支援向量機作動示意圖(一)
第三-B圖係本發明之支援向量機作動示意圖(二)

The first diagram is a flow chart of the steps of the present invention. The second diagram is a schematic diagram of an embodiment of the present invention.

The third-A diagram is a schematic diagram of the operation of the support vector machine of the present invention (1)
The third-B diagram is a schematic diagram of the operation of the support vector machine of the present invention (2)

Claims

A method of voice recognition, comprising:
(A) taking an audio database, the audio database comprising a plurality of sample audio;
(B) taking the same audio from the plurality of sample audio, and classifying the to-be-identified audio and the sample audio by using a to-be-identified audio and the sample audio to obtain a similarity value;
(C) setting a similarity threshold, when the similarity value is higher than the similarity threshold, determining that the to-be-identified audio is the same audio as the sample audio, if the similarity value is lower than the similarity threshold Then, return to step (B).

The method for voice recognition according to the first aspect of the patent, wherein the classifier used in the step (B) is a support vector machine, and when the support vector machine classifies the to-be-identified audio and the sample audio, The two similarity values are respectively a first classification error rate and a second classification error rate, and the similarity threshold is a classification error rate threshold, when the first classification error rate and the second classification error rate are both When the classification error rate threshold is higher than the classification error rate, it is determined that the to-be-identified audio is the same sound as the sample audio.

The method for voice recognition according to claim 1, wherein before the step (A), a pre-step is set, the pre-step adopts a dialogue audio, and the dialogue audio is a dialogue of the plural speakers. The first detection window and the second detection window are respectively configured to detect the conversation audio in the same unit time, and the at least one audio interface has a first detection window and a second detection window. A first signal parameter and a second signal parameter are respectively obtained, and the first detection window is detected by the start time of the conversation audio, and the second detection window is adjacent to the first detection on the time axis. The window starts to detect, and the first signal parameter and the second signal parameter are classified by using a support vector machine to obtain a first classification error rate and a second classification error rate respectively; and setting a classification error rate threshold When the first classification error rate and the second classification error rate are both lower than the classification error rate threshold, an audio dividing line is set, and the audio dividing line is vertically detected by the second detection window. Point and time When the first classification error rate and the second classification error rate are not lower than the classification error rate threshold, the first detection window does not move, and the second detection window slides for one unit time. And detecting the conversation audio until the first classification error rate and the second classification error rate are lower than the classification error rate threshold; by repeating the above steps, the conversation audio is divided into multiple to-be-identified audio signals. .

The method for voice recognition according to claim 3, wherein the pre-steps take an audio to be authenticated in the plurality of to-be-identified audios, and use the classifier to make the to-be-identified audio similar to other plural to-be-identified audios. For comparison, the to-be-identified audio with the similarity value higher than the similarity threshold is marked as the same sound.

The method for voice recognition according to claim 3, wherein a step (D) is further provided, wherein the step (D) is a language identifier, and the language recognizer converts the audio of each speaker into a text file. .