TWI460613B - A speech recognition method to input chinese characters using any language - Google Patents
A speech recognition method to input chinese characters using any language Download PDFInfo
- Publication number
- TWI460613B TWI460613B TW100113325A TW100113325A TWI460613B TW I460613 B TWI460613 B TW I460613B TW 100113325 A TW100113325 A TW 100113325A TW 100113325 A TW100113325 A TW 100113325A TW I460613 B TWI460613 B TW I460613B
- Authority
- TW
- Taiwan
- Prior art keywords
- sound
- matrix
- linear predictive
- tone
- cepstrum
- Prior art date
Links
Landscapes
- Machine Translation (AREA)
Description
本發明可用任何語言任何腔調輸入14100中文,先將409已知單音分成82句,只要花20-30分鐘用任何語言第一聲念82句子一次即可。詳細地說,本發明可以用任何語言,如國語、台語、客家語、廣東話等等、只要花20-30分念八十二句子(內包含409音),即可輸入14100不同中文字。辨認時間不到一秒。The invention can input 14100 Chinese in any language, and divide the 409 known tones into 82 sentences first, and it takes only 20-30 minutes to read 82 sentences in one language for any first time. In detail, the present invention can be used in any language, such as Mandarin, Taiwanese, Hakka, Cantonese, etc., as long as 20-30 minutes to read eighty-two sentences (including 409 sounds), you can enter 14100 different Chinese characters . The recognition time is less than one second.
先將14108文字,用單音聲音(不管五聲)分成409群,用409單音第一聲代表。每群內單字分成常用字及不常用字兩類。一個單音的特點用相等的ExP=12x12矩陣表示。先清楚地發一千個不同聲音一次,除雜音,剩下的全是有聲音的音波。再用E=12等長彈性框,無濾波器,不重疊,將全部有聲音的音波轉換成ExP=12x12線性預估編碼倒頻譜矩陣。一共有一千個不同ExP線性預估編碼倒頻譜矩陣,代表一千個不同資料庫。First, the 14108 text is divided into 409 groups with a single sound (no matter five sounds), and the first sound is represented by 409 single sounds. Each word in a group is divided into two types: common words and unusable words. The characteristics of a single tone are represented by an equal ExP=12x12 matrix. First, send a thousand different sounds once, except for the noise, and the rest are all sound waves with sound. Then use the E=12 equal length elastic frame, no filter, no overlap, and convert all the sound waves into the ExP=12x12 linear predictive coding cepstrum matrix. There are a total of one thousand different ExP linear predictive coding cepstrum matrices representing one thousand different databases.
先將409已知單音分成82句,用任何語言第一聲念一個句子,再將該句子分成已知單音,除去雜音,剩下的全是有聲音的音波。再用E個等長彈性框將一個已知單音全部有聲音的音波轉換成ExP線性預估編碼倒頻譜矩陣。用距離,將該單音分到和1000資料庫的ExP線性預估編碼倒頻譜矩陣最近L個的資料庫內。409代表14100單字的單音全部放在1000資料庫內,每一單音可放L個資料庫內。First divide the 409 known tones into 82 sentences, first read a sentence in any language, then divide the sentence into known tones, remove the noise, and the rest are sound waves with sound. Then, using E equal length elastic frames, a sound wave of all known single sounds is converted into an ExP linear predictive coding inverse spectrum matrix. Using the distance, the tone is divided into the nearest L databases of the ExP linear predictive coding cepstral matrix of the 1000 database. 409 means that the 14100 words are all placed in the 1000 database, and each tone can be placed in L databases.
用同樣語言第一聲辨認一個未知一個單音時,將它先轉換成ExP線性預估編碼倒頻譜矩陣,再用未知單音ExP矩陣和一千個資料庫ExP矩陣的距離,從一千個資料庫中,找F個最接近的資料庫。再從F個最接近的資料庫內已知單音,用距離找該未知單音。被辨認為未知單音的已知單音所代表的群所有單字全會出現When the first voice is recognized in the same language for the first time, it is first converted into the ExP linear predictive coding cepstrum matrix, and then the distance between the unknown monophonic ExP matrix and the thousand database ExP matrix is from one thousand. In the database, find the F closest database. The single tone is known from the F closest database, and the unknown tone is found by distance. All single-word pleads represented by known tones recognized as unknown tones appear
用Visual Basic,不到一秒鐘很快能辨認所要的單字。方法簡單,不需樣本,不需訓練,用數學計算及辨認,任何人都可及時使用,發音不標準或發錯音者也可。用國語、台語、發音均測試過速度快,準確率高。如不準確,本發明可以及時修正單音特徵,達到100辨認率。With Visual Basic, you can quickly identify the word you want in less than a second. The method is simple, no need for samples, no training, mathematical calculation and identification, anyone can use it in time, the pronunciation is not standard or the wrong voice can be. Tested in Mandarin, Taiwanese, and pronunciation, the speed is fast and the accuracy is high. If not accurate, the present invention can correct the tone characteristics in time to achieve a recognition rate of 100.
一個連續聲波特徵常用有下列幾種:能量(energy),零橫過點數(zero crossings),極值數目(extreme count),顛峰(formants),線性預估編碼倒頻譜(LPCC)及梅爾頻率倒頻譜(MFCC),其中以線性預估編碼倒頻譜(LPCC)及梅爾頻率倒頻譜(MFCC)是最有效,並普遍使用。線性預估編碼倒頻譜(LPCC)是代表一個連續音最可靠,穩定又準確的語言特徵。它用線性迴歸模式代表連續音音波,以最小平方估計法計算迴歸係數,其估計值再轉換成倒頻譜,就成為線性預估編碼倒頻譜(LPCC)。而梅爾頻率倒頻譜(MFCC)是將音波用傅氏轉換法轉換成頻率。再根據梅爾頻率比例去估計聽覺系統。根據學者S.B. Davis and P. Mermelstein於1980年出版在IEEE Transactions on Acoustics,Speech Signal Processing,Vol.28,No.4發表的論文Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences中用動態時間扭曲法(DTW),梅爾頻率倒頻譜(MFCC)特徵比線性預估編碼倒頻譜(LPCC)特徵辨認率要高。但經過多次語音辨認實驗(包含本人前發明),用貝氏分類法,線性預估編碼倒頻譜(LPCC)特徵辨認率比梅爾頻率倒頻譜(MFCC)特徵要高,且省時。A continuous acoustic feature is commonly used in the following ways: energy, zero crossings, extreme count, formants, linear predictive coding cepstrum (LPCC) and plum Frequency Frequency Cepstrum (MFCC), in which linear predictive coding cepstrum (LPCC) and Mel Frequency Cepstrum (MFCC) are the most efficient and commonly used. Linear Predictive Cepstrum (LPCC) is the most reliable, stable, and accurate linguistic feature that represents a continuous tone. It uses a linear regression model to represent continuous sound waves, calculates the regression coefficients by the least squares estimation method, and converts the estimated values into cepstrums, which becomes the linear predictive coding cepstrum (LPCC). The Mel Frequency Cepstrum (MFCC) converts sound waves into frequencies using the Fourier transform method. The auditory system is estimated based on the ratio of the frequency of the Mel. According to the scholar SB Davis and P. Mermelstein published in IEEE Transactions on Acoustics, Speech Signal Processing, Vol.28, No. 4, the use of dynamic time warping method in the paper of Comparatively parametric representations for monosyllabic word recognition in the continuous sentences (DTW), the Mel Frequency Cepstrum (MFCC) feature is higher than the Linear Predictive Cepstrum (LPCC) feature recognition rate. However, after many speech recognition experiments (including my previous invention), the Bayesian classification method, the linear prediction coding cepstrum (LPCC) feature recognition rate is higher than the Mel frequency cepstrum (MFCC) feature, and saves time.
至於語言辨認,已有很多方法採用。有動態時間扭曲法(dynamic time-warping),向量量化法(vector quantization)及隱藏式馬可夫模式法(HMM)。如果相同的發音在時間上的變化有差異,一面比對,一面將相同特徵拉到同一時間位置。辨認率會很好,但將相同特徵拉到同一位置很困難並扭曲時間太長,不能應用。向量量化法如辨認大量單音,不但不準確,且費時。最近隱藏式馬可夫模式法(HMM)辨認方法不錯,但方法繁雜,太多未知參數需估計,計算估計值及辨認費時。以T.F. Li(黎自奮)於2003年出版在Pattern Recognition,vol. 36發表的論文Speech recognition of mandarin monosyllables,Li,Tze Fen(黎自奮)於1997年在美國專利證書,Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique,U.S.A. Patent No. 5,704,004,Dec. 30,1997,黎自奮於2008年在中華民國專利證書I 297487號(2008,6,1)名稱語音辨認方法及黎自奮於2009年在中華民國專利證書第I 310543號(2009,6,1)名稱一個連續二次貝氏分類法辨認相似國語單音的方法中,用貝氏分類法,以相同資料庫,將長短不同一系列LPCC向量用各種方法壓縮成相同大小的特徵模型,辨認結果比Y.K. Chen,C.Y.Liu,G.H. Chiang,M.T. Lin於1990年出版在Proceedings of Telecommunication Symposium,Taiwan發表的論文The recognition of mandarin monosyllables based on the discrete hidden Markov model中用隱藏式馬可夫模式法HMM方法要好。但壓縮過程複雜費時,且相同單音很難將相同特徵壓縮到相同時間位置,對於相似單音,很難辨認。As for language recognition, there are many ways to adopt it. There are dynamic time-warping, vector quantization and hidden Markov mode (HMM). If the same pronunciation changes in time, one side compares the same feature to the same time position. The recognition rate will be good, but pulling the same feature to the same position is difficult and the distortion time is too long to apply. Vector quantization, such as identifying a large number of tones, is not only inaccurate, but also time consuming. Recently, the hidden Markov mode method (HMM) identification method is good, but the method is complicated, too many unknown parameters need to be estimated, and the estimated value and the recognition time are calculated. Speech by TF Li (Li Zifen) published in 2003 in Pattern Recognition, vol. 36 Speech recognition of mandarin monosyllables, Li, Tze Fen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, USA Patent No. 5,704,004, Dec. 30,1997, Li Zifen in 2008 in the Republic of China Patent Certificate I 297487 (2008, 6, 1) name speech recognition method and Li Zifen in 2009 in the Republic of China Patent Certificate No. I 310543 (2009, 6, 1) Name A method of identifying a similar Mandarin single tone by a continuous quadratic Bayesian classification method. Using the Bayesian classification method, the same database will be used for different lengths and lengths of a series of LPCC vectors. Various methods are compressed into feature models of the same size, and the recognition results are compared to YK Chen, CYLiu, GH Chiang, MT Lin, published in Proceedings of Telecommunication Symposium, 1990. The recognition of mandarin monosyllables based on the discrete hidden Markov model It is better to use the hidden Markov mode HMM method. However, the compression process is complicated and time consuming, and it is difficult for the same tone to compress the same feature to the same time position, which is difficult to recognize for similar tones.
到現在為止,還未有輸入中文字方法。最近出現用句子(斷詞或斷語)輸入中文,輸入中文字範圍很小,用處不大,也不準確並且訓練時間久,可能需要數十倍時間訓練。本發明語音辨認方法針對上述缺點,從學理方面,根據音波有一種語音特徵,隨時間作非線性變化,自然導出一套抽取語音特徵方法,將任何語言的單音全部用相等的ExP=12x12矩陣表示,提高單音辨認率。如果辨認不成功,本發明提供三種技術改正單音特徵,務使辨認正確。可以使用任何語言用語音輸入14100中文字。Up to now, there is no way to input Chinese characters. Recently, Chinese has been entered with sentences (word breaks or broken words). The range of Chinese characters is small, it is not useful, it is not accurate, and it takes a long time to train. It may take dozens of times to train. The speech recognition method of the present invention is directed to the above shortcomings. From the academic point of view, according to the sound characteristics of the sound wave, the nonlinear change with time, naturally derives a set of extracted speech feature methods, and all the monophonic sounds of any language are equal to the ExP=12x12 matrix. Indicates that the tone recognition rate is improved. If the recognition is unsuccessful, the present invention provides three techniques for correcting the monophonic features so that the recognition is correct. You can enter 14100 Chinese characters in voice using any language.
(1) 本發明目地是可以使用任何語言用語音輸入14100中文字。(1) The object of the present invention is to input 14100 Chinese characters by voice using any language.
(2) 本發明目地是對14100中文字的發音辨認率要高,最好能達到100%。(2) The object of the present invention is to have a high recognition rate for the Chinese characters of the 14100, preferably up to 100%.
(3) 本發明目地是辨認時間要短、輸入速度要快,最好比現在任何方發要快,任何人僅須20-30分鐘練習及可輸入14100中文字。。(3) The purpose of the present invention is to identify the time is short, the input speed is fast, and it is better to be faster than any current one. Any person only needs 20-30 minutes to practice and can input 14100 Chinese characters. .
(4) 為了達到上述三個目地,本發明應用一種單音音波正常化及抽取特徵方法。它使用較少數E=12個相等彈性框,等長,不重疊,沒有濾波器,能依一個單音正常化音波長短,自由調節含蓋全部單音波長,能將所有要辨認的單音全部轉換成相等ExP=12x12線性預估編碼倒頻譜矩陣。單音內一系列隨時間作非線性變化的動態特性轉換成一個大小相等的ExP線性預估編碼倒頻譜矩陣,並且相同單音的特徵模型在相同時間位置上有相同特徵。可以及時比對,達到電腦即時辨認及輸入效果。(4) In order to achieve the above three objectives, the present invention applies a single-tone sound wave normalization and extraction feature method. It uses a small number of E=12 equal elastic frames, equal length, no overlap, no filter, can be normalized according to a single tone, short wavelength, freely adjust all the monophonic wavelengths covered, can all the single tones that can be identified Converted to an equal ExP=12x12 linear predictive coding cepstrum matrix. A series of dynamic characteristics that change nonlinearly with time are converted into an equal-sized ExP linear predictive coding cepstrum matrix, and the same monotone feature model has the same characteristics at the same time position. It can be compared in time to achieve instant recognition and input of the computer.
(5) 本發明應用一千個不同資料庫,能辨認大量單音,速度快,準確率也大大提高。主要將全部已知單音分散在一千個資料庫最接近聲音的資料庫內,辨認未知單音時,先找和未知單音聲音F個最接近的資料庫,再從F個最接近的資料庫內的已知單音找所要辨認的未知單音。在F個最接近的資料庫內所有的已知單音不多,很容易辨認。本發明不用樣本,不用統計計算及辨認,用數學計算及用單音的線性預估編碼倒頻譜ExP矩陣之間的距離來辨認。(5) The present invention applies one thousand different databases, can recognize a large number of single tones, has a high speed, and the accuracy is greatly improved. It mainly distributes all known tones in the database closest to the sound of one thousand databases. When identifying unknown tones, first find the closest database with the unknown monophonic sounds, and then the closest from F. Know the single tone in the database to find the unknown tone to be recognized. There are not many known single tones in the F closest database, and it is easy to identify. The present invention does not use samples, does not require statistical calculations and recognition, and is identified by mathematical calculations and by using the distance between the inverted-spectrum ExP matrices of the linear predictive coding of the tones.
(6) 本發明辨認方法可以辨認發音太快或講話太慢的單音。發音太快時,一個單音音波很短,本發明的E=12等長彈性框長度可以縮小,仍然用相同數E個等長的彈性框含蓋短音波。產生E個線性預估編碼倒頻譜(LPCC)向量。發音太慢所發出一個單音音波較長。E=12等長彈性框長度會伸長。所產生相同數E個線性預估編碼倒頻譜(LPCC)向量也能有效代表該長單音。(6) The identification method of the present invention can recognize a single tone that is too fast or too slow to talk. When the pronunciation is too fast, a single sound wave is very short, and the length of the E=12 equal length elastic frame of the present invention can be reduced, and the same number of E equal length elastic frames are still used to cover the short sound wave. Generates E linear predictive coding cepstrum (LPCC) vectors. A single tone is longer when the pronunciation is too slow. The length of the E=12 equal length elastic frame will be elongated. The same number of E linear predictive coding cepstrum (LPCC) vectors are generated to effectively represent the long tone.
(7) 如果辨認不成功,本發明提供三種技術改正單音特徵,務使辨認正確。(7) If the recognition is unsuccessful, the present invention provides three techniques for correcting the monophonic features so that the recognition is correct.
用第一圖及第二圖說明發明執行程序。第一圖是表示建立M=1000個資料庫,每個資料庫內有相似已知單音。第二圖是表示使用者辨認未知單音及輸入中文執行程序。The invention execution program will be described using the first diagram and the second diagram. The first picture shows the establishment of M=1000 databases, each with similar known tones. The second figure shows that the user recognizes the unknown tone and inputs the Chinese execution program.
在第一圖,先有M=1000個不同聲音1,一個聲音音波轉換成數位化信號點10,除去雜音或靜音20。先將該有聲音音波正常化再抽取特徵,將一個聲音音波全部信號點分成E=12等時段,每時段組成一個框。一個聲音一共有E個等長框30,沒有濾波器,不重疊,根據聲音全部信號點的長度,E個相等框長度自由調整含蓋全部信號點。所以該框稱為等長彈性框,長度自由伸縮,但E個”彈性框長度一樣”。不像漢明(Hamming)窗,有濾波器、半重疊、固定長度、不能隨波長自由調整。因一個聲音音波隨時間作非線性變化,音波含有一個語音動態特徵,也隨時間作非線性變化。因為不重疊,所以本發明使用較少(E=12)個等長彈性框,涵蓋全部聲音音波,因信號點可由前面信號點估計,用隨時間作線性變化的迴歸模式來密切估計非線性變化的音波,用最小平方法估計迴歸未知係數。每個等長彈性框內,用最小平方法計算P=12個線性預估編碼倒頻譜40,一個聲音用ExP線性預估編碼倒頻譜(LPCC)矩陣代表,一個聲音的ExP線性預估編碼倒頻譜矩陣代表一個資料庫,一共有一千個資料庫50。先將409已知單音分成82句,用任何語言第一聲念一句子,再將該句子分成已知單音,除去靜音及雜音。用E個相等彈性框將要辨認的已知單音轉換成一個ExP線性預估編碼倒頻譜(LPCC)矩陣60。用距離將該已知單音分到L個最接近的資料庫內70。全部要辨認的409個已知單音分到M=1000個不同資料庫。有M=1000個資料庫,每個資料庫含相似的已知單音,409代表14100單字的單音全部放在1000資料庫內,每一單音可放L個資料庫內80。In the first figure, there are M = 1000 different sounds 1, and a sound sound wave is converted into a digitized signal point 10 to remove the noise or mute 20. First, the sound sound wave is normalized and the feature is extracted, and all the signal points of one sound sound wave are divided into E=12 and the like, and each time frame constitutes a frame. A sound has a total of equal length frames 30, no filters, no overlap, according to the length of all signal points of the sound, E equal frame length freely adjust all signal points with cover. Therefore, the frame is called an isometric elastic frame, and the length is free to expand and contract, but the E "elastic frames are the same length". Unlike the Hamming window, there are filters, semi-overlaps, fixed lengths, and no freedom to adjust with wavelength. Because a sound wave changes nonlinearly with time, the sound wave contains a dynamic feature of the voice, which also changes nonlinearly with time. Since there is no overlap, the present invention uses less (E=12) equal-length elastic frames covering all sound waves, since the signal points can be estimated from the previous signal points, and the nonlinear changes are closely estimated using a regression pattern that varies linearly with time. The sound wave is estimated by the least square method of the regression unknown coefficient. In each equal-length elastic frame, the P=12 linear predictive coding cepstrums 40 are calculated by the least square method, and one sound is represented by the ExP linear predictive coding cepstrum (LPCC) matrix, and the sound of the ExP linear predictive coding is inverted. The spectrum matrix represents a database with a total of one thousand databases 50. First divide the 409 known tones into 82 sentences, first read a sentence in any language, and then divide the sentence into known tones, removing mute and noise. The known tones to be recognized are converted into an ExP Linear Predictive Coded Cepstrum (LPCC) matrix 60 using E equal elastic frames. The known tones are divided into the L closest databases by distance 70. All 409 known tones identified to be identified are assigned to M = 1000 different databases. There are M=1000 databases, each database contains similar known tones, and 409 stands for 14100 words in the 1000 database. Each tone can be placed in L databases.
第二圖表示用同樣語言第一聲辨認的未知單音及輸入文字方法流程。先用同樣語言第一聲對一個所要辨認的未知單音清楚發音2。未知單音音波數位化成信號點10,除去靜音及雜音20。E個等長彈性框正常化音波,抽取特徵,將要辨認的未知單音全部具有語音的信號點分成E等時段,每時段形成一個彈性框30。一共有E個等長彈性框,沒有濾波器,不重疊,自由伸縮含蓋全部信號點。在每框內,因信號點可由前面信號估計,用最小平方法求迴歸未知係數的估計值。每框內所產生的P=12最小平方估計值叫做線性預估編碼(LPC)向量,再將線性預估編碼(LPC)向量轉換較穩定線性預估編碼倒頻譜(LPCC)向量,一個未知單音用一個ExP線性預估編碼倒頻譜矩陣代表41。本發明用要辨認未知單音的ExP線性預估編碼倒頻譜矩陣和M=1000資料庫80 ExP線性預估編碼倒頻譜矩陣的距離或加權距離,找F個最接近的資料庫,也即該F個資料庫距該要辨認未知單音的線性預估編碼倒頻譜矩陣有F個最小距離84。再用距離或加權距離在F個最接近資料庫內已知單音,找要辨認未知單音,被辨認為未知單音的已知單音所代表的群所有單字全會出現90。The second diagram shows the flow of unknown tones and input texts that are first recognized in the same language. First pronounce the first sound in the same language with a clear unknown sound. The unknown single tone is digitized into signal point 10, removing silence and noise 20. E equal-length elastic frames normalize the sound waves, extract features, and divide the signal points of the unknown single sounds to be recognized into E, etc., and form a flexible frame 30 every time period. There are a total of E equal length elastic frames, no filters, no overlap, and free telescopic covers all signal points. Within each frame, the estimated value of the regression unknown coefficient is obtained by the least squares method because the signal point can be estimated from the previous signal. The P=12 least squares estimate produced in each box is called the linear predictive coding (LPC) vector, and the linear predictive coding (LPC) vector is converted to the more stable linear predictive coding cepstrum (LPCC) vector, an unknown single. The tone represents 41 with an ExP linear predictive coding cepstrum matrix. The present invention finds the F closest database by using the distance or weighted distance of the ExP linear predictive coding cepstrum matrix of the unknown tone and the M=1000 database 80 ExP linear predictive coding cepstrum matrix, that is, the The F databases have F minimum distances 84 from the linear predictive coding cepstrum matrix of the unknown tones. Then use the distance or weighted distance to find the known single tone in the F closest to the database, and find the unknown single tone. All the single characters of the group represented by the known single tone that is recognized as the unknown single tone will appear 90.
本發明詳述於後:The invention is described in detail later:
(1) 一個聲音(單音)清楚發音後1,將此聲音(單音)音波轉換成一系列數化音波信號點(signal sampled points)10。再刪去不具語音音波信號點,在聲音(單音)之前及聲音(單音)之後,刪去所有的靜音及雜音20。不具語音信號點刪去後,剩下信號點代表一個聲音(單音)全部信號點。先將音波正常化再抽取特徵,將全部信號點分成E=12等時段,每時段形成一個框。一個聲音(單音)共有E個”等長”的彈性框,沒有濾波器、不重疊、自由伸縮,涵蓋全部信號點30。在每個等長彈性框內,信號點隨時間作非線性變化,很難用數學模型表示。因為J.Markhoul於1975年出版在Proceedings of IEEE,Vol.63,No.4發表論文Linear Prediction: A tutorial review及Li,Tze Fen(黎自奮)於1997年在美國專利證書,Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, U.S.A. Patent No. 5,704,004,Dec. 30,1997中說明信號點與前面信號點有線性關係,可用隨時間作線性變化的迴歸的模型估計此非線性變化的信號點。信號點S (n )可由前面信號點估計,其估計值S '(n )由下列迴歸模式表示:(1) After a sound (monophonic) is pronounced clearly, the sound (monophonic) sound wave is converted into a series of signal sampled points 10. Then delete the point that does not have a voice signal, and delete all the mute and noise 20 before the sound (single tone) and after the sound (single tone). After the voice signal point is deleted, the remaining signal points represent all the signal points of a single voice (single tone). The sound wave is normalized and then the feature is extracted, and all the signal points are divided into E=12 and the like, and a frame is formed every time period. A sound (monophonic) has a total of E "equal length" elastic frames, without filters, without overlapping, freely telescopic, covering all signal points 30. In each isometric elastic frame, the signal points change nonlinearly with time and are difficult to represent in mathematical models. Because J. Markhoul published in Proceedings of IEEE, Vol.63, No. 4 in 1975, Linear Prediction: A tutorial review and Li, Tze Fen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, USA Patent No. 5,704,004, Dec. 30,1997, which shows that the signal point has a linear relationship with the previous signal point, and the nonlinear change signal can be estimated by a regression model that changes linearly with time. point. The signal point S ( n ) can be estimated from the previous signal point, and its estimated value S '( n ) is represented by the following regression mode:
在(1)式中,a k ,k =1,...,P ,是迴歸未知係數估計值,P是前面信號點數目。用L. Rabiner及B. H. Juang於1993年著作書Fundamentals of Speech Recognition,Prentice Hall PTR,Englewood Cliffs,New Jersey及Li,Tze Fen(黎自奮)於1997年在美國專利證書,Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique,U.S.A. Patent No. 5,704,004,Dec. 30,1997中Durbin的循環公式求最小平方估計值,此組估計值叫做線性預估編碼(LPC)向量。求框內信號點的線性預估編碼(LPC)向量方法詳述如下:以E 1 表示信號點S (n )及其估計值S '(n )之間平方差總和:In equation (1), a k , k =1,..., P are the estimated values of the regression unknown coefficients, and P is the number of previous signal points. In 1993, L. Rabiner and BH Juang published Fundamentals of Speech Recognition, Prentice Hall PTR, Englewood Cliffs, New Jersey and Li, Tze Fen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, USA Patent No. 5, 704, 004, Dec. 30, 1997 The calculation formula of Durbin's cycle finds the least squares estimate. This set of estimates is called a linear predictive coding (LPC) vector. The linear predictive coding (LPC) vector method for finding the signal points in the frame is detailed as follows: the sum of the squared differences between the signal point S ( n ) and its estimated value S '( n ) is represented by E 1 :
求迴歸係數使平方總和E 1 達最小。對每個未知迴歸係數a i ,i =1,...,P ,求(2)式的偏微分,並使偏微分為0,得到P組正常方程式:Find the regression coefficient to minimize the sum of squares E 1 . For each unknown regression coefficient a i , i =1,..., P , find the partial differential of (2) and divide the partial differential into 0 to get the normal equation of P group:
展開(2)式後,以(3)式代入,得最小總平方差E P After expanding (2), substitute (3) to get the minimum total squared difference E P
(3)式及(4)式轉換為(3) and (4) are converted to
在(5)及(6)式中,用N 表示框內信號點數,In equations (5) and (6), N is used to indicate the number of signal points in the frame.
用Durbin的循環快速計算線性預估編碼(LPC)向量如下:The linear predictive coding (LPC) vector is quickly calculated using Durbin's loop as follows:
E 0 =R (0) (8) E 0 = R (0) (8)
(8-12)公式循環計算,得到迴歸係數最小平方估計值a j ,j =1,...,P ,(線性預估編碼(LPC)向量)如下:(8-12) Formula loop calculation to obtain the regression coefficient least squares estimate a j , j =1,..., P , (linear predictive coding (LPC) vector) as follows:
再用下列公式將LPC向量轉換較穩定線性預估編碼倒頻譜(LPCC)向量a ' j ,j =1,...,P ,Then use the following formula to convert the LPC vector to a more stable linear predictive coding cepstrum (LPCC) vector a ' j , j =1,..., P ,
一個彈性框產生一個線性預估編碼倒頻譜(LPCC)向量(a '1 ,...,a ' P ) 40。根據本發明語音辨認方法,用P=12,因最後的線性預估編碼倒頻譜(LPCC)幾乎為0。一個以E個線性預估編碼倒頻譜(LPCC)向量表示一個聲音(單音)特徵,也即一個含E×P=12x12個線性預估編碼倒頻譜(LPCC)的矩陣表示一個聲音,一個聲音的線性預估編碼倒頻譜ExP矩陣代表一個資料庫,一共有一千個資料庫50。An elastic box produces a linear predictive coding cepstrum (LPCC) vector ( a ' 1 ,..., a ' P ) 40. According to the speech recognition method of the present invention, P = 12, since the final linear prediction coding cepstrum (LPCC) is almost zero. A sound (monophonic) feature is represented by E linear predictive coding cepstral (LPCC) vectors, that is, a matrix containing E×P=12x12 linear predictive coding cepstrums (LPCC) represents a sound, a sound The linear predictive coding cepstrum ExP matrix represents a database with a total of one thousand databases 50.
(2) 先將409已知單音分成82句,用任何語言第一聲念一句子,再將該句子分成已知單音,除去靜音及雜音,用(8-15)公式將已知單音轉換成線性預估編碼倒頻譜(LPCC)ExP矩陣60。用已知單音線性預估編碼倒頻譜(LPCC)ExP矩陣與所有M=1000不同聲音的線性預估編碼倒頻譜ExP矩陣之間距離或加權距離找L個最接近的資料庫,將該已知單音分到L個最接近的資料庫內70。有M=1000個資料庫,每個資料庫含相似的已知單音,409代表14100單字的單音全部放在1000資料庫內,每一單音可放L個資料庫內80。(2) Divide the 409 known tones into 82 sentences, first read a sentence in any language, and then divide the sentence into known tones, remove the mute and noise, and use the formula (8-15) to know the order. The sound is converted into a linear predictive coding cepstrum (LPCC) ExP matrix 60. Find the L closest database with the distance or weighted distance between the known single tone linear predictive coding cepstrum (LPCC) ExP matrix and the linear predictive coding cepstrum ExP matrix of all M=1000 different sounds. Know the single tone into the nearest 70 database. There are M=1000 databases, each database contains similar known tones, and 409 stands for 14100 words in the 1000 database. Each tone can be placed in L databases.
用同樣語言第一聲要辨認未知單音時,使用者先對一個所要辨認的未知單音清楚發音2。未知單音音波數位化成信號點10,除去靜音及雜音20,在未知單音之前及未知單音之後,刪去所有的靜音及雜音。E個等長彈性框正常化音波,抽取特徵,將要辨認的未知單音全部具有語音的信號點分成E等時段,每時段形成一個彈性框。一共有E個等長彈性框,沒有濾波器,不重疊,自由伸縮含蓋全部信號點30。在每框內,因信號點可由前面信號估計,用最小平方法求迴歸未知係數的估計值。每框內用(8-15)公式轉換成線性預估編碼倒頻譜(LPCC)向量,一個未知單音用一個ExP線性預估編碼倒頻譜矩陣代表41。本發明同樣用該未知單音線性預估編碼倒頻譜ExP矩陣與所有M=1000不同聲音的線性預估編碼倒頻譜ExP矩陣80之間的距離或加權距離找F個最接近的資料庫84。再從F個最接近的資料庫,用該未知單音線性預估編碼倒頻譜ExP矩陣與最接近的F個資料庫內已知單音的線性預估編碼倒頻譜ExP矩陣之間的距離或加權距離,找使用者所要的未知單音,被辨認為未知單音的已知單音所代表的群所有單字全會出現90。When the first tone is used to recognize an unknown single tone in the same language, the user first pronounces an unknown single tone that is to be recognized. The unknown single-tone sound is digitized into signal point 10, and the mute and noise 20 are removed, and all mute and noise are deleted before the unknown single tone and after the unknown single tone. E equal-length elastic frames normalize the sound waves, extract features, and divide the signal points of the unknown single sounds with speech to be equal to E, and form a flexible frame every time period. There are a total of E equal length elastic frames, no filters, no overlap, and free telescopic covers all signal points 30. Within each frame, the estimated value of the regression unknown coefficient is obtained by the least squares method because the signal point can be estimated from the previous signal. Each box is converted to a linear predictive coding cepstrum (LPCC) vector using the (8-15) formula, and an unknown single tone is represented by an ExP linear predictive coding cepstrum matrix. The present invention also finds the F closest database 84 using the distance or weighted distance between the unknown tone linear predictive coding cepstrum ExP matrix and all linear predictive coded cepstrum ExP matrices 80 of M = 1000 different sounds. From the F closest database, use the unknown tone linear prediction to encode the distance between the cepstrum ExP matrix and the linear estimate of the known single tone in the F database. The weighted distance is used to find the unknown single tone that the user wants, and all the single-words of the group represented by the known tones that are recognized as unknown tones appear 90.
(3) 為了證實本發明能快速又準確地輸入中文字。82句子全部由409單音組成,每句有5不同單音的字。發明人對82句子用第一聲發音後,句子切成單音後,將每一個單音分到L=5個最近的資料庫內。每個資料庫包含兩群,一群儲藏常用字,另一群儲藏不常用字。用Visual Basic,全部訓練時間是26分鐘。(3) In order to confirm that the present invention can input Chinese characters quickly and accurately. The 82 sentences are all composed of 409 tones, each with 5 different monos. After the inventor pronounced the first sentence with 82 sentences, the sentence was cut into single tones, and each single tone was divided into L=5 nearest databases. Each database consists of two groups, one group storing common words and the other group storing less common words. With Visual Basic, the total training time is 26 minutes.
(4) 測試409單音後,90%單音排名在第一二位。經過更正後,全部出現在前四名,因相似音太多,有六群發同樣音,發明人也分不清楚。辨認及輸入時間不到1秒。(4) After testing 409 tones, 90% of the tones are ranked first. After the correction, all appeared in the top four, because the similar sound is too much, there are six groups of the same sound, the inventor is also unclear. The recognition and input time is less than 1 second.
(5) 圖三用本軟體(Visual Basic)語音輸入本發明說明書中文部分。第四圖出現圖三不常用字。(5) Figure 3 uses the software (Visual Basic) to input the Chinese part of the manual. The fourth figure shows the three commonly used words.
(1)...先有M=1000個不同聲音(1). . . First M=1000 different sounds
(10)...音波數位化(10). . . Sound wave digitization
(20)...除去雜音及靜音時段(20). . . Remove noise and silent time
(30)...E個等長彈性框正常化全部有聲音波(30). . . E equal length elastic frames normalize all sound waves
(40)...每個等長彈性框內,用最小平方法計算P個線性預估編碼倒頻譜(40). . . Calculate P linear predictive coding cepstrums by the least square method in each equal-length elastic frame
(50)...一個聲音的線性預估編碼倒頻譜ExP矩陣代表一個資料庫,一共有一千個資料庫(50). . . A linear predictive coding of a sound Cepstrum ExP matrix represents a database with a total of one thousand databases
(60)...先將409已知單音分成82句子,用任何語言第一聲念一個句子,再將該句子分成已知單音,除去靜音及雜音,將它轉換成線性預估編碼倒頻譜(LPCC)ExP矩陣(60). . . First divide the 409 known tones into 82 sentences, first read a sentence in any language, then divide the sentence into known tones, remove the mute and noise, and convert it into linear predictive coding cepstrum (LPCC) ExP matrix
(70)...用距離將已知單音線性預估編碼倒頻譜(LPCC)ExP矩陣分到L個最接近的資料庫內(70). . . Use the distance to classify the known single-tone linear predictive coding cepstrum (LPCC) ExP matrix into the L closest database
(80)...有M=1000個資料庫,每個資料庫含相似的已知單音,409代表14100單字的單音全部放在1000資料庫內,每一單音可放L個資料庫內(80). . . There are M=1000 databases, each database contains similar known tones, and 409 stands for 14100 words in the 1000 database. Each tone can be placed in L databases.
(2)...用同樣語言第一聲對要辨認未知單音清楚發音(2). . . In the same language, the first sound is clearly pronounced to identify the unknown single tone.
(41)...每個等長彈性框內,用最小平方法計算P個線性預估編碼倒頻譜,一個未知單音用線性預估編碼倒頻譜ExP矩陣代表(41). . . In each equal-length elastic frame, P linear predictive coding cepstrums are calculated by the least square method, and an unknown single tone is represented by a linear predictive coding cepstrum ExP matrix.
(84)...用距離在M=1000個資料庫找F個和該要辨認未知單音最接近的資料庫(84). . . Use the distance in M=1000 databases to find F and the closest database to identify unknown tones.
(90)...在F個最接近的資料庫內相似的已知單音,用距離找要辨認的未知單音,被辨認為未知單音的已知單音所代表的群所有單字全會出現(90). . . Similar known tones in the F closest database, using the distance to find the unknown tones that are to be recognized, all the words that are recognized by the known tones of the unknown tones appear in the group.
第一圖及第二圖說明發明執行程序。第一圖是表示建立M=1000個不同資料庫,每個資料庫含相似的已知單音。第二圖表示辨認未知單音及輸入中文的流程。第三圖是表示用Visual Basic軟體輸入本發明中文部分說明書。第四圖出現第三圖不常用字。The first and second figures illustrate the invention execution procedure. The first figure shows the establishment of M=1000 different databases, each containing similar known tones. The second figure shows the process of identifying unknown tones and inputting Chinese. The third figure shows the input of the Chinese part of the invention using Visual Basic software. The fourth picture shows the third picture of the less common word.
(2)...用同樣語言第一聲對要辨認未知單音清楚發音(2). . . In the same language, the first sound is clearly pronounced to identify the unknown single tone.
(10)...音波數位化(10). . . Sound wave digitization
(20)...除去雜音及靜音時段(20). . . Remove noise and silent time
(30)...E個等長彈性框正常化全部有聲音波(30). . . E equal length elastic frames normalize all sound waves
(41)...每個等長彈性框內,用最小平方法計算P個線性預估編碼倒頻譜,一個未知單音用線性預估編碼倒頻譜ExP矩陣代表(41). . . In each equal-length elastic frame, P linear predictive coding cepstrums are calculated by the least square method, and an unknown single tone is represented by a linear predictive coding cepstrum ExP matrix.
(80)...有M=1000個資料庫,每個資料庫含相似的已知單音,409代表14100單字的單音全部放在1000資料庫內,每一單音可放L個資料庫內(80). . . There are M=1000 databases, each database contains similar known tones, and 409 stands for 14100 words in the 1000 database. Each tone can be placed in L databases.
(84)...用距離在M=1000個資料庫找F個和該要辨認未知單音最接近的資料庫(84). . . Use the distance in M=1000 databases to find F and the closest database to identify unknown tones.
(90)...在F個最接近的資料庫內相似的已知單音,用距離找要辨認的未知單音,被辨認為未知單音的已知單音所代表的群所有單字全會出現(90). . . Similar known tones in the F closest database, using the distance to find the unknown tones that are to be recognized, all the words that are recognized by the known tones of the unknown tones appear in the group.
Claims (2)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW100113325A TWI460613B (en) | 2011-04-18 | 2011-04-18 | A speech recognition method to input chinese characters using any language |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW100113325A TWI460613B (en) | 2011-04-18 | 2011-04-18 | A speech recognition method to input chinese characters using any language |
Publications (2)
Publication Number | Publication Date |
---|---|
TW201243650A TW201243650A (en) | 2012-11-01 |
TWI460613B true TWI460613B (en) | 2014-11-11 |
Family
ID=48093897
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW100113325A TWI460613B (en) | 2011-04-18 | 2011-04-18 | A speech recognition method to input chinese characters using any language |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI460613B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006132596A1 (en) * | 2005-06-07 | 2006-12-14 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for audio clip classification |
US20080277564A1 (en) * | 2005-11-14 | 2008-11-13 | Koninklijke Philips Electronics, N.V. | System and Method of Controlling the Power of a Radiation Source |
TW201106340A (en) * | 2009-08-03 | 2011-02-16 | Tze-Fen Li | A speech recognition method for all languages without using samples |
TW201112226A (en) * | 2009-09-17 | 2011-04-01 | Tze-Fen Li | A method for speech recognition on all languages and for inputing words using speech recognition |
-
2011
- 2011-04-18 TW TW100113325A patent/TWI460613B/en not_active IP Right Cessation
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006132596A1 (en) * | 2005-06-07 | 2006-12-14 | Matsushita Electric Industrial Co., Ltd. | Method and apparatus for audio clip classification |
US20080277564A1 (en) * | 2005-11-14 | 2008-11-13 | Koninklijke Philips Electronics, N.V. | System and Method of Controlling the Power of a Radiation Source |
TW201106340A (en) * | 2009-08-03 | 2011-02-16 | Tze-Fen Li | A speech recognition method for all languages without using samples |
TW201112226A (en) * | 2009-09-17 | 2011-04-01 | Tze-Fen Li | A method for speech recognition on all languages and for inputing words using speech recognition |
Also Published As
Publication number | Publication date |
---|---|
TW201243650A (en) | 2012-11-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI396184B (en) | A method for speech recognition on all languages and for inputing words using speech recognition | |
Dua et al. | GFCC based discriminatively trained noise robust continuous ASR system for Hindi language | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
Sajjan et al. | Comparison of DTW and HMM for isolated word recognition | |
Patil et al. | Automatic Speech Recognition of isolated words in Hindi language using MFCC | |
US20010010039A1 (en) | Method and apparatus for mandarin chinese speech recognition by using initial/final phoneme similarity vector | |
Shaneh et al. | Voice command recognition system based on MFCC and VQ algorithms | |
Shariah et al. | Human computer interaction using isolated-words speech recognition technology | |
KR20060066483A (en) | Method for extracting feature vectors for voice recognition | |
Lu et al. | Physiological feature extraction for text independent speaker identification using non-uniform subband processing | |
Rao et al. | Glottal excitation feature based gender identification system using ergodic HMM | |
TWI460613B (en) | A speech recognition method to input chinese characters using any language | |
TWI460718B (en) | A speech recognition method on sentences in all languages | |
US20120116764A1 (en) | Speech recognition method on sentences in all languages | |
Sharma et al. | Speech recognition of Punjabi numerals using synergic HMM and DTW approach | |
Li | Speech recognition of mandarin monosyllables | |
Lingam | Speaker based language independent isolated speech recognition system | |
Prasangini et al. | Sinhala speech to sinhala unicode text conversion for disaster relief facilitation in sri lanka | |
Mathew et al. | Significance of feature selection for acoustic modeling in dysarthric speech recognition | |
TWI395200B (en) | A speech recognition method for all languages without using samples | |
Li et al. | Speech recognition of mandarin syllables using both linear predict coding cepstra and Mel frequency cepstra | |
Ibiyemi et al. | Automatic speech recognition for telephone voice dialling in yorùbá | |
Deo et al. | Review of Feature Extraction Techniques | |
Sigmund | Search for keywords and vocal elements in audio recordings | |
Sangwan | Feature Extraction for Speaker Recognition: A Systematic Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |