TW201216253A

TW201216253A - A speech recognition method on sentences in all languages

Info

Publication number: TW201216253A
Application number: TW99134580A
Authority: TW
Inventors: Tze-Fen Li; Lee Tai-Jan Li; Shih-Tzung Li; Shih-Hon Li; Li-Chuan Liao
Original assignee: Tze-Fen Li; Lee Tai-Jan Li; Shih-Tzung Li; Shih-Hon Li; Li-Chuan Liao
Priority date: 2010-10-11
Filing date: 2010-10-11
Publication date: 2012-04-16
Also published as: TWI460718B

Abstract

The invention can recognize sentences in all languages. A sentence can be a syllable, word, name or sentence. The most important feature of this invention is that all sentences are represented by E*P=12*12 matrices of linear predict coding cepstra (LPCC), produced by E=12 equal-sized elastic frames without filter and without overlap. The prior speech recognition methods have to compute and compare a series of matrices of features of words, but the invention only computes and compares an E*P matrix of LPCC. 1000 different voices are transformed into 1000 different matrices of LPCC to represent 1000 different databases. The E*P matrices of known sentences after deletion of noise and time intervals between two words and two syllables are put into their closest databases. To classify an unknown sentence, use the distance to find its F closest databases from 1000 different databases and then from known sentences in the F closest databases, find a known sentence to be the unknown one. The invention needs no samples and can find a sentence in one second using Visual Basic. Any person without training can immediately and freely communicate with a computer in any language. It can recognize up to 7200 English words, 500 sentences (any language) and 500 Chinese words and input 4400 Chinese words.

Description

201216253 六、發明說明：【發明所屬之技術領域】本發明可辨認任何語言的句子。本發明用12彈性框 (窗）’等長’無驗器，不重疊，將-到多個單字組成長短不:的-個句子的音波轉換成ExP=12xl2的線性預估編碼倒頻譜(LPCC)的轉。將全部被觸的已㈣？ _似度先分到-千個不同資料庫中。辨認一個未知句子時，將它先轉換成ExP線性預估編碼倒頻譜矩陣，再用未知句子Εχρ矩陣用距離從-千個資料庫中，找最接近的資料庫&再從最接近的資料庫内的已知句子，用距離找要辨認未知的句子。田使用者發音後，用Visual Basic，不到-秒鐘很快能辨 %所要的句子。方法簡單，不需樣本，任何人都可及時使用，發音不標準或發錯音者也可。赠要計算及比對一個句子特徵值，本發明只要計算及比對一個句子秘矩陣值。速度快， ^確率间。職語、台語、英語、日語、德語發音均測試過β 可辨認大4語音。本發财雜本，單數學計算及辨認，又準又快。【先前技術】般辨個未知句子是先將該未知句子切劃成多個單 201216253 音或單字，_是—雜絲度技術，尤其是英語，—個英語單字有多個音節，很難切割準確…音節之差會絲知句子: 認錯。所以講話時，要小心，要慢，要清楚，單字間隔要長。再將未知自子全料字和㈣庫的已知單字比對，-個單字錯誤會使未知句子辨認錯。再將單字資料庫找到的已知單字依^ 未知句子單字順序，連成―個已知句子，再從句子資料庫找最可能已知句子為未知句子。一般辨認一個未知句子方法很難準 • 確，費時，不能正常和電腦自由交談。-般辨認方法須費時做樣本，須用統計計算及辨認，當然不準，因統計只能估計 -個句子的發音是料絲示。音波是—種隨時間作非線性變化的系統，-個句子音波内含有一種動態特性，也隨時間作非線性連續變化。相同句子發音時，有一連串相同動態特性’隨時間作非線性伸展及收縮，但相同動態特性依時間排列秩序-樣’但時間不同。相同句子發音時，將相同的動態特性 # 排列在同一時間位置上非常困難。 -個電腦化語言辨認系統，首先要抽取聲波有關語言資訊’也即動態特性，過渡和語言無關的雜音，如人的音色、音調，說話時喊、生理讀緒和語細認無關細去。然後再將相同句子的相同特徵排列在相同的時間位置上。此一連串的特徵用-等長系珊徵向量表示，稱為—個句子_徵模型。目前語音辨認系統要產生大小-致的特徵翻太複雜，且費 201216253 時’因為相同句子的相同特徵很難排列在同一時間位置上，尤其是英語，導致比對辨·認困難。一個連續聲波特徵常用有下列幾種：能量（energy)，零才只過點數（zero crossings) ’ 極值數目（exfreme c〇unt)，顛峰（formants)，線性預估編碼倒頻譜（lpcc)及梅爾頻率倒頻譜（MFCC)，其中以線性預估編碼倒頻譜（LpCC)及梅爾頻率倒頻譜（MFCC)是最有效，並普遍使用。線性預估編碼倒春頻瑨（LPCC)是代表一個連續音最可靠，穩定又準確的語言特徵。匕用線性迴歸模式代表連續音音波，以最小平方估計法計算迴歸係數’其估計值再轉換成倒頻譜，就成為線性預估編碼倒頻譜（LPCC)。而梅爾頻率倒頻譜（mfcc)是將音波用傅氏轉換法轉換成頻率。再根據梅爾頻率比例去估計聽覺系統。根201216253 VI. Description of the Invention: [Technical Field to Which the Invention Is Ascribed] The present invention recognizes sentences in any language. The invention uses 12 elastic frame (window) 'isometric' non-verifiers, does not overlap, converts the sound waves of - a plurality of words into a long sentence: a linear predictive coding cepstrum of ExP=12xl2 (LPCC) ). Will all be touched (four)? _ likeness is first divided into - thousands of different databases. When identifying an unknown sentence, convert it to the ExP linear predictive coding cepstrum matrix, and then use the unknown sentence Εχρ matrix to find the closest database & the closest data from the -1000 database. Known sentences in the library, use distance to find unknown sentences. After the user of the field is pronounced, with Visual Basic, it is possible to identify the desired sentence in less than - seconds. The method is simple, no sample is needed, anyone can use it in time, and the pronunciation is not standard or the wrong sound can be used. To calculate and compare a sentence feature value, the present invention only needs to calculate and compare a sentence secret matrix value. Fast, ^ sure rate. The pronunciation of the official language, Taiwanese, English, Japanese, and German have all been tested with β recognizable large 4 voices. This wealthy book, single mathematics calculation and identification, and quasi-fast. [Prior Art] Generally, an unknown sentence is firstly divided into a plurality of single 201216253 sounds or words, _ is a heterogeneous technique, especially English, and an English single word has multiple syllables, which is difficult to cut. Accurate... The difference between syllables will know the sentence: Admit mistakes. Therefore, when speaking, be careful, be slow, be clear, and have a long word interval. Then, the unknown single sub-words are compared with the known single words of the (4) library, and a single word error will make the unknown sentence recognize the wrong one. Then, the known words found in the single-word database are concatenated into a known sentence according to the order of the unknown sentences, and the most likely known sentence is the unknown sentence from the sentence database. It is often difficult to accurately identify an unknown sentence method. It is time-consuming and cannot talk freely with the computer. - The method of identification must take time to do samples, which must be calculated and identified by statistics. Of course, it is not allowed. Statistics can only be estimated - the pronunciation of a sentence is a silk thread. Sound waves are systems that make non-linear changes over time. - The sentence sound waves contain a dynamic characteristic that also changes nonlinearly with time. When the same sentence is pronounced, there are a series of identical dynamic characteristics that nonlinearly stretch and contract over time, but the same dynamic characteristics are arranged in time-ordered but different in time. When the same sentence is pronounced, it is very difficult to arrange the same dynamic characteristics # at the same time position. A computerized language recognition system first extracts sound waves about language information, that is, dynamic characteristics, transitions, and language-independent noises, such as human voices, tones, speech screams, physiological readings, and linguistic nuances. The same features of the same sentence are then arranged at the same time position. This series of features is represented by an isometric vector, called a sentence _ sign model. At present, the speech recognition system is too complicated to generate a large-scale feature, and it costs 201216253' because the same features of the same sentence are difficult to arrange at the same time position, especially in English, which makes the comparison difficult. A continuous acoustic feature is commonly used in the following ways: energy, zero crossings, 'exfreme c〇unt', peaks, linear predictive coding cepstrum (lpcc) And the Mel Frequency Cepstrum (MFCC), in which the linear predictive coding cepstrum (LpCC) and the Mel frequency cepstrum (MFCC) are the most effective and commonly used. Linear Predictive Coding Spring Frequency (LPCC) is the most reliable, stable, and accurate language feature for a continuous tone.线性 Linear regression mode is used to represent continuous sound waves, and the regression coefficient is calculated by the least squares estimation method, and its estimated value is converted into cepstrum, which becomes linear predictive coding cepstrum (LPCC). The Mel frequency cepstrum (mfcc) converts sound waves into frequencies using the Fourier transform method. The auditory system is estimated based on the ratio of the frequency of the Mel. root

據學者S.B, DavisandP. Mermelstein於 1980 年出版在 IEEEAccording to scholar S.B, Davisand P. Mermelstein was published in IEEE in 1980.

Transactions on Acoustics, Speech Signal Processing, φ V〇丨.28，No. 4 發表的論文 Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences中用動態時間扭曲法 (DTW)，梅爾頻率倒頻譜（MFCC)特徵比線性預估編碼倒頻譜 (LPCC)特徵辨認率要高。但經過多次語音辨認實驗（包含本人前發明），用貝氏分類法，線性預估編碼倒頻譜（LPCC)特徵辨認率比梅爾頻率倒頻譜（MFCC)特徵要高，且省時。 [s] 7 201216253 至於語言韻’已有很多方法_ 1動態時間扭曲法 (dynamic time-warping)，向量量化法（vect〇r quantizati〇n) 及隱藏式馬可夫模式法（HMM)。如果相同的發音在時間上的變化有差異，一面比對，一面將相同特徵拉到同一時間位置。辨認率會很好’但將相同特徵拉到同一位置很困難並扭曲時間太長，不能應用。向量量化法如辨認大量單音，不但不準確，且費時。最近隱藏式馬可夫模式法（HMM)辨認方法不錯，但方籲法繁雜，太多未知參數需估計，計算估計值及辨認費時。以 T. F. Li(黎自奮)於 2003 年出版在 Pattern Recognition, vol. 36 發表的論文 Speech recognition of mandarin monosyllables’ Li，TzeFen (黎自奮）於1997年在美國專利證書，Apparatus and Method for Normalizing and CategorizingTransactions on Acoustics, Speech Signal Processing, φ V〇丨.28, No. 4 Published by Parametric representations for monosyllabic word recognition in continuous speech Sentences using dynamic time warping (DTW), Mel frequency cepstrum (MFCC) The feature is higher than the linear predictive coding cepstrum (LPCC) feature recognition rate. However, after many speech recognition experiments (including my previous invention), the Bayesian classification method, the linear prediction coding cepstrum (LPCC) feature recognition rate is higher than the Mel frequency cepstrum (MFCC) feature, and saves time. [s] 7 201216253 There are many methods for language rhyme _ 1 dynamic time-warping, vector quantization (vect〇r quantizati〇n) and hidden Markov mode (HMM). If the same pronunciation changes in time, one side compares the same feature to the same time position. The recognition rate will be good' but it is difficult and too long to pull the same feature to the same position and cannot be applied. Vector quantization, such as identifying a large number of tones, is not only inaccurate, but also time consuming. Recently, the hidden Markov model method (HMM) identification method is good, but the method is complicated, too many unknown parameters need to be estimated, and the estimated value and the time limit for identification are calculated. T. F. Li (Li Zifen) published in 2003 in Pattern Recognition, vol. 36 Speech recognition of mandarin monosyllables’ Li, TzeFen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing

Linear Prediction Code Vectors using Bayesian CategorizationLinear Prediction Code Vectors using Bayesian Categorization

Technique，U.S.A· Patent No. 5,704,004, Dec. 30, 1997,黎自奮 • 於2008年在中華民國專利證書I 297487號(2008, 6,1)名稱語音辨認方法及黎自奮於2009年在中華民國專利證書第I 310543號(2009, 6,1)名稱一個連續二次貝氏分類法辨認相似國語單音的方法中，用貝氏分類法，以相同資料庫，將長短不同一系列LPCC向量用各種方法壓縮成相同大小的特徵模型，辨認結果比 Y.K· Chen, C.Y.Liu，G. H. Chiang，M.T. Lin 於 1990 年出版在 proceedingS 〇f Telecommunication [si 8 201216253Technique,USA· Patent No. 5,704,004, Dec. 30, 1997, Li Zifen • In 2008, in the Republic of China Patent Certificate I 297487 (2008, 6, 1) name speech recognition method and Li Zifen in 2009 in the Republic of China patent certificate I 310543 (2009, 6, 1) Name A continuous quadratic Bayesian classification method for identifying similar Mandarin singles. Using the Bayesian classification method, the same database is used to compress a series of different LPCC vectors in various ways. Character model of the same size, identification results than YK·Chen, CYLiu, GH Chiang, MT Lin published in 1990 in proceedings 〇f Telecommunication [si 8 201216253

Symposium，Taiwan 發表的論文 The recognition of mandarin 〇 osyllables based on the discrete hidden Markov model 中用隱藏式馬可夫模式法HMM方法要好。但壓縮過程複雜費時’且相同單音很難將相同特徵壓縮到相同時間位置，對於相似單音，很難辨認。本發明語音辨認方法針對上述缺點，從學理方面，根據音 • 波有一種語音特徵，隨時間作非線性變化，自然導出一套抽取吾音特徵綠，將任何語言的奸，，全侧相等的Εχρ=ΐΜ2 矩陣”表示。【發明内容】 ⑴本發明最4要的目的是可以快速又準確地辨認任何語言的任何句子，以前要計算及比對—個$子全部單字特徵值’本發明只要計算及比對一個句子Εχρ矩陣值，可達到和電腦自由交談的目地。 ⑵為了達到⑴的目地，本發明應用一種句子音波正常化及抽取特徵方法。微職少數Ε==12 _轉性框，等長，不重疊，沒有濾波器，能依一個句子音波長短，自由調節含蓋全部句子縣，歸所有要龍—到多個單字長短不-㈣子全部轉誠鱗Εχρ=ΐ2χΐ2線性預估編碼倒頻譜矩陣。句子内-系列隨時間作非線性變化的動 201216253 態特性轉換成-個大小相等的ExP線性預估編瑪倒頻譜矩陣，並且相同句子的特徵模型在相同時間位置上有= 同特徵。可以及時比對，達到電腦即時辨認效果。 ⑶本發明應用-千個不同資料庫，能辨認大量句子，逮度快，準確率也大大提高。主要將全部已知句子分散在 -千個資料庫最接近聲音的資料庫内，舰未知句子時，先找和未知句子聲音F個最接近的資料庫，再從ρ 個最接近的資料庫内的已知句子找所要辨認的未知句子。在F個最接近的資料庫内所有的已知句子不多，很谷易辨認’又準又快。以前要計算及比對一個句子内所有單字特徵值矩陣，本發明只要計算及比對一個句子的一個ExP矩陣值。 ⑷本發明不用樣本用統計計算及辨認，用數學計算及用句子的雜預估編碼倒頻譜ΕχΡ矩陣之間的距離來辨認。 ⑸本發賴财法可關認講話以域餘太慢的句子。講話太快時，一個句子音波很短，本發明的等長彈性框長度可以縮小’仍然用相同數£個等長的彈性框含蓋短音波。產生Ε個線性預估編碼侧譜（Lpcc) 向量。講太慢所發出-個句子音波較長。㈣等長彈性框長度會伸長。所產生相同數E個線性預估編碼倒頻譜 10 (6)201216253 (LPCC)向量也能有效代表該長句子。本發明提供一種修正技術，次即可。對辨錯的句子清楚地發音一【實施方式】用第-圖及第二圖·發明執行程序。第—圖是表 • 立個資料庫，每個資料庫内有相似已知句子。第二圖是表示使用者辨認未知句子執行程序。先有個不同聲音〗，一個聲音音波轉換成數位化 ^遽點10，除去雜音或靜音2〇。祕該有聲音音波正常化 =抽取特徵，將-鱗音音波全部錢齡成帥等時奴’每時段組成-個框。一個聲音一共有E個等長框加，沒有驗n，*重疊，根據聲音铸錢闕長度，E個相 • 等框長度自由調整含蓋全部信號點。所以該框稱為等長彈性框’長度自由伸縮，但E個，，彈性框長度一樣”。不像漢明(Hamming)窗’有濾波器、半重疊、固定長度、不能隨波長自由調整。因一個聲音音波隨時間作非線性變化，音波含有-個語音動態特徵，也隨時間作非線性變化。因為不重疊’所以本發明使用較少(E=12)個等長彈性框，涵蓋全。P聲音音波’因信號點可由前面信號點估計，用隨時間作線性變化的迴歸模式來密切估計非線性變化的音波，用 201216253 取、】,平方絲相縣知絲。每轉長彈性翻，用最十舁P—12個線性預估編碼倒頻譜4〇, 一個聲音用ExP線性預估編碼倒頻譜矩陣代表，—個聲音的祕線性預估編碼倒頻譜矩陣代表一個資料庫一共有一千個資 ;、’庫5Q對所要辨認的已知句子清楚發音-次，除去靜音及雜9在句子之前及句子之後’兩單字及兩音節之間，删去所有的靜音及雜音。用E個相等彈性框將要辨認的已夫句子轉換成-個Exp線性預估編碼倒頻譜(Lpcc)矩陣 60用距離將該已知句子的Εχρ線性預估編碼麵譜(㈣) 矩陣分到最接近的資料庫㈣。全部要辨_已知句子分到Μ-1000個不同資料庫。有Μ=1_個資料庫每個資料庫含相似的已知句子8〇。 >第二圖表示辨認的未知句子方法流程。先對—個所要辨。的未知#子β楚發音2。未㈣子音波數位化成信號點除去靜曰及雜音2〇，在未知句子之前及未知句子之後，兩單字及1^音節之間’刪麵有的靜音及雜音。Ε個等長彈性框正常化音波’抽取特徵’將要辨認的未知句子全部具有語音的錢點分成Ε _段，每時段形成—個彈性框 3〇 ° -共有Ε辦轉性框，沒有滤波器，不重疊，自由伸縮s蓋王札號點。在每樞内，因信號點可由前面信號 12 201216253 估計，用最小付法求迴歸未知係_估計值。每框内所，生的M2最小付估計值叫做線性驗編碼⑽）向夏，將線性預估編碼（LPC)向量轉換較穩定線性預估編瑪倒簡即向量，-個未知㈣—個Εχρ線性預估編碼倒頻譜矩陣代表4卜本發_要辨認未知句子的ΕχΡ 線性預估編碼倒頻譜矩陣和Μ=1_資料庫抑Εχρ線性預估編碼倒頻譜矩陣的距離或加權距離，找F個最接近的資料庫，也即射個㈣庫距該要辨認未知句子的線性預估編碼倒頻譜矩_ F個最小輯84。距離或加權距離在Π固最接近資料庫内已知句子，找要辨認未知句子卯。本發明詳述於後： ⑴-個聲音（句子)清楚發音後丨，將此聲音（句子)音波轉換成一系列數化音波信號點（signai s卿㈣ points) 10。再啦不具語音音贿號點，在聲音 (句子）之前及聲音（句子)之後，兩單字及兩音節之間’刪去所有的靜音及雜音2〇。不具語音信號點刪去後，剩下信號點代表—個聲音（句子)全部信號點。先將音波正常化再抽取特徵，將全部信號點分成㈣等時段，每時段形成一個框。一個聲音（句子)共有e 個等長的彈性框，沒有渡波器'不重疊、自由伸縮’涵蓋全部信號點30。在每個等長彈性框内，信號 13 [S3 201216253 點隨時間作非線性變化，很難用數學模型表示。因為 J. Markhoui—於 1975 年出版在 proceedings of IEEE, Vol. 63’ Νο·4 發表論文 Linear Prediction: A tutorial review及 LU Tze Fen (黎自奮）於 1997 年在美國專利證書，Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, U.S.A. PatentNo.5,704,004，Dec?. 30, 1997 中說明信號點與前面信號點有線性關係，可用隨時間作線性變化的迴歸的模型估計此非線性變化的信號點。信號點5⑻可由前面信號點估計’其估計值《S，⑻由下列迴歸模式表示：Symposium, Taiwan published paper The recognition of mandarin 〇 osyllables based on the discrete hidden Markov model HMM method with hidden Markov mode is better. However, the compression process is complicated and time consuming and the same tone is difficult to compress the same feature to the same time position, which is difficult to recognize for similar tones. The speech recognition method of the present invention is directed to the above shortcomings. From the academic point of view, according to the sound characteristics of the sound wave, the nonlinear change with time, naturally derives a set of extracted feature green, which is equal to any language. Εχ ρ = ΐΜ 2 matrix ” [Abstract] (1) The fourth objective of the present invention is to quickly and accurately identify any sentence in any language, previously calculated and compared - a single sub-word eigenvalue 'the present invention as long as Calculate and compare the value of a sentence Εχρ matrix to achieve the goal of free conversation with the computer. (2) In order to achieve the purpose of (1), the present invention applies a sentence sound wave normalization and extraction feature method. A small number of Ε ==12 _ rotational box , isometric, non-overlapping, no filter, can be according to a sentence with a short wavelength, freely adjust all the sentences containing the cover, return to all the dragons - to the length of multiple words - (four) all turn to honest scales Εχ ΐ = 2 2 linear prediction Encoding the cepstrum matrix. The intra-sentence-series changes nonlinearly with time. The 201216253 state property is converted into an equal-sized ExP linear prediction. Cepstrum matrix, and the feature model of the same sentence has the same feature at the same time position. It can be compared in time to achieve the real-time recognition effect of the computer. (3) The application of the invention - thousands of different databases, can identify a large number of sentences, and catch fast The accuracy rate is also greatly improved. The main known sentences are scattered in the database of the nearest sounds of thousands of databases. When the ship is unknown, first find the closest database with the unknown sentence sounds, and then from ρ The known sentences in the closest database find the unknown sentences to be identified. There are not many known sentences in the F closest database, and it is easy to identify and accurate. Previously calculated and compared For all single-character eigenvalue matrices in a sentence, the present invention only needs to calculate and compare an ExP matrix value of a sentence. (4) The present invention does not use samples for statistical calculation and recognition, and uses mathematical calculations and misinterpretation of sentences to encode cepstrums. The distance between the matrices is recognized. (5) The text of the confession can recognize the sentence in which the speech is too slow. When the speech is too fast, the sentence sound is very short, the present invention The length of the isometric elastic frame can be reduced to 'still use the same number of elastic frames with the same length to cover the short sound waves. Generate a linear predictive coding side spectrum (Lpcc) vector. Speak too slowly - a sentence is longer (4) The length of the elastic frame of the same length will be elongated. The same number of E linear predictive coding cepstrums 10 (6) 201216253 (LPCC) vector can also effectively represent the long sentence. The present invention provides a correction technique. Clearly pronounce the sentence that is being erroneous. [Embodiment] Use the first and second diagrams to invent the execution program. The first diagram is a table • A database is established, each database has similar known sentences. The figure shows that the user recognizes the unknown sentence execution program. There is a different sound, and a sound sound wave is converted into a digital sound point 10 to remove the noise or mute 2〇. The secret should be normalization of the sound sound wave = the extraction feature, the - scale sound sound wave all the money age into a handsome time slaves - each time frame - frame. A sound has a total of E equal length frames, no check n, * overlap, according to the length of the sound cast money, E phase • The length of the frame is free to adjust all the signal points covered. Therefore, the frame is called the same length elastic frame 'length freely stretchable, but E, the length of the elastic frame is the same." Unlike the Hamming window, there are filters, semi-overlaps, fixed lengths, and cannot be freely adjusted with wavelength. Because a sound wave changes nonlinearly with time, the sound wave contains a dynamic feature of the voice, which also changes nonlinearly with time. Because it does not overlap, the present invention uses less (E=12) equal-length elastic frames, covering the whole P sound sound wave 'Because the signal point can be estimated from the previous signal point, use the regression mode which changes linearly with time to closely estimate the nonlinear change sound wave, use 201216253 to take,], square silk phase county know silk. Using the top ten P-12 linear prediction codes to encode the cepstrum 4〇, one sound is represented by the ExP linear prediction coding cepstrum matrix, and the secret linear prediction coding of the sound is represented by a database. Thousands of funds;, 'Library 5Q clearly pronounces the known sentences to be recognized - times, except for mute and miscellaneous 9 before and after the sentence 'between words and two syllables, delete all mute and To convert the sentence to be recognized into an exp-predictive coded cepstrum (Lpcc) matrix 60 using E equal elastic boxes. Use the distance to classify the Εχρ linear predictive coding spectrum ((4)) matrix of the known sentence. To the nearest database (4). All must be identified _ known sentences are divided into Μ-1000 different databases. There are Μ = 1 database each database contains similar known sentences 8 &. > second The figure shows the process of identifying the unknown sentence method. First, the unknown #子β楚 pronunciation 2 is not. The (4) sub-sonic is digitized into signal points to remove the silence and murmur 2〇, before the unknown sentence and after the unknown sentence, Between the two words and the 1^ syllables, the mute and murmurs are deleted. The equal-length elastic frame normalizes the sound wave 'extraction feature'. The unknown sentences that are to be recognized are all divided into Ε _ segments, which are formed every time period— Elastic frame 3〇° - There are a total of rotating frames, no filters, no overlap, freely telescopic s cover Wang Zai points. In each pivot, because the signal points can be estimated by the front signal 12 201216253, using the minimum payment method Regression unknown system _ estimated value. Each box The resulting M2 minimum estimate is called linear test coding (10). To the summer, the linear predictive coding (LPC) vector is transformed into a more stable linear estimate, which is a vector, and an unknown (four)-a Εχρ linear estimate. The coded cepstrum matrix represents 4 本发 _ 要辨辨要要要要要要要要要要要要要要要要要要要要要要要要要要要 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The library, that is, the shot (4) library from the linear predictive coding cepstrum moment to identify the unknown sentence _ F minimum set 84. The distance or weighted distance in the closest sentence to the known sentence in the database, find the unknown sentence The present invention is described in detail later: (1) After a sound (sentence) is pronounced clearly, the sound (sentence) sound wave is converted into a series of digitized sound wave signal points (signai s (four) points) 10 . Again, there is no voice tone, and before the sound (sentence) and after the sound (sentence), between the two words and the two syllables, all the mute and noise are deleted. After the voice signal point is deleted, the remaining signal points represent all the signal points of a sound (sentence). The sound wave is normalized and then the feature is extracted, and all the signal points are divided into (four) time periods, and a frame is formed every time period. A sound (sentence) has a total of e-length elastic frames, and no waver 'does not overlap, freely stretch' covers all signal points 30. In each equal-length elastic frame, the signal 13 [S3 201216253 point changes nonlinearly with time and is difficult to represent with a mathematical model. Because J. Markhoui—published in proceedings of IEEE in 1975, Vol. 63' Νο·4 published papers Linear Prediction: A tutorial review and LU Tze Fen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, USA Patent No. 5,704,004, Dec?. 30, 1997 illustrates that the signal point has a linear relationship with the previous signal point, and the nonlinearly varying signal can be estimated using a regression model that varies linearly with time. point. Signal point 5(8) can be estimated from the previous signal point 'The estimated value of 'S, (8) is represented by the following regression pattern:

P s，(.n) = ^i,S(n-k), n>〇 (i)P s,(.n) = ^i,S(n-k), n>〇 (i)

Ar=| 在（1)式中’〜，Α = 1，...，Ρ，是迴歸未知係數估計值，P是前面信號點數目。用L. Rabiner及B. H. Juang於1993 年著作書 Fundamentals of Speech Recognition, Prentice Hall PTR, Englewood Cliffs, New Jersey 及Li，TzeFen (黎自奮）於1997年在美國專利證書， Apparatus and Method for Normalizing and Categorizing Linear Prediction Code Vectors using Bayesian Categorization Technique, U.S.A. Patent No. 5,704,004, Dec. 30，1997中Durbin的猶環公式求最小平方估計 [s] 201216253 值，此組估計值叫做線性預估編碼（LPC)向量。泉樞内信號點的線性預估編碼（LPC)向量方法詳述如下. 以表示信號點S⑻及其估計值s，⑻之間平方差總和. ^ι=Σ[5(Λ)-Σα*5(»-^)]2 rM) P L ka\ (2) 求迴歸係數使平方總和尽達最小。對每個未知迴歸係Ar=| In the formula (1), '~, Α = 1,...,Ρ, is the estimated value of the regression unknown coefficient, and P is the number of previous signal points. In 1993, L. Rabiner and BH Juang published Fundamentals of Speech Recognition, Prentice Hall PTR, Englewood Cliffs, New Jersey and Li, TzeFen (Li Zifen) in 1997 in the US Patent Certificate, Apparatus and Method for Normalizing and Categorizing Linear Prediction. Code Vectors using Bayesian Categorization Technique, USA Patent No. 5,704,004, Dec. 30, 1997, Durbin's formula for finding the least squares estimate [s] 201216253, this set of estimates is called the linear predictive coding (LPC) vector. The Linear Predictive Coding (LPC) vector method for signal points in Quanshu is detailed below. To represent the sum of the squared differences between signal point S(8) and its estimated value s, (8). ^ι=Σ[5(Λ)-Σα*5 (»-^)]2 rM) PL ka\ (2) Find the regression coefficient to minimize the sum of squares. For each unknown regression system

數〜,=1，·.·，夂求(2)式的偏微分，並使偏微分為〇 ,得到 P組正常方程式： 2a^S(n~mn-i) = ^SWin-i), imp (3) rt 展開（2)式後，以（3)式代入，得最小總平方差心 £/>=Z52(w)-Ea*Z5(«)^(«'^) (4) n η (3)式及(4)式轉換為 ρ Σα* 及(’·-灸)=及(0， ⑸The number ~, =1, ···, pleading for the partial differential of (2), and dividing the partial differential into 〇, the normal equation of P group is obtained: 2a^S(n~mn-i) = ^SWin-i), Imp (3) rt After expanding (2), substituting (3), the minimum total squared difference is £/>=Z52(w)-Ea*Z5(«)^(«'^) (4) n η (3) and (4) are converted to ρ Σα* and ('·- moxibustion)= and (0, (5)

五户=聊~ Σ〜離） ⑹ 在⑸及(6)式中’用絲示框内信號點數， N，i ^(0 = X 5(ri)S(« + /), i > 〇 (?) l|s〇用Durbin的循環快速計算線性預估編碼（Lpc)向量如下： (8) E〇 = R(0) [Si 15 (9)201216253 kt ^(，) =k,.(five households = chat ~ Σ ~ away) (6) In (5) and (6), use the wire to indicate the number of signal points in the frame, N, i ^ (0 = X 5(ri)S(« + /), i > 〇(?) l|s〇 Use the Durbin's loop to quickly calculate the linear predictive coding (Lpc) vector as follows: (8) E〇= R(0) [Si 15 (9)201216253 kt ^(,) =k,.

(10) 01) 尽=(1 食’ (12)(10) 01) End = (1 food) (12)

(8~12)公式贿計算，得到迴歸係數最小平方估計值七’户1，···'，（線性預估編碼（LPC)向量）如下： Μτπ#將LPC向量轉換較穩鎌性預估編碼倒頻3鲁（LPCC)向量户 a'=a，+|(7)a^ (14) …’里(f)W P<i (15) 個彈性框產生—個線性驗編碼倒頻譜（Lpc〇向(8~12) Formula bribe calculation, get the regression coefficient least squares estimation value of ''1', '···', (linear predictive coding (LPC) vector) as follows: Μτπ# will be more stable prediction of LPC vector conversion Code scrambling 3 Lu (LPCC) vector household a'=a, +|(7)a^ (14) ... 'in (f) W P<i (15) elastic boxes produce a linear coded cepstrum ( Lpc orientation

量（'’ ’ P) 40 °根據本發明語音辨認方法，用P=12，因最後的線性雛編碼觸譜（祖）幾乎為G。一個以E個線性預估編碼倒頻譜⑽c)向量表示一個聲曰（句子)特徵，也即一個含Εχρ=1Μ2個線性預估編碼倒頻譜（LPCC)的矩陣表示一個聲音5〇。 ⑵將要贿的已知句子發音後，已知句子驗，兩單字及兩θ 4之間’刪去所有的靜音及雜音，⑽-⑸公式將已矣句子轉換成線性預估編碼倒頻错似⑻矩 [si- 201216253 陣60。用已知句子線性預估編碼倒頻譜(lpcqexP矩陣與所有M=1000不同聲音的線性預估編碼倒頻譜Εχρ 矩陣之間距離或加權距離找最接近的資料庫，將該已知句子的ΕχΡ線性預估編碼倒頻譜(LPCC)矩陣分到最接近的資料庫内70。有.1000個資料庫，每個資料庫含相似的已知句子8〇。The amount ('' ’ P) 40 ° according to the speech recognition method of the present invention, P = 12, because the final linear coding code (grandfather) is almost G. A cepstrum (10) c) vector with E linear predictive coding represents a sonic (sentence) feature, that is, a matrix containing Εχρ=1Μ2 linear predictive coding cepstrums (LPCC) represents a sound of 5〇. (2) After the known sentence of the bribe is pronounced, the sentence is known, the two words and the two θ 4 are 'deleted all the mute and murmur, and the (10)-(5) formula converts the sentence into a linear predictive coding scrambling error. (8) Moment [si- 201216253 array 60. Linearly predict the coded cepstrum with a known sentence (the distance between the lpcqexP matrix and the linear predictive coding cepstrum Εχρ matrix of all M=1000 different sounds or the weighted distance to find the closest database, the linearity of the known sentence The estimated coded cepstrum (LPCC) matrix is assigned to the nearest database. 70. There are .1000 databases, each containing 8 known similar sentences.

(3)要辨認未知句子時，使用者先對一個所要辨認的未知句子清楚發音2。未知句子音波數位化成信號點1〇，除去靜音及雜音20,縣知句子之歧未知句子之後，兩單字及兩音節之間，刪去所有的靜音及雜音。(3) When an unknown sentence is to be recognized, the user first pronounces a known sentence that is to be recognized. The unknown sentence is digitized into a signal point of 1〇, except for mute and murmur 20. After the county knows the sentence is unknown, the mute and murmur are deleted between the two words and the two syllables.

E個等長職框正常化音波，抽取概，將要辨認的未知句子全部具有时的频齡成£ _段，每時段形成-轉性框…財E轉長雜框，沒有滤波器，不重疊，自由伸縮含蓋全部信號點3〇。在每框因信號點可由前面信號估計，用最小平方法求迴 2知係數的估計值。每框内用㈣)公式轉換成線 1估編碼倒麵(_ΕχΡ矩陣，—個未知句子用 f ΕχΡ線性預估編碼觸譜矩陣代表幻。本發明 ^ M該未知句子線性預估編碼倒頻譜ΕχΡ矩陣與所 =”_音_編铜頻譜秘矩陣 0之間的距離或加權距離找F個最接近的資料庫 201216253 抑。再從F個最接近的資料庫，該未知句子線性預估編碼倒頻譜ExP矩陣與最接近的F個資料庫内已知句子的線性讎編碼隹4頻譜Εχρ矩陣之間的距離或加權距離，找使用者所要的未知句子90。 (4)為了證實本發明驗速又準確地辨認任何語言任何句子，可辆和細自由交談的目地，發明人用1〇〇〇個英語單字聲音代表1〇〇〇個不同資料庫，發明人發音 _ 928句子（8G英語句子、284中文句子、3台語句子、 2曰語句子、160英語單字、398中文單字、丨德文單字）。測試後’句子及英語單字全部排名第一，以前要計算及比對-個句子全部單字特徵值，本發明只要計算及比對一個ExP矩陣值，中文單字也在前兩名，因同音子太多，辨認時間不到1秒;發明人發音英語單字，測試後’也在前五名辨認時間不到2秒； • 發明人發音4400中文單字，測試後，也在前20名，辨認時間不到2秒。4400中文單字做為語音輸入中文軟體。本發明說明書用本軟體輸入。 (5)圖六，七用本軟體(Visual Basic)輸入片斷本發明說明書。圖三到圖五用本發明辨認中文及英文句子。【圖式簡單說明】第一圖及第二圖5尤明發明執行程序。第一圖是表示建立M=1〇〇0 [S] 201216253 個不同資料庫，每個資料庫含相似的已知句子。第二圖表示辨認未知句子的流程。第三_第是表示用yis⑽軟體輸入片斷本發明說明書及辨認中文及英文句子。【主要元件符號說明】 (1)先有M=1000個不同聲音鲁 (1〇)音波數位化 (20)除去雜音及靜音時段 (30) E個等長彈性框正常化全部有聲音波 (40)每鮮長雜_，崎小平絲計刺目線性預估編碼倒頻譜 (50) —個聲音的線性預估編碼倒頻譜Εχρ矩陣代表一個資料庫，一共有一千個資料庫 • (60)對已知句子清楚發音-次，除去靜音及雜音，將它轉換成線性預估編碼倒頻譜(Lpcc) ΕχΡ矩陣 (70)用距離將已知句子線性預估編碼倒頻譜(LpCc) 矩陣分到最接近的資料庫内 (80)有Μ=1000個資料庫，每個資料庫含相似的已知句子 (2)對要辨§忍未知句子清楚發音 (41)每個等長彈性框内，用最小平方法計算ρ個線性預估編碼倒頻譜，一個未知句子用線性預估編碼倒頻譜ε 201216253 χΡ矩陣代表 (84)用距離在Μ=1000個資料庫找F個和該要辨認未知句子最接近的資料庫 (90)在F個最接近的資料庫内相似的已知句子，用距離找要辨認的未知句子E equal length frame normalizes the sound wave, extracts the average, and the unknown sentences to be identified all have the time of the time into the £_ segment, forming a transition box every time period... the financial E turns the long box, there is no filter, no overlap , freely telescopic cover all signal points 3 〇. In each frame, the signal point can be estimated from the previous signal, and the estimated value of the 2 knowledge coefficient is obtained by the least square method. Each box uses (4)) formula to convert to line 1 to estimate the coded inverse (_ΕχΡ matrix, - an unknown sentence with f ΕχΡ linear prediction coded spectral matrix represents illusion. The invention ^ M the unknown sentence linear prediction coding cepstrum ΕχΡ The distance between the matrix and the ="____ copper-sense spectrum secret matrix 0 or the weighted distance is found in the closest database of 201216253. From the closest database of F, the linear prediction of the unknown sentence is encoded. The distance or weighted distance between the spectrum ExP matrix and the linear 雠 code 隹4 spectrum Εχρ matrix of the known sentences in the closest F databases is used to find the unknown sentence 90 desired by the user. (4) To confirm the speed of the invention And accurately identify any sentence in any language, the purpose of the car and the fine free conversation, the inventor uses 1 English single-word voice to represent 1 different databases, the inventor pronounces _ 928 sentences (8G English sentences, 284 Chinese sentences, 3 sentences, 2 sentences, 160 English words, 398 Chinese words, and German characters. After the test, the sentences and English words are all ranked first, and the previous calculations and comparisons - sentences All single word feature values, the present invention only needs to calculate and compare an ExP matrix value, the Chinese word is also in the top two, because the homophone is too much, the recognition time is less than 1 second; the inventor pronounces the English word, after the test 'before The five recognition time is less than 2 seconds; • The inventor pronounces 4400 Chinese words, and after the test, it is also in the top 20, and the recognition time is less than 2 seconds. The 4400 Chinese word is used as the voice input Chinese software. The present specification is input by the software. (5) Figure 6 and Figure 7 are used to input the fragment of the present invention. Figure 3 to Figure 5 identify the Chinese and English sentences with the invention. [Simplified illustration] The first picture and the second picture 5 The invention is executed. The first figure shows the establishment of M=1〇〇0 [S] 201216253 different databases, each containing similar known sentences. The second figure shows the process of identifying unknown sentences. The first is to use the yis (10) software to input the segment of the present invention and to identify Chinese and English sentences. [Main component symbol description] (1) First M=1000 different sounds Lu (1〇) sound wave digitization (20) Remove noise and mute Time Segment (30) E equal length elastic frames normalized all sound waves (40) every fresh long _, Saki small flat wire glaring linear prediction coding cepstrum (50) - a linear prediction of the sound encoding cepstrum Εχ ρ matrix Represents a database with a total of one thousand databases. • (60) Clearly pronounces known sentences - times, removes mutes and noises, converts them into linear predictive coding cepstrum (Lpcc) ΕχΡ matrix (70) with distance The known sentence linear predictive coding cepstrum (LpCc) matrix is divided into the closest database (80) with Μ=1000 databases, each containing similar known sentences (2) Forbearance of unknown sentences is clearly pronounced (41) Within each equal-length elastic frame, ρ linear predictive coding cepstrums are calculated by the least square method, and an unknown sentence is coded by linear predictive cepstrum ε 201216253 χΡ matrix representation (84) distance Find the unknown sentences in the F closest database with the closest known database (90) in the database of 1000=1000 databases, and find the unknown sentences to be identified by distance.

Claims

201216253 VII. Patent application scope: 1. A method for recognizing all language sentences, the steps of which include: (1) A sentence may be a single tone, a single word, a name or a sentence in any language, preceded by ι=ιοοο a different voice; ) - a pre-processor deletes the sentence <sound) and after the sentence (sound), between the two words and between the two syllables, all φ does not have the voiced signal points (sampled points) of the mute and (3) - Sound or sentence sound normalization and extraction feature method: use e equal elastic frames, no filter, no overlap, normalize a sound or sentence sound wave, and convert to equal size Linear predictive coding cepstrum (LPCC) ExP matrix; (4) M=1000 linear prediction coding cepstrum for different sounds (Lpcc) Εχρ matrix stands for Μ=1000 different databases; (5) user to known sentences Clearly pronounce once, delete the mute and murmur of the sampled points before and after the sentence, between the two words and the two syllables. The equal-elastic frame normalizes the sound waves of a known sentence with speech and converts them into a linear predictive coding cepstrum (LPCC) ExP matrix of equal size; (6) Linearly predicts the encoded cepstrum αρ with a known sentence (χ) Εχρ matrix and all M=1000 different sound linear prediction coding cepstrum Εχρ matrix distance or weighted distance to find the closest database to the linear pre-preparation of the known sentence [S3 201216253 Estimated Coded Cepstrum (LPCC) The ExP matrix is divided into the closest database. Then, using the distance or weighted distance, the linear predictive coded cepstrum ExP matrix of any known language is recognized and the linear predictive code representing the sound of the database is divided. The censored ExP matrix is located in the nearest database, and similarly known sentences are placed in the same database; (When the unknown sentence is to be recognized, after the user pronounces the unknown sentence, the present invention also uses the known sentence to encode the sentence. The linear equation of the inverse Εχρ matrix and all Μ-1000 different sounds is used to find the distance or weighted distance between the cepstrum gyp matrices to find the F closest to the library, and then use the unknown sentence. The linear predictive coding cepstrum ExP matrix and the F closest database, similar to the distance or weighted distance between the linear predictive coding cepstrum ΕχΡ matrix of the known sentence, find the unknown sentence desired by the user; (8) If If the recognition is unsuccessful, the user will pronounce the sentence once again, and convert the sentence into a linear predictive coding cepstrum by using the special elastic box (and Εχ p matrix' uses the distance to encode the linear predictive coding of the sentence (Lpcc) The Exp matrix is assigned to the closest database and the sentence will be recognized successfully. 2. According to the application of the suspects around the first item - carefully identify all language sentence methods 'where step (3) contains E equal elastic boxes, equal length, no filter, no overlap, will be - To normalize the sound or sentence sound waves and extract the feature matrix of the same size, the steps are as follows: (a) Before the sentence (sound) and after the sentence (sound), between the two words and between m 22 2〇1216253 two syllables, All mutes and murmurs that do not have a voice signal point (Q) ed fine (6), use a method of equally dividing a sentence (a sound) with all sound wave signal points. 'In order to closely estimate the nonlinearly varying sound waves with a linearly varying regression pattern, The full length has a sound wave point divided into £=12 equal time periods. Each phase of the material segments forms a bullet. It is difficult. One sentence (one sound ") has a total of E equal length elastic frames. There is no wave (4), no overlap, and it can be freely stretched and covered. Full-length sound wave, not a fixed-length Hamming window; (b) In each #long box, use a regression model that varies linearly with time to estimate the sound wave that changes nonlinearly with time. (C) a cyclic manner with Durbin's ^ 0) = J ^ S (n) S (n + /), / > = and fifty square ⑼

k^W)-J:arR^j)]/E y Si-1 aj = aT\ κβΡ, called linear prediction P=12 regression coefficient least squares estimate e coding (LPC) vector, reuse 23 201216253 ^ =^+Σ(ζΚ->αν 1 ρ y=i 1 a', = Σ (τΚ->°'7- p<ij=i-^ 1 Convert Linear Predictive Coding (LPC) Vector to Stable The linear predictive coding cepstrum (LPCC) vector <, 匕; (d) expresses a sentence using E linear predictive coding cepstrum (LPCC) vectors (a linear predictive coding cepstrum (LPCC) ExP matrix) (or a voice).

[S3 24