TWI299854B

TWI299854B - Lexicon database implementation method for audio recognition system and search/match method thereof

Info

Publication number: TWI299854B
Application number: TW95137548A
Authority: TW
Inventors: Chung Po Liao
Original assignee: Inventec Besta Co Ltd
Priority date: 2006-10-12
Filing date: 2006-10-12
Publication date: 2008-08-11
Also published as: TW200818117A

Description

，1299854 七、指定代表圖： (一）本案指定代表圖為：第（一）圖。 - (二）本代表圖之元件符號簡單說明： - S11〜S16 :步驟流程。八、本案若有化學式時，請揭示最能顯示發明特徵的化學式：九、發明說明：【發明所屬之技術領域】本發明係為提供一種語音辨識系統之詞彙資料庫建置方法及其搜尋比對方法，特別是一種可支援破音字處理之詞彙資料庫建置方法及其更具效率之搜尋比對方法。 4 1299854 【先前技術】習知語音辨識系統，並沒有加入破音字的處理功能，導致使用者在進行語音輸入時，必須唸成其破音字的另一種發音才能辨識成功，例如，人名陳力行的「行」字，必須發音為「厂尤v」才能辨識成功，如使用者發音為「丁一厶/」便無法正確辨識，又例如，樂團的「樂」字，必須發音為「力亡\」才能辨識，若發音為「u廿\」亦無法正確辨識，而這樣的語音輸入方式與一般使用者的發音習慣有很大的差異。此外，語音辨識系統在進行辨識時，通常是利用維特比演算法（Viterbi Algorithm)計算詞彙中每個字所瘫爽學，型的機率值來進行辨識，而這樣的演算也是語音辨;氕統花費乘大計算量的地方，因此，若是經常重複計算^系同的予將導致系統不必要的計算量加重，也會造成系統速度的下降，因此促成我們思考如何避免重複計算相 ^ 以降低整體的運算量。 j的予本發明人基於多年從事研究與諸多實務經驗，經究設計與專題探討，遂於本發賴出—種語研 ;;=;方法及其—前述:㈡【發明内容】有^監於上述課題，本發明之目的為提供—種語音辨 «料及其搜尋比對方法，特別是;= 處理之詞棄資料庫建置方法及其更具效率以 5 Ί299854 (a) 提供一破音字資料； (b) 輸入一詞彙； (c) 比對破音字資料，判斷此詞彙是否包含至少一破音字，若是，則對於此詞彙所包含之破音字之複數個發音方式分別建立相對應之複數個聲學模型，若否，則對於此詞彙建立單一對應之聲學模型；以及 ' ^ (Φ儲存此詞彙及其對應之聲學模型至詞彙資料庫。承上所述，因依本發明之語音辨識系統之詞彙資料庫建置方法士其搜尋崎方法，可建置—種域破音字處理功能之詞彙資料庫，使語音辨識系統更加人性化也更貼近一般使用者之發音習慣。此外，依本發明之詞彙資料庫搜尋法，可避免系統重複計算。、茲為使貴審查委員對本發明之技術特徵及所達成之功效有更進-步之瞭解與認識，下文謹提供較佳之實施例及相關圖式以為辅佐之用，並以詳細之說明文字配合說後。【實施方式】 /以了 ^照相麵式，·依本發明較佳實施例之語音辨識糸統之詞彙貧料庫建置方法及其搜尋比對 i 二件將以相_參照符號加以綱。 /、中相同的兀 ^發狀語音_系齡妓利㈣藏式馬可夫 =ddenMark〇vModel，HMM)的方法作觸，它以機率模型私杨音的現象’將-小段語音的發音過程，看成是可去中連續的狀態轉移；其帽識過程所_之語音特徵 ( Me I-Frequency Cepstrum Coefficients^ 6 1299854 MFCC)，它除了考慮到人耳對不關率的感受程度，更分離發音腔减型與激發峨的雜，使得我們在語音時不會受到說話者的音量大小，或中文語音之五種聲調 • (一、二、三、四聲與輕聲）的影響。基於以上特性’我們將從245個3中文破音字中選出本發明辨識系統之破音字’由於辨識時利用到的特徵參梅爾倒頻譜係數，因此破音字中其發音差異僅在於聲調不同這些字，並不包含在我們要處理的破音字中 _ 字的發音有兩種，其-為「以v」，另—則為「個= # ^於聲調的不同，我們便將其捨去，最後剩下來的便是我們的 f音字資料’其包含的字大致有:行、仔、樂、和、重二我:的 · ;、m、w、沒、校、從、都、落、朝、傳、單、彷、 :;、m、強、調、參、黏、省、塞、差、蓋、傍、般、 m、i、暴、熟、模、給、薄、告、嚇、藏、還、翟、識、騎、繫、覺、露、屬、攪等等。 ^隹建圖’侧林發明之語音辨識系統之詞彙資料庫建置方法之步驟流程圖，其步驟如後：、 φ 步驟S11 ··提供一破音字資料；步驟S12:輸入一詞彙；少破音字資料’判斷該詞彙是否包含至 -等tUi，則對於該詞囊所包含之該破音字之複數 .於該詞彙ί:單對應之複數個聲學模型’若否，則對』果遷立早對應之該聲學模型；以及 2此儲存該詞彙及該些聲學模型至該詞彙資料庫。上it:/上述破音字資料係包含複數個破音字及轉立方$，上述聲學_係為m馬可夫翻。料音方式 7 !299854 請參閱第二圖，係顯示本發明之語音辨識系統之詞彙資料庫建置方法之較佳實施例之步驟流程圖，其係以歌手姓名為例，建置歌手姓名之詞彙資料庫，其步驟如後·· 步驟S21 · f買入歌手姓名；步驟S22 ·比對破音字資料，判斷此歌手姓名是否包含至少一破音字，若是，執行步驟S23 ,若否，執行步驟S24 ; 步驟S23 ··增加一組由破音字代替的姓名；步驟S24 ··分別將姓名的字轉換成由隱藏式馬可夫模來表示；、步驟S25 ··是否讀到最後一筆歌手姓名；以及步驟S26 ··結束初始化，進入辨識流程。透過本發酬建置之詞彙歸庫，具有破音字_功能，讓使用者能夠依照一般慣用之發音，而得到正確的辨識結果。另外’在語麵識技射，每—辦文字可將其分解為聲母口明母’琴母出現在音節前端，韻母出現在音節尾端，每一個中文字都可_兩個麵聲母及韻·聲學觀來代表，而扭母及韻母的聲學模型機率值來做判定，二如果將_讀庫中的詞彙以字首相同者排在一起的方式作排序，了W-個詞彙同音字的機率值，在計算時便只要計算目前的_與上-_彙不同音字的機率值，而不f重複計的機率值，可節省搜尋比對時的計算量。 ^曰請參閱第三圖，係顯示本發明之語音辨識系搜哥比對方法之步職程圖，齡驟如後： 4貝科犀步驟S31 :提供一詞彙資料庫，係包含複數個詞囊，此些, 果係以字百相同者相鄰之方式進行排序，且，此些詞囊係以二對 1299854 一方式對應於複數個聲學模型；步驟S32 ··輸入一語音訊號；步驟S33 :擷取語音訊號之一特徵參數；步驟S34 :將特徵參數逐一比對此些詞彙之聲學模型風模型係對應於特徵參數分別產生—機率值，其中，每了承前一相鄰詞彙中相同發音字元所產生之機率值；以及〜步驟您：透過此些詞彙之機率值，以進行語音訊號之辨識。上述每學㈣係為—隱藏式馬可夫模型，上述特徵來一梅爾倒_ 係數（MehFrequeneyCepstrumC n = Μ_，上述機率值係利用一維特比演算法m 她，, 1299854 VII. Designation of representative drawings: (1) The representative representative of the case is: (1). - (b) A brief description of the symbol of the representative figure: - S11~S16: Step flow. 8. If there is a chemical formula in this case, please disclose the chemical formula that best shows the characteristics of the invention: IX. Description of the invention: [Technical field of the invention] The present invention provides a vocabulary database construction method for a speech recognition system and its search ratio The method, in particular, a vocabulary database construction method that supports broken word processing and a more efficient search comparison method. 4 1299854 [Prior Art] The conventional speech recognition system does not include the processing function of the broken words, so that the user must recognize another pronunciation of the broken words when the voice input is performed, for example, the name of Chen Lixing The word "行" must be pronounced "factory v" to identify success. If the user pronounces "Ding Yizhen/", it will not be recognized correctly. For example, the "Le" word of the orchestra must be pronounced "Defense\" In order to be recognized, if the pronunciation is "u廿\", it cannot be correctly identified, and such a voice input method is quite different from the pronunciation habit of a general user. In addition, when the speech recognition system performs identification, it is usually calculated by using the Viterbi Algorithm to calculate the probability value of each word in the vocabulary, and the calculation is also speech recognition; It takes a lot of calculations, so if you repeatedly repeat the calculations, the system will increase the amount of unnecessary calculations, which will cause the system to slow down. This will lead us to think about how to avoid double counting. The amount of computation. The inventor of the present invention based on years of research and many practical experiences, research design and topical discussion, 遂本本 — 种种种种种种种种种种 ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; In view of the above problems, the object of the present invention is to provide a speech recognition method and a search comparison method thereof, in particular, a method for constructing a word discarded database and a more efficient method for providing a broken word by 5 Ί 299854 (a). (b) input a vocabulary; (c) compare the broken words, determine whether the vocabulary contains at least one broken word, and if so, establish a corresponding plural for the plural pronunciations of the broken words contained in the vocabulary An acoustic model, if not, a single corresponding acoustic model for the vocabulary; and '^ (Φ stores the vocabulary and its corresponding acoustic model to the lexical database. As described above, the speech recognition system according to the present invention The vocabulary database construction method is based on the search for the Saki method, and the vocabulary database of the domain-breaking word processing function can be built, so that the speech recognition system is more humanized and closer to the pronunciation of the general user. In addition, according to the vocabulary database search method of the present invention, the system can be repeatedly calculated. In order to enable the reviewing committee to have a more advanced understanding and understanding of the technical features and the effects achieved by the present invention, the following is a summary. The preferred embodiments and related drawings are used for assistance, and the detailed explanations are used in conjunction with the following descriptions. [Embodiment] / Taking a photo surface, the speech recognition system according to the preferred embodiment of the present invention is poor in vocabulary The method of database construction and its search comparison i will be based on the phase_reference symbol. /, the same 兀^ hairline voice _ system age profit (four) Tibetan Markov = ddenMark 〇 vModel, HMM) At the touch, it takes the phenomenon of the probabilistic model Yang Yang's phenomenon to describe the pronunciation process of the small-segment speech as a continuous state transition; the voice feature of the cap recognition process ( Me I-Frequency Cepstrum Coefficients^ 6 1299854 MFCC), in addition to taking into account the human ear's perception of the rate of unrelatedness, it is more separate from the pronunciation cavity reduction and the excitation 峨, so that we do not receive the speaker's volume when speaking, or Chinese speech. Five tones • The effects of (one, two, three, four and soft). Based on the above characteristics, we will select the broken word of the identification system of the invention from 245 3 Chinese broken words. Because of the characteristic parameter Merkel cepstral coefficient used in the identification, the difference in pronunciation in the broken words is only the words with different tones. It is not included in the broken words we want to deal with. There are two kinds of pronunciations of the word _, the - is "with v", and the other is "a = # ^ in the difference of the tone, we will give it up, and finally What's left is our f-word data, which contains roughly the following words: line, aberdeen, music, and, heavy, two: my, m, w, no, school, slave, capital, fall, dynasty, Pass, single, imitation, :;, m, strong, tune, ginseng, sticky, province, plug, poor, cover, squat, general, m, i, violent, cooked, model, give, thin, sue, scare, hide ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, , φ step S11 ··provide a broken word data; step S12: input a vocabulary; less broken word data 'determine whether the vocabulary Including t-i, etc., for the plural of the broken word contained in the word capsule. In the word ί: a plurality of acoustic models corresponding to a single one, if not, the acoustic model corresponding to the early transition; 2 This stores the vocabulary and the acoustic models into the vocabulary database. On it: / The above-mentioned broken word data contains a plurality of broken words and a turn cube $, the above acoustic _ is m makfu. The sound mode 7 !299854 Please refer to the second figure, which is a flow chart showing the steps of a preferred embodiment of the vocabulary database construction method of the speech recognition system of the present invention. The vocabulary database of the singer's name is established by taking the singer's name as an example. Step S21 · f buys the artist name; Step S22 · Compare the broken word data, determine whether the artist name contains at least one broken word, if yes, execute step S23, if not, execute step S24; Step S23 ·· Adding a set of names replaced by broken words; Step S24 · Converting the words of the name to be represented by a hidden Markov module; Step S25 · Whether to read the last artist name; and Step S26 ·· The beam is initialized and enters the identification process. The vocabulary returned to the library through this payment has a broken word _ function, which allows the user to get the correct identification result according to the usual idiom. Each text can be decomposed into a mother-in-law mother-in-law. The mother-in-law appears at the front end of the syllable, and the finals appear at the end of the syllable. Each Chinese character can be represented by two-faced initials and rhyme and acoustics. And the acoustic model probability value of the finals is used for judgment. Second, if the vocabulary in the _reading library is sorted by the same word prefix, the probability values of the W-word homophones are calculated. The current _ and __ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ^曰Please refer to the third figure, which shows the step-by-step diagram of the speech recognition system of the present invention. After the age is as follows: 4 Beike rhinoceros step S31: provide a vocabulary database containing a plurality of words The capsules are sorted in such a manner that the words are identical to each other, and the word capsules correspond to the plurality of acoustic models in a pair of 1299854; step S32 · inputting a voice signal; step S33 : extracting one of the characteristic parameters of the voice signal; Step S34: comparing the characteristic parameters one by one to the acoustic model wind model of the vocabulary corresponding to the characteristic parameter respectively generating a probability value, wherein each of the adjacent vocabulary has the same pronunciation The probability value generated by the character; and ~ Step you: use the probability values of these words to identify the voice signal. Each of the above (4) is a hidden Markov model, and the above features are a Meyer's coefficient (MehFrequeneyCepstrumC n = Μ _, the probability value is a one-dimensional performance algorithm m she,

Algorithm)計算產生。以歌手姓名之雜倾料例，若總數有692個歌手姓名，共有2233個字，在做維特比演算法計算機率時，每段注音將合盘Algorithm) calculation generated. In the case of a singer's name, if there are 692 singer names in total, there are 2233 words. When doing the Viterbi algorithm computer rate, each piece of phonetic will be combined.

ί 靡次的搜尋，在這些搜尋中有“是重G 异的，因此’士發明將歌手姓名作排序，讓相同姓的歌手排在一起’亚且記下前-個名字同音字的機率，所名字時，只要計算不同音字的機率。聿歌乎請參閱第係顯示本發明之語音辨搜尋比對方法之較佳實施例之步驟流程圖，其步驟如後 1菜貝+犀步驟S41 .輸入語音之梅爾倒頻譜係數；步驟S42:讀入歌手姓名模型；驟S43=i斷目前歌手姓名的發音與前—個歌手姓名疋否重複’右疋，執行步驟S44，若否，則執行步驟雜； wH44·”音的字利用前一個名字記錄的機率代替，再由不同發音的字繼續進行下—個步鄉； 9 1299854 步驟S45 :利甩維特比演算法（Viterbi Algorithm)計算機率；步驟S46 ·儲存目前歌手姓名每個字的機率；右疋步驟S47 :是否所有歌手姓名皆已計算機率行步驟S48，若否，則重複上述步驟S42 ;以及步驟S48 :排列出五個最大機率的歌手姓名。ί 的的的 , , , 的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的的In the name, it is only necessary to calculate the probability of different phonetic words. Please refer to the flow chart of the preferred embodiment of the voice recognition search comparison method of the present invention, and the steps are as follows: 1 dish + rhino step S41. Enter the Mel Cepstral coefficient of the voice; Step S42: Read the singer name model; Step S43=i break the current singer name pronunciation and the previous singer name 疋 No repeat 'Right 疋, execute step S44, if not, execute Steps are mixed; wH44·” words are replaced by the probability of the previous name record, and then the words of different pronunciations continue to proceed to the next step; 9 1299854 Step S45: Viterbi Algorithm computer rate; Step S46: storing the probability of each word of the current singer name; right 疋 step S47: whether all singer names have been computerized in step S48, if not, repeating the above step S42; and step S48: arranging The chances of the five biggest artist name.

以歌手姓名「陳力行」為例，其與歌手「陳力宏」相鄰，這 =立歌手姓名的前兩個字的發音是相_，因此在做維特比演瞀法之計算時’輸人語音__賴餘先盘「陳二 =表^麟學__铸算，並存純，，皆下來輸人語音要與「陳力宏」做機率計算時，口二r宏」r個聲學模型所計算的機率值ΐ可力在」的完整機率。彳付到陳。任何未脫離本發明更’均應包含於後附以上所述僅為舉例性，而非為限制性者之精神與範.，㈣其進行之等效修改或之申請專利範圍中。圖式簡單說明】第-圖係顯示本發明之語音辨識系統之步驟流裎圖；果貝枓庫建置方法之車之:辨識系統之詞彙資料庫建置方法之統之詞彙資料庫搜尋比對方第三圖係顯示本發明之語音辨識系法之步驟流程圖；以及 3 ^ 1299854 第四圖係顯示本發明之語音辨識系統之詞彙資料庫搜尋比對方法之較佳實施例之步驟流程圖。【主要元件符號說明】 S11〜S16 :步驟流程； S21〜S26 :步驟流程； S31〜S35 :步驟流程；以及 S41〜S48 ··步驟流程。Take the singer's name "Chen Lixing" as an example. It is adjacent to the singer "Chen Lihong". This is the pronunciation of the first two words of the singer's name. Therefore, when doing the calculation of the Viterbi deductive method, the input voice is entered. __赖余先盘"Chen Er = Table ^ Lin Xue __ Casting, and pure, all down, the input voice is to be calculated with "Chen Lihong" when the probability is calculated, the mouth two r macro" r acoustic model calculated The probability that the probability value is at a good chance. I paid to Chen. The present invention is intended to be limited to the scope of the invention, and is not intended to be Brief Description of the Drawings] The first figure shows the flow chart of the speech recognition system of the present invention; the car of the method of establishing the vocabulary database of the identification system: the vocabulary database search ratio of the vocabulary database construction method of the identification system The third figure of the other party shows a flow chart of the steps of the speech recognition system of the present invention; and the third figure shows the flow chart of the preferred embodiment of the lexical database search comparison method of the speech recognition system of the present invention. . [Description of main component symbols] S11~S16: Step flow; S21~S26: Step flow; S31~S35: Step flow; and S41~S48 ··Step flow.

1111

Claims

1299854 Patent application scope: 1 2 4 ❿ , : The vocabulary database construction method of the seed ou recognition system, at least one broken word data is provided; a vocabulary is input; whether the vocabulary pair contains at least one broken pronunciation method is established separately Including the number two no '^ in the word capsule to establish a single correspondence = ear = type 'if, as claimed in the scope of the scope of the item! The word, and the broken words; the method includes a plurality of broken sound bank construction methods, wherein the voice mode J is a type of the Markov Model, a face, a subtractive Markov model, and a speech recognition system. The word abandonment database search comparison method, at least provides a vocabulary database, which is sorted by the method in which the same words of the same word of Fushu A are adjacent, and 'the vocabulary is corresponding to the scales in the way--; Translating a speech signal with a pair of speech signals, and taking a characteristic parameter of the speech signal; and comparing the characteristic parameters to the vocabulary model corresponding to the characteristic parameter respectively to generate two probability values; The 12 。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。 The search comparison method, wherein the acoustic model is a hidden Markov model. 6. The method for searching and comparing the words of the speech recognition method according to the fourth item of the claim is as follows: wherein the characteristic parameter is - plum Cepstral coefficients (Me Bu Frequency CePstrum C e MFCC) square lenTS,

7. If the application for full-time _ 4 of the speech recognition system library search comparison method' includes calculating the probability value using a Viterbi Algorithm. ^

13