201135716 六、發明說明: 【發明所屬之技術領域】 本揭示内容是有關於訊號處理技術,且特別是有關於 音頻特徵處理方法。 【先前技術】 語音辨認可提供自然方便的人機介面,可作為資料輸 入使用,或操控設備而讓設備更人性化而容易使用。語音 • 辨認技術也可用在電腦輔助語言學習系統,讓使用者得到 即時的回饋,增進學習的效率。 然而,若語音辨認時常發生錯誤,會造成使用上極大 的不便。為了降低語音辨認之錯誤機率,相關領域莫不費 盡心思來謀求解決之道,但長久以來一直未見適用的方式 被發展完成。因此,如何能更有效率地提高語音辨認率, 實屬當前重要研發課題之一,亦成爲當前相關領域亟需改 進的目標。 【發明内容】 因此,本揭示内容之一態樣是在提供一種音頻特徵處 理方法與音頻特徵處理裝置,用於提高聲音辨認率。 依據本揭示内容一實施例,音頻特徵處理方法包含下 列步驟: (a) 由複數個音框分別擷取複數個靜態特徵向量; (b) 自該些靜態特徵向量中選擇至少一組靜態特徵以 201135716 計舁對應於至少1相點之至少 徵向量與至少一右 彳變率 方時間變率之動態特 至少-右^間變率之動態特徵向量及該 主少右方時間變率之動態特徵 俗二置及该 複數個模型中搜尋出最匹配該些^料庫所預存之 型串列。 、向量之一模型或一模 之倒頻’每一 f態特徵向量可為梅爾頻率刻度 性預估係數、共振峰、音高或類似特徵特徵參數,例如線 於步驟⑴中,亦可自該些靜 =::r計算於該至少-時間點 特徵向量=、徵向量。而步驟(c)+,可利用該些靜態 至少一中央時間變率之動態特徵向量、該至 V左方時間變率之動態特徵向量及該至少一右方時間變 率:動特徵向量從該資料庫所預存之複數個模型中搜尋 出最匹配該#•特徵向4之該賴型賴模型串列。 於步驟(C)中,將該些靜態特徵向量、該至少一中央 態特徵向量、該至少-左方時間變率之動態 人系、β ί4二至J 一右方時間變率之動態特徵向量作組 ^ 、且合後之特徵向量序列,並從該資料庫所預 存之複數個模型中拙盖山旦 心^貝丁叶坪尸/Τ頂 之該模型或該模型串=最匹配該組合後之特徵向量序列 上述之步驟(b)可包含下列子步驟: (αΐ)自該些靜皞 對應之靜態特徵向徵向量中選擇該至少一時間點所 ^至少一時間點前後之數個靜態特 b>] 4 201135716 為至少一第一組靜態特徵向量,並將該至少-第靜態特徵向量對時間取變率以產生至少-中央時門 徵向量,變率計算十,該至少-第-組靜: =向量中在該至少—時間點之前的靜態特徵向量的權重 對稱於該至少—時間點之後的靜態特徵向量的權重; (α 2)自該些靜態特徵向量中選擇至少一 =向量對時間取變率,以產生至少一左方時間變率之: Μ寺徵向量,其中該至少一第二組 少一 靜恶符徵向量中在該至 夕時間點之刚的靜態特徵向量的總權重大於該 間點之後的靜態特徵向量的總權重; (α 3)自該些靜態特徵向量中選擇至少一 特徵向量對時間取變_,以產生至少一右方:二 態賴向量,其中該至少—第三組靜態特徵向i中在^至 少一時間點之前的靜態特徵向量的總權重小於小」 間點之後的靜態特徵向量的總權重。 夕、 或者,上述之步驟(b)可包含下列子步驟: (石υ自該些靜態特徵向量中選擇該至少一 對應之靜態特徵向量及該至少一時間點前後之數個靜熊 徵向量^作為至少-第—組靜態特徵向量,並將該至^一 第一組靜態特徵向量對時間取變率以產生至少_ ^ = ,率之動態特徵向量,其中該至少―第―組靜態特徵向= 中在該至少-時間點之前的靜態特徵向量的數目等於 少一時間點之後的靜態特徵向量的數目; 、以 (/5 2)自該些靜態特徵向量中選擇至少一 特徵向量對時間取變率,以產生至少一左方時間;率之;' 201135716 態特徵向量,其中誃 丨、一 _ 少-時間點之前的靜離特3了組靜態特徵向量中在該至 點之後的靜態特徵向;的=置的數目多於該至少-時間 (冷3)自該些靜態特徵向 特徵向量對時間取變率,以 選擇至>、一第三組靜態 態特徵向量,其中哕$ ,丨、一生至夕右方時間變率之動 少一時間點之前的“二“組靜態特徵向量中在該至 點之後的靜態特徵向;=量的數目少於該至少-時間 於子步驟(/52)中,兮筮_ Λ ^ 至少-時間點所對應之靜:=態=向量係包含該 靜態特徵向量,或者係上對應之 態特徵向量的另_邊。 時間點所對應之靜 ::卜’於步驟(c)卜資料庫預存之模型 高階隱藏式馬可夫模型'範例式模型^ 包含示内容另—實施例,—種音頻特徵處理裝置 擷取皁、一計算單元與一比對單元。 置 個靜擷取單元用以由複數個音框分別擷取複數 擇至乂-組靜態特徵以計算對應於至少一時了選 變率之動態特徵向量與至少一右方時 :: Μ寺徵向量。比對單元用以利用該至少—左方 ^動 動態特徵向量及該至少-右方時間變率之動態特4向量4之 201135716 量;==謝搜尋出最⑽些特徵向 賴率刻、m,i:早二所:取之每-靜態特徵向量係為梅爾 敫,線:預頻::數數^ 上逑之計算單元另用同或類似特徵。 :少1靜態徵向量中她 率之動態特徵向量。之^少-中央 量、該至少—左方甲糾間變率之動態特徵向 方時間變率之動離二=之動態特徵向量及該至少一右 型中搜尋出最匹:二該資料庫所預存之複數個模 再者,量之該模型或該模型串列 :些中央時間變率之動將該些靜態特徵向量、 :態特徵向量及該些右方時間變置率時間變率之 ,以產生一組合後之 動態特徵向量作組 拽尋出最匹配兮組人$ 序列,進而從該資料庫中 串列。合後之特徵向量序列之該模型;= βη _上述之計算單元可包含一 ^ 承元與—第三計算單元缺。十算單元、一第二計算 徵向量中選擇該至少一 ^算單疋用以自該些靜態特 =-時間點前後之數個;態:^靜態特徵向量及該 Ρ靜態特徵向量,並將該至少一二。|以作為至少-第-w變率以產生至少:第―組靜態特徵向量對時 中該至少一第一組靜態特動態特徵向量,其 隹該至少一時間點之前 201135716 的靜態特徵向量的權重對稱於 特徵向量的避舌咕 時間點之後的靜態 至第二組靜態特徵向量對時間取變;= ΐ靜變率之動態特徵向量,其中該至少 〜、特徵向里中在該至少一時間點 Π重至少一時間點之後的靜: 第-组靜熊:早兀用以自該些靜態特徵向量中選擇至少- 間變率之動態特徵向量,其中 右方時 :在一時間點之前的靜態特徵向量的總權重小於 〉一時間點之後的靜態特徵向量的總權重。 、 或者帛&十算單元用以自該些靜態特徵向量中選 =至少-時間點所對應之靜態特徵向量及該至少一時間點 :後之數個靜態特徵向量以作為至少一第一組靜態特徵向 量’並將該至少-第-組靜態特徵向量對時間取變率以產 生至少一中央時間變率之動態特徵向量,其中該至少一第 :組靜態特徵向量中在該至少一時間點之前的靜態特徵向 1的數目等於該至少-時間點之後的靜態特徵向量的數 目。第二計算單元用以自該些靜態特徵向量中選擇至少一 第二組靜態特徵向量對時間取變率,以產生至少一左方時 =變率之動態特徵向量’其中該至少一第二組靜態特徵向 里中在該至少一時間點之前的靜態特徵向量的數目多於該 至少一時間點之後的靜態特徵向量的數目。第三計算單_ 用以自該些靜態特徵向量中選擇至少一第三組靜態特徵= 量對時間取變率,以產生至少一右方時間變率之動態特徵201135716 VI. Description of the Invention: [Technical Field of the Invention] The present disclosure relates to signal processing techniques, and more particularly to audio feature processing methods. [Prior Art] Voice recognition provides a natural and convenient human-machine interface that can be used as a data input or manipulate the device to make the device more user-friendly and easy to use. Voice • Recognition technology can also be used in computer-assisted language learning systems to give users immediate feedback and increase learning efficiency. However, if an error occurs frequently in speech recognition, it will cause great inconvenience in use. In order to reduce the chance of error in speech recognition, the related fields have not tried their best to find a solution, but the method that has not been applied for a long time has been developed. Therefore, how to improve the speech recognition rate more effectively is one of the current important research and development topics, and it has become an urgent need for improvement in related fields. SUMMARY OF THE INVENTION Accordingly, it is an aspect of the present disclosure to provide an audio feature processing method and an audio feature processing apparatus for improving a sound recognition rate. According to an embodiment of the present disclosure, an audio feature processing method includes the following steps: (a) capturing a plurality of static feature vectors from a plurality of sound boxes; (b) selecting at least one set of static features from the static feature vectors to 201135716 The dynamic eigenvector corresponding to the dynamic eigenvector of at least one phase point and at least one right 彳 率 时间 时间 时间 及 及 及 及 及 及 及 及 及 及 及 及 及 及 及The common two and the plurality of models search for the type series that are most consistent with the storage of the database. , one of the vector models or one mode of scrambling 'each f-state eigenvector can be a Mel frequency gradation prediction coefficient, formant, pitch or similar feature parameter, such as in step (1), or The static =::r is calculated at the at least-time point feature vector =, the eigenvector. And in step (c)+, the static dynamic feature vector of at least one central time variability, the dynamic feature vector to the V left time variability, and the at least one right time variability: the dynamic feature vector are used The plurality of models prestored in the database search for the model string that best matches the #• feature to 4. In step (C), the static feature vectors, the at least one central state feature vector, the dynamic eigenvector of the at least-left time variability, and the β ί4 II to J-right time variability The group of eigenvectors and the combined eigenvectors, and the model of the sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred scorpion Subsequent feature vector sequence The above step (b) may comprise the following sub-steps: (αΐ) selecting a plurality of statics from at least one time point of the at least one time point from the static feature corresponding sign vectors corresponding to the statics Special b>] 4 201135716 is at least a first set of static feature vectors, and the at least - static feature vector is time-varying to generate at least a central time gate sign vector, the variability is calculated as ten, the at least - the - Group static: = the weight of the static feature vector in the vector before the at least - time point is symmetric to the weight of the static feature vector after the at least - time point; (α 2) selecting at least one = vector from the static feature vectors Take time to change rate At least one left time variability: a Μ 征 向量 vector, wherein the total weight of the static eigenvectors in the at least one second group less than one sinister symbol vector at the eve time point is greater than the point The total weight of the static feature vector; (α 3) selecting at least one feature vector from the static feature vectors to take time _ to generate at least one right side: a two-state sag vector, wherein the at least-third group of static features The total weight of the static feature vectors to i before at least one time point is less than the total weight of the static feature vectors after the small point. Alternatively, step (b) may include the following sub-steps: (the sarcophagus selects the at least one corresponding static feature vector from the static feature vectors and the plurality of static bear sign vectors before and after the at least one time point ^ And as a minimum-first-group static feature vector, and the first set of static feature vectors are time-varying to generate a dynamic feature vector of at least _^=, the at least “the first group of static feature directions” = the number of static feature vectors before the at least-time point is equal to the number of static feature vectors after one less time point; and selecting at least one feature vector from the static feature vectors by (/5 2) Variability to produce at least one left time; rate; '201135716 state feature vector, where 誃丨, _ 少 - 时间 时间 时间 特 特 了 了 了 了 了 了 了 了 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态The number of = is more than the at least - time (cold 3) from the static features to the feature vector versus time, to select >, a third set of static state feature vectors, where 哕$ , Oh, life to night The time variability is less than the static feature of the "two" group static feature vector before the point in time; the number of quantities is less than the at least - time in the substep (/52), _ Λ ^ At least the time corresponding to the static: = state = vector contains the static feature vector, or the other edge of the corresponding state feature vector. The static corresponding to the time point:: 'in the step (c The data library pre-stored model high-order hidden Markov model 'example model ^ contains the content of another embodiment, an audio feature processing device to extract soap, a computing unit and a comparison unit. The method is used for calculating a dynamic feature vector corresponding to at least one selected variability and at least one right side from a plurality of sound boxes to respectively calculate: a Μ temple levy vector. Using the at least-left-hand dynamic feature vector and the 201135716 quantity of the dynamic special 4 vector 4 of the at least-right time variability; == Xie searches for the most (10) features to the rate, m, i: two The: per-static eigenvector system is Mel, line : Pre-frequency:: Counting ^ The calculation unit of the upper 另 uses the same or similar features. : The dynamic eigenvector of her rate in the less than 1 static eigenvector. The less - the central quantity, the at least - the left side The dynamic feature of the rate is the dynamic eigenvector of the time variability of the square and the most recent one of the at least one right type: the plurality of modules pre-stored in the database, the model or the model Tandem: some central time variability moves the static eigenvectors, the state eigenvectors, and the right time variability time variability to produce a combined dynamic eigenvector for group 拽 拽Matching the group of people to the $ sequence, and then arranging from the database. The model of the combined feature vector sequence; = βη _ The above calculation unit may include a ^ element and a third calculation unit. Selecting the at least one calculation unit from the ten calculation unit and the second calculation eigenvector to use the plurality of static traits from before and after the time point; state: ^ static eigenvector and the Ρ static eigenvector, and The at least one or two. Taking at least a -w-th variability to generate at least: a first set of static eigenvectors of the first set of static eigenvector pairs, and a weight of the static eigenvectors of the 201135716 before the at least one time point The static to the second set of static feature vectors that are symmetric to the feature vector are time-varying; = the dynamic feature vector of the static rate of change, wherein the at least ~, the feature is in the middle at the at least one time point Π 至少 至少 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The total weight of the feature vector is less than the total weight of the static feature vector after > a time point. Or the 帛 & ten calculation unit is configured to select, from the static feature vectors, a static feature vector corresponding to at least a time point and the at least one time point: the following plurality of static feature vectors to be at least one first group The static feature vector 'and the at least one-set static feature vector versus time variability to generate at least one central time variability dynamic feature vector, wherein the at least one first: group static feature vector is at the at least one time point The number of previous static features to 1 is equal to the number of static feature vectors after the at least-time point. The second calculating unit is configured to select at least one second set of static feature vector versus time rate from the static feature vectors to generate at least one left time=variation dynamic feature vector ′ wherein the at least one second group The number of static feature vectors before the at least one time point in the static feature inward is greater than the number of static feature vectors after the at least one time point. The third calculation unit _ is configured to select at least one third group of static features=quantity versus time variability from the static eigenvectors to generate dynamic characteristics of at least one right time variability
UJ 201135716 向量,其中該至少一第三組靜 間點之前的靜態特徵向量的數;中在該至少1 的靜態特徵向量的數目。 夕、该至少一時間點之後 再者,上述之第二計算單元所 向量係包含該至少—時間點所u二組靜態特徵 述之第三計算單元所選擇之向量的—邊;上 币—組靜態特徵向署伤白各咕 型、二 型。 綜揭示内容之技術方案與現有技術相比具 有明顯的優點和有益效果。#由上述技術方案,可達到相 當的技術進步’並具有産業上的廣泛利用價值,其至少具 有下列優點: 1. 以靜態特徵向量、中央時間變率之動態特徵向量、 左右方時間變率之動態特徵向量等作處理來形成更精縣 達語音特性的方法,以獲得更佳的聲音辨認率;以及 2. 本發明應用範圍廣泛,不限定在語音辨認,亦涵蓋 其他訊號處理系統之運用,例如樂曲識別、語者辨認、丘 振峰追蹤、音高追蹤、聲調辨認、統計式語音合成等。 以下將於實施方式對上述之說明作詳細的描述,並對 本揭米内谷之技術方案提供更進一步的解釋。 201135716 【實施方式】 為了使本揭示内容之敘述更加詳盡與完備,可參照所 附之圖式及以下所述各種實施例,圖式中相同之號碼代表 相同或相似之元件。另一方面,眾所週知的元件與步驟並 未描述於實施例中,以避免對本發明造成不必要的限制。 第1圖是依照本揭示内容一實施例之一種語音辨認系 統100的方塊圖。語音辨認系統1〇〇包含麥克風11〇、類 比至數位轉換器12〇、音框分割模組130、端點偵測模組 〇特徵掏取子系統15〇、樣型比對子系統“ο與資料庫 170。 於使用上’麥克風110可將聲波轉為類比訊號,類比 ^數=轉換器12〇可將類比訊號轉換成數位語音的形式, 音框=割模組13〇可將數位語音分割成一些小段的訊號, 八中每個小段稱為音框。端點偵測模組140可找出語音之 ,點與終點,特徵擷取子系統150可將每個語音音框轉成 可=表其特性的特徵向量。資料庫170預存了語音樣型, 樣型2對子系統16〇可從資料庫170中搜尋出一個最接近 輸入曰特徵向量序列的字詞模型串列,當作辨認結果。 為了斜上述之「特徵向量」作更具體的說明,請參照 第2 ®是依照本揭示内容-實施例之—種音頻特 J、地、裝置2〇〇的方塊圖。音頻特徵處理裝置2〇〇可適用 於士述之特徵擷取子系統150以及樣型比對子系統16〇, 或疋廣/乏地運用在相關之技術環節,例如語者辨認系統及 統計式語音合成系統等。 如第2圖所示’音頻特徵處理裝置2〇〇包含擷取單元 201135716 210、計算單元220與比斟置-、 第-計算單元㈣二 複數音框分別棟取 包含能量與頻譜特二頻==用:態特徵向量通常 頻譜係數⑽cc);或者,靜 t頻率刻度之倒 (LPC)、共振峰、音高或類似;數/可為線性預估係數 除此之外’靜態特徵對時間 的重要特性,稱之為動態特徵 ,ί一可表現出語音 些靜態特徵向量中選擇至少一組靜::二2:用以自該 時間點之至少一中央時間變;;鼻於至少一 -右方時間變率之動態特徵向:。動“特徵向量與至少 於第一實施例中,第一計算 特徵向量中選擇至少—時間點 /以自該些靜態 至:一時間點前後所對應之數個靜:::3徵向量及該 一第一組靜態特徵向量,並將該至;二;向量以作為至少 量對時間取變率以產生至少第—組靜態特徵向 量,ΐ中該至少—第一組靜態特徵向之動態特徵向 點之别的靜態特徵向量的數在該至少-時間 靜態特徵向量 的數:。、s等於該至少-時間點之後的 或者,於第二實施例中,第— 些,特徵向量中選擇該至少一時早元221用以自該 至 向蝴至少一時間點前後之數個;離:=之靜態特徵 〜将徵向量以作為 11 201135716 少一第一組靜態特徵向量,並將該至少一第一 生至少-中央時間變率之動 其t二至y —第一組靜態特徵向量令在該至少一 間點之前的靜態特徵向量的權重對稱於該至少夕 後的靜態特徵向量的權重。 * B ”之 鐵之動態特徵向量可為靜態特徵向量對時間的一尸皆 相备於-階微分)與/或二階變率( 算方法可取前後數個音框資料之最二:)性 第,個音框之靜態=向音訊號中 之動熊特徵以dW矣--篁以表不,一階時間變率所得 * _ " 不,一階時間變率所得之動態特徵以4?] 之動態特徵計算方法參考第3圖,設 為第/個音框的瞬時特徵參數向量 θ, 若用直線方程士]+锨來近 、v、 刀1 , 來近似該音框附近之資料讣岣,則 =後N個曰框之近似誤差的加權平方和係滿足下列關係UJ 201135716 Vector, wherein the number of static feature vectors before the at least one third set of static points; the number of static feature vectors in the at least one. In the evening, after the at least one time point, the vector of the second calculating unit includes the edge of the vector selected by the third computing unit of the two sets of static features. The static features are smashed into different types and types. The technical solution of the comprehensive disclosure has obvious advantages and beneficial effects compared with the prior art. #The above technical solution can achieve considerable technological progress' and has extensive industrial use value, which has at least the following advantages: 1. Dynamic feature vector with static feature vector, central time variability, time variability of left and right time The dynamic feature vector is processed to form a method for synthesizing the speech characteristics of the county to obtain a better sound recognition rate; and 2. The invention has a wide range of applications, and is not limited to speech recognition, and covers the application of other signal processing systems. For example, music recognition, speaker recognition, Qiu Zhenfeng tracking, pitch tracking, tone recognition, and statistical speech synthesis. The above description will be described in detail in the following embodiments, and further explanations of the technical solutions of the present disclosure are provided. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In order to make the description of the present disclosure more complete and complete, reference is made to the accompanying drawings and the embodiments described below. On the other hand, well-known elements and steps are not described in the embodiments to avoid unnecessarily limiting the invention. 1 is a block diagram of a speech recognition system 100 in accordance with an embodiment of the present disclosure. The voice recognition system 1 includes a microphone 11A, an analog to digital converter 12A, a sound box division module 130, an endpoint detection module, a feature extraction subsystem 15A, and a sample comparison subsystem "o The database 170. In the use of the 'microphone 110 can convert the sound wave into analog signal, the analog number = converter 12 〇 can convert the analog signal into a digital voice form, the sound box = cut module 13 〇 can divide the digital voice In a small segment of the signal, each segment of the eight is called a sound box. The endpoint detection module 140 can find the voice, point and end point, and the feature capturing subsystem 150 can convert each voice frame into a = The feature vector of the characteristic is stored in the database 170. The sample 2 pairs of subsystems 16 搜寻 can search the database 170 for a list of word models that are closest to the sequence of input eigenvectors for identification. Results For a more specific description of the "characteristic vector" described above, please refer to the second diagram of the audio, J, and device according to the present disclosure. The audio feature processing device 2 can be applied to the feature extraction subsystem 150 and the sample comparison subsystem 16〇, or can be used in related technical aspects, such as a speaker recognition system and a statistical formula. Speech synthesis system, etc. As shown in FIG. 2, the audio feature processing device 2 includes a capture unit 201135716 210, a calculation unit 220, a comparison unit, and a second calculation unit (four). The second and second sound boxes respectively contain energy and spectrum special frequency = = Use: state feature vector usually spectral coefficient (10) cc); or, static t frequency scale (LPC), formant, pitch or similar; number / can be linear prediction coefficient other than 'static characteristics versus time An important feature, called a dynamic feature, can be used to express at least one set of static static feature vectors: 2: 2: used to change from at least one central time of the time point;; nose at least one - right The dynamic characteristics of the square time variability: Moving the feature vector and at least in the first embodiment, selecting at least the time point in the first calculated feature vector / from the static to: a plurality of static:::3 sign vectors corresponding to before and after a time point and the a first set of static feature vectors, and the to; two; vector as the at least amount versus time rate of change to generate at least a first set of static feature vectors, wherein the at least - the first set of static features toward the dynamic feature The number of other static feature vectors of the point is the number of the at least-time static feature vector:, s is equal to the at least-time point or, in the second embodiment, the first, the feature vector selects the at least The first time element 221 is used to control the number of points from at least one time point to the butterfly; the static feature of the === vector is used as 11 201135716, the first group of static feature vectors, and the at least one first The at least-central time variability is t2 to y - the first set of static feature vectors aligns the weight of the static feature vector before the at least one point to the weight of the static feature vector at least after the date. * B Iron The dynamic eigenvector can be a static eigenvector for a time of the corpse is prepared for the -order differential) and / or second-order variability (the calculation method can take the second and second of the sound box data:), the static of the sound box = The characteristics of the moving bear in the audio signal are dW矣--篁 to indicate, the first-order time variability is obtained * _ " No, the dynamic characteristics obtained by the first-order time variability are referenced by the dynamic feature calculation method of 4?] Figure 3, set as the instantaneous feature parameter vector θ of the /th frame, if you use the straight line equations]+锨 to approach, v, and knife 1 to approximate the data near the frame, then = N The weighted sum of squared approximation errors satisfies the following relationship
E 玉順[e,[㈣]士卩】+狄)] 可求得IS:::令加權平方誤差對b微分為。E Yushun [e, [(4)] gentry] + Di)] can obtain IS::: make the weighted square error to b differential.
d£ N db £ 2^[cx + ^] - (c, [t] + 6jfc)]t = 〇 12 201135716 上列公式在ί + Α:超過 之起點與終點時,Cjc[i+*]可以 卜讣]) ^~~~^- Σ Hk]k2d£ N db £ 2^[cx + ^] - (c, [t] + 6jfc)]t = 〇12 201135716 The above formula is in ί + Α: when the starting point and the ending point are exceeded, Cjc[i+*] can be讣]) ^~~~^- Σ Hk]k2
k〇~N 起點或終點的資料取抑。2 ^ 與終點之k納人計算式卜—作法可以只將W未超出起點 右4 ]為對稱函數,則時間變率計算式可簡化如下: b. 2έ娜2 化為K〇~N The starting or ending point of the data is suppressed. 2 ^ and the end of the k-person calculation formula - the practice can only W does not exceed the starting point right 4] as a symmetric function, then the time variability calculation formula can be simplified as follows: b. 2 έ娜2 into
2N 這也是另一個常用的動態特徵計算H綜合上列推導 可知動態特徵向量卟]可由下式計算: 二階時間變率aW之計算亦可套用上述方法,而以卟]取 代^的角色。二階時間變率4]之另外一種計算方法也可直 m 13 201135716 接由&靜態特徵與二次曲線作最接近之匹配來求得。以 c動態特徵之時間變率的計算,於本實施例中, ΐ/十异早70 221使用時間/左右兩邊對稱區間之資料來 ° ,以產生中央時間變率之動態特徵向量。若以一 間變率為例’财央時間變率之動態特徵向量即 、 圖所示之中央時間變率之動態特徵向量。 1常將語音之靜態特徵ew,—階時間變率動 」】,及二階時間變率動態特徵·合成時間⑽徵向 里,一句語音的所有時間之特徵向量形成特徵向量序列, 用來和預存之字詞模型比對,找出最可能的字詞模型串 f立t辨認結果。雖然中央時間變率之動態特徵可反映 b曰曰變^動的特性’與靜態特徵的組合可提升系統辨認率, =疋’語音裡的從-個音素到下一個音素的特質變化主要 有兩種’其中—種為逐漸變化’例如雙元音的特質會從第 一個元音的特質逐漸轉變到第二個元音。另外一種變化 跳躍式的步階變化’例如,從摩擦音到元音,在音素内部 的特質處處相似’而跨越音素交界後特質即換成另外L 種,在這種情況下,邊界處的左右時間變率都很小, =時間變率卻很大。在語音特質逐漸變化的情況,前述中 ,時間變率之動態特徵尚可有效代表語音特性’但在步階 ,的情況,則使用中央時間變率之動態特徵並不能作 =刀的代表,因此,第二計算單元222可計算左方時間變 =之動態特徵向量’第三計算單元223可計算右方時間變 率之動態特徵向量’再由比對單元260基於上述多種特徵 向量從資料庫no中搜尋出最匹配該些特徵向量 201135716 適切的表達語音特質及該音框所處的相對位 置,而有利於語音辨認。 靜離=:3:中’第二計算單元222用以自上述該些 蠻率以::中、擇至少一第二組靜態特徵向量對時間取 楚產生至少一左方時間變率之動態特徵向量,其中 向量中在該至少-時間點之前的靜態特 數目。第二▲十f該至少一時間點之後的靜態特徵向量的 選擇至^i7^223用以自上述該些靜態特徵向量中 、夕第二組靜態特徵向量對時間取變率,以產生至 =右方時間變率之動態特徵向量,其中該至 在:至少一時間點之前的靜態特徵向i的 該至少一時間點之後的靜態特徵向量的數目。 徵向二第一實施例之第-、第二、第三組靜態特 不凡王相同,或可以完全不同。舉例來說,該第二 =態特徵向量可包含該至少—時間點所對應離 者:t部在該至少-時間點所對應之靜態= 所L 靜態特徵向量可包含該至少一時間點 雁j靜態特徵向量,或者係全部在該至少 對應之靜態特徵向量的另一邊。 1點所 些靜ΐί々’於ί二實施例中,第二計算單元222用以自該 取變i,少一第二組靜態特徵向量對時間 _產生至> 一左方時間變率之動態特徵向量,1 =少一第二組靜態特徵向量中在該至少一時間點二 牲:態特徵向量的總權重大於該至少一時間點之後的:: 特徵向量的總權重。第三計算單元223用以自該些靜態g 15 m 201135716 徵向3:中選擇至少—第三組靜態特徵向量對時間取變率, 以產^至少一右方時間變率之動態特徵向量,其中該至少 ,一第,組靜態特徵向量中在該至少一時間點之前的靜態二 徵向1的總權重小於該至少一時間點之後的靜態特徵 的總權重。 12N This is another commonly used dynamic feature calculation. H is the above-mentioned derivation. The dynamic eigenvector 卟 can be calculated by the following formula: The second-order time variability aW can also be calculated by applying the above method, and taking the role of ^. Another calculation method for the second-order time variability 4] can also be obtained by taking the closest match between the & static feature and the quadratic curve. In the calculation of the time variability of the c dynamic feature, in the present embodiment, ΐ/ 十早早 70 221 uses the data of the time/left and right symmetry intervals to generate a dynamic eigenvector of the central time variability. If the rate is a variability, the dynamic eigenvector of the buddy time variability is the dynamic eigenvector of the central time variability shown in the figure. 1 often uses the static feature of speech, ew, the time variability of motion, and the second-order time variability dynamic feature, synthesis time (10), and the feature vector of all time of a speech forms a sequence of feature vectors for pre-existence. The word model is compared to find the most likely word model string to identify the result. Although the dynamic characteristics of the central time variability can reflect the characteristics of the b曰曰 variable and the combination of static features can improve the system identification rate, there are two major changes in the characteristics of the phoneme from the phoneme to the next. The trait of 'the genre is gradually changing', for example, the diphthong will gradually change from the trait of the first vowel to the second vowel. Another kind of change-stepped step change 'for example, from fricative to vowel, the traits inside the phoneme are similar', and the traits across the phoneme are replaced by another L, in which case the left and right time at the boundary The variability is very small, and the time variability is very large. In the case where the phonetic traits gradually change, in the foregoing, the dynamic characteristics of the time variability can effectively represent the speech characteristics'. However, in the case of the step, the dynamic characteristics of the central time variability cannot be represented by the = knife. The second calculating unit 222 can calculate the dynamic feature vector of the left time change=the third calculating unit 223 can calculate the dynamic feature vector of the right time variability, and then the comparison unit 260 is based on the plurality of feature vectors from the database no. Searching for the best matching of the feature vectors 201135716 to express the speech traits and the relative position of the sound box, is conducive to speech recognition. The static calculation =: 3: the second calculation unit 222 is configured to generate at least one left time variability dynamic characteristic from the above-mentioned sufficiency rate by using: at least one second static eigenvector Vector, where the static number of vectors in the vector before the at least-time point. The second ▲10f selects the static feature vector after at least one time point to ^i7^223 for taking the second set of static feature vectors from the above-mentioned static feature vectors, and taking the time rate of change to generate a dynamic feature vector of the right time variability, wherein the number of static feature vectors after the at least one time point of the static feature to i before at least one time point. The first, second, and third sets of static first extraordinary kings of the first embodiment are the same, or may be completely different. For example, the second = state feature vector may include the at least one corresponding to the time point: the static portion corresponding to the t portion at the at least time point = the L static feature vector may include the at least one time point geese j The static feature vector is all on the other side of the at least corresponding static feature vector. In the second embodiment, the second calculating unit 222 is configured to take the change i from the second, and the second set of static feature vectors is generated to the time_to a left time variability Dynamic feature vector, 1 = less than the total weight of the feature vector in the second set of static feature vectors at the at least one time point: the total weight of the feature vector is greater than the total weight of the feature vector. The third calculating unit 223 is configured to select at least a third set of static feature vector versus time variability from the static g 15 m 201135716 levation 3: to generate at least one dynamic eigenvector of the right time variability, The total weight of the static binary sign 1 before the at least one time point in the at least one, group static feature vector is less than the total weight of the static feature after the at least one time point. 1
實作上,第二實施例之第―、第二、第三組靜態特徵 向量可為同一組靜態特徵,然此不限制本發明,熟習此項 技藝者當視當時需要彈性調整第二實施例之第一、第二、 第三組靜態特徵向量之選擇方式。 ^於上述第一實施例之左方時間變率之動態特徵詳細計 算方法說明如下。參考第4圖,設要獲得時間點t的左方 時間變率’則第二計算單元222可取-段主要位於時間點 ^左邊的訊號來計算,例如使用下式: Σ«·(Φ+λ]-€Μ) Σ轉2 k;N' 其中 。 或者,參考第5圖,時間點t左方時間變率的另種作法 ί 咖]⑷(c[卜〜+A:]-c[i-\]) Σ尋2 可為: 其中。’這可視為以時間f左邊某一點為中心的時間 變率。 使用與前述方法類似的原則,第三計算單元223可取 201135716 變率 例=位於時間點,右邊的訊號來計算右方時間 至於二階以上之左(或右)時間 法來計算。然後,如第2圖所示,比2了使_似的方 靜態特徵向量序列、中央時間變圭i疋260用以利用 左方時間變率之*;態=向量序、 最匹配該些特徵向量相之數個模型中搜尋出 靜:::元260可對於每-時間點所對應之 ㈣特徵向量、中央時間變率之動態特徵 :率之動態特徵向量及右方時間變率二 ^ 之特徵向量序列“二 變率之動態特徵與其他特=二 _ m之齡單元2ig、計算單元咖與比對μ 來說’若以執行速度為首要考量,則該體 體為主;若以設計彈性為首=㈡ 早兀基本上可選用軟體為主;或者,料單元可同時採= 17 201135716 軟體、硬體及軔體協同作業。應暸解到,以上所舉的這此 例子並沒有所謂孰優孰劣之分,亦並非用以限制本發明: 熟習此項技藝者當視當時需要,彈性選擇該等單元的具體 實施方式。 〃 再者,所屬技術領域中具有通常知識者當可明白,上 述各單元依其執行之功能予以命名’僅係為了讓本案之技 術更加明顯易懂,並非用以限定該等單元的態樣。將各單 元予以整合成同一單元或分拆成多個單元’或者將任一單 • 元之功能更換到另一單元中執行,皆仍屬於本揭示内容之 實施方式。 為了對上述之特徵榻取與組合的方式作更且體的閣 述’請參照第6圖。第6圖係依照本揭示内容一實施例之 特徵擷取與組合的示意圖。如第6圖所示,採用12階之梅 爾頻率刻度之倒頻譜係數(MFCC)及能量對數作為靜態特 徵’並與中央時間變率、左方時間變率、右方時間變率合 組成一個52維之特徵向量。音框長度為25ms,音框取樣 % 率為每1 〇ms —個音框。中央時間變率、左方時間變率、右 方時間變率均採用5個音框的資料來計算,不同位置的加 權均設為1,每個時間點之左方時間變率採用包含該至少 一時間點的左邊5個音框之資料來計算,右方時間變率採 用包含該至少一時間點的右邊5個音框之資料來計算。語 音辨認模型採用高階隱藏式馬可夫模型(hidden Markov m〇del)。我們以TIDIGIT資料庫進行語音辨認實驗,並與 常用之特徵組合做辨認率之比較。比較對象之特徵包含12 階MFCC與能量對數構成的靜態特徵,靜態特徵之一階與 201135716 二階中央時間變率,總共為39維之特徵向量。隱藏式馬玎 夫模型中每個數字音含16個狀態,數字之間為一個狀態的 間隔音,而每一句語音的前後各有一段3個狀態的靜音。 每個狀態之機率分布採用高斯機率混和模型,而每個混和 成分採用對角線形式之共變異數矩陣,在各種混和數之實 驗結果如第7圖所示,由第7圖中可看出使用本實施例的 特徵組合之辨認率優於常用之特徵組合,且其辨認錯誤降 低率在實驗中最高可降低26%的錯誤個數。 另一方面’本揭示内容之另一技術態樣係提供一種音 頻特徵處理方法,該音頻特徵處理方法可經由上述之音頻 特徵處理裝置來執行,其相關的實施例已具體揭露如上', 對此不再重複贅述之。In practice, the second, third, and third sets of static feature vectors of the second embodiment may be the same set of static features. However, the present invention is not limited thereto, and those skilled in the art should flexibly adjust the second embodiment at that time. The first, second, and third sets of static feature vectors are selected. The detailed calculation method of the dynamic characteristics of the left time variability in the above-described first embodiment is explained as follows. Referring to FIG. 4, it is assumed that the left time variability of time point t is to be obtained, and then the second calculation unit 222 can take a signal that is mainly located at the left side of the time point ^, for example, using the following formula: Σ«·(Φ+λ ]-€Μ) Σ 2 k; N' where. Or, refer to Figure 5, another way to change the time variability of the time to the left of time t ί 咖 ] (4) (c [Bu ~ + A:] - c [i- \]) Σ 2 can be: where. This can be seen as the time variability centered at a point on the left side of time f. Using a principle similar to the foregoing method, the third calculating unit 223 may take the 201135716 variability example = at the time point, the right signal to calculate the right time as the second (or right) time method of the second order or more. Then, as shown in Fig. 2, the ratio of the square static eigenvectors and the central time is used to make use of the left time variability*; the state = vector order, which best matches the features. In the several models of the vector phase, the static::: element 260 can be used for each time point corresponding to the (four) feature vector, the dynamic characteristics of the central time variability: the dynamic eigenvector of the rate and the right time variability The eigenvector sequence "dynamic characteristics of the two variability and other special = two _ m age unit 2ig, computing unit coffee and comparison μ" If the execution speed is the primary consideration, then the body is dominant; Elasticity is the first = (2) Early is basically the choice of software; or, the unit can be used simultaneously = 17 201135716 Software, hardware and carcass work together. It should be understood that the above example does not have the so-called excellent The inferiority is not intended to limit the invention: Those skilled in the art will be able to flexibly select the specific embodiment of the unit as needed at the time. 〃 Furthermore, those of ordinary skill in the art will appreciate that Each unit executes according to it The function is named 'only to make the technology of the case more obvious and understandable, not to limit the aspect of the unit. Integrate each unit into the same unit or split into multiple units' or either The function is changed to another unit for execution, and still belongs to the embodiment of the present disclosure. In order to make a more detailed description of the above-mentioned features of the combination and combination, please refer to Fig. 6. Fig. 6 is in accordance with A schematic diagram of feature extraction and combination of an embodiment of the present disclosure. As shown in FIG. 6, a cepstral coefficient (MFCC) and an energy logarithm of a 12-order Mel frequency scale are used as static features' and a central time variability The left time variability and the right time variability combine to form a 52-dimensional feature vector. The length of the sound box is 25ms, and the frame sampling rate is 1 〇ms—one frame. Central time variability, left time The variability and the right time variability are calculated using the data of 5 frames. The weights of different positions are set to 1, and the time variability of the left side of each time point is 5 words on the left side including the at least one time point. Box information Calculated, the right time variability is calculated using the data of the five right boxes containing the at least one time point. The speech recognition model uses the high-order hidden Markov model (hidden Markov m〇del). We use the TIDIGIT database for speech recognition. Experiments, and combined with commonly used features to make a comparison of the recognition rate. The characteristics of the comparison object include the static features of the 12th order MFCC and the energy logarithm, the static feature first order and the 201135716 second order central time variability, a total of 39 dimensional eigenvectors. Each digital sound in the hidden Ma Fufu model contains 16 states, and the numbers are separated by a state, and each sentence has a state of 3 states before and after each speech. The probability distribution of each state is Gaussian. The probability blending model, and each blending component adopts a matrix of covariance numbers in a diagonal form. The experimental results in various blending numbers are shown in Fig. 7, and it can be seen from Fig. 7 that the feature combination of the present embodiment is used. The recognition rate is better than the commonly used feature combination, and the recognition error reduction rate can be reduced by up to 26% in the experiment. Another aspect of the present disclosure is to provide an audio feature processing method that can be performed via the audio feature processing device described above, the related embodiments of which have specifically disclosed above, Do not repeat them.
或者,如上所述之音頻特徵處理方法可實作為一 程式’並儲存於—電腦可讀取之記錄媒體中,而使電腦读 取此記錄媒體後令m錢行該音頻特徵處理方法二 雖然本揭示内容已以實施方式揭露如上,然其並非 以限定本發明’任何熟習此技藝者’在不脫離本揭示内 之精神和範®内’當可作各種之更動與潤飾,因此本 之保護範圍當視後附之申請專利範圍所界定者為準。X 【圖式簡單說明】 為讓本揭示内容之上述和其他目的、特徵、優 施例能更明顯易懂,所附圖式之說明如下: ”、 第!圖是依照本揭示内容〜實施例之一種語音辨認系 統的方塊圖; 201135716 第2圖是依照本揭示内容一實施例之一種音頻特徵處 理裝置的方塊圖; 第3圖是依照本揭示内容一實施例之計算中央時間變 率之動態特徵的圖表; 第4圖是依照本揭示内容一實施例之計算左方時間變 率之動態特徵的圖表; 第5圖是依照本揭示内容另一實施例之計算左方時間 變率之動態特徵的圖表;以及 • 第6圖係依照本揭示内容一實施例之特徵擷取與組合 的不意圖, 第7圖係繪示兩種特徵組合之字辨認率的比較表。Alternatively, the audio feature processing method as described above can be implemented as a program and stored in a computer-readable recording medium, and the computer can read the recording medium and then make the audio feature processing method The disclosure has been disclosed in the above embodiments, and it is not intended to limit the invention to any skilled person in the art. This is subject to the definition of the scope of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects, features and advantages of the present disclosure more obvious and obvious, the description of the drawings is as follows: ", the figure is in accordance with the disclosure - the embodiment Block diagram of a voice recognition system; 201135716 FIG. 2 is a block diagram of an audio feature processing apparatus according to an embodiment of the present disclosure; FIG. 3 is a diagram of calculating dynamics of a central time variability according to an embodiment of the present disclosure. A graph of features; FIG. 4 is a graph for calculating dynamic characteristics of left time variability according to an embodiment of the present disclosure; FIG. 5 is a graph for calculating dynamic characteristics of left time variability according to another embodiment of the present disclosure FIG. 6 is a schematic diagram of character extraction and combination according to an embodiment of the present disclosure, and FIG. 7 is a comparison table of word recognition rates of two feature combinations.
【主要元件符號說明】 100 :語音辨認系統 110 ··麥克風 120 :類比至數位轉換 130 :音框分割模組 140 :端點偵測模組 150 :特徵擷取子系統 160 :樣型比對子系統 170 :資料庫 200 :音頻特徵處理裝置 210 :擷取單元 221 :第一計算單元 222 :第二計算單元 223 :第三計算單 260 :比對單元[Main component symbol description] 100: voice recognition system 110 · · microphone 120 : analog to digital conversion 130 : sound frame segmentation module 140 : endpoint detection module 150 : feature extraction subsystem 160 : sample comparison System 170: database 200: audio feature processing device 210: capture unit 221: first calculation unit 222: second calculation unit 223: third calculation unit 260: comparison unit