TW201135716A - Method and apparatus for processing audio feature - Google Patents

Method and apparatus for processing audio feature Download PDF

Info

Publication number
TW201135716A
TW201135716A TW99111654A TW99111654A TW201135716A TW 201135716 A TW201135716 A TW 201135716A TW 99111654 A TW99111654 A TW 99111654A TW 99111654 A TW99111654 A TW 99111654A TW 201135716 A TW201135716 A TW 201135716A
Authority
TW
Taiwan
Prior art keywords
static
time
feature
dynamic
vector
Prior art date
Application number
TW99111654A
Other languages
Chinese (zh)
Other versions
TWI409802B (en
Inventor
Lee-Min Lee
Original Assignee
Univ Da Yeh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Da Yeh filed Critical Univ Da Yeh
Priority to TW99111654A priority Critical patent/TWI409802B/en
Publication of TW201135716A publication Critical patent/TW201135716A/en
Application granted granted Critical
Publication of TWI409802B publication Critical patent/TWI409802B/en

Links

Abstract

A Method and Apparatus for processing audio feature are disclosed herein. In this method, a static feature vector sequence, a dynamic feature vector sequence of central time rate of change, another dynamic feature vector sequence of left side time rate of change and yet another dynamic feature vector sequence of right side time rate of change are used to search for a model or a string of models from a database, wherein the model or string of models mostly matches these feature vector sequences. Thus, the voice recognition rate is improved.

Description

201135716 六、發明說明: 【發明所屬之技術領域】 本揭示内容是有關於訊號處理技術,且特別是有關於 音頻特徵處理方法。 【先前技術】 語音辨認可提供自然方便的人機介面,可作為資料輸 入使用,或操控設備而讓設備更人性化而容易使用。語音 • 辨認技術也可用在電腦輔助語言學習系統,讓使用者得到 即時的回饋,增進學習的效率。 然而,若語音辨認時常發生錯誤,會造成使用上極大 的不便。為了降低語音辨認之錯誤機率,相關領域莫不費 盡心思來謀求解決之道,但長久以來一直未見適用的方式 被發展完成。因此,如何能更有效率地提高語音辨認率, 實屬當前重要研發課題之一,亦成爲當前相關領域亟需改 進的目標。 【發明内容】 因此,本揭示内容之一態樣是在提供一種音頻特徵處 理方法與音頻特徵處理裝置,用於提高聲音辨認率。 依據本揭示内容一實施例,音頻特徵處理方法包含下 列步驟: (a) 由複數個音框分別擷取複數個靜態特徵向量; (b) 自該些靜態特徵向量中選擇至少一組靜態特徵以 201135716 計舁對應於至少1相點之至少 徵向量與至少一右 彳變率 方時間變率之動態特 至少-右^間變率之動態特徵向量及該 主少右方時間變率之動態特徵 俗二置及该 複數個模型中搜尋出最匹配該些^料庫所預存之 型串列。 、向量之一模型或一模 之倒頻’每一 f態特徵向量可為梅爾頻率刻度 性預估係數、共振峰、音高或類似特徵特徵參數,例如線 於步驟⑴中,亦可自該些靜 =::r計算於該至少-時間點 特徵向量=、徵向量。而步驟(c)+,可利用該些靜態 至少一中央時間變率之動態特徵向量、該至 V左方時間變率之動態特徵向量及該至少一右方時間變 率:動特徵向量從該資料庫所預存之複數個模型中搜尋 出最匹配該#•特徵向4之該賴型賴模型串列。 於步驟(C)中,將該些靜態特徵向量、該至少一中央 態特徵向量、該至少-左方時間變率之動態 人系、β ί4二至J 一右方時間變率之動態特徵向量作組 ^ 、且合後之特徵向量序列,並從該資料庫所預 存之複數個模型中拙盖山旦 心^貝丁叶坪尸/Τ頂 之該模型或該模型串=最匹配該組合後之特徵向量序列 上述之步驟(b)可包含下列子步驟: (αΐ)自該些靜皞 對應之靜態特徵向徵向量中選擇該至少一時間點所 ^至少一時間點前後之數個靜態特 b>] 4 201135716 為至少一第一組靜態特徵向量,並將該至少-第靜態特徵向量對時間取變率以產生至少-中央時門 徵向量,變率計算十,該至少-第-組靜: =向量中在該至少—時間點之前的靜態特徵向量的權重 對稱於該至少—時間點之後的靜態特徵向量的權重; (α 2)自該些靜態特徵向量中選擇至少一 =向量對時間取變率,以產生至少一左方時間變率之: Μ寺徵向量,其中該至少一第二組 少一 靜恶符徵向量中在該至 夕時間點之刚的靜態特徵向量的總權重大於該 間點之後的靜態特徵向量的總權重; (α 3)自該些靜態特徵向量中選擇至少一 特徵向量對時間取變_,以產生至少一右方:二 態賴向量,其中該至少—第三組靜態特徵向i中在^至 少一時間點之前的靜態特徵向量的總權重小於小」 間點之後的靜態特徵向量的總權重。 夕、 或者,上述之步驟(b)可包含下列子步驟: (石υ自該些靜態特徵向量中選擇該至少一 對應之靜態特徵向量及該至少一時間點前後之數個靜熊 徵向量^作為至少-第—組靜態特徵向量,並將該至^一 第一組靜態特徵向量對時間取變率以產生至少_ ^ = ,率之動態特徵向量,其中該至少―第―組靜態特徵向= 中在該至少-時間點之前的靜態特徵向量的數目等於 少一時間點之後的靜態特徵向量的數目; 、以 (/5 2)自該些靜態特徵向量中選擇至少一 特徵向量對時間取變率,以產生至少一左方時間;率之;' 201135716 態特徵向量,其中誃 丨、一 _ 少-時間點之前的靜離特3了組靜態特徵向量中在該至 點之後的靜態特徵向;的=置的數目多於該至少-時間 (冷3)自該些靜態特徵向 特徵向量對時間取變率,以 選擇至>、一第三組靜態 態特徵向量,其中哕$ ,丨、一生至夕右方時間變率之動 少一時間點之前的“二“組靜態特徵向量中在該至 點之後的靜態特徵向;=量的數目少於該至少-時間 於子步驟(/52)中,兮筮_ Λ ^ 至少-時間點所對應之靜:=態=向量係包含該 靜態特徵向量,或者係上對應之 態特徵向量的另_邊。 時間點所對應之靜 ::卜’於步驟(c)卜資料庫預存之模型 高階隱藏式馬可夫模型'範例式模型^ 包含示内容另—實施例,—種音頻特徵處理裝置 擷取皁、一計算單元與一比對單元。 置 個靜擷取單元用以由複數個音框分別擷取複數 擇至乂-組靜態特徵以計算對應於至少一時了選 變率之動態特徵向量與至少一右方時 :: Μ寺徵向量。比對單元用以利用該至少—左方 ^動 動態特徵向量及該至少-右方時間變率之動態特4向量4之 201135716 量;==謝搜尋出最⑽些特徵向 賴率刻、m,i:早二所:取之每-靜態特徵向量係為梅爾 敫,線:預頻::數數^ 上逑之計算單元另用同或類似特徵。 :少1靜態徵向量中她 率之動態特徵向量。之^少-中央 量、該至少—左方甲糾間變率之動態特徵向 方時間變率之動離二=之動態特徵向量及該至少一右 型中搜尋出最匹:二該資料庫所預存之複數個模 再者,量之該模型或該模型串列 :些中央時間變率之動將該些靜態特徵向量、 :態特徵向量及該些右方時間變置率時間變率之 ,以產生一組合後之 動態特徵向量作組 拽尋出最匹配兮組人$ 序列,進而從該資料庫中 串列。合後之特徵向量序列之該模型;= βη _上述之計算單元可包含一 ^ 承元與—第三計算單元缺。十算單元、一第二計算 徵向量中選擇該至少一 ^算單疋用以自該些靜態特 =-時間點前後之數個;態:^靜態特徵向量及該 Ρ靜態特徵向量,並將該至少一二。|以作為至少-第-w變率以產生至少:第―組靜態特徵向量對時 中該至少一第一組靜態特動態特徵向量,其 隹該至少一時間點之前 201135716 的靜態特徵向量的權重對稱於 特徵向量的避舌咕 時間點之後的靜態 至第二組靜態特徵向量對時間取變;= ΐ靜變率之動態特徵向量,其中該至少 〜、特徵向里中在該至少一時間點 Π重至少一時間點之後的靜: 第-组靜熊:早兀用以自該些靜態特徵向量中選擇至少- 間變率之動態特徵向量,其中 右方時 :在一時間點之前的靜態特徵向量的總權重小於 〉一時間點之後的靜態特徵向量的總權重。 、 或者帛&十算單元用以自該些靜態特徵向量中選 =至少-時間點所對應之靜態特徵向量及該至少一時間點 :後之數個靜態特徵向量以作為至少一第一組靜態特徵向 量’並將該至少-第-組靜態特徵向量對時間取變率以產 生至少一中央時間變率之動態特徵向量,其中該至少一第 :組靜態特徵向量中在該至少一時間點之前的靜態特徵向 1的數目等於該至少-時間點之後的靜態特徵向量的數 目。第二計算單元用以自該些靜態特徵向量中選擇至少一 第二組靜態特徵向量對時間取變率,以產生至少一左方時 =變率之動態特徵向量’其中該至少一第二組靜態特徵向 里中在該至少一時間點之前的靜態特徵向量的數目多於該 至少一時間點之後的靜態特徵向量的數目。第三計算單_ 用以自該些靜態特徵向量中選擇至少一第三組靜態特徵= 量對時間取變率,以產生至少一右方時間變率之動態特徵201135716 VI. Description of the Invention: [Technical Field of the Invention] The present disclosure relates to signal processing techniques, and more particularly to audio feature processing methods. [Prior Art] Voice recognition provides a natural and convenient human-machine interface that can be used as a data input or manipulate the device to make the device more user-friendly and easy to use. Voice • Recognition technology can also be used in computer-assisted language learning systems to give users immediate feedback and increase learning efficiency. However, if an error occurs frequently in speech recognition, it will cause great inconvenience in use. In order to reduce the chance of error in speech recognition, the related fields have not tried their best to find a solution, but the method that has not been applied for a long time has been developed. Therefore, how to improve the speech recognition rate more effectively is one of the current important research and development topics, and it has become an urgent need for improvement in related fields. SUMMARY OF THE INVENTION Accordingly, it is an aspect of the present disclosure to provide an audio feature processing method and an audio feature processing apparatus for improving a sound recognition rate. According to an embodiment of the present disclosure, an audio feature processing method includes the following steps: (a) capturing a plurality of static feature vectors from a plurality of sound boxes; (b) selecting at least one set of static features from the static feature vectors to 201135716 The dynamic eigenvector corresponding to the dynamic eigenvector of at least one phase point and at least one right 彳 率 时间 时间 时间 及 及 及 及 及 及 及 及 及 及 及 及 及 及 及The common two and the plurality of models search for the type series that are most consistent with the storage of the database. , one of the vector models or one mode of scrambling 'each f-state eigenvector can be a Mel frequency gradation prediction coefficient, formant, pitch or similar feature parameter, such as in step (1), or The static =::r is calculated at the at least-time point feature vector =, the eigenvector. And in step (c)+, the static dynamic feature vector of at least one central time variability, the dynamic feature vector to the V left time variability, and the at least one right time variability: the dynamic feature vector are used The plurality of models prestored in the database search for the model string that best matches the #• feature to 4. In step (C), the static feature vectors, the at least one central state feature vector, the dynamic eigenvector of the at least-left time variability, and the β ί4 II to J-right time variability The group of eigenvectors and the combined eigenvectors, and the model of the sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred sacred scorpion Subsequent feature vector sequence The above step (b) may comprise the following sub-steps: (αΐ) selecting a plurality of statics from at least one time point of the at least one time point from the static feature corresponding sign vectors corresponding to the statics Special b>] 4 201135716 is at least a first set of static feature vectors, and the at least - static feature vector is time-varying to generate at least a central time gate sign vector, the variability is calculated as ten, the at least - the - Group static: = the weight of the static feature vector in the vector before the at least - time point is symmetric to the weight of the static feature vector after the at least - time point; (α 2) selecting at least one = vector from the static feature vectors Take time to change rate At least one left time variability: a Μ 征 向量 vector, wherein the total weight of the static eigenvectors in the at least one second group less than one sinister symbol vector at the eve time point is greater than the point The total weight of the static feature vector; (α 3) selecting at least one feature vector from the static feature vectors to take time _ to generate at least one right side: a two-state sag vector, wherein the at least-third group of static features The total weight of the static feature vectors to i before at least one time point is less than the total weight of the static feature vectors after the small point. Alternatively, step (b) may include the following sub-steps: (the sarcophagus selects the at least one corresponding static feature vector from the static feature vectors and the plurality of static bear sign vectors before and after the at least one time point ^ And as a minimum-first-group static feature vector, and the first set of static feature vectors are time-varying to generate a dynamic feature vector of at least _^=, the at least “the first group of static feature directions” = the number of static feature vectors before the at least-time point is equal to the number of static feature vectors after one less time point; and selecting at least one feature vector from the static feature vectors by (/5 2) Variability to produce at least one left time; rate; '201135716 state feature vector, where 誃丨, _ 少 - 时间 时间 时间 特 特 了 了 了 了 了 了 了 了 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态 静态The number of = is more than the at least - time (cold 3) from the static features to the feature vector versus time, to select >, a third set of static state feature vectors, where 哕$ , Oh, life to night The time variability is less than the static feature of the "two" group static feature vector before the point in time; the number of quantities is less than the at least - time in the substep (/52), _ Λ ^ At least the time corresponding to the static: = state = vector contains the static feature vector, or the other edge of the corresponding state feature vector. The static corresponding to the time point:: 'in the step (c The data library pre-stored model high-order hidden Markov model 'example model ^ contains the content of another embodiment, an audio feature processing device to extract soap, a computing unit and a comparison unit. The method is used for calculating a dynamic feature vector corresponding to at least one selected variability and at least one right side from a plurality of sound boxes to respectively calculate: a Μ temple levy vector. Using the at least-left-hand dynamic feature vector and the 201135716 quantity of the dynamic special 4 vector 4 of the at least-right time variability; == Xie searches for the most (10) features to the rate, m, i: two The: per-static eigenvector system is Mel, line : Pre-frequency:: Counting ^ The calculation unit of the upper 另 uses the same or similar features. : The dynamic eigenvector of her rate in the less than 1 static eigenvector. The less - the central quantity, the at least - the left side The dynamic feature of the rate is the dynamic eigenvector of the time variability of the square and the most recent one of the at least one right type: the plurality of modules pre-stored in the database, the model or the model Tandem: some central time variability moves the static eigenvectors, the state eigenvectors, and the right time variability time variability to produce a combined dynamic eigenvector for group 拽 拽Matching the group of people to the $ sequence, and then arranging from the database. The model of the combined feature vector sequence; = βη _ The above calculation unit may include a ^ element and a third calculation unit. Selecting the at least one calculation unit from the ten calculation unit and the second calculation eigenvector to use the plurality of static traits from before and after the time point; state: ^ static eigenvector and the Ρ static eigenvector, and The at least one or two. Taking at least a -w-th variability to generate at least: a first set of static eigenvectors of the first set of static eigenvector pairs, and a weight of the static eigenvectors of the 201135716 before the at least one time point The static to the second set of static feature vectors that are symmetric to the feature vector are time-varying; = the dynamic feature vector of the static rate of change, wherein the at least ~, the feature is in the middle at the at least one time point Π 至少 至少 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The total weight of the feature vector is less than the total weight of the static feature vector after > a time point. Or the 帛 & ten calculation unit is configured to select, from the static feature vectors, a static feature vector corresponding to at least a time point and the at least one time point: the following plurality of static feature vectors to be at least one first group The static feature vector 'and the at least one-set static feature vector versus time variability to generate at least one central time variability dynamic feature vector, wherein the at least one first: group static feature vector is at the at least one time point The number of previous static features to 1 is equal to the number of static feature vectors after the at least-time point. The second calculating unit is configured to select at least one second set of static feature vector versus time rate from the static feature vectors to generate at least one left time=variation dynamic feature vector ′ wherein the at least one second group The number of static feature vectors before the at least one time point in the static feature inward is greater than the number of static feature vectors after the at least one time point. The third calculation unit _ is configured to select at least one third group of static features=quantity versus time variability from the static eigenvectors to generate dynamic characteristics of at least one right time variability

UJ 201135716 向量,其中該至少一第三組靜 間點之前的靜態特徵向量的數;中在該至少1 的靜態特徵向量的數目。 夕、该至少一時間點之後 再者,上述之第二計算單元所 向量係包含該至少—時間點所u二組靜態特徵 述之第三計算單元所選擇之向量的—邊;上 币—組靜態特徵向署伤白各咕 型、二 型。 綜揭示内容之技術方案與現有技術相比具 有明顯的優點和有益效果。#由上述技術方案,可達到相 當的技術進步’並具有産業上的廣泛利用價值,其至少具 有下列優點: 1. 以靜態特徵向量、中央時間變率之動態特徵向量、 左右方時間變率之動態特徵向量等作處理來形成更精縣 達語音特性的方法,以獲得更佳的聲音辨認率;以及 2. 本發明應用範圍廣泛,不限定在語音辨認,亦涵蓋 其他訊號處理系統之運用,例如樂曲識別、語者辨認、丘 振峰追蹤、音高追蹤、聲調辨認、統計式語音合成等。 以下將於實施方式對上述之說明作詳細的描述,並對 本揭米内谷之技術方案提供更進一步的解釋。 201135716 【實施方式】 為了使本揭示内容之敘述更加詳盡與完備,可參照所 附之圖式及以下所述各種實施例,圖式中相同之號碼代表 相同或相似之元件。另一方面,眾所週知的元件與步驟並 未描述於實施例中,以避免對本發明造成不必要的限制。 第1圖是依照本揭示内容一實施例之一種語音辨認系 統100的方塊圖。語音辨認系統1〇〇包含麥克風11〇、類 比至數位轉換器12〇、音框分割模組130、端點偵測模組 〇特徵掏取子系統15〇、樣型比對子系統“ο與資料庫 170。 於使用上’麥克風110可將聲波轉為類比訊號,類比 ^數=轉換器12〇可將類比訊號轉換成數位語音的形式, 音框=割模組13〇可將數位語音分割成一些小段的訊號, 八中每個小段稱為音框。端點偵測模組140可找出語音之 ,點與終點,特徵擷取子系統150可將每個語音音框轉成 可=表其特性的特徵向量。資料庫170預存了語音樣型, 樣型2對子系統16〇可從資料庫170中搜尋出一個最接近 輸入曰特徵向量序列的字詞模型串列,當作辨認結果。 為了斜上述之「特徵向量」作更具體的說明,請參照 第2 ®是依照本揭示内容-實施例之—種音頻特 J、地、裝置2〇〇的方塊圖。音頻特徵處理裝置2〇〇可適用 於士述之特徵擷取子系統150以及樣型比對子系統16〇, 或疋廣/乏地運用在相關之技術環節,例如語者辨認系統及 統計式語音合成系統等。 如第2圖所示’音頻特徵處理裝置2〇〇包含擷取單元 201135716 210、計算單元220與比斟置-、 第-計算單元㈣二 複數音框分別棟取 包含能量與頻譜特二頻==用:態特徵向量通常 頻譜係數⑽cc);或者,靜 t頻率刻度之倒 (LPC)、共振峰、音高或類似;數/可為線性預估係數 除此之外’靜態特徵對時間 的重要特性,稱之為動態特徵 ,ί一可表現出語音 些靜態特徵向量中選擇至少一組靜::二2:用以自該 時間點之至少一中央時間變;;鼻於至少一 -右方時間變率之動態特徵向:。動“特徵向量與至少 於第一實施例中,第一計算 特徵向量中選擇至少—時間點 /以自該些靜態 至:一時間點前後所對應之數個靜:::3徵向量及該 一第一組靜態特徵向量,並將該至;二;向量以作為至少 量對時間取變率以產生至少第—組靜態特徵向 量,ΐ中該至少—第一組靜態特徵向之動態特徵向 點之别的靜態特徵向量的數在該至少-時間 靜態特徵向量 的數:。、s等於該至少-時間點之後的 或者,於第二實施例中,第— 些,特徵向量中選擇該至少一時早元221用以自該 至 向蝴至少一時間點前後之數個;離:=之靜態特徵 〜将徵向量以作為 11 201135716 少一第一組靜態特徵向量,並將該至少一第一 生至少-中央時間變率之動 其t二至y —第一組靜態特徵向量令在該至少一 間點之前的靜態特徵向量的權重對稱於該至少夕 後的靜態特徵向量的權重。 * B ”之 鐵之動態特徵向量可為靜態特徵向量對時間的一尸皆 相备於-階微分)與/或二階變率( 算方法可取前後數個音框資料之最二:)性 第,個音框之靜態=向音訊號中 之動熊特徵以dW矣--篁以表不,一階時間變率所得 * _ " 不,一階時間變率所得之動態特徵以4?] 之動態特徵計算方法參考第3圖,設 為第/個音框的瞬時特徵參數向量 θ, 若用直線方程士]+锨來近 、v、 刀1 , 來近似該音框附近之資料讣岣,則 =後N個曰框之近似誤差的加權平方和係滿足下列關係UJ 201135716 Vector, wherein the number of static feature vectors before the at least one third set of static points; the number of static feature vectors in the at least one. In the evening, after the at least one time point, the vector of the second calculating unit includes the edge of the vector selected by the third computing unit of the two sets of static features. The static features are smashed into different types and types. The technical solution of the comprehensive disclosure has obvious advantages and beneficial effects compared with the prior art. #The above technical solution can achieve considerable technological progress' and has extensive industrial use value, which has at least the following advantages: 1. Dynamic feature vector with static feature vector, central time variability, time variability of left and right time The dynamic feature vector is processed to form a method for synthesizing the speech characteristics of the county to obtain a better sound recognition rate; and 2. The invention has a wide range of applications, and is not limited to speech recognition, and covers the application of other signal processing systems. For example, music recognition, speaker recognition, Qiu Zhenfeng tracking, pitch tracking, tone recognition, and statistical speech synthesis. The above description will be described in detail in the following embodiments, and further explanations of the technical solutions of the present disclosure are provided. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In order to make the description of the present disclosure more complete and complete, reference is made to the accompanying drawings and the embodiments described below. On the other hand, well-known elements and steps are not described in the embodiments to avoid unnecessarily limiting the invention. 1 is a block diagram of a speech recognition system 100 in accordance with an embodiment of the present disclosure. The voice recognition system 1 includes a microphone 11A, an analog to digital converter 12A, a sound box division module 130, an endpoint detection module, a feature extraction subsystem 15A, and a sample comparison subsystem "o The database 170. In the use of the 'microphone 110 can convert the sound wave into analog signal, the analog number = converter 12 〇 can convert the analog signal into a digital voice form, the sound box = cut module 13 〇 can divide the digital voice In a small segment of the signal, each segment of the eight is called a sound box. The endpoint detection module 140 can find the voice, point and end point, and the feature capturing subsystem 150 can convert each voice frame into a = The feature vector of the characteristic is stored in the database 170. The sample 2 pairs of subsystems 16 搜寻 can search the database 170 for a list of word models that are closest to the sequence of input eigenvectors for identification. Results For a more specific description of the "characteristic vector" described above, please refer to the second diagram of the audio, J, and device according to the present disclosure. The audio feature processing device 2 can be applied to the feature extraction subsystem 150 and the sample comparison subsystem 16〇, or can be used in related technical aspects, such as a speaker recognition system and a statistical formula. Speech synthesis system, etc. As shown in FIG. 2, the audio feature processing device 2 includes a capture unit 201135716 210, a calculation unit 220, a comparison unit, and a second calculation unit (four). The second and second sound boxes respectively contain energy and spectrum special frequency = = Use: state feature vector usually spectral coefficient (10) cc); or, static t frequency scale (LPC), formant, pitch or similar; number / can be linear prediction coefficient other than 'static characteristics versus time An important feature, called a dynamic feature, can be used to express at least one set of static static feature vectors: 2: 2: used to change from at least one central time of the time point;; nose at least one - right The dynamic characteristics of the square time variability: Moving the feature vector and at least in the first embodiment, selecting at least the time point in the first calculated feature vector / from the static to: a plurality of static:::3 sign vectors corresponding to before and after a time point and the a first set of static feature vectors, and the to; two; vector as the at least amount versus time rate of change to generate at least a first set of static feature vectors, wherein the at least - the first set of static features toward the dynamic feature The number of other static feature vectors of the point is the number of the at least-time static feature vector:, s is equal to the at least-time point or, in the second embodiment, the first, the feature vector selects the at least The first time element 221 is used to control the number of points from at least one time point to the butterfly; the static feature of the === vector is used as 11 201135716, the first group of static feature vectors, and the at least one first The at least-central time variability is t2 to y - the first set of static feature vectors aligns the weight of the static feature vector before the at least one point to the weight of the static feature vector at least after the date. * B Iron The dynamic eigenvector can be a static eigenvector for a time of the corpse is prepared for the -order differential) and / or second-order variability (the calculation method can take the second and second of the sound box data:), the static of the sound box = The characteristics of the moving bear in the audio signal are dW矣--篁 to indicate, the first-order time variability is obtained * _ " No, the dynamic characteristics obtained by the first-order time variability are referenced by the dynamic feature calculation method of 4?] Figure 3, set as the instantaneous feature parameter vector θ of the /th frame, if you use the straight line equations]+锨 to approach, v, and knife 1 to approximate the data near the frame, then = N The weighted sum of squared approximation errors satisfies the following relationship

E 玉順[e,[㈣]士卩】+狄)] 可求得IS:::令加權平方誤差對b微分為。E Yushun [e, [(4)] gentry] + Di)] can obtain IS::: make the weighted square error to b differential.

d£ N db £ 2^[cx + ^] - (c, [t] + 6jfc)]t = 〇 12 201135716 上列公式在ί + Α:超過 之起點與終點時,Cjc[i+*]可以 卜讣]) ^~~~^- Σ Hk]k2d£ N db £ 2^[cx + ^] - (c, [t] + 6jfc)]t = 〇12 201135716 The above formula is in ί + Α: when the starting point and the ending point are exceeded, Cjc[i+*] can be讣]) ^~~~^- Σ Hk]k2

k〇~N 起點或終點的資料取抑。2 ^ 與終點之k納人計算式卜—作法可以只將W未超出起點 右4 ]為對稱函數,則時間變率計算式可簡化如下: b. 2έ娜2 化為K〇~N The starting or ending point of the data is suppressed. 2 ^ and the end of the k-person calculation formula - the practice can only W does not exceed the starting point right 4] as a symmetric function, then the time variability calculation formula can be simplified as follows: b. 2 έ娜2 into

2N 這也是另一個常用的動態特徵計算H綜合上列推導 可知動態特徵向量卟]可由下式計算: 二階時間變率aW之計算亦可套用上述方法,而以卟]取 代^的角色。二階時間變率4]之另外一種計算方法也可直 m 13 201135716 接由&靜態特徵與二次曲線作最接近之匹配來求得。以 c動態特徵之時間變率的計算,於本實施例中, ΐ/十异早70 221使用時間/左右兩邊對稱區間之資料來 ° ,以產生中央時間變率之動態特徵向量。若以一 間變率為例’财央時間變率之動態特徵向量即 、 圖所示之中央時間變率之動態特徵向量。 1常將語音之靜態特徵ew,—階時間變率動 」】,及二階時間變率動態特徵·合成時間⑽徵向 里,一句語音的所有時間之特徵向量形成特徵向量序列, 用來和預存之字詞模型比對,找出最可能的字詞模型串 f立t辨認結果。雖然中央時間變率之動態特徵可反映 b曰曰變^動的特性’與靜態特徵的組合可提升系統辨認率, =疋’語音裡的從-個音素到下一個音素的特質變化主要 有兩種’其中—種為逐漸變化’例如雙元音的特質會從第 一個元音的特質逐漸轉變到第二個元音。另外一種變化 跳躍式的步階變化’例如,從摩擦音到元音,在音素内部 的特質處處相似’而跨越音素交界後特質即換成另外L 種,在這種情況下,邊界處的左右時間變率都很小, =時間變率卻很大。在語音特質逐漸變化的情況,前述中 ,時間變率之動態特徵尚可有效代表語音特性’但在步階 ,的情況,則使用中央時間變率之動態特徵並不能作 =刀的代表,因此,第二計算單元222可計算左方時間變 =之動態特徵向量’第三計算單元223可計算右方時間變 率之動態特徵向量’再由比對單元260基於上述多種特徵 向量從資料庫no中搜尋出最匹配該些特徵向量 201135716 適切的表達語音特質及該音框所處的相對位 置,而有利於語音辨認。 靜離=:3:中’第二計算單元222用以自上述該些 蠻率以::中、擇至少一第二組靜態特徵向量對時間取 楚產生至少一左方時間變率之動態特徵向量,其中 向量中在該至少-時間點之前的靜態特 數目。第二▲十f該至少一時間點之後的靜態特徵向量的 選擇至^i7^223用以自上述該些靜態特徵向量中 、夕第二組靜態特徵向量對時間取變率,以產生至 =右方時間變率之動態特徵向量,其中該至 在:至少一時間點之前的靜態特徵向i的 該至少一時間點之後的靜態特徵向量的數目。 徵向二第一實施例之第-、第二、第三組靜態特 不凡王相同,或可以完全不同。舉例來說,該第二 =態特徵向量可包含該至少—時間點所對應離 者:t部在該至少-時間點所對應之靜態= 所L 靜態特徵向量可包含該至少一時間點 雁j靜態特徵向量,或者係全部在該至少 對應之靜態特徵向量的另一邊。 1點所 些靜ΐί々’於ί二實施例中,第二計算單元222用以自該 取變i,少一第二組靜態特徵向量對時間 _產生至> 一左方時間變率之動態特徵向量,1 =少一第二組靜態特徵向量中在該至少一時間點二 牲:態特徵向量的總權重大於該至少一時間點之後的:: 特徵向量的總權重。第三計算單元223用以自該些靜態g 15 m 201135716 徵向3:中選擇至少—第三組靜態特徵向量對時間取變率, 以產^至少一右方時間變率之動態特徵向量,其中該至少 ,一第,組靜態特徵向量中在該至少一時間點之前的靜態二 徵向1的總權重小於該至少一時間點之後的靜態特徵 的總權重。 12N This is another commonly used dynamic feature calculation. H is the above-mentioned derivation. The dynamic eigenvector 卟 can be calculated by the following formula: The second-order time variability aW can also be calculated by applying the above method, and taking the role of ^. Another calculation method for the second-order time variability 4] can also be obtained by taking the closest match between the & static feature and the quadratic curve. In the calculation of the time variability of the c dynamic feature, in the present embodiment, ΐ/ 十早早 70 221 uses the data of the time/left and right symmetry intervals to generate a dynamic eigenvector of the central time variability. If the rate is a variability, the dynamic eigenvector of the buddy time variability is the dynamic eigenvector of the central time variability shown in the figure. 1 often uses the static feature of speech, ew, the time variability of motion, and the second-order time variability dynamic feature, synthesis time (10), and the feature vector of all time of a speech forms a sequence of feature vectors for pre-existence. The word model is compared to find the most likely word model string to identify the result. Although the dynamic characteristics of the central time variability can reflect the characteristics of the b曰曰 variable and the combination of static features can improve the system identification rate, there are two major changes in the characteristics of the phoneme from the phoneme to the next. The trait of 'the genre is gradually changing', for example, the diphthong will gradually change from the trait of the first vowel to the second vowel. Another kind of change-stepped step change 'for example, from fricative to vowel, the traits inside the phoneme are similar', and the traits across the phoneme are replaced by another L, in which case the left and right time at the boundary The variability is very small, and the time variability is very large. In the case where the phonetic traits gradually change, in the foregoing, the dynamic characteristics of the time variability can effectively represent the speech characteristics'. However, in the case of the step, the dynamic characteristics of the central time variability cannot be represented by the = knife. The second calculating unit 222 can calculate the dynamic feature vector of the left time change=the third calculating unit 223 can calculate the dynamic feature vector of the right time variability, and then the comparison unit 260 is based on the plurality of feature vectors from the database no. Searching for the best matching of the feature vectors 201135716 to express the speech traits and the relative position of the sound box, is conducive to speech recognition. The static calculation =: 3: the second calculation unit 222 is configured to generate at least one left time variability dynamic characteristic from the above-mentioned sufficiency rate by using: at least one second static eigenvector Vector, where the static number of vectors in the vector before the at least-time point. The second ▲10f selects the static feature vector after at least one time point to ^i7^223 for taking the second set of static feature vectors from the above-mentioned static feature vectors, and taking the time rate of change to generate a dynamic feature vector of the right time variability, wherein the number of static feature vectors after the at least one time point of the static feature to i before at least one time point. The first, second, and third sets of static first extraordinary kings of the first embodiment are the same, or may be completely different. For example, the second = state feature vector may include the at least one corresponding to the time point: the static portion corresponding to the t portion at the at least time point = the L static feature vector may include the at least one time point geese j The static feature vector is all on the other side of the at least corresponding static feature vector. In the second embodiment, the second calculating unit 222 is configured to take the change i from the second, and the second set of static feature vectors is generated to the time_to a left time variability Dynamic feature vector, 1 = less than the total weight of the feature vector in the second set of static feature vectors at the at least one time point: the total weight of the feature vector is greater than the total weight of the feature vector. The third calculating unit 223 is configured to select at least a third set of static feature vector versus time variability from the static g 15 m 201135716 levation 3: to generate at least one dynamic eigenvector of the right time variability, The total weight of the static binary sign 1 before the at least one time point in the at least one, group static feature vector is less than the total weight of the static feature after the at least one time point. 1

實作上,第二實施例之第―、第二、第三組靜態特徵 向量可為同一組靜態特徵,然此不限制本發明,熟習此項 技藝者當視當時需要彈性調整第二實施例之第一、第二、 第三組靜態特徵向量之選擇方式。 ^於上述第一實施例之左方時間變率之動態特徵詳細計 算方法說明如下。參考第4圖,設要獲得時間點t的左方 時間變率’則第二計算單元222可取-段主要位於時間點 ^左邊的訊號來計算,例如使用下式: Σ«·(Φ+λ]-€Μ) Σ轉2 k;N' 其中 。 或者,參考第5圖,時間點t左方時間變率的另種作法 ί 咖]⑷(c[卜〜+A:]-c[i-\]) Σ尋2 可為: 其中。’這可視為以時間f左邊某一點為中心的時間 變率。 使用與前述方法類似的原則,第三計算單元223可取 201135716 變率 例=位於時間點,右邊的訊號來計算右方時間 至於二階以上之左(或右)時間 法來計算。然後,如第2圖所示,比2了使_似的方 靜態特徵向量序列、中央時間變圭i疋260用以利用 左方時間變率之*;態=向量序、 最匹配該些特徵向量相之數個模型中搜尋出 靜:::元260可對於每-時間點所對應之 ㈣特徵向量、中央時間變率之動態特徵 :率之動態特徵向量及右方時間變率二 ^ 之特徵向量序列“二 變率之動態特徵與其他特=二 _ m之齡單元2ig、計算單元咖與比對μ 來說’若以執行速度為首要考量,則該體 體為主;若以設計彈性為首=㈡ 早兀基本上可選用軟體為主;或者,料單元可同時採= 17 201135716 軟體、硬體及軔體協同作業。應暸解到,以上所舉的這此 例子並沒有所謂孰優孰劣之分,亦並非用以限制本發明: 熟習此項技藝者當視當時需要,彈性選擇該等單元的具體 實施方式。 〃 再者,所屬技術領域中具有通常知識者當可明白,上 述各單元依其執行之功能予以命名’僅係為了讓本案之技 術更加明顯易懂,並非用以限定該等單元的態樣。將各單 元予以整合成同一單元或分拆成多個單元’或者將任一單 • 元之功能更換到另一單元中執行,皆仍屬於本揭示内容之 實施方式。 為了對上述之特徵榻取與組合的方式作更且體的閣 述’請參照第6圖。第6圖係依照本揭示内容一實施例之 特徵擷取與組合的示意圖。如第6圖所示,採用12階之梅 爾頻率刻度之倒頻譜係數(MFCC)及能量對數作為靜態特 徵’並與中央時間變率、左方時間變率、右方時間變率合 組成一個52維之特徵向量。音框長度為25ms,音框取樣 % 率為每1 〇ms —個音框。中央時間變率、左方時間變率、右 方時間變率均採用5個音框的資料來計算,不同位置的加 權均設為1,每個時間點之左方時間變率採用包含該至少 一時間點的左邊5個音框之資料來計算,右方時間變率採 用包含該至少一時間點的右邊5個音框之資料來計算。語 音辨認模型採用高階隱藏式馬可夫模型(hidden Markov m〇del)。我們以TIDIGIT資料庫進行語音辨認實驗,並與 常用之特徵組合做辨認率之比較。比較對象之特徵包含12 階MFCC與能量對數構成的靜態特徵,靜態特徵之一階與 201135716 二階中央時間變率,總共為39維之特徵向量。隱藏式馬玎 夫模型中每個數字音含16個狀態,數字之間為一個狀態的 間隔音,而每一句語音的前後各有一段3個狀態的靜音。 每個狀態之機率分布採用高斯機率混和模型,而每個混和 成分採用對角線形式之共變異數矩陣,在各種混和數之實 驗結果如第7圖所示,由第7圖中可看出使用本實施例的 特徵組合之辨認率優於常用之特徵組合,且其辨認錯誤降 低率在實驗中最高可降低26%的錯誤個數。 另一方面’本揭示内容之另一技術態樣係提供一種音 頻特徵處理方法,該音頻特徵處理方法可經由上述之音頻 特徵處理裝置來執行,其相關的實施例已具體揭露如上', 對此不再重複贅述之。In practice, the second, third, and third sets of static feature vectors of the second embodiment may be the same set of static features. However, the present invention is not limited thereto, and those skilled in the art should flexibly adjust the second embodiment at that time. The first, second, and third sets of static feature vectors are selected. The detailed calculation method of the dynamic characteristics of the left time variability in the above-described first embodiment is explained as follows. Referring to FIG. 4, it is assumed that the left time variability of time point t is to be obtained, and then the second calculation unit 222 can take a signal that is mainly located at the left side of the time point ^, for example, using the following formula: Σ«·(Φ+λ ]-€Μ) Σ 2 k; N' where. Or, refer to Figure 5, another way to change the time variability of the time to the left of time t ί 咖 ] (4) (c [Bu ~ + A:] - c [i- \]) Σ 2 can be: where. This can be seen as the time variability centered at a point on the left side of time f. Using a principle similar to the foregoing method, the third calculating unit 223 may take the 201135716 variability example = at the time point, the right signal to calculate the right time as the second (or right) time method of the second order or more. Then, as shown in Fig. 2, the ratio of the square static eigenvectors and the central time is used to make use of the left time variability*; the state = vector order, which best matches the features. In the several models of the vector phase, the static::: element 260 can be used for each time point corresponding to the (four) feature vector, the dynamic characteristics of the central time variability: the dynamic eigenvector of the rate and the right time variability The eigenvector sequence "dynamic characteristics of the two variability and other special = two _ m age unit 2ig, computing unit coffee and comparison μ" If the execution speed is the primary consideration, then the body is dominant; Elasticity is the first = (2) Early is basically the choice of software; or, the unit can be used simultaneously = 17 201135716 Software, hardware and carcass work together. It should be understood that the above example does not have the so-called excellent The inferiority is not intended to limit the invention: Those skilled in the art will be able to flexibly select the specific embodiment of the unit as needed at the time. 〃 Furthermore, those of ordinary skill in the art will appreciate that Each unit executes according to it The function is named 'only to make the technology of the case more obvious and understandable, not to limit the aspect of the unit. Integrate each unit into the same unit or split into multiple units' or either The function is changed to another unit for execution, and still belongs to the embodiment of the present disclosure. In order to make a more detailed description of the above-mentioned features of the combination and combination, please refer to Fig. 6. Fig. 6 is in accordance with A schematic diagram of feature extraction and combination of an embodiment of the present disclosure. As shown in FIG. 6, a cepstral coefficient (MFCC) and an energy logarithm of a 12-order Mel frequency scale are used as static features' and a central time variability The left time variability and the right time variability combine to form a 52-dimensional feature vector. The length of the sound box is 25ms, and the frame sampling rate is 1 〇ms—one frame. Central time variability, left time The variability and the right time variability are calculated using the data of 5 frames. The weights of different positions are set to 1, and the time variability of the left side of each time point is 5 words on the left side including the at least one time point. Box information Calculated, the right time variability is calculated using the data of the five right boxes containing the at least one time point. The speech recognition model uses the high-order hidden Markov model (hidden Markov m〇del). We use the TIDIGIT database for speech recognition. Experiments, and combined with commonly used features to make a comparison of the recognition rate. The characteristics of the comparison object include the static features of the 12th order MFCC and the energy logarithm, the static feature first order and the 201135716 second order central time variability, a total of 39 dimensional eigenvectors. Each digital sound in the hidden Ma Fufu model contains 16 states, and the numbers are separated by a state, and each sentence has a state of 3 states before and after each speech. The probability distribution of each state is Gaussian. The probability blending model, and each blending component adopts a matrix of covariance numbers in a diagonal form. The experimental results in various blending numbers are shown in Fig. 7, and it can be seen from Fig. 7 that the feature combination of the present embodiment is used. The recognition rate is better than the commonly used feature combination, and the recognition error reduction rate can be reduced by up to 26% in the experiment. Another aspect of the present disclosure is to provide an audio feature processing method that can be performed via the audio feature processing device described above, the related embodiments of which have specifically disclosed above, Do not repeat them.

或者,如上所述之音頻特徵處理方法可實作為一 程式’並儲存於—電腦可讀取之記錄媒體中,而使電腦读 取此記錄媒體後令m錢行該音頻特徵處理方法二 雖然本揭示内容已以實施方式揭露如上,然其並非 以限定本發明’任何熟習此技藝者’在不脫離本揭示内 之精神和範®内’當可作各種之更動與潤飾,因此本 之保護範圍當視後附之申請專利範圍所界定者為準。X 【圖式簡單說明】 為讓本揭示内容之上述和其他目的、特徵、優 施例能更明顯易懂,所附圖式之說明如下: ”、 第!圖是依照本揭示内容〜實施例之一種語音辨認系 統的方塊圖; 201135716 第2圖是依照本揭示内容一實施例之一種音頻特徵處 理裝置的方塊圖; 第3圖是依照本揭示内容一實施例之計算中央時間變 率之動態特徵的圖表; 第4圖是依照本揭示内容一實施例之計算左方時間變 率之動態特徵的圖表; 第5圖是依照本揭示内容另一實施例之計算左方時間 變率之動態特徵的圖表;以及 • 第6圖係依照本揭示内容一實施例之特徵擷取與組合 的不意圖, 第7圖係繪示兩種特徵組合之字辨認率的比較表。Alternatively, the audio feature processing method as described above can be implemented as a program and stored in a computer-readable recording medium, and the computer can read the recording medium and then make the audio feature processing method The disclosure has been disclosed in the above embodiments, and it is not intended to limit the invention to any skilled person in the art. This is subject to the definition of the scope of the patent application. BRIEF DESCRIPTION OF THE DRAWINGS In order to make the above and other objects, features and advantages of the present disclosure more obvious and obvious, the description of the drawings is as follows: ", the figure is in accordance with the disclosure - the embodiment Block diagram of a voice recognition system; 201135716 FIG. 2 is a block diagram of an audio feature processing apparatus according to an embodiment of the present disclosure; FIG. 3 is a diagram of calculating dynamics of a central time variability according to an embodiment of the present disclosure. A graph of features; FIG. 4 is a graph for calculating dynamic characteristics of left time variability according to an embodiment of the present disclosure; FIG. 5 is a graph for calculating dynamic characteristics of left time variability according to another embodiment of the present disclosure FIG. 6 is a schematic diagram of character extraction and combination according to an embodiment of the present disclosure, and FIG. 7 is a comparison table of word recognition rates of two feature combinations.

【主要元件符號說明】 100 :語音辨認系統 110 ··麥克風 120 :類比至數位轉換 130 :音框分割模組 140 :端點偵測模組 150 :特徵擷取子系統 160 :樣型比對子系統 170 :資料庫 200 :音頻特徵處理裝置 210 :擷取單元 221 :第一計算單元 222 :第二計算單元 223 :第三計算單 260 :比對單元[Main component symbol description] 100: voice recognition system 110 · · microphone 120 : analog to digital conversion 130 : sound frame segmentation module 140 : endpoint detection module 150 : feature extraction subsystem 160 : sample comparison System 170: database 200: audio feature processing device 210: capture unit 221: first calculation unit 222: second calculation unit 223: third calculation unit 260: comparison unit

Claims (1)

201135716 七、申請專利範園· 1. 一種音頻特徵處理方法,至少包含· (:)由複數個音框分別梅取複數個;態特徵向量. ()自該些靜態特徵向量中選擇至少—組 計算對應於至少一時間點之至少一 、、静釔特徵以 徵向量與至少一右方時間變率之動態特率:動態特 (c)利用該至少-左方時間變率之動 至少一右方時間變率之動態特徵向量從一::粗二°量及該 複數個模型+拽尋出最匹配該些 :二:二存之 型串列。 '"里心模型或一模 靜態2特徵如===::徵處理方法,其中每-該 數、線性預估Γ數為梅共振/=度^倒頻譜係數與能量對 另包3含如請求们所述之音_徵處理方法,其中步驟⑴ _自該些靜態特徵向量中選擇該至少 :於該至少一時間點之至少一中央 、I特徵以計 量,而步驟(c)包含: 率之動態特徵向 利用該些靜態特徵向量 '該至 態特徵向量、該至少一左方時間變率之間變=之動 至少一右方時間變率之動態特徵向量极徵向1及該 複數個模型中搜尋出最匹配該 ;所預存之 设向量之該模型或該模 21 I 201135716 型串列。 4.如請求項3所述之音頻特徵處理方法,其中步驟(C) 更包含: 將該些靜態特徵向量、該至少一中央時間變率之動態 特徵向量、該至少一在方時間變率之動態特徵向量及該至 一右方時間變率之動態特徵向量作組合,以產生.一組合 後=特徵向量序列,旅從該資料庫所預存之複數個模塑中 ,尋出最匹配該組合後之特徵向量序列之該模型或該模型 串列。 更包含 自該些靜態特徵向量中選擇 向量及該至"間 :作為至少一第一組靜態灸:數個靜態特徵向量 靜態特徵向量射_ ,並將該至少一第_@ 〜特徵向置,其中該至少一 中央時間變率之 =少—時間點之前的靜態特=靜態特徵向量中在該 時間點之後的靜態特徵向以的權重對稱於該至少一 自該些靜態特徵向量中, 讀時間取變率,以產生至;^至少-第二組靜 向量,其t該至少—第二έ二一左方時間變率之=能向 間點之前的靜態特徵::::徵向量中在;:: 後的靜態特徵向量的總權::1權重大於該至少一㈣點: i'J 22 201135716 量對==特徵向量中選擇至少一第三組靜態特徵向 Θ量其= I以產生至少―右方時間變率之動態特徵 向董,其中該至少一第三組靜態特徵向量中在該至少一時 間點之刖的靜態特徵向量的總權重小於該至 後的靜態特徵向量的總權重。 更包^如請求項3所述之音頻特徵處理方法,其中步驟⑴ • 自該些靜態特徵向量中選擇該至少-時間點所對應之 靜態特徵向量及該至少-時間點前後之數個靜態特徵向量 以作為至少-第-組靜態特徵向量,並將該至少一第一组 靜態特徵向量對時間取變率以產生至少—中央時_率之 動態特徵向量,其中該至少一第一組靜態特徵向量中在該 至少一時間點之前的靜態特徵向量的數目等於該至少一時 間點之後的靜態特徵向量的數目; 自該些靜態特徵向量中選擇至少第二組靜態特徵向 癱量對時間取變率’以產生至少一左方時間變率之動態特徵 向量,其中該至少一第二組靜態特徵向量中在該至少一時 間點之前的靜態特徵向量的數目多於該至少一時 的靜態特徵向量的數目; ” 自該些靜態特徵向量中選擇至少—第三組靜態特徵向 f對時間取變率,以產生至少一右方時間變率之動態特徵 向量,其中該至少一第三組靜態特徵向量中在該至少一時 間點之前的靜態特徵向量的數目少於該至少一時 的靜態特徵向量的數目。 ” 23 201135716 7.如請求項6所述之音頻特徵處理方法, 組靜態特徵向量係包含該至少一、u第二 :::邊;該第三組靜態特徵向量係包含該至少:時3 對應之靜態特徵向量的另-邊。 時間點所 包含: 8.如請求項1所述之音頻特徵處理方法,其中步驟( C 採用以高階隱藏式馬可夫模型為基礎的演算法搜a 最匹配該些特徵向量之該模型或該模型串列。 哥 9. 一種音頻特徵處理裝置,至少包含: =取單元,用以由複數個音框分別擷取複數個靜態 特徵向量; 一計算單元,用以自該些靜態特徵向量中選擇 一 組靜態特徵以計算對應於至少一時間點之至少一夕 特徵向*與至少一右方時間變率之動態= -比對單元,用以利用該至少—左方時間變 特徵向量及該至少〆右方時間變率之動態特徵向量從一: 料庫所預存之複數個模型中搜尋出最匹配該些特貝 -模型或一模型串列。 量之 24 201135716 l〇.如請求項9所述之音頻特徵處理 該靜態特徵向量係為梅爾頻率刻度之倒 ::: 數、線性預估係數、我振峰或音高。”日綠與此量對 — 11·如請求項9所述之音頻特徵處理裝置,立中 =向二:單=_些靜=== 甲央時間變率之動態特徵向量、 變率之動態特徵向量及該至少—右,至少-左方時間 向量從該資料庠所預存之複數個率之動態特徵 特徵向量之該模型或該模型岣。中〜出最匹配該些 12.如請求項η所述之音 對單元係用以將該些靜態特徵^量,理裝置,其中該比 率之動態特徵向量 、該至少-左方少一中央時間變 量及該至少一右方時間變率 之動態特徵向 生-組合後之特徵向量序列,進:=:作組合,以產 匹配該組合後之特徵向量序列之該模出最 算單元包含:求項11所述之音頻特徵處理《置’其中該計 至少ίρ二:算早70 ’用以自該些靜態特徵向旦击 後之數個靜態特徵向量以作量及該至少-時間點前 至夕一第1 且靜態特徵向 25 201135716 量,並將該至少-第一組靜態特徵向量對時間取變率 生至少一中央時間變率之動態特徵向量,其中該至少一 r且靜態特徵向量中在該至少一時間點之前的靜態:徵向 置的權重對稱於該至少一時間點之後的靜態特徵向量的權 重, -第二計算單力,用^自該些靜態特徵向4中選擇至 少一第二組靜態特徵向量對時間取變率,以產生至少一 方時間變率之動態特徵向量,其中該至少 徵向量中在該至少一時間點之前的靜態特徵向量上: 大於該至少一時間點之後的靜態特徵向量的總權重; -第三計算單元’用以自該些靜態特徵向量中選 >一第二組靜態特徵向量對時間取變率,以產生至,卜一 態特徵向量,其中該至少—第三組靜態特 ,中在該至少—時間點之前的靜態特徵向量的姨權重 該至少—時間點之後的靜態特徵向量的總權重Γ 參 算單L4包t請求項11所述之音頻特徵處理裝置,其中該計 至少:用以自該些靜態特徵向量中選擇該 -數 二-第-組靜態特徵向量對時間== 一組i能1時間變率之動態特徵向量’其中該至少一第 量的數ΐ等量中在該至少一時間點之前的靜態特徵向 等於該至少-時間點之後的靜態特徵向量的數 26 201135716 s ; 乐一 訂异單元 少-第二組靜態特徵向量對時量中選擇至 於該至::時,點之後的靜態;==量的數目多201135716 VII. Application for Patent Fan Garden 1. An audio feature processing method, including at least · (:) a plurality of sound boxes respectively take a plurality of sound states; state feature vectors. () select at least one of the static feature vectors Calculating a dynamic rate corresponding to at least one of the at least one time point, the static feature to the sign vector, and the at least one right time variability: the dynamic special (c) utilizes the at least-left time variability to move at least one right The dynamic eigenvectors of the square time variability are from one:: coarse two-degree quantity and the plurality of models + 拽 find the most matching ones: two: two-type type series. The '" core model or a static 2 feature such as ===:: lemma processing method, where each - the number, the linear estimated number of turns is the Mei resonance / = degree ^ cepstral coefficient and energy to the other package 3 The method for processing sounds as described in the request, wherein the step (1) _ selects the at least one of the static feature vectors from at least one central point, the I feature to measure, and the step (c) comprises: The dynamic feature of the rate is obtained by using the static feature vector 'the state feature vector, the at least one left time variability=the dynamic eigenvector pole sign of at least one right time variability to 1 and the complex number The model finds the model that best matches the pre-stored vector or the model 21 I 201135716 type. 4. The audio feature processing method of claim 3, wherein the step (C) further comprises: the static feature vectors, the at least one central time variability dynamic feature vector, the at least one in-square time variability The dynamic feature vector and the dynamic feature vector to the right time variability are combined to generate a combined = feature vector sequence, and the travel finds the best match from the plurality of molds prestored in the database. The model of the subsequent eigenvector sequence or the model string. The method further includes selecting a vector from the static feature vectors and the interval to: at least one first group of static moxibustion: a plurality of static feature vector static feature vectors, and aligning the at least one _@~ feature , wherein the at least one central time variability = less - the static feature in the static characteristic = static feature vector before the time point is symmetrical with respect to the at least one from the static eigenvectors, Time volatility, to generate; ^ at least - a second set of static vectors, t at least - second έ 21 left time variability = static feature before the inter-point:::: eigenvector The total weight of the static feature vector after ;::: 1 weight is greater than the at least one (four) point: i'J 22 201135716 quantity pair == feature vector select at least one third set of static features to measure it = I Generating at least a dynamic characteristic of the right time variability to the dong, wherein a total weight of the static eigenvectors at the at least one time point of the at least one third set of static eigenvectors is less than a total of the static eigenvectors Weights. The audio feature processing method of claim 3, wherein the step (1): selecting, from the static feature vectors, the static feature vector corresponding to the at least-time point and the plurality of static features before and after the at least-time point The vector is used as at least a first-group static feature vector, and the at least one first set of static feature vectors is time-varying to generate at least a central time-rate dynamic feature vector, wherein the at least one first set of static features The number of static feature vectors in the vector before the at least one time point is equal to the number of static feature vectors after the at least one time point; selecting at least the second set of static features from the static feature vectors to change the time to the time Rate ' to generate at least one left time variability dynamic feature vector, wherein the number of static feature vectors in the at least one second set of static feature vectors before the at least one time point is greater than the at least one time static feature vector Number; "select at least from the static feature vectors - the third set of static features f to time versus rate to produce at least one a dynamic eigenvector of a square time variability, wherein the number of static eigenvectors in the at least one third set of static eigenvectors before the at least one time point is less than the number of static eigenvectors at the at least one time." 23 201135716 7. The audio feature processing method of claim 6, the group static feature vector includes the at least one, u second::: edge; the third group of static feature vectors includes the at least: time 3 corresponding static feature vector Another side. The time point includes: 8. The audio feature processing method according to claim 1, wherein the step (C uses an algorithm based on a high-order hidden Markov model to search for the model or the model string that best matches the feature vectors. Column 9. An audio feature processing apparatus, comprising: at least: a unit for extracting a plurality of static feature vectors from a plurality of sound boxes; a computing unit for selecting a group from the static feature vectors a static feature to calculate a dynamic =-alignment unit corresponding to at least one temporal feature of at least one time point and at least one right time variability for utilizing the at least-left time-varying feature vector and the at least right The dynamic eigenvectors of the square time variability are searched from a plurality of models prestored in the library to find the best matching of the Tebe-models or a model series. Quantities 24 201135716 l〇. As described in claim 9 Audio Feature Processing The static feature vector is the inverse of the Mel frequency scale::: number, linear prediction coefficient, my peak or pitch." Day Green and this amount pair - 11 · as requested in Item 9 The audio feature processing device, the center = two: single = _ some static == = the dynamic eigenvector of the central time variability, the dynamic eigenvector of the variability and the at least-right, at least-left time vector from the The model or the model of the dynamic feature vector of the complex rate pre-stored by the data 岣. The medium-to-out match is the most suitable. 12. The tone pair unit as described in the request item η is used to measure the static features. And a rational device, wherein the dynamic feature vector of the ratio, the at least one left central time variable, and the dynamic feature of the at least one right time variability are combined with the generated feature vector sequence, and the combination is: The module that produces the sequence of feature vectors matching the combination includes: the audio feature processing described in claim 11 "sets at least ίρ2: counts 70" from the static features a number of static feature vectors after the hit, and the first and first static features are 25 201135716, and the at least - first set of static feature vectors are time-varying At least one central time change a dynamic feature vector, wherein the at least one r and the static feature vector are static before the at least one time point: the weight of the sign is symmetric to the weight of the static feature vector after the at least one time point, - the second calculation form And selecting, by using the static features from the at least one second set of static feature vectors, a time variability to generate a dynamic feature vector of at least one time variability, wherein the at least one time is at least one time On the static feature vector before the point: greater than the total weight of the static feature vector after the at least one time point; - the third computing unit 'used to select from the static feature vectors> a second set of static feature vector versus time Taking a variable rate to generate a eigenvector, wherein the at least - the third set of static traits, the 特征 weight of the static eigenvector before the at least - time point is at least - the static eigenvector after the time point The audio feature processing device of claim 11, wherein the meter is at least: for selecting from the static feature vectors. Number two-set-group static eigenvector pair time == a set of i-energy-time variability dynamic eigenvectors where the static eigenvectors of the at least one first-order number ΐ equals before the at least one time point are equal The number of static feature vectors after the at least-time point is 26 201135716 s; the number of music-single-segment elements is small - the second group of static feature vectors is selected from the time-to-time amount: the static value after the point: == a large number 少-第三向==靜,向量中選擇至 方時間變率之動態特徵向量:其中::广產生至少-右 徵向量中在該至少—時間點之前2=二第三組靜態特 於該至少一時間點之後的靜態特徵向二向量的㈣ 15.如請求項14所 立 二組靜態特徵向㈣包含^;頻特徵處理《置,其中該第 徵向量’或是全部在該至;:一=一時間點所對應之靜態特 量的-邊;該第三組靜 &間點所對應之靜態特徵向 •所對應之靜態特徵向量了=向量係包含該至少一時間點 應之靜態特徵向量的=疋全部在該至少-時間點所對 如请求項9所述 單元係採用高階隱藏式日,寺徵處理裝置,其中比對 最匹配該些特徵向::夫模型為基礎之演算法搜尋出 篁之該模型或該模型串列。 27Less-third direction == static, select the dynamic eigenvector of the time variability in the vector: where:: wide produces at least - right sign vector before the at least - time point 2 = two third group static is special The static feature after at least one time point is (4) of the two vectors. 15. The two sets of static features to the (fourth) of the request item 14 include ^; the frequency feature processing "set, wherein the eigenvector" or all of the up to; The static feature of the third set of static & the corresponding static feature vector = the vector contains the static of the at least one time point The 特征 of the eigenvectors is all at the at least-time point. The unit system as claimed in claim 9 adopts a high-order hidden day, the temple levitation processing device, wherein the alignment best matches the eigenvalues of the trait model: The method searches for the model or the model string. 27
TW99111654A 2010-04-14 2010-04-14 Method and apparatus for processing audio feature TWI409802B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW99111654A TWI409802B (en) 2010-04-14 2010-04-14 Method and apparatus for processing audio feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW99111654A TWI409802B (en) 2010-04-14 2010-04-14 Method and apparatus for processing audio feature

Publications (2)

Publication Number Publication Date
TW201135716A true TW201135716A (en) 2011-10-16
TWI409802B TWI409802B (en) 2013-09-21

Family

ID=46752014

Family Applications (1)

Application Number Title Priority Date Filing Date
TW99111654A TWI409802B (en) 2010-04-14 2010-04-14 Method and apparatus for processing audio feature

Country Status (1)

Country Link
TW (1) TWI409802B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI475558B (en) * 2012-11-08 2015-03-01 Ind Tech Res Inst Method and apparatus for utterance verification
TWI584269B (en) * 2012-07-11 2017-05-21 Univ Nat Central Unsupervised language conversion detection method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100571574B1 (en) * 2004-07-26 2006-04-17 한양대학교 산학협력단 Similar Speaker Recognition Method Using Nonlinear Analysis and Its System
US8155965B2 (en) * 2005-03-11 2012-04-10 Qualcomm Incorporated Time warping frames inside the vocoder by modifying the residual
TWI397903B (en) * 2005-04-13 2013-06-01 Dolby Lab Licensing Corp Economical loudness measurement of coded audio

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI584269B (en) * 2012-07-11 2017-05-21 Univ Nat Central Unsupervised language conversion detection method
TWI475558B (en) * 2012-11-08 2015-03-01 Ind Tech Res Inst Method and apparatus for utterance verification
US8972264B2 (en) 2012-11-08 2015-03-03 Industrial Technology Research Institute Method and apparatus for utterance verification

Also Published As

Publication number Publication date
TWI409802B (en) 2013-09-21

Similar Documents

Publication Publication Date Title
Chung et al. Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder
US9406298B2 (en) Method and apparatus for efficient i-vector extraction
Sato et al. Emotion recognition using mel-frequency cepstral coefficients
WO2021135438A1 (en) Multilingual speech recognition model training method, apparatus, device, and storage medium
Sadjadi et al. Speaker age estimation on conversational telephone speech using senone posterior based i-vectors
Liu et al. Graph-based semi-supervised learning for phone and segment classification.
Bian et al. Self-attention based speaker recognition using Cluster-Range Loss
Rasipuram et al. Integrating articulatory features using Kullback-Leibler divergence based acoustic model for phoneme recognition
Hu et al. DBN-based spectral feature representation for statistical parametric speech synthesis
Wang et al. Deep discriminant analysis for i-vector based robust speaker recognition
TW201135716A (en) Method and apparatus for processing audio feature
Wang et al. Research on Mongolian speech recognition based on FSMN
Cui et al. Improving deep neural network acoustic modeling for audio corpus indexing under the iarpa babel program
Shahnawazuddin et al. Sparse coding over redundant dictionaries for fast adaptation of speech recognition system
Cui et al. Multi-view and multi-objective semi-supervised learning for large vocabulary continuous speech recognition
You et al. Investigation of deep Boltzmann machines for phone recognition
de Abreu Campos et al. A framework for speaker retrieval and identification through unsupervised learning
Shi et al. Speech classification based on cuckoo algorithm and support vector machines
Durrieu et al. Sparse non-negative decomposition of speech power spectra for formant tracking
Singh et al. Enhancing end-to-end automatic speech recognition for low-resource Punjabi language using synthesized datasets
Patil et al. Linear collaborative discriminant regression and Cepstra features for Hindi speech recognition
Jiang et al. An enhanced Fishervoice subspace framework for text-independent speaker verification
Sarkar et al. Supervector-based approaches in a discriminative framework for speaker verification in noisy environments
Hecht et al. Effective model representation by information bottleneck principle
Chen et al. D-MelGAN: speech synthesis with specific voiceprint features

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees