TW480473B

TW480473B - Method and system for detection of phonetic features

Info

Publication number: TW480473B
Application number: TW089122842A
Authority: TW
Inventors: Jonathan B Allen; Mazin G Rahim; Lawrence K Saul
Original assignee: At & Amp T Corp
Priority date: 1999-10-28
Filing date: 2000-10-30
Publication date: 2002-03-21
Also published as: WO2001031628A2; CA2387091A1; WO2001031628A3; EP1232495A2

Abstract

In various embodiments, techniques for detecting phonetic features in a stream of speech data are provided by first dividing the stream speech data into a plurality of critical bands, segmenting the critical bands into streams of consecutive windows and determining various parameters for each window per critical band. The various parameters can then be combined using various operators in a multi-layered network including a first layer that can process the various parameters by using a sigmoid operator on a weighted parameter sum. The processed sums can then be further combined using a hierarchy of conjunctive and disjunctive operators to produce a stream of detected features.

Description

480473 A7 B7 五、發明說明（1) 相關申請案之對照：本非臨時專利申請案聲明享有於1 9 9 9年1 0月 2 8日提出申請的美國臨時專利申請案 6 0 / 1 6 1 ’ 9 9 5 ''Learing from Examples in Critical Bands of Speech "(內部案號 1 9 9 9 — 〇 5 1 6，1 0 5 1 7 4 )的權益。該臨時專利申請案的申請人是 Lawrence K. SAUL、Mazin G. RAHIM、及 Jonathan Β· ALLEN。本專利申請案特此引用上述臨時專利申請案之全文以供參照。發明背景： 1· 發明領域：本發明係有關用來偵測語音特徵之語音辨識系統。 2 · 相關技術說明：自動化語苜辨識（Automated Speech Recognition ;簡稱A S R )是在人耳聽覺的準確度範圍內辨識人類語音，其中包括辨識因背景雜音而劣化的語音、或因諸如電話等通訊裝置中固有的各種濾波器而失真的語音。但是很不幸，A S R系統很少在人耳聽覺的準確度範圍內執行。然而，藉由模仿人類器官辨識系統而作出A S R系統的模型，在理論上可將A S R的準確度提昇到人類聽覺的準確度。但是很不幸，雖然人類辨識系統的各種面向已有完整的文獻，但是傳統的A S R系統還是無法利用人類的模型。因本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐）丨 f---r------- (請先閱讀背面之注意事頊再填寫本頁) 經濟部智慧財產局員工消費合作社印製訂-—-------I ----Γ I .480473 A7 B7 V. Description of the invention (1) Contrast of related applications: This non-provisional patent application statement enjoys a provisional U.S. patent application filed on October 28, 1999, 6 0/1 6 1 '9 9 5' 'Learing from Examples in Critical Bands of Speech " The applicants for this provisional patent application are Lawrence K. SAUL, Mazin G. RAHIM, and Jonathan B. ALLEN. The entire text of the above-mentioned provisional patent application is hereby incorporated by reference for this patent application. BACKGROUND OF THE INVENTION: 1. Field of the Invention: The present invention relates to a speech recognition system for detecting speech features. 2 · Related technical descriptions: Automated Speech Recognition (ASR) is to recognize human speech within the range of human ear hearing accuracy, including identifying speech that is degraded by background noise, or communication devices such as telephones Distorted speech inherent in various filters. Unfortunately, ASR systems rarely perform within the range of hearing accuracy of the human ear. However, by modelling the A S R system by mimicking the human organ recognition system, the accuracy of A S R can be theoretically improved to the accuracy of human hearing. But unfortunately, although the various aspects of the human identification system already have a complete literature, the traditional ASR system cannot use human models. Because this paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm) 丨 f --- r ------- (Please read the notes on the back first, and then fill out this page) Ministry of Economy Wisdom Established by the Consumer Affairs Cooperative of the Property Bureau ---------- I ---- Γ I.

一 rA. — — — — — — —----I I 480473 經濟部智慧財產局員工消費合作社印製 A7 B7 五、發明說明（2) 此，這些傳統的A S R系統在惡劣的聆聽狀況下，無法如人類一般維持辨識語音的強韌能力。因此，目前需要〜種可提供準確的語音特徵辨識之語音辨識方法及系統。發明槪述：在各實施例中，提供了根據關鍵性語音頻帶的語音特徵辨識方法及系統。在各實施例中，提供了偵測一語音資料流中的語音特徵之技術，此種技術包含下列步驟：首先將該串流語音資料分成複數個關鍵性頻帶（critical band) ;將該等關鍵性頻帶分割成若干連續音窗（window )流；以及決定每一關鍵性頻帶中的每一音窗之各種參數。然後可利用一多層網路中之各種算符組合該等參數。該多層網路的一第一層可處理該等各種參數，其方式爲將該等參數加權；形成每一關鍵性頻帶的一總和；以及利用S形算符（s i g m 〇 i d 〇 p e r a t 〇 r )處理該等總和。然後可進一步利用一階層的合取（conjunctive)及析及（ GHsjunctue )算符來組合該等已處理的總和，以便產生一偵測特特徵流。在其他的實施例中，針對該多層網路而修改一訓練技術，其方式爲：反覆地偵測特徵；將所偵測的特徵與一串流預定特徵標記比較；以及利用諸如一預期最大化技術及一最大可能性估計技術等的各種方法來更新各種內部權値 7 〆衣--------訂---------線丨^ (請先閱讀背面之注意事項再填寫本頁} 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -5- 480473 --------------B7五、發明說明（3) 附圖簡述：現在將參照下列各圖示而詳細說明本發明，其中相同的代號表示相同的元件，這些圖示有：圖1是一例示特徵辨識系統之方塊圖；圖2是圖1所示特徵辨識器之方塊圖；圖3是圖2所示特徵辨識器的一例示前端之方塊圖；圖4是圖2所示特徵辨識器的例示後端之方塊圖；圖5是具有能夠學習的各種訓練電路的圖4所示後端的〜部分之方塊圖；圖6是根據本發明的一例示第一層組合器之方塊圖；以及圖7是槪述對各語音特徵進行辨識及訓練的一例示方法之流程圖。主要元件對照表 •丨.—*-------- (請先閱讀背面之注意事項再填寫本頁) 經濟部智慧財產局員工消費合作社印製 1 〇〇語：gl 辨識系統 1 2 0 特徵辨識器 1 1 0 資料來源 1 1 2 輸入鏈路 1 3 〇資料儲存 1 2 2 輸出鏈路 2 1 〇前端 2 2 0 後乂而 2 1 2 鏈路訂——I-----線丨 _ 本纸張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -6 - 經濟部智慧財產局員工消費合作社印製 480473 A7 ________Β7 —_ 五、發明說明（4) 3 10 濾波器單元 312 非線性裝置 3 14 低通濾波器/向下抽樣裝置） 316 音窗/參數量測裝置 318 臨界値比較裝置 4 10 一 1’410 — 2’···410 — η 第—* 層組合器 420 — 1，420 — 2，…420 — η 第二層組合益 430 第三層組合器 5 1 0 訓練電路 61〇一1，610 — 2，…61〇j 乘法器 620 加總節點 6 3 0 S形函數節點 612-1，612 - 2，…612 - J ’ 422，512 鏈路較佳實施例之詳細說明：語音辨識系統供一種強力的工具，而自動化諸如經由電話線路進行買賣、自動聽寫機、及必須自一語者取得資訊的各種其他交易等的各種交易。但是很不幸，在有背景雜音時，或者當人類語音已經過諸如電話及錄音機制等的各種電子系統濾波時，語音辨識系統可能會產生無法接受的錯誤比率。然而，藉由將各種人類器官模型應用於自動本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） I Λ—ml· n I n I ϋ I -1 n - I ϋ n ϋ ϋ ϋ I I -I .1 ϋ ϋ ϋ ϋ n I I I I I n pi - (請先閱讀背面之注意事項再填寫本頁) 480473 A7 ___ B7 五、發明說明（5) 化語音辨識（A S R )系統，理論上可使機器在人耳聽覺的準確度範圍內偵測到語音。 (請先閱讀背面之注意事項再填寫本頁) 一個相關的器官模型係基於一種可在語音辨識的早期階段中獨立地分析頻譜的不同部分之假說。亦即，藉由將一較寬的頻譜分成若干被稱爲、、關鍵性頻帶〃（、、critical band 〃）的較窄頻率範圍，並適當地處理及組合所得的資訊，即可準確地辨識各語音特徵。但是很不幸，現在已證實先前根據關鍵性頻帶而推導出一準確的A S R模型的嘗試並不成功。然而，首先形成並處理語音的各關鍵性頻帶，然後利用隱藏式變數模型及各種學習技術而適當地組合該等經過處理的關鍵性頻帶，即可開發出一種縱使在有相當大背景雜音的環境中也可高可靠地偵測各語音特徵（亦稱爲可區別的特徵（distinctive feature ))之 A S R 系統。一特徵辨識系統的一較佳目標是區別兩種被稱爲響音 (sonorant )及閉鎖音（obstruent )的語音特徵。響音（ '' C+sonorant〕〃）可以是可辨識爲母音（v〇wei )、鼻經濟部智慧財產局員工消費合作社印製音（nasal )、及approximants的一組可聽到的特徵中之一個特徵，且響音的特性是有聲帶的週期性顫動。亦即，可在一特定較窄的頻率範圍及其諧頻中找到一響音的大部分能量。另一方面，閉鎖音（、'〔-s ο η 〇 r a n t〕〃）可包括諸如塞音（stop )、摩擦音（fricative )、及破擦音（ affncate )等的語音特徵，且閉鎖音的特性是具有阻塞氣流的語音。下表1示出響音（、、〔+sonoi:ant〕〃）與閉鎖音 -8 - 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） 480473 A7 _ B7 五、發明說明（6) (、、〔-s ο η 〇 r a n t〕” ）間之區別 [+voiced] [-voice] b (bee) P (pea) 塞音閉鎖音 d (day) t (tea) [-sonorant] g (2ay) k (key) z (zone) 摩擦音 v (van) dh (then) zh (azure) jh (joke) ch (choke) 破擦音 m (mom) 鼻音響音 n (noon) [+sonorant] ng (sing) 1 (lay) approximants r (ray) w (way) y (yacht) (請先閱讀背面之注意事項再填寫本頁) 經濟部智慧財產局員工消費合作社印製雖然可將諸如〔+sonorant〕等的語音特徵描述爲具有週期性，但是將以一語者的音高（Pitch )建立週期性的特定頻率範圍。然而，藉由將一頻譜分成許多關鍵性頻帶，某些這類關鍵性頻帶不只是能夠偵測〔+son〇rarit〕聲音能量，包含〔+ s ο η 〇 r a n t〕的這些關鍵性頻帶也能將一改良式 -9- 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） 480473 A7 ____B7 _ 五、發明說明（7) (請先閱讀背面之注意事項再填寫本頁) 信號雜音比（Signal-to-Noise Ratio ;簡稱S N R )顯示爲將處理〔+s〇 nor ant〕與寬頻帶雜音間之比率。因此，藉由模擬人類聽覺系統的多頻帶處理方式，縱使在有雜音及經過濾波的環境中，亦可區別諸如〔+sonorant〕及〔-sonorant〕等的語音特徵。圖1是一語音辨識系統（1 0 0 )的一例示方塊圖。系統（1 0 0 )包含一特徵辨識器（1 2 0 )，該特徵辨識器（1 2 0 )係經由一輸入鏈路（1 1 2 )而連接到一資料來源（1 1 0 )，且係經由一輸出鏈路（1 2 2 )而連接到一資料儲存區（1 3 0 )。該例示的特徵辨識器（ 1 2 0 )可自資料來源（1 1 0 )接收語音資料，並偵測語音資料中的各語音特徵。特徵辨識器（1 2 0 )可將該語音資料的頻譜分成若干關鍵性頻帶，然後處理每一關鍵性頻帶，以便產生各提示信號，並有利地組合這些提示信號，以便產生可經由輸出鏈路（1 2 2 )而傳送到資料儲存區（1 3 0 )之一語音特徵流，而偵測語音特徵。經濟部智慧財產局員Η消費合作社印製資料來源（1 1 〇 )可將實際語音或可以代表實際語音的任何格式下之語音資料提供給特徵辨識器（1 2 0 ) ，該語音資料包括二進位資料、A S C I I資料、傅立葉資料、子波（wavelet )資料、文書處理檔案中包含的資料、WAV檔、以及MPEG檔或包含壓縮或未壓縮語音資料的任何其他檔案格式。此外，資料來源（1 1 〇 )可以是若干不同類型的資料來源中之任一資料來源，例如一人員、一電腦、一儲存裝置、或可自儲存裝置產生、轉送、本紙張尺度適用中國國家標準（CNSM4規格（210 x 297公釐） _ _ 480473First rA. — — — — — — ——— II 480473 Printed by the Consumers ’Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs A7 B7 V. Invention Description (2) Therefore, these traditional ASR systems cannot be used under severe listening conditions. Like humans, they maintain a strong ability to recognize speech. Therefore, there is currently a need for a speech recognition method and system that can provide accurate speech feature recognition. Summary of the Invention: In each embodiment, a method and system for identifying speech features based on key speech bands are provided. In various embodiments, a technique for detecting a voice feature in a voice data stream is provided. This technique includes the following steps: firstly dividing the stream voice data into a plurality of critical bands; The frequency band is divided into several continuous window streams; and various parameters that determine each window in each critical frequency band. These parameters can then be combined using various operators in a multilayer network. A first layer of the multilayer network can process the various parameters by weighting the parameters; forming a sum of each critical frequency band; and using an S-shaped operator (sigm 〇id 〇perat 〇r) Process those sums. Then, one level of conjunctive and analysis (GHsjunctue) operators can be further used to combine the processed sums in order to generate a detection feature stream. In other embodiments, a training technique is modified for the multi-layer network by repeatedly detecting features; comparing the detected features with a stream of predetermined feature tags; and using, for example, an expected maximization Technology and a method of maximum likelihood estimation to update various internal rights. 7 〆衣 -------- Order --------- line 丨 ^ (Please read the precautions on the back first Please fill out this page again} This paper size applies to China National Standard (CNS) A4 (210 X 297 mm) -5- 480473 -------------- B7 V. Description of the invention (3) Brief Description of the Drawings: The present invention will now be described in detail with reference to the following drawings, in which the same reference numerals represent the same elements, these drawings are: Figure 1 is a block diagram illustrating a feature recognition system; Figure 2 is a block diagram of Figure 1 Fig. 3 is a block diagram of an example front end of the feature recognizer shown in Fig. 2; Fig. 4 is a block diagram of an example back end of the feature recognizer shown in Fig. 2; Fig. 4 is a block diagram of a rear portion of various training circuits shown in Fig. 4; A block diagram illustrating an example of a first-layer combiner; and FIG. 7 is a flowchart illustrating an example method for identifying and training each voice feature. A comparison table of main components • 丨 .— * -------- ( Please read the notes on the back before filling this page) Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 1 〇〇: gl Identification System 1 2 0 Feature Identifier 1 1 0 Source 1 1 2 Input Link 1 3 〇 Data storage 1 2 2 Output link 2 1 〇Front end 2 2 0 Later 2 2 2 Link order——I ----- line 丨 _ This paper size applies to China National Standard (CNS) A4 specification (210 X 297 mm) -6-Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 480473 A7 ________ Β7 —_ V. Description of the invention (4) 3 10 Filter unit 312 Non-linear device 3 14 Low-pass filter / down-sampling device ) 316 Sound window / parameter measurement device 318 Critical 値 comparison device 4 10-1'410 — 2 '··· 410 — η — — Layer combiner 420 — 1,420 — 2,… 420 — η Second layer Combination Yi 430 Third layer combiner 5 1 0 Training Circuit 61〇1, 610—2, ... 61〇j multiplier 620 Sum node 6 3 0 S-shaped function node 612-1, 612-2, ... 612-J 422,512 Detailed description: The speech recognition system provides a powerful tool to automate various transactions such as buying and selling via telephone lines, automatic dictation machines, and various other transactions that require information from a speaker. Unfortunately, when there is background noise, or when human speech has been filtered by various electronic systems such as telephones and recording mechanisms, speech recognition systems may produce unacceptable error rates. However, by applying various human organ models to the automatic paper size, the Chinese National Standard (CNS) A4 specification (210 X 297 mm) is applied. I Λ-ml · n I n I ϋ I -1 n-I ϋ n ϋ ϋ ϋ II -I .1 ϋ ϋ ϋ III n IIIII n pi-(Please read the notes on the back before filling out this page) 480473 A7 ___ B7 V. Description of the invention (5) ASR system, in theory Enables the machine to detect speech within the accuracy range of human hearing. (Please read the notes on the back before filling out this page.) A related organ model is based on the hypothesis that different parts of the spectrum can be analyzed independently in the early stages of speech recognition. That is, by dividing a wider frequency spectrum into several narrower frequency ranges called critical bands 〃 (, critical band 〃), and properly processing and combining the obtained information, you can accurately identify Each voice feature. Unfortunately, it has now been proven that previous attempts to derive an accurate ASR model based on critical frequency bands were unsuccessful. However, by first forming and processing the key frequency bands of speech, and then using hidden variable models and various learning techniques to properly combine these processed key frequency bands, we can develop an environment that has considerable background noise. The ASR system can also detect various voice features (also known as distinctive features) with high reliability. A better goal of a feature recognition system is to distinguish between two types of speech features called sonorant and obstruent. The sound ('' C + sonorant) 〃) can be one of a set of audible features that can be identified as the vowel (v〇wei), the printed sound (nasal) of the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Nose Economy, and the approximants A characteristic and characteristic of the sound is the periodic tremor of the vocal cords. That is, most of the energy of a loud sound can be found in a specific narrow frequency range and its harmonics. On the other hand, the blocking sound (, '[-s ο η 〇rant] 〃) may include speech features such as stop, friction, and affncate, and the characteristics of the blocking sound are Voice with blocked airflow. The following table 1 shows the ringing sound (,, [+ sonoi: ant] 闭) and the blocking sound -8-This paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm) 480473 A7 _ B7 V. Invention Explain (6) the difference between (,, [-s ο η 〇rant] ”) [+ voiced] [-voice] b (bee) P (pea) stop sound d (day) t (tea) [-sonorant ] g (2ay) k (key) z (zone) fricative v (van) dh (then) zh (azure) jh (joke) ch (choke) m (mom) nasal sound n (noon) [+ sonorant] ng (sing) 1 (lay) approximants r (ray) w (way) y (yacht) (Please read the notes on the back before filling this page) Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs. [+ Sonorant] and other phonetic features are described as periodic, but a specific frequency range of periodicity will be established with the pitch of the speaker (Pitch). However, by dividing a frequency spectrum into many key frequency bands, some of these Similar critical frequency bands are not only able to detect [+ son〇rarit] sound energy, but include [+ s ο η 〇rant] The critical frequency band can also use an improved type-9. This paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm) 480473 A7 ____B7 _ V. Description of the invention (7) (Please read the notes on the back first (Fill in this page again.) The Signal-to-Noise Ratio (SNR) is shown as the ratio between processing [+ s〇nor ant] and broadband noise. Therefore, the multi-band processing of the human auditory system is simulated. Method, even in the presence of noise and filtered environment, you can also distinguish voice features such as [+ sonorant] and [-sonorant]. Figure 1 is an example block diagram of a speech recognition system (100). System (1 0 0) includes a feature identifier (1 2 0), which is connected to a data source (1 1 0) via an input link (1 1 2), and Connected to a data storage area (130) via an output link (122). The illustrated feature recognizer (120) can receive voice data from a data source (110) and detect Each voice feature in the voice data. The feature recognizer (120) can divide the frequency spectrum of the voice data into several key frequency bands, and then process each key frequency band to generate each prompt signal, and advantageously combine these prompt signals to generate an output link (1 2 2) and send it to a voice feature stream in the data storage area (130) to detect the voice feature. The member of the Intellectual Property Bureau of the Ministry of Economic Affairs and the printed cooperative source (1 10) can provide the actual speech or the speech data in any format that can represent the actual speech to the feature recognizer (120). The speech data includes binary Data, ASCII data, Fourier data, wavelet data, data contained in word processing files, WAV files, and MPEG files or any other file format containing compressed or uncompressed speech data. In addition, the data source (110) can be any of a number of different types of data sources, such as a person, a computer, a storage device, or can be generated from a storage device, forwarded, and this paper standard applies to China Standard (CNSM4 size (210 x 297 mm) _ 480 473

五、發明說明（8) (請先閱讀背面之注意事項再填寫本頁) 或取回一訊息或可代表實際語音的任何其他資訊之已知或未來開發出的硬體及軟體之任何組合。同樣地，資料儲存區（1 3 0 )可以是可接收語音特徵資料的任何裝置，例如一數位電腦、一通訊網路元件、或可接收、轉送、儲存、感測；或感知用來代表語音特徵的資料或資訊的硬體及軟體之任何組合。鏈路（1 1 2 )及（1 2 2 )可以是用來將資料來源 (1 1 0 )或資料儲存區（1 3 0 )連接到特徵辨識器（經濟部智慧財產局員工消費合作社印製 1 2 0 )之任何習知的或未來開發出的裝置或系統。此種裝置包含一直接序列/平行纜線連接裝置、經由一廣域網路或一區域網路之一連線、經由一企業內網路之一連線、經由網際網路之一連線、或經由任何其他分散式處理網路或系統之任何其他連線。此外，輸入鏈路（1 1 2 )或輸出鏈路（1 2 2 )可以是用來連結各種軟體系統的任何其他軟體裝置。一般而言，鏈路（112)及（122)可以是任何習知的或未來開發出的連接系統、電腦程式、或適於將資料來源（1 1 〇)或資料儲存區（1 3 0)連接到特徵辨識器（1 2 0 )的任何結構。圖2是根據本發明的一例示特徵辨識器（1 2 0 )。特徵辨識器（120)包含一前端（210)，該前端（ 2 1 0 )係經由鏈路（2 1 2 )而耦合到一後端（2 2 0 )° 在一第一作業模式中，前端（2 1 0 )經由鏈路（ 1 1 2 )接收一訓練資料流，該訓練資料流包括具有一可本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -11 - 480473 A7 B7 ________ 五、發明說明（9) (請先閱讀背面之注意事項再填寫本頁) 指示一特定音窗的語音資料是否包含一特定語音特徵的各別特徵標記流之語音資料流。例如，一語音資料流中的一第一區段可包含音素（phoneme) / h/ ( 、'hay〃），並具有一〔+sonorant〕的各別特徵標記，而包含音素/ d / ('' ladder")的一第二區段可具有一〔-sonorant〕的各別特徵標記，且一第三區段可包含既不導向一〔+sonorant 〕特徵也不導向一〔-sonorant〕特徵的隨機雜音。雖然該例示特徵辨識器（1 2 0 )可區分響音〔+sonorant〕及閉鎖音〔-sonorant〕，但是我們當了解，在不脫離本發明的精神及範圍下，由特徵辨識器（1 2 0 )區別及（或）辨識的各特徵可有所改變。例如，在其他實施例中，特徵辨識器（1 2 0 )可偵測/區分諸如有聲單音（v〇lcing )及鼻音等的其他語音特徵。其他的語音特徵至少可包括Miller ，G.及 Nicely，Ρ·在、、Acous. Soc. Am. J. 27(2Γ 3 3 8 - 3 5 2 頁（1 9 5 2 )發表的論文、、An analysis of perceptual confusions among some English constants "中述及或提及的任何一位元語音特徵，本發明特此引用隨論文之全文以供參照。經濟部智慧財產局員工消費合作社印製一旦前端（2 1 0 )接收訓練資料之後，該前端即可對該語音資料執行第一組程序，以便產生一經過處理的語音資料流。前端（2 1 0 )然後可利用鏈路（2 1 2 )將該經過處理的語音資料流及各別特徵標記流傳送到後端（ 2 2 0 )。後端（2 2 0 )可利用該經過處理的語音資料流及各別的特徵估計値而調整各內部權値（圖中未示出） -12 - 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公f ) 480473 A7 B7 五、發明說明（a (請先閱讀背面之注意事項再填寫本頁) ，使特徵辨識器（1 2 0 )可有效地學習區分各特徵。後端（2 2 0 )利用預期最大化（Expectation-Maximization ;簡稱E Μ )技術及一反覆最大可能性估計（Maxmum Likelihood Estimation ;簡稱M L E )技術來學習區分各特徵。至少可在由 Dempster，A.、 Laird，N.、及 Rubin，D. 在 '、Journal of the Royal Statistical Society，B 39:1-38 (1977)〃發表的論文、、Maximum likelihood from incomplete data via the EM algorithm 〃中找到與預期最大化技術及最大可能性估計技術有關的資訊，本發明特此引用該論文之全文以供參照。雖然該例示後端（2 2 0 )可E Μ及 M L Ε技術的一種組合來自訓練資料實例進行學習，但是我們當了解，在不脫離本發明的精神及範圍下，可使用適於訓練一裝置或系統來區分各語音特徵的習知或未來開發出的技術之任何組合。經濟部智慧財產局員工消費合作社印製在訓練作業期間，後端（2 2 0 )可接收經過處理的語音資料及各別特徵標記，並訓練其內部權値（圖中未示出），直到後端（2 2 0 )可有效地區分相關的各語音特徵爲止。一旦訓練了特徵辨識器（1 2 0 )之後，特徵辨識器（1 2 0 )然後可根據一第二作業模式而作業。在第二作業模式中，前端（2 1 0 )可接收一語音資料流，然後以如同前文所述的第一作業模式之方式處理該語音資料流，並將一經過處理的語音資料流提供給後端（ 2 2 0 )。後端（2 2 0 )可相應地接收該經過處理的語音資料流，然有以有利的方式利用經過訓練的內部權値組本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） 480473 A7 B7 五、發明說明（11) (請先閱讀背面之注意事項再填寫本頁) 合該經過處理的語音資料流所提供之不同提示信號，以便偵測/區分各語音特徵，並將一經過偵測的特徵流提供給鏈路（1 2 2 )。圖3是一根據本發明的例示前端（2 1 0 )。前端（ 2 1 0 )包含一濾波器單元（3 1 〇 )、諸如整流/方波產生裝置等的一非線性裝置（3 1 2 )、一低通濾波器（ Low Pass Filter ;簡稱L P F )/向下抽樣裝置（3 1 4 )、一音窗/參數量測裝置（3 1 6 )、及一臨界値比較裝置（3 1 8 )。在作業中，濾波器單元（3 1 0 )可首先經由鏈路（1 1 2 )接收一語音資料流。濾波器單元一（ 3 1 0 )然後可利用濾波器單元（3 1 0 )內設的若干帶通濾波器（圖中未示出）而將該語音資料流變換成一系列的窄頻帶（亦即關鍵性頻帶）。該例示濾波器單元（經濟部智慧財產局員工消費合作社印製 3 1 0 )可將語音資料分成二十四個各別的頻帶，這些頻帶的中心頻率係在2 2 5 Η z與3 6 2 5 Η z之間，且其頻寬範圍係自半個八度音（octave )到三分之一個八度音之間。然而，我們當了解，在不脫離本發明的精神及範圍下，濾波器單元（3 1 0 )可將語音資料分成任何數目的具有各種中心頻率及頻寬之關鍵性頻帶。一旦濾波器單元 (3 1 0 )將該語音資料流分成其各關鍵性頻帶之後，即將該等窄頻帶的語音資料提供給非線性裝置（3 1 2 )。非線性裝置（3 1 2 )接收該等窄頻帶的語音資料，然後將該等窄頻帶語音資料整流（亦即去除該等窄頻帶語音資料的負成分），並使該經過整流的語音資料流方波化本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公f ) -14 - 480473 A7 B7 ___ 五、發明說明（θ ，然後將該等經過整流/方波化後的語音資料流提供給 LPF/向下抽樣裝置（314)。 (請先閱讀背面之注意事項再填寫本頁) L P F /向下抽樣裝置（3 1 4 )接收該等經過整流 /方波化後的語音資料流，然後去除該等經過整流/方波化後的語音資料流之高頻成分，以便平滑該語音資料，然後將該等平滑的語音資料流數位化，並將經過數位化的語音資料流提供給音窗/參數量測裝置（3 1 6 )。音窗/參數量測裝置（3 1 6 )接收該等經過數位化的語音資料流，並將每一經過數位化的語音資料流分成一串流的十六毫秒（1 6 m S )連續非重疊音窗。雖然該例示音窗/參數量測裝置（3 1 6 )將語音分成若干連續的非重疊十六毫秒音窗，但是我們當了解，於各實施例中，在不脫離本發明的精神及範圍下，可視需要或在設計要求的情形下，該等音窗的大小亦可改變。此外，在其他各實施例中，我們當了解，可視需要、或在有利的決訂下、或在設計要求的情形下，可音窗也可以是非重疊或重疊的。經濟部智慧財產局員工消費合作社印製對於每一音窗的語音資料而言，音窗/參數量測裝置 (3 1 6 )可決定與每一音窗相關聯的統計參數之數目。該例示音窗/參數量測裝置（3 1 6 )決定在每一關鍵性頻帶中每一音窗有6個統計參數：前兩個參數是一特定關鍵性頻帶的信號雜音比之當時估計値，而後四個參數則是自協方差（a u t 〇 c 〇 v a r i a n c e )統計値。雖然該例示音窗/參數量測裝置（3 1 6 )量測與信號雜音比及自協方差統計値有關的六個參數，但是我們當本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -15- 480473 A7 五、發明說明（θ 、了解，在各實施例中，音窗/參數量測裝置（3 1 6 )可在不脫離本發明的精神及範圍下，決定各種其他品質，& (或）決定不同的參數數目。例如，音窗/參數量_裝置 (316)可以先決定一特定音窗的上述六個參數，然後利用先前及後續的音窗來決定該等參數的第一及第二導裏女 (derivative )。一旦庚定了各參數之後，音窗/參數量測經濟部智慧財產局員工消費合作社印製裝置（3 1 6 3 18)。臨界値比等參數之値標處理的若干白縮放該等參數數標準化之後，…2 1 2 -如前文所述，二十四個頻道生六個參數，每秒六個音窗 1 4 4個參數圖4是根。該例示後端 4 )即可將該等參數提供給臨界値比較裝置較裝置（準化。亦色雜音頻。在臨界，即可經 η )而輸例示前端。因此，則例示臨的速率下 3 18 即，可帶中推値t匕較由鏈路出該等 (21 如果前界値比，在每 )可接根據可導出的裝置（ (21 經過標 0 )可端（2 收該等自諸如若干預參數，經過相定臨界 318)將各，2 1 準化後的頻道 2-1 並將該同方式値，而頻道參 2 — 2 1〇）較裝置（3 1 十六毫秒的參數。將一語音資料流分成在每一頻道產 8 )可在大約音窗中共產生 • —Λ---Ί I---------------訂---------線— (請先閱y面之注意事項再填寫本頁) ， 0 — 1 ，4 1 〇據本發明的一例示後端（2 0 )之方塊圖 (2 2 0 )包含第^一*數目的第一*層組合益（ 2:，…4 1 〇 — η )、第二數目的層組合器（4 2 0 及一第三層組合器 1 ，.4 2〇 — 2 ，...42〇一m 本纸張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -16 - 480473 A7 B7 經濟部智慧財產局'負工消費合作社印製五、發明說明（1今層組合器（41〇一1 ，410 — 2 ，…410 — η)分別經由鏈路（2 1 2 — 1 ，2 1 2 - 2 ，…2 1 2 — η ) 接收與各關鍵性語音頻帶相關聯的參數流。該等參數流中之例示參數可以是與各連續非重疊音窗的語音資料的信號雜音比及自協方差統計値有關之若干組的六個量測値。然而，如前文所述，在不脫離本發明的精神及範圍下，該等參數的數目、類型、及本質可視需要或依照設計的要求而有所改變。當該等第一層組合器（ 41〇一 1 ，410 — 2 ，…410 — η)接收到每一音窗的參數時，該等第一層組合器（410 — 1 ，410 -2，…4 1 0 — η )可根據方程式（1 )而執行一第一組合作業：Pr[XiFl | Μ』]=σ ( “ · (1) 其中Μ :是與第i音窗的語音資料相關的一組量測値（亦即一參數向量），0 ^ j是與Μ ,相關聯的一組權値{ β丄，02，··.0』}，且 σ (z) =〔1 + e — 2〕一 1 是亦稱爲S形函數之邏輯函數。如前文所述，可以各種訓練技術來估計該等權値0 i ^。然而，在不脫離本發明的精神及範圍下，亦可替代性地以可提供用於偵測/區分各語音特徵的權値之任何方法推導出該等權値β i ^。對於每一音窗的語音資料而言，每一第一層組合器（ 41〇一1 ，410 — 2 ，.·· 41〇一 η)可將每一參數 M i乘以各別權値㊀L j，將各別乘積相加，並利用一 S形函數處理該乘積和。在處理了每一組權値之後，可將每一 (請先閱讀背面之注意事項再填寫本頁) · -丨線- 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -17- 480473 A7 B7 五、發明說明（1弓第一層組合器（410 — 1 ’ 410 — 2 4 1 0— π 的輸出提供給第二層組合器（420 - 1 ，420 - 2 …4 2 0 - m )。如圖4所示，每一第二層組合器（ 4 2 0 1 4 〇一2 ，…42〇一m)可自三個第層組合器（410 — 1 ’ 410 — 2 4 1 〇 — η )接收輸出。然而，我們當了解，在各實施例中，每一第二層組合器（4 2 0 4 2 0 2 4 2 0— m)接收任何數目的第一層組合器輸出。此外，我們當了解，在不脫離本發明的精神及範圍下，每一第一 1 ，4 1 0 — 2 4 1 0 — η ) 一層組合器（4 10 可將其輸出提供給一個以上的第二層組合器（420 - 1，420 — 2 4 2 0 — m ) 一旦每一第二層組合器（4 2 0 …4 2 0 - m )接收了其各別的第一層組合器資料之後，每一第二層組合器（420—1 ，4 2 0 — 2 4 2 0 (請先閱讀背面之注意事項再填寫本頁) 訂-· 一 m )即可根據方程式二組合運算： [Ycl | Μι]=ΠΡΓ[χπ1 I Mi] 2 )而對其接收的資料執行一第 (2) --線· 經濟部智慧財產局員工消費合作社印製V. Description of the invention (8) (Please read the notes on the back before filling out this page) or retrieve a message or any other information that can represent the actual voice, any combination of known or future developed hardware and software. Similarly, the data storage area (130) can be any device that can receive voice characteristic data, such as a digital computer, a communication network element, or can receive, transfer, store, and sense; or sense to represent voice characteristics Data or any combination of hardware and software. The links (1 1 2) and (1 2 2) can be used to connect the data source (1 1 0) or the data storage area (1 30) to the feature identifier (printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs) 1 2 0) any known or future developed device or system. Such devices include a direct serial / parallel cable connection device, connection via a wide area network or a local area network, connection via a corporate network, connection via one of the Internet, or via Any other decentralized processing network or any other connection to the system. In addition, the input link (1 1 2) or the output link (1 2 2) may be any other software device used to connect various software systems. Generally speaking, the links (112) and (122) can be any known or future developed connection system, computer program, or suitable for storing data sources (110) or data storage areas (130) Any structure connected to the feature recognizer (12 0). FIG. 2 is an exemplary feature recognizer (1 2 0) according to the present invention. The feature identifier (120) includes a front end (210), which is coupled to a back end (2 2 0) via a link (2 1 2). In a first operation mode, the front end (2 1 0) Receive a training data stream via the link (1 1 2). The training data stream includes a paper size applicable to the Chinese National Standard (CNS) A4 specification (210 X 297 mm) -11-480473 A7 B7 ________ V. Description of the invention (9) (Please read the notes on the back before filling this page) The voice data stream indicating whether the voice data of a specific sound window contains a stream of individual feature markers of a particular voice feature. For example, a first segment in a voice data stream may include phoneme / h / (, 'hay〃), and have a respective feature tag of [+ sonorant], and include phoneme / d / (' 'ladder ") A second section may have a separate feature tag of [-sonorant], and a third section may contain either a [+ sonorant] feature or a [-sonorant] feature Random noise. Although the exemplified feature recognizer (1 2 0) can distinguish between a sound [+ sonorant] and a lock sound [-sonorant], we should understand that without departing from the spirit and scope of the invention, the feature recognizer (1 2 0) The characteristics of the distinction and / or identification may be changed. For example, in other embodiments, the feature recognizer (120) can detect / distinguish other voice features such as voiced vocals and nasal sounds. Other speech features can include at least Miller, G., and Nicely, P. Jay, Acous. Soc. Am. J. 27 (2Γ 3 38-3 52 pages (1 952)), An analysis of perceptual confusions among some English constants " any one of the metaphonetic features mentioned or mentioned in the present invention is hereby incorporated by reference for the full text of the paper for reference. Printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs once the front end (2 1 0) After receiving the training data, the front end can execute the first set of procedures on the voice data to generate a processed voice data stream. The front end (2 1 0) can then use the link (2 1 2) to send the The processed voice data stream and the respective feature tag streams are transmitted to the back end (220). The back end (220) can use the processed voice data stream and the respective feature estimates to adjust each internal weight値 (not shown in the figure) -12-This paper size applies to China National Standard (CNS) A4 (210 X 297 male f) 480473 A7 B7 V. Description of the invention (a (Please read the notes on the back before filling in this Page) to make a feature recognizer (1 2 0) can effectively learn to distinguish between features. The back end (220) uses Expectation-Maximization (EM) technology and an iterative Maximum Likelihood Estimation (MLE) technology to learn to distinguish Various characteristics, at least in the papers published by Dempster, A., Laird, N., and Rubin, D. in ', Journal of the Royal Statistical Society, B 39: 1-38 (1977) 〃, Maximum likelihood from Incomplete data via the EM algorithm 找到 finds information related to the expectation maximization technology and the maximum likelihood estimation technology, the present invention hereby cites the full text of the paper for reference. Although the example back end (2 2 0) can be EM and A combination of ML Ε technology comes from training material examples for learning, but we should understand that without departing from the spirit and scope of the present invention, it is possible to use the knowledge or future development that is suitable for training a device or system to distinguish each voice feature Any combination of technologies. Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs During the training operation, the back end (2 2 0) can receive the pass The voice data and individual feature tags are processed, and their internal weights (not shown) are trained until the back end (220) can effectively distinguish the relevant voice features. Once the feature recognizer (120) is trained, the feature recognizer (120) may then operate according to a second operating mode. In the second operation mode, the front end (2 1 0) can receive a voice data stream, and then process the voice data stream in the same manner as the first operation mode described above, and provide a processed voice data stream to the Back end (2 2 0). The back end (2 2 0) can receive the processed voice data stream accordingly, but has the advantage of utilizing the trained internal rights in an advantageous way. The paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 public) (%) 480473 A7 B7 V. Description of the invention (11) (Please read the notes on the back before filling this page) Combine the different prompt signals provided by the processed voice data stream in order to detect / distinguish each voice feature, and A detected feature stream is provided to the link (1 2 2). Fig. 3 is an exemplary front end (2 1 0) according to the present invention. The front end (2 1 0) includes a filter unit (3 1 0), a non-linear device (3 1 2) such as a rectifier / square wave generating device, and a low pass filter (Low Pass Filter; LPF for short) / A down-sampling device (3 1 4), a sound window / parameter measuring device (3 1 6), and a critical unit comparison device (3 1 8). In operation, the filter unit (3 1 0) may first receive a voice data stream via the link (1 12). The filter unit one (3 1 0) can then use a number of band-pass filters (not shown) built in the filter unit (3 1 0) to transform the voice data stream into a series of narrow frequency bands (that is, Critical band). The example filter unit (printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs printed 3 1 0) can divide the voice data into twenty-four separate frequency bands, the center frequencies of these bands are 2 2 5 Η z and 3 6 2 5 Η z, and its bandwidth ranges from half an octave to a third of an octave. However, we should understand that, without departing from the spirit and scope of the present invention, the filter unit (310) can divide the speech data into any number of critical frequency bands with various center frequencies and bandwidths. Once the filter unit (3 1 0) divides the speech data stream into its key frequency bands, the speech data of the narrow frequency band is provided to the non-linear device (3 1 2). The non-linear device (3 1 2) receives the narrowband voice data, and then rectifies the narrowband voice data (that is, removes the negative component of the narrowband voice data), and makes the rectified voice data stream Square wave size This paper is in accordance with Chinese National Standard (CNS) A4 specification (210 X 297 male f) -14-480473 A7 B7 ___ V. Description of the invention (θ, and then rectify / square wave the voice data The stream is provided to the LPF / downsampling device (314). (Please read the precautions on the back before filling this page) The LPF / downsampling device (3 1 4) receives these rectified / square waved voice data Stream, and then remove the high-frequency components of the rectified / square waved voice data streams to smooth the voice data, and then digitize the smooth voice data streams, and provide the digitized voice data streams The sound window / parameter measurement device (3 1 6). The sound window / parameter measurement device (3 1 6) receives the digitized voice data streams and divides each digitized voice data stream into one Sixteen milliseconds of streaming (1 6 m S) continuous non-overlapping sound window. Although the illustrated sound window / parameter measurement device (3 1 6) divides the speech into several continuous non-overlapping sixteen millisecond sound windows, we should understand that in each embodiment, Without departing from the spirit and scope of the present invention, the size of these sound windows can also be changed as required or in the case of design requirements. In addition, in other embodiments, we should understand that, depending on the needs or advantages The sound window can also be non-overlapping or overlapping under the requirements of the design, or in the case of design requirements. Printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs. For each sound window's voice data, the sound window / parameter amount The measurement device (3 1 6) can determine the number of statistical parameters associated with each sound window. The illustrated sound window / parameter measurement device (3 1 6) determines that each sound window has 6 in each critical frequency band. Statistical parameters: the first two parameters are the current estimate of the signal-to-noise ratio of a specific critical frequency band, and the last four parameters are the auto-covariance (aut occ 〇variance) statistics. Although this example shows the window / parameter amount Test equipment Set (3 1 6) to measure the six parameters related to the signal noise ratio and the self-covariance statistics, but we should apply the Chinese National Standard (CNS) A4 specification (210 X 297 mm) to this paper size -15- 480473 A7 V. Description of the invention (θ, understand, in each embodiment, the sound window / parameter measuring device (3 1 6) can determine various other qualities without departing from the spirit and scope of the present invention, & (or) Determine the number of different parameters. For example, the sound window / parameter amount_device (316) may first determine the above six parameters of a specific sound window, and then use the previous and subsequent sound windows to determine the first and second of these parameters.引里女 (derivative). Once the parameters have been determined, the sound window / parameter measurement device is printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs (3 1 6 3 18). Critical ratio and other parameters of the standard processing of some white scaling After the number of these parameters is normalized, ... 2 1 2-As mentioned above, twenty-four channels generate six parameters, six sound windows per second 1 4 4 Parameter Figure 4 is the root. The example back-end 4) can provide these parameters to the critical 値 comparison device compared to the device (standardized. Also mixed audio. At the threshold, you can pass η) and exemplify the front-end. Therefore, the instantaneous rate of 3 18 is exemplified, that can be pushed out of the link by the link (21 if the frontier ratio, at each) can be connected according to the derivable device ((21 passes the standard 0 ) Reliable (2 such as receiving a number of pre-parameters, passed the phase-critical threshold 318) will each, 2 1 normalized channel 2-1 and the same way, and channel reference 2-2 1〇) The device (3 1 sixteen millisecond parameters. Dividing a voice data stream into 8 on each channel) can be generated in approximately the sound window • —Λ --- Ί I ------------ --- Order --------- Line— (Please read the notes on the y side before filling out this page), 0 — 1, 4 1 〇 According to an example of the present invention, the back end (2 0) The block diagram (2 2 0) contains the first * number of first-layer combination benefits (2, 2, ... 4 1 0—η), the second number of layer combiners (4 2 0, and a third layer combiner 1, .4 2〇—2, ... 42〇1m This paper size is applicable to China National Standard (CNS) A4 (210 X 297 mm) -16-480473 A7 B7 Intellectual Property Bureau of the Ministry of Economic Affairs Printed by Consumer Cooperatives The coupler (410-1, 410-2, ... 410-η) receives the key speech frequency bands associated with each of the key voice frequency bands via the links (2 1 2-1, 2, 2 1 2-2, ... 2 1 2-η), respectively. The parameter streams exemplified in these parameter streams may be a set of six measurements related to the signal-to-noise ratio and autocovariance statistics of the speech data of each successive non-overlapping sound window. However, as previously described It is stated that, without departing from the spirit and scope of the present invention, the number, type, and nature of these parameters may be changed as needed or according to design requirements. When these first-layer combiners (41101, 410 — 2,… 410 — η) When receiving the parameters of each sound window, the first-layer combiners (410 — 1, 410 -2, ... 4 1 0 — η) can execute one according to equation (1). The first combination operation: Pr [XiFl | Μ 』] = σ (" · (1) where Μ is a set of measurements (ie, a parameter vector) related to the speech data of the ith sound window, 0 ^ j Is a set of weights {β 丄, 02, ·· .0 ″} associated with M, and σ (z) = [1 + e — 2] —1 is also known as S-shaped function The logic function. As described above, various training techniques can be used to estimate these weights. However, without departing from the spirit and scope of the present invention, it can alternatively be provided for detection / distinguishability. Any method of weights of each speech feature derives these weights β i ^. For the speech data of each sound window, each first-layer combiner (410.1, 410-2, .. 411.0-n) can multiply each parameter M i by each weight 値㊀L j, add the respective products, and use an sigmoid function to process the product sum. After processing each set of rights, you can change each one (please read the notes on the back before filling this page) ·-丨-This paper size applies to China National Standard (CNS) A4 (210 X 297 mm) ) -17- 480473 A7 B7 V. Description of the invention (1 bow first layer combiner (410 — 1 '410 — 2 4 1 0 — π output is provided to the second layer combiner (420-1, 420-2 ... 4 2 0-m). As shown in FIG. 4, each second layer combiner (4 2 0 1 4 〇 2,… 42 〇 1 m) can be from three third layer combiners (410 — 1 ′ 410 — 2 4 1 〇 — η) receive output. However, we should understand that in each embodiment, each second layer combiner (4 2 0 4 2 0 2 4 2 0 — m) receives any number of first Layer combiner output. In addition, we should understand that without departing from the spirit and scope of the present invention, each first 1, 4 1 0 — 2 4 1 0 — η) layer combiner (4 10 can provide its output Give more than one second-level combiner (420-1, 420 — 2 4 2 0 — m) once each second-level combiner (4 2 0… 4 2 0-m) has received its respective first Floor After combining data, each second-level combiner (420-1, 4 2 0 — 2 4 2 0 (please read the precautions on the back before filling in this page) order-· 1 m) can be combined according to equation two Operation: [Ycl | Μι] = ΠΡΓ [χπ1 I Mi] 2) and perform a first (2) on the data it receives-printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economy

其中P r 〔 X I Μ〕是在已知^^時的一第一層組合器之條件機率分佈，且X i，表示在一特定關鍵性頻帶中的第i次測試之第一層組合器結果。由於每一、第一層組合器的輸出可以有自零到一的變化，所以方程式（2 ) 示出每一第二層組合器（42〇-1，420 — 2，… 4 2 0 — m )的輸出也可以有自零到一的變化，並示出方 -18- 480473 Α7 Β7 五、發明說明（1夺程式（2 )的效應是有效地執行一合取運算。亦即，方程式（2 )可執行一邏輯及〃（ANDing 〃）運算。例如，如果某一第二層組合器接收三個第一層組合器的輸出，且其中一個第一層組合器具有一個零的輸出，則不論其他第一層組合器的輸出爲何，該第二層組合器的輸出都將是零。一旦每一第二層組合器（420 - 1 ，420 - 2， …4 2 0 - m )執行了其組合運算之後，可將每一第二層組合器（42〇一1 ，42〇一 2，··· 420 — m)的輸出提供給第三層組合器（4 3 0 )。第三層組合器（4 3 0 )自每一第二層組合器（ 42 0— 1 ，42 0 — 2 ，…42〇一m)接收輸出，並根據方程式（3 )而對該等第二層組合器輸出執行一第三組合運算= [Z=l | Μ]=1- Πα -Pr[Yi = l I Mi]) (3) 其中M = { M i，M 2，…}並表示整組的參數量測値，Z是二進位隨機變數，且Pr 〔Z=l|M| 〕是在已知Μ時的Z之條件機率分佈。第三層組合器（4 3 0 )的效應是完成各第二階輸出Y ^的一析取。亦即，第三層組合器（4 3 0 )有效地執行一邏輯 ''或〃運算。例如，如果任何一個第二層組合器的輸出是一，則不論其他第二層組合器的輸出爲何，第三層組合器（4 3 0 )的輸出都將是 -- 〇如前文所述，後端（2 2 0 )可決定P r 〔 Z I Μ | 〕，亦即決定一音窗的語音資料根據關鍵性頻帶中的週期本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） J ,--------.— (請先閱讀背面之注意事項再填寫本頁) · •線· 經濟部智慧財產局員工消費合作社印製 -19- 480473 B7______ 五、發明說明（ (請先閱讀背面之注意事項再填寫本頁) 性及S N R量測値而包含諸如一〔+sonorant〕等的一特定語音特徵之機率。此種推論可能涉及資訊經由圖4所示分層網路的由下到上之傳送。然而，亦可進行涉及由下到上及由上到下推論的一種組合之其他推論。例如，諸如P r [Xij|Yi,Mi|](在已知一特定第一層輸出的參數量測値Μ :及輸出時的一特定第一層輸出X i j之條件機率分佈）及Pr 〔Yi | Z，M|〕（在已知一第三層輸出 Y ^的參數量測値Μ及輸出時的一特定第二層輸出之條件機率分佈）等的後驗機率（posterior probabilities)可用於自實例學習，亦即用於訓練。某些這類的後驗機率可採用前文方程式（2 )及（3 )的邏輯 ''及〃與邏輯 ''或〃運算。例如，根據邏輯 ''及〃運算，我們可推論P r 〔 X u = 1丨Y ^ = 1，Μ |〕= 1，亦即已知一特定第二層組合器的輸出是一（Y i = 1 )時，將資料提供給第二層的第一層組合器也很可能是一。此外，如果Y i二1，則可對將資料提供給第二層組合器的所有第一層組合器進行X t，二1的推論。經濟部智慧財產局員工消費合作社印製同樣地，根據方程式（3 )的邏輯、、或〃運算，我們可推論Pr 〔Yi=l | z = 〇，M| 〕= 〇，亦即已知第三層組合器的輸出是零（Z = 〇 )時，一特定第二層組合器的輸出也很可能是零。此外，假設Z = 〇，則可對所有的第二層組合器進行Y i = 〇的推論。亦可自貝氏p則（ Bayes rule )計算出其他的後驗機率。爲了簡化所得到的公式，使p u = P r 〔 X u = 1 I M i |〕表示由後端（本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公爱） -20- 480473 A7 __B7___五、發明說明（ 2 2 0 )的第一層組合器計算出的條件機率。因此，在〔 -sonorant〕關鍵性頻帶中，我們可根據方程式（4 )而進行下列的推論： 4 其中在該方程式的方括號中的項必然是小於一。因此，方程式（4)可表示：當已知一第二層組合器中的一輸出是負値時，我們應降低任何各別第一層組合器中任何特定測試爲肯定之機率。同樣地，在語音的〔+ s 〇 norant 〕音窗中，可根據方程式（5 )而進行推論： 5 (請先閱讀背面之注意事項再填寫本頁) 經濟部智慧財產局員工消費合作社印製其中該方程式中的分母必然是小於一。方程式示出：當已在一個或多個關鍵性頻帶中偵測到一〔+SQnoi:ant〕特徵時，我們應增加在任何特定第一器中偵測到一〔+ s ο η 〇 r a n t〕特徵的機率。前文所述模型之一優點在於：該機率圖模型可以精確的定量方程式（4 )中之直覺實施例定形。在各實施例中，我們當了解，在不脫離本發明及範圍下，可修改/延伸該例示後端（2 2 Q )。如前文所述，該量測向量M t可包含若干組與S N R 方差統計値影關的六個量測値。然而，亦如前文所層組合機率圖方式將的精神例如，及自協述，包 -21 - 480473 A7 B7 _ 五、發明說明（Θ (請先閱讀背面之注意事項再填寫本頁) 含S N R及自協方差統計値的第一階及第二階導數，即可延伸該等組的量測値。此外，將來自連續音窗（並非來自相同音窗）的參數量測値提供給每一第二層組合器之下的該等邏輯曲線迴歸（logistic regression )，即可得到一第二延伸。最後，雖然已如前文所述的參照〔+s〇norant〕的偵測而說明了該例示後端（2 2 0 )，但是在不脫離本發明的精神及範圍下，亦可將後端（2 2 0 )替代性地用來偵測諸如有聲單音、鼻音、或任何一位元語音特徵等的其他語音特徵。經濟部智慧財產局員工消費合作社印製圖5是配合一組訓練電路（5 1 0 )而使用的圖4所示例示後端（2 2 0 )的一部分之方塊圖，其中該組訓練電路（5 1 0 )可使後端（2 2 0 )學習區分各語音特徵。在作業中，該例示後端（2 2 0 )可接收若干音窗的經過處理之語音資料，並根據前文所述的方程式（1 ) -（ 3,)而組合該等經過處理之語音資料。當組合每一音窗的經過處理之語音資料時，該例示訓練電路（5 1 0 )可自第三層組合器輸出接收資料，而該第三層組合器輸出可包含一預測語音特徵流。該等訓練電路（5 1 0 )可進一步自資料來源（1 1 0 )接收該經過處理的語音資料流、以及用來指示一特定音窗的語音資料是否實際包含一相關語音特徵之一各別語音標記流。使用該等經過處理的語音資料流、預設語音特徵、及實際語音特徵（標記）時，該等訓練電路（5 1 0 )可反覆地訓練第一層組合器（4 1 0 — 1，4 1 0 — 1 … -22- 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） 480473 A7 B7 五、發明說明（2() 4 1 0 - η )中之各權値。該等訓練電路（5 1 0 )可利用一 Ε Μ技術並配合一 M L Ε技術而估計該等權値。 (請先閱讀背面之注意事項再填寫本頁) 該例示Ε Μ技術包含兩個交替出現的步驟，即一 Ε步驟及一 Μ步驟。例示的Ε步驟可包含下列步驟：計算受資料來源（1 1 0 )提供的標記制約的後驗機率p r 〔 X t』 I Z，Μ I 〕。該M步驟可包含下列步驟：利用作爲目標値的後驗機率來更新每一邏輯曲線迴歸中之各參數β ^。可自寬頻帶語音資料推到該例示訓練資料，且在各實施例中，該訓練資料有可能受到各種雜音來源、濾波、或其他失真的污染。該例示訓練資料亦可包含具有語音調準的一串流〔+/ - s ο η 〇 r a n t〕標記。對於每一苜窗的語音資料而言，可使一組聲音量測値Μ 1與用來指示每一音窗是否爲一〔+sonorant〕特徵的一目標標記Z L e { 0，1 }相關聯。然後可將第一層參數選擇成根據方程式（6 )而將一對數可能性（L〇g-Likeliho‘od ;簡稱L L )最大化： LL-^logPrfz1 =ζΐΜι], 、 ( 6 ) t 經濟部智慧財產局員工消費合作社印製使該後端的各輸出預測値符合該資料來源的各標記。該EM程序包含兩個交替出現的步驟，即一F步驟及一 Μ步驟。此模型中的該E步驟可計算受到聲音調準所提供的標記制約的各隱藏式變數之後驗機率。此對〔-sonorant〕及〔+sonorant〕音窗的語音之計算是不同白勺。可根據方程式（7 )而進行〔-son or ant〕音窗的計算： -23- 本紙張尺度適用中國國家標準（CNS)A4規格（2.10 X 297公釐） 480473Where Pr [XI M] is the conditional probability distribution of a first-layer combiner when ^^ is known, and X i represents the first-layer combiner result of the i-th test in a specific critical frequency band . Since the output of each and the first-layer combiner can vary from zero to one, equation (2) shows that each second-layer combiner (420-1, 420 — 2, ... 4 2 0 — m The output of) can also change from zero to one, and shows the formula -18- 480473 Α7 Β7 V. Description of the invention (The effect of the formula (2) is to effectively perform a conjunction operation. That is, the equation ( 2) A logical AND operation can be performed. For example, if a second layer combiner receives the output of three first layer combiners and one of the first layer combiners has a zero output, then Regardless of the output of other first-layer combiners, the output of the second-layer combiner will be zero. Once each second-layer combiner (420-1, 420-2, ... 4 2 0-m) is executed After the combination operation, the output of each second-layer combiner (420.1, 422.0-2, ..., 420 — m) can be provided to the third-layer combiner (430). The combiner (4 3 0) receives the output from each second-layer combiner (42 0-1, 42 0-2, ... 4201 m), and roots Equation (3) and perform a third combination operation on the outputs of these second-level combiners = [Z = l | Μ] = 1- Πα -Pr [Yi = l I Mi]) (3) where M = {M i, M 2, ...} and denote the parameter measurement of the whole group, Z is a binary random variable, and Pr [Z = 1 | M |] is the conditional probability distribution of Z when known M. The effect of the third-layer combiner (430) is to perform an extraction of each second-order output Y ^. That is, the third-layer combiner (430) effectively performs a logical OR operation. For example, if the output of any one of the second layer combiners is one, the output of the third layer combiner (430) will be-regardless of the output of the other second layer combiners-〇 As described above, The back end (2 2 0) can determine Pr [ZI Μ |], that is, determine the sound data of a sound window according to the period in the key frequency band. This paper standard applies the Chinese National Standard (CNS) A4 specification (210 X 297 public). Li) J, --------.— (Please read the notes on the back before filling out this page) · • Line · Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs -19- 480473 B7______ V. Description of the Invention ((Please read the precautions on the back before filling out this page). The probability of including a specific voice feature such as a [+ sonorant]. This kind of inference may involve the information being layered as shown in Figure 4. Bottom-up transmission of the network. However, other inferences involving a combination of bottom-up and top-down inference can also be made. For example, such as Pr [Xij | Yi, Mi |] (known in Parameter measurement of the output of a specific first layer 値 M: and a specific first at the time of output Conditional probability distribution of output X ij) and Pr [Yi | Z, M |] (conditional probability distribution of a specific second layer output when the parameter 量 M of a third layer output Y ^ is measured and the output is known) Posterior probabilities can be used for self-example learning, that is, for training. Some such posterior probabilities can use the logic `` and unity logic '' of the previous equations (2) and (3). Or 〃 operation. For example, according to the logic '' and 〃 operation, we can infer P r [X u = 1 丨 Y ^ = 1, M |] = 1, which means that the output of a specific second-level combiner is known to be When Yi (Y i = 1), the first layer combiner that provides data to the second layer is also likely to be one. In addition, if Yi i is two, all the data provided to the second layer combiner can be used. The first-layer combiner makes the inference of X t, 2 1. Similarly, according to the logic of equation (3), or unitary operation, we can infer Pr [Yi = l | z = 〇, M |] = 〇, that is, when the output of the third-layer combiner is known to be zero (Z = 〇), a specific second The output of the combiner is also likely to be zero. In addition, assuming Z = 〇, all second-level combiners can be inferred by Yi = 〇. Others can be calculated from the Bayes rule (Bayes rule) Probability of posterior inspection. In order to simplify the formula obtained, let pu = Pr [〔X u = 1 IM i |] be represented by the back end (this paper size applies Chinese National Standard (CNS) A4 specification (210 X 297 public love)- 20- 480473 A7 __B7___ V. Conditional probability calculated by the first layer combiner of the description of the invention (2 2 0). Therefore, in the [-sonorant] critical band, we can make the following inferences based on equation (4): 4 where the terms in the square brackets of the equation must be less than one. Therefore, equation (4) can indicate that when an output in a second-layer combiner is known to be negative, we should reduce the probability of any particular test in any respective first-layer combiner being positive. Similarly, in the [+ s 〇norant] sound window of the voice, it can be inferred according to equation (5): 5 (Please read the precautions on the back before filling this page) Printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs The denominator in this equation must be less than one. The equation shows that when a [+ SQnoi: ant] feature has been detected in one or more critical bands, we should increase the detection of a [+ s ο η rant] in any particular first device. The probability of a feature. One of the advantages of the aforementioned model is that the probability map model can accurately quantify the intuitionistic embodiment of the equation (4). In each embodiment, we should understand that the exemplified back end (2 2 Q) can be modified / extended without departing from the scope of the present invention. As described above, the measurement vector M t may include several groups of six measurement data related to the S N R variance statistics. However, it is also the spirit of the way of combining the probability diagrams in the previous example. For example, and self-commentary, including -21-480473 A7 B7 _ V. Description of the invention (Θ (Please read the notes on the back before filling this page) Including SNR And the first- and second-order derivatives of the statistical covariance statistics can extend these groups of measurements. In addition, parameter measurements from continuous sound windows (not from the same sound window) are provided to A second extension can be obtained by the logistic regression under the second layer combiner. Finally, although the illustration has been described with reference to the detection of [+ s〇norant] as described above Back end (2 2 0), but without departing from the spirit and scope of the present invention, the back end (2 2 0) can also be used to detect, for example, a voiced monophonic, nasal, or any one-bit speech Other voice features such as features. Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs Figure 5 is a block diagram of a portion of the backend (2 2 0) illustrated in Figure 4 used in conjunction with a set of training circuits (5 1 0) Where the set of training circuits (5 1 0) can Make the back end (2 2 0) learn to distinguish each voice feature. In the assignment, the illustrated back end (2 2 0) can receive the processed speech data of several sound windows, and according to the equation (1) described above- (3,) and combine the processed voice data. When the processed voice data of each sound window is combined, the illustrated training circuit (5 1 0) can receive data from the output of the third-layer combiner, and the The output of the third layer combiner may include a predicted speech feature stream. The training circuits (5 1 0) may further receive the processed speech data stream from a data source (1 1 0), and be used to indicate a specific sound window. Whether the voice data of the voice data actually contains a separate voice tag stream of one of the relevant voice features. When using these processed voice data streams, preset voice features, and actual voice features (tags), the training circuits (5 1 0 ) Iteratively trains the first layer combiner (4 1 0 — 1, 4 1 0 — 1… -22- This paper size applies to China National Standard (CNS) A4 (210 X 297 mm) 480473 A7 B7 V. Invention Description (2 () 4 1 0- ). These training circuits (5 1 0) can use an EM technology in conjunction with an ML Ε technology to estimate these rights. (Please read the precautions on the back before filling this page) This example The E M technique includes two alternate steps, namely an E step and an M step. The illustrated E step may include the following steps: Calculate the posterior probability pr [X t IZ, M I]. The M step may include the following steps: using the posterior probability as the target 値 to update each parameter β ^ in each logistic curve regression. It can be derived from the wideband speech data to the exemplified training data, and in various embodiments, the training data may be contaminated by various noise sources, filtering, or other distortions. The illustrated training data may also include a stream of [+ /-s ο η 〇 r a n t] tokens with voice alignment. For the speech data of each window, a set of sound measurement QM 1 can be associated with a target mark ZL e {0,1} used to indicate whether each window is a [+ sonorant] feature. . The parameters of the first layer can then be selected to maximize the logarithmic likelihood (L0g-Likeliho'od; LL for short) according to equation (6): LL- ^ logPrfz1 = ζΐΜι ,, (6) t Ministry of Economic Affairs The Intellectual Property Bureau employee consumer cooperative prints the output predictions of the backend to match the tags of the data source. The EM program includes two alternating steps, namely an F step and an M step. This E step in this model calculates the probability of checking after each hidden variable restricted by the markers provided by the sound alignment. The calculation of the sound of [-sonorant] and [+ sonorant] sound windows is different. Calculation of [-son or ant] sound window can be performed according to equation (7): -23- This paper size applies to China National Standard (CNS) A4 specification (2.10 X 297 mm) 480473

五、發明說明（2》V. Description of Invention (2)

Pr[^(>- = ljZ = 〇,Λ/]= ρ〇· 7 而可根據方程式（8 )而進行〔+sonorant〕音窗的計算： ?υ[χ^Ι\Ζ = \,μ]Pr [^ (>-= ljZ = 〇, Λ /] = ρ〇 · 7 and the calculation of [+ sonorant] sound window can be performed according to equation (8):? Υ [χ ^ Ι \ Z = \, μ ]

8 經濟部智慧財產局員工消費合作社印製然w而，將(5 ) 該中的參 =p r )及（計算出 I z t 而計算 0 1], 後可將貝氏規則應用於方程式（7 )及（8 )的左隱藏式變數Yi邊際化，並重複使用方程式（4 )及，而推導出方程式（7)及（8)的後驗機率。 E Μ程序的該Μ步驟然後可更新每一邏輯曲線迴歸數，而提供更新後的參數估計値0 i j。假設q^ 4〔Χ’υ二1丨ΖΪ=ζ1，μ1〕表示方程式（7 8 )利用更新後的L ^來取代現有的估計値θ ,』而的後驗機率。同樣地，假設>，y = P r 〔 x，i』=工二z 1 ’ Μ 1〕表示方程式（1 )利用更新後的；^』出的先髮機率。該Μ步驟然後可包含以取代其中可以方程式（9 )推導L： ^ ^ \zW^p：j 9 因爲方程式（9 )中的每一項都可界定的一凸函數，所以可以牛頓法或一梯度上升技術來執行方程式（9 本紙張&度適用中國國家標準（CNS)A4規格（210 X 297公爱 (請先閱讀背面之注意事項再填寫本頁) -24- 480473 A7 B7 五、發明說明（2与 )的最大化。 (請先閱讀背面之注意事項再填寫本頁) 一旦決定了該等新的估計値& i ^之後，該等訓練電路 (5 1 0 )即可將該等新的估計値^ t j提供給其各別的第一層組合器（4 10 - 1至4 10 — η)。因此，該等第一層組合器（4 1 0 - 1至4 1 0 - η)可包含該等新的估計値i』，且可以類似的方式處理次一音窗的語音資料，直到處理完整個訓練資料流爲止，然後後端（2 2〇）可顯示適當的效能或視其他的需要而作顯示。圖6是根據本發明的一例示第一層組合器（4 1 0 - 1 )之方塊圖。該例示第一層組合器（4 1 0 — 1 )包含若干乘法器（610 — 1 ，610 — 2，…610 — j) 、一加總節點（6 2 0 )、及一 S形函數節點（6 3 0 ) 。在作業中，可將各參數經由鏈路（212 — i-l， 2 1 2 — i - 2，…2 1 2 - i 一 j提供給每一乘法器（ 610 — 1 ，610 — 2 ，…610 — j)。乘法器（ 6 1 0 — 1 ，6 1 0 — 2 ，…6 1〇一j )可接收該等參經濟部智慧財產局員工消費合作社印製數，將每一參數乘以一各別權値Θ a - 0 f，並將其各別的乘積分別經由鏈路（6 1 2 - 1 ，6 1 2 - 2 ，…2 1 2 一 j )輸出到加總節點（6 2 0 )。加總節點（6 2 0 ) 可因而自乘法器（610— 1 ，610 — 2 ，…610 — j )接收該等乘積，然後將該等乘積相加，並將該等乘積的總和經由鏈路（4 2 2 )而提供給S形函數節點（ 6 3 0 ) 。S形函數節點（6 3 0 )可使用S形轉移函數或其他類似的函數來處理該總和。一旦S形函數節點（ -25- 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） 480473 Α7 Β7 經濟部智慧財產局員工消費合作社印製五、發明說明（2今 6 3 0 )處理過該總和之後，即可將該經過處理的總和經由鏈路（4 1 2 - i )而提供給一第二層組合器（圖中未示出）。如前文所述，各第一層權値尤其是可在一訓練作業中有所改變。因此，於每一次反覆的訓練程序中，各乘法器 (6 1〇一 1 ，6 1 0 — 2 ，…6 1 0 — J·)可接收各權値估計値。一旦每一乘法器（6 1 0 — 1，6 1 0 — 2， …6 1 0 — j )接收了一特定權値之後，該等乘法器（ 6 1 0 — 1 ’ 6 1 0 — 2 ，…6 1〇一 j )即可無限期地保有該權値，直到有進一步的修改爲止。圖7是根據本發明而槪述處理關鍵性語音頻帶的一例示方法之流程圖。該程序開始於步驟（7 1 0 )，此時接收一第一音窗的語音資料。然後在步驟（7 2 0 )中，執行若干前端濾波作業。如前文所述，一組前端作業可包含下列步驟：將所接收的語音資料分成若干關鍵性語音頻帶 ;將每一關鍵性語音頻帶進行整流及方波化；對每一經過整流/方波化的關鍵性頻帶進行濾波及向下抽樣；對每一音窗中的每一關鍵性頻帶進行上下限幅、各參數的量測、及各參數的標準化，以便產生一參數向量Μ流。然而，如前文所述，我們當了解，在不脫離本發明的精神及範圍下，可視需要或按照其他的要求而改變該等前端濾波作業。該作業繼續到步驟（7 3 0 )。在步驟（7 3 0 )中，係根據前文所述的方程式（1 )而執行一第一層組合作業。雖然^該例示的第一層組合作 (請先閱讀背面之注意事項再填寫本頁) 訂：丨線· ·! 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -26- 480473 A7 B7 五、發明說明（2今業通常涉及經由一 S形算符而傳送各加權後參數的一總和，但是我們當了解，於各實施例中，在不脫離本發明的精神及範圍下，第一層組合作業的特定形式可以有所改變。該等作業繼續進入步驟（7 4 0 )。在步驟（740)中，係利用步驟（730)之輸出而執行若干第二層組合作業。如前文所述，該等第二層組合作業在本質上是可以合取的，且可根據前文所述的方程式（2 )而執行該等第二層組合作業。然後在步驟（ 7 5 0 )中，可對步驟（7 4 0 )的該等合取輸出執行若干第三層組合作業。該等第三層組合作業在本質上可以是析取的，且可採取前文所述方程式（3 )之形式。雖然可利用前文所述的方程式（2 )及（3 )執行該等例示第二層及第三層作業，但是我們當了解，步驟（740)及（ 7 5 0 )的確切形式是可以改變的，且此種形式可以是在不脫離本發明的精神及範圍下可用來偵測/區分諸如響音、閉鎖音、有聲單音、及鼻音等的各種語音特徵之程序之任何組合。該程序的控制繼續進入步驟（7 6 0 )。在步驟（7 6 0 )中，將步驟（7 5 0 )所提供的估計特徵提供給諸如一電腦等的一外部裝置。然後在步驟（ 7 7 0 )中，決定是否正在執行一訓練作業。如果正在執行一訓練作業，則本程序之控制繼續進入步驟（7 8 0 ) ;否則，控制將跳到步驟（8 0 0 )。在步驟（7 8 0 )中，接收到其中包含一可指示現行音窗的語音資料是否包含一特定語音特徵的語音標記之一本纸張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） (請先閱讀背面之注意事項再填寫本頁) 4 -線· 經濟部智慧財產局員工消費合作社印製 480473 經濟部智慧財產局員工消費合作 -社印製 B7 五、發明說明（25 ) 音窗的語音訓練資料。然後在步驟（7 9 0 )中，更新與步驟（7 3 0 )的第一層組合作業相關聯之一組權値。如前文所述，該例示技術可使用一 Ε Μ及M L E技術來更新該等權値。然而，我們當了解，在不脫離本發明的精神及範圍下，用來更新該等權値的特定技術是可以改變的，且該等特定技術可以是可用來訓練該等權値使一特定裝置可學習準確地偵測/區分各語音特徵的各種技術之任何組合。該等作業繼續進入步驟（8 〇〇 )。在步驟（800)中，決定是否要停止本程序。如果要停止本程序，則控制繼續進入步驟（8 1 0 ),此時本程序停止；否則，控制跳回到步驟（7 1 0 )，此時接收到額外的語音資料，因而可重複步驟（7 1 0 — 7 9 0 ) 。該作業然後可反覆地執行步驟（7 1 0 - 7 9 0 )，直到已適當地訓練該等第一層權値爲止，或直到用完了可用的語音資料爲止，或者直到:達到其他的要求爲止。我們當了解，最好是在一數位信號處理器（Digital Signal Processor ;簡稱D S P )或其他的積體電路上實施本發明的各種系統及方法。然〜而，亦可利用一個或多個一般用途電腦、特殊用途電腦、程式微處理器或微控制器及周邊積體電路元件、諸如特定應用積體電路（ApplicatlC)I1 Specific Integrated Circuit ;簡稱 A S I C )等的硬體電子電路或邏輯電路、分立式元件電路、諸如PL D、PL A 、F PGA、或PAL等的可程式邏輯裝置之任何組合來實施該等系統及方法。一般而言，可利用其上存在一可實本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐） -28- ϋ ϋ n ϋ n ϋ ·ϋ n · ·1 -ϋ n ϋ ϋ ϋ I ‘· ·ϋ ϋ ί ϋ n n n I ϋ n n ϋ l I n %1- ϋ tr at n 1 n ϋ I ·ϋ ·ϋ ϋ Hi n I— (請先閱讀背面之注意事項再填寫本頁) _ 480473 A7 ------ΒΖ__ __ 五、發明說明（26 ) 施圖1 - 6所示的各元件及（或）圖7所示的流程圖之一有限狀態機之任何裝置來實施特徵辨識器（1 2 〇 )之功會g 〇雖然已參照一些特定實施例而說明了本發明，但是熟悉、本門技術者當可易於了解許多替代、修改、及改變。因此，本文所述的本發明之較佳實施例係用來解說，而非對本發明加以限制。可在不脫離本發明的精神及範圍下進行許多改變。 (請先閱讀背面之注意事項再填寫本頁) 經濟部智慧財產局員工消費合作社印製 -29- 本紙張尺度適用中國國家標準（CNS)A4規格（210 X 297公釐）8 Printed by the Consumer Cooperative of the Intellectual Property Bureau of the Ministry of Economic Affairs, (5) where the parameter = pr) and (calculate I zt and calculate 0 1], and then apply the Bayes rule to equation (7) The left hidden variable Yi and (8) are marginalized, and the equations (4) and are used repeatedly to derive the posterior probability of the equations (7) and (8). This M step of the EM program can then update each Logarithmic curve regression number, and provide updated parameter estimates 値 0 ij. Assume that q ^ 4 [χ'υ 2 1 丨 ZΪ = ζ1, μ1] represents equation (7 8) using the updated L ^ instead of the existing estimate値 θ, ′ and the posterior probability. Similarly, suppose that y = P r [x, i ’= work two z 1 'Μ 1] means that equation (1) uses the updated one; ^” The M step may then include replacing L where ^ can be derived from equation (9): ^ ^ \ zW ^ p: j 9 Since each term in equation (9) can be defined as a convex function, Newton can be Method or a gradient ascent technique to execute the equation (9 papers & degrees apply Chinese National Standard (CNS) A4 specification (210 X 297 Love (Please read the notes on the back before filling this page) -24- 480473 A7 B7 5. Maximize the description of the invention (2 and). (Please read the notes on the back before filling this page) Once you decide After the new estimate 値 & i ^, the training circuits (5 1 0) can provide the new estimate 値 ^ j to their respective first-layer combiners (4 10-1 to 4 10 — η). Therefore, the first-layer combiners (4 1 0-1 to 4 1 0-η) can contain the new estimates 値 i ′ and can process the speech data of the next sound window in a similar manner, Until the entire training data stream is processed, then the backend (220) can display the appropriate performance or display it according to other needs. Fig. 6 is an exemplary first layer combiner (4 1 0- 1). This example illustrates that the first layer combiner (4 1 0 — 1) includes a number of multipliers (610 — 1, 610 — 2,… 610 — j), a summing node (6 2 0), and An S-shaped function node (6 3 0). In the operation, each parameter can be provided via the link (212 — il, 2 1 2 — i-2, ... 2 1 2-i — j For each multiplier (610 — 1, 610 — 2,… 610 — j). The multiplier (6 1 0 — 1, 6 1 0 — 2, ... 6 10—j) can receive the wisdom of the Ministry of Economic Affairs. The property bureau employees consume the printed number of the cooperative, multiply each parameter by a respective weight 値 Θ a-0 f, and pass their respective products through the links (6 1 2-1, 6 1 2-2, … 2 1 2 a j) is output to the summing node (6 2 0). The summing node (6 2 0) can thus receive the products from the multiplier (610-1, 610-2, ... 610-j), then add the products, and pass the sum of the products via the link (4 2 2) and provided to the sigmoid function node (6 3 0). The sigmoid function node (630) may use an sigmoid transfer function or other similar function to process the sum. Once the sigmoid function node (-25- This paper size applies the Chinese National Standard (CNS) A4 specification (210 X 297 mm) 480473 Α7 Β7 Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 5) Invention description (2 to 6 3 0) After processing the sum, the processed sum can be provided to a second-layer combiner (not shown in the figure) via a link (4 1 2-i). As described above, each One layer of weights can be changed in particular during a training operation. Therefore, in each iterative training procedure, each multiplier (6 1101, 6 1 0 — 2, ... 6 1 0 — J · ) Can receive various weight estimates 一旦. Once each multiplier (6 1 0 — 1, 6 1 0 — 2,… 6 1 0 — j) receives a specific weight, the multipliers (6 1 0 — 1 '6 1 0 — 2, ... 6 1〇-j) can retain this right indefinitely until there is further modification. Figure 7 is an example of the key speech frequency bands described in accordance with the present invention Flowchart of the method. The program starts at step (7 1 0), at which time the voice data of a first sound window is received. In step (720), several front-end filtering operations are performed. As described above, a set of front-end operations may include the following steps: dividing the received voice data into several key voice frequency bands; performing each key voice frequency band Rectification and square wave; filtering and down-sampling of each rectified / square waved critical frequency band; upper and lower limits, measurement of various parameters for each critical frequency band in each sound window, and The parameters are normalized in order to generate a parameter vector M stream. However, as mentioned above, we should understand that such front-end filtering operations can be changed as needed or according to other requirements without departing from the spirit and scope of the present invention. The operation continues to step (7 3 0). In step (7 3 0), a first layer combination operation is performed according to the equation (1) described above. Although the exemplified first layer group cooperates ( Please read the precautions on the back before filling in this page) Order: 丨 Line · ·! This paper size applies to China National Standard (CNS) A4 (210 X 297 mm) -26- 480473 A7 B7 V. Description of the invention (2 Today The industry usually involves transmitting a sum of the weighted parameters via an S-shaped operator, but we should understand that in various embodiments, the specific form of the first-level combination operation can be made without departing from the spirit and scope of the present invention. There are changes. These operations continue to step (740). In step (740), the output of step (730) is used to perform a number of second-level combined operations. As described above, these second layers The combined operations are conjunctive in nature, and the second-level combined operations can be performed according to the equation (2) described above. Then in step (750), several third-level combination operations can be performed on the conjunct outputs of step (740). These third-level combined operations can be disjunctive in nature and can take the form of equation (3) described above. Although the above-mentioned equations (2) and (3) can be used to perform the illustrated second and third layer operations, we should understand that the exact form of steps (740) and (750) can be changed. And, this form can be any combination of programs that can be used to detect / distinguish various voice characteristics such as ringing sounds, blocking sounds, vocal monotones, and nasal sounds without departing from the spirit and scope of the present invention. Control of this program continues to step (760). In step (760), the estimation feature provided in step (750) is provided to an external device such as a computer or the like. Then in step (770), it is determined whether a training job is being performed. If a training operation is being performed, the control of this procedure continues to step (780); otherwise, control will skip to step (800). In step (780), it is received that it contains one of the voice marks indicating whether the voice data of the current sound window contains a specific voice feature. The paper size is applicable to the Chinese National Standard (CNS) A4 specification (210 X 297). (Mm) (Please read the notes on the back before filling out this page) 4-Line · Printed by the Employees 'Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs 480473 Printed by the Employees' Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs-printed by the agency B7 V. Invention Description (25 ) Voice training data of the sound window. Then in step (79), a set of weights associated with the first-level combined job of step (730) is updated. As mentioned earlier, the illustrated technology can use an EM and M L E technology to update these rights. However, we should understand that the specific technology used to update these rights can be changed without departing from the spirit and scope of the present invention, and that these specific technologies can be used to train these rights to use a specific device. Learn any combination of technologies that accurately detect / distinguish each voice feature. These operations continue to step (800). In step (800), it is decided whether to stop the program. If you want to stop the program, control continues to step (8 1 0), at which point the program stops; otherwise, control jumps back to step (7 1 0), at which time additional voice data is received, so the steps can be repeated ( 7 1 0 — 7 9 0). The job can then iteratively execute steps (7 1 0-7 9 0) until the first level rights have been properly trained, or until the available voice data is used up, or until: other requirements are met . We should understand that it is best to implement the various systems and methods of the present invention on a digital signal processor (Digital Signal Processor; DSP for short) or other integrated circuits. However, one or more general-purpose computers, special-purpose computers, program microprocessors or microcontrollers and peripheral integrated circuit components, such as application specific integrated circuits (ApplicatlC) I1 Specific Integrated Circuit; ) And other hardware electronic circuits or logic circuits, discrete component circuits, programmable logic devices such as PL D, PL A, F PGA, or PAL to implement these systems and methods. Generally speaking, it can be used that there is a verifiable paper size that is applicable to the Chinese National Standard (CNS) A4 specification (210 X 297 mm) -28- ϋ ϋ n ϋ n ϋ · · n · · 1 -ϋ n ϋ ϋ ϋ I '·· ϋ ϋ ί ϋ nnn I ϋ nn ϋ l I n% 1- ϋ tr at n 1 n ϋ I · ϋ · ϋ ϋ Hi n I— (Please read the precautions on the back before filling this page ) _480473 A7 ------ ΒZ__ __ V. Description of the Invention (26) Implement each component shown in Figures 1-6 and / or any device of the finite state machine shown in the flowchart shown in Figure 7 The function of the feature recognizer (12) Although the invention has been described with reference to some specific embodiments, those skilled in the art can easily understand many alternatives, modifications, and changes. Therefore, the preferred embodiments of the invention described herein are intended to be illustrative, and not to limit the invention. Many changes may be made without departing from the spirit and scope of the invention. (Please read the notes on the back before filling out this page) Printed by the Consumer Cooperatives of the Intellectual Property Bureau of the Ministry of Economic Affairs -29- This paper size applies to China National Standard (CNS) A4 (210 X 297 mm)

Claims

480473 Λ 8 Β8 C8 D8 々、 Scope of patent application 1 · A method for processing features in a voice data stream, including the following steps: Divide the stream voice data into a plurality of key frequency bands; determine these key keys A plurality of parameters of at least one critical frequency band in the frequency band; and combining the parameters according to at least one of a product sum, an S-shaped operator, and a Bayesian technique. 2. The method according to item 1 of the scope of patent application, wherein the parameters are combined according to at least one of a product sum and an S-shaped operator. 3. The method according to item 2 of the scope of patent application, wherein the parameters are combined based on a product sum and an S-shaped operator. 4. The method according to item 3 of the scope of patent application, wherein the parameters are combined according to a product sum, an S-shaped operator, and a Bayesian technique. 5. The method of claim 4 in the scope of patent application, wherein the Bayesian technique is desirable. 6. The method of claim 5 in which the Bayesian technique is extracted. 7 · The method according to item 2 of the scope of patent application, wherein the step of combining the parameters includes the following steps: multiplying a first set of the plurality of parameters by a set of weights, so as to generate a first set of weighted Parameters; forming a first sum of the first set of weighted parameters; and processing the first sum using an S-shaped operator to form a first processed sum. This paper size applies to Chinese National Standard (CNS) A4 specifications (210X297 mm, read the back 1¾ of the matter, and then fill in the pages, printed by the Ministry of Economic Affairs, Smart Wealth 4 / ¾¾, Industrial and Consumer Cooperatives -30-480473 B8 C8 D8 6. Scope of patent application 8 · If the method of applying for the scope of the patent No. 7 'where the step of combining these parameters includes the following steps: Use a conjunctive operator to combine the first processed sum with other processed sums in order to generate a * first Conjunction result. 9 · The method of the eighth item of the patent application 'where the conjunction operator contains: Pr [Yi = l | Mi] = fJPrfXi ^ l | Mi] where M i is a parameter vector, X i ^ Represents the sum after processing, and Y ^ represents the conjunction result. 1 0 · As in the method of claim 8 of the patent application, the step of combining these parameters further includes the following steps: using a disjunction operator A conjunctive result is combined with at least one other conjunctive result in order to produce a disjunction result of 1. 1 1. As applied. The method of patent scope item 8, wherein the disjunction operator includes: Pr [Z = l | M] = l- Πα -Pr [Yi = l I M,]), where Mi is a parameter vector, {Mi, M2, ... 丨 and represents a set of parameter vectors, Yi. Represents a disjunction result, and Z is a binary random variable. 1 2 · The method according to item 1 of the scope of patent application, further comprising the following steps: updating at least one of the rights in the group. 1 3 · The method according to item 1 ^ in the scope of patent application, wherein the updating step is based on an expected maximization (EM) technology and a maximum probability estimation. The paper size applies the Chinese National Standard (CNS) A4 specification (210X2W company). (Slow) Please read the precautions before filling out this page) Γ, τ Printed by the Ministry of Economic Affairs 4- ^ μ Industrial Consumer Cooperatives -31-480473 A8 B8 C8 D8 __ ......... .... One, sixteen, at least one of the patent application scope s ten (MLE) technology. (Please read the precautions on the back first, then this page) 1 4 · If the method of the scope of the patent application is the first one, in which at least one of the plural | parameters includes a signal-to-noise ratio estimate of a specific frequency band At least one of the self-covariance statistics. ^, 1 5 · The method according to item 1 of the patent application scope further includes the following steps: performing a non-linear operation on at least one of the plurality of frequency bands. 16 · The method according to item 15 of the scope of patent application, wherein the non-linear operation is a one-wave operation. 17 · The method according to item 15 of the scope of patent application, wherein the non-linear operation is a rectification operation. / 18 · —A device for processing voice characteristics, including: a front end for receiving a voice data stream, dividing the voice data stream into a plurality of voice data bands, and each of the plurality of voice data bands The first data band is segmented into a stream of sound windows, and a plurality of parameters of each sound window are determined; and after printing by the Ministry of Economics and Industry Cooperatives, the back end can be based on a product sum, An S-shaped operator and at least one of a Bayesian technique combine these parameters to determine at least one speech feature. -1 9 · If the device in the scope of patent application No. 18, wherein the back end includes a first layer combiner, the first layer combiner uses the first set of weights to weight the first plurality of parameters, and The weighted parameters are added and processed after the weighted parameters to produce a first result. 2 〇 · If the device in the scope of patent application No. 19, in which the paper is processed, the Chinese National Standard (CNS) A4 specification (210X, Q7 public form, ~~~ -32-480473 AS B8 C8 D8 |丨 6. The weighted parameters such as the scope of the patent application include transforming the weighted parameters 2 1 according to an S-shaped operator. For example, the device of the scope of patent application No. 20, wherein the back end further includes a second layer combiner. The second-layer combiner uses a conjunction operator to combine the first result with at least one second result in order to produce a first conjunction result. 2 2. For the device of the scope of patent application No. 21, wherein The back end further includes a third layer combiner, which uses a disjunction operator to combine the first conjunction result with at least one second conjunction result to generate a first disjunction result. 2 3 · The device according to item 22 of the scope of patent application, further comprising a training device for updating the first set of rights based on at least the first disjunction results. 2 4 · If the scope of patent application is scope 2 3 Item of equipment, wherein the training The training device further updates the first set of rights based on at least one of an expected maximization (E M) technology and a maximum likelihood estimation (ML E) technology. 25. The device according to item 20 of the scope of patent application, wherein the front end includes a non-linear device. 2 6 · The device according to item 25 of the scope of patent application, wherein the non-linear device roughly converts at least one voice data frequency band square wave. 7. The device according to item 25 of the scope of patent application, wherein the non-linear device roughly rectifies at least one voice data frequency band. The meaning of this paper applies to the Chinese National Standard (CNS) Α4 specification (2 丨 0X2Q7 common) -33 -