TW200304119A - Voice activity detection (VAD) devices and methods for use with noise suppression systems - Google Patents

Voice activity detection (VAD) devices and methods for use with noise suppression systems Download PDF

Info

Publication number
TW200304119A
TW200304119A TW92104696A TW92104696A TW200304119A TW 200304119 A TW200304119 A TW 200304119A TW 92104696 A TW92104696 A TW 92104696A TW 92104696 A TW92104696 A TW 92104696A TW 200304119 A TW200304119 A TW 200304119A
Authority
TW
Taiwan
Prior art keywords
noise
signal
vad
microphone
sound
Prior art date
Application number
TW92104696A
Other languages
Chinese (zh)
Inventor
Gregory C Burnett
Nicolas J Petit
Alexander M Asseily
Andrew E Einaudi
Original Assignee
Aliphcom Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aliphcom Inc filed Critical Aliphcom Inc
Publication of TW200304119A publication Critical patent/TW200304119A/en

Links

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

Voice Activity Detection devices, systems and methods are described for use with signal processing systems to denoise acoustic signals. Components of a signal processing system and/or VAD system receive acoustic signals and voice activity signals. Control signals are automatically generated from data of the voice activity signals. Components of the signal processing system and/or VAD system use the control signals to automatically select a denoising method appropriate to data of frequency subbands of the acoustic signals. The selected denoising method is applied to the acoustic signals to generate denoised acoustic signals.

Description

200304119 玖、發明說明 ^麗g^Tg明所屬之技術領域、先前技術、內容、實施方式及圖式簡單說明) 本申請案係以下列美國專利申請案主張優先權,包括:2002年3月5日提出 申請,申請案號爲60/362,162的「利用尋徑雜訊抑制之尋徑式語音活動偵測 (PATHFINDER-BASED VOICE ACTIVITY DETECTION (PVAD) USED WITH PATHFINDER NOISE)」;2002年3月5日提出申請,申請案號爲60/362,170的「利 用尋徑雜訊抑制之加速計爲主語音活動偵測(ACCELEROMETER-BASED VOICE ACTIVITY DETECTION (PVAD) WITH PATHFINDER NOISE) ; 2002 年 3 月 5 日提 出申請,申請案號爲60/361,981的「陣列式語音活動偵測與尋徑雜訊抑制 (ARRAY-BASED VOICE ACTIVITY DETECTION (AVAD) AND PATHFINDER NOISE) ; 2002年3月5曰提出申請,申請案號爲60/362,161的「利用外部語音活 動偵測裝置之尋徑雜訊抑制(PATHFINDER NOISE USING AN EXTERNAL VOICE ACTIVITY DETECTION (VAD) DEVICE)」;2002 年 3 月 5 曰提出申請,申請案號 爲 60/362,103 的「加速計爲主之語音活動偵測(Accelerometer-Based Voice Activity Detection)」,以及2002年3月27曰提出申請,申請案號爲60/368,343的「雙麥克 風式以頻率爲主的語音活動偵測」,上述申請案目前均在審核階段。 此外,本申請案與下列美國專利申請案有關,包括2001年7月12日提出申 請,申請案號爲09/905,361的^消除電子訊號的雜訊的方法與裝置(Method and Apparalxis for Removing Noise from Electronic Sign汹」;2002 年 5 月 30 曰提出申請, 申請案號爲10/159,770的「應用聲音與非聲音感測器偵測有無人發聲(DETECTING VOICED AND UNVOICED SPEECH USING BOTH ACOUSTIC AND NONACOUSTIC SENSORS)」,以及2002年11月21日提出申請,申請案號爲10/301,237的「消除 電子訊號的雜訊之方法與裝置(Method and Apparatus for Removing Noise from □續次頁 (發明說明頁不敷使用時,請註記並使用續頁) 200304119 發明說明續頁 lectronic Signals)」。 - 【發明所屬之技術領域】 本發明所揭示的實施例,係與有聲音雜訊的情況下,用以偵測與處理想要的 訊號之系統與方法有關。 C先前技術] 多年來業界發展出許多雜訊抑制演算法和技術,目前大多數用於聲音通訊系 統的雜訊抑制系統,係都以1970年代開發之單一麥克風頻譜頻譜減除技術(spectral subtraction technique)爲主,舉例來說,像是 S. F· Boll 於 IEEE Trans, on ASSP,ρρ· 113-120,1979所發表之「利用頻譜減除技術抑制聲音雜訊(Suppression of Acoustic noise in Speech using Spectral Subtraction)」,這些技術經過多年的改良,但是基本的 操作原則維持不變。比如說美國專利案號5,687,243,發明人爲McLaughlin等,還 有美國專利案號4,811,404,發明人爲Vilmui·等皆爲範例。一般來說,這些技術都 是採用單一麥克風的語音活動偵測器(Voice Activity Detector,VAD)來決定背景雜 訊的特定,其中語音(voice)—詞係包含人類發聲的聲音,或是混雜人聲與無人聲 的聲音。 VAD已應用在數位蜂巢式系統(digital cellular system),如美國專利案號 6,453,291,發明人爲Ashley,其中描述VAD是如何配置在數位蜂巢式系統的前端。 此外,有些碼分部多重存取(Code Division Multiple Access,CDMA)系統應用VAD 將使用的有效射頻頻帶(radio spectrum)降至最低,讓系統容量可以增加。同樣的, 在全球行動通訊系統(GSM)中也可以加入VAD,以降低頻道間干擾(co-channel interference),減少用戶端或付費者裝置的電池功率消耗。 這類典型的單麥克風VAD系統以單一麥克風接收聲音資訊,並利用典型的 訊號處理技術加以分析,因此其功能大幅受到分析結果的限制,特別是,這類單 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 發明說明_頁 麥克風VAD系統已經被指出,在處理低訊噪比(SNR)的訊號’還有背景雜訊變化 快速的環境中效能不彰,因此,這類單麥克風VAD應用在雜訊抑制系統上也會 有類似的限制。 【內容】 以下將描述多種用於適應性雜訊抑制系統的語音活動偵測(Voice Activity Detection, VAD)裝置與方法。另外,以下也將利用這些用做雜訊抑制系統元件的 VAD裝置與方法,特別是根據加州舊金山艾莉芙公司(網址在 http://www.aliph.com)所提供的尋徑雜訊抑制系統(Pathfinder Noise Suppression System)所得的實驗結果加以說明,但是本發明的實施例並不在此限。在以下的敘 述中,若有參考尋徑雜訊抑制系統的情形,要注意的是,可以估計雜訊波型並將 其自訊號中減去,同時使用或可以使用VAD資訊以提供可靠運作的雜訊抑制系 統,也同樣包含在參考範例中。在此是爲了方便提供一個實作範例,用以敘述可 以處理包含欲取出的人聲訊號和雜訊的系統,因此使用尋徑系統作爲參考。 在此敘述之配合雜訊抑制系統所使用的VAD裝置和方法,其中VAD訊號是 獨立於雜訊抑制系統來處理,所以接收與處理VAD資訊的動作也獨立於雜訊抑 制的相關過程,但是本發明的實施例並不在此限。此一獨立性是透過實體的’也 就是用不同的硬體來接收和處理與VAD相關的訊號與進行雜訊抑制;還有透過 處理方式,也就是採用同樣的硬體來接收訊號到雜訊抑制系統,不過是利用獨立 的技術(軟體、演算法、程式常式);以及透過不同硬體與軟體的組合來達成。 在以下的描述中,聲音(acoustic)—般是定義爲透過空間傳送的聲波,利用空 氣以外的媒介所傳遞的聲波將會另行指出。人聲(speech)或語音(voice)—般是指包 含有人發聲(voiced)、無人發聲(unvoiced),以及/或者兩者的組合的人類聲音,在 有需要的時候會加以區別。「雜訊抑制(noise suppression)」一詞通常指在電子訊號 □續次頁(發明說明頁不敷使用時,請註記並使用續頁) 8 發明說明_胃 200304119 中用來降低或消除雜訊的任何方法。 此外,VAD —詞通常定義爲語音在數位或類比範疇出現的向量、或陣列訊 號、資料或某種方式所呈現的資訊。一個通用的VAD資訊是以一位元的數位訊 號代表,取樣速率和對應的聲音訊號相同’其中零數値代表在對應的取樣時間 內。沒有語音產生,而數値爲一(unity)代表在對應的取樣時間有語音產生。雖然 以下所描述的實施例通常是在數位領域’但是同樣的描述也可用於類比的範疇。 在此描述的VAD裝置/方法一般包含振動與動作感測器、聲音感測器、還 有手動VAD裝置,但並不在此限。在一實施例中,加速計放在人體皮膚上,用 | 來偵測與人聲有關的皮膚表面振動,所記錄的振動則用來計算VAD訊號,此VAD . 訊號則可用於或提供適應性雜訊抑制演算法使用,用來抑制和聲音訊號一起(在 幾個微秒ms內)錄進來的環境聲音雜訊,而其中聲音訊號包含聲音與雜訊。 另一個在此揭示的VAD裝置/方法的實施例包含一個修改過附有薄膜 (membrane)的聲音麥克風,讓麥克風不能夠有效地偵測空氣中的聲音振動,不過 這層薄膜可以偵測到與其接觸的物體,比如說人體皮膚,所產生的振動(可達到 良好的機械阻抗匹配)。也就是說,聲音麥克風的被修改成不能偵測空氣中的聲 音振動(也就是不能達到良好的實體阻抗匹配),但是只能偵測到與其接觸的物 · 體的振動,這麼做使得麥克風和加速計(accelerometer)—樣,可以偵測到人體發聲 時所產生的振動,同時又不會偵測到空氣中的環境聲音雜訊,所偵測到的振動經 過處理後會產生VAD訊號以用於雜訊抑制系統,以下將詳細說明。 , 另一實施例所揭不的VAD採用電磁振動感測器(electromagnetic vibration sensor),比方像是用來偵測皮膚振動的射頻振動計(RFvibrometer咸雷射振動計。 另外,射頻振動計可偵測身體內組織像是臉頰或氣管內面的動作,與產生聲音有 關的表皮與內部組織的振動都可以用來產生VAD訊號,以用於雜訊抑制系統, 以下將詳細說明。 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 發明說明_頁 在另一個實施例中所揭示的VAD包含一個聲帶振動測量器(electr〇glottograPh, EGG),可直接偵測聲帶(vocai fold)的活動,EGG是一種使用交流電的方法,可用 來量測聲帶區域,當EGG偵測到足量的聲帶活動,便假設有發聲的動作正在進 行,接著會產生對應的VAD訊號用來代表有人發聲,而VAD訊號可用於雜訊抑 制系統,以下將詳細說明。同樣地,還有一個額外的VAD實施例使用視訊系統 來偵測一個人的發聲器官的動作,顯示的確有聲音產生。 另一種如下所述之VAD裝置/方法,使用一或多個聲音麥克風來接收訊號, 以對應的訊號處理技術,在大多數環境雜訊情況下精確與可靠地產生VAD訊號, 這類實施例包含簡單的陣列,和放置在一起的全指向性和單一指向性聲音麥克風 的組合。在這種VAD的實施例中最簡單的配置就是使用單一麥克風,放置在靠 近講話者嘴巴的地方,以便在相當高的訊噪比下收錄訊號。這個麥克風可以是梯 度麥克風,或者是近距離收音的麥克風。其他的配置方法包括使用單一指向性與 全指向性麥克風,達成各種方向和配置的方式。這類麥克風所接收到的訊號,還 有相關的訊號處理工作,可用來計算VAD訊號,以用於下面所敘述的雜訊抑制 系統,同樣地,以下也將描述由人爲啓動的VAD系統,比如說簡單的無線收發 裝置(walkie-talkie),或者是一名系統觀察者啓動的系統。 如前所述,此處所揭示的VAD裝置與方法係用於雜訊抑制系統,比如說像 是由加州舊金山的艾莉芙公司(Aliph)所推出的尋徑雜訊抑制系統(Pathfinder Noise Suppression System) ^ i^T*#iS^M(Pathfinder system) ° VAD m 置可以在尋徑雜訊抑制系統中找到,熟悉此技藝者應可了解這類VAD裝置與方 法可以用在各種習知的雜訊抑制系統與方法。 尋徑系統係以數位訊號處理(DSP)爲主的聲音雜訊抑制與回音消除 (echo-cancellation)系統。尋徑系統可連接至語音處理系統的前端,利用VAD資訊 與接收到的聲音資訊,估計雜訊波型,並將它自包含語音與雜訊的訊號中減除, D續次頁(發明說明頁不敷使用時,請註記並使用續頁) 119 發明說明續頁 以降低或消除聲音訊號中的雜訊,在以下的詳細說明和相關的申請案中會針對尋 趣系統做進一步的描述。 【鸾施方式】 圖1爲一實施例中訊號處理系統100的方塊圖示,其中包括尋徑雜訊抑制系 铳101和VAD系統102。訊號處理系統100包含2隻麥克風MIC 1 110和MIC 2 112,用以接收來自至少一個人聲訊號源120和至少一個雜訊源122的訊號或資 訊。從人聲訊號源120到MIC 1的路徑s⑻,以及從雜訊源122到MIC 2的路徑 n(n)被視爲單一(unity)。此外,ft⑵代表由雜訊源122到MIC 1的路徑,而H<z) 代袠由人聲訊號源120到MIC 2的路徑。爲了和包含尋徑系統101的訊號處理系 統100相對照,圖2爲訊號處理系統200的方塊圖示,其中包含了傳統的適應性 雜訊消除系統(classical adaptive noise cancellation system)2〇2。 訊號處理系統100的組件,比如說雜訊抑制系統101透過無線、有線、以及 /¾者無線與有線的連結組合接到麥克風MIC 1與MIC 2。同樣地,VAD系統1〇2 和雜訊抑制系統101 —樣,透過無線連結、有線連結、以及/或者無線和有線連 結的組合接到訊號處理系統100的組件。舉例來說,VAD系統102的組件VAD 裝置和麥克風可以利用藍牙(Bluetooth)無線規格和訊號處理系統的其他組件通 訊,但此並非用以限制。 請參考圖1,由VAD系統102所產生的VAD訊號104,不論系統接收到的訊 號具有何種雜訊類型、振幅以及/或者方向,都會控制消除訊號雜訊的動作。當 VAD訊號104顯示沒有人說話時,尋徑系統101會使用MIC 1和MIC 2訊號來計 算在接收到的訊號中預先指定的副頻帶(pre-specified subband)下,轉移函數圧⑵ 的一個模型的係數。當VAD訊號104顯示有人說話時,尋徑系統101就停止更新 H《z),並且開始計算在接收到的訊號中預先指定的副頻帶(pre-specified subband) D續次頁(發明說明頁不敷使用時,請註記並使用續頁) 發明說明續頁 則可以繼續更新該副 200304119 下,轉移函數H2(z)的係數。如果畐!1頻帶的訊噪比(SNR)低, 頻帶的ft係數。在實施例中,尋徑系統皿採用最小均方根値(LMS)技術,有關 此技術可參考B. Widrow與S. Steams的「適應性訊號處理(Adaptive Signal Processing)」一書,由 Prentice-Hall 公司出版,ISBN 書號爲 0-13-004029-0,不過本 發明並不限於此。轉移函數可以根據時域(time domain)、頻域(frequency domain)或 者是時域/頻域的組合中計算而得,尋徑系統101接著會利用轉移函數乩⑵與 ft(z)的組合,將想要得到的聲音訊號中的雜訊移除,因此會產生至少一個消除雜 訊後的聲音訊號流。 尋徑系統可用許多不同的方式完成,但是在這些實施例中均需仰賴精確與可 靠的VAD裝置以及/或者方法。VAD裝置/方法必須精確的原因,是因爲當沒 有人聲時,尋徑系統就需要更新其濾波器(filter)係數。如果在更新係數的過程中 有人聲能量出現,接下來的人聲若具有相似的頻譜特性會被抑制,這不是好的現 象。VAD裝置/方法需要強大的功能,能夠在不同的環境條件下達到高精確度’ 顯然在有些狀況下沒有任何VAD裝置/方法可以提供令人滿意的結果’但是在 正常情況下,VAD裝置/方法應該要能提供最大的雜訊抑制效果,同時對人聲又 不會有太多的負面作用。 在雜訊抑制系統中使用VAD裝置/方法時,VAD訊號是獨立於雜訊抑制系 統來處理,所以接收和處理VAD訊號的過程是獨立於雜訊抑制的處理過程’不 過本發明的實施例並不在此限。此一獨立性是透過實體的,也就是用不同的硬體 來接收和處理與VAD相關的訊號與進行雜訊抑制;透過處理方式,也就是採用 同樣的硬體來接收訊號到雜訊抑制系統,不過利用獨立的技術(軟體、演算法、 程式常式);以及透過不同硬體與軟體的組合來達成’如以下所述。 圖1A係一實施例中的VAD系統102A的方塊圖,其中包含用以接收與處理 與VAD有關的訊號的硬體。VAD系統102A包含VAD裝置130連接到對應的VAD 續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 ^--- 發明說明續頁 演算法140,用以提供資料。請注意在其他實施例中,熟悉此技藝者可以運用各 種已知方式,讓雜訊抑制系統整合一些或所有的VAD演算法和雜訊抑制處理功 能。 圖1B係另一實施例中的VAD系統102B的方塊圖,其利用相關的雜訊抑 制系統101來接收VAD資訊164。VAD系統102B包含接收來自MIC 1與MIC 2 的資料164的VAD演算法150,或者訊號處理系統100中的其他組件。在其他實 施例中,熟悉此技藝者可以運用各種已知方式,讓雜訊抑制系統整合一些或所有 的VAD演算法和雜訊抑制處理功能。 以振動/動作爲牛的VAD裝置/方法 以振動/動作爲主的VAD裝置包含實體硬體裝置,用來接收與處理和VAD 以及雜訊抑制有關的訊號。當一談話者或使用者發出人聲,所產生的振動會穿過 談話者的人體組織傳遞,因此可以利用各種方式來偵測皮膚上或底下的振動。這 類振動可作爲良好的VAD資訊來源,因爲它們和有人發聲和無人發聲的聲音有 相當密切的關係(不過無人發聲時其振動相當微弱,也很難偵測得到),而且一 般受到環境聲音雜訊的影響不大(環境聲音雜訊對一些裝置/方法的影響不大, 或沒有影響),舉例來說,如以下所述的電磁振動感測器。這些組織振動或動作 可以利用幾種VAD裝置,包括加速計爲主的裝置、皮膚表面麥克風(SSM)裝置、 以及包含射頻(RF)振動計與雷射振動計2種的電磁振動計(EM vibrometer)、直接聲 帶動作量測裝置以及視訊偵測裝置等加以偵測。 以加速計爲主的VAD裝置/方法 加速計可偵測與人聲有關的皮膚振動,比如說,在圖1與圖1A中,VAD系 統102A 5令一實施例中可包含以加速計爲主的VAD裝置130,用以提供皮膚振動 [□續次頁(發明說明頁不敷使用時’請註記並使用續頁) 13 200304119 發明說明續頁’ 的資料給相關的演算法140。實施例中所用的演算法使用能量計算技術並比較臨 界値(threshold),以下將會敘述,但不在此限。要注意的是熟悉此技藝者可以採用 更爲複雜的能量計算方法。 圖3係在一實施例中的流程圖300,利用加速計爲主的VAD來決定有無人聲 發音的方法。一般來說,能量是藉由定義標準區間大小(standard window size),然 後計算在一段時間內,振幅的平方的總和,如下所示:200304119 发明, Description of the invention ^ 丽 g ^ Tg 明 Technical field, prior art, content, implementation and drawings are briefly explained) This application claims priority from the following US patent applications, including: March 5, 2002 Filed an application with application number 60 / 362,162 for "PATHFINDER-BASED VOICE ACTIVITY DETECTION (PVAD) USED WITH PATHFINDER NOISE" using path finding noise suppression "; March 5, 2002 Filed an application with application number 60 / 362,170 "ACCELEROMETER-BASED VOICE ACTIVITY DETECTION (PVAD) WITH PATHFINDER NOISE) using path finding noise suppression; filed on March 5, 2002 , Application No. 60 / 361,981 "ARRAY-BASED VOICE ACTIVITY DETECTION (AVAD) AND PATHFINDER NOISE); Application was made on March 5, 2002, and the application number is 60 / 362,161 "PATHFINDER NOISE USING AN EXTERNAL VOICE ACTIVITY DETECTION (VAD) DEVICE)"; March 5, 2002 Issued application, application number 60 / 362,103 "Accelerometer-Based Voice Activity Detection" and application filed on March 27, 2002, application number 60 / 368,343 " "Dual microphone type frequency-based voice activity detection." The above applications are currently under review. In addition, this application is related to the following U.S. patent applications, including an application filed on July 12, 2001, Application No. 09 / 905,361: Method and Apparalxis for Removing Noise from Electronic Sign "; Application was filed on May 30, 2002, with application number 10 / 159,770," DETECTING VOICED AND UNVOICED SPEECH USING BOTH ACOUSTIC AND NONACOUSTIC SENSORS " , And an application filed on November 21, 2002, with Application No. 10 / 301,237, “Method and Apparatus for Removing Noise from (Method and Apparatus for Removing Noise from □ Continued on the next page) Please note and use the continuation page) 200304119 Description of the Invention Continuation page of Lectronic Signals) "-[Technical Field to which the Invention belongs] The embodiments disclosed in the present invention are used for detecting and processing in the presence of sound noise The system and method of the desired signal are related. C Previous technology] Over the years, the industry has developed many noise suppression algorithms and technologies. The noise suppression systems used in sound communication systems are based on the single microphone spectral subtraction technique developed in the 1970s. For example, S. F. Boll in IEEE Trans, on ASSP, ρρ 113-120, 1979, "Suppression of Acoustic noise in Speech using Spectral Subtraction" published by these technologies. These technologies have been improved for many years, but the basic operating principles remain unchanged. For example, U.S. Patent No. 5,687,243, the inventor is McLaughlin, etc., and U.S. Patent No. 4,811,404, the inventor is Vilmui, etc. are examples. Generally, these technologies are voice activity detection using a single microphone Voice Activity Detector (VAD) is used to determine the specificity of background noise. Voice—words contain human voices, or mixed human voices and unvoiced voices. VAD has been applied to digital honeycomb systems ( digital cellular system), such as US Patent No. 6,453,291, and the inventor is Ashley, which describes how VAD is configured in the digital Nested front end system. In addition, some Code Division Multiple Access (CDMA) systems use VAD to minimize the effective radio frequency spectrum (radio spectrum) used, so that the system capacity can be increased. Similarly, VAD can be added to the Global System for Mobile Communications (GSM) to reduce co-channel interference and reduce the battery power consumption of the client or payer device. This type of typical single-microphone VAD system uses a single microphone to receive sound information and analyze it using typical signal processing techniques. Therefore, its functions are greatly limited by the analysis results. In particular, this type of mono] (Please note and use the continuation sheet when using it.) 200304119 Description of the invention _ Page Microphone VAD system has been pointed out, in low-noise ratio (SNR) signals' and background noise changes quickly in the environment, performance is not good, so This kind of single-microphone VAD application will have similar restrictions on noise suppression systems. [Content] The following will describe a variety of voice activity detection (VAD) devices and methods for adaptive noise suppression systems. In addition, the following VAD devices and methods used as noise suppression system components will also be used, especially according to the path-finding noise suppression provided by the California San Francisco Elif Corporation (website at http://www.aliph.com) The experimental results obtained by the system (Pathfinder Noise Suppression System) will be described, but the embodiment of the present invention is not limited thereto. In the following description, if there is a case of a path finding noise suppression system, it should be noted that the noise waveform can be estimated and subtracted from the signal, and VAD information can be used at the same time to provide reliable operation. The noise suppression system is also included in the reference example. Here is to provide a practical example to describe the system that can process the vocal signals and noises to be extracted, so the path finding system is used as a reference. The VAD device and method used in conjunction with the noise suppression system described here, in which the VAD signal is processed independently of the noise suppression system, so the actions of receiving and processing VAD information are also independent of the related process of noise suppression. The embodiments of the invention are not limited thereto. This independence is through physical 'that is, different hardware is used to receive and process VAD-related signals and perform noise suppression; and through processing methods, that is, the same hardware is used to receive signals to noise. Suppression systems are only achieved by using independent technologies (software, algorithms, routines); and through a combination of different hardware and software. In the following description, acoustic (acoustic) is generally defined as sound waves transmitted through space. Sound waves transmitted by media other than air will be pointed out separately. Voice or voice—generally, human voices that include human voices, unvoiced voices, and / or a combination of the two are distinguished when needed. The term "noise suppression" usually refers to the electronic signal □ continuation page (when the description page of the invention is insufficient, please note and use the continuation page) 8 Description of the invention _ stomach 200304119 is used to reduce or eliminate noise Any method. In addition, VAD — words are usually defined as vectors, or array signals, data, or information presented in some way, in the digital or analogue category of speech. A general-purpose VAD information is represented by a one-bit digital signal, and the sampling rate is the same as the corresponding audio signal ', where the zero number 値 represents within the corresponding sampling time. No speech is generated, and the unity of unity indicates that speech was generated at the corresponding sampling time. Although the embodiments described below are generally in the digital domain ', the same description can be applied to the analogy category. The VAD devices / methods described herein generally include, but are not limited to, vibration and motion sensors, sound sensors, and manual VAD devices. In one embodiment, the accelerometer is placed on the human skin, and | is used to detect the vibration of the skin surface related to human voice. The recorded vibration is used to calculate the VAD signal. This VAD. Signal can be used to provide adaptive noise. The signal suppression algorithm is used to suppress the ambient sound noise recorded with the sound signal (within a few microseconds ms), and the sound signal contains sound and noise. Another embodiment of the VAD device / method disclosed herein includes a modified microphone with a membrane (membrane), which prevents the microphone from effectively detecting sound vibrations in the air, but this layer of membrane can detect Vibration generated by contacting objects, such as human skin (good mechanical impedance matching). In other words, the sound microphone is modified so that it cannot detect sound vibrations in the air (that is, it cannot achieve good physical impedance matching), but can only detect vibrations of objects and bodies in contact with it. Doing so makes the microphone and Accelerometer-like, it can detect the vibration generated by the human body, but it will not detect the ambient sound noise in the air. After the detected vibration is processed, it will generate a VAD signal for use. The noise suppression system will be described in detail below. The VAD disclosed in another embodiment uses an electromagnetic vibration sensor, such as a radio frequency vibrometer (RFvibrometer) for detecting skin vibration. In addition, the radio frequency vibrometer can detect Measure the internal tissues of the body such as the movement of the cheeks or the inner surface of the trachea. Both the epidermis and the internal tissue vibrations related to sound generation can be used to generate VAD signals for use in noise suppression systems, which will be described in detail below.] Continued (When the description page of the invention is insufficient, please note and use the continuation page) 200304119 Invention description_ The VAD disclosed in another embodiment includes a vocal cord vibration measuring device (electrglottograPh, EGG), which can directly detect the vocal cord (vocai fold) activity, EGG is a method using alternating current, which can be used to measure the vocal fold area. When EGG detects a sufficient amount of vocal fold activity, it is assumed that a vocal action is in progress, and a corresponding VAD signal will be generated. To represent someone, and the VAD signal can be used in noise suppression systems, which will be described in detail below. Similarly, there is an additional VAD embodiment to use The signal system detects the movement of a person's vocal organs, and it does show that sound is generated. Another VAD device / method described below uses one or more sound microphones to receive signals, and uses corresponding signal processing technologies. In most cases, VAD signals are accurately and reliably generated in the presence of environmental noise. Such embodiments include a simple array and a combination of omnidirectional and single-directional sound microphones placed together. The simplest of these VAD embodiments The configuration is to use a single microphone, placed near the speaker's mouth, in order to record the signal at a relatively high signal-to-noise ratio. This microphone can be a gradient microphone, or a microphone for close-range radio reception. Other configuration methods include the use of a single pointing Omnidirectional and omnidirectional microphones to achieve various directions and configurations. The signals received by such microphones and related signal processing tasks can be used to calculate VAD signals for use in the noise suppression system described below. Similarly, the following will describe the VAD system initiated by humans, such as simple Transceiver (walkie-talkie), or a system started by a system observer. As mentioned earlier, the VAD devices and methods disclosed here are used in noise suppression systems, such as Airy by San Francisco, California Pathfinder Noise Suppression System by Aliph ^ i ^ T * # iS ^ M (Pathfinder system) ° VAD m settings can be found in the path finding noise suppression system, familiar with this The skilled artisan should understand that such VAD devices and methods can be used in various conventional noise suppression systems and methods. The path finding system is a digital signal processing (DSP) -based sound noise suppression and echo-cancellation system. The path-finding system can be connected to the front end of the speech processing system, use VAD information and received sound information to estimate the noise waveform, and subtract it from the signal containing speech and noise. When the page is inadequate, please note and use the continuation page.) 119 Description of the Invention Continuation pages are used to reduce or eliminate noise in the sound signal. The following detailed description and related applications will further describe the fun-seeking system. [Methods] FIG. 1 is a block diagram of a signal processing system 100 according to an embodiment, which includes a path finding noise suppression system 铳 101 and a VAD system 102. The signal processing system 100 includes two microphones, MIC 1 110 and MIC 2 112, for receiving signals or information from at least one human voice signal source 120 and at least one noise source 122. The path s⑻ from the human voice signal source 120 to the MIC 1 and the path n (n) from the noise source 122 to the MIC 2 are considered to be unity. In addition, ft⑵ represents a path from the noise source 122 to the MIC 1, and H < z) substitutes a path from the human voice signal source 120 to the MIC 2. For comparison with the signal processing system 100 including the path finding system 101, FIG. 2 is a block diagram of the signal processing system 200, which includes a conventional adaptive noise cancellation system 202. The components of the signal processing system 100, such as the noise suppression system 101, are connected to the microphones MIC 1 and MIC 2 through a combination of wireless, wired, and wireless and wired connections. Similarly, the VAD system 102 and the noise suppression system 101 are connected to the components of the signal processing system 100 through a wireless link, a wired link, and / or a combination of wireless and wired links. For example, the components of the VAD system 102, the VAD device and the microphone, can communicate using the Bluetooth wireless specification and other components of the signal processing system, but this is not a limitation. Please refer to FIG. 1. The VAD signal 104 generated by the VAD system 102 controls the action of eliminating signal noise regardless of the noise type, amplitude, and / or direction of the signal received by the system. When the VAD signal 104 shows that no one is speaking, the path finding system 101 uses the MIC 1 and MIC 2 signals to calculate a model of the transfer function 下 in the pre-specified subband specified in the received signal. Coefficient. When the VAD signal 104 shows that someone is speaking, the path finding system 101 stops updating H «z and starts to calculate the pre-specified subband D pre-specified in the received signal. (Please note and use the continuation page when using it.) Description of the Invention Continuation page can continue to update the coefficient of the transfer function H2 (z) under this pair of 200304119. If 畐! The signal-to-noise ratio (SNR) of the 1 band is low, and the ft coefficient of the band is low. In the embodiment, the path-finding system uses the Least Mean Square Root (LMS) technology. For this technology, please refer to the book "Adaptive Signal Processing" by B. Widrow and S. Steams. Published by Hall Corporation, ISBN book number 0-13-004029-0, but the present invention is not limited thereto. The transfer function can be calculated based on the time domain, frequency domain, or a time / frequency domain combination. The path-finding system 101 then uses the combination of the transfer function 乩 ⑵ and ft (z). The noise in the desired sound signal is removed, so at least one noise signal stream is generated after noise removal. The path finding system can be accomplished in many different ways, but in these embodiments all rely on accurate and reliable VAD devices and / or methods. The reason why the VAD device / method must be accurate is because the path-finding system needs to update its filter coefficients when there is no human voice. If someone's vocal energy appears during the process of updating the coefficients, the subsequent vocals will be suppressed if they have similar spectral characteristics, which is not a good thing. VAD devices / methods need powerful functions to be able to achieve high accuracy under different environmental conditions. 'Obviously there are no VAD devices / methods that can provide satisfactory results in some situations' but under normal circumstances, VAD devices / methods It should be able to provide the largest noise suppression effect, while not having too much negative effects on human voice. When a VAD device / method is used in a noise suppression system, the VAD signal is processed independently of the noise suppression system, so the process of receiving and processing the VAD signal is independent of the noise suppression processing process. However, the embodiment of the present invention does not Not in this limit. This independence is through physical, that is, different hardware is used to receive and process VAD-related signals and perform noise suppression; through processing, that is, the same hardware is used to receive signals to the noise suppression system , But using independent technologies (software, algorithms, routines); and using a combination of different hardware and software to achieve 'as described below. Fig. 1A is a block diagram of a VAD system 102A according to an embodiment, which includes hardware for receiving and processing VAD-related signals. The VAD system 102A includes a VAD device 130 connected to the corresponding VAD continuation page (when the description page of the invention is insufficient, please note and use the continuation page) 200304119 ^ --- description page of the invention Algorithm 140 for providing information. Please note that in other embodiments, those skilled in the art can use various known methods to allow the noise suppression system to integrate some or all of the VAD algorithms and noise suppression processing functions. FIG. 1B is a block diagram of a VAD system 102B in another embodiment, which uses a related noise suppression system 101 to receive VAD information 164. The VAD system 102B includes a VAD algorithm 150 that receives data 164 from MIC 1 and MIC 2 or other components in the signal processing system 100. In other embodiments, those skilled in the art can use various known methods to let the noise suppression system integrate some or all of the VAD algorithms and noise suppression processing functions. VAD devices / methods that use vibration / action as the cattle VAD devices that use vibration / action as the main component include physical hardware devices to receive signals related to processing and VAD and noise suppression. When a talker or user makes a human voice, the generated vibration is transmitted through the talker's body tissue, so various ways can be used to detect the vibration on or under the skin. This type of vibration can be a good source of VAD information, because they have a close relationship with the sounds of people uttering and no sound (however, the vibration is very weak and difficult to detect when there is no sound), and it is generally affected by environmental sounds. The effect of environmental noise is not large (some or no effect on some devices / methods), for example, an electromagnetic vibration sensor as described below. These tissue vibrations or movements can use several VAD devices, including accelerometer-based devices, skin surface microphone (SSM) devices, and electromagnetic vibrometers (EM vibrometers) including radio frequency (RF) vibrometers and laser vibrometers. ), Direct vocal cord motion measurement device and video detection device. Accelerometer-based VAD device / method The accelerometer can detect skin vibrations related to human voice. For example, in Figure 1 and Figure 1A, the VAD system 102A 5 allows an embodiment to include The VAD device 130 is used to provide the skin vibration [□ Continued pages ('Notes and use of continuation pages when the Instruction Sheet is not enough) 13 200304119 Inventive Instructions Continuation Page' to the relevant algorithm 140. The algorithm used in the embodiment uses energy calculation technology and compares thresholds, which will be described below, but not limited to this. It should be noted that those skilled in the art can use more complex energy calculation methods. FIG. 3 is a flowchart 300 of an embodiment, a method for determining whether or not a human voice is produced using an accelerometer-based VAD. Generally, the energy is defined by the standard window size, and then the sum of the squared amplitudes over a period of time is calculated as follows:

Energy = J]x12, i 其中i是數位取樣的標號,並且從範圍的起點開始到終點結束。 0 參考圖3,一開始在302區塊接收加速計資料。在區塊304進行與VAD相關 · 的處理’包括過濾來自加速計的資料,以預先消除鬼影(aliasing),並將資料數位 化與過濾。在區塊306分割資料爲長度20微秒(ms)的區間,而資料以每δ微秒一 個步級(step)呈現。在區塊308內進一步處理區間內已分割的資料,移除被雜訊干 擾的頻譜資訊或其他不要的資訊。在區塊310內,利用上述計算振幅的平方的總 和,算出每個區間內的能量,計算出來的能量値可以除以區間長度以便正規化, 不過這牽涉到額外的計算,如果每個區間的長度不變,也就沒必要處理。 經過計算或正規化的能量値,在區塊312與臨界値(threshold value)比較。在區春 塊314,當加速計資料的能量等於或高於臨界値,將加速計資料所對應的聲音標 示爲有人發聲。同樣地,在區塊316,當加速計資料的能量等於或低於臨界値, 將加速計資料所對應的聲音標示爲無人發聲。在其他實施例中,雜訊抑制系統可 . 以利用多重臨界値來標示人聲訊號的相對強度與準確度,但不在此限。另外也可 _ 以採用多重副頻帶的處理方式來提高精確度。 圖4所示爲在一實施例中的輸出圖表,係有雜訊的聲音訊號(現場錄音)402 加上對應的加速計爲主的VAD訊號404、對應的加速計輸出訊號412、還有尋徑 系統利用VAD訊號404來消除聲音訊號422的雜訊。在此一範例中,在500至 []續次頁(發明說明頁不敷使用時’請註記並使用續頁) 14 200304119 - 發明說明,續頁 2500 Hz之間的加速計資料被帶通(bandpass)濾掉,以便移除低於500Hz會耦合至 加速器的聲音雜訊。聲音訊號402則是利用艾莉芙公司(Aliph)的麥克風組和標準 的加速計,在一間6英尺長,天花板高度8英尺高的房間內錄音,環境中有嘈雜 的噪音。尋徑系統是採用即時處理的方式,延遲約l〇ms,由原始聲音訊號402和 消除雜訊後的聲音訊號422之間的差別可以看出,雜訊抑制大約在25至30 dB之 間,人聲訊號的失真不大,因此,以加速計提供VAD資訊可以有效地消除雜訊。 皮膚表面(SSM)爲牛的VAD裝置/方法 · 回到圖1與圖1A,在一實施例中VAD系統102A包含SSM VAD裝置130, . 用以提供資料給相關的演算法140。SSM是利用傳統的麥克風改良,可避免經由 · 空氣傳導的聲音資訊跑進麥克風的偵測元件中,麥克風上有一層矽膠(silicone gel) 或其他可以改變麥克風阻抗的附著物質,避免空氣傳導的聲音訊息被明顯的偵測 到。因此麥克風不會受到空氣傳導的能量影響,不過仍可以偵測到經由空氣以外 媒介所傳來的聲波,只要一直保持與媒介接觸就可偵測得到。爲了要有效偵測到 人體皮膚上的聲音能量,矽膠(gd)的阻抗會匹配皮膚的機械阻抗特性。 講話時,SSM放在臉頰或頸部上,因此SSM可以輕易偵測到發聲所產生的 · 振動,而且SSM不會很輕易就偵測到由空氣傳來的聲音資料。由組織傳來的聲音 訊號被SSM偵測到以後,就用來產生VAD訊號,拿來處理與消除聲音訊號的雜 訊,如上所述並可參考加速計產生的VAD訊號與圖;3 〇 ‘ 圖5所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)502加上對應的 · SSM爲主的VAD訊號504,對應的SSM輸出訊號512,還有尋徑系統利用VAD 訊號504來消除聲音訊號522的雜訊的圖表。在此一範例中,在500至2500 Hz 之間的加速計資料被帶通濾掉,以便移除低於500Hz會耦合至加速器的聲音雜 訊。聲音訊號502則是利用Aliph的麥克風,裝置在一間6英尺長,天花板高度8 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 15 200304119 - 發明說明_頁 英尺的房間內錄音,環境中有嘈雜的噪音。尋徑系統是採用即時處理的方式,延 遲約10ms,由原始聲音訊號502和消除雜訊後的聲音訊號522之間的差別可以看 出,雜訊抑制大約在25至30 dB之間,人聲訊號的失真不大,因此,以SSM提 供VAD資訊可以有效地消除雜訊。 雷磁(EM)振動計爲主的VAD裝置/方法Devices/Methods 回到圖1與圖1A,在一實施例中VAD系統102A包含電磁振動計(EM vibrometer)的VAD裝置130,用以提供資料給相關的演算法140 〇電磁振動計同樣 · 可偵測組織振動,但是可以在不接觸組織的情況下,相隔一個距離偵測組織振 . 動,此外,電磁振動計可以偵測人體內部組織的振動。電磁振動計不受聲音雜訊 _ 干擾,因此適用於高度吵雜的環境中。在本實施例中,尋徑系統接收來自電磁振 動計的資訊,電磁振動計包括RF振動計(RF vibrometer)與雷射振動計(Laser vibrometer),但不在此限,以下將逐一描述。 RF振動計的操作範圍在電磁頻譜的射頻至微波之間,可以量測發聲時人體 內部組織的相對活動,內部的人體組織包括氣管、臉頰、下顎,以及/或者鼻子 /鼻腔,但不在此限。RF振動計可利用低功率無線電波偵測動作,而這些裝置· 所提供的資料吻合經調校過的目標,由於RF振動計不會偵測到聲音雜訊,因此 在實施例中的尋徑系統採用這些裝置所提供的訊號,參考上述加速計爲主的VAD 與圖3的能量/臨界値方法,以建構VAD。 . 由加州舊金山的Aliph公司所推出的聲帶電磁動作感測(GEMS)無線電振動 ^ 計(radiovibrometer),即爲一種RF振動計,其他的RF振動計可參考相關申請案以 及加州大學戴維斯分校的Gregory C. Burnett於1999年1月所發表之博士論文:「聲 帶電磁微功率感測器的生理學基礎及其定義人類發聲道的激發作用的用途探討 (The Physiological Basis of Glottal Electromagnetic Micropower Sensors (GEMS) and 〇續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 -—Π 發明說明編頁Energy = J] x12, i where i is the label of the digital sample and starts at the beginning of the range and ends at the end. 0 Referring to FIG. 3, the accelerometer data is initially received in block 302. The VAD-related processing at block 304 includes filtering the data from the accelerometer to eliminate aliasing in advance, and digitizing and filtering the data. In block 306, the data is divided into intervals of 20 microseconds (ms), and the data is presented in steps of δ microseconds. The divided data in the interval is further processed in block 308 to remove spectrum information or other unwanted information that is disturbed by noise. In block 310, the sum of the squares of the calculated amplitudes is used to calculate the energy in each interval. The calculated energy 値 can be divided by the interval length for normalization, but this involves additional calculations. The length does not change, so there is no need to deal with it. The calculated or normalized energy chirp is compared with a threshold value at block 312. In Block 314, when the energy of the accelerometer data is equal to or higher than the critical threshold, the sound corresponding to the accelerometer data is marked as someone speaking. Similarly, in block 316, when the energy of the accelerometer data is equal to or lower than the critical threshold, the sound corresponding to the accelerometer data is marked as unvoiced. In other embodiments, the noise suppression system may use multiple critical thresholds to indicate the relative strength and accuracy of the human voice signal, but not limited to this. In addition, you can also use multiple sub-band processing methods to improve accuracy. Figure 4 shows an output chart in an embodiment, which is a noise sound signal (live recording) 402 plus a corresponding accelerometer-based VAD signal 404, a corresponding accelerometer output signal 412, and The path system uses the VAD signal 404 to eliminate noise from the audio signal 422. In this example, when the 500 to [] continuation page is inadequate (please note and use the continuation page when the invention description page is insufficient) 14 200304119-invention description, the accelerometer data between 2500 Hz on the continuation page is bandpassed ( bandpass) to remove sound noise below 500Hz that would couple to the accelerator. The audio signal 402 was recorded in a 6-foot-long, 8-foot ceiling room using Aliph's microphone set and a standard accelerometer. There was noisy noise in the environment. The path finding system uses a real-time processing method with a delay of about 10ms. It can be seen from the difference between the original sound signal 402 and the noise-removed sound signal 422 that the noise suppression is about 25 to 30 dB. The distortion of the human voice signal is not large, therefore, providing VAD information with an accelerometer can effectively eliminate noise. The VAD device / method of the skin surface (SSM) is bovine. Back to FIG. 1 and FIG. 1A, in one embodiment, the VAD system 102A includes an SSM VAD device 130, which is used to provide information to the relevant algorithm 140. SSM is a modification of the traditional microphone, which can prevent the sound information transmitted through the air from running into the microphone's detection element. The microphone has a layer of silicone gel or other attached substances that can change the impedance of the microphone to avoid air-borne sound. The message was clearly detected. Therefore, the microphone will not be affected by the energy transmitted by the air, but it can still detect the sound waves transmitted through the medium other than air, as long as it is always in contact with the medium. In order to effectively detect the sound energy on the human skin, the impedance of the silicone (gd) will match the mechanical impedance characteristics of the skin. When speaking, the SSM is placed on the cheek or neck, so the SSM can easily detect the vibration caused by the sound, and the SSM cannot easily detect the sound data from the air. After the sound signal transmitted by the organization is detected by the SSM, it is used to generate a VAD signal, which is used to process and eliminate the noise of the sound signal. As mentioned above, the VAD signal and map generated by the accelerometer can be referred to; 3 〇 ' Figure 5 shows an example of a noisy sound signal (live recording) 502 plus a corresponding SSM-based VAD signal 504, a corresponding SSM output signal 512, and a path finding system using a VAD signal 504 to eliminate the noise of the sound signal 522 graph. In this example, the accelerometer data between 500 and 2500 Hz is bandpass filtered to remove acoustic noise below 500 Hz that would couple to the accelerator. The audio signal 502 uses Aliph's microphone, which is installed in a room 6 feet long and the ceiling height is 8] Continued pages (when the description page of the invention is insufficient, please note and use the continued page) 15 200304119-Description of the invention _ page feet Recording in the room, there is noisy noise in the environment. The path-finding system uses a real-time processing method with a delay of about 10ms. It can be seen from the difference between the original sound signal 502 and the noise-removed sound signal 522 that the noise suppression is about 25 to 30 dB, and the human voice signal The distortion is small, so VAD information provided by SSM can effectively eliminate noise. VAD devices / methods based on thunder (EM) vibrometers Back to FIG. 1 and FIG. 1A, in one embodiment the VAD system 102A includes a VAD device 130 of an electromagnetic vibrometer (EM vibrometer) for providing information The relevant algorithm is 140. The electromagnetic vibration meter can also detect tissue vibration, but can detect tissue vibration at a distance without touching the tissue. In addition, the electromagnetic vibration meter can detect the internal tissue of the human body. vibration. The electromagnetic vibrometer is immune to sound noise _ and is therefore suitable for use in highly noisy environments. In this embodiment, the path finding system receives information from an electromagnetic vibrometer. The electromagnetic vibrometer includes an RF vibrometer and a laser vibrometer, but it is not limited to this, which will be described one by one below. The operating range of the RF vibrometer is between the radio frequency and the microwave of the electromagnetic spectrum. It can measure the relative movement of the internal tissues of the human body during vocalization. The internal human tissues include trachea, cheeks, jaw, and / or nose / nasal cavity, but not limited to this. . The RF vibrometer can use low-power radio waves to detect actions, and the data provided by these devices match the adjusted target. Since the RF vibrometer does not detect sound noise, the path finding in the embodiment is The system uses the signals provided by these devices to construct the VAD by referring to the accelerometer-based VAD and the energy / critical threshold method shown in Figure 3. The vocal cord electromagnetic motion sensing (GEMS) radio vibrometer (radiovibrometer) introduced by Aliph Corporation of San Francisco, California, is an RF vibrometer. For other RF vibrometers, please refer to relevant applications and the University of California, Davis Gregory C. Burnett's doctoral dissertation published in January 1999: `` The Physiological Basis of Glottal Electromagnetic Micropower Sensors (The Physiological Basis of Glottal Electromagnetic Microsensors) GEMS) and 〇 Continued pages (If the description page of the invention is insufficient, please note and use the continued page) 200304119-Π Page of the description of the invention

Their Use in Defining an Excitation Function for the Human Vocal Tract)」。 雷射振動計的操作範圍在靠近可見光或可見光的範圍內,因此只能偵測表面 的振動,和上述的加速計或SSM相似。雷射振動計和RF振動計一樣不會偵測到 聲音雜訊,因此在實施例中的尋徑系統採用這些裝置所提供的訊號,參考上述力口 速計爲主的VAD和圖3的能量/臨界値方法,用來建構VAD。 圖6所示爲在一實施例中的圖表,有雜訊的聲音訊號(現場錄音)602加上 對應的GEMS爲主的VAD訊號604,對應的GEMS輸出訊號612,還有尋徑系統 利用VAD訊號604來消除聲音訊號622的雜訊。由加州舊金山的Aliph公司所提 _ 供,裝置在氣管上的GEMS無線電振動計發出VAD訊號604,而聲音訊號602則 _ 是利用Aliph的麥克風,裝置在一間6英尺長,天花板高度8英尺的房間內錄音, 環境中有嘈雜的噪音。尋徑系統是採用即時處理的方式,延遲約10ms,由原始聲 音訊號602和消除雜訊後的聲音訊號622之間的差別可以看出,雜訊抑制大約在 20至25分貝(dB)之間,人聲訊號的失真不大,因此,採用以GEMS爲主的VAD 資訊可以有效地消除雜訊。顯然,即使GEMS沒有偵測到無人發聲(unvoiced speech) 的聲音,VAD訊號和消除雜訊的動作仍相當有效,由於無人發聲的聲音能量都相 當低,因此並不會明顯地影響到乩⑵的收歛,以及消除雜訊後的人聲品質。 · 直接聲帶動作量測的VAD裝置/方法 參照圖1與圖1A,在一實施例中VAD系統102A包含直接聲帶動作量測 (direct glottal motion measurement)的VAD裝置130,用以提供資料給相關的演算法 140。在實施例中,尋徑系統的直接聲帶動作量測的VAD裝置包含聲帶振動測i 器(EGG),以及任何可直接量測聲帶活動或位置的裝置。EGG利用放在甲狀腺軟 骨(thyroid cartilage)側面的二個或更多個電極,傳回聲帶接觸區域的對應訊號’從 一或更多個電流所傳送的少量交流電流,會經過頸部組織(包括聲帶)傳遞至頸 □續次頁(發明說明頁不敷使用時,請註記並使用續頁) 17 200304119 發明說明續頁 部另一側的其他電極上,如果聲帶皺摺彼此碰觸,則由一邊的電極流至另一邊的 電極的電流量會增加,如果聲帶皺摺間不彼此碰觸,那電流量就會減少,如電磁 振動計和SSM的情況相同,EGG的訊號不受聲音雜訊干擾,因此在本實施例中, 尋徑系統接收來自EGG的資訊,參考上述加速計爲主的VAD和圖3的能量/臨 界値方法,用來建構VAD〇 圖7所示爲在一實施例中的輸出圖,已錄的英語發聲男性的聲音資料702加 上數位加入的雜訊,還有對應的EGG爲主的VAD訊號704,以及對應的高通濾 波EGG輸出訊號712。若比較聲音資料702和EGG輸出訊號712可知,EGG在籲 偵測有人發聲的聲音時相當精確,不過當聲帶皺摺彼此不碰觸,比如說無人發聲 . 或非常輕柔的說話聲時,EGG就不能偵測到。不過在實驗中,無法偵測到無人發 聲或非常輕柔的說話聲(兩者的能量都很低)的問題,並不會明顯地影響到一般 環境下系統消除人聲雜訊的能力,有關EGG的更多資訊可參照D.G. Childers與 A. K. Krishnamurthy 於 1985 年在 CRC Crit Rev Biomedical Engineering 期刊上發表 之「有關聲帶電學的重要分析(A Critical Review of Electroglottography)」一文,第 12 期,131 至 161 頁。 • 視訊偵測爲牛的VAD裝置/方法 參照圖1與圖1A,在一實施例中VAD系統102A包含視訊偵測的VAD裝置 130,用以提供資料給相關的演算法140 〇在實施例中,視訊攝影機和處理系統偵 測包括下顎、嘴唇、牙齒與舌頭等發聲器官的動作。目前在開發中的視訊與電腦 系統支援二維的電腦影像,因此可以提供視訊偵測爲主的VAD系統,有關此類 型系統的建構工具可以上網至 http://wwwjnteLcom/research/mrl/research/opencv/ 查詢。 在實施例中,尋徑系統可以利用視訊系統的組件來偵測發聲器官的動作,以 ^續次頁(發明說明頁不敷使用時,請註記並使用續頁) 18 200304119 發明說明I賣Μ 產生VAD資訊。圖8所示爲在一實施例中,利用視訊爲主的VAD來決定有人發 聲的方法的流程圖800。在區塊802,視訊系統的組件會找出使用者的臉和發聲器 官,在區塊804,計算出發聲器官的動作。在區塊806,視訊系統的組件以及/或 者尋徑系統會決定所計算出來的發聲器官動作是否快於臨界速度而且在振動(發 聲器官來回移動,可分辨得出來不是簡單的偶然動作)。如果動作比臨界速度慢 而且/或者不是在振動,如果動作比臨界速度慢而且/或者不是在振動,就回到 區塊802的動作。 如果在區塊806,動作快於臨界速度而且在振動,那麼在區塊808,視訊系統 φ 的組件以及/或者尋徑系統會決定發聲器官動作是否大於臨界値。如果動作大於 . 臨界値,在區塊810,視訊系統的組件以及/或者尋徑系統決定目標正在發聲, 然後在區塊812,將資訊傳遞至尋徑系統。這個以視訊爲主的VAD裝置/方法不 受聲音雜訊的影響,而且可以在離使用者或談話者一段距離的地方使用,因此相 當適合用於監測的場合。 聲咅資訊爲主的VAD奘置/方法 參照以上對圖1與圖1B的說明,在雜訊抑制系統中使用VAD時,其中VAD · 訊號是獨立於雜訊抑制系統來處理,所以接收與處理VAD資訊的動作也獨立於 雜訊抑制的相關處理過程,但是本發明的實施例並不在此限。以聲音資訊爲主的 VAD裝置的獨立性是透過處理方式,也就是可採用同樣的硬體來接收訊號到雜訊 . 抑制系統,不過利用獨立的技術(軟體、演算法或程式常式)來處理接收到的訊 號,不過在某些情況下,聲音麥克風是用來建構VAD功能,而非用來抑制雜訊。 在實施例中,聲音資訊爲主的VAD裝置/方法需依賴一或多個傳統的聲音 麥克風來偵測所要接收的人聲,因此比較容易受到環境聲音雜訊的干擾,也不能 在所有的雜訊環境中可靠的運作,不過,以聲音資訊爲主的VAD具有簡單、便 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 19 200304119 - 發明說明續頁 宜的優點,而且同樣的麥克風可以用作VAD和接收聲音訊號的用途,因此,在 成本比消除雜訊的效能要重要的場合,這類VAD解決方案會比較受歡迎,在實 施例中,聲音資訊爲主的VAD裝置/方法,包含單麥克風VAD、尋徑VAD、立 體聲VAD (SVAD)、陣列VAD (AVAD),以及其他單麥克風的傳統VAD裝置/方 法,但不在此限,以下將詳細描述。 單麥克風爲主的VAD裝置/方法 這也許是偵測使用者是否有說話的最簡單的方法,請參照圖1與圖1B,在 φ 一實施例中VAD系統102B包含VAD演算法150,用以接收來自對應的訊號處理 . 系統100的麥克風所傳來的資料164,麥克風(通常是「近距離(close-talk)」或是 梯度麥克風)至於離使用者嘴巴相當近的地方,有時候直接碰觸到嘴唇。梯度麥 克風對於距離麥克風幾公分以上的聲音就變得很不敏感(而所接收的頻率範圍通 常小於ΙΚΗζ),因此可以用來記錄具有相當高SNR的訊號。當然了,單麥克風所 會g達到的效能和使用者嘴巴到麥克風的距離、環境雜訊的惡劣程度,以及使用者 是否願意把東西放在靠嘴唇那麼近的地方,都會有影響。由於來自近距離單麥克 風所記錄的資料或訊號有至少一部分的頻譜具有相當高的SNR,因此在實施例 · 中,尋徑系統接收來自來自單麥克風的訊號,參考上述加速計爲主的VAD和圖3 的能量/臨界値方法,用來建構VAD 〇 圖9所示爲在一實施例中的圖表,有雜訊的聲音訊號902 (現場錄音)加上 -對應的單一(梯度)麥克風爲主的VAD訊號904,對應的梯度麥克風(gradient * microphone)輸出訊號912,還有尋徑系統利用VAD訊號904來消除聲音訊號922 的雜訊。由加州舊金山的Aliph公司所提供,裝置在氣管上的GEMS無線電振動 計發出VAD訊號604,而聲音訊號902則是利用Aliph的麥克風組和加速計,裝 置在一間6英尺長,天花板高度8英尺的房間內錄音,環境中有嘈雜的噪音。尋 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 20 200304119 發明說明續頁 徑系統是採用即時處理的方式,延遲約10ms ’由原始聲音訊號902和消除雜訊後 的聲音訊號922之間的差別可以看出,雜訊抑制大約在25至30 dB之間,人聲訊 號的失真不大。由這些結果顯示,採用以單麥克風爲主的VAD資訊可以有效地 消除雜訊。 尋徑系統爲主的VAD(PVAD)裝置/方法 再回到圖1與圖1B,在一實施例中PVAD系統102B包含PVAD演算法150, 用以接收來自對應的訊號處理系統1〇〇的麥克風陣列所傳來的資料164。麥克風 φ 陣列包含二支麥克風,但此並非用以限制。在此一實施例中,PVAD於時域(time . domain)內操作,而二支麥克風之間相距幾公分,其中至少有一支麥克風是單指向 性麥克風。 圖10所示爲在一實施例中,陣列中的單一心型曲線單一指向性麥克風 (c滅oid unidkectional microphone)1002,力口上相關的空間響應曲線1010的 單指向性麥克風1002同時也被稱爲雜訊麥克風(noise microphone)1002,或MIC 2, 它的方向是讓使用者嘴巴在或者靠近雜訊麥克風1002的頻率響應1010的無效區 (null)1012內,不過本系統並不限於使用心型曲線指向性麥克風。 圖11所示爲在一實施例中的PVAD系統的麥克風陣列11〇〇,麥克風陣列11〇〇 包含二個心型曲線單一指向性麥克風MIC丨1〇〇2與MIC 2 11〇2,其空間響應曲線 分別爲1010與1110。在麥克風陣列11〇〇中用作人聲麥克風MIC 1的麥克風型式 並無限制’不過當人聲麥克風MIC 1爲單指向性的麥克風,而且它的方向是讓使 用者嘴巴在或者靠近頻率響應曲線1〇1〇的最大値時,會有最佳的效果。這樣可以 確保在有人說話時,二支麥克風的信號有很大的差異。 在一個實施例中,麥克風組態包含Μία與MIC 2,然後將這些麥克風放置 在靠近人耳處’這個組態是將人聲麥克風MIC 1朝向使用者的嘴巴,而雜訊麥克 [I續次頁(發明說明頁不敷使用時,請註記並使用續頁) 21 200304119 發明說明續頁 風MIC 2則是朝向遠離使用者頭部的方向,讓每個麥克風的空間響應曲線彼此間 有大約90度的角度差,這樣可以讓雜訊麥克風MIC 2能夠充分地捕捉到頭的前 端的雜訊,而同時又不會捕捉到太多的信號。 另外二個實施例,是調整麥克風1102與1002的指向位置’讓每個麥克風的 空間響應曲線的最大値分別相距75度與135度。這些PVAD系統的配置方式是 把麥克風盡量放在一起,以簡化Hl(z)的計算,然後把麥克風擺成讓人聲麥克風 MIC 1能夠偵測到大多數的人聲’而雜訊麥克風MIC 2可以偵測到大多數的雜訊 (也就是讓H2(z)相對地很小),而每個麥克風的空間響應曲線的最大値之間的角籲 度差最大可達約180度,但是不應小於約45度。 . PVAD系統採用尋徑方法,艮P計算人聲麥克風和雜訊麥克風之間的差別路徑 (在尋徑方法中稱爲Hi,以下將說明),用以計算VAD。尋徑系統並不使用此一 資訊進行雜訊抑制,而是使用乩的增益(gain)來決定何時該消除雜訊。比較人聲 麥克風的訊號能量比率和雜訊麥克風的訊號能量比率’可根據以下的算式求得 VAD增益(以下稱爲增益)Their Use in Defining an Excitation Function for the Human Vocal Tract). " The operating range of the laser vibrometer is close to visible light or visible light, so it can only detect surface vibrations, similar to the accelerometer or SSM described above. The laser vibrometer and the RF vibrometer will not detect sound noise, so the path finding system in the embodiment uses the signals provided by these devices, refer to the VAD based on the above-mentioned elocimeter and the energy of FIG. 3 / Critical Martingale method for constructing VAD. Figure 6 shows a diagram in an embodiment, with a noisy sound signal (live recording) 602 plus a corresponding GEMS VAD signal 604, a corresponding GEMS output signal 612, and a path finding system using VAD Signal 604 to eliminate noise from sound signal 622. Provided by Aliph Company of San Francisco, California, the GEMS radio vibrometer mounted on the trachea emits a VAD signal 604, and the sound signal 602_ is a 6-foot-long, 8-foot ceiling height installation using Alih's microphone Recording in the room, there is noisy noise in the environment. The path finding system uses a real-time processing method with a delay of about 10ms. As can be seen from the difference between the original sound signal 602 and the noise-removed sound signal 622, the noise suppression is about 20 to 25 decibels (dB). The distortion of the human voice signal is not large, so using VAD information based on GEMS can effectively eliminate noise. Obviously, even if GEMS does not detect unvoiced speech, the VAD signal and the action of eliminating noise are still quite effective. Since the unvoiced sound energy is quite low, it does not significantly affect the cricket. Convergence and vocal quality after noise removal. VAD device / method for direct vocal cord motion measurement Refer to FIG. 1 and FIG. 1A. In one embodiment, the VAD system 102A includes a direct glottal motion measurement VAD device 130 for providing data to related Algorithm 140. In an embodiment, the VAD device for direct vocal cord motion measurement of the path finding system includes a vocal cord vibration detector (EGG), and any device that can directly measure vocal cord movement or position. EGG uses two or more electrodes placed on the side of the thyroid cartilage to return the corresponding signal of the vocal cord contact area 'a small amount of AC current transmitted from one or more currents will pass through the neck tissue (including Vocal cord) to the neck □ continuation sheet (if the description sheet of the invention is insufficient, please note and use the continuation sheet) 17 200304119 invention description on other electrodes on the other side of the continuation sheet part, The amount of current flowing from one electrode to the other electrode will increase. If the vocal folds do not touch each other, the current will decrease. For example, the electromagnetic vibrometer and SSM are the same. The signal of EGG is not affected by noise. Therefore, in this embodiment, the path finding system receives information from the EGG, and refers to the above-mentioned accelerometer-based VAD and the energy / critical threshold method of FIG. 3 to construct the VAD. FIG. 7 shows an embodiment In the output picture, the recorded English voice male voice data 702 plus digitally added noise, the corresponding EDG-based VAD signal 704, and the corresponding high-pass filtered EGG output signal 712. If you compare the sound data 702 with the EGG output signal 712, it can be seen that EGG is quite accurate in detecting the sound of someone making a sound, but when the vocal folds don't touch each other, such as when no one speaks. Cannot be detected. However, in the experiment, the problem of unvoiced or very soft speech (both of which have low energy) cannot be detected, and it will not significantly affect the ability of the system to eliminate human noise in the general environment. For more information, see "A Critical Review of Electroglottography," published by DG Childers and AK Krishnamurthy in the journal CRC Crit Rev Biomedical Engineering in 1985, No. 12, pages 131-161. • VAD device / method for detecting video as cattle Refer to FIG. 1 and FIG. 1A. In one embodiment, the VAD system 102A includes a VAD device 130 for video detection, which is used to provide data to related algorithms 140. In the embodiment Video cameras and processing systems detect movements of vocal organs including the jaw, lips, teeth, and tongue. The video and computer systems currently under development support two-dimensional computer images, so VAD systems based on video detection can be provided. Construction tools for this type of system can be accessed at http: // wwwjnteLcom / research / mrl / research / opencv / query. In the embodiment, the path-finding system can use the components of the video system to detect the movement of the vocal organs, and continue to the next page (when the description page of the invention is insufficient, please note and use the continued page) 18 200304119 Invention Description Generate VAD information. FIG. 8 shows a flowchart 800 of a method for determining a voice of a person using a video-based VAD in one embodiment. In block 802, the components of the video system will find the user's face and sound generator, and in block 804, calculate the actions of the starting sound organ. At block 806, the components of the video system and / or the path-finding system determine whether the calculated vocal organ moves faster than the critical speed and is vibrating (the vocal organ moves back and forth, it can be discerned that it is not a simple accidental action). If the action is slower than the critical speed and / or not vibrating, and if the action is slower than the critical speed and / or not vibrating, return to the action of block 802. If in block 806, the action is faster than the critical speed and is vibrating, then in block 808, the components of the video system φ and / or the path finding system will determine whether the vocal organ action is greater than the critical rate. If the action is greater than the critical threshold, in block 810, the components of the video system and / or the path finding system determine that the target is sounding, and then in block 812, the information is passed to the path finding system. This video-based VAD device / method is not affected by acoustic noise, and can be used at a distance from the user or talker, so it is suitable for monitoring applications. Acoustic information-based VAD setup / method Refer to the description of Figures 1 and 1B above. When using VAD in a noise suppression system, the VAD signal is processed independently of the noise suppression system, so it is received and processed. The action of VAD information is also independent of related processing of noise suppression, but the embodiment of the present invention is not limited thereto. The independence of the VAD device mainly based on sound information is through the processing method, that is, it can use the same hardware to receive signals to noise. The suppression system, but uses independent technology (software, algorithms, or routines) Processes the received signal, but in some cases, the sound microphone is used to construct the VAD function, not to suppress noise. In the embodiment, the audio information-based VAD device / method needs to rely on one or more traditional voice microphones to detect the human voice to be received, so it is relatively susceptible to interference from ambient sound noise, and it cannot be used in all noise. Reliable operation in the environment, however, VAD mainly based on audio information is simple and convenient] Continued page (When the description page of the invention is insufficient, please note and use the continued page) 19 200304119-Description of the advantages And the same microphone can be used for VAD and receiving sound signals. Therefore, this type of VAD solution will be more popular in situations where the cost is more important than the performance of noise removal. In the embodiment, the voice information is mainly VAD devices / methods include single-microphone VAD, path-finding VAD, stereo VAD (SVAD), array VAD (AVAD), and other traditional single-microphone VAD devices / methods, but not limited to this, which will be described in detail below. Single microphone-based VAD device / method This may be the easiest way to detect whether the user is speaking. Please refer to FIG. 1 and FIG. 1B. In one embodiment, the VAD system 102B includes a VAD algorithm 150 for Receive signal processing from the corresponding signal. Data 164 from the microphone of the system 100, the microphone (usually a "close-talk" or gradient microphone) is relatively close to the user's mouth, sometimes directly touch Touch your lips. Gradient microphones are very insensitive to sounds more than a few centimeters away from the microphone (and the frequency range received is usually less than 1KΗζ), so they can be used to record signals with a fairly high SNR. Of course, the performance achieved by a single microphone and the distance from the user's mouth to the microphone, the harshness of environmental noise, and whether the user is willing to put things as close to the lips will have an impact. Since at least a part of the frequency spectrum of the data or signal recorded from a close-range single microphone has a relatively high SNR, in the embodiment, the path-finding system receives a signal from a single microphone, referring to the accelerometer-based VAD and The energy / critical threshold method of Figure 3 is used to construct the VAD. Figure 9 shows a diagram in an embodiment. Noisy sound signal 902 (live recording) plus-corresponding single (gradient) microphone VAD signal 904, corresponding gradient microphone output signal 912, and the path finding system uses VAD signal 904 to eliminate noise from sound signal 922. Provided by Aliph Company of San Francisco, California, the GEMS radio vibrometer installed on the trachea emits VAD signal 604, and the sound signal 902 uses Aliph's microphone set and accelerometer. Recording in the room, there is noisy noise in the environment. [Find] Continued pages (Note when the invention description page is inadequate, please note and use the continuation page) 20 200304119 Invention description The page diameter system uses a real-time processing method, with a delay of about 10ms' from the original sound signal 902 and noise reduction The difference between the sound signal 922 can be seen, the noise suppression is about 25 to 30 dB, the distortion of the human voice signal is not large. These results show that the use of single microphone-based VAD information can effectively eliminate noise. The VAD (PVAD) device / method based on the path-finding system returns to FIG. 1 and FIG. 1B again. In one embodiment, the PVAD system 102B includes a PVAD algorithm 150 to receive a microphone from the corresponding signal processing system 100. 164 from the array. The microphone φ array contains two microphones, but this is not a limitation. In this embodiment, the PVAD operates in the time domain, and the two microphones are a few centimeters apart. At least one of the microphones is a unidirectional microphone. FIG. 10 shows a single cardioid curve unidirectional microphone 1002 in an embodiment, and a unidirectional microphone 1002 related to the spatial response curve 1010 on the force port is also called Noise microphone 1002, or MIC 2, its direction is to allow the user's mouth to be within or near the null zone 1012 of the frequency response 1010 of the noise microphone 1002, but the system is not limited to using a heart shape Curved directional microphone. FIG. 11 shows a microphone array 1100 of a PVAD system according to an embodiment. The microphone array 1100 includes two cardioid curve single-directional microphones MIC 1002 and MIC 2 1102, and their space The response curves are 1010 and 1110, respectively. There are no restrictions on the type of microphone used as the vocal microphone MIC 1 in the microphone array 1 100 '. However, when the vocal microphone MIC 1 is a unidirectional microphone, and its direction is such that the user's mouth is at or near the frequency response curve 1 When the maximum time is 10, the best effect will be achieved. This will ensure that the signals from the two microphones are significantly different when someone is speaking. In one embodiment, the microphone configuration includes Μία and MIC 2, and these microphones are placed close to the human ear. This configuration is directed to the vocal microphone MIC 1 towards the user's mouth, and the noise microphone [I continued page (When the description page of the invention is not enough, please note and use the continuation page) 21 200304119 Description of the page continued MIC 2 is away from the user's head, so that the spatial response curve of each microphone is about 90 degrees from each other The angle difference is so that the noise microphone MIC 2 can fully capture the noise at the front end of the head, but at the same time will not capture too much signal. The other two embodiments are to adjust the pointing positions of the microphones 1102 and 1002 'so that the maximum 値 of the spatial response curve of each microphone is 75 degrees and 135 degrees apart, respectively. The configuration of these PVAD systems is to put the microphones together as much as possible to simplify the calculation of Hl (z), and then place the microphone as a human microphone. MIC 1 can detect most human voices, while the noise microphone MIC 2 can detect Most noise is detected (that is, H2 (z) is relatively small), and the angular difference between the maximum chirps of the spatial response curve of each microphone can be up to about 180 degrees, but should not be less than About 45 degrees. The PVAD system uses a path finding method, which calculates the differential path between the vocal microphone and the noise microphone (referred to as Hi in the path finding method, described below) to calculate the VAD. The path finding system does not use this information for noise suppression. Instead, it uses chirp gain to determine when to eliminate noise. Comparing the signal energy ratio of a vocal microphone with the signal energy ratio of a noise microphone ’, the VAD gain (hereinafter referred to as the gain) can be obtained by the following formula

Gain = |Hj (z)| =Gain = | Hj (z) | =

Energy of speech mic Energy of noise micEnergy of speech mic Energy of noise mic

其中&爲人聲麥克風的數位化訊號的第i個樣本,而y.爲雜訊麥克風的數位化訊 號的第i個樣本,在此VAD應用中並不需要調整ft的計算,儘管這個範例是以 數位方式說明,但是用於類比方式也同樣有效,而增益也可以在時域或頻域內計 算。在頻域中,增益參數是ft係數平方的總和,如上所述,在能量的計算中並 沒有出現區間(計算區間)的長度,因爲在計算能量比率時會被抵銷掉,最後, 這個範例是用在單一副頻帶中,但是可適用於任何數目的副頻帶。 回到圖11,麥克風陣列1100的空間響應曲線1010與1110在第一半球1120 的增益大於1,而在第二半球1130的增益小於1,但此並非用以限制。這個特性 □續次頁(發明說明頁不敷使用時,請註記並使用續頁) 22 200304119Where & is the i-th sample of the digitized signal of the vocal microphone, and y. Is the i-th sample of the digitized signal of the noise microphone. In this VAD application, there is no need to adjust the calculation of ft, although this example is It is described digitally, but it is also effective for analogy, and the gain can also be calculated in the time or frequency domain. In the frequency domain, the gain parameter is the sum of the square of the ft coefficient. As mentioned above, the length of the interval (calculation interval) does not appear in the calculation of energy, because it will be offset when calculating the energy ratio. Finally, this example It is used in a single sub-band, but can be applied to any number of sub-bands. Returning to FIG. 11, the gain of the spatial response curves 1010 and 1110 of the microphone array 1100 in the first hemisphere 1120 is greater than 1 and the gain in the second hemisphere 1130 is less than 1, but this is not a limitation. This feature □ Continued (please note and use the continuation page when the invention description page is insufficient) 22 200304119

發明說明MM 和人聲麥克風MIC 1和使用者嘴巴的接近程度,有助於區隔人聲與雜訊。 在PVAD實施例中的麥克風陣列11〇〇能夠讓尋徑系統達到最佳的效能,同 時可以用二個相同的麥克風來處理VAD和消除雜訊的工作,因此可以降低系統 成本’提供額外的好處。不過爲了要讓VAD達到最佳的效能,二支麥克風要朝 向相反的方向,以便發揮此一配置下增益變化非常大的特性。 在另一個實施例中,PVAD還包含第三個單一指向性麥克風MIC 3(未顯示), 但並非用以限制。第三個麥克風MIC 3朝向與MIC 1相反的方向,只用於VAD 的用途,而MIC 2只用於雜訊抑制,而MIC 1用於VAD和雜訊抑制,如此一來鲁 會提昇整體系統效能,但是需增加一隻麥克風的成本,而且要處理多50%的聲音 . 資料。 在一實施例中,尋徑系統接收來自PVAD的訊號,參考上述加速計爲主的 VAD和圖3的能量/臨界値方法,用來建構VAD。因爲在麥克風資料中可能有 相當大的雜訊成份,所以並不是在每種情況下都能使用加速計爲主的能量/臨界 値VAD偵測方法。另一種VAD實施例使用過去的增益値(在只有雜訊的時候) 來偵測是否有人在說話,以下將作說明。 圖12所示爲在另一 PVAD實施例中,利用增益値決定有無人發聲的方法的 · 流程圖1200。一開始在1202區塊接收系統麥克風的資料。在區塊12〇4PVAD系 統的組件過濾資料以預先消除鬼影(aliasing),並將過濾後的資料數位化。在區塊 1206分割資料爲長度20微秒(ms)的區間,而資料以每8微秒一個步級(_)呈現。 . 在區塊1208內進一步處理區間內已分割的資料,移除不需要的頻譜資訊,計算在 只有雜訊的區間內最後50筆增益的標準變異(SD),稱爲向量OLD jTD,並且算 出〇LD_STD的平均値(AVE)。在區塊1210內,AVE和SD的値會拿來跟預先指 定的極小値比較,如果小於極小値,就分別增加至極小値。 在區塊1212,PVAD系統的組件接著將AVE加上SD的倍數,計算有人發聲 □續次頁(發明說明頁不敷使用時,請註記並使用續頁) 23 200304119 發明說明續頁 的臨界値。低臨界値爲AVE加上1.5倍的SD,而高臨界値則是AVE加上四倍的 SD ’在區塊1214,將振幅平方加起來可求得每個區間內的能量,另外,在區塊 1214中,可以計算MIC 1的能量對MIC 2能量的比率,以求得增益,在MIC 2的 會皂量中可加入一小截止値(small cutoff value),以保證穩定性,但是此一實施例並 不在此限。 在區塊1216中,比較所求得的增益與臨界値,會有3種可能的結果,當增益 小於低臨界値,那麼可以決定在這個區間中並沒有人聲,而OLD jTD向量就更 新爲新的增益値;如果增益大於低臨界値,同時小於高臨界値,就可以決定在這 個區間中並沒有人聲,但是懷疑是有人發聲的聲音,因此並不更新OLD jTD爲 新的增益値;當增益大於低與高臨界値,就可以決定在此區間有人發聲,而 〇LD_STD並不更新爲新的增益値。 不論這個方法是如何實施,它的精神在於利用有人聲時H《z) = M《z) / M2(z) 的增益較大,用以區隔背景雜訊。根據麥克風的配置,有人聲的時候所計算而得 的增益應該比較大,因爲人聲麥克風MIC 1所收到的人聲遠比雜訊麥克風MIC 2 所接收到的雜訊要大聲,反之,由於雜訊比較可能會隨地形擴散,因此MIC 2收 到的雜訊通常會比MIC 1要大聲。但是如果用全指向性的麥克風作爲人聲麥克風 的話,上述的說法就不一定是對的,而此時系統控制雜訊的能力也會受到限制。 請注意,只應用聲音來消除雜訊的方法比較容易受到環境雜訊的干擾。不過 有測試結果顯示,上述以單一指向性麥克風搭配單一指向性麥克風的配置可以達 到MIC 1的SNR略小於OdB,算是令人滿意的結果,因此,這個以PVAD爲主的 雜訊抑制系統在大多數使用者會遇到的雜訊環境中,都能夠有效的運作。同樣 的,若有需要,可以將麥克風移近使用者的嘴巴,以提高MIC 1的SNR。 圖13所示爲在一實施例中的圖表,有雜訊的聲音訊號1302 (現場錄音)加 上對應的單一(梯度)麥克風爲主的PVAD訊號1304,對應的梯度麥克風(gradient ϋ續次頁(發明說明頁不敷使用時,請註記並使用續頁) 24 200304119 發明說明續頁 microphone)輸出訊號1312,還有尋徑系統利用PVAD訊號1304來消除聲音訊號 1322的雜訊。聲音訊號1302是利用Aliph的麥克風組和加速計,裝置在一間6英 尺長,天花板高度8英尺的房間內錄音,環境中有嘈雜的噪音。尋徑系統是採用 即時處理的方式,延遲約l〇ms,由原始聲音訊號1302和消除雜訊後的聲音訊號 1322之間的差別可以看出,雜訊抑制大約在20至25 dB之間,人聲訊號的失真不 大。由這些結果顯示,採用以單麥克風爲主的PVAD資訊可以有效地消除雜訊。 立體聲VAD(SVAD)裝置/方法 再回到圖1與圖1B,在一實施例中SVAD系統102B包含SVAD演算法150’ 用以接收來自對應的訊號處理系統100的頻率式(frequency-based)雙麥克風陣列所 傳來的資料164。SVAD演算法的理論是在頻譜中,接收到的人聲和雜訊是可以 分辨的,因此在SVAD裝置/方法的處理中包含比較麥克風之間的平均FFT (快 速傅立葉轉換),SVAD採用2隻麥克風,指向可參考上述的PVAD與圖11,同 時也仰賴先前區間內的雜訊資料來決定目前的區間是否包含人聲,如同上述對 PVAD裝置/方法的說明,人聲麥克風在此處稱爲MIC 1,而雜訊麥克風稱爲MIC 2。 請參照圖1,此尋徑雜訊抑制系統採用雙麥克風來捕捉信號(MIC 1)和雜訊 (MIC 2)的特性,很自然地,二支麥克風都會接收到混雜人聲與雜訊的訊號,但是 在實施例中假設MIC 1的SNR大於MIC 2,意味著MIC 1比MIC 2要靠近訊號源 (使用者)或是方位比較好,而任何雜訊源與MIC 1和MIC 2的距離都比訊號源 遠。不過,利用全指向性和單一指向性或類似的麥克風,也可以達到相同的效果。 二支麥克風之間的SNR差異可以運用在時域或頻域,爲了要將雜訊從人聲中 分離,有需要計算一段時間內雜訊的平均頻譜,我們可以利用以下的指數函數平 均法(exponential averaging method)來計算: 二]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 發明說明續頁 200304119 L(i,k) = aL(i -1,k) + (1 - a)S(i,k), 其中a控制平均的平滑程度,0.999表示平均非常平滑,而0.9並不是很平滑。變 數L(i,k)與S(i,k)分別是平均和瞬時變數(instantaneous variable),i代表離散時間樣 本,而k代表頻率區間(frequency bin) ’其數値是由FFT的長度來決定’這些數値 也可以用傳統的平均或移動平均(movinS averaSe)的方式來計算。 圖14所示爲在一實施例的流程圖1400,利用立體聲VAD(SVAD)決定有無人 發聲的方法。在這個範例中,可參考圖1的敘述,係利用2隻麥克風在8KHz處 記錄資料,並且事先有小心處理以去除鬼影。所使用的區間爲20ms長’每一步 _ 級 8ms〇 · 一開始在1402區塊用二支麥克風接收資料。來自麥克風的訊號經過適當的過 . 濾以預先消除鬼影,並將過濾後的資料數位化以便處理。此外,在區塊1404,由 MIC 1與MIC 2所收集的先前160個樣本利用Hamming區間(window)分配在各個 區間內,然後在區塊1406與1408,SVAD系統的組件會計算分配在區間內的資料 的FFT大小,以求得FFT1與FFT2。 在區塊1410,利用上述的指數函數平均法,並指定α的數値爲0.85,對FFT1 與FFT2進行指數平均以產生MF1與MF2。在區塊1412,系統利用MF1與MF2 ® 來計算MF1對MF2加上截止値(cutoff)的比率,接著求出平均,標示爲VAD_det, 算式如下= 1 ( MF1 、Description of the invention The closeness of the MM and the vocal microphone MIC 1 to the user's mouth helps to distinguish human voice from noise. The microphone array 1100 in the PVAD embodiment enables the path-finding system to achieve the best performance. At the same time, two identical microphones can be used to process VAD and eliminate noise, so the system cost can be reduced. 'Providing additional benefits . However, in order to achieve the best VAD performance, the two microphones should be oriented in opposite directions in order to take advantage of the very large gain variation in this configuration. In another embodiment, the PVAD also includes a third single directional microphone MIC 3 (not shown), but is not intended to be limiting. The third microphone, MIC 3, is facing the opposite direction of MIC 1. It is only used for VAD purposes, while MIC 2 is only used for noise suppression, and MIC 1 is used for VAD and noise suppression. In this way, Lu will improve the overall system Performance, but at the cost of a microphone, and more than 50% more sound. Data. In one embodiment, the path finding system receives signals from PVAD, and refers to the above-mentioned accelerometer-based VAD and the energy / critical threshold method of FIG. 3 to construct a VAD. Because there may be considerable noise components in the microphone data, accelerometer-based energy / critical 値 VAD detection methods cannot be used in every case. Another VAD embodiment uses the past gain (when there is only noise) to detect whether someone is talking, which will be described below. FIG. 12 shows a flowchart 1200 of a method for determining the presence or absence of a voice by using gain 値 in another PVAD embodiment. At the beginning, the data of the system microphone was received in block 1202. The components of the PVAD system at block 1204 filter the data to eliminate aliasing in advance, and digitize the filtered data. In block 1206, the data is divided into intervals of 20 microseconds (ms), and the data is presented in steps (_) every 8 microseconds. In block 1208, further process the segmented data in the interval, remove the unnecessary spectrum information, calculate the standard deviation (SD) of the last 50 gains in the interval with only noise, and call it the vector OLD jTD, and calculate 〇 LD_STD average 値 (AVE). In block 1210, the AVE and SD 値 will be compared with the pre-specified minimum 値. If they are smaller than the minimum 値, they will be increased to the minimum 分别 respectively. In block 1212, the components of the PVAD system then add AVE to the multiple of SD to calculate someone ’s voice. Continued page (when the description page of the invention is insufficient, please note and use the continued page) 23 200304119 The critical point of the continued page of the invention description 値. The low critical value is AVE plus 1.5 times the SD, and the high critical value is AVE plus four times the SD. At block 1214, the squared amplitude can be added to obtain the energy in each interval. In addition, In block 1214, the ratio of the energy of MIC 1 to the energy of MIC 2 may be calculated to obtain the gain. A small cutoff value may be added to the amount of soap in MIC 2 to ensure stability, but this The embodiment is not limited thereto. In block 1216, comparing the obtained gain with the critical threshold, there are 3 possible results. When the gain is less than the low critical threshold, then it can be determined that there is no human voice in this interval, and the OLD jTD vector is updated to new If the gain is greater than the low critical 値 and lower than the high critical 値, it can be determined that there is no human voice in this interval, but it is suspected that someone is vocal, so the OLD jTD is not updated to the new gain 値; when the gain Above the low and high critical thresholds, it can be determined that someone is speaking in this interval, and 〇LD_STD is not updated to the new gain threshold. No matter how this method is implemented, its spirit is to use the gain of H <z) = M <z) / M2 (z) when using human voice to isolate background noise. According to the configuration of the microphone, the calculated gain when there is human voice should be relatively large, because the human voice received by the vocal microphone MIC 1 is much louder than the noise received by the noise microphone MIC 2; otherwise, due to noise The noise is likely to spread with the terrain, so the noise received by MIC 2 is usually louder than that of MIC 1. However, if an omnidirectional microphone is used as the vocal microphone, the above statement is not necessarily correct, and the ability of the system to control noise is also limited at this time. Please note that methods that use only sound to eliminate noise are more susceptible to environmental noise. However, some test results show that the above configuration with a single directional microphone and a single directional microphone can reach an SNR of MIC 1 slightly less than OdB, which is a satisfactory result. Therefore, this PVAD-based noise suppression system has a large Most users will be able to operate effectively in the noisy environment. Similarly, if necessary, the microphone can be moved closer to the user's mouth to improve the SNR of MIC 1. FIG. 13 is a chart in an embodiment, a noise sound signal 1302 (live recording) plus a corresponding single (gradient) microphone-based PVAD signal 1304, and a corresponding gradient microphone (gradient ϋcontinued on the next page) (When the description page of the invention is insufficient, please note and use the continuation page.) 24 200304119 Description of the page continued (microphone) Output signal 1312, and the path finding system uses the PVAD signal 1304 to eliminate the noise of the sound signal 1322. The sound signal 1302 is made using Aliph's microphone set and accelerometer. The device is recorded in a 6-foot-long room with a ceiling height of 8 feet. There is noisy noise in the environment. The path-finding system uses a real-time processing method with a delay of about 10 ms. From the difference between the original sound signal 1302 and the noise signal 1322 after noise cancellation, it can be seen that noise suppression is between 20 and 25 dB. The distortion of the vocal signal is small. These results show that the use of single microphone-based PVAD information can effectively eliminate noise. Stereo VAD (SVAD) device / method returns to FIG. 1 and FIG. 1B. In one embodiment, the SVAD system 102B includes an SVAD algorithm 150 ′ to receive frequency-based dual signals from the corresponding signal processing system 100. Information 164 from the microphone array. The theory of the SVAD algorithm is that in the frequency spectrum, the received human voice and noise can be distinguished. Therefore, the SVAD device / method includes a comparison of the average FFT (fast Fourier transform) between microphones. SVAD uses 2 microphones For directions, please refer to the PVAD and Figure 11 above, and also rely on the noise data in the previous section to determine whether the current section contains human voice. As described above for the PVAD device / method, the vocal microphone is called MIC 1 here. The noise microphone is called MIC 2. Please refer to Figure 1. This path finding noise suppression system uses dual microphones to capture the characteristics of the signal (MIC 1) and noise (MIC 2). Naturally, both microphones will receive mixed human voice and noise signals. However, in the embodiment, it is assumed that the SNR of MIC 1 is greater than that of MIC 2, which means that MIC 1 is closer to the signal source (user) or better in orientation than MIC 2, and the distance between any noise source and MIC 1 and MIC 2 is better than The signal is far away. However, the same effect can be achieved with omnidirectional and single-directional or similar microphones. The SNR difference between the two microphones can be used in the time or frequency domain. In order to separate the noise from the human voice, it is necessary to calculate the average spectrum of the noise over a period of time. We can use the following exponential function average method (exponential averaging method) to calculate: Two] Continued pages (When the description page of the invention is insufficient, please note and use the continued page) Invention description sequel 200304119 L (i, k) = aL (i -1, k) + (1 -a) S (i, k), where a controls the smoothness of the average, 0.999 means that the average is very smooth, and 0.9 is not very smooth. The variables L (i, k) and S (i, k) are the average and instantaneous variables, respectively, i represents the discrete-time samples, and k represents the frequency bin (frequency bin) 'The number 値 is determined by the length of the FFT The decision 'These numbers can also be calculated using the traditional average or moving average (movinSaveraSe) method. Fig. 14 shows a flowchart 1400 of an embodiment of a method for determining the presence or absence of sound by using stereo VAD (SVAD). In this example, you can refer to the description in Figure 1, which uses two microphones to record data at 8KHz, and has been carefully processed in advance to remove ghosting. The interval used is 20ms long ’each step _ level 8ms 0 · At first, two microphones were used to receive data in block 1402. The signal from the microphone is properly filtered to eliminate ghosting in advance, and the filtered data is digitized for processing. In addition, in block 1404, the previous 160 samples collected by MIC 1 and MIC 2 were allocated in each interval using the Hamming interval (window), and then in blocks 1406 and 1408, the components of the SVAD system calculated and allocated in the interval. FFT size of the data to obtain FFT1 and FFT2. In block 1410, the above exponential function averaging method is used, and the number α of α is specified to be 0.85. FFT1 and FFT2 are exponentially averaged to generate MF1 and MF2. At block 1412, the system uses MF1 and MF2 ® to calculate the ratio of MF1 to MF2 plus cutoff, and then calculates the average and marks it as VAD_det. The formula is as follows: 1 (MF1,

VAD det1=—V —^. 一 128 V^MF2ik+cutoff J 其中i代表第i個區間,k爲頻率區間,而截止値用來合理的調整比例,以防有 MIC 2頻率區間的振幅過小的情況發生。由於FFT的長度爲128,因此將結果除 以128就可以得到比率的平均値。 在區塊1414,尋徑系統的組件比較VAD_det和發聲臨界値Vjhresh,經比較 〇續次頁(發明說明頁不敷使用時,請註記並使用續頁) 26 發明說明續頁 200304119 後,如果VAD_det的數値低於Vjhresh,系統組件就把VAD_state設爲0,如果 VAD_det的數値高於Vjhresh,則將VAD_state設爲1。 在區塊1416會決定VADjtate是否等於1,當VAD_state等於1,在區塊1417, 尋徑系統的組件會更新參數,並且在記錄連續發聲區段(contiguous voicing section) 的計數器中記錄最大的VAD_det値,然後進入區塊1420的操作,如果在有人發 聲的區間後出現一個無人發聲的區間,貝(1檢查先前連續發聲區段(可以包含一或 多個區間)所記錄的VAD_det最大値,看看發聲顯示結果是否有誤,如果區段中 最大的VADjet低於一預設的臨界値(比如說,低臨界値加上高低臨界値的差的 @ 40% ),則將該區間的發聲狀態設爲-1,這樣可以用來提醒消除雜訊的演算法, ” 先前有聲區段實際上不太可能有人發聲,而尋徑系統也可以修正係數的計算。 _ 在區塊1416,當SVAD系統決定VAD—state等於0,接著在區塊1418,SVAD 系統的組件會重設包含VAD_det的最大値的參數,同樣的,如果前一區間有人發 聲,系統會檢查先前有人發聲的區段是否爲誤判,在區塊1420,尋徑系統的組件 接著會更新高和低的決定等級(determinant level),用來計算發聲的臨界値 Vjhresh,接著會回到區塊1402的操作。 在本實施例中低與高決定等級都是利用指數平均方法計算而得,其中α的數 φ 値式根據目前的VAD_det是高於或低於低與高決定等級,如下所述。對於低決定 等級來說,如果VAD_det的數値比預設的低決定等級要高,則α的數値則設爲 0.999,不然就設爲0.9。對於高決定等級也是用相似的方法,只不過當目前VAD_det . 的値比現在的高決定等級要少的時候,α是設爲0.999,而目前VAD_det的値比 現在的高決定等級要高的時候,α是設爲0.9。在其他實施例中可以用傳統的平 均或移動平均方法來決定這些等級。 在一個實施例中,臨界値一般是設爲低決定等級加上高低決定等級的差的15 %,並且會訂出絕對最小臨界値,但是此一實施例並不在此限。絕對最小臨界値 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 27 200304119 發明說明β胃 應該在VAD不會被任意觸發、安靜的環境中設定。 --- 在其他實施例中,利用SVAD以決定有無人發聲的情況下,可以利用不同的 參數,包括區間大小、FFT大小、截止數値以及α數値等,用來比較麥克風之間 的平均FFT値。只要麥克風之間的SNR差異足夠大,SVAD裝置/方法就可以用 來處理任何雜訊,2隻麥克風的SNR絕對値並不比相對的SNR値來得重要,因此, 如果在配置麥克風時讓麥克風之間的相對SNR差異變大,一般可以獲得較佳的 VAD表現。 SVAD裝置/方法已經成功地用於多種不同的麥克風配置、雜訊類型以及雜 φ 訊等級上。以圖15爲例,其中所示爲在一實施例中的圖表,有雜訊的聲音訊號 - 1502 (現場錄音)加上對應的單一(梯度)麥克風爲主的SVAD訊號1504,還有 _ 尋徑系統利用SVAD訊號1504來消除聲音訊號1522的雜訊。聲音訊號1502是利 用Aliph的麥克風組,裝置在一間6英尺長,天花板高度8英尺的房間內錄音, 環境中有嘈雜的噪音。尋徑系統是採用即時處理的方式,延遲約10ms,由原始聲 音訊號1502和消除雜訊後的聲音訊號1522之間的差別可以看出,雜訊抑制大約 在25至30 dB之間,顯示採用SVAD訊號1504所得的人聲訊號的失真不大。 • 隨列VAD (AVAD)裝置/方法 再回到圖1與圖1B,在一實施例中AVAD系統102B包含AVAD演算法150, 用以接收來自對應的訊號處理系統1〇〇的麥克風陣列所傳來的資料164。以 AVAD爲主的系統包含一個2或更多隻麥克風的陣列,用來分辨使用者的聲音與 環境噪音,但不在此限。在一個實施例中,二支麥克風彼此相距一預先指定的距 離,因此可以強調特定方向的聲音源,比如說位於連接麥克風的軸線上,或者是 在該線的中點上。在另一個實施例中’可使用波束成形(beamforming)或來源追蹤 的方式來找出陣列的視野範圍內想要取出的訊號,並建構相關的適應性雜訊抑制 —I續次頁(發明說明頁不敷使用時,請註記並使用續頁) 28 200304119 -- 發明說明續頁 系統,如尋徑系統所用的VAD訊號。熟悉此技藝者應可參考2001年由Μ. Brandstein與D· Ward所著之「麥克風陣列(Microphone Arrays)」一書,書號爲ISBN 3_540-41953_5,了解其他實施例的作法。 在一實施例中,AVAD包含一個雙麥克風的陣列,其中採用panasonic的單指 向性麥克風,麥克風的單指向性讓陣列能夠偵測到放在陣列之前或者是前方的聲 音源。不過,並不是一定要用單指向性麥克風,特S提當這些陣列放置的地方只 有一端會傳來聲音,比如說放在牆上,這時就不需要使用單指向性麥克風。二支 麥克風間的直線距離約爲30.5公分,而低雜訊放大器可以放大麥克風傳來的資 φ 料,利用國家儀器公司(National Instruments)的Labview 5.0軟體錄製在個人電腦(PC) - 上,但是此並非用以限制本實施例。AVAD系統的組件可以利用這個陣列,以12 . 位元、32KHz記錄麥克風的資料,然後用數位濾波的方式將資料降頻至16KHz 〇 在其他的實施例中可以利用低很多的解析度(也許8位元)和取樣頻率(降至幾 個KHz),加上足夠的預先類比濾波即可,因爲通常聲音資料的傳真度並不是考 量的重點。 當目標訊號源(人類談話者)在距離麥克風陣列的中線約30公分處,這種 配置方式讓MIC 1與MIC 2在收錄目標訊號源的聲音時,2者之間的延遲爲零,® 而對於其他訊號源則有一非零延遲,在其他的實施例中可以利用好幾種不同的配 置,其中延遲的數値不一,而每一個延遲定義一個目標訊號源可放置的活動區域 (active area)。 - 在這個實驗中,有2個喇叭用來提供雜訊,其中有一個喇叭距離麥克風陣列 · 右邊大約50公分,而第2個喇叭位於說話者右後方大約150公分處,喇D八所播放 的街上與卡車噪音大槪具有2至5dB的SNR。此外,有些錄音沒有加入雜訊,這 是爲了方便調整的關係。 圖16所示爲在一實施例中,利用AVAD決定有無人發聲的方法的流程圖 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 29 200304119 - 發明說明_頁 1600。一開始在1602區塊用二支麥克風接收資料。在區塊1604,來自麥克風的 訊號經過適當的過濾以預先消除鬼影,並將過濾後的資料數位化以便處理。在區 塊1606,數位化的資料被分配在長度爲20ms的區間內,而資料爲每一步級8ms。 在區塊1608,進一步過濾分配於區間內的資料,以便移除被雜訊或其他不想要的 訊號所干擾的頻譜資訊。 在區塊1610,來自MIC 1的區間資料加上來自MIC 2的資料,結果再經過平 方,如以下算式所示: M12 =(M!+M2)2. 把麥克風資料加總的動作會強調所得資料中零延遲(zerodday)的資料成份’對於 相位相同的麥克風資料會有加成的效應,對於相位不一致的資料會有抵銷的效 應;由於目標訊號源在所有頻率下都是同相的(in phase),因此加起來會有建設性 的效益,而雜訊源的相位關係會隨著頻率改變,因此通常會有破壞性的抵銷結 果。接著,所得的訊號再經過平方後,會大幅加重訊號中的零延遲訊號成份,所 算出來的訊號可以利用簡單的能量/臨界値演算法來偵測是否有人發聲(請參考 上述加速計爲主的VAD以及圖3的說明),而此時零延遲訊號成份實際上已被增 加。 接下來,在區塊1612,將振幅的平方相加,可計算出所得向量的能量。在區 塊1614,計算最後50個只包含雜訊的區間的標準變異(standard deviation,SD) 〇LD_STD及其平均AVE。在區塊1616,拿AVE與SD的數値跟預先指定的最小 數値比較,如果小於最小値,貝[J分別增加至最小値。 而後在區塊1618,尋徑系統的組件將AVE加上SD的一個倍數,計算出發聲 臨界値。低臨界値爲AVE加上1.5倍的SD,而高臨界値則是AVE加上4倍的SD, 在區塊1620中,比較所求得的增益與臨界値,會有3種可能的結果,當增益小於 低臨界値,那麼可以決定在這個區間中並沒有人聲,而0LD_STD向量就更新爲 [□續次頁(發明說明頁不敷使用時,請註記並使用續頁) 30 200304119 發明說明_頁 新的增益値;如果增益大於低臨界値,同時小於高臨界値,就可以決定在這個區 間中並沒有人聲,但是懷疑是有人發聲的聲音,因此並不更新〇LD_STD爲新的 增益値;當增益大於低與高臨界値,就可以決定在此區間有人發聲,而OLD_STD 並不更新爲新的增益値。 圖17所示爲在一實施例中,AVAD系統中分別來自不同麥克風的聲音訊號 1710與1720,加上對應的VAD訊號1712與1722的圖表。圖中同樣標示出將聲 音訊號1710與1720加起來所得的訊號1730。喇叭放在距離麥克風陣列中線約30 公分處,所使用的雜訊爲卡車的噪音,而2隻麥克風的SNR均小於OdB。VAD 訊號1712與1722可以作爲尋徑系統或其他雜訊抑制系統的輸入訊號。 傳統單麥克風VAD裝置/方法 在一實施例中的雜訊抑制系統使用雙麥克風的系統中,一支麥克風的訊號來 產生VAD資訊,但並非用以限制。圖18所示爲在一實施例中,包含尋徑雜訊抑 制系統101與單麥克風VAD系統102B的訊號處理系統1800的方塊圖。系統1800 包含主麥克風MIC 1,或稱爲人聲麥克風,以及一隻參考麥克風MIC 2,或稱爲 雜訊麥克風。主麥克風MIC 1提供訊號至VAD系統102B以及尋徑系統101。參 考麥克風MIC 2提供訊號至尋徑系統1〇1。因此來自主麥克風MIC 1的訊號提供 人聲與雜訊資料給尋徑系統1〇1,並提供資料給VAD系統102B,用以產生VAD 資訊。 VAD系統i〇2B包含VAD演算法,如美國專利案號4,811,404與5,687,243的 發明所述,用以計算VAD訊號,而所求得的資料1〇4則提供給尋徑系統101,但 是此實施例並不在此限,經由參考麥克風MIC 2所接收到的訊號僅用在雜訊抑制 上。 圖19所示爲在一實施例中,利用單一麥克風產生有聲資訊的方法的流程圖 續次頁(發明說明頁不敷使用時,請註記並使用續頁) 31 200304119 - 發明說明續頁 1900。一開始在1902區塊用主麥克風接收訊號。在區塊1904,與VAD相關的處 理工作包括將主麥克風所接收到的訊號先過濾以預先消除鬼影,並將過濾後的資 料數位化,以合適的取樣頻率(一般爲8KHz)處理。在區塊1906,數位化的資 料被分割與過濾,以符合傳統VAD處理需求。在區塊1908,利用VAD演算法計 算VAD資訊,接著在區塊1910將VAD資訊提供給尋徑系統,用來消除雜訊。 以氣流推導之VAD奘置/方法 以氣流爲主的VAD裝置/方法利用從使用者口腔以及/或者鼻腔所送出的 氣流來建構VAD訊號,使用者可利用各種習知的方法來量測氣流,並且將氣流 與呼吸以及動作氣流(gross motion flow)分開,以便獲得正確的VAD資訊。由於呼 吸與動作氣流大多是由低頻(低於100Hz)的能量所組成,因此可以利用高通濾 波的方法來過濾氣流資料,並且將氣流與呼吸及動作氣流分開。可用來量測氣流 的器材可參考聲帶企業(Glottal Enterprise)的胸腔面罩(Pneumotach Masks),進一步的 資訊可參照 http://www.glottal.com ° 利用氣流爲主的VAD裝置/方法不太會受到聲音雜訊的影響,因爲要偵測 氣流時,必須非常靠近嘴巴和鼻子,因此,可參考上述加速計爲主的VAD和圖3 的能量/臨界値演算法,用來偵測有無人聲與建構VAD訊號。在其他氣流爲主 的VAD裝置以及/或者相關的雜訊抑制系統的實施例中,熟悉此技藝者可利用 其他以能量爲主的方法來產生VAD訊號。 圖20所示爲在一實施例中,利用氣流爲主的VAD決定有無人發聲的方法的 流程圖200。一開始在區塊2002,接收氣流資料。在區塊2004,與VAD相關的 處理工作包括將氣流資料先過濾以預先消除鬼影,並將過濾後的資料數位化。在 區塊2006,數位化的資料被分配在長度爲20ms的區間內,而資料爲每一步級 8ms。接下來在區塊2008,處理過程包括過濾分配在各個區間內的資料,以移除 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 32 200304119VAD det1 = —V — ^.-128 V ^ MF2ik + cutoff J where i represents the i-th interval and k is the frequency interval, and the cutoff 値 is used to reasonably adjust the ratio to prevent the amplitude of the MIC 2 frequency interval from being too small. The situation happened. Since the length of the FFT is 128, divide the result by 128 to get the average 値 of the ratio. In block 1414, the components of the path-finding system compare VAD_det with the voicing threshold 値 Vjhresh. After comparison, the following page (continued if the description page is insufficient, please note and use the continued page) If the number of values is lower than Vjhresh, the system component sets VAD_state to 0. If the number of VAD_det is higher than Vjhresh, then set VAD_state to 1. In block 1416, it is determined whether VADjtate is equal to 1. When VAD_state is equal to 1, in block 1417, the components of the path finding system will update the parameters and record the largest VAD_det in the counter that records the continuous voicing section. , And then enter the operation of block 1420. If an unvoiced section appears after a vocalized section, Bay (1 checks the maximum VAD_det recorded in the previous continuous vocal section (which can contain one or more sections), see The vocalization shows whether the result is wrong. If the largest VADjet in the segment is lower than a preset critical threshold (for example, the difference between the low critical threshold and the high and low critical threshold @ 40%), the vocalization state of the interval is set. It is -1, which can be used to remind the algorithm of eliminating noise. ”In the previous vocal segment, it is actually impossible for someone to make a sound, and the path finding system can also modify the calculation of the coefficient. _ At block 1416, when the SVAD system decides VAD_state is equal to 0. Then in block 1418, the components of the SVAD system will reset the parameter containing the maximum value of VAD_det. Similarly, if someone made a sound in the previous interval, the system Check whether the previously vocalized segment was misjudged. In block 1420, the components of the path-finding system will then update the high and low determinant levels to calculate the critical threshold 声 Vjhresh, and then return to the block. Operation of 1402. In this embodiment, the low and high decision levels are both calculated using an exponential average method, where the number φ 値 is higher or lower than the low and high decision levels according to the current VAD_det, as described below. For low decision levels, if the number of VAD_det is higher than the preset low decision level, then the number of α is set to 0.999, otherwise it is set to 0.9. For high decision levels, a similar method is used, only However, when the 値 of VAD_det. Is lower than the current high decision level, α is set to 0.999, and when the 値 of VAD_det is higher than the current high decision level, α is set to 0.9. In other embodiments These levels can be determined using traditional average or moving average methods. In one embodiment, the critical threshold is generally set to a low determination level plus a 15% difference between the high and low determination levels, and will Absolute minimum critical value, but this embodiment is not limited to it. Absolute minimum critical value] Continued page (When the description page of the invention is insufficient, please note and use the continued page) 27 200304119 Invention description β stomach should not be in VAD It is set arbitrarily in a quiet environment. --- In other embodiments, in the case of using SVAD to determine whether there is a vocalization, different parameters can be used, including the interval size, FFT size, cutoff number 値, and α number 値Etc. to compare the average FFT 値 between microphones. As long as the SNR difference between the microphones is large enough, the SVAD device / method can be used to handle any noise. The SNR of the two microphones is definitely not more important than the relative SNR. Therefore, if you configure the microphones, The relative SNR difference becomes larger, and generally a better VAD performance can be obtained. The SVAD device / method has been successfully used in many different microphone configurations, noise types, and noise levels. Take Figure 15 as an example, which shows a diagram in an embodiment, with a noisy sound signal-1502 (live recording) plus a corresponding single (gradient) microphone SVAD signal 1504, and _ seek The path system uses the SVAD signal 1504 to eliminate noise from the audio signal 1522. The sound signal 1502 is a microphone set using Aliph. The device is recorded in a 6-foot long room with a ceiling height of 8 feet. There is noisy noise in the environment. The path finding system uses a real-time processing method with a delay of about 10ms. From the difference between the original sound signal 1502 and the noise signal 1522, it can be seen that the noise suppression is about 25 to 30 dB. The distortion of the vocal signal obtained by the SVAD signal 1504 is not great. • With the VAD (AVAD) device / method, return to FIG. 1 and FIG. 1B. In one embodiment, the AVAD system 102B includes an AVAD algorithm 150 for receiving the information transmitted by the microphone array of the corresponding signal processing system 100.来 的 资料 164. AVAD-based systems include an array of 2 or more microphones to distinguish the user's voice from ambient noise, but not limited to this. In one embodiment, the two microphones are at a predetermined distance from each other, so that a sound source in a particular direction can be emphasized, for example, on the axis connecting the microphones, or at the midpoint of the line. In another embodiment, 'beamforming or source tracking can be used to find the signals to be taken out within the field of view of the array, and to construct relevant adaptive noise suppression—I continued page (Explanation of the invention When the page is insufficient, please note and use the continuation page) 28 200304119-Description of the invention Continuation page system, such as the VAD signal used by the path finding system. Those familiar with this technique should refer to the 2001 book "Microphone Arrays" by M. Brandstein and D. Ward under the ISBN 3_540-41953_5 for the practice of other embodiments. In one embodiment, the AVAD includes an array of dual microphones. A unidirectional microphone from Panasonic is used. The unidirectionality of the microphone allows the array to detect sound sources placed in front of or in front of the array. However, it is not necessary to use a unidirectional microphone. When the array is placed, only one end of the array will emit sound, for example, placed on a wall. In this case, there is no need to use a unidirectional microphone. The linear distance between the two microphones is about 30.5 cm. The low-noise amplifier can amplify the data from the microphone. It was recorded on a personal computer (PC) using Labview 5.0 software of National Instruments, but This is not intended to limit the embodiment. The components of the AVAD system can use this array to record microphone data at 12 bit and 32KHz, and then use digital filtering to down-convert the data to 16KHz. In other embodiments, a much lower resolution can be used (maybe 8 Bit) and sampling frequency (down to a few KHz), plus sufficient pre-analog filtering, because the facsimile of audio data is usually not the focus of consideration. When the target signal source (human talker) is about 30 cm away from the center line of the microphone array, this configuration allows the MIC 1 and MIC 2 to record the sound of the target signal source. The delay between the two is zero. For other signal sources, there is a non-zero delay. In other embodiments, several different configurations can be used. The number of delays varies, and each delay defines an active area where the target signal source can be placed. area). -In this experiment, two speakers were used to provide noise, one of which was about 50 cm from the right of the microphone array, and the second speaker was located about 150 cm from the right behind the speaker. The noise of street and truck noise is 2 to 5dB. In addition, some recordings do not include noise, which is to facilitate the adjustment of the relationship. FIG. 16 shows a flowchart of a method for determining the presence or absence of utterance by using AVAD in one embodiment] Continued page (when the invention description page is insufficient, please note and use the continuation page) 29 200304119-Invention Description_Page 1600 . At the beginning, two microphones were used to receive data in block 1602. At block 1604, the signal from the microphone is properly filtered to remove ghosting in advance, and the filtered data is digitized for processing. In block 1606, the digitized data is allocated in intervals of 20ms in length, and the data is 8ms per step. In block 1608, the data allocated in the interval is further filtered in order to remove the spectrum information interfered by noise or other unwanted signals. In block 1610, the interval data from MIC 1 is added to the data from MIC 2, and the result is then squared, as shown in the following formula: M12 = (M! + M2) 2. The action of summing the microphone data will emphasize the result. The data component of zerodday in the data will have an additive effect on microphone data with the same phase, and an offsetting effect on data with inconsistent phase; because the target signal source is in-phase at all frequencies (in phase), so that adds up to constructive benefits, and the phase relationship of the noise source changes with frequency, so there is usually a destructive offset. Then, after the obtained signal is squared, the zero-delay signal component in the signal will be greatly aggravated. The calculated signal can use a simple energy / critical algorithm to detect whether someone is making a sound (please refer to the above accelerometers for details) VAD and the description of Figure 3), and at this time the zero delay signal component has actually been increased. Next, in block 1612, the squares of the amplitudes are added to calculate the energy of the resulting vector. In block 1614, calculate the standard deviation (SD) of the last 50 noise-only intervals, LD_STD, and its average AVE. In block 1616, the numbers AVE and SD are compared with the pre-specified minimum number 如果. If it is less than the minimum 値, J [J increases to the minimum 分别 respectively. Then in block 1618, the components of the path-finding system add AVE to a multiple of SD to calculate the critical threshold 出发. The low criticality is AVE plus 1.5 times SD, and the high criticality is AVE plus 4 times SD. In block 1620, comparing the obtained gain and criticality, there are 3 possible results. When the gain is less than the low critical threshold, then it can be determined that there is no human voice in this interval, and the 0LD_STD vector is updated to [□ Continued page (if the description page of the invention is insufficient, please note and use the continued page) 30 200304119 发明 说明 _ Page new gain 値; if the gain is greater than low critical 値 and lower than high critical 同时, you can decide that there is no human voice in this interval, but it is suspected that someone is vocal, so LD_STD is not updated to the new gain 値; When the gain is greater than the low and high thresholds, it can be determined that someone is speaking in this interval, and OLD_STD is not updated to the new gains. FIG. 17 is a graph showing sound signals 1710 and 1720 from different microphones in the AVAD system, and corresponding VAD signals 1712 and 1722 in an embodiment. The figure also shows the signal 1730 obtained by adding the audio signals 1710 and 1720 together. The speaker is placed about 30 cm from the center line of the microphone array. The noise used is truck noise, and the SNR of both microphones is less than OdB. VAD signals 1712 and 1722 can be used as input signals for path finding systems or other noise suppression systems. Conventional single-microphone VAD device / method In a system in which the noise suppression system of an embodiment uses dual microphones, the signal of one microphone is used to generate VAD information, but this is not a limitation. FIG. 18 is a block diagram of a signal processing system 1800 including a path finding noise suppression system 101 and a single microphone VAD system 102B in an embodiment. System 1800 includes a main microphone, MIC 1, or a vocal microphone, and a reference microphone, MIC 2, or a noise microphone. The main microphone MIC 1 provides signals to the VAD system 102B and the path finding system 101. The reference microphone MIC 2 provides a signal to the path finding system 101. Therefore, the signal from the main microphone MIC 1 provides vocal and noise data to the path finding system 101, and provides data to the VAD system 102B for generating VAD information. The VAD system i02B contains a VAD algorithm, as described in the inventions of U.S. Patent Nos. 4,811,404 and 5,687,243, for calculating the VAD signal, and the obtained data 104 is provided to the path finding system 101, but this The embodiment is not limited thereto, and the signal received through the reference microphone MIC 2 is only used for noise suppression. FIG. 19 is a flowchart of a method for generating audio information by using a single microphone in an embodiment. Continued page (When the description page of the invention is insufficient, please note and use the continued page) 31 200304119-Description of the invention continued page 1900. Initially, the signal was received with the main microphone in block 1902. In block 1904, the processing work related to VAD includes filtering the signal received by the main microphone to eliminate ghosts in advance, digitizing the filtered data, and processing at an appropriate sampling frequency (generally 8KHz). In block 1906, the digitized data was segmented and filtered to meet traditional VAD processing needs. In block 1908, the VAD information is calculated using the VAD algorithm, and then the VAD information is provided to the path finding system in block 1910 to eliminate noise. VAD settings / methods derived from airflow VAD devices / methods based on airflow use the airflow sent from the user's mouth and / or nasal cavity to construct VAD signals. Users can use various known methods to measure airflow. Separate airflow from breathing and gross motion flow in order to obtain the correct VAD information. Because breathing and action airflow are mostly composed of low-frequency (less than 100Hz) energy, high-pass filtering can be used to filter airflow data and separate airflow from breathing and action airflow. The equipment that can be used to measure airflow can refer to the Pneumotach Masks of Glottal Enterprise. For further information, please refer to http://www.glottal.com ° VAD devices / methods based on airflow are unlikely Affected by sound noise, because it is necessary to be very close to the mouth and nose when detecting airflow, you can refer to the above-mentioned accelerometer-based VAD and the energy / critical calculation algorithm in Figure 3 to detect the presence of human voice and Construct a VAD signal. In other embodiments of airflow-based VAD devices and / or related noise suppression systems, those skilled in the art may use other energy-based methods to generate VAD signals. FIG. 20 shows a flowchart 200 of a method for determining the presence or absence of a vocalization using a VAD based on airflow in an embodiment. Initially in block 2002, airflow data was received. In block 2004, VAD-related processing includes filtering airflow data to eliminate ghosting in advance, and digitizing the filtered data. In block 2006, the digitized data was allocated in a 20ms interval, and the data was 8ms per step. Next in block 2008, the process includes filtering the data allocated in each section to remove it.] Continued pages (when the invention description page is insufficient, please note and use the continuation page) 32 200304119

發明說明MM 低頻活動與呼吸產生的人爲聲音,其他還包含其他不想要的頻譜資訊。在區塊 2010內,計算振幅的平方的總和,算出每個區間內的能量。 在區塊2012中,計算後的能量値接著與臨界値比較,在區塊2014中,當區 間的能量等於或高於臨界値,則指定對應該氣流資料的區間爲有人發聲。在區塊 2016,有人發聲的資料訊息倍傳送到尋徑系統作爲VAD資訊,在其他雜訊抑制 系統的實施例中,可以利用多重臨界値來標示人聲訊號的相對強度與準確度,但 不在此限。 手動VAD裝置/方法 · 在一實施例中的手動VAD裝置包含可以由使用者或觀察者手動啓動,比如 說以按鈕或切換裝置啓動的VAD裝置,啓動這種手動VAD裝置,或者手動強制 修改自動化VAD裝置,可產生VAD訊號。 圖21所示爲在一實施例中,有雜訊的聲音訊號2102加上對應的手動啓動/ 計算的VAD訊號2104,還有尋徑系統利用手動VAD訊號2104來消除聲音訊號 2122的雜訊的圖表。聲音訊號2102是利用Aliph的麥克風組,裝置在一間6英尺 長,天花板尚度8英尺的房間內錄音,環境中有嘈雜的噪音。尋徑系統是採用即 _ 時處理的方式,延遲約l〇ms,由原始聲音訊號2102和消除雜訊後的聲音訊號2122 之間的差別可明顯地看出,雜訊抑制大約在25至30 dB之間,人聲訊號的失真不 大,因此,利用手動VAD資訊來消除雜訊是有效的。 熟悉此技藝者應可了解,用來處理訊號中混合聲音資訊與雜訊的電子系統, 可以借重上述的VAD裝置/方法以發揮效用。舉例來說,塞在耳朵或帶在頭上 的耳機若具有上述VAD裝置的其中之一,可以透過有線以及/或者無線連結的 方式連接手持機組(handset),比如說行動電話,而其上會提供對應的VAD演算法。 特別是’舉例來說,塞在耳朵或帶在頭上的耳機也許包含上述的皮膚表面麥克風 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 33 發明說明續頁 200304119 (SSM),利用無線方式將訊號傳送給裝置在手持機組上的尋徑VAD 〇 尋徑雜訊抑制系統 如上所述,圖1爲一個實施例中訊號處理系統1〇〇的方塊圖示,其中包括尋 徑雜訊抑制系統101和VAD系統102。訊號處理系統100包含二支麥克風MIC 1 110和MIC 2 112,用以接收來自至少一個人聲訊號源120和至少一個雜訊源122 的訊號或資訊。從人聲訊號源120到MIC 1的路徑s(n),以及從雜訊源122到MIC 2的路徑η⑻被視爲單一(unity)。此外,Η《ζ)代表由雜訊源122到MIC 1的路徑, 而H2(z)代表由人聲訊號源120到MIC 2的路徑。 以某種方式推導出來的VAD訊號104,係用以控制雜訊移除的方法,MIC 1 所接收到的聲音資訊表示爲n^n),MIC 2所接收到的聲音資訊表示爲m2(n),在z (數位頻率)領域內,我們可以將它們表示爲Mi⑵與M2⑵,因此: MJz^S^+nWhJz) (1) M2 (z) = N(z)+S(z)H2 (z) 這種情況可適用於所有實際雙麥克風系統,MIC 1總會接收到一些雜訊,一 些訊號也會跑到MIC 2,在等式⑴有4個未知項,但是只有二個關係式,顯然不 能直接解出答案。 不過’也許有某種方法可以解決等式⑴中的一些未知項,以沒有訊號產生的 情形爲例,也就是說VAD顯示無人發聲,在這個情況下,等式(1)可以進一步表 示爲:Description of the Invention The artificial sounds produced by MM's low frequency activities and breathing, among other things, contain other unwanted spectral information. In block 2010, the sum of the squared amplitudes is calculated to calculate the energy in each interval. In block 2012, the calculated energy 値 is then compared with the critical ,. In block 2014, when the inter-regional energy is equal to or higher than the critical 値, then the section corresponding to the airflow data is designated as someone speaking. In block 2016, a voiced data message was transmitted to the path finding system as VAD information. In other embodiments of the noise suppression system, multiple critical thresholds can be used to indicate the relative strength and accuracy of the human voice signal, but not here limit. Manual VAD device / method · The manual VAD device in one embodiment includes a VAD device that can be manually activated by a user or an observer, such as a VAD device activated by a button or a switching device, activating such a manual VAD device, or manually forcing modification automation VAD device can generate VAD signal. FIG. 21 shows the noise signal 2102 with corresponding manual start / calculated VAD signal 2104 and the path finding system using the manual VAD signal 2104 to eliminate the noise of the sound signal 2122 in one embodiment. chart. The sound signal 2102 is a microphone set using Aliph. The device is recorded in a 6-foot-long room with a ceiling of 8 feet. There is noisy noise in the environment. The path-finding system uses immediate processing, with a delay of about 10ms. It can be clearly seen from the difference between the original sound signal 2102 and the noise-removed sound signal 2122 that the noise suppression is about 25 to 30. Between dB, the distortion of the human voice signal is not large, so it is effective to use manual VAD information to eliminate noise. Those skilled in this art should understand that the electronic system used to process mixed sound information and noise in the signal can take advantage of the above VAD device / method to achieve its effectiveness. For example, if the earphone or headphone has one of the above VAD devices, it can be connected to a handset, such as a mobile phone, through a wired and / or wireless connection. Corresponding VAD algorithm. In particular, 'for example, a headset plugged in the ear or a head may include the above-mentioned skin surface microphone] Continued page (If the description page is not enough, please note and use the continuation page) 33 Invention description Continued 200304119 ( SSM), which transmits signals wirelessly to the device's path finding VAD on the handheld unit. The path finding noise suppression system is as described above. Figure 1 is a block diagram of the signal processing system 100 in one embodiment, including Path finding noise suppression system 101 and VAD system 102. The signal processing system 100 includes two microphones, MIC 1 110 and MIC 2 112, for receiving signals or information from at least one human voice signal source 120 and at least one noise source 122. The path s (n) from the human voice signal source 120 to the MIC 1 and the path η⑻ from the noise source 122 to the MIC 2 are considered to be unity. In addition, Η (ζ) represents a path from the noise source 122 to the MIC 1, and H2 (z) represents a path from the human voice signal source 120 to the MIC 2. The VAD signal 104 derived in some way is a method for controlling noise removal. The sound information received by MIC 1 is represented as n ^ n), and the sound information received by MIC 2 is represented as m2 (n ), In the field of z (digital frequency), we can represent them as Mi⑵ and M2⑵, so: MJz ^ S ^ + nWhJz) (1) M2 (z) = N (z) + S (z) H2 (z ) This situation can be applied to all practical dual microphone systems. MIC 1 will always receive some noise, and some signals will also go to MIC 2. There are 4 unknown terms in equation 但是, but there are only two relations. Obviously The answer cannot be solved directly. However, there may be some way to solve some unknown terms in Equation ⑴. Take the case where no signal is generated as an example, that is, VAD shows that no one is speaking. In this case, equation (1) can be further expressed as:

Mln = M2n(z) = N(z) 其中Μ變數底下的n代表只有接收到雜訊,經過計算 續次頁(發明說明頁不敷使用時,請註記並使用續頁) 34 200304119Mln = M2n (z) = N (z) where n under the M variable means that only noise is received and calculated. Continued page (If the description page of the invention is insufficient, please note and use the continued page) 34 200304119

Mln(z) = M2n (2)^(2)Mln (z) = M2n (2) ^ (2)

Mln(z) M2n(z) 發明說明_頁 (2) 在只有雜訊存在的情況下,即可以利用任何可用的系統辨識演算法(system identification algorithms)和麥克風的輸出來計算Ηι(ζ),此一計算方式應以適應性方 法來進行,以便讓系統能夠追蹤任何雜訊的變化。 找出等式(1)中的一個未知項後,可以利用VAD來找出有人發聲而雜訊不多 的情況下圧⑵的數値,當VAD顯示有人發聲,不過最近(以1秒鐘左右或類似 爲準)麥克風的記錄顯示雜訊等級低,這時可假設n(s) = N⑵約近似於0,因此 等式(1)可以變成:Mln (z) M2n (z) Description of the invention_page (2) In the presence of noise only, we can use any available system identification algorithms and the output of the microphone to calculate Ηι (ζ), This calculation should be done in an adaptive way so that the system can track any changes in noise. After finding an unknown term in equation (1), you can use VAD to find out the number of cases where someone utters sound but not much noise. When VAD shows someone to utter, but recently (about 1 second) Or similar) The microphone record shows a low noise level. At this time, it can be assumed that n (s) = N⑵ is approximately 0, so equation (1) can become:

Mls(z) = S(z) M2s(z) = S(z)H2(z) 經過計算後 M2s(z) = M1s(z)H2(z) H2(z) = M2s(z)Mls (z) = S (z) M2s (z) = S (z) H2 (z) After calculation M2s (z) = M1s (z) H2 (z) H2 (z) = M2s (z)

Mls(z) 雖然計算ft⑵的等式看起來好像是在計算迚⑵時的等式倒過來一樣,但是請記 住所用的輸入訊號不同,要注意的是H2⑵應該相當固定,因爲總是會只有單一訊 號源(使用者),而使用者與麥克風的相對位置也應該是相當固定’如果在計算 压⑵時加入少量的適應性增益(adaptive gain),可以讓計算工作更有效’而且在雜 訊存在的狀況下同樣可以發揮強大的功能。 在計算出Η!⑵與H2(z)之後,接著用它們來移除訊號中的雜訊’將等式⑴改 寫爲· S(z) = M1(z)-N(z)H1(z) N(z) = M2(z)-S(z)H2(z) S(z) = Mj (z) - [Μ2 (z)- S(z)H2 (z)^! (z) S(z)[l - H2 (z^H, (z)] = M, (z) - M2 (z) □續次頁 (發明說明頁不敷使用時,請註記並使用續頁) 35 200304119 發明說明$賣Μ 如此可算出s⑵ (2) i-h2(z)Hi(z) 一般來說,H2⑵相當小’而Hi⑵小於1,所以對大多數情況下的大多數頻率來說: H2(z)H1(z)«1, 而訊號可以利用下列等式求得: 所以這裏假設B⑵不用計算,而乩⑵是唯一需要計算的轉移函數,雖然我們還 是可以求出B(z)的値,但是若麥克風的位置與方向擺得好,就可以省略計算h2(z) 的必要。 在處理聲音訊號時利用多重副頻帶可達到顯著的雜訊抑制效果。這是因爲用 來計算轉移函數的適應性濾波器(adaptive filter)大多數爲FIR類型,這類濾波器只 用零點(zero)而非極點(pole)來計算包含零點與極點的系統,如下所示: Η (ζ)_&gt; Β(ζ) iUl〇DELS &gt;Α(:) 在有足夠的輸入資訊時,這類模型可以很精確,但是這也會大幅提昇運算成本和 收歛時間。在以能量爲主的適應性濾波系統,比如最小均方根(LMS)系統中,一鲁 般來說系統對包含較多能量的小頻率範圍,可以計算出相當接近的振幅與相位, 這使得LMS系統可以發揮其最大的功能,將雜訊的能量(energy of error)降至最 低,不過這樣的結果可能會造成在匹配的頻率範圍外的雜訊增加,降低雜訊抑制 . 系統的有效性。 使用副頻帶的作法可以解決這個問題,來自主麥克風與次要麥克風的訊號經 過過濾變成多個副頻帶,而每個副頻帶所計算出來的資料(有可能是移頻和降頻 後的資料,視使用者需求而定)則傳送至個別的適應性濾波器,這麼一來適應性 濾波器必須想辦法在自己的副頻帶中匹配資料,而不是只找出在訊號中那個地方 D續次頁(發明說明頁不敷使用時,請註記並使用續頁) 36 200304119 發明說明 的能量最高,在每個副頻帶所得的訊號抑制結果可以加總得到最後消除雜訊後的 訊號。讓每個訊號都能配合時間以及補償濾波器飄移的工作並不容易,但是對系 統來說所得到的結果較佳,只不過在記憶體和處理器方面的需求較高。 乍看之下,尋徑演算法似乎很接近其他的演算法,比如說傳統的ANC (適應 性雜訊消除),如圖2所示。不過經過仔細比較後,可以看到有幾個地方造成雜 訊抑制效果的差異,包括使用VAD資訊來控制雜訊抑制系統對接收到的訊號的 調整,使用多個副頻帶以確保在目標頻譜上達到適當的收歛,還有支援系統的參 考麥克風對目標聲音訊號進行操作,以下將逐一說明。 關於使用VAD資訊來控制雜訊抑制系統對接收到的訊號的調整,傳統的 ANC並不使用VAD資訊,因爲在有人發聲的過程中,參考麥克風會接收到訊號, 因此在這段期間調整a(z)(由雜訊到主麥克風的路徑)的係數會造成目標訊號的 人聲能量被大幅移除,結果是訊號失真與消除(去訊號化)。因此,在以上所述 的各種方法,應用VAD資訊來建構相當精確的VAD,以便通知尋徑系統何時需 調整&amp; (僅包含雜訊)與H2 (在有人發聲時,有需要可採用)的係數。 如上所述,尋徑系統和傳統ANC有個重大的差別在於切割聲音資料成副頻 帶的動作。尋徑系統應用許多副頻帶,並應用LMS演算法計算個別副頻帶內的 資訊,因此可確保所有副頻帶內的資料經加總後可以適當地收歛,讓尋徑系統在 目標頻譜中發揮良好的效能。 由於ANC演算法通常會用到LMS適應性濾波器來建立迚的模型,而此一模 型利用所有的零點來建構濾波器,一個「實際」可用的系統不太可能利用這種模 型達到精確的結果。可用的系統幾乎都會包含極點(poles)和零點(zeros),因此和 LMS濾波器所計算而得的頻率響應大不相同。通常LMS只能在單一頻率上匹配 真實系統的相位和振幅,而在這個頻率以外所得的結果就很差,也會造成其他 地區的雜訊升高。所以針對目標訊號的整個頻譜使用LMS演算法,往往會造成 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 37 200304119 _ 發明說明續頁 目標訊號在振幅/相位匹配不佳的地方產生訊號劣化的情形。 最後,尋徑系統支援系統的參考麥克風對目標聲音訊號進行操作,讓參考麥 克風可以接收聲音訊號,代表麥克風之間的位置可以靠得更近,和傳統ANC的 麥克風配置不一樣。較近的間距可以簡化適應性濾波計算,並且讓麥克風配置/ 解決方案更爲簡便。同時,現在已經有特殊的麥克風配置可以將訊號失真與去訊 號化的程度降至最低,並可建立目標信號源與參考麥克風之間的訊號路徑模型。 在一個實施例中,使用指向性麥克風可確保轉移函數不會接近1,不過即使 是採用指向性麥克風,雜訊麥克風還是會收到一些訊號,如果將它忽略,並且假 _ 定ft(z) = 0,那麼假設VAD爲理想的狀態下,會有一些失真存在。我們可以參考 * 等式⑵並且在H2(z)不包含在內的情況下,求出以下的等式: . S(z)[l - H2 (z)H1 (z)] = M, (z)- M2 (z)^ (z). (4) 這表示訊號會有一個[1 - H2(z)ft⑵]因數的失真,所以,失真的類型和失真量會受 到雜訊環境影響而改變,在雜訊非常小的環境下,H《z)大約是0,而失真也不大, 當有雜訊存在時,失真量會受到雜訊源的類型、位置以及強度而改變,好的麥克 風配置設計可以把這類失真降低至最低程度。 當VAD顯示無人發聲,或有人說話但是副頻帶的SNR相當低時,可以計算 鲁 出ft。同樣的,當VAD顯示有人說話而副頻帶的SNR又相當高的時候,可以算 出H2。不過,在適當的麥克風擺設和處理下,訊號失真的問題可以減至最低,而 我們只需計算B的値,這麼一來可大幅降低所需的處理工作,簡化尋徑演算法的 建構。傳統ANC的作法是不讓任何訊號跑到MIC 2,尋徑演算法在採用適當的麥 . 克風配置的情況下,可以容許MIC 2接收到訊號。一個適當的麥克風配置的實施 例如圖11所示,其中用了 2個心型曲線單指向性麥克風MIC 1和MIC 2。這種配 置讓MIC 1指向使用者的嘴巴,此外,這種配置讓MIC 2盡可能的靠近MIC 1, 而且讓MIC 2的指向跟MIC 1差了 90度。 〇續次頁(發明說明頁不敷使用時,請註記並使用續頁) 38 200304119 -—Π 發明說明續頁 或許要證明雜訊抑制跟VAD有關的最佳方式,是在VAD失誤的情況下,檢 查VAD錯誤對消除雜訊動作的影響。可能發生的2種錯誤包括僞陽性(False positives,FP),也就是在無人發聲發生的情況下VAD顯示有人在說話;還有僞陰 性(False negatives,FN),也就是VAD並沒有偵測到有人發聲。如果FP發生的次數 太頻繁會變得很麻煩,這是因爲偶然發生的FP只會短暫地停止系統更新IL的係 數,所以不會明顯的影響到雜訊抑制系統的表現,從另一個角度來看,FN可能 會造成問題,特別在未偵測到的人聲具有高SNR時更是如此。 假設系統中的麥克風都有接收到人聲和雜訊,但VAD未能偵測出訊號,並 _ 傳回一個FN,所有系統只有偵測到雜訊,則此時MIC 2所接收到的訊號爲: - M2 =h1n + h2s, 其中不標出z是爲了清楚表達之故,由於VAD顯示只有雜訊存在,因此系統會 根據以下的等式,試著將上述的系統表示爲單一雜訊與單一轉移函數的模型: TF model = 1¾^. 尋徑系統採用LMS演算法來計算氐,但是LMS演算法一般最好是用在建立非時 變、全零點(time-invariant,all-zero)的系統模型上。由於雜訊與人聲訊號不太可能有 關連性,所以一般來說,系統會根據MIC 1所接收的資料的SNR,建立乩與H2 ® 模型的能力,以及还與乩的非時變性,建立人聲與相關轉移函數的模型,或者 是雜訊與相關轉移函數的模型,如下所述。 關於MIC 1的資料的SNR,非常低的SNR (比zero⑼還小)傾向讓尋徑系統 收歛爲雜訊轉移函數,相形之下,高SNR (比(0)還小)傾向讓尋徑系統收歛爲人 聲轉移函數。至於在建立扎與H2模型的能力上,如果用LMS (all-zero模型)比 較容易建立ft或迅,則尋徑系統傾向會建立那樣的個別轉移函數。 系統建立模型時與还與压,的非時變性有關,而LMS系統在建立非時變系 統上表現最佳,不過就經驗來看,如果在時變系統中,系統變化比LMS的收歛 □續次頁(發明說明頁不敷使用時,請註記並使用續頁) 39 200304119 發明說明續頁 時間要慢,那麼也可以用LMS系統來建立時變系統的模型。所以,尋徑系統通 常會收歛至a,因爲通常a的變化速度比托要來得慢。 如果LMS選擇利用雜訊轉移函數來建立人聲轉移函數,那麼人聲會被歸類 爲雜訊,只要LMS濾波器的係數維持一定或相近,人聲就會被移除。所以,當 尋徑系統收歛至人聲轉移函數的ft (可以在幾個ms的時間內發生)的模型,任 何後續的人聲(即使VAD功能正常下所偵測到的人聲),都會因爲系統認爲它的 轉移函數和VAD發生失誤時所建立的轉移函數模型相似,使得能量被移除。在 這種情況下,主要是建立H2的模型,所以雜訊不受影響,或只有一部分被移除。φ 這個過程的最後結果是音量變小,而且原有乾淨的人聲會失真,失真的程度 . 跟以上所討論的變數有關係。如果系統傾向收歛至,那麼後續的增益損失和人 _ 聲失真就不會太明顯,如果系統傾向於收歛至H2,那就有可能會嚴重的失真。 VAD失誤分析並不是用來描述使用副頻帶的細節,或者是麥克風的位置、類型和 指向等,主要是要傳達VAD對於消除雜訊的重要性,以上所得的結果可以用在 單一副頻帶或任意數目的副頻帶上,因爲每個副頻帶內的動作其實是一樣的。 此外,在以上VAD失誤分析說明中,對VAD的依賴性和VAD失誤所造成 的問題並不限於尋徑雜訊抑制系統,任何使用VAD來決定如何去除雜訊的適應 籲 性濾波器雜訊抑制系統同樣也會受到影響。此處所揭示的內容是參考尋徑抑制系 統,不過要記得的是所有可以使用多隻麥克風來估計雜訊波型,並將雜訊由包含 人聲與雜訊的訊號中去除的雜訊抑制系統,還有依賴VAD以提供可靠操作的雜 . 訊抑制系統都在本發明的範圍內,尋徑系統只是爲方便之故所參考的實際系統。 本發明的各方面可以程式化爲各種電路中的功能,包括可編程邏輯裝置 (PLD),像是場可編程閘陣列(FPGA)、可編程閘陣列邏輯(PAL)裝置、電子可編程 邏輯和記憶體裝置以及標準細胞兀裝置,其他還包括特定應用積體電路(ASIQ 〇 其他可以滿足本發明的某些方面的裝置包括:附有記憶體的微控制器 [□續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 發明說明續頁 (microcontroller),比如說電子式可抹除之可編程胃®EEPRQM), 處理器、韌體、軟體等等。如果)。如果本發明的某些方面在製造的過程中至少有 一個步驟是以軟體來實施,例如內嵌至韌體或放在PLD中,那麼該軟體應該可以 用任何電腦可讀媒介,比如說磁性或光學可讀碟片(固定式或軟體),利用載體 訊號調變或其他的方式傳輸。 此外,本發明的各個方面也可以利用微處理器模擬軟體電路、離散邏輯(序 列式或組合式)、自定裝置、模糊(神經)邏輯、量子裝置以及上述各種裝置類 型的混合加以實施。當然這些裝置所用的技術也可以用在各種元件,例如金氧半 導體場效應電晶體(M0SFET)技術,像是互補金氧半導體;雙載子技術如射極耦合 邏輯(ECL);聚合物技術如結合矽的聚合物(silicon-conjugated polymer)以及結合金屬 的聚合物一金屬結構;混合類比與數位電路等。 除非特別提及,否則在詳細說明與申請專利範圍中所使用的包含(comprise, comprising)與其他類似的用語是用來包含可能的樣態,而不是用來排除其他的可 能,也就是說,應該是「包含,而不限定於」的用法。使用單數或複數的名詞同 樣也包括複數與單數的方式。另外/在此(herein)」、「在此之下(hereunder)」、 「以上(above)」、「以下(below)」或其他相似的敘述用語,應用在本專利申請案 中,應該參考本案全體而不是特定的敘述部份。若在本專利申請案中有用到「或 (or)」表示提到2或更多個項目的時候,表示有下列的可能,任何在名單中的項目, 名單中所有的項目以及名單中各種項目的組合。 以上對本發明實施例的敘述不是用來窮舉或是限制本發明在所揭示的範圍 內,本發明的各種特定實施例以及所舉的範例是用來提供說明,熟悉此技藝者當 可了解,在本發明中可以進行各種等效的修改,在此對發明所作的說明可以應用 到其它的處理系統和通訊系統,並不一定要限制在上述的處理系統上。 以上各個實施例所提到的元件以及動作可經過組合以衍生出進一步的實施方 ]續次頁(發明說明頁不敷使用時,請註記並使用續頁) 200304119 -- 發明說明續頁 式,可根據以上的詳細說明對本發明做各種的改變。 --— 本發明引用上述的說明與相關美國專利申請案作爲參考,若有需要,本發明 的各個方面也可以經過修改,以便應用上述專利與申請案所提出的系統、功能以 及槪念,提供進一步的實施方式。 一般來說,在接下來的專利申請範圍中所用的術語,不應該被用來推斷爲將 本發明限定在詳細說明與申請專利範圍所揭示的特定實施例中,而應該用來推斷 包含所有在申請專利範圍下運作的系統,以提供壓縮和解壓縮資料檔案或資料流 的用途。因此,本發明並不限定於所揭示的內容,應該完全以申請專利範圍來決 定本發明的範疇。 雖然本發明的某些方面已由接下來的專利申請像加以闡述,發明人仍考慮到 本發明可以用任意數目的專利申請項形式予以表達,舉例來說,雖然本發明中只 有在某一方面提到使用電腦可讀取的媒介,但是其他方面也可以利用電腦可讀取 的媒介予以實施,因此,發明人保留增加額外專利申請範圍的權利,以便在提出 專利申請之後,針對本發明的其他方面再提出額外的專利申請項。 【圖式簡單說明】 圖1係繪示在一實施例中,包含尋徑雜訊抑制系統與VAD系統的訊號處理 系統的方塊圖。 圖1A係繪示在一實施例中的VAD系統,其中包含用以接收與處理與VAD 有關的訊號的硬體。 圖1B係繪示在另一實施例中的VAD系統,其利用相關的雜訊抑制系統的硬 體來接收VAD資訊。 圖2係訊號處理系統的方塊圖,其中應用了習知的傳統適應性雜訊消除系統。 □續次頁 (發明說明頁不敷使用時,請註記並使用續頁) 42 200304119 發明說明續頁 圖3係在一實施例中,利用加速計爲主的VAD來決定有無人聲發音的方法 的流程圖。 圖4所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的加速 計爲主的VAD訊號,對應的加速計輸出訊號,還有尋徑系統利用VAD訊號來消 除聲音訊號的雜訊的圖表。 圖5所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的SSM 爲主的VAD訊號,對應的SSM輸出訊號,還有尋徑系統利用VAD訊號來消除 聲音訊號的雜訊的圖表。 圖6所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的GEMS 爲主的VAD訊號,對應的GEMS輸出訊號,還有尋徑系統利用VAD訊號來消除 聲音訊號的雜訊的圖表。 圖7所示爲在一實施例中,已錄的有人發聲資料加上數位加入的雜訊,還有 對應的EGG爲主的VAD訊號,以及對應的高通濾波EGG輸出訊號。 圖8所示爲在一實施例中,利用視訊爲主的VAD來決定有人發聲的方法的 流程圖800。 圖9所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的單一 (梯度)麥克風爲主的VAD訊號,對應的梯度麥克風(gradient microphone)輸出訊 號,還有尋徑系統利用VAD訊號來消除聲音訊號的雜訊的圖表。 圖1〇所示爲在一實施例中,單一心型曲線單一指向性麥克風的陣列,加上 相關的空間響應曲線的圖示。 圖11所示爲在一實施例中,一 PVAD系統的麥克風陣列。 圖12所示爲在另一 PVAD實施例中,利用增益値決定有無人發聲的方法的 流程圖。 續&amp;胃(發明說明頁不敷使用時,請註記並使用續頁) 43 200304119 - 發明說明續頁 圖13所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的麥 克風爲主的PVAD訊號,對應的PVAD增益相對於時間訊號,還有尋徑系統利用 PVAD訊號來消除聲音訊號的雜訊的圖表。 圖14所示爲在一實施例中,利用立體聲VAD決定有無人發聲的方法的流程 圖。 圖15所示爲在一實施例中,有雜訊的聲音訊號(現場錄音)加上對應的SVAD 的訊號,還有尋徑系統利用SVAD訊號來消除聲音訊號的雜訊的圖表。 圖16所示爲在一實施例中,利用AVAD決定有無人發聲的方法的流程圖。 圖17所示爲在一實施例中,來自AVAD系統的每一隻麥克風的聲音訊號力口 上對應的合倂能量訊號的圖表。 圖18所示爲在一實施例中,包含尋徑雜訊抑制系統與單一麥克風VAD系統 的訊號處理系統的方塊圖。 圖19所示爲在一實施例中,利用單一麥克風產生語音資訊的方法的流程圖。 圖20所示爲在一實施例中,利用空氣流爲主的VAD決定有無人發聲的方法 的流程圖。 圖21所示爲在一實施例中,有雜訊的聲音訊號加上對應的手動啓動/計算 的VAD訊號,還有尋徑系統利用手動VAD訊號來消除聲音訊號的雜訊的圖表。 在圖示中,相同的參考數目係標示相同或實質上相似的元件或動作。爲了方便找 出任何特定元件或動作的討論描述,最明顯的數字或參考數字中的數字係參考該 元件第一次所出現的圖號,比如說元件104係在圖1中第一次出現和討論。 【元件符號說明】 100訊號處理系統 1402接收訊號 □續次頁 (發明說明頁不敷使用時,請註記並使用續頁) 44 200304119 -發明說明續頁 102VAD系統 104VAD訊號 110麥克風MIC 1 112麥克風MIC 2 120語音訊號源 122雜訊源 102AVAD 系統 130VAD裝置 140VAD演算法 102BVAD 系統 150VAD演算法 164 VAD資訊 302接收加速計資料 304過濾加速計資料與數位化 306分割與步級化數位化資料 30δ移除被雜訊干擾的頻譜資訊 310計算出每個區間內的能量 312比較能量値與臨界値 314高於臨界値的能量顯示爲有人發聲 310低於臨界値的能量顯示爲無人發聲 802找出目標的臉和發聲器官 804計算出發聲器官的動作 806動作是否比臨界速度快且在振動? 808動作是否比臨界値大? 810目標正在發聲 1404分配MIC 1和MIC2的樣本 1406 計算 FFT1 與 FFT2 1408計算FFT1與KFT2的大小 1410計算FFT1與FFT2的指數平均 1412由上個步驟的値計算比(FFT_mtio)及 其平均値 1414比較平均値與臨界値,設定VAD狀 態 1416決定VAD_state是否等於1 1417更新參數,記鍵續發聲下最高的平 均値 1418重設參數,如果在有人發聲的區間後 出現第一個無人發聲的區段,貝[J檢查 先前有人發聲是否爲誤判。 1420計算高與低能量等級,計算新的發聲 臨界値,如果恰當的話加上絕對最小 臨界値。 1602接收麥克風信號/資料 1604過濾麥克風資料與數位化 1606分割與步級化數位化資料 1608過濾分配於區間內的資料 1610將區間內的資料相加,所得結果再予 以平方 1612計算所得向量的能量 □續次頁 (發明說明頁不敷使用時,請註記並使用續頁) 45 200304119 -發明說明胃 812將資訊傳遞至尋徑系統 1002單指向性麥克風MIC 1 1010 MIC 1的空間響應曲線 1012無頻率響應區 1014極大値 1100麥克風陣列 1102單指向性麥克風MIC 2 1110 MIC 2的空間響應曲線 1120第一半球 1130第二半球 1202接收系統麥克風資料 1204過濾訊號與數位化 1206分割、步級化與過濾數位化資料 1208計算標準變異和過去增益値的平均 1210根據最小値檢查標準變異與平均,並 進行調整 1212計算有人發聲的臨界値 1214計算區間內的能量和增益 1216比較臨界値與增益,決定區間內的資 料是有人發聲或無人發聲 1614計算過去能量數値的標準變異與平 均 1616拿標準變異及平均與最小數値比 較,並予以調整 1618計算發聲臨界値 1620比較能量和臨界値,並決定區間內的 資料係有人或無人發聲 1800訊號處理系統 1902接收來自主麥克風的訊號 1904過濾所接收到的訊號與數位化 1906分割與過濾數位化資料 1908計算VAD資訊 1910提供VAD資訊給雜訊抑制系統 2002接收氣流資料 2004過濾氣流資料以消除鬼影並數位化 2006收集下一個20ms區間的資料 2008過濾掉不想要的氣流頻譜 2010計算區間內的能量 2012區間能量是否大於臨界値 2014有人發聲 2016將資訊傳送給尋徑雜訊抑制系統Mls (z) Although the equation for calculating ft⑵ looks like the equation is reversed when calculating 迚 ⑵, remember that the input signals used are different. It should be noted that H2 相当 should be quite fixed, because there will always be only A single signal source (user), and the relative position of the user and the microphone should also be fairly fixed. 'If a small amount of adaptive gain is added when calculating the pressure, the calculation can be made more efficient' and the noise It can also play a powerful role in the existing conditions. After calculating Η! ⑵ and H2 (z), then use them to remove noise from the signal. 'Rewrite equation ⑴ as · S (z) = M1 (z) -N (z) H1 (z) N (z) = M2 (z) -S (z) H2 (z) S (z) = Mj (z)-[Μ2 (z)-S (z) H2 (z) ^! (Z) S (z ) [l-H2 (z ^ H, (z)] = M, (z)-M2 (z) □ Next page (when the description page of the invention is insufficient, please note and use the continued page) 35 200304119 Description of the invention $ Selling M can calculate s⑵ (2) i-h2 (z) Hi (z) Generally speaking, H2⑵ is quite small 'and Hi⑵ is less than 1, so for most frequencies in most cases: H2 (z) H1 (z) «1, and the signal can be obtained using the following equation: So it is assumed here that B⑵ is not calculated, and 乩 ⑵ is the only transfer function that needs to be calculated. Although we can still find 値 for B (z), but if the microphone If the position and direction of the signal are well placed, the need to calculate h2 (z) can be omitted. When processing sound signals, multiple sub-bands can be used to achieve significant noise suppression. This is because of the adaptive filter used to calculate the transfer function (adaptive filter) is mostly FIR type, this type of filter uses only zero (zero) instead of pole (count) A system containing zeros and poles is as follows: Η (ζ) _ &gt; Β (ζ) iUl〇DELS &gt; Α (:) This model can be accurate when there is sufficient input information, but this can also be significant Improve computing cost and convergence time. In energy-based adaptive filtering systems, such as the minimum root mean square (LMS) system, the system can be calculated to be fairly close to a small frequency range containing more energy. Amplitude and phase, which allows the LMS system to perform its maximum function, minimizing the energy of error, but this result may increase the noise outside the matched frequency range and reduce the noise. Signal suppression. The effectiveness of the system. The use of subbands can solve this problem. The signals from the primary and secondary microphones are filtered into multiple subbands, and the data calculated for each subband (possibly a shift Frequency and down-converted data, depending on user needs) are sent to individual adaptive filters, so the adaptive filter must find a way to match in its own sub-band Data, not just finding where in the signal the D continuation page (when the invention description page is insufficient, please note and use the continuation page) 36 200304119 The invention description has the highest energy, and the signal suppression results obtained in each sub-band Can sum up to get the signal after the last noise is eliminated. It is not easy to make each signal match the time and compensate the filter drift, but the result is better for the system, but only in the memory and processing The demand on the device is high. At first glance, the path finding algorithm seems to be very close to other algorithms, such as traditional ANC (Adaptive Noise Reduction), as shown in Figure 2. However, after careful comparison, we can see that there are differences in noise suppression effects in several places, including the use of VAD information to control the adjustment of the received signal by the noise suppression system, and the use of multiple sub-bands to ensure that the target spectrum To achieve proper convergence, and the reference microphone of the support system to operate the target sound signal, the following will be explained one by one. Regarding the use of VAD information to control the adjustment of the received signal by the noise suppression system, traditional ANC does not use VAD information, because in the process of someone speaking, the reference microphone will receive the signal, so adjust a ( z) (the path from noise to the main microphone) will cause the vocal energy of the target signal to be largely removed, with the result that the signal is distorted and eliminated (de-signalized). Therefore, in the various methods described above, the VAD information is used to construct a fairly accurate VAD in order to inform the path finding system when to adjust &amp; (contains only noise) and H2 (when someone speaks, it can be used if necessary). coefficient. As mentioned above, a significant difference between the path finding system and the conventional ANC is the action of cutting sound data into sub-bands. The path-finding system uses many sub-bands, and the LMS algorithm is used to calculate information in individual sub-bands. Therefore, it can ensure that the data in all sub-bands can be properly converged after being summed up, allowing the path-finding system to play a good role in the target spectrum efficacy. Since the ANC algorithm usually uses the LMS adaptive filter to build the 迚 model, and this model uses all the zeros to construct the filter, an "actual" usable system is unlikely to use this model to achieve accurate results . Almost all available systems contain poles and zeros, and therefore are quite different from the frequency response calculated by the LMS filter. Generally, LMS can only match the phase and amplitude of a real system at a single frequency, and the results obtained outside this frequency are poor, and noise in other regions will also increase. Therefore, using the LMS algorithm for the entire spectrum of the target signal often results in] Continued pages (when the description page of the invention is inadequate, please note and use the continued page) 37 200304119 _ Description of the continued page The target signal does not match in amplitude / phase Signal degradation occurs in the best place. Finally, the reference microphone of the path finding system support system operates the target sound signal so that the reference microphone can receive the sound signal, which means that the positions of the microphones can be closer, which is different from the traditional ANC microphone configuration. Closer distances simplify adaptive filtering calculations and make microphone configuration / solution easier. At the same time, there are special microphone configurations that can minimize signal distortion and de-signalization, and can establish a signal path model between the target signal source and the reference microphone. In one embodiment, the use of a directional microphone can ensure that the transfer function will not approach 1. However, even with a directional microphone, the noise microphone will still receive some signals. If you ignore it, and false _ fixed ft (z) = 0, then assuming VAD is ideal, there will be some distortion. We can refer to the * equation ⑵ and find the following equations without H2 (z): S (z) [l-H2 (z) H1 (z)] = M, (z )-M2 (z) ^ (z). (4) This means that the signal will have a distortion of [1-H2 (z) ft⑵] factor, so the type and amount of distortion will be affected by the noise environment. In the environment with very little noise, H <z is about 0, and the distortion is not large. When there is noise, the amount of distortion will be changed by the type, location and intensity of the noise source. Good microphone configuration The design minimizes this type of distortion. When VAD shows that no one is speaking, or someone is talking but the SNR of the sub-band is quite low, we can calculate ft. Similarly, when VAD shows that someone is talking and the SNR of the sub-band is quite high, H2 can be calculated. However, with proper microphone placement and processing, the problem of signal distortion can be minimized, and we only need to calculate the 値 of B. This can greatly reduce the processing required and simplify the construction of the path finding algorithm. The traditional ANC method is to not allow any signal to go to MIC 2. The path-finding algorithm can allow the MIC 2 to receive the signal under the condition of proper microphone and microphone configuration. An example of a suitable microphone configuration is shown in Fig. 11, where two cardioid curve unidirectional microphones MIC 1 and MIC 2 are used. This configuration allows the MIC 1 to point at the user's mouth. In addition, this configuration allows the MIC 2 to be as close to the MIC 1 as possible, and makes the MIC 2 pointing 90 degrees away from the MIC 1. 〇 Continued pages (If the description page of the invention is insufficient, please note and use the continuation page) 38 200304119 --- Π The description page of the invention may prove that the best way to suppress noise related to VAD is in the case of VAD errors. , Check the effect of VAD error on noise elimination action. Two types of errors that can occur include false positives (FP), that is, when no vocalization occurs, VAD shows that someone is talking; and false negatives (FN), that is, VAD is not detected Someone made a noise. If the FP occurs too often, it will become very troublesome. This is because the accidental FP will only temporarily stop the system from updating the coefficient of IL, so it will not significantly affect the performance of the noise suppression system. From another perspective See, FN can cause problems, especially when undetected human voices have high SNR. Assume that the microphone in the system has received human voice and noise, but VAD fails to detect the signal, and _ returns an FN. All systems only detect noise, then the signal received by MIC 2 is :-M2 = h1n + h2s, where z is not marked for clarity. Because VAD shows that only noise exists, the system will try to represent the above system as a single noise and a single noise according to the following equation. Model of the transfer function: TF model = 1¾ ^. Path finding systems use LMS algorithms to calculate 氐, but LMS algorithms are generally best used to build time-invariant (all-zero) systems On the model. Because noise and human voice signals are unlikely to be related, in general, the system will build the ability of the 乩 and H2 ® models based on the SNR of the data received by MIC 1, and also the non-temporal change of 乩 to create the human voice. Models with related transfer functions, or models with noise and related transfer functions, are described below. Regarding the SNR of MIC 1, the very low SNR (smaller than zero) tends to converge the path-finding system to a noise transfer function. In contrast, the high SNR (smaller than (0)) tends to converge the path-finding system. Is a vocal transfer function. As for the ability to build the model with H2, if it is easier to build ft or fast with LMS (all-zero model), the path-finding system tends to build such individual transfer functions. The model of the system is related to the non-time-varying nature of the pressure, and the LMS system performs best in building the non-time-varying system. However, empirically, if the time-varying system, the system change is more convergent than the LMS. Continued The next page (please note and use the continuation page when the description page of the invention is insufficient) 39 200304119 The description page is slow, so you can also use the LMS system to build a time-varying system model. Therefore, the path-finding system usually converges to a, because a usually changes at a slower rate than To. If the LMS chooses to use the noise transfer function to establish a vocal transfer function, then the vocal will be classified as noise, and as long as the coefficient of the LMS filter remains constant or similar, the vocal will be removed. Therefore, when the path-finding system converges to the ft model of the vocal transfer function (which can occur within a few ms), any subsequent vocals (even if the vocals are detected under normal VAD function), will Its transfer function is similar to the transfer function model established when VAD fails, so that energy is removed. In this case, the H2 model is mainly established, so noise is not affected, or only a part is removed. The final result of this process is that the volume becomes smaller, and the original clean human voice will be distorted. The degree of distortion is related to the variables discussed above. If the system tends to converge, the subsequent gain loss and vocal distortion will not be too obvious. If the system tends to converge to H2, then there may be severe distortion. VAD error analysis is not used to describe the details of using the sub-band, or the position, type, and orientation of the microphone. It is mainly to convey the importance of VAD for eliminating noise. The results obtained above can be used in a single sub-band or arbitrary The number of sub-bands, because the action in each sub-band is actually the same. In addition, in the above VAD error analysis description, the dependence on VAD and the problems caused by VAD errors are not limited to the path finding noise suppression system. Any adaptive filter noise suppression using VAD to determine how to remove noise The system is also affected. The content disclosed here is a reference path suppression system, but remember that all noise suppression systems that can use multiple microphones to estimate the noise pattern and remove the noise from the signal including human voice and noise, There are also clutter suppression systems that rely on VAD to provide reliable operation, which are all within the scope of the present invention. The path finding system is just an actual system that is referenced for convenience. Aspects of the invention can be programmed into functions in various circuits, including programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), programmable gate array logic (PAL) devices, electronic programmable logic, and Memory devices and standard cell devices, others also include application-specific integrated circuits (ASIQ 〇 Other devices that can meet certain aspects of the present invention include: Microcontroller with memory [□ Continued page (Invention page When not enough, please note and use the continuation sheet) 200304119 Description of the invention Continuation sheet (microcontroller), such as electronic erasable programmable stomach (EEPRQM), processor, firmware, software, etc. in case). If at least one step of some aspects of the invention is implemented in software during the manufacturing process, such as embedded in firmware or placed in PLD, then the software should be able to use any computer-readable medium, such as magnetic or Optically readable disc (fixed or software), using carrier signal modulation or other transmission methods. In addition, various aspects of the present invention can also be implemented using a microprocessor to simulate software circuits, discrete logic (sequential or combined), custom devices, fuzzy (neural) logic, quantum devices, and a mixture of the various device types described above. Of course, the technology used in these devices can also be used in various components, such as metal-oxide-semiconductor field-effect transistor (MOSFET) technology, such as complementary metal-oxide-semiconductor; bipolar technology such as emitter-coupled logic (ECL); polymer technology such as Silicon-conjugated polymer and metal-polymer-metal structure; mixed analog and digital circuits. Unless specifically mentioned, the terms "comprise, comprising" and other similar terms used in the detailed description and the scope of the patent application are used to include possible forms, not to exclude other possibilities, that is, It should be "include, not limited to" usage. The use of the singular or plural nouns also includes the plural and singular forms. In addition, "herein", "hereunder", "above", "below" or other similar narrative terms are applied in this patent application, and this case should be referred to The whole rather than a specific narrative. If "or (or)" is used in this patent application to refer to 2 or more items, it means that there are the following possibilities, any item in the list, all items in the list, and various items in the list The combination. The above description of the embodiments of the present invention is not intended to be exhaustive or to limit the present invention within the disclosed scope. Various specific embodiments of the present invention and the examples given are provided to provide explanations. Those skilled in the art will understand that Various equivalent modifications can be made in the present invention, and the description of the invention herein can be applied to other processing systems and communication systems, and is not necessarily limited to the above-mentioned processing systems. The elements and actions mentioned in each of the above embodiments can be combined to derive further implementations.] Continued pages (when the invention description page is insufficient, please note and use the continuation page) 200304119-invention description sequel, Various changes can be made to the present invention based on the above detailed description. --- The present invention refers to the above description and related US patent applications for reference. If necessary, various aspects of the present invention can also be modified in order to apply the systems, functions, and ideas proposed in the above patents and applications, to provide A further embodiment. In general, the terms used in the scope of the subsequent patent applications should not be used to infer the invention to be limited to the specific embodiments disclosed in the detailed description and the scope of the patent application, but should be used to infer that all A system operating under the scope of the patent application to provide the purpose of compressing and decompressing data files or streams. Therefore, the present invention is not limited to what is disclosed, and the scope of the present invention should be determined entirely by the scope of patent application. Although certain aspects of the present invention have been described by the following patent applications, the inventors have considered that the present invention can be expressed in the form of any number of patent applications, for example, although only one aspect of the invention Mentioned the use of computer-readable media, but other aspects can also be implemented using computer-readable media. Therefore, the inventor reserves the right to increase the scope of additional patent applications, so that after filing a patent application, It also filed additional patent applications. [Brief description of the drawings] FIG. 1 is a block diagram of a signal processing system including a path finding noise suppression system and a VAD system in an embodiment. FIG. 1A illustrates a VAD system according to an embodiment, which includes hardware for receiving and processing VAD-related signals. FIG. 1B illustrates a VAD system in another embodiment, which uses the hardware of the related noise suppression system to receive VAD information. FIG. 2 is a block diagram of a signal processing system in which a conventional conventional adaptive noise cancellation system is applied. □ Continued pages (note the insufficient pages of the invention, please note and use the continuation pages) 42 200304119 Continued pages of the invention Fig. 3 is an embodiment of the method of using the accelerometer-based VAD to determine the presence or absence of vocal pronunciation flow chart. Figure 4 shows the noise signal (live recording) plus the corresponding accelerometer-based VAD signal, the corresponding accelerometer output signal, and the path finding system using the VAD signal to eliminate Chart of noise of sound signal. Figure 5 shows an example of a noisy sound signal (live recording) plus the corresponding SSM VAD signal, the corresponding SSM output signal, and the path finding system uses the VAD signal to eliminate the sound signal. Noise chart. Figure 6 shows an example of a noisy sound signal (live recording) plus a corresponding GEMS VAD signal, a corresponding GEMS output signal, and a path finding system using the VAD signal to eliminate the sound signal Noise chart. FIG. 7 shows the recorded human voice data plus digitally added noise in one embodiment, the corresponding EDG-based VAD signal, and the corresponding high-pass filtered EGG output signal. FIG. 8 is a flowchart 800 of a method for determining a voice of a person using a video-based VAD in an embodiment. FIG. 9 shows a noise signal (live recording) plus a corresponding single (gradient) microphone-based VAD signal, and a corresponding gradient microphone output signal. The path system uses the VAD signal to eliminate noise from the sound signal. FIG. 10 is a diagram showing an array of a single cardioid curve and a single directional microphone, plus related spatial response curves in an embodiment. FIG. 11 shows a microphone array of a PVAD system in an embodiment. Fig. 12 is a flowchart of a method for determining whether or not a person makes a sound by using gain 値 in another PVAD embodiment. Continue &amp; Stomach (Note if the Instruction Sheet is inadequate, please note and use the Continue Sheet) 43 200304119-Description of the Invention Continued Figure 13 shows the noise signal (live recording) with noise in one embodiment The corresponding microphone is the main PVAD signal, the corresponding PVAD gain is relative to the time signal, and there is a graph of the path finding system using the PVAD signal to eliminate the noise of the sound signal. Fig. 14 is a flow chart showing a method for determining the presence or absence of sound by using a stereo VAD in one embodiment. FIG. 15 is a diagram showing noise signals (live recording) with corresponding SVAD signals and a path finding system using SVAD signals to eliminate noise from the sound signals in an embodiment. FIG. 16 is a flowchart of a method for determining the presence or absence of utterance using AVAD in an embodiment. FIG. 17 is a diagram showing a corresponding combined energy signal on the sound signal port of each microphone of the AVAD system in an embodiment. FIG. 18 is a block diagram of a signal processing system including a path finding noise suppression system and a single microphone VAD system in an embodiment. FIG. 19 is a flowchart of a method for generating voice information using a single microphone in an embodiment. Fig. 20 is a flowchart showing a method for determining the presence or absence of utterance using a VAD dominated by airflow in one embodiment. FIG. 21 is a diagram showing a noise signal with noise and a corresponding manually activated / calculated VAD signal, and a path finding system using a manual VAD signal to eliminate noise of the sound signal in one embodiment. In the drawings, the same reference numbers indicate the same or substantially similar elements or actions. In order to facilitate the discussion and description of any specific element or action, the most obvious number or reference number refers to the figure number where the element first appears, for example, the element 104 first appears in Figure 1 and discuss. [Explanation of component symbols] 100 signal processing system 1402 receives the signal □ Continued page (when the description page of the invention is insufficient, please note and use the continued page) 44 200304119-Description of the invention continued page 102VAD system 104VAD signal 110 microphone MIC 1 112 microphone MIC 2 120 voice signal source 122 noise source 102 AVAD system 130VAD device 140VAD algorithm 102BVAD system 150VAD algorithm 164 VAD information 302 receive accelerometer data 304 filter accelerometer data and digitization 306 segmentation and stepwise digitization data 30δ removed by The spectral information of noise interference 310 is calculated. The energy in each interval is calculated. 312 compares the energy 値 with the critical value. 314 is higher than the critical value. The energy is shown as someone ’s voice. 310 The energy below the critical value is shown as unvoiced. 802 The vocal organ 804 calculates whether the motion of the starting vocal organ 806 is faster than the critical speed and is vibrating? Is the 808 action greater than the critical threshold? 810 The target is sounding 1404 Allocating samples of MIC 1 and MIC2 1406 Calculating FFT1 and FFT2 1408 Calculating the size of FFT1 and KFT2 1410 Calculating the exponential average of FFT1 and FFT2 1412 Calculating the ratio (FFT_mtio) and its average 値 1414 of the previous step Average 値 and critical 値, set the VAD state 1416 to determine whether VAD_state is equal to 1 1417. Update the parameters. Remember the highest average 续 1418 to reset the parameters. If the first unvoiced zone appears after the voiced zone, Be [J checks if someone's previous voice was misjudged. 1420 Calculate the high and low energy levels and calculate the new vocal critical 値, plus the absolute minimum critical 値 if appropriate. 1602 Receive microphone signal / data 1604 Filter microphone data and digitize 1606 Segmentation and stepwise digitize data 1608 Filter the data allocated in the interval 1610 Add the data in the interval, and then square the result to calculate the energy of the vector 1612 □ Continued page (If the description page of the invention is not enough, please note and use the continued page) 45 200304119-Description of the invention Stomach 812 passes information to the path finding system 1002 Unidirectional microphone MIC 1 1010 MIC 1 Spatial response curve 1012 None Frequency response area 1014 値 1100 microphone array 1102 unidirectional microphone MIC 2 1110 MIC 2 spatial response curve 1120 first hemisphere 1130 second hemisphere 1202 receiving system microphone data 1204 filtering signal and digitization 1206 segmentation, stepping and filtering Digitized data 1208 Calculate standard variation and past gain 値 average 1210 Check standard variability and average based on minimum 値 Check and adjust 1212 Calculate critical threshold for vocalization 1212 Calculate energy and gain in the interval 1216 Compare critical 値 and gain to determine interval The information in it is voiced or unvoiced. Measure the standard variation of the number 値 and average 1616. Take the standard variation and the average and the minimum number 値 and adjust it. 1618 Calculate the voicing threshold. 1620 Compare the energy and the critical 値. Determine whether the data in the interval is a person or no one. 1902 Receive the signal from the main microphone 1904 Filter the received signal and digitize 1906 Divide and filter the digitized data 1908 Calculate VAD information 1910 Provide VAD information to the noise suppression system 2002 Receive airflow data 2004 Filter airflow data to eliminate ghosting and Digitize 2006 collect data for the next 20ms interval 2008 filter out unwanted airflow spectrum 2010 calculate the energy in the interval 2012 whether the energy in the interval 2012 is greater than the threshold 値 2014 someone vocalize 2016 send information to the path finding noise suppression system

4646

Claims (1)

200304119 申請專利範圍 ι· 一系統,用以消除聲音訊號的雜訊,包含: 一消除雜訊次系統(denoising subsystem)包含至少一個相連的接收器,用以提供 一環境的聲音訊號至該消除雜訊次系統的元件; 一語音偵測次系統(voice detection subsystem)連接至該消除雜訊次系統,該語音 偵測次系統接收包含人類語音活動資訊的語音活動訊號,其中該語音偵測次 系統利用該語音活動訊號的資訊,自動產生控制訊號; 其中消除雜訊次系統的元件利用該控制訊號,自動選擇至少一適合聲音訊號 的至少一個副頻帶(frequency subband)資料的消除雜訊之方法;以及 其中消除雜訊次系統的元件利用獲選的消除雜訊方法處理該聲音訊號,用以 產生消除雜訊的聲音訊號。 2. 如申請專利範圍第1項所述之系統,其中該接收器連接到至少一個可偵測該 聲音訊號之麥克風陣列。 3. 如申請專利範圍第2項所述之系統,其中該麥克風陣列包含至少2個緊密間 隔排列之麥克風。 4. 如申請專利範圍第1項所述之系統,其中該語音偵測次系統透過一感測器 (sensor)接收該語音活動訊號,其中該感測器係選自下列至少其中之一:一加 速度計(accelerometer)、接觸人體皮膚之皮膚表面麥克風(skin surface microphone)、一人體組織振動偵測器(human tissue vibration detector)、一身寸頻(RF) 振動偵測器、一雷射振動偵測器(laser vibration detector)、一聲帶振動測量器 (electroglottograph,EGG),以及一電腦影像組織振動偵測器(computer vision tissue vibration detector) ° 5.如申請專利範圍第1項所述之系統,其中該語音偵測次系統透過一連接至該 ]續次頁(申請專利範圍頁不敷使用時,請註記並使用續頁) 47 200304119 申請專利範圍續頁 接收器的麥克風陣列接收該語音活動訊號,其中該麥克風陣列至少包含一麥 克風、一梯度麥克風(gradient microphone),以及一對單指向性(unidirecti〇nal)麥 克風的其中之一。 6.如申請專利範圍第1項所述之系統,其中該語音偵測次系統透過一連接至該 接收器的麥克風陣列接收該語音活動訊號,其中該麥克風陣列包含一第一單 指向性麥克風與處於同一位置之一第二單指向性麥克風,其中該第一單指向 性麥克風的位置設爲該第一單指向性麥克風的一空間響應曲線最大値(spatial response curve maximum)與該第二單指向性麥克風的一空間響應曲線最大値的 方位差約在45度至180度之間。 7·如申請專利範圍第1項所述之系統,其中該語音偵測次系統透過一連接至該 接收器的麥克風陣列接收該語音活動訊號,其中該麥克風陣列包含一第一單 指向性麥克風與處於線性對應位置(positioned colinearly)之一第二單指向性麥 克風。 8· —方法,用以消除聲音訊號的雜訊,包含: 接收聲音訊號與語音活動訊號; 由該語音活動訊號的資料自動產生控制訊號; 利用該控制訊號,自動選擇至少一個適合聲音訊號的至少一個副頻帶資料的 '消除雜訊方法;以及 採用獲選的該消除雜訊方法並產生消除雜訊的該聲音訊號。 9. 如申請專利範圍第8項所述之方法,其中選擇至少一個消除雜訊方法中並進 一步包含選擇一第一消除雜訊方法,用於包含有人聲發音(voiced speech)的頻 率副頻帶。 10. 如申請專利範圍第9項所述之方法,其中選擇至少一個消除雜訊方法中並進 一步包含選擇一第二消除雜訊方法,用於包含無人聲發音(unvoiced speech)的 □續次頁(申請專利範圍頁不敷使用時,請註記並使用續頁) 200304119 _ 甲請專利範圍續頁 頻率副頻帶。 11·如申請專利範圍第8項所述之方法,其中選擇至少一個消除雜訊方法中並進 一步包含選擇一消除雜訊方法,用於包含無聲(devoid of speech)的頻率副頻帶。 12.如申請專利範圍第8項所述之方法,其中選擇至少一個消除雜訊方法中並進 一步包含選擇一消除雜訊方法,回應接收到的聲音訊號的雜訊資訊,其中該 雜訊資訊包含與一說話者相關的雜訊振幅、雜訊類型以及雜訊方向的至少其 中之一。 13·如申請專利範圍第8項所述之方法,其中選擇至少一個消除雜訊方法中並進 φ 一步包含,回應接收到的聲音訊號的雜訊資訊,選擇一消除雜訊方法,其中 . 該雜訊資訊包含與一說話者相關的雜訊來源動作(noise source motion) 〇 14. 一方法,用以消除聲音訊號的雜訊,包含: 接收聲音訊號; 接收與人類發聲活動(human voicing activity)相關的資訊; 產生至少一控制訊號,用以控制從聲音訊號移除雜訊; 回應該控制訊號,自動產生至少一轉移函數(transfer function),用於處理在至 少一副頻帶的該聲音訊號; · 應用所產生的該轉移函數於該聲音訊號;以及 移除該聲音訊號中的雜訊。 15. 如申請專利範圍第14項所述之方法,並進一步包含將所接收到的該聲音訊號 . 分割成一複數個頻率副頻帶。 16. 如申請專利範圍第14項所述之方法,其中產生該轉移函數的方法並進一步包 含,當該控制訊號顯示一副頻帶的該聲音訊號中沒有發聲資訊(voicing information)時,調整至少一代表一副頻帶的聲音訊號的第一轉移函數之係數。 17. 如申請專利範圍第14項所述之方法,其中產生該轉移函數的方法並進一步包 ]續次頁(申請專利範圍頁不敷使用時,請註記並使用續頁) 49 200304119 -—3 申請專利範圍_頁 含,當該控制訊號顯示一副頻帶的該聲音訊號中有發聲資訊時,調整至少一 代表一副頻帶的聲音訊號的第二轉移函數的係數。 18.如申請專利範圍第14項所述之方法,其中應用所產生的該轉移函數的方法並 進一步包含: 產生和該聲音訊號的雜訊相關之一雜訊波型估計(noise waveform estimate);以 及 當該聲音訊號包含人聲與雜訊時,將該雜訊波型估計自該聲音訊號扣除。200304119 Scope of patent application: A system for removing noise from sound signals, including: A denoising subsystem including at least one connected receiver for providing an environmental sound signal to the noise canceling system Components of a voice detection system; a voice detection subsystem is connected to the noise reduction subsystem, the voice detection subsystem receives a voice activity signal containing human voice activity information, and the voice detection subsystem Using the information of the voice activity signal to automatically generate a control signal; wherein the components of the noise reduction sub-system use the control signal to automatically select at least one frequency subband data removal method suitable for sound signals; And the components of the noise reduction sub-system use the selected noise reduction method to process the sound signal to generate a noise canceled sound signal. 2. The system described in item 1 of the patent application scope, wherein the receiver is connected to at least one microphone array capable of detecting the sound signal. 3. The system according to item 2 of the scope of patent application, wherein the microphone array includes at least two closely spaced microphones. 4. The system according to item 1 of the scope of patent application, wherein the voice detection sub-system receives the voice activity signal through a sensor, wherein the sensor is selected from at least one of the following: a Accelerometer, skin surface microphone in contact with human skin, human tissue vibration detector, human body vibration detector, RF vibration detector, laser vibration detection Device (laser vibration detector), an electroglottograph (EGG), and a computer vision tissue vibration detector ° 5. The system according to item 1 of the scope of patent application, wherein The voice detection sub-system receives the voice activity signal through a microphone array connected to the] continuation page (note when the patent application page is insufficient, please note and use the continuation page) 47 200304119 The microphone array includes at least a microphone, a gradient microphone, and a pair of unidirectional i〇nal) One of the microphones. 6. The system according to item 1 of the scope of patent application, wherein the voice detection sub-system receives the voice activity signal through a microphone array connected to the receiver, wherein the microphone array includes a first unidirectional microphone and A second unidirectional microphone in the same position, wherein the position of the first unidirectional microphone is set to a spatial response curve maximum of the first unidirectional microphone and the second unidirectional microphone The maximum azimuth difference of a spatial response curve of a sexual microphone is between 45 degrees and 180 degrees. 7. The system according to item 1 of the scope of patent application, wherein the voice detection sub-system receives the voice activity signal through a microphone array connected to the receiver, wherein the microphone array includes a first unidirectional microphone and One of the second unidirectional microphones that is positioned colinearly. 8 · —Method for eliminating noise of sound signals, including: receiving sound signals and voice activity signals; automatically generating control signals from the data of the voice activity signals; using the control signals to automatically select at least one at least one suitable for sound signals A 'noise canceling method for subband data; and the sound signal using the selected noise canceling method and generating noise canceling. 9. The method according to item 8 of the scope of patent application, wherein selecting at least one noise reduction method and further comprising selecting a first noise reduction method for frequency sub-bands including voiced speech. 10. The method as described in item 9 of the scope of patent application, wherein selecting at least one noise reduction method and further comprising selecting a second noise reduction method for unvoiced speech is continued on the next page (When the patent application page is insufficient, please note and use the continuation page) 200304119 _ A Please request the patent application page to renew the frequency sub-band. 11. The method according to item 8 of the scope of patent application, wherein selecting at least one noise canceling method and further comprising selecting a noise canceling method for a frequency sub-band including a void of speech. 12. The method according to item 8 of the scope of patent application, wherein selecting at least one noise reduction method further comprises selecting a noise reduction method in response to noise information of the received sound signal, wherein the noise information includes At least one of a noise amplitude, a noise type, and a noise direction associated with a speaker. 13. The method as described in item 8 of the scope of patent application, wherein the at least one noise reduction method is selected to further include a step of φ, in response to the noise information of the received sound signal, and a noise reduction method is selected, in which the noise The signal information includes a noise source motion related to a speaker. 14. A method for eliminating noise of a sound signal includes: receiving a sound signal; receiving a signal related to a human voicing activity Generating at least one control signal for controlling noise removal from the sound signal; responding to the control signal and automatically generating at least one transfer function for processing the sound signal in at least one sub-band; Applying the generated transfer function to the sound signal; and removing noise from the sound signal. 15. The method as described in item 14 of the scope of patent application, and further comprising dividing the received sound signal. Into a plurality of frequency sub-bands. 16. The method according to item 14 of the scope of patent application, wherein the method for generating the transfer function further comprises, when the control signal shows that there is no voicing information in the sound signal of a sub-band, adjusting at least one The coefficient of the first transfer function representing a sub-band sound signal. 17. The method as described in item 14 of the scope of patent application, in which the method of generating the transfer function is further included] Continued page (If the page of patent scope is insufficient, please note and use the continued page) 49 200304119 --3 The scope of patent application_page includes, when the control signal shows that there is sound information in the sound signal of a sub-band, adjusting the coefficient of at least one second transfer function representing the sound signal of a sub-band. 18. The method according to item 14 of the scope of patent application, wherein the method of applying the generated transfer function further comprises: generating a noise waveform estimate related to the noise of the sound signal; And when the sound signal includes human voice and noise, the noise wave pattern is estimated to be subtracted from the sound signal. 5050
TW92104696A 2002-03-05 2003-03-05 Voice activity detection (VAD) devices and methods for use with noise suppression systems TW200304119A (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US36216102P 2002-03-05 2002-03-05
US36216202P 2002-03-05 2002-03-05
US36217002P 2002-03-05 2002-03-05
US36210302P 2002-03-05 2002-03-05
US36198102P 2002-03-05 2002-03-05
US36834302P 2002-03-27 2002-03-27

Publications (1)

Publication Number Publication Date
TW200304119A true TW200304119A (en) 2003-09-16

Family

ID=51660900

Family Applications (1)

Application Number Title Priority Date Filing Date
TW92104696A TW200304119A (en) 2002-03-05 2003-03-05 Voice activity detection (VAD) devices and methods for use with noise suppression systems

Country Status (1)

Country Link
TW (1) TW200304119A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI407433B (en) * 2010-08-18 2013-09-01 Hon Hai Prec Ind Co Ltd Voice recording equipment and method for processing and recording voice
TWI423688B (en) * 2010-04-14 2014-01-11 Alcor Micro Corp Voice sensor with electromagnetic wave receiver
US8744849B2 (en) 2011-07-26 2014-06-03 Industrial Technology Research Institute Microphone-array-based speech recognition system and method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI423688B (en) * 2010-04-14 2014-01-11 Alcor Micro Corp Voice sensor with electromagnetic wave receiver
TWI407433B (en) * 2010-08-18 2013-09-01 Hon Hai Prec Ind Co Ltd Voice recording equipment and method for processing and recording voice
US8744849B2 (en) 2011-07-26 2014-06-03 Industrial Technology Research Institute Microphone-array-based speech recognition system and method

Similar Documents

Publication Publication Date Title
KR101402551B1 (en) Voice activity detection(vad) devices and methods for use with noise suppression systems
US20030179888A1 (en) Voice activity detection (VAD) devices and methods for use with noise suppression systems
TWI281354B (en) Voice activity detector (VAD)-based multiple-microphone acoustic noise suppression
KR101532153B1 (en) Systems, methods, and apparatus for voice activity detection
US10218327B2 (en) Dynamic enhancement of audio (DAE) in headset systems
KR101606966B1 (en) Systems, methods, apparatus, and computer-readable media for spatially selective audio augmentation
KR101434071B1 (en) Microphone and voice activity detection (vad) configurations for use with communication systems
JP5410603B2 (en) System, method, apparatus, and computer-readable medium for phase-based processing of multi-channel signals
US8682658B2 (en) Audio equipment including means for de-noising a speech signal by fractional delay filtering, in particular for a “hands-free” telephony system
ES2775799T3 (en) Method and apparatus for multisensory speech enhancement on a mobile device
CA2798512A1 (en) Vibration sensor and acoustic voice activity detection system (vads) for use with electronic systems
JP2013535915A (en) System, method, apparatus, and computer-readable medium for multi-microphone position selectivity processing
WO2008157421A1 (en) Dual omnidirectional microphone array
CA2798282A1 (en) Wind suppression/replacement component for use with electronic systems
AU2016202314A1 (en) Acoustic Voice Activity Detection (AVAD) for electronic systems
Kalgaonkar et al. Ultrasonic doppler sensor for voice activity detection
TW200304119A (en) Voice activity detection (VAD) devices and methods for use with noise suppression systems
US20230379621A1 (en) Acoustic voice activity detection (avad) for electronic systems
JP5249431B2 (en) Method for separating signal paths and methods for using the larynx to improve speech
WO2021239254A1 (en) A own voice detector of a hearing device