TWI489448B

TWI489448B - Apparatus and computer-implemented method for low power voice detection, computer readable storage medium thereof, and system with the same

Info

Publication number: TWI489448B
Application number: TW101144776A
Authority: TW
Inventors: Arijit Raychowdhury; Willem Beltman; James W Tschanz; Carlos Tokunaga; Michael E Deisher; Thomas E Walsh
Original assignee: Intel Corp
Priority date: 2011-12-06
Filing date: 2012-11-29
Publication date: 2015-06-21
Also published as: CN103959376B; EP2788979A4; WO2013085499A1; TW201342362A; CN103959376A; US9633654B2; EP2788979A1; US20140236582A1

Description

Device and system for low-power voice detection and computer implementation method, and computer readable storage medium

本發明的實施例一般係與音訊處理相關。更特別地，實施例係與語音辨識相關。Embodiments of the invention are generally associated with audio processing. More particularly, embodiments relate to speech recognition.

語音命令和持續的語音辨識對於鍵盤功能有限的行動運算裝置來說很重要。然而，持續收聽環境中可能的語音可能會造成耗費相當大的功率，因此，大多數的系統都需要使用者在開始聽之前輸入命令。此種方式可能會帶來不便，同時也限制了許多潛在應用的可行性。Voice commands and continuous speech recognition are important for mobile computing devices with limited keyboard functions. However, continuous listening to possible speech in the environment can result in considerable power consumption, so most systems require the user to enter commands before starting to listen. This approach can be inconvenient and limits the feasibility of many potential applications.

本發明的實施例所提到的設備係包括邏輯，其將時域音頻訊號儲存在被設置以根據一第一時脈頻率與一第一電壓而運算的記憶體中，並根據一第二時脈頻率與一第二電壓，對該時域音頻訊號，執行快速傅立葉轉換(Fast Fourier Transform，FFT)運算以產生頻域音頻訊號。The device mentioned in the embodiment of the present invention includes logic for storing the time domain audio signal in a memory set to operate according to a first clock frequency and a first voltage, and according to a second time The pulse frequency and a second voltage are used to perform a Fast Fourier Transform (FFT) operation on the time domain audio signal to generate a frequency domain audio signal.

本發明的實施例所提到的電腦實施方法包括記錄於第一時脈頻率與第一電壓之時域音頻訊號。對於第二時脈頻率之該時域音頻訊號執行快速傅立葉轉換(FFT)運算以產生頻域音頻訊號，其中該第一時脈頻率快於該第二時脈頻率。The computer implemented method mentioned in the embodiments of the present invention includes a time domain audio signal recorded at a first clock frequency and a first voltage. Performing a fast Fourier transform (FFT) operation on the time domain audio signal of the second clock frequency to generate a frequency domain audio signal, wherein the first clock frequency is faster than the second clock frequency.

本發明的實施例可包括一種具有一組指令的電腦可讀取的儲存媒體，如果被處理器執行，使得一電腦記錄於第一時脈頻率與第一電壓之時域音頻訊號；以及對於第二時脈頻率之該時域音頻訊號，執行快速傅立葉轉換(FFT)運算，以產生一頻域音頻訊號，其中該第一時脈頻率快於該第二時脈頻率。Embodiments of the invention may include a computer readable computer having a set of instructions Taking the storage medium, if executed by the processor, causing a computer to record the time domain audio signal of the first clock frequency and the first voltage; and performing the fast Fourier transform on the time domain audio signal of the second clock frequency ( FFT) to generate a frequency domain audio signal, wherein the first clock frequency is faster than the second clock frequency.

接著請參考第1圖，所示為一語音辨識系統(speech recognition system)100的實施例的方塊圖範例。系統可包括一預處理模組(pre-processing module)101，被組態用以擷取音頻訊號，一前端處理模組(frontend processing module)102，被組態用以處理音頻訊號並檢測任何在音頻訊號內的人聲資訊，以及一後端處理模組(backend processing module)103，被組態用以分析人聲資訊與執行和人聲資訊相關的運算。要注意的是音頻訊號可包括背景雜訊與人聲資訊。Referring now to FIG. 1, an example of a block diagram of an embodiment of a speech recognition system 100 is shown. The system can include a pre-processing module 101 configured to capture audio signals, a frontend processing module 102 configured to process audio signals and detect any The vocal information within the audio signal, and a backend processing module 103, are configured to analyze vocal information and perform operations related to vocal information. It should be noted that the audio signal can include background noise and vocal information.

預處理模組101可包括一用來擷取音頻訊號如脈衝密度調變(Pulse Density Modulation，PDM)資訊流的錄音機(recorder)105(例如麥克風)。PDM資訊流可包括數位格式的時域音頻訊號。預處理模組101可包括脈衝密度調變(PDM)至脈碼調變(Pulse-code modulation，PCM)轉換器110，被組態用以接收PDM資訊流與產生PCM資訊流。PCM資訊流可被視為PDM資訊流的數位表示。PCM資訊流包括未編碼或原始資訊。在一些實施例中可直接接收PCM資訊流。舉例來說，錄音機105可包括產生PCM資訊流的整合功能。The pre-processing module 101 can include a recorder 105 (eg, a microphone) for capturing audio signals such as Pulse Density Modulation (PDM) information streams. The PDM information stream can include time domain audio signals in a digital format. The pre-processing module 101 can include a pulse density modulation (PDM) to pulse-code modulation (PCM) converter 110 configured to receive the PDM information stream and generate a PCM information stream. The PCM stream can be viewed as a digital representation of the PDM stream. The PCM stream includes unencoded or raw information. The PCM information stream can be received directly in some embodiments. For example, the recorder 105 can include an integrated function that produces a stream of PCM information.

前端處理模組102(也稱為語音活動檢測(voice activity detection，VAD)模組)可包括一分框與分窗模組(framing and windowing module)115，被組態用以將來自PDM-PCM轉換器110的資訊流分框與分窗。分框與分窗模組115可根據取樣率(sampling rate)與訊框大小(frame size)將資訊流分框與分窗為多個訊框(如第2圖所示)。舉例來說，取樣率可設為16 kHz，而訊框大小可設為32 ms(毫秒)。視實施方式，可使用不同的取樣率和不同的訊框大小。對一些實施例來說，訊框可能會互相重疊，但是有非重疊的窗口。舉例來說，2個連續各具有32ms大小的訊框可有22 ms重疊，而10 ms為非重疊的窗口。以16 kHz取樣率和32 ms訊框大小作為範例，每個訊框的取樣數可為16×32=512。The front end processing module 102 (also referred to as a voice activity detection (VAD) module) may include a framing and windowing module 115 configured to receive PDM-PCM The information flow of the converter 110 is divided into sub-windows. The sub-frame and sub-window module 115 can frame and divide the information into a plurality of frames according to a sampling rate and a frame size (as shown in FIG. 2). For example, the sample rate can be set to 16 kHz and the frame size can be set to 32 ms (milliseconds). Depending on the implementation, different sampling rates and different frame sizes can be used. For some embodiments, frames may overlap each other, but with non-overlapping windows. For example, two consecutive frames of 32 ms size can have 22 ms overlap, while 10 ms are non-overlapping windows. Taking the 16 kHz sampling rate and the 32 ms frame size as an example, the number of samples per frame can be 16 × 32 = 512.

快速傅立葉轉換模組(FFT模組)120可被組態用以接收PCM資訊流的訊框，並執行將這些訊框由時域表示轉換為頻域表示的轉換。音頻訊號的頻域表示可顯示在一頻率範圍中的每一給定頻帶的能量或訊號等級(如第2圖所示)。在FFT模組120執行過轉換運算後，雜訊估計與抑制模組(noise estimation and suppression module)125可分析於頻域表示的音頻訊號，並濾掉任何沒有和人聲資訊同一頻帶的雜訊資訊。對一些實施例來說，雜訊估計與抑制模組125可為可程式化帶通濾波器(programmable band-pass filter)。一般來說，人聲的頻帶大約在20 Hz與7 KHz(在此稱為人聲頻帶)之間。雜訊估計與抑制模組125可被組態用以檢測任何在人聲頻帶之外的能量或訊號等級，並當作頻外能量(out-of-band energy)加以抑制。A fast Fourier transform module (FFT module) 120 can be configured to receive frames of the PCM information stream and perform conversions that convert the frames from the time domain representation to the frequency domain representation. The frequency domain representation of the audio signal can show the energy or signal level for each given frequency band in a range of frequencies (as shown in Figure 2). After the FFT module 120 performs the conversion operation, the noise estimation and suppression module 125 can analyze the audio signal represented by the frequency domain and filter out any noise information that is not in the same frequency band as the vocal information. . For some embodiments, the noise estimation and suppression module 125 can be a programmable band-pass filter. In general, the frequency band of a human voice is between approximately 20 Hz and 7 KHz (referred to herein as the human voice band). Noise estimation and suppression mode Group 125 can be configured to detect any energy or signal level outside of the human voice band and to suppress it as out-of-band energy.

人聲和背景雜訊之間的數值性質可能會不同。對一些實施例來說，雜訊估計與抑制模組125可假定人聲通常的形式是短暫猝發(short burst)然後接著是暫停，通常可表示為高幅能量的短暫猝發，接著是低幅能量，藉此來分辨人聲與背景雜訊。人聲的能量類型和背景雜訊的不同，後者的能量平均振幅會維持差不多，或者是從一段時間到另一段時間緩慢改變。因此，就可以追蹤和估計一段時間內的背景雜訊。The numerical nature between vocals and background noise may vary. For some embodiments, the noise estimation and suppression module 125 can assume that the usual form of the human voice is a short burst followed by a pause, which can generally be expressed as a short burst of high amplitude energy, followed by a low amplitude energy, This is used to distinguish between vocals and background noise. The vocal energy type is different from the background noise. The average energy amplitude of the latter will remain similar, or it will change slowly from time to time. Therefore, it is possible to track and estimate background noise for a period of time.

人聲檢測模組(human voice detection module)130可被組態用以使用背景雜訊估計以決定在人聲頻帶中是否有人聲的存在。對一些實施例來說，人聲檢測模組130可決定在以頻域表示的訊框內的總能量，將其與估計的雜訊能量比較，並決定在該訊框內是否有人聲存在。舉例來說，當總能量大於背景雜訊能量乘上一臨限(threshold)，則可能有人聲資訊135。當總能量約小於或等於背景雜訊能量，人聲資訊135可能不存在。當人聲資訊135不存在，前端處理模組102的運算可繼續由雜訊估計與抑制模組125進行下個訊框的雜訊估計與抑制。A human voice detection module 130 can be configured to use background noise estimation to determine if a human voice is present in the human voice band. For some embodiments, the vocal detection module 130 can determine the total energy in the frame represented by the frequency domain, compare it to the estimated noise energy, and determine if a human voice is present within the frame. For example, when the total energy is greater than the background noise energy multiplied by a threshold, there may be a human voice information 135. When the total energy is less than or equal to the background noise energy, the vocal information 135 may not exist. When the vocal information 135 does not exist, the operation of the front-end processing module 102 can continue to perform noise estimation and suppression of the next frame by the noise estimation and suppression module 125.

後端處理模組103可包括語音處理模組(voice processing module)140，被組態用以接收來自前端處理模組102的人聲資訊135，並決定人聲資訊135中可能包括的命令或指令。語音處理模組140可根據所決定的命令或指令而運算。The backend processing module 103 can include a voice processing module 140 configured to receive the vocal information 135 from the front end processing module 102 and determine possible packets in the vocal information 135. Commands or instructions. The speech processing module 140 can operate in accordance with the determined command or instruction.

接下來請參考第2圖，圖表200係用來表示與音頻訊號相關的能量與訊框範例。圖表200包括由錄音機105(如第1圖所示)擷取一段時間的音頻訊號的能量。圖表200的縱軸(vertical axis)205可代表能量的振幅，而橫軸(horizontal axis)210可代表時間。對一些實施例來說，音頻訊號可被分割為數個重疊的訊框，舉例來說，訊框215、220與225。在此例中，每一個訊框215、220與225有一32 ms的長度，彼此間相差一個10 ms的非重疊窗口(non-overlapping window)230。第1圖的FFT模組120可先處理涵蓋0 ms至31 ms的窗口的訊框215。而在10ms後，FFT模組120可處理涵蓋10 ms至41 ms的窗口的訊框220。接著，在10ms後，FFT模組120可處理涵蓋20 ms至51 ms的窗口的訊框225。Next, please refer to Figure 2, which is used to represent the energy and frame examples associated with audio signals. The chart 200 includes the energy of the audio signal captured by the recorder 105 (as shown in Figure 1) for a period of time. The vertical axis 205 of the graph 200 may represent the amplitude of the energy, while the horizontal axis 210 may represent time. For some embodiments, the audio signal can be split into a number of overlapping frames, for example, frames 215, 220, and 225. In this example, each of the frames 215, 220, and 225 has a length of 32 ms, which is different from each other by a 10 ms non-overlapping window 230. The FFT module 120 of FIG. 1 may first process a frame 215 covering a window of 0 ms to 31 ms. After 10 ms, the FFT module 120 can process the frame 220 covering the window of 10 ms to 41 ms. Then, after 10 ms, the FFT module 120 can process the frame 225 of the window covering 20 ms to 51 ms.

若採用16 kHz的取樣率，每一個訊框215、220與225可包括512個樣本。視所選擇的取樣率和訊框大小而定，可有不同的樣本數，但通常為2的乘羃。對一些實施例來說，FFT模組120(第1圖)應該可以在一類似非重疊窗口的時間內完成每一個訊框的轉換運算(從時域表示轉換至頻域表示)，例如10 ms。在其他的實施例中，FFT模組應該可以在非重疊窗口的一小段時間完成轉換運算。舉例來說，FFT模組可能只需要10 ms的10%(或1 ms)來完成處理。FFT模組的運算可以下列的方程式表示：X(k)=FFT(X(t)) 方程式1If a sampling rate of 16 kHz is employed, each frame 215, 220, and 225 can include 512 samples. Depending on the sampling rate and frame size selected, there may be a different number of samples, but usually a multiplier of two. For some embodiments, the FFT module 120 (Fig. 1) should be able to perform each frame conversion operation (from time domain representation to frequency domain representation) in a time similar to a non-overlapping window, such as 10 ms. . In other embodiments, the FFT module should be able to perform the conversion operation for a short period of time in a non-overlapping window. For example, an FFT module may only require 10% (or 1 ms) of 10 ms to complete processing. The operation of the FFT module can be as follows: Show: X(k)=FFT(X(t)) Equation 1

其中X(k)為音頻訊號的頻域表示，X(t)為音頻訊號的時域表示，k的數值從1到頻帶總數不等(例如512)，而t表示時間。方程式1的結果可能是512點的FFT(根據512個樣本的範例)。FFT運算的結果接著會被雜訊估計與抑制模組125(第1圖)濾波，以移除任何頻外雜訊。雜訊估計與抑制模組125的濾波運算可以下列的方程式表示：Y(k)=H(k)*X(k) 方程式2Where X(k) is the frequency domain representation of the audio signal, X(t) is the time domain representation of the audio signal, and the value of k varies from 1 to the total number of bands (eg, 512), and t represents time. The result of Equation 1 may be a 512-point FFT (according to an example of 512 samples). The result of the FFT operation is then filtered by the noise estimation and suppression module 125 (Fig. 1) to remove any extraneous noise. The filtering operation of the noise estimation and suppression module 125 can be expressed by the following equation: Y(k)=H(k)*X(k) Equation 2

其中Y(k)為濾波運算的結果，H(k)為濾波函數，X(k)為音頻訊號的頻域表示，k的數值從1到頻帶總數不等(例如512)。濾波運算是將音頻訊號的頻域表示X(k)通過濾波器以移除任何頻外雜訊。Where Y(k) is the result of the filtering operation, H(k) is the filtering function, X(k) is the frequency domain representation of the audio signal, and the value of k varies from 1 to the total number of bands (for example, 512). The filtering operation is to pass the frequency domain representation X(k) of the audio signal through the filter to remove any extra-frequency noise.

參考第3圖，其為表示雜訊抑制的示範實施例的方塊圖。一旦完成運算後，可應用一或多個雜訊抑制運算以移除或抑制任何非人聲的雜訊。對一些實施例來說，每一個雜訊抑制運算可能是與一個不同的雜訊抑制技術相關。多個不同的技術可結合以執行雜訊抑制運算。參考第3圖，過濾的資訊(filtered information)305會被傳送到第一雜訊抑制模組(first noise suppression module)310。應可了解的是，一串具有相同大小的訊框可用來將過濾的資訊305傳送給第一雜訊抑制模組310，第一雜訊抑制模組310所產生的結果資訊可被傳送至第二雜訊抑制模組 315，以此類推，直到第N個雜訊抑制模組320產生增強音頻訊號(在此稱為增強音頻資訊)325。舉例來說，第一雜訊抑制模組310可採用具固定係數的延遲與加法波束形成器(delay and sum beam former)，而第二雜訊抑制模組315可採用頻譜追蹤(spectral tracking)與次頻帶域文納濾波(sub-band domain Wiener filtering)的技術。在利用第3圖的雜訊抑制運算處理過後，增強音頻訊號325可具有比接收到的音頻訊號要高的訊號雜訊比(signal to noise ratio)。Referring to Figure 3, which is a block diagram showing an exemplary embodiment of noise suppression. Once the operation is complete, one or more noise suppression operations can be applied to remove or suppress any non-speech noise. For some embodiments, each noise suppression operation may be associated with a different noise suppression technique. A number of different techniques can be combined to perform noise suppression operations. Referring to FIG. 3, the filtered information 305 is transmitted to the first noise suppression module 310. It should be understood that a series of frames having the same size can be used to transmit the filtered information 305 to the first noise suppression module 310, and the result information generated by the first noise suppression module 310 can be transmitted to the first Two noise suppression module 315, and so on, until the Nth noise suppression module 320 generates an enhanced audio signal (referred to herein as enhanced audio information) 325. For example, the first noise suppression module 310 can employ a delay and sum beam former with a fixed coefficient, and the second noise suppression module 315 can employ spectral tracking and Sub-band domain Wiener filtering technology. After the noise suppression operation process of FIG. 3 is utilized, the enhanced audio signal 325 may have a higher signal to noise ratio than the received audio signal.

增強音頻訊號325可包括一串具有相同大小的訊框。在第1圖內的人聲檢測模組130可處理增強音頻訊號325以檢測人聲的存在。視實施方式而定，處理增強音頻訊號325的方式可有不同。以下是人聲檢測模組130在處理增強音頻訊號325時可用的第一演算法的虛擬碼(pseudo code)的範例：The enhanced audio signal 325 can include a string of frames of the same size. The vocal detection module 130 in FIG. 1 can process the enhanced audio signal 325 to detect the presence of a human voice. Depending on the implementation, the manner in which the enhanced audio signal 325 is processed may vary. The following is an example of a pseudo code of the first algorithm that the vocal detection module 130 can use when processing the enhanced audio signal 325:

任務1：針對增強音頻訊號325的每一訊框，決定總能量L(n)為：L(n)=(abs(FFT Output)* H)² 其中“abs”為絕對值函數(absolute function)，“FFT Output”是FFT模組120的運算結果，而H為濾波函數。Task 1: For each frame of the enhanced audio signal 325, determine the total energy L(n) as: L(n) = (abs(FFT Output)* H) ² where "abs" is an absolute function (absolute function) "FFT Output" is the operation result of the FFT module 120, and H is a filter function.

任務2：針對增強音頻訊號325的每一訊框，估計背景雜訊(或雜訊最低能量)Lmin(n)為： If (L(n) > Lmin(n-1)) Lmin(n)=(1-A) * Lmin(n-1)+A * L(n); Else Lmin(n)=(1-B) * Lmin(n-1)+B * L(n); End其中A與B為具有定值的參數，Lmin(n)為現行訊框的背景雜訊能量，而Lmin(n-1)為先前訊框的背景雜訊能量。Task 2: For each frame of the enhanced audio signal 325, estimate the background noise (or noise minimum energy) Lmin(n) as: If (L(n) > Lmin(n-1)) Lmin(n)=(1-A) * Lmin(n-1)+A * L(n); Else Lmin(n)=(1-B) * Lmin(n-1)+B * L(n); End where A and B are constant parameters, Lmin(n) is the background noise energy of the current frame, and Lmin(n-1) is the previous The background noise energy of the frame.

任務3：針對增強音頻訊號325的每一訊框，決定有人聲V(n)的存在。在有人聲時，設定V(n)=1，當沒有人聲時，設定V(n)=0。藉由比較第一演算法的任務1的總能量L(n)和第一演算法的任務2的背景雜訊Lmin(n)的最低能量，可做出決定。 If (L(n) < Lmin(n) * Tdown) V(n)=0; Elseif (L(n) > Lmin(n) * Tup OR silentframe < 4) V(n)=1; Else V(n)=V(n-1); If (L(n) < Lmin(n) * Tdown) silentframe++; speechframe=0; Elseif (L(n) > Lmin(n) * Tup) silentframe=0; speechframe++;其中Tup與Tdown為具有定值的參數。Task 3: Determine the presence of the human voice V(n) for each frame of the enhanced audio signal 325. When there is a human voice, set V(n)=1, and when there is no human voice, set V(n)=0. The decision can be made by comparing the total energy L(n) of task 1 of the first algorithm with the lowest energy of background noise Lmin(n) of task 2 of the first algorithm. If (L(n) < Lmin(n) * Tdown) V(n)=0; Elseif (L(n) > Lmin(n) * Tup OR silentframe < 4) V(n)=1; Else V(n )=V(n-1); If (L(n) < Lmin(n) * Tdown) silentframe++; speechframe=0; Elseif (L(n) > Lmin(n) * Tup) silentframe=0; speechframe++; Tup and Tdown are parameters with fixed values.

以下是人聲檢測模組130可用來處理增強音頻訊號 325的第二演算法的虛擬碼範例。第二演算法與第一演算法類似，只是加上了濾波和輪廓追蹤運算的額外函數。The following is the vocal detection module 130 can be used to process enhanced audio signals. An example of a virtual code for the second algorithm of 325. The second algorithm is similar to the first algorithm, except that additional functions for filtering and contour tracking operations are added.

任務1：針對增強音頻訊號325的每一訊框，決定總能量L(n)為：L(n)=(abs(FFT Output)* H)² 其中“abs”為絕對值函數(absolute function)，“FFT Output”是FFT模組120的頻域表示結果，而H為濾波函數。Task 1: For each frame of the enhanced audio signal 325, determine the total energy L(n) as: L(n) = (abs(FFT Output)* H) ² where "abs" is an absolute function (absolute function) "FFT Output" is the frequency domain representation of the FFT module 120, and H is the filter function.

任務2：針對增強音頻訊號325的每一訊框，應用中值濾波函數(median filtering function)H(n)以移除任何高頻雜訊，以輪廓追蹤函數(contour tracking function)CT(n)移除任何猝發的雜訊並決定每一訊框的平均能量。H(n)=medianfilter(L(n-S)：L(n)) CT(n)=mean(H(n-4)：H(n))Task 2: For each frame of the enhanced audio signal 325, apply a median filtering function H(n) to remove any high frequency noise, with a contour tracking function CT(n) Remove any bursts of noise and determine the average energy of each frame. H(n)=medianfilter(L(n-S):L(n)) CT(n)=mean(H(n-4):H(n))

任務3：針對增強音頻訊號325的每一訊框，決定有人聲V(n)的存在。在有人聲時，設定V(n)=1，當沒有人聲時，設定V(n)=0。藉由比較第二演算法的任務1的總能量L(n)和第二演算法的任務2的輪廓追蹤運算CT(n)的結果，可做出決定。 If (L(n) < CT(n) * DB) V(n)=0; Elseif (L(n) > CT(n) * DB OR silentframe < 4) V(n)=1; If (L(n) < Lmin(n) * Tdown)silentframe++; speechframe=0; Elseif (L(n) > Lmin(n) * Tup) Silentframe=0; speechframe++;其中Tup與Tdown為具有定值的參數，而Tup與Tdown的數值會依實施方式而有不同。Task 3: Determine the presence of the human voice V(n) for each frame of the enhanced audio signal 325. When there is a human voice, set V(n)=1, and when there is no human voice, set V(n)=0. A decision can be made by comparing the total energy L(n) of task 1 of the second algorithm with the result of the contour tracking operation CT(n) of task 2 of the second algorithm. If (L(n) < CT(n) * DB) V(n)=0; Elseif (L(n) > CT(n) * DB OR silentframe < 4) V(n)=1; If (L( n) < Lmin(n) * Tdown)silentframe++; speechframe=0; Elseif (L(n) > Lmin(n) * Tup) Silentframe=0; speechframe++; where Tup and Tdown are parameters with fixed values, and Tup and The value of Tdown will vary depending on the implementation.

要注意的是第一與第二演算法的效率可能視背景雜訊的情形而定。第一演算法在有均勻的背景雜訊的情形下表現較好。第二演算法在背景雜訊有包括非人聲的假性高頻雜訊的情形下表現較好。It should be noted that the efficiency of the first and second algorithms may depend on the context of the background noise. The first algorithm performs better in the presence of uniform background noise. The second algorithm performs better in the case where background noise includes false high frequency noise including non-human voice.

以下參考第4圖，其為示範與人聲檢測運算相關的錯誤接受率(false acceptance)與錯誤拒絕率(false rejection rate)的圖表400。在處理增強音頻訊號325以決定是否有人聲時，有二種可能的錯誤類型。第一種錯誤類型(稱為假拒絕錯誤)是拒絕了可能包括人聲的音頻訊號。第二種錯誤類型(稱為假接受錯誤)是把雜訊當作人聲而接收，而雜訊中可能沒有包括人聲。對一些實施例來說，可用一或更多個臨限參數(threshold parameter)來控制錯誤拒絕率與錯誤接受率。舉例來說，當臨限參數被設為低數值，所有的雜訊可能都會被當成人聲，當臨限參數被設為高數值，所有不包括人聲的雜訊可能會都被拒絕。藉由程式化一或更多個臨限參數，可達到不同的操作點。參考上述的第一與第二演算法範例，臨限參數可包括“A”、“B”、“DB”、“Tup”，以及“Tdown”。Reference is now made to Fig. 4, which is a graph 400 demonstrating the false acceptance rate and false rejection rate associated with vocal detection operations. There are two possible types of errors when processing the enhanced audio signal 325 to determine if there is a human voice. The first type of error (called a false rejection error) rejects an audio signal that may include a human voice. The second type of error (called a false acceptance error) is to receive the noise as a human voice, and the noise may not include a human voice. For some embodiments, one or more threshold parameters may be used to control the error rejection rate and the error acceptance rate. For example, when the threshold parameter is set to a low value, all noise may be used as an adult sound. When the threshold parameter is set to a high value, all noises that do not include vocals may be rejected. Absolutely. Different operating points can be achieved by stylizing one or more threshold parameters. Referring to the first and second algorithm examples described above, the threshold parameters may include "A", "B", "DB", "Tup", and "Tdown".

示範圖表400包括代表增強音頻訊號325的一訊框之錯誤接受率的縱軸(vertical axis)405和代表錯誤拒絕率的橫軸(horizontal axis)410。曲線(curve)420可代表與上述第一演算法相關的操作點，而曲線425可代表與上述第一演算法相關的操作點。在曲線420與425上的每一點可代表一操作點。在此範例中，背景雜訊可為5dB。要注意的是曲線425的錯誤接受率和錯誤拒絕率通常要比第一演算法要低，這可能是因為採用額外的中值濾波與輪廓追蹤函數的緣故。The exemplary chart 400 includes a vertical axis 405 representing the error acceptance rate of a frame of the enhanced audio signal 325 and a horizontal axis 410 representing the error rejection rate. Curve 420 may represent an operating point associated with the first algorithm described above, and curve 425 may represent an operating point associated with the first algorithm described above. Each point on curves 420 and 425 can represent an operating point. In this example, the background noise can be 5dB. It is to be noted that the error acceptance rate and error rejection rate of curve 425 are generally lower than the first algorithm, which may be due to the use of additional median filtering and contour tracking functions.

第5圖所示為一語音活動檢測模組(voice activity detection module)的硬體架構實施範例。圖示500可包括一些前端處理模組102(如第1圖所示)所包括的元件。對一些實施例來說，第1圖的分框與分窗模組115可以軟體方式實施，因此並不包括在圖示500內。在圖示500內所可能出現的前端處理模組102的元件包括FFT模組120、雜訊估計與抑制模組125，以及人聲檢測模組130。Figure 5 shows an example of a hardware architecture implementation of a voice activity detection module. The illustration 500 can include elements included with some front end processing modules 102 (shown in Figure 1). For some embodiments, the sub-frame and split window module 115 of FIG. 1 can be implemented in a software manner and is therefore not included in the illustration 500. The components of the front end processing module 102 that may appear in the diagram 500 include an FFT module 120, a noise estimation and suppression module 125, and a vocal detection module 130.

要注意的是在圖示500內有2個區段。第一區段包括在虛線區塊505內的元件。第二區段包括在虛線區塊505外的元件。對一些實施例來說，在虛線區塊505內的元件可被組態用以在低電壓(低Vcc)下操作，而它們可被組態用以在低時脈頻率(在此稱為時脈1)下操作。虛線區塊505外的元件可被組態用以在高電壓(高Vcc)下操作，而它們可被組態用以在高時脈頻率(在此稱為時脈16，因為是16倍)下操作。虛線區塊505內的元件可包括FFT模組525與乘法與濾波模組(multiplication and filtering module)520，以及語音活動檢測模組(voice activity detection module)550與555。FFT模組525可對應第1圖的FFT模組120，乘法與濾波模組520可對應第1圖的雜訊估計與抑制模組125，而語音活動檢測模組550、555可對應第1圖的人聲檢測模組130。It is to be noted that there are 2 segments in the diagram 500. The first section includes elements within the dashed block 505. The second section includes elements that are outside of the dashed block 505. For some embodiments, the components within the dashed block 505 can be configured to operate at low voltages (low Vcc) while they can be configured to operate at low clock frequencies (herein referred to as Pulse 1) operation. Dotted area The components outside block 505 can be configured to operate at high voltages (high Vcc), while they can be configured to operate at high clock frequencies (referred to herein as clock 16, as is 16 times) . Elements within dashed block 505 may include FFT module 525 and multiplication and filtering module 520, as well as voice activity detection modules 550 and 555. The FFT module 525 can correspond to the FFT module 120 of FIG. 1 , the multiplication and filtering module 520 can correspond to the noise estimation and suppression module 125 of FIG. 1 , and the voice activity detection modules 550 and 555 can correspond to the first figure. The vocal detection module 130.

與時域表示音頻訊號相關的資訊可儲存在記憶體模組(memory module)510與515內。在此範例中，每一個記憶體模組510、515可包括512條48位元的線。因此，總記憶體大小為2×512×48位元。在讀取記憶體模組510與515的資訊時，資訊可經由多工器(multiplexer)511、516被傳送至訊框緩衝器(frame buffer)540，接著到訊框緩衝器545。要注意的是訊框緩衝器540係位於虛線區塊505外，而訊框緩衝器545係位於虛線區塊505內。因此，訊框緩衝器540相較於訊框緩衝器545，前者可在較高的電壓與時脈頻率(例如時脈16)下操作。Information related to the audio signal in the time domain can be stored in memory modules 510 and 515. In this example, each of the memory modules 510, 515 can include 512 48-bit lines. Therefore, the total memory size is 2 x 512 x 48 bits. When reading the information of the memory modules 510 and 515, the information can be transferred to the frame buffer 540 via the multiplexers 511, 516, and then to the frame buffer 545. It is noted that the frame buffer 540 is located outside of the dashed block 505 and the frame buffer 545 is located within the dashed block 505. Thus, frame buffer 540 can operate at higher voltages and clock frequencies (e.g., clock 16) than frame buffer 545.

FFT模組525可被組態為一個32點FFT或16點FFT模組，其中FFT模組525的組態可由控制模組(control module)560控制。FFT模組525可將來自記憶體模組510、515的資訊從時域表示轉換為頻域表示。乘法與濾波模組520可接收來自FFT模組525的結果，並執行雜訊濾波與雜訊抑制運算，以產生增強音頻訊號325(如第3圖所示)。增強音頻訊號325接著可被儲存在訊框緩衝器535，其中增強音頻訊號325可由語音活動檢測模組550或555處理。視實施方式而定，可能會有多個語音活動模組平行操作。每一個語音活動檢測模組550與555可採用不同的演算法(例如上述的第一或第二演算法)。如前述，位於虛線區塊505內的元件可被組態於低頻率(或時脈1)以及低電壓(或低Vcc)下操作。位於虛線區塊505外的元件可被組態於高頻率(或時脈16)以及高電壓(或高Vcc)下操作。這麼做的明顯優勢是位於虛線區塊505內的元件消耗較少功率。The FFT module 525 can be configured as a 32-point FFT or a 16-point FFT module, wherein the configuration of the FFT module 525 can be controlled by a control module 560. The FFT module 525 can convert information from the memory modules 510, 515 from a time domain representation to a frequency domain representation. The multiplication and filtering module 520 can receive the results from the FFT module 525 and perform noise filtering The wave and noise suppression operations are performed to produce an enhanced audio signal 325 (as shown in Figure 3). The enhanced audio signal 325 can then be stored in the frame buffer 535, wherein the enhanced audio signal 325 can be processed by the voice activity detection module 550 or 555. Depending on the implementation, there may be multiple voice activity modules operating in parallel. Each of the voice activity detection modules 550 and 555 can employ a different algorithm (such as the first or second algorithm described above). As before, the components located within the dashed block 505 can be configured to operate at low frequencies (or clocks 1) and low voltages (or low Vccs). Elements located outside of the dashed block 505 can be configured to operate at high frequency (or clock 16) and high voltage (or high Vcc). A significant advantage of this is that the components located within the dashed block 505 consume less power.

以下參考第6圖，其為512點快速傅立葉轉換的示範方塊圖。圖示600包括4個平面。X平面610、Y平面620、Z平面630，以及W平面640。X平面610可具有16列和32行，總共16×32=512個資訊點。X平面610的資訊點可對應第5圖的FFT模組525從記憶體模組510、515所接收到的資訊。Referring to Figure 6 below, it is an exemplary block diagram of a 512-point fast Fourier transform. The illustration 600 includes four planes. X plane 610, Y plane 620, Z plane 630, and W plane 640. The X-plane 610 can have 16 columns and 32 rows for a total of 16 x 32 = 512 information points. The information points of the X-plane 610 can correspond to the information received by the FFT module 525 of FIG. 5 from the memory modules 510, 515.

對一些實施例來說，X平面610的512個資訊點可利用32點FFT運算轉換。由於X平面610內有16列，32點FFT運算轉換會執行16次。對X平面610的每一列的資訊點所進行的每一個32點FFT運算的結果顯示在Y平面620的對應列。舉例來說，X平面610的第一列的資訊點(X(0),X(16),...,X(495))的32點FFT運算的結果係反映在Y平面620的第一列(Y(0),Y(16),...,Y(495))。For some embodiments, 512 information points of the X-plane 610 can be converted using a 32-point FFT operation. Since there are 16 columns in the X-plane 610, the 32-point FFT operation conversion is performed 16 times. The result of each 32-point FFT operation performed on the information points of each column of the X-plane 610 is displayed in the corresponding column of the Y-plane 620. For example, the result of the 32-point FFT operation of the information points (X(0), X(16), ..., X(495)) of the first column of the X-plane 610 is reflected in the first of the Y-plane 620. Column (Y(0), Y(16),..., Y(495)).

FFT運算係以複數為主，每一個複數具有一個實數和虛數部分。X平面610的資訊點可包括實數資訊而沒有任何虛數資訊，因為它們代表的是實際的音訊輸入訊號。X平面610可為實數平面。然而，在Y平面620內的資訊點就可能包括實數和虛數部分。Y平面620可被稱為複數平面。Y平面620的資訊點接著可乘上一組虛數轉換因數(imaginary twiddle factor)625。此一轉換因數625可對應第5圖的乘法與濾波模組520所執行的乘法運算。對一些實施例來說，轉換因數625可包括4個平行運算的複數乘法器(complex multiplier)。由於Y平面620有512個資訊點，因此會有128個乘法週期，以算出Z平面630所用的512個資訊點。Z平面630可被稱為複數平面。The FFT operation is dominated by complex numbers, each of which has a real and imaginary part. The information points of the X-plane 610 may include real information without any imaginary information because they represent actual audio input signals. The X-plane 610 can be a real number plane. However, the information points within the Y-plane 620 may include real and imaginary parts. Y plane 620 can be referred to as a complex plane. The information point of the Y plane 620 can then be multiplied by a set of imaginary twiddle factors 625. This conversion factor 625 can correspond to the multiplication operation performed by the multiplication and filtering module 520 of FIG. For some embodiments, the conversion factor 625 can include 4 complex multipliers of parallel operations. Since the Y plane 620 has 512 information points, there are 128 multiplication periods to calculate the 512 information points used by the Z plane 630. Z plane 630 can be referred to as a complex plane.

對一些實施例來說，Z平面630的資訊點可採用16點FFT運算加以轉換。此一運算是對Z平面630的每一列的資訊點(例如Z(0),Z(1),...,Z(15))做16-點FFT運算。由於在Z平面630有32列，16點FFT運算需進行32次。對Z平面630的每一列的資訊點所進行的每一個16點FFT運算的結果會反映在W平面640的對應列。舉例來說，Z平面630的第一列的資訊點(Z(0),Z(1),...,Z(15))的16點FFT運算的結果係反映在W平面640的第一列(W(0),W(32),...,W(480))。For some embodiments, the information points of the Z-plane 630 can be converted using a 16-point FFT operation. This operation is a 16-point FFT operation on information points (e.g., Z(0), Z(1), ..., Z(15)) of each column of the Z-plane 630. Since there are 32 columns in the Z plane 630, the 16 point FFT operation needs to be performed 32 times. The result of each 16-point FFT operation performed on the information points of each column of the Z-plane 630 is reflected in the corresponding column of the W-plane 640. For example, the result of the 16-point FFT operation of the information points (Z(0), Z(1), ..., Z(15)) of the first column of the Z-plane 630 is reflected in the first of the W-plane 640. Columns (W(0), W(32), ..., W(480)).

第7圖所示為根據一實施例的快速傅立葉轉換模組的硬體實施範例之方塊圖。FFT模組700可稱為混合FFT模組，因為它可用來執行32點FFT和16點FFT運算。FFT 模組700可對應第5圖的FFT模組525。第5圖中的512個資訊點的分解方式適用於音訊、語音，或談話處理。因為這些應用適合串列執行的運算。舉例來說，512個資訊點的分解可包括使用32點的FFT運算(16次)，接著是512次複數乘法與最後的16點FFT運算(32次)。這麼做可能會比對X平面610的所有資訊點平行執行512點FFT運算要慢一些。Figure 7 is a block diagram showing an example of a hardware implementation of a fast Fourier transform module in accordance with an embodiment. The FFT module 700 can be referred to as a hybrid FFT module because it can be used to perform 32-point FFT and 16-point FFT operations. FFT The module 700 can correspond to the FFT module 525 of FIG. The decomposition of the 512 information points in Figure 5 applies to audio, speech, or conversation processing. Because these applications are suitable for the operations performed in tandem. For example, a decomposition of 512 information points may include a 32-point FFT operation (16 times) followed by 512 complex multiplications and a final 16-point FFT operation (32 times). This may be slower than performing a 512-point FFT operation parallel to all information points on the X-plane 610.

為了要能夠於低頻率(例如4 MHz)下以低功率操作，可能需要盡可能地縮減硬體架構。應注意的是，在如此低頻率下大多數的功率都是漏功率，所以採用相同硬體串列執行運算，可以在動作和漏功率(active and leakage power)之間取得平衡。對一些實施例來說，與其使用2個不同的FFT模組，一個作為32點FFT運算，一個用於16點FFT運算，FFT模組700可同時用來執行32點和16點FFT操作。FFT模組700可包括2個16點FFT 710、720。16點FFT 710、720被組態用以平行操作。In order to be able to operate at low power (eg 4 MHz) at low power, it may be necessary to reduce the hardware architecture as much as possible. It should be noted that most of the power at such low frequencies is leakage power, so performing the same hardware serial operation can strike a balance between active and leakage power. For some embodiments, instead of using two different FFT modules, one for a 32-point FFT operation and one for a 16-point FFT operation, the FFT module 700 can be used to perform both 32-point and 16-point FFT operations. The FFT module 700 can include two 16 point FFTs 710, 720. The 16 point FFTs 710, 720 are configured for parallel operation.

第一個16點FFT 710可連結16點FFT輸入705與其訊號Y(0)至Y(15)，或者其可連結32點FFT輸入715的16個第一輸入訊號X(0)至X(15)。第二個16點FFT 720可連接32點FFT輸入715的下16個輸入訊號X(16)至X(31)。The first 16-point FFT 710 can be connected to the 16-point FFT input 705 and its signals Y(0) to Y(15), or it can be connected to the 16 first input signals X(0) to X(15) of the 32-point FFT input 715. ). The second 16-point FFT 720 can be connected to the next 16 input signals X(16) through X(31) of the 32-point FFT input 715.

在FFT模組700內的16點FFT 710、720的其中之一可連接一控制訊號(control signal)725。控制訊號725可與多工器(multiplexer)730耦接。當控制訊號725是在第一設定(例如0)下，其可能讓多工器730接受輸入訊號705，並接著讓FFT模組700以16點FFT模組方式運作。當控制訊號725是在第二設定(例如1)下，其可能讓多工器730接受輸入訊號715，並接著讓FFT模組700以32點FFT模組方式運作。One of the 16-point FFTs 710, 720 within the FFT module 700 can be coupled to a control signal 725. Control signal 725 can be coupled to multiplexer 730. When the control signal 725 is At a first setting (e.g., 0), it may cause multiplexer 730 to accept input signal 705 and then cause FFT module 700 to operate as a 16 point FFT module. When the control signal 725 is at the second setting (e.g., 1), it may cause the multiplexer 730 to accept the input signal 715 and then cause the FFT module 700 to operate as a 32-point FFT module.

藉由使用FFT模組700來取代獨立的32點FFT模組與16點FFT模組，加法器的總數可以從大約9500降至約8300，而乘法器的總數可以從大約312降至約56。如此可顯著地省下功率和面積，只是可能會有在接受度範圍內的潛時。By using the FFT module 700 instead of the independent 32-point FFT module and the 16-point FFT module, the total number of adders can be reduced from about 9500 to about 8300, and the total number of multipliers can be reduced from about 312 to about 56. This can significantly save power and area, but there may be potential time within the acceptance range.

第8圖所示為一乘法與濾波模組的硬體實施範例圖。乘法與濾波模組800可被組態用以執行複數乘法運算與濾波運算。對一些實施例來說，第8圖的複數乘法運算可作為第6圖所示的轉換因數的一部分。對一些實施例來說，第8圖的濾波運算可在FFT運算之後執行。乘法與濾波模組800可對應第5圖所示的乘法與濾波模組520。Figure 8 shows a hardware implementation example of a multiplication and filtering module. The multiply and filter module 800 can be configured to perform complex multiplication operations and filtering operations. For some embodiments, the complex multiplication of Figure 8 can be used as part of the conversion factor shown in Figure 6. For some embodiments, the filtering operation of Figure 8 can be performed after the FFT operation. The multiplication and filtering module 800 can correspond to the multiplication and filtering module 520 shown in FIG.

乘法與濾波模組800可被組態用以執行2個複數(a+jb)與(c+jd)的相乘。一般來說，這2個複數的相乘係如下如示：X=a+jbThe multiplication and filtering module 800 can be configured to perform multiplication of two complex numbers (a+jb) and (c+jd). In general, the multiplication of these two complex numbers is as follows: X = a + jb

Y=c+jdY=c+jd

Z=X * Y=(ac+bd)+j(ad+bc)Z=X * Y=(ac+bd)+j(ad+bc)

其中X與Y為輸入訊號，而Z為輸出訊號。為了執行上述乘法，傳統方法需要用到4個乘法器與2個加法器。複數乘法可利用4個平行運算的複數乘法器來實施。以下是在使用傳統技術以實施上述操作時所需要的硬體相關資訊的一些範例：邏輯位準(Logic level)=52X and Y are input signals, and Z is an output signal. In order to In the above multiplication, the conventional method requires four multipliers and two adders. Complex multiplication can be implemented using a complex multiplier of four parallel operations. The following are some examples of hardware-related information needed to implement the above operations using traditional techniques: Logic level = 52

分支細胞格(Leaf cell)=3264Branch cell (Leaf cell) = 3264

對一些實施例來說，經過修正，相同2個複數相乘可如下所示：X=a+jbFor some embodiments, after correction, the same two complex numbers can be multiplied as follows: X = a + jb

Y=c+jdY=c+jd

(ac-bd)=a(c+d)-a(d+b) (在此“ad”項彼此抵銷)(ac-bd)=a(c+d)-a(d+b) (where "ad" items are offset each other)

(ad+bc)=a(c+d)-a(c-b) (在此“ac”項彼此抵銷)(ad+bc)=a(c+d)-a(c-b) (where “ac” items are offset from each other)

Z=X * Y=(ac+bd)+j(ad+bc).Z=X * Y=(ac+bd)+j(ad+bc).

為了要執行上述乘法，需要3個乘法器與5個加法器。要注意的是，與傳統技術相較下，在修正過後的做法中所需的乘法器數目較少，但是加法器比較多。這是可以接受的，因為乘法器不論是功率或是面積所耗費的都比加法器多。以下是在使用修正技術以實施上述操作時所需要的硬體相關資訊的一些範例：邏輯位準(Logic level)=53In order to perform the above multiplication, three multipliers and five adders are required. It should be noted that compared to the conventional technology, the number of multipliers required in the modified practice is small, but the number of adders is relatively large. This is acceptable because the multiplier consumes more power or area than the adder. The following are some examples of hardware-related information needed to use the correction technique to perform the above operations: Logic level = 53

分支細胞格(Leaf cell)=2848(在此細胞格的數目要比傳統技術少)Branch cell = 2848 (the number of cells in this cell is less than the traditional technology)

參考第8圖，3個乘法器包括乘法器810、820與850。5個加法器包括860、865、870，以及在輸入端用於 “c-b”與“b+d”的2個。乘法與濾波模組800的輸入訊號可傳送至一組多工器802、804、806，與808。當這些多工器被設定至一個數值(例如0)，乘法與濾波模組800可被組態用以執行複數乘法運算。舉例來說，在第一個多工器802，訊號“c-b”可被傳輸至多工器810。在第二個多工器804，訊號“a”可被傳輸至多工器810，讓多工器810產生“a(c-b)”的結果。在第三個多工器806，訊號“b+d”可被傳輸至多工器820。在第四個多工器808，訊號“a”可被傳輸至多工器820，讓多工器820產生“a(b+d)”的結果。多工器810與820的結果可用於加法器860、865與870，以產生Z的結果，也就是X*Y=(ac+bd)+j(ad+bc)。Referring to Figure 8, three multipliers include multipliers 810, 820, and 850. Five adders include 860, 865, 870, and are used at the input. Two of "c-b" and "b+d". The input signals of the multiplication and filtering module 800 can be transmitted to a set of multiplexers 802, 804, 806, and 808. When these multiplexers are set to a value (eg, 0), the multiply and filter module 800 can be configured to perform complex multiplication operations. For example, at the first multiplexer 802, the signal "c-b" can be transmitted to the multiplexer 810. At the second multiplexer 804, the signal "a" can be transmitted to the multiplexer 810, causing the multiplexer 810 to produce a "a(c-b)" result. At the third multiplexer 806, the signal "b+d" can be transmitted to the multiplexer 820. At the fourth multiplexer 808, the signal "a" can be transmitted to the multiplexer 820, causing the multiplexer 820 to produce a "a(b+d)" result. The results of multiplexers 810 and 820 can be used for adders 860, 865, and 870 to produce the result of Z, that is, X*Y = (ac + bd) + j (ad + bc).

乘法與濾波模組800可在多工器802、804、806，與808被設定至一個數值(例如1)時，被設定以執行濾波運算。在此情形下，乘法與濾波模組800可被組態用以過濾FFT運算的表示式“Coff*abs(xR+jxI)*abs(xR+jxI))”之絕對值平方，其中“xR+jxI”為複數，“abs”是絕對值函數，而“Coff”為一係數。該表示式的相等數學式為“Coff(xR² +xI² )”。此一表示式係顯示在第8圖的右側。輸入xR與xI為多工器802、804、806，與808的輸入。第一多工器810接著可產生“xR² ”的結果，而第二多工器820可產生“xI² ”的結果。這些結果接著通過係數848、多工器840，以及多工器850以產生表示式“Coff(xR² +xI² )”的數值。The multiply and filter module 800 can be configured to perform a filtering operation when the multiplexers 802, 804, 806, and 808 are set to a value (e.g., 1). In this case, the multiplication and filtering module 800 can be configured to filter the absolute square of the representation of the FFT operation "Coff*abs(xR+jxI)*abs(xR+jxI))", where "xR+ jxI" is a complex number, "abs" is an absolute value function, and "Coff" is a coefficient. The equality mathematical expression of this expression is "Coff(xR ² + xI ² )". This expression is shown on the right side of Figure 8. Inputs xR and xI are inputs to multiplexers 802, 804, 806, and 808. The first multiplexer 810 can then produce a result of "xR ² ", while the second multiplexer 820 can produce a result of "xI ² ". These results are then passed through a coefficient 848, a multiplexer 840, and a multiplexer 850 to produce a value representing the expression "Coff(xR ² + xI ² )".

第9圖所示為處理音頻訊號以檢測人音頻訊號的示範方法之流程圖。本方法可對應第5圖所示的硬體架構。本方法可以用一組被儲存在機器或電腦可讀取媒體，像是RAM、ROM、PROM，以及快閃記憶體等內的邏輯指令實施，以可組態邏輯如PLA、FPGA，與CPLD實施，以ASIC、CMOS或TTL技術所製成的固定功能邏輯硬體實施，或上述的組合實施。舉例來說，用以實施本發明的運算之電腦程式碼可以寫在一或更多個程式化語言的任何組合內，包括物件導向程式化語言像是C++或其類似者，或傳統的程序型程式化語言，像是"C"程式語言或類似的程式語言。Figure 9 shows a demonstration of processing audio signals to detect human audio signals. Flow chart of the method. This method can correspond to the hardware architecture shown in Figure 5. The method can be implemented with a set of logic instructions stored in a machine or computer readable medium such as RAM, ROM, PROM, and flash memory, with configurable logic such as PLA, FPGA, and CPLD Fixed-function logic hardware implemented in ASIC, CMOS or TTL technology, or a combination of the above. For example, a computer code for performing the operations of the present invention can be written in any combination of one or more stylized languages, including an object oriented stylized language such as C++ or the like, or a conventional program type. A stylized language, such as a "C" programming language or a similar programming language.

區塊905將音頻訊號儲存於記憶體內。如前述，音頻訊號可包括人聲和其他雜訊，像是背景雜訊。音頻訊號可由錄音機錄下並儲存於時域。記憶體可被組態於第一時脈頻率(例如高頻率)下操作。記憶體可被組態於第一電壓(例如高Vcc)下操作。Block 905 stores the audio signal in memory. As mentioned above, the audio signal can include vocals and other noise, such as background noise. Audio signals can be recorded by the recorder and stored in the time domain. The memory can be configured to operate at a first clock frequency (eg, a high frequency). The memory can be configured to operate at a first voltage (eg, high Vcc).

區塊910是用以對音頻訊號執行FFT運算，以便從時域轉換為頻域。FFT運算可根據音頻訊號的訊框來進行。如前述，訊框可利用分框與分窗運算決定。FFT運算可採用可組態的FFT模組，其可被組態為不同類型的FFT模組(例如32點的FFT模組或16點的FFT模組)。可組態的FFT模組可於第二時脈頻率(例如低頻率)下操作。可組態的FFT模組也可於第二電壓(例如低Vcc)下操作。Block 910 is used to perform an FFT operation on the audio signal to convert from the time domain to the frequency domain. The FFT operation can be performed based on the frame of the audio signal. As mentioned above, the frame can be determined by using a sub-frame and a windowing operation. The FFT operation can be configured with configurable FFT modules that can be configured as different types of FFT modules (eg, a 32-point FFT module or a 16-point FFT module). The configurable FFT module operates at a second clock frequency, such as a low frequency. The configurable FFT module can also operate at a second voltage (eg, low Vcc).

區塊915是在區塊910的FFT運算後，以其所得的頻域結果進行雜訊抑制與濾波運算，並以第二電壓為主。濾波運算可以採用第8圖所示之可組態的乘法與濾波硬體。雜訊抑制運算可利用第3圖所示的一或更多個雜訊抑制技術。區塊915的雜訊抑制與濾波運算可於第二時脈頻率(例如低頻率)下操作。雜訊抑制與濾波運算也可於第二電壓(例如低Vcc)下操作。Block 915 is the frequency obtained after the FFT operation of block 910. The domain result is subjected to noise suppression and filtering operations, and is dominated by the second voltage. The filtering operation can use the configurable multiplication and filtering hardware shown in Figure 8. The noise suppression operation can utilize one or more of the noise suppression techniques shown in FIG. The noise suppression and filtering operations of block 915 can operate at a second clock frequency (e.g., low frequency). The noise suppression and filtering operations can also operate at a second voltage (eg, low Vcc).

區塊920是在區塊915的雜訊抑制與濾波運算完成後進行語音檢測。如第5圖所示，可採用一或更多個語音檢測演算法。在一訊框中的總能量和背景雜訊可用來決定人聲的存在。區塊920的語音檢測運算可於第二時脈頻率(例如低頻率)下進行。語音檢測運算也可於第二電壓(例如低電壓)下進行。Block 920 performs voice detection after the noise suppression and filtering operations of block 915 are completed. As shown in Figure 5, one or more speech detection algorithms may be employed. The total energy and background noise in a frame can be used to determine the presence of a human voice. The speech detection operation of block 920 can be performed at a second clock frequency (e.g., low frequency). The speech detection operation can also be performed at a second voltage (eg, a low voltage).

本發明的實施例係適用於各種類型的半導體積體電路(IC)晶片。IC晶片的範例包括但不限於處理器、控制器、晶片組元件、可程式邏輯陣列(programmable logic array，PLA)、記憶體晶片、網路晶片、系統晶片(systems on chip，SoC)、SSD/NAND控制器ASIC，以及其類似者。此外，在某些圖示中，訊號傳導線是以線段表示。有些可能以不同方式表示，以顯示更多構成訊號路徑，具有標號，以顯示一數目的構成訊號路徑，以及/或者在一或更多個端點具有箭頭，以顯示主要的訊息流方向。然而，上述並非用以限制本發明。更確切地說，所增加的細節可用於一或更多個示範實施例，以方便了解電路的作用。任何圖中繪示的訊號線，不管是否具有額外的資訊，都可實際上包含一或更多個能夠多方向傳送的訊號，並可利用任何適用的訊號類型實施，例如以差動對實施的數位或類比線路、光纖線路，以及/或者單端線路。Embodiments of the present invention are applicable to various types of semiconductor integrated circuit (IC) wafers. Examples of IC chips include, but are not limited to, processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoC), SSD/ NAND controller ASIC, and the like. Moreover, in some of the illustrations, the signal conducting lines are represented by line segments. Some may be represented in different ways to display more constituent signal paths, with labels to indicate a number of constituent signal paths, and/or one or more endpoints with arrows to indicate the primary message flow direction. However, the above is not intended to limit the invention. Rather, the added details may be used in one or more exemplary embodiments to facilitate an understanding of the function of the circuitry. The signal lines shown in any of the figures, whether or not they have additional resources The signal may actually contain one or more signals that can be transmitted in multiple directions and may be implemented using any suitable signal type, such as digital or analog lines implemented with differential pairs, fiber optic lines, and/or single-ended lines. .

本說明書中有提出示範的尺寸/模型/數值/範圍，不過本發明的實施例並不在此限。由於製造技術(例如微影蝕刻技術)不斷進步，因此裝置尺寸會越來越小。此外，已知的IC晶片功率/接地連結和其他元件可能沒有顯示在圖中，用以簡化圖示與討論，不應用以混淆本發明的特定實施例。此外，配置是以方塊圖的型式顯示，以避免混淆本發明的實施例，同時要注意的是此種方塊圖配置與實施的平台高度相關，也就是說，熟悉此技藝者應可充分了解本發明的特定細節。由於本說明書已提出諸多細節以敘述本發明的示範實施例，熟悉此技藝者應可了解，本發明的實施例不一定需要這些特定細節才能實施。在此所提出的範例係用以舉例而非限制本發明。Exemplary sizes/models/values/ranges are set forth in this specification, although embodiments of the invention are not limited thereto. As manufacturing techniques (such as lithography techniques) continue to advance, device sizes will become smaller and smaller. In addition, known IC chip power/ground connections and other components may not be shown in the figures to simplify the illustration and discussion, and are not intended to confuse particular embodiments of the invention. In addition, the configuration is shown in a block diagram format to avoid obscuring the embodiments of the present invention, and it should be noted that such a block diagram configuration is highly correlated with the implemented platform, that is, those skilled in the art should be fully aware of this. Specific details of the invention. Since the specification has been described in detail with reference to the exemplary embodiments of the present invention, it should be understood that The examples presented herein are intended to be illustrative and not limiting.

「耦接(coupled)」一詞在此可用於所討論的元件間的任何類型，不管是直接或間接的關係，並可以應用於電氣、機械、流體、光學、電磁、電機械，或其他連結上。此外，「第一(first)」、「第二(second)」只是方便討論，並且不代表任何特定的時間或時間先後的意義，除非另外有標示。The term "coupled" is used herein to refer to any type of component in question, whether direct or indirect, and may be applied to electrical, mechanical, fluid, optical, electromagnetic, electromechanical, or other on. In addition, "first" and "second" are for convenience of discussion and do not represent any particular time or chronological meaning unless otherwise indicated.

熟悉此技藝者應可了解本發明的實施例於先前所述的廣泛技術可以各種不同的形式實施。因此，儘管本發明的實施例已經透過上述特定範例加以說明，不過其實際範疇不應受限於此，而熟悉此技藝者在閱讀過上述圖表、實施例說明以及後續的申請專利範圍後，應可了解其他修改方式。Those skilled in the art will appreciate that the embodiments of the present invention can be implemented in a variety of different forms. Therefore, although the embodiments of the present invention have been described by the above specific examples, the actual scope thereof It should not be limited to this, and those skilled in the art should be able to understand other modifications after reading the above diagram, the description of the embodiments, and the scope of the subsequent patent application.

100‧‧‧語音辨識系統100‧‧‧Voice Identification System

101‧‧‧預處理模組101‧‧‧Pre-processing module

102‧‧‧前端處理模組102‧‧‧ front-end processing module

103‧‧‧後端處理模組103‧‧‧Back-end processing module

105‧‧‧錄音機105‧‧‧recorder

110‧‧‧脈衝密度調變至脈碼調變轉換器110‧‧‧ Pulse Density Modulation to Pulse Code Modulation Converter

115‧‧‧分框與分窗模組115‧‧‧Sub-frame and split window module

120‧‧‧快速傅立葉轉換模組120‧‧‧Fast Fourier Transform Module

125‧‧‧雜訊預估與抑制模組125‧‧‧ Noise Estimation and Suppression Module

130‧‧‧人聲檢測模組130‧‧‧Sound detection module

135‧‧‧人聲資訊135‧‧‧ vocal information

140‧‧‧語音處理模組140‧‧‧Voice Processing Module

200‧‧‧圖表200‧‧‧ chart

205‧‧‧縱軸205‧‧‧ vertical axis

210‧‧‧橫軸210‧‧‧ horizontal axis

215‧‧‧訊框215‧‧‧ frame

220‧‧‧訊框220‧‧‧ frames

225‧‧‧訊框225‧‧‧ frame

230‧‧‧非重疊窗口230‧‧‧ non-overlapping windows

305‧‧‧過濾的資訊305‧‧‧Filtered information

310‧‧‧第一雜訊抑制模組310‧‧‧First Noise Suppression Module

315‧‧‧第二雜訊抑制模組315‧‧‧Second noise suppression module

320‧‧‧第N個雜訊抑制模組320‧‧‧Nth noise suppression module

325‧‧‧增強音頻訊號325‧‧‧Enhanced audio signal

400‧‧‧圖表400‧‧‧ Chart

405‧‧‧縱軸405‧‧‧ vertical axis

410‧‧‧橫軸410‧‧‧ horizontal axis

420‧‧‧曲線420‧‧‧ Curve

425‧‧‧曲線425‧‧‧ Curve

500‧‧‧圖示500‧‧‧ icon

505‧‧‧虛線區塊505‧‧‧dotted block

510‧‧‧記憶體模組510‧‧‧ memory module

511‧‧‧多工器511‧‧‧Multiplexer

515‧‧‧記憶體模組515‧‧‧ memory module

516‧‧‧多工器516‧‧‧Multiplexer

520‧‧‧乘法與濾波模組520‧‧‧Multiplication and Filter Module

525‧‧‧FFT模組525‧‧‧FFT Module

535‧‧‧訊框緩衝器535‧‧‧ frame buffer

540‧‧‧訊框緩衝器540‧‧‧ frame buffer

545‧‧‧訊框緩衝器545‧‧‧ frame buffer

550‧‧‧語音活動檢測模組550‧‧‧Voice Activity Detection Module

555‧‧‧語音活動檢測模組555‧‧‧Voice Activity Detection Module

560‧‧‧控制模組560‧‧‧Control Module

600‧‧‧圖示600‧‧‧ icon

610‧‧‧X平面610‧‧‧X plane

620‧‧‧Y平面620‧‧‧Y plane

625‧‧‧虛數轉換因數625‧‧‧ imaginary conversion factor

630‧‧‧Z平面630‧‧‧Z plane

700‧‧‧FFT模組700‧‧‧FFT Module

705‧‧‧16點FFT輸入705‧‧16-point FFT input

710‧‧‧16點FFT710‧‧16 point FFT

715‧‧‧FFT輸入715‧‧‧FFT input

720‧‧‧16點FFT720‧‧16 point FFT

725‧‧‧控制訊號725‧‧‧Control signal

730‧‧‧多工器730‧‧‧Multiplexer

800‧‧‧乘法與濾波模組800‧‧‧Multiplication and Filter Module

802‧‧‧多工器802‧‧‧Multiplexer

804‧‧‧多工器804‧‧‧Multiplexer

806‧‧‧多工器806‧‧‧Multiplexer

808‧‧‧多工器808‧‧‧Multiplexer

810‧‧‧乘法器810‧‧‧Multiplier

820‧‧‧乘法器820‧‧‧Multiplier

848‧‧‧係數848‧‧ coefficient

850‧‧‧乘法器850‧‧‧Multiplier

860‧‧‧加法器860‧‧‧Adder

865‧‧‧加法器865‧‧‧Adder

870‧‧‧加法器870‧‧‧Adder

熟悉此技藝者將可透過以下的說明書與附屬的專利範例，並配合圖表說明，更為理解本發明的各種優點，其中：第1圖所示為一語音辨識系統的實施例的方塊圖範例；第2圖所示的圖表為根據一實施例的音頻訊號的相關能量與訊框範例；第3圖所示為雜訊抑制的示範實施例的方塊圖；第4圖所示為與人聲檢測運算相關的錯誤接受率與錯誤拒絕率的示範圖表；第5圖所示為一語音活動檢測模組的硬體架構實施範例；第6圖所示為根據一實施例的512點快速傅立葉轉換的示範方塊圖；第7圖所示為根據一實施例的快速傅立葉轉換模組的硬體實施範例之方塊圖；第8圖所示為根據一實施例的乘法與濾波模組的硬體實施範例圖；以及第9圖所示為處理音頻訊號以檢測音頻訊號的示範方法之流程圖。Those skilled in the art will be able to better understand the various advantages of the present invention through the following description and accompanying patent examples, together with the accompanying drawings, wherein: FIG. 1 is a block diagram illustration of an embodiment of a speech recognition system; The graph shown in FIG. 2 is an example of related energy and frame of an audio signal according to an embodiment; FIG. 3 is a block diagram of an exemplary embodiment of noise suppression; and FIG. 4 is a diagram of operation with human voice detection. An exemplary diagram of the associated error acceptance rate and error rejection rate; Figure 5 shows an example of a hardware architecture implementation of a voice activity detection module; and Figure 6 shows an exemplary 512-point fast Fourier transform according to an embodiment. FIG. 7 is a block diagram showing a hardware implementation example of a fast Fourier transform module according to an embodiment; FIG. 8 is a hardware example diagram of a multiplication and filtering module according to an embodiment. And Figure 9 shows a demonstration of processing audio signals to detect audio signals. Flow chart of the law.

100‧‧‧語音辨識系統100‧‧‧Voice Identification System

101‧‧‧預處理模組101‧‧‧Pre-processing module

102‧‧‧前端處理模組102‧‧‧ front-end processing module

103‧‧‧後端處理模組103‧‧‧Back-end processing module

105‧‧‧錄音機105‧‧‧recorder

115‧‧‧分框與分窗模組115‧‧‧Sub-frame and split window module

130‧‧‧人聲檢測模組130‧‧‧Sound detection module

135‧‧‧人聲資訊135‧‧‧ vocal information

140‧‧‧語音處理模組140‧‧‧Voice Processing Module

Claims

An apparatus for low-power speech detection, comprising: logic means for storing a time domain audio signal in a memory set to operate according to a first clock frequency and a first voltage, and according to a first The second clock frequency and a second voltage are used to perform a Fast Fourier Transform (FFT) operation on the time domain audio signal to generate a frequency domain audio signal.

The apparatus of claim 1, wherein the logic means: performing a first set of FFT operations; performing a complex multiplication operation; and concatenating the first set of FFT operations to perform a second set of FFT operations.

The device of claim 2, wherein the second clock frequency is slower than the first clock frequency, and wherein the second voltage is lower than the first voltage.

The device of claim 3, wherein the logic means is: performing a noise suppression operation; performing a filtering operation on the frequency domain audio signal according to the second clock frequency and the second voltage, An enhanced audio signal is generated.

The apparatus of claim 4, wherein the complex product operation and the filtering operation are performed in a same hardware component.

Such as the device described in claim 4, wherein the logic The means is configured to perform a voice detection operation on the enhanced audio signal according to the second clock frequency and the second voltage.

The device of claim 6, wherein the logic is used to determine a total energy in a frame of the enhanced audio signal and to determine background noise in the frame of the enhanced audio signal.

The apparatus of claim 7, wherein the logic means is to perform a median filtering operation and perform a contour tracking operation.

The device of claim 7, wherein the logic means is configured to execute a command related to the detected human voice according to the first clock frequency and the first voltage.

A computer implemented method for low power speech detection, comprising: recording a time domain audio signal at a first clock frequency and a first voltage; and performing fast Fourier transform (FFT) on the time domain audio signal of the second clock frequency An operation to generate a frequency domain audio signal, wherein the first clock frequency is faster than the second clock signal.

The method of claim 10, wherein the FFT operations are performed at a second voltage that is lower than the first voltage.

The method of claim 11, further comprising: performing a noise suppression operation on the second clock frequency and the frequency domain audio signal of the second voltage to generate an enhanced audio signal.

The method of claim 12, further comprising: the enhanced audio signal for the second clock frequency and the second voltage No. Performs a vocal detection operation to detect vocals.

The method of claim 13, wherein the step of performing the vocal detection operation comprises: determining a total energy in a frame of the enhanced audio signal; determining a frame within the frame with the enhanced audio signal Background noise related energy; and detecting the vocal sound by subtracting the energy associated with the background noise from the total energy in the frame of the enhanced audio signal.

The method of claim 13, further comprising: executing a command related to the human voice at the first clock frequency and the first voltage.

The method of claim 15, wherein the time domain audio signal is continuously recorded at the first clock frequency and the first voltage is converted from pulse density modulation (PDM) to pulse code modulation. (PCM).

The method of claim 16, wherein the FFT operations are performed in series.

A computer readable storage medium comprising a set of instructions for low power speech detection, if executed by a processor, causing a computer to: record a time domain audio signal at a first clock frequency and a first voltage; The time domain audio signal of the second clock frequency performs a fast Fourier transform (FFT) operation to generate a frequency domain audio signal, wherein the first clock frequency is faster than the second clock signal.

Such as the media described in claim 18, wherein such The FFT operation is performed at a second voltage that is lower than the first voltage.

The medium of claim 19, further comprising a set of instructions, if executed by the processor, to cause the computer to: perform noise on the frequency domain audio signal at the second clock frequency and the second voltage Suppressing an operation to generate an enhanced audio signal; performing a vocal detection operation on the enhanced audio signal at the second clock frequency and the second voltage to detect a human voice; and performing the first clock frequency and the first voltage A command related to the vocal.

The medium of claim 20, wherein the vocal sound detection operation is: determining a total energy in a frame of the enhanced audio signal; determining a background in the frame with the enhanced audio signal The noise associated with the noise; and detecting the vocal by subtracting the energy associated with the background noise from the total energy in the frame of the enhanced audio signal.

The medium of claim 21, wherein the time domain audio signal is continuously recorded at the first clock frequency and the first voltage.

A system for low-power speech detection, comprising: a pre-processing module configured to extract an audio signal as a pulse density modulation (PDM) information stream according to a first clock frequency and a first voltage, and Converting the PDM information stream into a pulse code modulation (PCM) information stream; the front end processing module is coupled to the preprocessing module and configured to use the PCM information flow frame and the interval as multiple signals a frame; and a fast Fourier transform (FFT) module coupled to the front end processing mode a group configured to receive a frame of the PCM information stream according to the second clock frequency and the second voltage and perform a conversion of the frames from the time domain representation to the frequency domain representation, wherein the second The clock frequency is different from the first clock frequency, and the second voltage is different from the first voltage.

The system of claim 23, wherein the first clock frequency is faster than the second clock frequency, and wherein the second voltage is lower than the first voltage.

The system of claim 24, further comprising: a noise estimation and suppression module coupled to the FFT module, configured to analyze the frames in the frequency domain representation and to filter The noise information is not in the same frequency band as the human voice; the vocal detection module is coupled to the noise estimation and suppression module, and is configured to determine according to a human voice frequency band and using a background noise estimate Whether a voice is present in the frame; and a voice processing module coupled to the voice detection module configured to determine a command associated with the voice and to perform an operation associated with the voice command .