TWI749547B

TWI749547B - Speech enhancement system based on deep learning

Info

Publication number: TWI749547B
Application number: TW109115334A
Authority: TW
Inventors: 方士豪; 曹昱; 洪志偉; 王緒翔; 莊富凱
Original assignee: 元智大學
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2021-12-11
Also published as: TW202143215A

Abstract

A deep learning system includes a speech conversing module, a speech extracting module, a speech enhancement subsystem and a speech restoration module. The speech converting module receives and converts the first voice signal to be first voice spectrums and signal phases. The speech extracting module concatenates the first voice spectrums to be a second voice spectrum. The speech enhancement subsystem includes a speaker feature extracting model receiving the second voice spectrum to input in a first deep neuron network to acquire speaker feature code, and a speech enhancement network model receiving the speaker feature code and the second voice spectrum to input in a second deep neuron network to acquire a gain function. The gain function and the second voice spectrum are recovered to generate an enhanced speech signal spectrum. The speech restoration module combines the enhanced speech signal spectrum and the signal phases to output a second speech signal.

Description

Speech enhancement system using deep learning

本發明涉及一種語音增強系統，特別是涉及一種應用深度學習的語音增強系統。The invention relates to a speech enhancement system, in particular to a speech enhancement system applying deep learning.

在真實環境中，經常會遇到在噪聲干擾下進行語音通訊的問題，例如在火車、捷運上使用手機時，環境的噪聲訊號會降低語音訊號的品質與可理解度，因此降低了人與人或人與機器間的通訊效率及音訊品質。為了解決這個問題，一些前端語音訊號處理的技術相應而生，它可以從帶有噪音的語音訊號中提取乾淨的語音訊號、減少語音訊號中的雜訊成份，從而提高訊雜比，並且可以增加語音訊號的品質與可被理解性，這個處理方式稱為語音增強演算法(speech enhancement)。語音增強演算法可以分為無監督式與監督式的語音增強演算法，以基於頻域處理為主。In the real environment, we often encounter the problem of voice communication under noise interference. For example, when using mobile phones on trains and MRT, the environmental noise signal will reduce the quality and intelligibility of the voice signal, thus reducing the communication between people and people. The communication efficiency and audio quality between people or people and machines. In order to solve this problem, some front-end voice signal processing technologies were developed accordingly, which can extract clean voice signals from noisy voice signals, reduce the noise components in the voice signal, thereby increasing the signal-to-noise ratio, and can increase The quality and comprehensibility of the speech signal, this processing method is called speech enhancement algorithm (speech enhancement). Speech enhancement algorithms can be divided into unsupervised and supervised speech enhancement algorithms, mainly based on frequency domain processing.

無監督式語音增強演算法，優點是不需要事先準備輸入資料之標籤。目前應用廣泛的無監督式語音增強如基於短時頻譜恢復(spectral restoration)的方法，其常見的演算法例如譜減法(spectral subtraction)以及維納濾波器(wiener filtering)，皆是利用頻譜恢復的方法，而頻譜恢復是在頻譜中估計出訊號的增益函數(gain function)，以此增益函數實現語音增強。另有一些自適應頻譜恢復的方法，其方法則是需要先使用雜訊追蹤演算法找出語音訊號中的噪聲頻譜，後由該噪聲頻譜推得先驗訊雜比(a priori SNR)與後驗訊雜比(a posteriori SNR)如最小控制遞迴平均(minima controlled recursive averaging，MCRA)演算法，進而可計算出訊號的增益函數，並另用此增益函數實現語音增強。The unsupervised speech enhancement algorithm has the advantage of not needing to prepare tags for input data in advance. Currently widely used unsupervised speech enhancement methods such as methods based on short-term spectral restoration (spectral restoration), and its common algorithms such as spectral subtraction and Wiener filtering, are all based on spectral restoration. The method, and spectrum recovery is to estimate the gain function of the signal in the spectrum, and use the gain function to achieve speech enhancement. There are also some adaptive spectrum recovery methods. The method is to first use a noise tracking algorithm to find the noise spectrum in the speech signal, and then calculate the a priori SNR and post-noise ratio from the noise spectrum. A posteriori SNR (a posteriori SNR) algorithm, such as a minimum controlled recursive averaging (MCRA) algorithm, can then calculate the signal gain function, and use this gain function to achieve speech enhancement.

監督式語音增強演算法，需要事前準備「訓練資料」訓練語音增強系統，以實現語音增強。近幾年來，大多數監督式語音增強演算法都是基於建構人工神經網路下的深度學習方式，因為這種方式展現了在回歸分析(regression analysis)上的強大優勢。監督式語音增強演算法，如深層去噪自編碼(deep denoising auto-encoder，DDAE)提出乾淨訊號與帶噪訊號之間的關係模型，並且利用深度神經網路實現語音增強，能有效的去除帶噪訊號中的雜訊。此外，結果顯示以各種不同噪聲條件下深度學習的語音增強模型，在未知的噪聲環境中有著良好的適應能力。The supervised speech enhancement algorithm requires the preparation of "training data" to train the speech enhancement system in order to achieve speech enhancement. In recent years, most supervised speech enhancement algorithms are based on the construction of deep learning methods under artificial neural networks, because this method shows a strong advantage in regression analysis. Supervised speech enhancement algorithms, such as deep denoising auto-encoder (DDAE), propose a relationship model between clean signals and noisy signals, and use deep neural networks to achieve speech enhancement, which can effectively remove bands. Noise in the noise signal. In addition, the results show that the deep learning speech enhancement model under various noise conditions has good adaptability in unknown noise environments.

因此，由上述得知，存在一種需求，如何在未知的噪聲環境下，設計具有良好環境適應能力的語音增強系統。Therefore, it is known from the above that there is a need for how to design a speech enhancement system with good environmental adaptability in an unknown noise environment.

本發明所要解決的技術問題在於，針對現有技術的不足提供一種語音增強系統，可以在未知的噪聲環境中有著良好的適應能力。The technical problem to be solved by the present invention is to provide a speech enhancement system aiming at the deficiencies of the prior art, which can have good adaptability in an unknown noise environment.

為瞭解決上述的技術問題，本發明所採用的其中一技術方案是提供一種應用深度學習的語音增強系統，其包括一語音轉換模組、一語音擷取模組、一語音增強子系統與一語音還原模組。語音轉換模組用於接收第一語音訊號，並應用短時傅立葉轉換將第一語音訊號轉換為對應多個音框的多個第一語音頻譜與多個訊號相位。語音擷取模組連接語音轉換模組，將相鄰的多個音框對應的多個第一語音頻譜進行串接處理以獲得一第二語音頻譜。語音增強子系統連接語音擷取模組，且包括一語者特徵擷取模型與一語音增強網路模型。語者特徵擷取模型連接語音擷取模組，經配置以接收第二語音頻譜並輸入一第一深度神經網路，以擷取第二語音頻譜的至少一語者特徵編碼。語音增強網路模型連接語音擷取模組及語者特徵擷取模型，經配置以接收至少一語者特徵編碼與第二語音頻譜，並輸入一第二深度神經網路，以通過第二深度神經網路估計出一增益函數，且將增益函數與第二語音頻譜進行一頻譜回復處理以產生一增強語音訊號頻譜。語音還原模組連接語音轉換模組及語音增強子系統，接收增強語音訊號頻譜及多個訊號相位，並將增強語音訊號頻譜與多個訊號相位結合，並以反短時傅立葉轉換以輸出增強後的一第二語音訊號。In order to solve the above technical problems, one of the technical solutions adopted by the present invention is to provide a voice enhancement system using deep learning, which includes a voice conversion module, a voice capture module, a voice enhancement subsystem, and a voice enhancement system. Voice restoration module. The voice conversion module is used for receiving the first voice signal, and applying short-time Fourier transform to convert the first voice signal into multiple first voice spectra and multiple signal phases corresponding to multiple sound frames. The voice capture module is connected to the voice conversion module, and a plurality of first voice frequency spectra corresponding to a plurality of adjacent sound frames are concatenated to obtain a second voice frequency spectrum. The voice enhancement subsystem is connected to the voice capture module, and includes a speaker feature capture model and a voice enhancement network model. The speaker feature extraction model is connected to the voice capture module, and is configured to receive the second speech spectrum and input a first deep neural network to capture at least one speaker feature code of the second speech spectrum. The voice enhancement network model is connected to the voice capture module and the speaker feature capture model, and is configured to receive at least one speaker feature code and a second voice spectrum, and input a second deep neural network to pass the second depth The neural network estimates a gain function, and performs a spectrum recovery process on the gain function and the second voice spectrum to generate an enhanced voice signal spectrum. The voice restoration module is connected to the voice conversion module and the voice enhancement subsystem, receives the enhanced voice signal spectrum and multiple signal phases, combines the enhanced voice signal spectrum with multiple signal phases, and uses anti-short-time Fourier transform to output the enhanced A second voice signal.

本發明的其中一有益效果在於，本發明所提供的語者感知語音增強系統中，語者感知語音增強系統處理帶有噪聲的話語，具有明顯的降噪和語音質量和可懂度的提升，且其性能優於傳統非監督式及監督式語音增強。One of the beneficial effects of the present invention is that in the speaker-perceived speech enhancement system provided by the present invention, the speaker-perceived speech enhancement system processes noisy speech, which has obvious noise reduction and speech quality and intelligibility improvement. And its performance is better than traditional unsupervised and supervised speech enhancement.

為使能更進一步瞭解本發明的特徵及技術內容，請參閱以下有關本發明的詳細說明與圖式，然而所提供的圖式僅用於提供參考與說明，並非用來對本發明加以限制。In order to further understand the features and technical content of the present invention, please refer to the following detailed description and drawings about the present invention. However, the provided drawings are only for reference and description, and are not used to limit the present invention.

以下是通過特定的具體實施例來說明本發明所公開有關“應用深度學習的語音增強系統”的實施方式，本領域技術人員可由本說明書所公開的內容瞭解本發明的優點與效果。本發明可通過其他不同的具體實施例加以施行或應用，本說明書中的各項細節也可基於不同觀點與應用，在不背離本發明的構思下進行各種修改與變更。另外，本發明的附圖僅為簡單示意說明，並非依實際尺寸的描繪，事先聲明。以下的實施方式將進一步詳細說明本發明的相關技術內容，但所公開的內容並非用以限制本發明的保護範圍。另外，本文中所使用的術語“或”，應視實際情況可能包括相關聯的列出項目中的任一個或者多個的組合。The following is a specific embodiment to illustrate the implementation of the "speech enhancement system using deep learning" disclosed in the present invention. Those skilled in the art can understand the advantages and effects of the present invention from the content disclosed in this specification. The present invention can be implemented or applied through other different specific embodiments, and various details in this specification can also be based on different viewpoints and applications, and various modifications and changes can be made without departing from the concept of the present invention. In addition, the drawings of the present invention are merely schematic illustrations, and are not drawn according to actual size, and are stated in advance. The following embodiments will further describe the related technical content of the present invention in detail, but the disclosed content is not intended to limit the protection scope of the present invention. In addition, the term "or" used in this document may include any one or a combination of more of the associated listed items depending on the actual situation.

語音活動檢測又稱語音端點檢測、語音邊界檢驗，此技術的目的是從帶有噪聲的語音訊號中檢測語音的存在與否，並且定位出語音片段的起始點與終點，都是基於語音數據的原始處理。Voice activity detection is also called voice endpoint detection and voice boundary detection. The purpose of this technology is to detect the presence or absence of voice from noisy voice signals, and to locate the starting point and end point of the voice segment, which are based on voice The original processing of the data.

語音活動檢測的應用廣泛，如聲音定位、語音編碼、語音辨識、語者辨識等，眾多的語音活動檢測演算法中，本發明使用以統計基礎為主(Statistical-based)的語音活動檢測法，此方法是觀察時域頻帶中之語音訊號的變化以及頻域中之語音訊號的平坦度，利用這些方法可判別出該語音訊號片段為語音或是非語音，進而可以將語音與非語音分離。Voice activity detection has a wide range of applications, such as voice localization, speech coding, speech recognition, speaker identification, etc. Among the many voice activity detection algorithms, the present invention uses a statistical-based voice activity detection method. This method is to observe the change of the voice signal in the time domain frequency band and the flatness of the voice signal in the frequency domain. These methods can be used to determine whether the voice signal segment is voice or non-voice, and then the voice and non-voice can be separated.

深度學習是機器學習中的一環，它是具有層次性的機器學習法，通常被認為是一種「比較深」的類神經網路，並搭配了各式各樣的類神經網路層，如卷積神經網路(convolutional Neural Networks，CNN)、遞迴神經網路(recurrent Neural Network，RNN)等。Deep learning is a part of machine learning. It is a hierarchical machine learning method. It is usually regarded as a kind of "deep" neural network, and it is equipped with various neural network layers, such as scrolls. Convolutional Neural Networks (CNN), recurrent Neural Networks (RNN), etc.

類神經網路是一種用數學模型來模仿生物神經元(neuron)與神經系統中的傳導。在類神經網路中，通常會有數個層，每層中會有數十到數百個神經元，其中每個神經元的運作方式則是將上一層神經元的輸入加總後，輸出則是加入啟動函數(activation function)模擬神經傳導的運作方式。其中，每個神經元會跟下一層的神經元連接，使上一層神經元的輸出值經過權重計算後傳遞給下一層的神經元。Neural network is a kind of mathematical model to imitate the conduction in biological neuron (neuron) and nervous system. In a neural network, there are usually several layers, in each layer there are tens to hundreds of neurons, and the operation of each neuron is to add up the input of the neurons in the previous layer, and the output is It is to add an activation function to simulate the operation of nerve conduction. Among them, each neuron will be connected with the neuron of the next layer, so that the output value of the neuron of the previous layer is passed to the neuron of the next layer after weight calculation.

類神經網路的架構綜合來說，包含了層的數量、每層中的神經元數量、各層間的神經元連接方式與啟動函數的類型等設定。這些參數設定都是在使用類神經網路前需要由人力設定好的，參數設定的好壞也是大大影響到類神經網路的效能表現，而類神經網路的學習和訓練過程就是試著找到最佳的權重設定。其中隱藏層超過一層以上，而複數個隱藏層的神經網路通常被稱為深度神經網路。In general, the architecture of the similar neural network includes settings such as the number of layers, the number of neurons in each layer, the connection mode of neurons between each layer, and the type of activation function. These parameter settings need to be set manually before using the neural network. The quality of the parameter settings also greatly affects the performance of the neural network. The learning and training process of the neural network is to try to find The best weight setting. Among them, there are more than one hidden layer, and a neural network with multiple hidden layers is usually called a deep neural network.

圖1為本發明之應用深度學習的語音增強系統的方塊圖。如圖1所示，本發明的應用深度學習的語音增強系統10至少包括一語音轉換模組11、一語音擷取模組12、一語音增強子系統13以及一語音還原模組14。Fig. 1 is a block diagram of a speech enhancement system applying deep learning of the present invention. As shown in FIG. 1, the speech enhancement system 10 applying deep learning of the present invention at least includes a speech conversion module 11, a speech capture module 12, a speech enhancement subsystem 13 and a speech restoration module 14.

語音轉換模組11用於接收第一語音訊號y，並應用短時傅立葉轉換(Short-Time Fourier Transform，STFT)將第一語音訊號y轉換為對應多個音框的多個第一語音頻譜

與多個訊號相位

，

表示y中的第i個音框的頻譜強度。語音擷取模組12連接語音轉換模組11，將相鄰的多個音框對應的多個第一語音頻譜進行串接處理以獲得一第二語音頻譜

。語音增強子系統13連接語音擷取模組12，且包括語者特徵擷取模型131與語音增強網路模型132。語者特徵擷取模型131連接語音擷取模組12，經配置以接收第二語音頻譜

並輸入一第一深度神經網路133，以擷取第二語音頻譜

的至少一語者特徵編碼

。語音增強網路模型132連接語音擷取模組12及語者特徵擷取模型131，經配置以接收至少一語者特徵編碼

與第二語音頻譜

，並輸入一第二深度神經網路134，以通過第二深度神經網路134估計出一增益函數G _i，且將增益函數G _i與第二語音頻譜

進行一頻譜回復處理以產生一增強語音訊號頻譜

。語音還原模組連接語音轉換模組11及語音增強子系統13，接收增強語音訊號頻譜

及多個訊號相位

，並將增強語音訊號頻譜

與多個訊號相位

結合，並以反短時傅立葉轉換以輸出增強後的一第二語音訊號

。 The voice conversion module 11 is used to receive the first voice signal y, and apply Short-Time Fourier Transform (STFT) to convert the first voice signal y into multiple first voice frequency spectra corresponding to multiple sound frames

Phase with multiple signals

,

Represents the spectrum intensity of the i-th frame in y. The voice capture module 12 is connected to the voice conversion module 11, and a plurality of first voice frequency spectra corresponding to a plurality of adjacent sound frames are concatenated to obtain a second voice frequency spectrum

. The voice enhancement subsystem 13 is connected to the voice capture module 12 and includes a speaker feature capture model 131 and a voice enhancement network model 132. The speaker feature capture model 131 is connected to the voice capture module 12 and is configured to receive the second voice spectrum

And input a first deep neural network 133 to capture the second speech spectrum

At least one speaker feature code

. The voice enhancement network model 132 is connected to the voice capture module 12 and the speaker feature capture model 131, and is configured to receive at least one speaker feature code

With the second speech spectrum

, And input a second deep neural network 134 to estimate a gain function G _i through the second deep neural network 134, and combine the gain function G _i with the second speech frequency spectrum

Perform a spectrum recovery process to generate an enhanced voice signal spectrum

. The voice restoration module is connected to the voice conversion module 11 and the voice enhancement subsystem 13 to receive the enhanced voice signal spectrum

And multiple signal phases

And will enhance the voice signal spectrum

Phase with multiple signals

Combine and use inverse short-time Fourier transform to output an enhanced second voice signal

.

在第一實施例中，語音增強子系統13，在此又稱語者感知語音增強模型(Speaker-Aware Denoising Neural Network，SaDNN)主要是由兩個神經網路所組成，分別為語者特徵擷取模型(SpkFE)131及嵌入式語者特徵語音增強網路(SpE-DNN)132，其分別具有一個神經網路，與一般的語音增強系統基於深度神經網路之增益函數估計語音增強模型不同之處在於語者感知語音增強模型另外整合了語者特徵擷取模型產生之語者特徵編碼

。 In the first embodiment, the speech enhancement subsystem 13, also referred to herein as the Speaker-Aware Denoising Neural Network (SaDNN), is mainly composed of two neural networks, each of which is speaker feature extraction Take the model (SpkFE) 131 and the embedded speaker characteristic speech enhancement network (SpE-DNN) 132, each has a neural network, which is different from the general speech enhancement system based on the deep neural network gain function estimation speech enhancement model The point is that the speaker-perceived speech enhancement model also integrates the speaker feature coding generated by the speaker feature extraction model

.

短時傅立葉變換(STFT)是時頻域分析方法，所謂短時傅立葉變換，顧名思義就是對短時的訊號做傅立葉變換。短時的訊號是由長時的訊號乘短時的幀或稱作音框(frame)而來的。如此把一段長的訊號切開、乘上窗函數，再對每一音框做傅立葉變換，最後把每一音框的結果沿一個維度堆疊起來(frame overlap)，求得振幅成份，得到一幅三維訊號頻譜圖。The short-time Fourier transform (STFT) is a time-frequency domain analysis method. The so-called short-time Fourier transform, as the name implies, is to perform Fourier transform on short-time signals. The short-term signal is a long-term signal multiplied by a short-term frame, or called a frame. In this way, a long signal is cut, multiplied by the window function, and Fourier transform is performed on each sound frame, and finally the result of each sound frame is stacked in one dimension (frame overlap), and the amplitude component is obtained to obtain a three-dimensional image. Signal spectrum graph.

首先，定義第一語音訊號y在時域上為帶有雜訊之語音訊號，是由乾淨無噪聲的語音訊號x以及噪聲訊號v加總而得，因此可表示為一向量式：First, define the first voice signal y as a voice signal with noise in the time domain, which is obtained by the sum of the clean and noise-free voice signal x and the noise signal v, so it can be expressed as a vector type:

，

,

將上式經傅立葉變換後，第m個帶有雜訊的語音訊號的音框可以表示為：After the above formula is Fourier transformed, the sound frame of the m-th voice signal with noise can be expressed as:

，

,

其中k對應頻率

： Where k corresponds to the frequency

:

，

,

在第一語音頻譜

中，

與

分別為頻域中的乾淨訊號及噪聲訊號，L為音框的長度，k為頻譜中之頻率間隔(frequency bin)。 In the first speech spectrum

middle,

and

They are the clean signal and the noise signal in the frequency domain, L is the length of the sound frame, and k is the frequency bin in the spectrum.

雜訊追蹤部份，其目的是計算帶有噪聲語音訊號

的音功率譜密度(power spectrum density，PSD)，即可計算得到先驗訊雜比與後驗訊雜比。估計增益函數部份，根據先前計算出之先驗訊雜比與後驗訊雜比，即可估計出增益函數

。最後，增強後訊號

，是為

與

運算過後的結果。為了便於表示，後續的

、

、

及

都將以Y、S、V、G表示之。 The noise tracking part, its purpose is to calculate the noisy voice signal

The power spectrum density (PSD) can be calculated to obtain the prior signal-to-noise ratio and the posterior signal-to-noise ratio. In the part of the estimated gain function, the gain function can be estimated based on the a priori signal-to-noise ratio and the posterior signal-to-noise ratio calculated previously

. Finally, the enhanced signal

,it's for

and

The result after the operation. For ease of presentation, the subsequent

,

and

All will be represented by Y, S, V, G.

通過將噪聲和乾淨語音的第一語音頻譜

分解成振幅與(訊號)相位，可得： First speech spectrum by combining noise and clean speech

Decomposed into amplitude and (signal) phase, we can get:

，

,

，

,

其中

與

皆為振福，

與

皆為相位，在本發明中訊號的振幅都將以下標a表示之。 in

and

All are Zhenfu,

and

Both are phases. In the present invention, the amplitude of the signal will be denoted by the subscript a.

為了從帶有雜訊之訊號Y重建出乾淨語音訊號X，在本發明中，估計乾淨語音訊號的相位如下式：In order to reconstruct the clean voice signal X from the signal Y with noise, in the present invention, the phase of the clean voice signal is estimated as follows:

，

,

其中

的先驗密度通過均勻分佈得： in

The prior density of is obtained by uniform distribution:

p(θ)=

， p(θ)=

,

在區間(−π,π)，依據上述的方程式可得：In the interval (−π,π), according to the above equation, we can get:

；

其中，乾淨語音訊號之頻譜可以表示如下式：Among them, the frequency spectrum of the clean voice signal can be expressed as follows:

，

,

頻譜恢復的目的是估計出一增益函數G。The purpose of spectrum recovery is to estimate a gain function G.

因為語音訊號的頻譜是經由傅立葉變換得來，降低噪聲訊號追蹤所造成的誤差，對語音訊號功率頻譜

進行頻域以及時域上的平滑處理，取一個窗函數做平均，因而增加了相鄰視窗的關聯性，其方程式如下： Because the frequency spectrum of the voice signal is obtained by Fourier transform, the error caused by the tracking of the noise signal is reduced, and the power spectrum of the voice signal is

Perform smoothing in the frequency domain and time domain, and take a window function for averaging, thus increasing the relevance of adjacent windows. The equation is as follows:

，

,

其中

為加權係數，而窗函數的長度為

。在時域中，對音框的位置採用一階遞迴平滑處理，其方程式如下： in

Is the weighting factor, and the length of the window function is

. In the time domain, first-order recursive smoothing is used for the position of the sound frame, and the equation is as follows:

，

,

其中

為語音訊號的平滑參數，

表示前一個所含噪聲音框的功率頻譜。接著追蹤平滑語音訊號功率頻譜中的最小值當作語音訊號存在的基準，同時估計出語音訊號存在機率，另外暫時儲存一個最小值做為下一段區間內的初始值： in

Is the smoothing parameter of the voice signal,

Represents the power spectrum of the previous noise frame. Then track the minimum value in the power spectrum of the smoothed speech signal as the basis for the existence of the speech signal, and estimate the probability of the existence of the speech signal at the same time, and temporarily store a minimum value as the initial value in the next interval:

，

,

，

,

每當讀取完L個(m=0, 1, …, L−1)音框時，將最小值與前暫存之最小值將被進行初始值更新，方程式如下，並追蹤上述的方程式之功率頻譜中的最小值。Whenever L (m=0, 1, …, L−1) sound frames are read, the minimum value and the previously stored minimum value will be updated with the initial value. The equation is as follows, and the above equation is tracked. The minimum value in the power spectrum.

，

,

，

,

判斷語音訊號中音框是否有語音存在，可藉由追蹤平滑語音訊號功率頻譜中的最小值來判斷，語音存在指標為：To determine whether there is voice in the mid-range of the voice signal, it can be judged by tracking the minimum value in the power spectrum of the smooth voice signal. The voice presence index is:

，

,

其中

為1時表示語音訊號存在，

為0時表示語音訊號不存在。透過語音存在指標可以對觀測語音的語音存在機率進行追蹤： in

When it is 1, it means that the voice signal exists,

When it is 0, it means that the voice signal does not exist. The voice presence index can track the voice presence probability of the observed voice:

，

,

基於下列兩個假設：語音和噪聲訊號都是隨機過程以及語音和噪聲訊號間獨立不相關，彼此之間具有可加性的，來推導出噪聲訊號的功率頻譜密度(power spectral density，PSD)以及增益函數(gain function)G。Based on the following two assumptions: speech and noise signals are random processes and the speech and noise signals are independent and uncorrelated, and are additive with each other, to derive the power spectral density (PSD) of the noise signal and Gain function G.

先驗訊雜比(

)與後驗訊雜比(

)兩參數均為統計而來的訊雜比，定義為： Priori signal-to-noise ratio (

) And the posterior signal-to-noise ratio (

) Both parameters are statistical signal-to-noise ratio, which is defined as:

,

，

,

其中

=,

= E[

]分別為乾淨語音訊號X與噪聲訊號V的音功率譜密度。 in

=,

= E[

] Is the sound power spectral density of the clean speech signal X and the noise signal V, respectively.

假設乾淨語音訊號和噪聲訊號頻譜都是透過高斯分佈建構的，那麼條件概率密度函數(PDF)，p(Y|Xa,θ _X)可表示為： Assuming that the clean speech signal and noise signal spectrum are constructed through Gaussian distribution, the conditional probability density function (PDF), p(Y|Xa,θ _X ) can be expressed as:

)，

),

具有零平均值的複數高斯隨機變數中之振幅和相位在統計上是獨立的，因此p(

)可以表示為： The amplitude and phase of a complex Gaussian random variable with zero mean are statistically independent, so p(

)It can be expressed as:

p(

) = p(

) · p(

)， p(

) = p(

) · P(

),

其中p(

)萊利分佈(rayleigh distribution)； Where p(

) Rayleigh distribution;

，

,

其中

為萊利分佈機率密度中之超參數。 in

It is the hyperparameter in the probability density of the Riley distribution.

最大後驗頻譜振幅演算法(Maximum A Posteriori Spectral Amplitude，MAPA)目標估計之頻譜幅度

，可表示為： Maximum a posteriori spectral amplitude algorithm (Maximum A Posteriori Spectral Amplitude, MAPA) target estimated spectral amplitude

, Can be expressed as:

，

,

其中

為MAPA的損失函數，可以表示為： in

Is the loss function of MAPA, which can be expressed as:

，

,

接著上述方程式中的

微分，並令其等於零，即可得到基於MAPA的增益函數： Then in the above equation

Differentiate, and make it equal to zero, you can get the gain function based on MAPA:

G _MAPA=

， G _MAPA =

,

其中，

為先驗訊雜比()與

後驗訊雜比。 in,

Is the prior signal-to-noise ratio () and

The posterior signal-to-noise ratio.

經過最大後驗頻譜振幅演算法增強後之語音訊號頻譜為：The speech signal spectrum after the enhancement of the maximum posterior spectrum amplitude algorithm is:

。

.

維納濾波器(Frequency-Domain Wiener Filter)是由數學家諾伯特·維納(Norbert Wiener)提出的一種以最小均方誤差為最優準則的線性濾波器。維納濾波器是最經典的頻譜恢復方法之一，分為時域以及頻域維納濾波器。推導時域維納濾波器的方法，為使用預估的方式求得重建後之語音訊號x[n]的過程。首先令一長度為L之有限脈衝響應濾波器(finite impulse response)為h = [h ₀h ₁…h _L−1] ^T，再將y[n]與此濾波器之乘積可得

： The Frequency-Domain Wiener Filter (Frequency-Domain Wiener Filter) is a linear filter proposed by the mathematician Norbert Wiener (Norbert Wiener) with the minimum mean square error as the optimal criterion. Wiener filter is one of the most classic spectrum recovery methods, divided into time domain and frequency domain Wiener filters. The method of deriving the time-domain Wiener filter is the process of obtaining the reconstructed speech signal x[n] by using the estimation method. First, let a finite impulse response filter of length L be h = [h ₀ h ₁ …h _L−1 ] ^T , and then multiply the product of y[n] and this filter to obtain

:

，

,

經過預估後得到的重建語音訊號與原始語音訊號之間存在著誤差值，此誤差值

為： After estimation, there is an error value between the reconstructed voice signal and the original voice signal. This error value

for:

，

,

並取其最小均方誤差(minimum mean square error，MMSE)，And take its minimum mean square error (MMSE),

，

,

令最佳重建語音訊號為

與

相比，

則需要包含更少的雜訊，故

的最佳濾波器h ₀即為時域維納濾波器，方程式如下： Let the best reconstructed voice signal be

and

compared to,

Needs to contain less noise, so

The best filter h _{0 is} the time domain Wiener filter, and the equation is as follows:

以上為時域維納濾波的部分推導，皆是在時域進行推導，以下將轉換為頻域後繼續進行推導頻域維納濾波器。The above is part of the derivation of the time domain Wiener filter. They are all derived in the time domain. The following will convert to the frequency domain and continue to derive the frequency domain Wiener filter.

在頻域中重複時域維納濾波的推導，並以H替代H[k]表示，可得：Repeating the derivation of the time-domain Wiener filtering in the frequency domain, and substituting H for H[k], we can get:

其中，

in,

，

,

對上式中之H取偏微分後，令該式等於零，即可推算出頻域維納濾波器：After taking the partial differentiation of H in the above formula, make this formula equal to zero, and then the frequency domain Wiener filter can be calculated:

，

,

其中

與

分別為X和Y的功率頻譜密度，因此，需計算出X和Y的功率頻譜密度方可建構出所需之濾波器，其中Y容易求得，再依據先前的假設，語音與噪聲訊號間獨立不相關，且彼此間具有可加性，即可計算出，如下式： in

and

They are the power spectral densities of X and Y respectively. Therefore, the required filter can be constructed by calculating the power spectral densities of X and Y. Y is easy to find. According to the previous assumptions, the speech and noise signals are independent of each other. Irrelevant and additivity to each other, it can be calculated as the following formula:

，

,

其中

為雜訊V的PSD，再由上述的H ₀、

、

以及

的方程式即可推導出基於頻域維納濾波器的增益函數G _WIENER： in

Is the PSD of noise V, and then from the above H ₀ ,

,

as well as

The equation of can derive the gain function G _WIENER based on the frequency domain Wiener filter:

。

.

基於深度神經網路之語音增強模型(Deep Neural Network Based Speech Enhancement，DNN-SE)的目的是，使用神經網路增強訊號y來重建乾淨訊號x，詳細系統的方塊圖如圖1所示。The purpose of the Deep Neural Network Based Speech Enhancement (DNN-SE) model is to use the neural network to enhance the signal y to reconstruct the clean signal x. The detailed system block diagram is shown in Figure 1.

依舊參閱圖1，語音擷取模組(feature extraction)12連接語音轉換模組11，將相鄰的多個音框對應的多個第一語音頻譜

進行串接處理以獲得一第二語音頻譜

，y經過短時傅立葉轉換得到第一語音頻譜

，

表示y中的第i個音框的頻譜強度，接著經過特徵擷取(feature extraction)後得到一個新的訊號

，其是由連接每個相鄰的音框並取對數成為一個連續的對數功率頻譜，可以表示為： Still referring to FIG. 1, the voice extraction module (feature extraction) 12 is connected to the voice conversion module 11, and multiple first voice spectra corresponding to multiple adjacent sound frames

Perform concatenation processing to obtain a second speech spectrum

, Y gets the first speech spectrum after short-time Fourier transform

,

Represents the spectral intensity of the i-th frame in y, and then a new signal is obtained after feature extraction

, Which consists of connecting each adjacent sound frame and taking the logarithm to become a continuous logarithmic power spectrum, which can be expressed as:

，方程式A，

, Equation A,

其中符號「;」表示垂直連接的意思，第二語音頻譜

的長度為2I+1(在本發明中，假設I=5)。之後將每一個第二語音頻譜

經過基於深度神經網路的語音增強處理得到

，這個

是增強過後的對數功率頻譜，之後將對數功率頻譜經由頻譜回復(spectral restoration)轉換為原先的強度頻譜(magnitude spectrum)

，並與原始訊號y的訊號相位

合併得到新的頻譜

，最後

經過反短時傅立葉轉換得到增強過後的時域訊號

。 The symbol ";" means vertical connection, and the second speech frequency spectrum

The length of is 2I+1 (in the present invention, it is assumed that I=5). After that, each second speech spectrum

After speech enhancement processing based on deep neural network,

,this

It is the enhanced logarithmic power spectrum, and then the logarithmic power spectrum is converted to the original magnitude spectrum through spectral restoration.

, And the signal phase of the original signal y

Combine to get a new spectrum

,At last

After anti-short-time Fourier transform, the enhanced time-domain signal is obtained

.

圖1中的深度神經網路方塊，其結構是為深度神經網路所組成，並且用來增強輸入訊號

。假設這裡的深度神經網路具有L層，其中任何一層l的輸入與輸出關係： The deep neural network block in Figure 1 is composed of deep neural networks and used to enhance the input signal

. Assume that the deep neural network here has L layers, and the input and output relationship of any layer l:

方程式B，

Equation B,

其中

與

分別為任一 l層的啟動函數與線性回歸函數，而輸入層與輸出層分別對應到的第一層與第L層。因此，可以從深度神經網路方塊中可以得到

=

和

=

。 in

and

They are the startup function and linear regression function of any l layer, and the input layer and output layer correspond to the first layer and the Lth layer respectively. Therefore, you can get from the deep neural network block

=

and

=

.

關於訓練模型階段，首先準備訓練資料集，其是由帶噪語音訊號–乾淨語音訊號

所組成，訓練模型之參數方式是將模型的輸入設為第二語音頻譜

，計算並最小化模型的輸出

與目標輸出

間的損失函數，這裡使用回歸分析時常用的損失函數：均方誤差(mean square error)。 Regarding the training model phase, first prepare the training data set, which is composed of noisy speech signals-clean speech signals

The parameter method of training the model is to set the input of the model to the second speech frequency spectrum

, Calculate and minimize the output of the model

And target output

Loss function between time, here is the loss function commonly used in regression analysis: mean square error.

基於深度神經網路之增益函數估計語音增強模型(Deep Neural Network with Gain Estimation Based Speech Enhancement，DNN-Gain)Deep Neural Network with Gain Estimation Based Speech Enhancement (DNN-Gain)

使用深度神經網路計算出先驗訊雜比與後驗訊雜比，詳細架構如圖2所示。A deep neural network is used to calculate the prior signal-to-noise ratio and the posterior signal-to-noise ratio. The detailed structure is shown in Figure 2.

基於深度神經網路之語音增強模型(DNN-SE)中深度神經網路取代了以往非監督式雜訊追蹤的方法，以神經網路預測ξ與γ，即可計算增益函數G，並與原輸入訊號之對數功率頻譜Y運算後得到增強後之語音訊號頻譜

，可表示為： The deep neural network based on the deep neural network speech enhancement model (DNN-SE) replaces the previous unsupervised noise tracking method. The neural network predicts ξ and γ to calculate the gain function G and compare it with the original The logarithmic power spectrum Y of the input signal is calculated to obtain the enhanced voice signal spectrum

, Can be expressed as:

。

.

語者特徵的平均向量(d-vector)，The average vector of speaker characteristics (d-vector),

舉例來說，如下所示之表1，輸入是語音的梅爾頻譜，接著將最後一層的隱藏層之輸出做L2正則化(L2 regularization)，隨後取得整段語音在深度神經網路的輸出，並求得一平均向量(d-vector)，在安靜與吵雜的環境下測試，平均向量(d-vector)語者驗證系統分別能降低14%與25%的錯誤率。

表1 For example, in Table 1 as shown below, the input is the Mel spectrum of speech, and then the output of the last hidden layer is L2 regularization, and then the output of the entire speech in the deep neural network is obtained. An average vector (d-vector) was obtained. Tested in a quiet and noisy environment, the average vector (d-vector) speaker verification system can reduce the error rate by 14% and 25%, respectively.

Table 1

為了增進一般語音增強系統在不同語者中的處理能力，本發明以基於深度神經網路之增益函數估計語音增強模型(DNN-Gain)2為基礎，建構出應用深度學習之語者感知的語音增強系統(speaker-aware denoising neural network，SaDNN)20，系統方塊圖如圖2所示。在第二實施例中的應用深度學習的語音增強系統20，同樣包括至少包括一語音轉換模組21、一語音擷取模組22、一語音增強子系統23以及一語音還原模組24。由於第二實施例的語音轉換模組11、語音擷取模組12與語音還原模組14和第一實施例的語音轉換模組21、語音擷取模組22與語音還原模組24相同，因此，有關第二實施例的語音轉換模組21、語音擷取模組22與語音還原模組24的敘述，在此不再贅述。In order to improve the general speech enhancement system's processing capabilities among different speakers, the present invention is based on the deep neural network-based gain function estimation speech enhancement model (DNN-Gain) 2 to construct speech perceived by the speaker using deep learning Enhanced system (speaker-aware denoising neural network, SaDNN) 20, the system block diagram is shown in Figure 2. The voice enhancement system 20 applying deep learning in the second embodiment also includes at least a voice conversion module 21, a voice capture module 22, a voice enhancement subsystem 23, and a voice restoration module 24. Since the voice conversion module 11, the voice capture module 12, and the voice restoration module 14 of the second embodiment are the same as the voice conversion module 21, the voice capture module 22, and the voice restoration module 24 of the first embodiment, Therefore, the description of the voice conversion module 21, the voice capture module 22, and the voice restoration module 24 of the second embodiment will not be repeated here.

在第二實施例的語音增強子系統23同樣包括語者特徵擷取模型231與語音增強網路模型232，第二實施例的語者特徵擷取模組231同樣和第一實施例的語者特徵擷取模組131相同，其分別包括一第一深度神經網路233與一第二深度神經網路234，而第二實施例的語者增強子系統23進一步包括一噪聲特徵擷取模型235與一(嵌入式)語音增強模型236。噪聲特徵擷取模型235連接語音擷取模組22與語音增強網路模型232，噪聲特徵擷取模型235接收第二語音頻譜，以通過一第三深度神經網路237進行一噪聲特徵提取處理以產生至少一噪聲特徵編碼

。 The speech enhancement subsystem 23 of the second embodiment also includes a speaker feature extraction model 231 and a speech enhancement network model 232. The speaker feature extraction module 231 of the second embodiment is the same as the speaker feature of the first embodiment. The feature extraction module 131 is the same, which respectively includes a first deep neural network 233 and a second deep neural network 234, and the speaker enhancement subsystem 23 of the second embodiment further includes a noise feature extraction model 235 And an (embedded) speech enhancement model 236. The noise feature extraction model 235 is connected to the voice capture module 22 and the voice enhancement network model 232. The noise feature extraction model 235 receives the second speech spectrum to perform a noise feature extraction process through a third deep neural network 237 Generate at least one noise feature code

.

嵌入式語音增強模型236連接語音擷取模組22、語音增強網路模型232及噪聲特徵擷取模型235，接收第二語音頻譜、至少一語者特徵編碼

及至少一噪聲特徵編碼

以輸入一第四深度神經網路238，以通過第四深度神經網路238估計增益函數G。 The embedded voice enhancement model 236 is connected to the voice capture module 22, the voice enhancement network model 232, and the noise feature capture model 235, and receives the second voice spectrum and at least one speaker feature code

And at least one noise feature code

A fourth deep neural network 238 is input to estimate the gain function G through the fourth deep neural network 238.

本發明的語者感知語音增強系統有效的整合不同語者間的特徵，研究結果顯示語音增強後的評估指標均優於基於深度神經網路之增益函數估計語音增強模型的語音增強系統。進一步考慮到語者所在的環境中有數種噪聲干擾(非人聲訊號)，因此以基於深度神經網路之增益函數估計語音增強模型為基礎，衍伸出一種兼具語者特徵以及噪聲特徵的環境感知語音增強系統(speaker and speaking environment-aware denoising neural network，SEaDNN)。環境感知語音增強系統結合了基於深度神經網路之語音增強系統、特定語者的個人化特徵以及語者所在環境的噪聲特徵。The speaker-perceived speech enhancement system of the present invention effectively integrates the characteristics of different speakers, and the research results show that the evaluation index after speech enhancement is better than the speech enhancement system based on the deep neural network gain function estimation speech enhancement model. Further considering that there are several kinds of noise interference (non-human voice signals) in the environment where the speaker is located, based on the speech enhancement model based on the gain function estimation of the deep neural network, an environment with both speaker characteristics and noise characteristics is derived Perceptual speech enhancement system (speaker and speaking environment-aware denoising neural network, SEADNN). The environment-aware speech enhancement system combines a speech enhancement system based on a deep neural network, the personal characteristics of a specific speaker, and the noise characteristics of the speaker's environment.

類似於前面章節之第一實施例所提到的應用深度學習的語音增強系統10，在第二實施中，

模型之環境感知的語音增強系統20的輸入是由連接每個相鄰的音框並取對數成為一個連續的對數特徵頻譜(context feature)所組成，可表示如上述的方程式A。在本發明的第二實施例中，基於深度神經網路之語音增強系統20特別的地方是，是由三個深度神經網路(deep neural networks，DNNs)模型所組成，分別為語者特徵擷取模型(SpkFE module)231、環境的噪聲特徵擷取模型(NoeFE module)235以及嵌入式語音增強模型(例如如嵌入式環境整合語音增強神經網路模型(EE-DNN module)等)236。 Similar to the speech enhancement system 10 applying deep learning mentioned in the first embodiment of the previous chapter, in the second implementation,

The input of the speech enhancement system 20 for environment perception of the model is composed of connecting each adjacent sound frame and taking the logarithm to form a continuous logarithmic feature spectrum (context feature), which can be expressed as the above-mentioned equation A. In the second embodiment of the present invention, the special feature of the deep neural network-based speech enhancement system 20 is that it is composed of three deep neural networks (DNNs) models, each of which is speaker feature extraction The acquisition model (SpkFE module) 231, the environmental noise feature extraction model (NoeFE module) 235, and the embedded speech enhancement model (for example, the embedded environment integrated speech enhancement neural network model (EE-DNN module), etc.) 236.

圖3為(嵌入式語者特徵)語音增強網路模型的示意圖，如圖3所示，(嵌入式語者特徵)的語音增強網路模型132或232為結合帶噪聲語音之特徵

和語者特徵擷取模型產生的語者特徵編碼

的深度神經網路。帶噪聲語音之特徵

放置在輸入層(第1層)，語者特徵編碼

將與特定層(第l層)之輸出連接，因此輸入的特徵到下一層隱藏層30(第l+1層)可以表示為

。因此本發明的嵌入式語者特徵語音增強網路132或232相似於傳統的基於深度神經網路之增益函數估計語音增強網路模型，不同之處在於(嵌入式語者特徵)語音增強網路132或232在特定隱藏層加入了語者特徵。 Figure 3 is a schematic diagram of the (embedded speaker feature) speech enhancement network model. As shown in Figure 3, the (embedded speaker feature) speech

enhancement network model

132 or 232 is a feature that combines noisy speech

Speaker feature coding generated by the speaker feature extraction model

Deep neural network. Features of noisy speech

Placed in the input layer (layer 1), speaker feature coding

Will be connected to the output of a specific layer (layer l), so the input features to the next hidden layer 30 (layer l+1) can be expressed as

. Therefore, the embedded speaker feature

speech enhancement network

132 or 232 of the present invention is similar to the traditional deep neural network-based gain function estimation speech enhancement network model, and the difference lies in the (embedded speaker feature)

speech enhancement network

132 or 232 adds speaker characteristics to a specific hidden layer.

訓練(嵌入式語者特徵)語音增強網路132或232之前，訓練資料集是由帶噪語音之特徵{

}、乾淨語音之特徵{

}以及語者特徵擷取模型產生之語者特徵{

}組建而成。訓練以輸入{

}和{

}並產生輸出為{

}之增益函數輸出。訓練方式如最小化損失函數之誤差等方式。 Before training (embedded speaker features)

speech enhancement network

132 or 232, the training data set is composed of noisy speech features {

}, the characteristics of clean voice {

}And the speaker features generated by the speaker feature extraction model{

} Formed. Train to enter {

}and{

} And produce the output as {

} The gain function output. Training methods such as minimizing the error of the loss function, etc.

為了要從語者的資訊中擷取語者特徵，本發明提出的語音增強系統包括語者特徵擷取模型(SpkFE)131或231，如下所示之表2所示，其中

為語者特徵，p(spk _ij)為語者類別。

表2 In order to extract speaker characteristics from the speaker’s information, the speech enhancement system proposed by the present invention includes a speaker feature extraction model (SpkFE) 131 or 231, as shown in Table 2 below, where

It is the characteristic of the speaker, and p(spk _ij ) is the category of the speaker.

Table 2

建構語者特徵擷取模型(SpkFE)131或231的主要目的是擷取語音訊號中的語者特徵編碼

，作法為將如下所述。語音訊號

中的每個音框的特徵分類為該語者的類別p(spk _j)，因此語者的數量N決定了深度神經網路的輸出大小與維度，另外考慮到訓練資料集中的非語音之音框類別，因此深度神經網路的輸出類別有(N+1)種，故j=1, 2, ..., N+1。另外類別p(spk _j)為語音訊號

經過語音活性檢測(Voice Activity Detection，VAD)處理後得到的語者類別，將作為語者特徵擷取模型中深度神經網路的期望輸出。類別p(spk _j)為一有效編碼(One-Hot編碼)後的(N+1)維向量，其中非零元素對應到相應的語者。 The main purpose of constructing the speaker feature extraction model (SpkFE) 131 or 231 is to extract the speaker feature code in the speech signal

, The practice will be as follows. Voice signal

The feature of each sound box in is classified as the speaker's category p(spk _j ), so the number of speakers N determines the output size and dimension of the deep neural network, and also takes into account the non-speech sounds in the training data set Box category, so there are (N+1) output categories of deep neural networks, so j=1, 2, ..., N+1. In addition, the category p (spk _j ) is a voice signal

The speaker category obtained after voice activity detection (Voice Activity Detection, VAD) processing will be used as the expected output of the deep neural network in the speaker feature extraction model. The category p(spk _j ) is an (N+1)-dimensional vector after an effective encoding (One-Hot encoding), in which non-zero elements correspond to the corresponding speakers.

圖4展示了語者特徵擷取模型131中的第一深度神經網路架構，網路中每一隱藏層40間的輸入與輸出關係可以表示如上述的方程式B，其輸出層之啟動函數設定為歸一化指數(softmax)函數，而輸入層與所有隱藏層之啟動函數設定為線性整流函數(rectified linear unit，ReLU)。當深度神經網路訓練完成，選取最後一層隱藏層之輸出(即倒數第二層)，並定義此輸出為語者特徵編碼

，表示輸入

中各音框之向量的語者特徵，因此語者特徵擷取模型有兩個輸出，而取得的

將被送入嵌入式環境整合語音增強神經網路模型132中進一步處理。值得注意的是，選取語者特徵編碼

的方式，對於那些未知的語者而言，語者特徵編碼

具有比輸出層的輸出更能夠表示語者資訊的概括能力，同時這個做法在初步的研究中提供了噪聲特徵的環境感知語音增強系統更好的語音增強性能。 Figure 4 shows the architecture of the first deep neural network in the speaker feature extraction model 131. The input and output relationship between each hidden layer 40 in the network can be expressed as Equation B above, and the activation function setting of the output layer It is a normalized exponent (softmax) function, and the activation function of the input layer and all hidden layers is set to a rectified linear unit (ReLU). When the deep neural network training is completed, select the output of the last hidden layer (that is, the second-to-last layer), and define this output as the speaker feature code

For input

The speaker feature of the vector in each frame, so the speaker feature extraction model has two outputs, and the obtained

It will be sent to the embedded environment integrated speech enhancement neural network model 132 for further processing. It’s worth noting that the speaker feature code is selected

Way, for those unknown speakers, the speaker feature encoding

It has the ability to summarize speaker information better than the output of the output layer. At the same time, this approach provides better speech enhancement performance of the environment-aware speech enhancement system with noise characteristics in the preliminary research.

基於深度神經網路的噪聲特徵擷取模型，目的是從語者所在的環境提取環境中的噪聲特徵編碼

，其示意下表3所示。

表3 The noise feature extraction model based on deep neural network, the purpose is to extract the noise feature code in the environment from the environment where the speaker is located

, Its schematic is shown in Table 3 below.

table 3

噪聲特徵擷取模型將語音訊號

中的每個噪聲特徵分類為該噪聲的類別，因此噪聲的數量M決定了噪聲特徵擷取模型的輸出大小與維度，故k = 1, 2, ..., M。另外p(noe ^k)為語音訊號

中的噪聲類別，將作為噪聲特徵擷取模型的期望輸出；p(noek)為One-Hot編碼後的M維向量，其中非零元素對應到相應的噪聲。 The noise feature extraction model converts the speech signal

Each noise feature in is classified as the category of the noise, so the amount of noise M determines the output size and dimension of the noise feature extraction model, so k = 1, 2, ..., M. In addition, p(noe ^k ) is the voice signal

The noise category in will be used as the expected output of the noise feature extraction model; p(noek) is the M-dimensional vector after One-Hot encoding, and the non-zero elements correspond to the corresponding noise.

圖5為第三深度神經網路的示意圖，顯示了噪聲特徵擷取模型235的架構，其網路結構為具有一第三深度神經網路237，其中每一隱藏層50間的輸入與輸出關係可以表示如上所示的方程式B，其輸出層之啟動函數設定為歸一化指數(softmax)函數，而輸入層與所有隱藏層之啟動函數設定為線性整流函數(rectified linear unit，ReLU)。當深度神經網路訓練完成，選取最後一層隱藏層之輸出(即倒數第二層)，並定義此輸出為噪聲特徵編碼

，表示輸入的語音訊號

中各音框之向量的語者特徵，之後噪聲特徵編碼

與

都將被送入嵌入式環境整合語音增強神經網路模型(嵌入式語音增強模型236)中進一步處理。 5 is a schematic diagram of the third deep neural network, showing the structure of the noise feature extraction model 235, the network structure has a third deep neural network 237, in which the input and output relationship between each hidden layer 50 It can be expressed as Equation B as shown above, in which the activation function of the output layer is set as a normalized exponent (softmax) function, and the activation functions of the input layer and all hidden layers are set as a rectified linear unit (ReLU). When the deep neural network training is completed, select the output of the last hidden layer (that is, the second-to-last layer), and define this output as a noise feature code

, Which means the input voice signal

The speaker characteristics of the vector of each frame in the middle, and then the noise characteristics are coded

and

All will be sent to the embedded environment integrated speech enhancement neural network model (embedded speech enhancement model 236) for further processing.

比較噪聲特徵擷取模型235與語者特徵擷取模型231，目的均是提取特定語音訊號的特徵，用以增進嵌入式環境整合語音增強神經網路模型(嵌入式語音增強模型236)的系統效能，選取噪聲特徵編碼

的這個方式，對於那些未知的噪聲而言，噪聲特徵編碼

具有比最後的輸出層之輸出就有更能夠表示未知噪聲特徵的概括能力，同時這個做法在初步的實驗中提供了噪聲特徵的環境感知語音增強系統更好的語音增強性能。 Comparing the noise feature extraction model 235 and the speaker feature extraction model 231, the purpose is to extract the characteristics of specific speech signals to improve the system performance of the embedded environment integrated speech enhancement neural network model (embedded speech enhancement model 236) , Select noise feature code

In this way, for those unknown noises, the noise feature encoding

Compared with the output of the final output layer, it has the generalization ability that can express unknown noise characteristics. At the same time, this approach provides a better speech enhancement performance of the environment-aware speech enhancement system with noise characteristics in the preliminary experiment.

在嵌入式環境整合語音增強神經網路(Environment Embedded Denoising Neural Network，EE-DNN)(嵌入式語音增強模型236)中，基於深度神經網路之增益函數估計語音增強模型系統使用帶噪語音之特徵頻譜做為輸入，而嵌入式環境整合語音增強神經網路236則另外加入了語者特徵擷取模型與噪聲特徵擷取模型產生之語者與噪聲特徵，此兩種特徵合稱環境特徵，嵌入式環境整合語音增強神經網路236的架構如圖6所示。In the embedded environment integrated speech enhancement neural network (Environment Embedded Denoising Neural Network, EE-DNN) (embedded speech enhancement model 236), based on the gain function of the deep neural network to estimate the speech enhancement model system using the characteristics of noisy speech The frequency spectrum is used as input, and the embedded environment integrated speech enhancement neural network 236 additionally adds the speaker and noise features generated by the speaker feature extraction model and the noise feature extraction model. These two features are collectively called environmental features, embedded The architecture of the integrated speech enhancement neural network 236 in the integrated environment is shown in FIG. 6.

圖6為第四深度神經網路的示意圖，如圖6所示，嵌入式環境整合語音增強神經網路236的輸入包含了帶噪語音之特徵

、語者特徵

以及噪聲特徵編碼

。帶噪語音之特徵

放置在輸入層，而噪聲特徵編碼

和語者特徵

將分別與特定層(第l1與第l2層)之輸出連接，因此輸入的特徵到下一層隱藏層60(第l1+1)與(第l2+1)層可以表示為

與

。 Fig. 6 is a schematic diagram of the fourth deep neural network. As shown in Fig. 6, the input of the embedded environment integrated speech enhancement neural network 236 contains the characteristics of noisy speech

, Speaker characteristics

And noise feature coding

. Features of noisy speech

Placed in the input layer, and the noise feature encoding

Speaker characteristics

Connect to the output of a specific layer (the 11th and 12th layers) respectively, so the input features to the next hidden layer 60 (the l1+1) and (l2+1) layers can be expressed as

and

.

因此嵌入式環境整合語音增強神經網路模型相似於傳統的DNN-Gain網路，不同之處在於嵌入式環境整合語音增強神經網路模型在特定隱藏層中加入了語者特徵與噪聲特徵。嵌入式環境整合語音增強神經網路模型之訓練資料集是由帶噪語音之特徵{

}、相關的乾淨語音之特徵{

}以及Sp語者特徵擷取模型kFE產生之特徵{

}與噪聲特徵擷取模型產生之特徵{

}建構而成。訓練嵌入式環境整合語音增強神經網路模型是以輸入{

}、{

}以及{

}並產生輸出為{

}之增益函數輸出。 Therefore, the embedded environment integrated speech enhancement neural network model is similar to the traditional DNN-Gain network. The difference is that the embedded environment integrated speech enhancement neural network model adds speaker characteristics and noise characteristics to a specific hidden layer. The training data set of the embedded environment integrated speech enhancement neural network model is composed of the characteristics of noisy speech {

}, the characteristics of the related clean voice {

}And the features generated by the Sp speaker feature extraction model kFE{

}And the features generated by the noise feature extraction model{

} Constructed. Training the embedded environment to integrate the speech enhancement neural network model is based on the input {

}, {

}as well as{

} And produce the output as {

} The gain function output.

圖7A為語者特徵編碼分布圖，圖7B為語者噪聲分布圖。首先將訓練資料分為帶有噪聲的資料集(簡稱Noisy)以及乾淨資料集(簡稱Clean)，帶有噪聲的資料集中有不相同的24位語者(24 × 8 = 192個音檔)，並將其混入3種不同訊雜比的4種噪聲，因此共有2304個音檔(192×3×4 = 2304)，而乾淨資料集中有不相同的24位語者(24 × 8 = 192個音檔)，另外測試資料為帶有噪聲的資料集與乾淨資料集的集合(2304 + 192 = 2496個音檔)。接著分別以帶有噪聲的資料集與乾淨資料集訓練SpkFE模型，得到的兩個模型分別簡稱為帶有噪聲的資料集的模型(Noisy model)與乾淨資料集的模型(Clean model)，最後使用測試資料測試兩個模型，得到兩種語者特徵編碼。Fig. 7A is a distribution diagram of speaker feature codes, and Fig. 7B is a distribution diagram of speaker noise. First, the training data is divided into a noisy data set (Noisy for short) and a clean data set (Clean for short). There are 24 different speakers (24 × 8 = 192 audio files) in the noisy data set. And mix it into 4 kinds of noise with 3 different signal-to-noise ratios, so there are 2304 audio files (192×3×4 = 2304), and there are 24 different speakers (24 × 8 = 192) in the clean data set. Audio files), and the test data is a collection of noisy data sets and clean data sets (2304 + 192 = 2496 audio files). Then train the SpkFE model with the noisy data set and the clean data set respectively, and the two models obtained are referred to as the model of the noisy data set (Noisy model) and the model of the clean data set (Clean model), and finally used The test data tests two models, and obtains two speaker characteristic codes.

如圖7B所示的語者特徵不能有效區隔不同噪聲比(SNR)，因為這些特徵點多彼此交錯分佈。然而，從圖中仍能觀察到一些趨勢，在高噪聲比(SNR)的條件下，其分佈偏向外圍，反之，在低噪聲比(SNR)的比情況下，則偏向中心。這個結果表示語者的模型較容易受到噪聲比(SNR)的影響，相比於雜訊型態。The speaker features as shown in FIG. 7B cannot effectively distinguish different noise ratios (SNR), because these feature points are often interleaved with each other. However, some trends can still be observed from the figure. Under high-noise ratio (SNR) conditions, its distribution is biased toward the periphery, and conversely, under low-noise ratio (SNR) ratios, it is biased toward the center. This result indicates that the speaker's model is more susceptible to the noise ratio (SNR), compared to the noise type.

圖7C為原始帶有噪聲語音與經過本發明的應用深度學習的語音增強系統處理的柱狀圖。如圖7C所示，這些結果表明在本發明第一實施例或第二實施例中，語者特徵擷取模型以及噪聲特徵擷取模型的有效性，因此為整個語音增強過程呈現了對語音變化以及環境噪聲變化的強健性。Fig. 7C is a histogram of the original noisy speech and the speech enhancement system using deep learning of the present invention. As shown in FIG. 7C, these results show the effectiveness of the speaker feature extraction model and the noise feature extraction model in the first or second embodiment of the present invention, and therefore present changes in speech for the entire speech enhancement process. And the robustness of environmental noise changes.

本發明提出了一種新穎的語者感知語音增強系統，減少話語中來自不同語者、不同環境的影響下的失真，進而增加了語音訊號的品質。語者感知語音增強系統是由三個深度神經網路(DNN)所組成，第一個取得每位語者的特徵，第二個取得了語者當時環境中的環境噪聲之特徵，而第三個運用了第一個及第二個DNN所擷取的語音訊號特徵，使得帶有噪聲之語音還原成接近原來乾淨語音。特別是，語者感知語音增強系統證明有效且很好地增強了未知的語者以及在未知的環境中所產生的話語。The present invention proposes a novel speaker-perceived speech enhancement system, which reduces the distortion in the speech under the influence of different speakers and different environments, thereby increasing the quality of the speech signal. The speaker-perceived speech enhancement system is composed of three deep neural networks (DNN). The first one obtains the characteristics of each speaker, the second one obtains the characteristics of the environmental noise in the environment of the speaker at the time, and the third This uses the voice signal features captured by the first and second DNNs to restore the noisy voice to a clean voice close to the original. In particular, the speaker-perceived speech enhancement system proved to be effective and well enhanced the unknown speakers and the utterances generated in the unknown environment.

[實施例的有益效果][Beneficial effects of the embodiment]

以上所公開的內容僅為本發明的優選可行實施例，並非因此侷限本發明的申請專利範圍，所以凡是運用本發明說明書及圖式內容所做的等效技術變化，均包含於本發明的申請專利範圍內。The content disclosed above is only a preferred and feasible embodiment of the present invention, and does not limit the scope of the patent application of the present invention. Therefore, all equivalent technical changes made using the description and schematic content of the present invention are included in the application of the present invention. Within the scope of the patent.

10: 語音增強系統 11:語音轉換模組 12:語音擷取模組 13:語音增強子系統 131:語者特徵擷取模型 132:語音增強網路模型 133:第一深度神經網路 134:第二深度神經網路 14:語音還原模組 y:第一語音訊號

:第一語音頻譜

:訊號相位

:第二語音頻譜

:語者特徵編碼

:增益函數

:增益函數

:噪聲特徵編碼

:噪聲特徵編碼

:增強語音訊號頻譜

:第二語音訊號 20:語音增強系統 21:語音轉換模組 22:語音擷取模組 23:語音增強子系統 231:語者特徵擷取模型 232:語音增強網路模型 233:第一深度神經網路 234:第二深度神經網路 235: 噪聲特徵擷取模型 236: 嵌入式語音增強模型 237:第三深度神經網路 238:四深度神經網路 24:語音還原模組 30:隱藏層 40:隱藏層 50:隱藏層 60:隱藏層 PESQ:語音品質 10: Speech enhancement system 11: Speech conversion module 12: Speech capture module 13: Speech enhancement subsystem 131: Speaker feature extraction model 132: Speech enhancement network model 133: First deep neural network 134: No. Two deep neural network 14: voice restoration module y: the first voice signal

: The first speech spectrum

: Signal phase

: The second speech spectrum

: Speaker feature coding

: Gain function

: Noise feature coding

: Enhanced voice signal spectrum

: Second voice signal 20: voice enhancement system 21: voice conversion module 22: voice capture module 23: voice enhancement subsystem 231: speaker feature extraction model 232: voice enhancement network model 233: first deep nerve Network 234: Second deep neural network 235: Noise feature extraction model 236: Embedded speech enhancement model 237: Third deep neural network 238: Four-deep neural network 24: Speech restoration module 30: Hidden layer 40 : Hidden layer 50: Hidden layer 60: Hidden layer PESQ: Voice quality

圖1為本發明第一實施例之應用深度學習的語音增強系統的方塊圖。FIG. 1 is a block diagram of a speech enhancement system applying deep learning according to the first embodiment of the present invention.

圖2為本發明第二實施例之應用深度學習的語音增強系統的方塊圖。Fig. 2 is a block diagram of a speech enhancement system applying deep learning according to a second embodiment of the present invention.

圖3為語音增強網路模型的示意圖。Figure 3 is a schematic diagram of a voice enhancement network model.

圖4為語者特徵擷取模型中的第一深度神經網路架構的示意圖。FIG. 4 is a schematic diagram of the first deep neural network architecture in the speaker feature extraction model.

圖5為第三深度神經網路架構的示意圖。Figure 5 is a schematic diagram of the third deep neural network architecture.

圖6為第四深度神經網路架構的示意圖。Fig. 6 is a schematic diagram of the fourth deep neural network architecture.

圖7A為語者特徵編碼分布圖。Figure 7A is a distribution diagram of speaker feature codes.

圖7B為語者噪聲分布圖。Figure 7B shows the speaker noise distribution diagram.

圖7C為原始帶有噪聲語音與經過本發明的應用深度學習的語音增強系統處理的柱狀圖。Fig. 7C is a histogram of the original noisy speech and the speech enhancement system using deep learning of the present invention.

:第一語音頻譜

:訊號相位

:第二語音頻譜

:語者特徵編碼

:增強語音訊號頻譜

:第二語音訊號 G _i:增益函數 10: Speech enhancement system 11: Speech conversion module 12: Speech capture module 13: Speech enhancement subsystem 131: Speaker feature extraction model 132: Speech enhancement network model 133: First deep neural network 134: No. Two deep neural network 14: voice restoration module y: the first voice signal

: The first speech spectrum

: Signal phase

: The second speech spectrum

: Speaker feature coding

: Enhanced voice signal spectrum

: The second voice signal G _i : gain function

Claims

A speech enhancement system using deep learning includes: a speech conversion module for receiving a first speech signal, and applying short-time Fourier transform to convert the first speech signal into a plurality of first speech signals corresponding to a plurality of sound frames A voice frequency spectrum and a plurality of signal phases; a voice capture module connected to the voice conversion module, and a plurality of the first voice frequency spectrum corresponding to a plurality of adjacent sound frames are concatenated to obtain A second voice frequency spectrum; a voice enhancement subsystem, connected to the voice capture module, and includes: a speaker feature capture model, connected to the voice capture module, configured to receive the second voice Frequency spectrum and input a first deep neural network to extract at least one speaker feature code of the second speech spectrum; and a voice enhancement network model to connect the voice capture module and the speaker feature An extraction model, configured to receive the at least one speaker feature code and the second speech frequency spectrum, and input a second deep neural network to estimate a gain function through the second deep neural network, And perform a spectrum recovery process on the gain function and the second voice spectrum to generate an enhanced voice signal spectrum; and a voice restoration module connected to the voice conversion module and the voice enhancement subsystem to receive the The enhanced voice signal spectrum and the plurality of signal phases, and the enhanced voice signal spectrum and the plurality of signal phases are combined, and an inverse short-time Fourier transform is used to output an enhanced second voice signal; wherein, The voice enhancement network model further includes: a noise feature extraction model connecting the voice capture module and the voice enhancement network model, and the noise feature extraction model receives the second The speech frequency spectrum is processed by performing a noise feature extraction process through a third deep neural network to generate at least one noise feature code; an embedded speech enhancement model connected to the speech capture module, the speech enhancement network model, and The noise feature extraction model receives the second speech frequency spectrum, the at least one speaker feature code, and the at least one noise feature code to input a fourth deep neural network to pass the fourth deep neural network The network estimates the gain function; wherein the noise feature extraction model classifies the noise features in the second speech spectrum into noise categories, and determines the number of hidden layers of the deep neural network according to the number of noises, And output the noise feature code in the last hidden layer; wherein, the gain function is to apply Maximum A Posteriori Spectral Amplitude (MAPA) or Wiener Filter (Frequency-Domain Wiener Filter). ) To calculate the gain function based on the maximum a posteriori spectral amplitude algorithm or the Wiener filter.

The speech enhancement system applying deep learning according to claim 1, wherein the speaker feature extraction model encodes at least one speech feature extracted from the second speech frequency spectrum into multiple speaker categories, and The number of hidden layers of the deep neural network is determined according to the number of speakers, and the speaker feature code is output in the last hidden layer.

The speech enhancement system applying deep learning according to claim 2, wherein, in the speaker feature extraction model, the input and output relationship of the deep neural network is: z ^{( l )} = σ ^{( l )} ( h ^{( l )} ( z ^{( l -1)} )), l =1,..., L , where σ ^{( l )} (.) and h ^{( l )} (.) are any of the hidden layers l The activation function and linear regression function of the input layer and the output layer respectively correspond to the first hidden layer and the Lth hidden layer, the second speech spectrum is set in the input layer, and at least one The speaker feature encoding is set in any one of the hidden layers other than the input layer and the output layer.

The speech enhancement system applying deep learning according to claim 2, wherein, in the noise feature extraction model, the input and output relationship of the deep neural network is: z ^{( l )} = σ ^{( l )} ( h ^{( l )} ( z ^{( l -1)} )), l =1,..., L , where σ ^{( l )} (.) and h ^{( l )} (.) are the values of any one of the hidden layers l The activation function and the linear regression function, and the input layer and the output layer respectively correspond to the first hidden layer and the Lth hidden layer, and the activation function of the output layer is set to a normalized index (softmax) Function, and the activation function of the input layer and all the hidden layers is set to a rectified linear unit (ReLU), and the second speech spectrum is set in the input layer.

The voice enhancement system using deep learning as described in claim 1, further comprising a spectrum recovery module for combining the second voice spectrum represented by the logarithmic power spectrum and the signal phase of the first voice signal to obtain After the second speech frequency spectrum represented by the new frequency spectrum is converted to the second speech frequency spectrum represented by the new frequency spectrum by inverse short-time Fourier transformation, an enhanced second speech signal is output.

The speech enhancement system applying deep learning according to claim 1, wherein the first deep neural network, the second deep neural network, the third deep neural network, and the fourth deep neural network The network is a convolutional neural network or a recurrent neural network.

The speech enhancement system applying deep learning according to claim 1, wherein the second speech spectrum is formed by connecting each adjacent sound frame and taking the logarithm to form a continuous logarithmic power spectrum.