CN110751957B

CN110751957B - A speech enhancement method using stacked multi-scale modules

Info

Publication number: CN110751957B
Application number: CN201911182689.3A
Authority: CN
Inventors: 蓝天; 李森; 吕忆蓝; 刘峤; 钱宇欣; 叶文政; 惠国强; 李萌; 彭川
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-25
Filing date: 2019-11-27
Publication date: 2020-10-27
Anticipated expiration: 2039-11-27
Also published as: CN110751957A

Abstract

The present invention discloses an end-to-end speech enhancement method using stacked multi-scale modules, comprising the following steps: S1: constructing a cascaded end-to-end speech enhancement framework, and splicing the stacked multi-scale modules into a network structure; S2: in the preprocessing stage, transforming the time domain signal into a two-dimensional feature; S3: enhancing the two-dimensional feature using a speech enhancement module; S4: in the post-processing stage, transforming the enhanced feature representation into a one-dimensional time domain signal through decoding synthesis. In order to further improve the performance of the algorithm, the evaluation indicators STOI and SDR of speech enhancement are integrated into the loss function using a multi-objective joint optimization training strategy. Experiments show that the method proposed in the present invention can significantly improve the speech enhancement effect, and has good noise resistance under unknown noise and low signal-to-noise ratio conditions.

Description

A speech enhancement method using stacked multi-scale modules

技术领域technical field

本发明属于语音增强技术领域，尤其涉及一种使用堆叠多尺度模块的端到端语音增强方法。The invention belongs to the technical field of speech enhancement, and in particular relates to an end-to-end speech enhancement method using stacked multi-scale modules.

背景技术Background technique

语音增强是指去除或衰减含噪语音中附加噪声的任务，通过对噪声的抑制和分离来提升语音整体感知质量与语音可懂度，在鲁棒性语音识别、助听器设计、说话人验证等方面有着广泛的应用。传统的语音增强方法包括谱减法、维纳滤波法、基于统计模型的方法以及基于子空间的方法等，在过去的几年中，基于深度学习的监督性语音增强方法逐渐成为学者们关注的主要研究方向。Speech enhancement refers to the task of removing or attenuating additional noise in noisy speech, improving the overall perceptual quality of speech and speech intelligibility by suppressing and separating noise, in robust speech recognition, hearing aid design, speaker verification, etc. Has a wide range of applications. Traditional speech enhancement methods include spectral subtraction, Wiener filtering, statistical model-based methods, and subspace-based methods. In the past few years, deep learning-based supervised speech enhancement methods have gradually become the main focus of scholars. research direction.

一些学者考虑直接对语音的时域信号进行处理，而不是依赖于语音信号的频域表示，避免了语音信号在时域和频域之间来回切换，更充分的利用语音的时域特征表示。基于WaveNet框架，Qian等人提出引入语音先验分布的方法进行语音增强，Rethage等人通过非因果的扩张卷积来预测目标。Pascual等人提出SEGAN，使用卷积网络对时域语音直接增强，Fu等人提出全卷积神经网络,用于时域的整句语音增强，Pandey等人为了解决实时性语音增强，将序列建模网络与编解码器架构结合来处理时域信号。Some scholars consider processing the time domain signal of speech directly instead of relying on the frequency domain representation of the speech signal, avoiding the switch back and forth between the time domain and the frequency domain of the speech signal, and making full use of the time domain feature representation of speech. Based on the WaveNet framework, Qian et al. proposed the method of introducing speech prior distribution for speech enhancement, and Rethage et al. predicted the target through acausal dilated convolution. Pascual et al. proposed SEGAN, which used convolutional networks to directly enhance speech in the time domain. Fu et al. proposed a fully convolutional neural network for whole sentence speech enhancement in time domain. In order to solve real-time speech enhancement, Pandey et al. A modular network is combined with a codec architecture to process time-domain signals.

这些基于端到端的方法都是直接将一维时域波形映射到目标语音，然而时域波形信号本身不能表现出明显的特征结构，直接对时域信号建模比较困难，在低信噪比环境下建模难度会进一步提高。These end-to-end methods directly map the one-dimensional time-domain waveform to the target speech. However, the time-domain waveform signal itself cannot show obvious characteristic structure, and it is difficult to directly model the time-domain signal. In a low signal-to-noise ratio environment The next modeling difficulty will be further increased.

发明内容SUMMARY OF THE INVENTION

本发明提供一种使用堆叠多尺度模块的端到端语音增强方法，旨在解决上述存在的问题。The present invention provides an end-to-end speech enhancement method using stacked multi-scale modules, aiming to solve the above-mentioned problems.

本发明是这样实现的，一种使用堆叠多尺度模块的语音增强方法，包括以下步骤：The present invention is achieved in this way, a speech enhancement method using stacked multi-scale modules, comprising the following steps:

S1：构建级联端到端语音增强框架，并将堆叠的多尺度模块拼接到网络结构中；S1: Build a cascaded end-to-end speech enhancement framework and stitch stacked multi-scale modules into the network structure;

S2：在预处理阶段，将时域信号变换为二维特征；S2: In the preprocessing stage, transform the time domain signal into two-dimensional features;

S3：利用语音增强模块对二维特征进行增强；S3: Use the speech enhancement module to enhance the two-dimensional features;

S4：在后处理阶段，通过解码合成将增强后的特征表示变换为一维时域信号。S4: In the post-processing stage, the enhanced feature representation is transformed into a one-dimensional time domain signal by decoding synthesis.

进一步的，所述级联端到端语音增强架构包括语音时域信号预处理、语音增强模块以及目标语音合成后处理；具体步骤包括：Further, the cascaded end-to-end speech enhancement architecture includes speech time-domain signal preprocessing, speech enhancement module and post-processing of target speech synthesis; specific steps include:

a.在时域信号预处理阶段，一维卷积被用来对输入的语音片段进行卷积操作，每一个卷积核对带噪语音y作用的结果被逐行堆叠起来，形成一个二维的实数值特征Y，启发自卷积神经网络对图片像素值的处理方式，将二维特征分离，得到绝对值特征和sgn mask；a. In the time-domain signal preprocessing stage, one-dimensional convolution is used to perform the convolution operation on the input speech segment, and the results of each convolution check on the noisy speech y are stacked row by row to form a two-dimensional The real-valued feature Y is inspired by the processing method of the convolutional neural network on the pixel value of the image, and the two-dimensional feature is separated to obtain the absolute value feature and the sgn mask;

b.带噪语音y的绝对值特征被输入到语音增强模块中增强，得到绝对值特征的估计

将其与sgn mask相乘合成目标语音的特征表示：b. The absolute value feature of the noisy speech y is input to the speech enhancement module for enhancement, and the estimation of the absolute value feature is obtained

Multiply it with the sgn mask to synthesize the feature representation of the target speech:

c.经过转置卷积将

变换为时域信号

c. After transposed convolution, the

Transform to time domain signal

进一步的，所述多尺度模块包括平均池化层，卷积核为1×1和3×3的卷积，以及不同扩张率的扩张卷积。Further, the multi-scale module includes an average pooling layer, convolution kernels with convolution kernels of 1×1 and 3×3, and dilated convolutions with different dilation rates.

进一步的，运用多目标联合优化的训练策略将语音增强的评价指标STOI与SDR融入到损失函数中。Further, the training strategy of multi-objective joint optimization is used to integrate the speech enhancement evaluation indicators STOI and SDR into the loss function.

进一步的，将STOI指标融入到损失函数中的具体步骤包括：Further, the specific steps to incorporate the STOI indicator into the loss function include:

1)STOI输入为纯净语音X和退化语音

首先去除对语音可懂度无贡献的无声区域，然后用STFT将时域信号变换到时频域，通过将两个信号分割为50％重叠的带汉宁窗的帧；1) STOI input is pure speech X and degenerate speech

First remove the silent regions that do not contribute to speech intelligibility, then use STFT to transform the time-domain signal to the time-frequency domain by splitting the two signals into 50% overlapping frames with a Hanning window;

2)进行1/3倍频带分析，划分共15个1/3倍频带，其中频带中心频率范围为4.3kHz至150Hz，纯净语音的短时时间包络x_j，m表示如下：2) Carry out 1/3 octave frequency band analysis, divide a total of 15 1/3 octave frequency bands, in which the center frequency range of the frequency band is 4.3kHz to 150Hz, and the short-time envelope x _{j, m} of pure speech is expressed as follows:

[X_j(m-L+1)，X_j(m-L+2)，...X_j(m)]^T [X _j (m-L+1), X _j (m-L+2), ...X _j (m)] ^T

其中X∈R为由x得到的1/3倍频带，M是一段语音的总帧数，m为帧的索引，j是1/3倍频带的索引，L对应语音的长度；where X∈R is the 1/3 octave band obtained by x, M is the total number of frames of a speech, m is the index of the frame, j is the index of the 1/3 octave band, and L corresponds to the length of the speech;

3)对语音归一化与裁剪，得到退化语音的包络表示

可懂度表示为两个时间包络之间的相关系数：3) Normalize and crop the speech to obtain the envelope representation of the degraded speech

Intelligibility is expressed as the correlation coefficient between two temporal envelopes:

其中，||·||₂为L2范数，μ(·)示相应样本的均值向量。Among them, ||·|| ₂ is the L2 norm, and μ(·) is the mean vector of the corresponding sample.

4)计算所有波段和帧的可懂度的平均值，可以得到STOI计算指标：4) Calculate the average of the intelligibility of all bands and frames, and the STOI calculation indicator can be obtained:

5)将增强语音

带入到STOI计算公式中，即可得到训练过程中的STOI计算指标：5) Will enhance the voice

Bring it into the STOI calculation formula, and you can get the STOI calculation indicator during the training process:

其中，d_j，m表示为增强语音与纯净语音时间包络的相关系数。Among them, d _{j, m} represents the correlation coefficient between the enhanced speech and the pure speech temporal envelope.

进一步的，将SDR指标融入到损失函数中的具体步骤包括：Further, the specific steps for incorporating the SDR indicator into the loss function include:

1)SDR的输入为纯净语音x和增强语音

增强语音的SDR计算过程如下：1) The input of SDR is pure voice x and enhanced voice

The SDR calculation process for enhanced speech is as follows:

2)对SDR优化目标进行等价变换以简化计算得到：2) Equivalently transform the SDR optimization objective to simplify the calculation to obtain:

其中，最大化评价指标SDR的过程就等价于最小化

Among them, the process of maximizing the evaluation index SDR is equivalent to minimizing

进一步的，将STOI与SDR评价指标融合到损失函数中的具体步骤包括：Further, the specific steps of integrating the STOI and SDR evaluation indicators into the loss function include:

1)计算常规的均方根误差，过程如下：1) Calculate the conventional root mean square error, the process is as follows:

其中M和N是每条语音的采样点数与语音总条数。Among them, M and N are the number of sampling points of each voice and the total number of voices.

2)将均方根误差与基于STOI和SDR的评价指标损失函数合并：2) Combine the root mean square error with the loss function of the evaluation index based on STOI and SDR:

式中，α,β,γ对应损失函数中的不同部分的系数。In the formula, α, β, γ correspond to the coefficients of different parts in the loss function.

其中X∈R为由x得到的1/3倍频带，M是一段语音的总帧数，m为帧的索引，j∈{1，2，...15}是1/3倍频带的索引，L＝30对应分析语音的长度为384ms。where X∈R is the 1/3 octave band obtained by x, M is the total number of frames of a speech, m is the index of the frame, and j∈{1, 2,...15} is the index of the 1/3 octave band , L=30 corresponds to the length of the analyzed speech being 384ms.

与现有技术相比，本发明的有益效果是：为了提高神经网络对时域语音信号直接处理能力，本发明提出了一种新的多尺度端到端语音增强框架。在预处理阶段将时域信号变换为二维特征表示，然后利用语音增强模块对二维特征进行增强，最后通过解码合成将增强后的特征表示变换为一维时域信号。为进一步提高算法的性能，运用多目标联合优化的训练策略将语音增强的评价指标STOI与SDR融入到损失函数中。实验表明，本发明提出的方法能够显著提高语音增强效果，并且在未知噪声和低信噪比条件下具有较好的抗噪性。Compared with the prior art, the beneficial effect of the present invention is that in order to improve the direct processing capability of the neural network for the time domain speech signal, the present invention proposes a new multi-scale end-to-end speech enhancement framework. In the preprocessing stage, the time domain signal is transformed into a two-dimensional feature representation, then the two-dimensional feature is enhanced by the speech enhancement module, and finally the enhanced feature representation is transformed into a one-dimensional time domain signal through decoding and synthesis. In order to further improve the performance of the algorithm, the training strategy of multi-objective joint optimization is used to integrate the evaluation indicators STOI and SDR of speech enhancement into the loss function. Experiments show that the method proposed in the present invention can significantly improve the effect of speech enhancement, and has better noise immunity under unknown noise and low signal-to-noise ratio conditions.

附图说明Description of drawings

图1为本发明整体示意图；Fig. 1 is the overall schematic diagram of the present invention;

图2为本发明中堆叠多尺度模块示意图；2 is a schematic diagram of stacking multi-scale modules in the present invention;

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

在本发明的描述中，需要理解的是，术语“长度”、“宽度”、“上”、“下”、“前”、“后”、“左”、“右”、“竖直”、“水平”、“顶”、“底”“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，在本发明的描述中，“多个”的含义是两个或两个以上，除非另有明确具体的限定。In the description of the present invention, it should be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", The orientations or positional relationships indicated by "horizontal", "top", "bottom", "inside", "outside", etc. are based on the orientations or positional relationships shown in the accompanying drawings, which are only for the convenience of describing the present invention and simplifying the description, rather than An indication or implication that the referred device or element must have a particular orientation, be constructed and operate in a particular orientation, is not to be construed as a limitation of the invention. In addition, in the description of the present invention, "plurality" means two or more, unless otherwise expressly and specifically defined.

实施例Example

请参阅图1-2，本发明提供一种技术方案：一种使用堆叠多尺度模块的端到端语音增强方法，包括以下步骤：Referring to Figures 1-2, the present invention provides a technical solution: an end-to-end speech enhancement method using stacked multi-scale modules, comprising the following steps:

本发明提出的端到端语音增强框架包含语音时域信号预处理、语音增强模块以及目标语音合成后处理，如图1所示。The end-to-end speech enhancement framework proposed by the present invention includes speech time-domain signal preprocessing, speech enhancement module and post-processing of target speech synthesis, as shown in FIG. 1 .

假设时域干净语音为x，噪声信号为n，那么带噪语音y可以表示为：Assuming that the time domain clean speech is x and the noise signal is n, then the noisy speech y can be expressed as:

y＝x+ny=x+n

在时域信号预处理阶段，一维卷积被用来对输入的语音片段进行卷积操作，每一个卷积核对带噪语音y作用的结果被逐行堆叠起来，形成一个二维的实数值特征Y，启发自卷积神经网络对图片像素值的处理方式，将二维特征分离，得到绝对值特征和sgn mask，其中sgn 表示sign function(符号函数)，即取Y的符号，二维特征Y表示为绝对值特征和sgnmask之积：In the time-domain signal preprocessing stage, one-dimensional convolution is used to perform a convolution operation on the input speech segment, and the results of each convolution kernel on the noisy speech y are stacked row by row to form a two-dimensional real value. The feature Y is inspired by the way the convolutional neural network processes the pixel value of the image, and the two-dimensional feature is separated to obtain the absolute value feature and the sgn mask, where sgn represents the sign function (sign function), that is, the symbol of Y is taken, the two-dimensional feature Y is expressed as the product of absolute value features and sgnmask:

Y＝abs(Y)⊙sgn(Y)Y=abs(Y)⊙sgn(Y)

其中⊙表示对应元素相乘，随后带噪语音y的绝对值特征被输入到语音增强模块增强，得到绝对值特征的估计

将其与sgn mask相乘合成目标语音的特征表示：where ⊙ represents the multiplication of the corresponding elements, and then the absolute value feature of the noisy speech y is input to the speech enhancement module for enhancement, and the estimation of the absolute value feature is obtained.

最后再经过转置卷积将

变换为时域信号

Finally, through transposed convolution, the

Transform to time domain signal

其中，本框架采用的语音增强模块建立在全卷积网络的基础上。在该模块的编码过程中，每层卷积将特征大小减半，但是channels加倍，经过多层卷积将特征编码成一个小而深的表示形式，相应地在解码过程中，逐级扩大特征的尺寸，最后恢复到原有大小。在扩大特征尺寸的过程中，我们通过双线性插值上采样来获得更高的分辨率。Among them, the speech enhancement module adopted in this framework is based on the fully convolutional network. In the encoding process of this module, each layer of convolution reduces the feature size by half, but the channels are doubled, and the feature is encoded into a small and deep representation through multi-layer convolution, and correspondingly in the decoding process, the feature is expanded step by step. size, and finally returns to its original size. In the process of enlarging the feature size, we upsampling by bilinear interpolation to obtain higher resolution.

在语音增强模块中在相同层级的层之间添加跳过连接，这允许通过复制操作保留高水平的细节。低层级信息向高层级信息地直接流入可以有效引导模型对高分辨率特征的建模。Add skip connections between layers of the same hierarchy in the speech enhancement module, which allows to preserve a high level of detail through the copy operation. The direct inflow of low-level information to high-level information can effectively guide the model to model high-resolution features.

语音增强模块的对称结构确保了其输入和输出具有相同的形状大小，这种结构使其自然适用于任何像素密集的预测任务，尤其是对图像中每个像素标签的预测任务。The symmetrical structure of the speech enhancement module ensures that its input and output have the same shape size, which makes it naturally suitable for any pixel-intensive prediction task, especially the prediction task of each pixel label in an image.

为了更充分的利用语音特征中的多尺度上下文信息我们精心设计了多尺度块并将其堆叠，如图2所示，SMB(Stacked Multi-scale Block，堆叠多尺度Block)包含平均池化层，普通的1×1和3×3的卷积，以及不同扩张率的扩张卷积；为了对原始信息进行有效保留，我们将原始特征与多尺度特征拼接堆叠起来。In order to make full use of the multi-scale context information in speech features, we carefully designed multi-scale blocks and stacked them, as shown in Figure 2, SMB (Stacked Multi-scale Block, stacked multi-scale Block) contains an average pooling layer, Ordinary 1×1 and 3×3 convolutions, and dilated convolutions with different dilation rates; in order to effectively preserve the original information, we stack the original features with multi-scale features.

基于深度学习的语音增强方法常常采用均方误差MSE(Mean Squared Error)作为训练的损失函数，但是在语音增强过程中，往往是通过评估被增强的语音的可懂度和语音质量来检验模型的性能好坏，这种损失函数和评估指标之间的不一致并不能保证能得到最优的模型。Speech enhancement methods based on deep learning often use mean squared error (MSE) as the loss function for training, but in the process of speech enhancement, the model is often tested by evaluating the intelligibility and speech quality of the enhanced speech. Whether the performance is good or bad, this inconsistency between the loss function and the evaluation metric does not guarantee the optimal model.

为了从幅度值的角度来计算损失函数，我们使用RMSE(Root Mean Square Error，均方根误差)；STOI用于评估语音的可理解性，其输入是纯净语音X和退化语音

其首先去除对语音可懂度无贡献的无声区域，然后用STFT将时域信号编程时频域，通过将两个信号分割为50％重叠的带汉宁窗的帧。接着进行1/3倍频带分析，划分共15个1/3倍频带，其中频带中心频率范围为4.3kHz至150Hz。纯净语音的短时时间包络x_j，m可表示如下：In order to calculate the loss function from the perspective of magnitude value, we use RMSE (Root Mean Square Error, root mean square error); STOI is used to evaluate the intelligibility of speech, and its input is pure speech X and degenerate speech

It first removes the silent regions that do not contribute to speech intelligibility, and then uses STFT to program the time-domain signal into the time-frequency domain by dividing the two signals into 50% overlapping frames with Hanning windows. Then carry out 1/3 octave band analysis, and divide a total of 15 1/3 octave frequency bands, in which the center frequency of the frequency band ranges from 4.3kHz to 150Hz. The short-time envelope x _j,m of pure speech can be expressed as follows:

其中X∈R为由x得到的1/3倍频带，M是一段语音的总帧数，m为帧的索引，j是1/3倍频带的索引，L对应语音的长度；之后对语音的归一化与裁剪，归一化用来补偿全局差异，这种差异不应该对语音的可懂度产生影响；裁剪确保了严重退化语音上STOI评估的上界。退化语音的归一化和裁剪时间包络表示为

可懂度表示为两个时间包络之间的相关系数：where X∈R is the 1/3 octave band obtained by x, M is the total number of frames of a speech, m is the index of the frame, j is the index of the 1/3 octave band, and L corresponds to the length of the speech; Normalization and clipping. Normalization is used to compensate for global differences that should not have an impact on speech intelligibility; clipping ensures an upper bound for STOI evaluation on severely degraded speech. The normalized and cropped temporal envelope of degenerate speech is expressed as

其中，||·||₂为L2范数，μ(·)表示相应样本的均值向量。计算所有波段和帧的可懂度的平均值，可以得到退化语音的STOI计算指标：where ||·|| ₂ is the L2 norm, and μ(·) represents the mean vector of the corresponding sample. Calculate the average of intelligibility of all bands and frames, and the STOI calculation index of degenerate speech can be obtained:

将增强语音

带入到STOI计算公式中，即可得到训练过程中的STOI计算指标：will enhance the voice

其中，d_j，m表示为增强语音与纯净语音时间包络的相关系数.Among them, d _{j, m} represents the correlation coefficient between the enhanced speech and the pure speech temporal envelope.

另一方面，SDR是增强语音

中干净分量

与其他分量的能量比值，干净分量

是x在

上的投影：On the other hand, SDR is Enhanced Speech

medium clean

Energy ratio to other components, clean component

is x in

Projection on :

SDR定义为：SDR is defined as:

以上两式结合可以得到：Combining the above two formulas can get:

对SDR优化目标进行等价变换以简化计算：Equivalent transformation of the SDR optimization objective to simplify the calculation:

最后我们将这两个metrics与RMSE结合，组成一个loss function:Finally, we combine these two metrics with RMSE to form a loss function:

式中，α,β,γ对应loss function中的不同部分的系数。In the formula, α, β, γ correspond to the coefficients of different parts in the loss function.

试验例Test example

实验中使用的语音数据来自于TIMIT数据集，噪声数据集采用ESC-50作为训练集，为了验证本文提出模型的泛化性能，我们也将Noisex92噪声数据集用于测试。The speech data used in the experiment comes from the TIMIT data set, and the noise data set adopts ESC-50 as the training set. In order to verify the generalization performance of the model proposed in this paper, we also use the Noisex92 noise data set for testing.

本实验中，TIMIT数据集总共包含6300条语音，由630人每人录制10个句子得到，男女比率为7:3。其中，每人录制的句子中有7个是重复的，为了去除重复句子对模型训练与测试的影响，本实验只取句子均不相同的1890条语音。将其中约80％的语音作为训练集，另外20％作为测试语音，且男女比例与TIMIT总体分布相同。ESC-50数据集包含2000条带标签的环境录音集合，共分为5个主要类别：动物、自然音景与水声、非语音人声、室内声音、城区声音。所有的语音被重采样到16kHz，将所有的语音被截取为2秒的长度。Adam优化器被用于基于随机梯度下降(SGD)的优化。学习率设定为等于1×10^-4的恒定值。In this experiment, the TIMIT dataset contains a total of 6300 speeches, which were obtained by recording 10 sentences for each of 630 people, and the ratio of male to female is 7:3. Among them, 7 sentences recorded by each person are repeated. In order to remove the influence of repeated sentences on model training and testing, only 1890 speeches with different sentences were selected in this experiment. About 80% of the speech is used as the training set, and the other 20% is used as the test speech, and the ratio of male to female is the same as the overall distribution of TIMIT. The ESC-50 dataset contains a collection of 2000 labeled environmental recordings divided into 5 main categories: animals, natural soundscapes and water sounds, non-speech vocals, indoor sounds, and urban sounds. All speech was resampled to 16kHz, and all speech was truncated to 2 seconds length. The Adam optimizer is used for stochastic gradient descent (SGD) based optimization. The learning rate is set to a constant value equal to 1×10 ⁻⁴ .

对于基线模型，选取几种典型的编码器-解码器解决方案和本发明提出的方法做对比，包括基于频谱映射的以及端到端的方法，另外我们还将带噪语音作为一个基线对比：(a)带噪语音，(b)AET,(c)CED,(d)R-CED,(e)noSMB-SE，(f)SMB-SE。其中AET是端到端语音增强架构，CED和R-CED是卷积神经网络时频域的语音增强方法，noSMB-SE是我们提出的无SMB版本的基础框架，它只是单纯地将低层级信息连接到高层级，而SMB-SE则是以noSMB-SE为基础添加了4个SMB。For the baseline model, several typical encoder-decoder solutions are selected for comparison with the method proposed in the present invention, including spectral mapping-based and end-to-end methods, and we also use noisy speech as a baseline comparison: (a ) noisy speech, (b) AET, (c) CED, (d) R-CED, (e) noSMB-SE, (f) SMB-SE. Among them, AET is an end-to-end speech enhancement architecture, CED and R-CED are speech enhancement methods in the time-frequency domain of convolutional neural networks, and noSMB-SE is the basic framework of our proposed SMB-free version, which simply converts low-level information Connect to the high tier, while SMB-SE adds 4 SMBs based on noSMB-SE.

所有的模型都是在0dB的SNR条件下训练的，并在-15dB，-10dB，-5dB，0dB和5dB的信噪比下评估性能，为了评估所提出的框架的泛化性能，我们还在Noisex-92噪声数据集上测试了所提出的框架。All models are trained at 0dB SNR and evaluate performance at -15dB, -10dB, -5dB, 0dB and 5dB SNR. To evaluate the generalization performance of the proposed framework, we also The proposed framework is tested on the Noisex-92 noise dataset.

TABLE ITABLE I

在可见噪声条件下的测试结果.粗体为最佳性能Test results under visible noise conditions. Bold is best performance

TABLE IITABLE II

在不可见噪声条件下的测试结果.粗体为最佳性能Test results under invisible noise conditions. Bold is best performance

本发明提出了一个使用堆叠多尺度模块的端到端语音增强框架，首先将原始时域波形编码为二维特征表示，然后使用语音增强模块学习带噪语音到干净语音的映射关系，最后再通过解码合成时域语音信号。本发明提出的端到端框架可以有效提取时域信号的特征信息，SMB模块的应用帮助该模型挖掘更多的信息，STOI、SDR以及RMSE的整合可有效的提升该模型的整体增强性能。该框架展现了低SNR条件下的抗噪性和未知噪声环境下良好的泛化性。The present invention proposes an end-to-end speech enhancement framework using stacked multi-scale modules. First, the original time domain waveform is encoded into a two-dimensional feature representation, and then the speech enhancement module is used to learn the mapping relationship from noisy speech to clean speech. Decode the synthesized time-domain speech signal. The end-to-end framework proposed by the present invention can effectively extract the characteristic information of the time domain signal, the application of the SMB module helps the model to mine more information, and the integration of STOI, SDR and RMSE can effectively improve the overall enhancement performance of the model. The framework exhibits noise immunity under low SNR conditions and good generalization in unknown noise environments.

以上仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection scope of the present invention. Inside.

Claims

1. an end-to-end speech enhancement method using stacked multi-scale modules, is characterized in that, comprises the following steps:

S1: Build a cascaded end-to-end speech enhancement framework and stitch stacked multi-scale modules into the network structure;

S2: In the preprocessing stage, transform the time domain signal into two-dimensional features;

S3: Use the speech enhancement module to enhance the two-dimensional features;

S4: In the post-processing stage, the enhanced feature representation is transformed into a one-dimensional time domain signal by decoding synthesis.

2. speech enhancement method according to claim 1 is characterized in that: described cascading end-to-end speech enhancement framework comprises speech time domain signal preprocessing, speech enhancement module and target speech synthesis post-processing; Concrete steps comprise:

a. In the time-domain signal preprocessing stage, one-dimensional convolution is used to perform the convolution operation on the input speech segment, and the results of each convolution check on the noisy speech y are stacked row by row to form a two-dimensional The real-valued feature Y is inspired by the processing method of the convolutional neural network on the pixel value of the image, and the two-dimensional feature is separated to obtain the absolute value feature and the sgn mask;

b. The absolute value feature of the noisy speech y is input to the speech enhancement module for enhancement, and the estimation of the absolute value feature is obtained

c. After transposed convolution, the

Transform to time domain signal

3. The speech enhancement method according to claim 1, wherein the multi-scale module comprises an average pooling layer, the convolution kernels are 1×1 and 3×3 convolutions, and dilated volumes with different dilation rates product.

4 . The speech enhancement method according to claim 1 , further comprising the steps of: incorporating the evaluation indicators STOI and SDR of speech enhancement into the loss function using a training strategy of multi-objective joint optimization. 5 .

5. speech enhancement method according to claim 4, is characterized in that, the concrete step of incorporating STOI index into the loss function comprises:

1) STOI input is pure speech x and degenerate speech

2) Carry out 1/3 octave frequency band analysis, divide a total of 15 1/3 octave frequency bands, in which the center frequency range of the frequency band is 4.3kHz to 150Hz, and the short-time envelope x _{j, m} of pure speech is expressed as follows:

[X _j (m-L+1), X _j (m-L+2), ...X _j (m)] ^T

where X∈R is the 1/3 octave band obtained by x, M is the total number of frames of a speech, m is the index of the frame, j is the index of the 1/3 octave band, and L corresponds to the length of the speech;

3) Normalize and crop the speech to obtain the envelope representation of the degraded speech

where ||·|| ₂ is the L2 norm, and μ(·) represents the mean vector of the corresponding sample.

4) Calculate the average of the intelligibility of all bands and frames, and the STOI calculation index of the degraded speech can be obtained:

5) Will enhance the voice

Among them, d _{j, m} represents the correlation coefficient between the enhanced speech and the pure speech temporal envelope.

6. speech enhancement method according to claim 4, is characterized in that, the concrete step that SDR index is incorporated in loss function comprises:

1) The input of SDR is pure voice x and enhanced voice

The SDR calculation process for enhanced speech is as follows:

2) Equivalently transform the SDR optimization objective to simplify the calculation to obtain:

7. speech enhancement method according to claim 4 is characterized in that, STOI and SDR evaluation index are fused into loss function, and concrete steps comprise:

1) Calculate the conventional root mean square error, the process is as follows:

Where M and N are the number of sampling points of each voice and the total number of voices;

2) Combine the root mean square error with the loss function of the evaluation index based on STOI and SDR:

In the formula, α, β, γ correspond to the coefficients of different parts in the loss function.