CN116013344A

CN116013344A - A Speech Enhancement Method in Multiple Noise Environments

Info

Publication number: CN116013344A
Application number: CN202211637892.7A
Authority: CN
Inventors: 张新曼; 李扬科; 杨剑锋; 彭豪鸿; 王静静; 贾士凡; 赵红超; 黄永文; 李桂成; 王歆叶
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-12-17
Filing date: 2022-12-17
Publication date: 2023-04-25

Abstract

The invention discloses a voice enhancement method under multiple noise environments, which comprises the following steps: 1) Finishing the pretreatment of the audio and the data enhancement operation; 2) Extracting multi-level audio features by using a multi-scale encoder based on a transducer architecture, and strengthening key features by means of a feature lifting module; 3) Capturing long-term and short-term characteristics on different dimensions by means of a long-term and short-term sensing module based on a two-way architecture; 4) Obtaining a clean speech signal using a residual decoder and a mask estimation module; 5) The network model is jointly trained by means of a mean square error loss term and a signal to noise ratio loss term. The method has strong robustness and high instantaneity, and can effectively process ten common noises such as whistling, noisy, applause, bird song and the like, thereby improving the user experience of applications such as short videos, network live broadcast, video conferences, voice calls and the like. Compared with a part of mainstream voice enhancement model, the method can averagely improve 16% on the relevant evaluation index.

Description

A method for speech enhancement in various noise environments

技术领域Technical Field

本发明属于语音降噪技术领域，特别涉及一种多种噪声环境下的语音增强方法。The invention belongs to the technical field of speech noise reduction, and in particular relates to a speech enhancement method in multiple noise environments.

背景技术Background Art

无论是短视频还是网络直播，其都面临着一个较大的问题：拍摄者在进行说话的时候，周围的背景噪声也同样会被采集，这会极大地降低听众的实际体验。此外，不同的拍摄者所处的周围环境不同，因而噪声的种类也多种多样，例如：汽车鸣笛声、广场音乐声、孩童哭闹声、工地机器声、人群喧嚣声等。周围环境的干扰与应用场景的复杂多变要求利用一种鲁棒性的语音增强技术处理含噪音频。Whether it is a short video or a live webcast, it faces a big problem: when the shooter is speaking, the surrounding background noise will also be collected, which will greatly reduce the actual experience of the audience. In addition, different shooters are in different surrounding environments, so the types of noise are also varied, such as: car horns, square music, children crying, construction site machinery, crowd noise, etc. The interference of the surrounding environment and the complexity of the application scenarios require the use of a robust speech enhancement technology to process noisy audio.

当然，语音增强技术的应用不仅仅局限于短视频或网络直播，还可以服务于多种下游语音相关的任务，包括：语音智能交互、语音情感分析、智能语音输入等方面。在语音智能交互领域，常见如智能音箱。在智能语音输入领域，常见如语音输入法。以智能家居为例，用户可以借助语音实现指令的下达，从而真正地解放了双手，避免了与设备进行直接接触。虽然基于语音的智能交互正成为主流的人机交互方式，但是由于用户所处的复杂噪声环境使其依然无法在日常生活中完全替代键盘或触摸屏进行输入。因而，借助语音增强技术实时地从含噪声的混合音频中获取纯净语音便显得至关重要。Of course, the application of speech enhancement technology is not limited to short videos or live webcasts, but can also serve a variety of downstream voice-related tasks, including: intelligent voice interaction, voice sentiment analysis, intelligent voice input, etc. In the field of intelligent voice interaction, common ones include smart speakers. In the field of intelligent voice input, common ones include voice input methods. Taking smart homes as an example, users can use voice to give instructions, thereby truly freeing their hands and avoiding direct contact with the device. Although voice-based intelligent interaction is becoming the mainstream human-computer interaction method, due to the complex noise environment in which users live, it still cannot completely replace keyboards or touch screens for input in daily life. Therefore, it is crucial to use speech enhancement technology to obtain pure speech from noisy mixed audio in real time.

目前，语音增强算法根据处理方式的不同，主要分为：谐波增强法，其仅适用于平稳白噪声的去除，同时无法准确地估计出语音的基音周期；谱减法，其在处理宽带噪声时较为有效，但增强后的结果会存在噪声分量残余；维纳滤波法，其增强后的残留噪声类似于白噪声而非音乐噪声；基于语音模型参数的增强法，其在低信噪比的情况下性能较差，而且往往需要多次迭代运算；基于信号子空间法，其所需要的运算量较大难以满足实时的要求；基于小波变换的增强法，其对非平稳噪声的去噪能力较差；基于深度学习的方法，其借助数据驱动直接估计纯净的语音信号，具有较强的鲁棒性与实时性。与传统方法相比，基于深度学习的方法具有无可比拟的性能优势，因而其已经成为了语音增强的主流方法。At present, speech enhancement algorithms are mainly divided into the following according to different processing methods: harmonic enhancement method, which is only applicable to the removal of stationary white noise and cannot accurately estimate the pitch period of speech; spectral subtraction method, which is more effective in processing broadband noise, but the result after enhancement will have residual noise components; Wiener filtering method, the residual noise after enhancement is similar to white noise rather than musical noise; enhancement method based on speech model parameters, which has poor performance under low signal-to-noise ratio conditions and often requires multiple iterations; signal subspace method, which requires a large amount of calculation and is difficult to meet real-time requirements; enhancement method based on wavelet transform, which has poor denoising ability for non-stationary noise; method based on deep learning, which directly estimates pure speech signals with the help of data-driven, and has strong robustness and real-time performance. Compared with traditional methods, methods based on deep learning have incomparable performance advantages, so they have become the mainstream method of speech enhancement.

但是目前用于语音增强的深度学习方法，仍然面临无法有效捕获长短期特征以及增强关键特征等原因而导致噪声效果去除不佳、鲁棒性不强等问题。However, the current deep learning methods used for speech enhancement still face problems such as the inability to effectively capture long-term and short-term features and enhance key features, resulting in poor noise removal and weak robustness.

发明内容Summary of the invention

为了克服上述现有技术的缺点，本发明的目的在于提供一种多种噪声环境下的语音增强方法，以期能够更加有效地去除语音中的噪声，并具有较强的鲁棒性与实时性。In order to overcome the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a speech enhancement method in a variety of noise environments, so as to more effectively remove noise in speech and have strong robustness and real-time performance.

为了实现上述目的，本发明采用的技术方案是：In order to achieve the above object, the technical solution adopted by the present invention is:

一种多种噪声环境下的语音增强方法，其特征在于，包括以下步骤：A method for speech enhancement in a variety of noise environments, characterized by comprising the following steps:

步骤1：对获取的音频数据进行预处理操作与数据增强操作，将处理之后的音频数据输入至长短期感知强化模型；所述长短期感知强化模型包括：多尺度编码器、长短期感知模块以及残差解码器；Step 1: preprocessing and data enhancement operations are performed on the acquired audio data, and the processed audio data is input into a long-term and short-term perception enhancement model; the long-term and short-term perception enhancement model includes: a multi-scale encoder, a long-term and short-term perception module, and a residual decoder;

步骤2：对于所述处理之后的音频数据，利用所述多尺度编码器提取其深层音频特征；Step 2: for the processed audio data, extracting its deep audio features using the multi-scale encoder;

步骤3：利用所述长短期感知模块分别捕获不同维度上的特征；Step 3: Using the long-term and short-term perception modules to capture features in different dimensions respectively;

步骤4：利用所述残差解码器重构语音信号，并利用掩码估计模块估计纯净语音的掩码，将其与原始输入音频相乘，获得增强后的纯净语音。Step 4: Reconstruct the speech signal using the residual decoder, and estimate the mask of the clean speech using the mask estimation module, and multiply it with the original input audio to obtain the enhanced clean speech.

在一个实施例中，所述预处理操作包括如下操作的一种或者多种：对音频进行重采样操作、对音频长度进行裁剪操作、对音频进行通道压缩操作；In one embodiment, the preprocessing operation includes one or more of the following operations: resampling the audio, trimming the audio length, and compressing the audio channels;

所述数据增强操作包括如下操作的一种或者多种：按照随机信噪比混合噪声音频、随机改变音频的音量、随机添加混响效果。The data enhancement operation includes one or more of the following operations: mixing noise audio according to a random signal-to-noise ratio, randomly changing the volume of audio, and randomly adding a reverberation effect.

在一个实施例中，所述多尺度编码器基于Transformer架构，由多个特征捕获模块堆叠而成，并借助池化操作实现特征的下采样；每个特征捕获模块包括：特征提升模块、归一化层和前馈神经网络；In one embodiment, the multi-scale encoder is based on the Transformer architecture, is composed of a plurality of stacked feature capture modules, and achieves feature downsampling with the help of a pooling operation; each feature capture module includes: a feature enhancement module, a normalization layer, and a feedforward neural network;

所述特征提升模块用于捕获关键音频特征以及全局范围内特征之间的关系，其利用卷积层、全连接层以及Sigmoid函数获取注意力权重，并利用矩阵对应元素相乘实现关键特征增强，利用多头注意力机制捕获全局范围内特征之间的关系；所述归一化层进行归一化操作；所述前馈神经网络利用双向门控循环单元捕获长短期特征，并结合全连接层提取深层特征；The feature enhancement module is used to capture key audio features and the relationship between features in the global scope. It uses convolutional layers, fully connected layers and sigmoid functions to obtain attention weights, and uses matrix corresponding element multiplication to achieve key feature enhancement, and uses a multi-head attention mechanism to capture the relationship between features in the global scope; the normalization layer performs normalization operations; the feedforward neural network uses a bidirectional gated recurrent unit to capture long-term and short-term features, and combines a fully connected layer to extract deep features;

其中，不同特征捕获模块使用不同的膨胀卷积操作，从而捕获不同尺度的特征。Among them, different feature capture modules use different dilated convolution operations to capture features of different scales.

在一个实施例中，所述特征捕获模块的计算公式如下：In one embodiment, the calculation formula of the feature capture module is as follows:

式中，

和

分别为特征捕获模块的输入特征、中间过程特征和输出特征，LayerNorm(·)为层归一化操作，FBM(·)为特征提升模块操作，FNN(·)为前馈神经网络；In the formula,

and

are the input features, intermediate process features and output features of the feature capture module, LayerNorm(·) is the layer normalization operation, FBM(·) is the feature enhancement module operation, and FNN(·) is the feedforward neural network;

所述特征提升模块的计算公式如下：The calculation formula of the feature enhancement module is as follows:

式中，

和

分别为特征提升模块的输入特征、中间特征和输出特征；C_1D(·)、FC(·)和R(·)分别为一维卷积、全连接层和调整通道操作；⊙和

分别表示矩阵对应元素相乘与相加操作；σ表示激活函数Sigmoid；MAM(·)表示多头注意力机制操作。In the formula,

and

are the input features, intermediate features and output features of the feature enhancement module respectively; C _1D (·), FC(·) and R(·) are the one-dimensional convolution, fully connected layer and channel adjustment operations respectively; ⊙ and

They represent the multiplication and addition operations of the corresponding elements of the matrix respectively; σ represents the activation function Sigmoid; MAM(·) represents the multi-head attention mechanism operation.

在一个实施例中，所述多头注意力机制操作，首先利用可学习的线性变换根据输入特征

分别获得队列Q_i、键K_i、值V_i，计算公式如下：In one embodiment, the multi-head attention mechanism operates by first using a learnable linear transformation based on the input features.

Get the queue _Qi , key _Ki , and value _Vi respectively, and the calculation formula is as follows:

式中，W_i ^Q、W_i ^K和W_i ^V分别为全连接层的权重；Where ^, _WiQ , _WiK ^and _WiV are the weights of the fully connected layers respectively ^;

其次，利用点积的方式计算队列与键值之前的相似度，同时除以缩放因子；Secondly, the similarity between the queue and the key value is calculated using the dot product method, and divided by the scaling factor;

然后，应用Softmax激活函数获得每个值对应的权重，并与所对应的值相乘；Then, apply the Softmax activation function to obtain the weight corresponding to each value and multiply it with the corresponding value;

最后，将所有头部获得的结果串联，并再次进行线性投影操作，获得最终的输出；Finally, the results obtained from all heads are concatenated and linearly projected again to obtain the final output;

多头注意力机制的具体计算公式如下：The specific calculation formula of the multi-head attention mechanism is as follows:

MAM(Q,K,V)＝Concat(head₁,…,head_h)W^mh MAM(Q,K,V)=Concat(head ₁ ,...,head _h )W ^mh

式中，W^mh是线性变换矩阵，h为并行注意力层的数目，d是缩放因子；Where ^Wmh is the linear transformation matrix, h is the number of parallel attention layers, and d is the scaling factor;

多头注意力机制的输出作为前馈神经网络的输入，从而获得最终的输出特征；The output of the multi-head attention mechanism is used as the input of the feedforward neural network to obtain the final output features;

前馈神经网络包括门控循环单元、激活函数以及全连接层，其计算公式如下：The feedforward neural network includes a gated recurrent unit, an activation function, and a fully connected layer. The calculation formula is as follows:

式中，W_fc和b_fc表示全连接层的权重以及对应的偏置，δ表示激活函数ReLU，所述门控循环单元包括更新门与重置门，计算公式如下：Where W _fc and b _fc represent the weights of the fully connected layer and the corresponding bias, δ represents the activation function ReLU, and the gated recurrent unit includes an update gate and a reset gate. The calculation formula is as follows:

z_t＝σ(W_z·[h_t-1,x_t])r_t＝σ(W_r·[h_t-1,x_t])z _t =σ(W _z ·[h _t-1 ,x _t ])r _t =σ(W _r ·[h _t-1 ,x _t ])

式中，σ和γ分别表示激活函数Sigmoid和Tanh，x_t、h_t-1和h_t分别为此刻输入的特征、上一时刻的隐藏状态以及当前时刻的隐藏状态。Where σ and γ represent activation functions Sigmoid and Tanh respectively, x _t , h _t-1 and h _t are the input features at this moment, the hidden state at the previous moment and the hidden state at the current moment respectively.

在一个实施例中，所述长短期感知模块采用双路架构，包括门控循环单元、一维卷积模块、即时层归一化模块和通道调整模块；所述门控循环单元捕获特征的长短期特征，所述一维卷积模块提取深层特征，所述即时层归一化模块进行特征归一化处理。In one embodiment, the long-term and short-term perception module adopts a dual-path architecture, including a gated recurrent unit, a one-dimensional convolution module, an immediate layer normalization module and a channel adjustment module; the gated recurrent unit captures the long-term and short-term features of the features, the one-dimensional convolution module extracts deep features, and the immediate layer normalization module performs feature normalization processing.

在一个实施例中，所述长短期感知模块的计算公式如下：In one embodiment, the calculation formula of the long-term and short-term perception module is as follows:

式中，GRU(·)为门控循环单元，C_1D(·)为一维卷积操作，iLN(·)为即时归一化操作，R(·)为通道调整操作，

和

分别为长短期感知模块的输入特征、中间特征以及输出特征；In the formula, GRU(·) is a gated recurrent unit, C _1D (·) is a one-dimensional convolution operation, iLN(·) is an immediate normalization operation, and R(·) is a channel adjustment operation.

and

They are the input features, intermediate features, and output features of the long- and short-term perception modules respectively;

所述即时层归一化模块的计算公式如下：The calculation formula of the instant layer normalization module is as follows:

式中，X_tf为输入的特征，N和K分别为特征的维度，

和

分别为均值操作和方差操作，符号ε和β分别为可学习的参数，符号λ为正则化参数。In the formula, _Xtf is the input feature, N and K are the dimensions of the feature,

and

are mean operation and variance operation respectively, the symbols ε and β are learnable parameters respectively, and the symbol λ is the regularization parameter.

在一个实施例中，所述残差解码器包括多个解码单元，每个解码单元包括一维反卷积模块、归一化模块与激活函数；每个解码单元的输入均为上一个解码单元的输出

和同级特征捕获模块的输出

其计算公式如下：In one embodiment, the residual decoder includes a plurality of decoding units, each of which includes a one-dimensional deconvolution module, a normalization module and an activation function; the input of each decoding unit is the output of the previous decoding unit.

And the output of the feature capture module at the same level

The calculation formula is as follows:

式中，TC_1D(·)为一维反卷积操作，B(·)为批归一化操作，θ为激活函数PReLU，

为当前解码单元的输出特征，解码器的输出为重构的语音信号。Where TC _1D (·) is a one-dimensional deconvolution operation, B(·) is a batch normalization operation, θ is the activation function PReLU,

is the output feature of the current decoding unit, and the output of the decoder is the reconstructed speech signal.

在一个实施例中，所述掩码估计模块由一维卷积模块和多个不同的激活函数构成，其计算公式如下：In one embodiment, the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and its calculation formula is as follows:

式中，

和

分别为掩码估计模块的输入特征、中间过程特征以及输出的掩码，γ、δ和σ分别为激活函数Tanh、ReLU和Sigmoid；In the formula,

and

are the input features, intermediate process features and output masks of the mask estimation module, γ, δ and σ are the activation functions Tanh, ReLU and Sigmoid respectively;

将掩码估计模块的输出特征与原始输入的语音信号相乘，获得模型估计的纯净语音信号，其计算公式如下：The output feature of the mask estimation module is multiplied by the original input speech signal to obtain the pure speech signal estimated by the model. The calculation formula is as follows:

式中，X_in为原始输入的音频信号，X_est为模型估计的纯净语音。Where _Xin is the original input audio signal, and _Xest is the pure speech estimated by the model.

并且，本发明利用联合损失函数对该长短期感知强化模型进行训练，所述联合损失函数由均方误差损失项与信噪比损失项构成，所述均方误差损失项用于实现语音波形图上的优化，所述信噪比损失项用于实现语音频谱图上的优化；其中所述均方误差损失项取对数以确保其与信噪比损失项具有相同的数量级。In addition, the present invention utilizes a joint loss function to train the long-term and short-term perception enhancement model, wherein the joint loss function is composed of a mean square error loss term and a signal-to-noise ratio loss term, wherein the mean square error loss term is used to achieve optimization on a speech waveform graph, and the signal-to-noise ratio loss term is used to achieve optimization on a speech spectrum graph; wherein the mean square error loss term takes a logarithm to ensure that it has the same order of magnitude as the signal-to-noise ratio loss term.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

(1)本发明借助深度学习提出了一个基于长短期感知增强模型的实时语音降噪方法，其参数量少，鲁棒性强，实时性高，能够较好地应用于各类噪声场景中。(1) The present invention proposes a real-time speech denoising method based on the long-term and short-term perception enhancement model with the help of deep learning. The method has a small number of parameters, strong robustness, high real-time performance, and can be well applied to various noise scenes.

(2)本发明提出了一种基于Transformer架构的编码器，其引入了注意力机制与门控循环单元，这有利于解决关键特征的捕获与长短期特征的依赖问题。(2) The present invention proposes an encoder based on the Transformer architecture, which introduces an attention mechanism and a gated recurrent unit, which is conducive to solving the problem of capturing key features and the dependency of long-term and short-term features.

(3)本发明提出了一种基于注意力机制的特征提升模块，其能够有效地捕获不同范围内音频特征之间的关系，从而强化关键的音频特征。(3) The present invention proposes a feature enhancement module based on the attention mechanism, which can effectively capture the relationship between audio features in different ranges, thereby enhancing key audio features.

(4)本发明提出了一种基于双路架构的长短期感知模块，其可以实现不同维度上长短期特征的提取，进而为语音增强提供更具判别性的特征。(4) The present invention proposes a long-term and short-term perception module based on a dual-path architecture, which can realize the extraction of long-term and short-term features in different dimensions, thereby providing more discriminative features for speech enhancement.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明中实时语音增强方法流程图。FIG1 is a flow chart of the real-time speech enhancement method of the present invention.

图2为本发明中长短期感知强化模型框架图。FIG2 is a framework diagram of the medium- to short-term perception enhancement model of the present invention.

图3为本发明中多尺度编码器的特征捕获模块示意图。FIG3 is a schematic diagram of a feature capture module of a multi-scale encoder in the present invention.

图4为本发明中多头注意力机制示意图。FIG4 is a schematic diagram of the multi-head attention mechanism in the present invention.

图5为本发明中基于注意力机制的特征提升模块示意图。FIG5 is a schematic diagram of a feature enhancement module based on an attention mechanism in the present invention.

图6为本发明中门控循环单元示意图。FIG6 is a schematic diagram of a gated cycle unit in the present invention.

图7为本发明中基于双路架构的长短期感知模块示意图。FIG7 is a schematic diagram of a long-term and short-term perception module based on a dual-path architecture in the present invention.

图8为本发明中多种噪声下语音增强的效果图。FIG8 is a diagram showing the effects of speech enhancement under various noises in the present invention.

具体实施方式DETAILED DESCRIPTION

下面结合附图及实例对本发明如何应用技术手段解决技术问题，并达成技术效果的实现过程进行详细阐述。需要明确的是，下述具体实施方式仅用于说明本发明，而不用于限制本发明的范围。此外，只要不构成冲突，本发明中的各个实施例以及各实施例中的各个特征可以相互结合，所形成的技术方案均在本发明的保护范围之内。The following is a detailed description of how the present invention uses technical means to solve technical problems and achieve the implementation process of technical effects in conjunction with the accompanying drawings and examples. It should be clear that the following specific implementation methods are only used to illustrate the present invention and are not used to limit the scope of the present invention. In addition, as long as there is no conflict, the various embodiments of the present invention and the various features in the embodiments can be combined with each other, and the technical solutions formed are all within the protection scope of the present invention.

本发明公开了一种多种噪声环境下的语音增强方法，如图1所示，包含以下步骤：The present invention discloses a method for speech enhancement in a variety of noise environments, as shown in FIG1 , comprising the following steps:

步骤1：获取音频数据，并进行预处理操作与数据增强操作。Step 1: Obtain audio data and perform preprocessing and data enhancement operations.

步骤1.1：完成音频的预处理操作Step 1.1: Complete audio preprocessing

基于深度学习的语音增强技术作为一种数据驱动的监督学习方法，其要求输入的音频数据有固定的长度，因而需要将音频分割为固定的长度片段。考虑到不同音频的采样率均不相同，因此需要首先对其进行重采样操作。借助音频处理库librosa可以将音频的采样率调整为16KHz，并将其以WAV的格式进行存储。由于某些音频可能是多通道的，因而需要进行通道压缩操作，将其统一转为单通道的音频数据。为了便于计算简单，这里直接采用多通道相加取平均的融合策略，具体的计算公式如下：As a data-driven supervised learning method, speech enhancement technology based on deep learning requires that the input audio data has a fixed length, so the audio needs to be divided into fixed-length segments. Considering that the sampling rates of different audios are different, they need to be resampled first. With the help of the audio processing library librosa, the sampling rate of the audio can be adjusted to 16KHz and stored in WAV format. Since some audio may be multi-channel, channel compression operations are required to convert them into single-channel audio data. In order to simplify the calculation, the fusion strategy of multi-channel addition and averaging is directly adopted here. The specific calculation formula is as follows:

式中，K为音频通道的数目，S_mono为处理后的单通道音频，S_i为特定通道的音频，通过通道压缩，将多通道音频信号压缩为单通道音频信号。Wherein, K is the number of audio channels, S _mono is the processed single-channel audio, and _Si is the audio of a specific channel. Through channel compression, the multi-channel audio signal is compressed into a single-channel audio signal.

此外，假定模型的输入音频长度为4秒，则需要根据音频裁剪算法对音频长度执行裁剪操作，从而确保每个音频片段的长度为4秒。由于音频的采样率为16KHz，因此每个音频片段所包含的采样点为64000。假设音频的总采样点数为T，具体的音频裁剪计算公式如下：In addition, assuming that the input audio length of the model is 4 seconds, the audio length needs to be trimmed according to the audio trimming algorithm to ensure that the length of each audio segment is 4 seconds. Since the sampling rate of the audio is 16KHz, each audio segment contains 64,000 sampling points. Assuming that the total number of audio sampling points is T, the specific audio trimming calculation formula is as follows:

式中，l为正整数，S_start和S_end分别为起始采样点的ID和结束采样点的ID。当裁剪后的音频采样点数不满足64000但总采样点数大于50000时，可以使用线性插值的方式补齐至64000个采样点。当裁剪的采样点数小于50000时，可以直接舍弃该裁剪的音频片段。Where l is a positive integer, S _start and S _end are the IDs of the starting sampling point and the ending sampling point, respectively. When the number of audio sampling points after cropping does not meet 64,000 but the total number of sampling points is greater than 50,000, linear interpolation can be used to fill it up to 64,000 sampling points. When the number of cropped sampling points is less than 50,000, the cropped audio segment can be directly discarded.

步骤1.2：完成音频的数据增强操作Step 1.2: Complete the audio data augmentation operation

考虑到模型应用场景的复杂多变，需要利用数据增强技术提高模型的鲁棒性。为了能够增强音频的复杂性，这里引入了三种音频数据增强方法，其主要包括：随机信噪比混合噪声音频、随机改变音频音量、随机添加混响效果。Considering the complexity and variability of model application scenarios, data enhancement technology is needed to improve the robustness of the model. In order to enhance the complexity of audio, three audio data enhancement methods are introduced here, which mainly include: random signal-to-noise ratio mixed noise audio, random change of audio volume, and random addition of reverberation effect.

随机混合噪声音频操作主要是通过引入其他额外的背景噪声数据并按照随机信噪比混合输入音频。示例地，可选取电钻声、鸣笛声、喧嚣声、犬吠声、鼓掌声、鸟鸣声、枪击声、蛙叫声、机器声、音乐声等多种常见的噪声。此数据增强的具体操作流程为，首先利用均匀随机采样方法在[-15,15]范围内生成信噪比，将随机信噪比与原始语音进行相乘，并将相乘后的结果与噪声音频相加，从而获得含噪混合音频。The random mixed noise audio operation is mainly to introduce other additional background noise data and mix the input audio according to the random signal-to-noise ratio. For example, you can choose a variety of common noises such as electric drill sounds, whistle sounds, noise, dog barking, applause, bird sounds, gunshots, frog sounds, machine sounds, music sounds, etc. The specific operation process of this data enhancement is to first use the uniform random sampling method to generate a signal-to-noise ratio in the range of [-15,15], multiply the random signal-to-noise ratio with the original speech, and add the multiplied result to the noise audio to obtain the noisy mixed audio.

随机改变音量操作主要是借助随机缩放因子将输入音频的音量进行放大或缩小操作，其主要采用随机均匀采样在[0,2]范围内生成音频缩放因子，并将缩放因子与原始音频相乘获得经过随机调整音量的音频。The random volume change operation mainly uses a random scaling factor to amplify or reduce the volume of the input audio. It mainly uses random uniform sampling to generate an audio scaling factor in the range of [0,2], and multiplies the scaling factor with the original audio to obtain audio with randomly adjusted volume.

随机添加混响效果的操作流程包括如下几个方面：创建所处的房间(定义房间大小、所需的混响时间、墙面材料、允许的最大反射次数)、在房间内创建信号源、在房间内放置麦克风、创建房间冲击响应、模拟声音传播、合成混响效果。在本实施例中，可以直接借助Pyroomacoustics库实现语音数据的混响效果添加。The operation flow of randomly adding reverberation effects includes the following aspects: creating the room (defining the room size, the required reverberation time, the wall material, and the maximum number of reflections allowed), creating a signal source in the room, placing a microphone in the room, creating a room impulse response, simulating sound propagation, and synthesizing reverberation effects. In this embodiment, the reverberation effect of voice data can be added directly with the help of the Pyroomacoustics library.

步骤2：借助多尺度编码器提取深层音频特征。Step 2: Extract deep audio features with the help of a multi-scale encoder.

本发明借助深度学习技术设计了一个高效的长短期感知强化模型，将步骤1处理后的音频输入至该长短期感知强化模型中，从而实现多种噪声下的实时语音增强。图2展示了此模型的整体架构。该模型主要包括多尺度编码器、长短期感知模块以及残差解码器。多尺度编码器主要用于实现音频特征的压缩与深层特征的提取，残差解码器则主要用于实现音频信号的重构。本实施例中，多尺度编码器基于Transformer架构，其主要由多个特征捕获模块堆叠构成，本实施例中为5个。每个特征捕获模块又包括：特征提升模块、归一化层和前馈神经网络。The present invention designs an efficient long-term and short-term perception enhancement model with the help of deep learning technology, and inputs the audio processed in step 1 into the long-term and short-term perception enhancement model, thereby realizing real-time speech enhancement under various noises. Figure 2 shows the overall architecture of this model. The model mainly includes a multi-scale encoder, a long-term and short-term perception module, and a residual decoder. The multi-scale encoder is mainly used to realize the compression of audio features and the extraction of deep features, and the residual decoder is mainly used to realize the reconstruction of audio signals. In this embodiment, the multi-scale encoder is based on the Transformer architecture, which is mainly composed of a plurality of feature capture modules stacked, which are 5 in this embodiment. Each feature capture module also includes: a feature enhancement module, a normalization layer, and a feedforward neural network.

图3展示了基于Transformer架构的多尺度编码器中特征捕获模块的详细信息，其具体计算公式如下：Figure 3 shows the details of the feature capture module in the multi-scale encoder based on the Transformer architecture. The specific calculation formula is as follows:

式中，

和

分别为特征捕获模块的输入特征、中间过程特征和输出特征。LayerNorm(·)为层归一化操作，FBM(·)为特征提升模块操作，FNN(·)为前馈神经网络。此外，特征捕获模块引入了残差连接保持原始特征，并使用基于注意力机制的特征提升模块实现关键特征的捕获与强化。图4展示了此模块中所采用的多头注意力机制的细节信息。对于特征捕获模块的整体流程而言，为了有效地捕获关键的音频特征和解决长短期特征依赖，首先将获取后的特征X_ie_n输入至特征提升模块，利用基于注意力机制的特征提升模块捕获关键的长短期特征，进而借助层归一化操作实现特征归一化，然后利用前馈神经网络实现深层特征的捕获，最后利用层归一化操作进行处理。此外，不同特征捕获模块之间借助最大池化操作实现特征的下采样。同时，不同特征捕获模块采用不同的膨胀卷积操作捕获不同尺度的特征。In the formula,

and

They are the input features, intermediate process features and output features of the feature capture module respectively. LayerNorm(·) is a layer normalization operation, FBM(·) is a feature enhancement module operation, and FNN(·) is a feedforward neural network. In addition, the feature capture module introduces a residual connection to maintain the original features, and uses a feature enhancement module based on an attention mechanism to capture and enhance key features. Figure 4 shows the details of the multi-head attention mechanism used in this module. For the overall process of the feature capture module, in order to effectively capture key audio features and solve long-term and short-term feature dependencies, the acquired _feature _Xien is first input into the feature enhancement module, and the feature enhancement module based on the attention mechanism is used to capture key long-term and short-term features, and then the layer normalization operation is used to achieve feature normalization, and then the feedforward neural network is used to capture deep features, and finally the layer normalization operation is used for processing. In addition, the maximum pooling operation is used to achieve feature downsampling between different feature capture modules. At the same time, different feature capture modules use different dilated convolution operations to capture features of different scales.

特征提升模块是特征捕获模块的核心组件，本发明借助特征提升模块捕获关键音频特征以及全局范围内特征之间的关系，即有效地捕获与强化重要特征。图5展示了此模块的细节架构。此模块主要借助卷积层、全连接层以及Sigmoid函数获取注意力权重，并利用矩阵对应元素相乘实现关键特征增强。同时，其借助多头注意力机制还可以捕获较大范围内特征之间的关系，从而尽可能地消除谐波。其具体计算公式如下：The feature enhancement module is the core component of the feature capture module. The present invention uses the feature enhancement module to capture key audio features and the relationship between features in the global range, that is, effectively capture and strengthen important features. Figure 5 shows the detailed architecture of this module. This module mainly uses convolutional layers, fully connected layers and Sigmoid functions to obtain attention weights, and uses matrix corresponding elements to multiply to achieve key feature enhancement. At the same time, it can also capture the relationship between features in a larger range with the help of a multi-head attention mechanism, thereby eliminating harmonics as much as possible. The specific calculation formula is as follows:

式中，

和

分别为特征提升模块的输入特征、中间特征和输出特征。C_1D(·)、FC(·)和R(·)分别为一维卷积、全连接层和调整通道操作。⊙和

分别表示矩阵对应元素相乘与相加操作。此外，σ表示激活函数Sigmoid，从而便于求取关键特征对应的权重。MAM(·)则表示多头注意力机制操作。这里，使用卷积核大小为1的一维卷积操作实现特征通道的压缩，进而借助全连接层与激活函数Sigmoid获取相应的权重矩阵，最后利用矩阵对应元素相乘实现关键特征的强化。对于多头注意力机制而言，首先借助可学习的线性变换根据输入特征

分别获得队列Q_i、键K_i、值V_i，其具体的计算公式如下：In the formula,

and

are the input features, intermediate features, and output features of the feature enhancement module, respectively. _{C 1D} (·), FC(·), and R(·) are the one-dimensional convolution, fully connected layer, and channel adjustment operations, respectively. ⊙ and

Respectively represent the multiplication and addition operations of the corresponding elements of the matrix. In addition, σ represents the activation function Sigmoid, which makes it easier to obtain the weights corresponding to the key features. MAM(·) represents the multi-head attention mechanism operation. Here, a one-dimensional convolution operation with a convolution kernel size of 1 is used to compress the feature channel, and then the corresponding weight matrix is obtained with the help of the fully connected layer and the activation function Sigmoid. Finally, the corresponding elements of the matrix are multiplied to enhance the key features. For the multi-head attention mechanism, firstly, a learnable linear transformation is used to calculate the weights of the input features.

Get the queue _Qi , key _Ki , and value _Vi respectively. The specific calculation formula is as follows:

式中，W_i ^Q、W_i ^K和W_i ^V分别为全连接层的权重。其次，利用点积的方式计算队列与键值之前的相似度，同时还需要除以缩放因子。然后，应用Softmax激活函数获得每个值对应的权重，并与所对应的值相乘。最后，需要将所有头部获得的结果串联，并再次进行线性投影操作获得最终的输出。多头注意力机制的具体计算公式如下： ^In the formula, _WiQ , _WiK ^and _WiV are the weights of the fully connected layer. Secondly, the similarity between the queue and the ^key value is calculated by the dot product method, and it is also necessary to divide it by the scaling factor. Then, the Softmax activation function is applied to obtain the weight corresponding to each value and multiply it with the corresponding value. Finally, the results obtained by all heads need to be connected in series, and the linear projection operation is performed again to obtain the final output. The specific calculation formula of the multi-head attention mechanism is as follows:

式中，W^mh是线性变换矩阵，h为并行注意力层的数目。多头注意力模块的输出将会作为前馈神经网络的输入，从而获得最终的输出特征。在此模块中，残差连接与层归一化操作也被引入进一步改善特征的提取效果。Where W ^mh is the linear transformation matrix and h is the number of parallel attention layers. The output of the multi-head attention module will be used as the input of the feedforward neural network to obtain the final output features. In this module, residual connections and layer normalization operations are also introduced to further improve the feature extraction effect.

前馈神经网络主要包括：门控循环单元、激活函数以及全连接层，主要主要借助双向门控循环单元实现长短期特征的捕获，并结合全连接层实现深层特征的提取，其具体的计算公式如下：The feedforward neural network mainly includes: gated recurrent unit, activation function and fully connected layer. It mainly uses the bidirectional gated recurrent unit to capture long-term and short-term features, and combines the fully connected layer to extract deep features. The specific calculation formula is as follows:

式中，W_fc和b_fc表示全连接层的权重以及对应的偏置，δ表示激活函数ReLU。这里，采用双向门控循环单元实现音频特征的捕获，其不仅可以有效地捕获长短期的特征，同时也避免了LSTM计算较为复杂的问题。此外，这种方式与单纯使用全连接层相比，往往能够获得更令人满意的效果。同时，双向门控循环单元与一维卷积相比，其可以感知更远特征之间的关系，自动关注更为重要的特征。图6展示了门控循环单元的具体实现细节，其主要包括更新门与重置门，具体的计算公式如下：Where W _fc and b _fc represent the weights of the fully connected layer and the corresponding bias, and δ represents the activation function ReLU. Here, a bidirectional gated recurrent unit is used to capture audio features, which can not only effectively capture long-term and short-term features, but also avoid the problem of complex LSTM calculations. In addition, compared with simply using a fully connected layer, this method can often achieve more satisfactory results. At the same time, compared with one-dimensional convolution, the bidirectional gated recurrent unit can perceive the relationship between farther features and automatically focus on more important features. Figure 6 shows the specific implementation details of the gated recurrent unit, which mainly includes an update gate and a reset gate. The specific calculation formula is as follows:

z_t＝σ(W_z·[h_t-1,x_t])r_t＝σW_r·[h_t-1,x_t])z _t =σ(W _z ·[h _t-1 ,x _t ])r _t =σW _r ·[h _t-1 ,x _t ])

步骤3：借助长短期感知模块捕获不同维度上的特征。Step 3: Use the long-term and short-term perception modules to capture features in different dimensions.

对于多尺度编码器提取的语音特征，还需要进一步处理不同维度上特征之间的关系。因而，本发明设计了一种采用双路架构的长短期感知模块，其可以有效地实现不同维度上长短期音频特征的捕获，从而有效地解决特征之间的长短期依赖关系。如图7所示，展示了长短期感知模块的细节架构。此模块主要借助门控循环单元、一维卷积操作、即时层归一化操作和通道调整操作，分别实现时间维度与特征维度上的长短期特征捕获。值得注意的是，本实施例采用了即时层归一化操作替代传统的层归一化操作，降低模型对输入信号能量的敏感度。同时，为了保持原有特征，此模块还引入了残差连接的思想。无论是时间维度还是特征维度，其均是借助门控循环单元实现不同范围内长短期特征的提取，并利用一维卷积操作实现深层特征的捕获，进而借助即时归一化操作实现特征的归一化。For the speech features extracted by the multi-scale encoder, it is necessary to further process the relationship between the features in different dimensions. Therefore, the present invention designs a long-term and short-term perception module using a dual-path architecture, which can effectively realize the capture of long-term and short-term audio features in different dimensions, thereby effectively solving the long-term and short-term dependencies between features. As shown in Figure 7, the detailed architecture of the long-term and short-term perception module is shown. This module mainly uses gated cyclic units, one-dimensional convolution operations, instant layer normalization operations and channel adjustment operations to respectively realize the capture of long-term and short-term features in the time dimension and feature dimension. It is worth noting that this embodiment uses an instant layer normalization operation to replace the traditional layer normalization operation to reduce the sensitivity of the model to the energy of the input signal. At the same time, in order to maintain the original features, this module also introduces the idea of residual connection. Whether it is the time dimension or the feature dimension, it uses the gated cyclic unit to realize the extraction of long-term and short-term features in different ranges, and uses the one-dimensional convolution operation to realize the capture of deep features, and then uses the instant normalization operation to realize the normalization of features.

此模块的具体计算公式如下：The specific calculation formula of this module is as follows:

式中，GRU(·)为门控循环单元，C_1D(·)为一维卷积操作，iLN(·)为即时归一化操作，R(·)为通道调整操作。此外，

和

分别为此模块的输入特征、中间特征以及输出特征。当特征输入至网络中，首先利用GRU实现时间维度上长短期特征的捕获，进而利用一维卷积操作实现深层特征的提取，然后借助即时层归一化操作进行特征归一化处理。这里之所以采用GRU，是因为其与LSTM相比所需要的计算资源与时间成本更少，却能够达到相同的效果。GRU仅包含控制重置的门控和控制更新的门控，其有效地解决了长短期记忆的问题。另外，此模块所使用的即时层归一化操作的具体计算公式如下：In the formula, GRU(·) is a gated recurrent unit, C _1D (·) is a one-dimensional convolution operation, iLN(·) is an immediate normalization operation, and R(·) is a channel adjustment operation. In addition,

and

They are the input features, intermediate features, and output features of this module respectively. When the features are input into the network, GRU is first used to capture the long-term and short-term features in the time dimension, and then the one-dimensional convolution operation is used to extract the deep features, and then the feature normalization is performed with the help of the instantaneous layer normalization operation. The reason why GRU is used here is that it requires less computing resources and time cost than LSTM, but can achieve the same effect. GRU only contains the gate for controlling reset and the gate for controlling update, which effectively solves the problem of long-term and short-term memory. In addition, the specific calculation formula of the instantaneous layer normalization operation used in this module is as follows:

式中，X_tf为输入的特征，N和K分别为特征的维度。此外，

和

分别为均值操作和方差操作。此外，符号ε和β分别为可学习的参数，符号λ为正则化参数。该归一化操作可以降低模型对输入信号能量的敏感程度。为了实现特征维度上长短期特征的捕获，需要将特征的两个通道进行调换，进而利用GRU捕获长期特征，其次借助一维卷积实现深层特征的提取，并使用即时归一化操作进行处理，最后借助通道调整操作获得输出特征。In the formula, _Xtf is the input feature, N and K are the dimensions of the feature. In addition,

and

are mean operation and variance operation respectively. In addition, the symbols ε and β are learnable parameters, and the symbol λ is a regularization parameter. This normalization operation can reduce the sensitivity of the model to the energy of the input signal. In order to capture long-term and short-term features in the feature dimension, it is necessary to swap the two channels of the feature, and then use GRU to capture the long-term features. Secondly, use one-dimensional convolution to extract deep features, and use instant normalization operation for processing. Finally, the output features are obtained with the help of channel adjustment operation.

步骤4：借助残差解码器获得增强后的纯净语音。Step 4: Use the residual decoder to obtain the enhanced clean speech.

为了能够获得纯净语音，需要首先借助残差解码器重构语音信号。此残差解码器主要包含多个解码单元，本实施例中为5个，其可以逐步实现频谱图掩码的估计。对于每个解码单元，其主要由一维反卷积操作、归一化操作与激活函数构成。同时，为了能够较好地重构语音信号，每一个解码单元的输入均包含两部分：一个是来自于上一个解码单元的输出

另一个是来自于同级特征捕获模块的输出

解码单元使用一维反卷积同时实现特征的提取与上采样操作，并借助激活函数PReLU增加模型的非线性能力。其具体的计算公式如下：In order to obtain pure speech, it is necessary to first reconstruct the speech signal with the help of a residual decoder. This residual decoder mainly includes multiple decoding units, 5 in this embodiment, which can gradually realize the estimation of the spectrum mask. For each decoding unit, it is mainly composed of a one-dimensional deconvolution operation, a normalization operation and an activation function. At the same time, in order to better reconstruct the speech signal, the input of each decoding unit includes two parts: one is the output from the previous decoding unit

The other is the output from the feature capture module at the same level

The decoding unit uses one-dimensional deconvolution to simultaneously extract features and perform upsampling operations, and uses the activation function PReLU to increase the nonlinearity of the model. The specific calculation formula is as follows:

式中，TC_1D(·)为一维反卷积操作，其主要用于实现特征提取与上采样操作。B(·)为批归一化操作，θ为激活函数PReLU。此外，

为当前解码单元的输出特征，解码器的输出为重构的语音信号。此时，需要借助掩码估计模块处理解码器输出的重构语音信号，估计纯净语音信号的掩码，从而实现纯净语音掩码的生成。此掩码估计模块由一维卷积操作和多个不同的激活函数构成，其具体的计算公式如下：Where TC _1D (·) is a one-dimensional deconvolution operation, which is mainly used to implement feature extraction and upsampling operations. B(·) is a batch normalization operation, and θ is the activation function PReLU. In addition,

is the output feature of the current decoding unit, and the output of the decoder is the reconstructed speech signal. At this time, it is necessary to use the mask estimation module to process the reconstructed speech signal output by the decoder, estimate the mask of the pure speech signal, and thus realize the generation of the pure speech mask. This mask estimation module consists of a one-dimensional convolution operation and multiple different activation functions, and its specific calculation formula is as follows:

式中，

和

分别为掩码估计模块的输入特征、中间过程特征以及输出的掩码。此外，γ、δ和σ分别为激活函数Tanh、ReLU和Sigmoid。将掩码估计模块的输出特征与原始输入的语音信号相乘即可获得模型估计的纯净语音信号，其计算公式如下：In the formula,

and

are the input features, intermediate process features, and output masks of the mask estimation module. In addition, γ, δ, and σ are activation functions Tanh, ReLU, and Sigmoid, respectively. The output features of the mask estimation module are multiplied by the original input speech signal to obtain the pure speech signal estimated by the model. The calculation formula is as follows:

本发明的模型以及其流程如上，进一步地，还需要对上述模型进行训练或测试，以获取满足要求的模型。The model and process of the present invention are as described above. Furthermore, the model needs to be trained or tested to obtain a model that meets the requirements.

具体地，为了完成模型的监督训练，本发明引入了一种联合损失函数，其包括两部分：信噪比损失项f(·)与均方误差损失项MSE(·)。前者主要用于实现语音波形图上的优化，后者主要用于实现语音频谱图上的优化。此外，需要对均方误差损失项取对数以确保其与信噪比损失项具有相同的数量级。Specifically, in order to complete the supervised training of the model, the present invention introduces a joint loss function, which includes two parts: the signal-to-noise ratio loss term f(·) and the mean square error loss term MSE(·). The former is mainly used to achieve optimization on the speech waveform graph, and the latter is mainly used to achieve optimization on the speech spectrogram. In addition, the logarithm of the mean square error loss term needs to be taken to ensure that it has the same order of magnitude as the signal-to-noise ratio loss term.

该损失函数的具体表达式如下：The specific expression of the loss function is as follows:

式中，s和

分别为纯净的音频与模型估计的音频，S_r和

分别为纯净的频谱图的实部与模型估计的频谱图的实部，S_i和

分别为纯净的频谱图的虚部与模型估计的频谱图的虚部，|S|和

分别为纯净的频谱图的幅值与模型估计的频谱图的幅值。此外，均方误差损失项可以测量模型估计频谱图和真实频谱图之间的实部、虚部以及幅度的差异。同时，对均方误差损失项取对数，以确保其与信噪比损失项具有相同的数量级。信噪比损失项则可以约束输出的振幅，避免输入和输出之间的电平偏移。该损失项具体的计算公式如下：In the formula, s and

are the pure audio and the model estimated audio, S _r and

are the real part of the pure spectrum and the real part of the spectrum estimated by the model, _Si and

are the imaginary parts of the pure spectrum and the model-estimated spectrum, respectively, |S| and

are the amplitude of the pure spectrum and the amplitude of the spectrum estimated by the model, respectively. In addition, the mean square error loss term can measure the difference in real part, imaginary part and amplitude between the model estimated spectrum and the true spectrum. At the same time, the logarithm of the mean square error loss term is taken to ensure that it has the same order of magnitude as the signal-to-noise ratio loss term. The signal-to-noise ratio loss term can constrain the amplitude of the output and avoid level offset between the input and output. The specific calculation formula of this loss term is as follows:

为了能够证明本发明所提方法的有效性，便开展了相关的实验测试。在现有纯净语音的基础上融合了大量的噪声音频，从而模拟各种噪声下所采集的语音。这里选择的噪声种类为：电钻声、鸣笛声、喧嚣声、犬吠声、鼓掌声、鸟鸣声、枪击声、蛙叫声、机器声、音乐声。同时，借助语音增强常用三个的评价指标衡量语音增强的效果，其分别为：感知语音质量评估(PESQ)、短时语音可懂度(STOI)和源伪影比(SAR)。其中，PESQ和STOI均属于感知级别的评估方法，其均是数值越大表示语音增强的效果越好。对于STOI而言，其计算过程主要包括三个步骤：去除静音帧；对信号完成DFT的1/3倍频带分解；计算增强前后时间包络之前的相关系数并取平均。对于PESQ而言，其需要带噪的衰减信号和一个原始的参考信号，计算过程包括了预处理、时间对齐、感知滤波、掩蔽效果等等操作。其能够对客观语音质量评估提供一个主观预测值，而且可以映射到MOS刻度范围，得分范围在-0.5–4.5之间。另外，评价指标SAR可以看做是信号级别的评估指标，其数值越大表示语音增强的效果越好，具体计算公式如下：In order to prove the effectiveness of the method proposed in the present invention, relevant experimental tests were carried out. A large amount of noise audio is integrated on the basis of the existing pure speech to simulate the speech collected under various noises. The types of noise selected here are: electric drill sound, whistle sound, noise, dog barking, applause, bird sound, gunshot sound, frog sound, machine sound, and music sound. At the same time, the effect of speech enhancement is measured by means of three commonly used evaluation indicators for speech enhancement, namely: perceptual speech quality evaluation (PESQ), short-term speech intelligibility (STOI) and source artifact ratio (SAR). Among them, PESQ and STOI are both perceptual level evaluation methods, and the larger the value, the better the effect of speech enhancement. For STOI, its calculation process mainly includes three steps: removing silent frames; completing 1/3 octave band decomposition of DFT for the signal; calculating the correlation coefficient before and after the time envelope of enhancement and taking the average. For PESQ, it requires a noisy attenuated signal and an original reference signal, and the calculation process includes preprocessing, time alignment, perceptual filtering, masking effect and other operations. It can provide a subjective prediction value for objective speech quality assessment and can be mapped to the MOS scale range, with a score range of -0.5–4.5. In addition, the evaluation index SAR can be regarded as an evaluation index of the signal level. The larger the value, the better the speech enhancement effect. The specific calculation formula is as follows:

式中，e_interf、e_noise和e_artif分别为由干扰、噪声和伪影引入的误差信号，s_target则为目标信号。表1展示了本发明在上述评价指标上与主流方法的效果比较。不难发现，其可以在PESQ评价指标上比主流语音增强模型Demucs提升了约16％，在SAR评价指标上比主流语音增强模型MannerNet提升了约16％。同时，在评价指标STOI上可以达到0.94的优异表现。另外，针对十种不同的噪声干扰环境，图8展示了基于本发明提出的长短期感知强化模型降噪后的语音效果图，其可以获得令人满意的效果。Wherein, e _interf , e _noise and e _artif are the error signals introduced by interference, noise and artifacts respectively, and s _target is the target signal. Table 1 shows the comparison of the effects of the present invention with the mainstream methods on the above evaluation indicators. It is not difficult to find that it can improve the PESQ evaluation index by about 16% compared with the mainstream speech enhancement model Demucs, and improve the SAR evaluation index by about 16% compared with the mainstream speech enhancement model MannerNet. At the same time, it can achieve an excellent performance of 0.94 on the evaluation index STOI. In addition, for ten different noise interference environments, Figure 8 shows the speech effect diagram after denoising based on the long-term and short-term perception enhancement model proposed by the present invention, which can achieve satisfactory results.

表1本发明的长短期感知强化模型与主流语音增强模型的效果对比Table 1 Comparison of the effects of the long-term and short-term perception enhancement model of the present invention and the mainstream speech enhancement model

PESQPESQ STOISTOI SARSAR DemucsDemucs 2.082.08 0.930.93 18.7018.70 MannerNetMannerNet 2.222.22 0.940.94 17.4117.41 长短期感知强化模型Long-term and short-term perception enhancement model 2.412.41 0.940.94 20.2720.27

Claims

1. A method for speech enhancement in a variety of noise environments, characterized in that it comprises the following steps:

Step 1: preprocessing and data enhancement operations are performed on the acquired audio data, and the processed audio data is input into a long-term and short-term perception enhancement model; the long-term and short-term perception enhancement model includes a multi-scale encoder, a long-term and short-term perception module, and a residual decoder;

Step 2: for the processed audio data, extracting its deep audio features using the multi-scale encoder;

Step 3: Using the long-term and short-term perception modules to capture features in different dimensions respectively;

Step 4: Reconstruct the speech signal using the residual decoder, and estimate the mask of the clean speech using the mask estimation module, multiply it with the original input audio to obtain the enhanced clean speech. The model training is completed with the help of the joint loss function.

2. According to the method for speech enhancement in multiple noise environments of claim 1, the preprocessing operation comprises one or more of the following operations: resampling the audio, trimming the audio length, and compressing the audio channel;

The data enhancement operation includes one or more of the following operations: mixing noise audio according to a random signal-to-noise ratio, randomly changing the volume of audio, and randomly adding a reverberation effect.

3. According to the method for speech enhancement in a variety of noise environments described in claim 1, it is characterized in that the multi-scale encoder is based on the Transformer architecture, is composed of a plurality of feature capture modules stacked, and achieves feature downsampling by means of a pooling operation; each feature capture module includes: a feature enhancement module, a normalization layer and a feedforward neural network;

The feature enhancement module is used to capture key audio features and the relationship between features in the global scope. It uses convolutional layers, fully connected layers and Sigmoid functions to obtain attention weights, and uses matrix corresponding element multiplication to achieve key feature enhancement, and uses a multi-head attention mechanism to capture the relationship between features in the global scope; the normalization layer performs normalization operations; the feedforward neural network uses a bidirectional gated recurrent unit to capture long-term and short-term features, and combines a fully connected layer to extract deep features;

Among them, different feature capture modules use different dilated convolution operations to capture features of different scales.

4. According to the method for speech enhancement in multiple noise environments as claimed in claim 3, it is characterized in that the calculation formula of the feature capture module is as follows:

In the formula,

and

The calculation formula of the feature enhancement module is as follows:

In the formula,

and

5. According to the method of claim 4, the multi-head attention mechanism is characterized in that the multi-head attention mechanism first uses a learnable linear transformation according to the input feature

Where ^, _WiQ , _WiK ^and _WiV are the weights of the fully connected layers respectively ^;

Secondly, the similarity between the queue and the key value is calculated using the dot product method, and divided by the scaling factor;

Then, apply the Softmax activation function to obtain the weight corresponding to each value and multiply it with the corresponding value;

Finally, the results obtained from all heads are concatenated and linearly projected again to obtain the final output;

The specific calculation formula of the multi-head attention mechanism is as follows:

MAM(Q,K,V)=Concat(head ₁ ,...,head _h )W ^mh

Where ^Wmh is the linear transformation matrix, h is the number of parallel attention layers, and d is the scaling factor;

The output of the multi-head attention mechanism is used as the input of the feedforward neural network to obtain the final output features;

The feedforward neural network includes: gated recurrent unit, activation function and fully connected layer, and its calculation formula is as follows:

Where W _fc and b _fc represent the weights of the fully connected layer and the corresponding bias, δ represents the activation function ReLU, and the gated recurrent unit includes an update gate and a reset gate. The calculation formula is as follows:

z _t =σ(W _z ·[h _t-1 ,x _t ])r _t =σ(W _r ·[h _t-1 ,x _t ])

Where σ and γ represent activation functions Sigmoid and Tanh respectively, x _t , h _t-1 and h _t are the input features at this moment, the hidden state at the previous moment and the hidden state at the current moment respectively.

6. According to the method for speech enhancement in multiple noise environments described in claim 1, it is characterized in that the long-term and short-term perception module adopts a dual-path architecture, including a gated recurrent unit, a one-dimensional convolution module, an instantaneous layer normalization module and a channel adjustment module; the gated recurrent unit captures the long-term and short-term features of the features, the one-dimensional convolution module extracts deep features, and the instantaneous layer normalization module performs feature normalization.

7. The method for speech enhancement in multiple noise environments according to claim 6, wherein the calculation formula of the long-term and short-term perception module is as follows:

In the formula, GRU(·) is a gated recurrent unit, C _1D (·) is a one-dimensional convolution operation, iLN(·) is an immediate normalization operation, and R(·) is a channel adjustment operation.

and

The calculation formula of the instant layer normalization module is as follows:

In the formula, _Xtf is the input feature, N and K are the dimensions of the feature,

and

8. According to the method for speech enhancement in a variety of noise environments described in claim 1, it is characterized in that the residual decoder includes a plurality of decoding units, each decoding unit includes a one-dimensional deconvolution module, a normalization module and an activation function; the input of each decoding unit is the output of the previous decoding unit

And the output of the feature capture module at the same level

The calculation formula is as follows:

Where TC _1D (·) is a one-dimensional deconvolution operation, B(·) is a batch normalization operation, θ is the activation function PReLU,

9. According to the method for speech enhancement in multiple noise environments of claim 1, the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and its calculation formula is as follows:

In the formula,

and

are the input features, intermediate process features and output masks of the mask estimation module, respectively. γ, δ and σ are the activation functions Tanh, ReLU and Sigmoid, respectively.

The output feature of the mask estimation module is multiplied by the original input speech signal to obtain the pure speech signal estimated by the model. The calculation formula is as follows:

Where _Xin is the original input audio signal, and _Xest is the pure speech estimated by the model.

10. According to claim 1, a method for speech enhancement in multiple noise environments is characterized in that the long-term and short-term perception enhancement model is trained using a joint loss function, and the joint loss function is composed of a mean square error loss term and a signal-to-noise ratio loss term, the mean square error loss term is used to achieve optimization on the speech waveform graph, and the signal-to-noise ratio loss term is used to achieve optimization on the speech spectrum graph; wherein the mean square error loss term takes a logarithm to ensure that it has the same order of magnitude as the signal-to-noise ratio loss term.