CN116013344A - Speech enhancement method under multiple noise environments - Google Patents

Speech enhancement method under multiple noise environments Download PDF

Info

Publication number
CN116013344A
CN116013344A CN202211637892.7A CN202211637892A CN116013344A CN 116013344 A CN116013344 A CN 116013344A CN 202211637892 A CN202211637892 A CN 202211637892A CN 116013344 A CN116013344 A CN 116013344A
Authority
CN
China
Prior art keywords
module
features
feature
audio
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211637892.7A
Other languages
Chinese (zh)
Inventor
张新曼
李扬科
杨剑锋
彭豪鸿
王静静
贾士凡
赵红超
黄永文
李桂成
王歆叶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202211637892.7A priority Critical patent/CN116013344A/en
Publication of CN116013344A publication Critical patent/CN116013344A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a voice enhancement method under multiple noise environments, which comprises the following steps: 1) Finishing the pretreatment of the audio and the data enhancement operation; 2) Extracting multi-level audio features by using a multi-scale encoder based on a transducer architecture, and strengthening key features by means of a feature lifting module; 3) Capturing long-term and short-term characteristics on different dimensions by means of a long-term and short-term sensing module based on a two-way architecture; 4) Obtaining a clean speech signal using a residual decoder and a mask estimation module; 5) The network model is jointly trained by means of a mean square error loss term and a signal to noise ratio loss term. The method has strong robustness and high instantaneity, and can effectively process ten common noises such as whistling, noisy, applause, bird song and the like, thereby improving the user experience of applications such as short videos, network live broadcast, video conferences, voice calls and the like. Compared with a part of mainstream voice enhancement model, the method can averagely improve 16% on the relevant evaluation index.

Description

Speech enhancement method under multiple noise environments
Technical Field
The invention belongs to the technical field of voice noise reduction, and particularly relates to a voice enhancement method in a plurality of noise environments.
Background
Whether short video or live, it faces a major problem: the surrounding background noise is also captured while the photographer is speaking, which can greatly reduce the audience's actual experience. In addition, the surrounding environment in which different photographers are located is different, and thus the kind of noise is also various, for example: car whistling, plaza music, child crying, construction site machine sound, crowd noise and the like. The interference of the surrounding environment and the complex variability of the application scenario require the processing of noisy audio using a robust speech enhancement technique.
Of course, the application of speech enhancement techniques is not limited to just short video or web-casting, but may also serve a variety of downstream speech-related tasks, including: intelligent voice interaction, voice emotion analysis, intelligent voice input and the like. In the field of intelligent interaction of speech, it is common to say intelligent sound boxes. In the field of intelligent voice input, a voice input method is common. By taking smart home as an example, a user can realize the issuing of instructions by means of voice, so that both hands are truly liberated, and direct contact with equipment is avoided. Although intelligent interaction based on voice is becoming a mainstream man-machine interaction mode, the user still cannot completely replace a keyboard or a touch screen for input in daily life due to the complex noise environment of the user. Thus, it is important to obtain clean speech from noisy mixed audio in real-time by means of speech enhancement techniques.
At present, the voice enhancement algorithm is mainly divided into: the harmonic enhancement method is only suitable for removing stable white noise, and meanwhile, the pitch period of the voice cannot be accurately estimated; spectral subtraction, which is more efficient in handling wideband noise, but the enhanced result may have noise component residuals; wiener filtering, which enhances residual noise to resemble white noise rather than musical noise; the enhancement method based on the voice model parameters has poor performance under the condition of low signal-to-noise ratio, and often requires multiple iterative operations; based on the signal subspace method, the required operand is large and is difficult to meet the real-time requirement; the enhancement method based on wavelet transformation has poor denoising capability for non-stationary noise; the method based on deep learning directly estimates pure voice signals by means of data driving, and has strong robustness and instantaneity. The deep learning-based approach has incomparable performance advantages over the conventional approach, and thus has become the dominant approach to speech enhancement.
However, the current deep learning method for speech enhancement still faces the problems of poor noise effect removal, poor robustness and the like caused by the fact that long-term features, short-term features, key features for enhancement and the like cannot be effectively captured.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention aims to provide a method for enhancing speech in a multi-noise environment, so as to remove noise in speech more effectively, and have stronger robustness and real-time performance.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method for speech enhancement in a multiple noise environment, comprising the steps of:
step 1: preprocessing operation and data enhancement operation are carried out on the obtained audio data, and the processed audio data are input into a long-period perception enhancement model; the long-term perception enhancement model comprises the following steps: a multi-scale encoder, a long-short-term perception module and a residual error decoder;
step 2: extracting deep audio features of the processed audio data by using the multi-scale encoder;
step 3: the long-term sensing module is utilized to capture the characteristics in different dimensions respectively;
step 4: and reconstructing a voice signal by using the residual decoder, estimating the mask of the clean voice by using a mask estimation module, and multiplying the mask with the original input audio to obtain the enhanced clean voice.
In one embodiment, the preprocessing operation includes one or more of the following operations: resampling operation is carried out on the audio, clipping operation is carried out on the audio length, and channel compression operation is carried out on the audio;
the data enhancement operation includes one or more of the following operations: mixing noise audio according to random signal-to-noise ratio, randomly changing volume of the audio, and randomly adding reverberation effect.
In one embodiment, the multi-scale encoder is based on a transducer architecture, is formed by stacking a plurality of feature capture modules, and realizes the downsampling of features by means of a pooling operation; each feature capture module includes: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network;
the feature lifting module is used for capturing the relation between the key audio features and the features in the global scope, acquiring attention weights by using a convolution layer, a full connection layer and a Sigmoid function, realizing key feature enhancement by multiplying corresponding elements of a matrix, and capturing the relation between the features in the global scope by using a multi-head attention mechanism; the normalization layer performs normalization operation; the feedforward neural network captures long-term and short-term characteristics by utilizing a bidirectional gating circulating unit, and extracts deep-layer characteristics by combining a full-connection layer;
wherein different feature capture modules use different dilation convolution operations to capture features of different scales.
In one embodiment, the calculation formula of the feature capture module is as follows:
Figure BDA0004004828550000031
Figure BDA0004004828550000032
in the method, in the process of the invention,
Figure BDA0004004828550000033
and->
Figure BDA0004004828550000034
Input features, intermediate process features and output features of the feature capture module are respectively LayerNorm (& gt) is layer normalization operation, FBM (& gt) is feature lifting module operation, and FNN (& gt) is a feedforward neural network;
the calculation formula of the characteristic lifting module is as follows:
Figure BDA0004004828550000035
Figure BDA0004004828550000036
in the method, in the process of the invention,
Figure BDA0004004828550000037
and->
Figure BDA0004004828550000038
Respectively an input feature, an intermediate feature and an output feature of the feature lifting module; c (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and adjustment channel operations, respectively; the addition of the root of Manchurian food>
Figure BDA0004004828550000039
Respectively representing multiplication and addition operations of matrix corresponding elements; sigma represents the activation function Sigmoid; MAM (·) represents a multi-head attention mechanism operation.
In one embodiment, the multi-headed attentiveness mechanism operates by first utilizing a learnable linear transformation based on input characteristics
Figure BDA00040048285500000310
Respectively obtain queues Q i Key K i Value V i The calculation formula is as follows:
Figure BDA0004004828550000041
Figure BDA0004004828550000042
Figure BDA0004004828550000043
in which W is i Q 、W i K And W is i V Weights of the full connection layers respectively;
secondly, calculating the similarity between the queue and the key value by utilizing a dot product mode, and dividing the similarity by a scaling factor;
then, a Softmax activation function is applied to obtain a weight corresponding to each value, and the weights are multiplied by the corresponding values;
finally, the results obtained by all heads are connected in series, and linear projection operation is carried out again to obtain final output;
the specific calculation formula of the multi-head attention mechanism is as follows:
Figure BDA0004004828550000044
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers, d is a scaling factor;
the output of the multi-head attention mechanism is used as the input of the feedforward neural network, so that the final output characteristic is obtained;
the feedforward neural network comprises a gating circulating unit, an activating function and a full-connection layer, and the calculation formula is as follows:
Figure BDA0004004828550000045
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU, and the gating cycle unit comprises an update gate and a reset gate, and the calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ])
Figure BDA0004004828550000046
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
In one embodiment, the long-short-period sensing module adopts a two-way architecture and comprises a gating circulation unit, a one-dimensional convolution module, an instant layer normalization module and a channel adjustment module; the gating circulation unit captures long-term and short-term characteristics of the characteristics, the one-dimensional convolution module extracts deep characteristics, and the instant layer normalization module performs characteristic normalization processing.
In one embodiment, the calculation formula of the long-short period sensing module is as follows:
Figure BDA0004004828550000051
Figure BDA0004004828550000052
wherein GRU (& gt) is a gate control circulating unit, C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (-) is an instantaneous normalization operation, R (-) is a channel adjustment operation,
Figure BDA0004004828550000053
and->
Figure BDA0004004828550000054
Input features, middle features and output features of the long-short period sensing module respectively;
the calculation formula of the instant layer normalization module is as follows:
Figure BDA0004004828550000055
Figure BDA0004004828550000056
Figure BDA0004004828550000057
wherein X is tf For the input feature, N and K are the dimensions of the feature,
Figure BDA0004004828550000058
and->
Figure BDA0004004828550000059
The mean operation and the variance operation are respectively, the symbols epsilon and beta are respectively learnable parameters, and the symbol lambda is a regularized parameter. />
In one embodiment, the residual decoder comprises a plurality of decoding units, each decoding unit comprising a one-dimensional deconvolution module, a normalization module and an activation function; the input of each decoding unit is the output of the last decoding unit
Figure BDA00040048285500000510
And the output of the peer feature capture module>
Figure BDA00040048285500000511
The calculation formula is as follows:
Figure BDA00040048285500000512
in TC 1D (. Cndot.) is a one-dimensional deconvolution operation, B (-) is a batch normalization operation, θ is an activation function PReLU,
Figure BDA00040048285500000513
the output of the decoder is a reconstructed speech signal, which is an output characteristic of the current decoding unit.
In one embodiment, the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and the calculation formula is as follows:
Figure BDA0004004828550000061
Figure BDA0004004828550000062
in the method, in the process of the invention,
Figure BDA0004004828550000063
and->
Figure BDA0004004828550000064
The mask is input feature, middle process feature and output mask of mask estimation module, gamma, delta and sigma are activation functions Tanh, reLU and Sigmoid;
multiplying the output characteristics of the mask estimation module with the original input voice signal to obtain a pure voice signal estimated by the model, wherein the calculation formula is as follows:
Figure BDA0004004828550000065
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
The long-short-period perception enhancement model is trained by utilizing a joint loss function, wherein the joint loss function consists of a mean square error loss term and a signal-to-noise ratio loss term, the mean square error loss term is used for realizing the optimization on a voice waveform diagram, and the signal-to-noise ratio loss term is used for realizing the optimization on a voice spectrogram; wherein the mean square error loss term is logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a real-time voice noise reduction method based on a long-short period perception enhancement model by deep learning, which has the advantages of less parameter quantity, strong robustness and high instantaneity, and can be well applied to various noise scenes.
(2) The invention provides an encoder based on a transducer architecture, which introduces a attention mechanism and a gating and circulating unit, and is beneficial to solving the problems of capturing key features and relying on long-term and short-term features.
(3) The invention provides a feature lifting module based on an attention mechanism, which can effectively capture the relation between audio features in different ranges, so as to strengthen key audio features.
(4) The invention provides a long-short-period sensing module based on a two-way architecture, which can realize the extraction of long-period features on different dimensions, thereby providing more discriminative features for voice enhancement.
Drawings
FIG. 1 is a flow chart of a real-time speech enhancement method according to the present invention.
FIG. 2 is a diagram of a long and short term perception enhancement model framework in the present invention.
FIG. 3 is a schematic diagram of a feature capture module of a multi-scale encoder according to the present invention.
FIG. 4 is a diagram of a multi-head attention mechanism according to the present invention.
Fig. 5 is a schematic diagram of a feature lifting module based on an attention mechanism in the present invention.
FIG. 6 is a schematic diagram of a gated loop unit according to the present invention.
Fig. 7 is a schematic diagram of a long-short-term sensing module based on a two-way architecture in the present invention.
FIG. 8 is a graph showing the effect of speech enhancement under various noise conditions in the present invention.
Detailed Description
The following describes in detail how the technical means are applied to solve the technical problems and achieve the technical effects by combining the drawings and examples. It should be understood that the following detailed description is merely illustrative of the invention and is not intended to limit the scope of the invention. In addition, as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.
The invention discloses a voice enhancement method under a plurality of noise environments, as shown in fig. 1, comprising the following steps:
step 1: and acquiring audio data, and performing preprocessing operation and data enhancement operation.
Step 1.1: completing an audio preprocessing operation
The deep learning-based speech enhancement technique is used as a data-driven supervised learning method that requires input audio data to have a fixed length, and thus requires the audio to be divided into fixed-length segments. Considering that the sampling rates of different audios are different, it is necessary to first resample them. The sampling rate of the audio may be adjusted to 16KHz by means of the audio processing library and stored in WAV format. Since some audio may be multi-channel, channel compression operations are required to uniformly convert it to single-channel audio data. For easy calculation, a fusion strategy of averaging by adding multiple channels is directly adopted, and a specific calculation formula is as follows:
Figure BDA0004004828550000081
wherein K is the number of audio channels, S mono For processed single channel audio, S i For audio of a particular channel, the multi-channel audio signal is compressed into a single-channel audio signal by channel compression.
Further, assuming that the input audio length of the model is 4 seconds, a clipping operation is required to be performed on the audio length according to an audio clipping algorithm, thereby ensuring that the length of each audio clip is 4 seconds. Since the sampling rate of audio is 16KHz, each audio clip contains 64000 samples. Assuming that the total sampling point number of the audio is T, a specific audio clipping calculation formula is as follows:
Figure BDA0004004828550000082
wherein l is a positive integer, S start And S is end The ID of the start sampling point and the ID of the end sampling point, respectively. When the number of audio sampling points after clipping does not satisfy 64000 but the total number of sampling points is greater than 50000, the number of sampling points can be padded to 64000 by using a linear interpolation method. When the number of samples clipped is less than 50000, the clipped audio clip can be directly discarded.
Step 1.2: completing data enhancement operations for audio
Considering the complexity and variability of the application scene of the model, the robustness of the model needs to be improved by utilizing a data enhancement technology. To be able to enhance the complexity of the audio, three audio data enhancement methods are introduced here, which mainly comprise: random signal to noise ratio mixes noise audio, randomly changes audio volume, randomly adds reverberation effects.
Random mixing noise audio operation is primarily by introducing other additional background noise data and mixing the input audio according to a random signal-to-noise ratio. By way of example, a variety of common noises such as electric drill sounds, whistling sounds, noise sounds, barking sounds, drum applause sounds, bird sounds, gun sounds, frog sounds, machine sounds, music sounds, and the like may be selected. The specific operation flow of the data enhancement is that firstly, a signal-to-noise ratio is generated within the range of [ -15,15] by using a uniform random sampling method, the random signal-to-noise ratio is multiplied with original voice, and the multiplied result is added with noise audio, so that noise-containing mixed audio is obtained.
The random volume change operation is mainly to amplify or reduce the volume of the input audio by means of a random scaling factor, and mainly adopts random uniform sampling to generate an audio scaling factor in the range of [0,2], and multiplies the scaling factor with the original audio to obtain the audio with randomly adjusted volume.
The operation flow of randomly adding the reverberation effect includes the following aspects: creating a room in which to reside (defining room size, desired reverberation time, wall covering, maximum number of reflections allowed), creating a signal source in the room, placing a microphone in the room, creating a room impulse response, simulating sound propagation, synthesizing a reverberation effect. In this embodiment, the reverberation effect addition of the voice data can be directly implemented by means of the pyroomaceus library.
Step 2: deep audio features are extracted by means of a multi-scale encoder.
The invention designs an efficient long-short-period perception enhancement model by means of a deep learning technology, and inputs the audio processed in the step 1 into the long-short-period perception enhancement model, so that real-time voice enhancement under various noise is realized. Fig. 2 shows the overall architecture of this model. The model mainly comprises a multi-scale encoder, a long-short-period sensing module and a residual error decoder. The multi-scale encoder is mainly used for compressing audio features and extracting deep features, and the residual decoder is mainly used for reconstructing audio signals. In this embodiment, the multi-scale encoder is based on a transducer architecture, which is mainly composed of a plurality of feature capture modules stacked, in this embodiment 5. Each feature capture module in turn comprises: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network.
Fig. 3 shows detailed information of a feature capture module in a multi-scale encoder based on a transducer architecture, and the specific calculation formula is as follows:
Figure BDA0004004828550000091
Figure BDA0004004828550000092
in the method, in the process of the invention,
Figure BDA0004004828550000093
and->
Figure BDA0004004828550000094
The input features, the intermediate process features and the output features of the feature capture module, respectively. LayerNorm (·) is the layer normalization operation, FBM (·) is the feature lifting module operation, and FNN (·) is the feed forward neural network. In addition, the feature capture module introduces residual connection to maintain original features, and uses a feature lifting module based on an attention mechanism to capture and strengthen key features. Fig. 4 shows details of the multi-head attention mechanism employed in this module. For the overall flow of the feature capture module, in order to effectively capture key audio features and solve long-short term feature dependence, the acquired features X are firstly taken i e n The method comprises the steps of inputting the characteristics to a characteristic lifting module, capturing key long-term and short-term characteristics by using the characteristic lifting module based on an attention mechanism, further realizing characteristic normalization by means of layer normalization operation, capturing deep characteristics by using a feedforward neural network, and finally processing by using layer normalization operation. Furthermore, the downsampling of features is achieved between different feature capture modules by means of a max pooling operation. Meanwhile, different feature capturing modules capture features of different scales by adopting different expansion convolution operations.
The feature lifting module is a core component of the feature capturing module, and the key audio features and the relation between the features in the global scope are captured by the feature lifting module, namely the important features are effectively captured and enhanced. Fig. 5 shows the detailed architecture of this module. The module mainly acquires attention weights by means of a convolution layer, a full connection layer and a Sigmoid function, and realizes key feature enhancement by multiplying corresponding elements of a matrix. At the same time, the relationship between the features in a larger range can be captured by means of a multi-head attention mechanism, so that the harmonic wave is eliminated as much as possible. The specific calculation formula is as follows:
Figure BDA0004004828550000101
Figure BDA0004004828550000102
in the method, in the process of the invention,
Figure BDA0004004828550000103
and->
Figure BDA0004004828550000104
The input features, the intermediate features and the output features of the feature lifting module are respectively. C (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and tuning channel operations, respectively. The addition of the root of Manchurian food>
Figure BDA0004004828550000107
Respectively representing the multiplication and addition operations of the matrix corresponding elements. In addition, σ represents an activation function Sigmoid, thereby facilitating the calculation of weights corresponding to key features. MAM (·) then represents a multi-headed attentive mechanism operation. The compression of the characteristic channels is realized by using one-dimensional convolution operation with the convolution kernel size of 1, the corresponding weight matrix is obtained by means of the full-connection layer and the activation function Sigmoid, and finally the key characteristics are strengthened by multiplying corresponding elements of the matrix. For the multi-head attention mechanism, firstly the input features are input by means of a learnable linear transformation +.>
Figure BDA0004004828550000105
Respectively obtain queues Q i Key K i Value V i Its concrete calculation formulaThe following are provided:
Figure BDA0004004828550000106
in which W is i Q 、W i K And W is i V The weights of the full connection layers respectively. Second, the similarity between the queue and the key is calculated by dot product, and the similarity is divided by the scaling factor. Then, a Softmax activation function is applied to obtain a weight corresponding to each value and multiply the corresponding value. Finally, it is necessary to concatenate the results obtained for all heads and perform the linear projection operation again to obtain the final output. The specific calculation formula of the multi-head attention mechanism is as follows:
Figure BDA0004004828550000111
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers. The output of the multi-headed attention module is used as the input of the feedforward neural network to obtain the final output characteristics. In this module, residual connection and layer normalization operations are also introduced to further improve the extraction of features.
The feedforward neural network mainly comprises: the system mainly comprises a gating circulation unit, an activation function and a full-connection layer, wherein capture of long-short-period features is mainly realized by means of the bidirectional gating circulation unit, extraction of deep features is realized by combining the full-connection layer, and a specific calculation formula is as follows:
Figure BDA0004004828550000112
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU. Here, capture of audio features is achieved using a bi-directional gating loop unit, which does notOnly long-term and short-term features can be effectively captured, and meanwhile, the problem of complex LSTM calculation is avoided. Furthermore, this approach tends to achieve more satisfactory results than the mere use of a fully attached layer. Meanwhile, compared with one-dimensional convolution, the two-way gating circulation unit can sense the relation between the more distant features and automatically pay attention to the more important features. Fig. 6 shows details of the implementation of the gating cycle unit, which mainly includes an update gate and a reset gate, and the specific calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σW r ·[h t-1 ,x t ])
Figure BDA0004004828550000113
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
Step 3: features in different dimensions are captured by means of a long and short term perception module.
For speech features extracted by a multi-scale encoder, the relationships between features in different dimensions need to be further processed. Therefore, the invention designs the long-short-period sensing module adopting the two-way architecture, which can effectively realize the capture of the long-short-period audio frequency characteristics in different dimensions, thereby effectively solving the long-short-period dependency relationship among the characteristics. As shown in fig. 7, a detailed architecture of the long-short term perception module is shown. The module is mainly used for respectively realizing long-term and short-term feature capture in the time dimension and the feature dimension by means of a gating circulation unit, one-dimensional convolution operation, instant layer normalization operation and channel adjustment operation. It is noted that the instant layer normalization operation is adopted to replace the traditional layer normalization operation in the embodiment, so that the sensitivity of the model to the energy of the input signal is reduced. Meanwhile, in order to maintain the original characteristics, the module also introduces the idea of residual connection. Whether the feature dimension is the time dimension or the feature dimension, the gating circulation unit is used for extracting long-short-period features in different ranges, the one-dimensional convolution operation is used for capturing deep features, and further the instant normalization operation is used for normalizing the features.
The specific calculation formula of the module is as follows:
Figure BDA0004004828550000121
Figure BDA0004004828550000122
wherein GRU (& gt) is a gate control circulating unit, C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (. Cndot.) is an instantaneous normalization operation, and R (. Cndot.) is a channel adjustment operation. In addition, in the case of the optical fiber,
Figure BDA0004004828550000123
and->
Figure BDA0004004828550000124
Respectively the input features, the intermediate features and the output features of the module. When the features are input into the network, firstly, capturing of long-term and short-term features in the time dimension is realized by utilizing GRU, further extraction of deep features is realized by utilizing one-dimensional convolution operation, and then feature normalization processing is carried out by means of instant layer normalization operation. The GRU is used here because it requires less computational resources and less time costs than the LSTM, but achieves the same result. The GRU only includes a gate controlling reset and a gate controlling update, which effectively solves the problem of long-term memory. In addition, the specific calculation formula of the instant layer normalization operation used by the module is as follows:
Figure BDA0004004828550000125
Figure BDA0004004828550000126
Figure BDA0004004828550000127
wherein X is tf For the input feature, N and K are the dimensions of the feature, respectively. In addition, in the case of the optical fiber,
Figure BDA0004004828550000128
and->
Figure BDA0004004828550000129
The mean operation and the variance operation, respectively. Furthermore, the symbols epsilon and beta are respectively learnable parameters, and the symbol lambda is a regularization parameter. This normalization operation may reduce the sensitivity of the model to input signal energy. In order to achieve capture of long-term features in feature dimensions, two channels of features need to be exchanged, then the GRU is used for capturing long-term features, extraction of deep features is achieved through one-dimensional convolution, immediate normalization operation is used for processing, and finally output features are obtained through channel adjustment operation.
Step 4: the enhanced clean speech is obtained by means of a residual decoder.
In order to be able to obtain clean speech, it is necessary to reconstruct the speech signal first by means of a residual decoder. The residual decoder mainly comprises a plurality of decoding units, in this embodiment 5, which can gradually realize the estimation of the spectrogram mask. For each decoding unit, it is mainly composed of one-dimensional deconvolution operation, normalization operation and activation function. Meanwhile, in order to be able to reconstruct the speech signal well, the input of each decoding unit comprises two parts: one is the output from the last decoding unit
Figure BDA0004004828550000131
The other is the output from the peer feature capture module>
Figure BDA0004004828550000132
The decoding unit causesThe feature extraction and up-sampling operations are simultaneously realized by one-dimensional deconvolution, and the nonlinear capability of the model is increased by means of an activation function PReLU. The specific calculation formula is as follows:
Figure BDA0004004828550000133
in TC 1D (. Cndot.) is a one-dimensional deconvolution operation that is used primarily to implement feature extraction and upsampling operations. B (·) is a batch normalization operation, θ is an activation function PReLU. In addition, in the case of the optical fiber,
Figure BDA0004004828550000134
the output of the decoder is a reconstructed speech signal, which is an output characteristic of the current decoding unit. At this time, the mask estimation module is required to process the reconstructed speech signal output by the decoder and estimate the mask of the clean speech signal, thereby realizing the generation of the clean speech mask. The mask estimation module consists of a one-dimensional convolution operation and a plurality of different activation functions, and a specific calculation formula is as follows:
Figure BDA0004004828550000135
Figure BDA0004004828550000136
in the method, in the process of the invention,
Figure BDA0004004828550000137
and->
Figure BDA0004004828550000138
The mask of the input features, intermediate process features and output of the mask estimation module, respectively. In addition, γ, δ, and σ are activation functions Tanh, reLU, and Sigmoid, respectively. The pure speech signal estimated by the model can be obtained by multiplying the output characteristics of the mask estimation module with the speech signal which is originally input, and the calculation formula is as follows:
Figure BDA0004004828550000141
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
The model and the flow thereof are as above, and further, the model needs to be trained or tested to obtain the model meeting the requirements.
Specifically, to accomplish the supervised training of the model, the present invention introduces a joint loss function comprising two parts: the signal-to-noise ratio loss term f (·) and the mean square error loss term MSE (·). The former is mainly used for realizing the optimization on the voice waveform diagram, and the latter is mainly used for realizing the optimization on the voice spectrogram. Furthermore, the mean square error loss term needs to be logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
The specific expression of the loss function is as follows:
Figure BDA0004004828550000142
wherein s and
Figure BDA0004004828550000143
the audio is respectively pure audio and model estimated audio, S r And->
Figure BDA0004004828550000144
The real part of the pure spectrogram and the real part of the model estimated spectrogram are respectively, S i And->
Figure BDA0004004828550000145
The imaginary part of the pure spectrogram and the imaginary part of the model estimated spectrogram, |S| and +|respectively>
Figure BDA0004004828550000146
Amplitude of spectrum patterns with purity respectivelyAnd the magnitude of the spectrogram estimated by the model. In addition, the mean square error loss term may measure the differences in real, imaginary and amplitude between the model estimated spectrogram and the real spectrogram. At the same time, the mean square error loss term is logarithmized to ensure that it is of the same order of magnitude as the signal-to-noise loss term. The loss of signal to noise term can then constrain the amplitude of the output, avoiding level shifts between the input and output. The specific calculation formula of the loss term is as follows:
Figure BDA0004004828550000147
in order to be able to demonstrate the effectiveness of the method according to the invention, relevant experimental tests have been carried out. A large amount of noise audio frequencies are fused on the basis of the existing pure voice, so that the collected voice under various noises is simulated. The noise categories selected here are: electric drill sound, whistling sound, noise sound, barking sound, drum sound, bird song sound, gun sound, frog song sound, machine sound, music sound. Meanwhile, the effect of voice enhancement is measured by means of three commonly used evaluation indexes of voice enhancement, which are respectively as follows: perceptual speech quality assessment (PESQ), short-term speech intelligibility (STOI), and Source Artifact Ratio (SAR). Wherein, PESQ and STOI are both evaluation methods of perception level, which are both that the larger the value is, the better the effect of speech enhancement is. For STOI, the calculation process mainly comprises three steps: removing the mute frame; 1/3 times of frequency band decomposition of DFT is completed on the signal; the correlation coefficients before and after the enhancement are calculated and averaged. For PESQ, which requires a noisy attenuated signal and an original reference signal, the computation involves preprocessing, time alignment, perceptual filtering, masking effects, etc. It can provide a subjective predictor for objective speech quality assessment and can be mapped to a MOS scale range, with a score range between-0.5 and 4.5. In addition, the evaluation index SAR may be regarded as an evaluation index of a signal level, and the larger the value thereof is, the better the effect of speech enhancement is, and the specific calculation formula is as follows:
Figure BDA0004004828550000151
in the formula e interf 、e noise And e artif Error signals, s, introduced by interference, noise and artefacts, respectively target Then it is the target signal. Table 1 shows the effect of the present invention on the above evaluation index compared with the main stream method. It was found that it could be improved by about 16% over the mainstream speech enhancement model Demucs on PESQ evaluation index and about 16% over the mainstream speech enhancement model MannerNet on SAR evaluation index. At the same time, an excellent performance of 0.94 can be achieved on the evaluation index STOI. In addition, for ten different noise interference environments, fig. 8 shows a voice effect diagram after noise reduction based on the long-short-period perception enhancement model provided by the invention, which can obtain a satisfactory effect.
TABLE 1 Effect comparison of the long-short-term perception enhancement model of the present invention with the mainstream speech enhancement model
PESQ STOI SAR
Demucs 2.08 0.93 18.70
MannerNet 2.22 0.94 17.41
Long-short period perception strengthening model 2.41 0.94 20.27

Claims (10)

1. A method for speech enhancement in a multiple noise environment, comprising the steps of:
step 1: preprocessing operation and data enhancement operation are carried out on the obtained audio data, and the processed audio data are input into a long-period perception enhancement model; the long-period perception enhancement model comprises a multi-scale encoder, a long-period perception module and a residual error decoder;
step 2: extracting deep audio features of the processed audio data by using the multi-scale encoder;
step 3: the long-term sensing module is utilized to capture the characteristics in different dimensions respectively;
step 4: and reconstructing a voice signal by using the residual decoder, estimating the mask of the clean voice by using a mask estimation module, and multiplying the mask with the original input audio to obtain the enhanced clean voice. Training of the model is done by means of joint loss functions.
2. The method of claim 1, wherein the preprocessing operation comprises one or more of the following operations: resampling operation is carried out on the audio, clipping operation is carried out on the audio length, and channel compression operation is carried out on the audio;
the data enhancement operation includes one or more of the following operations: mixing noise audio according to random signal-to-noise ratio, randomly changing volume of the audio, and randomly adding reverberation effect.
3. The method according to claim 1, wherein the multi-scale encoder is based on a transform architecture, is formed by stacking a plurality of feature capture modules, and performs feature downsampling by means of pooling; each feature capture module includes: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network;
the feature lifting module is used for capturing the relation between the key audio features and the features in the global scope, acquiring attention weights by using a convolution layer, a full connection layer and a Sigmoid function, realizing key feature enhancement by multiplying corresponding elements of a matrix, and capturing the relation between the features in the global scope by using a multi-head attention mechanism; the normalization layer performs normalization operation; the feedforward neural network captures long-term and short-term characteristics by utilizing a bidirectional gating circulating unit, and extracts deep-layer characteristics by combining a full-connection layer;
wherein different feature capture modules use different dilation convolution operations to capture features of different scales.
4. A method for enhancing speech in a multiple noise environment according to claim 3, wherein the calculation formula of the feature capturing module is as follows:
Figure FDA0004004828540000021
Figure FDA0004004828540000022
in the method, in the process of the invention,
Figure FDA0004004828540000023
and->
Figure FDA0004004828540000024
Input features, intermediate process features of the feature capture module respectivelyAnd outputting characteristics, namely LayerNorm (& gt) is layer normalization operation, FBM (& gt) is characteristic lifting module operation, and FNN (& gt) is feedforward neural network;
the calculation formula of the characteristic lifting module is as follows:
Figure FDA0004004828540000025
Figure FDA0004004828540000026
in the method, in the process of the invention,
Figure FDA0004004828540000027
and->
Figure FDA0004004828540000028
Respectively an input feature, an intermediate feature and an output feature of the feature lifting module; c (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and adjustment channel operations, respectively; the addition of the root of Manchurian food>
Figure FDA0004004828540000029
Respectively representing multiplication and addition operations of matrix corresponding elements; sigma represents the activation function Sigmoid; MAM (·) represents a multi-head attention mechanism operation.
5. The method of claim 4, wherein the multi-headed attentiveness mechanism operates by first utilizing a learnable linear transformation based on input characteristics
Figure FDA00040048285400000210
Respectively obtain queues Q i Key K i Value V i The calculation formula is as follows:
Figure FDA00040048285400000211
Figure FDA00040048285400000212
Figure FDA00040048285400000213
in which W is i Q 、W i K And W is i V Weights of the full connection layers respectively;
secondly, calculating the similarity between the queue and the key value by utilizing a dot product mode, and dividing the similarity by a scaling factor;
then, a Softmax activation function is applied to obtain a weight corresponding to each value, and the weights are multiplied by the corresponding values;
finally, the results obtained by all heads are connected in series, and linear projection operation is carried out again to obtain final output;
the specific calculation formula of the multi-head attention mechanism is as follows:
Figure FDA0004004828540000031
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers, d is a scaling factor;
the output of the multi-head attention mechanism is used as the input of the feedforward neural network, so that the final output characteristic is obtained;
the feedforward neural network includes: the gating circulation unit, the activation function and the full connection layer have the following calculation formula:
Figure FDA0004004828540000032
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU, and the gating cycle unit comprises an update gate and a reset gate, and the calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ])
Figure FDA0004004828540000033
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
6. The method for enhancing voice in a multiple noise environment according to claim 1, wherein the long-short-period sensing module adopts a two-way architecture, and comprises a gating circulation unit, a one-dimensional convolution module, an instant layer normalization module and a channel adjustment module; the gating circulation unit captures long-term and short-term features of the features, the one-dimensional convolution module extracts deep features, and the instant layer normalization module performs feature normalization.
7. The method for enhancing speech in a multiple noise environment according to claim 6, wherein the long-short-period sensing module has a calculation formula as follows:
Figure FDA0004004828540000041
Figure FDA0004004828540000042
wherein GRU (& gt) is a gating cycleUnit C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (-) is an instantaneous normalization operation, R (-) is a channel adjustment operation,
Figure FDA0004004828540000043
and->
Figure FDA0004004828540000044
Input features, middle features and output features of the long-short period sensing module respectively;
the calculation formula of the instant layer normalization module is as follows:
Figure FDA0004004828540000045
Figure FDA0004004828540000046
Figure FDA0004004828540000047
wherein X is tf For the input feature, N and K are the dimensions of the feature,
Figure FDA0004004828540000048
and->
Figure FDA0004004828540000049
The mean operation and the variance operation are respectively, the symbols epsilon and beta are respectively learnable parameters, and the symbol lambda is a regularized parameter.
8. The method of claim 1, wherein the residual decoder comprises a plurality of decoding units, each decoding unit comprising a one-dimensional deconvolution module, a normalization module, and an activation function; each decoding unitThe inputs of which are the outputs of the last decoding unit
Figure FDA00040048285400000410
And the output of the peer feature capture module>
Figure FDA00040048285400000411
The calculation formula is as follows:
Figure FDA00040048285400000412
in TC 1D (. Cndot.) is a one-dimensional deconvolution operation, B (-) is a batch normalization operation, θ is an activation function PReLU,
Figure FDA00040048285400000413
the output of the decoder is a reconstructed speech signal, which is an output characteristic of the current decoding unit.
9. The method for speech enhancement in a multiple noise environment according to claim 1, wherein the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and the calculation formula is as follows:
Figure FDA00040048285400000414
Figure FDA0004004828540000051
in the method, in the process of the invention,
Figure FDA0004004828540000052
and->
Figure FDA0004004828540000053
Respectively mask estimation modesThe input features, intermediate process features and output masks of the block, gamma, delta and sigma are activation functions Tanh, reLU and Sigmoid respectively;
multiplying the output characteristics of the mask estimation module with the original input voice signal to obtain a pure voice signal estimated by the model, wherein the calculation formula is as follows:
Figure FDA0004004828540000054
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
10. The method for enhancing speech in a multiple noise environment according to claim 1, wherein the long-short-period perceptual enhancement model is trained using a joint loss function, the joint loss function is composed of a mean square error loss term and a signal-to-noise ratio loss term, the mean square error loss term is used for realizing optimization on a speech waveform diagram, and the signal-to-noise ratio loss term is used for realizing optimization on a speech spectrogram; wherein the mean square error loss term is logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
CN202211637892.7A 2022-12-17 2022-12-17 Speech enhancement method under multiple noise environments Pending CN116013344A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211637892.7A CN116013344A (en) 2022-12-17 2022-12-17 Speech enhancement method under multiple noise environments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211637892.7A CN116013344A (en) 2022-12-17 2022-12-17 Speech enhancement method under multiple noise environments

Publications (1)

Publication Number Publication Date
CN116013344A true CN116013344A (en) 2023-04-25

Family

ID=86032687

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211637892.7A Pending CN116013344A (en) 2022-12-17 2022-12-17 Speech enhancement method under multiple noise environments

Country Status (1)

Country Link
CN (1) CN116013344A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117351983A (en) * 2023-12-06 2024-01-05 深圳市龙芯威半导体科技有限公司 Transformer-based voice noise reduction method and system
CN117711417A (en) * 2024-02-05 2024-03-15 武汉大学 Voice quality enhancement method and system based on frequency domain self-attention network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117351983A (en) * 2023-12-06 2024-01-05 深圳市龙芯威半导体科技有限公司 Transformer-based voice noise reduction method and system
CN117351983B (en) * 2023-12-06 2024-04-19 深圳市龙芯威半导体科技有限公司 Transformer-based voice noise reduction method and system
CN117711417A (en) * 2024-02-05 2024-03-15 武汉大学 Voice quality enhancement method and system based on frequency domain self-attention network
CN117711417B (en) * 2024-02-05 2024-04-30 武汉大学 Voice quality enhancement method and system based on frequency domain self-attention network

Similar Documents

Publication Publication Date Title
CN110085249B (en) Single-channel speech enhancement method of recurrent neural network based on attention gating
CN110491407B (en) Voice noise reduction method and device, electronic equipment and storage medium
CN107845389B (en) Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
CN116013344A (en) Speech enhancement method under multiple noise environments
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN113077806B (en) Audio processing method and device, model training method and device, medium and equipment
CN113345460B (en) Audio signal processing method, device, equipment and storage medium
WO2019014890A1 (en) Universal single channel real-time noise-reduction method
Villanueva-Luna et al. De-noising audio signals using MATLAB wavelets toolbox
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
Geng et al. End-to-end speech enhancement based on discrete cosine transform
JP2023548707A (en) Speech enhancement methods, devices, equipment and computer programs
CN113707167A (en) Training method and training device for residual echo suppression model
CN117854536B (en) RNN noise reduction method and system based on multidimensional voice feature combination
CN111899750A (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN115223583A (en) Voice enhancement method, device, equipment and medium
CN114566179A (en) Time delay controllable voice noise reduction method
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
Astudillo et al. Uncertainty propagation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination