CN116013344A - Speech enhancement method under multiple noise environments - Google Patents
Speech enhancement method under multiple noise environments Download PDFInfo
- Publication number
- CN116013344A CN116013344A CN202211637892.7A CN202211637892A CN116013344A CN 116013344 A CN116013344 A CN 116013344A CN 202211637892 A CN202211637892 A CN 202211637892A CN 116013344 A CN116013344 A CN 116013344A
- Authority
- CN
- China
- Prior art keywords
- module
- features
- feature
- audio
- term
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 230000008569 process Effects 0.000 claims abstract description 18
- 230000007774 longterm Effects 0.000 claims abstract description 17
- 238000004364 calculation method Methods 0.000 claims description 41
- 238000010606 normalization Methods 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 39
- 230000004913 activation Effects 0.000 claims description 28
- 230000007246 mechanism Effects 0.000 claims description 21
- 230000008447 perception Effects 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 16
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 6
- 230000005236 sound signal Effects 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 5
- 238000007906 compression Methods 0.000 claims description 5
- 230000002708 enhancing effect Effects 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 238000002156 mixing Methods 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012952 Resampling Methods 0.000 claims description 2
- 230000010339 dilation Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 abstract description 9
- 238000005728 strengthening Methods 0.000 abstract description 2
- 238000005070 sampling Methods 0.000 description 12
- 238000000605 extraction Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 239000013307 optical fiber Substances 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000001303 quality assessment method Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010011469 Crying Diseases 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 238000005266 casting Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a voice enhancement method under multiple noise environments, which comprises the following steps: 1) Finishing the pretreatment of the audio and the data enhancement operation; 2) Extracting multi-level audio features by using a multi-scale encoder based on a transducer architecture, and strengthening key features by means of a feature lifting module; 3) Capturing long-term and short-term characteristics on different dimensions by means of a long-term and short-term sensing module based on a two-way architecture; 4) Obtaining a clean speech signal using a residual decoder and a mask estimation module; 5) The network model is jointly trained by means of a mean square error loss term and a signal to noise ratio loss term. The method has strong robustness and high instantaneity, and can effectively process ten common noises such as whistling, noisy, applause, bird song and the like, thereby improving the user experience of applications such as short videos, network live broadcast, video conferences, voice calls and the like. Compared with a part of mainstream voice enhancement model, the method can averagely improve 16% on the relevant evaluation index.
Description
Technical Field
The invention belongs to the technical field of voice noise reduction, and particularly relates to a voice enhancement method in a plurality of noise environments.
Background
Whether short video or live, it faces a major problem: the surrounding background noise is also captured while the photographer is speaking, which can greatly reduce the audience's actual experience. In addition, the surrounding environment in which different photographers are located is different, and thus the kind of noise is also various, for example: car whistling, plaza music, child crying, construction site machine sound, crowd noise and the like. The interference of the surrounding environment and the complex variability of the application scenario require the processing of noisy audio using a robust speech enhancement technique.
Of course, the application of speech enhancement techniques is not limited to just short video or web-casting, but may also serve a variety of downstream speech-related tasks, including: intelligent voice interaction, voice emotion analysis, intelligent voice input and the like. In the field of intelligent interaction of speech, it is common to say intelligent sound boxes. In the field of intelligent voice input, a voice input method is common. By taking smart home as an example, a user can realize the issuing of instructions by means of voice, so that both hands are truly liberated, and direct contact with equipment is avoided. Although intelligent interaction based on voice is becoming a mainstream man-machine interaction mode, the user still cannot completely replace a keyboard or a touch screen for input in daily life due to the complex noise environment of the user. Thus, it is important to obtain clean speech from noisy mixed audio in real-time by means of speech enhancement techniques.
At present, the voice enhancement algorithm is mainly divided into: the harmonic enhancement method is only suitable for removing stable white noise, and meanwhile, the pitch period of the voice cannot be accurately estimated; spectral subtraction, which is more efficient in handling wideband noise, but the enhanced result may have noise component residuals; wiener filtering, which enhances residual noise to resemble white noise rather than musical noise; the enhancement method based on the voice model parameters has poor performance under the condition of low signal-to-noise ratio, and often requires multiple iterative operations; based on the signal subspace method, the required operand is large and is difficult to meet the real-time requirement; the enhancement method based on wavelet transformation has poor denoising capability for non-stationary noise; the method based on deep learning directly estimates pure voice signals by means of data driving, and has strong robustness and instantaneity. The deep learning-based approach has incomparable performance advantages over the conventional approach, and thus has become the dominant approach to speech enhancement.
However, the current deep learning method for speech enhancement still faces the problems of poor noise effect removal, poor robustness and the like caused by the fact that long-term features, short-term features, key features for enhancement and the like cannot be effectively captured.
Disclosure of Invention
In order to overcome the disadvantages of the prior art, the present invention aims to provide a method for enhancing speech in a multi-noise environment, so as to remove noise in speech more effectively, and have stronger robustness and real-time performance.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a method for speech enhancement in a multiple noise environment, comprising the steps of:
step 1: preprocessing operation and data enhancement operation are carried out on the obtained audio data, and the processed audio data are input into a long-period perception enhancement model; the long-term perception enhancement model comprises the following steps: a multi-scale encoder, a long-short-term perception module and a residual error decoder;
step 2: extracting deep audio features of the processed audio data by using the multi-scale encoder;
step 3: the long-term sensing module is utilized to capture the characteristics in different dimensions respectively;
step 4: and reconstructing a voice signal by using the residual decoder, estimating the mask of the clean voice by using a mask estimation module, and multiplying the mask with the original input audio to obtain the enhanced clean voice.
In one embodiment, the preprocessing operation includes one or more of the following operations: resampling operation is carried out on the audio, clipping operation is carried out on the audio length, and channel compression operation is carried out on the audio;
the data enhancement operation includes one or more of the following operations: mixing noise audio according to random signal-to-noise ratio, randomly changing volume of the audio, and randomly adding reverberation effect.
In one embodiment, the multi-scale encoder is based on a transducer architecture, is formed by stacking a plurality of feature capture modules, and realizes the downsampling of features by means of a pooling operation; each feature capture module includes: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network;
the feature lifting module is used for capturing the relation between the key audio features and the features in the global scope, acquiring attention weights by using a convolution layer, a full connection layer and a Sigmoid function, realizing key feature enhancement by multiplying corresponding elements of a matrix, and capturing the relation between the features in the global scope by using a multi-head attention mechanism; the normalization layer performs normalization operation; the feedforward neural network captures long-term and short-term characteristics by utilizing a bidirectional gating circulating unit, and extracts deep-layer characteristics by combining a full-connection layer;
wherein different feature capture modules use different dilation convolution operations to capture features of different scales.
In one embodiment, the calculation formula of the feature capture module is as follows:
in the method, in the process of the invention,and->Input features, intermediate process features and output features of the feature capture module are respectively LayerNorm (& gt) is layer normalization operation, FBM (& gt) is feature lifting module operation, and FNN (& gt) is a feedforward neural network;
the calculation formula of the characteristic lifting module is as follows:
in the method, in the process of the invention,and->Respectively an input feature, an intermediate feature and an output feature of the feature lifting module; c (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and adjustment channel operations, respectively; the addition of the root of Manchurian food>Respectively representing multiplication and addition operations of matrix corresponding elements; sigma represents the activation function Sigmoid; MAM (·) represents a multi-head attention mechanism operation.
In one embodiment, the multi-headed attentiveness mechanism operates by first utilizing a learnable linear transformation based on input characteristicsRespectively obtain queues Q i Key K i Value V i The calculation formula is as follows:
in which W is i Q 、W i K And W is i V Weights of the full connection layers respectively;
secondly, calculating the similarity between the queue and the key value by utilizing a dot product mode, and dividing the similarity by a scaling factor;
then, a Softmax activation function is applied to obtain a weight corresponding to each value, and the weights are multiplied by the corresponding values;
finally, the results obtained by all heads are connected in series, and linear projection operation is carried out again to obtain final output;
the specific calculation formula of the multi-head attention mechanism is as follows:
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers, d is a scaling factor;
the output of the multi-head attention mechanism is used as the input of the feedforward neural network, so that the final output characteristic is obtained;
the feedforward neural network comprises a gating circulating unit, an activating function and a full-connection layer, and the calculation formula is as follows:
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU, and the gating cycle unit comprises an update gate and a reset gate, and the calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ])
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
In one embodiment, the long-short-period sensing module adopts a two-way architecture and comprises a gating circulation unit, a one-dimensional convolution module, an instant layer normalization module and a channel adjustment module; the gating circulation unit captures long-term and short-term characteristics of the characteristics, the one-dimensional convolution module extracts deep characteristics, and the instant layer normalization module performs characteristic normalization processing.
In one embodiment, the calculation formula of the long-short period sensing module is as follows:
wherein GRU (& gt) is a gate control circulating unit, C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (-) is an instantaneous normalization operation, R (-) is a channel adjustment operation,and->Input features, middle features and output features of the long-short period sensing module respectively;
the calculation formula of the instant layer normalization module is as follows:
wherein X is tf For the input feature, N and K are the dimensions of the feature,and->The mean operation and the variance operation are respectively, the symbols epsilon and beta are respectively learnable parameters, and the symbol lambda is a regularized parameter. />
In one embodiment, the residual decoder comprises a plurality of decoding units, each decoding unit comprising a one-dimensional deconvolution module, a normalization module and an activation function; the input of each decoding unit is the output of the last decoding unitAnd the output of the peer feature capture module>The calculation formula is as follows:
in TC 1D (. Cndot.) is a one-dimensional deconvolution operation, B (-) is a batch normalization operation, θ is an activation function PReLU,the output of the decoder is a reconstructed speech signal, which is an output characteristic of the current decoding unit.
In one embodiment, the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and the calculation formula is as follows:
in the method, in the process of the invention,and->The mask is input feature, middle process feature and output mask of mask estimation module, gamma, delta and sigma are activation functions Tanh, reLU and Sigmoid;
multiplying the output characteristics of the mask estimation module with the original input voice signal to obtain a pure voice signal estimated by the model, wherein the calculation formula is as follows:
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
The long-short-period perception enhancement model is trained by utilizing a joint loss function, wherein the joint loss function consists of a mean square error loss term and a signal-to-noise ratio loss term, the mean square error loss term is used for realizing the optimization on a voice waveform diagram, and the signal-to-noise ratio loss term is used for realizing the optimization on a voice spectrogram; wherein the mean square error loss term is logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
Compared with the prior art, the invention has the beneficial effects that:
(1) The invention provides a real-time voice noise reduction method based on a long-short period perception enhancement model by deep learning, which has the advantages of less parameter quantity, strong robustness and high instantaneity, and can be well applied to various noise scenes.
(2) The invention provides an encoder based on a transducer architecture, which introduces a attention mechanism and a gating and circulating unit, and is beneficial to solving the problems of capturing key features and relying on long-term and short-term features.
(3) The invention provides a feature lifting module based on an attention mechanism, which can effectively capture the relation between audio features in different ranges, so as to strengthen key audio features.
(4) The invention provides a long-short-period sensing module based on a two-way architecture, which can realize the extraction of long-period features on different dimensions, thereby providing more discriminative features for voice enhancement.
Drawings
FIG. 1 is a flow chart of a real-time speech enhancement method according to the present invention.
FIG. 2 is a diagram of a long and short term perception enhancement model framework in the present invention.
FIG. 3 is a schematic diagram of a feature capture module of a multi-scale encoder according to the present invention.
FIG. 4 is a diagram of a multi-head attention mechanism according to the present invention.
Fig. 5 is a schematic diagram of a feature lifting module based on an attention mechanism in the present invention.
FIG. 6 is a schematic diagram of a gated loop unit according to the present invention.
Fig. 7 is a schematic diagram of a long-short-term sensing module based on a two-way architecture in the present invention.
FIG. 8 is a graph showing the effect of speech enhancement under various noise conditions in the present invention.
Detailed Description
The following describes in detail how the technical means are applied to solve the technical problems and achieve the technical effects by combining the drawings and examples. It should be understood that the following detailed description is merely illustrative of the invention and is not intended to limit the scope of the invention. In addition, as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.
The invention discloses a voice enhancement method under a plurality of noise environments, as shown in fig. 1, comprising the following steps:
step 1: and acquiring audio data, and performing preprocessing operation and data enhancement operation.
Step 1.1: completing an audio preprocessing operation
The deep learning-based speech enhancement technique is used as a data-driven supervised learning method that requires input audio data to have a fixed length, and thus requires the audio to be divided into fixed-length segments. Considering that the sampling rates of different audios are different, it is necessary to first resample them. The sampling rate of the audio may be adjusted to 16KHz by means of the audio processing library and stored in WAV format. Since some audio may be multi-channel, channel compression operations are required to uniformly convert it to single-channel audio data. For easy calculation, a fusion strategy of averaging by adding multiple channels is directly adopted, and a specific calculation formula is as follows:
wherein K is the number of audio channels, S mono For processed single channel audio, S i For audio of a particular channel, the multi-channel audio signal is compressed into a single-channel audio signal by channel compression.
Further, assuming that the input audio length of the model is 4 seconds, a clipping operation is required to be performed on the audio length according to an audio clipping algorithm, thereby ensuring that the length of each audio clip is 4 seconds. Since the sampling rate of audio is 16KHz, each audio clip contains 64000 samples. Assuming that the total sampling point number of the audio is T, a specific audio clipping calculation formula is as follows:
wherein l is a positive integer, S start And S is end The ID of the start sampling point and the ID of the end sampling point, respectively. When the number of audio sampling points after clipping does not satisfy 64000 but the total number of sampling points is greater than 50000, the number of sampling points can be padded to 64000 by using a linear interpolation method. When the number of samples clipped is less than 50000, the clipped audio clip can be directly discarded.
Step 1.2: completing data enhancement operations for audio
Considering the complexity and variability of the application scene of the model, the robustness of the model needs to be improved by utilizing a data enhancement technology. To be able to enhance the complexity of the audio, three audio data enhancement methods are introduced here, which mainly comprise: random signal to noise ratio mixes noise audio, randomly changes audio volume, randomly adds reverberation effects.
Random mixing noise audio operation is primarily by introducing other additional background noise data and mixing the input audio according to a random signal-to-noise ratio. By way of example, a variety of common noises such as electric drill sounds, whistling sounds, noise sounds, barking sounds, drum applause sounds, bird sounds, gun sounds, frog sounds, machine sounds, music sounds, and the like may be selected. The specific operation flow of the data enhancement is that firstly, a signal-to-noise ratio is generated within the range of [ -15,15] by using a uniform random sampling method, the random signal-to-noise ratio is multiplied with original voice, and the multiplied result is added with noise audio, so that noise-containing mixed audio is obtained.
The random volume change operation is mainly to amplify or reduce the volume of the input audio by means of a random scaling factor, and mainly adopts random uniform sampling to generate an audio scaling factor in the range of [0,2], and multiplies the scaling factor with the original audio to obtain the audio with randomly adjusted volume.
The operation flow of randomly adding the reverberation effect includes the following aspects: creating a room in which to reside (defining room size, desired reverberation time, wall covering, maximum number of reflections allowed), creating a signal source in the room, placing a microphone in the room, creating a room impulse response, simulating sound propagation, synthesizing a reverberation effect. In this embodiment, the reverberation effect addition of the voice data can be directly implemented by means of the pyroomaceus library.
Step 2: deep audio features are extracted by means of a multi-scale encoder.
The invention designs an efficient long-short-period perception enhancement model by means of a deep learning technology, and inputs the audio processed in the step 1 into the long-short-period perception enhancement model, so that real-time voice enhancement under various noise is realized. Fig. 2 shows the overall architecture of this model. The model mainly comprises a multi-scale encoder, a long-short-period sensing module and a residual error decoder. The multi-scale encoder is mainly used for compressing audio features and extracting deep features, and the residual decoder is mainly used for reconstructing audio signals. In this embodiment, the multi-scale encoder is based on a transducer architecture, which is mainly composed of a plurality of feature capture modules stacked, in this embodiment 5. Each feature capture module in turn comprises: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network.
Fig. 3 shows detailed information of a feature capture module in a multi-scale encoder based on a transducer architecture, and the specific calculation formula is as follows:
in the method, in the process of the invention,and->The input features, the intermediate process features and the output features of the feature capture module, respectively. LayerNorm (·) is the layer normalization operation, FBM (·) is the feature lifting module operation, and FNN (·) is the feed forward neural network. In addition, the feature capture module introduces residual connection to maintain original features, and uses a feature lifting module based on an attention mechanism to capture and strengthen key features. Fig. 4 shows details of the multi-head attention mechanism employed in this module. For the overall flow of the feature capture module, in order to effectively capture key audio features and solve long-short term feature dependence, the acquired features X are firstly taken i e n The method comprises the steps of inputting the characteristics to a characteristic lifting module, capturing key long-term and short-term characteristics by using the characteristic lifting module based on an attention mechanism, further realizing characteristic normalization by means of layer normalization operation, capturing deep characteristics by using a feedforward neural network, and finally processing by using layer normalization operation. Furthermore, the downsampling of features is achieved between different feature capture modules by means of a max pooling operation. Meanwhile, different feature capturing modules capture features of different scales by adopting different expansion convolution operations.
The feature lifting module is a core component of the feature capturing module, and the key audio features and the relation between the features in the global scope are captured by the feature lifting module, namely the important features are effectively captured and enhanced. Fig. 5 shows the detailed architecture of this module. The module mainly acquires attention weights by means of a convolution layer, a full connection layer and a Sigmoid function, and realizes key feature enhancement by multiplying corresponding elements of a matrix. At the same time, the relationship between the features in a larger range can be captured by means of a multi-head attention mechanism, so that the harmonic wave is eliminated as much as possible. The specific calculation formula is as follows:
in the method, in the process of the invention,and->The input features, the intermediate features and the output features of the feature lifting module are respectively. C (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and tuning channel operations, respectively. The addition of the root of Manchurian food>Respectively representing the multiplication and addition operations of the matrix corresponding elements. In addition, σ represents an activation function Sigmoid, thereby facilitating the calculation of weights corresponding to key features. MAM (·) then represents a multi-headed attentive mechanism operation. The compression of the characteristic channels is realized by using one-dimensional convolution operation with the convolution kernel size of 1, the corresponding weight matrix is obtained by means of the full-connection layer and the activation function Sigmoid, and finally the key characteristics are strengthened by multiplying corresponding elements of the matrix. For the multi-head attention mechanism, firstly the input features are input by means of a learnable linear transformation +.>Respectively obtain queues Q i Key K i Value V i Its concrete calculation formulaThe following are provided:
in which W is i Q 、W i K And W is i V The weights of the full connection layers respectively. Second, the similarity between the queue and the key is calculated by dot product, and the similarity is divided by the scaling factor. Then, a Softmax activation function is applied to obtain a weight corresponding to each value and multiply the corresponding value. Finally, it is necessary to concatenate the results obtained for all heads and perform the linear projection operation again to obtain the final output. The specific calculation formula of the multi-head attention mechanism is as follows:
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers. The output of the multi-headed attention module is used as the input of the feedforward neural network to obtain the final output characteristics. In this module, residual connection and layer normalization operations are also introduced to further improve the extraction of features.
The feedforward neural network mainly comprises: the system mainly comprises a gating circulation unit, an activation function and a full-connection layer, wherein capture of long-short-period features is mainly realized by means of the bidirectional gating circulation unit, extraction of deep features is realized by combining the full-connection layer, and a specific calculation formula is as follows:
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU. Here, capture of audio features is achieved using a bi-directional gating loop unit, which does notOnly long-term and short-term features can be effectively captured, and meanwhile, the problem of complex LSTM calculation is avoided. Furthermore, this approach tends to achieve more satisfactory results than the mere use of a fully attached layer. Meanwhile, compared with one-dimensional convolution, the two-way gating circulation unit can sense the relation between the more distant features and automatically pay attention to the more important features. Fig. 6 shows details of the implementation of the gating cycle unit, which mainly includes an update gate and a reset gate, and the specific calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σW r ·[h t-1 ,x t ])
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
Step 3: features in different dimensions are captured by means of a long and short term perception module.
For speech features extracted by a multi-scale encoder, the relationships between features in different dimensions need to be further processed. Therefore, the invention designs the long-short-period sensing module adopting the two-way architecture, which can effectively realize the capture of the long-short-period audio frequency characteristics in different dimensions, thereby effectively solving the long-short-period dependency relationship among the characteristics. As shown in fig. 7, a detailed architecture of the long-short term perception module is shown. The module is mainly used for respectively realizing long-term and short-term feature capture in the time dimension and the feature dimension by means of a gating circulation unit, one-dimensional convolution operation, instant layer normalization operation and channel adjustment operation. It is noted that the instant layer normalization operation is adopted to replace the traditional layer normalization operation in the embodiment, so that the sensitivity of the model to the energy of the input signal is reduced. Meanwhile, in order to maintain the original characteristics, the module also introduces the idea of residual connection. Whether the feature dimension is the time dimension or the feature dimension, the gating circulation unit is used for extracting long-short-period features in different ranges, the one-dimensional convolution operation is used for capturing deep features, and further the instant normalization operation is used for normalizing the features.
The specific calculation formula of the module is as follows:
wherein GRU (& gt) is a gate control circulating unit, C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (. Cndot.) is an instantaneous normalization operation, and R (. Cndot.) is a channel adjustment operation. In addition, in the case of the optical fiber,and->Respectively the input features, the intermediate features and the output features of the module. When the features are input into the network, firstly, capturing of long-term and short-term features in the time dimension is realized by utilizing GRU, further extraction of deep features is realized by utilizing one-dimensional convolution operation, and then feature normalization processing is carried out by means of instant layer normalization operation. The GRU is used here because it requires less computational resources and less time costs than the LSTM, but achieves the same result. The GRU only includes a gate controlling reset and a gate controlling update, which effectively solves the problem of long-term memory. In addition, the specific calculation formula of the instant layer normalization operation used by the module is as follows:
wherein X is tf For the input feature, N and K are the dimensions of the feature, respectively. In addition, in the case of the optical fiber,and->The mean operation and the variance operation, respectively. Furthermore, the symbols epsilon and beta are respectively learnable parameters, and the symbol lambda is a regularization parameter. This normalization operation may reduce the sensitivity of the model to input signal energy. In order to achieve capture of long-term features in feature dimensions, two channels of features need to be exchanged, then the GRU is used for capturing long-term features, extraction of deep features is achieved through one-dimensional convolution, immediate normalization operation is used for processing, and finally output features are obtained through channel adjustment operation.
Step 4: the enhanced clean speech is obtained by means of a residual decoder.
In order to be able to obtain clean speech, it is necessary to reconstruct the speech signal first by means of a residual decoder. The residual decoder mainly comprises a plurality of decoding units, in this embodiment 5, which can gradually realize the estimation of the spectrogram mask. For each decoding unit, it is mainly composed of one-dimensional deconvolution operation, normalization operation and activation function. Meanwhile, in order to be able to reconstruct the speech signal well, the input of each decoding unit comprises two parts: one is the output from the last decoding unitThe other is the output from the peer feature capture module>The decoding unit causesThe feature extraction and up-sampling operations are simultaneously realized by one-dimensional deconvolution, and the nonlinear capability of the model is increased by means of an activation function PReLU. The specific calculation formula is as follows:
in TC 1D (. Cndot.) is a one-dimensional deconvolution operation that is used primarily to implement feature extraction and upsampling operations. B (·) is a batch normalization operation, θ is an activation function PReLU. In addition, in the case of the optical fiber,the output of the decoder is a reconstructed speech signal, which is an output characteristic of the current decoding unit. At this time, the mask estimation module is required to process the reconstructed speech signal output by the decoder and estimate the mask of the clean speech signal, thereby realizing the generation of the clean speech mask. The mask estimation module consists of a one-dimensional convolution operation and a plurality of different activation functions, and a specific calculation formula is as follows:
in the method, in the process of the invention,and->The mask of the input features, intermediate process features and output of the mask estimation module, respectively. In addition, γ, δ, and σ are activation functions Tanh, reLU, and Sigmoid, respectively. The pure speech signal estimated by the model can be obtained by multiplying the output characteristics of the mask estimation module with the speech signal which is originally input, and the calculation formula is as follows:
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
The model and the flow thereof are as above, and further, the model needs to be trained or tested to obtain the model meeting the requirements.
Specifically, to accomplish the supervised training of the model, the present invention introduces a joint loss function comprising two parts: the signal-to-noise ratio loss term f (·) and the mean square error loss term MSE (·). The former is mainly used for realizing the optimization on the voice waveform diagram, and the latter is mainly used for realizing the optimization on the voice spectrogram. Furthermore, the mean square error loss term needs to be logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
The specific expression of the loss function is as follows:
wherein s andthe audio is respectively pure audio and model estimated audio, S r And->The real part of the pure spectrogram and the real part of the model estimated spectrogram are respectively, S i And->The imaginary part of the pure spectrogram and the imaginary part of the model estimated spectrogram, |S| and +|respectively>Amplitude of spectrum patterns with purity respectivelyAnd the magnitude of the spectrogram estimated by the model. In addition, the mean square error loss term may measure the differences in real, imaginary and amplitude between the model estimated spectrogram and the real spectrogram. At the same time, the mean square error loss term is logarithmized to ensure that it is of the same order of magnitude as the signal-to-noise loss term. The loss of signal to noise term can then constrain the amplitude of the output, avoiding level shifts between the input and output. The specific calculation formula of the loss term is as follows:
in order to be able to demonstrate the effectiveness of the method according to the invention, relevant experimental tests have been carried out. A large amount of noise audio frequencies are fused on the basis of the existing pure voice, so that the collected voice under various noises is simulated. The noise categories selected here are: electric drill sound, whistling sound, noise sound, barking sound, drum sound, bird song sound, gun sound, frog song sound, machine sound, music sound. Meanwhile, the effect of voice enhancement is measured by means of three commonly used evaluation indexes of voice enhancement, which are respectively as follows: perceptual speech quality assessment (PESQ), short-term speech intelligibility (STOI), and Source Artifact Ratio (SAR). Wherein, PESQ and STOI are both evaluation methods of perception level, which are both that the larger the value is, the better the effect of speech enhancement is. For STOI, the calculation process mainly comprises three steps: removing the mute frame; 1/3 times of frequency band decomposition of DFT is completed on the signal; the correlation coefficients before and after the enhancement are calculated and averaged. For PESQ, which requires a noisy attenuated signal and an original reference signal, the computation involves preprocessing, time alignment, perceptual filtering, masking effects, etc. It can provide a subjective predictor for objective speech quality assessment and can be mapped to a MOS scale range, with a score range between-0.5 and 4.5. In addition, the evaluation index SAR may be regarded as an evaluation index of a signal level, and the larger the value thereof is, the better the effect of speech enhancement is, and the specific calculation formula is as follows:
in the formula e interf 、e noise And e artif Error signals, s, introduced by interference, noise and artefacts, respectively target Then it is the target signal. Table 1 shows the effect of the present invention on the above evaluation index compared with the main stream method. It was found that it could be improved by about 16% over the mainstream speech enhancement model Demucs on PESQ evaluation index and about 16% over the mainstream speech enhancement model MannerNet on SAR evaluation index. At the same time, an excellent performance of 0.94 can be achieved on the evaluation index STOI. In addition, for ten different noise interference environments, fig. 8 shows a voice effect diagram after noise reduction based on the long-short-period perception enhancement model provided by the invention, which can obtain a satisfactory effect.
TABLE 1 Effect comparison of the long-short-term perception enhancement model of the present invention with the mainstream speech enhancement model
PESQ | STOI | SAR | |
Demucs | 2.08 | 0.93 | 18.70 |
MannerNet | 2.22 | 0.94 | 17.41 |
Long-short period perception strengthening model | 2.41 | 0.94 | 20.27 |
Claims (10)
1. A method for speech enhancement in a multiple noise environment, comprising the steps of:
step 1: preprocessing operation and data enhancement operation are carried out on the obtained audio data, and the processed audio data are input into a long-period perception enhancement model; the long-period perception enhancement model comprises a multi-scale encoder, a long-period perception module and a residual error decoder;
step 2: extracting deep audio features of the processed audio data by using the multi-scale encoder;
step 3: the long-term sensing module is utilized to capture the characteristics in different dimensions respectively;
step 4: and reconstructing a voice signal by using the residual decoder, estimating the mask of the clean voice by using a mask estimation module, and multiplying the mask with the original input audio to obtain the enhanced clean voice. Training of the model is done by means of joint loss functions.
2. The method of claim 1, wherein the preprocessing operation comprises one or more of the following operations: resampling operation is carried out on the audio, clipping operation is carried out on the audio length, and channel compression operation is carried out on the audio;
the data enhancement operation includes one or more of the following operations: mixing noise audio according to random signal-to-noise ratio, randomly changing volume of the audio, and randomly adding reverberation effect.
3. The method according to claim 1, wherein the multi-scale encoder is based on a transform architecture, is formed by stacking a plurality of feature capture modules, and performs feature downsampling by means of pooling; each feature capture module includes: the device comprises a characteristic lifting module, a normalization layer and a feedforward neural network;
the feature lifting module is used for capturing the relation between the key audio features and the features in the global scope, acquiring attention weights by using a convolution layer, a full connection layer and a Sigmoid function, realizing key feature enhancement by multiplying corresponding elements of a matrix, and capturing the relation between the features in the global scope by using a multi-head attention mechanism; the normalization layer performs normalization operation; the feedforward neural network captures long-term and short-term characteristics by utilizing a bidirectional gating circulating unit, and extracts deep-layer characteristics by combining a full-connection layer;
wherein different feature capture modules use different dilation convolution operations to capture features of different scales.
4. A method for enhancing speech in a multiple noise environment according to claim 3, wherein the calculation formula of the feature capturing module is as follows:
in the method, in the process of the invention,and->Input features, intermediate process features of the feature capture module respectivelyAnd outputting characteristics, namely LayerNorm (& gt) is layer normalization operation, FBM (& gt) is characteristic lifting module operation, and FNN (& gt) is feedforward neural network;
the calculation formula of the characteristic lifting module is as follows:
in the method, in the process of the invention,and->Respectively an input feature, an intermediate feature and an output feature of the feature lifting module; c (C) 1D (. Cndot.), FC (-) and R (-) are one-dimensional convolution, full connection layer and adjustment channel operations, respectively; the addition of the root of Manchurian food>Respectively representing multiplication and addition operations of matrix corresponding elements; sigma represents the activation function Sigmoid; MAM (·) represents a multi-head attention mechanism operation.
5. The method of claim 4, wherein the multi-headed attentiveness mechanism operates by first utilizing a learnable linear transformation based on input characteristicsRespectively obtain queues Q i Key K i Value V i The calculation formula is as follows:
in which W is i Q 、W i K And W is i V Weights of the full connection layers respectively;
secondly, calculating the similarity between the queue and the key value by utilizing a dot product mode, and dividing the similarity by a scaling factor;
then, a Softmax activation function is applied to obtain a weight corresponding to each value, and the weights are multiplied by the corresponding values;
finally, the results obtained by all heads are connected in series, and linear projection operation is carried out again to obtain final output;
the specific calculation formula of the multi-head attention mechanism is as follows:
MAM(Q,K,V)=Concat(head 1 ,…,head h )W mh
in which W is mh Is a linear transformation matrix, h is the number of parallel attention layers, d is a scaling factor;
the output of the multi-head attention mechanism is used as the input of the feedforward neural network, so that the final output characteristic is obtained;
the feedforward neural network includes: the gating circulation unit, the activation function and the full connection layer have the following calculation formula:
in which W is fc And b fc Representing the weights of the fully connected layers and the corresponding offsets, delta represents the activation function ReLU, and the gating cycle unit comprises an update gate and a reset gate, and the calculation formula is as follows:
z t =σ(W z ·[h t-1 ,x t ])r t =σ(W r ·[h t-1 ,x t ])
wherein sigma and gamma represent the activation functions Sigmoid and Tanh, x, respectively t 、h t-1 And h t The characteristics input at the moment, the hiding state at the last moment and the hiding state at the current moment are respectively.
6. The method for enhancing voice in a multiple noise environment according to claim 1, wherein the long-short-period sensing module adopts a two-way architecture, and comprises a gating circulation unit, a one-dimensional convolution module, an instant layer normalization module and a channel adjustment module; the gating circulation unit captures long-term and short-term features of the features, the one-dimensional convolution module extracts deep features, and the instant layer normalization module performs feature normalization.
7. The method for enhancing speech in a multiple noise environment according to claim 6, wherein the long-short-period sensing module has a calculation formula as follows:
wherein GRU (& gt) is a gating cycleUnit C 1D (. Cndot.) is a one-dimensional convolution operation, iLN (-) is an instantaneous normalization operation, R (-) is a channel adjustment operation,and->Input features, middle features and output features of the long-short period sensing module respectively;
the calculation formula of the instant layer normalization module is as follows:
8. The method of claim 1, wherein the residual decoder comprises a plurality of decoding units, each decoding unit comprising a one-dimensional deconvolution module, a normalization module, and an activation function; each decoding unitThe inputs of which are the outputs of the last decoding unitAnd the output of the peer feature capture module>The calculation formula is as follows:
9. The method for speech enhancement in a multiple noise environment according to claim 1, wherein the mask estimation module is composed of a one-dimensional convolution module and a plurality of different activation functions, and the calculation formula is as follows:
in the method, in the process of the invention,and->Respectively mask estimation modesThe input features, intermediate process features and output masks of the block, gamma, delta and sigma are activation functions Tanh, reLU and Sigmoid respectively;
multiplying the output characteristics of the mask estimation module with the original input voice signal to obtain a pure voice signal estimated by the model, wherein the calculation formula is as follows:
wherein X is in X is the original input audio signal est Pure speech estimated for the model.
10. The method for enhancing speech in a multiple noise environment according to claim 1, wherein the long-short-period perceptual enhancement model is trained using a joint loss function, the joint loss function is composed of a mean square error loss term and a signal-to-noise ratio loss term, the mean square error loss term is used for realizing optimization on a speech waveform diagram, and the signal-to-noise ratio loss term is used for realizing optimization on a speech spectrogram; wherein the mean square error loss term is logarithmic to ensure that it is of the same order of magnitude as the signal to noise ratio loss term.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211637892.7A CN116013344A (en) | 2022-12-17 | 2022-12-17 | Speech enhancement method under multiple noise environments |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211637892.7A CN116013344A (en) | 2022-12-17 | 2022-12-17 | Speech enhancement method under multiple noise environments |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116013344A true CN116013344A (en) | 2023-04-25 |
Family
ID=86032687
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211637892.7A Pending CN116013344A (en) | 2022-12-17 | 2022-12-17 | Speech enhancement method under multiple noise environments |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116013344A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117351983A (en) * | 2023-12-06 | 2024-01-05 | 深圳市龙芯威半导体科技有限公司 | Transformer-based voice noise reduction method and system |
CN117711417A (en) * | 2024-02-05 | 2024-03-15 | 武汉大学 | Voice quality enhancement method and system based on frequency domain self-attention network |
-
2022
- 2022-12-17 CN CN202211637892.7A patent/CN116013344A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117351983A (en) * | 2023-12-06 | 2024-01-05 | 深圳市龙芯威半导体科技有限公司 | Transformer-based voice noise reduction method and system |
CN117351983B (en) * | 2023-12-06 | 2024-04-19 | 深圳市龙芯威半导体科技有限公司 | Transformer-based voice noise reduction method and system |
CN117711417A (en) * | 2024-02-05 | 2024-03-15 | 武汉大学 | Voice quality enhancement method and system based on frequency domain self-attention network |
CN117711417B (en) * | 2024-02-05 | 2024-04-30 | 武汉大学 | Voice quality enhancement method and system based on frequency domain self-attention network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110085249B (en) | Single-channel speech enhancement method of recurrent neural network based on attention gating | |
CN110491407B (en) | Voice noise reduction method and device, electronic equipment and storage medium | |
CN107845389B (en) | Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network | |
Li et al. | Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement | |
CN110867181B (en) | Multi-target speech enhancement method based on SCNN and TCNN joint estimation | |
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
CN110148420A (en) | A kind of audio recognition method suitable under noise circumstance | |
CN116013344A (en) | Speech enhancement method under multiple noise environments | |
CN107452389A (en) | A kind of general monophonic real-time noise-reducing method | |
CN113077806B (en) | Audio processing method and device, model training method and device, medium and equipment | |
CN113345460B (en) | Audio signal processing method, device, equipment and storage medium | |
WO2019014890A1 (en) | Universal single channel real-time noise-reduction method | |
Villanueva-Luna et al. | De-noising audio signals using MATLAB wavelets toolbox | |
CN115424627A (en) | Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
Geng et al. | End-to-end speech enhancement based on discrete cosine transform | |
JP2023548707A (en) | Speech enhancement methods, devices, equipment and computer programs | |
CN113707167A (en) | Training method and training device for residual echo suppression model | |
CN117854536B (en) | RNN noise reduction method and system based on multidimensional voice feature combination | |
CN111899750A (en) | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network | |
CN115223583A (en) | Voice enhancement method, device, equipment and medium | |
CN114566179A (en) | Time delay controllable voice noise reduction method | |
CN114333893A (en) | Voice processing method and device, electronic equipment and readable medium | |
CN111681649B (en) | Speech recognition method, interaction system and achievement management system comprising system | |
Astudillo et al. | Uncertainty propagation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |