CN113674753A

CN113674753A - New speech enhancement method

Info

Publication number: CN113674753A
Application number: CN202110916018.6A
Authority: CN
Inventors: 陈彦男; 邹波蓉; 王伟东; 景浩; 李辉
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-19
Anticipated expiration: 2041-08-11
Also published as: CN113674753B

Abstract

The invention provides a new speech enhancement method. In the existing non-end-to-end deep neural network speech enhancement method, the problem of unsatisfactory speech quality enhancement caused by neglecting the importance of phase spectrum learning is solved, and the requirements of network users cannot be well met. According to the method, firstly, the voice with noise is subjected to blocking operation, a non-local module is added into a convolutional layer at an encoding end, then a gated cyclic unit network is added to capture time sequence correlation information among voice sequences, finally, the dimension of an output result is adjusted by using a one-dimensional convolutional layer, and output enhanced voice blocks are spliced in sequence, so that the quality and intelligibility of the enhanced voice are improved.

Description

New speech enhancement method

Technical Field

The invention provides a new voice enhancement method, and relates to the field of voice signal processing.

Background

In the development process of human beings, voice becomes the most practical information communication mode in daily life due to the characteristics of convenience and high efficiency. Especially in modern society, with the improvement of technological level, the speech signal processing technology is widely developed, and a series of products taking speech as a carrier, such as an intelligent sound box (earphone), a speech assistant, an intelligent recording pen, a translator and the like, are gradually derived, and the generation of intelligent equipment of the type greatly facilitates the life of people. However, in the actual use process, the performance of the device is seriously affected due to the existence of environmental noise, internal electrical noise of the device and other speaker sounds, and even a speech distortion situation occurs. Therefore, the use problem of the intelligent device in the noise environment is effectively solved, the method has great significance for improving the user experience of the product, and the voice enhancement is used as a voice processing technology and can well solve the problem. The speech enhancement aims at improving the quality and intelligibility of speech signals polluted by noise, and a corresponding enhancement algorithm is artificially designed to extract pure speech signals from noisy speech so as to inhibit the interference of background noise. Although the voice enhancement technology can effectively reduce the adverse effect caused by the noise signal to a certain extent, the diversity, complexity and mutation of the noise in the real environment undoubtedly put higher requirements on the voice enhancement technology.

The speech enhancement task originates from the "cocktail party problem," which describes such a phenomenon. In a dance meeting full of various environmental sounds, people can easily focus on interested target voices and reduce the attention on other background sounds, and the generation of hearing selection attention is actually an adaptive capacity of a human hearing system, but the adaptive capacity is not available for machines, so that the design of a computer hearing model capable of acquiring target information in a noisy environment is always the research focus of the problem. As a key step of speech signal processing, the speech enhancement technology has wide application prospect in the fields of speech recognition, mobile communication and the like. In the field of speech recognition, most of the existing speech recognition systems are ideally performed, that is, all research is based on speech data under a noise-free condition, however, in a noise environment, especially in a strong noise environment, the recognition accuracy of the system will be greatly reduced, and the speech enhancement technology can be used as the front end of the recognition system to improve the recognition rate of the system by preprocessing the noisy speech. In the field of mobile communication, both sides of communication often receive noise interference in the real scene in the communication process, especially when one of them is in the street or dining room environment, receive distortion very easily in the signal transmission process, seriously reduce communication quality, influence both sides' subjective sense of hearing, can effectively filter the noise through exerting corresponding speech enhancement technique at the transmitting end, and then improve the speech quality of receiving end.

The academic community has already developed voice enhancement problems for decades, and different from tasks such as text classification or target detection, the processing of voice signals is more complex in nature. Because the noise in the actual application scene is various, and meanwhile, the characteristics of different speakers and the perception characteristics of human ears need to be considered in the voice enhancement process, all factors need to be considered comprehensively when voice enhancement is carried out, and therefore a proper voice enhancement algorithm is selected in a targeted manner. In recent years, deep learning techniques have been rapidly and widely used in many fields and have achieved remarkable results because of their strong learning abilities, which are favored by researchers. In the speech enhancement task, this type of data-driven approach accomplishes the speech enhancement process by establishing a non-linear mapping relationship between noisy speech and clean speech. Therefore, in order to promote the development of the deep learning algorithm in the field of speech enhancement, the deep learning algorithm is deeply researched, the enhancement model is improved by combining the characteristics of the field of speech enhancement, the speech quality and the intelligibility are enhanced, and meanwhile, the generalization level of the model in an unknown noise environment is improved.

The invention provides a new voice enhancement method, and a coding and decoding network model fused with a Non-local module is constructed. Firstly, a sliding window with a fixed length is used for acting on time domain noisy speech to carry out blocking, and the blocked speech is spliced and input as a model coding end, so that amplitude information and phase information of the speech are fully utilized; secondly, adding a non-local module into a convolutional layer at an encoding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gate control cycle unit network to capture time sequence correlation information among the voice sequences; sending the output of the gating cycle unit network into a non-local module of a decoding end again, introducing jump connection, and performing feature splicing on a high-resolution feature map of the encoding end and a low-resolution feature map of a decoding stage to complete the supplement of detail information between the feature maps; and finally, adjusting the dimensionality of the output result by using the one-dimensional convolutional layer, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice. The test result on the Chinese voice data set shows that the invention effectively improves the quality and intelligibility of the enhanced voice.

Disclosure of Invention

In view of the above, the main objective of the present invention is to remedy the problem of non-ideal quality of enhanced speech due to neglect of the importance of phase spectrum learning in the non-end-to-end enhanced model.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

a new method of speech enhancement, the method comprising the steps of:

step 1, preprocessing input voice data with noise, sequentially performing down-sampling, blocking and splicing on the voice with noise in a time domain, and sequentially sending splicing results into a model;

step 2, the voice spliced in the step 1 is sent into a convolution layer at an encoding end, a non-local module is introduced into the last convolution layer at the encoding end, the key features of the voice sequence are extracted, meanwhile, useless features are suppressed, and meanwhile, a gate control cyclic unit network is added to capture time sequence correlation information among the voice sequences;

step 3, fusing the output characteristics of the parallel gating cycle unit network in the step 2, then sending the fused output characteristics into a non-local module of a decoding end, then introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end so as to complete the supplement of detail information among the characteristic diagrams;

and 4, using the one-dimensional convolutional layer to regulate the output dimensionality of the step 3, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.

In summary, the present invention designs an encoding and decoding network, which uses the time domain representation of the speech as the input of the encoding end to perform deep feature extraction, thereby making full use of the amplitude information and the phase information of the speech signal; adding a non-local module in the convolutional layer of the coding end and the decoding end, and introducing a gating cycle unit network to capture time sequence correlation information among voice sequences, thereby improving the quality and intelligibility of enhanced voice while reducing noise characteristic information interference.

Drawings

FIG. 1 is a schematic flow chart of a new speech enhancement method according to the present invention;

FIG. 2 is a schematic diagram of the operation flow of blocking and splicing noisy speech;

FIG. 3 is a schematic diagram of a deep feature extraction process performed at the encoding end;

FIG. 4 is a flow chart illustrating a decoding-side feature recovery process;

FIG. 5 is a schematic diagram of the overall synthesis flow of enhanced speech;

FIG. 6 is a diagram of an enhanced speech spectrogram obtained using the present invention;

the specific implementation mode is as follows:

the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the present invention is illustrative and not limited to the embodiments of the present invention, and the present invention may be implemented in other different specific embodiments. All other embodiments obtained by a person skilled in the art without making any inventive step are within the scope of protection of the present invention.

Fig. 1 is a schematic general flow chart of a new speech enhancement method according to the present invention, and as shown in fig. 1, the speech enhancement method of a convolutional loop network and a non-local module according to the present invention includes the following steps:

step 1, preprocessing input voice data with noise, and sequentially performing down-sampling, voice blocking and splicing operations;

step 2, sending the splicing result of the step 1 into down-sampling modules of an encoding end, wherein each down-sampling module consists of a convolution layer, a batch normalization layer and an activation function layer;

step 3, fusing the output characteristics of the parallel gating cycle unit network in the step 2, then sending the fused output characteristics into a non-local module of a decoding end, introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end so as to complete the supplement of detail information among the characteristic diagrams;

and 4, adjusting the output dimensionality of the step 3 by using the one-dimensional convolutional layer, and splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.

Fig. 2 is a schematic flow chart of blocking and splicing of noisy speech, as shown in fig. 2, in step 1, preprocessing data, and sequentially performing downsampling, speech blocking and splicing of noisy speech, including the following steps:

step 11, down-sampling the noisy speech to 16000Hz, then performing block processing by adopting a sliding window with the window length of 16384 (about 1S), wherein each speech block is not overlapped, and for the noisy speech which cannot be completely divided by the window length, zero padding is needed;

step 12, sequentially performing splicing operation on each frame of time domain signals obtained in the step 11 along the vertical direction, which is represented as follows:

wherein y is_LRepresenting the L-th block of noisy speech, each noisy speechThe block lengths are all constant 16384;

step 13, performing normalization processing on the feature matrix in the step 12 according to the mean value of 0 and the variance of 1, and representing as follows:

where μ represents the mean of the input data Y and σ represents the variance of the input data Y.

Fig. 3 is a schematic diagram of a deep feature extraction process at an encoding end, as shown in fig. 3, in step 2, the splicing result of step 1 is sent to a down-sampling module at the encoding end, where each down-sampling module is composed of a convolution layer, a batch normalization layer and an activation function layer, and in addition, a non-local module is introduced after the last down-sampling module at the encoding end, so as to suppress useless features while extracting key features of a speech sequence, and simultaneously add a parallel gated cyclic unit network for capturing time sequence correlation information between speech sequences, which includes the following steps:

and 21, continuously passing the input time domain matrix of the noisy speech through 12 down-sampling modules to extract deep features, wherein each down-sampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer (BN). The convolution operation is represented as follows:

M_i＝f(WgY_i+b)

in the above formula

Is an output feature map of the convolutional layer, where C represents the number of channels, F represents the feature dimension, and Y represents_iRepresenting the characteristic diagram of the ith input, b is a corresponding bias term, W is a weight matrix of a corresponding convolution kernel, the number of the convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264 and 288 in sequence, wherein the kernel size is 15, the step size is 1, f is a LeakyReLU activation function with a leakage correction linear unit, and the function is expressed as follows:

wherein a is a fixed value, and the value is generally 0.01;

and step 22, taking the feature map M generated in the step 21 as an input of a non-local module, firstly, respectively using two one-dimensional convolution layers to perform channel dimension dimensionality reduction on the M, reducing the channel number from C to C2, simultaneously performing matrix multiplication on output results of the two convolution layers, and then performing normalization processing on multiplication results by using a softmax function to generate the attention weight. The calculation procedure is represented as follows:

where θ and ψ both represent one-dimensional convolution operations;

step 23, weighting the attention of step 22

With features m also generated by one-dimensional convolution_jThe mapping representation is matrix multiplied to obtain the output response y of the ith position_i；

Step 24, output result y to step 23_iUsing convolution operation to adjust dimensionality to make dimensionality consistent with the module input, and further using residual connection to carry out element-by-element addition to obtain enhanced feature Z containing global information of voice sequence_iExpressed as:

z_i＝W_zy_i+m_i

wherein W_zRepresenting the weight matrix to be learned by the training process. It should be noted that the convolution parameters used in steps 22, 23 and 24 have both a size and a step size of 1;

step 25, sending the output result of the non-local module into a parallel gated loop network, wherein the calculation process of the upper and lower parallel networks is the same, and in the case of the upper network, the input at the given time t is x_tThe forward calculation process is as follows:

r_t＝σ(W_xrx_t+W_hrh_t-1+b_r)

z_t＝σ(W_xzx_t+W_hzh_t-1+b_z)

where σ denotes Sigmoid activation function, e denotes element product, W and b denote weight and offset, respectively, and r_tAnd z_tRespectively representing a reset gate and a refresh gate, h_t-1Is a hidden state at the time t-1,

representing candidate hidden states, commonly used for auxiliary hidden states h_tAnd (4) calculating.

Fig. 4 is a schematic flow diagram of a feature recovery process at a decoding end, and as shown in fig. 4, in step 3, output features of the gated loop unit networks in parallel in step 2 are fused, and then sent to a non-local module at the decoding end, and then jump connection is introduced, and feature splicing is performed on a high-resolution feature map at an encoding end and a low-resolution feature map at a decoding stage, so that supplement of detail information between the feature maps is completed, including the following steps:

and step 31, merging the output results of the parallel gating cycle network. Suppose the output results of the two networks are Out respectively₁And Out₂The fusion process is represented as:

Input_D＝Add(Out₁；Out₂)

wherein Input_DRepresenting a fusion result (input at a decoding end), and Add represents a feature fusion mode;

step 32, sending the output of the step 31 to a non-local module of a decoding end, and repeating the calculation processes from the step 22 to the step 24 to obtain a weighted feature vector;

the results of step 33 and step 32 are sent to a decoding end, the decoding end is composed of 12 consecutive up-sampling modules, and is associated with a coding end through jump connection, before jump connection, linear interpolation with a scaling factor of 2 is required to be performed on the output of the previous part in the sequence dimension, and then the linear interpolation is spliced with the output of the corresponding down-sampling module in the channel dimension, wherein the structure of each down-sampling module is the same as that of the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the size is 5, and the step size is 1.

Fig. 5 is a schematic diagram of an overall synthesis process of enhanced speech, and as shown in fig. 5, in step 4, a one-dimensional convolutional layer is used to perform dimension normalization on output features of a decoding end, and output enhanced speech blocks are sequentially spliced along a time dimension, so as to complete the overall synthesis of the enhanced speech, including the following steps:

step 41, splicing an output result of a decoding end with an original input of a network, and then sending the output result into an output layer for dimension normalization, wherein the number, the size and the step length of convolution kernels of the output layer are 1, 5 and 1 in sequence;

step 42, performing standardized reduction on the output result of the step 41, and judging whether the noise-containing voice in the preprocessing stage has zero padding operation, wherein for the noise-containing voice with zero padding operation, the corresponding zero padding part in the model prediction output result should be removed first, otherwise, no operation is needed;

and 43, splicing the output characteristics of the step 42 to complete the reconstruction of the enhanced voice.

Examples

In this embodiment, clean speech and noise are synthesized into noisy speech according to different signal-to-noise ratios, and the speech enhancement method of the present invention is used to reduce noise of the noisy speech, specifically including the following steps:

1. preprocessing data, namely performing a down-sampling process, partitioning the noisy speech into blocks by using a sliding window with the window length of 16384, performing zero filling on the parts which cannot be completely removed, ensuring that no overlapping exists between the blocks, and finally obtaining noisy speech blocks with the dimensionality of 16384;

2. and then, sending the voice block with the noise into a down-sampling module at an encoding end, wherein each down-sampling module sequentially comprises a one-dimensional convolution layer, a batch normalization layer and an activation function layer. In the convolutional layer, in order to ensure that the input dimension and the output dimension are the Same, zero padding is performed in a Same mode, and meanwhile, in the transmission process of a data stream at a coding end, the output characteristic of a downsampling module is reduced by half in sequence dimension in sequence, so that the final output characteristic size at the coding end is 288 × 4, wherein 288 is the number of channels, and 4 is the sequence dimension;

3. sequentially inputting the output characteristics of the coding end into a non-local module and a gate control cycle unit network, and respectively finishing the weighting of the output characteristics of the coding end and the calculation process of time sequence correlation;

4. and (3) sending the output of the gated cyclic unit network to a decoding end, wherein the decoding end is also composed of the same number of up-sampling modules, and the structure of the decoding end is consistent with that of the encoding end. The method is characterized in that the method is associated with an encoding end through jump connection, before the jump connection, linear interpolation with a scaling factor of 2 is required to be carried out on the output of the previous part in a sequence dimension, then the linear interpolation is spliced with the output of a corresponding down-sampling module in a channel dimension, and finally the output characteristic size of a decoding end is 24 x 16384. Then splicing the data with the input of an encoding end, wherein the number of channels of a spliced feature map is 25, the sequence dimension is kept unchanged, and finally, a splicing result is sent to an output layer for dimension normalization, and the final output feature size is 1 multiplied by 16384;

5. after the one-dimensional output result of the enhanced voice is obtained, the zero filling part in the preprocessing is removed, and then the splicing operation is carried out, so that the synthesis of the enhanced voice is completed.

FIG. 6 shows a spectrogram of a noisy speech after a signal-to-noise ratio of-3 dB and after being contaminated by M109 (tank internal noise) noise and speech enhancement using the method of the present invention, wherein the horizontal axis represents time, the vertical axis represents frequency, values of coordinate points correspond to energy of the speech signal, and the energy is represented by shades. Wherein, the graph (a) is the spectrogram of clean voice, the graph (b) is the spectrogram of noisy voice, and the graph (c) is the enhanced voice spectrogram obtained by using the method. It can be seen from the spectrogram that the enhanced speech obtained by the method of the present invention better restores the low-frequency detail information of the clean speech, but the processing of the high-frequency part is not ideal, and in the dashed box of fig. 6(c), there is still some noise redundancy.

Claims

1. A new speech enhancement method, characterized in that it comprises the following steps:

step 1, using a sliding window with a fixed length to perform blocking operation on a voice with noise, splicing the blocked voice, and then directly sending the voice to a model coding end;

step 2, adding a non-local module in a convolutional layer at an encoding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gate control cycle unit network to capture time sequence correlation information among the voice sequences;

step 3, the output of the gating cycle unit network is sent to a non-local module of a decoding end, then jump connection is introduced, and a high-resolution feature map of the encoding end and a low-resolution feature map of the decoding end are spliced, so that the supplement of detail information among the feature maps is completed;

and 4, using the one-dimensional convolutional layer to regulate the dimensionality of an output result, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.

2. The method of claim 1, wherein in step 1, the noisy speech is segmented by using a sliding window with a fixed length, and the segmented speech is spliced and then sent directly to the encoding end of the model, comprising the following steps:

step 11, down-sampling the noisy speech to 16kHz, then performing block processing by adopting a sliding window with the length of 16384 (about 1 second), wherein each speech block is not overlapped, and for the noisy speech which cannot be evenly divided by the window length, zero padding is needed;

and step 12, splicing each noisy speech block obtained in the step 11 in sequence along the vertical direction, wherein the splicing result is represented as follows:

wherein y is_LRepresents the L-th noisy speech block, and the length of each noisy speech block is a fixed value 16384;

3. The method of claim 1, wherein in step 2, adding a non-local module to the convolutional layer at the encoding end to extract key features of the speech sequence and suppress unwanted features, and adding a gated cyclic unit network to capture timing correlation information between the speech sequences, comprises the following steps:

step 21, the input time domain feature vector of the noisy speech is sent to continuous 12 down-sampling modules for deep feature extraction, and each down-sampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer (BN). The convolution operation is represented as follows:

M_i＝f(WgY_i+b)

in the above formula

Is an output feature map of the convolutional layer, where C represents the number of channels, F represents the feature dimension, and Y represents_iRepresenting the characteristic diagram of the ith input, b is a corresponding bias item, W is a weight matrix of corresponding convolution kernels, and the number of the convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264 and 288 in sequence, wherein the kernels areSize 15, step size 1, f (g) Leaky ReLU activation function with leakage corrected linear units, expressed as follows:

wherein a is a fixed value, and the value is generally 0.01;

and step 22, taking the feature map M generated in the step 21 as the input of the non-local module. Firstly, two one-dimensional convolution layers are respectively used for carrying out channel dimension dimensionality reduction on M, the number of channels is reduced to C/2 from C, meanwhile, matrix multiplication is carried out on output results of the two convolution layers, then a softmax function is used for carrying out normalization processing on the multiplication results, and attention weight is generated. The calculation procedure is represented as follows:

where θ and ψ both represent one-dimensional convolution operations;

step 23, weighting the attention of step 22

z_i＝W_zy_i+m_i

wherein W_zRepresenting the weight matrix to be learned by the training process. It should be noted that the convolution kernel size and step size used in step 22, step 23 and step 24 are all 1;

step 25, non-localThe output result of the module is sent into a parallel gate control circulation network, the calculation processes of the upper and lower parallel networks are the same, taking the upstream network as an example, the input at the given time t is x_tThe forward calculation process is as follows:

r_t＝σ(W_xrx_t+W_hrh_t-1+b_r)

z_t＝σ(W_xzx_t+W_hzh_t-1+b_z)

where σ denotes a Sigmoid activation function, indicates the product of elements, W and b denote weight and offset, respectively, and r_tAnd z_tRespectively representing a reset gate and a refresh gate, h_t-1Is a hidden state at the time t-1,

4. The method of claim 1, wherein in step 3, the output of the gated cyclic unit network is sent to a non-local module at the decoding end, and then a skip connection is introduced to splice the high resolution feature map at the encoding end with the low resolution feature map at the decoding end, so as to complete the supplement of the detail information between the feature maps, comprising the following steps:

and step 31, merging the output results of the parallel gating cycle network. Suppose the output results of the two networks are Out respectively₁And Out₂The fusion mode is expressed as:

Input_D＝Add(Out₁；Out₂)

step 32, sending the output of step 31 to a non-local module of a decoding end, and repeating the calculation processes from step 22 to step 24 to obtain a weighted feature vector;

and step 33, sending the result of the step 32 to a decoding end, wherein the decoding end consists of 12 continuous up-sampling modules, the up-sampling modules are associated with the encoding end through jump connection, before the jump connection, linear interpolation with a scaling factor of 2 is required to be carried out on the output of the previous part in the sequence dimension, and then the linear interpolation is spliced with the output of the corresponding down-sampling module in the channel dimension, wherein the structure of each down-sampling module is the same as that of the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the size of the convolution kernel is 5, and the step size is 1.

5. The method of claim 1, wherein in step 4, the dimensions of the output result are normalized by using one-dimensional convolutional layers, and the output enhanced speech blocks are sequentially spliced, so as to complete the overall synthesis of the enhanced speech, comprising the following steps:

step 41, splicing an output result of a decoding end with a network original input, and then sending the output result into an output convolution layer for dimension normalization, wherein the number, the size and the step length of convolution kernels of the output layer are 1, 5 and 1 in sequence;

and step 42, because the output result of the step 41 is 1-dimensional, and in the step 11, each preprocessed voice block with noise has no overlap, the one-dimensional output results predicted by the enhanced model can be spliced directly. It should be noted that in the preprocessing process, zero padding is required for noisy speech that cannot be removed by the window length, so in this case, the corresponding zero padding portion in the model prediction output result should be removed first, and then concatenation should be performed.