CN113674753B

CN113674753B - Voice enhancement method

Info

Publication number: CN113674753B
Application number: CN202110916018.6A
Authority: CN
Inventors: 陈彦男; 邹波蓉; 王伟东; 景浩; 李辉
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2023-08-01
Anticipated expiration: 2041-08-11
Also published as: CN113674753A

Abstract

The invention provides a voice enhancement method. In the existing non-end-to-end deep neural network voice enhancement method, the problem of non-ideal enhanced voice quality caused by neglecting the importance of phase spectrum learning cannot be well met. According to the invention, firstly, the noisy speech is subjected to block operation, a non-local module is added into a convolution layer at a coding end, secondly, a gating circulation unit network is added to capture time sequence correlation information among speech sequences, finally, a one-dimensional convolution layer is used for adjusting the dimension of an output result, and the output enhanced speech blocks are spliced in sequence, so that the quality and the intelligibility of the enhanced speech are improved.

Description

Voice enhancement method

Technical Field

The invention provides a voice enhancement method, and relates to the field of voice signal processing.

Background

In the development process of human beings, voice is the most practical information communication mode in daily life due to the characteristics of convenience and high efficiency. Particularly in modern society, along with the improvement of the technology level, the voice signal processing technology is widely developed, and a series of products taking voice as a carrier, such as intelligent sound boxes (headphones), voice assistants, intelligent recording pens, translation machines and the like, are gradually derived, so that the generation of the intelligent equipment is greatly convenient for our life. However, in actual use, the performance of the device is severely affected by the presence of environmental noise, electrical noise within the device, and other speaker sounds, and even the occurrence of speech distortion. Therefore, the intelligent device can effectively solve the use problem of intelligent devices in a noise environment, has great significance for improving the user experience of products, and can well solve the problem when voice enhancement is used as a voice processing technology. The speech enhancement aims at improving the quality and the intelligibility of speech signals polluted by noise, and pure speech signals are extracted from noise-containing speech by artificially designing corresponding enhancement algorithms so as to inhibit the interference of background noise. Although voice enhancement techniques can effectively reduce the adverse effects of noise signals to some extent, the diversity, complexity, and variability of noise in real environments clearly place higher demands on voice enhancement techniques.

The speech enhancement task originates from a "cocktail party problem", which describes such a phenomenon. In a dance that is filled with various environmental sounds, a human can easily focus on a target voice of interest and reduce the attention to other background sounds, and the generation of such hearing selection attention is actually an adaptation capability of the human auditory system, but the machine does not have such adaptation capability, so that designing a computer auditory model that can acquire target information in a noisy environment has been the focus of research on the problem. As a key step of voice signal processing, the voice enhancement technology has wide application prospect in the fields of voice recognition, mobile communication and the like. In the field of speech recognition, most of the existing speech recognition systems are performed in an ideal state, that is, all researches are based on speech data under a noiseless condition, however, in a noisy environment, particularly in a strong noise environment, the recognition accuracy of the system is greatly reduced, and a speech enhancement technology can be used as the front end of the recognition system to improve the recognition rate of the system by preprocessing noise-containing speech. In the field of mobile communication, both communication parties often receive noise interference in a real scene in the communication process, especially when one party is in a street or restaurant environment, the signal transmission process is easily distorted, the communication quality is seriously reduced, subjective hearing feeling of both parties is affected, the noise can be effectively filtered by applying a corresponding voice enhancement technology on a transmitting end, and the voice quality of a receiving end is improved.

Research on speech enhancement problems in academia has been a development history for decades, and processing of speech signals is essentially more complex, unlike tasks such as text classification or object detection. Because of various noise types in the practical application scene, and the characteristics of different speakers and the perception characteristics of human ears are also required to be considered in the voice enhancement process, all factors are required to be synthesized for consideration in the voice enhancement process, so that a proper voice enhancement algorithm is selected in a targeted manner. In recent years, deep learning technology is popular with researchers because of its strong learning ability, and is rapidly applied in a wide variety of fields with remarkable effects. In the voice enhancement task, the data driving method completes the voice enhancement process by establishing a nonlinear mapping relation between the noisy voice and the clean voice. Therefore, in order to promote the development of the deep learning algorithm in the field of voice enhancement, the deep learning algorithm is studied in depth, the enhancement model is improved by combining the characteristics of the field of voice enhancement, the enhancement voice quality and the intelligibility are ensured, and the generalization level of the model in an unknown noise environment is improved.

The invention provides a voice enhancement method, which constructs a coding and decoding network model fused with a Non-local module. Firstly, a sliding window with a fixed length is used for acting on time domain noise-carrying voice to carry out blocking, and the voice after blocking is spliced and is input as a model coding end, so that the amplitude information and the phase information of the voice are fully utilized; secondly, adding a non-local module into a convolution layer of a coding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gating circulation unit network to capture time sequence correlation information among the voice sequences; the output of the gating circulation unit network is sent into a non-local module of the decoding end again, then jump connection is introduced, and feature splicing is carried out on the high-resolution feature map of the encoding end and the low-resolution feature map of the decoding stage, so that the supplement of detail information among the feature maps is completed; finally, the dimension of the output result is adjusted by using the one-dimensional convolution layer, and the output enhanced voice blocks are spliced in sequence, so that the integral synthesis of the enhanced voice is completed. The test result on the Chinese voice data set shows that the invention effectively improves the quality and the intelligibility of the enhanced voice.

Disclosure of Invention

Therefore, the main purpose of the present invention is to overcome the problem of non-ideal enhancement of speech quality caused by neglecting the importance of phase spectrum learning in the non-end-to-end enhancement model.

In order to achieve the above purpose, the technical scheme provided by the invention is as follows:

a method of speech enhancement, the method comprising the steps of:

step 1, preprocessing input noisy speech data, sequentially performing downsampling, blocking and splicing operations on the noisy speech in a time domain, and sequentially sending splicing results into a model;

step 2, sending the voice spliced in the step 1 into a convolution layer at the coding end, introducing a non-local module into the last convolution layer at the coding end, extracting key characteristics of a voice sequence, simultaneously inhibiting useless characteristics, and simultaneously adding a gating circulation unit network to capture time sequence correlation information among the voice sequences;

step 3, fusing the network output characteristics of the parallel gating circulation unit in the step 2, then sending the fused network output characteristics into a non-local module of a decoding end, then introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end, thereby completing the supplement of detail information between the characteristic diagrams;

and 4, using a one-dimensional convolution layer to normalize the output dimension of the step 3, and splicing the output enhanced voice blocks in sequence, so as to complete the integral synthesis of the enhanced voice.

In summary, the invention designs a coding and decoding network, which takes the time domain representation of the voice as the input of the coding end to extract the deep features, thereby fully utilizing the amplitude information and the phase information of the voice signal; the non-local module is added in the convolution layer of the coding end and the decoding end, and the gating circulation unit network is introduced to capture time sequence correlation information between voice sequences, so that the noise characteristic information interference is reduced, and the quality and the intelligibility of the enhanced voice are improved.

Drawings

FIG. 1 is a schematic diagram of a speech enhancement method according to the present invention;

FIG. 2 is a schematic diagram of the block and splice operation of noisy speech;

FIG. 3 is a schematic diagram of a deep feature extraction process performed at the encoding end;

FIG. 4 is a schematic flow chart of a decoding-side feature recovery process;

FIG. 5 is a schematic diagram of the overall synthesis flow of enhanced speech;

FIG. 6 is a graph of an enhanced speech using the present invention;

the specific embodiment is as follows:

the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the examples are given for illustration and not limitation of the embodiments of the present invention, and the present invention may be implemented by other different embodiments. All other embodiments, which can be made by those skilled in the art without the inventive effort, are intended to be within the scope of the present invention.

Fig. 1 is a general flow chart of a speech enhancement method according to the present invention, and as shown in fig. 1, the speech enhancement method of the convolutional network and the non-local module according to the present invention includes the following steps:

step 1, preprocessing input voice data with noise, and sequentially performing downsampling, voice blocking and splicing operations;

step 2, sending the splicing result in the step 1 into a downsampling module of the coding end, wherein each downsampling module consists of a convolution layer, a batch normalization layer and an activation function layer, in addition, introducing a non-local module after the last downsampling module of the coding end, extracting key characteristics of a voice sequence, simultaneously inhibiting useless characteristics, and simultaneously adding a parallel gating circulation unit network for capturing time sequence correlation information among the voice sequences;

step 3, fusing the output characteristics of the parallel gating circulation unit network in the step 2, then sending the fused output characteristics into a non-local module of a decoding end, simultaneously introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end, thereby completing the supplement of detail information between the characteristic diagrams;

and 4, adjusting the output dimension of the step 3 by using a one-dimensional convolution layer, and splicing the output enhanced voice blocks along the time dimension, so that the integral synthesis of the enhanced voice is completed.

Fig. 2 is a schematic flow chart of the block operation and the splicing of the noisy speech, as shown in fig. 2, in step 1, the data is preprocessed, and the down-sampling, the speech block operation and the splicing operation of the noisy speech are sequentially performed, which includes the following steps:

step 11, downsampling the noisy speech to 16000Hz, then carrying out block processing by adopting a sliding window with the window length of 16384, wherein each speech block is not overlapped, and zero padding is needed for the noisy speech which cannot be divided by the window length;

step 12, performing splicing operation on each frame of time domain signal obtained in step 11 along the vertical direction in sequence, wherein the splicing operation is represented as follows:

wherein y is _L Representing the L noisy speech blocks, each noisy speech block length being a constant value 16384;

and 13, normalizing the feature matrix in the step 12 according to the mean value of 0 and the variance of 1, wherein the normalization is expressed as follows:

wherein μ represents the mean value of the input data Y, σ represents the variance of the input data Y;

fig. 3 is a schematic diagram of a deep feature extraction flow of an encoding end, as shown in fig. 3, in step 2, a splicing result of step 1 is sent to a downsampling module of the encoding end, where each downsampling module is composed of a convolution layer, a batch normalization layer and an activation function layer, in addition, a non-local module is introduced after the last downsampling module of the encoding end, and when key features of a voice sequence are extracted, useless features are suppressed, and a parallel gating circulation unit network is added for capturing time sequence correlation information between voice sequences, and the method includes the following steps:

step 21, deep feature extraction is carried out on the input noisy speech time domain matrix through 12 downsampling modules continuously, wherein each downsampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer. The convolution operation is represented as follows:

M _i ＝f(W·Y _i +b)

in the aboveIs an output characteristic diagram of a convolution layer, wherein C represents the channel number, F represents the characteristic dimension and Y _i The feature map representing the ith input, b is the corresponding bias term, W is the weight matrix of the corresponding convolution kernel, the number of convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264, 288 in turn, the kernel size is 15, the step size is 1, f (x) is the leak-corrected linear unit's leak-corrected ReLU activation function, the function is expressed as follows:

wherein a is a fixed value and the value is 0.01;

step 22, taking the feature map M generated in the step 21 as an input of a non-local module, firstly, respectively using two one-dimensional convolution layers to reduce the dimension of a channel to reduce the number of the channel from C to C/2, simultaneously, carrying out matrix multiplication on output results of the two convolution layers, and then carrying out normalization processing on the multiplication results by using a softmax function to generate attention weights. The calculation process is represented as follows:

wherein θ and ψ both represent one-dimensional convolution operations;

step 23, weighting the attention of step 22With features m also generated by one-dimensional convolution _j Matrix multiplication is performed on the mapping representation to obtain the firstOutput response y for i positions _i ；

Step 24, outputting the result y of step 23 _i Performing dimension adjustment by using convolution operation to ensure that the dimension of the enhancement feature Z is consistent with the input of the module, and further performing element-by-element addition by using residual connection to obtain enhancement feature Z containing global information of the voice sequence _i Expressed as:

z _i ＝W _z y _i +m _i

wherein W is _z Representing the weight matrix to be learned in the training process. It should be noted that the convolution parameters used in steps 22, 23, 24 are 1 in both size and step size;

step 25, sending the output result of the non-local module to a parallel gating circulation network, wherein the calculation processes of the parallel network are the same, taking an uplink network as an example, and the input at the moment t is given as x _t The forward calculation process comprises the following steps:

r _t ＝σ(W _xt x _t +W _ht h _t-1 +b _r )

z _t ＝σ(w _x x _t +w _hz h _t-1 +b _z )

wherein σ represents a Sigmoid activation function, while, by representing the product of elements, W and b represent weights and offsets, r, respectively _t And z _t Respectively representing a reset gate and an update gate, h _t-1 Is the hidden state at the time t-1,representing candidate hidden states, commonly used for auxiliary hidden states h _t Is calculated by the computer.

Fig. 4 is a schematic flow chart of a feature recovery process at a decoding end, as shown in fig. 4, in step 3, the network output features of the gate control circulation units in parallel in step 2 are fused, then sent into a non-local module at the decoding end, and then jump connection is introduced, and feature splicing is performed on a high-resolution feature map at the encoding end and a low-resolution feature map at the decoding stage, so that the supplement of detail information between the feature maps is completed, and the method comprises the following steps:

and step 31, fusing the output results of the parallel gate control loop network. When the output results of the two networks are Out respectively ₁ And Out ₂ The fusion process is expressed as:

Input _D ＝Add(Out ₁ ；Out ₂ )

wherein Input is _D Representing a fusion result, wherein Add represents a feature fusion mode;

step 32, sending the output of the step 31 to a non-local module at the decoding end, and repeating the calculation processes of the steps 22 to 24, thereby obtaining weighted feature vectors;

the results of step 33 and step 32 are sent to a decoding end, the decoding end is formed by 12 continuous up-sampling modules, the decoding end is associated with an encoding end through jump connection, before the jump connection, linear interpolation with the scaling factor of 2 is needed to be carried out on the output of the last part in the sequence dimension, then the linear interpolation is carried out on the output of the last part in the channel dimension together with the output of the corresponding down-sampling module, wherein the structure of each down-sampling module is the same as that of the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the size is 5, and the step length is 1.

Fig. 5 is a schematic diagram of an overall synthesis flow of enhanced speech, as shown in fig. 5, in step 4, a one-dimensional convolution layer is used to perform dimension normalization on output features of a decoding end, and outputted enhanced speech blocks are spliced in sequence along a time dimension, so as to complete overall synthesis of the enhanced speech, and the method includes the following steps:

step 41, splicing an output result of the decoding end with an original network input, and then sending the spliced output result into an output layer for dimension normalization, wherein the number, the size and the step length of convolution kernels of the output layer are sequentially 1, 5 and 1;

step 42, carrying out standardized reduction on the output result of step 41, judging whether the noise-containing voice in the preprocessing stage has zero-filling operation or not, and firstly removing the corresponding zero-filling part in the model prediction output result for the noise-containing voice with zero-filling operation, otherwise, eliminating any operation;

and 43, splicing the output characteristics of the step 42, and further completing reconstruction of the enhanced voice.

Examples

The method for reducing noise of the voice with noise by using the voice enhancement method of the invention utilizes clean voice and noise to synthesize the voice with noise according to different signal to noise ratios, and comprises the following specific steps:

1. preprocessing data, firstly performing a downsampling process, then partitioning noisy speech by using a sliding window with the window length of 16384, carrying out zero padding on a part which cannot be divided completely, ensuring that no overlap exists among blocks, and finally obtaining a noisy speech block with the dimension of 16384;

2. and then sending the noisy speech blocks into a downsampling module at the encoding end, wherein each downsampling module comprises a one-dimensional convolution layer, a batch normalization layer and an activation function layer in sequence. In a convolution layer, in order to ensure that the input dimension is the Same as the output dimension, zero padding is performed in a Same mode, and meanwhile, in the transmission process of a data stream at an encoding end, the output characteristic of a downsampling module is halved in sequence dimension in sequence, so that the final output characteristic of the encoding end is 288 multiplied by 4, wherein 288 is the number of channels, and 4 is the sequence dimension;

3. sequentially inputting the output characteristics of the coding end into a non-local module and a gating circulation unit network, and respectively completing the weighting and time sequence correlation calculation process of the output characteristics of the coding end;

4. and the network output of the gate control circulation unit is sent to a decoding end, the decoding end is also composed of the same number of up-sampling modules, and the structure of the decoding end is consistent with that of the encoding end. The method comprises the steps of associating with an encoding end through jump connection, carrying out linear interpolation with a scaling factor of 2 on the output of the last part in a sequence dimension before the jump connection, then splicing the output of the last part with the output of a corresponding downsampling module in a channel dimension, and finally, outputting the characteristic size of the decoding end to be 24 multiplied by 16384. Then splicing the input of the coding end, wherein the number of channels of the characteristic diagram after splicing is 25, the sequence dimension is kept unchanged, and finally, the spliced result is sent to an output layer for dimension normalization, and finally, the size of the output characteristic is 1 multiplied by 16384;

5. after the one-dimensional output result of the enhanced voice is obtained, the zero padding part in the pretreatment is removed, and then the splicing operation is carried out, so that the synthesis of the enhanced voice is completed.

Fig. 6 shows a graph of a noisy speech contaminated with M109 (tank internal noise) noise with the signal-to-noise ratio of-3 dB, the speech enhanced by the method of the present invention, wherein the horizontal axis represents time, the vertical axis represents frequency, the value of the coordinate point corresponds to the energy of the speech signal, and the energy is represented by the color shade. Wherein, the figure (a) is a spectrogram of clean voice, the figure (b) is a spectrogram of voice with noise, and the figure (c) is an enhanced voice spectrogram obtained by using the method. As can be seen from the spectrogram, the enhanced speech obtained by the method of the present invention preferably restores the low frequency details of the clean speech, but the processing of the high frequency part is not ideal, and in the dashed box of fig. 6 (c), there is still some noise redundancy.

Claims

1. A method of speech enhancement, comprising the steps of:

step 1, performing block operation on voice with noise by using a sliding window with fixed length, splicing the segmented voice, and then directly sending the voice into a model coding end;

step 2, adding a non-local module into a convolution layer at a coding end, extracting key features of a voice sequence, inhibiting useless features, and simultaneously adding time sequence correlation information among captured voice sequences of a gating circulation unit network;

step 3, sending the output of the gate control circulation unit network into a non-local module of a decoding end, then introducing jump connection, and splicing the high-resolution characteristic image of the encoding end and the low-resolution characteristic image of the decoding end in sequence along a time dimension, so as to complete the supplement of detail information between the characteristic images;

step 4, the dimension of the output result of the decoding end is regulated by using the one-dimensional convolution layer, and the output enhanced voice blocks are spliced in sequence along the time dimension, so that the integral synthesis of the enhanced voice is completed;

in the step 1, a sliding window with a fixed length is used for carrying out a blocking operation on the voice with noise, the voice after the blocking is spliced, and then the voice is directly sent to an encoding end of a model, and the method comprises the following steps:

step 11, downsampling the noisy speech to 16kHz, then carrying out block processing by adopting a sliding window with the length of 16384, wherein each speech block is not overlapped, and zero padding is needed for the noisy speech which cannot be divided by the window length;

step 12, splicing each noisy speech block obtained in the step 11 along the vertical direction in sequence, wherein the splicing result is expressed as follows:

in step 2, adding a non-local module to a convolution layer at a coding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gating circulation unit network to capture time sequence correlation information among the voice sequences, wherein the method comprises the following steps:

step 21, the input noisy speech time domain feature vector is sent to 12 continuous downsampling modules for deep feature extraction, and each downsampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer, and the convolution operation is expressed as follows:

M _i ＝f(W·Y _i +b)

in the aboveIs an output characteristic diagram of a convolution layer, wherein C represents the channel number, F represents the characteristic dimension and Y _i The feature map representing the ith input, b is the corresponding bias term, W is the weight matrix of the corresponding convolution kernel, the number of convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264, 288 in turn, the kernel size is 15, the step size is 1, f (x) is the activation function of the LeakyReLU with leakage correction linear unit, and the function is expressed as follows:

wherein a is a fixed value and the value is 0.01;

step 22, taking the feature map M generated in the step 21 as the input of a non-local module, firstly, respectively using two one-dimensional convolution layers to reduce the dimension of a channel to reduce the number of the channel from C to C/2, simultaneously carrying out matrix multiplication on the output results of the two convolution layers, and then carrying out normalization processing on the multiplication results by using a softmax function to generate attention weight; the calculation process is represented as follows:

wherein θ and ψ both represent one-dimensional convolution operations;

step 23, weighting the attention of step 22With features m also generated by one-dimensional convolution _j Matrix multiplication is performed on the mapping representation to obtain an output response y of the ith position _i ；

Step 24, outputting the result y of step 23 _i Performing dimension adjustment by using convolution operation to ensure that the dimension of the convolution operation is consistent with the input of the module, and further performing element-by-element addition by using residual connection to obtain an enhanced feature z containing global information of the voice sequence _i Expressed as:

z _i ＝W _z y _i +m _i

wherein W is _z The weight matrix to be learned in the training process is represented, and the convolution kernel sizes and the step sizes used in the step 22, the step 23 and the step 24 are all 1;

r _t ＝σ(W _xr x _t +W _hr h _t-1 +b _r )

z _t ＝σ(W _xz x _t +W _hz h _t-1 +b _z )

2. The method for speech enhancement according to claim 1, wherein in step 3, the output of the gating cyclic unit network is sent to a non-local module at the decoding end, then a jump connection is introduced, and the high resolution feature map at the encoding end and the low resolution feature map at the decoding end are spliced, thereby completing the supplement of detail information between the feature maps, and the method comprises the following steps:

step 31, fusing the output results of the parallel gate control loop network, when the output results of the two networks are Out respectively ₁ And Out ₂ The fusion mode is expressed as:

Input _D ＝Add(Out ₁ ；Out ₂ )

step 32, the non-local module output from the step 31 is sent to the decoding end, and the calculation process from the step 22 to the step 24 is repeated, so that the weighted feature vector is obtained;

step 33, the result of step 32 is sent to a decoding end, the decoding end is formed by 12 continuous up-sampling modules, the decoding end is associated with an encoding end through jump connection, before the jump connection, linear interpolation with a scaling factor of 2 is needed to be carried out on the output of the last part in a sequence dimension, and then the linear interpolation is spliced with the output of a corresponding down-sampling module in a channel dimension, wherein each down-sampling module has the same structure as the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the convolution kernel size is 5, and the step length is 1.

3. The method for enhancing speech according to claim 1, wherein in step 4, the dimension of the output result of the decoding end is normalized by using a one-dimensional convolution layer, and the output enhanced speech blocks are spliced in sequence along the time dimension, thereby completing the overall synthesis of the enhanced speech, and the method comprises the following steps:

step 41, splicing an output result of the decoding end with an original network input, and then sending the spliced output result into an output convolution layer for dimension normalization, wherein the number, the size and the step length of the convolution kernels of the output layer are sequentially 1, 5 and 1;

step 42, because the output result of step 41 is 1 dimension, and meanwhile, in step 11, each noise-added speech block after preprocessing is non-overlapped, so that the one-dimensional output result of model prediction is directly spliced, it is pointed out that zero padding is needed for noise-added speech which cannot be divided by a window length in the preprocessing process, so that in this case, the corresponding zero padding part in the model prediction output result should be removed first, and then splicing is performed;