CN113674753A - New speech enhancement method - Google Patents

New speech enhancement method Download PDF

Info

Publication number
CN113674753A
CN113674753A CN202110916018.6A CN202110916018A CN113674753A CN 113674753 A CN113674753 A CN 113674753A CN 202110916018 A CN202110916018 A CN 202110916018A CN 113674753 A CN113674753 A CN 113674753A
Authority
CN
China
Prior art keywords
output
speech
voice
convolution
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110916018.6A
Other languages
Chinese (zh)
Other versions
CN113674753B (en
Inventor
陈彦男
邹波蓉
王伟东
景浩
李辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN202110916018.6A priority Critical patent/CN113674753B/en
Publication of CN113674753A publication Critical patent/CN113674753A/en
Application granted granted Critical
Publication of CN113674753B publication Critical patent/CN113674753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • G10L19/022Blocking, i.e. grouping of samples in time; Choice of analysis windows; Overlap factoring
    • G10L19/025Detection of transients or attacks for time/frequency resolution switching
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a new speech enhancement method. In the existing non-end-to-end deep neural network speech enhancement method, the problem of unsatisfactory speech quality enhancement caused by neglecting the importance of phase spectrum learning is solved, and the requirements of network users cannot be well met. According to the method, firstly, the voice with noise is subjected to blocking operation, a non-local module is added into a convolutional layer at an encoding end, then a gated cyclic unit network is added to capture time sequence correlation information among voice sequences, finally, the dimension of an output result is adjusted by using a one-dimensional convolutional layer, and output enhanced voice blocks are spliced in sequence, so that the quality and intelligibility of the enhanced voice are improved.

Description

New speech enhancement method
Technical Field
The invention provides a new voice enhancement method, and relates to the field of voice signal processing.
Background
In the development process of human beings, voice becomes the most practical information communication mode in daily life due to the characteristics of convenience and high efficiency. Especially in modern society, with the improvement of technological level, the speech signal processing technology is widely developed, and a series of products taking speech as a carrier, such as an intelligent sound box (earphone), a speech assistant, an intelligent recording pen, a translator and the like, are gradually derived, and the generation of intelligent equipment of the type greatly facilitates the life of people. However, in the actual use process, the performance of the device is seriously affected due to the existence of environmental noise, internal electrical noise of the device and other speaker sounds, and even a speech distortion situation occurs. Therefore, the use problem of the intelligent device in the noise environment is effectively solved, the method has great significance for improving the user experience of the product, and the voice enhancement is used as a voice processing technology and can well solve the problem. The speech enhancement aims at improving the quality and intelligibility of speech signals polluted by noise, and a corresponding enhancement algorithm is artificially designed to extract pure speech signals from noisy speech so as to inhibit the interference of background noise. Although the voice enhancement technology can effectively reduce the adverse effect caused by the noise signal to a certain extent, the diversity, complexity and mutation of the noise in the real environment undoubtedly put higher requirements on the voice enhancement technology.
The speech enhancement task originates from the "cocktail party problem," which describes such a phenomenon. In a dance meeting full of various environmental sounds, people can easily focus on interested target voices and reduce the attention on other background sounds, and the generation of hearing selection attention is actually an adaptive capacity of a human hearing system, but the adaptive capacity is not available for machines, so that the design of a computer hearing model capable of acquiring target information in a noisy environment is always the research focus of the problem. As a key step of speech signal processing, the speech enhancement technology has wide application prospect in the fields of speech recognition, mobile communication and the like. In the field of speech recognition, most of the existing speech recognition systems are ideally performed, that is, all research is based on speech data under a noise-free condition, however, in a noise environment, especially in a strong noise environment, the recognition accuracy of the system will be greatly reduced, and the speech enhancement technology can be used as the front end of the recognition system to improve the recognition rate of the system by preprocessing the noisy speech. In the field of mobile communication, both sides of communication often receive noise interference in the real scene in the communication process, especially when one of them is in the street or dining room environment, receive distortion very easily in the signal transmission process, seriously reduce communication quality, influence both sides' subjective sense of hearing, can effectively filter the noise through exerting corresponding speech enhancement technique at the transmitting end, and then improve the speech quality of receiving end.
The academic community has already developed voice enhancement problems for decades, and different from tasks such as text classification or target detection, the processing of voice signals is more complex in nature. Because the noise in the actual application scene is various, and meanwhile, the characteristics of different speakers and the perception characteristics of human ears need to be considered in the voice enhancement process, all factors need to be considered comprehensively when voice enhancement is carried out, and therefore a proper voice enhancement algorithm is selected in a targeted manner. In recent years, deep learning techniques have been rapidly and widely used in many fields and have achieved remarkable results because of their strong learning abilities, which are favored by researchers. In the speech enhancement task, this type of data-driven approach accomplishes the speech enhancement process by establishing a non-linear mapping relationship between noisy speech and clean speech. Therefore, in order to promote the development of the deep learning algorithm in the field of speech enhancement, the deep learning algorithm is deeply researched, the enhancement model is improved by combining the characteristics of the field of speech enhancement, the speech quality and the intelligibility are enhanced, and meanwhile, the generalization level of the model in an unknown noise environment is improved.
The invention provides a new voice enhancement method, and a coding and decoding network model fused with a Non-local module is constructed. Firstly, a sliding window with a fixed length is used for acting on time domain noisy speech to carry out blocking, and the blocked speech is spliced and input as a model coding end, so that amplitude information and phase information of the speech are fully utilized; secondly, adding a non-local module into a convolutional layer at an encoding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gate control cycle unit network to capture time sequence correlation information among the voice sequences; sending the output of the gating cycle unit network into a non-local module of a decoding end again, introducing jump connection, and performing feature splicing on a high-resolution feature map of the encoding end and a low-resolution feature map of a decoding stage to complete the supplement of detail information between the feature maps; and finally, adjusting the dimensionality of the output result by using the one-dimensional convolutional layer, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice. The test result on the Chinese voice data set shows that the invention effectively improves the quality and intelligibility of the enhanced voice.
Disclosure of Invention
In view of the above, the main objective of the present invention is to remedy the problem of non-ideal quality of enhanced speech due to neglect of the importance of phase spectrum learning in the non-end-to-end enhanced model.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
a new method of speech enhancement, the method comprising the steps of:
step 1, preprocessing input voice data with noise, sequentially performing down-sampling, blocking and splicing on the voice with noise in a time domain, and sequentially sending splicing results into a model;
step 2, the voice spliced in the step 1 is sent into a convolution layer at an encoding end, a non-local module is introduced into the last convolution layer at the encoding end, the key features of the voice sequence are extracted, meanwhile, useless features are suppressed, and meanwhile, a gate control cyclic unit network is added to capture time sequence correlation information among the voice sequences;
step 3, fusing the output characteristics of the parallel gating cycle unit network in the step 2, then sending the fused output characteristics into a non-local module of a decoding end, then introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end so as to complete the supplement of detail information among the characteristic diagrams;
and 4, using the one-dimensional convolutional layer to regulate the output dimensionality of the step 3, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.
In summary, the present invention designs an encoding and decoding network, which uses the time domain representation of the speech as the input of the encoding end to perform deep feature extraction, thereby making full use of the amplitude information and the phase information of the speech signal; adding a non-local module in the convolutional layer of the coding end and the decoding end, and introducing a gating cycle unit network to capture time sequence correlation information among voice sequences, thereby improving the quality and intelligibility of enhanced voice while reducing noise characteristic information interference.
Drawings
FIG. 1 is a schematic flow chart of a new speech enhancement method according to the present invention;
FIG. 2 is a schematic diagram of the operation flow of blocking and splicing noisy speech;
FIG. 3 is a schematic diagram of a deep feature extraction process performed at the encoding end;
FIG. 4 is a flow chart illustrating a decoding-side feature recovery process;
FIG. 5 is a schematic diagram of the overall synthesis flow of enhanced speech;
FIG. 6 is a diagram of an enhanced speech spectrogram obtained using the present invention;
the specific implementation mode is as follows:
the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is to be understood that the present invention is illustrative and not limited to the embodiments of the present invention, and the present invention may be implemented in other different specific embodiments. All other embodiments obtained by a person skilled in the art without making any inventive step are within the scope of protection of the present invention.
Fig. 1 is a schematic general flow chart of a new speech enhancement method according to the present invention, and as shown in fig. 1, the speech enhancement method of a convolutional loop network and a non-local module according to the present invention includes the following steps:
step 1, preprocessing input voice data with noise, and sequentially performing down-sampling, voice blocking and splicing operations;
step 2, sending the splicing result of the step 1 into down-sampling modules of an encoding end, wherein each down-sampling module consists of a convolution layer, a batch normalization layer and an activation function layer;
step 3, fusing the output characteristics of the parallel gating cycle unit network in the step 2, then sending the fused output characteristics into a non-local module of a decoding end, introducing jump connection, and performing characteristic splicing on a high-resolution characteristic diagram of an encoding end and a low-resolution characteristic diagram of the decoding end so as to complete the supplement of detail information among the characteristic diagrams;
and 4, adjusting the output dimensionality of the step 3 by using the one-dimensional convolutional layer, and splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.
Fig. 2 is a schematic flow chart of blocking and splicing of noisy speech, as shown in fig. 2, in step 1, preprocessing data, and sequentially performing downsampling, speech blocking and splicing of noisy speech, including the following steps:
step 11, down-sampling the noisy speech to 16000Hz, then performing block processing by adopting a sliding window with the window length of 16384 (about 1S), wherein each speech block is not overlapped, and for the noisy speech which cannot be completely divided by the window length, zero padding is needed;
step 12, sequentially performing splicing operation on each frame of time domain signals obtained in the step 11 along the vertical direction, which is represented as follows:
Figure BDA0003205608440000051
wherein y isLRepresenting the L-th block of noisy speech, each noisy speechThe block lengths are all constant 16384;
step 13, performing normalization processing on the feature matrix in the step 12 according to the mean value of 0 and the variance of 1, and representing as follows:
Figure BDA0003205608440000052
where μ represents the mean of the input data Y and σ represents the variance of the input data Y.
Fig. 3 is a schematic diagram of a deep feature extraction process at an encoding end, as shown in fig. 3, in step 2, the splicing result of step 1 is sent to a down-sampling module at the encoding end, where each down-sampling module is composed of a convolution layer, a batch normalization layer and an activation function layer, and in addition, a non-local module is introduced after the last down-sampling module at the encoding end, so as to suppress useless features while extracting key features of a speech sequence, and simultaneously add a parallel gated cyclic unit network for capturing time sequence correlation information between speech sequences, which includes the following steps:
and 21, continuously passing the input time domain matrix of the noisy speech through 12 down-sampling modules to extract deep features, wherein each down-sampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer (BN). The convolution operation is represented as follows:
Mi=f(WgYi+b)
in the above formula
Figure BDA0003205608440000061
Is an output feature map of the convolutional layer, where C represents the number of channels, F represents the feature dimension, and Y representsiRepresenting the characteristic diagram of the ith input, b is a corresponding bias term, W is a weight matrix of a corresponding convolution kernel, the number of the convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264 and 288 in sequence, wherein the kernel size is 15, the step size is 1, f is a LeakyReLU activation function with a leakage correction linear unit, and the function is expressed as follows:
Figure BDA0003205608440000062
wherein a is a fixed value, and the value is generally 0.01;
and step 22, taking the feature map M generated in the step 21 as an input of a non-local module, firstly, respectively using two one-dimensional convolution layers to perform channel dimension dimensionality reduction on the M, reducing the channel number from C to C2, simultaneously performing matrix multiplication on output results of the two convolution layers, and then performing normalization processing on multiplication results by using a softmax function to generate the attention weight. The calculation procedure is represented as follows:
Figure BDA0003205608440000063
where θ and ψ both represent one-dimensional convolution operations;
step 23, weighting the attention of step 22
Figure BDA0003205608440000064
With features m also generated by one-dimensional convolutionjThe mapping representation is matrix multiplied to obtain the output response y of the ith positioni
Step 24, output result y to step 23iUsing convolution operation to adjust dimensionality to make dimensionality consistent with the module input, and further using residual connection to carry out element-by-element addition to obtain enhanced feature Z containing global information of voice sequenceiExpressed as:
zi=Wzyi+mi
wherein WzRepresenting the weight matrix to be learned by the training process. It should be noted that the convolution parameters used in steps 22, 23 and 24 have both a size and a step size of 1;
step 25, sending the output result of the non-local module into a parallel gated loop network, wherein the calculation process of the upper and lower parallel networks is the same, and in the case of the upper network, the input at the given time t is xtThe forward calculation process is as follows:
rt=σ(Wxrxt+Whrht-1+br)
zt=σ(Wxzxt+Whzht-1+bz)
Figure BDA0003205608440000071
Figure BDA0003205608440000072
where σ denotes Sigmoid activation function, e denotes element product, W and b denote weight and offset, respectively, and rtAnd ztRespectively representing a reset gate and a refresh gate, ht-1Is a hidden state at the time t-1,
Figure BDA0003205608440000073
representing candidate hidden states, commonly used for auxiliary hidden states htAnd (4) calculating.
Fig. 4 is a schematic flow diagram of a feature recovery process at a decoding end, and as shown in fig. 4, in step 3, output features of the gated loop unit networks in parallel in step 2 are fused, and then sent to a non-local module at the decoding end, and then jump connection is introduced, and feature splicing is performed on a high-resolution feature map at an encoding end and a low-resolution feature map at a decoding stage, so that supplement of detail information between the feature maps is completed, including the following steps:
and step 31, merging the output results of the parallel gating cycle network. Suppose the output results of the two networks are Out respectively1And Out2The fusion process is represented as:
InputD=Add(Out1;Out2)
wherein InputDRepresenting a fusion result (input at a decoding end), and Add represents a feature fusion mode;
step 32, sending the output of the step 31 to a non-local module of a decoding end, and repeating the calculation processes from the step 22 to the step 24 to obtain a weighted feature vector;
the results of step 33 and step 32 are sent to a decoding end, the decoding end is composed of 12 consecutive up-sampling modules, and is associated with a coding end through jump connection, before jump connection, linear interpolation with a scaling factor of 2 is required to be performed on the output of the previous part in the sequence dimension, and then the linear interpolation is spliced with the output of the corresponding down-sampling module in the channel dimension, wherein the structure of each down-sampling module is the same as that of the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the size is 5, and the step size is 1.
Fig. 5 is a schematic diagram of an overall synthesis process of enhanced speech, and as shown in fig. 5, in step 4, a one-dimensional convolutional layer is used to perform dimension normalization on output features of a decoding end, and output enhanced speech blocks are sequentially spliced along a time dimension, so as to complete the overall synthesis of the enhanced speech, including the following steps:
step 41, splicing an output result of a decoding end with an original input of a network, and then sending the output result into an output layer for dimension normalization, wherein the number, the size and the step length of convolution kernels of the output layer are 1, 5 and 1 in sequence;
step 42, performing standardized reduction on the output result of the step 41, and judging whether the noise-containing voice in the preprocessing stage has zero padding operation, wherein for the noise-containing voice with zero padding operation, the corresponding zero padding part in the model prediction output result should be removed first, otherwise, no operation is needed;
and 43, splicing the output characteristics of the step 42 to complete the reconstruction of the enhanced voice.
Examples
In this embodiment, clean speech and noise are synthesized into noisy speech according to different signal-to-noise ratios, and the speech enhancement method of the present invention is used to reduce noise of the noisy speech, specifically including the following steps:
1. preprocessing data, namely performing a down-sampling process, partitioning the noisy speech into blocks by using a sliding window with the window length of 16384, performing zero filling on the parts which cannot be completely removed, ensuring that no overlapping exists between the blocks, and finally obtaining noisy speech blocks with the dimensionality of 16384;
2. and then, sending the voice block with the noise into a down-sampling module at an encoding end, wherein each down-sampling module sequentially comprises a one-dimensional convolution layer, a batch normalization layer and an activation function layer. In the convolutional layer, in order to ensure that the input dimension and the output dimension are the Same, zero padding is performed in a Same mode, and meanwhile, in the transmission process of a data stream at a coding end, the output characteristic of a downsampling module is reduced by half in sequence dimension in sequence, so that the final output characteristic size at the coding end is 288 × 4, wherein 288 is the number of channels, and 4 is the sequence dimension;
3. sequentially inputting the output characteristics of the coding end into a non-local module and a gate control cycle unit network, and respectively finishing the weighting of the output characteristics of the coding end and the calculation process of time sequence correlation;
4. and (3) sending the output of the gated cyclic unit network to a decoding end, wherein the decoding end is also composed of the same number of up-sampling modules, and the structure of the decoding end is consistent with that of the encoding end. The method is characterized in that the method is associated with an encoding end through jump connection, before the jump connection, linear interpolation with a scaling factor of 2 is required to be carried out on the output of the previous part in a sequence dimension, then the linear interpolation is spliced with the output of a corresponding down-sampling module in a channel dimension, and finally the output characteristic size of a decoding end is 24 x 16384. Then splicing the data with the input of an encoding end, wherein the number of channels of a spliced feature map is 25, the sequence dimension is kept unchanged, and finally, a splicing result is sent to an output layer for dimension normalization, and the final output feature size is 1 multiplied by 16384;
5. after the one-dimensional output result of the enhanced voice is obtained, the zero filling part in the preprocessing is removed, and then the splicing operation is carried out, so that the synthesis of the enhanced voice is completed.
FIG. 6 shows a spectrogram of a noisy speech after a signal-to-noise ratio of-3 dB and after being contaminated by M109 (tank internal noise) noise and speech enhancement using the method of the present invention, wherein the horizontal axis represents time, the vertical axis represents frequency, values of coordinate points correspond to energy of the speech signal, and the energy is represented by shades. Wherein, the graph (a) is the spectrogram of clean voice, the graph (b) is the spectrogram of noisy voice, and the graph (c) is the enhanced voice spectrogram obtained by using the method. It can be seen from the spectrogram that the enhanced speech obtained by the method of the present invention better restores the low-frequency detail information of the clean speech, but the processing of the high-frequency part is not ideal, and in the dashed box of fig. 6(c), there is still some noise redundancy.

Claims (5)

1. A new speech enhancement method, characterized in that it comprises the following steps:
step 1, using a sliding window with a fixed length to perform blocking operation on a voice with noise, splicing the blocked voice, and then directly sending the voice to a model coding end;
step 2, adding a non-local module in a convolutional layer at an encoding end, extracting key features of a voice sequence, simultaneously inhibiting useless features, and simultaneously adding a gate control cycle unit network to capture time sequence correlation information among the voice sequences;
step 3, the output of the gating cycle unit network is sent to a non-local module of a decoding end, then jump connection is introduced, and a high-resolution feature map of the encoding end and a low-resolution feature map of the decoding end are spliced, so that the supplement of detail information among the feature maps is completed;
and 4, using the one-dimensional convolutional layer to regulate the dimensionality of an output result, and sequentially splicing the output enhanced voice blocks so as to complete the integral synthesis of the enhanced voice.
2. The method of claim 1, wherein in step 1, the noisy speech is segmented by using a sliding window with a fixed length, and the segmented speech is spliced and then sent directly to the encoding end of the model, comprising the following steps:
step 11, down-sampling the noisy speech to 16kHz, then performing block processing by adopting a sliding window with the length of 16384 (about 1 second), wherein each speech block is not overlapped, and for the noisy speech which cannot be evenly divided by the window length, zero padding is needed;
and step 12, splicing each noisy speech block obtained in the step 11 in sequence along the vertical direction, wherein the splicing result is represented as follows:
Figure FDA0003205608430000011
wherein y isLRepresents the L-th noisy speech block, and the length of each noisy speech block is a fixed value 16384;
step 13, performing normalization processing on the feature matrix in the step 12 according to the mean value of 0 and the variance of 1, and representing as follows:
Figure FDA0003205608430000021
where μ represents the mean of the input data Y and σ represents the variance of the input data Y.
3. The method of claim 1, wherein in step 2, adding a non-local module to the convolutional layer at the encoding end to extract key features of the speech sequence and suppress unwanted features, and adding a gated cyclic unit network to capture timing correlation information between the speech sequences, comprises the following steps:
step 21, the input time domain feature vector of the noisy speech is sent to continuous 12 down-sampling modules for deep feature extraction, and each down-sampling module comprises a one-dimensional convolution layer, an activation function layer and a batch normalization layer (BN). The convolution operation is represented as follows:
Mi=f(WgYi+b)
in the above formula
Figure FDA0003205608430000023
Is an output feature map of the convolutional layer, where C represents the number of channels, F represents the feature dimension, and Y representsiRepresenting the characteristic diagram of the ith input, b is a corresponding bias item, W is a weight matrix of corresponding convolution kernels, and the number of the convolution kernels is 24, 48, 72, 96, 120, 144, 168, 192, 216, 240, 264 and 288 in sequence, wherein the kernels areSize 15, step size 1, f (g) Leaky ReLU activation function with leakage corrected linear units, expressed as follows:
Figure FDA0003205608430000022
wherein a is a fixed value, and the value is generally 0.01;
and step 22, taking the feature map M generated in the step 21 as the input of the non-local module. Firstly, two one-dimensional convolution layers are respectively used for carrying out channel dimension dimensionality reduction on M, the number of channels is reduced to C/2 from C, meanwhile, matrix multiplication is carried out on output results of the two convolution layers, then a softmax function is used for carrying out normalization processing on the multiplication results, and attention weight is generated. The calculation procedure is represented as follows:
Figure FDA0003205608430000031
where θ and ψ both represent one-dimensional convolution operations;
step 23, weighting the attention of step 22
Figure FDA0003205608430000032
With features m also generated by one-dimensional convolutionjThe mapping representation is matrix multiplied to obtain the output response y of the ith positioni
Step 24, output result y to step 23iUsing convolution operation to adjust dimensionality to make dimensionality consistent with the module input, and further using residual connection to carry out element-by-element addition to obtain enhanced feature z containing global information of voice sequenceiExpressed as:
zi=Wzyi+mi
wherein WzRepresenting the weight matrix to be learned by the training process. It should be noted that the convolution kernel size and step size used in step 22, step 23 and step 24 are all 1;
step 25, non-localThe output result of the module is sent into a parallel gate control circulation network, the calculation processes of the upper and lower parallel networks are the same, taking the upstream network as an example, the input at the given time t is xtThe forward calculation process is as follows:
rt=σ(Wxrxt+Whrht-1+br)
zt=σ(Wxzxt+Whzht-1+bz)
Figure FDA0003205608430000033
Figure FDA0003205608430000034
where σ denotes a Sigmoid activation function, indicates the product of elements, W and b denote weight and offset, respectively, and rtAnd ztRespectively representing a reset gate and a refresh gate, ht-1Is a hidden state at the time t-1,
Figure FDA0003205608430000035
representing candidate hidden states, commonly used for auxiliary hidden states htAnd (4) calculating.
4. The method of claim 1, wherein in step 3, the output of the gated cyclic unit network is sent to a non-local module at the decoding end, and then a skip connection is introduced to splice the high resolution feature map at the encoding end with the low resolution feature map at the decoding end, so as to complete the supplement of the detail information between the feature maps, comprising the following steps:
and step 31, merging the output results of the parallel gating cycle network. Suppose the output results of the two networks are Out respectively1And Out2The fusion mode is expressed as:
InputD=Add(Out1;Out2)
wherein InputDRepresenting a fusion result (input at a decoding end), and Add represents a feature fusion mode;
step 32, sending the output of step 31 to a non-local module of a decoding end, and repeating the calculation processes from step 22 to step 24 to obtain a weighted feature vector;
and step 33, sending the result of the step 32 to a decoding end, wherein the decoding end consists of 12 continuous up-sampling modules, the up-sampling modules are associated with the encoding end through jump connection, before the jump connection, linear interpolation with a scaling factor of 2 is required to be carried out on the output of the previous part in the sequence dimension, and then the linear interpolation is spliced with the output of the corresponding down-sampling module in the channel dimension, wherein the structure of each down-sampling module is the same as that of the up-sampling module, and the difference is that the number of convolution kernels of the decoding end is 288, 264, 240, 216, 192, 168, 144, 120, 96, 72, 48 and 24 in sequence, the size of the convolution kernel is 5, and the step size is 1.
5. The method of claim 1, wherein in step 4, the dimensions of the output result are normalized by using one-dimensional convolutional layers, and the output enhanced speech blocks are sequentially spliced, so as to complete the overall synthesis of the enhanced speech, comprising the following steps:
step 41, splicing an output result of a decoding end with a network original input, and then sending the output result into an output convolution layer for dimension normalization, wherein the number, the size and the step length of convolution kernels of the output layer are 1, 5 and 1 in sequence;
and step 42, because the output result of the step 41 is 1-dimensional, and in the step 11, each preprocessed voice block with noise has no overlap, the one-dimensional output results predicted by the enhanced model can be spliced directly. It should be noted that in the preprocessing process, zero padding is required for noisy speech that cannot be removed by the window length, so in this case, the corresponding zero padding portion in the model prediction output result should be removed first, and then concatenation should be performed.
And 43, splicing the output characteristics of the step 42 to complete the reconstruction of the enhanced voice.
CN202110916018.6A 2021-08-11 2021-08-11 Voice enhancement method Active CN113674753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110916018.6A CN113674753B (en) 2021-08-11 2021-08-11 Voice enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110916018.6A CN113674753B (en) 2021-08-11 2021-08-11 Voice enhancement method

Publications (2)

Publication Number Publication Date
CN113674753A true CN113674753A (en) 2021-11-19
CN113674753B CN113674753B (en) 2023-08-01

Family

ID=78542169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110916018.6A Active CN113674753B (en) 2021-08-11 2021-08-11 Voice enhancement method

Country Status (1)

Country Link
CN (1) CN113674753B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826462A (en) * 2019-10-31 2020-02-21 上海海事大学 Human body behavior identification method of non-local double-current convolutional neural network model
CN111242846A (en) * 2020-01-07 2020-06-05 福州大学 Fine-grained scale image super-resolution method based on non-local enhancement network
CN112509593A (en) * 2020-11-17 2021-03-16 北京清微智能科技有限公司 Voice enhancement network model, single-channel voice enhancement method and system
CN112967730A (en) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium
CN113095106A (en) * 2019-12-23 2021-07-09 华为数字技术(苏州)有限公司 Human body posture estimation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826462A (en) * 2019-10-31 2020-02-21 上海海事大学 Human body behavior identification method of non-local double-current convolutional neural network model
CN113095106A (en) * 2019-12-23 2021-07-09 华为数字技术(苏州)有限公司 Human body posture estimation method and device
CN111242846A (en) * 2020-01-07 2020-06-05 福州大学 Fine-grained scale image super-resolution method based on non-local enhancement network
CN112509593A (en) * 2020-11-17 2021-03-16 北京清微智能科技有限公司 Voice enhancement network model, single-channel voice enhancement method and system
CN112967730A (en) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 Voice signal processing method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
袁文浩 等: "基于卷积门控循环神经网络的语音增强方法", 《华中科技大学学报(自然科学版)》 *

Also Published As

Publication number Publication date
CN113674753B (en) 2023-08-01

Similar Documents

Publication Publication Date Title
Yin et al. Phasen: A phase-and-harmonics-aware speech enhancement network
Zhang et al. Automatic modulation classification using CNN-LSTM based dual-stream structure
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
WO2020042707A1 (en) Convolutional recurrent neural network-based single-channel real-time noise reduction method
CN111081268A (en) Phase-correlated shared deep convolutional neural network speech enhancement method
CN109890043B (en) Wireless signal noise reduction method based on generative countermeasure network
CN110600018A (en) Voice recognition method and device and neural network training method and device
CN113470671B (en) Audio-visual voice enhancement method and system fully utilizing vision and voice connection
Liu et al. Cp-gan: Context pyramid generative adversarial network for speech enhancement
Shi et al. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.
CN112735460B (en) Beam forming method and system based on time-frequency masking value estimation
WO2022183806A1 (en) Voice enhancement method and apparatus based on neural network, and electronic device
CN115602152B (en) Voice enhancement method based on multi-stage attention network
CN115731505A (en) Video salient region detection method and device, electronic equipment and storage medium
Qi et al. Exploring deep hybrid tensor-to-vector network architectures for regression based speech enhancement
CN117174105A (en) Speech noise reduction and dereverberation method based on improved deep convolutional network
Qiu et al. Adversarial multi-task learning with inverse mapping for speech enhancement
Jiang et al. An improved unsupervised single-channel speech separation algorithm for processing speech sensor signals
CN112289338B (en) Signal processing method and device, computer equipment and readable storage medium
Jannu et al. Shuffle attention u-Net for speech enhancement in time domain
CN114830168A (en) Image reconstruction method, electronic device, and computer-readable storage medium
CN113674753B (en) Voice enhancement method
Zhou et al. Speech Enhancement via Residual Dense Generative Adversarial Network.
WO2022213825A1 (en) Neural network-based end-to-end speech enhancement method and apparatus
CN113035217B (en) Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant