CN112309411A - Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system - Google Patents
Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system Download PDFInfo
- Publication number
- CN112309411A CN112309411A CN202011332442.8A CN202011332442A CN112309411A CN 112309411 A CN112309411 A CN 112309411A CN 202011332442 A CN202011332442 A CN 202011332442A CN 112309411 A CN112309411 A CN 112309411A
- Authority
- CN
- China
- Prior art keywords
- scale
- real
- phase
- voice
- gated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 239000011800 void material Substances 0.000 title claims description 5
- 238000001228 spectrum Methods 0.000 claims abstract description 38
- 238000004458 analytical method Methods 0.000 claims abstract description 29
- 238000013507 mapping Methods 0.000 claims abstract description 8
- 238000003062 neural network model Methods 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 6
- 238000004141 dimensional analysis Methods 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 9
- 230000006872 improvement Effects 0.000 description 7
- 230000009467 reduction Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000408659 Darpa Species 0.000 description 1
- 241000282373 Panthera pardus Species 0.000 description 1
- 206010044565 Tremor Diseases 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement method, which is characterized in that a neural network model is utilized to construct a mapping relation between complex frequency spectrums of voice signals, real and imaginary part frequency spectrums of noisy voice subjected to time-frequency analysis are mapped to obtain enhanced real and imaginary part frequency spectrums, and the enhanced real and imaginary part frequency spectrums are restored into enhanced time-domain voice signals. The invention also provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement system. The invention has the beneficial effects that: the method improves the effect of voice enhancement, ensures that the enhanced voice has good voice intelligibility, and better avoids the problem of voice distortion.
Description
Technical Field
The invention relates to a speech enhancement method, in particular to a phase-sensitive gated multi-scale cavity convolutional network speech enhancement method and system.
Background
Early hearing experimental studies showed that when the signal-to-noise ratio is higher than 6dB, the influence of phase distortion on speech quality and intelligibility is small, so most of the current single-channel speech enhancement methods mainly perform noise reduction processing in the amplitude domain of the speech signal and directly perform reconstruction of the speech signal by using the noisy phase. However, when the acoustic scene faced by our voice product is worse, for example, the signal-to-noise ratio is lower than 0dB, or the noise signal completely submerges the voice signal in local time, if only the amplitude of the voice signal is enhanced, it cannot be guaranteed that the enhanced voice has good voice intelligibility, and even some voice distortion problems such as trembling and buzzing of voice also occur.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a phase-sensitive gated multi-scale hole convolution network speech enhancement method and system.
The invention provides a phase-sensitive gated multi-scale cavity convolution network voice enhancement method, which is characterized in that a neural network model is utilized to construct a mapping relation between complex frequency spectrums of voice signals, real and imaginary part frequency spectrums of noisy voice subjected to time-frequency analysis are mapped to obtain enhanced real and imaginary part frequency spectrums, and the enhanced real and imaginary part frequency spectrums are restored into enhanced time-domain voice signals.
As a further improvement of the invention, firstly, the noisy speech signal is subjected to frame division and windowing processing, then short-time Fourier transform is carried out to obtain a complex spectrum of the noisy speech signal, real and imaginary parts are separated, and only effective value parts are taken, so that two groups of input characteristics are obtained: real and imaginary features.
As a further improvement of the present invention, two sets of input features are then fed into a gated multi-scale hole convolution network model.
As a further improvement of the invention, the processing flow of the gated multi-scale hole convolutional network model comprises the following steps: firstly, a gating coding module carries out gating coding operation to obtain a high-latitude nonlinear feature representation form, then a multi-scale feature analysis module is used for respectively carrying out time sequence feature analysis on coded real part feature representation and coded imaginary part feature representation, and a gating decoding module carries out gating decoding operation respectively to obtain an enhanced real-imaginary part frequency spectrum.
As a further improvement of the invention, the enhanced real-imaginary part frequency spectrum is subjected to inverse Fourier transform and then overlapped and added to finally obtain an enhanced voice signal.
As a further improvement of the present invention, the gated coding module is formed by stacking at least two gated linear coding units, each gated linear coding unit adopts a 1 × 3 convolution kernel, and performs a two-dimensional convolution operation in a manner that the step size is 1 × 2.
As a further improvement of the invention, the output of each gated linear coding unit is exponentially and linearly activated to perform a non-linear transformation of the features.
As a further refinement of the present invention, the input to the multi-scale feature analysis module comprises two sets of features: (1) real or imaginary spectra of the original noisy speech; (2) real or imaginary features of the gated encoding module output.
As a further improvement of the present invention, the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit performs a splicing operation on two sets of feature tensors, and before the splicing operation, the two sets of tensors need to be reshaped into a three-dimensional tensor, the shape of which is [ sentence number, sentence length, 322 ]. And then, decomposing the spliced characteristic tensor into 8 sub-bands, wherein the tensor of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], splicing the input of the current sub-band with the convolution output of the adjacent sub-bands, then performing one-dimensional cavity convolution operation, after convolution of each sub-band, performing exponential linear activation, expanding the characteristic after multi-dimensional analysis by using a 1024-dimensional full connection layer after passing through a plurality of multi-dimensional analysis units, and remolding the output characteristic tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then respectively sending the two groups of remolded characteristic tensors to a gate control decoding module for decoding operation.
The invention also provides a phase-sensitive gated multi-scale hole convolutional network speech enhancement system, which comprises a readable storage medium, wherein execution instructions are stored in the readable storage medium, and when executed by a processor, the execution instructions are used for realizing the method according to any one of the above items.
The invention has the beneficial effects that: by the scheme, the effect of voice enhancement is improved, the enhanced voice is ensured to have good voice intelligibility, and the problem of voice distortion is well avoided.
Drawings
FIG. 1 is a block diagram of a processing flow of a phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 2 is a diagram of a gated multi-scale hole convolution network structure of a phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 3 is a block diagram of a gated linear coding and decoding unit of a phase-sensitive gated multi-scale hole convolutional network speech enhancement method of the present invention.
FIG. 4 is a diagram of a multi-scale analysis unit of the phase-sensitive gated multi-scale hole convolution network speech enhancement method of the present invention.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
A phase-sensitive gated multi-scale cavity convolution network speech enhancement method aims to utilize a neural network model to construct a mapping relation between complex frequency spectrums of speech signals, map noisy speech real and imaginary part frequency spectrums after time-frequency analysis processing to obtain enhanced real and imaginary part frequency spectrums, and restore the enhanced real and imaginary part frequency spectrums into enhanced time-domain speech signals. The processing flow of the whole algorithm is shown in figure 1, the dotted line part is a gated multi-scale cavity convolution network structure designed by the invention, the gated multi-scale cavity convolution network structure is a core module of the whole algorithm, and the noise reduction processing of the real and imaginary part frequency spectrum of the voice with noise is realized through three modules of gated encoding, multi-scale feature analysis and gated decoding.
As shown in fig. 1, a noisy speech signal is first subjected to frame division and windowing, and then subjected to short-time fourier transform to obtain a complex spectrum of the noisy speech signal, and the real and imaginary parts are separated, and only the effective value part is taken, so that two sets of input features are obtained: real and imaginary features. And then, the two groups of characteristics are sent into a gated multi-scale cavity convolution network model, firstly, gated encoding operation is carried out to obtain a high-latitude nonlinear characteristic representation form, and then, a multi-scale characteristic analysis module is utilized to respectively carry out time sequence characteristic analysis on the encoded characteristic representation and respectively carry out decoding to obtain the enhanced real-imaginary part frequency spectrum. The details of each module of the gated multi-scale hole convolution network will be described below.
The detailed structure of the gated multi-scale hole convolutional network is shown in fig. 2 and comprises three parts, namely gated encoding, multi-scale feature analysis and gated decoding. Real and imaginary part characteristic X of input noisy speechreal(n, k) and Ximag(n, k) will enter the gated coding part to perform feature transformation, the structure of the gated linear coding unit is shown as (a) in fig. 3, and the tensor shape of the input real and imaginary features is [ sentence number, sentence length, 161, 2 [ ]]Since the speech frames have a length of 20ms and overlap by 10ms with a sampling rate of 16K, 161 in the third dimension is the characteristic length corresponding to each frame of the real part or the imaginary part, and 2 in the fourth dimension represents the real part and the imaginary part, which are two dimensions in total. Here, 5 gated linear coding units are stacked, each coding unit adopts a convolution kernel of 1 × 3, and performs two-dimensional convolution operation in a manner of step size 1 × 2, where the number of channels is 16, 32, 64, 128, and 256, respectively, and thus the output tensors of the 5 linear coding units are obtained in sequence: [ sentence number, sentence length, 80, 16][ number of sentences, length of sentence, 39, 32][ number of sentences, length of sentence, 19, 64]"[ sentence number, sentence length, 9, 128]And [ sentence number, sentence length, 4, 256]. In order to realize attention control among features, a Sigmoid activation function is adopted to carry out nonlinear activation on one side convolution output in each coding unit so that the side convolution output becomes 0,1]The probability values within, are then point-multiplied on the convolution output features on the other side in a gated attention manner. In addition, the output of each gated linear coding unit is subjected to exponential linear activation in the following formula (1) to perform nonlinear transformation of the characteristics.
Wherein α is a parameter to be optimized in the training process, and the exponential linear activation is beneficial to relieving the disappearance of the gradient in the training process, so that the model is more robust to the input noise.
Next, in order to fully utilize the context information between the speech signals, a multi-scale time domain feature analysis method is adopted to analyze and synthesize the feature information in the past frame and the current frame, and capture the context information more favorable for estimating the features of the current frame. The structure of the designed multi-scale analysis unit is shown in fig. 4, wherein the input of the multi-scale analysis unit mainly has two parts: the real or imaginary part spectrum of the original voice with noise and the characteristics output by the former module are spliced, and before splicing, the two groups of tensors are required to be reshaped into a three-dimensional tensor with the shape of [ sentence number, sentence length, 322 ]. Next, the concatenated feature tensor is sub-band decomposed, where it is totally divided into 8 sub-bands, the tensor shape of the first 7 sub-bands is [ sentence number, sentence length, 40], and the shape of the last sub-band is [ sentence number, sentence length, 42 ]. When convolution operation is carried out on each sub-band, input of the current sub-band and convolution output of the adjacent sub-band need to be spliced, and then one-dimensional void convolution operation is carried out, because 5 multi-scale analysis units are stacked together, in order to better enlarge the experience field of convolution, a mode that the void rate is gradually increased is adopted and is respectively 1, 3, 5, 7 and 11. Furthermore, after each subband convolution, an exponential linear activation in equation (1) is employed. The sub-band splicing convolution mode enables each convolution layer to have different receptive field ranges, and in the decomposition direction, the receptive field is linearly increased, so that the convolution layers are guaranteed to have time sequence characteristic analysis capabilities of different scales. However, it is desirable that each multiscale analysis unit produces an intermediate estimate of a set of complex spectral features, which is used as input for the next multiscale analysis unit. Therefore, after the multi-scale convolutional layer, a fully-connected linear decoding layer is designed, and features obtained by the multi-scale layer are subjected to linear transformation to obtain an intermediate estimation value of a real part or an imaginary part, wherein the tensor of the intermediate estimation value is [ sentence number, sentence length, 161 ].
After 5 multi-scale analysis units, a layer of 1024-dimensional full connection layer is used for expanding the multi-scale analyzed features, and the output feature tensor is reshaped into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256 ]. And then, respectively sending the two groups of reshaped feature tensors to a gated linear decoding unit for decoding operation. The operation mode of the linear decoding unit is as shown in fig. 3(b), different from the linear encoding unit, the linear decoding unit implements expansion of the feature tensor by using a two-dimensional deconvolution operation, each decoding unit uses a 1 × 3 convolution kernel, and performs two-dimensional deconvolution in a mode of a step length of 1 × 2, so that gradual widening of each channel feature can be implemented, and the number of channels uses a gradual decreasing mode, which is respectively 128, 64, 32, 16, 1, and thus the output tensors of 5 linear decoding units are obtained in sequence: [ sentence number, sentence length, 9, 128], [ sentence number, sentence length, 19, 64], [ sentence number, sentence length, 39, 32], [ sentence number, sentence length, 80, 16] and [ sentence number, sentence length, 161, 1 ]. Similarly, the output of each gated linear decoding unit will also perform exponential linear activation in equation (1) to perform nonlinear transformation of features, and the gated linear decoding unit is shown in (b) of fig. 3.
After the neural network model is constructed, a large amount of data needs to be trained to enable the neural network model to have the capacity of mapping the pure real-imaginary part frequency spectrum. First, it is necessary to prepare enough pairs of noisy speech complex spectrum and ideal speech complex spectrum as training data set, so we pick the TIMIT data set[1]Using 4620 words as clean speech data of training set, and then using NOISEX-92[2]12 kinds of noises in the noise library, including restaurant noise, 2 kinds of fighter noise, 2 kinds of destroyer noise, factory noise, tank noise, Volvo automobile noise, high-frequency channel noise, white noiseNoise, leopard type battle vehicle noise and machine gun noise as noise data to be randomly mixed with pure voice, and the mixed signal-to-noise ratio is [ -5,15]And uniformly distributed, and the noisy training data with the time duration of about 38 hours is obtained in total. In order to optimize the parameters of the model, a verification set is required to be set, 280 sentences of words are selected from the test set of the TMIT data set as verification set pure voice data, and are uniformly mixed with 12 kinds of noise in the training set, wherein the signal to noise ratio is-5 dB to 15 dB. The loss function of the gate-controlled multi-scale cavity convolution network during training is calculated by mean square error, and the calculation formula is shown as formula (2), wherein n and k are respectively the frame and frequency index of the speech signal, and Xreal(n, k) and Ximag(n, k) is an ideal real-imaginary spectrum, andandthen is the real-imaginary spectrum of the neural network output:
during training, the overfitting problem of the model is reduced in a mode of 20% random neuron inactivation rate and batch normalization, backward propagation is carried out by using an Adam optimization algorithm, iteration is carried out for 50 times at a learning rate of 0.001, and then iteration is carried out for 10 times at a learning rate of 0.0001, so that a gated multi-scale cavity convolution network model with a mapped pure speech real-imaginary part spectrum can be obtained.
The following experiment verifies the noise reduction effect of the method provided by the invention, and in order to evaluate the quality, Intelligibility and Distortion of the noise-reduced voice, we adopt pesq (spatial evaluation of speech quality), STOI (Short-Time Objective intelligent) and sdr (signal to disturbance ratio) indexes to evaluate the noise-reduced voice. As shown in table 1, all noise reduction effects and indicators were measured on the test set, with higher indicators representing better performance. The test set used was another 320 words selected from the test set of the TIMIT dataset that did not duplicate both the training and validation sets, and was mixed with 12 trained and 3 untrained noises (untrained fighter noise, untrained factory noise, and pink noise) in NOISEX-92 to five noise pollution levels-5 dB, 0dB, 5dB, 10dB, and 15dB, respectively. The experimental results in table 1 show that the method provided by the invention not only can have a good noise reduction effect in a trained noise scene, but also can be well generalized to an untrained noise scene, and has good model generalization performance. Even under the condition that instantaneous noises such as factory noise, machine gun sound and the like exist, the method provided by the invention still has obvious effect, almost cannot hear sharp background noise, and has good voice quality recovery. In some noise environments with low signal-to-noise ratio, the enhanced voice has no problems of hum, sound jitter and the like. In addition, the time delay of the method designed by the invention is less than 30ms, and the requirement of most voice products on the real-time property can be completely met.
TABLE 1 evaluation results of PESQ, STOI and SDR indexes under different noise environments
Different from the deep neural network noise reduction method which only enhances in the amplitude domain in the prior art, the method is characterized in that the complex spectrum information of a speech signal, namely the real-imaginary spectrum after Fourier transform is modeled to construct a hole convolution neural network with a multi-scale coding and decoding framework, and the mapping relation between a noisy signal and a pure signal is learned in the complex domain, so that the common optimization of phase and amplitude information is realized. The main advantages of this algorithm are as follows:
(1) learning is carried out in a complex domain, enhancement of phase information is considered, and better speech intelligibility and speech quality can be realized in a low signal-to-noise ratio environment;
(2) real and imaginary part information of the complex spectrum is equivalent to two learning targets, and compared with a method for mapping a single magnitude spectrum, the multi-target model has better generalization performance;
(3) the modeling is carried out by utilizing a multi-scale convolution method, so that the context information in the voice can be captured more finely, and more voice details can be recovered;
(4) the designed model is a completely causal system, namely, the output of the model is only related to the information of the current frame and the past frame, and the time delay of the algorithm is reduced to the maximum extent.
Reference documents:
[1]J.S.Garofolo,“Getting started with the DARPA TIMIT CD ROM:An acoustic phonetic continuous speech database NIST Tech Report,”1988.
[2]Andrew Varga,Herman J.M.,Steeneken,“Assessment for automatic speech recognition:II.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems,”Speech Communication,vol.12, no.3,1993.
the foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. A phase-sensitive gated multi-scale hole convolution network speech enhancement method is characterized by comprising the following steps: and constructing a mapping relation between complex frequency spectrums of the voice signals by utilizing a neural network model, mapping the real-imaginary part frequency spectrum of the voice with noise after time-frequency analysis processing to obtain an enhanced real-imaginary part frequency spectrum, and restoring the enhanced real-imaginary part frequency spectrum into an enhanced time domain voice signal.
2. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 1, characterized by: firstly, the voice signal with noise is processed by framing and windowing, then short-time Fourier transform is carried out to obtain a complex spectrum of the voice signal with noise, real and imaginary parts are separated, and only effective value parts are taken, so that two groups of input characteristics are obtained: real and imaginary features.
3. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 2, characterized by: and then feeding the two groups of input features into a gated multi-scale hole convolution network model.
4. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 3, characterized by: the processing flow of the gated multi-scale void convolutional network model comprises the following steps: firstly, a gating coding module carries out gating coding operation to obtain a high-latitude nonlinear feature representation form, then a multi-scale feature analysis module is used for respectively carrying out time sequence feature analysis on coded real part feature representation and coded imaginary part feature representation, and a gating decoding module carries out gating decoding operation respectively to obtain an enhanced real-imaginary part frequency spectrum.
5. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: and performing inverse Fourier transform on the enhanced real-imaginary part frequency spectrum, and overlapping and adding to finally obtain an enhanced voice signal.
6. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: the gate control coding module is formed by stacking at least two gate control linear coding units, each gate control linear coding unit adopts a 1 x 3 convolution kernel, and two-dimensional convolution operation is carried out in a mode that the step length is 1 x 2.
7. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 6, characterized by: the output of each gated linear coding unit is exponentially and linearly activated to perform a nonlinear transformation of the features.
8. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 4, characterized by: the input to the multi-scale feature analysis module includes two sets of features: (1) real or imaginary spectra of the original noisy speech; (2) real or imaginary features of the gated encoding module output.
9. The phase-sensitive gated multi-scale hole convolution network speech enhancement method of claim 8, characterized by: the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit carries out splicing operation on two groups of feature tensors, and before splicing, the two groups of tensors need to be reshaped to be changed into a three-dimensional tensor with the shape of [ sentence number, sentence length, 322 ]. And then, decomposing the spliced characteristic tensor into 8 sub-bands, wherein the tensor of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], splicing the input of the current sub-band with the convolution output of the adjacent sub-bands, then performing one-dimensional cavity convolution operation, after convolution of each sub-band, performing exponential linear activation, expanding the characteristic after multi-dimensional analysis by using a 1024-dimensional full connection layer after passing through a plurality of multi-dimensional analysis units, and remolding the output characteristic tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then respectively sending the two groups of remolded characteristic tensors to a gate control decoding module for decoding operation.
10. A phase-sensitive gated multi-scale hole convolution network speech enhancement system is characterized in that: comprising a readable storage medium having stored therein execution instructions for, when executed by a processor, implementing the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332442.8A CN112309411A (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332442.8A CN112309411A (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112309411A true CN112309411A (en) | 2021-02-02 |
Family
ID=74335732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011332442.8A Pending CN112309411A (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112309411A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129873A (en) * | 2021-04-27 | 2021-07-16 | 思必驰科技股份有限公司 | Optimization method and system for stack type one-dimensional convolution network awakening acoustic model |
CN113707163A (en) * | 2021-08-31 | 2021-11-26 | 北京达佳互联信息技术有限公司 | Speech processing method and apparatus, and model training method and apparatus |
CN114283829A (en) * | 2021-12-13 | 2022-04-05 | 电子科技大学 | Voice enhancement method based on dynamic gate control convolution cyclic network |
CN114842863A (en) * | 2022-04-19 | 2022-08-02 | 电子科技大学 | Signal enhancement method based on multi-branch-dynamic merging network |
CN115862581A (en) * | 2023-02-10 | 2023-03-28 | 杭州兆华电子股份有限公司 | Secondary elimination method and system for repeated pattern noise |
-
2020
- 2020-11-24 CN CN202011332442.8A patent/CN112309411A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129873A (en) * | 2021-04-27 | 2021-07-16 | 思必驰科技股份有限公司 | Optimization method and system for stack type one-dimensional convolution network awakening acoustic model |
CN113707163A (en) * | 2021-08-31 | 2021-11-26 | 北京达佳互联信息技术有限公司 | Speech processing method and apparatus, and model training method and apparatus |
CN113707163B (en) * | 2021-08-31 | 2024-05-14 | 北京达佳互联信息技术有限公司 | Speech processing method and device and model training method and device |
CN114283829A (en) * | 2021-12-13 | 2022-04-05 | 电子科技大学 | Voice enhancement method based on dynamic gate control convolution cyclic network |
CN114283829B (en) * | 2021-12-13 | 2023-06-16 | 电子科技大学 | Voice enhancement method based on dynamic gating convolution circulation network |
CN114842863A (en) * | 2022-04-19 | 2022-08-02 | 电子科技大学 | Signal enhancement method based on multi-branch-dynamic merging network |
CN114842863B (en) * | 2022-04-19 | 2023-06-02 | 电子科技大学 | Signal enhancement method based on multi-branch-dynamic merging network |
CN115862581A (en) * | 2023-02-10 | 2023-03-28 | 杭州兆华电子股份有限公司 | Secondary elimination method and system for repeated pattern noise |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112309411A (en) | Phase-sensitive gated multi-scale void convolutional network speech enhancement method and system | |
CN109841226B (en) | Single-channel real-time noise reduction method based on convolution recurrent neural network | |
CN111971743B (en) | Systems, methods, and computer readable media for improved real-time audio processing | |
US20220004870A1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
CN112581973B (en) | Voice enhancement method and system | |
CN112735460A (en) | Beam forming method and system based on time-frequency masking value estimation | |
CN113936681A (en) | Voice enhancement method based on mask mapping and mixed hole convolution network | |
CN113096682B (en) | Real-time voice noise reduction method and device based on mask time domain decoder | |
JP2023546099A (en) | Audio generator, audio signal generation method, and audio generator learning method | |
Sun et al. | A model compression method with matrix product operators for speech enhancement | |
Guimarães et al. | Monaural speech enhancement through deep wave-U-net | |
Li et al. | A multi-objective learning speech enhancement algorithm based on IRM post-processing with joint estimation of SCNN and TCNN | |
Lim et al. | Harmonic and percussive source separation using a convolutional auto encoder | |
CN114818806A (en) | Gearbox fault diagnosis method based on wavelet packet and depth self-encoder | |
CN116486826A (en) | Voice enhancement method based on converged network | |
CN114067819A (en) | Speech enhancement method based on cross-layer similarity knowledge distillation | |
Ram et al. | Speech enhancement through improvised conditional generative adversarial networks | |
CN116391191A (en) | Generating neural network models for processing audio samples in a filter bank domain | |
CN117174105A (en) | Speech noise reduction and dereverberation method based on improved deep convolutional network | |
Raj et al. | Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients | |
CN116682444A (en) | Single-channel voice enhancement method based on waveform spectrum fusion network | |
Hao et al. | Optimizing the perceptual quality of time-domain speech enhancement with reinforcement learning | |
CN113571074B (en) | Voice enhancement method and device based on multi-band structure time domain audio frequency separation network | |
Parvathala et al. | Neural comb filtering using sliding window attention network for speech enhancement | |
Grzywalski et al. | Speech enhancement using U-nets with wide-context units |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |