CN112309411B - Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system - Google Patents
Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system Download PDFInfo
- Publication number
- CN112309411B CN112309411B CN202011332442.8A CN202011332442A CN112309411B CN 112309411 B CN112309411 B CN 112309411B CN 202011332442 A CN202011332442 A CN 202011332442A CN 112309411 B CN112309411 B CN 112309411B
- Authority
- CN
- China
- Prior art keywords
- gating
- voice
- real
- imaginary
- scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000001228 spectrum Methods 0.000 claims abstract description 38
- 238000004458 analytical method Methods 0.000 claims abstract description 32
- 238000013507 mapping Methods 0.000 claims abstract description 9
- 238000012545 processing Methods 0.000 claims abstract description 9
- 238000003062 neural network model Methods 0.000 claims abstract description 6
- 230000004913 activation Effects 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 7
- 238000000354 decomposition reaction Methods 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 2
- 239000011800 void material Substances 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 9
- 230000006872 improvement Effects 0.000 description 7
- 230000009467 reduction Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000408659 Darpa Species 0.000 description 1
- 238000006424 Flood reaction Methods 0.000 description 1
- 241000282373 Panthera pardus Species 0.000 description 1
- 206010044565 Tremor Diseases 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000002779 inactivation Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000004804 winding Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system. The beneficial effects of the invention are as follows: the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is well avoided.
Description
Technical Field
The invention relates to a voice enhancement method, in particular to a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.
Background
Early auditory experiment researches show that when the signal-to-noise ratio is higher than 6dB, the influence of phase distortion on voice quality and intelligibility is small, so that most single-channel voice enhancement methods at present mainly perform noise reduction treatment in the amplitude domain of voice signals and directly utilize noisy phases to reconstruct the voice signals. However, when the acoustic scene faced by our voice product is worse, for example, the signal-to-noise ratio is lower than 0dB, or the noise signal completely floods the voice signal in local time, if only the amplitude of the voice signal is enhanced, it cannot be ensured that the enhanced voice has good voice intelligibility, and even some voice distortion problems such as sound tremble, buzzing and the like can occur.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system.
The invention provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement method, which utilizes a neural network model to construct a mapping relation between complex frequency spectrums of voice signals, maps real and imaginary frequency spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and restores the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals.
As a further improvement of the present invention, firstly, the noisy speech signal is subjected to framing and windowing, then is subjected to short-time fourier transform to obtain a complex spectrum of the noisy speech signal, the real and imaginary parts are separated, and only the effective value part is taken, so that two groups of input features are obtained: real and imaginary features.
As a further improvement of the present invention, two sets of input features are then fed into a gated multi-scale hole convolutional network model.
As a further improvement of the present invention, the process flow of the gating multi-scale hole convolution network model includes: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is utilized to respectively perform time sequence characteristic analysis on the real part characteristic and the imaginary part characteristic representation of the encoding, and a gating decoding module respectively performs gating decoding operation to obtain an enhanced real-imaginary part frequency spectrum.
As a further improvement of the invention, the enhanced real-imaginary frequency spectrum is subjected to inverse Fourier transform and then overlap-added to finally obtain the enhanced voice signal.
As a further improvement of the invention, the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.
As a further improvement of the invention, the output of each gated linear coding unit is exponentially activated to perform a nonlinear transformation of the characteristic.
As a further refinement of the invention, the input of the multi-scale feature analysis module comprises two sets of features: (1) a real or imaginary spectrum of the original noisy speech; (2) And the gating coding module outputs real or imaginary characteristics.
As a further improvement of the invention, the multi-scale feature analysis module is formed by stacking at least two multi-scale analysis units, each multi-scale analysis unit performs a splicing operation on two sets of feature tensors, and before splicing, the two sets of tensors need to be remolded to be changed into a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Then, sub-band decomposition is carried out on the spliced characteristic tensor, wherein the characteristic tensor is totally divided into 8 sub-bands, the tensor shape of the first 7 sub-bands is [ sentence number, sentence length, 40], the shape of the last sub-band is [ sentence number, sentence length, 42], the input of the current sub-band and the winding output of the adjacent sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after a plurality of multi-scale analysis units are carried out, the multi-scale analyzed characteristics are expanded by utilizing a 1024-dimensional full-connection layer, the output characteristic tensor is remolded into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256], and then the two remolded characteristic tensors are respectively sent to a gate control decoding module for decoding operation.
The invention also provides a phase-sensitive gating multi-scale cavity convolution network voice enhancement system, which comprises a readable storage medium, wherein the readable storage medium is stored with execution instructions, and the execution instructions are used for realizing the method of any one of the above when being executed by a processor.
The beneficial effects of the invention are as follows: through the scheme, the effect of voice enhancement is improved, the enhanced voice is guaranteed to have good voice intelligibility, and the problem of voice distortion is better avoided.
Drawings
FIG. 1 is a process flow diagram of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.
FIG. 2 is a diagram of a gating multi-scale hole convolution network structure of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 3 is a block diagram of a gating linear encoding and decoding unit of a phase-sensitive gating multi-scale hole convolution network speech enhancement method of the present invention.
FIG. 4 is a block diagram of a multi-scale analysis unit of a phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of the present invention.
Detailed Description
The invention is further described with reference to the following description of the drawings and detailed description.
A phase-sensitive gating multi-scale cavity convolution network voice enhancement method aims at constructing a mapping relation between complex frequency spectrums of voice signals by utilizing a neural network model, mapping real and imaginary frequency spectrums of voice with noise after time-frequency analysis processing to obtain enhanced real and imaginary frequency spectrums, and recovering the enhanced real and imaginary frequency spectrums into enhanced time-domain voice signals. The processing flow of the whole algorithm is shown in figure 1, the dotted line part is a gating multi-scale cavity convolution network structure designed by the invention, and the gating multi-scale cavity convolution network structure is a core module of the whole algorithm, and the processing flow realizes noise reduction processing of real and imaginary frequency spectrums of noisy speech through three modules of gating coding, multi-scale feature analysis and gating decoding.
As shown in fig. 1, the noisy speech signal is first subjected to a frame windowing process, then to a short-time fourier transform to obtain a complex spectrum of the noisy speech signal, separating real and imaginary parts, and taking only the effective value part, two sets of input features are obtained: real and imaginary features. And then, the two groups of features are sent into a gating multi-scale cavity convolution network model, the gating encoding operation is firstly carried out to obtain a high-latitude nonlinear feature representation form, and then, the multi-scale feature analysis module is utilized to respectively carry out time sequence feature analysis on the encoded feature representations and respectively decode the time sequence feature analysis to obtain the enhanced real-imaginary frequency spectrum. The following describes each module of the gating multi-scale hole convolution network in detail.
The detailed structure of the gating multi-scale cavity convolution network is shown in fig. 2, and the gating multi-scale cavity convolution network consists of three parts, namely gating coding, multi-scale feature analysis and gating decoding. The real and imaginary characteristics X real (n, K) and X imag (n, K) of the input noisy speech will first enter the gating coding part to perform characteristic transformation, the structure of the gating linear coding unit is as shown in (a) in fig. 3, the tensor shape of the input real and imaginary characteristics is [ sentence number, sentence length, 161,2], and since the sampling rate of 16K is adopted, the frame length of the speech is 20ms, and the frames overlap by 10ms, so that 161 in the third dimension is the characteristic length corresponding to each frame of the real part or the imaginary part, and 2 in the fourth dimension represents the real part and the imaginary part, and the total two dimensions. Here, a total of 5 gating linear coding units are stacked, each coding unit adopts a convolution kernel of 1×3, and performs two-dimensional convolution operation in a mode of 1×2 step length, and the number of channels is 16, 32, 64, 128, 256 respectively, so that output tensors of the 5 linear coding units are sequentially obtained: [ number of sentences, sentence length, 80, 16], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 9, 128], and [ number of sentences, sentence length, 4, 256]. In order to achieve attention control between features, a Sigmoid activation function is used to nonlinearly activate the convolved output of one side of each coding unit to become a probability value within [0,1], and then in a manner of gating attention, the points multiply the convolved output features of the other side. In addition, the output of each gating linear coding unit is subjected to exponential linear activation in the following formula (1) to perform nonlinear transformation of the features.
Wherein, alpha is a parameter which needs to be optimized in the training process, and the exponential linear activation is favorable for relieving gradient disappearance in the training process, so that the model is more robust to input noise.
Next, in order to make full use of the context information between the speech signals, we use a multi-scale time domain feature analysis method to analyze and integrate the feature information in the past frame and the current frame, and capture the context information more beneficial to estimating the features of the current frame. The structure of the designed multi-scale analysis cell is shown in fig. 4, wherein the input of the multi-scale analysis cell has mainly two parts: the real or imaginary spectrum of the original voice with noise and the characteristics output by the previous module are then spliced, and before splicing, the two groups of characteristic tensors are required to be remolded to become a three-dimensional tensor, and the shape is [ sentence number, sentence length, 322]. Next, the stitched feature tensor is sub-band decomposed, here divided into a total of 8 sub-bands, with the tensor shape of the first 7 sub-bands [ number of sentences, sentence length, 40], and the shape of the last sub-band [ number of sentences, sentence length, 42]. When each sub-band is subjected to convolution operation, the input of the current sub-band and the output of the adjacent sub-band are spliced, and then one-dimensional cavity convolution operation is performed, wherein 5 multiscale analysis units are stacked in total, and in order to better enlarge the convolution receptive field, a mode that the cavity rate is gradually increased is adopted to be 1,3,5,7 and 11 respectively. Further, after each sub-band is coiled, exponential linear activation in equation (1) is employed. The sub-band splicing convolution mode ensures that each convolution layer has different receptive field ranges, and in the decomposition direction, the receptive field is linearly increased, so that the convolution layer is ensured to have timing sequence characteristic analysis capability of different scales. However, it is desirable that each multi-scale analysis unit can produce an intermediate estimate of a set of complex spectral features and take this as input to the next multi-scale analysis unit. Therefore, after the multi-scale convolution layer, we design a fully connected linear decoding layer, and linearly transform the features obtained by the multi-scale layer to obtain an intermediate estimation value of the real part or the imaginary part, and the tensor is [ sentence number, sentence length, 161].
After 5 multi-scale analysis units, expanding the multi-scale analyzed features by using a 1024-dimensional full-connection layer, and remolding the output feature tensor into a 4-dimensional tensor form [ sentence number, sentence length, 4, 256]. And then, respectively feeding the two groups of remodeled characteristic tensors into a gating linear decoding unit for decoding operation. The operation mode of the linear decoding unit is shown in fig. 3 (b), unlike the linear encoding unit, the linear decoding unit adopts two-dimensional deconvolution operation to realize expansion of the feature tensor, each decoding unit adopts a convolution kernel of 1×3, performs two-dimensional deconvolution in a mode of step length of 1×2, so that gradual expansion of the feature of each channel can be realized, and the number of channels adopts a gradually decreasing mode to be 128, 64, 32, 16,1 respectively, so that the output tensors of 5 linear decoding units are sequentially obtained, namely: [ number of sentences, sentence length, 9, 128], [ number of sentences, sentence length, 19, 64], [ number of sentences, sentence length, 39, 32], [ number of sentences, sentence length, 80, 16], and [ number of sentences, sentence length, 161,1]. Similarly, the output of each gated linear decoding unit also performs exponential linear activation in equation (1) to perform nonlinear transformation of the feature, and the gated linear decoding unit is shown in fig. 3 (b).
After the neural network model is constructed, a great amount of data is required to be trained, so that the neural network model has the capability of mapping pure real-imaginary frequency spectrums. Firstly, preparing enough pairs of noisy speech complex spectrum and ideal speech complex spectrum as training data sets, so that we choose 4620 sentences in TIMIT dataset [1] as clean speech data of the training sets, and then using 12 kinds of noise in NOISEX-92 [2] noise library, including restaurant noise, 2 kinds of fighter noise, 2 kinds of expelling ship noise, factory noise, tank noise, volvo car noise, high-frequency channel noise, white noise, leopard type of war car noise and gun noise, as noise data, randomly mixing with clean speech, and leading the mixed signal-to-noise ratio to be between [ -5,15], and obtaining noisy training data with a total time length of about 38 hours. In order to adjust the parameters of the model, a verification set is required to be set, 280 sentences are also selected from the test set of the TMIT data set to be used as pure voice data of the verification set, and the pure voice data and 12 kinds of noise in the training set are uniformly mixed with each other with the signal to noise ratio of-5 to 15 dB. The loss function during the training of the gated multi-scale cavity convolution network is calculated by mean square error, the calculation formula is shown as a formula (2), wherein n and k are respectively the frame and frequency indexes of the voice signal, X real (n, k) and X imag (n, k) are ideal real-imaginary part spectrums, andAnd/>Then the real-imaginary spectrum of the neural network output:
During training, the overfitting problem of the model is reduced in a mode of 20% random neuron inactivation rate and batch normalization, the Adam optimization algorithm is utilized for back propagation, the iteration is performed for 50 times at a learning rate of 0.001, and then the iteration is performed for 10 times at a learning rate of 0.0001, so that the gated multi-scale cavity convolution network model with the real-imaginary part spectrum of the mapped pure voice can be obtained.
The following experiment verifies the noise reduction effect of the method provided by the invention, and in order to evaluate the quality, the intelligibility and the distortion condition of the noise-reduced voice, PESQ (Perceptual evaluation of speech quality), STOI (Short-Time Objective Intelligibility) and SDR (Signal to Distortion Ratio) indexes are adopted to evaluate the noise-reduced voice. As shown in Table 1, all noise reduction effects and indicators were measured on the test set, with higher indicators representing better performance. The test set used was an additional 320 words selected from the test set of the TIMIT dataset that were not repeated with both the training set and the validation set, and mixed with 12 kinds of trained noise and 3 kinds of untrained noise (untrained fighter noise, untrained factory noise, and pink noise) of NOISEX-92, respectively, to five noise pollution levels of-5 dB,0dB,5dB,10dB, and 15 dB. The experimental results in table 1 show that the method provided by the invention not only has good noise reduction effect in the trained noise scene, but also can be well generalized to the untrained noise scene, and has good model generalization performance. Even if the instantaneous noise such as factory noise, machine gun noise and the like exists, the method provided by the invention has obvious effect, hardly hears abrupt background noise, and has good recovery of voice quality. In some noise environments with low signal-to-noise ratio, the enhanced speech also has no problems such as buzzing, sound jitter, etc. In addition, the time delay of the method designed by the invention is less than 30ms, and the method can completely meet the requirement of most voice products on real-time property.
TABLE 1 evaluation results of the PESQ, STOI and SDR metrics in different noise environments
Different from the deep neural network noise reduction method which only enhances in the amplitude domain in the prior art, the method models complex spectrum information of the voice signal, namely real-imaginary spectrum after Fourier transformation, so as to construct a hollow convolutional neural network of a multi-scale coding and decoding architecture, and learns the mapping relation between the noisy signal and the pure signal in the complex domain, thereby realizing common optimization of phase and amplitude information. The main advantages of the algorithm are as follows:
(1) The learning is carried out in a complex domain, the enhancement of phase information is considered, and better speech intelligibility and speech quality can be realized in a low signal-to-noise ratio environment;
(2) The real and imaginary part information of the complex spectrum is equivalent to two learning targets, and compared with a method for mapping a single-magnitude spectrum, the model with multiple targets has better generalization performance;
(3) The modeling is performed by using a multi-scale convolution method, so that the context information in the voice can be captured more finely, and more voice details can be recovered;
(4) The designed model is a complete causal system, that is, the output of the model is only related to the current frame and the past frame information, and the time delay of the algorithm is reduced to the greatest extent.
Reference is made to:
[1]J.S.Garofolo,"Getting started with the DARPA TIMIT CD ROM:An acoustic phonetic continuous speech database NIST Tech Report,"1988.
[2]Andrew Varga,Herman J.M.,Steeneken,"Assessment for automatic speech recognition:II.NOISEX-92:A database and an experiment to study the effect of additive noise on speech recognition systems,"Speech Communication,vol.12, no.3,1993.
the foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.
Claims (5)
1. A phase-sensitive gating multi-scale cavity convolution network voice enhancement method is characterized in that: the method comprises the steps of constructing a mapping relation between complex spectrums of voice signals by utilizing a neural network model, mapping real and imaginary spectrums of noisy voice after time-frequency analysis processing to obtain enhanced real and imaginary spectrums, recovering the enhanced real and imaginary spectrums into enhanced time-domain voice signals, firstly, carrying out framing and windowing processing on the noisy voice signals, then carrying out short-time Fourier transformation to obtain complex spectrums of the noisy voice signals, separating real and imaginary parts, and taking only effective value parts, so that two groups of input features are obtained: the real part features and the imaginary part features, and then two groups of input features are sent into a gating multi-scale cavity convolution network model, wherein the processing flow of the gating multi-scale cavity convolution network model comprises the following steps: firstly, a gating encoding module performs gating encoding operation to obtain a high-latitude nonlinear characteristic representation form, a multi-scale characteristic analysis module is used for respectively performing time sequence characteristic analysis on the encoded real part characteristic and the encoded imaginary part characteristic representation, a gating decoding module performs gating decoding operation to obtain an enhanced real part and imaginary part frequency spectrum, and the input of the multi-scale characteristic analysis module comprises two groups of characteristics: (1) a real or imaginary spectrum of the original noisy speech; (2) The method comprises the steps that a gating coding module outputs real or imaginary characteristics, a multiscale characteristic analysis module is formed by stacking at least two multiscale analysis units, each multiscale analysis unit is used for splicing two groups of characteristic tensors, the two groups of tensors are required to be remodeled before splicing to be changed into a three-dimensional tensor, the three-dimensional tensor is in a shape of [ sentence number, sentence length, 322], then the spliced characteristic tensor is subjected to sub-band decomposition, the three-dimensional tensor is totally divided into 8 sub-bands, the tensor in the shape of the first 7 sub-bands is in a shape of [ sentence number, sentence length, 40], the shape of the last sub-band is in a shape of [ sentence number, sentence length, 42], the input of the current sub-band and the adjacent sub-band convolution output of the current sub-band are spliced, then one-dimensional cavity convolution operation is carried out, after each sub-band convolution, exponential linear activation is adopted, after the multiscale analysis units are subjected to expansion, the output characteristic tensor is remodeled by using a layer of 1024-dimensional full-connection layer, the output characteristic tensor is remodeled into a 4-dimensional tensor form of [ sentence number, sentence length, 4, 256] and then the two-dimensional characteristic tensor after decoding operation is carried out, the decoding operation is carried out by the gating module respectively.
2. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: and performing inverse Fourier transform on the enhanced real-imaginary frequency spectrum, and performing overlap addition to finally obtain the enhanced voice signal.
3. The phase-sensitive gated multi-scale cavitation convolutional network speech enhancement method of claim 1, wherein: the gating coding module is formed by stacking at least two gating linear coding units, and each gating linear coding unit adopts a convolution kernel of 1 multiplied by 3 to perform two-dimensional convolution operation in a mode of 1 multiplied by 2.
4. A phase-sensitive gated multi-scale void convolution network speech enhancement method according to claim 3, characterized in that: the output of each gating linear coding unit is subjected to exponential linear activation to perform nonlinear transformation of the features.
5. A phase-sensitive gating multi-scale cavity convolution network voice enhancement system is characterized in that: comprising a readable storage medium having stored therein execution instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332442.8A CN112309411B (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011332442.8A CN112309411B (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112309411A CN112309411A (en) | 2021-02-02 |
CN112309411B true CN112309411B (en) | 2024-06-11 |
Family
ID=74335732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011332442.8A Active CN112309411B (en) | 2020-11-24 | 2020-11-24 | Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112309411B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113129873B (en) * | 2021-04-27 | 2022-07-08 | 思必驰科技股份有限公司 | Optimization method and system for stack type one-dimensional convolution network awakening acoustic model |
CN113707163B (en) * | 2021-08-31 | 2024-05-14 | 北京达佳互联信息技术有限公司 | Speech processing method and device and model training method and device |
CN114283829B (en) * | 2021-12-13 | 2023-06-16 | 电子科技大学 | Voice enhancement method based on dynamic gating convolution circulation network |
CN114842863B (en) * | 2022-04-19 | 2023-06-02 | 电子科技大学 | Signal enhancement method based on multi-branch-dynamic merging network |
CN115862581A (en) * | 2023-02-10 | 2023-03-28 | 杭州兆华电子股份有限公司 | Secondary elimination method and system for repeated pattern noise |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3156908A1 (en) * | 2015-01-06 | 2016-07-14 | David Burton | Mobile wearable monitoring systems |
WO2019075267A1 (en) * | 2017-10-11 | 2019-04-18 | Google Llc | Self-gating activation neural network layers |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110534123A (en) * | 2019-07-22 | 2019-12-03 | 中国科学院自动化研究所 | Sound enhancement method, device, storage medium, electronic equipment |
CN110674866A (en) * | 2019-09-23 | 2020-01-10 | 兰州理工大学 | Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network |
CN111160040A (en) * | 2019-12-26 | 2020-05-15 | 西安交通大学 | Information reliability evaluation system and method based on multi-scale gating equilibrium interaction fusion network |
CN111401250A (en) * | 2020-03-17 | 2020-07-10 | 东北大学 | Chinese lip language identification method and device based on hybrid convolutional neural network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160111107A1 (en) * | 2014-10-21 | 2016-04-21 | Mitsubishi Electric Research Laboratories, Inc. | Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System |
-
2020
- 2020-11-24 CN CN202011332442.8A patent/CN112309411B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3156908A1 (en) * | 2015-01-06 | 2016-07-14 | David Burton | Mobile wearable monitoring systems |
WO2019075267A1 (en) * | 2017-10-11 | 2019-04-18 | Google Llc | Self-gating activation neural network layers |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110534123A (en) * | 2019-07-22 | 2019-12-03 | 中国科学院自动化研究所 | Sound enhancement method, device, storage medium, electronic equipment |
CN110674866A (en) * | 2019-09-23 | 2020-01-10 | 兰州理工大学 | Method for detecting X-ray breast lesion images by using transfer learning characteristic pyramid network |
CN111160040A (en) * | 2019-12-26 | 2020-05-15 | 西安交通大学 | Information reliability evaluation system and method based on multi-scale gating equilibrium interaction fusion network |
CN111401250A (en) * | 2020-03-17 | 2020-07-10 | 东北大学 | Chinese lip language identification method and device based on hybrid convolutional neural network |
Non-Patent Citations (1)
Title |
---|
一种融合相位估计的深度卷积神经网络语音增强方法;袁文浩;梁春燕;夏斌;孙文珠;;电子学报;20181015(第10期) * |
Also Published As
Publication number | Publication date |
---|---|
CN112309411A (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112309411B (en) | Phase-sensitive gating multi-scale cavity convolution network voice enhancement method and system | |
CN111971743B (en) | Systems, methods, and computer readable media for improved real-time audio processing | |
US11069344B2 (en) | Complex evolution recurrent neural networks | |
CN109841226B (en) | Single-channel real-time noise reduction method based on convolution recurrent neural network | |
EP3926623B1 (en) | Speech recognition method and apparatus, and neural network training method and apparatus | |
US20210295859A1 (en) | Enhanced multi-channel acoustic models | |
CN107680611B (en) | Single-channel sound separation method based on convolutional neural network | |
CN110060690B (en) | Many-to-many speaker conversion method based on STARGAN and ResNet | |
CN110739003B (en) | Voice enhancement method based on multi-head self-attention mechanism | |
WO2019227586A1 (en) | Voice model training method, speaker recognition method, apparatus, device and medium | |
EP4235646A2 (en) | Adaptive audio enhancement for multichannel speech recognition | |
CN110060657B (en) | SN-based many-to-many speaker conversion method | |
CN112331224A (en) | Lightweight time domain convolution network voice enhancement method and system | |
CN113936681B (en) | Speech enhancement method based on mask mapping and mixed cavity convolution network | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
Wang et al. | Recurrent deep stacking networks for supervised speech separation | |
Mundodu Krishna et al. | Single channel speech separation based on empirical mode decomposition and Hilbert transform | |
JP2023546099A (en) | Audio generator, audio signal generation method, and audio generator learning method | |
Hasannezhad et al. | PACDNN: A phase-aware composite deep neural network for speech enhancement | |
Du et al. | A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement | |
CN108806725A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
Sheeja et al. | Speech dereverberation and source separation using DNN-WPE and LWPR-PCA | |
CN113571074B (en) | Voice enhancement method and device based on multi-band structure time domain audio frequency separation network | |
Azam et al. | Urdu spoken digits recognition using classified MFCC and backpropgation neural network | |
CN113707172A (en) | Single-channel voice separation method, system and computer equipment of sparse orthogonal network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |