CN110751957B - Speech enhancement method using stacked multi-scale modules - Google Patents
Speech enhancement method using stacked multi-scale modules Download PDFInfo
- Publication number
- CN110751957B CN110751957B CN201911182689.3A CN201911182689A CN110751957B CN 110751957 B CN110751957 B CN 110751957B CN 201911182689 A CN201911182689 A CN 201911182689A CN 110751957 B CN110751957 B CN 110751957B
- Authority
- CN
- China
- Prior art keywords
- speech
- voice
- enhancement
- stoi
- sdr
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000011156 evaluation Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 11
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 8
- 238000005457 optimization Methods 0.000 claims abstract description 8
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 8
- 238000012805 post-processing Methods 0.000 claims abstract description 7
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 11
- 230000002123 temporal effect Effects 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000010339 dilation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000009471 action Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 2
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 abstract description 5
- 230000036039 immunity Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 13
- 238000012360 testing method Methods 0.000 description 6
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses an end-to-end speech enhancement method using stacked multi-scale modules, which comprises the following steps: s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure; s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features; s3: enhancing the two-dimensional features by utilizing a voice enhancement module; s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.
Description
Technical Field
The invention belongs to the technical field of voice enhancement, and particularly relates to an end-to-end voice enhancement method using stacked multi-scale modules.
Background
The voice enhancement refers to the task of removing or attenuating additional noise in noisy voice, improves the overall perception quality and the voice intelligibility of the voice by restraining and separating the noise, and has wide application in the aspects of robust voice recognition, hearing aid design, speaker verification and the like. Traditional speech enhancement methods include spectral subtraction, wiener filtering, statistical model-based methods, subspace-based methods, etc., and over the last few years, supervised speech enhancement methods based on deep learning have become the main direction of research of interest to scholars.
Some scholars consider that the time-domain signal of the voice is directly processed, and the voice signal is not dependent on the frequency-domain representation of the voice signal, so that the voice signal is prevented from being switched back and forth between the time domain and the frequency domain, and the time-domain feature representation of the voice is more fully utilized. Based on the WaveNet framework, Qian et al propose a method of introducing a prior distribution of speech for speech enhancement, and restage et al predict the target by a non-causal dilated convolution. Pascual et al propose SEGAN, which uses convolutional networks to directly enhance time domain speech, Fu et al propose full convolutional neural networks for time domain whole sentence speech enhancement, and Pandey et al combine sequence modeling networks with codec architectures to process time domain signals in order to solve real-time speech enhancement.
These end-to-end based methods map the one-dimensional time domain waveform to the target voice directly, however, the time domain waveform signal itself cannot show an obvious characteristic structure, and it is difficult to model the time domain signal directly, and the modeling difficulty is further improved in the low signal-to-noise ratio environment.
Disclosure of Invention
The present invention provides an end-to-end speech enhancement method using stacked multi-scale modules, which aims to solve the existing problems.
The invention is realized in such a way that a speech enhancement method using stacked multi-scale modules comprises the following steps:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
Further, the cascade end-to-end voice enhancement architecture comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:
a. in the time domain signal preprocessing stage, one-dimensional convolution is used for carrying out convolution operation on input voice fragments, the results of each convolution check on the action of noisy voice Y are stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a self-convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated, and an absolute value feature and sgn mask are obtained;
b. the absolute value feature of the noisy speech y is input into a speech enhancement module for enhancement to obtain an estimate of the absolute value featureMultiplying it by sgn mask synthesizes a feature representation of the target speech:
Further, the multi-scale module includes an average pooling layer, convolution kernels of 1 × 1 and 3 × 3, and dilation convolutions of different dilation rates.
Furthermore, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization.
Further, the specific step of incorporating the STOI indicator into the loss function includes:
1) STOI inputs are clean speech X and degraded speechRemoving a silent region which does not contribute to speech intelligibility, and then transforming a time domain signal into a time-frequency domain by using STFT, and dividing two signals into frames with Hanning windows which are overlapped by 50%;
2) 1/3 times frequency band analysis is carried out, 15 times frequency bands are divided into 1/3 times frequency bands, wherein the central frequency range of the frequency bands is 4.3kHz to 150Hz, and the short-time envelope x of pure voicej,mIs represented as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice;
3) normalizing and clipping the speech to obtain the quitEnvelope representation of speechIntelligibility is expressed as a correlation coefficient between two temporal envelopes:
wherein | · | purple sweet2Is the L2 norm, μ (-) shows the mean vector of the corresponding sample.
4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index:
5) will enhance the speechAnd the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
Further, the specific step of incorporating the SDR index into the loss function includes:
1) the SDR inputs are clean speech x and enhanced speechThe SDR calculation process for enhanced speech is as follows:
2) performing equivalent transformation on the SDR optimization target to simplify calculation to obtain:
Further, the specific step of fusing the STOI and SDR evaluation index into the loss function includes:
1) the conventional root mean square error is calculated as follows:
wherein M and N are the number of sampling points of each voice and the total number of voices.
2) The root mean square error is combined with the STOI and SDR based evaluation index loss function:
where α, β, γ correspond to the coefficients of different parts of the loss function.
Where X ∈ R is 1/3 times the band from X, M is the total number of frames of a segment of speech, M is the index of the frame, j ∈ {1, 2.. 15} is the index of 1/3 times the band, and L ═ 30 corresponds to the length of the analyzed speech being 384 ms.
Compared with the prior art, the invention has the beneficial effects that: in order to improve the direct processing capability of the neural network on time domain voice signals, the invention provides a novel multi-scale end-to-end voice enhancement framework. In the preprocessing stage, the time domain signal is transformed into a two-dimensional characteristic representation, then the two-dimensional characteristic is enhanced by a voice enhancement module, and finally the enhanced characteristic representation is transformed into a one-dimensional time domain signal through decoding and synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.
Drawings
FIG. 1 is an overall schematic view of the present invention;
FIG. 2 is a schematic diagram of stacked multi-scale modules of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, it is to be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, and are used merely for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention. Further, in the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Examples
Referring to fig. 1-2, the present invention provides a technical solution: a method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
The end-to-end speech enhancement framework proposed by the present invention comprises speech time domain signal preprocessing, speech enhancement module and target speech synthesis post-processing, as shown in fig. 1.
Assuming that the time-domain clean speech is x and the noise signal is n, the noisy speech y can be expressed as:
y=x+n
in the time domain signal preprocessing stage, one-dimensional convolution is used for performing convolution operation on input voice fragments, the result of each convolution check on the action of noisy voice Y is stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated to obtain an absolute value feature and sgn mask, wherein sgn represents sign function, namely the sign of Y is taken, and the two-dimensional feature Y is represented as the product of the absolute value feature and sgn mask:
Y=abs(Y)⊙sgn(Y)
wherein [ ] indicates the multiplication of the corresponding elements and then the absolute value characteristic of the noisy speech y is input to the speech enhancement module for enhancement, resulting in an estimate of the absolute value characteristicMultiplying it by sgn mask synthesizes a feature representation of the target speech:
The voice enhancement module adopted by the framework is established on the basis of a full convolution network. In the encoding process of the module, each layer of convolution reduces the feature size by half, but channels are doubled, the features are encoded into a small and deep representation form through the multilayer convolution, correspondingly, the feature size is gradually enlarged in the decoding process, and finally, the feature size is restored to the original size. In expanding the feature size, we obtain higher resolution by bilinear interpolation upsampling.
Adding skip connections between layers of the same layer level in the speech enhancement module allows a high level of detail to be preserved by the copy operation. The direct flow of low-level information into high-level information can effectively guide the modeling of high-resolution features by the model.
The symmetrical structure of the speech enhancement module ensures that its input and output have the same shape size, which makes it naturally suitable for any pixel-intensive prediction task, especially for every pixel label in the image.
To make more full use of the Multi-scale context information in speech features, we elaborately designed and Stacked Multi-scale blocks, as shown in fig. 2, SMB (Stacked Multi-scale Block) contains an average pooling layer, ordinary 1 × 1 and 3 × 3 convolutions, and dilated convolutions of different dilation rates; in order to preserve the original information efficiently, we stack the original features with multi-scale feature stitching.
The speech enhancement method based on deep learning usually adopts mean square error mse (mean Squared error) as a loss function of training, but in the speech enhancement process, the intelligibility and speech quality of enhanced speech are often evaluated to check the performance of a model, and the inconsistency between the loss function and an evaluation index cannot guarantee that an optimal model can be obtained.
To calculate the loss function from the angle of the amplitude values, we use RMSE (Root Mean Square Error); STOI is used to assess the intelligibility of speech, the inputs of which are clean speech X and degraded speechIt first removes the silence regions that do not contribute to speech intelligibility, and then programs the time-domain signal with an STFT into the time-frequency domain by dividing the two signals into 50% overlapping frames with a hanning window. Then 1/3 times frequency band analysis is carried out to divide the total 15 1/3 times frequency bandsWherein the frequency band center frequency ranges from 4.3kHz to 150 Hz. Short-time envelope x of pure speechj,mCan be expressed as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice; then, normalizing and clipping the voice, wherein the normalization is used for compensating global difference, and the difference should not influence the intelligibility of the voice; clipping ensures an upper bound on STOI evaluation on severely degraded speech. Normalized and clipped temporal envelope representation of degraded speech asIntelligibility is expressed as a correlation coefficient between two temporal envelopes:
wherein | · | purple sweet2And L2 norm, μ (·) represents the mean vector of the corresponding sample. Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:
will enhance the speechAnd the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
SDR, on the other hand, is enhanced speechMedium dry net weightEnergy ratio to other components, clean componentIs x is inProjection of (2):
SDR is defined as:
the combination of the above two formulas can obtain:
equivalent transformation is performed on the SDR optimization target to simplify calculation:
finally, we combine these two metrics with RMSE to form a loss function:
where α, β, γ correspond to coefficients of different parts of the loss function.
Test examples
The speech data used in the experiment was from a TIMIT dataset and the noise dataset used ESC-50 as a training set, and we also used the Noisex92 noise dataset for testing in order to verify the generalization performance of the model presented herein.
In this experiment, the total of the TIMIT data set contained 6300 voices, which were recorded 10 sentences per person for 630 persons, with a male-female ratio of 7: 3. Wherein, 7 among the sentences that everyone recorded are repeated, in order to get rid of the influence of repeated sentence to model training and test, this experiment only took 1890 voices that the sentence is all different. About 80% of the voices were used as training set, and the other 20% were used as test voices, and the male and female ratios were the same as the overall distribution of TIMIT. The ESC-50 dataset contains 2000 tagged ambient sound recordings in a set of 5 main categories: animals, natural sound scenes and underwater sounds, non-speech human voices, indoor sounds, urban sounds. All speech was resampled to 16kHz and all speech was cut to a length of 2 seconds. Adam optimizer is used for random gradient descent (SGD) based optimization. The learning rate is set to be equal to 1 × 10-4Is constant.
For the baseline model, several typical encoder-decoder solutions are selected for comparison with the method proposed by the present invention, including spectral mapping-based and end-to-end methods, and we also use noisy speech as a baseline comparison: (a) noisy speech, (b) AET, (c) CED, (d) R-CED, (e) NOSMB-SE, (f) SMB-SE. Wherein AET is end-to-end speech enhancement architecture, CED and R-CED are speech enhancement methods of convolutional neural network time-frequency domain, and the non SMB-SE is the basic framework of our proposed SMB-free version, which simply connects low-level information to high-level, and the SMB-SE is based on the non SMB-SE and adds 4 SMBs.
All models were trained under 0dB SNR conditions and evaluated for performance at-15 dB, -10dB, -5dB, 0dB and 5dB signal-to-noise ratios, and to evaluate the generalization performance of the proposed framework, we also tested the proposed framework on the noilex-92 noise dataset.
TABLE I
Test results under visible noise conditions bold for best performance
TABLE II
Test results under invisible noise conditions bold for best performance
The invention provides an end-to-end speech enhancement framework using a stacked multi-scale module, which is characterized in that an original time domain waveform is coded into a two-dimensional feature representation, then a speech enhancement module is used for learning the mapping relation from noisy speech to clean speech, and finally a time domain speech signal is synthesized by decoding. The end-to-end framework provided by the invention can effectively extract the characteristic information of the time domain signal, the SMB module is applied to help the model to mine more information, and the integration of STOI, SDR and RMSE can effectively improve the overall enhancement performance of the model. The framework exhibits noise immunity under low SNR conditions and good generalization in unknown noise environments.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (7)
1. A method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
2. The speech enhancement method of claim 1, wherein: the cascade end-to-end voice enhancement framework comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:
a. in the time domain signal preprocessing stage, one-dimensional convolution is used for carrying out convolution operation on input voice fragments, the results of each convolution check on the action of noisy voice Y are stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a self-convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated, and an absolute value feature and sgn mask are obtained;
b. the absolute value feature of the noisy speech y is input into a speech enhancement module for enhancement to obtain an estimate of the absolute value featureMultiplying it by sgn mask synthesizes a feature representation of the target speech:
3. The speech enhancement method of claim 1, wherein: the multiscale module includes an average pooling layer, convolutions with convolution kernels of 1 x 1 and 3 x 3, and dilated convolutions of different dilation rates.
4. The speech enhancement method of claim 1, further comprising the steps of: and integrating evaluation indexes STOI and SDR of voice enhancement into a loss function by using a training strategy of multi-objective joint optimization.
5. The speech enhancement method of claim 4 wherein the step of incorporating the STOI indicator into the loss function comprises:
1) STOI inputs are clean speech x and degraded speechRemoving a silent region which does not contribute to speech intelligibility, and then transforming a time domain signal into a time-frequency domain by using STFT, and dividing two signals into frames with Hanning windows which are overlapped by 50%;
2) 1/3 times frequency band analysis is carried out, 15 times frequency bands are divided into 1/3 times frequency bands, wherein the central frequency range of the frequency bands is 4.3kHz to 150Hz, and the short-time envelope x of pure voicej,mIs represented as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice;
3) normalizing and clipping speech to obtain an envelope representation of degraded speechIntelligibility is expressed as a correlation coefficient between two temporal envelopes:
wherein | · | purple sweet2And L2 norm, μ (·) represents the mean vector of the corresponding sample.
4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:
5) will enhance the speechAnd the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
6. The speech enhancement method of claim 4 further characterized in that the step of incorporating the SDR indicator into the loss function comprises:
1) the SDR inputs are clean speech x and enhanced speechThe SDR calculation process for enhanced speech is as follows:
2) performing equivalent transformation on the SDR optimization target to simplify calculation to obtain:
7. The speech enhancement method of claim 4 further characterized by fusing the STOI and SDR merit indices into a loss function, the specific steps comprising:
1) the conventional root mean square error is calculated as follows:
wherein M and N are the number of sampling points of each voice and the total number of the voices;
2) the root mean square error is combined with the STOI and SDR based evaluation index loss function:
where α, β, γ correspond to the coefficients of different parts of the loss function.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2019109136349 | 2019-09-25 | ||
CN201910913634 | 2019-09-25 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110751957A CN110751957A (en) | 2020-02-04 |
CN110751957B true CN110751957B (en) | 2020-10-27 |
Family
ID=69284766
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911182689.3A Active CN110751957B (en) | 2019-09-25 | 2019-11-27 | Speech enhancement method using stacked multi-scale modules |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110751957B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111524530A (en) * | 2020-04-23 | 2020-08-11 | 广州清音智能科技有限公司 | Voice noise reduction method based on expansion causal convolution |
CN111583947A (en) * | 2020-04-30 | 2020-08-25 | 厦门快商通科技股份有限公司 | Voice enhancement method, device and equipment |
US11538464B2 (en) | 2020-09-09 | 2022-12-27 | International Business Machines Corporation . | Speech recognition using data analysis and dilation of speech content from separated audio input |
US11495216B2 (en) | 2020-09-09 | 2022-11-08 | International Business Machines Corporation | Speech recognition using data analysis and dilation of interlaced audio input |
CN112862068A (en) * | 2021-01-15 | 2021-05-28 | 复旦大学 | Fault-tolerant architecture and method for complex convolutional neural network |
CN113129918B (en) * | 2021-04-15 | 2022-05-03 | 浙江大学 | Voice dereverberation method combining beam forming and deep complex U-Net network |
CN113870887A (en) * | 2021-09-26 | 2021-12-31 | 平安科技(深圳)有限公司 | Single-channel speech enhancement method and device, computer equipment and storage medium |
CN113936680B (en) * | 2021-10-08 | 2023-08-08 | 电子科技大学 | Single-channel voice enhancement method based on multi-scale information perception convolutional neural network |
CN115050379B (en) * | 2022-04-24 | 2024-08-06 | 华侨大学 | FHGAN-based high-fidelity voice enhancement model and application thereof |
CN114974283A (en) * | 2022-05-24 | 2022-08-30 | 云知声智能科技股份有限公司 | Training method and device of voice noise reduction model, storage medium and electronic device |
CN117174105A (en) * | 2023-11-03 | 2023-12-05 | 深圳市龙芯威半导体科技有限公司 | Speech noise reduction and dereverberation method based on improved deep convolutional network |
CN117219107B (en) * | 2023-11-08 | 2024-01-30 | 腾讯科技(深圳)有限公司 | Training method, device, equipment and storage medium of echo cancellation model |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101617342A (en) * | 2007-01-16 | 2009-12-30 | 汤姆科技成像系统有限公司 | The figured method and system that is used for multidate information |
CN109034162A (en) * | 2018-07-13 | 2018-12-18 | 南京邮电大学 | A kind of image, semantic dividing method |
CN109741260A (en) * | 2018-12-29 | 2019-05-10 | 天津大学 | A kind of efficient super-resolution method based on depth back projection network |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110246510A (en) * | 2019-06-24 | 2019-09-17 | 电子科技大学 | A kind of end-to-end speech Enhancement Method based on RefineNet |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10971142B2 (en) * | 2017-10-27 | 2021-04-06 | Baidu Usa Llc | Systems and methods for robust speech recognition using generative adversarial networks |
CN108491856B (en) * | 2018-02-08 | 2022-02-18 | 西安电子科技大学 | Image scene classification method based on multi-scale feature convolutional neural network |
CN109473120A (en) * | 2018-11-14 | 2019-03-15 | 辽宁工程技术大学 | A kind of abnormal sound signal recognition method based on convolutional neural networks |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110059582B (en) * | 2019-03-28 | 2023-04-07 | 东南大学 | Driver behavior identification method based on multi-scale attention convolution neural network |
-
2019
- 2019-11-27 CN CN201911182689.3A patent/CN110751957B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101617342A (en) * | 2007-01-16 | 2009-12-30 | 汤姆科技成像系统有限公司 | The figured method and system that is used for multidate information |
CN109034162A (en) * | 2018-07-13 | 2018-12-18 | 南京邮电大学 | A kind of image, semantic dividing method |
CN109741260A (en) * | 2018-12-29 | 2019-05-10 | 天津大学 | A kind of efficient super-resolution method based on depth back projection network |
CN110010144A (en) * | 2019-04-24 | 2019-07-12 | 厦门亿联网络技术股份有限公司 | Voice signals enhancement method and device |
CN110136731A (en) * | 2019-05-13 | 2019-08-16 | 天津大学 | Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice |
CN110246510A (en) * | 2019-06-24 | 2019-09-17 | 电子科技大学 | A kind of end-to-end speech Enhancement Method based on RefineNet |
Non-Patent Citations (4)
Title |
---|
An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech;Cees H.Taal;《IEEE Transactions on Audio, Speech, and Language Processing》;20110214;全文 * |
The enhancement of depth estimation based on multi-scale convolution kernels;Hua Heng;《Conference on Optoelectronic Imaging and Multimedia Technology V》;20181012;全文 * |
一种基于卷积神经网络的端到端语音分离方法;范存航;《信号处理》;20190403(第4期);第542-548页 * |
基于一维卷积神经网络的车载语音识别技术研究;朱锡祥;《中国优秀硕士学位论文全文数据库》;20170815(第8期);I136-37 * |
Also Published As
Publication number | Publication date |
---|---|
CN110751957A (en) | 2020-02-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110751957B (en) | Speech enhancement method using stacked multi-scale modules | |
CN110246510B (en) | End-to-end voice enhancement method based on RefineNet | |
US10777215B2 (en) | Method and system for enhancing a speech signal of a human speaker in a video using visual information | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN103531205B (en) | The asymmetrical voice conversion method mapped based on deep neural network feature | |
CN105957537B (en) | One kind being based on L1/2The speech de-noising method and system of sparse constraint convolution Non-negative Matrix Factorization | |
Su et al. | Bandwidth extension is all you need | |
CN108520753A (en) | Voice lie detection method based on the two-way length of convolution memory network in short-term | |
CN114360567A (en) | Single-channel voice enhancement method based on deep rewinding product network | |
CN110648684B (en) | Bone conduction voice enhancement waveform generation method based on WaveNet | |
CN112992121B (en) | Voice enhancement method based on attention residual error learning | |
Zhu et al. | FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions | |
Geng et al. | End-to-end speech enhancement based on discrete cosine transform | |
CN113823308A (en) | Method for denoising voice by using single voice sample with noise | |
Qian et al. | Combining equalization and estimation for bandwidth extension of narrowband speech | |
CN114067818B (en) | Time domain flexible vibration sensor voice enhancement method and system | |
CN115881156A (en) | Multi-scale-based multi-modal time domain voice separation method | |
Islam et al. | Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask | |
Jannu et al. | Shuffle attention u-net for speech enhancement in time domain | |
CN114360571A (en) | Reference-based speech enhancement method | |
Shukla et al. | Speech enhancement system using deep neural network optimized with Battle Royale Optimization | |
Hussain et al. | A novel speech intelligibility enhancement model based on canonical correlation and deep learning | |
Xu et al. | A multi-scale feature aggregation based lightweight network for audio-visual speech enhancement | |
CN113571074B (en) | Voice enhancement method and device based on multi-band structure time domain audio frequency separation network | |
CN113808604B (en) | Sound scene classification method based on gamma through spectrum separation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |