CN110751957A - Speech enhancement method using stacked multi-scale modules - Google Patents

Speech enhancement method using stacked multi-scale modules Download PDF

Info

Publication number
CN110751957A
CN110751957A CN201911182689.3A CN201911182689A CN110751957A CN 110751957 A CN110751957 A CN 110751957A CN 201911182689 A CN201911182689 A CN 201911182689A CN 110751957 A CN110751957 A CN 110751957A
Authority
CN
China
Prior art keywords
speech
voice
enhancement
stoi
sdr
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911182689.3A
Other languages
Chinese (zh)
Other versions
CN110751957B (en
Inventor
蓝天
李森
吕忆蓝
刘峤
钱宇欣
叶文政
惠国强
李萌
彭川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Publication of CN110751957A publication Critical patent/CN110751957A/en
Application granted granted Critical
Publication of CN110751957B publication Critical patent/CN110751957B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses an end-to-end speech enhancement method using stacked multi-scale modules, which comprises the following steps: s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure; s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features; s3: enhancing the two-dimensional features by utilizing a voice enhancement module; s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.

Description

Speech enhancement method using stacked multi-scale modules
Technical Field
The invention belongs to the technical field of voice enhancement, and particularly relates to an end-to-end voice enhancement method using stacked multi-scale modules.
Background
The voice enhancement refers to the task of removing or attenuating additional noise in noisy voice, improves the overall perception quality and the voice intelligibility of the voice by restraining and separating the noise, and has wide application in the aspects of robust voice recognition, hearing aid design, speaker verification and the like. Traditional speech enhancement methods include spectral subtraction, wiener filtering, statistical model-based methods, subspace-based methods, etc., and over the last few years, supervised speech enhancement methods based on deep learning have become the main direction of research of interest to scholars.
Some scholars consider that the time-domain signal of the voice is directly processed, and the voice signal is not dependent on the frequency-domain representation of the voice signal, so that the voice signal is prevented from being switched back and forth between the time domain and the frequency domain, and the time-domain feature representation of the voice is more fully utilized. Based on the WaveNet framework, Qian et al propose a method of introducing a prior distribution of speech for speech enhancement, and restage et al predict the target by a non-causal dilated convolution. Pascual et al propose SEGAN, which uses convolutional networks to directly enhance time domain speech, Fu et al propose full convolutional neural networks for time domain whole sentence speech enhancement, and Pandey et al combine sequence modeling networks with codec architectures to process time domain signals in order to solve real-time speech enhancement.
These end-to-end based methods map the one-dimensional time domain waveform to the target voice directly, however, the time domain waveform signal itself cannot show an obvious characteristic structure, and it is difficult to model the time domain signal directly, and the modeling difficulty is further improved in the low signal-to-noise ratio environment.
Disclosure of Invention
The present invention provides an end-to-end speech enhancement method using stacked multi-scale modules, which aims to solve the existing problems.
The invention is realized in such a way that a speech enhancement method using stacked multi-scale modules comprises the following steps:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
Further, the cascade end-to-end voice enhancement architecture comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:
a. in the time domain signal preprocessing stage, one-dimensional convolution is used for carrying out convolution operation on input voice fragments, the results of each convolution check on the action of noisy voice Y are stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a self-convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated, and an absolute value feature and sgn mask are obtained;
b. the absolute value feature of the noisy speech y is input into a speech enhancement module for enhancement to obtain an estimate of the absolute value feature
Figure BDA0002291691380000021
Multiplying it by sgn mask synthesizes a feature representation of the target speech:
Figure BDA0002291691380000022
c. by means of transposed convolution
Figure BDA0002291691380000023
Transforming into time domain signals
Figure BDA0002291691380000024
Further, the multi-scale module includes an average pooling layer, convolution kernels of 1 × 1 and 3 × 3, and dilation convolutions of different dilation rates.
Furthermore, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization.
Further, the specific step of incorporating the STOI indicator into the loss function includes:
1) STOI inputs are clean speech X and degraded speech
Figure BDA0002291691380000025
Removing a silent region which does not contribute to speech intelligibility, and then transforming a time domain signal into a time-frequency domain by using STFT, and dividing two signals into frames with Hanning windows which are overlapped by 50%;
2) 1/3 times frequency band analysis is carried out, 15 times frequency bands are divided into 1/3 times frequency bands, wherein the central frequency range of the frequency bands is 4.3kHz to 150Hz, and the short-time envelope x of pure voicej,mIs represented as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice;
3) normalizing and clipping speech to obtain an envelope representation of degraded speech
Figure BDA0002291691380000031
Intelligibility is expressed as a correlation coefficient between two temporal envelopes:
Figure BDA0002291691380000032
wherein | · | purple sweet2Is the L2 norm, μ (-) shows the mean vector of the corresponding sample.
4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index:
Figure BDA0002291691380000033
5) will enhance the speech
Figure BDA0002291691380000034
And the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
Figure BDA0002291691380000035
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
Further, the specific step of incorporating the SDR index into the loss function includes:
1) the SDR inputs are clean speech x and enhanced speech
Figure BDA0002291691380000036
The SDR calculation process for enhanced speech is as follows:
Figure BDA0002291691380000037
2) performing equivalent transformation on the SDR optimization target to simplify calculation to obtain:
Figure BDA0002291691380000041
wherein the process of maximizing the evaluation index SDR is equivalent to minimizing
Figure BDA0002291691380000042
Further, the specific step of fusing the STOI and SDR evaluation index into the loss function includes:
1) the conventional root mean square error is calculated as follows:
Figure BDA0002291691380000043
wherein M and N are the number of sampling points of each voice and the total number of voices.
2) The root mean square error is combined with the STOI and SDR based evaluation index loss function:
Figure BDA0002291691380000044
where α, γ corresponds to the coefficients of different parts of the loss function.
Where X ∈ R is 1/3 times the band from X, M is the total number of frames of a segment of speech, M is the index of the frame, j ∈ {1, 2.. 15} is the index of 1/3 times the band, and L ═ 30 corresponds to the length of the analyzed speech being 384 ms.
Compared with the prior art, the invention has the beneficial effects that: in order to improve the direct processing capability of the neural network on time domain voice signals, the invention provides a novel multi-scale end-to-end voice enhancement framework. In the preprocessing stage, the time domain signal is transformed into a two-dimensional characteristic representation, then the two-dimensional characteristic is enhanced by a voice enhancement module, and finally the enhanced characteristic representation is transformed into a one-dimensional time domain signal through decoding and synthesis. In order to further improve the performance of the algorithm, the evaluation indexes STOI and SDR of the voice enhancement are merged into a loss function by using a training strategy of multi-objective joint optimization. Experiments show that the method provided by the invention can obviously improve the voice enhancement effect and has better noise immunity under the conditions of unknown noise and low signal-to-noise ratio.
Drawings
FIG. 1 is an overall schematic view of the present invention;
FIG. 2 is a schematic diagram of stacked multi-scale modules of the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the description of the present invention, it is to be understood that the terms "length", "width", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships illustrated in the drawings, and are used merely for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, are not to be construed as limiting the present invention. Further, in the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.
Examples
Referring to fig. 1-2, the present invention provides a technical solution: a method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
The end-to-end speech enhancement framework proposed by the present invention comprises speech time domain signal preprocessing, speech enhancement module and target speech synthesis post-processing, as shown in fig. 1.
Assuming that the time-domain clean speech is x and the noise signal is n, the noisy speech y can be expressed as:
y=x+n
in the time domain signal preprocessing stage, one-dimensional convolution is used for performing convolution operation on input voice fragments, the result of each convolution check on the action of noisy voice Y is stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated to obtain an absolute value feature and sgn mask, wherein sgn represents sign function, namely the sign of Y is taken, and the two-dimensional feature Y is represented as the product of the absolute value feature and sgn mask:
Y=abs(Y)⊙sgn(Y)
wherein ⊙ represents the multiplication of corresponding elements, and then the absolute value feature of the noisy speech y is input to the speech enhancement module for enhancement, resulting in an estimate of the absolute value featureMultiplying it by sgn mask synthesizes a feature representation of the target speech:
Figure BDA0002291691380000062
finally, the method is performed by transposition convolution
Figure BDA0002291691380000063
Transforming into time domain signals
Figure BDA0002291691380000064
The voice enhancement module adopted by the framework is established on the basis of a full convolution network. In the encoding process of the module, each layer of convolution reduces the feature size by half, but channels are doubled, the features are encoded into a small and deep representation form through the multilayer convolution, correspondingly, the feature size is gradually enlarged in the decoding process, and finally, the feature size is restored to the original size. In expanding the feature size, we obtain higher resolution by bilinear interpolation upsampling.
Adding skip connections between layers of the same layer level in the speech enhancement module allows a high level of detail to be preserved by the copy operation. The direct flow of low-level information into high-level information can effectively guide the modeling of high-resolution features by the model.
The symmetrical structure of the speech enhancement module ensures that its input and output have the same shape size, which makes it naturally suitable for any pixel-intensive prediction task, especially for every pixel label in the image.
To make more full use of the Multi-scale context information in speech features, we elaborately designed and Stacked Multi-scale blocks, as shown in fig. 2, SMB (Stacked Multi-scale Block) contains an average pooling layer, ordinary 1 × 1 and 3 × 3 convolutions, and dilated convolutions of different dilation rates; in order to preserve the original information efficiently, we stack the original features with multi-scale feature stitching.
The speech enhancement method based on deep learning usually adopts mean square error mse (mean Squared error) as a loss function of training, but in the speech enhancement process, the intelligibility and speech quality of enhanced speech are often evaluated to check the performance of a model, and the inconsistency between the loss function and an evaluation index cannot guarantee that an optimal model can be obtained.
To calculate the loss function from the angle of the amplitude values, we use RMSE (Root Mean Square Error); STOI is used to assess the intelligibility of speech, the inputs of which are clean speech X and degraded speech
Figure BDA0002291691380000071
It first removes the silence regions that do not contribute to speech intelligibility, and then programs the time-domain signal with an STFT into the time-frequency domain by dividing the two signals into 50% overlapping frames with a hanning window. 1/3-fold frequency band analysis is performed to divide a total of 15 1/3-fold frequency bands, wherein the center frequency range of the frequency band is 4.3kHz to 150 Hz. Short-time envelope x of pure speechj,mCan be expressed as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice; then, normalizing and clipping the voice, wherein the normalization is used for compensating global difference, and the difference should not influence the intelligibility of the voice; clipping ensures an upper bound on STOI evaluation on severely degraded speech. Normalized and clipped temporal envelope representation of degraded speech as
Figure BDA0002291691380000072
Intelligibility is expressed as a correlation coefficient between two temporal envelopes:
Figure BDA0002291691380000073
wherein | · | purple sweet2And L2 norm, μ (·) represents the mean vector of the corresponding sample. Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:
Figure BDA0002291691380000074
will enhance the speech
Figure BDA0002291691380000075
And the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
Figure BDA0002291691380000081
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
SDR, on the other hand, is enhanced speech
Figure BDA0002291691380000082
Medium dry net weight
Figure BDA0002291691380000083
Energy ratio to other components, clean component
Figure BDA0002291691380000084
Is x is in
Figure BDA0002291691380000085
Projection of (2):
SDR is defined as:
Figure BDA0002291691380000087
the combination of the above two formulas can obtain:
equivalent transformation is performed on the SDR optimization target to simplify calculation:
Figure BDA0002291691380000089
finally, we combine these two metrics with RMSE to form a loss function:
Figure BDA00022916913800000810
in the equation α, γ corresponds to the coefficient of different parts in the loss function.
Test examples
The speech data used in the experiment was from a TIMIT dataset and the noise dataset used ESC-50 as a training set, and we also used the Noisex92 noise dataset for testing in order to verify the generalization performance of the model presented herein.
In this experiment, the total of the TIMIT data set contained 6300 voices, which were recorded 10 sentences per person for 630 persons, with a male-female ratio of 7: 3. Wherein, 7 among the sentences that everyone recorded are repeated, in order to get rid of the influence of repeated sentence to model training and test, this experiment only took 1890 voices that the sentence is all different. About 80% of the voices were used as training set, and the other 20% were used as test voices, and the male and female ratios were the same as the overall distribution of TIMIT. The ESC-50 dataset contains 2000 tagged ambient sound recordings in a set of 5 main categories: animals, natural sound scenes and underwater sounds, non-speech human voices, indoor sounds, urban sounds. All speech was resampled to 16kHz and all speech was cut to a length of 2 seconds. Adam optimizer is used for random gradient descent (SGD) based optimization. The learning rate is set to be equal to 1 × 10-4Is constant.
For the baseline model, several typical encoder-decoder solutions are selected for comparison with the method proposed by the present invention, including spectral mapping-based and end-to-end methods, and we also use noisy speech as a baseline comparison: (a) noisy speech, (b) AET, (c) CED, (d) R-CED, (e) NOSMB-SE, (f) SMB-SE. Wherein AET is end-to-end speech enhancement architecture, CED and R-CED are speech enhancement methods of convolutional neural network time-frequency domain, and the non SMB-SE is the basic framework of our proposed SMB-free version, which simply connects low-level information to high-level, and the SMB-SE is based on the non SMB-SE and adds 4 SMBs.
All models were trained under 0dB SNR conditions and evaluated for performance at-15 dB, -10dB, -5dB, 0dB and 5dB signal-to-noise ratios, and to evaluate the generalization performance of the proposed framework, we also tested the proposed framework on the noilex-92 noise dataset.
TABLE I
Test results under visible noise conditions bold for best performance
Figure BDA0002291691380000101
TABLE II
Test results under invisible noise conditions bold for best performance
The invention provides an end-to-end speech enhancement framework using a stacked multi-scale module, which is characterized in that an original time domain waveform is coded into a two-dimensional feature representation, then a speech enhancement module is used for learning the mapping relation from noisy speech to clean speech, and finally a time domain speech signal is synthesized by decoding. The end-to-end framework provided by the invention can effectively extract the characteristic information of the time domain signal, the SMB module is applied to help the model to mine more information, and the integration of STOI, SDR and RMSE can effectively improve the overall enhancement performance of the model. The framework exhibits noise immunity under low SNR conditions and good generalization in unknown noise environments.
The present invention is not limited to the above preferred embodiments, and any modifications, equivalent substitutions and improvements made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A method of end-to-end speech enhancement using stacked multi-scale modules, comprising the steps of:
s1: constructing a cascade end-to-end voice enhancement framework, and splicing the stacked multi-scale modules into a network structure;
s2: in the preprocessing stage, the time domain signals are transformed into two-dimensional features;
s3: enhancing the two-dimensional features by utilizing a voice enhancement module;
s4: in a post-processing stage, the enhanced feature representation is transformed into a one-dimensional time-domain signal by decoding synthesis.
2. The speech enhancement method of claim 1, wherein: the cascade end-to-end voice enhancement framework comprises voice time domain signal preprocessing, a voice enhancement module and target voice synthesis post-processing; the method comprises the following specific steps:
a. in the time domain signal preprocessing stage, one-dimensional convolution is used for carrying out convolution operation on input voice fragments, the results of each convolution check on the action of noisy voice Y are stacked line by line to form a two-dimensional real-value feature Y, the processing mode of a self-convolution neural network on picture pixel values is inspired, the two-dimensional feature is separated, and an absolute value feature and sgn mask are obtained;
b. the absolute value feature of the noisy speech y is input into a speech enhancement module for enhancement to obtain an estimate of the absolute value feature
Figure FDA0002291691370000011
Multiplying it by sgn mask synthesizes a feature representation of the target speech:
Figure FDA0002291691370000012
c. by means of transposed convolution
Figure FDA0002291691370000014
Transforming into time domain signals
3. The speech enhancement method of claim 1, wherein: the multiscale module includes an average pooling layer, convolutions with convolution kernels of 1 x 1 and 3 x 3, and dilated convolutions of different dilation rates.
4. The speech enhancement method of claim 1, further comprising the steps of: and integrating evaluation indexes STOI and SDR of voice enhancement into a loss function by using a training strategy of multi-objective joint optimization.
5. The speech enhancement method of claim 4 wherein the step of incorporating the STOI indicator into the loss function comprises:
1) STOI inputs are clean speech x and degraded speech
Figure FDA0002291691370000021
Removing a silent region which does not contribute to speech intelligibility, and then transforming a time domain signal into a time-frequency domain by using STFT, and dividing two signals into frames with Hanning windows which are overlapped by 50%;
2) 1/3 times frequency band analysis is carried out, 15 times frequency bands are divided into 1/3 times frequency bands, wherein the central frequency range of the frequency bands is 4.3kHz to 150Hz, and the short-time envelope x of pure voicej,mIs represented as follows:
[Xj(m-L+1),Xj(m-L+2),...Xj(m)]T
wherein X belongs to R and is 1/3 times frequency band obtained by X, M is total frame number of a section of voice, M is index of frame, j is index of 1/3 times frequency band, and L corresponds to length of voice;
3) normalizing and clipping speech to obtain an envelope representation of degraded speech
Figure FDA0002291691370000022
Intelligibility is expressed as a correlation coefficient between two temporal envelopes:
wherein | · | purple sweet2And L2 norm, μ (·) represents the mean vector of the corresponding sample.
4) Calculating the average value of the intelligibility of all bands and frames to obtain the STOI calculation index of the degraded speech:
Figure FDA0002291691370000024
5) will enhance the speechAnd the calculation indexes of the STOI in the training process can be obtained by being brought into an STOI calculation formula:
Figure FDA0002291691370000031
wherein d isj,mExpressed as the correlation coefficient of the enhanced speech with the clean speech temporal envelope.
6. The speech enhancement method of claim 4 further characterized in that the step of incorporating the SDR indicator into the loss function comprises:
1) the SDR inputs are clean speech x and enhanced speech
Figure FDA0002291691370000032
The SDR calculation process for enhanced speech is as follows:
Figure FDA0002291691370000033
2) performing equivalent transformation on the SDR optimization target to simplify calculation to obtain:
wherein the process of maximizing the evaluation index SDR is equivalent to minimizing
Figure FDA0002291691370000035
7. The speech enhancement method of claim 4 further characterized by fusing the STOI and SDR merit indices into a loss function, the specific steps comprising:
1) the conventional root mean square error is calculated as follows:
Figure FDA0002291691370000036
wherein M and N are the number of sampling points of each voice and the total number of voices.
2) The root mean square error is combined with the STOI and SDR based evaluation index loss function:
Figure FDA0002291691370000037
where α, γ corresponds to the coefficients of different parts of the loss function.
CN201911182689.3A 2019-09-25 2019-11-27 Speech enhancement method using stacked multi-scale modules Active CN110751957B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910913634 2019-09-25
CN2019109136349 2019-09-25

Publications (2)

Publication Number Publication Date
CN110751957A true CN110751957A (en) 2020-02-04
CN110751957B CN110751957B (en) 2020-10-27

Family

ID=69284766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911182689.3A Active CN110751957B (en) 2019-09-25 2019-11-27 Speech enhancement method using stacked multi-scale modules

Country Status (1)

Country Link
CN (1) CN110751957B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524530A (en) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 Voice noise reduction method based on expansion causal convolution
CN111583947A (en) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 Voice enhancement method, device and equipment
CN112862068A (en) * 2021-01-15 2021-05-28 复旦大学 Fault-tolerant architecture and method for complex convolutional neural network
CN113129918A (en) * 2021-04-15 2021-07-16 浙江大学 Voice dereverberation method combining beam forming and deep complex U-Net network
CN113936680A (en) * 2021-10-08 2022-01-14 电子科技大学 Single-channel speech enhancement method based on multi-scale information perception convolutional neural network
CN115050379A (en) * 2022-04-24 2022-09-13 华侨大学 High-fidelity voice enhancement model based on FHGAN and application thereof
US11495216B2 (en) 2020-09-09 2022-11-08 International Business Machines Corporation Speech recognition using data analysis and dilation of interlaced audio input
US11538464B2 (en) 2020-09-09 2022-12-27 International Business Machines Corporation . Speech recognition using data analysis and dilation of speech content from separated audio input
CN117174105A (en) * 2023-11-03 2023-12-05 深圳市龙芯威半导体科技有限公司 Speech noise reduction and dereverberation method based on improved deep convolutional network
CN117219107A (en) * 2023-11-08 2023-12-12 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of echo cancellation model

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101617342A (en) * 2007-01-16 2009-12-30 汤姆科技成像系统有限公司 The figured method and system that is used for multidate information
CN108491856A (en) * 2018-02-08 2018-09-04 西安电子科技大学 A kind of image scene classification method based on Analysis On Multi-scale Features convolutional neural networks
CN109034162A (en) * 2018-07-13 2018-12-18 南京邮电大学 A kind of image, semantic dividing method
CN109473120A (en) * 2018-11-14 2019-03-15 辽宁工程技术大学 A kind of abnormal sound signal recognition method based on convolutional neural networks
US20190130903A1 (en) * 2017-10-27 2019-05-02 Baidu Usa Llc Systems and methods for robust speech recognition using generative adversarial networks
CN109741260A (en) * 2018-12-29 2019-05-10 天津大学 A kind of efficient super-resolution method based on depth back projection network
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110246510A (en) * 2019-06-24 2019-09-17 电子科技大学 A kind of end-to-end speech Enhancement Method based on RefineNet

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101617342A (en) * 2007-01-16 2009-12-30 汤姆科技成像系统有限公司 The figured method and system that is used for multidate information
US20190130903A1 (en) * 2017-10-27 2019-05-02 Baidu Usa Llc Systems and methods for robust speech recognition using generative adversarial networks
CN108491856A (en) * 2018-02-08 2018-09-04 西安电子科技大学 A kind of image scene classification method based on Analysis On Multi-scale Features convolutional neural networks
CN109034162A (en) * 2018-07-13 2018-12-18 南京邮电大学 A kind of image, semantic dividing method
CN109473120A (en) * 2018-11-14 2019-03-15 辽宁工程技术大学 A kind of abnormal sound signal recognition method based on convolutional neural networks
CN109741260A (en) * 2018-12-29 2019-05-10 天津大学 A kind of efficient super-resolution method based on depth back projection network
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110059582A (en) * 2019-03-28 2019-07-26 东南大学 Driving behavior recognition methods based on multiple dimensioned attention convolutional neural networks
CN110010144A (en) * 2019-04-24 2019-07-12 厦门亿联网络技术股份有限公司 Voice signals enhancement method and device
CN110136731A (en) * 2019-05-13 2019-08-16 天津大学 Empty cause and effect convolution generates the confrontation blind Enhancement Method of network end-to-end bone conduction voice
CN110246510A (en) * 2019-06-24 2019-09-17 电子科技大学 A kind of end-to-end speech Enhancement Method based on RefineNet

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CEES H.TAAL: "An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
HUA HENG: "The enhancement of depth estimation based on multi-scale convolution kernels", 《CONFERENCE ON OPTOELECTRONIC IMAGING AND MULTIMEDIA TECHNOLOGY V》 *
O. RONNEBERGER: "U-net: Convolutional networks for biomedical image segmentation", 《MEDICAL IMAGE COMPUTING AND COMPUTER-ASSISTED INTERVENTION》 *
SZU-WEI FU: "Raw waveform-based speech enhancement by fully convolutional networks", 《2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 》 *
廖轩: "基于多尺度互特征卷积神经网络的深度图像增强", 《中国优秀硕士学位论文全文数据库》 *
朱锡祥: "基于一维卷积神经网络的车载语音识别技术研究", 《中国优秀硕士学位论文全文数据库》 *
杨远飞: "基于优化的卷积神经网络在图像识别上的研究", 《中国优秀硕士学位论文全文数据库》 *
范存航: "一种基于卷积神经网络的端到端语音分离方法", 《信号处理》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111524530A (en) * 2020-04-23 2020-08-11 广州清音智能科技有限公司 Voice noise reduction method based on expansion causal convolution
CN111583947A (en) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 Voice enhancement method, device and equipment
US11495216B2 (en) 2020-09-09 2022-11-08 International Business Machines Corporation Speech recognition using data analysis and dilation of interlaced audio input
US11538464B2 (en) 2020-09-09 2022-12-27 International Business Machines Corporation . Speech recognition using data analysis and dilation of speech content from separated audio input
CN112862068A (en) * 2021-01-15 2021-05-28 复旦大学 Fault-tolerant architecture and method for complex convolutional neural network
CN113129918A (en) * 2021-04-15 2021-07-16 浙江大学 Voice dereverberation method combining beam forming and deep complex U-Net network
CN113129918B (en) * 2021-04-15 2022-05-03 浙江大学 Voice dereverberation method combining beam forming and deep complex U-Net network
CN113936680B (en) * 2021-10-08 2023-08-08 电子科技大学 Single-channel voice enhancement method based on multi-scale information perception convolutional neural network
CN113936680A (en) * 2021-10-08 2022-01-14 电子科技大学 Single-channel speech enhancement method based on multi-scale information perception convolutional neural network
CN115050379A (en) * 2022-04-24 2022-09-13 华侨大学 High-fidelity voice enhancement model based on FHGAN and application thereof
CN117174105A (en) * 2023-11-03 2023-12-05 深圳市龙芯威半导体科技有限公司 Speech noise reduction and dereverberation method based on improved deep convolutional network
CN117219107A (en) * 2023-11-08 2023-12-12 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of echo cancellation model
CN117219107B (en) * 2023-11-08 2024-01-30 腾讯科技(深圳)有限公司 Training method, device, equipment and storage medium of echo cancellation model

Also Published As

Publication number Publication date
CN110751957B (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN110751957B (en) Speech enhancement method using stacked multi-scale modules
US10777215B2 (en) Method and system for enhancing a speech signal of a human speaker in a video using visual information
CN110246510B (en) End-to-end voice enhancement method based on RefineNet
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN105957537B (en) One kind being based on L1/2The speech de-noising method and system of sparse constraint convolution Non-negative Matrix Factorization
Su et al. Bandwidth extension is all you need
CN108520753A (en) Voice lie detection method based on the two-way length of convolution memory network in short-term
CN112992121B (en) Voice enhancement method based on attention residual error learning
Zhu et al. FLGCNN: A novel fully convolutional neural network for end-to-end monaural speech enhancement with utterance-based objective functions
Geng et al. End-to-end speech enhancement based on discrete cosine transform
CN114360567A (en) Single-channel voice enhancement method based on deep rewinding product network
CN113823308A (en) Method for denoising voice by using single voice sample with noise
Qian et al. Combining equalization and estimation for bandwidth extension of narrowband speech
Islam et al. Supervised single channel speech enhancement based on stationary wavelet transforms and non-negative matrix factorization with concatenated framing process and subband smooth ratio mask
CN114360571A (en) Reference-based speech enhancement method
Jannu et al. Shuffle attention u-Net for speech enhancement in time domain
CN113571074B (en) Voice enhancement method and device based on multi-band structure time domain audio frequency separation network
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
Meutzner et al. A generative-discriminative hybrid approach to multi-channel noise reduction for robust automatic speech recognition
CN115472168A (en) Short-time voice voiceprint recognition method, system and equipment coupling BGCC and PWPE characteristics
Hussain et al. A Novel Speech Intelligibility Enhancement Model based on Canonical Correlation and Deep Learning
Zhao Evaluation of multimedia popular music teaching effect based on audio frame feature recognition technology
CN111968627A (en) Bone conduction speech enhancement method based on joint dictionary learning and sparse representation
Soni et al. Comparing front-end enhancement techniques and multiconditioned training for robust automatic speech recognition
Jeon et al. Lightweight U-Net Based Monaural Speech Source Separation for Edge Computing Device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant