CN116364109A - Speech enhancement network signal-to-noise ratio estimator and loss optimization method - Google Patents

Speech enhancement network signal-to-noise ratio estimator and loss optimization method Download PDF

Info

Publication number
CN116364109A
CN116364109A CN202310200774.8A CN202310200774A CN116364109A CN 116364109 A CN116364109 A CN 116364109A CN 202310200774 A CN202310200774 A CN 202310200774A CN 116364109 A CN116364109 A CN 116364109A
Authority
CN
China
Prior art keywords
noise ratio
signal
complex
layer
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310200774.8A
Other languages
Chinese (zh)
Inventor
崔立恒
周翊
刘宏清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202310200774.8A priority Critical patent/CN116364109A/en
Publication of CN116364109A publication Critical patent/CN116364109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech

Abstract

The invention discloses a signal-to-noise ratio estimator of a voice enhancement network, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer, a 1-D convolution module is arranged behind the LSTM layer, the 1-D convolution module is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full-connection layer, the full-connection layer is provided with a sigmoid function, the signal-to-noise ratio estimator adopts the serial combination of two one-dimensional convolution layers and the full-connection layer with the sigmoid function, and the input is the concatenation of the real part and the imaginary part of a voice signal with noise after the complex LSTM calculation, and the output is the frame-level priori signal-to-noise ratio calculated according to a formula so as to keep good voice quality.

Description

Speech enhancement network signal-to-noise ratio estimator and loss optimization method
Technical Field
The invention belongs to the technical field related to voice enhancement technology, and particularly relates to a voice enhancement network signal-to-noise ratio estimator and a loss optimization method.
Background
In a real acoustic scene, the voice signal is polluted by noise with different degrees, and the voice quality is seriously affected under the condition of larger pollution degree. The purpose of speech enhancement is to recover a clean original speech signal from noisy speech, and to improve speech intelligibility and perceptual quality by suppressing the interference of background noise on the speech. In modern voice communication systems, voice enhancement techniques are widely used, such as automatic voice recognition, mobile communication, conference systems, etc.
The current stage voice enhancement algorithm is mainly divided into a traditional algorithm and an algorithm based on deep learning. The traditional voice enhancement algorithm has low calculation cost and hardware requirement, and achieves better effect in removing stationary noise, but has weaker processing capability under the conditions of non-stationary noise and low signal-to-noise ratio. In recent years, a deep learning-based speech enhancement algorithm achieves very good results, and the method is generally regarded as a supervised learning task, and is used for enhancing noise speech in a time-frequency domain or directly in a time domain.
However, the noise reduction method based on deep learning can inevitably generate the problem of voice distortion while reducing noise, and when the noise-carrying voice with high signal to noise ratio is enhanced, the network can bring more serious attenuation and even worse than the original noise-carrying voice.
Disclosure of Invention
The invention aims to provide a speech enhancement network signal-to-noise ratio estimator and a loss optimization method, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions:
compared with the prior art, the invention provides a voice enhancement network signal-to-noise ratio estimator, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, and the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer;
a signal-to-noise ratio estimator is arranged behind the LSTM layer, the signal-to-noise ratio estimator is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full connection layer, and the full connection layer is provided with a sigmoid function;
the process of 1-D convolution modulus calculating a priori signal to noise ratio is as follows:
Figure BDA0004108977150000011
wherein PSNR represents an a priori signal-to-noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
Figure BDA0004108977150000021
wherein the method comprises the steps of
Figure BDA0004108977150000022
Represents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2。
2. Preferably, the Conv2D operation is implemented by a complex-valued filter W defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
Figure BDA0004108977150000023
wherein F is out Representing the output characteristics of a plurality of layers.
3. A method of optimizing loss of a speech enhancement network according to claim 1, comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice manufactured in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator;
s3: the signal-to-noise ratio estimator calculates the prior signal-to-noise ratio Lable_snr, and the loss value is obtained by subtracting the prior signal-to-noise ratio Lable_snr from the SI-SNR, and the SI-SNR calculation formula is as follows:
Figure BDA0004108977150000024
Figure BDA0004108977150000025
Figure BDA0004108977150000026
s represents the time domain waveform of clean speech,
Figure BDA0004108977150000027
representing the time domain waveform of the estimated speech,<·,·>representing the dot product between the two vectors, I.I 2 Representing euclidean distance;
Figure BDA0004108977150000028
Loss=(-1)×Loss SI-SNR +2×Loss SNR
wherein, loss SNR The coefficient of (2) is set to be 2, so that the scales of the two loss functions are unified;
s4: the prior signal-to-noise ratio calculated by the signal-to-noise ratio estimator is decoded by a decoder and then output to obtain the enhanced voice.
A speech enhancement network signal-to-noise ratio estimator and a loss optimization method have the following beneficial effects:
1. the voice enhancement model effectively combines the advantages of DCUNET and CRN, and a complex network is adopted between the encoder and the decoder, so that the correlation between the amplitude and the phase can be modeled through complex operation multiplication, and the phase information of voice can be better utilized.
2. The invention designs the signal-to-noise ratio estimator after two LSTM layers, and redesigns the loss function in the voice enhancement algorithm of the signal-to-noise ratio estimator, thereby being capable of calculating the prior signal-to-noise ratio of the frame level, maintaining good voice quality and improving the voice attenuation problem caused by the neural network.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and together with the embodiments of the invention and do not constitute a limitation to the invention, and in which:
fig. 1 is a schematic diagram of a DCCRN voice enhancement model provided with a voice enhancement network signal-to-noise estimator according to the present invention;
fig. 2 is a schematic diagram of a snr estimator according to the present invention.
In the figure: 10. an encoder; 11. a complex Conv2D layer; 12. a complex BN layer; 13. real PReLU layer; 20. a decoder; 30. a complex-operation CNN; 40. a plurality of LSTM; 50. a plurality of BN layers; 60. a 1-D convolution module; 61. a one-dimensional convolution layer; 62. and (5) a full connection layer.
Detailed Description
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
The invention provides a voice enhancement network signal-to-noise ratio estimator which is arranged in a DCCRN voice enhancement model shown in figure 1.
The DCCRN speech enhancement model includes an encoder 10 and a decoder 20, between which a complex operation CNN30, a complex LSTM40, and a complex BN layer 50 are disposed between the encoder 10 and the decoder 20. The encoder 10 comprises a complex Conv2D layer 11, a complex BN layer 12 and a real prehu layer 13. The complex Conv2D layer 11 contains four conventional Conv2D operations for controlling the flow of complex information in the encoder.
The complex value filter W is defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
Figure BDA0004108977150000041
wherein F is out Representing the output characteristics of a plurality of layers.
The DCCRN voice enhancement model uses a complex network, and can model the correlation between the amplitude and the phase through complex operation multiplication, so that the phase information of the voice can be better utilized.
Referring to fig. 2, two LSTM layers 40 are disposed between the encoder 10 and the decoder 20, and a signal to noise ratio estimator 60 is disposed after the LSTM layers 40, where the signal to noise ratio estimator 60 is formed by alternately and serially combining a plurality of one-dimensional convolution layers 61 and a full connection layer 62, and the full connection layer 62 has a sigmoid function.
The spliced signal of the real part and the imaginary part of the voice signal with noise calculated by the LSTM layer 40 is input to the signal-to-noise ratio estimator 60, and the signal-to-noise ratio estimator 60 processes the signal-to-noise ratio to calculate the frame-level prior signal-to-noise ratio so as to maintain good voice quality.
The signal-to-noise ratio estimator 60 calculates the prior signal-to-noise ratio as follows:
Figure BDA0004108977150000042
wherein PSNR represents an a priori signal-to-noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
Figure BDA0004108977150000043
wherein the method comprises the steps of
Figure BDA0004108977150000044
Represents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2
A method for optimizing loss of a speech enhancement network, comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice produced in S1 into a DCCRN voice enhancement model having a signal-to-noise ratio estimator 60;
s3: the SNR estimator 60 calculates the a priori SNR lable_snr, which is subtracted from the SI-SNR to obtain the loss value, where the SI-SNR calculation formula is as follows:
Figure BDA0004108977150000045
where s represents the time domain waveform of clean speech,
Figure BDA0004108977150000046
representing the time domain waveform of the estimated speech,<·,·>representing two vectorsDot product between the two, I.I.I 2 Representing Euclidean distance
The prior signal to noise ratio calculation formula is as follows:
Figure BDA0004108977150000051
Loss=(-1)×Loss SI-SNR +2×Loss SNR
Loss SNR the coefficient of (2) is set to be 2, so that the loss function scales of the two parts are unified;
s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator 60 is decoded by the decoder 20 and output to yield enhanced speech.
To fully verify the effect of the loss function used by the signal-to-noise estimator 60 on the improvement of speech quality, the present invention uses the data set provided by the DNS challenge (INTERSPEECH 2020) to train and evaluate the proposed model. The corpus provided contains a total of 500 hours of clean speech segments from 2150 speakers, a 180 hour noise dataset from 150 categories, and room impulse responses from openSLR26 and openSLR28 datasets. DNS challenge games also present a publicly available test dataset comprising two categories of synthetic clips, the first being reverberant and the other being non-reverberant, with 150 segments per category, with a signal to noise ratio distribution between 0db and 20 db.
In order to diversify data, the invention adopts a method of dynamically mixing voice and noise, respectively prepares 50 hours data sets with reverberation and without reverberation, totaling 100 hours, and mixes clean voice and noise under the random signal-to-noise ratio of-5 dB and 20 dB; to produce reverberated speech, the present invention convolves the random decimation of the data set of RIR_noise for speech and simulated room reverberation impact, respectively. Finally 60000 speech segments of 6 seconds were obtained, all audio files at a 16kHz sampling rate.
In the test model of the present invention, the frame size is set to 400 points, the hop size is set to 256 points, and the FFT length is 512 points. The number of channels of the encoder is {32,64,128,128,256,256}, the convolution kernel is set to (5, 2), and the step size is set to (2, 1). In the one-dimensional convolution module, the input channel is set to 4 and the convolution kernel is set to (32, 1). The last LSTM layer is followed by a 1024 unit dense layer. For the decoder, looking back one frame at each convolutional layer, for a total of 6 frames, 6x6.25 = 37.5ms. The model optimizer employs an Adam optimizer.
The present invention uses a Pytorch deep learning framework to train the test model. The input to the model is a 100 second long sequence, totaling 100 epochs. The initial value of the learning rate is set to 0.001. If the model is not improved continuously over a validation set of 2 epochs, the learning rate will drop by one third. Training is terminated if the model is not improved over a validation set of 10 epochs in succession.
As shown in table 1, model 1 (Proposed Model 1) corresponds to adding a signal to noise ratio estimator after LSTM layer, and Model 2 (Proposed Model 2) adds a new loss function provided by the present invention based on Model 1.
Figure BDA0004108977150000061
Table 1 DNS challenge synthesis development set objective evaluation results
In Table 1, peSQ is an objective speech quality evaluation, STOI is speech intelligibility evaluation, SI-SDR is a scale-invariant signal-to-distortion ratio, and the higher the index score, the better the speech quality, and the better the model effect. As can be seen from table 1, the method proposed by the present invention improves over other methods in 3 indexes under the condition of reverberation and no reverberation. The PESQ score for model 2 was increased by 4.62% over DCCRN under no reverberation conditions, and the scores at STOI and SI-SDR were increased by 1.53% and 1.77dB over NSnet, respectively. Model 2 has 8.22% higher PESQ score than DCCRN under reverberant conditions, 7.82% higher STOI score than DTLN, 5.927dB higher SI-SDR score than NSnet.
Table 2 DNS challenge blind test set MOS test results
Model no reverb reverb realrec Ave.
Noisy 3.13 2.64 2.83 2.85
NSNet 3.49 2.64 3.00 3.03
DCCRN-E 4.00 2.94 3.37 3.42
Proposed model 1 3.95 2.98 3.42 3.45
Proposed model 2 4.03 3.02 3.57 3.54
Table 2 shows the results of tests performed on the blind test set provided by DNS Change, and it can be seen that the model provided by the invention has an MOS score improved by 15.47% under the condition of no reverberation compared with NSnet, and an MOS score improved by 2.72% and 5.93% under the condition of reverberation and true recording compared with DCCRN, respectively
Tables 3-5 show the objective evaluation results for the second test set under no reverberation conditions. It can be seen that the test results of the model provided by the invention on 5 typical signal-to-noise ratios are improved to a certain extent. Under the non-reverberation condition, the model provided by the invention improves 2.05%, 0.47% and 0.796dB on the PESQ, STOI and SI-SDR respectively, and exceeds the best performance of DCCRN-E.
TABLE 3 test results for a reverberant less noisy speech test set PESQ
test SNR 0dB 5dB 10dB 15dB 20dB Ave.
Noisy 1.789 2.132 2.47 2.833 3.176 2.48
DCCRN-E 2.704 3.075 3.379 3.628 3.814 3.32
Proposed model 2 2.805 3.137 3.398 3.735 3.867 3.388
TABLE 4 test results for non-reverberant noisy speech test set STOI
test SNR 0dB 5dB 10dB 15dB 20dB Ave.
Noisy 77.55 84.78 90.44 94.34 96.76 88.77
DCCRN-E 87.44 92.31 95.31 97.13 98.18 94.07
Proposed model 2 89.07 92.33 95.53 97.43 98.24 94.52
TABLE 5 test results for non-reverberant noisy speech test set SI-SDR
test SNR 0dB 5dB 10dB 15dB 20dB Ave.
Noisy 0.005 5.003 10.003 14.899 20.001 9.982
DCCRN-E 11.19 14.905 18.22 21.343 24.039 17.939
Proposed model 2 12.181 15.905 19.208 21.896 24.487 18.735
In summary, the model introduced into the signal-to-noise estimator according to the present invention focuses on improving the damage degree of the neural network to the speech perception quality. From the experimental results and the index scoring situation, it can be seen that the proposed model exceeds other models, and the effect of the signal-to-noise ratio estimator 60 using the prior signal-to-noise ratio calculation method provided by the invention is demonstrated.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (3)

1. A speech enhancement network signal to noise ratio estimator and loss optimization method the speech enhancement network signal to noise ratio estimator comprises an encoder (10) and a decoder (20), characterized in that:
a complex operation CNN (30), a complex LSTM (40) and a complex BN layer (50) are arranged between the encoder (10) and the decoder (20), and the encoder (10) comprises a complex Conv2D layer (11), a complex BN layer (12) and a real PReLU layer (13);
the LSTM layer (40) is provided with a signal-to-noise ratio estimator (60) at the back, the signal-to-noise ratio estimator (60) is formed by alternately and serially combining a plurality of one-dimensional convolution layers (61) and a full connection layer (62), and the full connection layer (62) is provided with a sigmoid function;
the process of the signal to noise ratio estimator (60) to calculate the a priori signal to noise ratio is as follows:
Figure FDA0004108977130000011
wherein PSNRRepresenting a priori signal to noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
Figure FDA0004108977130000012
wherein the method comprises the steps of
Figure FDA0004108977130000013
Represents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2。
2. The speech enhancement network signal-to-noise ratio estimator and loss optimization method according to claim 1, wherein: the Conv2D operation is implemented by a complex-valued filter W defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
Figure FDA0004108977130000014
wherein F is out Representing the output characteristics of a plurality of layers.
3. A speech enhancement network signal to noise ratio estimator and loss optimization method according to any of claims 1 or 2, the speech enhancement network loss optimization method comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice produced in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator (60);
s3: a signal-to-noise ratio estimator (60) calculates a priori signal-to-noise ratio Lable_snr, and subtracts the priori signal-to-noise ratio Lable_snr from SI-SNR to obtain a loss value, wherein the SI-SNR calculation formula is as follows:
Figure FDA0004108977130000021
s represents the time domain waveform of clean speech,
Figure FDA0004108977130000022
representing the time domain waveform of the estimated speech,<·,·>representing the dot product between the two vectors, I.I 2 Representing euclidean distance;
the prior signal-to-noise ratio lable_SNR in the loss function is calculated as follows:
Figure FDA0004108977130000023
Loss=(-1)×Loss SI-SNR +2×Loss SNR
wherein, loss SNR The coefficient of (2) is set to be 2, so that the scales of the two loss functions are unified;
s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator (60) is decoded by the decoder (20) and output to obtain enhanced speech.
CN202310200774.8A 2023-03-03 2023-03-03 Speech enhancement network signal-to-noise ratio estimator and loss optimization method Pending CN116364109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310200774.8A CN116364109A (en) 2023-03-03 2023-03-03 Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310200774.8A CN116364109A (en) 2023-03-03 2023-03-03 Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Publications (1)

Publication Number Publication Date
CN116364109A true CN116364109A (en) 2023-06-30

Family

ID=86933661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310200774.8A Pending CN116364109A (en) 2023-03-03 2023-03-03 Speech enhancement network signal-to-noise ratio estimator and loss optimization method

Country Status (1)

Country Link
CN (1) CN116364109A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778913A (en) * 2023-08-25 2023-09-19 澳克多普有限公司 Speech recognition method and system for enhancing noise robustness
CN117198290A (en) * 2023-11-06 2023-12-08 深圳市金鼎胜照明有限公司 Acoustic control-based multi-mode LED intelligent control method and apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778913A (en) * 2023-08-25 2023-09-19 澳克多普有限公司 Speech recognition method and system for enhancing noise robustness
CN116778913B (en) * 2023-08-25 2023-10-20 澳克多普有限公司 Speech recognition method and system for enhancing noise robustness
CN117198290A (en) * 2023-11-06 2023-12-08 深圳市金鼎胜照明有限公司 Acoustic control-based multi-mode LED intelligent control method and apparatus

Similar Documents

Publication Publication Date Title
Lv et al. Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement
CN116364109A (en) Speech enhancement network signal-to-noise ratio estimator and loss optimization method
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110428849B (en) Voice enhancement method based on generation countermeasure network
KR100304666B1 (en) Speech enhancement method
CN112151059A (en) Microphone array-oriented channel attention weighted speech enhancement method
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN113936681B (en) Speech enhancement method based on mask mapping and mixed cavity convolution network
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN105280193B (en) Priori signal-to-noise ratio estimation method based on MMSE error criterion
CN110867181A (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
Mirsamadi et al. Causal speech enhancement combining data-driven learning and suppression rule estimation.
Ju et al. Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge
CN112331224A (en) Lightweight time domain convolution network voice enhancement method and system
CN110728989A (en) Binaural voice separation method based on long-time and short-time memory network LSTM
Shi et al. Robust speaker recognition based on improved GFCC
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN112634927B (en) Short wave channel voice enhancement method
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
Yamashita et al. Improved spectral subtraction utilizing iterative processing
CN101533642B (en) Method for processing voice signal and device
CN113763984B (en) Parameterized noise elimination system for distributed multi-speaker
CN115273884A (en) Multi-stage full-band speech enhancement method based on spectrum compression and neural network
CN113411456B (en) Voice quality assessment method and device based on voice recognition
CN113393852B (en) Method and system for constructing voice enhancement model and method and system for voice enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination