CN116364109A - Speech enhancement network signal-to-noise ratio estimator and loss optimization method - Google Patents
Speech enhancement network signal-to-noise ratio estimator and loss optimization method Download PDFInfo
- Publication number
- CN116364109A CN116364109A CN202310200774.8A CN202310200774A CN116364109A CN 116364109 A CN116364109 A CN 116364109A CN 202310200774 A CN202310200774 A CN 202310200774A CN 116364109 A CN116364109 A CN 116364109A
- Authority
- CN
- China
- Prior art keywords
- noise ratio
- signal
- complex
- layer
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims description 26
- 238000005457 optimization Methods 0.000 title claims description 9
- 238000004364 calculation method Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 230000000295 complement effect Effects 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 2
- 238000012360 testing method Methods 0.000 description 18
- 238000013135 deep learning Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
Abstract
The invention discloses a signal-to-noise ratio estimator of a voice enhancement network, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer, a 1-D convolution module is arranged behind the LSTM layer, the 1-D convolution module is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full-connection layer, the full-connection layer is provided with a sigmoid function, the signal-to-noise ratio estimator adopts the serial combination of two one-dimensional convolution layers and the full-connection layer with the sigmoid function, and the input is the concatenation of the real part and the imaginary part of a voice signal with noise after the complex LSTM calculation, and the output is the frame-level priori signal-to-noise ratio calculated according to a formula so as to keep good voice quality.
Description
Technical Field
The invention belongs to the technical field related to voice enhancement technology, and particularly relates to a voice enhancement network signal-to-noise ratio estimator and a loss optimization method.
Background
In a real acoustic scene, the voice signal is polluted by noise with different degrees, and the voice quality is seriously affected under the condition of larger pollution degree. The purpose of speech enhancement is to recover a clean original speech signal from noisy speech, and to improve speech intelligibility and perceptual quality by suppressing the interference of background noise on the speech. In modern voice communication systems, voice enhancement techniques are widely used, such as automatic voice recognition, mobile communication, conference systems, etc.
The current stage voice enhancement algorithm is mainly divided into a traditional algorithm and an algorithm based on deep learning. The traditional voice enhancement algorithm has low calculation cost and hardware requirement, and achieves better effect in removing stationary noise, but has weaker processing capability under the conditions of non-stationary noise and low signal-to-noise ratio. In recent years, a deep learning-based speech enhancement algorithm achieves very good results, and the method is generally regarded as a supervised learning task, and is used for enhancing noise speech in a time-frequency domain or directly in a time domain.
However, the noise reduction method based on deep learning can inevitably generate the problem of voice distortion while reducing noise, and when the noise-carrying voice with high signal to noise ratio is enhanced, the network can bring more serious attenuation and even worse than the original noise-carrying voice.
Disclosure of Invention
The invention aims to provide a speech enhancement network signal-to-noise ratio estimator and a loss optimization method, so as to solve the problems in the background technology.
In order to achieve the above purpose, the present invention provides the following technical solutions:
compared with the prior art, the invention provides a voice enhancement network signal-to-noise ratio estimator, which comprises an encoder and a decoder, wherein a complex operation CNN, a complex LSTM and a complex BN layer are arranged between the encoder and the decoder, and the encoder comprises a complex Conv2D layer, a complex BN layer and a real PReLU layer;
a signal-to-noise ratio estimator is arranged behind the LSTM layer, the signal-to-noise ratio estimator is formed by alternately and serially combining a plurality of one-dimensional convolution layers and a full connection layer, and the full connection layer is provided with a sigmoid function;
the process of 1-D convolution modulus calculating a priori signal to noise ratio is as follows:
wherein PSNR represents an a priori signal-to-noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
wherein the method comprises the steps ofRepresents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2。
2. Preferably, the Conv2D operation is implemented by a complex-valued filter W defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
wherein F is out Representing the output characteristics of a plurality of layers.
3. A method of optimizing loss of a speech enhancement network according to claim 1, comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice manufactured in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator;
s3: the signal-to-noise ratio estimator calculates the prior signal-to-noise ratio Lable_snr, and the loss value is obtained by subtracting the prior signal-to-noise ratio Lable_snr from the SI-SNR, and the SI-SNR calculation formula is as follows:
s represents the time domain waveform of clean speech,representing the time domain waveform of the estimated speech,<·,·>representing the dot product between the two vectors, I.I 2 Representing euclidean distance;
Loss=(-1)×Loss SI-SNR +2×Loss SNR
wherein, loss SNR The coefficient of (2) is set to be 2, so that the scales of the two loss functions are unified;
s4: the prior signal-to-noise ratio calculated by the signal-to-noise ratio estimator is decoded by a decoder and then output to obtain the enhanced voice.
A speech enhancement network signal-to-noise ratio estimator and a loss optimization method have the following beneficial effects:
1. the voice enhancement model effectively combines the advantages of DCUNET and CRN, and a complex network is adopted between the encoder and the decoder, so that the correlation between the amplitude and the phase can be modeled through complex operation multiplication, and the phase information of voice can be better utilized.
2. The invention designs the signal-to-noise ratio estimator after two LSTM layers, and redesigns the loss function in the voice enhancement algorithm of the signal-to-noise ratio estimator, thereby being capable of calculating the prior signal-to-noise ratio of the frame level, maintaining good voice quality and improving the voice attenuation problem caused by the neural network.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate and together with the embodiments of the invention and do not constitute a limitation to the invention, and in which:
fig. 1 is a schematic diagram of a DCCRN voice enhancement model provided with a voice enhancement network signal-to-noise estimator according to the present invention;
fig. 2 is a schematic diagram of a snr estimator according to the present invention.
In the figure: 10. an encoder; 11. a complex Conv2D layer; 12. a complex BN layer; 13. real PReLU layer; 20. a decoder; 30. a complex-operation CNN; 40. a plurality of LSTM; 50. a plurality of BN layers; 60. a 1-D convolution module; 61. a one-dimensional convolution layer; 62. and (5) a full connection layer.
Detailed Description
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
The invention provides a voice enhancement network signal-to-noise ratio estimator which is arranged in a DCCRN voice enhancement model shown in figure 1.
The DCCRN speech enhancement model includes an encoder 10 and a decoder 20, between which a complex operation CNN30, a complex LSTM40, and a complex BN layer 50 are disposed between the encoder 10 and the decoder 20. The encoder 10 comprises a complex Conv2D layer 11, a complex BN layer 12 and a real prehu layer 13. The complex Conv2D layer 11 contains four conventional Conv2D operations for controlling the flow of complex information in the encoder.
The complex value filter W is defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
wherein F is out Representing the output characteristics of a plurality of layers.
The DCCRN voice enhancement model uses a complex network, and can model the correlation between the amplitude and the phase through complex operation multiplication, so that the phase information of the voice can be better utilized.
Referring to fig. 2, two LSTM layers 40 are disposed between the encoder 10 and the decoder 20, and a signal to noise ratio estimator 60 is disposed after the LSTM layers 40, where the signal to noise ratio estimator 60 is formed by alternately and serially combining a plurality of one-dimensional convolution layers 61 and a full connection layer 62, and the full connection layer 62 has a sigmoid function.
The spliced signal of the real part and the imaginary part of the voice signal with noise calculated by the LSTM layer 40 is input to the signal-to-noise ratio estimator 60, and the signal-to-noise ratio estimator 60 processes the signal-to-noise ratio to calculate the frame-level prior signal-to-noise ratio so as to maintain good voice quality.
The signal-to-noise ratio estimator 60 calculates the prior signal-to-noise ratio as follows:
wherein PSNR represents an a priori signal-to-noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
wherein the method comprises the steps ofRepresents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2
A method for optimizing loss of a speech enhancement network, comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice produced in S1 into a DCCRN voice enhancement model having a signal-to-noise ratio estimator 60;
s3: the SNR estimator 60 calculates the a priori SNR lable_snr, which is subtracted from the SI-SNR to obtain the loss value, where the SI-SNR calculation formula is as follows:
where s represents the time domain waveform of clean speech,representing the time domain waveform of the estimated speech,<·,·>representing two vectorsDot product between the two, I.I.I 2 Representing Euclidean distance
The prior signal to noise ratio calculation formula is as follows:
Loss=(-1)×Loss SI-SNR +2×Loss SNR
Loss SNR the coefficient of (2) is set to be 2, so that the loss function scales of the two parts are unified;
s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator 60 is decoded by the decoder 20 and output to yield enhanced speech.
To fully verify the effect of the loss function used by the signal-to-noise estimator 60 on the improvement of speech quality, the present invention uses the data set provided by the DNS challenge (INTERSPEECH 2020) to train and evaluate the proposed model. The corpus provided contains a total of 500 hours of clean speech segments from 2150 speakers, a 180 hour noise dataset from 150 categories, and room impulse responses from openSLR26 and openSLR28 datasets. DNS challenge games also present a publicly available test dataset comprising two categories of synthetic clips, the first being reverberant and the other being non-reverberant, with 150 segments per category, with a signal to noise ratio distribution between 0db and 20 db.
In order to diversify data, the invention adopts a method of dynamically mixing voice and noise, respectively prepares 50 hours data sets with reverberation and without reverberation, totaling 100 hours, and mixes clean voice and noise under the random signal-to-noise ratio of-5 dB and 20 dB; to produce reverberated speech, the present invention convolves the random decimation of the data set of RIR_noise for speech and simulated room reverberation impact, respectively. Finally 60000 speech segments of 6 seconds were obtained, all audio files at a 16kHz sampling rate.
In the test model of the present invention, the frame size is set to 400 points, the hop size is set to 256 points, and the FFT length is 512 points. The number of channels of the encoder is {32,64,128,128,256,256}, the convolution kernel is set to (5, 2), and the step size is set to (2, 1). In the one-dimensional convolution module, the input channel is set to 4 and the convolution kernel is set to (32, 1). The last LSTM layer is followed by a 1024 unit dense layer. For the decoder, looking back one frame at each convolutional layer, for a total of 6 frames, 6x6.25 = 37.5ms. The model optimizer employs an Adam optimizer.
The present invention uses a Pytorch deep learning framework to train the test model. The input to the model is a 100 second long sequence, totaling 100 epochs. The initial value of the learning rate is set to 0.001. If the model is not improved continuously over a validation set of 2 epochs, the learning rate will drop by one third. Training is terminated if the model is not improved over a validation set of 10 epochs in succession.
As shown in table 1, model 1 (Proposed Model 1) corresponds to adding a signal to noise ratio estimator after LSTM layer, and Model 2 (Proposed Model 2) adds a new loss function provided by the present invention based on Model 1.
Table 1 DNS challenge synthesis development set objective evaluation results
In Table 1, peSQ is an objective speech quality evaluation, STOI is speech intelligibility evaluation, SI-SDR is a scale-invariant signal-to-distortion ratio, and the higher the index score, the better the speech quality, and the better the model effect. As can be seen from table 1, the method proposed by the present invention improves over other methods in 3 indexes under the condition of reverberation and no reverberation. The PESQ score for model 2 was increased by 4.62% over DCCRN under no reverberation conditions, and the scores at STOI and SI-SDR were increased by 1.53% and 1.77dB over NSnet, respectively. Model 2 has 8.22% higher PESQ score than DCCRN under reverberant conditions, 7.82% higher STOI score than DTLN, 5.927dB higher SI-SDR score than NSnet.
Table 2 DNS challenge blind test set MOS test results
Model | no reverb | reverb | realrec | Ave. |
Noisy | 3.13 | 2.64 | 2.83 | 2.85 |
NSNet | 3.49 | 2.64 | 3.00 | 3.03 |
DCCRN-E | 4.00 | 2.94 | 3.37 | 3.42 |
|
3.95 | 2.98 | 3.42 | 3.45 |
Proposed model 2 | 4.03 | 3.02 | 3.57 | 3.54 |
Table 2 shows the results of tests performed on the blind test set provided by DNS Change, and it can be seen that the model provided by the invention has an MOS score improved by 15.47% under the condition of no reverberation compared with NSnet, and an MOS score improved by 2.72% and 5.93% under the condition of reverberation and true recording compared with DCCRN, respectively
Tables 3-5 show the objective evaluation results for the second test set under no reverberation conditions. It can be seen that the test results of the model provided by the invention on 5 typical signal-to-noise ratios are improved to a certain extent. Under the non-reverberation condition, the model provided by the invention improves 2.05%, 0.47% and 0.796dB on the PESQ, STOI and SI-SDR respectively, and exceeds the best performance of DCCRN-E.
TABLE 3 test results for a reverberant less noisy speech test set PESQ
test SNR | 0dB | 5dB | 10dB | 15dB | 20dB | Ave. |
Noisy | 1.789 | 2.132 | 2.47 | 2.833 | 3.176 | 2.48 |
DCCRN-E | 2.704 | 3.075 | 3.379 | 3.628 | 3.814 | 3.32 |
Proposed model 2 | 2.805 | 3.137 | 3.398 | 3.735 | 3.867 | 3.388 |
TABLE 4 test results for non-reverberant noisy speech test set STOI
test SNR | 0dB | 5dB | 10dB | 15dB | 20dB | Ave. |
Noisy | 77.55 | 84.78 | 90.44 | 94.34 | 96.76 | 88.77 |
DCCRN-E | 87.44 | 92.31 | 95.31 | 97.13 | 98.18 | 94.07 |
Proposed model 2 | 89.07 | 92.33 | 95.53 | 97.43 | 98.24 | 94.52 |
TABLE 5 test results for non-reverberant noisy speech test set SI-SDR
test SNR | 0dB | 5dB | 10dB | 15dB | 20dB | Ave. |
Noisy | 0.005 | 5.003 | 10.003 | 14.899 | 20.001 | 9.982 |
DCCRN-E | 11.19 | 14.905 | 18.22 | 21.343 | 24.039 | 17.939 |
Proposed model 2 | 12.181 | 15.905 | 19.208 | 21.896 | 24.487 | 18.735 |
In summary, the model introduced into the signal-to-noise estimator according to the present invention focuses on improving the damage degree of the neural network to the speech perception quality. From the experimental results and the index scoring situation, it can be seen that the proposed model exceeds other models, and the effect of the signal-to-noise ratio estimator 60 using the prior signal-to-noise ratio calculation method provided by the invention is demonstrated.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.
Claims (3)
1. A speech enhancement network signal to noise ratio estimator and loss optimization method the speech enhancement network signal to noise ratio estimator comprises an encoder (10) and a decoder (20), characterized in that:
a complex operation CNN (30), a complex LSTM (40) and a complex BN layer (50) are arranged between the encoder (10) and the decoder (20), and the encoder (10) comprises a complex Conv2D layer (11), a complex BN layer (12) and a real PReLU layer (13);
the LSTM layer (40) is provided with a signal-to-noise ratio estimator (60) at the back, the signal-to-noise ratio estimator (60) is formed by alternately and serially combining a plurality of one-dimensional convolution layers (61) and a full connection layer (62), and the full connection layer (62) is provided with a sigmoid function;
the process of the signal to noise ratio estimator (60) to calculate the a priori signal to noise ratio is as follows:
wherein PSNRRepresenting a priori signal to noise ratio, A clean Represent clean speech spectrum, A noise Representing the noise spectrum. The PSNR is standardized, and the following steps are obtained:
wherein the method comprises the steps ofRepresents the standard deviation of PSNR, and σ represents the mean value of PSRN.
Afterwards, PSNR is compressed to between 0 and 1 using ERFC (complementary error function)
PSNR=(erfc(PSNR+1))/2。
2. The speech enhancement network signal-to-noise ratio estimator and loss optimization method according to claim 1, wherein: the Conv2D operation is implemented by a complex-valued filter W defined as:
W=W r +jW i
real matrix W r And W is i Representing the real and imaginary parts of a complex convolution kernel.
Complex matrix x=x defining one input r +jX i The complex output of the complex convolution operation is obtained as:
wherein F is out Representing the output characteristics of a plurality of layers.
3. A speech enhancement network signal to noise ratio estimator and loss optimization method according to any of claims 1 or 2, the speech enhancement network loss optimization method comprising the steps of:
s1: respectively manufacturing a voice data set with reverberation and a voice data set without reverberation by using a dynamic voice-noise mixing method, wherein the signal-to-noise ratio of the voices is distributed between 0 decibel and 20 decibels;
s2: inputting the voice produced in the step S1 into a DCCRN voice enhancement model with a signal-to-noise ratio estimator (60);
s3: a signal-to-noise ratio estimator (60) calculates a priori signal-to-noise ratio Lable_snr, and subtracts the priori signal-to-noise ratio Lable_snr from SI-SNR to obtain a loss value, wherein the SI-SNR calculation formula is as follows:
s represents the time domain waveform of clean speech,representing the time domain waveform of the estimated speech,<·,·>representing the dot product between the two vectors, I.I 2 Representing euclidean distance;
the prior signal-to-noise ratio lable_SNR in the loss function is calculated as follows:
Loss=(-1)×Loss SI-SNR +2×Loss SNR
wherein, loss SNR The coefficient of (2) is set to be 2, so that the scales of the two loss functions are unified;
s4: the a priori signal to noise ratio calculated by the signal to noise ratio estimator (60) is decoded by the decoder (20) and output to obtain enhanced speech.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310200774.8A CN116364109A (en) | 2023-03-03 | 2023-03-03 | Speech enhancement network signal-to-noise ratio estimator and loss optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310200774.8A CN116364109A (en) | 2023-03-03 | 2023-03-03 | Speech enhancement network signal-to-noise ratio estimator and loss optimization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116364109A true CN116364109A (en) | 2023-06-30 |
Family
ID=86933661
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310200774.8A Pending CN116364109A (en) | 2023-03-03 | 2023-03-03 | Speech enhancement network signal-to-noise ratio estimator and loss optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116364109A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116778913A (en) * | 2023-08-25 | 2023-09-19 | 澳克多普有限公司 | Speech recognition method and system for enhancing noise robustness |
CN117198290A (en) * | 2023-11-06 | 2023-12-08 | 深圳市金鼎胜照明有限公司 | Acoustic control-based multi-mode LED intelligent control method and apparatus |
-
2023
- 2023-03-03 CN CN202310200774.8A patent/CN116364109A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116778913A (en) * | 2023-08-25 | 2023-09-19 | 澳克多普有限公司 | Speech recognition method and system for enhancing noise robustness |
CN116778913B (en) * | 2023-08-25 | 2023-10-20 | 澳克多普有限公司 | Speech recognition method and system for enhancing noise robustness |
CN117198290A (en) * | 2023-11-06 | 2023-12-08 | 深圳市金鼎胜照明有限公司 | Acoustic control-based multi-mode LED intelligent control method and apparatus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lv et al. | Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement | |
CN116364109A (en) | Speech enhancement network signal-to-noise ratio estimator and loss optimization method | |
CN110619885B (en) | Method for generating confrontation network voice enhancement based on deep complete convolution neural network | |
CN110428849B (en) | Voice enhancement method based on generation countermeasure network | |
KR100304666B1 (en) | Speech enhancement method | |
CN112151059A (en) | Microphone array-oriented channel attention weighted speech enhancement method | |
CN107452389A (en) | A kind of general monophonic real-time noise-reducing method | |
CN113936681B (en) | Speech enhancement method based on mask mapping and mixed cavity convolution network | |
CN112735456B (en) | Speech enhancement method based on DNN-CLSTM network | |
CN105280193B (en) | Priori signal-to-noise ratio estimation method based on MMSE error criterion | |
CN110867181A (en) | Multi-target speech enhancement method based on SCNN and TCNN joint estimation | |
Mirsamadi et al. | Causal speech enhancement combining data-driven learning and suppression rule estimation. | |
Ju et al. | Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge | |
CN112331224A (en) | Lightweight time domain convolution network voice enhancement method and system | |
CN110728989A (en) | Binaural voice separation method based on long-time and short-time memory network LSTM | |
Shi et al. | Robust speaker recognition based on improved GFCC | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
CN112634927B (en) | Short wave channel voice enhancement method | |
CN115424627A (en) | Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm | |
Yamashita et al. | Improved spectral subtraction utilizing iterative processing | |
CN101533642B (en) | Method for processing voice signal and device | |
CN113763984B (en) | Parameterized noise elimination system for distributed multi-speaker | |
CN115273884A (en) | Multi-stage full-band speech enhancement method based on spectrum compression and neural network | |
CN113411456B (en) | Voice quality assessment method and device based on voice recognition | |
CN113393852B (en) | Method and system for constructing voice enhancement model and method and system for voice enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |