CN116913308A - Single-channel voice enhancement method for balancing noise reduction amount and voice quality - Google Patents

Single-channel voice enhancement method for balancing noise reduction amount and voice quality Download PDF

Info

Publication number
CN116913308A
CN116913308A CN202310707811.4A CN202310707811A CN116913308A CN 116913308 A CN116913308 A CN 116913308A CN 202310707811 A CN202310707811 A CN 202310707811A CN 116913308 A CN116913308 A CN 116913308A
Authority
CN
China
Prior art keywords
signal
posterior
smoothing
noise ratio
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310707811.4A
Other languages
Chinese (zh)
Inventor
汪大涵
卢晶
朱长宝
胡玉祥
程光伟
刘松
朱天一
张哲会
刘磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310707811.4A priority Critical patent/CN116913308A/en
Publication of CN116913308A publication Critical patent/CN116913308A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Noise Elimination (AREA)

Abstract

The invention discloses a single-channel voice enhancement method for balancing noise reduction and voice quality. The method comprises the following steps: (1) Transforming the noise-containing signal to a time-frequency domain, and estimating a fundamental frequency by using a PEFAC method; (2) Calculating a posterior signal-to-noise ratio, smoothing the posterior signal-to-noise ratio in a cepstrum domain according to the fundamental frequency estimation, and further estimating the posterior voice existence probability by using a fixed prior method; (3) Estimating noise power spectral density according to the posterior speech existence probability; (4) Updating the posterior signal-to-noise ratio, and calculating the maximum likelihood estimation of the voice power spectral density; (5) Smoothing the power spectral density of the voice in a cepstrum domain according to the base frequency estimation, and enhancing the base frequency to obtain the estimation of the prior signal-to-noise ratio; (6) Estimating the existence probability of posterior voice again by using a self-adaptive priori method; (7) Calculating a logarithmic spectrum amplitude gain based on generalized gamma prior, and combining the posterior speech existence probability to derive a gain estimation based on speech existence uncertainty; (8) Enhancing the speech spectrum and transforming back to the time domain to obtain an enhanced signal.

Description

Single-channel voice enhancement method for balancing noise reduction amount and voice quality
Technical Field
The invention belongs to the field of voice enhancement algorithms, and particularly relates to a single-channel voice enhancement method for balancing noise reduction and voice quality.
Background
Speech enhancement is one of the core problems of audio signal processing, and its main purpose is to recover speech from signals contaminated by environmental noise, interfering human voice, reverberation, echo, etc., to improve speech quality and intelligibility, and it has been widely used in mobile communication, teleconferencing, bluetooth headset, hearing aid, and speech recognition front-end. Single-channel speech enhancement is an important technology, and is mainly applied to a scene with only a single microphone for collecting signals, and can also be used as a post-processing step of a multi-channel technology.
The single-channel voice enhancement technology based on the traditional signal algorithm has the advantages of low computational complexity, high stability, strong parameter interpretability and the like, and various algorithms such as spectral subtraction, wiener filtering, algorithms based on a statistical model, algorithms based on wavelet transformation, algorithms based on subspaces and the like are formed. The algorithm based on the minimum mean square error of the discrete Fourier transform domain is widely applied due to the characteristics of lower computational complexity, excellent processing performance and easy combination with other systems.
Single channel speech enhancement techniques have no information from the spatial dimension and therefore can only be processed and enhanced based on the diversity of the spectrum and the nature of the source. However, in practical applications, due to different factors such as environmental complexity, noise type and intensity, and quality of voice signal, it is always difficult to achieve higher noise reduction and higher definition and intelligibility at the same time in single-channel voice enhancement, which is one of the difficulties in research of single-channel voice enhancement technology at present.
The Log spectral amplitude estimation strategy (OMLSA) algorithm, which takes into account the uncertainty (Speech Presence Uncertainty, SPU) of the presence of speech, is an excellent single-channel speech enhancement algorithm (Cohen I, berdoug b. Spech enhancement for non-stationary noise environments J. Signal processing,2001,81 (11): 2403-2418.) and has been improved by later studies, which are still one of the algorithms commonly used in the art. But has poor noise tracking capability, and has damage to voice components and obvious music noise residues in the scenes of low signal-to-noise ratio and unsteady noise.
Cepstral domain smoothing (Temporal Cepstrum Smoothing, TCS) is an estimation method of a priori signal-to-noise ratio, which can effectively preserve the harmonic components of speech while reducing estimation variance and suppressing music noise. And by combining with an unbiased minimum mean square error noise power spectrum density estimator with strong tracking capability and low calculation complexity, better voice enhancement performance can be realized. However, the TCS method retains many noise components while retaining the harmonic components of speech, and also has some impairment to the quality of speech in a complex environment. An important reason for this problem is that the cepstral domain fundamental frequency estimation approach used is less accurate for both the estimation of the fundamental frequency and the decision on voiced frames. There have been many studies to develop more accurate baseband estimation methods, such as PEFAC (Pitch Estimation Filter with Amplitude Compression) method (Gonzalez S, brookes M.A pitch estimation filter robust to high levels of noise (PEFAC) [ C ]//2011 19th European Signal Processing Conference.IEEE,2011:451-455.).
Disclosure of Invention
Because of different factors such as environmental complexity, noise type and intensity, voice signal quality and the like, single-channel voice enhancement is always difficult to realize higher noise reduction and higher definition and intelligibility at the same time, so that the improvement of system performance and the realization of good balance of noise reduction and voice quality are significant in practical application while ensuring lower calculation complexity. In view of this, the present invention proposes a single-channel speech enhancement method that balances noise reduction and speech quality.
The invention adopts the technical scheme that:
a single channel speech enhancement method that balances noise reduction and speech quality, the method comprising the steps of:
step 1, transforming a noise-containing signal to a time-frequency domain, and estimating a fundamental frequency by using a PEC method;
step 2, calculating a posterior signal-to-noise ratio, smoothing the posterior signal-to-noise ratio in a cepstrum domain according to the fundamental frequency estimated value obtained in the step 1, and further estimating the posterior voice existence probability by using a fixed prior method;
step 3, estimating noise power spectral density according to the posterior speech existence probability obtained in the step 2 by using an unbiased minimum mean square error method;
step 4, calculating a posterior signal-to-noise ratio estimated value according to the noise power spectral density estimated value obtained in the step 3, and calculating the maximum likelihood estimation of the voice power spectral density;
step 5, according to the base frequency estimated value obtained in the step 1, carrying out cepstrum base frequency enhancement at the same time in the maximum likelihood estimation of the voice power spectral density obtained in the cepstrum domain smoothing step 4, so as to obtain the estimation of the priori signal-to-noise ratio;
step 6, estimating the existence probability of the posterior voice again according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the prior signal-to-noise ratio estimated value obtained in the step 5 by using a self-adaptive prior method;
step 7, according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the prior signal-to-noise ratio estimated value obtained in the step 5, calculating the logarithmic spectrum amplitude gain based on generalized gamma prior χ, and further combining the posterior voice existence probability estimated value obtained in the step 6 to derive gain estimation based on voice existence uncertainty;
and 8, enhancing the voice by using the gain estimation value obtained in the step 7, and transforming the enhanced spectrum back to the time domain to obtain an enhanced signal.
Compared with the prior art, the method has low calculation complexity, good noise suppression capability, can keep weak voice components, recover destroyed harmonic waves, and can realize good balance of noise reduction and voice quality.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a graph showing the average objective scoring results of test sets at different signal to noise ratios for the present invention and the comparative method, wherein (a) - (e) are the average scoring results of broadband PESQ, STOI, DNSMOS-OVRL, DNSMOS-SIG, DNSMOS-BAK, respectively.
FIG. 3 is a graph illustrating the enhancement of speech by the inventive method and the comparative method, wherein (a) - (d) are respectively the speech patterns of the noisy signal, the clean speech, the enhancement result of OMLSA algorithm and the enhancement result of the inventive method.
Detailed Description
The technical scheme of the present invention will be clearly and completely described below with reference to the accompanying drawings and examples.
The invention comprises the following steps:
step 1, transforming a noise-containing signal to a time-frequency domain, and estimating a fundamental frequency by using a PEC method;
step 2, calculating a posterior signal-to-noise ratio, smoothing the posterior signal-to-noise ratio in a cepstrum domain according to the fundamental frequency estimation value in the step 1, and estimating the posterior voice existence probability by using a Fixed Prior (FP) method;
step 3, estimating noise power spectral density according to the posterior speech existence probability obtained in the step 2 by using an unbiased minimum mean square error method;
step 4, calculating a posterior signal-to-noise ratio estimated value according to the noise power spectral density estimated value obtained in the step 3, and calculating the maximum likelihood estimation of the voice power spectral density;
step 5, according to the fundamental frequency estimation value in the step 1, carrying out cepstrum fundamental frequency enhancement at the same time in the maximum likelihood estimation of the voice power spectral density obtained in the cepstrum domain smoothing step 4, so as to obtain an estimated value of priori signal-to-noise ratio;
step 6, estimating the existence probability of the posterior voice again according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the Prior signal-to-noise ratio estimated value obtained in the step 5 by using a self-Adaptive Prior (AP) method;
step 7, according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the prior signal-to-noise ratio estimated value obtained in the step 5, calculating the logarithmic spectrum amplitude gain based on generalized gamma prior χ, and further combining the posterior voice existence probability estimated value obtained in the step 6 to derive gain estimation based on voice existence uncertainty;
and 8, enhancing the voice by using the gain estimation value obtained in the step 7, and transforming the enhanced spectrum back to the time domain to obtain an enhanced signal.
Further, the cepstral domain smoothing method in the step 2 specifically includes the following steps:
is provided withAn estimate of the a posteriori signal to noise ratio is represented, where k, l represent the band index and the frame index, respectively. />Using the noise power spectral density estimate of the previous frame +.>Power spectral density calculation with the current frame noisy signal Y (k, l):
will beTransform to cepstral domain, denoted gamma ceps (q, l), taking the logarithm and performing the inverse discrete fourier transform:
P ceps (q,l)=IDFT{log(P(k,l)| k=0,1,...,N-1 )},q=0,1,...,N-1
wherein IDFT {.cndot. } represents the inverse discrete fourier transform and q represents the cepstrum frequency index; n represents the length of the discrete fourier transform used in step 1. Due to symmetry, the following operations are only forIs carried out.
Estimating the fundamental frequency f obtained in the step 1 0 (l) Converted into a cepstrum frequency q 0 (l):
wherein ,fs Representing the sampling rate;representing a rounding down. And then q 0 (l) Expanding fundamental frequency to a magnitude of 2Δq for the center 0 Range of +1:
wherein v (l) is a voiced sound frame discrimination result given by the fundamental frequency estimator of step 1; v (l) =1 indicates that the current frame is a voiced frame, and v (l) =0 indicates that the current frame is a non-voiced frame. And further determining the smoothing factor alpha ceps (q,l):
wherein ,βceps For smoothing alpha ceps (q,l);α const (q) is presetThe smoothing factor related to the cepstrum frequency is expressed as that the numerical value of the cepstrum frequency is smaller, and the rest is larger; alpha 0 Is a smaller fundamental smoothing factor. This results in a weaker smoothing of the cepstrum coefficients corresponding to the harmonic components of the fundamental speech frequency and the spectral envelope, and a stronger smoothing of the noise-dominated cepstrum coefficients, thus preserving the speech components as much as possible while smoothing.
Smoothing gamma ceps (q, l) to give
Inverse transformation back to the frequency domain, obtaining biased smoothed result gamma b (k,l):
Where DFT {.cndot. } represents the discrete Fourier transform.
Performing deviation compensation to obtain unbiased smoothing result
Wherein B (l) is a bias compensation factor based on χ 2 And (5) calculating to obtain the product. Assume before smoothingThe conforming shape parameter is mu γ Is χ of (2) 2 Distribution is smooth and also accords with χ 2 Distribution, and shape parameter +.>For pre-preparationSet shape parameter mu γ Cepstrum variance var can be calculated offline qceps }:
Wherein ζ (·, ·) represents the Riemannzta function; kappa (kappa) m Representing a logarithmic covariance; m represents kappa which is not 0 m A corresponding maximum subscript; kappa (kappa) m The calculation formula of (2) is as follows:
wherein Γ (·) represents the gamma function; psi (·) represents a psi function; correlation coefficient ρ m The value of (2) depends on the window function used in the short-time fourier transform in step 1. And further obtaining the smoothed cepstrum variance according to the smoothing factorApproximately as
wherein ,νq Representing the weighting coefficient, v 0N/2 =1/2, the remainder being 2. Solving the following equation to obtain smoothed shape parameters
Thereby obtaining a deviation compensation factor:
wherein ψ (·) represents the psi function.
Further, the specific step of estimating the voice existence probability by using the fixed prior method in the step 2 is as follows:
using the post-cepstral domain smoothed posterior signal-to-noise ratio of step 2And shape parameters->Calculating generalized likelihood ratio Λ (k, l):
representing a fixed prior speech presence probability and a prior signal-to-noise ratio, respectively, wherein +.>Indicating the state of speech presence. And further calculating the posterior voice existence probability:
recursive smoothingAnd detecting the average value, if the average value is too large, thenBundle->Is not made to approach 1 for a long period of time to avoid the noise power spectral density estimation in step 3 from being stuck.
Further, in the step 3, an unbiased minimum mean square error algorithm is used to estimate the noise power spectral densityThe calculation formula of (2) is as follows:
wherein ,αN Representing the noise power spectral density smoothing factor.
Further, the cepstral domain smoothing method in the step 5 specifically includes the following steps:
is provided withRepresenting an estimate of the power spectral density of speech, first using an estimate of the noise power spectral density of the current frameUpdating the estimated value of the posterior signal-to-noise ratio:
and calculating the maximum likelihood estimation of the voice power spectral density:
wherein ,ξmin Representing a priori signal to noise ratio lower limit;
will beTransform to cepstral domain, denoted +.>I.e. taking the logarithm and performing the inverse discrete fourier transform:
in the mode of the step 2, according to the estimated value f of the fundamental frequency 0 (l) Determination of the smoothing factor alpha ceps (k, l); smoothingObtain->
Fundamental frequency enhancement is performed:
wherein ρ is a fundamental frequency amplification factor, which is a constant greater than 1;
inversely transforming the frequency domain to obtain biased smooth result
Where DFT {.cndot. } represents the discrete Fourier transform.
Performing deviation compensation to obtain unbiased smoothing result
Wherein B is a fixed offset compensation factor:
B=exp(0.5*0.5772)
according to unbiased smoothing results, the formula for calculating the prior signal to noise ratio is as follows:
further, in the step 6, the specific step of estimating the posterior speech existence probability by using the adaptive prior method is as follows:
calculating the time domain recursive average of the prior signal to noise ratio obtained in the step 5:
wherein ,αS Is a smoothing factor. Using a size of 2W L Smaller smooth window w of +1 L For the time domain recursive averageSmoothing local frequency domain
And further calculate the local prior speech existence probability
wherein ,Pmin Is the lower probability limit; and />Representing the upper and lower limits of the local a priori signal to noise ratio, respectively. Computing the global a priori speech presence probability in the same way +.>I.e. the superscript or subscript L of the above formula is replaced by G.
Calculating frame average prior signal to noise ratioAnd a constraint upper and lower bound frame average a priori signal to noise ratio +.>
wherein ,Kmin and Kmax Constraint calculation of the frequency band range of the frame average priori signal-to-noise ratio; and />Respectively representUpper and lower limits of (2). Further calculate the frame prior speech presence probability +.>
wherein , and />Respectively representing the upper and lower limits of the prior signal-to-noise ratio of the frame; the variable p is given by:
computing adaptive prior speech presence probabilities
wherein ,representing a priori the lower probability limit of speech presence. Further, the posterior speech existence probability based on the adaptive method is calculated>
Further, in said step 7, a log-spectral amplitude gain based on Speech Presence Uncertainty (SPU) is usedThe calculation formula of (2) is as follows:
wherein ,Gmin Representing a lower gain limit;the logarithmic spectrum amplitude gain based on χ priori with the shape parameter mu is represented, wherein χ priori is a generalized gamma distribution with the calculation formula:
wherein ,
wherein ,1 F 1 (a, b; x) represents a confluent super-geometric function.
The advantages of the method proposed by the invention are more readily seen in the experimental evaluation:
1. evaluation setting
The clean speech used for evaluation was from the TIMIT dataset, comprising 128 randomly selected clean speech, 50% of each of male and female speech. Noise for evaluation includes noise from the noiex-92 dataset and pink noise, railways from the ITU-T p.501 dataset, subways, cars, road noise and modulated gaussian white noise. The signal to noise ratios were set at-10, -5, 0, 5, 10 and 15dB.
The comparison method is OMLSA algorithm. The evaluation metrics include a deep noise suppression mean opinion score (Deep Noise Suppression Mean Opinion Score, DNSMOS), a wideband PESQ score mapped to a mean opinion score-hearing quality objective (Mean Opinion Score-Listening Quality Objective, MOS-LQO), and STOI. The DNSMOS simulates subjective MOS scores through a deep neural network, and includes OVRL, SIG, BAK terms, which respectively represent the overall quality of speech, the quality of speech, and the noise reduction effect.
2. Parameter setting
Sampling rate f s For 16kHz, the frame length and the number of discrete fourier transform points N are 512 points, 50% overlap, and the short-time fourier transform window function is a square root hanning window.
Fixed prior speech presence probabilitySet to 0.5; fixed a priori signal to noise ratio +.>Set to 15dB; shape parameter μ for smoothing the anterior posterior signal-to-noise ratio γ Set to 0.7; setting a recursive smoothing factor for the posterior speech presence probability to 0.9 in the step of avoiding stagnation; noise power spectral density smoothing factor alpha N Set to 0.8; priori signal-to-noise ratio lower limit xi min Set to-25 dB; Δq 0 Set to 2; beta ceps Set to 0.9 and 0.96 in step 2 and step 5, respectively; for the square root hanning window ρ 1 =0.5, m=1; the fundamental frequency amplification factor ρ is set to 2.5; the adaptive method estimates a series of upper and lower limits in the posterior speech presence probability as: p (P) min =0.005,/>
K min =3,K max =257; time domain recursive average factor alpha for a priori signal to noise ratio S Set to 0.7; both the local and global smoothing windows are set as normalized hanning windows, and the size of the parameters W is determined L and WG Set to 1 and 15, respectively; the χa priori shape parameter μ is set to 0.6; lower gain limit G min Taking-18 dB; fundamental frequency smoothing factor alpha in step 2 and step 5 0 Set to 0.2 and 0.1, respectively, the fixed smoothing factor is set to:
wherein ,α1 、α 2 、α 3 Set to 0.2, 0.4, 0.997 and 0.1, 0.6, 0.95 in step 2 and step 5, respectively.
3. Evaluation result
Figure 2 shows the average objective scores of the method of the present invention and OMLSA algorithm at different signal-to-noise ratios, with (a) - (e) being the average score results for broadband PESQ, STOI, DNSMOS-OVRL, DNSMOS-SIG, DNSMOS-BAK, respectively. It can be seen that the present invention has significant advantages.
FIG. 3 shows a sample of the spectrogram, the signal-to-noise ratio of the noisy signal is 10dB, the background noise type is noise, and (a) - (d) are respectively the spectrograms of the noisy signal, clean voice, OMLSA algorithm enhancement result and the method enhancement result. From the spectrogram, the method has better effects on noise reduction capability and voice recovery capability.

Claims (4)

1. A single-channel speech enhancement method for balancing noise reduction and speech quality, the method comprising the steps of:
step 1, transforming a noise-containing signal to a time-frequency domain, and estimating a fundamental frequency by using a PEC method;
step 2, calculating a posterior signal-to-noise ratio, smoothing the posterior signal-to-noise ratio in a cepstrum domain according to the fundamental frequency estimated value obtained in the step 1, and further estimating the posterior voice existence probability by using a fixed prior method;
step 3, estimating noise power spectral density according to the posterior speech existence probability obtained in the step 2 by using an unbiased minimum mean square error method;
step 4, calculating a posterior signal-to-noise ratio estimated value according to the noise power spectral density estimated value obtained in the step 3, and calculating the maximum likelihood estimation of the voice power spectral density;
step 5, according to the base frequency estimated value obtained in the step 1, carrying out cepstrum base frequency enhancement at the same time in the maximum likelihood estimation of the voice power spectral density obtained in the cepstrum domain smoothing step 4, so as to obtain the estimation of the priori signal-to-noise ratio;
step 6, estimating the existence probability of the posterior voice again according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the prior signal-to-noise ratio estimated value obtained in the step 5 by using a self-adaptive prior method;
step 7, according to the posterior signal-to-noise ratio estimated value obtained in the step 4 and the prior signal-to-noise ratio estimated value obtained in the step 5, calculating the logarithmic spectrum amplitude gain based on generalized gamma prior χ, and further combining the posterior voice existence probability estimated value obtained in the step 6 to derive gain estimation based on voice existence uncertainty;
and 8, enhancing the voice by using the gain estimation value obtained in the step 7, and transforming the enhanced spectrum back to the time domain to obtain an enhanced signal.
2. The single-channel speech enhancement method for balancing noise reduction and speech quality according to claim 1, wherein the cepstrum domain smoothing method in step 2 is specifically as follows:
is provided withAn estimated value representing a posterior signal-to-noise ratio, wherein k and l represent a band index and a frame index, respectively; />Using the noise power spectral density estimate of the previous frame +.>Power spectral density calculation with the current frame noisy signal Y (k, l):
will beTransform to cepstral domain, denoted gamma ceps (q, l), taking the logarithm and performing the inverse discrete fourier transform:
wherein IDFT {.cndot. } represents the inverse discrete fourier transform and q represents the cepstrum frequency index; n represents the length of the discrete fourier transform used in step 1; due to symmetry, the following operations are only forProceeding;
the fundamental frequency estimated value f obtained in the step 1 is processed 0 (l) Converted into a cepstrum frequency q 0 (l):
wherein ,fs Representing the sampling rate;representing a downward rounding; and then q 0 (l) Expanding fundamental frequency to a magnitude of 2Δq for the center 0 Range of +1:
wherein v (l) is a voiced frame discrimination result, v (l) =1 indicates that the current frame is a voiced frame, and v (l) =0 indicates that the current frame is a non-voiced frame; and further determining the smoothing factor alpha ceps (k,l):
wherein ,βceps For smoothing alpha ceps (q,l);α const (q) is a predetermined cepstrum frequency dependent smoothing factor, which is manifested by a low cepstrum frequency value being smaller and the remainder being larger; alpha 0 Is a smaller fundamental smoothing factor;
smoothing gamma ceps (q, l) to give
Inverse transformation back to the frequency domain, obtaining biased smoothed result gamma b (k,l):
Wherein DFT {. Cndot. } represents the discrete Fourier transform;
performing deviation compensation to obtain unbiased smoothing result
Wherein B (l) is a bias compensation factor based on χ 2 Calculating distribution; assume before smoothingThe conforming shape parameter is mu γ Is χ of (2) 2 Distribution is smooth and also accords with χ 2 Distribution, and shape parameter +.>For a preset shape parameter mu γ Cepstrum variance var can be calculated offline qceps }:
Wherein ζ (·, ·) represents the Riemannzta function; kappa (kappa) m Representing a logarithmic covariance; m represents kappa which is not 0 m A corresponding maximum subscript; kappa (kappa) m The calculation formula of (2) is as follows:
wherein, · (·) represents the gamma function; psi (·) represents a psi function; correlation coefficient ρ m The value of (2) depends on the window function used in the short-time Fourier transform in step 1; and further obtaining the smoothed cepstrum variance according to the smoothing factorApproximately as
wherein ,νq Representing the weighting coefficient, v 0N/2 =1/2, the remainder being 2; solving the following equation to obtain smoothed shape parameters
Thereby obtaining a deviation compensation factor:
3. the single-channel speech enhancement method for balancing noise reduction and speech quality according to claim 1, wherein the cepstral domain smoothing method in step 5 is specifically as follows:
is provided withAn estimated value representing the power spectral density of speech, wherein k, l represent the band index and the frame index, respectively; firstly, the estimated value of the noise power spectral density of the current frame is utilized +.>Updating the estimated value of the posterior signal-to-noise ratio:
and calculating the maximum likelihood estimation of the voice power spectral density:
wherein ,ξmin Representing a priori signal to noise ratio lower limit;
will beTransform to cepstral domain, denoted +.>I.e. taking the logarithm and performing the inverse discrete fourier transform:
in the mode of the step 2, according to the estimated value f of the fundamental frequency 0 (l) Determination of the smoothing factor alpha ceps (k, l); smoothingObtain->
Fundamental frequency enhancement is performed:
wherein ρ is a fundamental frequency amplification factor, which is a constant greater than 1;
inversely transforming the frequency domain to obtain biased smooth result
Where DFT {.cndot. } represents the discrete Fourier transform.
Performing deviation compensation to obtain unbiased smoothing result
Wherein B is a fixed offset compensation factor:
B=exp(0.5*0.5772)。
4. the method for single channel speech enhancement for balancing noise reduction and speech quality according to claim 1, wherein in step 7, log spectral amplitude gain based on uncertainty of speech is usedThe calculation formula of (2) is as follows:
wherein ,representing the posterior speech presence probability estimate obtained in said step 6, wherein +.>A state indicating the presence of speech; g min Representing a lower gain limit; />The logarithmic spectrum amplitude gain based on χ priori with the shape parameter μ is represented by the formula:
wherein ,
wherein ,1 F 1 (a, b; x) represents a confluent super-geometric function; and />And (5) respectively representing the posterior signal-to-noise ratio estimated value obtained in the step (4) and the prior signal-to-noise ratio estimated value obtained in the step (5).
CN202310707811.4A 2023-06-15 2023-06-15 Single-channel voice enhancement method for balancing noise reduction amount and voice quality Pending CN116913308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310707811.4A CN116913308A (en) 2023-06-15 2023-06-15 Single-channel voice enhancement method for balancing noise reduction amount and voice quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310707811.4A CN116913308A (en) 2023-06-15 2023-06-15 Single-channel voice enhancement method for balancing noise reduction amount and voice quality

Publications (1)

Publication Number Publication Date
CN116913308A true CN116913308A (en) 2023-10-20

Family

ID=88352011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310707811.4A Pending CN116913308A (en) 2023-06-15 2023-06-15 Single-channel voice enhancement method for balancing noise reduction amount and voice quality

Country Status (1)

Country Link
CN (1) CN116913308A (en)

Similar Documents

Publication Publication Date Title
CN108831499B (en) Speech enhancement method using speech existence probability
CN111899752B (en) Noise suppression method and device for rapidly calculating voice existence probability, storage medium and terminal
Lin et al. Adaptive noise estimation algorithm for speech enhancement
CN110739005B (en) Real-time voice enhancement method for transient noise suppression
CN102982801B (en) Phonetic feature extracting method for robust voice recognition
CN108735225A (en) It is a kind of based on human ear masking effect and Bayesian Estimation improvement spectrum subtract method
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN105280193B (en) Priori signal-to-noise ratio estimation method based on MMSE error criterion
CN111091833A (en) Endpoint detection method for reducing noise influence
CN107045874B (en) Non-linear voice enhancement method based on correlation
CN111933169B (en) Voice noise reduction method for secondarily utilizing voice existence probability
CN116913308A (en) Single-channel voice enhancement method for balancing noise reduction amount and voice quality
EP1635331A1 (en) Method for estimating a signal to noise ratio
Surendran et al. Perceptual subspace speech enhancement with variance normalization
Naik et al. Modified magnitude spectral subtraction methods for speech enhancement
Sunitha et al. Speech enhancement based on wavelet thresholding the multitaper spectrum combined with noise estimation algorithm
Shen et al. A priori SNR estimator based on a convex combination of two DD approaches for speech enhancement
Zavarehei et al. Speech enhancement in temporal DFT trajectories using Kalman filters.
Surendran et al. Perceptual subspace speech enhancement with ssdr normalization
Deepa et al. Spectral Subtraction Method of Speech Enhancement using Adaptive Estimation of Noise with PDE method as a preprocessing technique
CN117711419B (en) Intelligent data cleaning method for data center
Gao et al. DNN-based speech separation with joint improved distortion constraints
Liu et al. MTF-based kalman filtering with linear prediction for power envelope restoration in noisy reverberant environments
Gouhar et al. Speech enhancement using new iterative minimum statistics approach
Wang et al. A Dual-microphone Sub-band Post-filter Using Simplified TBRR for Speech Enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination