WO2018086444A1

WO2018086444A1 - Method for estimating signal-to-noise ratio for noise suppression, and user terminal

Info

Publication number: WO2018086444A1
Application number: PCT/CN2017/106502
Authority: WO
Inventors: 谢单辉
Original assignee: 电信科学技术研究院
Priority date: 2016-11-10
Filing date: 2017-10-17
Publication date: 2018-05-17
Also published as: CN108074582B; CN108074582A

Abstract

A method for estimating signal-to-noise ratio for noise suppression, and a user terminal. The method may comprise: estimating preestimated priori signal-to-noise ratio of a current audio frame (101); computing, according to the preestimated priori signal-to-noise ratio, an estimated value of an MMSE corresponding to the preestimated priori signal-to-noise ratio of the current audio frame (102); computing a speech presence probability of the current audio frame (103); and estimating final priori signal-to-noise ratio of the current audio frame with reference to the speech presence probability and the estimated value (104).

Description

Noise suppression signal to noise ratio estimation method and user terminal

Cross-reference to related applications

The present application claims priority to Chinese Patent Application No. 201611039463.4, filed on Jan.

Technical field

The present disclosure relates to the field of voice technologies, and in particular, to a noise suppression signal to noise ratio estimation method and a user terminal.

Background technique

At present, a single microphone noise reduction method is generally used in a user terminal to perform noise reduction on an audio signal. The method mainly includes the following steps:

Using a fast Fourier Transformation (FFT) or other transform method, the noisy speech is used to decompose the frequency domain signal Y in the frequency domain;

Estimating the noise variance of the frequency domain signal Y;

Estimating the a priori signal to noise ratio and the a posteriori signal to noise ratio based on the noise variance described above;

Calculating a suitable gain based on the a priori signal to noise ratio and the a posteriori signal to noise ratio;

Multiplying each frequency domain of the frequency domain signal Y by the above gain to obtain a noise-reduced frequency domain signal;

The noise-reduced frequency domain signal is transformed into a time domain signal by Inverse Fast Fourier Transform (IFFT).

However, in the above technique, the a priori signal-to-noise ratio is estimated using a direct decision method, that is, estimated by the following formula:

among them,

An estimate of the a priori signal-to-noise ratio of the current frame, α usually needs to take a smoothing number close to 1, specifically 0.95 to 1.

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

Represents an a posteriori signal to noise ratio estimate for the current frame.

It can be seen from the above formula that the estimated value of the posterior SNR is heavily biased towards the noise reduction processing result of the previous frame.

and

Can be seen as the previous frame of speech variance

Instantaneous value. Therefore, the a priori estimated signal-to-noise ratio ξ estimated by the above formula is not an estimate of the signal-to-noise ratio ξ(m) of the current frame, and can be regarded as estimating the a priori signal-to-noise ratio ξ(m-1) of the previous frame. It can be seen that it is currently estimated that the a priori signal to noise ratio of the current audio frame has a poor correlation with the current audio frame, which is not conducive to the problem of noise suppression of the current audio frame.

Summary of the invention

The purpose of the present disclosure is to provide a noise suppression signal to noise ratio estimation method and a user terminal, which solves the problem that estimating the a priori signal to noise ratio of the current audio frame has a poor correlation with the current audio frame, which is disadvantageous to the noise of the current audio frame. The problem of suppression.

In order to achieve the above object, an embodiment of the present disclosure provides a method for estimating a priori signal to noise ratio, including:

Estimating the estimated a priori signal to noise ratio of the current audio frame;

Calculating an estimated value of a minimum mean square error (MMSE) corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

Calculating a voice existence probability of the current audio frame;

A final a priori signal to noise ratio of the current audio frame is estimated in conjunction with the speech presence probability and the estimate.

Optionally, the estimating an a priori signal to noise ratio of the current audio frame includes:

Estimating an a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimate of the current audio frame.

Optionally, the estimating an a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimation value of the current audio frame, including:

The estimated a priori SNR of the current audio frame is estimated by the following formula:

among them, Representing the estimated a priori signal to noise ratio, α is a smoothing number,

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

Representing an a posteriori signal to noise ratio estimate of the current audio frame;

Or,

among them,

Representing the estimated a priori signal to noise ratio, α is a smoothing number,

For the a priori signal to noise ratio of the previous frame,

Optionally, the method further includes:

The smoothing number required to estimate the estimated a priori signal to noise ratio is adjusted by the following formula:

Where a ₁ and a ₂ are preset two smooth numbers, and a ₁ > a ₂ , γ _th and ξ _th are two empirical thresholds.

Optionally, the step of estimating an estimated a priori signal to noise ratio of the current audio frame based on the estimated probability of existence of the voice, further comprising:

The estimated a priori signal to noise ratio of the current audio frame is further estimated by the following formula:

or

among them,

Representing the estimated a priori signal to noise ratio,

with

Respectively smoothing said number is a _2-priori SNR estimate the current audio frame, p is a ₁ and said current smoothing priori SNR estimate the number of audio frames | represents (H ₁ Y) The voice existence probability, and p _th is a preset threshold.

Optionally, the calculating, according to the estimated a priori signal to noise ratio, an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame, including:

Calculating an estimated value of the minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio:

among them,

An estimate of the minimum mean square error corresponding to the estimated a priori signal to noise ratio,

Representing the estimated a priori signal to noise ratio,

Representing an a posteriori signal to noise ratio estimate for the current audio frame.

Optionally, the calculating a voice existence probability of the current audio frame includes:

Calculating the probability of existence of the current audio frame by the following formula:

or

Where p(H ₁ |Y) represents the probability of existence of the speech, and p(H ₁ ) and p(H ₀ ) respectively represent a priori speech existence probability and a priori no speech probability,

For a fixed value,

Representing an a posteriori signal to noise ratio estimate of the current audio frame, exp() is an exponential function, γ _min and γ _max are two empirical values, and γ _min <γ _max , p _max and p _min are two empirical values And p _min <p _max .

Optionally, the estimating the final a priori signal to noise ratio of the current audio frame by combining the voice existence probability and the estimated value, including:

The final a priori signal to noise ratio of the current audio frame is estimated by the following formula:

among them,

Representing the final a priori signal to noise ratio of the current audio frame,

An estimated value of the minimum mean square error of the estimated a priori signal to noise ratio, p(H ₁ |Y) represents the probability of existence of the voice, and ξ _min is a certain fractional value.

The embodiment of the present disclosure further provides a user terminal, including:

a first estimating module, configured to estimate an estimated a priori signal to noise ratio of the current audio frame;

a first calculating module, configured to calculate an estimated value of the MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

a second calculating module, configured to calculate a voice existence probability of the current audio frame;

And a second estimating module, configured to estimate a final a priori signal to noise ratio of the current audio frame in combination with the voice presence probability and the estimated value.

Optionally, the first estimation module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimation value of the current audio frame.

Optionally, the first estimation module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

or,

The first estimation module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

For the a priori signal to noise ratio of the previous frame,

Optionally, the user terminal further includes:

An adjustment module for adjusting a smoothing number required to estimate the estimated a priori signal to noise ratio by the following formula:

Optionally, the first estimation module is further configured to further estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

or

among them,

Representing the estimated a priori signal to noise ratio,

Optionally, the first calculating module is configured to calculate, according to the estimated a priori signal to noise ratio, an estimated value of the MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame by using:

among them,

Representing the estimated a priori signal to noise ratio,

Optionally, the second calculating module is configured to calculate a voice existence probability of the current audio frame by using the following formula:

or

For a fixed value,

Optionally, the second estimation module is configured to estimate a final a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

The embodiment of the present disclosure further provides a user terminal, including: a processor, a memory, and a transceiver, where:

The processor is configured to read a program in the memory and perform the following process:

Calculating an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

Calculating a voice existence probability of the current audio frame;

Estimating a final a priori signal to noise ratio of the current audio frame in conjunction with the speech presence probability and the estimated value,

The transceiver is configured to receive and transmit data, and the memory is capable of storing data used by the processor when performing operations.

The above technical solution of the present disclosure has at least the following beneficial effects:

In an embodiment of the present disclosure, estimating an estimated a priori signal to noise ratio of a current audio frame; and calculating, according to the estimated a priori signal to noise ratio, an MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame. Estimating a value; calculating a speech presence probability of the current audio frame; estimating a final a priori signal to noise ratio of the current audio frame in conjunction with the speech presence probability and the estimated value. The final a priori signal-to-noise ratio estimated by combining the estimated probability of the voice of the current frame with the estimated a priori SNR of the current audio frame, compared to the prior art according to the previous frame. Detecting the signal to noise ratio for estimation, the a priori signal to noise ratio that can be estimated by the embodiments of the present disclosure is more correlated with the current audio frame, thereby facilitating noise suppression of the current audio frame.

DRAWINGS

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments described in the present application. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work. The following figures are not intended to be scaled to scale in actual dimensions, with emphasis on the subject matter of the present application.

FIG. 1 is a schematic flowchart diagram of a noise suppression signal to noise ratio estimation method according to an embodiment of the present disclosure;

2 is a schematic diagram of another noise suppression signal to noise ratio estimation method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of experimental data of a noise suppression signal to noise ratio estimation method according to an embodiment of the present disclosure; FIG.

4 is a schematic diagram of another experimental data of a noise suppression signal to noise ratio estimation method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of another experimental data of a noise suppression signal to noise ratio estimation method according to an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a user terminal according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of another user terminal according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of another user terminal according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described in conjunction with the drawings in the embodiments of the present disclosure. It is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. example. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present disclosure without departing from the inventive scope are the scope of the disclosure.

Referring to FIG. 1 , an embodiment of the present disclosure provides a noise suppression signal to noise ratio estimation method, as shown in FIG. 1 , including the following steps:

101. Estimating an estimated a priori signal to noise ratio of the current audio frame;

102. Calculate, according to the estimated a priori signal to noise ratio, an estimated value of the MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame.

103. Calculate a voice existence probability of the current audio frame.

104. Estimate a final a priori signal to noise ratio of the current audio frame in conjunction with the speech presence probability and the estimated value.

In the embodiment of the present disclosure, the current audio frame may be a current frame collected by a microphone of the user terminal, and the current frame may be a voice frame or a noise frame.

In addition, the above-mentioned estimated a priori signal-to-noise ratio may be an a priori signal-to-noise ratio estimated by a direct decision method or a maximum likelihood method. The above estimated MMSE estimate for estimating the a priori SNR may be an estimate of the MMSE using the MMSE algorithm to obtain the above-described estimated prior SNR. The voice existence probability of the current audio frame may be calculated according to the posterior signal to noise ratio of the current audio frame, or may be averaged or smoothed by combining the posterior signal to noise ratio of the same frequency point of the previous frames. The value of the calculation calculates the probability of speech presence of the current audio frame.

It should be noted that, in the order of execution between step 103 and step 101 and step 102, the embodiment of the present disclosure is not limited. For example, step 103 may be performed first, then step 101 may be performed, or step 101 may be performed first. Then step 103 is performed.

In addition, the final a priori signal to noise ratio of the current audio frame may be understood as a priori signal to noise ratio for gain calculation in the process of performing noise reduction on the audio frame, or may also be understood as being directed to the embodiments of the present disclosure. The a priori signal-to-noise ratio of the current audio frame output. Estimating the final a priori signal to noise ratio of the current audio frame according to the voice existence probability and the estimated value may be: determining a probability that the current audio frame is a voice frame according to the voice existence probability, and determining that the current audio frame is pure noise Frame, then set the final a priori SNR to a stable minimum, such as ξ _min , to ensure smooth processing of pure noise segments and reduce music noise; and when determining that the current audio frame is an audio frame in a speech segment Then, the final a priori SNR is calculated to be biased toward the estimated minimum azimuth error of the a priori SNR, so that the final a priori SNR estimation is more accurate.

Through the above steps, the final a priori SNR of the estimated value of the minimum mean square error of the current frame and the estimated a priori SNR of the current audio frame can be realized, the estimated a priori SNR and the current The correlation of audio frames is higher, which is beneficial to the noise suppression of the current audio frame to improve the noise suppression effect.

The posterior signal to noise ratio of the current audio frame is common knowledge and will not be described in detail herein. The estimating an a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimation value of the current audio frame may be based on the a posteriori signal to noise ratio estimation value of the current audio frame, using a direct decision method to estimate the current The estimated a priori signal to noise ratio of the audio frame is of course not limited by the embodiments of the present disclosure.

Optionally, the estimating the a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimation value of the current audio frame, including:

Estimate the estimated signal-to-noise ratio of the current audio frame by the following formula:

among them,

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

or,

among them,

For the a priori signal to noise ratio of the previous frame,

In this embodiment, the estimated a priori signal to noise ratio can be estimated by any one of the above two formulas. According to experiments

Corresponding formulas are better for calculating the above-mentioned estimated a priori signal-to-noise ratio. In this method, mainly the musical tone is less, so in the embodiment of the present disclosure, optionally,

The corresponding formula calculates the above-mentioned estimated prior signal-to-noise ratio.

Further, the smoothing number may be a value set in advance, for example, a value of 0.95 to 1, or a value of 0.98 or 0.3, which is not limited thereto, and the noise variance is common knowledge, and will not be described in detail.

Optionally, the foregoing method further includes:

In this embodiment, it is considered that the α factor needs to be as large as possible in pure noise, so that the estimated value is as stable as possible, and needs to be as small as possible when there is a voice segment, so as to ensure fast tracking of the voice. . The above-mentioned a ₁ and a ₂ may be 0.98 and 0.3, respectively. Of course, the embodiment of the present disclosure does not limit this, for example, it may be 0.95 and 0.28, etc., and may be adjusted according to actual conditions.

In this embodiment, the accuracy of estimating the a priori signal to noise ratio can be improved by the above a ₁ and a ₂ .

Optionally, in this implementation, the step of estimating the estimated a priori signal to noise ratio of the current audio frame based on the estimated probability of existence of the voice, further comprising:

or

among them,

Representing the estimated a priori signal to noise ratio,

with

In this implementation manner, the estimated a priori signal to noise ratio may be switched according to the audio presence probability of the current audio frame to improve the accuracy of the estimated a priori signal to noise ratio.

Optionally, calculating, according to the estimated a priori signal to noise ratio, an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame, including:

among them,

Representing the estimated a priori signal to noise ratio,

It should be noted that the above

The estimated a priori signal to noise ratio calculated in step 101 is not limited to the above mentioned

The estimated a priori signal-to-noise ratio calculated by the formula.

Wherein, the above may be obtained according to a complex Gaussian model

In addition, a super Gaussian model of speech can also be used to calculate E(X ² |Y). among them,

It can be equivalent to E(X ² |Y). Because in practical applications, the a priori SNR is mainly to estimate the variance of the speech signal.

By definition

This only depends on the speech signal X. But X is not available, so most of the pairs

The estimation algorithm has to be estimated from the noisy signal Y. This can also be seen from the direct decision method. In the second half of the calculation formula of the direct decision method, γ-1 is the variance of the speech.

The maximum likelihood estimate for the case where γ is known (ieY known), the first half is the instantaneous value

To replace E(X ² ).

Therefore, from most of the SNR estimation algorithms, it needs to be established under the condition that the noisy signal Y is known. In other words, in reality, the variance of the speech cannot be directly estimated.

But the condition known in Y, estimated

Therefore, in the embodiments of the present disclosure, conditional expectations are employed.

or

To estimate the variance of speech

Based on this idea, from the definition of conditional expectations

It can be seen that the corresponding is actually the MMSE estimation of the speech amplitude spectrum X ² . Considering the probability p(H ₁ |Y) of speech in Y, the condition expects the final expression to be:

According to the complex Gaussian model:

Where p(H ₀ |Y) represents the probability that there is no speech H ₀ under the condition that Y is known, that is, the conditional probability, the binary hypothesis:

H0: Y=N, indicating no voice

H1: Y=X+N, indicating that there is voice

E(X ² |Y, H ₀ ) According to the above binary hypothesis, E(X ² |Y, H ₀ )=0.

In the above formula

It is the true speech variance, which needs to be further estimated. It can be estimated by the maximum likelihood or direct decision method. On the other hand, it can also obey the other models from the hypothetical speech, such as super Gaussian models, such as chi-square (chi) distribution:

After derivation

Above

It is a Huitong type hypergeometric function. Due to the inclusion of the transcendental function, the overall calculation is more complicated, and it is generally required to look up the table and the like.

According to the above analysis, the above

The formula of the representation can pass the complex Gaussian model

Super Gaussian model

Derived.

It should be noted that, in the embodiment of the present disclosure, the estimated value of the minimum mean square error of the estimated prior signal to noise ratio may be directly calculated by using the above formula, without performing the derivation process desired by the above condition, and performing the corresponding steps. That is, the above conditions are expected to be merely explanations of the principles at the time of implementation in the embodiments of the present disclosure.

or

For a fixed value,

In this embodiment, speech and noise are distinguished by the above formula. In addition, when the above formula is used to calculate the probability of existence of speech, the probability of existence of the current audio frame can be calculated by combining the a posteriori signal-to-noise ratio of the same frequency points of the previous frames to obtain an average or smoothed value. Additionally, the above formula may be derived directly from the complex Gaussian model provided above.

In the embodiment of the present disclosure, the probability of existence by voice is to provide a probability of existence of a voice, so that the current estimated a priori signal-to-noise ratio can be soft-switched in pure noise and voice segments, thereby accelerating the tracking delay problem existing in the direct decision method. At the same time, the advantages of the direct decision method can be retained.

Optionally, the foregoing estimating the final a priori signal to noise ratio of the current audio frame by combining the voice existence probability and the estimated value, including:

among them,

In this embodiment, the calculation of the above formula is such that the final a priori signal-to-noise ratio pure noise is kept as small as possible at a stable small value, such as ξ _min , and in the speech segment, the estimated a priori signal-to-noise ratio is biased toward

Or understand that the estimated a priori signal-to-noise ratio is biased toward

In this embodiment, the voice state and the voiceless state can be distinguished, and the optimal a priori signal and noise estimate is derived according to the MMSE criterion in the voice state. There is no voice state, and using a certain minimum value as the limit of maximum suppression strength can ensure smooth processing of pure noise segments and reduce music noise. The existence and non-existence state of speech are calculated by the probability of existence of speech. The probability is calculated by using the fixed value a priori SNR, which makes the a priori SNR estimation more accurate and can solve the existence of direct judgment. Tracking delay issues.

It should be noted that, in the embodiments of the present disclosure, the various embodiments described above may be implemented in combination with each other, or may be implemented separately, and the embodiments of the present disclosure are not limited thereto. Additionally, in an embodiment of the present disclosure, the estimated a priori signal to noise ratio may be used for gain calculation of the noise reduction process of the audio signal, and optionally, gain calculation using a single microphone noise reduction process may be applied. For example, as shown in FIG. 2, the a posteriori signal-to-noise ratio and the power spectrum of the previous frame processing structure are obtained, and the a priori of the current audio frame is calculated using a direct decision method based on the posterior signal-to-noise ratio and the power spectrum of the previous frame processing structure. Signal-to-noise ratio, calculating a voice existence probability of a current audio signal frame based on a posteriori signal-to-noise ratio, calculating an estimated value of the MMSE estimating the a priori signal-to-noise ratio, and estimating the current in combination with the voice existence probability and the estimated value The final a priori signal-to-noise ratio of the audio frame, which is used for gain calculation.

In the embodiment of the present disclosure, the effect of the inherent delay of one frame can be eliminated by the above steps, and the initial segment of the speech is attenuated and the tail of the end segment is degraded, thereby improving the noise reduction performance. The following is an explanation of the results through experimental data:

The experiment uses the Noizus database, the data sampling rate is 8 kHz, the white noise is generated using Cool Edit (for an audio processing software), and the other noise is the Noizus database. The frame length is 20ms, the overlap rate is 50%, and the square root Hanning window is used before and after.

Take 15dB. ξ _min takes -20dB, the suppression criterion uses MMSE-STSA (Short-Time Spectral Amplitude) algorithm, and the noise estimation uses unbiased MMSE algorithm.

Figures 3 and 4 show a comparison between the direct decision and the method of the present disclosure when the signal to noise ratio is 0 dB and 5 dB, respectively. The speech in Figure 3 is sp01, the noise is white noise, the speech in Figure 4 is sp04, and the noise is car noise. Among them, sp01 and sp04 are the speech numbers in the data set. As can be seen at the arrows, the disclosed method is clearly superior to the comparison algorithm. Subjective contrast, the music noise of the processing results are not obvious. Figure 5 shows the Noisus database of 30 sets of car noise and white noise, and the average segment signal-to-noise ratio is improved at 0/5/10/15 dB. It is easy to see from the figure that the performance of the present disclosure method is superior to the direct decision.

It should be noted that the above method can be applied to any user terminal with a microphone, such as a mobile phone, a tablet personal computer, a laptop computer, a personal digital assistant (PDA), and a mobile device. A terminal device such as a Mobile Intemet Device (MID), an in-vehicle device, or a wearable device, it should be noted that the specific type of the user terminal is not limited in the embodiment of the present disclosure.

Estimating an estimated a priori signal to noise ratio of the current audio frame; calculating an estimated value of the estimated MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio; The probability of speech presence of the current audio frame; the final a priori signal to noise ratio of the current audio frame is estimated in conjunction with the speech presence probability and the estimate. The final a priori signal-to-noise ratio estimated by combining the estimated probability of the voice of the current frame with the estimated a priori SNR of the current audio frame, compared to the prior art according to the previous frame. Detecting the signal to noise ratio for estimation, the a priori signal to noise ratio that can be estimated by the embodiments of the present disclosure is more correlated with the current audio frame, thereby facilitating noise suppression of the current audio frame.

Referring to FIG. 6, an embodiment of the present disclosure provides a user terminal. As shown in FIG. 6, the user terminal 600 includes the following modules:

The first estimating module 601 is configured to estimate an estimated a priori signal to noise ratio of the current audio frame;

The first calculating module 602 is configured to calculate, according to the estimated a priori signal to noise ratio, an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame;

a second calculating module 603, configured to calculate a voice existence probability of the current audio frame;

The second estimation module 604 is configured to estimate a final a priori signal to noise ratio of the current audio frame in conjunction with the voice presence probability and the estimated value.

Optionally, the first estimating module 601 is configured to estimate an estimated a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimation value of the current audio frame.

Optionally, the first estimation module 601 is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

or,

The first estimation module 601 is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

For the a priori signal to noise ratio of the previous frame,

Optionally, as shown in FIG. 7, the user terminal 600 further includes:

The adjusting module 605 is configured to adjust, by using the following formula, a smoothing number required to estimate the estimated a priori signal to noise ratio:

Optionally, the first estimation module 601 is further configured to further estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

or

among them,

Representing the estimated a priori signal to noise ratio,

with

Optionally, the first calculating module 602 is configured to calculate, according to the estimated a priori signal to noise ratio, an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame by using a formula :

among them,

Representing the estimated a priori signal to noise ratio,

Optionally, the second calculating module 603 is configured to calculate a voice existence probability of the current audio frame by using the following formula:

or

For a fixed value,

Optionally, the second estimation module 604 is configured to estimate a final a priori signal to noise ratio of the current audio frame by using the following formula:

among them,

It should be noted that, in the embodiment, the user terminal 600 may be a user terminal corresponding to the voice signal noise reduction method provided by the method embodiment in the embodiment of the present disclosure, and any implementation in the method embodiment in the embodiment of the present disclosure The method can be implemented by the foregoing user terminal 600 in the embodiment, and achieve the same beneficial effects, and details are not described herein again.

Referring to FIG. 8, an embodiment of the present disclosure provides a structure of another user terminal, including: a processor 800, a transceiver 810, a memory 820, a user interface 830, and a bus interface, where:

The processor 800 is configured to read a program in the memory 820 and perform the following process:

Calculating an estimated value of the MMSE corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

Calculating a voice existence probability of the current audio frame;

The microphone included in the user interface 830, the transceiver 810, is configured to receive and transmit data under the control of the processor 800.

In FIG. 8, the bus architecture may include any number of interconnected buses and bridges, specifically linked by one or more processors represented by processor 800 and various circuits of memory represented by memory 820. The bus architecture can also link various other circuits such as peripherals, voltage regulators, and power management circuits. The bus interface provides an interface. Transceiver 810 can be a plurality of components, including a transmitter and a receiver, providing means for communicating with various other devices on a transmission medium. For different user equipments, the user interface 830 may also be an interface capable of externally connecting the required devices, including but not limited to a keypad, a display, a speaker, a microphone, a joystick, and the like.

The processor 800 is responsible for managing the bus architecture and general processing, and the memory 820 can store data used by the processor 800 in performing operations.

among them,

Indicates the noise reduction processing result of the previous frame,

Indicates the noise variance,

or,

among them,

For the a priori signal to noise ratio of the previous frame,

Optionally, the processor 800 is further configured to:

or

among them,

Representing the estimated a priori signal to noise ratio,

with

among them,

Representing the estimated a priori signal to noise ratio,

or

For a fixed value,

Representing a posterior signal to noise ratio estimate of the current audio frame, exp() is an exponential function, γ _min and γ _max are two empirical values, and γ _min <γ _max , p _max and p _min are two empirical values And p _min <p _max .

among them,

It should be noted that, in the embodiment, the user terminal may be a user terminal corresponding to the voice signal noise reduction method provided by the method embodiment in the embodiment of the present disclosure, and any of the method embodiments in the embodiments of the present disclosure It can be implemented by the above user terminal in this embodiment, and achieve the same beneficial effects, and will not be described again here.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may be physically included separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

The above-described integrated unit implemented in the form of a software functional unit can be stored in a computer readable storage medium. The software functional unit described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform part of the steps of the method of transmitting and receiving described in various embodiments of the present disclosure. The foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), and a random access memory. A variety of media that can store program code, such as a random access memory (RAM), a disk, or an optical disk.

The above is a preferred embodiment of the present disclosure, and it should be noted that those skilled in the art can make several improvements and refinements without departing from the principles of the present disclosure. Retouching should also be considered as protection of this disclosure.

Claims

A noise suppression signal to noise ratio estimation method includes:

Estimating the estimated a priori signal to noise ratio of the current audio frame;

Calculating an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

Calculating a voice existence probability of the current audio frame;

A final a priori signal to noise ratio of the current audio frame is estimated in conjunction with the speech presence probability and the estimate.
The method of claim 1 wherein said estimating an estimated a priori signal to noise ratio of a current audio frame comprises:

Estimating an a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimate of the current audio frame.
The method of claim 2 wherein said estimating an a priori signal to noise ratio of a current audio frame based on an a posteriori signal to noise ratio estimate of said current audio frame comprises:

The estimated a priori SNR of the current audio frame is estimated by the following formula:

among them,
Representing the estimated a priori signal to noise ratio, α is a smoothing number,
Indicates the noise reduction processing result of the previous frame,
Indicates the noise variance,
Representing an a posteriori signal to noise ratio estimate of the current audio frame;

or,

The estimated a priori SNR of the current audio frame is estimated by the following formula:

among them,
Representing the estimated a priori signal to noise ratio, α is a smoothing number,
For the a priori signal to noise ratio of the previous frame,
Represents an a posteriori signal to noise ratio estimate for the current frame.
The method of claim 3 further comprising:

The smoothing number required to estimate the estimated a priori signal to noise ratio is adjusted by the following formula:

Where a 1 and a 2 are preset two smooth numbers, and a 1 > a 2 , γ th and ξ th are two empirical thresholds.
The method of claim 4, wherein the step of estimating an estimated a priori signal to noise ratio of the current audio frame based on the estimated probability of existence of the speech further comprises:

The estimated a priori signal to noise ratio of the current audio frame is further estimated by the following formula:

or

among them,
Representing the estimated a priori signal to noise ratio,
with
Respectively smoothing said number is a 2-priori SNR estimate the current audio frame, p is a 1 and said current smoothing priori SNR estimate the number of audio frames | represents (H 1 Y) The voice existence probability, and p th is a preset threshold.
The method according to any one of claims 1 to 5, wherein the calculating a minimum average of the estimated a priori signal to noise ratios of the current audio frame according to the estimated a priori signal to noise ratio Estimates of the square error, including:

Calculating an estimated value of the minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio:

among them,
An estimate of the minimum mean square error corresponding to the estimated a priori signal to noise ratio,
Representing the estimated a priori signal to noise ratio,
Representing an a posteriori signal to noise ratio estimate for the current audio frame.
The method of any of claims 1-5, wherein the calculating a voice presence probability of the current audio frame comprises:

Calculating the probability of existence of the current audio frame by the following formula:

or

Where p(H 1 |Y) represents the probability of existence of the speech, and p(H 1 ) and p(H 0 ) respectively represent a priori speech existence probability and a priori no speech probability,
For a fixed value,
Representing an a posteriori signal to noise ratio estimate of the current audio frame, exp() is an exponential function, γ min and γ max are two empirical values, and γ min <γ max , p max and p min are two empirical values And p min <p max .
The method of any of claims 1-5, wherein the estimating the final a priori signal to noise ratio of the current audio frame in conjunction with the speech presence probability and the estimate comprises:

The final a priori signal to noise ratio of the current audio frame is estimated by the following formula:

among them,
Representing the final a priori signal to noise ratio of the current audio frame,
An estimated value of the minimum mean square error of the estimated a priori signal to noise ratio, p(H 1 |Y) represents the probability of existence of the voice, and ξ min is a certain fractional value.
A user terminal comprising:

a first estimating module, configured to estimate an estimated a priori signal to noise ratio of the current audio frame;

a first calculating module, configured to calculate, according to the estimated a priori signal to noise ratio, an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame;

a second calculating module, configured to calculate a voice existence probability of the current audio frame;

And a second estimating module, configured to estimate a final a priori signal to noise ratio of the current audio frame in combination with the voice presence probability and the estimated value.
The user terminal of claim 9, wherein the first estimating module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame based on the a posteriori signal to noise ratio estimate of the current audio frame.
The user terminal of claim 10, wherein the first estimation module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by the following formula:

among them,
Representing the estimated a priori signal to noise ratio, α is a smoothing number,
Indicates the noise reduction processing result of the previous frame,
Indicates the noise variance,
Representing an a posteriori signal to noise ratio estimate of the current audio frame;

or,

The first estimation module is configured to estimate an estimated a priori signal to noise ratio of the current audio frame by using the following formula:

among them,
Representing the estimated a priori signal to noise ratio, α is a smoothing number,
For the a priori signal to noise ratio of the previous frame,
Represents an a posteriori signal to noise ratio estimate for the current frame.
The user terminal of claim 11, further comprising:

An adjustment module for adjusting a smoothing number required to estimate the estimated a priori signal to noise ratio by the following formula:

Where a 1 and a 2 are preset two smooth numbers, and a 1 > a 2 , γ th and ξ th are two empirical thresholds.
The user terminal of claim 12, wherein the first estimating module is further configured to further estimate an estimated a priori signal to noise ratio of the current audio frame by:

or

among them,
Representing the estimated a priori signal to noise ratio,
with
Respectively smoothing said number is a 2-priori SNR estimate the current audio frame, p is a 1 and said current smoothing priori SNR estimate the number of audio frames | represents (H 1 Y) The voice existence probability, and p th is a preset threshold.
The user terminal according to any one of claims 9 to 13, wherein the first calculation module is configured to calculate the pre-preparation of the current audio frame according to the estimated a priori signal to noise ratio by the following formula Estimate the estimate of the minimum mean square error corresponding to the prior SNR:

among them,
An estimate of the minimum mean square error corresponding to the estimated a priori signal to noise ratio,
Representing the estimated a priori signal to noise ratio,
Representing an a posteriori signal to noise ratio estimate for the current audio frame.
The user terminal according to any one of claims 9 to 13, wherein the second calculation module is configured to calculate a voice existence probability of the current audio frame by the following formula:

or

Where p(H 1 |Y) represents the probability of existence of the speech, and p(H 1 ) and p(H 0 ) respectively represent a priori speech existence probability and a priori no speech probability,
For a fixed value,
Representing an a posteriori signal to noise ratio estimate of the current audio frame, exp() is an exponential function, γ min and γ max are two empirical values, and γ min <γ max , p max and p min are two empirical values And p min <p max .
The user terminal according to any one of claims 9 to 13, wherein the second estimation module is configured to estimate a final a priori signal to noise ratio of the current audio frame by the following formula:

among them,
Representing the final a priori signal to noise ratio of the current audio frame,
An estimated value of the minimum mean square error of the estimated a priori signal to noise ratio, p(H 1 |Y) represents the probability of existence of the voice, and ξ min is a certain fractional value.
A user terminal includes: a processor, a memory, and a transceiver, wherein:

The processor is configured to read a program in the memory and perform the following process:

Estimating the estimated a priori signal to noise ratio of the current audio frame;

Calculating an estimated value of a minimum mean square error corresponding to the estimated a priori signal to noise ratio of the current audio frame according to the estimated a priori signal to noise ratio;

Calculating a voice existence probability of the current audio frame;

Estimating a final a priori signal to noise ratio of the current audio frame in conjunction with the speech presence probability and the estimated value,

The transceiver is configured to receive and transmit data, and the memory is capable of storing data used by the processor when performing operations.