CN105869649B

CN105869649B - Perceptual filtering method and perceptual filter

Info

Publication number: CN105869649B
Application number: CN201510031872.9A
Authority: CN
Inventors: 张勇; 刘轶
Original assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Current assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; Peking University Shenzhen Graduate School
Priority date: 2015-01-21
Filing date: 2015-01-21
Publication date: 2020-02-21
Anticipated expiration: 2035-01-21
Also published as: CN105869649A

Abstract

The invention provides a perception filtering method, which comprises the following steps: acquiring a voice with noise, and calculating the voice with noise according to a noise estimation algorithm to obtain noise power; calculating the voice with noise according to a masking model to obtain a frequency domain masking threshold; converting the voice with noise into a frequency domain to obtain the voice with noise in the frequency domain, including pure voice in the frequency domain and background noise in the frequency domain; based on a voice estimation error algorithm, expressing voice signal distortion as a relational expression about frequency domain pure voice and perceptual filter gain, and expressing filtering background noise as a relational expression about frequency domain background noise and perceptual filter gain; constructing an equation about the gain of the perceptual filter based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is less than or equal to the masking threshold of the frequency domain; solving an equation to obtain the gain of the perception filter; and according to the gain of the perception filter, the noise-containing voice is filtered to obtain enhanced voice, so that the subjective perception quality of the enhanced voice is improved. A perceptual filter is also provided.

Description

Perceptual filtering method and perceptual filter

Technical Field

The present invention relates to the field of speech signal processing, and in particular, to a perceptual filtering method and a perceptual filter.

Background

In real life, voice signals are inevitably polluted by background noise, and voice enhancement is an efficient way for solving noise pollution as a signal processing method, so that the method is always a research hotspot in the field of voice signal processing. The purpose of speech enhancement is to remove background noise as much as possible and improve the subjective auditory effect of speech on the premise of ensuring speech intelligibility.

Conventional speech enhancement algorithms include spectral subtraction, wiener filtering, minimum mean square error estimation, log-spectral amplitude minimum mean square error, Discrete Cosine Transform (DCT) -based enhancement methods, and the like. Most of the methods are based on statistical models of voice and noise components in a frequency domain, and various estimation theories are combined to design a targeted noise elimination technology. However, the traditional speech enhancement algorithm still has a lot of speech distortion and residual noise in the enhanced speech signal due to the deviation of the assumed model from the actual situation, which affects the speech enhancement effect.

Disclosure of Invention

In view of the above, it is desirable to provide a perceptual filtering method and a perceptual filter that reduce the noise level below the auditory masking threshold of the human ear to improve the subjective perceptual quality of enhanced speech.

A method of perceptual filtering, the method comprising:

acquiring a voice with noise, and calculating the voice with noise according to a noise estimation algorithm to obtain noise power;

calculating the voice with noise according to a masking model to obtain a frequency domain masking threshold;

converting the voice with noise into a frequency domain to obtain the voice with noise of the frequency domain, wherein the voice with noise of the frequency domain comprises pure voice of the frequency domain and background noise of the frequency domain;

based on a speech estimation error algorithm, representing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of a perception filter, and representing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter;

according to the voice signal distortion and the filtering background noise, constructing an equation related to the gain of the perception filter based on the relation that the sum of the voice signal distortion power and the filtering background noise power is less than or equal to the frequency domain masking threshold;

solving the equation to obtain the gain of the perceptual filter;

and according to the gain of the perception filter, carrying out filtering processing on the voice with noise to obtain enhanced voice.

In one embodiment, the step of constructing an equation regarding the perceptual filter gain based on the relationship that the sum of the speech signal distortion power and the filtered background noise power is less than or equal to the frequency domain masking threshold according to the speech signal distortion and the filtered background noise is as follows:

obtaining the distortion power of the voice signal according to the distortion of the voice signal;

obtaining the filtering background noise power according to the filtering background noise;

obtaining the equation (G (k) -1) based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is equal to the masking threshold of the frequency domain²P_s(k)+(G(k))²P_z(k) T (k), where g (k) is the perceptual filter gain, k is the spectrum number, P_S(k) For the frequency domain pure speech power, P_Z(k) For frequency domain background noise power, t (k) is the frequency domain masking threshold.

In one embodiment, the step of solving the equation to obtain the perceptual filter gain comprises:

calculating to obtain the frequency domain background noise power by adopting an approximate algorithm according to the noise power;

calculating to obtain a posterior signal-to-noise ratio according to the frequency domain background noise power;

calculating to obtain a prior signal-to-noise ratio based on a direct decision algorithm according to the posterior signal-to-noise ratio;

and solving the equation according to the frequency domain background noise power, the prior signal-to-noise ratio and the frequency domain masking threshold to obtain the gain of the perception filter.

In one embodiment, before the step of converting the noisy speech into the frequency domain to obtain a frequency-domain noisy speech, where the frequency-domain noisy speech includes frequency-domain clean speech and frequency-domain background noise, the method further includes:

the method comprises the following steps of enhancing the voice with noise by adopting a short-time amplitude spectrum estimation method to obtain enhanced voice with noise, converting the voice with noise into a frequency domain to convert the enhanced voice with noise into the frequency domain, filtering the voice with noise to obtain enhanced voice, filtering the enhanced voice with noise to obtain enhanced voice, and calculating the background noise power of the frequency domain by adopting an approximate algorithm according to the noise power:

obtaining the frequency domain gain function G based on the short-time amplitude spectrum estimation method_H(k) Wherein k is a frequency spectrum serial number;

according to P_Z(k)＝λ_d(k)-(1-G_H(k))|Y(k)|²Obtaining the frequency domain background noise power P_Z(k) Wherein λ is_d(k) Y (k) is the noise power, and Y (k) is the frequency domain noisy speech.

In one embodiment, the step of calculating the prior snr based on a direct decision algorithm according to the a posteriori snr comprises:

obtaining posterior signal-to-noise ratios of a current frame and a previous frame, wherein the posterior signal-to-noise ratios are respectively gamma '(k, l) and gamma' (k, l-1), k is a frequency spectrum serial number, l is a frame serial number, and the current frame is l frames;

acquiring a previous frame perceptual filter gain G (k, l-1), wherein if the previous frame is a first frame, the previous frame perceptual filter gain is a preset value;

obtaining the prior signal-to-noise ratio of the current frame according to the posterior signal-to-noise ratio and the gain of the perception filter by a formula

Wherein η is a smoothing factor and 0 < η < 1.

A perceptual filter, the perceptual filter comprising:

the acquisition module is used for acquiring the voice with noise;

the noise power calculation module is used for calculating the voice with noise according to a noise estimation algorithm to obtain noise power;

the masking threshold calculation module is used for calculating the voice with the noise according to a masking model to obtain a frequency domain masking threshold;

the frequency domain conversion module is used for converting the voice with noise into a frequency domain to obtain the voice with noise in the frequency domain, wherein the voice with noise in the frequency domain comprises pure voice in the frequency domain and background noise in the frequency domain;

an equation constructing module, configured to represent, based on a speech estimation error algorithm, speech signal distortion as a relational expression about the frequency-domain clean speech and the gain of the perceptual filter, represent filtering background noise as a relational expression about the frequency-domain background noise and the gain of the perceptual filter, and construct, according to the speech signal distortion and the filtering background noise, an equation about the gain of the perceptual filter based on a relational expression that a sum of speech signal distortion power and filtering background noise power is less than or equal to a frequency-domain masking threshold;

the gain solving module is used for solving the equation to obtain the gain of the perception filter;

and the filtering processing module is used for filtering the voice with noise according to the gain of the perception filter to obtain enhanced voice.

In one embodiment, the equation constructing module constructs an equation about the perceptual filter gain according to the speech signal distortion and the filtering background noise, based on a relationship that a sum of the speech signal distortion power and the filtering background noise power is less than or equal to a frequency domain masking threshold, specifically:

In one embodiment, the gain solving module comprises:

the solving preparation unit is used for calculating to obtain the frequency domain background noise power by adopting an approximation algorithm according to the noise power, calculating to obtain a posterior signal-to-noise ratio according to the frequency domain background noise power, and calculating to obtain a prior signal-to-noise ratio based on a direct decision algorithm according to the posterior signal-to-noise ratio;

and the solving unit is used for solving the equation according to the frequency domain background noise power, the prior signal-to-noise ratio and the frequency domain masking threshold value to obtain the gain of the perceptual filter.

In one embodiment, the perceptual filter further comprises:

the enhancement module is used for enhancing the voice with noise by adopting a short-time amplitude spectrum estimation method to obtain the enhanced voice with noise;

the frequency domain conversion module converts the noisy speech into a frequency domain to convert the enhanced noisy speech into the frequency domain;

the filtering processing module is used for filtering the voice with noise to obtain enhanced voice, and filtering the enhanced voice with noise to obtain enhanced voice;

the calculation preparation unit calculates the frequency domain background noise power by adopting an approximation algorithm according to the noise power, and specifically comprises the following steps:

In one embodiment, the prior snr obtained by the solution preparation unit based on the direct decision algorithm according to the a posteriori snr is specifically:

Wherein η is a smoothing factor and 0 < η < 1.

According to the perception filtering method and the perception filter, the noise power is obtained by obtaining the voice with noise, the voice with noise is calculated according to the noise estimation algorithm, the frequency domain masking threshold value is obtained by calculating the voice with noise according to the masking model, the voice with noise is converted into the frequency domain, and the voice with noise in the frequency domain is obtained and comprises the pure voice in the frequency domain and the background noise in the frequency domain; based on a speech estimation error algorithm, representing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of a perception filter, and representing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter; according to the distortion of the voice signal and the filtering background noise, constructing an equation related to the gain of the perception filter based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is smaller than or equal to the masking threshold of the frequency domain; solving an equation to obtain the gain of the perception filter; and according to the gain of the perception filter, filtering the voice with noise to obtain enhanced voice. Because the sum of the distortion power of the voice signal and the power of the filtering background noise is less than or equal to the frequency domain masking threshold, the voice signal distortion power is ensured to be small, and simultaneously, the noise level is ensured to be less than the auditory masking threshold of human ears and not to be heard by the human ears, thereby improving the subjective perception quality of the enhanced voice.

Drawings

FIG. 1 is a flow diagram of a method of perceptual filtering in one embodiment;

FIG. 2 is a flow diagram for constructing equations relating perceptual filter gains in one embodiment;

FIG. 3 is a flow diagram of solving equations to obtain perceptual filter gains in one embodiment;

FIG. 4 is a block diagram of the architecture of the speech enhancement system in one embodiment;

FIG. 5 is a block diagram of the structure of a perceptual filter in one embodiment;

FIG. 6 is a block diagram of a gain solver module in one embodiment;

fig. 7 is a block diagram of the structure of a perceptual filter in another embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, there is provided a perceptual filtering method comprising the steps of:

step S110, obtaining the voice with noise, and calculating the voice with noise according to a noise estimation algorithm to obtain noise power.

In this embodiment, the obtained noisy speech is represented in the time domain as y (n) ═ s (n) + z (n), where s (n) is a clean speech signal, and z (n) is the background noise in the original noisy speech. The noise estimation algorithm may adopt an existing algorithm, and the noise power λ of the frequency domain is calculated according to the noise estimation algorithm from the noisy speech y (n)(s) (n) + z (n)_d(k) And k is a frequency spectrum serial number.

And step S120, calculating the voice with noise according to the masking model to obtain a frequency domain masking threshold.

In this embodiment, the masking model may be an existing masking model, such as a psychoacoustic model, and the frequency domain masking threshold t (k) of the frequency domain noisy speech y (k) is calculated according to the masking model.

Step S130, converting the speech with noise into a frequency domain to obtain a frequency domain speech with noise, including a frequency domain pure speech and a frequency domain background noise.

In this embodiment, the noisy speech y (n) ═ s (n) + z (n) is transformed into the frequency domain through FFT, and the frequency domain noisy speech y (k) is obtained, which is denoted as y (k) ═ s (k) + z (k), where s (k) is the frequency domain pure speech, z (k) is the frequency domain background noise, and k is the frequency spectrum sequence number. It will be appreciated that the noisy speech may be a noisy speech processed by a speech enhancement algorithm, such as a speech enhancement method based on short-time spectral amplitude estimation, where z (n) is the residual noise in the speech processed by the speech enhancement method based on short-time spectral amplitude estimation.

Step S140, based on the speech estimation error algorithm, representing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of the perception filter, and representing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter.

In this embodiment, the frequency domain enhanced speech after being denoised by the perceptual filterEstimating error from speechObtaining e (k) ═ s (k) -g (k) y (k), where e (k) is the speech estimation error, s (k) is the frequency domain pure speech, g (k) is the perceptual filter gain, and y (k) is the frequency domain noisy speech. According to y (k) ═ s (k) + z (k), e (k) ═ s (k) — g (k) (s (k) + z (k)) is obtained, where z (k) is frequency domain background noise. Converting the speech estimation error E (k) into E (k) ═ 1-G (k) — S (k) — G (k) Z (K) to obtain the speech signal distortion epsilon_S(k) (1-g (k)) s (k)), filtering background noise ∈_Z(k)＝|-G(k)Z(k)|＝|G(k)Z(k)|。

And step S150, constructing an equation related to the gain of the perception filter according to the distortion of the voice signal and the filtering background noise and based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is less than or equal to the masking threshold of the frequency domain.

In this embodiment, the power of the distortion of the speech signal is

Filtered background noise power of

Where E {. denotes expectation and T denotes the transpose of the matrix. In combination with the human ear masking effect, the optimal gain function g (k) should make the voice distortion as small as possible, and at the same time, make the background noise below the human ear masking threshold, and if the voice distortion is too large, the distortion obviously occurs, and the subjective perception quality will be affected, therefore, this embodiment requires the voice signal distortion power E_S(k) Filtered background noise power E_z(k) The sum is less than or equal to the frequency domain masking threshold T (k), i.e. E_S(k)+E_Z(k) T (k) is less than or equal to T (k). Can meet the requirement of E_S(k)+E_Z(k) Customized E under the condition of ≦ T (k)_S(k)+E_Z(k) The relationship between T (k) and T (k) constructs an equation for G (k), e.g. E_S(k)+E_Z(k)＝T(k)/2。

In one embodiment, as shown in fig. 2, the equation for g (k) is constructed according to the sum of the distortion power of the speech signal and the power of the filtered background noise being equal to the frequency-domain masking threshold, and the step S150 includes the following steps:

and step S151, obtaining the distortion power of the voice signal according to the distortion of the voice signal.

In particular, the speech signal is distorted by_S(k) Substitution of (1-g (k)) s (k)Obtaining the distortion power E of the speech signal_s(k)＝(G(k)-1)²P_s(k) In which P is_S(k)＝E{S^T(k) S (k) } is the frequency domain pure speech power.

And S152, obtaining the filtering background noise power according to the filtering background noise.

In particular, the background noise ε is filtered_Z(k) Substitution of | g (k) z (k) |

Obtaining the power E of the filtered background noise_z(k)＝(G(k))²P_z(k) In which P is_Z(k)＝E{Z^T(k) And z (k) } is the frequency domain background noise power.

Step S153, based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is equal to the masking threshold of the frequency domain, obtaining the equation (G (k) -1)²P_S(k)+(G(k))²P_Z(k) T (k), where g (k) is the perceptual filter gain, k is the spectrum number, P_S(k) For the frequency domain pure speech power, P_Z(k) For frequency domain background noise power, t (k) is the frequency domain masking threshold.

In particular, the speech signal is distorted by the power E_S(k)＝(G(k)-1)²P_S(k) Filtered background noise power E_z(k)＝(G(k))²P_z(k) Substitution into E_S(k)+E_Z(k) T (k) to yield (g (k) -1)²P_S(k)+(G(k))²P_Z(k) T (k), where g (k) is the perceptual filter gain, k is the spectrum number, P_S(k) Is frequency ofDomain pure speech power, P_Z(k) For frequency domain background noise power, t (k) is the frequency domain masking threshold.

And S150, solving an equation to obtain the gain of the perception filter.

In this embodiment, (G (k) -1)²P_S(k)+(G(k))²P_Z(k) T (k) is a quadratic, unary equation for g (k), which may be calculated by first calculating P_S(k)、P_Z(k) And then solving a quadratic equation of one unit to obtain the root of the equation. The equations can also be solved by transforming the equations. Since there are cases where the one-dimensional quadratic equation is not solved, G (k) can be customized in this case.

In one embodiment, as shown in fig. 3, step S160 includes the following steps:

and step S161, calculating to obtain frequency domain background noise power by adopting an approximation algorithm according to the noise power.

In particular, the noise power λ_d(k) Approximately equal to the frequency domain background noise power P_Z(k) From P_Z(k)＝λ_d(k) Obtaining the background noise power P of the frequency domain_Z(k) In that respect It will be appreciated that if the acquired noisy speech is processed through a speech enhancement algorithm, the approximation algorithm may be different, with a custom approximation algorithm.

And step S162, calculating according to the frequency domain background noise power to obtain a posterior signal-to-noise ratio, and calculating according to the posterior signal-to-noise ratio based on a direct decision algorithm to obtain a prior signal-to-noise ratio.

In the present embodiment, the posterior signal-to-noise ratio γ' (k) is defined as

Where Y (k) is noisy speech, | Y (k) | is spectral amplitude of noisy speech, P_Z(k) The direct decision algorithm can use the existing algorithm to calculate the prior signal-to-noise ratio ξ' (k).

And step S163, solving an equation according to the frequency domain background noise power, the prior signal-to-noise ratio and the frequency domain masking threshold to obtain the gain of the perception filter.

In this exampleIf according to E_S(k)+E_Z(k) T (k) the equation is constructed as (g (k) -1)²P_S(k)+(G(k))²P_Z(k) T (k). Taking the solution of this equation as an example, divide this equation by P simultaneously_Z(k) Conversion to equation (G (k) -1)²ξ′(k)+(G(k))²C (k), where ξ' (k) is the a priori signal-to-noise ratio, P_Z(k) The equation is a linear equation of two about G (k), wherein ξ' (k), C (k) are known, and the equation is obtained by solving the root equation according to a quadratic equation of one element

Is satisfied under the condition that

If it is not

Or C (k) is not solved when the equation is more than or equal to 1. If it is not

Definition of

If C (k) is more than or equal to 1, according to the frequency domain background noise power P_Z(k) Under the frequency domain masking threshold t (k), at this time, the noise level is smaller than the auditory masking threshold of human ears, and at this time, the filtering processing is not needed to be performed on the frequency domain noisy speech y (k), and a better auditory subjective effect can be achieved, and g (k) ═ 1 is defined. Thus, combining the above analysis, the perceptual filter g (k) is:

it is understood that the present embodiment is only to solve the equation (G (k) -1)²P_S(k)+(G(k))²P_Z(k) Given as an example t (k), the equation may be according to E_S(k)+E_Z(k) Other equations of construction ≦ T (k).

And S170, filtering the voice with noise according to the gain of the perception filter to obtain enhanced voice.

In this embodiment, according to the gain G (k) of the perceptual filter, the frequency domain is used to generate the noisy speech

Obtaining enhanced frequency domain speech

Then converting it to time domain to obtain enhanced speech

Or firstly converting the gain G (k) of the perception filter into the time domain to obtain g (n), and then obtaining the gain G (n) by the gain G (k) of the perception filter

Enhanced speech

Where y (n) is time domain noisy speech, which represents a convolution.

In the embodiment, a noisy speech is obtained, the noisy speech is calculated according to a noise estimation algorithm to obtain a noise power, the noisy speech is calculated according to a masking model to obtain a frequency domain masking threshold, the noisy speech is converted into a frequency domain to obtain a frequency domain noisy speech, and the frequency domain noisy speech comprises a frequency domain pure speech and a frequency domain background noise; based on a speech estimation error algorithm, representing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of a perception filter, and representing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter; according to the distortion of the voice signal and the filtering background noise, constructing an equation related to the gain of the perception filter based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is smaller than or equal to the masking threshold of the frequency domain; solving an equation to obtain the gain of the perception filter; and according to the gain of the perception filter, filtering the voice with noise to obtain enhanced voice. Because the sum of the distortion power of the voice signal and the power of the filtering background noise is less than or equal to the frequency domain masking threshold, the voice signal distortion power is ensured to be small, and simultaneously, the noise level is ensured to be less than the auditory masking threshold of human ears and not to be heard by the human ears, thereby improving the subjective perception quality of the enhanced voice.

In one embodiment, before step S130, the method further includes: the method includes the steps of enhancing a noisy speech by a short-time-based amplitude spectrum estimation method to obtain an enhanced noisy speech, converting the noisy speech to a frequency domain in step S130 to convert the enhanced noisy speech to the frequency domain, filtering the noisy speech in step S170 to obtain an enhanced speech, and filtering the enhanced noisy speech to obtain the enhanced speech, wherein step S161 includes:

obtaining a frequency domain gain function G based on a short-time amplitude spectrum estimation method_H(k) Wherein k is a frequency spectrum serial number; according to P_Z(k)＝λ_d(k)-(1-G_H(k))|Y(k)|²Obtaining the background noise power P of the frequency domain_Z(k) Wherein λ is_d(k) Y (k) is the noise power, and Y (k) is the frequency domain noisy speech.

In this embodiment, because residual noise still exists in the speech enhanced by the short-time amplitude spectrum estimation method, the effect of enhancing the speech can be further improved by the perceptual filtering method of this embodiment. At this time according to the noise power lambda_d(k) Calculating to obtain frequency domain background noise power P by adopting approximation algorithm_Z(k) Firstly, a frequency domain gain function G based on a short-time amplitude spectrum estimation method is obtained_H(k) Then according to P_Z(k)＝λ_d(k)-(1-G_H(k))|Y(k)|²Obtaining the background noise power P of the frequency domain by approximate estimation_Z(k) Where y (k) is noisy speech and y (k) is the spectral magnitude of the noisy speech.

In one embodiment, the step of calculating the prior snr based on the direct decision algorithm based on the a posteriori snr comprises: obtaining posterior signal-to-noise ratios of a current frame and a previous frame, wherein the posterior signal-to-noise ratios are respectively gamma '(k, l) and gamma' (k, l-1), k is a frequency spectrum serial number, l is a frame serial number, and the current frame is l frames; obtaining a previous frame perceptual filter gain G (k, l-1), if the previous frame is a first frame, the previous frame perceptual filter gain G (k, l-1)The gain is a preset value; obtaining the prior signal-to-noise ratio of the current frame according to the posterior signal-to-noise ratio and the gain of the perception filter by a formula

Wherein η is a smoothing factor and 0 < η < 1.

In this embodiment, the first frame perceptual filter gain G (k,1) is defined as a preset value, preferably 1, and the a posteriori snr of the second frame and the first frame is obtained as γ '(k, 2), γ' (k,1) according to the predetermined value

η is a smoothing factor, which can be any number between 0 and 1, preferably η ═ 0.92. after the prior snr of the second frame is obtained, the perceptual filter gain G (k,2) of the second frame can be obtained according to the equation solved in the subsequent steps.

The above embodiment can be applied to the speech enhancement system shown in fig. 4, wherein the noisy speech is input, the frequency domain masking threshold t (k) is obtained through the masking threshold estimation 164, and the noise power λ is obtained through the noise estimation 166_d(k) T (k), λ_d(k) And inputting the perception enhancement filter 165, constructing and solving an equation to obtain the gain of the perception filter, and performing filtering processing to obtain the enhanced voice.

In one embodiment, as shown in fig. 5, there is provided a perceptual filter comprising:

the obtaining module 210 is configured to obtain a noisy speech.

And the noise power calculation module 220 is configured to calculate the noise power of the voice with noise according to a noise estimation algorithm.

And the masking threshold calculation module 230 is configured to calculate a frequency domain masking threshold for the voice with noise according to the masking model.

The frequency domain converting module 240 is configured to convert the voice with noise into a frequency domain to obtain a frequency domain voice with noise, where the frequency domain voice with noise includes a frequency domain pure voice and a frequency domain background noise.

And the equation constructing module 250 is used for expressing the distortion of the voice signal as a relational expression about the pure voice of the frequency domain and the gain of the perception filter based on a voice estimation error algorithm, expressing the filtering background noise as a relational expression about the background noise of the frequency domain and the gain of the perception filter, and constructing an equation about the gain of the perception filter based on a relational expression that the sum of the distortion power of the voice signal and the power of the filtering background noise is less than or equal to a masking threshold of the frequency domain according to the distortion of the voice signal and the filtering background noise.

The gain solving module 260 is used for solving an equation to obtain the gain of the perception filter;

and a filtering processing module 270, configured to perform filtering processing on the noisy speech according to the gain of the perceptual filter to obtain an enhanced speech.

In one embodiment, the equation constructing module 250 constructs an equation for the perceptual filter gain according to the speech signal distortion and the filtering background noise, based on the relationship that the sum of the speech signal distortion power and the filtering background noise power is less than or equal to the frequency domain masking threshold, specifically: obtaining the distortion power of the voice signal according to the distortion of the voice signal, obtaining the power of the filtering background noise according to the filtering background noise, and obtaining an equation (G (k) -1) based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is equal to the masking threshold of a frequency domain²P_s(k)+(G(k))²P_z(k)＝T(k)_，Wherein G (k) is the perceptual filter gain, k is the frequency spectrum number, P_S(k) For the frequency domain pure speech power, P_Z(k) For frequency domain background noise power, t (k) is the frequency domain masking threshold.

In one embodiment, as shown in FIG. 6, the gain solving module 260 includes:

the solution preparation unit 261 is configured to calculate a frequency domain background noise power by using an approximation algorithm according to the noise power, calculate a posterior signal-to-noise ratio according to the frequency domain background noise power, and calculate a prior signal-to-noise ratio based on a direct decision algorithm according to the posterior signal-to-noise ratio.

And the solving unit 262 is configured to solve the equation according to the frequency domain background noise power, the prior signal-to-noise ratio, and the frequency domain masking threshold to obtain the gain of the perceptual filter.

In one embodiment, as shown in fig. 7, on the basis of the above embodiment, the perceptual filter further includes:

and the enhancing module 280 is configured to enhance the noisy speech by using a short-time amplitude spectrum estimation method to obtain an enhanced noisy speech.

The frequency domain converting module 240 converts the noisy speech into a frequency domain to convert the enhanced noisy speech into a frequency domain, and the filtering processing module 270 performs filtering processing on the noisy speech to obtain an enhanced speech, or performs filtering processing on the enhanced noisy speech to obtain an enhanced speech.

The solution preparation unit 261 calculates the frequency domain background noise power by using an approximation algorithm according to the noise power, specifically: obtaining a frequency domain gain function G based on a short-time amplitude spectrum estimation method_H(k) Where k is the spectral number, according to P_Z(k)＝λ_d(k)-(1-G_H(k))|Y(k)|²Obtaining the background noise power P of the frequency domain_Z(k) Wherein λ is_d(k) Y (k) is the noise power, and Y (k) is the frequency domain noisy speech.

In one embodiment, the solution preparation unit 261 calculates the prior snr based on a direct decision algorithm according to the posterior snr specifically as follows: acquiring posterior signal-to-noise ratios of a current frame and a previous frame, wherein the posterior signal-to-noise ratios are respectively gamma '(k, l) and gamma' (k, l-1), wherein k is a frequency spectrum serial number, l is a frame serial number, the current frame is a frame, and acquiring a previous frame sensing filter gain G (k, l-1), and if the previous frame is a first frame, the previous frame sensing filter gain is a preset value; obtaining the prior signal-to-noise ratio of the current frame according to the posterior signal-to-noise ratio and the gain of the perception filter by a formula

Wherein η is a smoothing factor and 0 < η < 1.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of perceptual filtering, the method comprising:

based on a speech estimation error algorithm, representing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of a perception filter, and representing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter, wherein the filtering background noise is epsilon_Z(k) (k) z (k) is | g (k), g (k) is perceptual filter gain, and z (k) is frequency domain background noise;

based on the relation that the sum of the distortion power of the speech signal and the power of the filtering background noise is equal to the masking threshold of the frequency domain, the equation (G (k) -1) is obtained²P_s(k)+(G(k))²P_z(k) T (k), where g (k) is the perceptual filter gain, k is the spectrum number, P_S(k) For the frequency domain pure speech power, P_Z(k) Is the frequency domain background noise power, T (k) is the frequency domain masking threshold;

obtaining posterior signal-to-noise ratios of a current frame and a previous frame, wherein the posterior signal-to-noise ratios are respectively gamma '(k, l) and gamma' (k, l-1), k is a frequency spectrum serial number, and l is a frequency spectrum serial numberFrame number, the current frame is l frames; acquiring a previous frame perceptual filter gain G (k, l-1), wherein if the previous frame is a first frame, the previous frame perceptual filter gain is a preset value; obtaining the prior signal-to-noise ratio of the current frame according to the posterior signal-to-noise ratio and the gain of the perception filter by a formula

Wherein η is a smoothing factor and 0 < η < 1;

solving the equation (G (k) -1) according to the frequency domain background noise power, the prior signal-to-noise ratio and the frequency domain masking threshold²P_s(k)+(G(k))²P_z(k) Obtaining the perceptual filter gain t (k);

2. The method of claim 1, wherein the step of solving the equation to obtain the perceptual filter gain comprises:

3. The method according to claim 2, wherein before the step of converting the noisy speech into the frequency domain to obtain frequency-domain noisy speech, the frequency-domain noisy speech comprising frequency-domain clean speech and frequency-domain background noise, further comprising:

4. A perceptual filter, the perceptual filter comprising:

the acquisition module is used for acquiring the voice with noise;

an equation constructing module for expressing the speech signal distortion as a relational expression about the frequency domain pure speech and the gain of the perception filter and expressing the filtering background noise as a relational expression about the frequency domain background noise and the gain of the perception filter based on the speech estimation error algorithm, wherein the filtering background noise is epsilon_Z(k) L-g (k) z (k) l g (k) z (k), g (k) is perceptual filter gain, z (k) is frequency domain background noise, and the distortion power of the speech signal is obtained according to the distortion of the speech signal; obtaining the filtering background noise power according to the filtering background noise; based on the relation that the sum of the distortion power of the voice signal and the power of the filtering background noise is equal to the masking threshold of the frequency domain, an equation is obtained(G(k)-1)²P_s(k)+(G(k))²P_z(k) T (k), where g (k) is the perceptual filter gain, k is the spectrum number, P_S(k) For the frequency domain pure speech power, P_Z(k) Is the frequency domain background noise power, T (k) is the frequency domain masking threshold; according to the voice signal distortion and the filtering background noise, constructing an equation related to the gain of the perception filter based on the relation that the sum of the voice signal distortion power and the filtering background noise power is less than or equal to the frequency domain masking threshold;

the gain solving module is used for acquiring the posterior signal-to-noise ratio of the current frame and the previous frame, which are respectively gamma '(k, l) and gamma' (k, l-1), wherein k is a frequency spectrum serial number, l is a frame serial number, and the current frame is a frame; acquiring a previous frame perceptual filter gain G (k, l-1), wherein if the previous frame is a first frame, the previous frame perceptual filter gain is a preset value; obtaining the prior signal-to-noise ratio of the current frame according to the posterior signal-to-noise ratio and the gain of the perception filter by a formula

Wherein η is a smoothing factor and 0 < η < 1, solving the equation (G (k) -1) based on the frequency domain background noise power, the prior signal-to-noise ratio, and the frequency domain masking threshold²P_s(k)+(G(k))²P_z(k) Obtaining the perceptual filter gain t (k);

5. The perceptual filter of claim 4, wherein the gain solving module comprises:

6. The perceptual filter of claim 4, wherein the perceptual filter further comprises: