WO2021007841A1

WO2021007841A1 - Noise estimation method, noise estimation apparatus, speech processing chip and electronic device

Info

Publication number: WO2021007841A1
Application number: PCT/CN2019/096503
Authority: WO
Inventors: 何婷婷; 王鑫山; 朱虎; 李国梁; 郭红敬
Original assignee: 深圳市汇顶科技股份有限公司
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2021-01-21
Also published as: CN112602150A

Abstract

A noise estimation method. The method comprises: determining an initially estimated noise power of noisy speech; calculating a smooth factor according to the probability of the existence of target speech; and updating an initially estimated noise power spectrum according to the noisy speech and the smooth factor so as to obtain an effective noise power spectrum. By means of the method, an effective noise power spectrum is made to be as close as possible to a real noise power spectrum, thereby avoiding noise residue, and improving the overall performance of speech enhancement.

Description

Noise estimation method, noise estimation device, speech processing chip and electronic equipment

Technical field

The embodiments of the present application relate to the field of signal processing technology, and in particular, to a noise estimation method, a noise estimation device, a speech processing chip, and electronic equipment.

Background technique

Voice is an important means of communication between people. At the same time, with the development of electronic information technology, the forms of communication between people are becoming more and more diversified. In addition to traditional face-to-face communication, it also includes various types of voice communication, such as phone calls, WeChat voice, video, etc.; in addition, , Voice communication is no longer limited to people. In recent years, voice interactions between people and machines, and between machines and machines have also been seen everywhere in daily life. However, because people or machines are often in various noisy public places, in the process of voice communication or human-computer interaction, the voice is inevitably interfered by surrounding noises, such as car noise on the street, air-conditioning noise in the office, and factory Machine noise in the restaurant, interference from other audio sources in the restaurant, etc., which makes the receiver no longer receive pure speech, but noisy speech mixed with various noises. These noises will cause serious interference to the speech and reduce the speech The product quality of communication products, for example, causes voice distortion during the voice communication process, causing communication failure; in the voice recognition system, it causes a sharp drop in the voice recognition rate, which seriously affects the performance of the voice recognition system. Noise not only reduces the product quality of voice communication products, but also brings poor user experience to users. Therefore, it is particularly important to suppress noise and extract a relatively pure voice signal (also called target voice).

Speech enhancement technology is an important means to suppress noise. The main tasks of speech enhancement include two aspects: one is to suppress noise through signal processing means to obtain relatively pure speech, thereby improving the intelligibility and comfort of speech, and improving hearing fatigue caused by noise; second, speech enhancement is The necessary links of various voice communication and voice interaction systems can effectively reduce the bit error rate of voice communication and the error recognition rate of voice recognition, thereby improving the performance of the voice processing system.

Speech enhancement technology is an important branch in the field of signal processing. There are several types of representative speech enhancement technologies in the prior art, mainly including non-parametric methods, parametric methods, statistical methods, wavelet transforms, neural networks, etc. Among the non-parametric methods, the more typical processing techniques include spectral subtraction and its improvement methods. This type of method is widely used because of its simple principle and easy implementation. However, non-parametric methods will produce severe musical noise under strong noise. Typical processing techniques in the parameter method include subspace method, etc. The subspace method requires eigenvalue decomposition in the implementation process, and then introduces a larger amount of calculation, so it is rarely used in the engineering implementation process. The typical representative of statistical methods is the Minimum Mean Square Error (MMSE) and its improved methods. This type of method is the optimal estimator solved under the minimum mean square error criterion, which can better suppress residual noise , But the principle of this method is more complicated, and the hardware cost is also greater; the emerging wavelet change and neural network technology is still in the research stage, and is currently less applied to engineering implementation.

In fact, regardless of the above-mentioned speech enhancement technology, it is assumed that the noise is known. However, in the actual enhancement process, the noise characteristics cannot be obtained in advance. Instead, noisy speech needs to be used to estimate the noise, and the accuracy of the noise estimation is practical. This directly affects the overall performance of speech enhancement. Therefore, it is urgent to provide an effective noise estimation method to improve the overall performance of speech enhancement.

Summary of the invention

In view of this, the embodiments of the present application provide a noise estimation method, a noise estimation device, a speech processing chip, and an electronic device to overcome the above-mentioned defects in the prior art.

The embodiment of the present application provides a noise estimation method, which includes:

Determine the initial estimated noise power spectrum of noisy speech;

Calculate the smoothing factor according to the probability of the target voice;

According to the noisy speech and the smoothing factor, the initial estimated noise power spectrum is updated to obtain an effective noise power spectrum.

An embodiment of the present application provides a noise estimation device, which includes:

The initial noise estimation unit is used to determine the initial estimated noise power spectrum of noisy speech;

The noise update unit is configured to update the initial estimated noise power spectrum to obtain an effective noise power spectrum according to the noisy speech and a smoothing factor, and the calculated smoothing factor is calculated according to the probability of the existence of the target speech.

An embodiment of the present application provides a voice processing chip, which includes the noise estimation device in any embodiment of the present application.

An embodiment of the present application provides an electronic device, which includes the voice processing chip in any embodiment of the present application.

In this embodiment of the application, by determining the initial estimated noise power spectrum of the noisy speech, the smoothing factor is calculated according to the probability of the existence of the target speech; according to the noisy speech and the smoothing factor, the initial estimated noise power spectrum is calculated The effective noise power spectrum is updated to make the effective noise power spectrum as close to the real noise power spectrum as possible, thereby eliminating noise as much as possible in the subsequent noise elimination process, avoiding noise residue, and improving the overall performance of speech enhancement.

Description of the drawings

Hereinafter, some specific embodiments of the embodiments of the present application will be described in detail in an exemplary but not restrictive manner with reference to the accompanying drawings. The same reference numerals in the drawings indicate the same or similar components or parts. Those skilled in the art should understand that these drawings are not necessarily drawn to scale. In the attached picture:

FIG. 1 is a schematic diagram of the structure of a speech enhancement system in Embodiment 1 of this application;

2 is a schematic flowchart of a voice enhancement method in Embodiment 2 of this application;

Figures 3 and 4 respectively show the first and second schematic diagrams of the mapping curves of the posterior probability and the smoothing factor of the target speech.

Detailed ways

The implementation of any technical solution of the embodiments of the present application does not necessarily need to achieve all the above advantages at the same time.

The specific implementation of the embodiments of the present application will be further described below in conjunction with the drawings of the embodiments of the present application.

Fig. 1 is a schematic structural diagram of a speech enhancement system in Embodiment 1 of this application; as shown in Fig. 1, the noise estimation scheme of this application is applied in the speech enhancement system. However, the specific structure of the speech enhancement system in this embodiment is only an example and is not a limitation. In fact, those of ordinary skill in the art can also simplify some of the modules according to the needs of the application scenario, or add some modules on this basis. Other modules. The functions between the modules can actually be integrated with each other.

As shown in Figure 1, in this embodiment, the voice enhancement system includes: a collection module, a preprocessing module, a voice enhancement device, a restoration module, and an output module.

(1) Acquisition module

In this embodiment, the collection module may specifically be a voice receiving device such as a microphone, which is mainly used to collect the target voice generated by the sound source of interest (or the sound source of interest), in which environmental noise and noise interfered by other sound sources can also be collected. Get noisy speech that includes both target speech and noise. The acquisition module also performs processing such as sampling and encoding on the noisy speech, and converts it into a binary code group, that is, the noisy speech in digital form, or simply called the original digital noisy speech.

(2) Preprocessing module

The preprocessing module is used to sequentially perform windowing and framing processing, pre-emphasis processing, fast Fourier transform (FFT) processing on the noisy speech, and finally convert the noisy speech from the time domain to On the frequency domain. The preprocessing module includes but is not limited to the above processing steps.

Further, in this embodiment, the pre-processing module may include a windowing unit, a pre-emphasis unit, and an FFT unit, but is not limited to the above processing unit.

Specifically, in this embodiment, the windowing unit is mainly used for windowing and framing the input noisy speech through a window function. According to the short-term stationary characteristics of the target speech, the duration of each frame of noisy speech is 10~ Between 30ms. In addition, in order to maintain a smooth transition from frame to frame, an overlap and windowing method is adopted between the noisy speech of the previous and subsequent frames, and the overlap degree is 50%. The window function is selected according to different application scenarios, such as rectangular window, Hamming window, Caesar window, etc.

Specifically, in this embodiment, the pre-emphasis unit performs pre-emphasis processing on each frame of noisy speech after windowing and framing, so as to enhance the high-frequency components of the noisy speech and at the same time remove the influence of lip radiation. The pre-emphasis unit can be specifically, but not limited to, a high-pass filter.

Specifically, in this embodiment, the FFT unit performs fast Fourier transform on each frame of noisy speech after pre-emphasis, to obtain the frequency domain signal of each frame of noisy speech, so as to reduce the noisy speech in the frequency domain. Noise processing.

(3) Voice enhancement device

In this embodiment, the speech enhancement device is mainly used to estimate the noise in the noisy speech in the frequency domain, and further remove the noise from the noisy speech by means of filtering.

Specifically, the speech enhancement device includes a noise estimation device, a noise update control module, and a filtering module. In addition, since this embodiment performs noise estimation and filter coefficient calculation based on the noisy speech power spectrum, the speech enhancement device Also includes: power spectrum calculation module. Therefore, the speech enhancement device actually includes four main modules: a power spectrum calculation module, a noise estimation device, a noise update control module, and a filtering module. Here, it should be noted that the speech enhancement device does not necessarily include a power spectrum calculation module and a filter module. In fact, those of ordinary skill in the art can also configure the power spectrum calculation module and the filter module in the speech enhancement system according to their needs. On the module, or the power spectrum calculation module and the filter module are independent modules.

The power spectrum calculation module is configured to calculate the power spectrum of the noisy speech according to the spectrum of the noisy speech, and the initial noise estimation unit is further configured to determine the initial estimated noise of the noisy speech according to the power spectrum of the noisy speech power spectrum.

The noise update control module is used to calculate the smoothing factor according to the posterior probability of the target voice.

The noise estimation device is used to determine the initial estimated noise power spectrum according to the noisy speech power spectrum, and update or correct the initial estimated noise power spectrum according to the smoothing factor output by the noise update control module to obtain Effective noise power spectrum.

The filtering module is configured to calculate filter coefficients according to the effective noise power spectrum; according to the filter coefficients, the real part and the imaginary part of the noisy speech spectrum are respectively filtered to obtain an enhanced speech spectrum.

As mentioned earlier, the noise in the noisy speech is estimated in the frequency domain, and the noise is further eliminated from the noisy speech through frequency domain filtering.

(3.1) Noise estimation device

Specifically, in order to make the effective noise power spectrum closer to the real noise power spectrum, in this embodiment, the noise estimation device adopts two-step estimation, namely, determining the initial estimated noise power spectrum and updating the initial estimated noise power spectrum to obtain the effective noise. power spectrum.

As shown in Figure 1, the noise estimation device includes an initial noise estimation unit and a noise update unit, wherein: the initial noise estimation unit is used to perform windowing processing on the noisy speech power spectrum, that is, smoothing between frequencies; After the windowing, the noisy speech power spectrum is smoothed before and after frames, that is, inter-frame smoothing, to obtain a smoothed noisy speech power spectrum; the noisy speech power spectrum after the smoothing between the frames is A minimum power spectrum search is performed within a certain time window, and the searched minimum power spectrum is used as the initial estimated noise power spectrum; the noise update unit is configured to perform a calculation on the initial estimated noise power spectrum according to the smoothing factor output by the noise update control module Update to get the effective noise power spectrum.

Specifically, in addition, those of ordinary skill in the art can also choose other different methods to determine the initial estimated noise power spectrum, such as quantiles, histograms, etc., from the perspectives of hardware overhead, algorithm simplicity, application scenarios, and algorithm performance. Time recursive average etc. Since the initial noise estimation unit is only a rough estimate of the noise in the noisy speech, there is still a large deviation between the initial estimated noise power spectrum and the real noise power spectrum. Generally, the initial estimated noise power spectrum is smaller than the real noise power spectrum. As mentioned earlier, considering that the accuracy of noise estimation directly affects the performance of subsequent filters, as well as the overall performance of the speech enhancement system. Therefore, this application adds a noise update unit to correct (or update) the initial estimated noise power spectrum, so that the effective noise power spectrum obtained after the correction is as close as possible to the real noise power spectrum, so that the filtered noise power spectrum can be effectively solved. There is a large residual noise problem in the enhanced speech, which improves the overall performance of the speech enhancement system.

(3.2) Noise update control module

In this embodiment, the smoothing factor is calculated in real time for each frame of noisy speech, and the initial estimated noise power spectrum is updated by the smoothing factor to obtain the effective noise power spectrum. The effective noise power spectrum is closer to the real noise power spectrum, which solves The initial estimated noise power spectrum is relatively small compared to the real noise power spectrum. At the same time, since the smoothing factor is a function of the posterior probability of the target speech, it can be based on the posterior probability of the target speech at each frequency point in the current frame. Controlling the size of the smoothing factor effectively avoids the problem that the effective noise power spectrum is too large compared to the real noise power spectrum. Therefore, by adding a noise update control module, it can effectively solve the problems of excessive noise residue caused by small initial estimated noise power spectrum during speech enhancement and voice loss caused by large effective noise power spectrum. The following provides a specific noise update control Module.

As shown in Figure 1, the noise update control module includes a likelihood ratio calculation unit, a priori probability calculation unit for the existence of a target voice, a posterior probability calculation unit for the existence of a target voice, and a smoothing factor calculation unit. Specifically, in this embodiment, the likelihood ratio calculation unit, the prior probability calculation unit for the existence of the target speech, the posterior probability calculation unit for the existence of the target speech, and the smoothing factor calculation unit are all based on the power spectrum of the noisy speech in the frequency domain. Carry out their respective related technical processing.

The specific structure of the noise update control module in this embodiment is only an example and not a limitation. In fact, a person of ordinary skill in the art can also simplify some of the modules according to the needs of the application scenario, or add some other modules on this basis. Module. The functions between the modules can actually be integrated with each other.

Specifically, in this embodiment, the likelihood ratio calculation unit is used for the probability density distribution function of the noisy speech spectrum when the target speech is assumed to exist and the probability density distribution function of the noisy speech spectrum when the target speech is assumed to be absent. , Calculate the likelihood ratio. Further, the likelihood ratio calculation unit calculates the likelihood based on the probability density distribution function of the noisy speech spectrum when the target speech is assumed to exist and the probability density distribution function of the noisy speech spectrum when the target speech is assumed to be absent. When comparing, replace the target speech power spectrum with the estimated enhanced speech power spectrum, replace the real noise power spectrum with the initial estimated noise power spectrum, and use the noisy speech power spectrum and the enhanced speech The power spectrum and the initial estimated noise power spectrum are used to calculate the likelihood ratio. The specific method for the likelihood ratio calculation unit to calculate the likelihood ratio depends on the probability density distribution characteristics of the target speech spectrum and the noise spectrum in a specific application scenario. For details, please refer to the description of the following method embodiments.

Specifically, the prior probability calculation unit for the existence of the target speech is used to determine the prior probability of the existence of the effective target speech according to the power spectrum of the noisy speech to determine whether the target speech exists in the noisy speech possibility.

Further, the prior probability calculation unit for the existence of the target speech calculates the prior probability of the existence of the target speech in two steps, and further includes: the first step is to preliminarily judge the noisy current frame according to the power spectrum of the noisy speech Whether there is a target voice in the voice; the second step is to determine the estimated prior probability of the existence of the target voice according to the preliminary judgment result of the preliminary judgment on whether there is a target voice in the noisy voice in the current frame, and according to the estimated target voice The prior probability of existence determines the prior probability of the existence of a valid target speech.

Further, the prior probability calculation unit for the existence of the target speech is further configured to: if the target speech does not exist, perform inter-frequency smoothing on the power spectrum of noisy speech without the target speech to obtain inter-frequency smoothing Or, if the target speech exists, the power spectrum of the noisy speech after smoothing between historical frames is used as the power spectrum of the noisy speech after the smoothing between frequency points; Perform inter-frame smoothing on the noisy speech power spectrum smoothed between points to obtain the inter-frame smoothed noisy speech power spectrum; according to the smoothed inter-frame noisy speech power spectrum, determine the prior art of the estimated target speech Probability. For the noisy speech in the current frame, the historical power spectrum of the noisy speech after inter-frame smoothing may directly be the inter-frame smoothed noisy speech power spectrum obtained for the noisy speech in the previous frame. Of course, here, it is not particularly limited to only use the inter-frame smoothed noisy speech power spectrum obtained for the noisy speech in the previous frame. In fact, it can be flexibly selected according to the requirements of the application scenario.

Further, the prior probability calculation unit for the existence of the target speech further determines the first frequency of each frequency point in the current frame according to the power spectrum of the noisy speech and the minimum power spectrum of the noisy speech smoothed between the frames. A detection factor; the second detection factor of each frequency point of the current frame is determined according to the power spectrum of the noisy speech after being windowed and smoothed between frames and the minimum power spectrum of the noisy speech after the smoothing between frames, so as The first detection factor and the second detection factor are calculated at each frequency point of the current frame, and it is preliminarily determined whether the target voice exists in each frequency point of the noisy speech.

If the first detection factor calculated at a certain frequency point of the noisy speech in the current frame is less than the set first detection factor threshold, and the second detection factor is less than the set second detection factor threshold, a preliminary determination is made The noisy speech does not have the target speech at this frequency point; if the above conditions are not met, it is preliminarily determined that the noisy speech has the target speech at this frequency point.

Further, the prior probability calculation unit for the existence of the target speech is used to indicate whether the target speech exists at a certain frequency point of the noisy speech in the current frame according to a defined indicator function, and determine whether the target speech exists at the frequency point in the current frame In the case of the target voice, the value of the indicator function at the frequency point is 0, and when it is determined that the target voice does not exist at the frequency point of the current frame, the value of the indicator function at the frequency point is 1.

Further, the prior probability calculation unit for the existence of the target speech is further configured to perform the inter-frequency calculation of the noisy speech in the current frame according to the value of the indicator function calculated at each frequency point of the current frame. Smoothing (or one-time smoothing), if the value of the indicator function at each frequency point of the current frame is not all zero, it is determined that the target voice does not exist in the frame, and then the indicator function and the window function are paired The power spectrum of the noisy speech without the target speech is smoothed between frequency points; further, the inter-frame smoothing (or called the second smoothing) is performed according to the noisy speech after the first smoothing to obtain two smoothed The power spectrum of noisy speech. If the value of the indicator function at each frequency point of the current frame is all zero, it is determined that the target speech exists in the frame, and the two smoothed noisy speech power spectrum obtained in the previous frame is used as the current frame secondary Smoothed noisy speech power spectrum.

Further, the prior probability calculation unit for the existence of the target speech is further configured to determine each frequency point of the current frame according to the noisy speech power spectrum and the minimum power spectrum of the two smoothed noisy speech power spectra The third detection factor at each frequency point in the current frame is determined according to the two smoothed noisy speech power spectrum and its minimum power spectrum; according to the third detection factor at each frequency point The detection factor and the fourth detection factor determine the prior probability of the presence of the estimated target speech at the corresponding frequency point.

Further, the a priori probability calculation unit for the existence of the target voice is further configured to compare the third detection factor and the fourth detection factor calculated at each frequency point of the current frame with a corresponding threshold, and according to the The different comparison results are used to determine the prior probability of the presence of the estimated target speech at the corresponding frequency point in the current frame.

Further, the prior probability calculation unit for the existence of the target voice is further configured to compare the prior probability of the existence of the target speech at each frequency point of the current frame with the minimum value of the prior probability of the existence of the target speech , Taking the maximum of the minimum value of the estimated prior probability of the existence of the target speech at each frequency point and the minimum of the prior probability of the existence of the target speech as the effective target speech existence at each frequency point in the current frame The prior probability, so that the prior probability of the effective target voice at each frequency point of each frame of noisy speech can be obtained.

Specifically, in this embodiment, the posterior probability calculation unit for the existence of the target speech is configured to determine the posterior probability of the existence of the target speech according to the likelihood ratio and the prior probability of the existence of the effective target speech.

Specifically, in this embodiment, the smoothing factor calculation unit is used to determine the mapping model between the posterior probability of the existence of the target speech and the smoothing factor according to different noise reduction scenarios; The test probability is used as the input of the mapping model, and the smoothing factor is the output of the mapping model. In this embodiment, the smoothing factor is a function of the posterior probability of the existence of the target speech. For the frequency domain speech signal corresponding to each frame of noisy speech, the posterior probability of the existence of the corresponding target speech can be calculated at each frequency point. Probability, the posterior probability of the existence of the target speech at different frequencies can be mapped to different smoothing factors, and further, the smoothing factor obtained by the mapping is used to achieve the initial estimated noise power at each frequency The spectrum is corrected.

Specifically, a mapping table between the posterior probability of the existence of the target voice and the smoothing factor used in the noise update can be established. In the implementation process, the amount of calculation can be reduced by using a table lookup method, thereby saving hardware resource overhead.

(3.3) Filter module

In this embodiment, the filtering module is configured to calculate filter coefficients according to the effective noise power spectrum; and filter the noisy speech according to the filter coefficients to obtain an enhanced speech spectrum. Specifically, the target speech power spectrum (or called enhanced speech power spectrum) calculated for the noisy speech of the previous frame and the effective noise power spectrum of the noisy speech of the current frame, and Calculate filter coefficients for the noisy speech power spectrum of the current frame of noisy speech; according to the filter coefficients, filter the real and imaginary parts of the noisy speech spectrum of the current frame of noisy speech to be enhanced Voice spectrum.

Further, as shown in Fig. 1, the filtering module may include: a filter coefficient calculation unit and a filter unit. Wherein, the filter coefficient calculation unit is used to calculate the target speech power spectrum (or called enhancement) based on the effective noise power spectrum of the noisy speech in the previous frame and the current frame, and the target speech power spectrum calculated for the noisy speech in the previous frame. Speech power spectrum) and calculate filter coefficients for the noisy speech power spectrum of the current frame of noisy speech; the filter unit calculates the real part of the noisy speech spectrum of the current frame of noisy speech according to the filter coefficient And the imaginary part is filtered separately to obtain the enhanced speech spectrum. In this embodiment, the filter may be a Wiener filter, an MMSE estimator, etc.

(4) Restore module

In this embodiment, the restoration module is mainly used to restore the noise-reduced enhanced speech from the frequency domain back to the time domain, while eliminating the influence of some operations of the preprocessing module.

Specifically, the restoration module includes an Inverse Fast Fourier Transform (IFFT) unit, a de-emphasis unit, and a window-removing unit. The IFFT unit performs an IFFT operation on the enhanced speech spectrum, and restores the enhanced speech from the frequency domain back to the time domain to obtain the time domain waveform of the enhanced speech. The de-emphasis unit is mainly used to eliminate the influence of the high-pass filter in the pre-emphasis process. The de-emphasis unit is mainly realized by a low-pass filter; the window-removing unit is mainly used to remove the effect of the previous windowing. The enhanced time-domain speech is restored to the original time-domain sequence, and the influence of the windowing operation on the amplitude is also removed. For this reason, in this embodiment, the windowing unit and the window removing unit are preferably designed at the same time.

(4) Output module

In this embodiment, the output module performs related operations such as decoding and transmission of the time-domain binary code group input by the restoration module, and then plays it through the speaker.

Here, it should be noted that the embodiment in FIG. 1 above is an exemplary explanation of the application of the noise estimation device in the embodiment of the present application from a system perspective, and is not a unique limitation. In addition, according to the needs of the application scenario, in the embodiment shown in FIG. 1, further or specific technical implementation manners are only examples, and are not uniquely limited.

Fig. 2 is a schematic flowchart of the speech enhancement method in the second embodiment of this application; it corresponds to the structure of the speech enhancement system in Fig. 1; specifically, in this embodiment, the following steps are included:

S201. The collection module collects noisy speech;

In this embodiment, the collected noisy speech is represented by the following formula (1).

y(n)=x(n)+n(n) (1)

Among them, y(n) is the collected noisy speech, x(n) is the target speech, n(n) is the noise, and the n in parentheses represents the sampling time sequence.

S202. The preprocessing module preprocesses the noisy speech to transform the noisy speech into the frequency domain.

In this embodiment, step S202 specifically includes steps S212-S232:

S212. The windowing unit performs windowing and framing on the noisy speech through a window function;

S222. The pre-emphasis unit performs pre-emphasis processing on each frame of noisy speech after windowing and framing;

S232. The FFT unit performs fast Fourier transform on each frame of noisy speech after pre-emphasis to transform the noisy speech into the frequency domain.

After the above steps S212-S232 are processed, the frequency domain signal of the λth frame of noisy speech is obtained, as shown in formula (2):

Among them, Y(λ,k) represents the frequency spectrum of the noisy speech in the λth frame, X(λ,k) represents the frequency spectrum of the target speech in the λth frame, and N(λ,k) represents the λth frame. The frequency spectrum of the frame noise in the frequency domain, k represents different frequency points of the frequency domain signal, 0≤k≤N-1. [w(l-m)] is the window function used in the windowing operation, where m represents the parameter representing the position of the window, l represents the parameter representing the window length, and N represents the number of FFT points. Among them, in a specific application scenario, the window function satisfies the following characteristics.

w ² (M)+w ² (M+L)=1 (3)

Among them, L is the specific length of each frame of noisy speech participating in the windowing operation, that is, the specific window length, and M represents the specific position of the window, that is, in the above formula, l=L, m=M.

S203: The power spectrum calculation module calculates the power spectrum of the noisy speech;

In this embodiment, the noisy speech power spectrum of the λth frame |Y(λ,k)| ² can be obtained by squaring and adding the real and imaginary parts of the noisy speech spectrum Y(λ,k) of the frame. . However, in some application scenarios, considering that the calculation and storage of |Y(λ,k)| ² will occupy a lot of hardware resources, the noisy speech modulus |Y(λ,k)| can be used to replace the noisy speech power spectrum , That is, open the root sign of the noisy speech power spectrum to obtain the noisy speech modulus |Y(λ,k)|.

S204a. The initial noise estimation unit determines the initial estimated noise power spectrum according to the noisy speech power spectrum.

In this embodiment, step S204a specifically includes the following steps when determining the initial estimated noise power spectrum:

S214a: Perform windowing processing on the noisy speech power spectrum, that is, perform smoothing processing on the noisy speech power spectrum between frequency points;

P _w (λ,k)=cov(|Y(λ,k)| ² ,hamming(n)) (4)

Among them, hamming(n) is the normalized Hamming window, cov is the convolution operation, P _w (λ,k) is the power spectrum of noisy speech after windowing in the λth frame, and m is the parameter representing the window length, k represents different frequency points.

S224a: Perform inter-frame smoothing processing on the windowed noisy speech power spectrum;

P(λ,)=α ₁ P(λ-1,k)+(1-α ₁ )P _w (λ,k) (5)

Where α ₁ is the smoothing factor, and P(λ-1,k) represents the noisy speech power spectrum P _w (λ-1,k) smoothed by the preceding and following frames in the λ-1 frame after being windowed Speech power spectrum, P _w (λ,k) represents the noisy speech power spectrum of the λth frame after windowing, P(λ,k) represents the windowed noise of the λth frame The noisy speech power spectrum P _w (λ, k) is the smoothed noisy speech power spectrum of the front and rear frames, that is, the smoothed noisy speech power spectrum.

S234a: Perform a minimum power spectrum search on the noisy speech power spectrum after being windowed and smoothed between the preceding and following frames (or called the smoothed noisy speech power spectrum) in a certain time window.

In this implementation, the minimum power spectrum found is used as the initial estimated noise power spectrum.

if mod(λ/D) == 0

P _min (λ,k)=min{P _temp (λ-1,k),P(λ,k)} (6)

P _temp (λ,k)=P(λ,k) (7)

else

P _min (λ,k)=min{P _min (λ-1,k),P(λ,k)} (8)

P _temp (λ,k)=min{P _temp (λ-1,k),P(λ,k)} (9)

end

Among them, D is the minimum power spectrum search window length. If D is selected too small, the noise power spectrum will fluctuate greatly. If D is too large, it will cause a long time delay between the initial estimated noise and the real noise. Therefore, D is specific A compromise choice when applying.

It can be seen from the above formulas (6)-(9) that by calculating the remainder of the number of noisy speech frames currently processed λ and the minimum power spectrum search window length D, it is judged whether it is 0. If it is 0, save the smoothed noisy speech power spectrum P(λ,k) in the λth frame in the temporary array P _temp (λ,k), and take the temporary array P in the λ-1 frame The minimum value of the data saved in _temp (λ-1,k) and the smoothed noisy speech power spectrum P(λ,k) at each frequency point k in the λth frame is used as the minimum power of the λth frame The spectrum P _min (λ,k). If it is not 0, it is determined that the data stored in the temporary array P _temp (λ-1,k) in the λ-1 frame and the smoothed noisy speech power spectrum P(λ,k) in the λth frame are in The minimum value at each frequency point k is used as the data saved in the temporary array P _temp (λ, k) of the current frame, and the minimum power spectrum P _{min of} the smoothed noisy speech power spectrum in the λ-1 frame is further determined The minimum value of (λ-1,k) and the smoothed noised speech power spectrum P(λ,k) of the current frame at each frequency point k is taken as the minimum power spectrum P _min (λ,k) of the current frame.

Referring to the above formula (10), it can be seen that the minimum power spectrum of the smoothed noisy speech output after comparison of each frame is taken as the initial estimated noise power spectrum of the current frame, and P _min (λ,k) represents the output of the λth frame The minimum power spectrum of the smoothed noisy speech,

Represents the initial estimated noise power spectrum.

S204b: Determine the likelihood ratio according to the probability density distribution of the noisy speech spectrum when the target speech is assumed to exist and the probability density distribution of the noisy speech spectrum when the target speech is assumed to not exist.

In statistical theory, the likelihood ratio changes with the change in the distribution characteristics of the target speech and noise spectrum probability density function. The following assumes that the target speech and noise spectrum obey the Gaussian distribution, then

Specifically, in the process of engineering realization, according to the probability density distribution of the noisy speech spectrum when the target speech is assumed to exist and the probability density distribution of the noisy speech spectrum when the target speech is assumed to not exist, when determining the likelihood ratio, use all State the estimated target speech power spectrum of the current frame (such as the λth frame)

(Or called enhanced speech power spectrum) instead of the real target speech power spectrum of the current frame (such as the λth frame) |X(λ,k)| ² , specifically, it can be used for the previous frame (such as the first frame) λ-1 frame) to obtain the filter coefficient to filter the current frame (such as the λth frame) noisy speech power spectrum to obtain the enhanced speech power spectrum as the estimated target speech power spectrum

Represents the true noise power spectrum, which can be calculated according to the initial estimated noise power spectrum calculated by the above formula (10)

Instead, get the likelihood ratio in engineering realization

Calculated as follows:

Specifically, the above formula (15) can be simplified to obtain likelihood ratio calculation formulas in different simplified forms to save hardware resource overhead.

In addition, in the above formulas (11) and (12), H ₀ indicates that there is no target speech, and H ₁ indicates that there is a target speech. Therefore, p(Y(λ,k)|H ₀ ) indicates the λth frame when there is no target speech. The probability density distribution function of the noisy speech spectrum, p(Y(λ,k)|H ₁ ) represents the probability density distribution function of the noisy speech spectrum in the λth frame when there is a target speech. See formula (13) again, the likelihood ratio corresponding to the k-th frequency point is actually p(Y(λ,k)|H ₁ ) and [(Y(λ,k)|H ₀ ) at the corresponding frequency. Therefore, the specific form of the above formulas (11) and (12) is determined, and the formula (13) can be used to obtain the likelihood ratio Δ _k corresponding to each frequency point. The formula (14) is determined After a form of formula (11) and (12), the specific expression of the likelihood ratio is obtained. Formula (15) is a concrete expression of the likelihood ratio in engineering realization.

S204c: Determine a priori probability of the existence of a valid target speech according to the power spectrum of the noisy speech.

In this embodiment, when determining the prior probability of the existence of a valid target speech in step S204c, the first step is to preliminarily determine whether there is a target speech in the noisy speech in the current frame according to the power spectrum of the noisy speech; The second step is to determine the estimated prior probability of the existence of the target speech according to the preliminary judgment result of whether the target speech exists in the noisy speech in the current frame, and determine the effective The prior probability of the existence of the target speech.

Further, in step S204c, determining the estimated prior probability of the presence of the target speech according to the power spectrum of the noisy speech includes: smoothing the power spectrum of the noisy speech without the target speech between frequency points And inter-frame smoothing processing; according to the noisy speech power spectrum after twice smoothing, the prior probability of the existence of the estimated target speech is determined.

Further, in step S204c, when preliminarily determining whether the target speech exists in the noisy speech, it is based on the power spectrum of the noisy speech, namely the above |Y(λ,k)| ² , and after windowing and frame The minimum power spectrum of the noisy speech after inter-smoothing is the above-mentioned P _min (λ,k), and the first detection factor at each frequency point of the current frame is determined; according to the noisy speech after windowing and inter-frame smoothing The power spectrum is the above P(λ,k), and the minimum power spectrum of the noisy speech after windowing and smoothing between frames is the above P _min (λ,k) to determine the second detection factor at each frequency point in the current frame To preliminarily determine whether the target voice exists in the noisy speech at each frequency point of the current frame according to the first detection factor and the second detection factor at each frequency point of the current frame.

Specifically, if the first detection factor calculated at a certain frequency point of the noisy speech in the current frame is less than the set first detection factor threshold, and the second detection factor at this frequency point is less than the set second detection factor Factor threshold, it is preliminarily determined that the noisy speech does not have the target voice at this frequency point; if the above conditions are not met, it is preliminarily determined that the noisy speech has the target voice at this frequency point.

In a specific application scenario, the following formulas (16)-(18) are used to preliminarily determine whether the target voice exists in the noisy voice.

Where γ ₀ and ζ ₀ are thresholds, and

Where B _min =1.66 is the estimated deviation factor, P _min is the minimum power spectrum of the smoothed noisy speech power spectrum output by equation (6) or equation (8), and P(λ,k) is calculated by equation (5) The smoothed noisy speech power spectrum. B _min is used to compensate or correct P _min, P _min such as small, corrects the B _min P _min by making it more accurate.

Refer to the above formula (17), according to the λth frame noisy speech power spectrum |Y(λ,k)| ² and the minimum power spectrum P _min calculated according to the above formula (6) or (8), determine the first detection factor γ _min (λ,k), γ _min (λ,k) are used to detect whether there is a target voice in the frequency domain signal corresponding to each frequency point of the noisy speech in the λth frame.

Referring to the above formula (18), the second detection factor ζ( is determined according to the smoothed λth frame noisy speech power spectrum P(λ,k) and the minimum power spectrum P _min calculated according to the above formula (6) or (8) λ,k); ζ(λ,k) is used to detect whether there is a target voice in the frequency domain signal corresponding to each frequency point of the noisy speech in the λth frame.

Considering that if there is no target speech, or it is called a high probability that there is only noise, since the noise is relatively stable, the first detection factor and the second detection factor calculated according to the above formulas (17) and (18) The value is relatively small. Therefore, according to the above formulas (17) and (18), the first detection factor γ _min (λ, k) and the second detection factor ζ (λ, k) are obtained, respectively, and the corresponding threshold γ ₀ and ζ ₀ , if the first detection factor γ _min (λ,k) at a certain frequency point in the current frame is less than the corresponding threshold γ ₀ , and the second detection factor ζ( λ, k) is less than the corresponding threshold ζ ₀ , then it is preliminarily determined that the noisy speech includes only noise at this frequency point and does not include the target speech. Under other conditions, it is preliminarily determined that the noisy speech at this frequency point It includes both noise and target speech.

The result of judging whether the target voice exists at each frequency point of the current frame is represented by 0 and 1, where 0 means determining that the target voice exists at the current frequency point of the current frame, and 1 means determining that the target voice does not exist at the current frequency point of the current frame For speech, define I(λ,k) as an indicator function and save the result of the judgment to the corresponding frequency point of I(λ,k). For the above indicator function I(λ, k), when the corresponding frequency point of the noisy speech does not contain the target voice, the value of the indicator function at the corresponding frequency point is 1; otherwise, the value is 0.

In the above formula (16), the thresholds γ ₀ and ζ ₀ can be flexibly set according to application scenarios.

Specifically, when determining the prior probability of the existence of a valid target voice in step S204c, determine whether the target voice exists in the noisy voice in the current frame according to the indicator function, and if the target voice does not exist in the current frame, That is, the indicator function is not all zero at each frequency point, and the noisy speech power spectrum is smoothed between frequency points by using the indicator function and the window function; if the target speech exists in the noisy speech in the current frame, the indicator function If it is all zeros at each frequency point, the two smoothed noisy speech power spectra obtained in the previous frame are used as the smoothed noisy speech power spectrum between the current frame frequency points; then, the current frame frequency point Inter-frame smoothing of the noisy speech power spectrum after inter-frame smoothing is performed to obtain the inter-frame smoothed noisy speech power spectrum; due to the inter-frequency smoothing processing and the inter-frame smoothing processing, the final smoothed inter-frame noise is obtained The speech power spectrum can also be called the noisy speech power spectrum after twice smoothing. Further, the third detection factor at each frequency point of the current frame is determined according to the minimum power spectrum of the noisy speech power spectrum after twice smoothing and the noisy speech power spectrum; according to the twice smoothed noisy speech The power spectrum and its minimum power spectrum determine the fourth detection factor at each frequency point in the current frame; according to the third detection factor at each frequency point in the current frame and the fourth detection factor at each frequency point in the current frame Factor to determine the prior probability of the existence of the estimated target speech at each frequency point in the current frame.

Obtaining the minimum power spectrum of the twice-smoothed noisy speech power spectrum can be specifically implemented with reference to the above formulas (6)-(9).

Furthermore, according to the following formulas (19) and (20), the λ-th frame noisy speech power spectrum |Y(λ,k)| ² is smoothed between frequency points and between frames to obtain the two smoothing The power spectrum of noisy speech afterwards.

Where α ₂ is the smoothing factor.

Represents the power spectrum of noisy speech in the λth frame (the current frame) |Y(λ,k)| ^{2 The power} after the indicator function I(λ,k) and the window function w(L) are smoothed between frequency points Spectrum, Lw represents the length of the window, i represents the position of the window,

Represents the power spectrum smoothed between the frequency points of the indicator function I(λ,k) and the window function w(L) in the λth frame

Two smoothed noisy speech power spectra obtained after smoothing between frames,

Represents the noisy speech power spectrum after twice smoothing in the λ-1 frame.

Combining the above formulas (16)-(18), and then combining the above formula (19), it can be seen that it is equivalent to smoothing the power spectrum of noisy speech between frequencies based on the result of preliminary judgment. See formula (20) again, which is equivalent to smoothing the power spectrum of noisy speech after smoothing between frequency points and then smoothing between frames.

get

After that, refer to the above formulas (6)-(8) to determine the minimum power spectrum of the two smoothed noisy speech power spectrum

Then according to the following formulas (21)-(24), determine the estimated prior probability of the existence of the target speech q _s (λ, k) and the effective prior probability of the existence of the target speech q (λ, k).

q(λ,k)=max((q _s (λ,k),q _min ) (22)

Among them, γ ₁ and ζ ₀ are thresholds, and q _min is the minimum value of the prior probability of the existence of the target speech. After the application scenario is determined, q _min is approximately fixed, that is, q _min can be set according to the application scenario. q(λ,k) is the prior probability of the existence of a valid target speech.

With reference to the above formula (23) and formula (24), for the noisy speech in the λth frame at the kth frequency point, the above first detection factor is similarly determined. When determining the prior probability of the estimated target speech, it includes ：According to the two smoothed power spectrum of noisy speech

Minimum power spectrum

And the noisy speech power spectrum |Y(λ,k)| ² , determine the third detection factor

According to the two smoothed noisy speech power spectrum

And its minimum power spectrum

Determine the fourth detection factor

According to the third detection factor and the fourth detection factor, the prior probability of the existence of the estimated target voice at each frequency point of the current frame is determined.

Further, referring to the above formulas (21)-(22), the third detection factor and the fourth detection factor calculated for each frequency point of the noisy speech in the current frame are compared with the corresponding threshold, and according to the According to the comparison result, the prior probability of the existence of the estimated target speech at each frequency point of the current frame is determined.

Referring to the above formulas (21)-(22), for the noisy speech in frame λ, its corresponding third detection factor

Less than or equal to 1 at a certain frequency point, and the fourth detection factor

If the corresponding frequency point is less than the corresponding threshold ζ ₀ , it is determined that the prior probability q _s (λ,k) of the estimated target speech at this frequency point is 0; if the third detection factor

The value at a certain frequency point is greater than 1 but less than the threshold γ ₁ , and the fourth detection factor

The value at the corresponding frequency point is less than the corresponding threshold ζ ₀ , then the prior probability q _s (λ,k) of the estimated target speech existence at the frequency point is calculated according to the above formula (21), specifically according to the third Detection factor

And the corresponding threshold is used to calculate the estimated prior probability q _s (λ,k) of the existence of the target speech at the frequency point; in other cases except the above two cases, the estimated prior probability of the existence of the target speech q _s ( The values of λ, k) are all 1.

Further, referring to the above formula (22), by taking the estimated prior probability of the existence of the target speech q _s (λ, k) at each frequency point and the maximum of the minimum value of the prior probability of the existence of the target speech q _min , As the prior probability q(λ,k) of the effective target speech at the corresponding frequency point.

Here, it should be noted that the foregoing embodiment exemplarily provides a way to calculate the prior probability q(λ,k) of the existence of an effective target speech. However, other methods can also be selected to solve q( λ,k).

S204d: Determine the posterior probability of the existence of the target speech at each frequency point of the current frame according to the likelihood ratio and the prior probability of the existence of the effective target speech.

According to Bayesian theory, the posterior probability of the existence of the target speech is calculated by the following formula (25):

Simplify the above formula (25) to get:

In the above formula (13),

Is the likelihood ratio of the noisy speech in the λth frame at different frequency points, and q(λ,k) is the prior probability of the existence of the above-mentioned effective target speech. The likelihood ratio and the priori probability of the existence of the effective target speech have been known according to the above formula, which can be taken into formula (26) to obtain the posterior probability of the existence of the target speech in the λth frame of noisy speech at each frequency point. The probability p(H ₁ |Y(λ,k)).

S204e. The smoothing factor calculation unit calculates the smoothing factor according to the posterior probability of the existence of the target voice;

According to different noise reduction scenarios, the mapping model between the posterior probability of the target voice and the smoothing factor is determined; correspondingly, the corresponding smoothing factor is calculated according to the posterior probability of the target voice, including: The posterior probability of the existence of the target speech is used as the input of the mapping model, and the output of the mapping model is the smoothing factor.

Specifically, refer to the following formula (27) to calculate the smoothing factor corresponding to the noisy speech in the λth frame at each frequency point.

α(k)=f(p(H ₁ |Y(λ,k)) (27)

It can be seen from the above formula (25) that for the frequency domain signal of the noisy speech in the λth frame at the kth frequency point, the corresponding posterior probability of the existence of the target speech at the kth frequency point can be calculated. Smoothing factor α(k). The specific functional relationship between p(H ₁ |Y(λ,k) and α(k) can be linear, exponential, logarithmic, etc. The specific mapping model used depends on the noise characteristics in the application environment.

Fig. 3 and Fig. 4 exemplarily show the first and second schematic diagrams of the mapping curve of the posterior probability and the smoothing factor of the target speech. The abscissa is the posterior probability of the existence of the target speech, and the ordinate is the smoothing factor. The relationship curve of α(k) with p(H ₁ |Y(λ,k) under different parameter configuration conditions.

α(k)=min{β+(1-β)*P(k),0.96} (28)

Among them, β, γ, μ, ε are all configurable parameters, and different parameter configurations will produce different p(H ₁ |Y(λ,k) and α(k) function relations and function curves. Such as formula (28)-( 29) It can be seen that the relationship between the posterior probability of the existence of the target speech and the smoothing factor includes a non-linear relationship.

S205. The noise update unit is configured to update the initial estimated noise power spectrum according to the smoothing factor and the noisy speech power spectrum to obtain an effective noise power spectrum;

In this embodiment, when step S205 is specifically executed, it is assumed that the update of the initial estimated noise power spectrum is stopped when there is a target voice, so as to avoid damage to the target voice, and the initial estimated noise power spectrum is updated when there is no target voice. To improve the accuracy of noise estimation. For this reason, the update modes in the case of no target voice and target voice are obtained respectively.

Refer to the above formula (30), which is for the noisy speech power spectrum of the λth frame. When there is no target speech, the initial estimated noise power spectrum is updated; when there is a target speech, the initial noise power spectrum is not estimated To update, see formula (31) above. That is, when there is no target voice, the initial estimated noise power spectrum is updated, and when there is a target voice, the initial estimated noise power spectrum is not updated, which can avoid speech loss and avoid excessive noise residue.

Therefore, based on the assumptions of the above formulas (30) and (31), considering that for the noisy speech in the λth frame, the corresponding effective noise power spectrum

As shown in formula (32).

Bringing (30) and (31) into (32), we get

Among them, α ₃ is the aforementioned smoothing factor, which has a functional relationship with the posterior probability of the existence of the target speech. For the noisy speech in the λth frame, p(H ₀ |Y(λ,k)) is the posterior probability of the existence of no target speech. P(H ₁ |Y(λ,k)) is the posterior probability of the existence of the target speech, and p(H ₀ |Y(λ,k)) = 1-p(H ₁ |Y(λ, k)). The above three parameters are calculated by the noise update control module,

In order to obtain the initial estimated noise power spectrum for the noisy speech in the λ-1 frame, if it is necessary to enhance the noise reduction capability,

The effective noise power spectrum corresponding to the noisy speech in frame λ-1 can also be used, namely

From the above formulas (30)-(33), it can be seen that when the initial estimated noise power spectrum is updated to obtain the effective noise power spectrum, specifically, according to the noisy speech power spectrum, the smoothing factor, and the aftermath of the existence of no target speech The initial estimated noise power spectrum and the posterior probability of the existence of the target speech are updated to obtain the effective noise power spectrum. For the noisy speech in the λth frame, the historical initial estimated noise power spectrum can be directly the initial estimated noise power spectrum corresponding to the noisy speech in the λ-1th frame. If the noise reduction capability needs to be enhanced, it can also be The effective noise power spectrum corresponding to the noisy speech in the λ-1 frame.

S206. The filter coefficient calculation module calculates the filter coefficient according to the effective noise power spectrum.

S207. The filter module separately filters the real part and the imaginary part of the noisy speech spectrum according to the filter coefficient to obtain an enhanced speech spectrum.

The classic frequency domain Wiener filter structure is as follows:

Wherein a and b are variable parameters.

In practice, the real target speech power spectrum and real noise power spectrum cannot be obtained, so the following classic decision-guided method is used to approximate ξ _k .

Where a is the smoothing factor. ξ _min is

The minimum desirable value.

Represents the effective noise power spectrum estimated from the λth frame;

Represents the effective noise power spectrum estimated from the λ-1 frame;

Represents the power spectrum of the target speech obtained in the λ-1 frame or the enhanced target speech power spectrum; |Y(λ,k)| ² represents the power spectrum of the noisy speech in the λth frame.

The filter mainly includes an adder and a multiplier. The filter coefficients calculated by equations (34) and (35) are used to reduce the noise of the real and imaginary parts of the noisy speech spectrum of the λth frame, that is, with the real and The imaginary parts are multiplied and added to obtain the enhanced speech complex spectrum.

S208. The restoration module restores the enhanced speech spectrum from the frequency domain back to the time domain to obtain a time domain binary code group;

S209. The output module decodes and transmits the time-domain binary code group to be played through the speaker.

Here, the "user" in the foregoing embodiment is a relative concept, and is not specifically limited to a person, but may also be a machine. The above embodiments can be applied to various reference scenarios such as human-to-human voice calls, human-to-robot voice calls, and robot-to-robot voice calls. In fact, it can generalize any object that can produce effective voice.

In the second embodiment mentioned above, steps S04a-S204e and step S205 are actually an exemplary embodiment of the noise estimation method. However, it should be noted that further or specific technical implementation manners are not uniquely limited.

An embodiment of the present application also provides an electronic device, which includes the solution described in any of the embodiments of the present application.

In addition, the specific formulas described in the foregoing embodiments are merely examples and are not uniquely limited. Those of ordinary skill in the art can modify them without departing from the idea of the present application.

The above-mentioned technical solutions of the embodiments of the present application can be specifically applied to various types of electronic equipment, and the electronic equipment exists in various forms, including but not limited to:

(1) Mobile communication equipment: This type of equipment is characterized by mobile communication functions, and its main goal is to provide voice and data communications. Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.

(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has calculation and processing functions, and generally also has mobile Internet features. Such terminals include: PDA, MID and UMPC devices, such as iPad.

(3) Portable entertainment equipment: This type of equipment can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.

(4) Other electronic devices with data interaction functions.

So far, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desired results. In addition, the processes depicted in the drawings do not necessarily require the specific order or sequential order shown in order to achieve the desired result. In certain embodiments, multitasking and parallel processing may be advantageous.

The systems, devices, modules, or units illustrated in the above embodiments may be specifically implemented by computer chips or entities, or implemented by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cell phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or Any combination of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit can be implemented in the same one or more software and/or hardware.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

It should also be noted that the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, product or equipment including a series of elements not only includes those elements, but also includes Other elements that are not explicitly listed, or include elements inherent to this process, method, commodity, or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, method, commodity, or equipment that includes the element.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific transactions or implement specific abstract data types. This application can also be practiced in distributed computing environments. In these distributed computing environments, remote processing devices connected through a communication network execute transactions. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices.

The above descriptions are only examples of this application and are not used to limit this application. For those skilled in the art, this application can have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A noise estimation method, characterized in that it includes:

Determine the initial estimated noise power spectrum of noisy speech;

Calculate the smoothing factor according to the probability of the target voice;

According to the noisy speech and the smoothing factor, the initial estimated noise power spectrum is updated to obtain an effective noise power spectrum.
The method according to claim 1, wherein the probability of the existence of the target speech comprises the posterior probability of the existence of the target speech, and the relationship between the posterior probability of the existence of the target speech and the smoothing factor comprises nonlinearity. relationship.
The method according to claim 2, further comprising: according to the probability density distribution of the noisy speech spectrum when the target speech is assumed to exist and the probability density distribution of the noisy speech spectrum when the target speech is assumed to be absent, Determine the likelihood ratio; determine the posterior probability of the existence of the target speech according to the likelihood ratio and the prior probability of the existence of the effective target speech;

Correspondingly, the smoothing factor is calculated according to the posterior probability of the existence of the target speech.
The method according to claim 3, further comprising: determining the prior probability of the existence of the effective target speech according to the power spectrum of the noisy speech.
The method according to claim 4, wherein the determining a priori probability of the existence of the effective target speech according to the noisy speech power spectrum comprises: preliminarily determining the noisy speech power spectrum Whether the target speech exists in the noisy speech; determine the estimated prior probability of the existence of the target speech according to the result of the preliminary judgment, and determine the effective target according to the estimated prior probability of the existence of the target speech The prior probability of speech existence.
The method according to claim 5, wherein the determining the estimated prior probability of the existence of the target speech according to the result of the preliminary judgment comprises: if the target speech does not exist, determining whether the target speech does not exist The noisy speech power spectrum of the target speech is smoothed between frequency points to obtain the smoothed noisy speech power spectrum between frequency points, or, if the target speech exists, the noisy speech power spectrum smoothed between historical frames is taken as Smoothing the power spectrum of the noisy speech between the frequency points; performing inter-frame smoothing on the power spectrum of the noisy speech smoothed between the frequency points to obtain the noisy speech power spectrum after the smoothing between the frames; according to the inter-frame smoothing After the noisy speech power spectrum, the prior probability of the existence of the estimated target speech is determined.
The method according to claim 5, wherein the preliminarily determining whether the target speech exists in the noisy speech according to the noisy speech power spectrum comprises: according to the noisy speech power spectrum and Determine the first detection factor based on the minimum power spectrum of the noisy speech power spectrum after windowing and smoothing between frames; determine the second detection factor according to the noisy speech power spectrum after windowing and smoothing between frames and its minimum power spectrum To preliminarily determine whether the target voice exists in the noisy voice based on the first detection factor and the second detection factor.
8. The method of claim 7, wherein if the first detection factor is less than a set first detection factor threshold, and the second detection factor is less than a set second detection factor threshold, a preliminary determination The target voice does not exist in the noisy speech; otherwise, it is preliminarily determined that the target voice exists in the noisy speech.
The method according to any one of claims 6-8, wherein the determining a priori probability of the presence of the estimated target speech according to the power spectrum of the noisy speech smoothed between frames comprises: Determine the third detection factor based on the noisy speech power spectrum and the minimum power spectrum of the noisy speech power spectrum smoothed between frames; determine according to the noisy speech power spectrum after the smoothing between frames and the minimum power spectrum A fourth detection factor; according to the third detection factor and the fourth detection factor, determine the estimated prior probability of the existence of the target voice.
The method according to claim 9, wherein the determining the estimated prior probability of the existence of the estimated target speech according to the third detection factor and the fourth detection factor comprises: combining the first The three detection factors are compared with the fourth detection factor and the corresponding threshold, and the estimated prior probability of the existence of the target speech is determined according to the comparison result.
The method according to any one of claims 5-10, wherein the determining the prior probability of the existence of the effective target speech according to the estimated prior probability of the existence of the target speech comprises: The estimated prior probability of the existence of the target speech and the minimum value of the prior probability of the existence of the target speech determine the prior probability of the existence of the effective target speech.
The method according to any one of claims 4-11, further comprising: calculating the power spectrum of the noisy speech to determine the initial estimated noise of the noisy speech according to the power spectrum of the noisy speech power spectrum.
The method according to claim 12, wherein the determining the initial estimated noise power spectrum of the noisy speech according to the noisy speech power spectrum comprises: windowing the noisy speech power spectrum Processing; performing inter-frame smoothing processing on the windowed noisy speech power spectrum; performing a minimum power spectrum search on the noisy speech power spectrum smoothed between frames, and using the searched minimum power spectrum as the initial Estimate the noise power spectrum.
The method according to any one of claims 4-13, wherein, according to the noisy speech and the smoothing factor, updating the initial estimated noise power spectrum to obtain an effective noise power spectrum comprises: The noisy speech power spectrum, the smoothing factor, the posterior probability of the existence of no target speech, the historical initial estimated noise power spectrum, and the posterior probability of the existence of target speech, update the initial estimated noise power spectrum Obtain the effective noise power spectrum.
The method according to any one of claims 1-14, further comprising: calculating a filter coefficient according to the effective noise power spectrum; and filtering the noisy speech according to the filter coefficient to obtain Enhance the speech spectrum.
A noise estimation device is characterized by comprising:

The initial noise estimation unit is used to determine the initial estimated noise power spectrum of noisy speech;

The noise update unit is configured to update the initial estimated noise power spectrum to obtain an effective noise power spectrum according to the noisy speech and a smoothing factor, and the calculated smoothing factor is calculated according to the probability of the existence of the target speech.
The apparatus according to claim 16, wherein the probability of the existence of the target speech comprises a posterior probability of the existence of the target speech, and the relationship between the posterior probability of the existence of the target speech and the smoothing factor comprises nonlinearity relationship.
18. The device according to claim 17, further comprising: a likelihood ratio calculation unit, configured to assume the presence of the target speech based on the probability density distribution of the noisy speech spectrum and the assumption that the target speech does not exist. The probability density distribution of the noise speech spectrum determines the likelihood ratio; the posterior probability calculation unit for the existence of the target speech is used to determine the posterior probability of the existence of the target speech according to the likelihood ratio and the prior probability of the existence of the effective target speech Probability

Correspondingly, the smoothing factor is calculated according to the posterior probability of the existence of the target speech.
The apparatus according to claim 18, further comprising: a priori probability calculation unit for the existence of the target speech, configured to determine the priori probability of the existence of the effective target speech according to the power spectrum of the noisy speech.
The device according to claim 19, wherein the prior probability calculation unit for the existence of the target speech is further configured to: according to the power spectrum of the noisy speech, preliminarily judge whether the target exists in the noisy speech Speech; according to the preliminary judgment result of the preliminary judgment of whether the target speech exists in the noisy speech in the current frame, determine the estimated prior probability of the existence of the target speech, and determine the estimated prior probability of the existence of the target speech The prior probability that the effective target speech exists.
22. The device according to claim 20, wherein the prior probability calculation unit for the existence of the target speech is further configured to: if the target speech does not exist, determine the power spectrum of the noisy speech without the target speech Perform inter-frequency smoothing to obtain the smoothed inter-frequency noisy speech power spectrum, or, if the target voice exists, use the smoothed historical inter-frame noisy speech power spectrum as the inter-frequency smoothed power spectrum Noisy speech power spectrum; inter-frame smoothing is performed on the noisy speech power spectrum smoothed between frequency points to obtain an inter-frame smoothed noisy speech power spectrum; according to the noisy speech power spectrum smoothed between frames, The prior probability of the existence of the estimated target speech is determined.
The device according to claim 20, wherein the prior probability calculation unit for the existence of the target speech is further configured to: according to the noisy speech power spectrum and the noisy speech power after windowing and inter-frame smoothing The minimum power spectrum of the spectrum determines the first detection factor; the second detection factor is determined based on the windowed and smoothed inter-frame noised speech power spectrum and its minimum power spectrum to determine the second detection factor according to the first detection factor and the first The second detection factor is to preliminarily determine whether the target voice exists in the noisy voice.
The device according to claim 22, wherein if the first detection factor is less than a set first detection factor threshold, and the second detection factor is less than a set second detection factor threshold, a preliminary determination The target voice does not exist in the noisy speech; otherwise, it is preliminarily determined that the target voice exists in the noisy speech.
The device according to any one of claims 21-23, wherein the prior probability calculation unit for the existence of the target speech is further configured to: according to the power spectrum of the noisy speech and the smoothed band between frames Determine the third detection factor according to the minimum power spectrum of the noisy speech power spectrum; determine the fourth detection factor according to the smoothed noised speech power spectrum between frames and the minimum power spectrum; according to the third detection factor and the first Four detection factors to determine the prior probability of the estimated target speech existence.
The device according to claim 24, wherein the prior probability calculation unit for the existence of the target speech is further configured to: compare the third detection factor and the fourth detection factor with a corresponding threshold; According to the comparison result, the prior probability that the estimated target speech exists is determined.
The device according to any one of claims 20-25, wherein the prior probability calculation unit for the existence of the target speech is further configured to: according to the estimated prior probability of the existence of the target speech and the existence of the target speech The minimum value of the prior probability determines the prior probability of the existence of the effective target speech.
The device according to any one of claims 19-26, further comprising: a power spectrum calculation module, configured to calculate the noisy speech power spectrum to determine the noisy speech power spectrum The initial estimated noise power spectrum of noisy speech.
The device according to claim 27, wherein the initial noise estimation unit is further configured to: perform windowing processing on the noisy speech power spectrum; and frame the windowed noisy speech power spectrum Inter-smoothing processing; performing a minimum power spectrum search on the noisy speech power spectrum smoothed between frames, and using the searched minimum power spectrum as the initial estimated noise power spectrum.
The device according to any one of claims 19-28, wherein the noise update unit is further configured to: according to the power spectrum of the noisy speech, the smoothing factor, the posterior probability of the existence of no target speech, The historical initial estimated noise power spectrum and the posterior probability of the existence of the target speech are updated, and the effective noise power spectrum is obtained by updating the initial estimated noise power spectrum.
The device according to any one of claims 16-29, further comprising: a filtering module, configured to calculate filter coefficients according to the effective noise power spectrum; according to the filter coefficients, the band The noisy speech is filtered to obtain an enhanced speech spectrum.
A speech processing chip, characterized by comprising the noise estimation device according to any one of claims 16-30.
An electronic device, characterized by comprising the voice processing chip of claim 31.