WO2020107269A1

WO2020107269A1 - Self-adaptive speech enhancement method, and electronic device

Info

Publication number: WO2020107269A1
Application number: PCT/CN2018/117972
Authority: WO
Inventors: 朱虎; 王鑫山; 李国梁; 曾端; 郭红敬
Original assignee: 深圳市汇顶科技股份有限公司
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2020-06-04
Also published as: CN109643554A; CN109643554B

Abstract

A self-adaptive speech enhancement method and an electronic device. The self-adaptive speech enhancement method comprises: after receiving a speech signal, calculating, according to the speech signal, the power of a current frame of the speech signal (101); comparing the power of the current frame with the noise power of a previous frame (102); acquiring, according to a comparison result and the noise power of the previous frame, a noise estimation value of the current frame (103); and acquiring, according to the noise estimation value, a pure speech signal (104). The use of this method makes the estimation of noise more accurate and reduces the complexity of an algorithm, thereby facilitating the enhancement of a speech signal and improving the quality of human auditory perception.

Description

Adaptive speech enhancement method and electronic equipment

Technical field

This application relates to the field of information processing technology, in particular to an adaptive speech enhancement method and electronic equipment.

Background technique

In real life, since the speaker is often in a variety of noisy environments, the voice signal is inevitably polluted by background noise. The background noise degrades the performance of many voice processing systems. As a signal processing method, speech enhancement is an efficient way to solve noise pollution. On the one hand, through speech enhancement, the clarity, intelligibility and comfort of the noise in a noisy environment can be improved, and the quality of human hearing perception can be improved; on the other hand, speech enhancement is also an indispensable part of the speech processing system. Before performing various speech signal processing operations, speech enhancement must first be performed to reduce the impact of noise on the speech processing system and improve the working skills of the system.

Speech enhancement mainly includes two parts: noise estimation and filter coefficient solution. Representative speech enhancement methods include spectral subtraction, Wiener filtering, minimum mean square error estimation, subspace methods, wavelet transform-based enhancement methods, and so on. Most of these methods are based on statistical models of speech and noise components in frequency, and combined with various estimation theories to design targeted noise cancellation techniques.

In the speech enhancement algorithm in the prior art, there are problems of inaccurate noise estimation and complicated algorithm.

Summary of the invention

The purpose of some embodiments of the present application is to provide an adaptive speech enhancement method, which makes the estimation of noise more accurate, and reduces the complexity of the algorithm, thereby facilitating the enhancement of speech signals and improving the quality of human auditory perception.

An embodiment of the present application provides an adaptive speech enhancement method, which includes: after receiving a speech signal, calculating the power of the current frame of the speech signal according to the speech signal; comparing the power of the current frame with the noise power of the previous frame ;According to the result of the comparison and the noise power of the previous frame, the noise estimate value of the current frame is obtained; according to the noise estimate value, the pure voice signal is obtained.

An embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores executable by the at least one processor Instructions that are executed by the at least one processor to enable the at least one processor to perform the adaptive speech enhancement method described above.

Compared with the prior art, the embodiment of the present application calculates the power of the current frame of the voice signal according to the received voice signal, compares the power of the current frame with the noise power of the previous frame, and according to the result of the comparison and the previous frame To obtain the noise estimate of the current frame. For the estimation of noise, it is not necessary to use the VAD algorithm to detect whether the current frame is a speech frame or a noise frame, so that the inaccurate detection of the VAD algorithm will lead to a large deviation of the noise estimation, which is beneficial to quickly estimate the noise component in the speech signal. This application uses an iterative estimation method. The noise power of each frame is adaptively updated. The power of the current frame is compared with the noise power of the previous frame to estimate the noise value of the current frame. During the continuous iteration process, It also makes the estimated noise value more and more accurate. Moreover, in this application, the power is recalculated for each frame, which can achieve continuous estimation and continuous update of noise. It only needs to compare the power of the current frame with the noise power of the previous frame, and does not need to store the previous D frame data and Sorting according to power size reduces the algorithm resource overhead and algorithm complexity. Obtaining pure voice signals based on noise estimates is helpful for enhancing voice signals and improving the quality of human auditory perception.

For example, the power of the current frame is specifically: the log power spectrum of the current frame; the noise power of the previous frame is specifically: the log quantile of the previous frame. The logarithmic coordinates can amplify the details, and can extract signals that cannot be extracted under the general coordinate scale, which is conducive to compressing the dynamic range of the values, so that in the logarithmic coordinate system, the log power spectrum of the current frame and the log of the previous frame The comparison between quantiles is more precise, which facilitates subsequent accurate processing.

For example, according to the comparison result and the noise power of the previous frame, obtaining the noise estimate value of the current frame includes: according to the comparison result of the log power spectrum of the current frame and the log quantile of the previous frame, obtaining the current frame Incremental step; obtain the log quantile of the current frame according to the log quantile of the previous frame and the incremental step of the current frame; obtain the noise of the current frame according to the log quantile of the current frame estimated value. The incremental step size of the current frame provides a meaningful reference for obtaining the log quantile of the current frame, which is beneficial to accurately obtain the log quantile of the current frame, so as to accurately estimate the noise value of the previous frame .

For example, the log quantile of the current frame is obtained according to the log quantile of the previous frame and the incremental step size of the current frame, which specifically includes: if the log power spectrum of the current frame is greater than or equal to the log of the previous frame Quantile, the log quantile of the previous frame is adaptively increased according to the incremental step size to obtain the log quantile of the current frame; if the log power spectrum of the current frame is less than the log quantile of the previous frame For the number of bits, the log quantile of the previous frame is adaptively reduced according to the incremental step size to obtain the log quantile of the current frame. By adaptively increasing or decreasing the log quantile of the previous frame according to the incremental step size, it is beneficial to accurately obtain the log quantile of the current frame.

For example, according to the comparison result of the log power spectrum of the current frame and the log quantile of the previous frame, the incremental step of the current frame is obtained, which specifically includes: according to the log power spectrum of the current frame and the previous frame. The comparison result of the quantiles is used to obtain the density function; obtaining the incremental step size of the current frame according to the density function provides a way to obtain the incremental step size of the current frame.

For example, to obtain the density function, specifically use the following formula to obtain the density function density:

Where λ is the frame number of the current frame, k is the number of frequency points, β is the empirical value of the experiment, ξ is the preset threshold, and log(|Y _w (λ)| ² ) Is the log power spectrum of the current frame, where lq(λ-1, k) is the log quantile of the previous frame; a specific calculation formula for obtaining the density function is provided, which is beneficial to quickly and accurately obtain the density function.

For example, the incremental step of the current frame is obtained according to the density function, and the incremental step delta is specifically obtained by the following formula:

Among them, λ is the frame number of the current frame, K is the incremental step size control factor, density(λ-1, k) is the density function of the previous frame, and provides a specific calculation formula for obtaining the incremental step size. It is helpful to obtain the incremental step quickly and accurately.

For example, according to the increment step, the log quantile of the previous frame is adaptively increased to obtain the log quantile of the current frame. Specifically, the log quantile of the current frame is obtained by the following formula: lq(λ,k )=lq(λ-1,k)+α·delta(λ,k)/β; adaptively reduce the log quantile of the previous frame according to the increment step to obtain the log quantile of the current frame The number includes: the log quantile of the current frame is obtained by the following formula: lq(λ,k)=lq(λ-1,k)-(1-α)·delta(λ,k)/β; Where λ is the frame number of the current frame, k is the number of frequency points, α is the empirical value of the experiment, and delta(λ, k) is the incremental step size. The calculation formulas for adaptively increasing and decreasing the log quantile are provided, which is beneficial to obtain the log quantile of the current frame directly, quickly and accurately.

For example, obtaining the pure voice signal according to the noise estimate includes: obtaining the power spectrum of the current frame of the voice signal; obtaining the spectral gain coefficient according to the noise estimate; obtaining the pure voice signal of the current frame according to the spectral gain coefficient, which is beneficial to self Adapt to track the change of noise in each frame, enhance the voice of each frame, improve the clarity, intelligibility and comfort of the voice in a noisy environment, reduce the impact of noise on the voice processing system, and improve the working skills of the system .

For example, obtaining the spectral gain coefficient based on the noise estimate includes: calculating the prior signal-to-noise ratio based on the previous frame's noise estimate and the previous frame's pure speech signal; based on the current frame's noise estimate and the current frame's Power, calculate the posterior signal-to-noise ratio; obtain the spectral gain coefficient according to the a priori signal-to-noise ratio and the posterior signal-to-noise ratio, providing a way to obtain the spectral gain coefficient.

For example, obtaining the spectral gain coefficient according to the a priori signal-to-noise ratio and the posterior signal-to-noise ratio specifically includes: obtaining the spectral gain coefficient according to the following formula:

Where γ _k is the posterior signal-to-noise ratio, and ξ _k is the a priori signal-to-noise ratio,

p is the perceptual weighted order, and β is the order of the higher-order amplitude spectrum. A specific calculation formula for obtaining the spectral gain coefficient is provided, which is beneficial to obtain the spectral gain coefficient accurately and quickly.

For example, calculating the signal-to-noise ratio of several subbands specifically includes: calculating the signal-to-noise ratio of the several subbands by the following formula:

Where b is the serial number of the subband, k is the number of frequency points, B _low (b) is the starting point of the frequency point of the b subband in the Bark domain, and B _up (b) is the end point of the frequency point of the b subband in the Bark domain. The human ear is more sensitive to speech in the Bark domain and the human ear masking mechanism is conducive to improving the quality of human auditory perception.

For example, calculating the perceptual weighted order based on the signal-to-noise ratio of several subbands, specifically: calculating the perceptual weighted order p by the following formula:

p(b,k)=max{min[α ₁ SNR(b,k)+α ₂ ,p _max ],p _min }

Among them, α ₁ , α ₂ , p _min and p _max are experimental empirical values. A specific calculation formula for obtaining the perceptual weighted order is provided, which is beneficial to obtain the perceptual weighted order accurately and quickly.

E.g,

with

It can be obtained by the following method: query according to the input/output correspondence relationship of the pre-stored Γ function

with

It can be obtained in the following way: query according to the input/output correspondence between the pre-stored Φ function

with

The query method based on the correspondence relationship greatly reduces the calculation complexity of the method, reduces the calculation amount, and has more engineering applicability.

For example, the pure voice signal is obtained according to the spectral gain coefficient, specifically obtained by the following formula:

Wherein, Y _w (k) is the signal amplitude of the current frame, and provides a specific formula for obtaining a pure voice signal, which is beneficial to quickly and accurately obtain the pure voice signal of the current frame.

BRIEF DESCRIPTION

One or more embodiments are exemplarily illustrated by the pictures in the corresponding drawings. These exemplary descriptions do not constitute a limitation on the embodiments, and elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the figures in the drawings do not constitute a scale limitation.

FIG. 1 is a flowchart of the adaptive speech enhancement method according to the first embodiment of the present application;

2 is a schematic diagram of the Kaiser window function according to the first embodiment of the present application;

3 is a schematic diagram of the sub-steps of step 104 in the first embodiment of the present application;

4 is a flowchart of an adaptive speech enhancement method according to the second embodiment of the present application;

5 is a schematic diagram of modules for implementing an adaptive speech enhancement method according to the second embodiment of the present application;

6 is a flowchart of an adaptive speech enhancement method according to the third embodiment of the present application;

7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

detailed description

In order to make the purpose, technical solutions and advantages of the present application more clear, the following describes some embodiments of the present application in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application. The division of the following embodiments is for convenience of description, and should not constitute any limitation on the specific implementation manner of the present invention. The embodiments can be combined with each other and referenced without contradiction.

The first embodiment of the present application relates to an adaptive speech enhancement method, which includes: after receiving the speech signal, calculating the power of the current frame of the speech signal according to the speech signal; and performing the power of the current frame with the adaptively updated noise power Comparison; wherein, the adaptively updated noise power is the noise power of the previous frame of the speech signal; according to the comparison result, the noise estimate value of the current frame is obtained; according to the noise estimate value, the pure speech signal is obtained so that the noise The estimation is more accurate and reduces the complexity of the algorithm, which is beneficial to enhance the speech signal and improve the quality of human auditory perception. The implementation details of the adaptive speech enhancement method of this embodiment are described below in detail. The following content is only for implementation details provided for easy understanding, and is not necessary for implementing this solution.

The adaptive speech enhancement method of this embodiment can be applied in the field of speech signal processing technology and is applicable to low-power speech enhancement, speech recognition, and voice interaction products, including but not limited to headphones, stereos, mobile phones, televisions, automobiles, and wearable Electronic equipment such as equipment and smart home.

The specific process of the adaptive speech enhancement method in this embodiment is shown in FIG. 1 and includes:

Step 101: After receiving the voice signal, calculate the power of the current frame of the voice signal according to the voice signal.

Specifically, after receiving the voice signal, the voice signal can be transformed in the time domain and the frequency domain to obtain the frequency domain voice. The frequency domain voice is a coordinate system used to describe the frequency characteristics of the voice signal. The transformation of the speech signal from the time domain to the frequency domain is mainly realized by Fourier series and Fourier transform. The periodic signal depends on the Fourier series, and the non-periodic signal depends on the Fourier transform. Generally, the wider the time domain of a speech signal, the shorter the frequency domain. The power of the current frame is obtained according to the amplitude of the current frame of the frequency domain speech signal.

In one example, assuming that the sampling rate of the voice signal is Fs = 8000 Hz, the data length is generally between 8 ms and 30 ms. The processing of the voice signal can be 64 points and overlap 64 points in the previous frame, then the Processing 128 points, that is, the overlap rate between the current frame and the previous frame is 50%, but in practical applications, it is not limited to this. Perform pre-emphasis processing on the received voice signal to improve the high-frequency component of the voice signal. The specific operation can be:

Among them, α is a smoothing factor. In this embodiment, α can take a value of 0.98, but in actual applications, different settings can be made according to actual needs. y(n) is the sampled speech signal of the current frame, and y(n-1) is the sampled speech signal of the previous frame.

Further, after pre-emphasis processing, in order to reduce the spectral energy leakage, the interception function can be used to truncate the signal. The truncation function is called the window function, that is, the speech signal is windowed. The design of the window function can be selected according to different application scenarios Rectangular windows, Hamming windows, Hanning windows, and Gaussian window functions can be flexibly selected in actual design. In this embodiment, the Kaiser window function shown in FIG. 2 is used, and the overlap is 50%.

In addition, since the calculation of the power of the current frame of the speech signal is usually processed in the frequency domain, the windowed data can be subjected to the fast Fourier transform FFT by the following formula to obtain the frequency domain signal.

Where k represents the number of frequency points, w(n) is the Kasier window function, and N is 128, that is, 128 points are actually processed at a time. This embodiment only takes N=128 as an example, but it is not limited to this in practical applications. . m is the number of frames, and the value of n can be from 1 to 128. For the calculation of the power of the current frame, the amplitude of the transformed 128-frequency frequency domain signal can be obtained, and the amplitude of the 128 frequency points can be squared respectively.

Step 102: Compare the power of the current frame with the noise power of the previous frame.

Specifically, the noise power of the previous frame is the adaptively updated noise power. In practical applications, the noise power can be initialized according to the experimental value first. If the current frame is the first frame, the power of the current frame can be compared with the initialized noise power. The adaptively updated noise power means that the noise power of different frames is different. After the initial value of the noise power is set, the noise power of the current frame can be adaptively updated during the iteration process. For example, compare the power of 128 frequency points of the current frame with the power of 128 frequency points of the previous frame, and adaptively update the noise power corresponding to each frequency point of the current frame.

Step 103: Acquire the noise estimate value of the current frame according to the comparison result and the noise power of the previous frame.

Specifically, if the power of the current frame is greater than the noise power of the previous frame, the noise power of the previous frame can be adaptively increased as the noise estimate of the current frame, for example, an incremental step can be preset Long, according to the incremental step to increase adaptively. Preferably, the incremental step size can also be updated adaptively during the iteration. If the power of the current frame is less than the noise power of the previous frame, the noise power of the previous frame can be adaptively reduced, and the reduced noise power can be used as the noise estimation value of the current frame.

Step 104: Obtain a pure voice signal according to the noise estimate.

Specifically, step 104 may include the following sub-steps as shown in FIG. 3:

Step 1041: Calculate the a priori signal-to-noise ratio based on the noise estimate of the previous frame and the pure voice signal of the previous frame.

Specifically, the calculation of the a priori signal-to-noise ratio can use the classic improved decision guidance method, and the a priori signal-to-noise ratio can be calculated according to the following formula

Where a is the smoothing factor and ξ _min is the preset empirical value,

Is the pure voice signal power of the previous frame, and λ is the frame number of the current frame. In an example, the value of a may be 0.98, and ξ _min may be -15dB according to experience, but it is not limited to this in practical applications.

It should be noted that, in this embodiment, the prior signal-to-noise ratio is calculated by the above formula as an example, but it is not limited to this in practical applications.

Step 1042: Calculate the posterior signal-to-noise ratio according to the current frame noise estimate and the current frame power.

Specifically, the posterior signal-to-noise ratio can be calculated according to the following formula:

among them,

Is the power of the current frame, and λ _d (k) is the noise estimate of the current frame.

It should be noted that, in this embodiment, the posterior signal-to-noise ratio is calculated by the above formula as an example, but it is not limited to this in practical applications. Moreover, in this embodiment, the execution order of step 1041 and step 1042 is not limited. In practical applications, step 1042 may be executed first and then step 1041 may be executed, or step 1041 and step 1042 may be executed simultaneously.

Step 1043: Calculate the perceptual weighted order p.

Specifically, the parameter p can be calculated adaptively according to the sub-band signal-to-noise ratio and the characteristics of the Bark domain. Specifically, in the frequency spectrum of the voice signal, the Bark domain can be divided into several subbands, for example: the Bark domain can be divided into 18 subbands, and the upper limit frequency of each subband is: [100,200,300,400,510,630,770,920,1080,1270,1480 , 1720, 2000, 2320, 2700, 3150, 3700, 4400], according to the human ear is more sensitive to speech in the Bark domain, calculate the signal-to-noise ratio of subbands, and calculate the signal-to-noise ratio of several subbands by the following formula:

Where b is the serial number of the sub-band, the serial number of the sub-band is 1≤b≤18, k is the number of frequency points, B _low (b) is the starting point of the frequency point of the b sub-band of the Bark domain, and B _up (b) is the first of the Bark domain. b The end point of the frequency of the subband. Further, the parameter p can be calculated by the following formula:

p(b,k)=max{min[α ₁ SNR(b,k)+α ₂ ,p _max ],p _min }

Among them, α ₁ , α ₂ , p _min and p _max are experimental experience values. In this embodiment, for example, the experimental experience values can be as follows: α ₁ =0.251, α ₂ =-1.542, p _max = 4, p _min = -1, but it is not limited to this in practical applications.

Step 1044: Calculate the order β of the higher-order amplitude spectrum.

Specifically, the order β of the higher-order amplitude spectrum is calculated by the following formula:

Among them, F _s is the sampling frequency, f(k)=kFs/N, which represents the number of frequencies represented by each frequency point after FFT, and β _max and A are experimental empirical values. For example, in this embodiment, the values of the above empirical values may be as follows: β _max = 0.8, β _min = 0.2, A = 165.4 Hz, but it is not limited to this in practical applications.

It should be noted that this embodiment does not limit the execution order of step 1043 and step 1044. In practical applications, step 1044 can be executed first and then step 1043, or step 1043 and step 1044 can be executed simultaneously.

Step 1045: Obtain the spectral gain coefficient according to the a priori signal-to-noise ratio, the posterior signal-to-noise ratio, the perceptual weighted order, and the order of the higher-order amplitude spectrum.

Specifically, the core idea of obtaining the spectral gain coefficient can be Bayesian short-term amplitude spectrum estimation, and its cost function is:

Similar to the derivation process of the classic MMSE estimator, we can get:

Assuming that both X _k and D _k are complex Gaussian random distributions, we can obtain:

among them,

It is a theoretical formula of a priori signal-to-noise ratio. Since it is difficult to obtain the pure voice power λ _x (k) of the current frame in practice, the following formula can usually be used to estimate and approximate the a priori signal-to-noise ratio ξ _k :

The formula for calculating the spectral gain coefficient G from the above derivation formula is as follows:

From the above expression of the spectral gain coefficient G, it can be known that the spectral gain coefficient can be calculated according to the prior signal-to-noise ratio ξ _k and the posterior signal-to-noise ratio γ _k , parameters β and p.

Further, considering the complexity of the Γ function and the Φ function, the spectral gain coefficient can be calculated in the form of a look-up table. The specific correspondence between the input and output of the Γ function and the Φ function can be pre-stored. Input and output the corresponding relationship table to query, when the input is

, The corresponding output value

When the input is

, The corresponding output value

Query in the input and output correspondence table of the pre-stored Φ function: When the input is

, The corresponding output value

When the input is

Corresponding output value

Finally, the found output value is brought into the calculation expression of the spectral gain coefficient to obtain the spectral gain coefficient, which greatly reduces the calculation complexity of the method.

It should be noted that, in this embodiment, the spectral gain coefficient is obtained by the expression of the spectral gain coefficient G as an example, but it is not limited to this in practical applications.

Step 1046: Obtain the pure voice signal of the current frame according to the spectral gain coefficient.

Specifically, after obtaining the spectral gain coefficient, the pure voice signal of the current frame can be calculated according to the following formula

Where Y _w (k) is the signal amplitude of the current frame.

It should be noted that in this embodiment, only pure voice signals are used.

The pure voice signal obtained by the above calculation formula is taken as an example. In practical applications, any method for obtaining the pure voice signal of the current frame through the spectral gain coefficient is within the protection scope of this embodiment.

Compared with the prior art, this embodiment has the following technical effects: First, compared with the traditional noise estimation, there is no need to detect voiced and unvoiced speech, the noise is updated at the same time in the noise frame and the speech frame, and the noise can be tracked adaptively The change. Second, compared with the traditional quantile noise estimation, there is no need to store the previous D frame data and sort by power, which reduces the algorithm resource overhead. Third, when calculating the spectral gain coefficient, the human ear masking mechanism and the sensitivity to noise and spectral amplitude are also considered. The adaptive update parameters p and β are compared with the traditional generalized weighted higher-order spectral estimator for speech enhancement. Reduced the amount of calculation, and has more engineering applicability.

The second embodiment of the present application relates to an adaptive speech enhancement method. The power of the current frame in this embodiment is specifically: the log power spectrum of the current frame; the noise power in this embodiment is specifically the log quantile . Under the logarithmic coordinate system, the comparison between the log power spectrum of the current frame and the log quantile of the previous frame is more accurate, thereby facilitating subsequent accurate processing.

The specific process of the adaptive speech enhancement method in this embodiment is shown in FIG. 4 and includes:

Step 201: After receiving the voice signal, calculate the log power spectrum of the current frame of the voice signal according to the voice signal.

Specifically, step 201 is substantially the same as step 101, the difference is that the power calculated in step 101 is the current frame power, and the log power spectrum calculated in this step is the current frame, that is, the calculated current The power of the frame is logarithmic. For example, the processing of the voice signal of the current frame can be 64 points and overlap the 64 points of the previous frame, then the actual processing of 128 points at a time can obtain the power value of 128 points, and the power value of 128 points The logarithmic power corresponding to 128 frequency points can be obtained by taking the logarithms respectively, and the 128 logarithmic powers constitute the logarithmic power spectrum of the current frame.

Step 202: Acquire a density function according to the comparison result of the log power spectrum of the current frame and the log quantile of the previous frame.

Specifically, in this embodiment, the initial log quantile and the initial density function may be preset. That is, the density function and the log quantile can be initialized according to the experimental value first. For example, the log quantile after the initialization based on the experimental value can be: lq(1,k)=8. If the current frame is the first frame, the log power spectrum of the first frame can be compared with the initial log quantile. In the subsequent processing, the density function of the current frame can be updated according to the log power spectrum of the current frame and the log quantile of the previous frame. Specifically, it can be updated according to the following formula:

Where λ is the frame number of the current frame, k is the number of frequency points, β is the empirical value of the experiment, ξ is the preset threshold threshold, and log(|Y _w (λ)| ² ) is the logarithmic power spectrum of the current frame, lq(λ-1,k) is the log quantile of the previous frame.

It should be noted that, in this embodiment, the density function of the current frame is obtained by using the above density function calculation formula as an example, but it is not limited to this in practical applications.

Step 203: Obtain the incremental step size of the current frame according to the density function.

Specifically, the initial incremental step size can be set in advance. For example, the initial incremental step size obtained after initialization based on the experimental value may be: delta(1,k)=40. In the subsequent processing, the incremental step of the current frame is updated according to the density function of the previous frame, which can be specifically updated according to the following formula:

Among them, K is the incremental step size control factor. If the current frame is the first frame, the incremental step size control factor K is the initial incremental step size.

It should be noted that, in this embodiment, the incremental step of the current frame is obtained by the above incremental step calculation formula as an example. In practical applications, any method for obtaining the incremental step of the current frame according to the density function is This embodiment is within the protection scope.

Step 204: Obtain the log quantile of the current frame according to the log quantile of the previous frame and the incremental step size of the current frame.

Specifically, if the log power spectrum of the current frame is greater than or equal to the log quantile of the previous frame, the log quantile of the previous frame can be adaptively increased according to the incremental step size to obtain the current frame’s Log quantile; if the log power spectrum of the current frame is less than the log quantile of the previous frame, the log quantile of the previous frame can be adaptively reduced according to the increment step to get the current frame Of the log quantile.

Step 205: Acquire the noise estimate of the current frame according to the log quantile of the current frame.

Specifically, after obtaining the log quantile lq(λ,k) of the current frame, the noise estimate can be calculated by the following formula:

Step 206: Obtain a pure voice signal according to the noise estimate.

Step 206 is substantially the same as step 104 in the first embodiment, and will not be repeated here to avoid repetition.

For convenience of description, this embodiment provides a block diagram as shown in FIG. 5 to explain the adaptive speech enhancement method in this embodiment:

The pre-emphasis module 301 mainly implements the function of a high-pass filter to filter out low-frequency components and enhance high-frequency voice components, that is, to filter the received noisy voice signal y(n)=x(n)+d(n), to filter out low frequencies Components, where x(n) is a pure voice signal and d(n) is a noise signal. The de-pre-emphasis module 310 is mainly a low-pass filter. The de-pre-emphasis module 310 and the pre-emphasis module 301 are reciprocal processes, and the combination of the two can achieve the effect of de-reverberation.

The windowing module 302 is mainly to avoid the occurrence of sudden changes in overlapping signals. The window synthesis module 309 mainly removes the effect of the window function on the output of pure voice signals. In this embodiment, the windowing module 302 and the window synthesis module 309 use the same window function in the implementation process. Therefore, the window function must be a power-preserving mapping, that is, the sum of the squared windows of the overlapping parts of the voice signal must be 1, as shown in the following formula Show:

w ² (N)+w ² (N+M)=1

Among them, N is the number of FFT processing points, the value is 128, M is the frame length value is 64.

The fast Fourier transform FFT module 303 is mainly used for mutual conversion between the time domain signal and the frequency domain signal. The FFT module 303 and the inverse FFT module 308 are inverse processes of each other. The FFT module 303 converts the time domain signal into a frequency domain signal, and after conversion into the frequency domain signal, the signal amplitude Y _w can be obtained according to the frequency domain signal. The inverse FFT module 308 converts the frequency domain signal into a time domain signal.

The power spectrum calculation module 304 is configured to obtain the power P of the current frame by squaring the amplitude obtained from the frequency domain signal. The log power spectrum calculation module 305 is configured to log the power of the current frame to obtain the log power spectrum of the current frame. The power spectrum calculation module 304 and the logarithm calculation module 305 are mainly a pre-processing process before noise estimation.

The noise value estimation module 306 mainly performs noise estimation on a noisy speech signal, and estimates an accurate noise signal as much as possible. The noise estimation value is mainly obtained through noise estimation according to the principle of adaptive quantile noise estimation.

The calculating spectral gain coefficient module 307 is mainly to complete the calculation of the spectral gain coefficient according to the noise estimate value and the power of the noisy speech signal to obtain the spectral gain coefficient G. Specifically, the calculation of the spectral gain coefficient is mainly based on the principle of the generalized weighted high-order short-time spectral amplitude estimator.

Further, the pure voice signal in the frequency domain is obtained according to the spectral gain coefficient G and the signal amplitude Y _w

Then, the frequency domain signal is converted into a time domain signal through the inverse FFT module 308, and then processed by the window synthesis module 309 and the de-pre-emphasis module 310 to output a pure voice signal in the time domain.

Thus completing the enhancement of the voice signal.

Compared with the prior art, this embodiment compares the log power spectrum of the current frame of the noisy speech with the log quantile of the previous frame to modify the log quantile to obtain the noise estimate value. It can avoid the detection of voice signals, a large amount of data storage and power spectrum sorting operations in the prior art, and reduce the algorithm resource overhead. Moreover, the logarithmic coordinates can amplify the details, and can extract signals that cannot be extracted under the general coordinate scale, which is conducive to compressing the dynamic range of the values, so that in the logarithmic coordinate system, the logarithmic power spectrum of the current frame and the previous frame The comparison between the quantiles is more accurate, which facilitates subsequent accurate processing.

The third embodiment of the present application relates to an adaptive speech enhancement method. In this embodiment, a specific formula is provided to adaptively increase the log quantile of the previous frame according to the incremental step size to obtain The log quantile of the current frame is beneficial to obtain the log quantile of the current frame directly, quickly and accurately.

The specific process of the adaptive speech enhancement method in this embodiment is shown in FIG. 6 and includes:

Step 401: After receiving the voice signal, calculate the log power spectrum of the current frame of the voice signal according to the voice signal.

Step 402: Acquire a density function according to the comparison result of the log power spectrum of the current frame and the log quantile of the previous frame.

Step 403: Obtain the incremental step size of the current frame according to the density function.

Steps 401 to 403 are substantially the same as steps 201 to 203 in the second embodiment, and will not be repeated here to avoid repetition.

Step 404: Determine whether the log power spectrum of the current frame is greater than or equal to the log quantile of the previous frame. If yes, perform step 405; otherwise, perform step 406.

Step 405: Calculate the log quantile of the current frame according to the formula lq(λ,k)=lq(λ-1,k)+α·delta(λ,k)/β.

That is to say, when log(|Y _w (λ)| ² )≥lq(λ-1,k), the log quantile of the current frame is: according to the increment of the log quantile of the previous frame The step size is adaptively increased. Specifically, it is adaptively increased by the following formula, and the log quantile of the current frame is calculated: lq(λ,k)=lq(λ-1,k)+α·delta(λ,k)/ β. Where λ is the current number of frames, k is the number of frequency points, and α and β are the experimental experience values. In this embodiment, the experimental experience values may be: α=0.25, β=67, but in practical applications This is limited.

Step 406: Calculate the log quantile of the current frame according to the formula lq(λ,k)=lq(λ-1,k)-(1-α)·delta(λ,k)/β.

In other words, when log(|Y _w (λ)| ² )<lq(λ-1,k), the log quantile of the current frame is: according to the increment of the log quantile of the previous frame The step size is adaptively reduced. Specifically, it is adaptively reduced by the following formula, and the log quantile of the current frame is calculated: lq(λ,k)=lq(λ-1,k)-(1-α)·delta (λ,k)/β.

Step 407: According to the formula

Get the noise estimate of the current frame.

Step 408: Obtain the pure voice signal according to the noise estimate.

Steps 407 to 408 are substantially the same as steps 205 to 206 in the second embodiment, and will not be repeated here to avoid repetition.

Compared with the prior art, this embodiment provides a specific formula for adaptively increasing the log quantile of the previous frame according to the incremental step size to obtain the log quantile of the current frame, which is beneficial to the current frame. The incremental step size of is directly, quickly and accurately obtains the log quantile of the current frame, which is beneficial to the noise estimation according to the log quantile of the current frame.

The fourth embodiment of the present application relates to an electronic device, as shown in FIG. 7, including at least one processor 501; and, a memory 502 communicatively connected to the at least one processor 501; wherein, the memory 502 stores An instruction executed by one processor 501, the instruction is executed by at least one processor 501, so that the at least one processor 501 can execute the adaptive speech enhancement method described above.

Among them, the memory 502 and the processor 501 are connected by a bus. The bus may include any number of interconnected buses and bridges. The bus connects one or more processors 501 and various circuits of the memory 502 together. The bus can also connect various other circuits such as peripheral devices, voltage regulators, and power management circuits, etc., which are well known in the art, and therefore, they will not be described further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices on the transmission medium. The data processed by the processor 501 is transmitted on the wireless medium through the antenna. Further, the antenna also receives the data and transmits the data to the processor 501.

The processor 501 is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. The memory 502 may be used to store data used by the processor 501 when performing operations.

Those skilled in the art can understand that all or part of the steps in the method of the above embodiments can be completed by instructing relevant hardware through a program, which is stored in a storage medium, and includes several instructions to make a device (may be A single chip microcomputer, a chip, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .

Persons of ordinary skill in the art may understand that the above-mentioned embodiments are specific embodiments for implementing the present application, and in practical applications, various changes may be made in form and detail without departing from the spirit and range.

Claims

An adaptive speech enhancement method, characterized in that it includes:

After receiving the voice signal, according to the voice signal, calculate the power of the current frame of the voice signal;

Compare the power of the current frame with the noise power of the previous frame;

Obtaining the noise estimate value of the current frame according to the comparison result and the noise power of the previous frame;

Based on the noise estimate, a pure voice signal is obtained.
The adaptive speech enhancement method according to claim 1, wherein

The power of the current frame is: the log power spectrum of the current frame;

The noise power of the previous frame is: the log quantile of the previous frame.
The adaptive speech enhancement method according to claim 2, wherein the acquiring the noise estimate value of the current frame according to the comparison result and the noise power of the previous frame includes:

Obtain the incremental step size of the current frame according to the comparison result of the log power spectrum of the current frame and the log quantile of the previous frame;

Acquiring the log quantile of the current frame according to the log quantile of the previous frame and the incremental step size of the current frame;

Acquire the noise estimate of the current frame according to the log quantile of the current frame.
The adaptive speech enhancement method according to claim 3, wherein the log score of the current frame is obtained according to the log quantile of the previous frame and the incremental step size of the current frame Number of digits, including:

If the log power spectrum of the current frame is greater than or equal to the log quantile of the previous frame, the log quantile of the previous frame is adaptively increased according to the incremental step size to obtain the current frame Of the log quantile;

If the log power spectrum of the current frame is less than the log quantile of the previous frame, the log quantile of the previous frame is adaptively reduced according to the incremental step size to obtain the current frame’s Log quantile.
The adaptive speech enhancement method according to claim 3, further comprising:

Set the initial log quantile and initial increment step in advance.
The adaptive speech enhancement method according to claim 3, wherein the current frame is obtained according to a comparison result of the log power spectrum of the current frame and the log quantile of the previous frame The incremental step size includes:

Obtain a density function according to a comparison result of the log power spectrum of the current frame and the log quantile of the previous frame;

Obtain the incremental step size of the current frame according to the density function.
The adaptive speech enhancement method according to claim 6, wherein the acquisition probability density includes:

The density function density is obtained by the following formula:

Where λ is the frame number of the current frame, k is the number of frequency points, β is the experimental value, ξ is the preset threshold threshold, and log(|Y w (λ)| 2 ) is The log power spectrum of the current frame, where lq(λ-1,k) is the log quantile of the previous frame.
The adaptive speech enhancement method according to claim 6, wherein the obtaining the incremental step size of the current frame according to the density function includes:

The incremental step delta is obtained by the following formula:

Wherein, λ is the frame number of the current frame, K is an incremental step size control factor, and density(λ-1, k) is the density function of the previous frame.
The adaptive speech enhancement method according to claim 4, wherein the log quantile of the previous frame is adaptively increased according to the incremental step size to obtain the log quantile of the current frame Number, including:

The log quantile of the current frame is obtained by the following formula:

lq(λ,k)=lq(λ-1,k)+α·delta(λ,k)/β

The step of adaptively reducing the log quantile of the previous frame according to the increment step to obtain the log quantile of the current frame includes:

The log quantile of the current frame is obtained by the following formula:

lq(λ,k)=lq(λ-1,k)-(1-α)·delta(λ,k)/β

Where λ is the frame number of the current frame, k is the number of frequency points, α is the empirical value of the experiment, and delta(λ, k) is the incremental step size.
The adaptive speech enhancement method according to claim 3, wherein the obtaining the noise estimate value of the current frame according to the log quantile of the current frame includes:

The noise estimate value of the current frame is obtained by the following formula:

Among them, the
For noise estimates, the lq(λ, k) is the log quantile of the current frame, the λ is the frame number of the current frame, and the k is the number of frequency points.
The adaptive speech enhancement method according to claim 1, wherein the acquiring the pure speech signal according to the noise estimation value comprises:

Obtain the spectral gain coefficient according to the noise estimate;

Obtain the pure voice signal of the current frame according to the spectral gain coefficient.
The adaptive speech enhancement method according to claim 11, wherein the acquiring the spectral gain coefficient according to the noise estimation value comprises:

Calculate the a priori signal-to-noise ratio according to the noise estimate value of the previous frame and the pure speech signal of the previous frame;

Calculate the posterior signal-to-noise ratio according to the estimated noise value of the current frame and the power of the current frame;

The spectral gain coefficient is obtained according to the a priori signal-to-noise ratio and the a posteriori signal-to-noise ratio.
The adaptive speech enhancement method according to claim 12, wherein the obtaining of the spectral gain coefficient according to the prior signal-to-noise ratio and the posterior signal-to-noise ratio includes:

Obtain the spectral gain coefficient G according to the following formula:

Where γ k is the a posteriori signal-to-noise ratio, and ξ k is the a priori signal-to-noise ratio, the
The p is the perceptual weighted order, and the β is the order of the higher-order amplitude spectrum.
The adaptive speech enhancement method according to claim 13, wherein the perceptual weighting order is obtained in the following manner:

In the frequency spectrum of the voice signal, the frequency band of the Bark domain is divided into several subbands;

Calculate the signal-to-noise ratio of the several subbands:

The perceptual weighting order is calculated according to the signal-to-noise ratio of the several subbands.
The adaptive speech enhancement method according to claim 14, wherein the calculating the signal-to-noise ratio of the several sub-bands includes:

The signal-to-noise ratio SNR of the several sub-bands is calculated by the following formula:

Where b is the serial number of the subband, k is the number of frequency points, B low (b) is the starting point of the frequency point of the b subband in the Bark domain, and B up (b) is the Bark The end of the frequency point of the b-th subband of the domain.
The adaptive speech enhancement method according to claim 15, wherein the calculation of the perceptual weighting order according to the signal-to-noise ratio of the several sub-bands is:

The perceptual weighted order p is calculated by the following formula:

p(b,k)=max{min[α 1 SNR(b,k)+α 2 ,p max ],p min }

Wherein, the α 1 , the α 2 , the p min and the p max are experimental empirical values.
The adaptive speech enhancement method according to claim 13, wherein the order of the higher-order amplitude spectrum is obtained in the following manner:

In the frequency spectrum of the speech signal, the Bark domain is divided into several subbands;

The order β of the higher-order amplitude spectrum is calculated by the following formula:

Where F s is the sampling frequency, the β min , the β max , the p min , the p max and the A are experimental empirical values, and the b is the serial number of the subband, k is the number of frequency points, the B low (b) is the starting point of the frequency point of the b-th sub-band of the Bark domain, the B up (b) is the end point of the frequency point of the b-th sub-band of the Bark domain, and the f (k)=kFs/N is the frequency of the k-th frequency point after performing fast Fourier transform on the received speech signal.
The adaptive speech enhancement method according to claim 13, wherein:

Said
with
By querying the corresponding relationship between the input and output of the pre-stored Γ function:

Said
with
It can be obtained by querying the corresponding relationship between input and output of the pre-stored Φ function.
The adaptive speech enhancement method according to claim 13, wherein the obtaining of pure speech signals according to the spectral gain coefficients includes:

Get pure voice signal by the following formula

Wherein, Y w (k) is the signal amplitude of the current frame.
An electronic device, characterized in that it includes:

At least one processor; and,

A memory communicatively connected to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to execute any one of claims 1 to 19 Adaptive speech enhancement method.