CN111145768B

CN111145768B - Speech enhancement method based on WSHRRPCA algorithm

Info

Publication number: CN111145768B
Application number: CN201911290388.2A
Authority: CN
Inventors: 罗勇江; 杨腾飞; 杨家利; 毕鲁浩; 汤建龙; 王钟慧
Original assignee: Xidian University
Current assignee: Xi'an Shengxin Technology Co ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2022-05-17
Anticipated expiration: 2039-12-16
Also published as: CN111145768A

Abstract

The invention discloses a voice enhancement method based on a WSHRRPCA algorithm, which mainly solves the problem of poor voice enhancement effect of the existing algorithm in a colored noise environment, and specifically comprises the following steps: establishing a whitening model by using a noisy speech sample, whitening the noisy speech by using the model in a time domain, and then acquiring a time-frequency amplitude spectrum and a time-frequency phase spectrum of the noisy speech by using short-time Fourier transform; the method comprises the steps of disordering and rearranging the arrangement sequence of the spectrum elements in each column of the time-frequency amplitude spectrum by a Hash function mapping method, decomposing the spectrum elements by using a robust principal component analysis algorithm to obtain an enhanced time-frequency amplitude spectrum and restoring the arrangement sequence; and forming an enhanced time frequency spectrum by using the enhanced time frequency amplitude spectrum and the time frequency phase spectrum, reconstructing a complete time domain whitening enhanced voice signal, and performing inverse whitening processing on the signal by using a whitening model to obtain enhanced voice. The invention can effectively eliminate various noises in the noisy speech to achieve the aim of speech enhancement, and can be applied to a speech receiving system, a speech coding system and a speech recognition system.

Description

Speech enhancement method based on WSHRRPCA algorithm

Technical Field

The invention belongs to the technical field of signal processing, and further relates to voice signal processing, in particular to a voice enhancement method based on a Whitened short-time Fourier spectrum Hash rearrangement Robust Principal Component Analysis WSHRRPCA (Whitened-short-time-Fourier-spectral-hash-reordered Robust Component Analysis) algorithm, which can be used for a voice receiving system and a voice recognition system, realizes voice enhancement and noise reduction in the voice receiving system, and improves the signal-to-noise ratio of an input signal in a front-end preprocessing part of the voice recognition system, thereby improving the anti-interference capability and the recognition rate of the system.

Background

The speech enhancement technology is widely applied to the fields of voice call, teleconference, scene recording, military eavesdropping, hearing aid equipment, voice recognition equipment and the like, and a plurality of preprocessing modules of the speech coding and speech recognition system are all related to the technology. Traditional speech enhancement algorithms are mainly classified into three main categories: spectral subtraction, statistical model-based algorithms, and subspace algorithms. However, these conventional speech enhancement algorithms have their limitations in application. Spectral subtraction often performs signal enhancement processing based on estimation of a noise spectrum, when some non-stationary noise occurs, the estimation of the noise spectrum is inaccurate, the signal enhancement effect is affected, and the algorithm is easy to generate 'unnaturalness' music noise; statistical model-based algorithms generally require the assumption that speech and noise signals are statistically independent and obey a gaussian distribution; the subspace algorithm needs to assume that the clean speech signal subspace and the noise subspace are orthogonal, but this assumption of subspace orthogonality is very unreasonable in practical situations. In order to break through the limitations of the conventional algorithms, people begin to find new theories. In recent years, the convex optimization-based compressed sensing and matrix rank minimization and low-rank matrix recovery theory derived from the convex optimization-based compressed sensing become one of research hotspots in the field of digital signal processing, and a matrix low-rank sparse decomposition algorithm, namely robust principal component analysis, serving as the low-rank matrix recovery theory has also been applied to the field of speech enhancement and achieves a better effect. However, the speech enhancement method based on robust principal component analysis has the following disadvantages: first, the method has good performance in white noise environment, but the energy distribution characteristics of colored noise and white noise are different, which makes the method have insufficient performance in eliminating colored noise; second, when noise is removed, a part of low-rank speech components are also removed, resulting in loss of speech components and affecting the speech enhancement effect.

Sun et al, in their published paper "A novel Speech enhancement method based on constrained low-rank and sparse matrix decomposition" (Speech Communication,60: 44-55,2014), propose a Speech enhancement method based on a matrix decomposition algorithm with low rank and sparse constraints. The method comprises the following implementation steps: the method comprises the steps that firstly, a short-time Fourier transform is used for obtaining a time-frequency amplitude spectrum and a time-frequency phase spectrum of noise-containing voice, and a three-point median filter is used for smoothing the time-frequency amplitude spectrum; secondly, decomposing a time-frequency amplitude spectrum of the noisy speech by using a constraint low-rank and sparse matrix decomposition algorithm to obtain a low-rank matrix and a sparse matrix, and performing binary time-frequency masking processing on the sparse matrix; and thirdly, reconstructing a time spectrum of the enhanced voice by using the sparse matrix and the noisy voice phase spectrum, and reconstructing the enhanced voice in a time domain form by using inverse short-time Fourier transform. The main problem of this method is to reduce the possibility that the low rank speech component is erroneously eliminated only by limiting the size of the low rank matrix rank, which is not fundamentally solved, and thus, there is still a part of the low rank speech that is removed as noise. Meanwhile, the method increases the limitation on the sparsity of the sparse matrix, so that under the condition of strong background noise, a large number of voice components are eliminated, and the voice quality is reduced.

Disclosure of Invention

The invention aims to provide a speech enhancement method based on a whitening short-time Fourier spectrum Hash rearrangement robust principal component analysis algorithm aiming at the defects of the prior art, obtains high-quality enhanced speech in a noise environment, and is mainly applied to a speech receiving system, a speech coding system and a speech recognition system.

The specific idea for realizing the purpose of the invention is that firstly a whitening model is established by utilizing a part of samples of the noisy speech, the whitening model is used for whitening the noisy speech in the time domain, short-time Fourier transform is used for carrying out time-frequency analysis on the whitened noisy speech to obtain the time-frequency amplitude spectrum and the time-frequency phase spectrum thereof, then, the arrangement sequence of the spectrum elements in each column of the time-frequency amplitude spectrum is disordered and rearranged by using a Hash function mapping method to obtain a rearranged time-frequency amplitude spectrum, then the rearranged time-frequency amplitude spectrum is decomposed by using a robust principal component analysis algorithm to obtain an enhanced time-frequency amplitude spectrum, the arrangement sequence of each column of the spectrum elements is recovered, then, an enhanced time frequency spectrum is formed by utilizing the enhanced time frequency amplitude spectrum and the time frequency phase spectrum, a complete time domain whitening enhanced voice signal is reconstructed, and finally, the voice signal is subjected to inverse whitening processing by using the established whitening model to obtain enhanced voice. The invention can be used for speech enhancement in various speech processing systems, recover the quality and intelligibility of speech seriously polluted by noise and achieve the aim of enhancing noise-containing speech.

The method specifically comprises the following steps:

(1) generating whitened noisy speech x_w(n)：

(1a) Selecting an integer value as a sample point number N within the range of [1000,1500], and taking the first N sampling points of a noisy speech x (N) to establish a whitening filter;

(1b) carrying out whitening treatment on the noise-containing voice x (n) by using the whitening filter obtained in the step (1a) to obtain whitened noise-containing voice x_w(n)；

(2) Generating whitened noisy speech x_w(n) time-frequency amplitude spectrum | D_wI and time-frequency phase spectrum < D_w：

(2a) In [20,40 ]]The duration of each frame of speech signal is optionally selected from a range of milliseconds, and the length of each frame is 25%, 75%]Selecting a value in the range of (1) as the displacement of the next frame speech relative to the previous frame speech, and whitening the noise-containing speech x_w(n) dividing into a plurality of short-time speech frames;

(2b) selecting an unprocessed frame of short-time speech in sequence according to a time sequence from all the short-time speech frames as a frame to be processed currently;

(2c) performing Fourier transform on a short-time speech frame to be processed currently to obtain a Fourier spectrum of the frame, and calculating the amplitude and phase of the Fourier spectrum to obtain a Fourier amplitude spectrum and a Fourier phase spectrum;

(2d) judging whether all short-time speech frames are processed or not, if so, executing the step (2e), otherwise, returning to the step (2 b);

(2e) taking the Fourier magnitude spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency magnitude spectrum | D of the whitened noisy speech_wL, |; taking the Fourier phase spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency phase spectrum < D > for whitening the noise-containing voice_w；

(3) Generating a rearranged time-frequency amplitude spectrum | D_w|_r：

(3a) Amplitude spectrum | D in time_wOfIn all column vectors, sequentially selecting an unprocessed column according to a time sequence to serve as a Fourier magnitude spectrum to be processed currently;

(3b) generating a new arrangement sequence for the spectral elements in the current Fourier magnitude spectrum by utilizing a Hash function, and rearranging the spectral elements according to the sequence to obtain a rearranged Fourier magnitude spectrum;

(3c) judging whether the | D is processed_wIf yes, executing the step (3d), otherwise, returning to the step (3 a);

(3d) all the rearranged Fourier magnitude spectrums are used as column vectors and are arranged according to time sequence to form a rearranged time frequency magnitude spectrum | D_w|_r；

(4) Generating an enhanced time-frequency magnitude spectrum | S_w|：

(4a) In [6,10 ]]Is selected as an integer Q as the estimated rearrangement time-frequency amplitude spectrum | D_w|_rThe number of columns used for medium noise intensity, using | D_w|_rFront Q-column rebinned Fourier magnitude spectral estimation | D_w|_rThe intensity of the noise in (1);

(4b) utilizing robust principal component analysis algorithm to rearrange time-frequency amplitude spectrum | D according to the noise intensity estimated in (4a)_w|_rEnhancing to generate sparse rearrangement time-frequency amplitude spectrum | S_w|_r；

(4c) Restoring | S according to the arrangement order generated in (3b)_w|_rThe order of the Fourier amplitude spectrum elements in all the columns to obtain an enhanced time-frequency amplitude spectrum | S_w|；

(5) Composing an enhanced time spectrum S_w：

By enhancing the time-frequency amplitude spectrum | S_wI and time-frequency phase spectrum < D_wComposing an enhanced time spectrum S_w；

(6) Reconstructed whitened enhanced speech y_w(n)：

(6a) Spectrum S at enhancement_wIn all the column vectors, one unprocessed column is sequentially selected according to the time sequence and is used as the enhanced Fourier spectrum to be processed currently;

(6b) performing inverse Fourier transform on the enhanced Fourier spectrum to be processed currently to obtain a frame of whitened short-time enhanced voice;

(6c) judging whether the processing is finished S_wIf yes, executing the step (6d), otherwise, returning to the step (6 a);

(6d) reconstructing all whitened short-time enhanced speech frames into complete whitened enhanced speech y using overlap-Add Overlapped Add method_w(n)；

(7) Generating the enhanced speech y (n):

whitening enhanced speech y using the whitening filter obtained in (1a)_w(n) performing inverse whitening processing to obtain enhanced speech y (n).

Compared with the prior art, the invention has the following advantages:

firstly, the processing procedure of whitening is added in the invention, when the background noise is colored noise, the colored noise can be converted into white noise, and the capability of eliminating the colored noise is improved; moreover, the whitening processing does not influence the noise reduction capability of the invention in a white noise environment;

secondly, the invention uses the Hash function mapping method to carry out disordering rearrangement on the arrangement sequence of each column of spectrum elements of the original time-frequency amplitude spectrum before generating the enhanced time-frequency amplitude spectrum, so that the low-rank voice components in the invention become close to full rank and no longer have the characteristic of low rank, the low-rank voice components are effectively retained in the enhanced voice, and the quality of the enhanced voice is improved.

Drawings

FIG. 1 is a flow chart of an implementation of the method of the present invention;

fig. 2 is a diagram showing a comparison result between a top view of a time-frequency amplitude spectrum of colored noise F16 noise and a top view of a time-frequency amplitude spectrum of a whitening signal thereof in simulation experiment 1 according to the present invention.

FIG. 3 is a time-frequency amplitude spectrum visual comparison graph of the speech enhancement effect of the method of the present invention and the speech enhancement method based on the robust principal component analysis algorithm under the condition of colored noise F16 noise in simulation experiment 2 of the present invention;

FIG. 4 is a comparison graph of objective indexes of average speech enhancement effect of the method of the present invention and a speech enhancement method based on a robust principal component analysis algorithm under six different types of colored noise in simulation experiment 3.

Detailed Description

The implementation steps of the method of the invention are described in further detail below with reference to fig. 1.

Step 1, generating whitened noisy speech x_w(n)。

(1.1) selecting an integer value as the number N of sample points within the range of [1000,1500], and taking the first N sampling points of noisy speech x (N) to establish a whitening filter; the specific steps for establishing the whitening filter are as follows:

step 1, at [30,50 ]]Selecting an integer p as the order of the whitening filter, and establishing a p-order linear predictor by using x (N) first sampling points of the noisy speech, wherein the transfer function of the linear predictor is

And solving coefficient a of linear predictor by using autocorrelation method_i(i＝1,2,…,p)；

Step 2, using a p-order linear predictor to build a p-order whitening filter with a transfer function of

(1.2) carrying out whitening treatment on the noise-containing voice x (n) by using the whitening filter obtained in (1.1) to obtain whitened noise-containing voice x_w(n) of (a). The whitening processing of the noisy speech x (n) means: and (3) filtering the noisy speech x (n) by using the p-order whitening filter established in the step (1.1) of the step.

Step 2, generating whitening noisy speech x_w(n) time-frequency amplitude spectrum | D_wI and time-frequency phase spectrum < D_w。

(2.1) in [20,40 ]]The duration of each frame of speech signal is optionally selected from a range of milliseconds, and the length of each frame is 25%, 75%]Selecting a value in the range of (1) as the displacement of the next frame speech relative to the previous frame speech, and whitening the noise-containing speech x_w(n) division into a plurality of short-time speech frames。

And (2.2) sequentially selecting an unprocessed frame of short-time speech as a frame to be processed currently in time sequence from all the short-time speech frames.

And (2.3) carrying out Fourier transform on the short-time speech frame to be processed currently to obtain a Fourier spectrum of the frame, and calculating the amplitude and the phase of the Fourier spectrum to obtain a Fourier amplitude spectrum and a Fourier phase spectrum.

And (2.4) judging whether all short-time speech frames are processed, if so, executing the step (2.5) of the step, otherwise, executing the step (2.2) of the step.

(2.5) taking the Fourier magnitude spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time frequency magnitude spectrum | D of the whitened noisy speech_wTaking the Fourier phase spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency phase spectrum < D > of the whitened noisy speech_wHere, the time-frequency magnitude spectrum | D_wI and time-frequency phase spectrum < D_wAre all matrices and | D_w|∈R^m×n，∠D_w∈R^m×nWhere e denotes that the element belongs to the set, R denotes the matrix | D_w| and matrix &_wWherein the elements are real numbers, and m is a matrix | D_wI and matrix < D |_wN is the matrix | D_wI and matrix < D |_wThe number of columns.

Step 3, generating a rearrangement time-frequency amplitude spectrum | D_w|_r。

(3.1) amplitude Spectrum | D in time_wAnd in all column vectors of l, sequentially selecting an unprocessed column according to a time sequence to serve as a Fourier magnitude spectrum to be processed currently.

(3.2) generating a new arrangement sequence for the spectral elements in the current Fourier magnitude spectrum by utilizing a hash function, and rearranging the spectral elements according to the sequence to obtain a rearranged Fourier magnitude spectrum, wherein the method comprises the following specific steps:

let the current Fourier magnitude spectrum be X ═ X₁,x₂,…,x_m]^T∈R^m×1Wherein X is a column vector, X₁,x₂,…,x_mIs m

spectral elementsSubscripts

1,2, …, m of each spectral element represent the arrangement sequence of the spectral elements in a Fourier magnitude spectrum, T represents a vector transposition operation, epsilon represents that the elements belong to a set, and R represents that the spectral elements are real numbers;

(3.2.1) selecting an integer a which is coprime to m in the range of [2, m), selecting an integer b in the range of [0, m), and constructing a hash function f (k) ═ ak + b)_m+1, wherein (.)_mDenotes a modulo operation, k denotes the subscript of the spectral element, and k ═ 1,2, …, m;

(3.2.2) mapping the original sequence of

indices

1,2, …, m of the spectral elements to a new sequence of indices f (1), f (2), …, f (m) using a hash function f (k);

(3.2.3) rearranging the spectral elements according to the new subscript sequence f (1), f (2), …, f (m) to obtain a rearranged Fourier magnitude spectrum X_r＝[x_f(1),x_f(2),…,x_f(m)]^T∈R^m×1。

(3.3) judging whether | D is processed_wIf yes, executing the (3.4) step of the step, otherwise, executing the (3.1) step of the step.

(3.4) all the rearranged Fourier amplitude spectra are arranged in time sequence as column vectors to form a rearranged time-frequency amplitude spectrum | D_w|_r。

Step 4, generating an enhanced time-frequency amplitude spectrum | S_w|。

(4.1) at [6,10]Is selected as an integer Q as the estimated rearrangement time-frequency amplitude spectrum | D_w|_rThe number of columns used for medium noise intensity, using | D_w|_rFront Q-column rebinned Fourier magnitude spectral estimation | D_w|_rThe intensity of the noise in (2).

(4.2) according to the noise intensity estimated in the step (4.1) of the step, utilizing a robust principal component analysis algorithm to rearrange a time-frequency amplitude spectrum | D_w|_rEnhancing to generate sparse rearrangement time-frequency amplitude spectrum | S_w|_r。

The robust principal component analysis algorithm is as follows: rearrangement time-frequency amplitude spectrum | D by using augmented Lagrange multiplier ALM method_w|_rThe robust principal component analysis algorithm model carries out optimization solution, and a rearrangement time-frequency amplitude spectrum | D is decomposed_w|_rObtaining a sparse rearrangement time-frequency amplitude spectrum | S_w|_rAnd low rank rearrangement time-frequency amplitude spectrum | L_w|_r. The specific optimization process comprises the following steps:

at | D_w|_r＝|L_w|′_r+|S_w|′_rUnder the condition, finding a sparse rearrangement time-frequency amplitude spectrum | S_w|_rMatrix and low rank rearrangement time-frequency amplitude spectrum | L_w|_rMatrix, such that L_w|_r||_*+λ|||S_w|_r||₁Has the smallest value, i.e.

(4.3) restoring | S according to the arrangement order generated in the (3.2) th step of the step 3_w|_rThe arrangement sequence of the Fourier amplitude spectrum elements in all the columns to obtain an enhanced time frequency amplitude spectrum | S_w|。

Step 5, forming an enhanced time spectrum S_w。

Using the enhanced time-frequency amplitude spectrum | S generated in step 4_wI and the time-frequency phase spectrum < D obtained in the (2.5) step of the step 2_wComposing an enhanced time spectrum S_w。

Step 6, reconstructing the whitened enhanced speech y_w(n)。

(6.1) Spectrum S at enhancement_wAnd sequentially selecting an unprocessed column in the all column vectors according to the time sequence to be used as the enhanced Fourier spectrum to be processed currently.

And (6.2) carrying out inverse Fourier transform on the enhanced Fourier spectrum to be processed currently to obtain a frame of whitened short-time enhanced voice.

(6.3) judging whether S has been processed_wIf yes, executing the (6.4) th step of the step, otherwise, executing the (6.1) th step of the step.

(6.4) reconstructing all whitened short term enhanced speech frames into a complete whitened enhanced speech y using overlap-added overlaid Add_w(n)。

And 7, generating enhanced voice y (n).

Using the whitening filter obtained in step 1 (1.1) to whiten the enhanced speech y obtained in step 6_w(n) performing inverse whitening processing to obtain enhanced speech y (n).

The whitening enhanced voice y obtained in the step 6 is subjected to whitening by using the whitening filter obtained in the step 1, the step (1.1)_w(n) inverse whitening processing is performed to obtain enhanced speech y (n) as follows.

(7a) Using the whitening filter obtained in step 1 (1.1) to build an inverse whitening filter having a transfer function of W_I(z)＝1/W(z)。

(7b) Speech enhancement using an inverse whitening filter_w(n) filtering to obtain enhanced speech y (n).

The application effect of the invention is further explained by combining the following simulation:

1. simulation conditions

The simulation experiment of the invention is realized by MATLAB simulation software, the sampling rate of the voice is set to be 8000 Hz, the time length of each frame of short-time voice is 32 milliseconds, and the displacement of the next frame of voice relative to the previous frame of voice is 16 milliseconds. And taking the first 1024 sampling points of the noisy speech to establish a 40-order whitening filter. The method solves the robust principal component analysis algorithm by using an Exact ALM (Exact Augmented Lagrange Multiplier) method in simulation, wherein the weight parameters of the robust principal component analysis algorithm and the rearranged Fourier magnitude spectrum | D_w|_rThe noise intensity relation in (2) can be determined adaptively by the following formula:

λ＝-0.004×ζ+0.1181

wherein λ represents a weight parameter of the robust principal component analysis algorithm,ζ represents the rearranged Fourier magnitude spectrum | D_w|_rAn estimate of the signal-to-noise ratio in (1). Specifically, ζ can be determined by the following formula:

where ζ represents the rearranged Fourier magnitude spectrum | D_w|_rEstimate of the signal-to-noise ratio in (log)₁₀(. to.) denotes a base-10 logarithm operation,. sigma.,

representation matrix | D_w|_rThe square of the spectral element at the ith row and jth column position in the matrix, | D_w|_rN is the matrix | D_w|_rThe number of columns of (1), Q represents | D_w|_rThe number of columns of the voice spectrum is set to 8 in the simulation experiment of the present invention.

2. Emulated content

The simulation experiments of the invention are three. Simulation experiment 1 is a whitening experiment of colored noise to illustrate the effectiveness of the whitening process in the present invention. Fig. 2 is a diagram showing the comparison result between the top view of the time-frequency amplitude spectrum of the colored noise F16 noise and the top view of the time-frequency amplitude spectrum of the whitened signal obtained in simulation experiment 1. Fig. 2(a) shows a top view of a time-frequency amplitude spectrum of a colored noise F16 noise, and fig. 2(a) shows a top view of a time-frequency amplitude spectrum of a signal obtained by whitening a colored noise F16. The horizontal axis in each time-frequency amplitude spectrum in fig. 2 represents the time axis in seconds, the vertical axis represents the frequency axis in kilohertz, and each time-frequency amplitude spectrum is represented in the form of a logarithmic spectrum with the spectral values in decibels.

The simulation experiment 2 is to visually compare the voice enhancement effect of the method of the present invention with the voice enhancement method based on the robust principal component analysis algorithm to obtain the time-frequency amplitude spectrum visual comparison graph of fig. 3. In simulation experiment 2, a clean speech segment is polluted by colored noise F16, the signal-to-noise ratio is 5dB, and speech enhancement is respectively carried out by using the method and the existing speech enhancement method based on the robust principal component analysis algorithm. Fig. 3(a) shows a top view of a time-frequency amplitude spectrum of clean speech, fig. 3(b) shows a top view of a time-frequency amplitude spectrum of colored noise F16, fig. 3(c) shows a top view of a time-frequency amplitude spectrum of a speech component obtained by a speech enhancement method based on a robust principal component analysis algorithm, fig. 3(d) shows a top view of a time-frequency amplitude spectrum of a noise component obtained by a speech enhancement method based on a robust principal component analysis algorithm, fig. 3(e) shows a top view of a time-frequency amplitude spectrum of a speech component obtained by the method of the present invention, and fig. 3(F) shows a top view of a time-frequency amplitude spectrum of a noise component obtained by the method of the present invention. The horizontal axis in each time-frequency amplitude spectrum in fig. 3 represents the time axis in seconds, the vertical axis represents the frequency axis in kilohertz, and each time-frequency amplitude spectrum is represented in the form of a log spectrum with spectral values in decibels.

Simulation experiment 3 is to compare the average voice enhancement effect in six different types of colored noise (buccaneer1, buccaneer2, f16, factor 1, hfchannel and ping) by using the method of the present invention and the existing voice enhancement method based on the robust principal component analysis algorithm, and the result is shown in fig. 4. Fig. 4 is a simulation experiment 3 showing objective index comparison of average speech enhancement effect of the speech enhancement method based on the robust principal component analysis algorithm under six different types of colored noise conditions, where the speech enhancement effect is measured by two objective indexes, namely, source distortion ratio and speech quality perception evaluation, the source distortion ratio is measured by the ratio of speech signal energy to noise energy contained in the enhanced speech, and is measured in decibels, and the speech quality perception evaluation is an index for evaluating subjective intelligibility of the enhanced speech, and the larger the numerical values of the two indexes are, the better the speech enhancement effect is. The curve marked by a circle in fig. 4(a) represents the variation curve of the average signal-to-distortion ratio of the enhanced speech obtained by the method of the present invention under the above six colored noise pollutions, which is influenced by the signal-to-noise ratio. The curve marked by diamonds in fig. 4(a) represents the variation curve of the average signal-to-distortion ratio of the enhanced speech under the six colored noise pollutions, which is based on the speech enhancement method of the robust principal component analysis algorithm, and is influenced by the signal-to-noise ratio. The abscissa in fig. 4(a) represents the snr of noisy speech in decibels and the ordinate represents the source distortion ratio in decibels. The curve marked by a circle in fig. 4(b) represents the variation curve of the average speech quality perception evaluation index of the enhanced speech obtained under the above six colored noise pollutions, which is influenced by the signal-to-noise ratio, according to the method of the present invention. The curve marked by diamonds in fig. 4(b) represents the variation curve of the average speech quality perception evaluation index of the enhanced speech under the six colored noise pollutions, which is based on the speech enhancement method of the robust principal component analysis algorithm, and is influenced by the signal-to-noise ratio. The abscissa in fig. 4(b) represents the signal-to-noise ratio in decibels, and the ordinate represents the speech quality perception assessment.

3. And (3) simulation result analysis:

as can be seen from fig. 2, the colored noise F16 is mainly concentrated in the frequency bands of 0 hz-1.5 khz and 2.5 khz-3 khz, and after whitening, the signal energy is substantially uniformly distributed in the entire frequency band of 0 hz-4 khz, which is similar to white noise. Therefore, the whitening processing step in the present invention is effective in converting color noise into white noise.

FIG. 3 shows that the noise component time-frequency amplitude spectrum obtained by the speech enhancement method based on the robust principal component analysis algorithm in FIG. 3(d) has a large amount of speech components remaining, while the noise component time-frequency amplitude spectrum obtained by the method of the present invention in FIG. 3(f) has very few speech components remaining; meanwhile, the energy of the voice component in fig. 3(e) is larger than that in fig. 3(c), which intuitively shows that the algorithm of the present invention has a better voice enhancement effect. The method has the advantages that the arrangement sequence of each frame of Fourier spectrum elements is disturbed, so that the similarity between the time-frequency amplitude spectrum frame and the frame of the low-rank voice component is reduced, the condition that the low-rank voice component is wrongly decomposed is relieved, and the voice enhancement effect is improved.

As can be seen from fig. 4, the source distortion ratio curve obtained by the method of the present invention is above the source distortion ratio curve obtained by the speech enhancement method based on the robust principal component analysis algorithm. For the voice quality perception evaluation index, under the condition that the signal-to-noise ratio of the noise-containing voice is 0dB and-5 dB, the score of the method is slightly lower than that of a voice enhancement method based on a robust principal component analysis algorithm. By combining the two indexes, the method has better noise elimination capability in various colored noise environments, and simultaneously, more voice components are kept in the enhanced voice as much as possible, so that the method has good voice enhancement effect.

While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A speech enhancement method based on a whitening short-time Fourier spectrum Hash rearrangement robust principal component analysis WSHRRPCA algorithm is characterized by comprising the following steps:

(1) generating whitened noisy speech x_w(n)：

(1a) Selecting an integer value as a sample point number N within the range of [1000,1500], and taking the first N sampling points of a noisy speech x (N) to establish a whitening filter; the method comprises the following specific steps:

(1a1) in [30,50 ]]Selecting an integer p as the order of the whitening filter, and establishing a p-order linear predictor by using x (N) first sampling points of the noisy speech, wherein the transfer function of the linear predictor is

And solving the coefficient a of the linear predictor by using an autocorrelation method_i(i＝1,2,…,p)；

(1a2) Using a p-order linear predictor to build a p-order whitening filter having a transfer function of

(1b) By using (1a) to obtainThe whitening filter performs whitening processing on the noisy speech x (n), specifically: filtering the noisy speech x (n) by using the p-order whitening filter established in the step (1a) to obtain whitened noisy speech x_w(n)；

(2a) In [20,40 ]]The duration of each frame of speech signal is optionally selected from a range of milliseconds, and the length of each frame is 25%, 75%]Selecting a value in the range of (1) as the displacement of the next frame speech relative to the previous frame speech, and whitening the noisy speech x_w(n) dividing into a plurality of short-time speech frames;

(2c) performing Fourier transform on a short-time speech frame to be processed currently to obtain a Fourier spectrum of the frame, calculating the amplitude and the phase of the Fourier spectrum, and obtaining a Fourier magnitude spectrum and a Fourier phase spectrum;

(2e) taking the Fourier magnitude spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency magnitude spectrum | D of the whitened noisy speech_wL, |; taking the Fourier phase spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency phase spectrum & lt D & gt of the whitened noisy speech_w；

(3) Generating a rearranged time-frequency amplitude spectrum | D_w|_r：

(3a) Amplitude spectrum | D in time_wIn all column vectors of |, sequentially selecting an unprocessed column according to a time sequence to serve as a Fourier magnitude spectrum to be processed currently;

(4) Generating an enhanced time-frequency magnitude spectrum | S_w|：

(4c) Restoring | S according to the new arrangement order generated in (3b)_w|_rThe arrangement sequence of the Fourier magnitude spectrum elements in all the columns to obtain an enhanced time frequency magnitude spectrum | S_w|；

(5) Composing an enhanced time spectrum S_w：

(6) Reconstructed whitened enhanced speech y_w(n)：

(6a) Spectrum S at enhancement_wSequentially selecting an unprocessed column from all the column vectors according to a time sequence to serve as an enhanced Fourier spectrum to be processed currently;

(7) Generating enhanced speech y (n):

3. The method of claim 1, further comprising: in the step (3b), a new arrangement order is generated for the spectral elements in the current fourier magnitude spectrum by using a hash function, and the spectral elements are rearranged according to the new arrangement order to obtain a rearranged fourier magnitude spectrum, and the steps are as follows:

let the current Fourier magnitude spectrum be X ═ X₁,x₂,…,x_m]^T∈R^m×1Wherein X is a column vector, X₁,x₂,…,x_mM spectral elements are represented, subscripts 1,2, … of the spectral elements indicate arrangement sequence of the spectral elements in a Fourier magnitude spectrum, T indicates a vector transposition operation, e indicates that the elements belong to a set, and R indicates that the spectral elements are real numbers;

(3b1) selecting an integer a which is coprime to m in the range of [2, m), selecting an integer b in the range of [0, m), and constructing a hash function f (k) ═ ak + b)_m+1, wherein (.)_mDenotes a modulo operation, k denotes the subscript of the spectral element, and k ═ 1,2, …, m;

(3b2) mapping the original index sequence 1,2, …, m of the spectral elements to a new index sequence f (1), f (2), …, f (m) using a hash function f (k);

(3b3) rearranging the spectral elements according to the new subscript sequence f (1), f (2), …, f (m) to obtain a rearranged FourierAmplitude spectrum X_r＝[x_f(1),x_f(2),…,x_f(m)]^T∈R^m×1。

4. The method of claim 1, further comprising: the robust principal component analysis algorithm in the step (4b) refers to: rearrangement time-frequency amplitude spectrum | D by using augmented Lagrange multiplier ALM method_w|_rThe robust principal component analysis algorithm model carries out optimization solution, and a rearrangement time-frequency amplitude spectrum | D is decomposed_w|_rObtaining a sparse rearrangement time-frequency amplitude spectrum | S_w|_rAnd low rank rearrangement time-frequency amplitude spectrum | L_w|_r。

5. The method of claim 1, further comprising: step (7) of using the whitening filter obtained in step (1a) to whiten the enhanced speech y_w(n) performing inverse whitening processing to obtain enhanced speech y (n) as follows:

(7a) using the whitening filter obtained in (1a) to build an inverse whitening filter having a transfer function of W_I(z)＝1/W(z)；