Disclosure of Invention
The invention aims to provide a speech enhancement method based on a whitening short-time Fourier spectrum Hash rearrangement robust principal component analysis algorithm aiming at the defects of the prior art, obtains high-quality enhanced speech in a noise environment, and is mainly applied to a speech receiving system, a speech coding system and a speech recognition system.
The specific idea for realizing the purpose of the invention is that firstly a whitening model is established by utilizing a part of samples of the noisy speech, the whitening model is used for whitening the noisy speech in the time domain, short-time Fourier transform is used for carrying out time-frequency analysis on the whitened noisy speech to obtain the time-frequency amplitude spectrum and the time-frequency phase spectrum thereof, then, the arrangement sequence of the spectrum elements in each column of the time-frequency amplitude spectrum is disordered and rearranged by using a Hash function mapping method to obtain a rearranged time-frequency amplitude spectrum, then the rearranged time-frequency amplitude spectrum is decomposed by using a robust principal component analysis algorithm to obtain an enhanced time-frequency amplitude spectrum, the arrangement sequence of each column of the spectrum elements is recovered, then, an enhanced time frequency spectrum is formed by utilizing the enhanced time frequency amplitude spectrum and the time frequency phase spectrum, a complete time domain whitening enhanced voice signal is reconstructed, and finally, the voice signal is subjected to inverse whitening processing by using the established whitening model to obtain enhanced voice. The invention can be used for speech enhancement in various speech processing systems, recover the quality and intelligibility of speech seriously polluted by noise and achieve the aim of enhancing noise-containing speech.
The method specifically comprises the following steps:
(1) generating whitened noisy speech xw(n):
(1a) Selecting an integer value as a sample point number N within the range of [1000,1500], and taking the first N sampling points of a noisy speech x (N) to establish a whitening filter;
(1b) carrying out whitening treatment on the noise-containing voice x (n) by using the whitening filter obtained in the step (1a) to obtain whitened noise-containing voice xw(n);
(2) Generating whitened noisy speech xw(n) time-frequency amplitude spectrum | DwI and time-frequency phase spectrum < Dw:
(2a) In [20,40 ]]The duration of each frame of speech signal is optionally selected from a range of milliseconds, and the length of each frame is 25%, 75%]Selecting a value in the range of (1) as the displacement of the next frame speech relative to the previous frame speech, and whitening the noise-containing speech xw(n) dividing into a plurality of short-time speech frames;
(2b) selecting an unprocessed frame of short-time speech in sequence according to a time sequence from all the short-time speech frames as a frame to be processed currently;
(2c) performing Fourier transform on a short-time speech frame to be processed currently to obtain a Fourier spectrum of the frame, and calculating the amplitude and phase of the Fourier spectrum to obtain a Fourier amplitude spectrum and a Fourier phase spectrum;
(2d) judging whether all short-time speech frames are processed or not, if so, executing the step (2e), otherwise, returning to the step (2 b);
(2e) taking the Fourier magnitude spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency magnitude spectrum | D of the whitened noisy speechwL, |; taking the Fourier phase spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency phase spectrum < D > for whitening the noise-containing voicew;
(3) Generating a rearranged time-frequency amplitude spectrum | Dw|r:
(3a) Amplitude spectrum | D in timewOfIn all column vectors, sequentially selecting an unprocessed column according to a time sequence to serve as a Fourier magnitude spectrum to be processed currently;
(3b) generating a new arrangement sequence for the spectral elements in the current Fourier magnitude spectrum by utilizing a Hash function, and rearranging the spectral elements according to the sequence to obtain a rearranged Fourier magnitude spectrum;
(3c) judging whether the | D is processedwIf yes, executing the step (3d), otherwise, returning to the step (3 a);
(3d) all the rearranged Fourier magnitude spectrums are used as column vectors and are arranged according to time sequence to form a rearranged time frequency magnitude spectrum | Dw|r;
(4) Generating an enhanced time-frequency magnitude spectrum | Sw|:
(4a) In [6,10 ]]Is selected as an integer Q as the estimated rearrangement time-frequency amplitude spectrum | Dw|rThe number of columns used for medium noise intensity, using | Dw|rFront Q-column rebinned Fourier magnitude spectral estimation | Dw|rThe intensity of the noise in (1);
(4b) utilizing robust principal component analysis algorithm to rearrange time-frequency amplitude spectrum | D according to the noise intensity estimated in (4a)w|rEnhancing to generate sparse rearrangement time-frequency amplitude spectrum | Sw|r;
(4c) Restoring | S according to the arrangement order generated in (3b)w|rThe order of the Fourier amplitude spectrum elements in all the columns to obtain an enhanced time-frequency amplitude spectrum | Sw|;
(5) Composing an enhanced time spectrum Sw:
By enhancing the time-frequency amplitude spectrum | SwI and time-frequency phase spectrum < DwComposing an enhanced time spectrum Sw;
(6) Reconstructed whitened enhanced speech yw(n):
(6a) Spectrum S at enhancementwIn all the column vectors, one unprocessed column is sequentially selected according to the time sequence and is used as the enhanced Fourier spectrum to be processed currently;
(6b) performing inverse Fourier transform on the enhanced Fourier spectrum to be processed currently to obtain a frame of whitened short-time enhanced voice;
(6c) judging whether the processing is finished SwIf yes, executing the step (6d), otherwise, returning to the step (6 a);
(6d) reconstructing all whitened short-time enhanced speech frames into complete whitened enhanced speech y using overlap-Add Overlapped Add methodw(n);
(7) Generating the enhanced speech y (n):
whitening enhanced speech y using the whitening filter obtained in (1a)w(n) performing inverse whitening processing to obtain enhanced speech y (n).
Compared with the prior art, the invention has the following advantages:
firstly, the processing procedure of whitening is added in the invention, when the background noise is colored noise, the colored noise can be converted into white noise, and the capability of eliminating the colored noise is improved; moreover, the whitening processing does not influence the noise reduction capability of the invention in a white noise environment;
secondly, the invention uses the Hash function mapping method to carry out disordering rearrangement on the arrangement sequence of each column of spectrum elements of the original time-frequency amplitude spectrum before generating the enhanced time-frequency amplitude spectrum, so that the low-rank voice components in the invention become close to full rank and no longer have the characteristic of low rank, the low-rank voice components are effectively retained in the enhanced voice, and the quality of the enhanced voice is improved.
Detailed Description
The implementation steps of the method of the invention are described in further detail below with reference to fig. 1.
Step 1, generating whitened noisy speech xw(n)。
(1.1) selecting an integer value as the number N of sample points within the range of [1000,1500], and taking the first N sampling points of noisy speech x (N) to establish a whitening filter; the specific steps for establishing the whitening filter are as follows:
step 1, at [30,50 ]]Selecting an integer p as the order of the whitening filter, and establishing a p-order linear predictor by using x (N) first sampling points of the noisy speech, wherein the transfer function of the linear predictor is
And solving coefficient a of linear predictor by using autocorrelation method
i(i=1,2,…,p);
Step 2, using a p-order linear predictor to build a p-order whitening filter with a transfer function of
(1.2) carrying out whitening treatment on the noise-containing voice x (n) by using the whitening filter obtained in (1.1) to obtain whitened noise-containing voice xw(n) of (a). The whitening processing of the noisy speech x (n) means: and (3) filtering the noisy speech x (n) by using the p-order whitening filter established in the step (1.1) of the step.
Step 2, generating whitening noisy speech xw(n) time-frequency amplitude spectrum | DwI and time-frequency phase spectrum < Dw。
(2.1) in [20,40 ]]The duration of each frame of speech signal is optionally selected from a range of milliseconds, and the length of each frame is 25%, 75%]Selecting a value in the range of (1) as the displacement of the next frame speech relative to the previous frame speech, and whitening the noise-containing speech xw(n) division into a plurality of short-time speech frames。
And (2.2) sequentially selecting an unprocessed frame of short-time speech as a frame to be processed currently in time sequence from all the short-time speech frames.
And (2.3) carrying out Fourier transform on the short-time speech frame to be processed currently to obtain a Fourier spectrum of the frame, and calculating the amplitude and the phase of the Fourier spectrum to obtain a Fourier amplitude spectrum and a Fourier phase spectrum.
And (2.4) judging whether all short-time speech frames are processed, if so, executing the step (2.5) of the step, otherwise, executing the step (2.2) of the step.
(2.5) taking the Fourier magnitude spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time frequency magnitude spectrum | D of the whitened noisy speechwTaking the Fourier phase spectrum of each frame as a column vector, arranging the column vector according to the time sequence to form a time-frequency phase spectrum < D > of the whitened noisy speechwHere, the time-frequency magnitude spectrum | DwI and time-frequency phase spectrum < DwAre all matrices and | Dw|∈Rm×n,∠Dw∈Rm×nWhere e denotes that the element belongs to the set, R denotes the matrix | Dw| and matrix &wWherein the elements are real numbers, and m is a matrix | DwI and matrix < D |wN is the matrix | DwI and matrix < D |wThe number of columns.
Step 3, generating a rearrangement time-frequency amplitude spectrum | Dw|r。
(3.1) amplitude Spectrum | D in timewAnd in all column vectors of l, sequentially selecting an unprocessed column according to a time sequence to serve as a Fourier magnitude spectrum to be processed currently.
(3.2) generating a new arrangement sequence for the spectral elements in the current Fourier magnitude spectrum by utilizing a hash function, and rearranging the spectral elements according to the sequence to obtain a rearranged Fourier magnitude spectrum, wherein the method comprises the following specific steps:
let the current Fourier magnitude spectrum be X ═ X1,x2,…,xm]T∈Rm×1Wherein X is a column vector, X1,x2,…,xmIs m spectral elementsSubscripts 1,2, …, m of each spectral element represent the arrangement sequence of the spectral elements in a Fourier magnitude spectrum, T represents a vector transposition operation, epsilon represents that the elements belong to a set, and R represents that the spectral elements are real numbers;
(3.2.1) selecting an integer a which is coprime to m in the range of [2, m), selecting an integer b in the range of [0, m), and constructing a hash function f (k) ═ ak + b)m+1, wherein (.)mDenotes a modulo operation, k denotes the subscript of the spectral element, and k ═ 1,2, …, m;
(3.2.2) mapping the original sequence of indices 1,2, …, m of the spectral elements to a new sequence of indices f (1), f (2), …, f (m) using a hash function f (k);
(3.2.3) rearranging the spectral elements according to the new subscript sequence f (1), f (2), …, f (m) to obtain a rearranged Fourier magnitude spectrum Xr=[xf(1),xf(2),…,xf(m)]T∈Rm×1。
(3.3) judging whether | D is processedwIf yes, executing the (3.4) step of the step, otherwise, executing the (3.1) step of the step.
(3.4) all the rearranged Fourier amplitude spectra are arranged in time sequence as column vectors to form a rearranged time-frequency amplitude spectrum | Dw|r。
Step 4, generating an enhanced time-frequency amplitude spectrum | Sw|。
(4.1) at [6,10]Is selected as an integer Q as the estimated rearrangement time-frequency amplitude spectrum | Dw|rThe number of columns used for medium noise intensity, using | Dw|rFront Q-column rebinned Fourier magnitude spectral estimation | Dw|rThe intensity of the noise in (2).
(4.2) according to the noise intensity estimated in the step (4.1) of the step, utilizing a robust principal component analysis algorithm to rearrange a time-frequency amplitude spectrum | Dw|rEnhancing to generate sparse rearrangement time-frequency amplitude spectrum | Sw|r。
The robust principal component analysis algorithm is as follows: rearrangement time-frequency amplitude spectrum | D by using augmented Lagrange multiplier ALM methodw|rThe robust principal component analysis algorithm model carries out optimization solution, and a rearrangement time-frequency amplitude spectrum | D is decomposedw|rObtaining a sparse rearrangement time-frequency amplitude spectrum | Sw|rAnd low rank rearrangement time-frequency amplitude spectrum | Lw|r. The specific optimization process comprises the following steps:
at | D
w|
r=|L
w|′
r+|S
w|′
rUnder the condition, finding a sparse rearrangement time-frequency amplitude spectrum | S
w|
rMatrix and low rank rearrangement time-frequency amplitude spectrum | L
w|
rMatrix, such that L
w|
r||
*+λ|||S
w|
r||
1Has the smallest value, i.e.
Wherein, | Sw|′rRepresenting low rank rearranged time-frequency amplitude spectrum, | L, containing noise informationw|′rRepresenting a sparsely rearranged time-frequency amplitude spectrum containing speech information, | · | | luminance*Representing kernel norm operation, λ representing weight, | · | | | luminance1Representing a 1-norm operation.
(4.3) restoring | S according to the arrangement order generated in the (3.2) th step of the step 3w|rThe arrangement sequence of the Fourier amplitude spectrum elements in all the columns to obtain an enhanced time frequency amplitude spectrum | Sw|。
Step 5, forming an enhanced time spectrum Sw。
Using the enhanced time-frequency amplitude spectrum | S generated in step 4wI and the time-frequency phase spectrum < D obtained in the (2.5) step of the step 2wComposing an enhanced time spectrum Sw。
Step 6, reconstructing the whitened enhanced speech yw(n)。
(6.1) Spectrum S at enhancementwAnd sequentially selecting an unprocessed column in the all column vectors according to the time sequence to be used as the enhanced Fourier spectrum to be processed currently.
And (6.2) carrying out inverse Fourier transform on the enhanced Fourier spectrum to be processed currently to obtain a frame of whitened short-time enhanced voice.
(6.3) judging whether S has been processedwIf yes, executing the (6.4) th step of the step, otherwise, executing the (6.1) th step of the step.
(6.4) reconstructing all whitened short term enhanced speech frames into a complete whitened enhanced speech y using overlap-added overlaid Addw(n)。
And 7, generating enhanced voice y (n).
Using the whitening filter obtained in step 1 (1.1) to whiten the enhanced speech y obtained in step 6w(n) performing inverse whitening processing to obtain enhanced speech y (n).
The whitening enhanced voice y obtained in the step 6 is subjected to whitening by using the whitening filter obtained in the step 1, the step (1.1)w(n) inverse whitening processing is performed to obtain enhanced speech y (n) as follows.
(7a) Using the whitening filter obtained in step 1 (1.1) to build an inverse whitening filter having a transfer function of WI(z)=1/W(z)。
(7b) Speech enhancement using an inverse whitening filterw(n) filtering to obtain enhanced speech y (n).
The application effect of the invention is further explained by combining the following simulation:
1. simulation conditions
The simulation experiment of the invention is realized by MATLAB simulation software, the sampling rate of the voice is set to be 8000 Hz, the time length of each frame of short-time voice is 32 milliseconds, and the displacement of the next frame of voice relative to the previous frame of voice is 16 milliseconds. And taking the first 1024 sampling points of the noisy speech to establish a 40-order whitening filter. The method solves the robust principal component analysis algorithm by using an Exact ALM (Exact Augmented Lagrange Multiplier) method in simulation, wherein the weight parameters of the robust principal component analysis algorithm and the rearranged Fourier magnitude spectrum | Dw|rThe noise intensity relation in (2) can be determined adaptively by the following formula:
λ=-0.004×ζ+0.1181
wherein λ represents a weight parameter of the robust principal component analysis algorithm,ζ represents the rearranged Fourier magnitude spectrum | Dw|rAn estimate of the signal-to-noise ratio in (1). Specifically, ζ can be determined by the following formula:
where ζ represents the rearranged Fourier magnitude spectrum | D
w|
rEstimate of the signal-to-noise ratio in (log)
10(. to.) denotes a base-10 logarithm operation,. sigma.,
representation matrix | D
w|
rThe square of the spectral element at the ith row and jth column position in the matrix, | D
w|
rN is the matrix | D
w|
rThe number of columns of (1), Q represents | D
w|
rThe number of columns of the voice spectrum is set to 8 in the simulation experiment of the present invention.
2. Emulated content
The simulation experiments of the invention are three. Simulation experiment 1 is a whitening experiment of colored noise to illustrate the effectiveness of the whitening process in the present invention. Fig. 2 is a diagram showing the comparison result between the top view of the time-frequency amplitude spectrum of the colored noise F16 noise and the top view of the time-frequency amplitude spectrum of the whitened signal obtained in simulation experiment 1. Fig. 2(a) shows a top view of a time-frequency amplitude spectrum of a colored noise F16 noise, and fig. 2(a) shows a top view of a time-frequency amplitude spectrum of a signal obtained by whitening a colored noise F16. The horizontal axis in each time-frequency amplitude spectrum in fig. 2 represents the time axis in seconds, the vertical axis represents the frequency axis in kilohertz, and each time-frequency amplitude spectrum is represented in the form of a logarithmic spectrum with the spectral values in decibels.
The simulation experiment 2 is to visually compare the voice enhancement effect of the method of the present invention with the voice enhancement method based on the robust principal component analysis algorithm to obtain the time-frequency amplitude spectrum visual comparison graph of fig. 3. In simulation experiment 2, a clean speech segment is polluted by colored noise F16, the signal-to-noise ratio is 5dB, and speech enhancement is respectively carried out by using the method and the existing speech enhancement method based on the robust principal component analysis algorithm. Fig. 3(a) shows a top view of a time-frequency amplitude spectrum of clean speech, fig. 3(b) shows a top view of a time-frequency amplitude spectrum of colored noise F16, fig. 3(c) shows a top view of a time-frequency amplitude spectrum of a speech component obtained by a speech enhancement method based on a robust principal component analysis algorithm, fig. 3(d) shows a top view of a time-frequency amplitude spectrum of a noise component obtained by a speech enhancement method based on a robust principal component analysis algorithm, fig. 3(e) shows a top view of a time-frequency amplitude spectrum of a speech component obtained by the method of the present invention, and fig. 3(F) shows a top view of a time-frequency amplitude spectrum of a noise component obtained by the method of the present invention. The horizontal axis in each time-frequency amplitude spectrum in fig. 3 represents the time axis in seconds, the vertical axis represents the frequency axis in kilohertz, and each time-frequency amplitude spectrum is represented in the form of a log spectrum with spectral values in decibels.
Simulation experiment 3 is to compare the average voice enhancement effect in six different types of colored noise (buccaneer1, buccaneer2, f16, factor 1, hfchannel and ping) by using the method of the present invention and the existing voice enhancement method based on the robust principal component analysis algorithm, and the result is shown in fig. 4. Fig. 4 is a simulation experiment 3 showing objective index comparison of average speech enhancement effect of the speech enhancement method based on the robust principal component analysis algorithm under six different types of colored noise conditions, where the speech enhancement effect is measured by two objective indexes, namely, source distortion ratio and speech quality perception evaluation, the source distortion ratio is measured by the ratio of speech signal energy to noise energy contained in the enhanced speech, and is measured in decibels, and the speech quality perception evaluation is an index for evaluating subjective intelligibility of the enhanced speech, and the larger the numerical values of the two indexes are, the better the speech enhancement effect is. The curve marked by a circle in fig. 4(a) represents the variation curve of the average signal-to-distortion ratio of the enhanced speech obtained by the method of the present invention under the above six colored noise pollutions, which is influenced by the signal-to-noise ratio. The curve marked by diamonds in fig. 4(a) represents the variation curve of the average signal-to-distortion ratio of the enhanced speech under the six colored noise pollutions, which is based on the speech enhancement method of the robust principal component analysis algorithm, and is influenced by the signal-to-noise ratio. The abscissa in fig. 4(a) represents the snr of noisy speech in decibels and the ordinate represents the source distortion ratio in decibels. The curve marked by a circle in fig. 4(b) represents the variation curve of the average speech quality perception evaluation index of the enhanced speech obtained under the above six colored noise pollutions, which is influenced by the signal-to-noise ratio, according to the method of the present invention. The curve marked by diamonds in fig. 4(b) represents the variation curve of the average speech quality perception evaluation index of the enhanced speech under the six colored noise pollutions, which is based on the speech enhancement method of the robust principal component analysis algorithm, and is influenced by the signal-to-noise ratio. The abscissa in fig. 4(b) represents the signal-to-noise ratio in decibels, and the ordinate represents the speech quality perception assessment.
3. And (3) simulation result analysis:
as can be seen from fig. 2, the colored noise F16 is mainly concentrated in the frequency bands of 0 hz-1.5 khz and 2.5 khz-3 khz, and after whitening, the signal energy is substantially uniformly distributed in the entire frequency band of 0 hz-4 khz, which is similar to white noise. Therefore, the whitening processing step in the present invention is effective in converting color noise into white noise.
FIG. 3 shows that the noise component time-frequency amplitude spectrum obtained by the speech enhancement method based on the robust principal component analysis algorithm in FIG. 3(d) has a large amount of speech components remaining, while the noise component time-frequency amplitude spectrum obtained by the method of the present invention in FIG. 3(f) has very few speech components remaining; meanwhile, the energy of the voice component in fig. 3(e) is larger than that in fig. 3(c), which intuitively shows that the algorithm of the present invention has a better voice enhancement effect. The method has the advantages that the arrangement sequence of each frame of Fourier spectrum elements is disturbed, so that the similarity between the time-frequency amplitude spectrum frame and the frame of the low-rank voice component is reduced, the condition that the low-rank voice component is wrongly decomposed is relieved, and the voice enhancement effect is improved.
As can be seen from fig. 4, the source distortion ratio curve obtained by the method of the present invention is above the source distortion ratio curve obtained by the speech enhancement method based on the robust principal component analysis algorithm. For the voice quality perception evaluation index, under the condition that the signal-to-noise ratio of the noise-containing voice is 0dB and-5 dB, the score of the method is slightly lower than that of a voice enhancement method based on a robust principal component analysis algorithm. By combining the two indexes, the method has better noise elimination capability in various colored noise environments, and simultaneously, more voice components are kept in the enhanced voice as much as possible, so that the method has good voice enhancement effect.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.