CN104269180B

CN104269180B - A kind of quasi- clean speech building method for speech quality objective assessment

Info

Publication number: CN104269180B
Application number: CN201410515374.7A
Authority: CN
Inventors: 贺前华; 周伟力; 李洪韬
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-09-29
Filing date: 2014-09-29
Publication date: 2018-04-13
Anticipated expiration: 2034-09-29
Also published as: CN104269180A

Abstract

The invention discloses a kind of quasi- clean speech building method for speech quality objective assessment, this method is mainly included using a kind of quasi- clean speech of improved minimum value control recursive average algorithm with obtaining distorted speech with spectrum-subtraction more：(1) distorted speech non-speech segment and voice segments are distinguished；(2) noise power spectrum of non-speech segment and voice segments is estimated respectively according to the division of non-speech segment and voice segments；(3) according to non-speech segment and voice segments noise spectrum estimation value, the quasi- clean speech power spectrum of calculated distortion voice.Have the advantages that using quasi- clean speech and distorted speech as the input voice of PESQ algorithms, the objective evaluation score value of acquisition distorted speech.

Description

Quasi-clean voice construction method for voice quality objective evaluation

Technical Field

The invention relates to a voice quality objective evaluation technology, in particular to a quasi-clean voice construction method for voice quality objective evaluation, belonging to the field of reference-source-free (Non-intrusive) voice quality objective evaluation.

Background

The quality of voice is one of the important criteria for evaluating the quality of a voice communication system. Speech quality assessment is generally divided into subjective assessment methods and objective assessment methods. The subjective evaluation method directly reflects the viewpoint of a user on the system quality by depending on the judgment of the Opinion of a listener on the voice quality, wherein MOS (Mean Opinion Score) proposed by ITU-T recommendation P.830 is a widely used subjective evaluation method. However, the subjective evaluation method has poor repeatability, is difficult to organize and implement flexibly, is easily influenced by subjective factors of people, and is not beneficial to application in the production process and field experiments.

The objective evaluation method avoids the possible influence of human factors, and adopts a signal processing mode to realize the evaluation process of voice quality aiming at the specific characteristics of voice signals. The objective evaluation method is classified into an objective evaluation method with a reference source (intuitive) and an objective evaluation method without a reference source (Non-intuitive) according to whether a reference source signal (clean voice) is required. The objective evaluation method with the reference source judges whether the voice quality is good or bad according to the error size between the input signal and the output signal of the voice system, and is an error measurement, wherein the PESQ perception voice quality evaluation provided by ITU-T recommendation P.862 is the objective evaluation method with the reference source with better performance at present, and can better identify communication delay, environmental noise and errors. However, PESQ and other objective evaluation methods with reference sources require the use of input speech (clean speech) as a reference and cannot be used in applications where only distorted signals are present.

ITU-T recommendation P.563 is a standard of the current reference-source-free objective evaluation method, can be applied to VoIP without reference signals and monitoring of telecommunication network performance, but has high operation complexity, is not beneficial to real-time evaluation of voice quality, and has evaluation performance inferior to PESQ. At present, mainstream objective evaluation methods based on statistical models are mainly based on Gaussian Mixture Models (GMM) and Vector Quantization (Vector Quantization), clean speech is trained into a reference model and a reference codebook in the model training process, distorted speech, the reference model and the reference codebook are subjected to distortion calculation during testing, and error results are mapped into final objective quality scores. A large amount of clean voice data is needed in the model training process based on the statistical model, and the evaluation performance of the model is greatly different from the PESQ.

The quasi-clean speech construction technology estimates the noise spectrum of the distorted speech through a noise tracking algorithm, eliminates the noise part of the distorted speech and obtains the quasi-clean speech of the distorted speech. Different from Voice Activity Detection (Voice Activity Detection), which updates the noise power spectrum only in a non-Voice section, the noise tracking algorithm can continuously perform better noise estimation during Voice Activity, and is more suitable for a noisy non-stationary scene. The minimum-controlled recursive averaging algorithm is able to estimate the noise power spectrum in non-stationary noise environments much faster than other noise-tracking algorithms (Martin, 2001, doblinger, 1995. However, the minimum control recursive average algorithm estimates the distorted speech uniformly when estimating and updating the noise spectrum, and does not distinguish the distorted speech between speech segments and non-speech segments, so that the estimation result has a certain error compared with the actual noise power spectrum, and the uniform estimation of the distorted speech noise spectrum increases the computational complexity, reduces the efficiency of the algorithm, and is not beneficial to real-time estimation.

Disclosure of Invention

The invention aims to overcome the defects of the reference-source-free objective evaluation method in the prior art and provide a quasi-clean voice construction method for the objective evaluation of voice quality.

The purpose of the invention is realized by the following technical scheme: a quasi-clean speech construction method for objective evaluation of speech quality comprises the following steps:

step 1, an improved minimum control recursive average algorithm distinguishes a non-speech section from a speech section in noise spectrum estimation of distorted speech, and the noise spectrum estimation value of the non-speech section is updated according to the characteristics of the non-speech section;

step 2, when carrying out noise estimation on a voice frame, adopting a new frequency correlation threshold value when the improved minimum control recursive average algorithm determines the existence probability of voice in a voice frame frequency band;

step 3, the improved minimum control recursive average algorithm determines a final noise power spectrum estimation value of the voice with noise according to the noise power spectrum estimation of the non-voice section and the voice section;

step 4, the improved minimum control recursive average algorithm divides a non-speech section and a speech section by adopting a speech activity detection mode, and the sohn algorithm respectively determines the speech section of the distorted speech and the non-speech section between the words in the speech section by utilizing the zero crossing rate and the short-time energy time domain characteristics;

and step 5, respectively calculating the quasi-clean power spectrums of the non-voice sections and the voice sections of the quasi-clean voice by multi-band spectrum subtraction according to the division of the non-voice sections and the corresponding noise spectrum estimation values, thereby obtaining the quasi-clean voice power spectrums of the distorted voice.

In step 1, the modified minimum controls recursive averagingThe algorithm is based on the division of the non-voice sections and the voice sections; recognizing the non-speech segment as noise, noise spectrum estimation value D (lambda) _uv ,k)＝|Y(λ _uv ,k)| ² Wherein, | Y (λ) _uv ,k)| ² Short-time power spectrum, lambda, for non-speech frames _uv Is the frame index of the non-speech segment, and k is the band index.

The division of the non-voice segments and the voice segments is realized by a voice activity detection mode, namely: rough estimation is carried out on the distorted voice by using the zero crossing rate, the short-time energy and the like time domain characteristics, the starting time and the ending time of the voice section of the distorted voice are found out, background noise is eliminated, the whole voice section of the distorted voice is determined, fine estimation is carried out on the positioned whole voice section by adopting a sohn voice activity detection algorithm, and the voice part and the non-voice part between the words in the voice section are determined.

In step 2, when the improved minimum control recursive average algorithm performs noise estimation on a speech frame, the definition of the adopted frequency correlation threshold value δ (k) is as follows:

wherein, LF and MF correspond to frequency points of 1kHZ and 3kHZ respectively, fs is sampling frequency, and k is frequency band index.

In step 3, the improved minimum control recursive average algorithm estimates and determines a noise power spectrum estimation value D (λ, k) of the noisy speech to be divided into a non-speech segment and a speech segment, where the noise power spectrum estimation value D (λ, k) is defined as:

wherein alpha is _s (λ _v K) is a time-frequency dependent smoothing factor, | Y (λ) _v ,k)| ² For short-time power spectrum, D (lambda), of speech frames _v -1, k) is the previous frame noise spectrum estimate for the current speech frame.

In step 5, the quasi-clean speech power spectrum S (λ, k) calculated by the multiband subtraction is divided into two parts, i.e. a non-speech section and a speech section, and the estimated value of the quasi-clean speech power spectrum S (λ, k) is defined as:

S(λ,k)＝(Y(λ _v ,k)-D(λ _v ,k))+(Y(λ _uv ,k)-D(λ _uv ,k))，

wherein, | Y (λ) _v ,k)| ² Is the short-time power spectrum, | Y (λ) of the speech frame _uv ,k)| ² Short-time power spectrum, D (lambda), for non-speech frames _v K) is the speech frame noise power spectrum estimate, D (λ) _uv And k) is the estimated value of the noise power spectrum of the non-speech frame.

The specific implementation process of the quasi-clean voice construction method of the invention is as follows:

1. the method for determining the speech frame and the non-speech frame of the distorted speech is shown in figure 2 in the attached figure of the specification, and the process for determining the speech frame and the non-speech frame is shown in figure. Firstly, roughly estimating a voice section of distorted voice, and specifically realizing the following steps: windowing and framing the distorted voice, and calculating the short-time energy and zero crossing rate of framing; setting short-time energy and zero-crossing rate threshold of the voice segment, and determining a starting frame and an ending frame of the distorted voice segment by using the short-time energy and the zero-crossing rate time domain characteristics. And then, carrying out fine estimation on the voice section by adopting a sohn algorithm, determining an inter-speech non-speech part of the voice section, marking the background noise section and the inter-speech non-speech part as non-speech frames, and marking the speech part of the voice section as a speech frame.

2. Noise tracking is performed on the distorted speech. Description of the drawings figure 3 shows the noise tracking estimation process for distorted speech. Firstly, fourier transform is carried out on the short-time frame of the distorted voice in the step 1, and the power spectrum of each frame is calculated. The noise tracking adopts the improved minimum control to pass to the average algorithm, and estimates and updates the non-speech frame and the speech frame of the distorted speech respectively, thereby improving the accuracy and the execution efficiency of the algorithm. The non-speech frame is considered as a noise frame, and the noise spectrum estimation value of the non-speech frame is the short-time power spectrum of the non-speech frame; when the noise estimation of the voice frame is carried out, the existence probability of the voice in the voice frame frequency band is obtained by comparing the ratio of the smooth power spectrum of the voice frame to the local minimum value thereof with a new frequency correlation threshold value; then smoothing the existence probability of the voice and updating a time-frequency related smoothing factor according to the smoothing probability; updating the noise spectrum estimation value of the voice part by using the time-frequency correlation smoothing factor; and finally, forming a distorted voice noise spectrum estimation value by the noise spectrum estimation values of the non-voice part and the voice part.

3. A quasi-clean speech is obtained. And (3) carrying out multi-band frequency spectrum subtraction on the noisy power spectrum of the distorted voice and the distorted voice noise estimation power spectrum obtained in the step (2) to obtain a quasi-clean voice power spectrum. And carrying out Fourier inverse transformation on the clean voice power spectrum to obtain a quasi-clean voice time domain signal.

4. Evaluating objective quality of distorted voice; the PESQ algorithm calculates distortion errors between distorted voice and quasi-clean voice through a perception model, and the distortion errors are finally mapped into objective quality scores of the distorted voice through a cognition model.

The principle of the invention is as follows: the invention adopts an improved minimum control recursive average algorithm and a multi-band spectrum subtraction method to obtain the quasi-clean voice of the distorted voice, and takes the quasi-clean voice and the distorted voice as the input voice of the PESQ algorithm to obtain the objective evaluation score of the distorted voice.

Compared with the prior art, the invention has the following advantages and effects:

1. by constructing quasi-clean speech of distorted speech, the PESQ algorithm can be applied to an objective evaluation application scenario without input speech. Compared with other reference-source-free objective evaluation methods, the method disclosed by the invention has the advantage that higher subjective and objective evaluation correlation degrees are obtained.

2. Compared with a mainstream reference-source-free objective evaluation method based on a statistical model, the method does not need a large amount of clean corpus training statistical models, so that the evaluation algorithm is suitable for the reference-source-free objective evaluation application field lacking of clean corpus.

3. The quasi-clean speech construction method can distinguish the non-speech section and the speech section of the distorted speech, estimate the noise power spectrum of the distorted speech more accurately, eliminate the noise part of the distorted speech to a greater extent, and improve the accuracy of objective quality scoring of the distorted speech.

Drawings

FIG. 1 is a process diagram of a quasi-clean speech construction method for objective assessment of speech quality.

Fig. 2 is a diagram of a process for labeling speech and non-speech frames.

Fig. 3 is a diagram of a noise tracking estimation process for distorted speech.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

A quasi-clean speech construction method for objective evaluation of speech quality comprises the following steps:

1. performing frame windowing on the distorted voice (frame length is 30ms, frame shift is 15ms, and Hamming window is added), and respectively calculating the short-time energy and the zero crossing rate of each frame; then, the average energy, the energy upper threshold, the energy lower threshold, the average zero crossing number and the zero crossing number threshold of the distorted voice are calculated. The upper energy threshold is 0.05 times of the average energy; the lower energy threshold is 0.25 times of the upper energy threshold; the zero crossing threshold is an average zero crossing number of 0.3 times.

2. Determining a starting frame and an ending frame of a distorted voice speech section by adopting a double-threshold method based on energy and zero-crossing rate; and determining the non-speech part between the conversations of the distorted speech segments by taking the determined distorted speech segments as input data of a sohn speech activity detection algorithm.

3. Defining the non-speech frames between the audio frames and the distorted speech segments except the distorted speech segments determined in the step 2 as the non-speech part of the distorted speech; and defining the audio frames except the distorted voice section non-voice frame determined in the step 2 as the voice part of the distorted voice. As shown in fig. 2, the distorted speech short time frame λ marks the non-speech frame part and the speech frame part:

4. as shown in FIG. 3, the fast Fourier transform is performed on the short time frame of the distorted speech, and the power spectrum | Y (λ) of the non-speech frame is calculated and obtained _uv ,k)| ² Speech frame power spectrum | Y (λ) _v ,k)| ² Where k is the band index.

5. The noise power spectrum of the non-speech frame is estimated. The non-speech segments are considered as noise, i.e. the noise spectrum estimate is D (lambda) _uv ,k)＝|Y(λ _uv ,k)| ² 。

6. For speech frame power spectrum Y (lambda) _v ,k)| ² And (3) smoothing:

P(λ _v ,k)＝ηP(λ _v -1,k)+(1-η)|Y(λ _v ,k)| ² ，

wherein, P (lambda) _v K) smoothed power spectrum of speech frame, λ _v The frame number of speech frame index, k the band index, and η the smoothing factor parameter (0.7 in the equation).

7. For P (lambda) _v K) performing local minimum tracking to obtain P _min (λ _v ,k)：

if P _min (λ _v -1,k)<P(λ _v ,k)

else

P _min (λ _v ,k)＝P(λ _v ,k)

end

In the formula, beta is 0.8, and gamma is 0.998.

8. The speech existence probability is calculated. Firstly, the ratio Sr (lambda) of the power spectrum of the speech frame to the local minimum value thereof is calculated _v ,k)：

Then according to S _r (λ _v K) determining the speech frame band speech presence probability I (lambda) _v ,k)：

if S _r (λ _v ,k)>δ(k)

I(λ _v K) =1 speech present

else

I(λ _v K) =0 voice absent

end

δ (k) is the band-dependent threshold:

9. Smoothed speech existence probability p (lambda) _v ,k)：

p(λ _v ,k)＝α _p p(λ _v -1,k)+(1-α _p )I(λ _v ,k)，

Wherein alpha is _p The smoothing factor parameter (0.2 in the formula).

10. Using smoothed speech presence probability p (lambda) _v K) calculating a time-frequency dependent smoothing factor alpha _s (λ _v ,k)：

α _s (λ _v ,k)＝α _d +(1-α _d )p(λ _v ,k)，

Wherein alpha is _d Is a constant number (0.85 in the formula).

11. Using time-frequency dependent smoothing factor alpha _s (λ _v K) updating the speech frame noise spectrum estimate D (λ) _v ,k)：

D(λ _v ,k)＝α _s (λ _v ,k)D(λ _v -1,k)+(1-α _s (λ _v ,k))|Y(λ _v ,k)| ² ，

12. And (2) obtaining a quasi-clean voice s (t) by inverse Fourier transform by adopting a multi-band spectral subtraction voice section and non-voice section quasi-clean power spectrum:

s(t)＝IFFT[Y(λ _v ,k)+Y(λ _uv ,k)-(D(λ _v ,k)+D(λ _uv ,k))]，

13. as shown in fig. 1, calculating a distorted speech objective quality score; and calculating a distortion error between the distorted voice and the quasi-clean voice by utilizing a PESQ algorithm, and mapping the distortion error into an objective quality score of the distorted voice through a cognitive model.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such modifications are intended to be included in the scope of the present invention.

Claims

1. A quasi-clean speech construction method for objectively evaluating speech quality is characterized by comprising the following steps:

step 1, distinguishing a non-speech section from a speech section in noise spectrum estimation of distorted speech by using an improved minimum control recursive average algorithm, and updating the noise spectrum estimation value of the non-speech section according to the characteristics of the non-speech section;

the improved minimum control recursive average algorithm adopts a voice activity detection mode to divide a non-voice section and a voice section, and a sohn algorithm respectively determines the voice section of the distorted voice and the non-voice section among the voice sections by utilizing the zero crossing rate and the short-time energy time domain characteristics;

step 2, when carrying out noise estimation on a voice frame, using an improved minimum control recursive average algorithm to adopt a new frequency correlation threshold value when determining the existence probability of voice in a voice frame frequency band;

the frequency-dependent threshold δ (k) is defined as:

wherein, LF and MF correspond to frequency points of 1kHZ and 3kHZ respectively, fs is sampling frequency, and k is frequency band index;

step 3, determining a final noise power spectrum estimation value of the voice with noise according to the noise power spectrum estimation values of the non-voice section and the voice section by utilizing an improved minimum control recursive average algorithm;

and 4, respectively calculating the quasi-clean power spectrums of the non-voice sections and the voice sections of the quasi-clean voice by utilizing a multi-band spectrum subtraction method according to the division of the non-voice sections and the corresponding noise spectrum estimation values, so as to obtain the quasi-clean voice power spectrums of the distorted voice.

2. The method according to claim 1, wherein in step 1, the recursive average algorithm controlled by improved minimum is based on the division of non-speech segments into speech segments; recognizing the non-speech segment as noise, noise spectrum estimation value D (lambda) _uv ,k)＝|Y(λ _uv ,k)| ² Wherein, | Y (λ) _uv ,k)| ² Short-time power spectrum, lambda, for non-speech frames _uv Is the frame index of the non-speech segment, and k is the band index.

3. The method according to claim 1, wherein in step 3, the noise power spectrum estimation value D (λ, k) for determining the final noisy speech by using the modified minimum-controlled recursive average algorithm is divided into two parts, i.e. a non-speech segment and a speech segment, and the noise power spectrum estimation value D (λ, k) is defined as:

wherein alpha is _s (λ _v K) is a time-frequency dependent smoothing factor, | Y (λ) _v ,k)| ² Short-time power spectrum, D (lambda), for a speech frame _v -1, k) is the previous frame noise spectrum estimate for the current speech frame, k being the band index.

4. The method according to claim 2, wherein the non-speech segments are divided from speech segments by voice activity detection, that is: the method comprises the steps of roughly estimating distorted voice by utilizing the characteristics of zero crossing rate, short-time energy and the like in an equal time domain, finding out the starting time and the ending time of a voice section of the distorted voice, eliminating background noise, determining the whole voice section of the distorted voice, finely estimating the whole voice section by adopting a sohn voice activity detection algorithm, and determining a voice part and an interphone non-voice part in the voice section.

5. The method according to claim 1, wherein in step 4, the power spectrum S (λ, k) of quasi-clean speech calculated by multiband subtraction is divided into two parts, i.e. non-speech and speech, and the estimated value of the power spectrum S (λ, k) of quasi-clean speech is defined as:

S(λ,k)＝(Y(λv,k)-D(λv,k))+(Y(λuv,k)-D(λuv,k))，

wherein, | Y (λ) _v ,k)| ² For the short-time power spectrum, | Y (λ) of the speech frame _uv ,k)| ² Short-time power spectrum, D (lambda), for non-speech frames _v K) is the noise power spectrum estimate of the speech frame, D (λ) _uv K) is the estimate of the noise power spectrum of the non-speech frame, λ _v Indexing the number of frames of a speech segment, λ _uv Is the frame index of the non-speech segment, and k is the band index.