CN104269180A

CN104269180A - Quasi-clean voice construction method for voice quality objective evaluation

Info

Publication number: CN104269180A
Application number: CN201410515374.7A
Authority: CN
Inventors: 贺前华; 周伟力; 李洪韬
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2014-09-29
Filing date: 2014-09-29
Publication date: 2015-01-07
Anticipated expiration: 2034-09-29
Also published as: CN104269180B

Abstract

The invention discloses a quasi-clean voice construction method for voice quality objective evaluation. An improved minimum value control recursion average algorithm and a multi-spectrum subtraction are adopted to obtain quasi-clean voice of distorsion voice. The method mainly comprises the steps of (1) distinguishing a voice segment and a non-voice segment of the distorsion voice; (2) respectively evaluating noise power spectrums of the voice segment and the non-voice segment according to the division of the voice segment and the non-voice segment; (3) calculating the quasi-clean voice power spectrum of the distorsion voice according to noise spectrum evaluation values of the non-voice segment and the voice segment. The quasi-clean voice construction method for voice quality objective evaluation has the advantages that the quasi-clean voice and the distorsion voice serve as input voice of a PESQ algorithm, and an objective evaluation value of the distorsion voice is obtained.

Description

A kind of accurate clean speech building method for speech quality objective assessment

Technical field

The present invention relates to a kind of speech quality objective assessment technology, in particular to a kind of accurate clean speech building method for speech quality objective assessment, this voice building method belongs to the speech quality objective assessment field of reference source-free (Non-intrusive).

Background technology

Voice quality quality is one of major criterion evaluating voice communication system quality.Voice quality assessment is generally divided into subjective evaluation method and method for objectively evaluating.Subjective evaluation method relies on comments hearer's suggestion to make judgement to voice quality, be directly reflect the viewpoint of user to system quality, wherein ITU-T advises that the MOS (Mean Opinion Score) P.830 proposed is a kind of widely used subjective evaluation method.But subjective evaluation method poor repeatability, be difficult to organize and implement underaction, the subjective factor easily by people affects, and is unfavorable for applying in production run and field experiment.

Method for objectively evaluating has stopped the issuable impact of human factor, for the special characteristic of voice signal, adopts the mode of signal transacting to realize the evaluation procedure of voice quality.Method for objectively evaluating has reference source (Intrusive) method for objectively evaluating and reference source-free (Non-Intrusive) method for objectively evaluating according to being divided into the need of reference source signal (clean speech).Reference source method for objectively evaluating is had to differentiate the quality of voice quality with the error size between the input signal of voice system and output signal, it is a kind of error metrics, wherein ITU-T advise the PESQ perceptual speech quality evaluation P.862 proposed be current better performances have reference source method for objectively evaluating, can identification communication time delay, neighbourhood noise and mistake preferably.But, PESQ and other have reference source method for objectively evaluating need use input voice (clean speech) as a reference, can not use in the application only having distorted signal.

P.563, ITU-T suggestion is the standard of current reference source-free method for objectively evaluating, can be applied to the monitoring of VoIP without reference signal and communication network performance, but its computational complexity is high, be unfavorable for Real-Time Evaluation voice quality, and assess performance is not as good as PESQ.The method for objectively evaluating of the Corpus--based Method model of current main flow is mainly based on gauss hybrid models (GMM) and vector quantization (Vector Quantization), clean speech is trained for reference model and reference code book by these class methods in model training process, carry out distortion computation by distorted speech and reference model and with reference to code book during test, error result is mapped as final objective quality score.Corpus--based Method model not only needs a large amount of clean speech data in model training process, and its assess performance differs larger with PESQ.

Accurate clean speech constructing technology, by the noise spectrum of noise track algorithm distortion estimator voice, eliminates the noise section of distorted speech, obtains the accurate clean speech of distorted speech.Be different from voice activity detection (Voice Activity Detection) and only upgrade noise power spectrum in non-speech segment, noise track algorithm can continue to carry out good noise estimation during voice activity, is more applicable to noise non-stationary scene.Minimum value controls recurrence average algorithm relative to other noise track algorithm (Martin, 2001; Doblinger, 1995; Hirsch and Ehrlicher, 1995; Cohen, 2003) can estimating noise power spectrum under nonstationary noise environment quickly.But, minimum value controls recurrence average algorithm and estimates distorted speech with unified during renewal noise spectrum in estimation, distorted speech is not carried out to the differentiation of voice segments and non-speech segment, therefore there is certain error in estimated result compared with the noise power spectrum of reality, and computation complexity is added to the unified estimation of distorted speech noise spectrum, reduce the efficiency of algorithm, be unfavorable for real-time estimation.

Summary of the invention

The object of the invention is to overcome the shortcoming of the defect of reference source-free method for objectively evaluating in prior art with not enough, a kind of accurate clean speech building method for speech quality objective assessment is provided, this voice building method, can follow the tracks of noise of the accurate clean speech introducing distorted speech with removing method and construct.

Object of the present invention is achieved through the following technical solutions: a kind of accurate clean speech building method for speech quality objective assessment, comprises the following steps:

Step 1, the minimum value improved control recurrence average algorithm and distinguish non-speech segment and voice segments in the noise spectrum estimations of distorted speech, upgrade the noise spectrum estimation value of non-speech segment according to the characteristic of non-speech segment;

Step 2, speech frame carried out to noise when estimating, the minimum value of improvement controls recurrence average algorithm when determining that speech frame band speech exists probability, adopts new frequency dependence threshold value;

Step 3, the minimum value improved control recurrence average algorithm determines final noisy speech noise power spectrum estimated value according to the noise power Power estimation of non-speech segment and voice segments;

Step 4, the minimum value improved control recurrence average algorithm and adopt voice activity detection model split non-speech segment and voice segments, utilize zero-crossing rate and short-time energy temporal signatures, sohn algorithm determines non-speech segment between the words in the voice segments of distorted speech and voice segments respectively;

Step 5, multi-band spectrum-subtraction, according to the division of non-speech segment and voice segments and corresponding noise spectrum estimation value, calculate non-speech segment and the accurate clean power spectrum of voice segments of accurate clean speech respectively, thus obtain the accurate clean speech power spectrum of distorted speech.

In step 1, the minimum value of described improvement controls the division based on non-speech segment and voice segments of recurrence average algorithm; Non-speech segment is regarded as noise, noise spectrum estimation value D (λ _uv, k)=| Y (λ _uv, k) | ², wherein, | Y (λ _uv, k) | ²for non-speech frame short-time rating spectrum, λ _uvfor the frame number index of non-speech segment, k is band index.

The division of described non-speech segment and voice segments is realized by the mode of voice activity detection, that is: the temporal signatures such as zero-crossing rate and short-time energy is utilized to carry out rough estimate to distorted speech, find out start time and the finish time of the voice segments of distorted speech, get rid of ground unrest, determine the holophrase segment of distorted speech, adopt the holophrase segment of sohn voice activity detection algorithms to above-mentioned location carefully to estimate, determine non-speech portion between phonological component in voice segments and words.

In step 2, when the minimum value control recurrence average algorithm of described improvement carries out noise estimation to speech frame, frequency dependence threshold value δ (k) of employing is defined as:

δ (k) = \{\begin{matrix} 1.5,1 \leq k \leq LF \\ 2.5, LF \leq k \leq MF \\ 6.5, MF \leq k \leq Fs / 2 \end{matrix},

Wherein, the frequency of corresponding 1kHZ and 3kHZ of LF and MF difference, Fs is sample frequency, and k is band index.

In step 3, the minimum value of described improvement controls the noise power spectrum estimated value D (λ that recurrence average algorithm estimates to determine noisy speech, k) be divided into non-speech segment and voice segments two parts, described noise power spectrum estimated value D (λ, k) is defined as:

Wherein, α _s(λ _v, k) be the smoothing factor that time-frequency is relevant, | Y (λ _v, k) | ²for speech frame short-time rating spectrum, D (λ _v-1, k) be the former frame noise spectrum estimation value of current speech frame.

In steps of 5, accurate clean speech power spectrum S (λ, k) that described multi-band spectrum-subtraction calculates is divided into non-speech segment and voice segments two parts, and the estimated value of described accurate clean speech power spectrum S (λ, k) is defined as:

S(λ,k)＝(Y(λ _v,k)-D(λ _v,k))+(Y(λ _uv,k)-D(λ _uv,k))，

Wherein, | Y (λ _v, k) | ²for speech frame short-time rating spectrum, | Y (λ _uv, k) | ²for non-speech frame short-time rating spectrum, D (λ _v, k) be speech frame noise power spectrum estimated value, D (λ _uv, k) be non-speech frame noise power spectrum estimated value.

The specific implementation process of accurate clean speech building method of the present invention is as follows:

1, determine speech frame and the non-speech frame of distorted speech, Figure of description Fig. 2 shows the processing procedure determining speech frame and non-speech frame.First voice segments rough estimate is carried out to distorted speech, be implemented as follows: windowing framing is carried out to distorted speech, calculate short-time energy and the zero-crossing rate of framing; Setting voice segments short-time energy and zero-crossing rate threshold value, utilize start frame and the end frame of short-time energy and zero-crossing rate temporal signatures determination distorted speech voice segments.Then adopt sohn algorithm carefully to estimate upper speech segment, non-speech portion between the words determining voice segments, non-speech portion between ground unrest section and words is labeled as non-speech frame, and the phonological component of voice segments is labeled as speech frame.

2, noise tracking is carried out to distorted speech.The noise that Figure of description Fig. 3 shows distorted speech follows the tracks of estimation procedure.First Fourier transform is carried out to the distorted speech short time frame of step 1, calculate the power spectrum of every frame.Noise is followed the tracks of and is adopted the minimum value improved to control to pass average algorithm, carries out respectively estimating and upgrades, improve accuracy and the execution efficiency of algorithm to the non-speech frame of distorted speech and speech frame.Wherein, non-speech frame is considered to noise frame, and the noise spectrum estimation value of non-speech frame is the short-time rating spectrum of non-speech frame; Carrying out speech frame noise when estimating, there is probability and is compared by the smooth power spectrum of speech frame and the ratio of its local minimum and new frequency dependence threshold value and obtained in speech frame band speech; Then there is probability and upgrade time-frequency according to smoothly enlarge and to be correlated with smoothing factor in smoothing speech; Above-mentioned time-frequency smoothing factor of being correlated with is used to upgrade the noise spectrum estimation value of phonological component; Finally form distorted speech noise spectrum estimation value by the noise spectrum estimation value of non-voice and voice two parts.

3, accurate clean speech is obtained.Power spectrum and the step 2 of being made an uproar by the band of distorted speech obtains distorted speech noise estimated power spectrum and carries out multiband spectral substraction, obtains accurate clean speech power spectrum.Aim at clean speech power spectrum and carry out Fourier inversion, obtain accurate clean speech time-domain signal.

4, distorted speech evaluating objective quality; PESQ algorithm is by the distortion between sensor model calculated distortion voice and accurate clean speech, and distortion is finally mapped as distorted speech objective quality score by cognitive model.

Principle of the present invention: the present invention adopts a kind of minimum value of improvement control recurrence average algorithm and multi-band spectrum-subtraction to obtain the accurate clean speech of distorted speech, using this accurate clean speech and the distorted speech input voice as PESQ algorithm, obtain the objective evaluation score value of distorted speech.

The present invention has following advantage and effect relative to prior art:

1, by the accurate clean speech of structure distorted speech, PESQ algorithm can be applied to do not input the objective evaluation application scenarios of voice.Compared with other reference source-free method for objectively evaluating, the present invention obtains the higher subjective evaluation degree of correlation.

2, relative to the reference source-free method for objectively evaluating of the Corpus--based Method model of main flow, the present invention does not need a large amount of clean language material training statistical models, makes evaluation algorithms be applicable to the reference source-free objective evaluation application of clean language material shortage.

3, accurate clean speech building method can distinguish non-speech segment and the voice segments of distorted speech, more accurate to the noise power Power estimation of distorted speech, eliminate the noise section of distorted speech largely, improve the accuracy of distorted speech objective quality score.

Accompanying drawing explanation

Fig. 1 is the accurate clean speech building method procedure chart for speech quality objective assessment.

Fig. 2 is the mark process procedure chart of speech frame and non-speech frame.

Fig. 3 is that the noise of distorted speech follows the tracks of estimation procedure figure.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.

Embodiment

For an accurate clean speech building method for speech quality objective assessment, comprise the steps:

1, framing windowing (frame length 30ms, frame moves 15ms, adds Hamming window) is carried out to distorted speech, calculate short-time energy and the zero-crossing rate of each frame respectively; Then the average energy of calculated distortion voice, energy Upper threshold, energy Lower Threshold, average Zero-crossing Number, Zero-crossing Number thresholding.Energy is visited the average energy being limited to 0.05 times; Energy Xiamen is limited to the energy Upper threshold of 0.25 times; Zero-crossing Number thresholding is the average Zero-crossing Number of 0.3 times.

2, the start frame based on the double threshold method determination distorted speech voice segments of energy and zero-crossing rate and end frame is adopted; Using the input data of the above-mentioned distorted speech section determined as sohn voice activity detection algorithms, non-speech portion between the words determining distorted speech section.

3, the audio frame beyond distorted speech section above-mentioned steps 2 determined and between distorted speech section words non-speech frame be defined as the non-speech portion of this distorted speech; Audio frame between distorted speech section words above-mentioned steps 2 determined beyond non-speech frame is defined as the phonological component of this distorted speech.As shown in Figure 2, distorted speech short time frame lambda notation non-speech frame part and speech frame part:

4, as shown in Figure 3, Fast Fourier Transform (FFT) is carried out to distorted speech short time frame, calculates and obtain non-speech frame power spectrum | Y (λ _uv, k) | ², speech frame power spectrum | Y (λ _v, k) | ², wherein k is band index.

5, non-speech frame noise power spectrum is estimated.Non-speech segment is considered to noise, and namely noise spectrum estimation value is D (λ _uv, k)=| Y (λ _uv, k) | ².

6, to speech frame power spectrum | Y (λ _v, k) | ²smoothing:

P(λ _v,k)＝ηP(λ _v-1,k)+(1-η)|Y(λ _v,k)| ²，

Wherein, P (λ _v, k) be speech frame smooth power spectrum, λ _vfor speech frame frame number index, k is band index, and η is smoothing factor parameter (getting 0.7 in formula).

7, to P (λ _v, k) carry out Local Minimum value trace, obtain P _min(λ _v, k):

if?P _min(λ _v-1,k)<P(λ _v,k)

\begin{matrix} P_{\min} (λ_{v}, k) = γ P_{\min} (λ_{v} - 1, k) \\ + \frac{1 - γ}{1 - β} (P (λ_{v}, k) - βP (λ_{v} - 1, k)) \end{matrix}

else

P _min(λ _v,k)＝P(λ _v,k)

end

In formula, β gets 0.8, γ and gets 0.998.

8, calculate voice and there is probability.First the ratio Sr (λ of speech frame power spectrum and its local minimum is calculated _v, k):

S_{r} (λ_{v}, k) = \frac{P (λ_{v}, k)}{P_{\min} (λ_{v}, k)},

Then according to S _r(λ _v, k) determine that speech frame band speech exists probability I (λ _v, k):

if?S _r(λ _v,k)>δ(k)

I (λ _v, k)=1 voice exist

else

I (λ _v, k)=0 voice do not exist

end

The threshold value that δ (k) is correlated with for frequency band:

δ (k) = \{\begin{matrix} 1.5,1 \leq k \leq LF, \\ 2.5, LF \leq k \leq MF, \\ 6.5, MF \leq k \leq Fs / 2, \end{matrix}

Wherein, the frequency of LF and MF difference correspondence and 1kHZ and 3kHZ, Fs is sample frequency, and k is band index.

9, there is Probability p (λ in smoothing speech _v, k):

p(λ _v,k)＝α _pp(λ _v-1,k)+(1-α _p)I(λ _v,k)，

Wherein, α _pfor smoothing factor parameter (getting 0.2 in formula).

10, smoothing speech is utilized to there is Probability p (λ _v, k) calculate the smoothing factor α that time-frequency is relevant _s(λ _v, k):

α _s(λ _v,k)＝α _d+(1-α _d)p(λ _v,k)，

Wherein, α _dfor constant (getting 0.85 in formula).

11, time-frequency is utilized to be correlated with smoothing factor α _s(λ _v, k) more new speech frame noise spectrum estimation value D (λ _v, k):

D(λ _v,k)＝α _s(λ _v,k)D(λ _v-1,k)+(1-α _s(λ _v,k))|Y(λ _v,k)| ²，

12, adopt multi-band spectrum-subtraction voice segments and the accurate clean power spectrum of non-speech segment, obtain accurate clean speech s (t) by inverse Fourier transform:

s(t)＝IFFT[Y(λ _v,k)+Y(λ _uv,k)-(D(λ _v,k)+D(λ _uv,k))]，

13, as shown in Figure 1, calculated distortion speech objective quality scoring; Utilize the distortion between PESQ algorithm calculated distortion voice and accurate clean speech, distortion is mapped as distorted speech objective quality score by cognitive model.

Above-described embodiment is the present invention's preferably embodiment; but embodiments of the present invention are not restricted to the described embodiments; change, the modification done under other any does not deviate from Spirit Essence of the present invention and principle, substitute, combine, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims

1., for an accurate clean speech building method for speech quality objective assessment, it is characterized in that, comprise the following steps:

Step 5, multi-band spectrum-subtraction, according to the division of non-speech segment and voice segments and corresponding noise spectrum estimation value, calculate the non-speech segment of accurate clean speech and the clean power spectrum of standard of voice segments respectively, thus obtain the accurate clean speech power spectrum of distorted speech.

2. the accurate clean speech building method for speech quality objective assessment according to claim 1, is characterized in that, in step 1, the minimum value of described improvement controls the division of recurrence average algorithm based on non-speech segment and voice segments; Non-speech segment is regarded as noise, noise spectrum estimation value D (λ _uv, k)=| Y (λ _uv, k) | ², wherein, | Y (λ _uv, k) | ²for non-speech frame short-time rating spectrum, λ _uvfor the frame number index of non-speech segment, k is band index.

3. the accurate clean speech building method for speech quality objective assessment according to claim 1, it is characterized in that, in step 2, when the minimum value control recurrence average algorithm of described improvement carries out noise estimation to speech frame, frequency dependence threshold value δ (k) of employing is defined as:

δ (k) = \{\begin{matrix} 1.5,1 \leq k \leq LF, \\ 2.5, LF \leq k \leq MF, \\ 6.5, MF \leq k \leq Fs / 2, \end{matrix}

4. the accurate clean speech building method for speech quality objective assessment according to claim 1, it is characterized in that, in step 3, the minimum value of described improvement controls the noise power spectrum estimated value D (λ that recurrence average algorithm estimates to determine noisy speech, k) non-speech segment and voice segments two parts are divided into, described noise power spectrum estimated value D (λ, k) is defined as:

5. the accurate clean speech building method for speech quality objective assessment according to claim 2, it is characterized in that, the division of described non-speech segment and voice segments is realized by the mode of voice activity detection, that is: the temporal signatures such as zero-crossing rate and short-time energy is utilized to carry out rough estimate to distorted speech, find out start time and the finish time of the voice segments of distorted speech, get rid of ground unrest, determine the holophrase segment of distorted speech, the holophrase segment of sohn voice activity detection algorithms to above-mentioned location is adopted carefully to estimate, determine phonological component in voice segments and non-speech portion between words.

6. the accurate clean speech building method for speech quality objective assessment according to claim 1, it is characterized in that, in steps of 5, the accurate clean speech power spectrum S (λ that described multi-band spectrum-subtraction calculates, k) non-speech segment and voice segments two parts are divided into, the estimated value of described accurate clean speech power spectrum S (λ, k) is defined as:

S(λ,k)＝(Y(λ _v,k)-D(λ _v,k))+(Y(λ _uv,k)-D(λ _uv,k))，