CN100347988C

CN100347988C - Broad frequency band voice quality objective evaluation method

Info

Publication number: CN100347988C
Application number: CNB2003101112735A
Authority: CN
Inventors: 胡瑞敏; 艾浩军; 涂卫平
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2003-10-24
Filing date: 2003-10-24
Publication date: 2007-11-07
Anticipated expiration: 2023-10-24
Also published as: CN1538667A

Abstract

The present invention relates to a broad frequency band voice quality objective evaluation method. The amplitude of measured voice and reference speech are normalized to be an average value which is zero, and standard deviation is a sequence of 1; the hearing threshold of a critical zone in a frequency band of 50 to 7000Hz is calculated; the threshold value of a quiet frame is calculated according to the energy of a window adding voice frame; the power spectrum of a signal whhich is normalized is calculated; a Bark spectrum is obtained through summation in the critical zone; the loudness of the voice frame is calculated according to the Bark spectrum; loudness vector quantity is normalized; the voice loudness *t is encoded according to the loudness *o of the original voice, and a distortion marker M (i) is perceptible by determining a noise shielding threshold value Th<n>; the distortion of each frame is given; the steps are repeated, and the distortion WBSD of the integral voice section is calculated. In the condition that whether voice distortion does not influence hearing quality in the quiet section, each unmute section is accumulated and added. Average values are calculated according to the number of frames of unmute frames to obtain the WBSD of the integral voice section. The present invention keeps good correlativity with subjective quality measurement, and improves precision.

Description

Objective evaluation method for broadband voice quality

Technical Field

The invention belongs to the field of voice communication quality evaluation, and particularly relates to an objective quality evaluation method for broadband voice communication on a data network.

Background

The data network transmits voice service, the problem of service quality must be considered, in order to effectively utilize bandwidth, speech coding technology and voice activity detection technology are used to realize Discontinuous Transmission (DTX), and the signal received by a listener and the signal sent by a speaker are not strictly synchronized in time domain. Meanwhile, due to the improvement of the demand of people on the call quality, the broadband (50-7000 Hz) voice communication can be more widely applied due to higher intelligibility, naturalness and definition. The original objective quality evaluation method of the telephone bandwidth (300-3400 Hz) voice has the following defects: a. the objective quality evaluation requirement of broadband voice cannot be met; b. the objective quality evaluation requirement after adopting discontinuous transmission on the packet network can not be met.

Disclosure of Invention

The invention aims to provide a method for evaluating broadband voice transmission quality on a packet network, which overcomes the defects of the existing circuit switching network objective voice quality evaluation method.

In order to achieve the purpose, the invention provides a broadband voice quality objective evaluation method, which is characterized by comprising the following steps of:

(1) the voice section comprises a test voice and a reference voice, a voice frame is taken from the voice section for calculation, and the amplitudes of the test voice and the reference voice are normalized into a sequence with an average value of 0 and a standard deviation of 1;

(2) calculating a critical band hearing threshold in a frequency band of 50-7000 Hz;

(3) calculating a quiet frame speech energy threshold based on the energy of a reference speech windowed speech frame, and if the energy of a frame of speech is less than the quiet frame speech energy threshold, the frame of speech does not participate in the quality assessment, the quiet frame speech energy threshold En_SilenceThIs compared with the energy En of the maximum energy frame_Max, its energy is below 15 dB;

(4) calculating a power spectrum for the normalized signal;

(5) summing in a critical band to obtain a Bark spectrum;

(6) according to Bark spectrum, calculating loudness of current speech frame, i.e. calculating loudness upper L of each critical band of test speech_t(i) And the loudness L of each critical band of the reference speech_o(i) Wherein i is more than or equal to 1 and less than or equal to K, and K is the number of critical zones;

(7) calculating normalized loudness of test speech

The normalization factor is equal to the critical band loudness L of the reference speech_o(i) Sum of and L on each critical band loudness of the test speech_t(i) The ratio of (a) to (b);

\overset{&OverBar;}{L_{t}} (i) = \frac{Σ_{i = 1}^{K} L_{o} (i)}{Σ_{i = 1}^{K} L_{t} (i)} L_{t} (i), 1 \leq i \leq K

(8) according to the critical band loudness L of the reference speech_o(i) Testing the normalized loudness of speech

Sum noise masking threshold Th_n(i) Determining a perceivable distortion flag m (i):

M (i) = \{\begin{matrix} 1 & L_{o} (i) - {\overset{&OverBar;}{L}}_{t} (i) {> Th}_{n} (i) \\ 0 & else \end{matrix}, 1 \leq i \leq K

(9) the distortion d (i) for each critical band is given by:

D (i) = M (i) | L_{o} (i) - \overset{&OverBar;}{L_{t}} (i) |

(10) and (4) repeating the steps (1) to (9), calculating the whole voice section frame by frame, then calculating the distortion WBSD of the whole voice section, judging whether the voice distortion exists in the quiet frame or not and not influencing the hearing quality, accumulating and summing the distortion of each non-quiet frame, and averaging according to the frame number of the non-quiet frame to obtain the WBSD of the whole voice section.

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} D^{(j)} (i)]

Wherein,

n: total number of processed non-silent frames

K: critical band number

D^(j)(i) The method comprises the following steps Distortion of ith critical band of jth frame of reference speech

Further, in the step (10), the linear prediction coefficient LPC is calculated from the power spectrum of the reference speech, and after weighting the Bark spectral distance of each critical band by the LPC spectral envelope, the average value is calculated, wherein the weighting coefficient W is^(j)(i) Summing the LPC filter frequency response values within the ith critical band for the jth frame;

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} W^{(j)} (i) D^{(j)} (i)]

the invention provides a method for calculating weighted spectral distance, which is used for calculating the spectral distance of each frame after weighting the critical band with the spectral distance larger than a masking value according to the amplitude of an LPC (linear predictive coding) spectrum. After FFT calculation, autocorrelation coefficients are directly calculated in the frequency domain, and LPC spectrum is calculated through Durbin algorithm.

Furthermore, in step (1) above, a time-hierarchical alignment based on voice activity detection is added, with analysis thereafter being performed after the active speech segments are time-aligned.

The invention has the following advantages and positive effects:

(1) calculating the speech Bark spectral distance in a wide frequency band as a measure basis, matching with the auditory characteristics of human ears, and keeping good correlation with subjective quality measure;

(2) by adopting a loudness linear interpolation algorithm, the precision is higher than that of a table lookup interpolation calculation method used for calculating the general loudness;

(3) the peak value of the spectrum of the LPC corresponds to the formant of the speech signal, and the frequency band corresponding to the formant has a direct relation with the intelligibility of the speech. The correlation between the method and the subjective quality can be improved by increasing the weight;

(4) due to the action of the voice activity detector, the problem that the reference speech and the detected speech are not synchronous due to the discontinuous transmission in the voice communication of the packet network can be overcome.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a graph of weighting coefficients derived from the LPC filter frequency response for an embodiment of the present invention;

fig. 3 is a schematic diagram of uninterrupted transmission according to an embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings 1 to 3.

The invention provides a broadband voice quality objective evaluation method, which comprises the following steps:

(1) the test voice and the reference voice amplitude are normalized into a sequence with the average value of 0 and the standard deviation of 1;

(3) a quiet frame threshold is calculated based on the energy of the windowed speech frames, and if the energy of a frame of speech is less than the quiet frame threshold, the frame of signal does not participate in the quality assessment. Quiet frame speech energy threshold En_SilenceThIs compared with the energy En of the maximum energy frame_MaxIts energy is lower than 15 dB;

(4) calculating a power spectrum of the normalized signal;

(5) summing in a critical band to obtain a Bark spectrum;

(6) calculating the loudness of the voice frame according to the Bark spectrum;

(7) normalized loudness vector L_i(i) The normalization factor is equal to the loudness L of the reference speech frame₀(j) And the loudness L of the test speech frame_i(j) The ratio of (A) to (B), wherein K is the number of critical bands;

\overset{&OverBar;}{L_{t} (i)} = \frac{Σ_{j = 1}^{K} L_{o} (j)}{Σ_{j = 1}^{K} L_{t} (j)} L_{t} (i)

(8) loudness according to reference speech

Testing loudness of speech

Sum noise masking threshold Th_nDetermining a perceivable distortion flag m (i):

M (i) = \{\begin{matrix} 1 & {\overset{&OverBar;}{L}}_{o} - {\overset{&OverBar;}{L}}_{t} > {Th}_{n} \\ 0 & else \end{matrix}

(9) distortion per frame d (i) is given by:

D (i) = M (i) | {\overset{&OverBar;}{L}}_{o} - {\overset{&OverBar;}{L}}_{t} |

(10) repeating the steps (1) - (9), calculating the distortion WBSD of the whole voice segment, judging whether the voice distortion does not influence the hearing quality in the quiet segment, accumulating and summing each non-quiet segment, and averaging according to the frame number of the non-quiet frame to obtain the WBSD of the whole voice segment.

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} M (i) | L_{o}^{(j)} (i) - L_{t}^{(j)} (i) |]

Wherein,

n: total number of frames processed

K: critical band number

L_o ^(j)(i) The method comprises the following steps J frame Bark spectrum of reference speech

L_t ^(j)(i) The method comprises the following steps Testing the j frame Bark spectrum of speech

FIG. 1 is a flow chart showing an embodiment of the method, wherein test speech y (n) and reference speech x (n) are input into the BSD preprocessor respectively, and loudness L of each critical band in a frame of test speech is calculated_y(j) And the loudness L of each critical band in a frame of reference speech_x(j) In that respect The bandwidth of the voice is limited to 50-7000 Hz, a critical band of a Bark number from 1 to 21 is covered, the corresponding frequency is 20-7700 Hz, and therefore in the whole calculation process, the loudness model is a 21-dimensional feature vector. The noise threshold value calculating section derives a noise masking threshold value Thn and a perceptual distortion flag m (j) for each critical band. BSD preprocessor and noiseThe result of the acoustic threshold computation module is the degree of distortion WBSD per frame. The input voice signal is a 16-bit signed integer, and the sampling frequency is 16 KHz. In the BSD preprocessor, firstly, a voice signal is converted from a time domain to a frequency domain, FFT calculation is used, the window length of FFT is 1024 points, the frame length of each frame of voice is 20ms, corresponding to 640 voice sample points, the frame is shifted to 10 ms.

As shown in fig. 2, the linear prediction coefficient LPC is calculated for the windowed speech signal, and the frequency response of the filter is calculated, and the frequency response of the filter is the broken line. The peaks of the filter correspond to the formants of the frame of speech. And summing the frequency response values in each critical band, averaging to obtain a weighted coefficient W (i), and calculating the voice distortion WBSD according to the following formula.

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} W (i) M (i) | L_{o}^{(j)} (i) - L_{t}^{(j)} (i) |]

As shown in fig. 3, in a data network, because uninterrupted transmission is used, the recipient's speech is not time aligned with the speaker's speech, and voice activity detection methods may be used to time align the active speech segments, analyze them frame by frame, and compute the WBSD.

Taking the g.722.1 coding as an example, the voice quality under different packet loss rates is calculated, and the correlation between the test result and the subjective test result is not lower than 0.8.

Claims

1. A broadband voice quality objective evaluation method is characterized by comprising the following steps:

(3) calculating a quiet frame speech energy threshold based on the energy of a reference speech windowed speech frame, if the energy of a frame of speech is less than the energy of the quiet frame speechThreshold value, this frame signal does not participate in quality evaluation, and silent frame speech energy threshold value En_SilenceThIs compared with the energy En of the maximum energy frame_MaxIts energy is lower than 15 dB;

(4) calculating a power spectrum for the normalized signal;

(5) summing in a critical band to obtain a Bark spectrum;

(6) according to Bark spectrum, calculating loudness of current speech frame, i.e. calculating loudness L of each critical band of test speech_t(i) And the loudness L of each critical band of the reference speech_o(i) Wherein i is more than or equal to 1 and less than or equal to K, and K is the number of critical zones;

(7) calculating normalized loudness of test speech

The normalization factor is equal to the critical band loudness L of the reference speech_o(i) And the critical band loudness L of the test speech_t(i) The ratio of (a) to (b);

\overset{&OverBar;}{L_{t}} (i) = \frac{Σ_{i = 1}^{K} L_{o} (i)}{Σ_{i = 1}^{K} L_{t} (i)} L_{t} (i), 1 \leq i \leq K

Sum noise masking threshold Th_n(i) DeterminingPerceivable distortion flag m (i):

M (i) = \{\begin{matrix} 1 & L_{o} (i) - {\overset{&OverBar;}{L}}_{t} (i) > T h_{n} (i) \\ 0 & else \end{matrix}, 1 \leq i \leq K

(9) the distortion d (i) for each critical band is given by:

D (i) = M (i) | L_{o} (i) - \overset{&OverBar;}{L_{t}} (i) |

(10) repeating the steps (1) - (9), calculating the whole voice section frame by frame, then calculating the distortion WBSD of the whole voice section, if the voice distortion does not affect the hearing quality in the quiet frame, accumulating and summing the distortion of each non-quiet frame, and averaging according to the frame number of the non-quiet frame to obtain the WBSD of the whole voice section;

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} D^{(j)} (i)]

wherein,

n: total number of processed non-silent frames

K: critical band number

2. The objective evaluation method for wideband speech quality according to claim 1, wherein: in the step (10), linear prediction coefficients LPC are calculated based on the power spectrum of the reference speech, and after the Bark spectral distance of each critical band is weighted according to the LPC spectral envelope, an average value is calculated, in which the weighting coefficient W is^(j)(i) Summing the LPC filter frequency response values within the ith critical band for the jth frame;

WBSD = \frac{1}{N} Σ_{j = 1}^{N} [Σ_{i = 1}^{K} W^{(j)} (i) D^{(j)} (i)]

3. the objective evaluation method for wideband speech quality according to claim 1 or 2, characterized in that: in step (1) above, a time-hierarchical alignment based on voice activity detection is added, with analysis thereafter being performed after the active speech segments are time-aligned.