WO2006033570A1 - Frequency compensation for perceptual speech analysis - Google Patents

Frequency compensation for perceptual speech analysis Download PDF

Info

Publication number
WO2006033570A1
WO2006033570A1 PCT/NL2005/000683 NL2005000683W WO2006033570A1 WO 2006033570 A1 WO2006033570 A1 WO 2006033570A1 NL 2005000683 W NL2005000683 W NL 2005000683W WO 2006033570 A1 WO2006033570 A1 WO 2006033570A1
Authority
WO
WIPO (PCT)
Prior art keywords
density function
loudness
frequency
input
output
Prior art date
Application number
PCT/NL2005/000683
Other languages
French (fr)
Inventor
John Gerard Beerends
Original Assignee
Nederlandse Organisatie Voor Toegepast- Natuurwetenschappelijk Onderzoek Tno
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nederlandse Organisatie Voor Toegepast- Natuurwetenschappelijk Onderzoek Tno filed Critical Nederlandse Organisatie Voor Toegepast- Natuurwetenschappelijk Onderzoek Tno
Priority to EP05787331A priority Critical patent/EP1792304B1/en
Priority to JP2007532270A priority patent/JP4879180B2/en
Priority to CA2580763A priority patent/CA2580763C/en
Priority to CN2005800377134A priority patent/CN101053016B/en
Priority to DE602005009221T priority patent/DE602005009221D1/en
Priority to US11/663,138 priority patent/US8014999B2/en
Priority to AU2005285694A priority patent/AU2005285694B2/en
Priority to DK05787331T priority patent/DK1792304T3/en
Publication of WO2006033570A1 publication Critical patent/WO2006033570A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/69Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for evaluating synthetic or decoded voice signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/22Arrangements for supervision, monitoring or testing
    • H04M3/2236Quality of speech transmission monitoring
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/09Long term prediction, i.e. removing periodical redundancies, e.g. by using adaptive codebook or pitch predictor

Definitions

  • the invention relates to a method for establishing a frequency compensated input pitch power density function of a time framed input signal for application to an audio transmission system having an input and an output, and the output of which yields a time framed output signal.
  • the invention also relates to a processing system for establishing a frequency compensated input pitch power density function.
  • the invention also relates to a computer readable medium comprising computer executable software code.
  • the method and system to which the invention relates may be used for example as part of a method or system for analysing the perceived quality of an audio transmission system.
  • Such method and system for analysing a perceptual quality measure for the impact of linear frequency distortion are known from a previously published European patent application no
  • EP1343145 and are also disclosed in references [1] ... [8].
  • the disclosed system and method and its predecessors provide for perceptual speech evaluation as part of ITU-T recommendation P.862 (further referred to as P.862), whereby a single overall measure for the perceived quality of a degraded output signal with respect to an input signal is obtained.
  • the disclosed method and system are based on the insight that speech and audio quality measurement should be carried out in the perceptual domain (see fig. 1). This goal is achieved by comparing a reference speech signal X n , that is applied to the system under test (1), with its degraded output signal Y n . By establishing the internal perceptual representations of these signals (0.1), (0.2) and comparing (0.3) them, an estimate can be made about the perceived quality by mapping (0.4) the result to a perceived quality scale, yielding a perceived quality measure PESQ.
  • a perceived quality scale also known as a mean opinion scale (MOS) is established in the prior art by empirical estimation. Persons are asked to judge the quality of degraded or distorted speech fragments.
  • MOS mean opinion scale
  • the scores are then matched to the actual distortion and laid down in a perceptual scale.
  • This scale can be used to predict the perceptual score depending on the distortion present in a signal.
  • Currently available processing systems for determining perceived quality of an audio transmission system including P.862, suffer from the fact that a single number is outputted that represents the overall quality. This makes it impossible to find underlying causes for the perceived degradations.
  • Classical measurements like signal to noise ratio, frequency response distortion, total harmonic distortion, etc. pre-suppose a certain type of degradation and then quantify this by performing a certain type of quality measurement. This classical approach finds one or more underlying causes for bad performance of the system under test but is not able to quantify the impact of the linear frequency response distortion in relation to the other types of distortion with regard to the overall perceived quality.
  • the above methods utilise frequency compensation of an input power density function, derived from the input signal, for the purpose of quantifying the effect that linear frequency response distortions have less impact on the perceived speech quality than non-linear distortions.
  • the known method of frequency compensation fails because they either use a hard clipping function or a modified clipping function that do not allow to quantify the impact of linear frequency response distortions on the perceived speech quality in a perceptual correct manner.
  • the object of the invention can be achieved in a first aspect of the invention, by a method for frequency compensating an input pitch power density function of an audio transmission system having an input and an output, and to which input a time framed input signal is applied and the output of which yields a time framed output signal, wherein the method may comprise the steps of:
  • the step of frequency compensating the input pitch power density function comprising a softscaling function using power compression function in the range of 0.5 , and an offset in the range of 4*10 5 .
  • Pitch power density functions and soft-scaling per se are known from the prior art.
  • This compression function the overall impact of linear frequency response distortions can be quantified to obtain a global score for the overall quality that includes the correct quantification of the linear frequency response distortions
  • This single quality number may be calculated for example in the same manner as carried out in P.862 [3], i.e. for each time frame two different disturbances are calculated from a frequency integration of the loudness difference function. The final quality number is then derived from two different time integrations.
  • the improvement provides a better correlation between objective speech quality measurements and subjective speech quality assessments, especially for speech transmission systems where linear frequency response distortions dominate the overall speech quality (e.g. systems that only carry out a bandwidth limitation).
  • embodiments can provide for a method or system for determining the perceived quality of an audio transmission system, which give accurate results w.r.t. linear frequency distortion like P.862 and for a method or system that allow to obtain a single output value that is representative for the perceived distortion including linear frequency distortions.
  • the method as such obtains a single quality measure for the linear frequency distortion, based upon the difference in the loudness spectrum.
  • This measure however still requires mapping to a perceptual quality measure, which is achieved in the following embodiment according to the first aspect of the invention, further comprising the steps - establishing a roughness measure of the difference averaged loudness spectrum based on the absolute difference of consecutive frequency bin values
  • Another embodiment according to the first aspect of the invention wherein the step of processing the time framed input signal further comprises frequency compensating an input pitch power density function with respect to an ideal spectrum, has the advantage that it compensates errors in the recording technique which often lead to unbalanced spectral power densities, in most cases an over emphasis of the lower frequencies (below 500 Hz).
  • This step is applied on the input pitch power densities as obtained by Hanning windowing, FFT and frequency warping of the input signal according to reference [I].
  • the first frequency compensation functions is expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions
  • the second frequency compensation functions is also expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions.
  • the averaging in the frequency compensation function calculation smoothes local peaks in the frequency compensation function which are less audible than would have been predicted from a direct calculation, without the smoothing.
  • a processing system for measuring the transmission quality of an audio transmission system comprising: - a processor,
  • the processor is arranged for executing the steps of the method according to the first aspect of the invention.
  • the object of the invention is further achieved in a third aspect according to the invention by a software program storage means comprising computer executable software code, which when loaded on a computer system, enables the computer system to execute the steps of the method according to the first aspect of the invention.
  • Fig 1 shows a general diagram of a method for determining the perceived quality of an audio transmission system according to the state of the art.
  • Fig 2 shows a diagram representing a method for determining the perceived quality of an audio transmission system according to ITU-T recommendation P.862 according to the state of the art.
  • Fig 3 shows a diagram representing a method for determining the perceived quality of an audio transmission system according to a preferred embodiment of the invention.
  • Fig 4 shows an improvement according to a first embodiment of the invention
  • Fig 5 shows a further improvement according to a second embodiment of the invention.
  • Step 1 represents the conversion of an input signal X n to an output signal Y n by a system or a device under test 1, whereby the in- and output signals are represented by discrete time frames 1 .. n, wherein X n represents a reference signal and Y n represents the distorted response of the system under test 1 on X n .
  • the frames may be 32 ms of duration, according to current PESQ embodiments. For the invention the frame duration may either be less than 32 ms or much longer. Durations covering a complete speech fragment, in the order of minutes, may also be feasible.
  • the device or system under test may be a telecom network, a telecom terminal, e.g. a telephone, or any device or system for processing audio.
  • the input signal may be a speech fragment, but application of the embodiments of the invention are not limited to speech.
  • Step 2.1 and 2.4 represent the time windowing of the input signal X n frames and output signal Y n frames respectively, using a Hanning window.
  • Steps 2.2 and 2.5 represent the discrete Fourier transforming frame by frame of the input and output signals respectively.
  • Steps 2.3 and 2.6 represent the warping of the Fourier transformed in- and output signal into so-called Bark bands, thus obtaining the pitch power density functions in discrete frequency bands for the input signal and for the output signal, PPX(f) n and PPY(f) n respectively.
  • Step 2.7 represents calculating a linear frequency compensation, which is used to weigh in step 2.8 the input pitch power density function PPX(f) n to obtain a frequency compensated input pitch power density function PPX' (f) n .
  • the input pitch power density function PPX(Jf) n is to be frequency compensated for the filtering that takes place in the audio transmission system under test 1.
  • the amount of compensation determines the contribution of linear frequency distortion in the ultimate PESQ value.
  • the frequency compensation as disclosed in the state of the art uses an estimation of the linear frequency response of the system under test based on all frames for which the input reference signal is larger then a silence criterion value (speech active frames, PPX(f) n >10 7 , frames louder then about 70 dB SPL for P.862 when used with play back levels that are correctly set).
  • the frequency response compensation in P.862 is carried out on the input pitch power density function PPX(I) n per frame. All power density functions and offsets in this description are scaled towards a ITU P.862 standard for power functions.
  • a frequency response compensation function H(f) is calculated by averaging PPX(f) n and PPY(I) n , the outputs of 2.3 and 2.6 respectively, over time index n (plain power averaging) resulting in averaged pitch power density functions APPX and APPY (used in 2.7) from which a first frequency compensated function PPX' (f) n at the output of 2.8 is calculated by multiplication.
  • the aim is to fully compensate for small, inaudible frequency response distortions, i.e. all deviations less than a prefixed amount of decibels are fully compensated.
  • Step 2.9 represents calculating a local scaling function for compensating the output pitch power density function short-term gain variations, whereby the last local scaling function S n -I is stored in 2.10 for use in the next frame.
  • the compensation is effected by multiplying in 2.11 the local scaling function S n with the output pitch power density function PPY(f) n , resulting in a locally scaled output pitch power density function PPY(f)n.
  • the input and output pitch power density functions PPX'(f) n and PPY(f) n are transformed to a loudness scale in steps 2.12 and 2.13 in accordance with the Sone loudness scale using Zwicker's algorithm, resulting in input and output loudness density functions LX(f) n and LY(f) n respectively.
  • the input and output loudness density functions LX(f) n and LY(f) n are thus representations of the loudness of the input and output signals in a perceptual frequency domain.
  • step 2.14 the input and output loudness density functions LX(f) n and LY(f) n are subtracted, resulting in a difference loudness density function D(f) n from which a perceived quality measure can be derived.
  • the difference loudness density function D(I) n is transformed in an asymmetric disturbance measure DA , which can be used as a perceived quality measure.
  • steps 2.18 and 2.19 where the difference loudness density function D(f) n is transformed in a disturbance measure Dn , by frequency integration and emphasizing silent parts respectively but without asymmetry.
  • step 2.20 the disturbance measure D and the asymmetrical disturbance measure DA are combined to a single PESQ score denoting a perceptive quality estimate for the audio transmission system 1.
  • Fig 3 discloses a method measuring the transmission quality of an audio transmission system according to the invention, wherein the known steps 2.1 .. 2.11 are used to establish a frequency compensated input pitch power density function PPX' (f) n and wherein step 2.13 is used to establish a loudness density function LY(f) n .
  • a new first frequency compensating function H 1 (I) is calculated.
  • H 1 Xf is a power based softscaling function with offset [6], using the in time averaged input and output pitch power density functions APPX(f) and APPYC-) :
  • ⁇ (/) ( ⁇ P7(/) + OF ⁇ % PX(/) + o 9(/) , with q(f) is in the range of 0.0
  • OFFSET is in the range of 10 4 - 10 6 .
  • q(f) is in the range of 0.5
  • OFFSET is in the range of 4*10 5 .
  • a first frequency compensated input pitch power density function PPX' (f) n is calculated in 2.8 by multiplying the input pitch power density function PPX(f) n with the first frequency compensating function H 1 (I).
  • step 3.10 a second frequency compensation function H2(f) is calculated similar to step 2.7 over the same set of speech active frames using a power based softscaling function with offset but now with a higher offset
  • n _ (APPY(f) + OFFSETLARGE/ Y (/) wherein ⁇ /A is
  • q(f) is in the range of 0.4 and OFFSETLARGE is in the range of 5*10 6 .
  • the secondary frequency compensation function ⁇ L2 ⁇ is used to multiply in step 3.11 the input pitch power density function PPX(f) n , resulting in a secondary compensated pitch power density function PPX" (f)n.
  • the primary and second frequency compensation functions Hi(f) and ⁇ .2 ⁇ are not directly calculated from the APPX(f) and APPY(f) functions, but from a smoothed version of these functions.
  • the averaging is carried out over bins 0, 1 and fkAx, fMAX - 1 respectively.
  • the second and second last (1 and fMAX-1) the averaging is carried out over bins 0, 1, 2 and fMAX, UMAX-I, fMAX-2 respectively.
  • this averaging is repeated up to a lower index of 10 and downto a higher index of fMAX-4. Between the indices 10 and fMAx-4 the averaging is carried out over five bins, from two to the left up to two to the right of the index value.
  • step 3.14 similar to step 2.12, the secondary compensated pitch power density function PPX" (f) n is transformed to an input loudness density function LX'(f) n containing less linear frequency response distortion compensations then used within the loudness calculation according to the invention.
  • the parameters q(f) and OFFSETLARGE in this step 3.10, 3.11 are to be tuned for optimum results in a linear frequency distortion quality measure.
  • the new input loudness density function LX'(f) n and the P.862 alike output loudness density function LY(f) n are then used to calculate the averaged loudness density functions ALSX(f) and ALSY(f) by averaging in steps 3.4 and 3.5 the spectral loudness density functions LX'(f) n and LY(f) n .
  • this averaging is performed only over the time frames for which both the input and output power per frame are larger then a silence criterion value, preferably PPX(f) n and PPY(f) n >10 7 , determined in step 3.1 and effected in steps 3.2 and 3.3.
  • a difference averaged loudness function DALS (f) is defined between the averaged loudness densities ALSX(f) and NALSY(f) .
  • this difference averaged loudness function is then integrated over the frequency axis using again Lebesque but now over the individual frequency band differences using a p ⁇ 1.0 (p preferably in the range of 0.2 to 0.4) for the loudness in each Bark frequency band.
  • / denotes a frequency band in the difference averaged loudness spectrum.
  • the roughness number RM can be combined in step 3.13 with the loudness frequency response distortion measure LSDM by means of multiplication, the result of which is mapped to a Mean Opinion Score table, resulting in a single frequency response impact quality measure FRIQM.
  • FIG. 4 shows an embodiment according to the invention wherein, in step 4.1, the difference function DALS(f) is split into a positive part (input > output) and a negative part DALS+(f) and DALS-(f).
  • steps 4.2 and 4.3 both parts, DALS(f)+ and DALS(f)- respectively, are then integrated according to Lebesque over the frequency axis using again the Lp norm but now over the individual frequency band differences using a p ⁇ 1.0 with 0.1 ⁇ p ⁇ 0.5 for the loudness in each Bark frequency band. This results in a positive and negative frequency response distortion number LSDM+ and LSDM- .
  • the two linear frequency domain impact numbers FRIQM+ and FRIQM- are calculated from the positive and negative frequency response distortion number LSDM+ and LSDM-, by multiplying with the roughness number RM. These frequency response distortion numbers are then mapped in step 4.6 to a MOS (Mean Opinion Score) like scale for quantifying the impact of the linear frequency response distortion yielding the two linear frequency domain impact numbers FRI QM+ and FRIQM- respectively.
  • MOS Mel Opinion Score
  • the LSDM+ and LSDM- can of course also be combined in a fashion similar to the frequency response impact quality measures FRIQM+ and FRIQM-, after which a mapping to an MOS can occur to yield a single frequency response impact quality measure FRIQM. Furthermore the multiplication with the roughness measure can also be performed on LSDM alone in this embodiment.
  • the input pitch power density function is frequency compensated, using Lebesque, on the basis of an Lp scaling with 0.3 ⁇ p ⁇ 0.6 towards an ideal spectral power density Ideal(f) of a speech signal.
  • the input pitch power density function is calculated from the input reference speech signal by calculating the average power in each frequency Bark band over the complete speech fragment for which the quality of the distorted signal has to be calculated.
  • the ideal spectral power density function Ideal(f) is defined on the basis of averaging of the long-term average spectral power density of many male and female voices which are recorded with a flat frequency response microphone. In each bark band as used in PESQ a density number is constructed on the basis of this ideal density function.
  • This partial scaling towards an ideal spectral power density function Ideal(f) compensates errors in the recording technique. Recording techniques often lead to unbalanced spectral power densities, in most cases an over-emphasis of the lower frequencies (below 500 Hz). From the ideal and input spectrum smoothed versions of the ideal spectral power density function Ideal(f) and input pitch power density function PPX(f) n are calculated in step 5.1 by averaging over a number of consecutive frequency bands. From these smoothed versions compensation factors S(f) can be calculated for each bark band defined as the ratio of the powers "ideal/reference".
  • step 5.2 the input pitch power density function PPX(f)n with S(f) p , with 0.3 ⁇ p ⁇ 0.8, to obtain an (idealized) input pitch power density function PPXI(f) n which can be used for further evaluation according to the invention instead of the input pitch power density function PPX(f) n .
  • the invention can be embodied in a computer system comprising a processor, memory and an input and an output.
  • the input may be a reading device like an analog input capable of sampling a reference input signal and a degraded output signal coming from an audio transmission system under test.
  • the sampled signals can be stored in a memory, for example a fixed disk, and put into frames, by selecting rows of samples.
  • the processor can then proceed and perform the steps as described above.
  • a result, for example the linear frequency impact quality measure can be output to a display, or to a communication port , or stored in the memory for future reference.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Measuring Pulse, Heart Rate, Blood Pressure Or Blood Flow (AREA)
  • Measuring Instrument Details And Bridges, And Automatic Balancing Devices (AREA)
  • Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a softscaled frequency compensation function that allows the evaluation of a first quality measure indicating a global impact of all distortions in an audio transmission system, including linear frequency response distortions and second quality measure that only takes into account the impact of linear frequency response distortions .The softscaled frequency compensation function is derived from a softscaled ratio between a time integrated output and a time integrated input power density functions. The first quality measure is derived from the difference loudness density function as function of time and frequency, using the frequency compensated input loudness density function and the gain compensated output loudness density function both as a function of time and frequency, in the same manner as carried out in ITU standard P.862.The second measure is established from a difference loudness density function as function of frequency only that is derived from a time integration of a second frequency compensated input loudness density function and the gain compensated output loudness density function both as a function of time and frequency.

Description

FREQUENCY COMPENSATION FOR PERCEPTUAL SPEECH ANALYSIS
Field of the invention
The invention relates to a method for establishing a frequency compensated input pitch power density function of a time framed input signal for application to an audio transmission system having an input and an output, and the output of which yields a time framed output signal.
The invention also relates to a processing system for establishing a frequency compensated input pitch power density function. The invention also relates to a computer readable medium comprising computer executable software code.
Background
The method and system to which the invention relates, may be used for example as part of a method or system for analysing the perceived quality of an audio transmission system. Such method and system for analysing a perceptual quality measure for the impact of linear frequency distortion are known from a previously published European patent application no
EP1343145 and are also disclosed in references [1] ... [8]. The disclosed system and method and its predecessors provide for perceptual speech evaluation as part of ITU-T recommendation P.862 (further referred to as P.862), whereby a single overall measure for the perceived quality of a degraded output signal with respect to an input signal is obtained.
The disclosed method and system are based on the insight that speech and audio quality measurement should be carried out in the perceptual domain (see fig. 1). This goal is achieved by comparing a reference speech signal Xn, that is applied to the system under test (1), with its degraded output signal Yn. By establishing the internal perceptual representations of these signals (0.1), (0.2) and comparing (0.3) them, an estimate can be made about the perceived quality by mapping (0.4) the result to a perceived quality scale, yielding a perceived quality measure PESQ. A perceived quality scale, also known as a mean opinion scale (MOS) is established in the prior art by empirical estimation. Persons are asked to judge the quality of degraded or distorted speech fragments. The scores are then matched to the actual distortion and laid down in a perceptual scale. This scale can be used to predict the perceptual score depending on the distortion present in a signal. Currently available processing systems for determining perceived quality of an audio transmission system, including P.862, suffer from the fact that a single number is outputted that represents the overall quality. This makes it impossible to find underlying causes for the perceived degradations. Classical measurements like signal to noise ratio, frequency response distortion, total harmonic distortion, etc. pre-suppose a certain type of degradation and then quantify this by performing a certain type of quality measurement. This classical approach finds one or more underlying causes for bad performance of the system under test but is not able to quantify the impact of the linear frequency response distortion in relation to the other types of distortion with regard to the overall perceived quality.
Furthermore, the performance of currently available methods and processing systems for determining perceived quality of an audio transmission system, including P.862, give inadequate results, since the perceived linear frequency distortion is not treated properly in those systems.
The above methods utilise frequency compensation of an input power density function, derived from the input signal, for the purpose of quantifying the effect that linear frequency response distortions have less impact on the perceived speech quality than non-linear distortions.
The known method of frequency compensation fails because they either use a hard clipping function or a modified clipping function that do not allow to quantify the impact of linear frequency response distortions on the perceived speech quality in a perceptual correct manner. Summary of the invention
It is an object of the invention to provide a method and a system for frequency compensation of input pitch power density functions that allows to quantify the impact of linear frequency response distortions on the perceived speech quality in a perceptual correct manner.
The object of the invention can be achieved in a first aspect of the invention, by a method for frequency compensating an input pitch power density function of an audio transmission system having an input and an output, and to which input a time framed input signal is applied and the output of which yields a time framed output signal, wherein the method may comprise the steps of:
- processing the input signal, to obtain an input pitch power density function;
- processing the output signal, to obtain an output pitch power density function;
- frequency compensating the input pitch power density function , with a first frequency compensating function, to obtain a first frequency compensated input pitch power density function ;
- the step of frequency compensating the input pitch power density function comprising a softscaling function using power compression function in the range of 0.5 , and an offset in the range of 4*105. Pitch power density functions and soft-scaling per se are known from the prior art. With this compression function the overall impact of linear frequency response distortions can be quantified to obtain a global score for the overall quality that includes the correct quantification of the linear frequency response distortions This single quality number may be calculated for example in the same manner as carried out in P.862 [3], i.e. for each time frame two different disturbances are calculated from a frequency integration of the loudness difference function. The final quality number is then derived from two different time integrations. The improvement provides a better correlation between objective speech quality measurements and subjective speech quality assessments, especially for speech transmission systems where linear frequency response distortions dominate the overall speech quality (e.g. systems that only carry out a bandwidth limitation).
Based on this improved method according to the invention, embodiments can provide for a method or system for determining the perceived quality of an audio transmission system, which give accurate results w.r.t. linear frequency distortion like P.862 and for a method or system that allow to obtain a single output value that is representative for the perceived distortion including linear frequency distortions.
In another embodiment according to the first aspect of the invention, further comprising the steps of: - compensating the output pitch power density function for short term gain variations, to obtain a locally scaled output pitch power density function;
- transforming the frequency compensated output pitch power density function to a loudness perception scale to obtain an output loudness density function;
- frequency compensating the input pitch power density function, with a second frequency compensating function, yielding a second frequency compensated input pitch power density function based on a softscaling power function with a power in the range 0.4 , and an offset in the range of 5*106
- transforming the second frequency compensated input pitch power density function to a loudness perception scale to obtain an input loudness density function ;
- averaging over the time frames of the framed input loudness density function to obtain an averaged input loudness spectrum; averaging over the time frames of the framed output loudness density function to obtain an averaged output loudness spectrum ;
- normalizing the averaged output loudness spectrum with respect to the averaged input loudness spectrum , thereby obtaining a normalized averaged output loudness spectrum ;
- subtracting the input loudness spectrum from the normalized output loudness spectrum thereby obtaining a difference averaged loudness spectrum;
- Lebesque integrating the difference averaged loudness spectrum function, thereby establishing a linear spectral distortion measure for the audio transmission system; allows for the establishment of a single measure indicating the perceived linear frequency distortion.
The method as such obtains a single quality measure for the linear frequency distortion, based upon the difference in the loudness spectrum. This measure however still requires mapping to a perceptual quality measure, which is achieved in the following embodiment according to the first aspect of the invention, further comprising the steps - establishing a roughness measure of the difference averaged loudness spectrum based on the absolute difference of consecutive frequency bin values
- combining the roughness measure and the linear spectral distortion measure by multiplication and mapping the result to a MOS scale, thereby obtaining a frequency response impact quality measure
Another embodiment according to the first aspect of the invention, wherein the step of processing the time framed input signal further comprises frequency compensating an input pitch power density function with respect to an ideal spectrum, has the advantage that it compensates errors in the recording technique which often lead to unbalanced spectral power densities, in most cases an over emphasis of the lower frequencies (below 500 Hz). This step is applied on the input pitch power densities as obtained by Hanning windowing, FFT and frequency warping of the input signal according to reference [I].
In another embodiment according to the first aspect of the invention, the first frequency compensation functions is expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions
In another embodiment according to the first aspect of the invention, the second frequency compensation functions is also expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions.
The averaging in the frequency compensation function calculation smoothes local peaks in the frequency compensation function which are less audible than would have been predicted from a direct calculation, without the smoothing.
Another embodiment according to the first aspect of the invention, further comprising the steps of
- detecting time frames for which simultaneously the input and output pitch power density functions per frame are larger then a silence criterion value; gating the input loudness density function frames and output loudness density function frames under the control of the detecting of time frames , prevents instability in the results due to values in the range of 0, especially in frames where division by zero may occur. A further embodiment according to the first aspect of the invention, wherein the step of establishing of a linear spectral distortion measure further comprises
- splitting the difference averaged loudness spectrum in a positive difference averaged loudness spectrum and a negative difference averaged loudness spectrum;
- integrating in the frequency domain positive values of the difference averaged loudness spectrum, thereby establishing a positive linear spectral distortion measure; - integrating in the frequency domain negative values of the difference averaged loudness spectrum, thereby establishing a negative linear spectral distortion measure;
- multiplying the roughness measure and the positive linear spectral distortion measure and mapping the result to a MOS (Mean Opinion Score) scale, thereby obtaining a positive frequency response distortion quality measure;
- combining the roughness measure and the negative linear spectral distortion measure by multiplication and mapping the result to a MOS (Mean Opinion Score) scale, thereby obtaining a negative frequency response distortion quality measure,
- weighing the positive frequency response, distortion quality measure with a first weigh factor greater than 0,
- weighing the negative frequency response distortion quality measure with a second weigh factor greater than 0; - adding the weighed negative frequency response impact quality measure and the positive frequency response impact quality measure thereby obtaining a single frequency response impact measure, allows for fine tuning and optimizing the method for frequency bands that appear louder in the output signal and for frequency bands that appear softer in the output signal. Which frequency bands fall into the positive or negative parts of the averaged difference loudness spectrum can be controlled in the step of frequency compensating the input pitch power density function by adjusting the compensation factors in that step. These compensation factors are adjusted for optimal correlation between the outcome of the frequency response impact measure and subjective perception.
Another embodiment according to first aspect of the invention, further comprising the steps of
- compensating the output pitch power density function for short term gain variations, to obtain a locally scaled output pitch power density function ;
- transforming the first frequency compensated input pitch power density function to a loudness perception scale to obtain an input loudness density function ; - transforming the frequency compensated output pitch power density function to a loudness perception scale to obtain an output loudness density function; subtracting output and input loudness density functions to obtain an difference loudness density function; - asymmetrical processing between positive and negative bins in the difference loudness density function, thereby obtaining an asymmetrical difference loudness density function;
- frequency integrating of the asymmetrical difference loudness density function , and emphasizing silent parts thereby obtaining an asymmetrical disturbance measure;
- frequency integrating of the difference loudness density function, and emphasizing silent parts thereby obtaining an symmetrical disturbance measure; time integrating the difference loudness density function and the asymmetrical difference loudness density function , thus obtaining a symmetrical and a asymmetrical disturbance measure and then combining the symmetrical and asymmetrical disturbance measures, thereby obtaining a single perceptive quality estimate, allows for determining the perceived quality of an audio transmission system, which gives accurate results w.r.t. linear frequency distortion.
The object of the invention is further achieved in a second aspect according to the invention by a processing system for measuring the transmission quality of an audio transmission system, comprising: - a processor,
- means for inputting a framed input signal and a framed output signal,
- whereby the processor is arranged for executing the steps of the method according to the first aspect of the invention.
The object of the invention is further achieved in a third aspect according to the invention by a software program storage means comprising computer executable software code, which when loaded on a computer system, enables the computer system to execute the steps of the method according to the first aspect of the invention.
Brief description of the drawings
Fig 1 shows a general diagram of a method for determining the perceived quality of an audio transmission system according to the state of the art. Fig 2 shows a diagram representing a method for determining the perceived quality of an audio transmission system according to ITU-T recommendation P.862 according to the state of the art.
Fig 3 shows a diagram representing a method for determining the perceived quality of an audio transmission system according to a preferred embodiment of the invention. Fig 4 shows an improvement according to a first embodiment of the invention Fig 5 shows a further improvement according to a second embodiment of the invention.
Detailed description Fig 2 discloses a diagram representing a method for determining the perceived quality of an audio transmission system according to ITU-T recommendation P.862, reference [6], which is hereby included by reference. This diagram is explained briefly since steps 2.1 .. 2.12 are also used in the system and method according to the invention. Step 1 represents the conversion of an input signal Xn to an output signal Yn by a system or a device under test 1, whereby the in- and output signals are represented by discrete time frames 1 .. n, wherein Xn represents a reference signal and Yn represents the distorted response of the system under test 1 on Xn . The frames may be 32 ms of duration, according to current PESQ embodiments. For the invention the frame duration may either be less than 32 ms or much longer. Durations covering a complete speech fragment, in the order of minutes, may also be feasible.
The device or system under test may be a telecom network, a telecom terminal, e.g. a telephone, or any device or system for processing audio. The input signal may be a speech fragment, but application of the embodiments of the invention are not limited to speech.
In order to establish a perceived quality measure for the output signal Yn with respect to the input signal Xn some preprocessing is necessary. According to the state of the art this is performed by the steps 2.1 ... 2.6. Step 2.1 and 2.4 represent the time windowing of the input signal Xn frames and output signal Yn frames respectively, using a Hanning window. Steps 2.2 and 2.5 represent the discrete Fourier transforming frame by frame of the input and output signals respectively. Steps 2.3 and 2.6 represent the warping of the Fourier transformed in- and output signal into so-called Bark bands, thus obtaining the pitch power density functions in discrete frequency bands for the input signal and for the output signal, PPX(f)n and PPY(f)n respectively. Step 2.7 represents calculating a linear frequency compensation, which is used to weigh in step 2.8 the input pitch power density function PPX(f)nto obtain a frequency compensated input pitch power density function PPX' (f)n . The input pitch power density function PPX(Jf)n is to be frequency compensated for the filtering that takes place in the audio transmission system under test 1. In P862, the amount of compensation determines the contribution of linear frequency distortion in the ultimate PESQ value.
The frequency compensation as disclosed in the state of the art , i.e. P.862, uses an estimation of the linear frequency response of the system under test based on all frames for which the input reference signal is larger then a silence criterion value (speech active frames, PPX(f)n>107, frames louder then about 70 dB SPL for P.862 when used with play back levels that are correctly set). The frequency response compensation in P.862 is carried out on the input pitch power density function PPX(I)n per frame. All power density functions and offsets in this description are scaled towards a ITU P.862 standard for power functions.
In 2.7 a frequency response compensation function H(f) is calculated by averaging PPX(f)n and PPY(I)n , the outputs of 2.3 and 2.6 respectively, over time index n (plain power averaging) resulting in averaged pitch power density functions APPX and APPY (used in 2.7) from which a first frequency compensated function PPX' (f)n at the output of 2.8 is calculated by multiplication. The aim is to fully compensate for small, inaudible frequency response distortions, i.e. all deviations less than a prefixed amount of decibels are fully compensated. Step 2.9 represents calculating a local scaling function for compensating the output pitch power density function short-term gain variations, whereby the last local scaling function Sn-I is stored in 2.10 for use in the next frame. The compensation is effected by multiplying in 2.11 the local scaling function Sn with the output pitch power density function PPY(f)n , resulting in a locally scaled output pitch power density function PPY(f)n. The input and output pitch power density functions PPX'(f)n and PPY(f)n are transformed to a loudness scale in steps 2.12 and 2.13 in accordance with the Sone loudness scale using Zwicker's algorithm, resulting in input and output loudness density functions LX(f)n and LY(f)n respectively. The input and output loudness density functions LX(f)n and LY(f)n are thus representations of the loudness of the input and output signals in a perceptual frequency domain. In step 2.14 the input and output loudness density functions LX(f)n and LY(f)n are subtracted, resulting in a difference loudness density function D(f)n from which a perceived quality measure can be derived. After asymmetrical processing in 2.15 between positive and negative bins in the difference loudness function D(f)n , frequency integration in 2.16 and emphasizing silent parts in 2.17 the difference loudness density function D(I)n is transformed in an asymmetric disturbance measure DA , which can be used as a perceived quality measure. The same applies for steps 2.18 and 2.19 where the difference loudness density function D(f)n is transformed in a disturbance measure Dn , by frequency integration and emphasizing silent parts respectively but without asymmetry.
Then after aggregation over time frames in step 2.20 the disturbance measure D and the asymmetrical disturbance measure DA are combined to a single PESQ score denoting a perceptive quality estimate for the audio transmission system 1.
All steps 2.1 .. 2.20 are described in more detail in [6] which is included by reference herein.
Fig 3, discloses a method measuring the transmission quality of an audio transmission system according to the invention, wherein the known steps 2.1 .. 2.11 are used to establish a frequency compensated input pitch power density function PPX' (f)n and wherein step 2.13 is used to establish a loudness density function LY(f)n .
According to the invention, in step 2.7 a new first frequency compensating function H1(I) is calculated. H1Xf), is a power based softscaling function with offset [6], using the in time averaged input and output pitch power density functions APPX(f) and APPYC-) :
^(/) = (^P7(/) + OF^%PX(/) + o 9(/), with q(f) is in the range of 0.0
-1.0 (can be frequency dependent), wherein OFFSET is in the range of 104 - 106.
The smaller q and the higher the OFFSET, the smaller the amount of frequency compensation is achieved. The parameters q and OFFSET in this step 2.7 are to be tuned for optimum results. Preferably q(f) is in the range of 0.5 and OFFSET is in the range of 4*105. Like in P.862, a first frequency compensated input pitch power density function PPX' (f)n is calculated in 2.8 by multiplying the input pitch power density function PPX(f)n with the first frequency compensating function H1(I).
In step 3.10 a second frequency compensation function H2(f) is calculated similar to step 2.7 over the same set of speech active frames using a power based softscaling function with offset but now with a higher offset
. „ , n _ (APPY(f) + OFFSETLARGE/ Y(/) wherein α/A is
• a 2 {J ) ~ \ /APPX(J) + OFFSETLARGE) ' wnerem <1 W 1S in the range of 0.0 -1.0 (can be frequency dependent), and OFFSETLARGE is in the range of 105 -108 .
Preferably q(f) is in the range of 0.4 and OFFSETLARGE is in the range of 5*106. The secondary frequency compensation function ΕL2Φ is used to multiply in step 3.11 the input pitch power density function PPX(f)n , resulting in a secondary compensated pitch power density function PPX" (f)n.
In an embodiment according to the invention, the primary and second frequency compensation functions Hi(f) and Η.2Φ are not directly calculated from the APPX(f) and APPY(f) functions, but from a smoothed version of these functions. The smoothing is carried out by averaging over the Bark bin values (f), f=0,...fMAX as specified in P.862 [3], where f=0 and fMAX- represent the first and last bin values. The averaging is carried out over bins 0, 1 and fkAx, fMAX - 1 respectively. For the second and second last (1 and fMAX-1) the averaging is carried out over bins 0, 1, 2 and fMAX, UMAX-I, fMAX-2 respectively. Next this averaging is repeated up to a lower index of 10 and downto a higher index of fMAX-4. Between the indices 10 and fMAx-4 the averaging is carried out over five bins, from two to the left up to two to the right of the index value.
In step 3.14, similar to step 2.12, the secondary compensated pitch power density function PPX" (f)n is transformed to an input loudness density function LX'(f)n containing less linear frequency response distortion compensations then used within the loudness calculation according to the invention.
The parameters q(f) and OFFSETLARGE in this step 3.10, 3.11 are to be tuned for optimum results in a linear frequency distortion quality measure.
The new input loudness density function LX'(f)n and the P.862 alike output loudness density function LY(f)n are then used to calculate the averaged loudness density functions ALSX(f) and ALSY(f) by averaging in steps 3.4 and 3.5 the spectral loudness density functions LX'(f)n and LY(f)n . The averaging in time is according to Lebesque , ALSX(J) = I — VLX'(/)£ and n
ALSYifϊΛ -∑LYif)* [ with p>l, preferably p = 2.5. n
Optionally, this averaging is performed only over the time frames for which both the input and output power per frame are larger then a silence criterion value, preferably PPX(f)n and PPY(f)n>107, determined in step 3.1 and effected in steps 3.2 and 3.3.
These averaged input and output loudness density functions, representing the loudness as a function of frequency, are then power integrated in step 3.6 over the frequency axis (Lp=I) resulting in a single loudness number NX for the (idealized) reference and a loudness number NY for the adjusted distorted
signal according to NX and NY =
Figure imgf000017_0001
Figure imgf000017_0002
These single loudness numbers NX, NY are then used to normalize the averaged loudness density function ALSY(f) in step 3.7 in such a way that the average of the averaged output loudness density function ALSY(f) in the frequency domain is the same for both the (idealized) input and adjusted output signal, resulting in a normalized averaged loudness density function NALSY(f).
In the step 3.8 a difference averaged loudness function DALS (f) is defined between the averaged loudness densities ALSX(f) and NALSY(f) . In step 3.9, this difference averaged loudness function is then integrated over the frequency axis using again Lebesque but now over the individual frequency band differences using a p<1.0 (p preferably in the range of 0.2 to 0.4) for the loudness in each Bark frequency band. The result is a loudness frequency response distortion measure LSDM according to LSDM = ]T DALS (f)p , wherein f
/ denotes a frequency band in the difference averaged loudness spectrum.
A special roughness measure RM can be calculated in step 3.12 by taking the absolute value of the consecutive loudness bins of the loudness difference function DALS (f) and summing them for all consecutive bins: RM = ][] DALSf p - DALSf p_λ , (f being the band index number, with p in the
/ range of 0.5 - 2.0 and preferably p is in the range of 1.5).
The roughness number RM can be combined in step 3.13 with the loudness frequency response distortion measure LSDM by means of multiplication, the result of which is mapped to a Mean Opinion Score table, resulting in a single frequency response impact quality measure FRIQM.
Figure 4 shows an embodiment according to the invention wherein, in step 4.1, the difference function DALS(f) is split into a positive part (input > output) and a negative part DALS+(f) and DALS-(f). In steps 4.2 and 4.3, both parts, DALS(f)+ and DALS(f)- respectively, are then integrated according to Lebesque over the frequency axis using again the Lp norm but now over the individual frequency band differences using a p<1.0 with 0.1<p<0.5 for the loudness in each Bark frequency band. This results in a positive and negative frequency response distortion number LSDM+ and LSDM- .
In steps 4.4 and 4.5, the two linear frequency domain impact numbers FRIQM+ and FRIQM- are calculated from the positive and negative frequency response distortion number LSDM+ and LSDM-, by multiplying with the roughness number RM. These frequency response distortion numbers are then mapped in step 4.6 to a MOS (Mean Opinion Score) like scale for quantifying the impact of the linear frequency response distortion yielding the two linear frequency domain impact numbers FRI QM+ and FRIQM- respectively. FRIQM+ and FRIQM- are weighed to obtain the single frequency response impact quality measure FRIQM: FRIQM = cc * FRIQM+ +β* FRIQM~ , wherein preferably α + β=l, and wherein the ratio between α and β is preferably more than 10. In a current implementation β=0, so only LSDM+ values are taken into account.
The LSDM+ and LSDM- can of course also be combined in a fashion similar to the frequency response impact quality measures FRIQM+ and FRIQM-, after which a mapping to an MOS can occur to yield a single frequency response impact quality measure FRIQM. Furthermore the multiplication with the roughness measure can also be performed on LSDM alone in this embodiment.
According to a preferred embodiment of the invention as shown in figure 4, the input pitch power density function is frequency compensated, using Lebesque, on the basis of an Lp scaling with 0.3<p<0.6 towards an ideal spectral power density Ideal(f) of a speech signal. The input pitch power density function is calculated from the input reference speech signal by calculating the average power in each frequency Bark band over the complete speech fragment for which the quality of the distorted signal has to be calculated. The ideal spectral power density function Ideal(f) is defined on the basis of averaging of the long-term average spectral power density of many male and female voices which are recorded with a flat frequency response microphone. In each bark band as used in PESQ a density number is constructed on the basis of this ideal density function.
This partial scaling towards an ideal spectral power density function Ideal(f) compensates errors in the recording technique. Recording techniques often lead to unbalanced spectral power densities, in most cases an over-emphasis of the lower frequencies (below 500 Hz). From the ideal and input spectrum smoothed versions of the ideal spectral power density function Ideal(f) and input pitch power density function PPX(f)n are calculated in step 5.1 by averaging over a number of consecutive frequency bands. From these smoothed versions compensation factors S(f) can be calculated for each bark band defined as the ratio of the powers "ideal/reference". These factors S(f) are then used to rescale in step 5.2 the input pitch power density function PPX(f)n with S(f)p , with 0.3<p<0.8, to obtain an (idealized) input pitch power density function PPXI(f)n which can be used for further evaluation according to the invention instead of the input pitch power density function PPX(f)n.
Note that the invention can be combined with the local time scaling using iteratively adjusting the frequency compensation and local time scaling according to [7].
The invention can be embodied in a computer system comprising a processor, memory and an input and an output. The input may be a reading device like an analog input capable of sampling a reference input signal and a degraded output signal coming from an audio transmission system under test. The sampled signals can be stored in a memory, for example a fixed disk, and put into frames, by selecting rows of samples. The processor can then proceed and perform the steps as described above. A result, for example the linear frequency impact quality measure can be output to a display, or to a communication port , or stored in the memory for future reference. Literature
[1] A. W. Rix, M. P. Hollier, A. P. Hekstra and J. G. Beerends, "PESQ, the new ITU standard for objective measurement of perceived speech quality, Part 1 - Time alignment," 3. Audio Eng. Soc, vol. 50, pp. 755-764 (2002 Oct.).
[2] J. G. Beerends, A. P. Hekstra, A. W. Rix and M. P. Hollier, "PESQ, the new ITU standard for objective measurement of perceived speech quality, Part II - Perceptual model," J. Audio Eng. Soc, vol. 50, pp. 765-778 (2002 Oct.) (equivalent to KPN Research publication 00-32228). [3] ITU-T Rec. P.862, "Perceptual Evaluation Of Speech Quality (PESQ), An Objective Method for End-to-end Speech Quality Assessment of Narrowband Telephone Networks and Speech Codecs," International Telecommunication Union, Geneva, Switzerland (2001 February.).
[4] A. P. Hekstra, J. G. Beerends, "Output power decompensation, " International patent application; PCT EP02/02342; European patent application 01200945.2, March 2001; Koninklijke PTT Nederland N. V.
[5] J. G. Beerends, "Frequency dependent frequency compensation, "
International patent application; PCT EP02/05556; European patent application 01203699.2, June 2001; Koninklijke PTT Nederland N. V. [6] J. G. Beerends, "Method and system for measuring a system's transmission quality," So ftscaling, International patent application; PCT EP03/02058; European patent application 02075973.4-2218, April 2002, Koninklijke PTT Nederland N. V.
[7] J. G. Beerends, "Method and system for measuring a system's transmission quality"; European patent application 02075973, July 2003,
Koninklijke PTT Nederland N. V. [8] T. Goldstein, J. G. Beerends, H. Klaus and C. Schmidmer, "Draft ITU-T Recommendation P.AAM, An objective method for end-to-end speech quality assessment of narrow-band telephone networks including acoustic terminal®," White contribution COM 12-64 to ITU-T Study Group 12, September 2003.

Claims

Claims
1. A method for establishing a first frequency compensated input pitch power density function (PPX'(f)n)of a time framed input signal (Xn), for application to an audio transmission system (1) having an input and an output, and the output of which yields a time framed output signal (Yn), the method comprising the steps of:
- processing (2.1..2.3) the input signal (Xn), to obtain an input pitch power density function (PPX (f)n);
- processing (2.4..2.6) the output signal (Yn), to obtain an output pitch power density function (PPY(f)n); - frequency compensating (2.7, 2.8) the input pitch power density function
(PPX(f)n), with a first frequency compensating function (Hi(f)), to obtain the first frequency compensated input pitch power density function (PPXXf)n); the method being characterized by - the step of frequency compensating (2.7, 2.8) the input pitch power density function (PPX(U)n) comprising a power based softscaling function with offset [6] using with a power in the range of 0.5 , and an offset in the range of 4*105.
2. The method of claim 1, further comprising the steps of compensating the output pitch power density function (PPY(f)n) for short term gain variations (2.9.. 2.11), to obtain a locally scaled output pitch power density function (PPY (f)n); transforming (2.13) the frequency compensated output pitch power density function (PPY (f)n) to a loudness perception scale to obtain an output loudness density function (LY(f)n); - frequency compensating (3.10, 3.11) the input pitch power density function (PPX(f)n) , with a second frequency compensating function (H2(f)), yielding a second frequency compensated input pitch power density function (PPX"(f)n); - transforming (3.14) the second frequency compensated input pitch power density function (PPX" (f)n) to a loudness perception scale to obtain an input loudness density function (LX' (f)n ); averaging (3.4) over the time frames of the framed input loudness density function (LX' (f)n ) to obtain an averaged input loudness spectrum (ALSX(f)); averaging (3.5) over the time frames of the framed output loudness density function (LY(f)n) to obtain an averaged output loudness spectrum (ALSY(f));
- normalizing (3.6, 3.7) the averaged output loudness spectrum (ALSY(f)) with respect to the averaged input loudness spectrum (ALSX(f)), thereby obtaining a normalized averaged output loudness spectrum (NALSY(f)); subtracting (3.8) the input loudness spectrum (ALSX(f)) from the normalized output loudness spectrum (NALSY(f)) thereby obtaining a difference averaged loudness spectrum (DALS(f)); - Lebesque integrating (3.9) the difference averaged loudness spectrum function (DALS(f)), thereby establishing a linear spectral distortion measure (LSDM) for the audio transmission system (1).
3. The method according to claim 2, further comprising the step of - establishing (3.12) a roughness measure (RM) of the difference averaged loudness spectrum (DALS(f)) based on the absolute difference of consecutive frequency bin values; combining (3.13) the roughness measure (RM) and the linear spectral distortion measure (LSDM) by multiplication and mapping the result to a MOS (Mean Opinion Score) scale (FRIQM). 4. The method according to any of the preceding claims, wherein the step of frequency compensating (3.10, 3.11) the input pitch power density function (PPX(f)n), yielding a second frequency compensated input pitch power density function (PPX" (f)n) is based on a power based softscaling function with offset [6] with a power in the range 0.
4 , and an offset in the range of 5*106 .
5. The method according to any of the preceding claims, wherein the first frequency compensation functions (Hi(f)) is expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions.
6. The method according to any of the preceding claims, wherein the second frequency compensation functions (H2(f)) is expressed in terms of Bark bin values and is derived from averaging over at least two neighboring Bark bin values of the input and output pitch power density functions.
7. The method according to any of the preceding claims, wherein the step of processing (2) the time framed input signal (Xn) further comprises the step of
- frequency compensating (5.1, 5.2) the input pitch power density function (PPX(f)n) with respect to an ideal spectrum Ideal(f).
8. The method according to any of the preceding claims, the method further comprising the steps of detecting time frames (3.1) for which simultaneously the input and output pitch power density functions (PPX(f)n , PPY(f)n)per frame are larger then a silence criterion value; gating (3.2 and 3.3) the input loudness density function frames (LX'(f)n) and output loudness density function frames (LY(f)n ) under the control of the detecting of time frames (3.1),
9. The method according to any of the preceding claims , wherein the step of establishing of a linear spectral distortion measure (LSDM) further comprises: splitting (4.1) the difference averaged loudness spectrum (DALS(f))in a positive difference averaged loudness spectrum (DALS+(f)) and a negative difference averaged loudness spectrum (DALS-(f));
- integrating (4.2) in the frequency domain positive values of the difference averaged loudness spectrum (DALS+(f)), thereby establishing a positive linear spectral distortion measure (LSDM+);
- integrating (4.3) in the frequency domain negative values of the difference averaged loudness spectrum (DALS-(f)), thereby establishing a negative linear spectral distortion measure (LSDM-);
- multiplying (4.4) the roughness measure (RM) and the positive linear spectral distortion measure (LSDM+) and mapping the result to a MOS (Mean Opinion Score) scale, thereby obtaining a positive frequency response distortion quality measure (FRI QM+);
- combining (4.5) the roughness measure (RM) and the negative linear spectral distortion measure (LSDM-) by multiplication and mapping the result to a MOS (Mean Opinion Score) scale, thereby obtaining a negative frequency response distortion quality measure (FRIQM-); - weighing (4.6) the positive frequency response distortion quality measure (FRIQM+) with a first weigh factor (α) greater than 0;
- weighing (4.6) the negative frequency response distortion quality measure (FRIQM-) with a second weigh factor (β) greater than 0; adding (4.6) the weighed negative frequency response impact quality measure (FRIQM-) and the positive frequency response impact quality measure (FRIQM+) thereby obtaining a single frequency response impact measure (FRIQM).
10. The method according to claim 1, further comprising the steps of - compensating the output pitch power density function (PPY(f)n) for short term gain variations (2.9.. 2.11), to obtain a locally scaled output pitch power density function (PPY (f)n);
- transforming (2.12) the first frequency compensated input pitch power density function (PPX (f)n) to a loudness perception scale to obtain an input loudness density function (LX(f)n);
- transforming (2.13) the frequency compensated output pitch power density function (PPYXf)n) to a loudness perception scale to obtain an output loudness density function (LY(f)n);
- subtracting (3.4) output and input loudness density functions (LY(f)n, LX(f)n) to obtain an difference loudness density function (D(f)n); asymmetrical processing (2.15) between positive and negative bins in the difference loudness density function (D(f)n), thereby obtaining an asymmetrical difference loudness density function (DA(f)n);
- frequency integrating (2.16) of the asymmetrical difference loudness density function (DA(f)n), and emphasizing silent parts (2.17) thereby obtaining an asymmetrical disturbance measure (DAn);
- frequency integrating (2.18) of the difference loudness density function (D(f)n), and emphasizing silent parts (2.19) thereby obtaining an symmetrical disturbance measure (Dn); - time integrating (2.20) the difference loudness density function and the asymmetrical difference loudness density function (Dn, DAn), thus obtaining a symmetrical and a asymmetrical disturbance measure (D, DA) and then combining (2.20) the symmetrical and asymmetrical disturbance measures (D, DA), thereby obtaining a single perceptive quality estimate (PESQ).
11. A processing system for establishing a frequency compensated input pitch power density function of a time framed input signal, which is applied to an audio transmission system having an input and an output, and the ouput of which yields a time framed output signal, comprising
- a processor
- means for inputting a framed input signal (Xn) and a framed output signal (Yn),
- means for outputting a frequency response impact quality measure (LSDM, FRIQM, PESQ), whereby the processor is arranged for executing the steps of the method according to at least one of the claims 1 - 10.
12. Computer readable medium comprising computer executable software code, which when executed on a computer system, cause the computer system to execute the steps of the method according to at least one of the claims 1 - 10.
PCT/NL2005/000683 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis WO2006033570A1 (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
EP05787331A EP1792304B1 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis
JP2007532270A JP4879180B2 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis
CA2580763A CA2580763C (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis
CN2005800377134A CN101053016B (en) 2004-09-20 2005-09-20 Method and system for constructing a first frequency compensation input spacing power density function
DE602005009221T DE602005009221D1 (en) 2004-09-20 2005-09-20 FREQUENCY COMPENSATION FOR PERCEPTIONAL LANGUAGE ANALYSIS
US11/663,138 US8014999B2 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis
AU2005285694A AU2005285694B2 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis
DK05787331T DK1792304T3 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP04077601.5 2004-09-20
EP04077601 2004-09-20

Publications (1)

Publication Number Publication Date
WO2006033570A1 true WO2006033570A1 (en) 2006-03-30

Family

ID=35355107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/NL2005/000683 WO2006033570A1 (en) 2004-09-20 2005-09-20 Frequency compensation for perceptual speech analysis

Country Status (12)

Country Link
US (1) US8014999B2 (en)
EP (1) EP1792304B1 (en)
JP (1) JP4879180B2 (en)
CN (1) CN101053016B (en)
AT (1) ATE405922T1 (en)
AU (1) AU2005285694B2 (en)
CA (1) CA2580763C (en)
DE (1) DE602005009221D1 (en)
DK (1) DK1792304T3 (en)
ES (1) ES2313413T3 (en)
PT (1) PT1792304E (en)
WO (1) WO2006033570A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1975924A1 (en) * 2007-03-29 2008-10-01 Koninklijke KPN N.V. Method and system for speech quality prediction of the impact of time localized distortions of an audio transmission system
JP2010534030A (en) * 2007-07-13 2010-10-28 ドルビー・ラボラトリーズ・ライセンシング・コーポレーション Acoustic processing using auditory scene analysis and spectral distortion

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2403509T3 (en) * 2007-09-11 2013-05-20 Deutsche Telekom Ag Method and system for the integral and diagnostic evaluation of the quality of the listening voice
EP2048657B1 (en) * 2007-10-11 2010-06-09 Koninklijke KPN N.V. Method and system for speech intelligibility measurement of an audio transmission system
JP5157852B2 (en) * 2008-11-28 2013-03-06 富士通株式会社 Audio signal processing evaluation program and audio signal processing evaluation apparatus
WO2011010962A1 (en) * 2009-07-24 2011-01-27 Telefonaktiebolaget L M Ericsson (Publ) Method, computer, computer program and computer program product for speech quality estimation
DK2465112T3 (en) 2009-08-14 2015-01-12 Koninkl Kpn Nv PROCEDURE, COMPUTER PROGRAM PRODUCT, AND SYSTEM FOR DETERMINING AN EVALUATED QUALITY OF AN AUDIO SYSTEM
KR101430321B1 (en) * 2009-08-14 2014-08-13 코닌클리즈케 케이피엔 엔.브이. Method and system for determining a perceived quality of an audio system
JP5606764B2 (en) * 2010-03-31 2014-10-15 クラリオン株式会社 Sound quality evaluation device and program therefor
CN102456348B (en) * 2010-10-25 2015-07-08 松下电器产业株式会社 Method and device for calculating sound compensation parameters as well as sound compensation system
TWI759223B (en) * 2010-12-03 2022-03-21 美商杜比實驗室特許公司 Audio decoding device, audio decoding method, and audio encoding method
EP2733700A1 (en) * 2012-11-16 2014-05-21 Nederlandse Organisatie voor toegepast -natuurwetenschappelijk onderzoek TNO Method of and apparatus for evaluating intelligibility of a degraded speech signal
EP2922058A1 (en) * 2014-03-20 2015-09-23 Nederlandse Organisatie voor toegepast- natuurwetenschappelijk onderzoek TNO Method of and apparatus for evaluating quality of a degraded speech signal
CN104243723B (en) * 2014-09-28 2017-03-29 辽宁省建设科学研究院 Caller interphone system audio signal non-linear distortion detection method
JP6461064B2 (en) * 2016-09-28 2019-01-30 本田技研工業株式会社 Acoustic characteristics calibration method
CN112083807B (en) * 2020-09-20 2021-10-29 吉林大学 Foot terrain touch reproduction method and device based on sound-touch conversion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1241663A1 (en) 2001-03-13 2002-09-18 Koninklijke KPN N.V. Method and device for determining the quality of speech signal
WO2003076889A1 (en) 2002-03-08 2003-09-18 Koninklijke Kpn N.V. Method and system for measuring a system's transmission quality

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB8701365D0 (en) * 1987-01-22 1987-02-25 Thomas L D Signal level control
US5321636A (en) * 1989-03-03 1994-06-14 U.S. Philips Corporation Method and arrangement for determining signal pitch
US5687281A (en) * 1990-10-23 1997-11-11 Koninklijke Ptt Nederland N.V. Bark amplitude component coder for a sampled analog signal and decoder for the coded signal
US5588089A (en) * 1990-10-23 1996-12-24 Koninklijke Ptt Nederland N.V. Bark amplitude component coder for a sampled analog signal and decoder for the coded signal
US5765127A (en) * 1992-03-18 1998-06-09 Sony Corp High efficiency encoding method
JP2953238B2 (en) * 1993-02-09 1999-09-27 日本電気株式会社 Sound quality subjective evaluation prediction method
US5632003A (en) * 1993-07-16 1997-05-20 Dolby Laboratories Licensing Corporation Computationally efficient adaptive bit allocation for coding method and apparatus
NL9500512A (en) * 1995-03-15 1996-10-01 Nederland Ptt Apparatus for determining the quality of an output signal to be generated by a signal processing circuit, and a method for determining the quality of an output signal to be generated by a signal processing circuit.
WO1997027578A1 (en) * 1996-01-26 1997-07-31 Motorola Inc. Very low bit rate time domain speech analyzer for voice messaging
ATE205009T1 (en) * 1996-05-21 2001-09-15 Koninkl Kpn Nv APPARATUS AND METHOD FOR DETERMINING THE QUALITY OF AN OUTPUT SIGNAL TO BE GENERATED BY A SIGNAL PROCESSING CIRCUIT
JP2000507788A (en) * 1996-12-13 2000-06-20 コニンクリジケ ケーピーエヌ エヌブィー Apparatus and method for signal characterization
DE19840548C2 (en) * 1998-08-27 2001-02-15 Deutsche Telekom Ag Procedures for instrumental language quality determination
JP3756686B2 (en) * 1999-01-19 2006-03-15 日本放送協会 Method and apparatus for obtaining evaluation value for evaluating degree of desired signal extraction, and parameter control method and apparatus for signal extraction apparatus
WO2001065543A1 (en) * 2000-02-29 2001-09-07 Telefonaktiebolaget Lm Ericsson (Publ) Compensation for linear filtering using frequency weighting factors
DE60029453T2 (en) * 2000-11-09 2007-04-12 Koninklijke Kpn N.V. Measuring the transmission quality of a telephone connection in a telecommunications network
DE10134471C2 (en) * 2001-02-28 2003-05-22 Fraunhofer Ges Forschung Method and device for characterizing a signal and method and device for generating an indexed signal
EP1298646B1 (en) * 2001-10-01 2006-01-11 Koninklijke KPN N.V. Improved method for determining the quality of a speech signal
US7146313B2 (en) * 2001-12-14 2006-12-05 Microsoft Corporation Techniques for measurement of perceptual audio quality
US7457757B1 (en) * 2002-05-30 2008-11-25 Plantronics, Inc. Intelligibility control for speech communications systems
US7308403B2 (en) * 2002-07-01 2007-12-11 Lucent Technologies Inc. Compensation for utterance dependent articulation for speech quality assessment
ATE333694T1 (en) * 2003-01-18 2006-08-15 Psytechnics Ltd TOOL FOR NON-INVASIVELY DETERMINING THE QUALITY OF A VOICE SIGNAL
EP1465156A1 (en) * 2003-03-31 2004-10-06 Koninklijke KPN N.V. Method and system for determining the quality of a speech signal
US7526093B2 (en) * 2003-08-04 2009-04-28 Harman International Industries, Incorporated System for configuring audio system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1241663A1 (en) 2001-03-13 2002-09-18 Koninklijke KPN N.V. Method and device for determining the quality of speech signal
WO2003076889A1 (en) 2002-03-08 2003-09-18 Koninklijke Kpn N.V. Method and system for measuring a system's transmission quality

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ITU-T RECOMMENDATION P 862: "Perceptual evaluation of speech quality (PESQ): An objective assessment of narrow-band telephone networks and speech codecs", ITU-T RECOMMENDATION P.862, XX, XX, 23 February 2001 (2001-02-23), pages 1 - 21, XP002327961 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1975924A1 (en) * 2007-03-29 2008-10-01 Koninklijke KPN N.V. Method and system for speech quality prediction of the impact of time localized distortions of an audio transmission system
WO2008119510A2 (en) * 2007-03-29 2008-10-09 Koninklijke Kpn N.V. Method and system for speech quality prediction of the impact of time localized distortions of an audio trasmission system
WO2008119510A3 (en) * 2007-03-29 2008-12-31 Koninkl Kpn Nv Method and system for speech quality prediction of the impact of time localized distortions of an audio trasmission system
JP2010534030A (en) * 2007-07-13 2010-10-28 ドルビー・ラボラトリーズ・ライセンシング・コーポレーション Acoustic processing using auditory scene analysis and spectral distortion

Also Published As

Publication number Publication date
CN101053016B (en) 2011-05-18
US8014999B2 (en) 2011-09-06
CN101053016A (en) 2007-10-10
JP2008513834A (en) 2008-05-01
JP4879180B2 (en) 2012-02-22
ATE405922T1 (en) 2008-09-15
US20080040102A1 (en) 2008-02-14
DE602005009221D1 (en) 2008-10-02
ES2313413T3 (en) 2009-03-01
AU2005285694B2 (en) 2010-09-16
PT1792304E (en) 2008-12-04
CA2580763C (en) 2014-07-29
CA2580763A1 (en) 2006-03-30
AU2005285694A1 (en) 2006-03-30
DK1792304T3 (en) 2009-01-05
EP1792304A1 (en) 2007-06-06
EP1792304B1 (en) 2008-08-20

Similar Documents

Publication Publication Date Title
US8014999B2 (en) Frequency compensation for perceptual speech analysis
US9025780B2 (en) Method and system for determining a perceived quality of an audio system
KR101430321B1 (en) Method and system for determining a perceived quality of an audio system
US9953663B2 (en) Method of and apparatus for evaluating quality of a degraded speech signal
US9659579B2 (en) Method of and apparatus for evaluating intelligibility of a degraded speech signal, through selecting a difference function for compensating for a disturbance type, and providing an output signal indicative of a derived quality parameter
CA2891453C (en) Method of and apparatus for evaluating intelligibility of a degraded speech signal
US20100211395A1 (en) Method and System for Speech Intelligibility Measurement of an Audio Transmission System
US20100106489A1 (en) Method and System for Speech Quality Prediction of the Impact of Time Localized Distortions of an Audio Transmission System
Côté et al. An intrusive super-wideband speech quality model: DIAL
Olatubosun et al. An Improved Logistic Function for Mapping Raw Scores of Perceptual Evaluation of Speech Quality (PESQ)
KR100275478B1 (en) Objective speech quality measure method highly correlated to subjective speech quality

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2580763

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2007532270

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2005285694

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2005787331

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2005285694

Country of ref document: AU

Date of ref document: 20050920

Kind code of ref document: A

WWP Wipo information: published in national office

Ref document number: 2005285694

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 200580037713.4

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 11663138

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 2005787331

Country of ref document: EP