WO2011129655A2

WO2011129655A2 - Method, apparatus, and program-containing medium for assessment of audio quality

Info

Publication number: WO2011129655A2
Application number: PCT/KR2011/002713
Authority: WO
Inventors: Jeong-Hun Seo; Koeng Mo Sung; Sang-Bae Chon; In-Yong Choi
Original assignee: Jeong-Hun Seo; Koeng Mo Sung; Sang-Bae Chon; In-Yong Choi
Priority date: 2010-04-16
Filing date: 2011-04-15
Publication date: 2011-10-20
Also published as: KR101170524B1; WO2011129655A3; KR20120053996A; KR20110115984A

Abstract

An audio quality measurement method is disclosed. The method comprises a step for producing a number of model output variables including a variable representing envelope interaural time difference distortion based on comparison between a reference signal and a signal under test, and a step for mapping the number of model output variables to an audio quality value.

Description

[DESCRIPTION]

[Invention Title]

METHOD, APPARATUS, AND PROGRAM-CONTAINING MEDIUM FOR ASSESSMENT OF AUDIO QUALITY

[Technical Field]

The present invention relates to an audio quality assessment method, an audio quality assessment apparatus, and audio quality assessment program containing media, particularly for objective audio quality assessment.

[Background Art]

Prediction of perceived sound quality, or Objective' quality assessment, is one of popular applications in the field of psychoacoustics. Many researchers have introduced various methods to predict perceived quality. Some of those methods have been widely adopted and used for quality assessment of compression coding systems for monaural and stereo audio.

There is a proposal for audio quality assessment of audio signal processed by a single channel audio signal compression codec, which is recommended by ITU Radio communication Sector (see ITU-R Recommendation BS. 1387-1, "Method for objective measurements of perceived audio quality", International Telecommunication Union, Geneva, Switzerland, 1998). The proposal, however, has a limitation that it cannot be used for an intermediate/low performance audio codec and a multi-channel audio codec. Further, the objective assessment of this proposal is mainly focused on the applications which are subjectively assessed by applying ITU-R Recommendation BS. 1 116-1 (see ITU-R BS. 1116-1, "Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems").

On the other hand, for a multi -channel audio codec that is the object of evaluation, its development discussion is actively underway in the MPEG standard group of ISO/IEC/JTC1/SC29/WG1 1. There are publications developed by various institutions. Audio quality evaluation of these codecs has been made by a subjective listening evaluation method based on the 'MUSHRA' technique (ITU-R Recommendation BS. 1534-1, "Method for the subjective Assessment of Intermediate Sound Quality (MUSHRA)", International Telecommunication Union, Geneva, Switzerland, 2001). There are publications on listening evaluation results of diverse codecs employing the above method (see ISO/IEC JTC1/SC29/WG1 l(MPEG), N7138, "Report on MPEG Spatial Audio Coding RMO Listening Tests", and ISO/IEC JTC1/SC29/WG11(MPEG), N7139, "Spatial Audio Coding RMO Listening Test Data").

In evaluating an audio quality of a multi-channel audio codec, however, such a method is very subjective, wherein a listener directly listens to an audio signal, evaluates its audio quality, and then a statistical process is conducted thereon. Therefore, there are needs for new methods for performing audio quality evaluation through consistent audio quality measurement or predicting the result of an audio quality evaluation, without doing the listening evaluation by the listener and statistical process for the audio quality evaluation of the multi-channel audio codec.

[Summary of Invention]

[Technical Problem]

Needs for objective sound quality assessment technique for a multi channel audio signal has been arising with development of multi channel compression techniques and common use of multi channel systems.

One of purposes of the present invention is to develop an assessment feature to objectively evaluate a multi channel audio compression codec, and to develop a method for evaluating an audio compression codec using the assessment feature, and to develop an apparatus for the same method, and a program-containing medium to conduct the method.

The scope of the present invention is not limited by the above described purpose of the present invention.

[Technical Solution]

For an objective audio quality assessment of a multi channel audio signal, the quality prediction model in ITU-R Recommendation BS. 1387-1 may be extended to multichannel audio coding systems showing high performance in the prediction of the perceived quality. This extended model may use at least thirteen features - ten timbral features from ITU-R Rec. BS. 1387-1 and three additional spatial features called ITDDist (Interaural Time Difference Distortions), ILDDist (Interaural Level Difference Distortions), and lACCDist (Interaural Cross Correlations Distortions).

Especially, ITDDist can be used as an important feature for predicting errors in sound localization. ITDDist may be calculated only for low frequency bands, based on the claims that ITDs (Interaural Time Differences) have greater salience on low frequency bands where interaural phase differences are unambiguous. However, based on many investigations saying that the ITDs in high frequency components are also important for sound localization, especially the ITDs in temporal envelopes of high frequency signals, Envelope ITDDist may be calculated in high frequency bands.

Generally, human brain uses different processes to recognize the location of low frequency sound and high frequency sound. ITD is used for recognizing the location of low frequency sound source. The excitation pattern of a basilar membrane, which is generated by low frequency sound excitation, is delivered to a MSO (Medial Superior Olive). Coincidence detection neurons may process the delivered signal to calculating ITD. Human brain can recognize a sound location by using ITD.

On the other hand, for high frequency sound, the excitation pattern of a basilar membrane is delivered to a LSO (Lateral Superior Olive). Due to this, different levels of electric signals are produced at both LSOs (left and right). Human brain can recognize a sound location using this interaural level difference of the electric signals. In addition to the interaural level difference, human brain may also utilize the signal envelope information of high frequency sound for sound localization. Particularly, neurons located in LSOs are sensitive to high frequency transposed tones. In addition, the neuron firing probabilities in auditory nerve fibers (ANFs) for high frequency transposed tones and low frequency tones are similar to each other. Sensitivity to ITDs of the high frequency envelope can be equivalent to that of ITDs in low frequency sound. Based on this phenomenon, it can be thought that EITDs of high frequency components have as much influence on sound localization of human listeners, as ITDs of low frequency sound and ILDs of high frequency sound do.

The processing of EITDs of high frequency components by human brain is important issue. The central mechanisms related to the sensitivity of envelope-based ITDs are similar to those related to sensitivity of fine-structure-based ITDs. If the central mechanisms of the two different cases have the similarity, EITDs of high frequency components can be computed as derived by coincidence detection neurons in the MSO, although binaural cues for sound localization of high frequency sounds are extracted in the LSO. Therefore, perceived EITDs of high frequency components can be computed by the cognition model used in the computation of ITD for low frequency bands.

An audio quality measurement method according to one aspect of the present invention comprises a step for producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion (EITDDist) based on a comparison between a reference signal and a signal under test, and a step for mapping the number of model output variables to an audio quality value.

Computer readable medium is provided according to another aspect of the present invention. The computer readable medium is for storing computer instructions executable by a processor for modifying an operation of a device having a processor. The computer readable medium comprises computer code for producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion (EITDDist) based on a comparison between a reference signal and a signal under test, and mapping the number of model output variables to an audio quality value.

Computer readable medium is provided according to still another aspect of the present invention. The computer readable medium is for storing a set of computer instructions executable by a processor for modifying another set of computer instructions. The another set of computer instructions is for producing a number of model output variables (MOVs) based on comparisons between a reference signal and a signal under test and mapping the number of model output variables to an audio quality value. The computer readable medium comprises computer code for modifying the another set of computer instructions to have the number of model output variables comprise a variable representing envelope interaural time difference distortion {EITDDist) based on a comparison between the reference signal and the signal under test.

An audio quality measurement apparatus is provided according to still another aspect of the present invention. The apparatus comprises a producing mean for producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion {EITDDist) based on a comparison between a reference signal and a signal under test, and a mapping mean for mapping the number of model output variables to an audio quality value. The producing mean and the mapping mean may be parts of a processing unit configured to execute a set of instruction for the producing and the mapping.

In the above describe various aspect of the present invention, the EITDDist may be

EITDDist = - ElTDDist[k, n]

given by

' , and the EITDDist[k,n may represent a value of envelope interaural time difference distortion obtained by comparing the reference signal and the signal under test at k-th frequency band of n-th time-frame. In this case, the EITDDist[k,n] may be given by

£/7DDisC[k, n] = i (c_test[k, n] + c_reflk_f n]) - AOTD [k, n]

ώ , and the AEITD[K,n] may represent a difference between envelope interaural time differences of the reference signal and the signal under test at k-th frequency band of n-th time-frame, and the C_test[k,n] may represent a nonlinearly transformed value of a envelope interaural cross-correlation coefficient (EIACC) of the signal under test at k-th frequency band of n-th time-frame, and the C_ref[k,n] may represent a nonlinearly transformed value of a envelope interaural cross-correlation coefficient (EIACC) of the reference signal at k-t frequency band of n-th time-frame.

In the above describe various aspect of the present invention, the reference signal may be obtained from a multichannel audio signal, and the signal under test may be obtained from an output of a device under test through which the multichannel audio signal is inputted.

In the above describe various aspect of the present invention, at least one of the number of model output variables may be based on a comparison between excitation patterns of the reference signal and the signal under test.

In the above describe various aspect of the present invention, the EITDDist may be obtained by applying the reference signal and the signal under test to a filter bank.

[Advantageous Effects]

According to the present invention, the reliability of an objective assessment model for multichannel audio codec can be increased by use of the variable ElTDDist.

The scope of the present invention is not restricted by the described advantageous effects.

[Description of Drawings]

FIG. 1 is a diagram illustrating a structure of a multi-channel audio reproduction system recommended by ITU-R, to which an embodiment of the present invention can be applied.

FIG. 2 is a diagram illustrating a structure of an apparatus for evaluating the audio quality of a multi-channel audio codec in accordance with an embodiment of the present invention.

FIG. 3 is a diagram describing an embodiment of sound transfer paths in accordance with an embodiment of the present invention.

FIG. 4 is a diagram describing the operation of one example of the preprocessing unit for binaural signal synthesis in accordance with an embodiment of the present invention.

FIG. 5 is a flowchart illustrating a method for evaluating an audio quality of a multi-channel audio codec in accordance with another embodiment of the present invention.

FIG. 6 is a flow chart for calculating an ILD distortion in accordance with one embodiment of the present invention.

FIG. 7 is a flow chart for calculating an EITD distortion in accordance with one embodiment of the present invention.

FIG. 8 is a sample envelope of an exemplary sound signal.

FIG. 9 shows a more detailed version of the flow chart of FIG. 7 calculating an EITD distortion.

[Mode for Invention]

The advantages, features and aspects of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter, so that a person skilled in the art will easily carry out the invention. Further, in the following description, well-known arts will not be described in detail if it seems that they could obscure the invention in unnecessary detail. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

ITU-R Recommendation BS.l 1 16-1, "Methods for the Subjective Assessment of Small Impairments in Audio Systems Including Multichannel Sound Systems", ITU-R Recommendation BS. 1387-1, "Method for objective measurements of perceived audio quality", International Telecommunication Union, Geneva, Switzerland, 1998, ITU-R

Recommendation BS. 1534-1, "Method for the Subjective Assessment of Intermediate

Sound Quality(MUSHRA)", International Telecommunication Union, Geneva, Switzerland, 2001, and ISO/IEC JTC1/SC29/WG11(MPEG), N7138, "Report on MPEG Spatial Audio Coding RMO Listening Tests," ISO/IEC JTC1/SC29/WG1 l(MPEG), N7139, "Spatial Audio Coding RMO listening test data" are incorporated herein by reference.

In general, a multi-channel audio has six channels (or 5.1 channel) such as front 5 speakers (LF (left front) and RF (right front)), a center speaker (C), an intermediate and low sound channel (LFE: low frequency effect), and rear speakers ((LS (left surround) and RS (right surround)). Among these, only five channel channels of the front speakers (LF and RF), the center speaker (C), and the rear speakers (LS and RS) are used for some embodiments of the present invention, since the LFE is not actually used in many cases. 10 FIG. 1 is a diagram illustrating a structure of a multi-channel audio reproduction system recommended by ITU-R, to which an embodiment of the present invention may apply.

In the multi-channel audio reproduction system recommended by the ITU-R, as shown in FIG. 1, the five channel speakers may be arranged on the line of one circle

15 centering around a listener 10, wherein the front left and the right speakers L and R and the listener 10 forms a regular triangle. The distance between the center speaker C in the front and the listener 10 may be equal to that between the front left and the right speakers L and R. And, the rear left and the right speakers LS and RS may be placed on the concentric circle of 100 to 120 degrees with respect to the front which is 0 degree.

20 The reason why the reproduction system is arranged to conform to the standard arrangement recommended by the ITU-R is that the intended audio quality (the best audio quality) can be obtained by doing so, because most of sources are being edited/recorded based on the standard arrangement.

In some embodiments of the present invention, the listener 10 of the multi-channel

25 audio reproduction system recommended by the ITU-R is substituted by an audio quality evaluation apparatus for a multi-channel audio codec, which evaluates the audio quality of the codec by measuring impulse responses of multi-channel audio signals from the five channel speakers L, R, C, LS and RS by using an binaural microphone that simulates the body (the head and upper half).

30 FIG. 2 is a diagram illustrating a structure of an apparatus for evaluating an audio quality of a multi-channel audio codec in accordance with an embodiment of the present invention.

As shown in FIG. 2, the audio quality evaluation apparatus 10 of a multi-channel audio codec may include a preprocessing unit 1 1 for synthesizing binaural signals

A , . X. ft

,,_ Lfef Rref Ltest Rtest , , . , , , , 5 .·. ! based on multi-channel audio signals transmitted through the channels L, R, C, LS and RS of a standard multi-channel audio reproduction system recommended by the ITU-R, an output variable calculator 12 for calculating MOVs (Model Output Variables) including lACCDist (InterAural Cross-correlation Coefficient Distortion), ILDDist (Interaural Level Difference Distortion), and EITDDist (Envelope Interaural Time Difference Distortion), and an artificial neural network circuit 13 for outputting a grade of the audio quality on the basis of the MOVs calculated from the output variable calculator 12.

Here, IACC represents the maximum value of the normalized cross correlation function between the left ear input and the right ear input. ILD denotes the ratio of intensity of signals between the left ear input and the right ear input. And EITD represents the time difference between the audio signal envelopes inputted through left and right ears, particularly for high frequency band audio signal.

The following is a brief explanation on the operation of each of the components of the audio quality evaluation apparatus of the multi-channel audio codec according to the invention. Five channel signals of sound sources which are encoded and decoded by a multi-channel audio codec to be evaluated are indicated by LF,_est, RF_test, Qest, LS_test, RS_test, and five channel signals of their original sound sources are denoted by LF_ref, RF_ref, C_ref, LSref, RSref- In this document, above LF,_est, RF,_est, LF_ref, RF_ref may be also denoted as L_test,

Rtes L_ref, R_ref-

Above all, total ten signals of LF_test, RFtest, C_test, LS_test, RS_test, LF_ref, RF_ref, C_ref, LS_ref, RS_ref may be inputted to a preprocessing unit 1 1. The preprocessing unit 1 1 may convolve head related impulse responses of corresponding azimuth angles— that simulate the transfer function of the sound propagation path including the body (head and torso) of a listener-to the five channel test signals and five channel reference signals, and sums up the

•Λ Λ Λ· Λ convolutions, to thereby calculate the binaural signals '^e} · ' ⁱ · ^t · . The purpose of this process is to simulate the acoustical environment in the audio reproduction layouts, and the process is illustrated as a block diagram in FIG. 4.

In this configuration, the total number of the sound transfer paths is ten, due to the five locations of loudspeakers and two ears of a listener, which may be represented by graphs as depicted in FIG. 3.

The output variable calculator 12 calculates MOVs including lACCDist, ILDDist, and EITDDist. Those two variables, lACCDist and ILDDist, mirror degradations in the attributes of spatial quality. The calculated MOVs may then be provided to artificial neural network circuit 13. The artificial neural network circuit 13 may output a grade of the audio quality based on the MOVs provided from the output variable calculator 12. The grade of the audio quality may be referred to as ODG (Objective Difference Grade).

Here, the output variable calculator 12 may calculate ILDDist from the binaural

Λ Λ Λ Λ

signals ^ref · ™^J > ^{' '}'"'^' · ^'"" inputted from the preprocessing unit 1 1 , by using following Equations 1 and 2. The ILD of an uncompressed original audio signal may be denoted as ILD_re/, and the ILD of the audio signal which is encoded and decoded by the multi-channel audio codec under test may be denoted as ILD_test. Also, the IACC may be named in the similar way. For the calculation of IACC and ILD, the binaural signals may be converted to time-frequency segment signals with the 75% overlapped time frames (of the length that equivalent to 50 ms for IACC, and of the length that equivalent to 10 ms for ILD) and 24 auditory critical bands filter-banks. Among these, ILDDist for a k-th frequency band of an n-th time frame may be represented as ILDDist[k,n].

[Equation 1]

ILDDist[k, n] = w[k]^■ ILD_test[k, n] - ILD_ref[k, n]

In Equation 1 , ILDDist represents an interaural level difference distortion, and w[k,n] represent a weighted function that is decided depending on the range of the critical band, which reflects the intensity level of a time-frequency segment and auditory sensitivity to the ILD.

Meanwhile, to acquire the ILDDist[n] of the entire auditory band in the n-th time frame, an average is taken for the entire frequency bands as following Equation 2.

[Equation 2]

1 ^2-1

ILDDist[n] = -∑ ILDDist[k, n]

By averaging again a number of the ILDDist[n] for the entire time frames, the ILDDist of the multi-channel audio codec can be calculated, and the IACC can also be calculated in the same way. At this time, IACCDist may be named as ICCDist. Since ICCDist and ILDDist have the high cross correlation with the audio quality evaluation (subjective evaluation) result of the multi-channel audio codec by the listener, output variable calculator 12 can regard ICCDist and ILDDist as the output variables. Variables including ICCDist and ILDDist may be inputted to artificial neural network circuit 13, to thereby output the one-dimensional grade of the audio quality with the objectivity and consistency.

Details of the method calculating an EITDDist at output variable calculator 12 will be explained with reference to FIG. 8.

FIG. 4 is a diagram describing the operation of one example of the preprocessing unit of the audio quality evaluation apparatus in accordance with an embodiment of the present invention.

As shown in FIG. 4, a preprocessing unit 1 1 of audio quality evaluation apparatus 10 converts an impulse response of each sound transfer path which is measured by using an interaural microphone that simulates the body (the head and upper half) of the standard multichannel audio reproduction system recommended by the ITU-R into a transfer function, and sums up the transfer functions, to thereby calculate the interaural input λ Λ λ Λ

-Lref; Rref -Ltest Rtest

signals · · ·

FIG. 5 illustrates a flowchart for a method of evaluating an audio quality of a multi-channel audio codec in accordance with an embodiment of the present invention.

First of all, a preprocessing unit 1 1 of the audio quality evaluation apparatus 10 for a multi-channel audio codec converts impulse responses of each of sound sources which are encoded and decoded by the multi-channel audio codec and original sound sources into transfer functions, and sums up the transfer functions, to thereby calculate the interaural

Λ Λ Λ Λ

input signals ^ref · ^r"^y < < '^es' (S501 ). Thereafter, an output variable calculator 12 may calculate MOVs including IACCDist, ILDDist, and EITDDist from the time-frequency

Λ ·Λ Λ, Λ

segments of the binaural signals ™^! · ^reJ · · provided by thepreprocessing unit 1 1 (S502). The calculated MOVs may be then applied to an artificial neural network circuit 13 (S503). Finally, the artificial neural network circuit 13 may output an objective audio quality grade based on the MOVs produced at the output variable calculator 12 (S504).

Referring back to FIG. 2, the output variable calculator 12 may further produce EITDs

A A A Λ

from the binaural signals ^tJ - ^r¾f ■ · ^test produced by the preprocessing unit 1 1. The produced EITDs may be inputted to the artificial neural network circuit 13.

Audio quality degradation caused by change of audio signal location is one of important evaluation factors. According to classical Duplex theory, the location of an audio signal can be recognized by ILD for high frequency component. In addition to ILD, EITD of high frequency component of an audio signal influences the mechanism of recognizing a location of an audio signal.

In one embodiment of the present invention, a method for calculating ILD and/or

EITD of a high frequency component is provided.

For evaluation of objective performance of a multichannel audio, quantitative analysis for a distortion of spatial impression is required as well as quantitative analysis for a distortion of sound tone. Distortion of sound location is one of important factors for evaluating a distortion of spatial impression. Because human brain uses ILD and EITD to recognizing the location of high frequency sound, a quantitative evaluation of an audio signal quality may use parameters including ILD and EITD. ILD and EITD may be respectively calculated both for a reference signal (i.e., original signal) and a test signal

(i.e., a signal coded and decoded from the reference signal by a codec). ILDDist or EITDDist may be calculated using cognitive distance or difference between ILDs or EITDs obtained from the reference signal and the test signal, respectively. In order to calculate ILD and EITD of a high frequency audio signal, multichannel audio signals may be synthesized into binaural signals. HRTFs (Head-Related Transfer Functions) may be used for synthesizing binaural signals, a HRTF represents an audio signal transfer path from each speaker to left and right ears. ILD and EITD of a high frequency audio signal may be calculated using the synthesized binaural signals.

Referring to FIG. 6, a binaural synthesis part 601 may produce binaural signals

1 R A Λ

^Γ<¾Γ ' ^ref of a reference signal and binaural signals ^'""'^'■< of a test signal, using the above described input signals LF,_est, RF_(est, C_test, LS_test, RS_tesi, LF_ref, RF_ref, C_ref, LS_ref, RS_ref. A peripheral ear model part 602 may produce excitation patterns of the reference

A Λ Λ Λ'

· i J-' '*' Rief L.tast Rtesl „ . signal and the test signal by using the binaural signals · · . Envelop extraction part 603 may produce envelopes of the excitation patterns of the reference signal and envelopes of the excitation patterns of the test signal, respectively. A cognition model part 604 may calculate an ILDDist value of a high frequency band by using the envelopes from the envelop extraction part 603.

The binaural synthesis part 601 of FIG. 6 may correspond to the preprocessing unit 1 1 of FIG. 2. The peripheral ear model part 602, envelop extraction part 603, cognition model part 604 of FIG. 6 may be included in the output variable calculator 12 of FIG. 2.

ILD may be defined as an energy difference between the signals inputted to a peripheral ear model of left and right ear, which is composed of a multiple of band-pass filters having a center frequency decided by ERB (Equivalent Rectangular Bandwidth) scale, and may be represented by Equation 3. A peripheral ear model is for calculating excitation patterns at basilar membrane, from audio signals inputted from both left and right ears.

[Equation 3]

Although the energy difference between the signals inputted from left and right ears can be expressed as Equation 3, human brain may process in different way when an ILD is given. When a non-zero ILD is given, the higher level signal among the signals from left and right ears may cause more frequent neural spikes in IC (Inferior Colliculuse) which processes the ILD, so the IC may have to handle the neural spikes. Because a model for the number of neural spikes occurring in IC follows a tangential sigmoid function, the calculated ILD value may further be nonlinearly transformed by a tangential sigmoid function and this is represented as Equations 4 and 5.

[Equation 4]

[Equation 5]

The gradient of a tangential sigmoid function shows different signs (e.g., positive or negative sign) according to the energy difference of the ear input signals. If the signal from left ear is larger than that of right ear, the gradient may have positive sign. To the contrary, if the signal from right ear is larger than that of left ear, the gradient may have negative sign. In addition, in order to reflect the sensitivity of neural spike occurring mechanism in IC according to each frequency band, the tangential sigmoid function may have different gradient according to each frequency band. In Equations 4 and 5, ' 7¾' may represent the threshold of the tangential sigmoid function, and ' Tk' may be zero(0) in case of ILD. Then, an ILDDist may be represented as Equation 7 for a time-frequency segmented signal.

Equation 6]

A resulting ILDDist may be obtained by calculating a mean value of ILDDist[k,n] values over the whole frequency bands and time frames, and may be represented as Equation 7. The resulting ILDDist may be regarded as a cognitive distance due to the ILD between a test signal and a reference signal.

[E uation 7]

An EITDDist represents a cognitive distance of the audio signal location of a test signal and the audio signal location of a reference signal, which arises due to the difference of EITDs of the test and reference signals. EITDDist, along with ILDDist, may be used as a feature for evaluating spatial impression that occurs due to the difference of high frequency audio signal source locations. FIG. 7 is a flow chart for calculating an EITDDist in accordance with an embodiment of the present invention.

Referring to FIG. 7, the binaural synthesis part 701 may produce binaural signals

test

o a re erence s gna an naura s gna s of a test signal, using the above described input signals LF_test, RF_test_> C_teSt, LS_test, RS_test, LF_ref, RF_ref, C_ref, LS_ref, RS_ref. Binaural synthesis part 701 of FIG. 7 may correspond to the preprocessing unit 1 1 of FIG. 2.

At the binaural synthesis part 701 , multichannel sound sources may be synthesized test

into binaural signals, which are represented as

, using HRTFs recorded in a reference listening room as recommended in ITU-R Rec. BS. 1 1 16-1. In this case, the LFE channel may be adjusted to have zero(0) value for every sound source. Equation 8 may be used to synthesize binaural signals from the five channel signals. Herein, each of the subscripts 'tesf and 'ref represents a test signal and a reference signal, respectively.

[Equation 8]

Ha, H_LjL, HRJI, H_LSL, HR_SL, HCR, HLJR, HR/Λ HL_SR, HR_SR of Equation 8 represent total ten of BRTFs (Binaural Room Transfer Functions) which represent acoustic wave paths from each speaker to left and right ears. Further, each of ^ and R of Equation 8 represents an acoustic wave input at left ear and right ear, respectively.

The synthesized binaural signals can be processed by a peripheral ear model. Input signals from two ears (left and right) are delivered to middle ears and then be processed in cochleas, and such a process can be reproduced by the peripheral ear model. A cochlea simulator in the peripheral ear model may transform the binaural signals into signals which stimulate hair cells at basilar membrane. The cochlea simulator may be regarded as a filter bank which is composed of a total of 24 pass-band filters with a center frequency decided by ERB (Equivalent Rectangular Bandwidth) scale. The signals passed through the cochlea simulator may be transformed into excitation patterns of the signals filtered by respective pass-band filters. The peripheral ear model part 702 of FIG. 7 may produce excitation patterns of the

•Λ Λ Λ Λ reference signal and the test signal by using the binaural signals ^eJ · ^reJ · · .

The envelop extraction part 703 of FIG. 7 may produce envelopes of the excitation patterns of the reference signal and envelopes of the excitation patterns of the test signal, respectively.

Envelopes of the excitation patterns can be extracted by discrete Hilbert Transform. The envelope is obtained by the squared sum of magnitude and Hilbert-transformed value of the excitation patterns. FIG. 8 shows an example of an extracted envelope. The solid line represents a full-rectified excitation pattern and the dashed line represents an extracted envelope. An EITD can be obtained by calculating a binaural time difference of the extracted envelope.

As shown in Equation 9, the output signals from ERB-scale auditory filter-bank can be denoted as x[k,n], a time-frequency segmented signal. Here, and V represent the index number of frequency band and time frame, respectively. Envelope signal E[k,ri\ can be computed using discrete Hilbert Transformed signal H{x[k,n] } as shown in Equation 9. In this document, x[k,n] can also be denoted as r[k,ri], and H{x[k, n) } can also be denoted as i[k,n].

[Equation 9]

E[k, n] - Jx²[k, n] + H {x[k, n]}²

In Equation 9, 'k' represents a frequency band index which is segmented by a peripheral ear model, and 'n' represents a time frame index which is being processed.

The cognition model part 704 of FIG.7 may calculate an EITDDist value of high frequency band by using the envelopes from the envelop extraction part 703.

For time-frequency segmented left and right ear input signals, high frequency EITDs can be computed using the time-segmented normalized cross-correlation function (NCF) as described in Equation 10.

Equation 10]

Here, „ ' and ¹ERX„ ' represent the envelope signals of the excitation patterns for left and right ear, respectively. ^ld', 'A ' and 'n ' represent the time lag in the sample unit, frequency band and time frame indices, respectively. In this document, and ERX„ can also be denoted as Χιχ_η and XRX,„, respectively. The interaural cross-correlation coefficient (IACC) may be defined as the maximum value of NCF over all d. The interaural time difference (ITDs) may be defined as the d value that yields the maximum NCF value. The cross-correlation may be calculated with an approximately lOms-length rectangular window, overlapped by 7/8. EITDs and EIACCs (Envelope InterAural Cross Correlation) can be expressed by Equations 1 1 and 12 for time-frequency segmented signals.

[Equation 1 1 ]

[Equation 12]

In Equations 1 1 and 12, 'N' represents the scope of 'd', and means a theoretically possible ITD value. EITDs and EIACCs are measured for reference and test signals. Then, subscripts 'ref and 'test ' represent the corresponding signals, respectively.

Since the perceptual change of the source direction can be approximated as an Euclidian distance between two different positions on the unit circle, the EITD difference between a test signal and a reference signal can be computed as shown in Equation 13. That is, the difference between EITDs of a test signal and a reference signal can be computed as the difference between two vectors with the corresponding phase angles to EITDs. In Equation 13, y denotes the sampling rate and W is the maximum ITD in sample numbers.

[Equation 13]

=^2- 2(cos^._: cos^_/ -sin ¾« ^sin )

After calculating EITDs, the next process has to consider that EITD detection may fail in some cases. If EIACCs is too low, perceived source location is ambiguous. Thus, a decision factor that considers the certainty of computed EITDs is applied. This certainty can be modeled by a tangential sigmoid function that transforms EIACCs nonlinearly as shown in Equations 14 and 15. That is, EIACCs may be transformed non-linearly by a tangential sigmoid function in order to consider the case that the detection of sound locations may fail for too low EIACC. EIACC values can be non-linearly transformed by Equations 14 and 15 for a reference signal and a test signal.

[Equation 14]

[Equation 15]

For Equations 14 and 15, the tangential sigmoid function used in this model may have a steepness S of 50, and threshold 7* may have different value in different frequency band, since each frequency band has different sensitivity to JTDs.

When these certainty factors are applied to AEITD[k, n] , an EITDDist value can be obtained as shown in Equation 16. That is, EITD distortion can be computed by applying nonlinearly transformed EIACC values as certainty factors to Equation 13.

[Equation 16]

EITDDist[k, n] - -(¾_ε5,[^, «] t ?_re/[/C «]) · AEITD[k,n]

The resulting EITDDist may be obtained by averaging EITDDist[k,n] values over frequency bands and time frames, as expressed in Equation 17. That is, the resulting EITDDist is an averaged value over frequency bands and time frames.

[Equation 17]

The peripheral ear model part 702, envelop extraction part 703, and cognition model part 704 of FIG. 7 may be included in the output variable calculator 12 of FIG. 2.

Referring to FIG. 9, the binaural synthesis part 901 may produce binaural signals from multichannel signals, using Equation 8. The peripheral ear model part 902 may produce excitation patterns of a reference signal and a test signal by using binaural input signals. The envelop extraction part 903 may produce envelopes of the excitation patterns of the reference signal and envelopes of the excitation patterns of the test signal by using Equation 9, respectively. The NFC part 904 can calculate EITDs and EIACCs using the obtained envelopes. The EITD Distortion Computation part 905 can calculate an EITDDist value using the EITDs and EIACCs of the test and reference signals. In FIG. 9, each of subscripts '/? ', 'L ', 'test ', 'ref, 'k 'n ' represents right channel, left channel, test signal, reference signal, frequency band index, and time frame index, respectively.

The £/7jDZ /5t-obtaining method that uses Equations 13 to 17 can be modified to a method using the following Equations 18 and 19. Before obtaining an EITDDist value from the EITD values calculated by Equation 1 1 , EIACC values may be non-linearly transformed by applying a tangential sigmoid function as shown in Equations 14 and 15.

The transformed EIACC value can be used as a weighting factor which can be applied to EITD values. Then, a cognitive EITD distance can be calculated from a weighted EITD values. Since the perceptual change of the source direction can be approximated as the Euclidian distance between two different positions on the unit circle, the EITD difference can be computed as in Equation 18. In Equation 18, ^ denotes the sampling rate and N is the maximum ITD in sample numbers. In this document, c_test[k,n] and c_ref[k,n] of Equation 18 may also be denoted as p_test[k,n] and p_ref(k,n].

[Equation 18] bElTD[k, H} 2 --2 COS1T · f_s · (c_test[k, n] · EITD_testfk, n] - c_ref[k, n] · EITD_t?st[k, h])/N

The resulting EITDDist is averaged over frequency bands and time frames, as expressed in Equation 20. The resulting EITDDist can show a mean value of EITD distances, which means a cognitive distance between reference and test signals due to EITD value difference.

[Equation 20]

According to one embodiment of the present invention, an audio signal evaluation apparatus may comprise a preprocessing means adapted to produce binaural input signals from multichannel audio signals from each channel(L, R, C, LS, RS) of a multichannel audio reproducing system, an output variable calculating means adapted to output model output values including lACCDist, ILDDist, EITDDist values, and a neural network circuit means adapted to output an audio quality level based on the model output variables.

To train and verify a prediction model used in one embodiment of the present invention, a listening test database distributed from the ISO/MPEG audio group of ITU-R Recommendation BS. 1534-1 was used for the model. Subjective listening tests followed the procedures recommended in ITU-R Rec. BS.1534-1 "Multiple Stimulus with Hidden Reference and Anchor (MUSHRA). 1 1 different test signals were used in the listening tests. Each test excerpt was encoded and decoded by 1 1 different multichannel audio coding systems. Consequently, listening test database contains 121 items.

Table 1 shows the correlation coefficients between the subjective listening evaluation result and the 14 evaluation features (MOVs) used for the objective evaluation scheme.

[Table 1 ] MOV correlation coefficient

ADB -0.68

NMRtoB -0.51

NLo ndB -0.51

AModDiflB -0.45

WModDiflB -0.44

RDF -0.43

EHS -0.43

AModDiflB -0.36

AvgBwRef -0.06

AvgBwTst -0.00

HDD -0.78

IACCD -0.62

ITDD -0.61

EITDD -0.72

Each correlation coefficient ρχ_,γ can be calculated as in Equation 20.

[Equation 20]

_ OV(X ) _ E{XY] - E[X]E[Y]

In Equation 20, each of X and Y represents MOS and data of each feature, respectively. Correlation coefficients between the fourteen features (MOVs) and the subjective listening evaluation result were calculated for the 121 signals synthesized from binaural signals. Among the fourteen features, last four features represent the degree of degradation level of spatial impression. Ten model output values and four spatial features are listed in Table 2 and 3, respectively.

[Table 2]

AModDiflB Averaged modulation difference

Averaged modulation difference with emphasis on introduced

AModDif2B modulations and modulation changes where the reference contains little or no modulations

Windowed averaged difference in modulation (envelopes)

WinModDifB between Reference Signal

and Signal Under Test

Relative fraction of frames for which at least one frequency

RDF band contains a

significant noise component

Rms value of the averaged noise loudness with emphasis on

NLoudB

introduced components

[Table 3]

MOV Description

Cognitive distance for the audio signal direction difference

ITDDist between a test signal and a reference signal, arising due to interaural time difference

Cognitive distance for the audio signal direction difference

ILDDist between a test signal and a reference signal, arising due to interaural level difference

Cognitive distance for the audio signal width and ambience

LACCDist between a test signal and a reference signal, arising due to interaural correlation coefficient difference

Cognitive distance for the audio signal direction difference,

EITDDist arising due to high frequency envelope interaural time difference

Because every MOV has a negative correlation value with respect to the subjective listening evaluation result, it can be considered that a larger absolute value of a correlation coefficient in Table 1 means a better performance for audio quality prediction. As shown in Table 1, EITDDist shows a considerably high absolute coefficient value of 0.72, which is larger than lACCDist of 0.61, ITDDist of 0.61, and other ten audio tone distortion features. From this result, it can be understood that the high frequency envelope information plays an important role in spatial impression and overall audio quality for multichannel audio signal. In addition, compared to the MOVs used in ITU-R Rec. BS. 1387-1, the four features in Table 3 has larger or similar correlation coefficient values. Based on this result, it can be understood that spatial impression, as well as audio tone, is very important for evaluating multichannel audio quality assessment.

Each of the above MOVs can be used as an input variable for a prediction model for an objective audio quality evaluation. An objective audio quality prediction model for multichannel audio coding system can show better prediction performance when a MOV representing spatial impression distortion having a high correlation coefficient with respect to a subjective listening evaluation result is added on the model. EITDDist can be used as a model output variable for evaluating spatial impression distortion in an objective audio quality prediction model. Particularly, because EITDDist has high correlation with a subjective listening evaluation result, one can improve the performance of an objective audio quality prediction model for a multichannel audio coding system by adding EITDDist to the objective audio quality prediction model as an input feature.

According to one embodiment of the present invention, the performance of an objective audio quality evaluation model can be improved by providing spatial impression features. An evaluation model reflecting cognitive differences can be provided by mathematically modeling the audio signal process inside human brain using the spatial features.

The present invention is different from a conventional method which simply provides a distortion level between an original signal and its codes/encoded signal at individual frequency bands. The present invention is for obtaining a result that is similar to a statistically processed result of subjective audio quality evaluation results in multichannel audio reproduction environment. According to an embodiment of the present invention, listening evaluation and statistical processing procedures can be omitted.

An embodiment of the present invention can be used for an audio compression codec performance evaluating method/apparatus in order to compare cognitive sound qualities of a reference signal and a test signal (i.e., signal under test) which is coded and decoded from the reference signal using the audio compression codec.

In some embodiments of the present invention, the artificial neural network circuit may be substituted by a general digital signal processing unit. That means, the artificial neural network circuit in this document was introduced as an exemplary digital signal filter. Therefore, the scope of the present invention is not limited to the accompanying drawings its related descriptions.

According to one embodiment of the present invention, features that influence spatial impression recognition can be obtained based on the psycho-acoustical and physiological research results, and the performance of an objective evaluation model for a multichannel audio codec can be improved by implementing the features by appropriate mathematical models. The method of an embodiment of the present invention as mentioned above may be implemented by a software program that is stored in a computer-readable storage medium such as CD-ROM, RAM, ROM, floppy disk, hard disk, optical magnetic disk, or the like. This process may be readily carried out by those skilled in the art; and therefore, details of thereof are omitted here.

Each of above described embodiments can be obtained by elements and features of the present invention. Each element or feature can be regarded as a selectable one unless any contrary explanation is provided. Each element or feature can be omitted in some embodiments. The order of steps described above can be interchanged without departing from the spirit and scope of the invention. One element comprised in an embodiment can be comprised in another embodiment, or can be substituted with an element or a feature in another embodiment. At least two of the accompanying claims can be merged to constitute an embodiment.

Some embodiments of the present invention can be implemented by various means, such as hardware, firmware, software or the combination thereof. In case of a hardware implementation, an embodiment of the present invention can be implemented with one or more ASICs (Application Specific Integrated Circuits), DSPs (Digital Signal Processors), DSPDs (Digital Signal Processing Devices), PLDs (Programmable Logic Devices), FPGAs (Field Programmable Gate Arrays), processors, controllers, micro controllers, and micro processors.

In case of a firmware or software implementation, an embodiment of the present invention can be implemented with modules, procedures, functions performing the above described means or steps. A software code can be saved in a memory unit and run by a processor. The memory unit may be located in or outside the processor, and communicate data with the processor by various conventional communication means.

While the present invention has been described with respect to the particular embodiment, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.

Claims

[CLAIMS]

[Claim 1 ]

An audio quality measurement method, comprising:

producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion {EITDDist) based on comparison between a reference signal and a signal under test; and

mapping the number of model output variables to an audio quality value.

[Claim 2]

The method of claim 1, wherein the EITDDist is given by

N Z

EITDDist = - ^ i - ^ £/TDDist[k, n]

n=i V k=i _{? anc}j wherein the EITDDist[k,n] represents a value of envelope interaural time difference distortion obtained by comparing the reference signal and the signal under test at k-th frequency band of n-th time-frame.

[Claim 3]

The method of claim 2,

wherein the EITDDist[k,n] is given by

1

EITDDist[k, n] = - (c_test[k, n] + c_ref[k, n]) · AEITD [k, n] wherein the AEITD[k,n] represents a difference between envelope interaural time differences of the reference signal and the signal under test at k-th frequency band of «-th time-frame,

wherein the C_test[k,n] represents a nonlinearly transformed value of a envelope interaural cross-correlation coefficient (EIACC) of the signal under test at k-t frequency band of n-th time-frame, and

wherein the C_ref[k,n] represents a nonlinearly transformed value of a envelope interaural cross-correlation coefficient (EIACC) of the reference signal at k-th frequency band of n-th time-frame.

[Claim 4]

The method of claim 1 , wherein the reference signal is obtained from a multichannel audio signal, and the signal under test is obtained from an output of a device under test through which the multichannel audio signal is inputted.

[Claim 5 ]

The method of claim 1 , wherein at least one of the number of model output variables is based on comparison between excitation patterns of the reference signal and the signal under test.

[Claim 6]

The method of claim 1 , wherein the EITDDist is obtained by applying the reference signal and the signal under test to a filter bank.

[Claim 7]

Computer readable medium for storing computer instructions executable by a processor for modifying an operation of a device having a processor, the computer readable medium comprising:

computer code for producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion {EITDDist) based on a comparison between a reference signal and a signal under test, and mapping the number of model output variables to an audio quality value.

[Claim 8]

The computer readable medium of claim 7, wherein the EITDDist is given by

and wherein the EITDDist[k,n] represents a value of envelope interaural time difference distortion obtained by comparing the reference signal and the signal under test at k-th frequency band of n-t time-frame. [Claim 9]

Computer readable medium for storing a set of computer instructions executable by a processor for modifying another set of computer instructions for producing a number of model output variables (MOVs) based on comparison between a reference signal and a signal under test and mapping the number of model output variables to an audio quality value, the computer readable medium comprising:

computer code for modifying the another set of computer instructions to have the number of model output variables comprise a variable representing envelope interaural time difference distortion (EITDDist) based on comparison between the reference signal and the signal under test. [Claim 10]

The computer readable medium of claim 9, wherein the EITDDist is given by

EITDDist

, and wherein the EITDDist[k,n] represents a value of envelope interaural time difference distortion obtained by comparing the reference signal and the signal under test at k-t frequency band of n-t time-frame.

[Claim 1 1 ]

An audio quality measurement apparatus, comprising:

a producing mean for producing a number of model output variables (MOVs) including a variable representing envelope interaural time difference distortion (EITDDist) based on comparison between a reference signal and a signal under test; and

a mapping mean for mapping the number of model output variables to an audio quality value. [Claim 12]

The audio quality measurement apparatus of claim 1 1 , wherein the producing mean and the mapping mean are parts of a processing unit configured to execute a set of instruction for the producing and the mapping. [Claim 13 ]

The computer readable medium of claim 1 1 , wherein the EITDDist is given by EITDDist

, and wherein the EITDDist[k,n] represents a value of envelope interaural time difference distortion obtained by comparing the reference signal and the signal under test at A th frequency band of n-th time-frame.