EP1468416A1

EP1468416A1 - Method for qualitative evaluation of a digital audio signal

Info

Publication number: EP1468416A1
Application number: EP03715043A
Authority: EP
Inventors: Alexandre Joly
Original assignee: Telediffusion de France ets Public de Diffusion
Current assignee: Telediffusion de France ets Public de Diffusion
Priority date: 2002-01-24
Filing date: 2003-01-23
Publication date: 2004-10-20
Anticipated expiration: 2023-01-23
Also published as: FR2835125A1; US20050143974A1; CA2474067C; FR2835125B1; EP1468416B1; US8606385B2; US8036765B2; US20120099734A1; WO2003063134A1; CA2474067A1

Abstract

The invention relates to a method for qualitative evaluation of a digital audio signal, characterized in that a quality indicator consisting of a vector associated with each time window is calculated in real and continuous time in successive time windows. The generation of said quality indicator vector involves, for example, the following stages for a reference audio signal and an audio signal to be evaluated: calculation of the spectral density of the power of the audio signal or calculation of the coefficients of a prediction filter by means of an autoregressive method, or calculation of the time activity of the signal or calculation of the minimum spectrum in successive blocks of the signal. The method can involve calculation of a distance between the vectors of the reference audio signal and the audio signal to be evaluated which are associated with each time window in order to evaluate the degradation of the audio signal.

Description

METHOD FOR QUALITATIVE EVALUATION OF A DIGITAL AUDIO SIGNAL.

The subject of the present invention is a method for evaluating a digital audio signal, in particular a digitally transmitted signal and / or a digital signal to which digital coding has been applied, in particular with rate reduction and / or decoding. A digitally transmitted signal can be a standalone audio signal (broadcasting) or an audio signal that accompanies a program such as an audiovisual program.

The field of digital radiocommunications and broadcasting is in full expansion, in particular with the appearance of digital television and radiotelephones. New instruments must therefore be developed to measure the quality of all the systems necessary for the implementation of this technology, and thus be able to ensure quality of service.

It is for this purpose that subjective tests are used. These tests make it possible to judge the quality of sound signals by making them listen to listeners, experts or novices. This method is long and expensive because the conditions to be observed during these tests are numerous and strict (choice of panelists, listening conditions, sequences, test chronology, etc.). It nevertheless makes it possible to constitute databases of reference signals with the scores which have been assigned to them. These tests are used to obtain the MOS (Mean Opinion Score) scores, which are recognized as the benchmark for quality estimation.

To try to minimize the number of these subjective tests, many studies have been done on the human hearing system. From there, models of the ear and psychoacoustic phenomena were developed, which made it possible to analyze and then estimate the quality of the sound signals by objective methods. Since the quality measured is that perceived by the human ear, it is called objective perceptual quality.

It is possible to differentiate three classes of objective qualification methods: The first ("full reference") directly compares the original signal to the degraded signal (after coding, broadcasting, multiplexing, ...), the second compares only parameters extracted from two signals (called reduced reference). In the third, the faults generated by the diffusion chain are detected using their main known characteristics. This last class overcomes the constraints linked to the use of the reference signal. Indeed, in all other cases, the reference must be sent instead of comparison then perfectly synchronized with the degraded signal. This makes the system complex and more expensive.

Degradations due to transmission errors significantly reduce the signal quality. They appear during the broadcast, of an MPEG digital stream for example or during the broadcast, notably of radio, on the Internet.

In such a context, it is desirable to have a method which makes it possible to objectively measure the quality of an audio signal after broadcasting, without using a reference signal and / or by using a reduced reference. Indeed, only these techniques are suitable for monitoring a broadcasting network, for example where several measurement points distant from each other may be necessary. It is also interesting to take advantage of the relative simplicity of such a method for measuring the quality of a digital audio signal transmitted or not, which has been subjected to digital coding, in particular to bit rate reduction, and / or to decoding. . The number of audio qualitometry methods developed is very variable depending on the class considered. Indeed, a large number of methods with complete reference have been developed. Only a few methods have been developed without reference or with reduced reference.

The methods with full reference for which the signal to be evaluated is compared to the reference signal correspond to the conventional techniques used to estimate the quality of audio coders for example. Their general principle is based on the calculation, via a perceptual hearing model, of an internal representation of the original signal and the degraded signal, then on a comparison of these two internal representations. Such a method is described in the article by John G. BEERENDS and JAN A. STEMERDINK entitled "A Perceptual Audio Quality Measure Based on a Psychoacoustic Sound Representation", published in "Journal of Audio Engineering Society", vol. 12, December 1992, pages 963 to 978.

These hearing models are established from masking experiments, in order to obtain a representation which is as faithful as possible, and must make it possible to predict whether the deterioration will be audible or not. Not all degradations on a signal are audible or annoying. These perceptual models with reference are based on the diagram in Figure 1. Many methods, more or less complete and elaborate, are based on this principle. Recently, PEAQ algorithm (Method for objective measurements of PErceived Audio Quality) has been standardized by the ITU-R (Standard ITU-R BS.1387). This algorithm is based on classical principles by associating a quality prediction model using a neural network.

The major interest of these techniques is to be able to detect very small degradations but, it must be borne in mind that they are intended to study the influence of a coding. The measurements obtained are relative: only the difference is taken into account in this type of measurement. In the case of a very good quality coder, a signal comprising significant degradations will be coded and then decoded in an almost transparent manner, and therefore, the score assigned will be very high. In addition, for a signal which would have been modified (equalized, colored, ...) between the calculation of the reference and the comparison, the note could be weak even if the two signals are of very good perceptual quality.

As for the methods without reference, these remain very few. The OBQ (Output-Based Objective Speech Quality) measurement is the most advanced of the techniques without reference. This method of estimating the quality of a speech signal only, without a reference signal, is based on the calculation of perceptual parameters representing the content of the signal, gathered in a vector. These vectors, calculated on non-degraded signals, will constitute a reference base. The quality will be estimated by comparing the same parameters, extracted from the degraded signals, with the vectors of the reference base. The main method using neural networks is the OSSQAR (Objective Scaling of Sound Quality And Reproduction) measurement. The general principle of this method is to use a hearing model in conjunction with a neural network. The network is trained to predict the subjective quality of a signal from its perceptual representation calculated by the hearing model, to simulate the phenomena of psychoacoustics. It should be noted that the results obtained by these methods are much better when the signals are part of the learning base or at least when they have close characteristics.

Such methods are therefore not suitable for evaluating the quality of any signals, for example the audio signals of a radio or TN broadcast.

As indicated above, most of the objective perceptual measurement algorithms with complete reference operate on an identical principle: it involves comparing the degraded sound signal to the original signal (signal before transmission and / or coding and / or decoding, called reference signal). These algorithms therefore require having a reference signal, which is moreover very precisely synchronized with the signal to be tested. These conditions can only be filled in simulation or during tests of coders and other "compact" systems or not geopraphically distributed; on the other hand, this is very different when receiving a signal broadcast from transmit antennas Ai and receive antennas A ₂ (Figure 2). The reference signal must be available at the different comparison points. Also, to be able to use a method with complete reference, the only possibility is to transmit the reference, without error, to the comparison points, then to synchronize it perfectly. For reasons of congestion of the spectrum and therefore of cost, these techniques with full reference are not applicable in practice, because they would require the use of a second transparent transmission channel.

The proposed methods without reference make it possible to obtain good results but only in the case of signals with known characteristics and modeled during the learning phase. Methods without reference therefore work badly on any signal.

It has been suggested to use a so-called "reduced" reference in which the reference audio signal is characterized by one or more numbers. Such a process has been described in French Patent Application FR 2 769 777 filed on October 13, 1997. However, this process does not allow all the samples to be processed, due in particular to the fact that the bit rate of the proposed reference signal is too high important (at least 36 kbits / s for windows of 1024 signal samples) to satisfy the practical conditions of installation and production in a television broadcasting network.

The present invention proposes a method according to which the indicators are simpler and can be calculated in real time and in continuous time, and require a significantly lower bit rate. Since the degradations can only modify a few samples, while degrading the quality significantly, the proposed method allows the entire audio stream to be analyzed.

The method according to the invention allows a reliable estimate of the quality of an audio signal having passed through a digital type transmission or coding. Indeed, the disturbances undergone by the transmission channels can induce the appearance of errors on the transmitted data; these errors result in degradations in the final audio signal.

The technological approach proposed consists in making a measurement on the audio signal, at the input and another at the output, the chain or any other system to be studied. A comparison between these measurements makes it possible to ensure the "transparency" of the transmission channel and to assess the extent of the degradations introduced. Used in conjunction or not with methods without reference, detecting the degradations based on the signature of the characteristics of the most important defects to be sought, the proposed approach allows a reliable estimate of the degradations introduced. It also makes it possible to compensate for a lack of reference signal. This method makes it possible to reduce the reference throughput necessary for estimating the quality in the case of measurements with reduced reference, and the number of parameters to be used in the case of measurements without reference.

The invention thus relates to a method for evaluating a digital audio signal, characterized in that it implements in real time and in continuous time, in successive time windows, the calculation of a quality indicator constituted, for each time window of a vector whose size is advantageously at least one hundred times less than the number of audio samples of a time window. This dimension is for example between 1 and 10 and preferably between 1 and 5.

The digital audio signal to be evaluated can be a signal which has been transmitted digitally and / or which has been subjected to digital coding, in particular with reduction in bit rate, from a digital reference signal.

According to a first variant, implementing a discrepancy in perceptual accounts, the method is characterized in that the generation of a said quality indicator vector implements for a reference audio signal and for the audio signal to be evaluated, the steps a) calculate the power spectral density of the audio signal for each time window and apply a filter representative of the attenuation of the inner and middle ear to obtain a filtered spectral density, b) calculate from this density filtered spectral individual excitations using the frequency spreading function in the basilar scale, c) determine from said individual excitations the compressed loudness using a function modeling the nonlinear frequency sensitivity of l ear, to obtain basilar components, d) separate the basilar components into classes, preferably in three classes, and calc uler for each class a number C representing the sum of the frequencies of this class, said vector consisting of said numbers C, e) calculating a distance between the vectors of the reference audio signal and of the audio signal to be evaluated associated with each time window for perform a so-called audio signal degradation assessment. According to a second variant, implementing an auto-regressive modeling of the audio signal, the method is characterized in that the generation of a said quality indicator vector implements, for the reference audio signal and for the audio signal to be evaluated, the following steps: a) calculating N coefficients of a prediction filter by an autoregressive modeling. b) determining in each time window the maximum of the residue by difference between the predicted signal using the prediction filter and the audio signal, said maximum of the prediction residue constituting said quality indicator vector, c) calculating a distance between said vectors of the reference audio signal and of the audio signal to be evaluated associated with each time window in order to carry out a so-called evaluation of the degradation of the audio signal.

According to a third variant, implementing an auto-regressive modeling of the basilar excitation, the method is characterized in that the generation of a said quality indicator vector implements for the reference audio signal and for the audio signal to evaluate, the following steps: a) calculate for each time window the power spectral density of the audio signal and apply to it a filter representative of the attenuation of the inner and middle ear, to obtain a frequency spreading function in l basal scale, b) calculate individual excitations from the frequency spreading function in the basilar scale, c) obtain from said individual excitations the compressed loudness using a function modeling nonlinear sensitivity by ear frequency, to obtain basilar components, d) calculate from said basilar components N 'predictive coefficients ion of a prediction filter by autoregressive modeling. e) generate for each time window a said quality indicator vector from only some of the N ′ prediction coefficients.

Preferably, the quality indicator vector comprises between 5 and 10 of said prediction coefficients. According to a fourth variant, implementing a detection of dishes in the signal activity, the method is characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated the following steps : a) calculation of a temporal activity of the signal in each time window, b) calculate a sliding average over Ni successive values of the time activity, c) keep the minimum value among M] successive values of the sliding average. The quality indicator vector can be constituted by said minimum value, or alternatively by a binary value resulting from the comparison of said minimum value with a given threshold. Also, the method can be characterized in that it implements the calculation of a quality score by determining a cumulative time interval during which said minimum value is less than a given threshold and / or by determining the number of times per second where said minimum value is less than a given threshold or else in that said minimum values are generated both for the reference audio signal and for the audio signal to be evaluated and in that a quality vector is generated by comparison between the corresponding minimum values of the reference audio signal and of the audio signal to be evaluated, for example by calculating the difference or the ratio between said minimum values.

According to a fifth variant implementing a detection of peaks in the activity of the audio signal, the method is characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated the following steps : a) calculate a temporal activity of the signal in each time window, b) calculate a sliding average over N ₂ successive values of the temporal activity, c) keep the maximum value among M ₂ successive values of the sliding average.

The quality indicator vector can be constituted by said maximum value or by a binary value resulting from the comparison of said minimum value with a given threshold.

The method can be characterized in that a degradation indicator is generated by comparison between the maximum value obtained on the reference audio signal and its corresponding maximum value obtained on the audio signal to be evaluated, for example by calculating the difference or the ratio between these maximum values.

According to a sixth variant implementing the minimum computation of the spectrum of the audio signal, the method is characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated. calculation of the Fourier transform in successive blocks of N ₃ samples constituting said time windows and calculating the minimum of the spectrum in M ₃ successive blocks which constitute a vector indicative of quality.

The method can be characterized in that it includes a step of evaluating the introduction of noise into the audio signal to be evaluated by comparing the value of said minimum of the spectrum in M ₃ successive blocks associated with the audio signal to be evaluated with the value maximum of the M ₃ minima obtained in the same M ₃ successive blocks associated with the reference audio signal.

It can also be characterized in that it comprises a step of evaluating the introduction of noise into the audio signal to be evaluated by comparing the value of said minimum of the spectrum in M ₃ successive blocks with an average value spectrum minima obtained in blocks prior to the M ₃ successive blocks, for example by calculating the difference or the ratio between these average values. According to a seventh variant, implementing an estimation of the flattening of the spectrum of the audio signal, the method is characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated the calculation a spectrum flattening parameter which is the ratio between an arithmetic mean and a geometric mean of the components of the signal spectrum.

The method can then be characterized in that it implements an indicator for detecting a degradation of the audio signal by the introduction of broadband noise by comparing said spectrum flattening parameter between the reference audio signal and the audio signal to be evaluated, for example by calculating the difference or the ratio between these two parameters.

Other characteristics and advantages of the invention will appear better on reading the description below in conjunction with the drawings in which:

- Figure 1 is a flowchart illustrating a quality assessment with full reference. FIG. 2 illustrates an audio transmission with loss of quality,

FIGS. 3 to 10 illustrate evaluation methods according to the present invention,

- And Figures 11 and 12 illustrate an audio quality system implementing the present invention. The management and recovery of decoding errors is not standardized. The influence of these errors on the perceived quality therefore depends on the decoder used.

The audibility of these faults is also linked to the type of element affected in the frame, for example MPEG, and to its audio content.

In the case of significant errors due to transmission, the quality of the signal decreases sharply. These degradations appear during the broadcast, of an MPEG digital stream for example, and are, most of the time, of impulse type. They can also appear during the broadcast of an audio stream on the Internet, or during coding or decoding.

For this type of fault, the quality can be estimated in binary fashion: either the signal has not been degraded and the quality will depend on the initial coding used, or errors have been introduced and significant degradations appear.

The estimation of the quality can then be done by methods without reference, by accounting for the degradations detected over regular time intervals of the order, for example of a second. Subjective tests have in fact made it possible to obtain a reliable estimate of the perceived quality, from the number and the length of the interruptions linked to impulse-type degradations in a signal. For measurements obtained with a reduced reference, the proposed method makes it possible to reduce the flow required for transporting the reference. This allows the use of reserved lanes with relatively limited speed. These measurements make it possible to detect degradations other than those due to transmission errors. Thus, the present invention allows a reduction in the bit rate in the case of measurements with reduced reference and, by adding simple measurements without reference, to keep measurements on the significant degradations in the case of a loss of the reference by example, by locally generating a vector which simply characterizes the degradations, and which could therefore be easily processed and transmitted to a control installation, in particular centralized.

The measurements taken along the chain and at various points on the network inform the monitoring and management system for digital television broadcasting on its overall performance. Measurements of signal degradations inform the broadcasting operator about the quality of service delivered. The process is characterized by two complementary operating modes: With reduced reference. The technological approach proposed consists in making a measurement on the audio signal, at the input, and another at the output of the transmission chain or any other system to be studied (encoder, decoder, etc.). A comparison between these measures makes it possible to ensure the "transparency" of the chain or system and to assess the extent of the degradations introduced. Unlike the prior art: the method performs an evaluation in real time and in continuous time.

- the reference measurements at the input of the chain represent a very small amount of data compared to the audio signal data, hence its classification as “reduced reference”.

- the reference data or measurements used are both a reduced representation of the content of the signal and a measure of the importance of a type of degradation. The invention makes it possible to compensate for a lack of reference signal.

For this, the method defines measures for the characteristic digital faults to be sought. Unlike the prior art, the proposed approach allows an estimation of the degradations introduced on any signal, and in a reliable manner and this approach can be implemented both on the scale of a transmission network and locally on an equipment. In addition, the computation complexity according to the method is low, and the indicator obtained represents a small quantity of data compared to the digital audio stream.

Finally, the method can be applied indifferently to purely digital signals or to signals having undergone after transmission a digital to analog conversion then analog to digital.

The first three methods described below are of the so-called "with reduced reference" type.

To obtain greater precision in the estimation of quality, some of the parameters developed use perceptual modeling: The principle of objective perceptual measurements is based on the transformation of the physical representation (sound pressure, level, time and frequency) into the psychoacoustic representation (sound strength, masking level, time and critical bands or barks) of two signals (the reference signal and the signal to be evaluated) in order to compare them. This transformation takes place thanks to a modeling of the human auditory system (generally, this modeling consists of a spectral analysis in the Barks domain followed by spreading phenomena). A distance can then be calculated between the psychoacoustic representations of the two signals, distance which can be linked to the quality of the signal to be evaluated (the smaller the distance, the closer the signal to be evaluated to the original signal and the better its quality).

The first process implements a parameter called "Difference in Perceptual Accounts".

The calculation of this parameter breaks down into several stages, necessary to take into account psychoacoustics. These are applied to the reference signal and to the degraded signal. These steps are as follows:

Windowing of the temporal signal in blocks then, for each of the blocks, calculation of the excitation induced by the signal using a hearing model. This representation of the signals takes into account the phenomena of psychoacoustics, and provides a histogram whose accounts are the values of the basilar components.

This makes it possible to take into consideration only the audible components of the signal and therefore to be limited to useful information. To obtain this excitation, classical modeling can be used: attenuation of the outer and middle ear, integration according to critical bands and frequency masking. The time windows chosen are approximately 42 ms (2048 points at 48 kHz) with an overlap of 50%. This makes it possible to obtain a temporal resolution of the order of

21 ms. Several steps are necessary for this modeling. For the first step, the attenuation filter for the outer and middle ear is applied to the power spectral density, obtained from the signal spectrum. This filter also takes into account the absolute hearing threshold. The notion of critical bands is modeled by a transformation of the frequency scale into a basilar scale. The following stage corresponds to the calculation of the individual excitations to take account of the masking phenomena, thanks to the frequency spreading function in the basilar scale and to a nonlinear addition. The last step makes it possible to obtain the compressed loudness, by a power function, to model the non-linear frequency sensitivity of the ear, by a histogram comprising the 109 basilar components.

The histogram accounts obtained are then grouped into three classes. This vectorization makes it possible to obtain a visual representation of the evolution of the structure of the signals. This also makes it possible to obtain a simple and concise characterization of the signal and therefore to have a particularly interesting reference parameter. Several strategies exist to fix the limits of these three accounts: The simplest is to separate the rhistogram into three zones of equal sizes. Thus, the 109 basilar components, (or the 24 components which constitute the excitation and constitute a simplified representation of it) represent 24 Barks and can be separated with the following indices:

94

If = 36 let z = - * 36 = 7.927 Barks (1)

109

94

S ₂ = 73 or z = - * 73 = 16,073 Barks (2)

109

The second strategy takes into account the Beerends scaling zones. In fact, gain compensation between the excitation of the reference signal and that of the signal to be tested is carried out by the ear, the fixed limits are then the following:

94

If = 9 or z = - * 9 = 1,982 Barks (3)

109 ^}

94

S ₂ = 100 or z = - * 100 = 22.018 Barks (4)

^109j

The trajectory is then represented in a triangle, called the frequency triangle. For each block we obtain three accounts Ci, C ₂ and C ₃ , therefore two Cartesian coordinates according to the following formulas:

with i: sum of the basilar excitations for the high frequencies (above S ₂ ) C ₂ : account associated with the medium frequencies (components between Si and S ₂ ) and N = Cj + C ₂ + C3: Total sum of the values of the components .

A point (X, Y) constituting a vector is therefore obtained for each time window of the signal, which corresponds to the transmission of two values per window of for example 1024 bits, or a bit rate of 3 kbits / s for an audio signal. sampled at 48 kHz. For a complete sequence, the associated representation is thus a trajectory parameterized by time, as shown in Figure 3.

The distance (Euclidean) between the reference signal and the degraded signal is then calculated. In the case of a continuous quality estimate, the distance between the points makes it possible to estimate the extent of the degradations introduced between the reference signal and the degraded signal. This distance can be considered as a perceptual distance due to the use of models of psychoacoustics.

To estimate a quality score for a signal of several seconds, it is possible to calculate an overall measure of the difference between the two signals. Several metrics can be used for this. These can be of diffuse type (average of the distances between the vertices, intercepted area, ...), local

(maximum, minimum distances between vertices, ...) and depend on the position in the triangle. It is also possible to take account of barely perceptible differences ("Just Noticeable Difference"). These thresholds make it possible to determine the audibility of the differences which have appeared. They can be modeled by tolerance zones as a function of the position in the triangle to take account of the variability of the masking phenomena. In all cases, the two trajectories must be synchronized beforehand.

The principle of the calculation of this comparative parameter can thus be summarized by the diagram Figure 4.

The main advantage of the parameter comes from the fact that psychoacoustic phenomena are taken into account without increasing the bit rate necessary for transferring the reference. This makes it possible to reduce the reference to 2 values for 1024 signal samples (3 kbits / s).

The second method implements an autoregressive modeling of the signal. The general principle of linear prediction is to model the signal as being a combination of its past values. The idea is to calculate the

N coefficients of a prediction filter by autoregressive modeling (all pole).

With this adaptive filter, it is possible to obtain a predicted signal from the actual signal.

Prediction errors or residuals are calculated by difference between these two signals. The presence and amount of noise in a signal can be determined by analyzing these residues. The comparison of the residues obtained on the reference signal and those calculated from the degraded signal, and therefore of the noise levels, makes it possible to estimate the importance of the modifications and defects inserted.

The reference to be transmitted corresponds to the maximum of the residuals over a time window of given size. It is in fact not interesting to transmit all the residues if the bit rate of the reference wants to be reduced.

To adapt the coefficients of the prediction filter, two methods are given below by way of example:

- The LENINSON-DURBIN algorithm which is described for example in the work of M. BELLANGER - Digital signal processing - Theory and practice (MASSON ed. 1987) p. 393 to 395. To use it, it is necessary to have an estimate of the autocorrelation of the signal over a set of N ₀ samples. This autocorrelation is used to solve the Yule-Walker system of equations and thus obtain the coefficients of the predictor filter. Only the first N values of the autocorrelation function can be used, where N denotes the order of the algorithm, that is to say the number of coefficients of the filter. On a window of 1024 samples, we keep the maximum of the prediction error.

The gradient algorithm which is described for example in the aforementioned work of M. BELLANGER p. 371 and following. The main drawback of the previous parameter is the need, in the case of an implementation on DSP, to store the N ₀ samples to estimate the autocorrelation, have the coefficients of the filter then calculate the residues. This second parameter makes it possible to avoid this by using another algorithm making it possible to calculate the coefficients of the filter: the algorithm of the gradient. This uses the error made to update the coefficients. The filter coefficients are changed in the direction of the gradient of the instantaneous quadratic error, with the opposite signal.

Once the residuals obtained by difference between the predicted signal and the real signal, only the maximum of their absolute values, over a time window of given size T, is preserved. The reference vector to be transmitted can thus be reduced to a single number.

After transmission then synchronization, the comparison consists of a simple calculation of the distance between the maxima of the reference and of the degraded signal, for example by difference.

Figure 5 summarizes the principle of parameter calculation: The main advantage of the two parameters is the flow required to transfer the reference. This reduces the reference to 1 real number for 1024 signal samples.

However, no model of psychoacoustics is taken into account.

The third method implements an autoregressive modeling of the basilar excitation.

Compared to the classical linear prediction, this method allows to take into account the phenomena of psychoacoustics, in order to obtain an evaluation of the perceived quality. For this, the calculation of the parameter goes through a modeling of various hearing principles. A linear prediction models the signal as a combination of its past values. Analysis of the residuals (or prediction errors) makes it possible to determine and estimate the presence of noise in a signal. The major drawback when using these techniques is the fact that there is no consideration of the principles of psychoacoustics. Thus, it is not possible to estimate the amount of noise actually perceived.

The process takes up the general principle of classical linear prediction. It also incorporates the phenomena of psychoacoustics to adapt it to the non-linear sensitivity in frequency (loudness) and intensity (tone) of the human ear.

One modifies the spectrum of the signal, by the intermediary of a model of hearing, before calculating the coefficients of the linear prediction by an autoregressive modeling (any pole). The coefficients thus obtained make it possible to model the signal in a simple way while taking account of psychoacoustics. It is these prediction coefficients that will be transmitted and will serve as a reference when comparing with the degraded signal.

The first part of the calculation of this parameter corresponds to the modeling of the principles of psychoacoustics using classical hearing models. The second part is the calculation of the linear prediction coefficients. The last part corresponds to the comparison of the prediction coefficients calculated for the reference signal and those obtained for the degraded signal. The different steps of this method are therefore as follows:

- Temporal windowing of the signal then calculation of an internal representation of the signal by modeling the phenomena of psychoacoustics. This step corresponds to the calculation of the compressed loudness, which is in fact the excitation induced by the signal at the level of the inner ear. This representation of the signals account of the phenomena of psychoacoustics, and is obtained from the signal spectrum, using conventional models: attenuation of the outer and middle ear, integration according to critical bands and frequency masking. This calculation step is identical to the parameter described above; - Autoregressive modeling of this compressed loudness in order to obtain the coefficients of a RIF prediction filter, just like in a classic linear prediction. The method used is that of autocorrelation, by solving the Yule-alker equations. The first step in obtaining the prediction coefficients is therefore the calculation of the signal autocorrelation. By considering the compressed loudness as a filtered spectral power, it is possible to calculate the autocorrelation of the perceived signal by inverse Fourier transformation.

One of the methods for solving this system of Yule-Walker equations and thus obtaining the coefficients of a predictor filter is the use of the Levinson-Durbin algorithm.

These are the prediction coefficients which constitute the reference vector to be transmitted up to the point of comparison. The transformations used during the final calculation on the degraded signal, are the same as for the initial phase on the reference signal. - Estimation of degradations by calculating a distance between the vectors from the reference and the degraded signal. It is a comparison of the vectors of coefficients obtained for the reference and for the audio signal transmitted, which makes it possible to estimate the degradations introduced during the transmission. This must be done on an adapted number of coefficients. The larger the number, the more precise the calculations can be, but the higher the bit rate required for transmitting the reference. Several distances can be used to compare the vectors of coefficients. The relative importance of the coefficients can for example be taken into account.

The principle of the method can be summarized according to the following diagram (Figure 6).

The modeling of the phenomena of psychoacoustics makes it possible to obtain 24 basilar components. The order N of the prediction filter is 32. From these, 32 autocorrelation coefficients are estimated, which gives 32 prediction coefficients of which only 5 to 10 coefficients are kept as an indicator vector of quality, for example the first 5 to 10 coefficients. The main advantage of the parameter comes from taking into account the phenomena of psychoacoustics. To do this, it was necessary to increase the flow required for the transfer of the reference to 5 or 10 values for

1024 signal samples (21 ms for an audio signal sampled at 48 kHz), i.e. a rate of 7.5 to 15 kbits / s.

The following methods can be used with or without reference. This makes it possible to keep the most significant degradation detection measures, even in the case where no reference parameter is available at the control point, at the time when the comparison should be carried out. The first of these methods implements dish detection in signal activity.

The notion of activity, which can be approximated by a derivation operation in the audio signal, is used to identify breaks and interruptions in the time signal. These types of faults are characteristic of decoding errors after transmission of the digital audio stream or during the broadcasting of sound sequences on the Internet. This occurs when the network speed becomes insufficient to ensure the arrival of all the frames necessary at the time of decoding, for example. These degradations, which introduce very low activity zones, translate at the auditory level by different sensations in the listener: mute, blur, impulse noise ...

The first step in calculating the parameter corresponds to estimating the temporal activity of the signal. To do this, the second derivative operator is used. It makes it possible to have a sufficiently precise estimate of the activity and requires very few calculations.

To simulate this second derivative operation in a simple way, the following formula is used:

f ^" (x ₀ ) = f (x ₀ + 2) - 2.f (x ₀ ) + f (x ₀ - 2) (7) OR f" (x ₀ ) = f (x ₀ + l) - 2 .f (x ₀ ) + f (x ₀ -l) (8)

where / Ct corresponds to the value of the sample at time t. A sliding average, on N values (for example N = 21, which corresponds to 0.5 ms for a sampling frequency of 48 KHz), then allows smooth the variations of the curve obtained and thus avoid false detections. Only one result will be kept per block of M results (M corresponds for example to 2048 audio samples). It is the minimum of M averages which is kept and then transmitted. The parameter is thus obtained at time t by the following formula:

Dishes (t) = min [- wherey (t) corresponds to the activity.

If the parameter is used with reference, then, after synchronization of the data, the comparison step consists of a simple difference which makes it possible to identify the zones where the signal has been replaced by decoding dishes.

Only the moments, when the activity is greatly reduced on the degraded signal, are of interest. So the comparison formula is as follows:

d (t) = max (θ. Dishes _r (t) - Dishes _d (t)) (10)

where Plats _r (t) and PlatS _d (t) are respectively the parameter calculated on the reference and on the degraded signal.

To further reduce the bit rate required for transporting the reference, it is also possible to compare the Dish (t) parameter, calculated on the signal, with a threshold S and thus obtain a binary parameter. When degradations appear, the drop in activity is indeed significant enough to be detected in this way. In this case, the comparison only serves to confirm the presence of the degradations. No more confusion is possible between the zones of silence and the zones of weak signal activity. The use of the parameter without reference nevertheless makes it possible to identify the degradations.

To move from a parameter of detection of degradations, to the estimation of a note of perceptual quality, the psychoacoustic importance of the detected degradations must be analyzed. Depending on their length and number, the perceived degradation will be very different.

The next step therefore consists in using correspondence curves from the binary parameter. These curves provide a quality score from the cumulative length and the number of impulse degradations detected per second. These curves are established from subjective tests. Different curves can be established depending on the type of audio signals (mainly speech or music). Once the estimate has been obtained, it is also possible to use a filter simulating the response of a panelist. This makes it possible to take into account the dynamic effect of the votes and the reaction times when faced with degradations.

The parameter can be summarized according to the following diagram Figure 7.

The main advantage of the parameter is the possibility of making measurements without reference. Another interesting point is the speed necessary for the transfer of the reference. This makes it possible to reduce the reference to 1 real number, ie a bit rate of 1.5 kbits / s (or even 1 bit in the event of thresholding, or a bit rate of 47 bits / s) for

1024 signal samples. It should also be noted that the algorithm is very simple and of reduced complexity, which allows it to be implemented in parallel with other parameters.

The second of these methods implements activity peak detection.

This parameter, like the previous one, is based on signal activity. This allows you to detect dropouts, breaks, breaks in part of the audio signal and outliers by looking for peaks in signal activity.

Thus, this time, only the maxima for blocks of M samples are kept. It is not interesting to transmit then compare all the values of the activity, mainly if the objective is to obtain a method requiring only a reduced reference.

The parameter is thus obtained at time t by the following formula:

ActTemp (t) = max (y (t - k)) (11) where y (t) is the signal activity calculated by the filter.

In the case of use with reference, this same calculation is carried out on the reference signal and on the degraded signal.

After synchronization of the two flows, the comparison of these activity maxima makes it possible to detect the zones where the signal has been disturbed. To make this comparison, the ratio between the value measured on the reference and that obtained on the degraded signal allows the detection of degradations. It is possible to detect the zones where the activity has been greatly reduced by choosing the maximum of the ratio and its inverse. The following formula is used:

ΑctTemp _d (t) ActTemp (t) ^' d (t) = max (12) ActTemp (t)' ActTemp, (t)

where ActTemp _r (t) and ActTemp < _d (t) are respectively the parameter calculated on the reference and on the degraded signal.

In the case where the reference is not available, it is possible to use a thresholding to detect if the parameter is greater than a threshold S ', which indicates the presence of degradations. To avoid false detections due to signals of an impulsive nature (attacks, percussions, ...), the threshold must have a fairly large value, which can lead to non-detections.

As in the previous case, the use of correspondence curves is possible to estimate a perceptual quality. The method consists in integrating the degradations detected by this parameter, with the others found by the previous parameter for example, and thus obtaining a global perceptual estimate.

The principle of the parameter is presented in the following diagram Figure 8.

As with the previous parameter, the advantage of the parameter lies in the possibility of making detections without reference.

The reduced complexity and the low bit rate necessary for transporting the reference, limited to 1 value, i.e. a bit rate of 1.5 kbits / s (or even 1 bit in case of thresholding, or a bit rate of 47 bits / s) for 1024 signal samples sampled at 48 kHz, are also interesting points. The following method implements the study of the minimum of the signal spectrum to locate the degradations.

It is mainly useful for the detection of so-called "impulse" degradations. It is indeed important to note that the majority of the degradations introduced, during the transmission of an audio signal, are of this type. These are very localized in time and very spread in frequency. Thus, by assimilating them to white broadband noise of very short duration in the signal, it is possible to detect them by analyzing the characteristics of the spectrum.

The first step in calculating these parameters is to estimate the spectrum of the signal. For this, the signal is windowed in blocks of N samples (N = 1024 or 2048 for example), with an overlap of N / 2 samples. That makes it possible to have a sufficient temporal resolution and to analyze all the signal, by taking account of the fact that the use of fenestration strongly attenuates the influence of the edges of these temporal windows.

This also makes it possible not to overly penalize the calculation time during implementation. A rapid Fourier transformation then makes it possible to pass into the frequency domain.

The appearance of a degradation increases the minimum of the spectrum, due to the introduction of broadband white noise in all the frequency components of the spectrum. It is this principle which made it possible to develop this parameter, calculated simply according to the formula:

MinSpe = min (Λ: _; .) For 1 <i <N (13) with Xi the N components of the X spectrum in dB (by distance calculation).

In the case of use with reference, a simple comparison, after synchronization of the values obtained on the reference and the degraded signal, is generally not sufficient for the detection of degradations. Indeed, the variability of the minima obtained with an undegraded signal is important. It is thus necessary to make comparisons by blocks of M values according to the following principle: For each block, only the maximum of the M minima obtained on the reference is kept. This provides a reference value for the initial noise level for the block. This value is compared to the minimum M obtained on the degraded signal. By keeping only the instants when the minima are increased, it is possible to detect the moments when noise has been added to the signal.

The distance obtained is thus, for each instant t:

where x _r> i is the i ^th of the N components of the spectrum obtained on the reference, x _< Li is the i ^th of the N components of the spectrum obtained on the degraded signal, and min the k ^th of the M minima of the block considered. If the reference is not available, it is possible to use an average of the spectrum minima previously obtained by the algorithm. The rest of the comparison is then done in the same way. As in the previous cases, the use of the correspondence curves is possible by integrating the degradations detected by this parameter with the others and thus obtaining a perceptual measurement.

The method can be summarized as follows by the following two diagrams Figure 9. Again, the main advantage of these parameters is the possibility of making measurements without reference. Another interesting point is the speed necessary for the transfer of the reference. This makes it possible to reduce the reference to 1 real number and even 1 integer, ie a bit rate of at most 1.5 kbit / s for N (for example 1024) signal samples. The reduced complexity of the algorithm is also an asset.

In the following method, according to which the Spectral Flattening is analyzed, two parameters, SFi and SF ₂ , make it possible to estimate the "flattening" of the spectrum, hence the sometimes used term "statistical flattening". They correspond to the study of the shape of the spectrum and its evolution along the studied sequence. When broadband noise appears in the signal, a continuous white noise component will cause the spectrum to flatten. SFj parameter _.

During the appearance of a degradation, the components which had values close to zero, will pass to non negligible values. The product of the components of the spectrum will thus strongly increase, while their sum will vary only very little. To exploit this, the parameter for estimating the flattening of the spectrum SF _t is calculated according to the following formula: MoyermeArithmétique (X)

SF _j ≈lO.loglO = 10.1ogl (15) Average Geometric metric (X)

with X, the signal spectrum and Xj the components of the spectrum. This parameter is calculated in the same way on the reference and on the degraded signal. By comparison it is then possible to estimate the level of white noise inserted, and consequently the degradations. SF ₂ parameter

To calculate this parameter, the statistical flattening coefficient, called "kurtosis" or "concentration" was used. The estimation is made from the centered moments of order 2 and 4. They allow the shape of the spectrum to be estimated with respect to a normal distribution in the statistical sense of the term.

The calculation corresponds to the ratio between the centered moment of order 4 and the centered moment of order 2 (variance) squared of the coefficients of the spectrum. The formula used is as follows:

with centered moments m ^ defined by:

where X is the arithmetic mean of the N components Xj of the X spectrum in dB.

As with the parameter SF _{13, the} higher the value obtained, the more the signal is concentrated and the less noise there is in the signal. This one is calculated on the reference and on the degraded signal. By comparison, the level of white noise inserted is estimated.

The diagram in Figure 10 presents the principle (valid for the two parameters above): In the case of a comparison with the reference, a simple distance of the difference or other type is sufficient to detect the degradations. If no reference is available, it is necessary to detect peaks in the variation of the parameters to search for degradations. This can be done using the technique, classic in image processing, of gray level mathematical morphology (erosions and dilations).

The advantages and limitations of these parameters are identical to those of the previous parameters: limited necessary throughput, without possible reference and use of the correspondence curves to estimate the perceptual importance of the degradations. In the context of monitoring a digital television broadcasting network, the reference audio signal corresponds to the signal at the input of the broadcasting network. The reference parameters are calculated on this signal, then transmitted via a specific data channel, to the desired measurement point. It is at this point that the same parameters necessary for the comparison are calculated for the establishment of the measurements with reduced reference. Non-referenced measurements are also calculated. In the event that the reference parameters are not available (not present, incorrect, ...) these measurements are sufficient to detect the most significant errors. The dotted subsystems in Figure 11 are no longer used. The measurements obtained without reference and those obtained with reduced reference (in the case where they could be calculated) are used by a model to estimate the extent of the degradation introduced during the diffusion. The diagram in Figure 11 summarizes this embodiment: Several measurement points can thus be established. Once these degradations estimates have been obtained, it is easy to transmit them to a network monitoring center, which gives an overview of network performance.

The same diagram as above can be used to visualize (with or without reference) the performance of radio broadcasting on the Internet. In this case, the data channel used to transport the reference parameters can be the network itself, as well as to return the estimated notes to the center of monitoring. The reference signal corresponds to the signal sent by the server, and the degraded signal is that decoded at the chosen measurement point. This can for example be used to choose the most appropriate server according to the connection location by accessing data from a monitoring center. The diagram (Figure 12) below illustrates this embodiment in the case where the reference parameters are sent by the network and where the notes obtained use a specific transmission channel.

A method according to the invention is applicable whenever it is necessary to identify faults on an audio signal which has been transmitted by any broadcasting network (cable, satellite, wireless, Internet, DNB, DAB, etc.). .). The proposed method exploits two classes of methods: techniques with reduced reference and those without reference. It is particularly advantageous when the bit rate available for transmitting the reference is limited.

Thus, this invention is applicable for operational purposes for metrology equipment and for supervision systems of audio signal distribution networks. One of its advantageous characteristics lies in the combination of the measurements carried out with and without reference. Finally, this invention corresponds to the requirements imposed in service quality management systems.

Claims

1. A method of qualitative evaluation of a digital audio signal, characterized in that it implements in real time and in continuous time in successive time windows, the calculation of a quality indicator obtained solely from said signal digital audio and which consists of a vector associated with each time window.

2. Method according to claim 1, characterized in that said vector has a dimension at least one hundred times less than the number of audio samples of a time window, this dimension being for example between 1 and 10 and preferably between 1 and 5, and more particularly between 2 and 5.

3. Method according to one of claims 1 or 2, characterized in that the generation of a said quality indicator vector implements for a reference audio signal and for the audio signal to be evaluated, the following steps: a) calculate for each time window the power spectral density of the audio signal and apply to it a filter representative of the attenuation of the inner and middle ear to obtain a filtered spectral density, b) calculate from the filtered spectral density the individual excitations at l using the frequency spreading function in the basilar scale, c) determining from said individual excitations the compressed loudness using a function modeling the non-linear frequency sensitivity of the ear, to obtain components basilar, d) separate the basilar components into classes, preferably into three classes, and calculate for each class a number C representing the sum of the frequencies of this class, said vector consisting of said numbers C, e) calculating a distance between the vectors of the reference audio signal and of the audio signal to be evaluated associated with each time window to carry out an evaluation of the degradation of the audio signal .

4. Method according to one of claims 1 or 2, characterized in that the generation of a said quality indicator vector implements, for the reference audio signal and for the audio signal to be evaluated, the following steps: a) calculate N coefficients of a prediction filter by autoregressive modeling, b) determining in each time window the maximum of the residue by difference between the signal predicted using the prediction filter and the audio signal, said maximum of the prediction residue constituting said quality indicator vector, c) calculating a distance between said vectors of the reference audio signal and of the audio signal to be evaluated associated with each time window to carry out a so-called evaluation of the degradation of the audio signal.

5. Method according to claim 1, characterized in that the generation of a said quality indicator vector implements for the reference audio signal and for the audio signal to be evaluated, the following steps: a) calculate for each time window the power spectral density of the audio signal and apply to it a filter representative of the attenuation of the inner and middle ear, to obtain a frequency spreading function in the basilar scale, b) calculate individual excitations from the frequency spread function in the basilar scale, c) obtain from said individual excitations the compressed loudness from a function modeling the non-linear frequency sensitivity of the ear, to obtain basilar components, d) calculate at from said basal components N ′ prediction coefficients of a prediction filter by self-regressive modeling, e) generate for each time window a so-called quality indicator vector from only some of the N 'prediction coefficients.

6. Method according to claim 5, characterized in that the quality indicator vector comprises between 5 and 10 of said prediction coefficients.

7. Method according to claim 1, characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated the following steps: a) calculation of a temporal activity of the signal in each window temporal, b) calculate a sliding average over Ni successive values of the temporal activity, c) keep the minimum value among Mi successive values of the sliding average.

8. Method according to claim 7, characterized in that said quality indicator vector consists of said minimum value.

9. Method according to claim 7, characterized in that said quality indicator vector consists of a binary value resulting from the comparison of said minimum value with a given threshold.

10. Method according to one of claims 7 to 9, characterized in that it implements the calculation of a quality score by determining a cumulative time interval during which said minimum value is less than a given threshold Si and / or by determining the number of times per second where said minimum value is less than a given threshold S'i.

11. Method according to one of claims 7 to 10, characterized in that said minimum values are generated both for the reference audio signal and for the audio signal to be evaluated and in that a quality vector is generated by comparison between the corresponding minimum values of the reference audio signal and the audio signal to be evaluated.

12. Method according to claim 1, characterized in that the generation of a said quality indicator vector implements at least for the audio signal to be evaluated the following steps: f) calculating a time activity of the signal in each time window, g) calculate a sliding average over N ₂ successive values of the time activity, h) keep the maximum value among M ₂ successive values of the sliding average.

13. Method according to claim 12, characterized in that said quality indicator vector consists of said maximum value.

14. Method according to claim 12, characterized in that said quality indicator vector consists of a binary value resulting from the comparison of said maximum value with a given threshold S ₂ .

15. The method of claim 12, characterized in that a degradation indicator vector is generated by comparison between the maximum value obtained on the reference audio signal and the corresponding maximum value obtained on the audio signal to be evaluated.

16. Method according to claim 1, characterized in that the generator of a said quality indicator vector implements at least for the audio signal to be evaluated the calculation of the Fourier transform in successive blocks of N ₃ samples constituting said time windows and calculate the minimum spectrum value in M ₃ successive blocks, said minimum spectrum value constituting a quality indicator vector.

17. The method of claim 16, characterized in that it comprises a step of evaluating the introduction of noise into the audio signal to evaluate by comparing the value of said minimum spectrum in M ₃ successive blocks associated with the audio signal transmitted with the maximum value of the minimum M ₃ obtained in the same M ₃ successive blocks associated with the reference audio signal.

18. The method of claim 16, characterized in that it comprises a step of evaluating the introduction of noise into the audio signal to be evaluated by comparing the value of said minimum of the spectrum in M ₃ successive blocks with an average value of minima of the spectrum obtained in blocks prior to said M ₃ successive blocks.

19. The method of claim 1, characterized in that it implements at least for the audio signal to be evaluated the calculation of a said quality indicator vector consisting of a spectrum flattening parameter which is the ratio between an arithmetic mean and a geometric mean of the components of the signal spectrum.

20. The method of claim 19, characterized in that it implements an indicator for detecting a degradation of the audio signal by the introduction of broadband noise by comparing said spectrum flattening parameter between the audio signal of reference and audio signal to be evaluated.

21. Method according to one of the preceding claims, characterized in that the audio signal to be evaluated is a digitally transmitted audio signal.

22. Method according to one of the preceding claims, characterized in that the audio signal to be evaluated is a digital audio signal to which digital coding has been applied.

23. The method of claim 22, characterized in that said digital coding is a bit rate reduction coding.