EP1600042B1

EP1600042B1 - Method for the treatment of compressed sound data for spatialization

Info

Publication number: EP1600042B1
Application number: EP04712070A
Authority: EP
Inventors: Abdellatif Benjelloun Touimi; Marc Emerit; Jean-Marie Pernaux
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2003-02-27
Filing date: 2004-02-18
Publication date: 2006-08-09
Anticipated expiration: 2024-02-18
Also published as: US20060198542A1; ATE336151T1; FR2851879A1; WO2004080124A1; EP1600042A1; DE602004001868D1; DE602004001868T2; ES2271847T3

Abstract

The invention relates to the treatment of sound data for spatialized restitution of acoustic signals. At least one first and one second series of weighting terms are obtained for each acoustic signal, said terms representing a direction of perception of said acoustic signal by a listener. The acoustic signals are then applied to at least two sets of filtering units, which are disposed in parallel, in order to provide at least one first and one second output signal (L,R), corresponding to a linear combination of signals provided by said filtering units, which are respectively weighted by the weighting terms of the first and second series. According to the invention, each acoustic signal to be treated is at least partially compression coded and is expressed in the form of a vector of sub-signals associated with respective frequency sub-bands. Matrix filtering applied to each vector is carried out by each filtering unit in the space of the frequential sub-bands.

Description

L'invention concerne un traitement de données sonores pour une restitution spatialisée de signaux acoustiques.The invention relates to a sound data processing for spatialized reproduction of acoustic signals.

L'apparition de nouveaux formats de codage de données sur les réseaux de télécommunications permet la transmission de scènes sonores complexes et structurées comprenant des sources sonores multiples. En général, ces sources sonores sont spatialisées, c'est-à-dire qu'elles sont traitées de manière à apporter un rendu final réaliste en terme de position des sources et d'effet de salle (réverbération). C'est le cas par exemple du codage selon la norme MPEG-4 qui permet de transmettre des scènes sonores complexes comprenant des sons compressés ou non, et des sons de synthèse, auxquels sont associés des paramètres de spatialisation (position, effet de la salle environnante). Cette transmission se fait sur des réseaux avec contraintes, et le rendu sonore dépend du type de terminal utilisé. Sur un terminal mobile de type PDA par exemple (pour "Personal Digital Assistant"), on utilisera de manière préférentielle un casque d'écoute. Les contraintes de ce type de terminaux (puissance de calcul, taille mémoire) rendent difficile l'implémentation de techniques de spatialisation du son.The advent of new data encoding formats over telecommunications networks allows the transmission of complex and structured sound scenes that include multiple sound sources. In general, these sound sources are spatialized, that is, they are processed so as to provide a realistic final rendering in terms of source position and room effect (reverberation). This is the case, for example, of coding according to the MPEG-4 standard which makes it possible to transmit complex sound scenes comprising compressed or uncompressed sounds, and synthetic sounds, which are associated with spatialization parameters (position, effect of the room surrounding). This transmission is done on networks with constraints, and the sound reproduction depends on the type of terminal used. For example, on a PDA-type mobile terminal (for "Personal Digital Assistant"), a headset will preferably be used. The constraints of this type of terminals (computing power, memory size) make it difficult to implement sound spatialization techniques.

La spatialisation sonore recouvre deux types de traitements différents. A partir d'un signal audio monophonique, on cherche à donner l'illusion à un auditeur que la ou les sources sonores sont à des positions bien précises de l'espace (que l'on souhaite pouvoir modifier en temps réel), et plongées dans un espace ayant des propriétés acoustiques particulières (réverbération, ou autres phénomènes acoustiques tels que l'occlusion). A titre d'exemple, sur des terminaux de télécommunication de type mobile, il est naturel d'envisager un rendu sonore avec un casque d'écoute stéréophonique. La technique de positionnement des sources sonores la plus efficace est alors la synthèse binaurale.Sound spatialization covers two different types of treatment. From a monophonic audio signal, we try to give the illusion to a listener that the sound source (s) are in good positions. accurate space (that we want to be able to change in real time), and immersed in a space with particular acoustic properties (reverberation, or other acoustic phenomena such as occlusion). For example, on mobile type telecommunication terminals, it is natural to consider a sound reproduction with a stereo headset. The most effective sound source positioning technique is then binaural synthesis.

Elle consiste, pour chaque source sonore, à filtrer le signal monophonique par des fonctions de transfert acoustiques, appelées HRTFs (de l'anglais "Head Related Transfer Functions"), qui modélisent les transformations engendrées par le torse, la tête et le pavillon de l'oreille de l'auditeur sur un signal provenant d'une source sonore. Pour chaque position de l'espace, on peut mesurer une paire de ces fonctions (une pour l'oreille droite, une pour l'oreille gauche). Les HRTFs sont donc des fonctions d'une position spatiale, plus particulièrement d'un angle d'azimut θ et d'un angle d'élévation (ϕ, et de la fréquence sonore f. On obtient alors, pour un sujet donné, une base de données de fonctions de transfert acoustiques de N positions de l'espace pour chaque oreille, dans lesquelles un son peut être "placé" (ou "spatialisé" selon la terminologie utilisée ci-après).It consists, for each sound source, in filtering the monophonic signal by acoustic transfer functions, called HRTFs (of the "Head Related Transfer Functions"), which model the transformations generated by the torso, the head and the horn. the listener's ear on a signal from a sound source. For each position of space, one can measure a pair of these functions (one for the right ear, one for the left ear). The HRTFs are thus functions of a spatial position, more particularly of an azimuth angle θ and of an elevation angle (φ, and of the sound frequency f, which gives a given subject a database of acoustic transfer functions of N space positions for each ear, in which a sound can be "placed" (or "spatialized" according to the terminology used hereinafter).

On indique qu'un traitement de spatialisation similaire consiste en une synthèse dite "transaurale", dans laquelle on prévoit simplement plus de deux haut-parleurs dans un dispositif de restitution (qui se présente alors sous une forme différente d'un casque à deux oreillettes gauche et droite).It is indicated that a similar spatialization treatment consists of a so-called "transaural" synthesis , in which more than two loudspeakers are simply provided in one playback device (which is then in a different form of a helmet with two left and right atria).

De manière classique, la mise en oeuvre de cette technique se fait sous forme dite "bicanale" (traitement représenté schématiquement sur la figure 1 relative à l'art antérieur). Pour chaque source sonore à positionner selon le couple d'angles azimutal et d'élévation [θ,ϕ], on filtre le signal de la source par la fonction HRTF de l'oreille gauche et par la fonction HRTF de l'oreille droite. Les deux canaux gauche et droit délivrent des signaux acoustiques qui sont alors diffusés aux oreilles de l'auditeur avec un casque d'écoute stéréophonique. Cette synthèse binaurale bicanale est de type dit ci-après "statique", car dans ce cas, les positions des sources sonores n'évoluent pas dans le temps.In a conventional manner, the implementation of this technique is in the so-called "two-channel" form (treatment shown schematically in FIG. 1 relating to the prior art). For each sound source to be positioned according to the pair of azimuthal and elevation angles [θ, φ], the signal of the source is filtered by the HRTF function of the left ear and by the HRTF function of the right ear. The two left and right channels deliver acoustic signals that are then broadcast to the listener's ears with a stereo headset. This two-channel binaural synthesis is of type hereinafter "static", because in this case, the positions of the sound sources do not evolve in time.

Si l'on souhaite, au contraire, faire varier les positions des sources sonores dans l'espace au cours du temps (synthèse "dynamique"), les filtres utilisés pour modéliser les HRTFs (oreille gauche et oreille droite) doivent être modifiés. Toutefois, ces filtres étant pour la plupart du type à réponse impulsionnelle finie (FIR) ou à réponse impulsionnelle infinie (IIR), des problèmes de discontinuités des signaux de sortie gauche et droit apparaissent, entraînant des "clicks" audibles. La solution technique classiquement employée pour pallier ce problème est de faire tourner deux jeux de filtres binauraux en parallèle. Le premier jeu simule une position [θ1,ϕ1] à l'instant t1, le second une position [θ2, ϕ2] à l'instant t2. Le signal donnant l'illusion d'un déplacement entre les positions aux instants t1 et t2 est alors obtenu par un fondu enchaîné des signaux gauche et droit résultant des processus de filtrage pour la position [θ1,ϕ1] et pour la position [θ2, ϕ2] . Ainsi, la complexité du système de positionnement des sources sonores est alors multipliée par deux (deux positions à deux instants) par rapport au cas statique.If, on the contrary, it is desired to vary the positions of the sound sources in space over time ("dynamic" synthesis), the filters used to model the HRTFs (left ear and right ear) must be modified. However, since these filters are mostly of the finite impulse response (FIR) or infinite impulse response (IIR) type, discontinuity problems in the left and right output signals appear, causing audible clicks . The technical solution conventionally used to overcome this problem is to run two sets of binaural filters in parallel. The first game simulates a position [θ1, φ1] at time t1, the second a position [θ2, φ2] at the instant t2. The signal giving the illusion of a displacement between the positions at the instants t1 and t2 is then obtained by a fade-out of the left and right signals resulting from the filtering processes for the position [θ1, φ1] and for the position [θ2, φ2]. Thus, the complexity of the positioning system of sound sources is then multiplied by two (two positions at two instants) compared to the static case.

Afin de pallier ce problème, des techniques de décomposition linéaire des HRTFs ont été proposées (traitement représenté schématiquement sur la figure 2 relative à l'art antérieur). L'un des avantages de ces techniques est qu'elles permettent une mise en oeuvre dont la complexité dépend beaucoup moins du nombre total de sources à positionner dans l'espace. En effet, ces techniques permettent de décomposer les HRTFs sur une base de fonctions communes à toutes les positions de l'espace, et ne dépendant donc que de la fréquence, ce qui permet de réduire le nombre de filtres nécessaires. Ainsi, ce nombre de filtres est fixe, indépendamment du nombre de sources et/ou du nombre de positions de sources à prévoir. L'ajout d'une source sonore supplémentaire n'ajoute alors que des opérations de multiplication par un jeu de coefficients de pondération et par un retard τ_i, ces coefficients et ce retard ne dépendant que de la position [θ, ϕ] . Aucun filtre supplémentaire n'est donc nécessaire.In order to overcome this problem, linear decomposition techniques for HRTFs have been proposed (treatment shown schematically in FIG. 2 relating to the prior art). One of the advantages of these techniques is that they allow an implementation whose complexity depends much less on the total number of sources to be positioned in space. Indeed, these techniques make it possible to break down the HRTFs on the basis of functions common to all the positions of the space, and thus only dependent on the frequency, which makes it possible to reduce the number of necessary filters. Thus, this number of filters is fixed, regardless of the number of sources and / or the number of source positions to be provided. The addition of an additional sound source then adds only multiplication operations by a set of weighting coefficients and by a delay τ _i , these coefficients and this delay being dependent only on the position [θ, φ]. No additional filter is needed.

Ces techniques de décomposition linéaire ont aussi un intérêt dans le cas de la synthèse binaurale dynamique (i.e. lorsque la position des sources sonores varie au cours du temps). En effet, dans cette configuration, on ne fait plus varier les coefficients des filtres, mais les valeurs des coefficients de pondération et des retards en fonction uniquement de la position. Le principe décrit ci-dessus de décomposition linéaire des filtres de rendu sonore se généralise à d'autres approches, comme on le verra ci-après.These linear decomposition techniques are also of interest in the case of dynamic binaural synthesis (ie when the position of sound sources varies course of time). In fact, in this configuration, the coefficients of the filters are no longer varied, but the values of the weighting coefficients and delays as a function solely of the position. The principle described above of linear decomposition of the sound rendering filters is generalized to other approaches, as will be seen below.

Par ailleurs, dans les différents services de communication de groupe (téléconférence, audioconférence, visioconférence, ou autre) ou de communication "en flux continu" (de l'anglais "STREAMING"), pour adapter un débit binaire à la largeur de la bande passante fournie par un réseau, les flux audio et/ou de parole sont transmis sous un format codé compressé. On ne considère ci-après que des flux initialement compressés par des codeurs de type fréquentiel (ou par transformée en fréquence) tels que ceux opérant selon la norme MPEG-1 (Layer I-II-III), la norme MPEG-2/4 AAC, la norme MPEG-4 TwinVQ, la norme Dolby AC-2, la norme Dolby AC-3, ou encore une norme UIT-T G.722.1 en codage de parole, ou encore le procédé de codage TDAC de la Demanderesse. L'utilisation de tels codeurs revient à effectuer d'abord une transformation temps/fréquence sur des blocs du signal temporel. Les paramètres obtenus sont ensuite quantifiés et codés pour être transmis dans une trame avec d'autres informations complémentaires nécessaires pour le décodage. Cette transformation temps/fréquence peut prendre la forme d'un banc de filtres en sous-bandes de fréquences ou encore une transformée de type MDCT (pour "Modified Discrete Cosinus Transform"). Ci-après, on désignera par les mêmes termes "domaine des sous-bandes" un domaine défini dans un espace de sous-bandes fréquentielles, un domaine d'un espace temporel transformé en fréquence ou un domaine fréquentiel.Moreover, in the various group communication services (teleconferencing, audio conferencing, videoconferencing, or other) or "streaming" communication (of the English "STREAMING"), to adapt a bit rate to the bandwidth provided by a network, the audio and / or speech streams are transmitted in a compressed coded format. Only flows initially compressed by frequency-domain (or frequency-converted) encoders such as those operating according to the MPEG-1 standard (Layer I-II-III), the MPEG-2/4 standard, are considered below. AAC, the MPEG-4 TwinVQ standard, the Dolby AC-2 standard, the Dolby AC-3 standard, or a ITU-T G.722.1 standard in speech coding, or the Applicant's TDAC coding method. The use of such encoders amounts to first performing a time / frequency transformation on blocks of the time signal. The resulting parameters are then quantized and encoded to be transmitted in a frame with other complementary information needed for decoding. This time / frequency transformation can take the form of a bank of filters in frequency sub-bands or a transform of the MDCT type (for "Modified Discrete Cosine Transform"). Hereinafter, we will designate by the same terms "subband domain" means a domain defined in a frequency subband space, a frequency-transformed time domain domain, or a frequency domain.

Pour effectuer la spatialisation sonore sur de tels flux, la méthode classique consiste à faire d'abord un décodage, réaliser le traitement de spatialisation sonore sur les signaux temporels, puis recoder les signaux qui en résultent, pour une transmission vers un terminal de restitution. Cette succession d'étapes, fastidieuse, est souvent très coûteuse en terme de puissance de calcul, de la mémoire nécessaire pour le traitement et du délai algorithmique introduit. Elle est donc souvent inadaptée aux contraintes imposées par les machines où s'effectue le traitement et aux contraintes de communication.To perform sound spatialization on such streams, the conventional method consists in first decoding, performing the sound spatialization processing on the time signals, and then recoding the resulting signals, for transmission to a rendering terminal. This sequence of steps, tedious, is often very expensive in terms of computing power, the memory required for processing and algorithmic delay introduced. It is therefore often unsuited to the constraints imposed by the machines where the processing and the communication constraints are carried out.

Par exemple, le document US-6,470,087 décrit un dispositif pour la restitution d'un signal acoustique multicanal comprimé sur deux haut-parleurs. Tous les calculs sont faits dans toute la bande de fréquence du signal d'entrée, qui de ce fait doit être complètement décodé.For example, document US Pat. No. 6,470,087 describes a device for rendering a multichannel acoustic signal compressed on two loudspeakers. All calculations are done in the entire frequency band of the input signal, which therefore must be completely decoded.

La présente invention vient améliorer la situation.The present invention improves the situation.

L'un des buts de la présente invention est de proposer un procédé de traitement de données sonores regroupant les opérations de codage/décodage en compression des flux audio et de spatialisation desdits flux.One of the aims of the present invention is to propose a sound data processing method comprising the coding / decoding operations in compression of the audio streams and the spatialization of said streams.

Un autre but de la présente invention est de proposer un procédé de traitement de données sonores, par spatialisation, qui s'adapte à un nombre variable (dynamiquement) de sources sonores à positionner.Another object of the present invention is to propose a sound data processing method, by spatialization, which adapts to a variable number (dynamically) of sound sources to be positioned.

Un but général de la présente invention est de proposer un procédé de traitement de données sonores, par spatialisation, permettant une large diffusion des données sonores spatialisées, en particulier une diffusion pour le grand public, les dispositifs de restitution étant simplement équipés d'un décodeur des signaux reçus et de haut-parleurs de restitution.A general object of the present invention is to propose a method for processing sound data, for example spatialization, allowing a wide dissemination of spatialized sound data, in particular a broadcast for the general public, the playback devices being simply equipped with a decoder of received signals and playback speakers.

Elle propose à cet effet un procédé de traitement de données sonores, pour une restitution spatialisée de signaux acoustiques, dans lequel :

a) on obtient, pour chaque signal acoustique, au moins un premier jeu et un second jeu de termes pondérateurs, représentatifs d'une direction de perception dudit signal acoustique par un auditeur ;
b) et on applique à au moins deux jeux d'unités de filtrage, disposées en parallèle, lesdits signaux acoustiques, pour délivrer au moins un premier signal de sortie et un second signal de sortie correspondant chacun à une combinaison linéaire des signaux acoustiques pondérés par l'ensemble des termes pondérateurs respectivement du premier jeu et du second jeu et filtrés par lesdites unités de filtrage.

To this end, it proposes a method for processing sound data, for a spatialized reproduction of acoustic signals, in which:

a) there is obtained, for each acoustic signal, at least a first set and a second set of weighting terms, representative of a direction of perception of said acoustic signal by a listener;
b) and applying to at least two sets of filter units, arranged in parallel, said acoustic signals, for outputting at least a first output signal and a second output signal each corresponding to a linear combination of the weighted acoustic signals; the set of weighting terms respectively of the first set and the second set and filtered by said filtering units.

Chaque signal acoustique à l'étape a) du procédé au sens de l'invention est au moins partiellement codé en compression et s'exprime sous la forme d'un vecteur de sous-signaux associés à des sous-bandes de fréquences respectives, et chaque unité de filtrage est agencée pour effectuer un filtrage matriciel appliqué à chaque vecteur, dans l'espace des sous-bandes fréquentielles.Each acoustic signal in step a) of the method according to the invention is at least partially coded in compression and is expressed in the form of a subsignal vector associated with respective frequency sub-bands, and each filtering unit is arranged to perform a matrix filtering applied to each vector, in the frequency subband space.

Avantageusement, chaque filtrage matriciel est obtenu par conversion, dans l'espace des sous-bandes fréquentielles, d'un filtre à réponse impulsionnelle (finie ou infinie) défini dans l'espace temporel. Un tel filtre à réponse impulsionnelle est préférentiellement obtenu par détermination d'une fonction de transfert acoustique dépendant d'une direction de perception d'un son et de la fréquence de ce son.Advantageously, each matrix filtering is obtained by converting, in the space of the frequency sub-bands, an impulse response filter (finite or infinite) defined in the time space. Such an impulse response filter is preferably obtained by determining an acoustic transfer function depending on a direction of perception of a sound and the frequency of this sound.

Selon une caractéristique avantageuse de l'invention, ces fonctions de transfert s'expriment par une combinaison linéaire de termes dépendant de la fréquence et pondérés par des termes dépendant de la direction, ce qui permet, comme indiqué ci-avant, d'une part, de traiter un nombre variable de signaux acoustiques à l'étape a) et, d'autre part, de faire varier dynamiquement la position de chaque source dans le temps. En outre, une telle expression des fonctions de transfert "intègre" le retard interaural qui est classiquement appliqué à l'un des signaux de sortie, par rapport à l'autre, avant la restitution, dans les traitements binauraux. A cet effet, on prévoit des matrices de filtres de gains associés à chaque signal.According to an advantageous characteristic of the invention, these transfer functions are expressed by a linear combination of terms dependent on the frequency and weighted by terms dependent on the direction, which allows, as indicated above, on the one hand , process a variable number of acoustic signals in step a) and, secondly, dynamically vary the position of each source in time. In addition, such an expression of the transfer functions "integrates" the interaural delay which is conventionally applied to one of the output signals, with respect to the other, before the restitution, in binaural processing. For this purpose, gain filter matrices associated with each signal are provided.

Ainsi, lesdits premier et second signaux de sortie étant préférentiellement destinés à être décodés en des premier et second signaux de restitution, la combinaison linéaire précitée tient déjà compte d'un décalage temporel entre ces premier et second signaux de restitution, de façon avantageuse.Thus, said first and second output signals being preferentially intended to be decoded into first and second reproduction signals, the aforementioned linear combination already takes into account a time shift between these first and second reproduction signals, advantageously.

Finalement, entre l'étape de réception/décodage des signaux reçus par un dispositif de restitution et l'étape de restitution elle-même, on peut ne prévoir aucune étape supplémentaire de spatialisation sonore, ce traitement de spatialisation étant complètement effectué en amont et directement sur des signaux codés.Finally, between the reception / decoding step of the signals received by a rendering device and the rendering step itself, no additional step of sound spatialization can be envisaged, this spatialization processing being performed completely upstream and directly on coded signals.

Selon l'un des avantages que procure la présente invention, l'association des techniques de décomposition linéaire des HRTFs aux techniques de filtrage dans le domaine des sous-bandes permet de profiter des avantages des deux techniques pour arriver à des systèmes de spatialisation sonore à faible complexité et à mémoire réduite pour des signaux audio codés multiples.According to one of the advantages afforded by the present invention, the combination of linear decomposition techniques of the HRTFs with filtering techniques in the field of the sub-bands makes it possible to take advantage of the advantages of the two techniques to arrive at sound spatialization systems. Low complexity and reduced memory for multiple coded audio signals.

En effet, dans une architecture "bicanale" classique, le nombre de filtres à utiliser est fonction du nombre de sources à positionner. Comme indiqué ci-avant, ce problème ne se retrouve pas dans une architecture basée sur la décomposition linéaire des HRTFs. Cette technique est donc préférable en termes de puissance de calcul, mais aussi d'espace mémoire nécessaire au stockage des filtres binauraux. Enfin, cette architecture permet de gérer de manière optimale la synthèse binaurale dynamique, car elle permet d'effectuer le "fading" entre deux instants t1 et t2 sur des coefficients qui ne dépendent que de la position, et ne nécessite donc pas deux jeux de filtres en parallèle.Indeed, in a conventional "dual channel " architecture, the number of filters to use is a function of the number of sources to be positioned. As mentioned above, this problem is not found in an architecture based on the linear decomposition of HRTFs. This technique is therefore preferable in terms of computing power, but also memory space required for binaural filters storage. Finally, this architecture makes it possible to optimally manage the dynamic binaural synthesis, because it makes it possible to "fading" between two instants t1 and t2 on coefficients that depend only on the position, and therefore does not require two sets of filters in parallel.

Selon un autre avantage que procure la présente invention, le filtrage direct des signaux dans le domaine codé permet l'économie d'un décodage complet par flux audio avant de procéder à la spatialisation des sources, ce qui implique un gain considérable en complexité.According to another advantage provided by the present invention, the direct filtering of the signals in the coded domain allows the economy of a complete decoding by audio stream before proceeding to the spatialization of the sources, which implies a considerable gain in complexity.

Selon un autre avantage que procure la présente invention, la spatialisation sonore de flux audio peut intervenir à différents points d'une chaîne de transmission (serveurs, noeuds du réseau ou terminaux). La nature de l'application et l'architecture de la communication utilisée peuvent favoriser un cas ou un autre. Ainsi, dans un contexte de téléconférence, le traitement de spatialisation est préférentiellement effectué au niveau des terminaux dans une architecture décentralisée et, au contraire, au niveau du pont audio (ou MCU pour "Multipoint Control Unit") dans une architecture centralisée. Pour des applications de "streaming" audio, notamment sur des terminaux mobiles, la spatialisation peut être réalisée soit dans le serveur, soit dans le terminal, ou encore lors de la création de contenu. Dans ces différents cas, une diminution de la complexité de traitement et aussi de la mémoire nécessaire pour le stockage des filtres HRTF est toujours appréciée. Par exemple, pour des terminaux mobiles (téléphones portables de seconde et troisième générations, PDA, ou micro-ordinateurs de poche) ayant des contraintes fortes en terme de capacité de calcul et de taille mémoire, on prévoit préférentiellement un traitement de spatialisation directement au niveau d'un serveur de contenus.According to another advantage provided by the present invention, the sound spatialization of audio streams can occur at different points of a transmission chain (servers, network nodes or terminals). The nature of the application and the architecture of the communication used may favor one case or another. Thus, in a teleconference context, the spatialization processing is preferably performed at the terminals in a decentralized architecture and, conversely, at the audio bridge (or MCU for "Multipoint Control Unit") in a centralized architecture. For "streaming" audio applications, especially on mobile terminals, the spatialization can be performed either in the server or in the terminal, or during the creation of content. In these different cases, a reduction in the processing complexity and also the memory required for storing the HRTF filters is still appreciated. For example, for mobile terminals (second- and third-generation mobile phones, PDAs, or hand-held microcomputers) having strong constraints in terms of computing capacity and memory size, preferably spatialization processing is provided directly at the level of the device. a content server.

La présente invention peut trouver aussi des applications dans le domaine de la transmission de flux audio multiples inclus dans des scènes sonores structurées, comme le prévoit la norme MPEG-4.The present invention can also find applications in the field of the transmission of multiple audio streams included in structured sound scenes, as provided by the MPEG-4 standard.

D'autres caractéristiques, avantages et applications de l'invention apparaîtront à l'examen de la description détaillée ci-après, et des dessins annexés sur lesquels :

la figure 1 illustre schématiquement un traitement correspondant à une synthèse binaurale "bicanale" statique pour des signaux audionumériques temporels S_i, de l'art antérieur ;
la figure 2 représente schématiquement une mise en oeuvre de la synthèse binaurale basée sur la décomposition linéaire des HRTFs pour des signaux audionumériques temporels non codés, de l'art antérieur ;
la figure 3 représente schématiquement un système, au sens de l'art antérieur, de spatialisation binaurale de N sources audio initialement codées, puis complètement décodées pour le traitement de spatialisation dans le domaine temporel et ensuite recodées pour une transmission à un ou plusieurs dispositifs de restitution, ici à partir d'un serveur ;
la figure 4 représente schématiquement un système, au sens de la présente invention, de spatialisation binaurale de N sources audio partiellement décodées pour le traitement de spatialisation dans le domaine des sous-bandes et ensuite recodées complètement pour la transmission à un ou plusieurs dispositifs de restitution, ici à partir d'un serveur ;
la figure 5 représente schématiquement un traitement de spatialisation sonore dans le domaine des sous-bandes, au sens de l'invention, basé sur la décomposition linéaire des HRTFs dans le contexte binaural ;
la figure 6 représente schématiquement un traitement d'encodage/décodage pour spatialisation, mené dans le domaine des sous-bandes et basé sur une décomposition linéaire de fonctions de transfert dans le contexte ambisonique, dans une variante de réalisation de l'invention ;
la figure 7 représente schématiquement un traitement de spatialisation binaurale de N sources audio codées, au sens de la présente invention, effectué auprès d'un terminal de communication, selon une variante du système de la figure 4 ;
la figure 8 représente schématiquement une architecture d'un système de téléconférence centralisée, avec un pont audio entre une pluralité de terminaux ; et
la figure 9 représente schématiquement un traitement, au sens de la présente invention, de spatialisation de (N-1) sources audio codées parmi N sources en entrée d'un pont audio d'un système selon la figure 8, effectué auprès de ce pont audio, selon une variante du système de la figure 4.

Other characteristics, advantages and applications of the invention will appear on examining the detailed description below, and the attached drawings in which:

FIG. 1 diagrammatically illustrates a processing corresponding to a binaural "two- channel " static synthesis for temporal digital audio signals S _i , of the prior art;
FIG. 2 diagrammatically represents an implementation of the binaural synthesis based on the linear decomposition of the HRTFs for uncoded temporal digital audio signals of the prior art;
FIG. 3 diagrammatically represents a system, in the sense of the prior art, of binaural spatialisation of N initially coded audio sources, then completely decoded for spatialization processing in the time domain, and then recoded for transmission to one or more transmission devices. restitution, here from a server;
FIG. 4 diagrammatically represents a system, within the meaning of the present invention, of binaural spatialisation of N partially decoded audio sources for spatialization processing in the subband domain and then completely recoded for transmission to one or more rendering devices. , here from a server;
FIG. 5 schematically represents a sound spatialization processing in the field of the subbands, at the sense of the invention, based on the linear decomposition of HRTFs in the binaural context;
FIG. 6 schematically represents an encoding / decoding process for spatialization, carried out in the subband domain and based on a linear decomposition of transfer functions in the ambisonic context, in an alternative embodiment of the invention;
FIG. 7 diagrammatically represents a binaural spatialization processing of N coded audio sources, within the meaning of the present invention, carried out with a communication terminal, according to a variant of the system of FIG. 4;
Figure 8 schematically shows an architecture of a centralized teleconferencing system, with an audio bridge between a plurality of terminals; and
FIG. 9 schematically represents a processing, within the meaning of the present invention, of spatialization of (N-1) audio sources encoded among N input sources of an audio bridge of a system according to FIG. 8, carried out with this bridge according to a variant of the system of FIG. 4.

On se réfère tout d'abord à la figure 1 pour décrire un traitement classique de synthèse binaurale "bicanale". Ce traitement consiste à filtrer le signal des sources (Si) que l'on souhaite positionner à une position choisie dans l'espace par les fonctions de transfert acoustiques gauche (HRTF_1) et droite (HRTF_r) correspondant à la direction <θi,ϕi) appropriée. On obtient deux signaux qui sont alors additionnés aux signaux gauches et droits résultant de la spatialisation des autres sources, pour donner les signaux globaux L et R diffusés aux oreilles gauche et droite d'un auditeur. Le nombre de filtres nécessaires est alors de 2.N pour une synthèse binaurale statique et de 4.N pour une synthèse binaurale dynamique, N étant le nombre de flux audio à spatialiser.Firstly, reference is made to FIG. 1 to describe a "binational" binaural synthesis treatment. This treatment consists in filtering the signal of the sources (Si) that one wishes to position at a position chosen in space by the acoustic transfer functions left (HRTF_1) and right (HRTF_r) corresponding to the direction <θi, φi) appropriate. Two signals are obtained which are then added to the left and right signals resulting from the spatialization of the other sources, to give the global signals L and R diffused to the left and right ears of a listener. The number of necessary filters is then 2.N for static binaural synthesis and 4.N for dynamic binaural synthesis, where N is the number of audio streams to spatialize.

On se réfère maintenant à la figure 2 pour décrire un traitement classique de synthèse binaurale basée sur la décomposition linéaire des HRTFs. Ici, chaque filtre HRTF est d'abord décomposé en un filtre à phase minimale, caractérisé par son module, et en un retard pur τ_i. Les dépendances spatiales et fréquentielles des modules des HRTFs sont séparées grâce à une décomposition linéaire. Ces modules dés fonctions de transfert HRTFs s'écrivent alors comme une somme de fonctions spatiales C _n (θ, ϕ) et de filtres de reconstruction L_n(f), comme exprimé ci-après : $| HRTF (θ, ϕ, f) | = \sum_{n = 1}^{P} C_{n} (θ, ϕ) . L_{n} (f)$

Reference is now made to FIG. 2 to describe a conventional binaural synthesis treatment based on the linear decomposition of HRTFs. Here, each HRTF filter is first decomposed into a minimum phase filter, characterized by its modulus, and a pure delay τ _i . The spatial and frequency dependencies of the HRTFs modules are separated by a linear decomposition. These modules of transfer functions HRTFs are then written as a sum of spatial functions C _n (θ, φ) and of reconstruction filters L _n (f), as expressed hereafter:

| HRTF (θ, φ, f) | = Σ_{not = 1}^{P} {VS}_{not} (θ, φ) . {The}_{not} (f)

Chaque signal d'une source S_i à spatialiser (i=1,...,N) est pondéré par des coefficients C_ni(θ, ϕ) (n=1,..., P) issus de la décomposition linéaire des HRTFs. Ces coefficients ont pour particularité de ne dépendre que de la position [θ, ϕ] où l'on souhaite placer la source, et non de la fréquence f. Le nombre de ces coefficients dépend du nombre P de vecteurs de base que l'on a conservé pour la reconstruction. Les N signaux de toutes les sources pondérés par le coefficient "directionnel" C_ni sont alors additionnés (pour le canal droit et le canal gauche, séparément), puis filtrés par le filtre correspondant au nième vecteur de base. Ainsi, contrairement à la synthèse binaurale "bicanale", l'ajout d'une source supplémentaire ne nécessite pas l'ajout de deux filtres additionnels (souvent de type FIR ou IIR). Les P filtres de base sont en effet partagés par toutes les sources présentes. Cette mise en oeuvre est dite "multicanale". De plus, dans le cas de la synthèse binaurale dynamique, il est possible de faire varier les coefficients C_ni (θ, ϕ) sans apparition de clicks en sortie du dispositif. Dans ce cas, seulement 2.P filtres sont nécessaires, alors que 4.N filtres étaient nécessaires pour la synthèse bicanale.Each signal of a source S _i to be spatialised (i = 1, ..., N) is weighted by coefficients C _ni (θ, φ) (n = 1, ..., P) resulting from the linear decomposition of HRTFs. These coefficients have the particularity of being dependent only on the position [ θ, φ ] where it is desired to place the source, and not on the frequency f. The number of these coefficients depends on the number P of basic vectors that has been preserved for the reconstruction. The N signals of all the sources weighted by the "directional" coefficient C _ni are then summed (for the right channel and the left channel, separately), then filtered by the filter corresponding to the nth base vector. Thus, unlike binaural synthesis "two-channel", the addition of an additional source does not require the addition of two additional filters (often type FIR or IIR). The basic P filters are indeed shared by all the sources present. This implementation is called "multichannel". Moreover, in the case of dynamic binaural synthesis, it is possible to vary the coefficients C _ni (θ, φ) without appearance of clicks at the device output. In this case, only 2.P filters are needed, whereas 4.N filters were needed for two-channel synthesis.

Sur la figure 2, les coefficients C_ni correspondent aux coefficients directionnels pour la source i à la position (θi,ϕi) et pour le filtre de reconstruction n. On les note C pour la voie gauche (L) et D pour la voie droite (R). On indique que le principe de traitement de la voie droite R est le même que celui de la voie gauche L. Toutefois, les flèches en traits pointillés pour le traitement de la voie droite n'ont pas été représentées par souci de clarté du dessin. Entre les deux lignes verticales en trait discontinu de la figure 2, on définit alors un système noté I, du type représenté sur la figure 3.In FIG. 2, the coefficients C _ni correspond to the directional coefficients for the source i at the position (θi, φi) and for the reconstruction filter n. They are noted C for the left channel (L) and D for the right channel (R). It is indicated that the processing principle of the right channel R is the same as that of the left channel L. However, the dotted line arrows for the treatment of the right channel have not been shown for the sake of clarity of the drawing. Between the two vertical lines in dashed line of Figure 2, then defines a system noted I, of the type shown in Figure 3.

Toutefois, avant de se reporter à la figure 3, on indique que différentes méthodes ont été proposées pour déterminer les fonctions spatiales et les filtres de reconstruction. Une première méthode est basée sur une décomposition dite de Karhunen-Loeve et est décrite notamment dans le document WO94/10816. Une autre méthode repose sur l'analyse en composantes principales des HRTFs et est décrite dans WO96/13962. Le document FR-2782228 plus récent décrit aussi une telle mise en oeuvre.However, before referring to FIG. 3, it is indicated that different methods have been proposed for determining the spatial functions and the reconstruction filters. A first method is based on a so-called Karhunen-Loeve decomposition and is described in particular in WO94 / 10816. Another method is based on principal component analysis of HRTFs and is described in WO96 / 13962. The document FR-2782228 more recent also describes such an implementation.

Dans le cas où un traitement de spatialisation de ce type se fait au niveau du terminal de communication, une étape de décodage des N signaux est nécessaire avant le traitement de spatialisation proprement dit. Cette étape demande des ressources de calcul considérables (ce qui est problématique sur les terminaux de communication actuels notamment de type portable). Par ailleurs, cette étape entraîne un délai sur les signaux traités, ce qui nuit à l'interactivité de la communication. Si la scène sonore transmise comprend un grand nombre de sources (N), l'étape de décodage peut en fait devenir plus coûteuse en ressources de calcul que l'étape de spatialisation sonore proprement dite. En effet, comme indiqué ci-avant, le coût de calcul de la synthèse binaurale "multicanale" ne dépend que très peu du nombre de sources sonores à spatialiser.In the case where a spatialization processing of this type is done at the communication terminal, a step of decoding the N signals is necessary before the actual spatialization processing. This step requires considerable computing resources (which is problematic on current communication terminals, particularly of the portable type). Moreover, this step causes a delay on the processed signals, which interferes with the interactivity of the communication. If the transmitted sound scene comprises a large number of sources (N), the decoding step may in fact become more expensive in computing resources than the sound spatialization step itself. Indeed, as indicated above, the calculation cost of binaural synthesis "multichannel" depends very little on the number of sound sources to spatialize.

Le coût de calcul de l'opération de spatialisation des N flux audio codés (dans la synthèse multicanale de la figure 2) peut donc se déduire des étapes suivantes (pour la synthèse de l'un des deux canaux de rendu gauche ou droit) :

décodage (pour N signaux),
application du retard interaural τ_i,
multiplication par les gains positionnels C_ni (PxN gains pour l'ensemble des N signaux),
sommation des N signaux pour chaque filtre de base d'indice n,
filtrage des P signaux par les filtres de base,
et sommation des P signaux de sortie des filtres de base.

The cost of calculating the spatialization operation of the N coded audio streams (in the multichannel synthesis of FIG. 2) can thus be deduced from the following steps (for the synthesis of one of the two left or right rendering channels):

decoding (for N signals),
application of the interaural delay τ _i ,
multiplication by positional gains C _ni (PxN gains for all N signals),
summation of the N signals for each basic filter of index n,
filtering the P signals by the basic filters,
and summing the P output signals of the basic filters.

Dans le cas où la spatialisation ne se fait pas au niveau d'un terminal mais au niveau d'un serveur (cas de la figure 3), ou encore dans un noeud d'un réseau de communication (cas d'un pont audio en téléconférence), il faut en plus rajouter une opération de codage complet du signal de sortie.In the case where the spatialization is not done at the level of a terminal but at the level of a server (case of Figure 3), or in a node of a communication network (case of an audio bridge in teleconference), it is necessary to add a complete coding operation of the output signal.

En se référant à la figure 3, la spatialisation de N sources sonores (faisant par exemple partie d'une scène sonore complexe de type MPEG4) nécessite donc :

un décodage complet des N sources audio S₁, ..., S_i, ..., S_N codées en entrée du système représenté (noté "Système I") pour obtenir N flux audio décodés, correspondant par exemple à des signaux PCM (pour "Pulse Code Modulation"),
un traitement de spatialisation dans le domaine temporel ("Système I") pour obtenir deux signaux spatialisés L et R,
et ensuite un recodage complet sous forme de canaux gauche et droit L et R, véhiculés dans le réseau de communication pour être reçus par un ou plusieurs dispositifs de.restitution.

Referring to FIG. 3, the spatialization of N sound sources (for example part of a complex sound scene of the MPEG4 type) therefore requires:

a complete decoding of the N audio sources S ₁ ,..., S _i ,..., S _N encoded at the input of the system represented (denoted "System I") to obtain N decoded audio streams, corresponding, for example, to PCM signals (for " Pulse Code Modulation" ),
temporal spatialization processing ("System I") to obtain two spatialized signals L and R,
and then a complete recoding in the form of left and right channels L and R, conveyed in the communication network to be received by one or more restitution devices.

Ainsi, le décodage des N flux codés est nécessaire avant l'étape de spatialisation des sources sonores, ce qui entraîne une augmentation du coût de calcul et l'ajout d'un délai dû au traitement du décodeur. On indique que les sources audio initiales sont généralement stockées directement sous format codé, dans les serveurs de contenus actuels.Thus, the decoding of N coded streams is necessary before the step of spatialization of the sound sources, which leads to an increase in the calculation cost and the addition of a delay due to the processing of the decoder. It indicates that the original audio sources are usually stored directly in coded format, in the current content servers.

On indique en outre que pour une restitution sur plus de deux haut-parleurs (synthèse transaurale ou encore en contexte "ambisonique" que l'on décrit ci-après), le nombre de signaux résultant du traitement de spatialisation est généralement supérieur à deux, ce qui augmente encore le coût de calcul pour recoder complètement ces signaux avant leur transmission par le réseau de communication.It is furthermore indicated that for a restitution on more than two loudspeakers (transaural synthesis or in "ambisonic" context that is described hereinafter), the number of signals resulting from the spatialization processing is generally greater than two, which further increases the cost of calculation to completely recode these signals before their transmission through the communication network.

On se réfère maintenant à la figure 4 pour décrire une mise en oeuvre du procédé au sens de la présente invention.Referring now to Figure 4 to describe an implementation of the method within the meaning of the present invention.

Elle consiste à associer l'implémentation "multicanale" de la synthèse binaurale (figure 2) aux techniques de filtrage dans le domaine transformé (domaine dit "des sous-bandes") afin de ne pas avoir à réaliser N opérations de décodage complètes avant l'étape de spatialisation. On réduit ainsi le coût de calcul global de l'opération. Cette "intégration" des opérations de codage et de spatialisation peut être effectuée dans le cas d'un traitement au niveau d'un terminal de communication ou d'un traitement au niveau d'un serveur comme représenté sur la figure 4.It consists of associating the "multichannel" implementation of binaural synthesis (FIG. 2) with filtering techniques in the transformed domain (so-called "sub-band" domain ) so as not to have to carry out N complete decoding operations before 'stage of spatialization. This reduces the overall calculation cost of the operation. This "integration" of the coding and spatialization operations can be performed in the case of a processing at a communication terminal or a processing at a server as shown in FIG. 4.

Les différentes étapes de traitement des données ainsi que l'architecture du système sont décrites en détail ci-après.The different stages of data processing as well as the architecture of the system are described in detail below.

Dans le cas d'une spatialisation de signaux audio codés multiples, au niveau du serveur comme dans l'exemple représenté sur la figure 4, une opération de décodage partiel est encore nécessaire. Toutefois, cette opération est beaucoup moins coûteuse que l'opération de décodage dans un système conventionnel tel que représenté sur la figure 3. Ici, cette opération consiste principalement à récupérer les paramètres des sous-bandes à partir du flux audio binaire, codé. Cette opération dépend du codeur initial utilisé. Elle peut consister par exemple en un décodage entropique suivi d'une quantification inverse comme dans un codeur MPEG-1 Layer III. Une fois ces paramètres des sous-bandes retrouvés, le traitement est effectué dans le domaine des sous-bandes, comme on le verra ci-après.In the case of a spatialization of multiple coded audio signals, at the server as in the example shown in Figure 4, a partial decoding operation is still necessary. However, this operation is much less expensive than the decoding operation in a conventional system as shown in FIG. 3. Here, this operation consists mainly in recovering the parameters of the subbands from the encoded binary audio stream. This operation depends on the initial encoder used. It may for example consist of entropy decoding followed by inverse quantization as in an MPEG-1 Layer III coder. Once these parameters of the subbands found, the treatment is performed in the field of sub-bands, as will be seen below.

Le coût de calcul global de l'opération de spatialisation des flux audio codés est alors considérablement réduit. En effet, l'opération initiale de décodage dans un système conventionnel est remplacée par une opération de décodage partiel de complexité bien moindre. La charge de calcul dans un système au sens de l'invention devient sensiblement constante en fonction du nombre de flux audio que l'on souhaite spatialiser. Par rapport aux systèmes conventionnels, on obtient un gain en terme de coût de calcul qui devient alors proportionnel au nombre de flux audio que l'on souhaite spatialiser. De plus, l'opération de décodage partiel entraîne un délai de traitement inférieur à l'opération de décodage complet, ce qui est particulièrement intéressant dans un contexte de communication interactive.The overall calculation cost of the spatialization operation of the coded audio streams is then considerably reduced. Indeed, the initial decoding operation in a conventional system is replaced by a partial decoding operation of much less complexity. The computing load in a system according to the invention becomes substantially constant as a function of the number of audio streams that it is desired to spatialize. Compared to conventional systems, we obtain a gain in terms of calculation cost which then becomes proportional to the number of audio streams that we wish to spatialize. In addition, the partial decoding operation results in a lower processing delay than the full decoding operation, which is particularly interesting in a context of interactive communication.

Le système pour la mise en oeuvre du procédé selon l'invention, effectuant la spatialisation dans le domaine des sous-bandes, est noté "Système II" sur la figure 4.The system for implementing the method according to the invention, performing spatialization in the field of sub-bands, is denoted "System II" in Figure 4.

On décrit ci-après l'obtention des paramètres dans le domaine des sous-bandes à partir de réponses impulsionnelles binaurales.The following describes obtaining the parameters in the subband domain from binaural impulse responses.

De manière classique, les fonctions de transfert binaurales ou HRTFs sont accessibles sous la forme de réponses impulsionnelles temporelles. Ces fonctions sont constituées en général de 256 échantillons temporels, à une fréquence d'échantillonnage de 44,1 kHz (typique dans le domaine de l'audio). Ces réponses impulsionnelles peuvent être issues de mesures ou de simulations acoustiques.Conventionally, the binaural transfer functions or HRTFs are accessible in the form of temporal impulse responses. These functions generally consist of 256 time samples, at a sampling frequency of 44.1 kHz (typical in the audio field). These impulse responses can be derived from measurements or acoustic simulations.

Les étapes de pré-traitement pour l'obtention des paramètres dans le domaine des sous-bandes sont préférentiellement les suivantes :

extraction du retard interaural à partir de réponses impulsionnelles binaurales h _l (n) et h _r (n) (si l'on dispose de D directions de l'espace mesurées, on obtient un vecteur de D valeurs de retard interaural ITD (exprimé en secondes)) ;
modélisation des réponses impulsionnelles binaurales sous forme de filtres à phase minimale ;
choix du nombre de vecteurs de base (P) que l'on souhaite conserver pour la décomposition linéaire des HRTFs ;
décomposition linéaire des réponses à phase minimale selon la relation Eq[1] ci-avant (on obtient ainsi les D coefficients directionnels C_ni et D_ni qui ne dépendent que de la position de la source sonore à spatialiser et les P vecteurs de base qui ne dépendent que de la fréquence) ;
modélisation des filtres de base L_n et R_n sous forme de filtres IIR ou FIR ;
calcul de matrices de filtres de gains G _i dans le domaine des sous-bandes à partir des D valeurs d'ITD (ces retards ITD sont alors considérés comme des filtres FIR destinés à être transposés dans le domaine des sous-bandes, comme on le verra ci-après. Dans le cas général, G _i est une matrice de filtres. Les D coefficients directionnels C _ni, D _ni à appliquer dans le domaine des sous-bandes sont des scalaires de mêmes valeurs que les C_ni et D_ni respectivement dans le domaine temporel) ;
transposition des filtres de base L_n et R_n, initialement sous forme IIR ou FIR, dans le domaine des sous-bandes (cette opération donne des matrices de filtres, notées ci-après L _n et R_n, à appliquer dans le domaine des sous-bandes. La méthode pour effectuer cette transposition est indiquée ci-après).

The pre-treatment steps for obtaining the parameters in the subband domain are preferably the following:

extraction of the interaural delay from binaural impulse responses h _l (n) and h _r (n) (if we have D measured space directions, we obtain a vector of D interaural delay values ITD (expressed in seconds));
modeling of binaural impulse responses in the form of minimal phase filters;
choice of the number of basic vectors (P) that one wishes to keep for the linear decomposition of HRTFs;
linear decomposition of the minimal phase responses according to the equation Eq [1] above (we thus obtain the D directional coefficients C _ni and D _ni which depend only on the position of the sound source to be spatialised and the P basic vectors which depend only on the frequency);
modeling of the basic filters L _n and R _n in the form of IIR or FIR filters;
calculation of gain filter matrices G _i in the subband domain from the D values of ITD (these ITD delays are then considered as FIR filters to be transposed in the subband domain, as it is see below: In the general case, G _i is a matrix of filters The D directional coefficients C _ni , D _ni to be applied in the subband domain are scalars of the same values as the C _ni and D _ni respectively in the time domain);
transposition of the basic filters L _n and R _n , initially in IIR or FIR form, in the domain of the sub-bands (this operation gives filter matrices, hereinafter denoted L _n and R _n , to be applied in the field of subbands The method for performing this transposition is shown below).

On remarquera que les matrices de filtres Gi appliqués de manière indépendante à chaque source "intègrent" une opération classique de calcul de retard pour l'ajout du retard interaural entre un signal L _i et un signal R _i à restituer. En effet, dans le domaine temporel, on prévoit classiquement des lignes à retard τ_i (figure 2) à appliquer à un signal "oreille gauche" par rapport à un signal "oreille droite". Dans le domaine des sous-bandes, on prévoit plutôt une telle matrice de filtres G_i, lesquels permettent en outre d'ajuster des gains (par exemple en énergie) de certaines sources par rapport à d'autres.It will be noted that the filter matrices Gi applied independently to each source "integrate" a conventional delay calculation operation for adding the interaural delay between a signal L _i and a signal R _i to return. Indeed, in the time domain, delay lines τ _i are conventionally provided (FIG. 2) to be applied to a "left ear" signal with respect to a "right ear" signal. In the field of sub-bands, rather, such a matrix of filters G _{i is provided} , which in addition makes it possible to adjust gains (for example in energy) of certain sources with respect to others.

Dans le cas d'une transmission à partir d'un serveur vers des terminaux de restitution, toutes ces étapes sont effectuées avantageusement hors ligne. Les matrices de filtres ci-avant sont donc calculées une fois puis stockées définitivement en mémoire du serveur. On notera en particulier que le jeu des coefficients de pondération C _ni , D _ni reste avantageusement inchangé du domaine temporel au domaine des sous-bandes.In the case of a transmission from a server to playback terminals, all these steps are performed advantageously offline. The filter matrices above are thus calculated once and then permanently stored in the server memory. In particular, it will be noted that the set of weighting coefficients C _n , D _{ni and} advantageously remains unchanged from the time domain to the domain of the subbands.

Pour des techniques de spatialisation basées sur du filtrage par des filtres HRTFs et ajout du retard ITD (pour "Interaural Time Delay") tel que la synthèse binaurale et transaurale, ou encore des filtres de fonctions de transfert dans le contexte ambisonique, une difficulté s'est présentée pour trouver des filtres équivalents à appliquer sur des échantillons dans le domaine des sous-bandes. En effet, ces filtres issus du banc de filtres d'analyse doivent préférentiellement être construits de manière à ce que les signaux temporels gauche et droite restitués par le banc de filtres de synthèse présentent le même rendu sonore, et sans aucun artefact, que celui obtenu par une spatialisation directe sur un signal temporel. La conception de filtres permettant d'aboutir à un tel résultat n'est pas immédiate. En effet, la modification du spectre du signal apporté par un filtrage dans le domaine temporel ne peut être réalisée directement sur les signaux des sous-bandes sans tenir compte du phénomène de recouvrement de spectre ("aliasing") introduit par le banc de filtres d'analyse. La relation de dépendance entre les composantes d'aliasing des différentes sous-bandes est préférentiellement conservée lors de l'opération du filtrage pour que leur suppression soit assurée par le banc de filtres de synthèse.For spatialization techniques based on filtering by HRTFs filters and adding the ITD delay (for "Interaural Time Delay") such as binaural and transaural synthesis, or else transfer function filters in the ambisonic context, a difficulty is presented to find equivalent filters to be applied to samples in the subband field. Indeed, these filters from the analysis filter bank must preferably be constructed in such a way that the left and right temporal signals reproduced by the synthesis filter bank have the same sound reproduction, and without any artifact, as that obtained by direct spatialization on a time signal. The design of filters to achieve such a result is not immediate. Indeed, the modification of the spectrum of the signal provided by a filtering in the time domain can not be carried out directly on the signals of the subbands without taking into account the phenomenon of spectrum overlap ("aliasing") introduced by the filter bank d 'analysis. The dependency relationship between the aliasing components of the different sub-bands is preferentially retained during the filtering operation so that their deletion is ensured by the synthesis filter bank.

On décrit ci-après un procédé pour transposer un filtre S(z), de type FIR ou IIR, rationnel (sa transformée en z étant un quotient de deux polynômes) dans le cas d'une décomposition linéaire de HRTFs ou de fonctions de transfert de ce type, dans le domaine des sous-bandes, pour un banc de filtres à M sous-bandes et à échantillonnage critique, défini respectivement par ses filtres d'analyse et de synthèse H _k(z) et F _k (z) , où 0≤k≤M-1. On entend par "échantillonnage critique" le fait que le nombre de l'ensemble des échantillons de sorties des sous-bandes correspond au nombre d'échantillons en entrées. Ce banc de filtres est supposé aussi satisfaire à la condition de reconstruction parfaite.A method for transposing a rational-type filter S (z) of type FIR or IIR is described below (its z-transform being a quotient of two polynomials) in the case of linear decomposition of HRTFs or transfer functions. of this type, in the area of sub-bands, for a filterbank M subband and critical sampling, respectively defined by its filter analysis and synthesis of H _k (z) and F _k (z), where 0≤k≤M-1. The term "critical sampling" means that the number of all the output samples of the subbands corresponds to the number of samples in inputs. This filter bank is also supposed to satisfy the condition of perfect reconstruction.

On considère tout d'abord une matrice de transfert S(z) correspondant au filtre scalaire S(z), qui s'exprime comme suit : $S (z) = [\begin{matrix} S_{0} (z) & S_{1} (z) & \dots & S_{M - 1} (z) \\ z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & \dots & S_{M - 2} (z) \\ z^{- 1} S_{M - 2} (z) & z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & \dots & S_{M - 3} (z) \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋮ \\ S_{1} (z) \\ z^{- 1} S_{1} (z) & \dots & z^{- 1} S_{M - 1} (z) & S_{0} (z) \end{matrix}],$

où S _k(z) (0≤k≤M-1) sont les composantes polyphasées du filtre S(z).We first consider a transfer matrix S (z) corresponding to the scalar filter S (z), which is expressed as follows:

S (z) = [\begin{matrix} S_{0} (z) & S_{1} (z) & ... & S_{M - 1} (z) \\ z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & ... & S_{M - two} (z) \\ z^{- 1} S_{M - two} (z) & z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & ... & S_{M - 3} (z) \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋮ \\ S_{1} (z) \\ z^{- 1} S_{1} (z) & ... & z^{- 1} S_{M - 1} (z) & S_{0} (z) \end{matrix}],

where S _k (z) (0≤k≤M-1) are the polyphase components of the filter S (z).

Ces composantes sont obtenues de manière directe pour un filtre FIR. Pour les filtres IIR, une méthode de calcul est indiquée dans :

[1] A. Benjelloun Touimi, "Traitement du signal audio dans le domaine codé : techniques et applications" thèse de doctorat de l'Ecole Nationale Supérieure des Télécommunications de Paris, (Annexe A, p.141), Mai 2001.

These components are obtained directly for a FIR filter. For IIR filters, a calculation method is indicated in:

[1] A. Benjelloun Touimi, "Processing of the audio signal in the coded domain: techniques and applications" Doctoral thesis of the National School of Telecommunications of Paris, (Annex A, p.141), May 2001.

On détermine ensuite des matrices polyphasées, E(z) et R(z), correspondant respectivement aux bancs de filtres d'analyse et de synthèse. Ces matrices sont déterminées définitivement pour le banc de filtres considéré.Polyphase matrices E (z) and R (z) corresponding to the analysis and synthesis filter banks are then determined. These matrices are definitively determined for the filter bank under consideration.

On calcule alors la matrice de filtrage complète en sous-bandes par la formule suivante:
S_sb(z)=z^kE(z)S(z)R(z), où z^K correspond à une avance avec K=(L/M)-1 (caractérisant le banc de filtres utilisé), L étant la longueur des filtres d'analyse et de synthèse des bancs de filtres utilisés.The complete filter matrix in sub-bands is then calculated by the following formula:
S _sb (z) = z ^k E (z) S (z) R (z), where z ^K corresponds to an advance with K = (L / M) -1 (characterizing the bank of filters used), L being the length of filters for analysis and synthesis of filter banks used.

On construit ensuite la matrice S̃ _sb(z) dont les lignes sont obtenues à partir de celles de S_sb(z) comme suit :
[0 ... S^sb _il (z) ... S^sb _ii (z) ... S^sb _in (z) ... 0] (0≤n≤M-1), où :

i est l'indice de la (i+1)ième ligne et compris entre 0 et M-1,
1 = i-δ mod[M], où δ correspond à un nombre choisi de sous-diagonales adjacentes, la notation mod[M] correspondant à une opération de soustraction modulo M,
n = i+δ mod[M], la notation mod[M] correspondant à une opération d'addition modulo M.

The matrix S _sb (z) whose lines are obtained from those of S _sb (z) are then constructed as follows:
[0 ... S ^sb _il (z) ... S ^sb _ii (z) ... S ^sb _in (z) ... 0] (0≤n≤M-1), where:

i is the index of the (i + 1) th line and between 0 and M-1,
1 = i-δ mod [M], where δ corresponds to a chosen number of adjacent subdiagals, the mod notation [M] corresponding to a modulo M subtraction operation,
n = i + δ mod [M], the mod notation [M] corresponding to a modulo M addition operation.

On indique que le nombre choisi δ correspond au nombre de bandes qui se recouvrent suffisamment d'un côté avec la bande passante d'un filtre du banc de filtres. Il dépend donc du type de bancs de filtres utilisés dans le codage choisi. A titre d'exemple, pour le banc de filtres MDCT, δ peut être pris égal à 2 ou 3. Pour le banc de filtres Pseudo-QMF du codage MPEG-1, δ est pris égal à 1.It is indicated that the chosen number δ corresponds to the number of bands which overlap sufficiently on one side with the bandwidth of a filter of the filterbank. It therefore depends on the type of filter banks used in the chosen coding. By way of example, for the MDCT filter bank, δ can be taken as 2 or 3. For the Pseudo-QMF filter bank of the MPEG-1 coding, δ is taken as 1.

On notera que le résultat de cette transposition d'un filtre à réponse impulsionnelle finie ou infinie au domaine des sous-bandes est une matrice de filtres de taille MxM. Cependant, tous les filtres de cette matrice ne sont pas considérés lors du filtrage en sous-bandes. Avantageusement, seuls les filtres de la diagonale principale et de quelques sous-diagonales adjacentes peuvent être utilisés pour obtenir un résultat similaire à celui obtenu par un filtrage dans le domaine temporel (sans altérer pour autant la qualité de la restitution).It will be noted that the result of this transposition of a finite or infinite impulse response filter to the domain of the sub-bands is a matrix of filters of size MxM. However, not all filters in this matrix are considered during subband filtering. Advantageously, only the filters of the main diagonal and some adjacent sub-diagonals can be used to obtain a result similar to that obtained by filtering in the time domain (without altering the quality of the restitution).

La matrice S̃ _sb (z) résultant de cette transposition, puis réduite, est celle utilisée pour le filtrage en sous-bandes.The matrix S _sb (z) resulting from this transposition, then reduced, is that used for the subband filtering.

A titre d'exemple, on indique ci-après les expression des matrices polyphasées E(z) et R(z) pour un banc de filtres MDCT, largement utilisé dans des codeurs par transformée actuels tels que ceux opérant selon les standards MPEG-2/4 AAC, ou Dolby AC-2 & AC-3, ou TDAC de la Demanderesse. Le traitement ci-après peut aussi bien s'adapter à un banc de filtres de type Pseudo-QMF du codeur MPEG-1/2 Layer I-II.By way of example, the expression of the polyphase matrices E (z) and R (z) for an MDCT filter bank, widely used in current transform coders such as those operating according to the MPEG-2 standards, is indicated below. / 4 AAC, or Dolby AC-2 & AC-3, or TDAC of the Applicant. The following treatment can also be adapted to a Pseudo-QMF filter bank of the MPEG-1/2 Layer I-II coder.

Un banc de filtres MDCT est généralement défini par une matrice T=[t_k,l], de taille M×2M, dont les éléments s'expriment comme suit : $t_{k, l} = \sqrt{\frac{2}{M}} h [l] \cos [\frac{π}{M} (k + \frac{1}{2}) (l + \frac{M + 1}{2})], 0 \leq k \leq M - 1 et 0 \leq l \leq 2 M - 1,$

où h[l] correspond à la fenêtre de pondération dont un choix possible est la fenêtre sinusoïdale qui s'exprime sous la forme suivante :

h [l] = \sin [(l + \frac{1}{2}) \frac{π}{2 M}], 0 \leq l \leq 2 M - 1.

Les matrices polyphasées d'analyse et de synthèse sont alors données respectivement par les formules suivantes :

E (z) = T_{1} J_{M} + T_{0} J_{M} z^{- 1},

R (z) = J_{M} T_{0}^{T} + J_{M} T_{1}^{T} z^{- 1},

où

J_{M} = (\begin{matrix} 0 & \dots & 1 \\ ⋮ & ⋰ & ⋮ \\ 1 & \dots & 0 \end{matrix})

correspond à la matrice anti-identité de taille MxM et T₀ et T₁ sont des matrices de taille MxM résultant de la partition suivante:

T = [\begin{matrix} T_{0} & T_{1} \end{matrix}] .

On indique que pour ce banc de filtres L=2M et K =1.An MDCT filter bank is generally defined by a matrix T = [t _{k, l} ], of size M × 2M, whose elements are expressed as follows:

t_{k, l} = \sqrt{\frac{two}{M}} h [l] \cos [\frac{π}{M} (k + \frac{1}{two}) (l + \frac{M + 1}{two})], 0 \leq k \leq M - 1 and 0 \leq l \leq two M - 1,

where h [ l ] corresponds to the weighting window whose possible choice is the sinusoidal window which is expressed in the following form:

h [l] = \sin [(l + \frac{1}{two}) \frac{π}{two M}], 0 \leq l \leq two M - 1.

The polyphase analysis and synthesis matrices are then respectively given by the following formulas:

E (z) = T_{1} J_{M} + T_{0} J_{M} z^{- 1},

R (z) = J_{M} T_{0}^{T} + J_{M} T_{1}^{T} z^{- 1},

or

J_{M} = (\begin{matrix} 0 & ... & 1 \\ ⋮ & ⋰ & ⋮ \\ 1 & ... & 0 \end{matrix})

corresponds to the anti-identity matrix of size MxM and T ₀ and T ₁ are matrices of size MxM resulting from the following partition:

T = [\begin{matrix} T_{0} & T_{1} \end{matrix}] .

It is indicated that for this bank of filters L = 2M and K = 1.

Pour des bancs de filtres de type Pseudo-QMF de MPEG-1/2 Layer I-II, on définit une fenêtre de pondération h[i], i=0...L-1 , et une matrice de modulation en cosinus Ĉ=[c_kl], de taille M×2M, dont les coefficients sont donnés par : $c_{k l} = \cos [\frac{π}{M} (k + \frac{1}{2}) (l - \frac{M}{2})], 0 \leq l \leq 2 M - 1 et 0 \leq k \leq M - 1,$

avec les relations suivantes : L=2mM et K=2m-1 où m est un entier. Plus particulièrement dans le cas du codeur MPEG-1/2 Layer I-II, ces paramètres prennent les valeurs suivantes : M=32, L=512, m=8 et K=15. For Pseudo-QMF filter banks of MPEG-1/2 Layer I-II, we define a weighting window h [i], i = 0 ... L-1, and a cosine modulation matrix Ĉ = [c _kl ], of size M × 2M, whose coefficients are given by:

{vs}_{k l} = \cos [\frac{π}{M} (k + \frac{1}{two}) (l - \frac{M}{two})], 0 \leq l \leq two M - 1 and 0 \leq k \leq M - 1,

with the following relations: L = 2mM and K = 2m- 1 where m is an integer. More particularly in the case of the MPEG-1/2 Layer I-II encoder, these parameters take the following values: M = 32, L = 512, m = 8 and K = 15.

La matrice polyphasée d'analyse s'exprime alors comme suit : $E (z) = \hat{C} [\begin{matrix} g_{0} (- z^{2}) \\ z^{- 1} g_{1} (- z^{2}) \end{matrix}],$

où g₀ (z) et g₁ (z) sont des matrices diagonales définies par :

{\begin{cases} g_{0} (z) = diag [\begin{matrix} G_{0} (z) & G_{1} (z) & \dots & G_{M - 1} (z) \end{matrix}], \\ g_{1} (z) = diag [\begin{matrix} G_{M} (z) & G_{M + 1} (z) & \dots & G_{2 M - 1} (z) \end{matrix}], \end{cases}

avec

G_{k} (- z^{2}) = \sum_{l = 0}^{m - 1} {(- 1)}^{l} h (2 l M + k) z^{- 2 l}, 0 \leq k \leq 2 M - 1.

The polyphase analysis matrix is then expressed as follows:

E (z) = \hat{VS} [\begin{matrix} {boy Wut}_{0} (- z^{two}) \\ z^{- 1} {boy Wut}_{1} (- z^{two}) \end{matrix}],

where g ₀ (z) and g ₁ (z) are diagonal matrices defined by:

{\begin{cases} {boy Wut}_{0} (z) = diag [\begin{matrix} {BOY WUT}_{0} (z) & {BOY WUT}_{1} (z) & \dots & {BOY WUT}_{M - 1} (z) \end{matrix}], \\ {boy Wut}_{1} (z) = diag [\begin{matrix} {BOY WUT}_{M} (z) & {BOY WUT}_{M + 1} (z) & \dots & {BOY WUT}_{two M - 1} (z) \end{matrix}], \end{cases}

with

{BOY WUT}_{k} (- z^{two}) = Σ_{l = 0}^{m - 1} {(- 1)}^{l} h (two l M + k) z^{- two l}, 0 \leq k \leq two M - 1.

Dans la norme MPEG-1 Audio Layer I-II, on fournit typiquement les valeurs de la fenêtre (-1)¹ h(2lM+k), avec 0≤k≤2M-1, 0≤l≤m-1. In the MPEG-1 Audio Layer I-II standard, the values of the window (-1) ¹ h (2 lM + k ) are typically provided, with 0≤k≤2M-1, 0≤l≤m-1.

La matrice polyphasée de synthèse peut alors se déduire simplement par la formule suivante : $R (z) = z^{- (2 m - 1)} E^{T} (z^{- 1})$

The polyphase synthesis matrix can then be deduced simply by the following formula:

R (z) = z^{- (two m - 1)} E^{T} (z^{- 1})

Ainsi, en se référant maintenant à la figure 4 au sens de la présente invention, on procède à un décodage partiel de N sources audio S₁,..., S_i,..., S_N codées en compression, pour obtenir des signaux S₁,..., S _i ,..., S _N correspondant préférentiellement à des vecteurs signaux dont les coefficients sont des valeurs affectées chacune à une sous-bande. On entend par "décodage partiel" un traitement permettant d'obtenir à partir des signaux codés en compression de tels vecteurs signaux dans le domaine des sous-bandes. On peut obtenir en outre des informations de position desquelles sont déduites des valeurs respectives de gains G ₁ ,..., G _i ,..., G _N (pour la synthèse binaurale) et des coefficients C _ni (pour l'oreille gauche) et D _ni (pour l'oreille droite) pour le traitement de spatialisation conformément à l'équation Eq [1] donnée ci-avant, comme le montre la figure 5. Toutefois, le traitement de spatialisation est mené directement dans le domaine des sous-bandes et l'on applique les 2P matrices L _n et R _n de filtres de base, obtenues comme indiqué ci-avant, aux vecteurs signaux Si pondérés par les coefficients scalaires C_ni et D_ni, respectivement.Thus, with reference now to FIG. 4 in the sense of the present invention, partial _N- coded audio sources S ₁ ,..., S _i ,..., S _{N are} decoded in compression to obtain S ₁ , ..., S _i , ..., S _N signals preferably corresponding to signal vectors whose coefficients are values each assigned to a sub-band. "Partial decoding" is understood to mean a processing that makes it possible to obtain, from the compression-coded signals, such signal vectors in the field of the sub-bands. Position information from which respective values of gains G ₁ ,..., G _i ,..., G _N (for binaural synthesis) and coefficients C _ni (for the left ear) can be obtained. ) and D _ni (for the right ear) for the spatialization processing according to equation Eq [1] given above, as shown in Figure 5. However, the spatialization processing is conducted directly in the field of sub-bands and apply the 2P matrices L _n and R _n of basic filters, obtained as indicated above, to the signal vectors S 1 weighted by the scalar coefficients C _ni and D _ni , respectively.

En se référant à la figure 5, les vecteurs signaux L et R, résultant du traitement de spatialisation dans le domaine des sous-bandes (par exemple dans un système de traitement noté "Système II" sur la figure 4) s'expriment alors par les relations suivantes, dans une représentation par leur transformée en z : $L (z) = \sum_{n = 1}^{P} L_{n} (z) . [\sum_{i = 1}^{N} C_{n i} . S_{i} (z)]$

R (z) = \sum_{n = 1}^{P} R_{n} (z) . [\sum_{i = 1}^{N} D_{n i} . S_{i} (z)]

Referring to FIG. 5, the signal vectors L and R, resulting from the spatialization processing in the subband domain (for example in a processing system denoted "System II" in FIG. 4) are then expressed by the following relations, in a representation by their transformed into z:

The (z) = Σ_{not = 1}^{P} {The}_{not} (z) . [Σ_{i = 1}^{NOT} {VS}_{not i} . S_{i} (z)]

R (z) = Σ_{not = 1}^{P} R_{not} (z) . [Σ_{i = 1}^{NOT} D_{not i} . S_{i} (z)]

Dans l'exemple représenté sur la figure 4, le traitement de spatialisation est effectué dans un serveur relié à un réseau de communication. Ainsi, ces vecteurs signaux L et R peuvent être recodés complètement en compression pour diffuser les signaux compressés L et R (canaux gauche et droit) dans le réseau de communication et à destination des terminaux de restitution.In the example shown in FIG. 4, the spatialization processing is carried out in a server connected to a communication network. Thus, these L and R signal vectors can be recoded completely in compression to broadcast the compressed signals L and R (left and right channels) in the communication network and to the playback terminals.

Ainsi, une étape initiale de décodage partiel des signaux codés S_i est prévue, avant le traitement de spatialisation. Toutefois, cette étape est beaucoup moins coûteuse et plus rapide que l'opération de décodage complet qui était nécessaire dans l'art antérieur (figure 3). De plus, les vecteurs signaux L et R sont déjà exprimés dans le domaine des sous-bandes et le recodage partiel de la figure 4 pour obtenir les signaux codés en compression L et R est plus rapide et moins coûteux qu'un codage complet tel que représenté sur la figure 3.Thus, an initial step of partially decoding the coded signals S _i is provided before the spatialization processing. However, this step is much less expensive and faster than the complete decoding operation that was necessary in the prior art (FIG. 3). In addition, the signal vectors L and R are already expressed in the subband domain and the partial recoding of FIG. 4 to obtain the coded signals in L and R compression is faster and less expensive than a complete coding such as shown in Figure 3.

On indique que les deux traits discontinus verticaux de la figure 5 délimitent le traitement de spatialisation effectué dans le "Système II" de la figure 4. A ce titre, la présente invention vise aussi un tel système comportant des moyens de traitement des signaux partiellement codés Si, pour la mise en oeuvre du procédé selon l'invention.It is indicated that the two vertical discontinuous lines of FIG. 5 delimit the spatialization processing carried out in "System II" of FIG. 4. As such, the present invention also aims at such a system comprising partially coded signal processing means. If, for the implementation of the method according to the invention.

On indique que le document :

[2] "A Generic Framework for Filtering in Subband Domain" A. Benjelloun Touimi, IEEE 9^th Workshop on Digital Signal Processing, Hunt, Texas, USA, Octobre 2000,

ainsi que le document [1] cité ci-avant, concernent une méthode générale de calcul d'une transposition dans le domaine des sous-bandes d'un filtre de réponse impulsionnelle finie ou infinie.It indicates that the document:

[2] "A Generic Framework for Filtering in Subband Domain" A. Benjelloun Touimi, IEEE 9 ^th Workshop on Digital Signal Processing, Hunt, Texas, USA, October 2000,

as well as the document [1] cited above, concern a general method for calculating a transposition in the subband domain of a finite or infinite impulse response filter.

On indique en outre que des techniques de spatialisation sonore dans le domaine des sous-bandes ont été proposées récemment, notamment dans un autre document :

[3] "Subband-Domain Filtering of MPEG Audio Signals", C.A. Lanciani and R. W. Schafer, IEEE Int. Conf. on Acoust., Speech, Signal Proc., 1999.

It is further indicated that sound spatialization techniques in the field of subbands have been proposed recently, in particular in another document:

[3] "Subband-Domain Filtering of MPEG Audio Signals", CA Lanciani and RW Schafer, IEEE Int. Conf. Acoust., Speech, Signal Proc., 1999.

Ce dernier document présente une méthode permettant de transposer un filtre à réponse impulsionnelle finie (FIR) dans le domaine des sous-bandes des bancs de filtres pseudo-QMF du codeur MPEG-1 Layer I-II et MDCT du codeur MPEG-2/4 AAC. L'opération de filtrage équivalente dans le domaine des sous-bandes est représentée par une matrice de filtres FIR. En particulier, cette proposition s'inscrit dans le contexte d'une transposition de filtres HRTFs, directement sous leur forme classique et non pas sous la forme d'une décomposition linéaire telle qu'exprimée par l'équation Eq[1] ci-avant et sur une base de filtres au sens de l'invention. Ainsi, un inconvénient de la méthode au sens de ce dernier document consiste en ce que le traitement de spatialisation ne peut pas s'adapter à un nombre quelconque de sources ou de flux audio encodés à spatialiser.The latter document presents a method for transposing a finite impulse response (FIR) filter in the sub-band domain of the pseudo-QMF filterbanks of the MPEG-1 Layer I-II and MDCT encoder of the MPEG-2/4 encoder AAC. The equivalent filtering operation in the subband field is represented by an array of FIR filters. In particular, this proposal is in the context of a transposition of HRTFs filters, directly in their classical form and not in the form of a linear decomposition as expressed by equation Eq [1] above and on a filter basis in the sense of the invention. Thus, a disadvantage of the method in the sense of the latter document is that the spatialization processing can not adapt to any number of sources or audio streams encoded spatialize.

On indique que, pour une position donnée, chaque filtre HRTF (d'ordre 200 pour un FIR et d'ordre 12 pour un IIR) donne lieu à une matrice de filtres (carrée) de dimension égale au nombre de sous-bandes du banc de filtres utilisé. Dans le document [3] cité ci-avant, on doit prévoir un nombre de HRTFs suffisant pour représenter les différentes positions dans l'espace, ce qui pose un problème de taille mémoire si l'on souhaite spatialiser une source à une position quelconque dans l'espace.It is indicated that, for a given position, each HRTF filter (of order 200 for a FIR and order 12 for an IIR) gives rise to a matrix of filters (square) of dimension equal to the number of subbands of the bench of filters used. In document [3] cited above, there must be a sufficient number of HRTFs to represent the different positions in space, which poses a problem of memory size if it is desired to spatialize a source at any position in space.

En revanche, une adaptation d'une décomposition linéaire des HRTFs dans le domaine des sous-bandes, au sens de la présente invention, ne présente pas ce problème puisque le nombre (P) de matrices de filtres de base L_n et R_n est beaucoup plus réduit. Ces matrices sont alors stockées définitivement dans une mémoire (du serveur de contenu ou du terminal de restitution) et permettent un traitement simultané de spatialisation d'un nombre quelconque de sources, comme représenté sur la figure 5.On the other hand, an adaptation of a linear decomposition of the HRTFs in the field of the subbands, within the meaning of the present invention, does not present this problem since the number (P) of basic filter matrices L _n and R _n is much smaller. These matrices are then permanently stored in a memory (of the content server or of the rendering terminal) and allow simultaneous spatialization processing of any number of sources, as shown in FIG.

On décrit ci-après une généralisation du traitement de spatialisation au sens de la figure 5 à d'autres traitements de rendu sonore, tels qu'un traitement dit d'"encodage ambisonique". En effet, un système de rendu sonore peut se présenter de manière générale sous la forme d'un système de prise de son réel ou virtuel (pour une simulation) consistant en un encodage du champ sonore. Cette phase consiste à enregistrer p signaux sonores de manière réelle ou à simuler de tels signaux (encodage virtuel) correspondant à l'ensemble d'une scène sonore comprenant tous les sons, ainsi qu'un effet de salle.Hereinafter, a generalization of the spatialization processing in the sense of FIG. 5 will be described with other sound rendering processes, such as a so-called " ambisonic encoding " process. Indeed, a sound rendering system can be generally in the form of a real or virtual sound recording system (for a simulation) consisting of an encoding of the sound field. This phase consists in recording p sound signals in a real way or in simulating such signals (virtual encoding) corresponding to the whole of a sound scene including all the sounds, as well as a room effect.

Le système précité peut aussi se présenter sous la forme d'un système de rendu sonore consistant à décoder les signaux issus de la prise de son pour les adapter aux dispositifs de traducteurs de rendu sonore (tels qu'une pluralité de haut-parleurs ou un casque de type stéréophonique). On transforme les p signaux en n signaux qui alimentent les n haut-parleurs.The aforementioned system may also be in the form of a sound rendering system of decoding the sound output signals to suit the sound rendering translators (such as a plurality of loudspeakers or a loudspeaker). stereophonic headphones). The p signals are transformed into n signals which supply the n loudspeakers.

A titre d'exemple, la synthèse binaurale consiste à réaliser une prise de son réel, à l'aide d'une paire de microphones introduit dans les oreilles d'une tête humaine (artificielle ou réelle). On peut aussi simuler l'enregistrement en réalisant la convolution d'un son monophonique avec la paire de HRTFs correspondant à une direction souhaitée de la source sonore virtuelle. A partir d'un ou plusieurs signaux monophoniques provenant de sources prédéterminées, on obtient deux signaux (oreille gauche et oreille droite) correspondant à une phase dite "d'encodage binaural", ces deux signaux étant simplement appliqués ensuite à un casque à deux oreillettes (tel qu'un casque stéréophonique).By way of example, binaural synthesis consists of making a real sound recording, using a pair of microphones introduced into the ears of a human head (artificial or real). The recording can also be simulated by convolving a monophonic sound with the pair of HRTFs corresponding to a desired direction of the virtual sound source. From one or more monophonic signals coming from predetermined sources, two signals (left ear and right ear) are obtained corresponding to a so-called " binaural encoding" phase, these two signals being then simply applied to a two-ear headset (such as a stereo headset).

Toutefois, d'autres encodages et décodages sont possibles à partir de la décomposition de filtre correspondant à des fonctions de transfert sur une base de filtres. Comme indiqué ci-avant, les dépendances spatiales et fréquentielles des fonctions de transfert, de type HRTFs, sont séparées grâce à une décomposition linéaire et s'écrivent comme une somme de fonctions spatiales C _i (θ,ϕ) et de filtres de reconstruction L _i(f) qui dépendent de la fréquence : $HRTF (θ, ϕ, f) = \sum_{i = 1}^{P} C_{i} (θ, ϕ) . L_{i} (f)$

However, other encodings and decodings are possible from the filter decomposition corresponding to transfer functions on a filter basis. As indicated above, the spatial and frequency dependencies of the transfer functions, of the HRTFs type, are separated by a linear decomposition and are written as a sum of spatial functions C _i (θ, φ) and reconstruction filters L _i ( f ) which depend on the frequency:

HRTF (θ, φ, f) = Σ_{i = 1}^{P} {VS}_{i} (θ, φ) . {The}_{i} (f)

Toutefois, on indique que cette expression peut être généralisée à tout type d'encodage, pour n sources sonores S _j (f) et un format d'encodage comprenant p signaux en sortie, à : $E_{i} (f) = \sum_{j = 1}^{n} X_{i j} (θ, ϕ) . S_{j} (f), 1 \leq i \leq p$

où, par exemple dans le cas d'une synthèse binaurale, X_ij peut s'exprimer sous la forme d'un produit des filtres de gains G _j et des coefficients C_ij,D_ij.However, it is indicated that this expression can be generalized to any type of encoding, for n sound sources S _j ( f ) and an encoding format comprising p output signals, to:

E_{i} (f) = Σ_{j = 1}^{not} X_{i j} (θ, φ) . S_{j} (f), 1 \leq i \leq p

where, for example in the case of a binaural synthesis, X _ij can be expressed in the form of a product of the gain filters G _j and coefficients C _ij , D _ij .

On se réfère à la figure 6 sur laquelle N flux audio S _j représentés dans le domaine des sous-bandes après décodage partiel, subissent un traitement de spatialisation, par exemple un encodage ambisonique, pour délivrer p signaux E _i encodés dans le domaine des sous-bandes. Un tel traitement de spatialisation respecte donc le cas général régi par l'équation Eq[2] ci-avant. On remarquera d'ailleurs sur la figure 6 que l'application aux signaux S _j de la matrice des filtres G _j (pour définir le retard interaural ITD) n'est plus nécessaire ici, dans le contexte ambisonique.Referring to FIG. 6, in which N audio streams S _j represented in the field of the sub-bands after partial decoding, undergo a spatialization processing, for example an ambisonic encoding, to deliver p signals E _i encoded in the sub domain. -bands. Such Spatialization processing therefore respects the general case governed by equation Eq [2] above. It will also be noted in FIG. 6 that the application to the signals S _j of the filter matrix G _j (to define the interaural delay ITD) is no longer necessary here, in the ambisonic context.

De même, une relation générale, pour un format de décodage comprenant p signaux E _i (f) et un format de rendu sonore comprenant m signaux, est donnée par : $D_{j} (f) = \sum_{i = 1}^{p} K_{j i} (f) E_{i} (f), 1 \leq j \leq m$

Similarly, a general relationship for a decoding format comprising p signals E _i (f) and a sound rendering format comprising m signals is given by:

D_{j} (f) = Σ_{i = 1}^{p} K_{j i} (f) E_{i} (f), 1 \leq j \leq m

Pour un système de rendu sonore donné, les filtres K _ji (f) sont fixes et ne dépendent, à fréquence constante, que du système de rendu sonore et de sa disposition par rapport à un auditeur. Cette situation est représentée sur la figure 6 (à droite du trait vertical en pointillés), dans l'exemple du contexte ambisonique. Par exemple, les signaux E _i encodés spatialement dans le domaine des sous-bandes sont recodés complètement en compression, transmis dans un réseau de communication, récupérés dans un terminal de restitution, décodés partiellement en compression pour en obtenir une représentation dans le domaine des sous-bandes. Finalement, on retrouve, après ces étapes, sensiblement les même signaux E _i décrits ci-avant, dans le terminal. Un traitement dans le domaine des sous-bandes du type exprimé par l'équation Eq[3] permet alors de récupérer m signaux D _j, spatialement décodés et prêts à être restitués après décodage en compression.For a given sound rendering system, the filters K _ji (f) are fixed and depend, at constant frequency, only on the sound rendering system and its arrangement with respect to a listener. This situation is shown in Figure 6 (to the right of the dashed vertical line), in the example of the ambisonic context. For example, the signals E _i spatially encoded in the field of the subbands are recoded completely in compression, transmitted in a communication network, recovered in a rendering terminal, partially decoded in compression to obtain a representation in the sub domain. -bands. Finally, after these steps, we find substantially the same signals E _i described above, in the terminal. A processing in the field of the subbands of the type expressed by the equation Eq [3] then makes it possible to recover m signals D _j , spatially decoded and ready to be restored after decoding in compression.

Bien entendu, plusieurs systèmes de décodage peuvent être agencés en série, selon l'application visée.Of course, several decoding systems can be arranged in series, depending on the intended application.

Par exemple, dans le contexte ambisonique bidimensionnel d'ordre 1, un format d'encodage avec trois signaux W, X, Y pour p sources sonores s'exprime, pour l'encodage, par : $E_{1} = W = \sum_{j = 1}^{n} S_{j}$

E_{2} = X = \sum_{j = 1}^{n} \cos (θ_{j}) S_{j}

E_{3} = Y = \sum_{j = 1}^{n} \sin (θ_{j}) S_{j}

For example, in the two-dimensional ambisonic context of order 1, an encoding format with three signals W, X, Y for p sound sources is expressed, for encoding, by:

E_{1} = W = Σ_{j = 1}^{not} S_{j}

E_{two} = X = Σ_{j = 1}^{not} \cos (θ_{j}) S_{j}

E_{3} = Y = Σ_{j = 1}^{not} \sin (θ_{j}) S_{j}

Pour le décodage "ambisonique" auprès d'un dispositif de restitution à cinq haut-parleurs sur deux bandes de fréquences [0 , f ₁] et [f ₁ , f ₂] avec f ₁ = 400Hz et f ₂ correspondant à une bande passante des signaux considérés, les filtres K _ji (f) prennent les valeurs numériques constantes sur ces deux bandes de fréquences, données dans les tableaux I et II ci-après. Tableau I : valeurs des coefficients définissant les filtres K _ji (f) pour 0< f ≤ f ₁ W X Y 0.342 0.233 0.000 0.268 0.382 0.505 0.268 0.382 -0.505 0.561 -0.499 0.457 0.561 -0.499 -0.457 Tableau II : valeurs des coefficients définissant les filtres K _ji (f) pour f ₁ < f ≤f ₂ W X Y 0.383 0.372 0.000 0.440 0.234 0.541 0.440 0.234 -0.541 0.782 -0.553 0.424 0.782 -0.553 -0.424 For "ambisonic" decoding with a five-speaker reproduction device in two frequency bands [0, f ₁ ] and [ f ₁ , f ₂ ] with f ₁ = 400 Hz and f ₂ corresponding to a bandwidth of the considered signals, the filters K _ji (f) take the constant numerical values on these two frequency bands, given in Tables I and II below. Table I: values of coefficients defining filters K ji (f) for 0 < f ≤ f 1 W X Y 0342 0233 0.000 0268 0382 0505 0268 0382 -0505 0561 -0499 0457 0561 -0499 -0457 W X Y 0383 0372 0.000 0440 0234 0541 0440 0234 -0541 0782 -0553 0424 0782 -0553 -0424

Bien entendu, des procédés de spatialisation différents (contexte ambisonique et synthèse binaurale et/ou transaurale) peuvent être combinés auprès d'un serveur et/ou auprès d'un terminal de restitution, de tels procédés de spatialisation respectant l'expression générale d'une décomposition linéaire de fonctions de transfert dans l'espace des fréquences, comme indiqué ci-avant.Of course, different spatialization methods (ambisonic context and binaural and / or transaural synthesis) can be combined with a server and / or with a restitution terminal, such spatialization methods respecting the general expression of a linear decomposition of transfer functions in the frequency space, as indicated above.

On décrit ci-après une mise en oeuvre du procédé au sens de l'invention dans une application liée à une téléconférence entre terminaux distants.An implementation of the method in the sense of the invention is described below in an application related to a teleconference between remote terminals.

En se référant à nouveau à la figure 4, des signaux codés (S _i) émanent de N terminaux distants. Ils sont spatialisés au niveau du serveur de téléconférence (par exemple au niveau d'un pont audio pour une architecture en étoile telle que représentée sur la figure 8), pour chaque participant. Cette étape, effectuée dans le domaine des sous-bandes après une phase de décodage partiel, est suivie d'un recodage partiel. Les signaux ainsi codés en compression sont ensuite transmis via le réseau et, dès réception par un terminal de restitution, sont décodés complètement en compression et appliqués aux deux voies gauche et droite 1 et r, respectivement, du terminal de restitution, dans le cas d'une spatialisation binaurale. Au niveau des terminaux, le traitement de décodage en compression permet ainsi de délivrer deux signaux temporels gauche et droit qui contiennent l'information de positions de N locuteurs distants et qui alimentent deux haut-parleurs respectifs (casque à deux oreillettes). Bien entendu, pour une spatialisation générale, par exemple dans le contexte ambisonique, m voies peuvent être récupérées en sortie du serveur de communication, si l'encodage/décodage en spatialisation sont effectués par le serveur. Toutefois, il est avantageux, en variante, de prévoir l'encodage en spatialisation auprès du serveur et le décodage en spatialisation auprès du terminal à partir des p signaux codés en compression, d'une part, pour limiter le nombre de signaux à véhiculer via le réseau (en général p<m) et, d'autre part, pour adapter le décodage spatial aux caractéristiques de rendu sonore de chaque terminal (par exemple le nombre de haut-parleurs qu'il comporte, ou autres).Referring again to FIG. 4, coded signals ( S _i ) emanate from N remote terminals. They are spatialised at the teleconference server (for example at an audio bridge for a star architecture as shown in FIG. 8) for each participant. This step, performed in the field of sub-bands after a partial decoding phase, is followed by partial recoding. The signals thus coded in compression are then transmitted via the network and, upon receipt by a rendering terminal, are decoded completely in compression and applied to the two left and right lanes 1 and r, respectively, of the rendering terminal, in the case of a binaural spatialization. At the terminals, the compression decoding process thus makes it possible to deliver two left and right temporal signals which contain the information of positions of N distant speakers and which feed two respective loudspeakers (two-ear headphones). Of course, for a general spatialization, for example in the ambisonic context, m channels can be retrieved at the output of the communication server, if the encoding / decoding in spatialization are performed by the server. However, it is advantageous, in a variant, to provide the spatialization encoding with the server and the spatialization decoding with the terminal from the p coded signals in compression, on the one hand, to limit the number of signals to be conveyed via the network (in general p <m) and, secondly, to adapt the spatial decoding to the sound reproduction characteristics of each terminal (for example the number of speakers it contains, or other).

Cette spatialisation peut être statique ou dynamique et, en outre, interactive. Ainsi, la position des locuteurs est fixe ou peut varier au cours du temps. Si la spatialisation n'est pas interactive, la position des différents locuteurs est fixe : l'auditeur ne peut pas la modifier. En revanche, si la spatialisation est interactive, chaque auditeur peut configurer son terminal pour positionner la voix des N autres locuteurs où il le souhaite, sensiblement en temps réel.This spatialization can be static or dynamic and, in addition, interactive. Thus, the position of the speakers is fixed or may vary over time. If the spatialization is not interactive, the position of the different speakers is fixed: the listener can not modify it. On the other hand, if the spatialization is interactive, each listener can configure his terminal to position the voice of the other N speakers where it wishes, substantially in real time.

En se référant maintenant à la figure 7, le terminal de restitution reçoit N flux audio (Si) codés en compression (MPEG, AAC, ou autres) d'un réseau de communication. Après un décodage partiel pour obtenir les vecteurs signaux (Si), le terminal ("Système II") traite ces vecteurs signaux pour spatialiser les sources audio, ici en synthèse binaurale, dans deux vecteurs signaux L et R qui sont ensuite appliqués à des bancs de filtres de synthèse en vue d'un décodage en compression. Les signaux PCM gauche et droit, respectivement 1 et r, résultant de ce décodage sont ensuite destinés à alimenter directement des haut-parleurs. Ce type de traitement s'adapte avantageusement à un système de téléconférence décentralisé (plusieurs terminaux connectés en mode point à point).Referring now to FIG. 7, the rendering terminal receives N audio streams (Si) encoded in compression (MPEG, AAC, or others) of a communication network. After partial decoding to obtain the signal vectors (Si), the terminal (" System II") processes these signal vectors to spatialize the audio sources, here in binaural synthesis, in two signal vectors L and R which are then applied to banks synthesis filters for compression decoding. The left and right PCM signals, respectively 1 and r, resulting from this decoding are then intended to directly supply loudspeakers. This type of processing advantageously adapts to a decentralized teleconferencing system (several terminals connected in point-to-point mode).

On décrit ci-après le cas d'un "streaming" ou d'un téléchargement d'une scène sonore, notamment dans le contexte de codage en compression selon la norme MPEG-4.The following is the case of a " streaming " or a download of a sound scene, particularly in the context of compression coding according to the MPEG-4 standard.

Cette scène peut être simple, ou encore complexe comme souvent dans le cadre de transmissions MPEG-4 où la scène sonore est transmise sous un format structuré. Dans le contexte MPEG-4, le terminal client reçoit, à partir d'un serveur multimédia, un flux binaire multiplexé correspondant à chacun des objets audio primitifs codés, ainsi que des instructions quant à leur composition pour reconstruire la scène sonore. On entend par "objet audio" un flux binaire élémentaire obtenu par un codeur MPEG-4 Audio. La norme MPEG-4 Système fournit un format spécial, appelé "AudioBIFS" (pour "BInary Format for Scene description"), afin de transmettre ces instructions. Le rôle de ce format est de décrire la composition spatio-temporelle des objets audio. Pour construire la scène sonore et assurer un certain rendu, ces différents flux décodés peuvent subir un traitement ultérieur. Particulièrement, une étape de traitement de spatialisation sonore peut être effectuée.This scene can be simple, or complex as often in the context of MPEG-4 transmissions where the sound scene is transmitted in a structured format. In the MPEG-4 context, the client terminal receives, from a multimedia server, a multiplexed bitstream corresponding to each of the coded primitive audio objects, as well as instructions as to their composition for reconstructing the sound scene. " Audio object " an elementary bit stream obtained by an MPEG-4 Audio encoder. The MPEG-4 System standard provides a special format, called "AudioBIFS" (for "BInary Format for Scene Description"), to convey these instructions. The role of this format is to describe the spatio-temporal composition of audio objects. To build the sound stage and ensure a certain rendering, these different decoded streams can undergo further processing. In particular, a sound spatialization processing step can be performed.

Dans le format "AudioBIFS", les manipulations à effectuer sont représentées par un graphe. On prévoit les signaux audio décodés en entrée du graphe. Chaque noeud du graphe représente un type de traitement à réaliser sur un signal audio. On prévoit en sortie du graphe les différents signaux sonores à restituer ou à associer à d'autres objets média (images ou autre).In the "AudioBIFS" format , the operations to be performed are represented by a graph. The decoded audio signals are provided at the input of the graph. Each node of the graph represents a type of processing to be performed on an audio signal. At the output of the graph are provided the different sound signals to be restored or associated with other media objects (images or other).

Les algorithmes utilisés sont mis à jour dynamiquement et sont transmis avec le graphe de la scène. Ils sont décrits sous forme de routines écrites dans un langage spécifique tel que "SAOL" (pour "Structured Audio Score Language"). Ce langage possède des fonctions prédéfinies qui incluent notamment et de façon particulièrement avantageuse des filtres FIR et IIR (qui peuvent alors correspondre à des HRTFs, comme indiqué ci-avant).The algorithms used are updated dynamically and are transmitted with the graph of the scene. They are described as routines written in a specific language such as "SAOL" (for " Structured Audio Score Language"). This language has predefined functions which notably include, and particularly advantageously, FIR and IIR filters (which can then correspond to HRTFs, as indicated above).

En outre, dans les outils de compression audio fournis par la norme MPEG-4, on trouve des codeurs par transformée utilisés surtout pour la transmission audio haute qualité (monophonique et multivoies). C'est le cas des codeurs AAC et TwinVQ basés sur la transformée MDCT.In addition, in the audio compression tools provided by the MPEG-4 standard, there are transform coders used mainly for high quality audio transmission. (monophonic and multichannel). This is the case of AAC and TwinVQ encoders based on the MDCT transform.

Ainsi, dans le contexte MPEG-4, les outils permettant de mettre en oeuvre le procédé au sens de l'invention sont déjà présents.Thus, in the MPEG-4 context, the tools making it possible to implement the method within the meaning of the invention are already present.

Dans un terminal MPEG-4 récepteur, il suffit alors d'intégrer la couche basse de décodage aux noeuds de la couche supérieure qui assure des traitements particuliers, telle que la spatialisation binaurale par des filtres HRTFs. Ainsi, après décodage partiel des flux binaires audio élémentaires démultiplexés et issus d'un même type de codeur (MPEG-4 AAC par exemple), les noeuds du graphe "AudioBIFS" qui font intervenir une spatialisation binaurale peuvent être traités directement dans le domaine des sous-bandes (MDCT par exemple). L'opération de synthèse par banc de filtres n'est effectuée qu'après cette étape.In an MPEG-4 receiver terminal, it is then sufficient to integrate the low decoding layer at the nodes of the upper layer which provides particular processing, such as binaural spatialization by HRTFs filters. Thus, after partial decoding of the demultiplexed elementary audio bit streams and coming from the same type of coder (MPEG-4 AAC for example), the nodes of the "AudioBIFS" graph which involve a binaural spatialization can be processed directly in the field of subbands (MDCT for example). The filter bank synthesis operation is performed only after this step.

Dans une architecture de téléconférence multipoint centralisée telle que représentée sur la figure 8, entre quatre terminaux dans l'exemple représenté, le traitement des signaux pour la spatialisation ne peut s'effectuer qu'au niveau du pont audio. En effet, les terminaux TER1, TER2, TER3 et TER4 reçoivent des flux déjà mixés et donc aucun traitement ne peut être réalisé à leur niveau pour la spatialisation.In a centralized multipoint teleconferencing architecture as shown in FIG. 8, between four terminals in the example shown, signal processing for spatialization can only be performed at the audio bridge level. Indeed, terminals TER1, TER2, TER3 and TER4 receive streams already mixed and therefore no treatment can be achieved at their level for spatialization.

On comprend qu'une réduction de la complexité de traitement est particulièrement souhaitée dans ce cas. En effet, pour une conférence à N terminaux (N ≥3) , le pont audio doit réaliser une spatialisation des locuteurs issus des terminaux pour chacun des N sous-ensembles constitués de (N-1) locuteurs parmi les N participant à la conférence. Un traitement dans le domaine codé apporte bien entendu plus de bénéfice.It is understood that a reduction of the processing complexity is particularly desired in this case. In Indeed, for a conference with N terminals ( N ≥3), the audio bridge must realize a spatialization of the speakers from the terminals for each of the N subsets consisting of (N- 1) speakers among the N participating in the conference. A treatment in the coded domain of course brings more benefit.

La figure 9 représente schématiquement le système de traitement prévu dans le pont audio. Ce traitement est ainsi effectué sur un sous-ensemble de (N-1) signaux audio codés parmi les N en entrée du pont. Les trames audio codés gauche et droit dans le cas d'une spatialisation binaurale, ou les m trames audio codés dans le cas d'une spatialisation générale (par exemple en encodage ambisonique) tel que représenté sur la figure 9, qui résultent de ce traitement sont ainsi transmises au terminal restant qui participe à la téléconférence mais qui ne figure pas parmi ce sous-ensemble (correspondant à un "terminal auditeur"). Au total, N traitements du type décrit ci-avant sont réalisés dans le pont audio (N sous-ensembles de (N-1) signaux codés). On indique que le codage partiel de la figure 9 désigne l'opération de construction de la trame audio codée après le traitement de spatialisation et à transmettre sur une voie (gauche ou droit). A titre d'exemple, il peut s'agir d'une quantification des vecteurs signaux L et R qui résultent du traitement de spatialisation, en se basant sur un nombre de bits alloué et calculé suivant un critère psychoacoustique choisi. Les traitements classiques de codage en compression après l'application du banc de filtres d'analyse peuvent donc être maintenus et effectués avec la spatialisation dans le domaine des sous-bandes.Figure 9 shows schematically the processing system provided in the audio bridge. This processing is thus performed on a subset of ( N -1) coded audio signals among the N at the input of the bridge. The left and right coded audio frames in the case of a binaural spatialization, or the m coded audio frames in the case of a general spatialization (for example in ambisonic encoding) as represented in FIG. 9, which result from this processing are thus transmitted to the remaining terminal which participates in the teleconference but which is not included in this subset (corresponding to a "listener terminal ") . In total, N treatments of the type described above are performed in the audio bridge ( N subsets of ( N -1) coded signals). It is indicated that the partial coding of FIG. 9 designates the operation of constructing the coded audio frame after the spatialization processing and to transmit on a channel (left or right). By way of example, it may be a quantization of the L and R signal vectors that result from the spatialization processing, based on a number of bits allocated and calculated according to a selected psychoacoustic criterion. Conventional compression coding treatments after the application of the Analysis filters can therefore be maintained and performed with spatialization in the subband field.

Par ailleurs, comme indiqué ci-avant, la position de la source sonore à spatialiser peut varier au cours du temps, ce qui revient à faire varier au cours du temps les coefficients directionnels du domaine des sous-bandes C _ni et D _ni. La variation de la valeur de ces coefficients se fait préférentiellement de manière discrète.Moreover, as indicated above, the position of the sound source to be spatialised may vary over time, which amounts to varying over time the directional coefficients of the domain of the subbands C _ni and D _ni . The variation of the value of these coefficients is preferentially done in a discrete manner.

Bien entendu, la présente invention ne se limite pas aux formes de réalisation décrites ci-avant à titre d'exemples mais elle s'étend à d'autres variantes définies dans le cadre des revendications ci-après.Of course, the present invention is not limited to the embodiments described above by way of examples but extends to other variants defined in the context of the claims below.

Claims

Method of processing sound data, for spatialized restitution of acoustic signals, in which:
a) at least one first set C_ni and one second set D_ni of weighting terms, representative of a direction of perception of said acoustic signal by a listener, are obtained for each acoustic signal (Si); and

b) said acoustic signals are applied to at least two sets of filtering units, disposed in parallel, so as to deliver at least a first output signal (L) and a second output signal (R) each corresponding to a linear combination of the acoustic signals weighted by the collection of weighting terms respectively of the first set (C_ni) and of the second set (D_ni) and filtered by said filtering units,
characterized in that each acoustic signal in step a) is at least partially compression-coded and is expressed in the form of a vector of subsignals associated with respective frequency subbands,
and in that each filtering unit is devised so as to perform a matrix filtering applied to each vector, in the frequency subband space.
Method according to Claim 1, characterized in that each matrix filtering is obtained by conversion, in the frequency subbands space, of a filter represented by an impulse response in the time space.
Method according to Claim 2, characterized in that each impulse response filter is obtained by determination of an acoustic transfer function dependent on a direction of perception of a sound and the frequency of this sound.
Method according to Claim 3, characterized in that said transfer functions are expressed by a linear combination of frequency dependent terms weighted by direction dependent terms (Eq[1]).
Method according to one of the preceding claims, characterized in that said weighting terms of the first and of the second set depend on the direction of the sound.
Method according to Claim 5, characterized in that the direction is defined by an azimuth angle (θ) and an elevation angle (ϕ).
Method according to one of Claims 2 and 3, characterized in that the matrix filtering is expressed on the basis of a product matrix involving polyphase matrices (E(z), R(z)) corresponding to analysis and synthesis filter banks and a transfer matrix (S(z)) whose elements are dependent on the impulse response filter.
Method according to one of the preceding claims, characterized in that the matrix of the matrix filtering is of reduced form and comprises a diagonal and a predetermined number (δ) of adjacent subdiagonals below and above, whose elements are not all zero.
Method according to Claim 8, taken in combination with claim 7, characterized in that the rows of the matrix of the matrix filtering are expressed by:
[0...S^sb _il (z) ... S^sb _ii (z) ... S^sb _in (z) ... 0] , where:
- i is the index of the (i+1)th row and lies between 0 and M-1, M corresponding to a total number of subbands,

- 1 = i-δ mod[M], where δ corresponds to said number of adjacent subdiagonals, the notation mod[M] corresponding to an operation of subtraction modulo M,

- n = i+δ mod[M], the notation mod[M] corresponding to an operation of addition modulo M,

- and S^sb _ij(z) are the coefficients of said product matrix involving the polyphase matrices of the analysis and synthesis filter banks and said transfer matrix.
Method according to one of Claims 7 to 9, characterized in that said product matrix is expressed by S ^sb(z) = Z^k E (z) S (z) R (z), where
- Z^k is an advance defined by the term K= (L/M) -1 where L is the length of the impulse response of the analysis and synthesis filters of the filter banks and M the total number of subbands,

- E(z) is the polyphase matrix corresponding to the analysis filter bank,

- R(z) is the polyphase matrix corresponding to the synthesis filter bank, and

- S(z) corresponds to said transfer matrix.
Method according to one of claims 7 to 10, characterized in that said transfer matrix is expressed by: $S (z) = [\begin{matrix} S_{0} (z) & S_{1} (z) & \dots & S_{M - 1} (z) \\ z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & \dots & S_{M - 2} (z) \\ z^{- 1} S_{M - 2} (z) & z^{- 1} S_{M - 1} (z) & S_{0} (z) & S_{1} (z) & \dots & S_{M - 3} (z) \\ ⋮ & ⋱ & ⋱ & ⋱ & ⋮ \\ S_{1} (z) \\ z^{- 1} S_{1} (z) & \dots & z^{- 1} S_{M - 1} (z) & S_{0} (z) \end{matrix}],$

where S_k(z) are the polyphase components of the impulse response filter S(z), with k lying between 0 and M-1 and M corresponding to a total number of subbands.
Method according to one of Claims 7 to 11, characterized in that said filter banks operate in critical sampling.
Method according to one of Claims 7 to 12, characterized in that said filter banks satisfy a perfect reconstruction property.
Method according to one of Claims 2 to 13, characterized in that the filter with impulse response is a rational filter, expressed in the form of a fraction of two polynomials.
Method according to Claim 14, characterized in that said impulse response is infinite.
Method according to one of Claims 8 to 15, characterized in that said predetermined number (δ) of adjacent subdiagonals is dependent on a type of filter bank used in the compression coding chosen.
Method according to Claim 16, characterized in that said predetermined number (δ) is between 1 and 5.
Method according to one of Claims 7 to 17, characterized in that the matrix elements ( L _n, R _n) resulting from said product matrix are stored in a memory and reused for all partially coded acoustic signals to be spatialized.
Method according to one of the preceding claims, characterized in that it furthermore comprises a step d) consisting in applying a synthesis filter bank to said first (L) and second output signals (R), before their rendering.
Method according to Claim 19, characterized in that it furthermore comprises a step c) prior to step d) consisting in conveying the first and second signals into a communication network, from a remote server and to a rendering device, in coded and spatialized form, and in that step b) is performed at said remote server.
Method according to Claim 19, characterized in that it furthermore comprises a step c) prior to step d) consisting in conveying the first and second signals into a communication network, from an audio bridge of a multipoint teleconferencing system, of centralized architecture, and to a rendering device of said teleconferencing system, in coded and spatialized form, and in that step b) is performed at said audio bridge.
Method according to Claim 19, characterized in that it furthermore comprises a step subsequent to step a) consisting in conveying said acoustic signals in compression-coded form into a communication network, from a remote server and to a rendering terminal, and in that steps b) and d) are performed at said rendering terminal.
Method according to one of the preceding claims, characterized in that a sound spatialization by binaural synthesis based on a linear decomposition of acoustic transfer functions is applied in step b).
Method according to Claim 23, characterized in that a matrix of filters of gains (G _i) is furthermore applied, in step b), to each partially coded acoustic signal (Si) ,
in that said first and second output signals are intended to be decoded into first and second rendered signals (1, r),
and in that the application of said matrix of gain filters amounts to applying a chosen time delay (ITD) between said first and second restitution signals.
Method according to one of Claims 1 to 22, characterized in that, in step a), more than two sets of weighting terms are obtained, and in that, in step b), more than two sets of filtering units are applied to the acoustic signals so as to deliver more than two output signals comprising encoded ambisonic signals.
System for processing sound data, characterized in that it comprises means for the implementation of the method according to one of the preceding claims.