FR2851352A1

FR2851352A1 - Continuous audio signal converting system for use in linguistic converter, has vocal synthesizer that synthesizes translated textual portions into synthesized portions that are mixed with respective residual portions by mixer

Info

Publication number: FR2851352A1
Application number: FR0301979A
Authority: FR
Inventors: Ghislain Moncomble
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2003-02-18
Filing date: 2003-02-18
Publication date: 2004-08-20
Anticipated expiration: 2023-02-18
Also published as: FR2851352B1

Abstract

The system has a vocal recognition module (6) converting vocal portions into textual portions of a language. The textual portions are translated into translated textual portions of another language by a translation module (7). A vocal synthesizer (8) synthesizes the translated textual portions into synthesized portions that are mixed with respective residual portions, to obtain a converted audio signal by a mixer (9). The vocal portions (PT) and the residual portions (PR) of same duration are formed by segmenting a received audio signal (SA) into audio portions (PAC) including the vocal and residual portions, by an audio analyzer.

Description

Système de conversion d'un signal audio continu en unSystem for converting a continuous audio signal into a

signal audio traduit et synthétisé La présente invention concerne un système pour 5 convertir un signal audio continu de durée indéterminée, contenant des portions vocales dans une première langue, en un signal audio traduit et synthétisé dans une deuxième langue différente de la première langue. Translated and synthesized audio signal The present invention relates to a system for converting a continuous audio signal of indefinite duration, containing vocal portions in a first language, into an audio signal translated and synthesized in a second language different from the first language.

Actuellement certains systèmes traduisent en temps réel des signaux audio. Selon la demande de brevet européen EP 1093059, un système de traduction est composé d'une unité de reconnaissance vocale, 15 d'un moteur de traduction, et d'une unité de synthèse vocale. L'unité de reconnaissance vocale analyse en continu un signal audio entrant dans une première langue, et soumet graduellement un résultat textuel au moteur de traduction. Le moteur de traduction 20 génère en continu une traduction textuelle dans une deuxième langue à partir du résultat de la reconnaissance vocale. La traduction est fournie à l'unité de synthèse vocale. En se basant sur une comparaison entre la traduction actuelle et les 25 traductions précédentes, l'unité de synthèse vocale synthétise le résultat de la traduction vocalement en la deuxième langue. Currently, some systems translate audio signals in real time. According to European patent application EP 1093059, a translation system is composed of a voice recognition unit, a translation engine, and a voice synthesis unit. The voice recognition unit continuously analyzes an incoming audio signal in a first language, and gradually submits a text result to the translation engine. The translation engine 20 continuously generates a text translation into a second language from the result of the voice recognition. The translation is provided to the text-to-speech unit. Based on a comparison between the current translation and the previous 25 translations, the text-to-speech unit synthesizes the result of the translation into the second language.

Des recherches récentes ont amélioré la qualité des traductions. Cependant les systèmes de traduction 30 actuels ne distinguent pas le signal vocal proprement dit et un signal résiduel ou bruit de fond dans le signal audio entrant, ni des portions du signal audio entrant qui ne sont pas à traduire. Ces distinctions engendrent une problématique temporelle posée par un 35 signal audio entrant de durée indéterminée au niveau de la durée de traitement par le système et plus précisément des temps de traduction différents selon la langue utilisée et selon les portions du signal audio entrant à traduire. Ils ne peuvent donc pas 5 incorporer des techniques nouvelles comme la prise en compte dans la traduction, du contexte dans lequel se situent les paroles du signal audio à traiter. Recent research has improved the quality of translations. However, current translation systems do not distinguish between the actual voice signal and a residual signal or background noise in the incoming audio signal, or portions of the incoming audio signal which are not to be translated. These distinctions generate a temporal problem posed by an incoming audio signal of indefinite duration at the level of the processing time by the system and more precisely different translation times according to the language used and according to the portions of the incoming audio signal to be translated. They cannot therefore incorporate new techniques such as taking into account in the translation the context in which the words of the audio signal to be processed are situated.

L'objectif de la présente invention est de 10 traduire et synthétiser un signal audio continu de durée indéterminée en remédiant à toute contrainte temporelle due à la durée de traitement par le système de conversion et aux variations des durées des portions traduites du signal continu sortant 15 comparativement aux durées des portions correspondantes initialement non traduites dans le signal audio entrant. The objective of the present invention is to translate and synthesize a continuous audio signal of indefinite duration by overcoming any time constraint due to the duration of processing by the conversion system and to the variations in the durations of the translated portions of the outgoing continuous signal. compared to the durations of the corresponding portions initially not translated in the incoming audio signal.

Pour atteindre cet objectif, un système pour 20 convertir un signal audio entrant en continu incluant un signal vocal dans une première langue en un signal audio converti dans une deuxième langue, comprend un moyen pour segmenter le signal audio reçu en des portions audio, un moyen pour filtrer les portions 25 audio chacune en une portion vocale et une portion résiduelle de même durée, un moyen de reconnaissance vocale convertissant les portions vocales en des portions textuelles dans la première langue, un moyen pour traduire les portions textuelles en des portions 30 textuelles traduites dans la deuxième langue, et un moyen pour synthétiser vocalement les portions textuelles traduites en des portions synthétisées ayant des durées différentes respectivement des portions audio. Le système est caractérisé en ce 35 qu'il comprend un moyen de mélange pour transformer chaque portion résiduelle et la portion synthétisée respective en des portions qui ont une durée inférieure à la plus grande des durées de la portion résiduelle et de la portion synthétisée respective et 5 qui sont en phase afin de les mélanger en une portion mélangée composant le signal audio converti. To achieve this objective, a system for converting a continuously incoming audio signal including a voice signal in a first language into an audio signal converted in a second language, comprises means for segmenting the received audio signal into audio portions, means for filtering the audio portions each into a voice portion and a residual portion of the same duration, voice recognition means converting the voice portions into text portions in the first language, means for translating the text portions into translated text portions in the second language, and means for vocally synthesizing the textual portions translated into synthesized portions having different durations respectively from the audio portions. The system is characterized in that it comprises a mixing means for transforming each residual portion and the respective synthesized portion into portions which have a duration less than the longest of the durations of the residual portion and of the respective synthesized portion and 5 which are in phase in order to mix them into a mixed portion making up the converted audio signal.

L'invention remédie également à tout retard ou avance trop conséquent de l'un des signal audio converti et signal audio entrant par rapport à 10 l'autre afin de lisser les écarts temporels entre ces deux signaux. A cette fin, le système peut comprendre un moyen pour détecter un excédent temporel d'une somme des durées des portions du signal converti par rapport à une somme des durées des portions 15 correspondantes du signal audio entrant afin que le moyen pour segmenter supprime une portion du signal audio entrant dont la durée est sensiblement égale audit excédent temporel, et/ou un moyen pour détecter un déficit temporel d'une somme des durées des 20 portions du signal converti par rapport à une somme des durées des portions correspondantes du signal audio entrant afin que le moyen pour détecter ajoute une portion audio au signal converti dont la durée est sensiblement égale audit déficit temporel. 25 D'autres caractéristiques et avantages de la présente invention apparaîtront plus clairement à la lecture de la description suivante de plusieurs réalisations préférées de l'invention en référence 30 aux dessins annexés correspondants dans lesquels: - la figure 1 est un bloc-diagramme schématique d'un système de traduction et de synthèse vocale selon une réalisation préférée de l'invention, dans l'environnement d'une installation terminale d'usager comprenant plusieurs équipements récepteurs et de plusieurs serveurs de traitement selon l'invention; - la figure 2 est un algorithme d'étapes exécutées par le système selon la réalisation 5 préférée pour convertir un signal audio continu en un signal audio traduit et synthétisé ; et - la figure 3 est un bloc-diagramme schématique d'une réalisation préférée d'un convertisseur linguistique inclus dans le système de conversion 10 selon l'invention. The invention also remedies any delay or advance too substantial of one of the converted audio signal and incoming audio signal relative to the other in order to smooth the time differences between these two signals. To this end, the system may include means for detecting a temporal excess of a sum of the durations of the portions of the converted signal over a sum of the durations of the corresponding portions of the incoming audio signal so that the means for segmenting deletes a portion of the incoming audio signal whose duration is substantially equal to said time surplus, and / or a means for detecting a time deficit of a sum of the durations of the portions of the converted signal with respect to a sum of the durations of the corresponding portions of the incoming audio signal so that the means for detecting adds an audio portion to the converted signal, the duration of which is substantially equal to said time deficit. Other characteristics and advantages of the present invention will appear more clearly on reading the following description of several preferred embodiments of the invention with reference to the corresponding appended drawings in which: - Figure 1 is a schematic block diagram of a translation and speech synthesis system according to a preferred embodiment of the invention, in the environment of a terminal user installation comprising several receiving devices and several processing servers according to the invention; FIG. 2 is an algorithm of steps executed by the system according to the preferred embodiment for converting a continuous audio signal into a translated and synthesized audio signal; and - Figure 3 is a schematic block diagram of a preferred embodiment of a language converter included in the conversion system 10 according to the invention.

Dans la suite, le terme "chaîne" désigne indifféremment un canal ou une voie de transmission pour diffuser un programme de radiodiffusion sonore 15 et la société de programme diffusant ledit programme. In the following, the term "chain" denotes either a channel or a transmission channel for broadcasting a sound broadcasting program 15 and the program company broadcasting said program.

Le terme "programme" désigne une succession d'émissions de radiodiffusion sonore, appelées également magazines, diffusées par une chaîne déterminée. The term "program" designates a succession of sound broadcasting programs, also called magazines, broadcast by a specific channel.

En référence à la figure 1, le système de conversion d'un signal audio continu incluant un signal vocal en un signal audio par traduction et synthèse vocale selon une première réalisation de 25 l'invention comprend essentiellement une installation terminale d'usager IT et un serveur de conversion audio SE, ou plus généralement plusieurs serveurs de traduction. With reference to FIG. 1, the system for converting a continuous audio signal including a voice signal into an audio signal by translation and speech synthesis according to a first embodiment of the invention essentially comprises an IT user terminal installation and a SE audio conversion server, or more generally several translation servers.

L'installation terminale d'usager IT comprend M 30 équipements récepteurs EQ1, ... EQm, ... EQM avec 1 < m < M. Par exemple, l'un EQ1 des équipements est un récepteur de radiodiffusion sonore pouvant recevoir sélectivement les émissions de plusieurs chaînes (stations) de radiodiffusion sonore. Un autre 35 équipement EQM est un ordinateur personnel (PC) par exemple relié à un réseau de paquets du type réseau Internet, ou relié à un réseau câblé de distribution de programme d'émission de radiodiffusion sonore. The IT user terminal installation includes M 30 receiver equipment EQ1, ... EQm, ... EQM with 1 <m <M. For example, one EQ1 of the devices is a sound broadcasting receiver capable of receiving selectively the broadcasts of several sound broadcasting stations (stations). Another EQM equipment is a personal computer (PC), for example connected to a packet network of the Internet network type, or connected to a wired network for the distribution of sound broadcasting programs.

Les équipements EQ1 à EQM sont pilotés à travers 5 un bus distribué BU par une unité centrale de traitement UCit dans l'installation IT. En variante, tout ou partie du bus BU peut être remplacé par une liaison radioélectrique de proximité de type Bluetooth ou selon la norme 802.11b. The equipment EQ1 to EQM is controlled through a distributed bus BU by a central processing unit UCit in the IT installation. As a variant, all or part of the BU bus can be replaced by a proximity radio link of the Bluetooth type or according to the 802.11b standard.

L'unité centrale UCit comprend essentiellement un microcontrôleur relié à une interface de communication IC et optionnellement à un clavier et à un écran. L'unité centrale et l'interface de communication sont inclus physiquement dans un 1 5 boîtier indépendant des équipements. En variante, l'unité centrale UCit avec les périphériques est intégrée dans l'ordinateur ou le récepteur de radiodiffusion ou le récepteur de télévision EQm. The UCit central unit essentially comprises a microcontroller connected to a communication interface IC and optionally to a keyboard and a screen. The central unit and the communication interface are physically included in a housing independent of the equipment. As a variant, the UCit central unit with the peripherals is integrated into the computer or the broadcasting receiver or the television receiver EQm.

Dans une autre variante l'installation terminale est 20 réduite à un simple récepteur. L'unité centrale UCit constitue un module de base qui peut desservir divers équipements domotiques tels que ceux illustrés à la figure 1 ainsi qu'un ou plusieurs téléphones et radiotéléphones mobiles, au moins un récepteur de 25 télévision, une centrale d'alarme, etc. L'interface de communication IC est adaptée à une liaison de télécommunications LT reliée à un réseau d'accès RA de l'installation IT. La liaison LT et le réseau RA peuvent être classiquement une ligne téléphonique et 30 le réseau téléphonique commuté RTC lui-même connecté à un réseau de transmission de paquets à haut débit RP de type internet. Selon d'autres variantes, la liaison de télécommunications LT est une ligne xDSL (Digital Subscriber Line) ou une ligne RNIS (Réseau 35 Numérique à Intégration de Services) reliée au réseau d'accès correspondant. La liaison LT peut être aussi confondue avec l'une des liaisons desservant l'un EQm des équipements à travers l'un de réseaux de distribution RD définis ci-dessous. In another variant, the terminal installation is reduced to a simple receiver. The UCit central unit constitutes a basic module which can serve various home automation equipment such as that illustrated in FIG. 1 as well as one or more mobile telephones and radiotelephones, at least one television receiver, an alarm central, etc. . The communication interface IC is adapted to a telecommunications link LT connected to an access network RA of the installation IT. The link LT and the network RA can conventionally be a telephone line and the switched telephone network PSTN itself connected to a high speed packet transmission network RP of the internet type. According to other variants, the telecommunications link LT is an xDSL line (Digital Subscriber Line) or an ISDN line (Digital Network with Integration of Services) connected to the corresponding access network. The link LT can also be confused with one of the links serving one of the equipment's EQm through one of the distribution networks RD defined below.

La figure 1 montre également d'une manière schématique le système de télécommunications environnant l'installation terminale d'usager IT. En particulier, les repères RD et TR désignent 10 respectivement un ou plusieurs réseaux de distribution d'émissions programmées de radiodiffusion sonore et de télévision et une ou plusieurs têtes de réseau diffusant des émissions et gérées par diverses sociétés de programme de 1 5 radiodiffusion sonore et de télévision. L'ensemble des réseaux de distribution RD comprend notamment des réseaux de radiodiffusion analogiques et/ou numériques pour diffuser des émissions capables d'être reçues au moins par le récepteur radio EQ1. 20 L'ensemble des réseaux de distribution RD comprend également le réseau Internet à travers lequel l'ordinateur EQm est capable de recevoir des émissions radio que diffusent certaines sociétés de programme. Figure 1 also schematically shows the telecommunications system surrounding the IT user terminal installation. In particular, the references RD and TR denote respectively one or more distribution networks for scheduled sound and television broadcasting programs and one or more head ends broadcasting programs and managed by various sound broadcasting and program companies. of TV. All of the RD distribution networks include in particular analog and / or digital broadcasting networks for broadcasting programs capable of being received at least by the radio receiver EQ1. The set of distribution networks RD also includes the Internet network through which the computer EQm is capable of receiving radio broadcasts broadcast by certain program companies.

Chaque serveur de conversion audio SE est relié au réseau de distribution d'émissions RD et à l'installation terminale de l'usager IT via le réseau de paquets RP et le réseau d'accès RA. Selon une 30 autre variante, les fonctionnalités du serveur de conversion SE sont situées dans une tête de réseau TR, ou plus généralement, le serveur SE est relié aux réseaux de distribution d'émissions RD. Dans ce cas la traduction et la synthèse vocale sont effectuées 35 avant diffusion. Each audio conversion server SE is connected to the program distribution network RD and to the terminal installation of the user IT via the packet network RP and the access network RA. According to another variant, the functionalities of the conversion server SE are located in a headend TR, or more generally, the server SE is connected to the broadcast distribution networks RD. In this case, the translation and the speech synthesis are carried out before broadcasting.

Les programmes prévus, sauf ceux en direct, sont traduits et synthétisés par légère anticipation, au moins quelques minutes environ avant leur diffusion, ce qui offre une traduction et une synthèse vocale 5 quasiment sans décalage temporel. En effet comme expliqué par la suite, la conversion d'un signal audio SA incluant un signal vocal dans une première langue, telle que l'anglais, par le serveur de conversion a une certaine durée qui engendre un 10 retard temporel entre le signal audio entrant SA dans le système et le signal audio converti SAC sortant du système et ayant un signal vocal dans une deuxième langue, telle que le français. Lorsque la conversion intervient lors de l'écoute d'un signal audio 15 continu, le retard variable d à la conversion peut être comblé par le signal audio continu qui sera alors dupliqué mais traduit au début du traitement, ou par un message sonore du type "traduction en cours", ou simplement par un silence, ou par tout 20 autre séquence prédéterminée audio. The scheduled programs, except those live, are translated and synthesized by slight anticipation, at least a few minutes before their broadcast, which offers translation and voice synthesis 5 almost without time difference. Indeed, as explained below, the conversion of an audio signal SA including a voice signal into a first language, such as English, by the conversion server has a certain duration which generates a time delay between the audio signal entering SA into the system and the converted audio signal SAC leaving the system and having a voice signal in a second language, such as French. When the conversion occurs when listening to a continuous audio signal, the variable delay d to the conversion can be made up by the continuous audio signal which will then be duplicated but translated at the start of the processing, or by an audible message of the type "translation in progress", or simply by silence, or by any other predetermined audio sequence.

Le serveur SE comporte une unité centrale de traitement UCs et un ensemble de périphériques dont au moins une base de données BD et un convertisseur linguistique CL décrit en détail ci-dessous. The server SE comprises a central processing unit UCs and a set of peripherals including at least one database BD and a linguistic converter CL described in detail below.

De nombreuses variantes de la répartition matérielle des composants de l'installation terminale d'usager IT et du serveur de traitement ST peuvent être déduites de la réalisation de l'invention illustrée dans les figures. Many variants of the hardware distribution of the components of the IT user terminal installation and of the processing server ST can be deduced from the embodiment of the invention illustrated in the figures.

Selon une variante d'architecture appelée "client lourd/serveur léger", par opposition à la réalisation préférée illustrée à la figure 1 appelée "client léger/serveur lourd", le système de conversion est inclus partiellement dans l'installation terminale d'usager IT. Le convertisseur linguistique CL et la base de données BD sont implantés dans l'installation d'usager IT, et le traitement qui était réalisé par l'unité centrale UCs est alors exécuté dans l'unité de traitement UCit. According to a variant of architecture called "thick client / thin server", as opposed to the preferred embodiment illustrated in FIG. 1 called "thin client / heavy server", the conversion system is partially included in the terminal user installation IT. The linguistic converter CL and the database BD are installed in the user installation IT, and the processing which was carried out by the central unit UCs is then executed in the processing unit UCit.

D'autres variantes intermédiaires entre l'architecture client léger/serveur lourd et l'architecture client lourd/serveur léger sont envisageables. Par exemple, des modules composant le 10 convertisseur linguistiques CL sont répartis dans l'installation terminale d'usager IT et le serveur SE. Other intermediate variants between the thin client / heavy server architecture and the thick client / thin server architecture are conceivable. For example, modules making up the language converter CL are distributed in the terminal installation of user IT and the server SE.

Selon une autre réalisation, l'ensemble des traitements réalisés par la suite sont exécutés en 15 amont de la diffusion des programmes, dans une tête de réseau TR. Dans ce cas, l'installation terminale de l'usager est réduite quasiment aux équipements EQ1 à EQM. According to another embodiment, all of the processing carried out subsequently is executed upstream of the broadcasting of the programs, in a network head TR. In this case, the user's terminal installation is almost reduced to equipment EQ1 to EQM.

Les termes "paramètres de conversion" désignent des paramètres d'activation PAC, un identificateur de langue IL et des paramètres généraux PAG. Les paramètres d'activation caractérisent une période d'activation du système de conversion selon 25 l'invention en fonction de dates et d'heures de début et de fin et/ou du type de programme. The terms "conversion parameters" refer to PAC activation parameters, an IL language identifier and general PAG parameters. The activation parameters characterize an activation period of the conversion system according to the invention as a function of start and end dates and times and / or of the type of program.

L'identificateur de langue IL définit une langue de traduction. Les paramètres généraux définissent une liste de préférences de l'usager comme la possibilité 30 de traduire ou de ne pas traduire des publicités et/ou des chansons contenus dans le signal audio SA. The language identifier IL defines a translation language. The general parameters define a list of user preferences such as the possibility of translating or not translating advertisements and / or songs contained in the audio signal SA.

Dans une autre réalisation de l'invention, un programme de préférences sert à mémoriser dans la 35 base de données BD et paramétrer des préférences sur la traduction et la synthèse vocale souhaitées par l'usager afin d'établir et mémoriser les paramètres de conversion et les modifier si cela est souhaité. In another embodiment of the invention, a preference program is used to store in the database BD and set preferences on the translation and the speech synthesis desired by the user in order to establish and store the conversion parameters and modify them if desired.

Le programme de préférences est exécuté par le 5 serveur SE via le réseau de paquets RP, ou directement par l'unité centrale UCit de l'installation terminale IT lorsque la base de données BD est incluse dans l'installation IT. The preference program is executed by the server SE via the packet network RP, or directly by the central unit UCit of the terminal installation IT when the database BD is included in the installation IT.

Par exemple, le programme de préférences 10 présente une liste complète des équipements EQ1 à EQM de l'usager via un afficheur dans l'installation IT, par exemple l'écran de l'ordinateur personnel EQM, afin que l'usager sélectionne l'équipement pour lequel il souhaite modifier des paramètres de 15 conversion audio lorsque les identificateurs de plusieurs équipements de l'usager ont été enregistrés lors de son abonnement. Des paramètres de conversion audio peuvent être proposés par défaut à l'usager, ou bien les paramètres actuels si l'usager a déjà 20 sélectionné ou modifié ces paramètres. Une première page invite l'usager à saisir des paramètres d'activation PAC programmables par l'usager selon des dates et des heures ou directement selon des émissions choisies à partir d'une grille de 25 programme. A chaque validation de l'usager d'une page de saisie, les valeurs saisies des paramètres sont envoyées au serveur ST pour mémorisation dans la base de données BD, ou directement dans la base de données BD de l'installation terminale pour l'architecture 30 "client lourd/serveur léger". Il en est de même pour les identificateurs de langue IL. For example, the preference program 10 presents a complete list of the equipment EQ1 to EQM of the user via a display in the IT installation, for example the screen of the personal computer EQM, so that the user selects the equipment for which he wishes to modify audio conversion parameters when the identifiers of several user equipment have been registered during his subscription. Audio conversion parameters can be proposed to the user by default, or the current parameters if the user has already selected or modified these parameters. A first page invites the user to enter PAC activation parameters programmable by the user according to dates and times or directly according to programs chosen from a program schedule. Each time the user validates an entry page, the entered values of the parameters are sent to the ST server for storage in the BD database, or directly in the BD database of the terminal installation for architecture. 30 "thick client / thin server". The same is true for IL language identifiers.

En variante, des paramètres linguistiques PL sont ajoutés aux paramètres de conversion et caractérisent la présentation sonore de la ou de 35 plusieurs voix synthétisée dans le signal vocal sortant SAC. Chaque voix dépend du sexe, d'une tranche d'âge, de caractères prosodiques, d'un accent, d'un rythme, etc. choisi par l'usager. En variante, les paramètres linguistiques sont ceux au 5 moins de l'une choisie de plusieurs voix prédéterminées. Selon une variante plus simple, la voix synthétisée reproduit sensiblement la voix ou les voix de plusieurs locuteurs dans le signal audio entrant. As a variant, linguistic parameters PL are added to the conversion parameters and characterize the sound presentation of the one or more voices synthesized in the outgoing voice signal SAC. Each voice depends on gender, age group, prosodic characters, accent, rhythm, etc. chosen by the user. As a variant, the linguistic parameters are those of at least one chosen from several predetermined voices. According to a simpler variant, the synthesized voice substantially reproduces the voice or the voices of several speakers in the incoming audio signal.

Si l'installation terminale IT ne dispose pas de moyen d'interface hommemachine comme une souris ou un clavier, les paramètres correspondant aux préférences de l'usager sont sélectionnés par défaut. If the IT terminal installation does not have a human machine interface such as a mouse or keyboard, the parameters corresponding to the user's preferences are selected by default.

Si la conversion audio selon l'invention est réalisée 15 dans une tête de réseau TR et l'installation terminale IT est réduite essentiellement aux équipements EQ1 à EQM, les paramètres sont modifiés par l'usager via tout autre moyen, par exemple par un terminal téléphonique ou radiotéléphonique ou par une 20 opératrice lors de la souscription au service de conversion audio selon l'invention. If the audio conversion according to the invention is carried out in a headend TR and the terminal installation IT is reduced essentially to the equipment EQ1 to EQM, the parameters are modified by the user via any other means, for example by a terminal by telephone or radiotelephone or by an operator when subscribing to the audio conversion service according to the invention.

La figure 2 montre un algorithme d'étapes El à E8 exécutées par le système de conversion selon la 25 réalisation préférée pour convertir un signal audio continu SA transmis par le réseau de distribution RD vers l'un EQm des équipements récepteurs de l'installation IT en un signal audio converti SAC. FIG. 2 shows an algorithm of steps E1 to E8 executed by the conversion system according to the preferred embodiment for converting a continuous audio signal SA transmitted by the distribution network RD to one EQm of the receiving equipment of the IT installation into a converted SAC audio signal.

A l'étape El, l'usager U de l'installation IT 30 met sous tension celleci et sélectionne un équipement EQm afin d'activer globalement le système de conversion de l'invention. Par exemple, une pression prédéterminée d'une télécommande de l'équipement sélectionné EQm lorsque cet équipement 35 sélectionné contient l'unité centrale UCit, ou un basculement à la position de mise en marche d'un bouton sur le boîtier intégrant l'unité centrale UCit met sous tension l'unité UCit. Celle-ci lit en mémoire et transmet alors automatiquement un 5 identificateur IU de l'usager U et un identificateur IEQm de l'équipement EQm sélectionné par l'usager U au serveur SE. In step E1, the user U of the IT installation 30 powers it up and selects an equipment EQm in order to globally activate the conversion system of the invention. For example, a predetermined pressure of a remote control of the selected equipment EQm when this selected equipment contains the central unit UCit, or a switch to the start position of a button on the housing incorporating the central unit UCit powers up the UCit unit. This reads from memory and then automatically transmits an identifier IU of the user U and an identifier IEQm of the equipment EQm selected by the user U to the server SE.

Le serveur SE identifie l'usager U qui a souscrit au service de conversion audio, en comparant 10 l'identificateur reçu IU avec les identificateurs des usagers abonnés dans la base de données BD, à l'étape E2. Dans une variante, le serveur SE demande à l'usager de saisir dans l'installation IT l'identificateur IU et un mot de passe qui lui a été 15 attribué lors de l'abonnement au service afin de transmettre l'identificateur et le mot de passe au serveur SE pour vérification. Puis à l'étape E2, l'unité centrale UCs lit les paramètres de conversion PAC, IL, PAG et éventuellement PL dans la base de 20 données BD en correspondance avec l'identificateur d'usager IU afin de les analyser selon les étapes suivantes en vue de produire la conversion dans l'équipement sélectionné EQm pour la chaîne sélectionnée. Les paramètres d'activation PAC sont 25 considérés par l'unité centrale UCs, afin que le système soit actif seulement pendant la durée d'activation déterminée par les paramètres PAC. The server SE identifies the user U who has subscribed to the audio conversion service, by comparing the identifier received IU with the identifiers of the users subscribed in the database BD, in step E2. In a variant, the server SE requests the user to enter in the IT installation the identifier UI and a password which has been assigned to him when subscribing to the service in order to transmit the identifier and the word to the SE server for verification. Then in step E2, the central unit UCs reads the conversion parameters PAC, IL, PAG and possibly PL from the database BD in correspondence with the user identifier IU in order to analyze them according to the following steps in order to produce the conversion in the selected equipment EQm for the selected chain. The PAC activation parameters are considered by the central unit UCs, so that the system is active only during the activation time determined by the PAC parameters.

Après l'identification de l'usager à l'étape E2, l'unité centrale UCs dans le serveur SE invite 30 l'usager U à sélectionner une chaîne dans l'équipement EQm qui transmet ensuite un identificateur ICH de la chaîne sélectionnée au serveur SE via l'unité UCit, à l'étape E3. After the identification of the user in step E2, the central unit UCs in the server SE invites the user U to select a chain in the equipment EQm which then transmits an identifier ICH of the selected chain to the server SE via the UCit unit, in step E3.

En variante, l'équipement EQm et la chaîne du 35 signal audio à convertir ont été présélectionnés par l'usager U notamment lors de la souscription au service de conversion audio, et les identificateurs IEQm et ICH ont été inscrits en correspondance avec l'identificateur IU de l'usager U dans la base de 5 données BD. Dans cette variante, l'équipement EQm est simplement mis sous tension en attente d'une conversion audio. As a variant, the equipment EQm and the chain of the audio signal to be converted have been preselected by the user U in particular when subscribing to the audio conversion service, and the identifiers IEQm and ICH have been registered in correspondence with the identifier UI of user U in the database BD. In this variant, the EQm equipment is simply powered up awaiting an audio conversion.

En variante lorsque l'équipement récepteur est un ordinateur personnel connecté à l'internet, la 10 sélection d'une chaîne de radiodiffusion sonore équivaut à l'appel d'un site web géré par la chaîne et désigné par une adresse URL précise. Dans ce cas, après identification de l'usager, celui- ci saisit ses paramètres par l'intermédiaire de formulaire de pages 15 HTML. Alternatively when the receiving equipment is a personal computer connected to the Internet, the selection of a sound broadcasting channel is equivalent to calling up a website managed by the channel and designated by a precise URL address. In this case, after identification of the user, the user enters his parameters via the HTML page form.

A l'étape E4, l'unité centrale UCs vérifie si le signal audio entrant SA identifié par l'identificateur de chaîne ICH est en cours de conversion par le serveur SE et si l'un des 20 paramètres de conversion pour la conversion actuelle correspond à l'un des paramètres PAC, PAG et IL sélectionnés par l'usager. Lorsque des paramètres de conversion correspondent, la conversion est poursuivie par l'étape E7. Dans le cas contraire, 25 l'unité centrale UCs du serveur SE sélectionne la chaîne désignée selon l'identificateur reçu ICH parmi toutes les chaînes disponibles au niveau du serveur SE à l'étape E5 afin de déterminer le signal audio entrant SA à convertir. In step E4, the central unit UCs checks whether the incoming audio signal SA identified by the chain identifier ICH is being converted by the server SE and whether one of the 20 conversion parameters for the current conversion corresponds to one of the PAC, PAG and IL parameters selected by the user. When the conversion parameters correspond, the conversion is continued by step E7. Otherwise, the central unit UCs of the server SE selects the channel designated according to the identifier received ICH from all the channels available at the level of the server SE in step E5 in order to determine the incoming audio signal SA to be converted.

A l'étape E6, le convertisseur linguistique CL filtre le signal SA en des portions vocales PV et des portions résiduelles PB, transforme les portions vocales PV du signal audio SA en des portions textuelles PT, traduit les portions textuelles PT en 35 des portions textuelles traduites PTT selon l'identificateur de langue IL, synthétise les portions textuelles traduites en des portions synthétisées PS, mélange respectivement les portions résiduelles PB et les portions synthétisées PS en des 5 portions mélangées PM, en adaptant temporellement les durées des portions mélangées aux durées des portions résiduelles d'origine dans le signal SA pour finalement composer un signal audio converti sortant SAC. Le fonctionnement du convertisseur est 10 ultérieurement détaillé en référence à la figure 3. In step E6, the linguistic converter CL filters the signal SA into voice portions PV and residual portions PB, transforms the voice portions PV of the audio signal SA into text portions PT, translates the text portions PT into text portions translated PTT according to the language identifier IL, synthesizes the textual portions translated into synthesized portions PS, respectively mixes the residual portions PB and the synthesized portions PS into 5 mixed portions PM, by temporally adapting the durations of the mixed portions to the durations of the original residual portions in the SA signal to finally compose an outgoing converted audio signal SAC. The operation of the converter is detailed later with reference to FIG. 3.

Le signal converti SAC est envoyé continuellement pendant le traitement progressif du signal SA par le serveur SE à l'installation terminale IT à l'étape E7. Toutes les étapes de 15 conversion jusqu'à l'étape E7 ont engendré un retard nécessaire à l'exécution de la conversion audio dans le serveur SE. The converted signal SAC is sent continuously during the progressive processing of the signal SA by the server SE to the terminal installation IT in step E7. All the conversion steps up to step E7 have caused a delay necessary for the execution of the audio conversion in the server SE.

A l'étape E8, l'installation terminale IT réceptionne le signal audio converti SAC pour 20 diffusion dans le bus BU vers l'équipement sélectionné EQm afin que l'usager U écoute le signal audio converti SAC à la place du signal audio initial SA correspondant à l'identificateur ICH de la chaîne sélectionnée. In step E8, the terminal installation IT receives the converted audio signal SAC for broadcasting on the bus BU to the selected equipment EQm so that the user U listens to the converted audio signal SAC in place of the initial audio signal SA corresponding to the ICH identifier of the selected channel.

La conversion par le système de conversion selon l'invention est terminée à l'expiration de la durée d'activation définie par les paramètres d'activation PAC et surveillée par l'unité centrale UCs. L'unité centrale UCs surveille également toute modification 30 des paramètres de conversion afin de la traiter en temps réel. De la même façon, lors de l'arrêt de l'installation terminale de l'usager, l'unité centrale UCit transmet un signal de fin de session à l'unité centrale UCs. The conversion by the conversion system according to the invention is completed at the expiration of the activation time defined by the PAC activation parameters and monitored by the central unit UCs. The central unit UCs also monitors any modification of the conversion parameters in order to process it in real time. Similarly, when the user's terminal installation is stopped, the central unit UCit transmits an end of session signal to the central unit UCs.

Le serveur de conversion SE comporte un convertisseur linguistique CL dont le fonctionnement est décrit ci-dessous en référence à la figure 3. The conversion server SE comprises a linguistic converter CL, the operation of which is described below with reference to FIG. 3.

Le convertisseur linguistique selon l'invention 5 comprend un analyseur audio 1, une base de données audio 2, un filtre audio 3, une unité de détermination de langue 4, un analyseur vocal 5, un module de reconnaissance vocale 6, un module de traduction 7, un synthétiseur vocal 8, un mélangeur 10 9, une base de données contextuelle 10, une unité de détermination de contexte de segment 11 et une unité d'adaptation temporelle 12. The language converter according to the invention 5 comprises an audio analyzer 1, an audio database 2, an audio filter 3, a language determination unit 4, a voice analyzer 5, a voice recognition module 6, a translation module 7, a speech synthesizer 8, a mixer 10 9, a contextual database 10, a segment context determination unit 11 and a time adaptation unit 12.

Dans la suite le terme "contexte" désigne une liste de mots ou expressions clés et de leurs 15 équivalents. Chaque mot ou expression clé caractérise un contexte susceptible d'être abordé dans n'importe quel document multimédia. Certains contextes sont des combinaisons de contextes, ou dans le cas de contextes d'actualités ou régionaux, des combinaisons 20 de contextes précisés par un nom propre, telles que par exemple: Météo Bretagne, Guerre Afghanistan, etc. Il sera supposé que le signal audio entrant SA reçu par le serveur SE est numérique; sinon, le signal audio reçu est analogique et converti par un 25 convertisseur analogique-numérique inclus dans l'analyseur audio 1. In the following, the term "context" designates a list of key words or expressions and their equivalents. Each key word or expression characterizes a context that can be addressed in any multimedia document. Certain contexts are combinations of contexts, or in the case of current or regional contexts, combinations of contexts specified by a proper name, such as for example: Brittany Weather, Afghanistan War, etc. It will be assumed that the incoming audio signal SA received by the server SE is digital; otherwise, the audio signal received is analog and converted by an analog-digital converter included in the audio analyzer 1.

La base de données audio 2 mémorise des morceaux de données audio tels que tout ou partie de musiques, chansons, jingles ou spots publicitaires, flashs 30 d'information et bruitages. Plus généralement, la base de données 2 a enregistré préalablement tout morceau de données audio de préférence qualifié par un type T, des paramètres audio PAS, une durée et des contextes CA dont les bornes temporelles sont 35 échelonnées par rapport à un repère fixe d'une donnée audio, telle que le début d'une chanson ou d'un jingle, et d'éventuelles traductions ou correspondances dans des langues déterminées et particulièrement dans la deuxième langue correspondant à l'identificateur IL sélectionné par l'usager U. Le type T définit la nature du morceau de données audio afin de distinguer entre eux une chanson, une publicité, un dialogue, un reportage, un thème particulier, un bruitage, etc. Le signal audio SA de la chaîne sélectionnée reçu par l'équipement sélectionné est mémorisé en continu temporairement dans une mémoire tampon contenue dans l'analyseur vidéo 1. Comme tout signal audio SA, celui-ci inclut des repères temporels 15 périodiques tels que des mots de verrouillage de trame, des mots de synchronisation de paquet, des signaux de synchronisation de trame ou de ligne, etc. Ces repères temporels sont comptés modulo un nombre prédéterminé et mémorisés dans la mémoire tampon 20 contenue dans l'analyseur vidéo 1 en réponse à la sélection de l'identificateur ICH de la chaîne par l'usager. Ces repères temporels sont rafraîchis périodiquement pendant le traitement du signal audio SA par le système, afin de synchroniser les 25 traitements de portions de signal audio par les différents modules du système de traduction. The audio database 2 stores pieces of audio data such as all or part of music, songs, jingles or commercials, news flashes and sound effects. More generally, the database 2 has previously recorded any piece of audio data preferably qualified by a type T, PAS audio parameters, duration and contexts CA whose time limits are staggered with respect to a fixed reference frame of audio data, such as the start of a song or a jingle, and possible translations or correspondences in specific languages and particularly in the second language corresponding to the identifier IL selected by the user U. The type T defines the nature of the piece of audio data in order to distinguish between them a song, an advertisement, a dialogue, a report, a particular theme, a sound effect, etc. The audio signal SA of the selected channel received by the selected equipment is temporarily stored temporarily in a buffer memory contained in the video analyzer 1. Like any audio signal SA, this includes periodic time marks such as words frame alignment, packet synchronization words, frame or line synchronization signals, etc. These time marks are counted modulo a predetermined number and stored in the buffer memory 20 contained in the video analyzer 1 in response to the selection of the ICH identifier of the channel by the user. These time marks are refreshed periodically during the processing of the audio signal SA by the system, in order to synchronize the processing of portions of audio signal by the various modules of the translation system.

L'analyseur audio 1 compare des échantillons du signal audio entrant continu SA à des échantillons de données audio contenues dans la base de données audio 30 2. Des suites d'échantillons sensiblement similaires à ceux dans la base de données audio permettent à l'analyseur audio 1 de délimiter des portions, appelées "portions connues", incluses dans le signal audio entrant SA correspondant à des morceaux 35 complets ou à des parties de morceaux de données audio contenus dans la base de données audio 2. Les autres portions du signal audio SA qui ne correspondent pas à des morceaux de données audio ou à des parties de ceux-ci contenus dans la base de 5 données audio 2, sont appelées "portions à convertir" PAC. Parmi les portions connues, l'analyseur audio 1 décide en fonction du type T et de la durée de la portion courante, des paramètres de conversion choisis par l'usager, de la présence d'une portion de 10 remplacement associée à la portion connue dans la base de données audio 2, et de la durée de ladite portion de remplacement de remplacer la portion connue du signal audio par la portion de remplacement. La portion de remplacement correspond 15 par exemple à la traduction de la portion connue dans la deuxième langue choisie par l'usager. Les portions connues qui ont été remplacées par leurs portions de remplacement sont appelées "portions remplacées " PRT, par opposition aux autres portions connues 20 appelées "portions non remplacées" PNT. Par exemple, les portions de type chanson peuvent être des portions non remplacées, et les portions de type publicité peuvent être des portions remplacées. The audio analyzer 1 compares samples of the continuous incoming audio signal SA with samples of audio data contained in the audio database 2. Sequences of samples substantially similar to those in the audio database allow the analyzer audio 1 to delimit portions, called "known portions", included in the incoming audio signal SA corresponding to complete pieces or to portions of pieces of audio data contained in the audio database 2. The other portions of the audio signal SA which do not correspond to pieces of audio data or to parts thereof contained in the audio database 2, are called "portions to be converted" PAC. Among the known portions, the audio analyzer 1 decides as a function of the type T and of the duration of the current portion, of the conversion parameters chosen by the user, of the presence of a replacement portion associated with the known portion in the audio database 2, and the duration of said replacement portion to replace the known portion of the audio signal with the replacement portion. The replacement portion corresponds for example to the translation of the known portion into the second language chosen by the user. The known portions which have been replaced by their replacement portions are called "replaced portions" PRT, as opposed to the other known portions called "non-replaced portions" PNT. For example, the song-like portions may be non-replaced portions, and the advertisement-type portions may be replaced portions.

Chaque portion est mémorisée dans la mémoire tampon, 25 ainsi que ses caractéristiques telles que le type T et sa durée. Each portion is stored in the buffer memory, as well as its characteristics such as type T and its duration.

Le type T est appliqué au synthétiseur vocal 8 afin qu'il décide d'une synthèse vocale pour la portion courante à convertir en fonction du type T de 30 la portion courante. The T type is applied to the speech synthesizer 8 so that it decides on a speech synthesis for the current portion to be converted as a function of the T type of the current portion.

Les portions non remplacées PNT et les portions remplacées PRT sont appliquées directement à l'unité d'adaptation temporelle 12 et également au filtre audio 3 suivi du module de reconnaissance vocale 6 et 35 de l'unité de détermination de contexte de segment 11 afin que les contextes de ces portions alimentent la base de données contextuelle 10. Les portions à convertir PAC sont transmises seulement au filtre audio 3. The non-replaced portions PNT and the replaced portions PRT are applied directly to the time adaptation unit 12 and also to the audio filter 3 followed by the voice recognition module 6 and 35 of the segment context determination unit 11 so that the contexts of these portions feed the contextual database 10. The portions to be converted PAC are transmitted only to the audio filter 3.

Dans une variante, si la portion connue courante PNT ou PRT dispose de paramètres PAS et d'un contexte CA lus par correspondance dans la base de données audio 2, les paramètres PAS et le contexte CA sont appliqués directement à l'unité 11 notamment afin 10 qu'ils alimentent la base de données contextuelle 10 la portion correspondante n'est pas transmise au filtre audio 3. L'analyseur audio 1 participe donc à l'amélioration de la qualité de détermination des contextes, expliquée ci-dessous, en dirigeant les 15 paramètres PAS et CA associés aux données audio et contenus dans la base de données audio 2 vers la base de données contextuelle 10. In a variant, if the current known portion PNT or PRT has parameters PAS and a context CA read by correspondence in the audio database 2, the parameters PAS and the context CA are applied directly to the unit 11 in particular in order to 10 that they feed the contextual database 10 the corresponding portion is not transmitted to the audio filter 3. The audio analyzer 1 therefore participates in improving the quality of determination of the contexts, explained below, by directing the 15 PAS and CA parameters associated with the audio data and contained in the audio database 2 to the contextual database 10.

Le filtre 3 filtre par soustraction spectrale ou filtrage adaptatif la portion PAC ou PNT ou PRT du 20 signal audio SA appliqué par l'analyseuraudio 1 afin de la dissocier en une partie comprenant uniquement de la voix et appelée "portion vocale" PV et une partie résiduelle PR comprenant des bruits de fond et appelée "portion résiduelle" PR. Le filtre 3 est par 25 exemple basé sur une analyse prédictive linéaire LPC (Linear Predictive Coding) et isole différentes composantes acoustiques dans un signal audio comme la voix, le bruit vocal et la musique pure. Filter 3 filters by spectral subtraction or adaptive filtering the PAC or PNT or PRT portion of the audio signal SA applied by the audio analyzer 1 in order to dissociate it into a part comprising only voice and called "vocal portion" PV and a part residual PR including background noise and called "residual portion" PR. Filter 3 is for example based on a linear predictive analysis LPC (Linear Predictive Coding) and isolates different acoustic components in an audio signal such as voice, vocal noise and pure music.

Dans un souci d'amélioration de la détermination 30 des contextes, vue en détail plus loin, la portion résiduelle courante PR comportant la partie non vocale résiduelle de la portion courante à convertir PAC produite par le filtre 3 est retournée par le filtre 3 à l'analyseur audio 1. L'analyseur compare 35 la portion résiduelle aux portions de données audio contenues dans la base de données audio 2 et qualifie la portion résiduelle PR par des paramètres PAS et un contexte CA d'une portion de données audio similaire ou sensiblement identique à la portion résiduelle 5 courante PR. Afin de constituer rapidement des données audio dans la base de données 2, les machines hébergeant le moyen de gestion gérant la base de données audio 2 peuvent être mutualisées. Dans une autre variante, le moyen de gestion est associé à 10 l'analyseur audio 1. In order to improve the determination of the contexts, seen in detail below, the current residual portion PR comprising the residual non-vocal part of the current portion to be converted PAC produced by the filter 3 is returned by the filter 3 to the audio analyzer 1. The analyzer compares the residual portion with the portions of audio data contained in the audio database 2 and qualifies the residual portion PR by parameters PAS and a context CA of a portion of audio data similar or substantially identical to the current residual portion 5 PR. In order to rapidly constitute audio data in the database 2, the machines hosting the management means managing the audio database 2 can be shared. In another variant, the management means is associated with the audio analyzer 1.

La portion vocale courante PV est traitée en parallèle par l'analyseur vocal 5 et le module de reconnaissance vocale 6. The current voice portion PV is processed in parallel by the voice analyzer 5 and the voice recognition module 6.

L'analyseur vocal 5 analyse la portion vocale 15 courante PV afin de déterminer des paramètres vocaux PVS caractérisant des sections attribuées respectivement à des locuteurs et incluses dans la portion vocale PV. La liste de paramètres vocaux n'est pas fixe mais comporte entre autre des 20 paramètres acoustiques et particulièrement prosodiques comme la fréquence de vibration, l'intensité, le débit, le timbre et également d'autres paramètres comme l'âge relatif du locuteur. The vocal analyzer 5 analyzes the current vocal portion PV 15 in order to determine the vocal parameters PVS characterizing sections allocated respectively to speakers and included in the vocal portion PV. The list of voice parameters is not fixed but includes, among other things, acoustic and particularly prosodic parameters such as the vibration frequency, intensity, flow, timbre and also other parameters such as the relative age of the speaker.

En parallèle à l'analyse vocale, la portion 25 vocale courante PV est soumise au module de reconnaissance vocale 6. Lorsque la langue du signal audio est considérée comme inconnue, l'unité de détermination de langue 4 est insérée entre le filtre 2 et le module de reconnaissance vocale 4. L'unité 4 30 détermine dynamiquement la langue de la portion vocale courante PV du signal audio SA si celle-ci n'est pas préalablement connue. Pour des informations multi-langues par exemple, la langue des portions vocales est reconnue ainsi en continue. Si la langue 35 du signal audio est prédéterminée et donc celle de la portion vocale est prédéterminée, cette langue est prise comme langue par défaut et l'unité de détermination de langue 4 n'est pas nécessaire. Le module de reconnaissance vocale 6 transforme la 5 portion vocale PV en une portion textuelle PT. In parallel with the voice analysis, the current voice portion 25 PV is subjected to the voice recognition module 6. When the language of the audio signal is considered to be unknown, the language determination unit 4 is inserted between the filter 2 and the voice recognition module 4. The unit 4 30 dynamically determines the language of the current voice portion PV of the audio signal SA if this is not previously known. For multi-language information for example, the language of the vocal portions is thus recognized continuously. If the language of the audio signal is predetermined and therefore that of the voice portion is predetermined, this language is taken as the default language and the language determining unit 4 is not necessary. The voice recognition module 6 transforms the 5 voice portion PV into a text portion PT.

Plusieurs modules de reconnaissance vocale peuvent être utilisés à des fins d'optimisation de la conversion audio. En variante, la portion textuelle PT est soumise à un correcteur d'orthographe prévu en 10 sortie du module 6. Several speech recognition modules can be used to optimize audio conversion. As a variant, the textual portion PT is subjected to a spelling corrector provided at the output of module 6.

Dans une autre variante, le module 6 considère les résultats d'une étude de contexte effectuée préalablement par l'unité de détermination de contexte de segment il afin d'affiner la 15 reconnaissance vocale et la traduction de la portion vocale courante PV. Le contexte se traduit en des éléments syntaxiques, c'est-à-dire des mots et expressions clés, présentant des probabilités élevées pour être inclus dans une portion de signal audio. 20 Par exemple, le contexte d'un spot publicitaire ou d'actualités relativement périodique ou fréquent dans un signal audio émis par une station de radiodiffusion sonore est prédit en connaissant le programme détaillé de cette station, ou en le 25 déduisant de spots publicitaires ou d'actualités précédents. In another variant, the module 6 considers the results of a context study previously carried out by the segment context determination unit 11 in order to refine the voice recognition and the translation of the current voice portion PV. The context is translated into syntactic elements, that is to say key words and expressions, with high probabilities to be included in a portion of audio signal. 20 For example, the context of a relatively periodic or frequent commercial or news spot in an audio signal transmitted by a sound broadcasting station is predicted by knowing the detailed program of this station, or by deducing it from advertising spots or previous news.

L'unité de détermination de contexte 11 comprend en outre une unité de segmentation et une mémoire tampon. L'unité de segmentation reçoit les portions 30 textuelles PT produites par le module 6. La mémoire tampon mémorise en continu la portion textuelle courante PT pendant une durée DS inférieure à la durée d'une portion et supérieure à une durée permettant d'obtenir un nombre de mot minimum pour 35 l'analyse contextuelle, par exemple au moins une trentaine de mots. L'unité de segmentation il segmente la portion textuelle courante PT en segments textuels temporels et périodiques..., Sn,... au fur et à mesure de la réception de la portion textuelle 5 courante. La durée prédéterminée des segments de portion audio dépend du rapport entre la qualité d'analyse du système, c'est-à-dire la pertinence de la reconnaissance vocale en fonction de la signification des mots contenus dans le signal audio, 10 et le temps de traitement par le système. Par exemple une durée de segment de 20 secondes comparativement à une durée de segment de 1 minute diminue le temps de traitement par le système au détriment de la qualité de la reconnaissance vocale. Une durée minimale de 15 15 secondes est typiquement suffisante au système pour assurer une qualité minimale. The context determination unit 11 further comprises a segmentation unit and a buffer memory. The segmentation unit receives the text portions PT produced by the module 6. The buffer memory continuously stores the current text portion PT for a duration DS less than the duration of a portion and greater than a duration making it possible to obtain a minimum number of words for contextual analysis, for example at least thirty words. The segmentation unit it segments the current text portion PT into temporal and periodic text segments ..., Sn, ... as and when the current text portion 5 is received. The predetermined duration of the audio portion segments depends on the relationship between the quality of analysis of the system, that is to say the relevance of the voice recognition as a function of the meaning of the words contained in the audio signal, and the time. processing by the system. For example, a segment duration of 20 seconds compared to a segment duration of 1 minute decreases the processing time by the system to the detriment of the quality of the voice recognition. A minimum duration of 15 to 15 seconds is typically sufficient for the system to ensure minimum quality.

Dans une autre réalisation préférée de l'invention, la segmentation n'est pas fondée sur une caractéristique temporelle mais dépend d'un élément 20 syntaxique comme un mot, ou un groupe de mots ou une phrase. Un élément syntaxique est par exemple défini par un niveau sonore supérieur à un seuil prédéterminé et encadré d'intervalles du signal audio ayant un niveau sonore inférieur au seuil 25 prédéterminé et considérés comme des silences. In another preferred embodiment of the invention, the segmentation is not based on a temporal characteristic but depends on a syntactic element such as a word, or a group of words or a sentence. A syntactic element is for example defined by a sound level above a predetermined threshold and framed by intervals of the audio signal having a sound level below the predetermined threshold and considered as silences.

Divers contextes sous la forme de mots et expressions clés, comme définis ci-dessus, déduits de segments précédant le segment courant Sn et/ou de l'étude de contexte constituent des contextes CSn 30 prémémorisés et gérés dans la base de données contextuelle 10 liée au module de reconnaissance vocale 6, au module de traduction 7 et à l'unité de détermination de contexte de segment 11. Les contextes dans la base de données contextuelle 10 35 sont également complétés et affinés par consultation automatique de bases de données contextuelles externes en fonction des contextes précédemment détectés. Les listes de contextes contenus dans ces bases de données externes sont mises à jour 5 manuellement et/ou automatiquement. Chaque contexte est caractérisé par des informations complémentaires comme la tonalité du contexte, par exemple grave ou joyeux, le segment de population concerné, par exemple enfants, cadres, ouvriers... Les contextes 10 écrits dans la base 10 au fur et à mesure de leur détermination dans l'unité 11 ainsi que les contextes contenus dans les bases de données contextuelles externes sont impliqués dans la conversion des portions vocales PV en des portions textuelles PT 15 exécutée par le module de reconnaissance vocale 6 et également dans la traduction de portions textuelles dans un module de traduction 7 décrit ci-après. Les contextes écrits dans la base de données contextuelle 10 sont ainsi améliorés progressivement au cours du 20 traitement des portions textuelles PT pour faciliter la reconnaissance vocale dans le module de reconnaissance vocale 6 et la traduction dans le module de traduction 7. Le module 6 peut s'appuyer sur un logiciel de compréhension en langage naturel 25 (Natural Language Understanding NLU). Various contexts in the form of key words and expressions, as defined above, deduced from segments preceding the current segment Sn and / or from the context study constitute contexts CSn 30 premoristered and managed in the contextual database 10 linked to the voice recognition module 6, to the translation module 7 and to the segment context determination unit 11. The contexts in the contextual database 10 35 are also completed and refined by automatic consultation of external contextual databases in depending on the contexts previously detected. The lists of contexts contained in these external databases are updated 5 manually and / or automatically. Each context is characterized by additional information such as the tone of the context, for example serious or happy, the segment of the population concerned, for example children, managers, workers ... The contexts 10 written in base 10 as and when their determination in unit 11 as well as the contexts contained in the external contextual databases are involved in the conversion of the voice portions PV into text portions PT 15 executed by the voice recognition module 6 and also in the translation of text portions in a translation module 7 described below. The contexts written in the contextual database 10 are thus progressively improved during the processing of the text portions PT to facilitate voice recognition in the voice recognition module 6 and translation in the translation module 7. The module 6 can '' press on Natural Language Understanding software 25 (Natural Language Understanding NLU).

L'unité 11 détermine un ou plusieurs contextes CSn du segment textuel courant Sn en fonction des paramètres vocaux courants PVS fournis par l'analyseur 5 et en fonction du segment textuel 30 courant. Dans une variante préférée, des contextes établis et mémorisés précédemment servent également à la détermination du contexte dans l'unité 11 et contribuent à augmenter la pertinence de nouveaux contextes de segment qui participeront à leur tour à 35 la détermination de contextes de prochains segments. The unit 11 determines one or more contexts CSn of the current text segment Sn as a function of the current speech parameters PVS supplied by the analyzer 5 and as a function of the current text segment 30. In a preferred variant, contexts established and stored previously also serve for determining the context in unit 11 and contribute to increasing the relevance of new segment contexts which will in turn participate in determining contexts for next segments.

Dans une autre variante, un contexte général est déterminé initialement avant tout traitement de portion textuelle PT en fonction de paramètres externes au système et liés entre autre à la source 5 du signal audio SA. Des grilles de programme ou des informations sur la chaîne ainsi que toutes informations susceptibles de renseigner le contexte de premiers segments textuels de portion textuelle enrichissent la base de données contextuelle 10. 10 L'unité 11 base ce contexte général sur le contexte d'un nombre déterminé de segments précédant le segment courant Sn lorsque le contexte du segment immédiatement précédent n'est pas déterminé et le segment précédent est le premier segment de la 15 portion courante PT à être traitée par le système de conversion audio. In another variant, a general context is determined initially before any processing of the text portion PT as a function of parameters external to the system and linked inter alia to the source 5 of the audio signal SA. Program grids or information on the chain as well as any information likely to inform the context of the first textual segments of textual portion enrich the contextual database 10. 10 The unit 11 bases this general context on the context of a number determined from segments preceding the current segment Sn when the context of the immediately preceding segment is not determined and the preceding segment is the first segment of the current portion PT to be processed by the audio conversion system.

En variante, à des fins d'optimisation de la conversion d'une portion vocale PV en une portion textuelle PT par le module de reconnaissance vocale 6 20 et de la traduction d'une portion textuelle PT en une portion textuelle traduite PTT par le module de traduction 7, une portion vocale PV est traitée plusieurs fois par les moyens fonctionnels 4, 5, 6, 7, 10 et 11. Par exemple, un passage d'une portion 25 vocale deux à K fois à travers les moyens 4, 5, 6, 7, et 11 affine la pertinence des contextes des segments textuels déduits de cette portion vocale. Le nombre K de cycles de traitement d'une portion vocale dépend des contraintes de temps, de la qualité de 30 chaque traitement dans les moyens 4, 5, 6, 7, 10 et 11 et de la capacité de la mémoire tampon dans l'unité 11. As a variant, for the purpose of optimizing the conversion of a voice portion PV into a text portion PT by the voice recognition module 6 20 and of the translation of a text portion PT into a text portion translated PTT by the module 7, a vocal portion PV is processed several times by the functional means 4, 5, 6, 7, 10 and 11. For example, a passage of a vocal portion 25 twice to K through the means 4, 5 , 6, 7, and 11 refines the relevance of the contexts of the text segments deduced from this vocal portion. The number K of processing cycles of a voice portion depends on the time constraints, on the quality of each processing in means 4, 5, 6, 7, 10 and 11 and on the capacity of the buffer memory in the unit 11.

Le convertisseur linguistique CL comprend au moins un module de traduction 7 et au moins un 35 synthétiseur vocal 8. Le module de traduction 7 est activé lorsque l'unité 4 constate que la deuxième langue désignée par l'identificateur de langue IL lu en correspondance avec l'identificateur d'usager IU et l'équipement EQm dans la base de données BD est 5 différente de la langue de la portion vocale courante PV déterminée par l'unité 4. Le module de traduction 7 traduit les portions textuelles PT en des portions textuelles traduites PTT dans ladite langue désignée et appliquées d'une part à l'unité 11 pour enrichir 10 la base de données contextuelle 10 qui est de préférence multilingue et d'autre part au synthétiseur vocal 8. De préférence, le module de reconnaissance vocale 6 et le module de traduction 7 exploitent une analyse de contexte commune pour 15 réduire les temps de traitement. The language converter CL comprises at least one translation module 7 and at least one speech synthesizer 8. The translation module 7 is activated when the unit 4 finds that the second language designated by the language identifier IL read in correspondence with the user identifier IU and the equipment EQm in the database BD is different from the language of the current voice portion PV determined by the unit 4. The translation module 7 translates the text portions PT into portions text texts translated PTT into said designated language and applied on the one hand to the unit 11 to enrich the contextual database 10 which is preferably multilingual and on the other hand to the voice synthesizer 8. Preferably, the voice recognition module 6 and the translation module 7 use a common context analysis to reduce processing times.

Comme les portions textuelles PT dans la première langue, les portions textuelles PTT traduites dans la deuxième langue peuvent être appliquées à l'unité 11 pour être segmentées en des 20 segments textuels temporels périodiques afin de déterminer en fonction des paramètres vocaux respectifs PVS et en fonction des segments textuels traduits les contextes des segment textuels traduits qui enrichissent la base de données contextuelle 10. 25 La portion textuelle traduite PTT est ensuite soumise au synthétiseur vocal 8. Le synthétiseur vocal 8 transforme la portion textuelle traduite PTT en une portion synthétisés PS lorsque le type T de la portion textuelle traduite PTT doit être convertie et 30 donc ne constitue pas une portion reconnue dans la base de données 2 par l'analyseur 1, comme une portion remplacée traduite PRT ou une portion non traduite PNT. Par conséquent, l'analyseur 1 inhibe le fonctionnement du synthétiseur vocal 8, et en 35 variante la chaîne de modules 4, 6 et 7, en réponse à 2 4 chaque portion PNT, PRT détectée dans le signal audio entrant SA et connue de la base de données 2. Like the textual portions PT in the first language, the textual portions PTT translated into the second language can be applied to the unit 11 to be segmented into periodic time text segments in order to determine according to the respective speech parameters PVS and according to translated text segments the contexts of the translated text segments which enrich the contextual database 10. The translated text portion PTT is then submitted to the speech synthesizer 8. The speech synthesizer 8 transforms the translated text portion PTT into a synthesized portion PS when the type T of the translated text portion PTT must be converted and therefore does not constitute a portion recognized in the database 2 by the analyzer 1, as a replaced portion translated PRT or an untranslated portion PNT. Consequently, the analyzer 1 inhibits the operation of the speech synthesizer 8, and as a variant the chain of modules 4, 6 and 7, in response to 2 4 each PNT, PRT portion detected in the incoming audio signal SA and known to the database 2.

Pour la synthèse vocale, le synthétiseur vocal 8 comprend notamment une mémoire tampon pour mémoriser 5 la portion textuelle PTT à synthétiser. La synthèse vocale dans le synthétiseur s'appuie sur les paramètres vocaux PVS issus de l'analyse continue de la portion vocale PV par l'analyseur vocal 4 et fournis en continu, et de préférence sur des 10 paramètres linguistiques PL lus dans la base de données BD lorsqu'ils ont été choisis par l'usager. For speech synthesis, the speech synthesizer 8 comprises in particular a buffer memory for memorizing the text portion PTT to be synthesized. The speech synthesis in the synthesizer is based on the PVS voice parameters resulting from the continuous analysis of the PV voice portion by the voice analyzer 4 and supplied continuously, and preferably on the linguistic parameters PL read in the database. BD data when they have been chosen by the user.

Lorsque l'analyseur vocal 5 détermine des paramètres vocaux caractérisant des sections attribuées respectivement à des locuteurs et incluses dans une 15 portion vocale PV, le synthétiseur 8 synthétise en continu la portion textuelle traduite respective PTT selon les paramètres vocaux PVS. Le synthétiseur vocal 8 diminue ou augmente éventuellement le débit de la portion synthétisée PS de 10% afin de réduire son écart de durée avec la portion textuelle traduite. When the voice analyzer 5 determines voice parameters characterizing sections respectively allocated to speakers and included in a PV voice portion, the synthesizer 8 continuously synthesizes the respective translated text portion PTT according to the PVS voice parameters. The voice synthesizer 8 decreases or optionally increases the bit rate of the synthesized portion PS by 10% in order to reduce its duration difference with the translated text portion.

A cause de la synthèse vocale et de la traduction, la portion synthétisée PS a une durée différente de la portion vocale d'origine 25 correspondante PV dans le signal audio SA. Cette différence de durée dépend principalement de la langue de traduction choisie par l'usager. Because of speech synthesis and translation, the synthesized portion PS has a different duration from the corresponding original speech portion PV in the audio signal SA. This difference in duration depends mainly on the language of translation chosen by the user.

La portion résiduelle PR a la même durée que la portion vocale PV correspondante. Le rôle du 30 mélangeur 9 est de transformer la portion résiduelle PR et la portion synthétisée PS ayant des durées différentes en deux portions qui ont une durée commune inférieure à la plus grande des durées de la portion résiduelle PR et de la portion synthétisée 35 respective PS, de préférence comprise entre les durées des portions PR et PS. Pour cette synchronisation, le mélangeur 9 détermine des paramètres de mélange caractérisant un signal audio, comme le débit ou la fréquence fondamentale et/ou des 5 harmoniques du signal audio. En agissant sur les paramètres de mélange de chaque portion résiduelle PR et de la portion synthétisée PS, le mélangeur augmente la durée de l'une des portions du signal résiduelle et synthétisée, en général la portion 10 résiduelle, et diminue la durée de l'autre portion, en général la portion synthétisée, afin que les deux portions aient la même durée et soient en phase. Puis le mélangeur mélange les deux portions de même durée et en phase en une portion mélangée PM du signal 15 audio converti SAC. The residual portion PR has the same duration as the corresponding vocal portion PV. The role of the mixer 9 is to transform the residual portion PR and the synthesized portion PS having different durations into two portions which have a common duration less than the greater of the durations of the residual portion PR and of the respective synthesized portion PS , preferably between the durations of the PR and PS portions. For this synchronization, the mixer 9 determines the mixing parameters characterizing an audio signal, such as the fundamental rate or frequency and / or the harmonics of the audio signal. By acting on the mixing parameters of each residual portion PR and of the synthesized portion PS, the mixer increases the duration of one of the portions of the residual and synthesized signal, in general the residual portion, and decreases the duration of the another portion, generally the synthesized portion, so that the two portions have the same duration and are in phase. The mixer then mixes the two portions of the same duration and in phase into a mixed portion PM of the converted audio signal SAC.

Pour obtenir une portion résiduelle PR et une portion synthétisée correspondante PS de même durée en agissant sur les paramètres de mélange, par exemple une moyenne des durées des deux portions est 20 déterminée par le mélangeur. Le mélangeur agit sur les paramètres de mélange de chacune des portions résiduelle et synthétisée afin que les durées des portions résiduelle et synthétisée soient égales à la durée moyenne, l'une des portions étant compressée 25 temporellement et l'autre portion étant détendue temporellement. L'action par le mélangeur sur les paramètres de mélange est limitée. En effet les portions synthétisées et résiduelles doivent garder le caractère "audible" des portions initiales 30 correspondantes pour l'oreille humaine. A cette fin, la variation de la durée de ces portions ne dépasse pas un pourcentage prédéterminé, par exemple 10 %. To obtain a residual portion PR and a corresponding synthesized portion PS of the same duration by acting on the mixing parameters, for example an average of the durations of the two portions is determined by the mixer. The mixer acts on the mixing parameters of each of the residual and synthesized portions so that the durations of the residual and synthesized portions are equal to the average duration, one of the portions being compressed in time and the other portion being relaxed in time. The action by the mixer on the mixing parameters is limited. In fact, the synthesized and residual portions must keep the "audible" character of the corresponding initial portions for the human ear. To this end, the variation in the duration of these portions does not exceed a predetermined percentage, for example 10%.

En variante, le mélangeur n'agit que sur les paramètres de mélange de la portion synthétisée PS 35 pour que la durée de celle-ci soit égale à la durée de la portion de signal résiduelle. De nombreuses autres variantes entrent dans le cadre de l'invention pour obtenir des portions résiduelle et synthétisée correspondantes ayant une même durée, lorsque ces 5 deux portions sont initialement de durées différentes. As a variant, the mixer acts only on the mixing parameters of the synthesized portion PS 35 so that the duration of the latter is equal to the duration of the residual signal portion. Many other variants fall within the scope of the invention in order to obtain corresponding residual and synthesized portions having the same duration, when these two portions are initially of different durations.

Dans un premier cas o la portion vocale PV est supérieure à une durée prédéterminée, environ 2 minutes, le temps de traduction par le système est 10 trop long. Dans un deuxième cas o le pourcentage prédéterminé ne suffit pas à égaliser les durées de la portion synthétisée PS et de la portion résiduelle PR, le mélangeur ne peut traiter en l'état la portion synthétisée PS et la portion résiduelle 15 correspondante PR. Dans les deux cas le mélangeur segmente la portion synthétisée et la portion résiduelle correspondante en un même nombre de segments afin que le mélangeur traite chaque segment synthétisé et le segment résiduel correspondant et 20 non plus les portions. In a first case where the voice portion PV is greater than a predetermined duration, approximately 2 minutes, the translation time by the system is too long. In a second case where the predetermined percentage is not sufficient to equalize the durations of the synthesized portion PS and of the residual portion PR, the mixer cannot process the synthesized portion PS and the corresponding residual portion PR in the same state. In both cases the mixer segments the synthesized portion and the corresponding residual portion into the same number of segments so that the mixer processes each synthesized segment and the corresponding residual segment and no longer the portions.

Afin d'empêcher le caractère monotone de la voix issue de la synthèse vocale, le mélangeur 9 dispose de plusieurs moyens comme le mélange d'un signal bruité supplémentaire à la portion synthétisée PS. Ce 25 signal bruité supplémentaire est une portion de morceau contenue dans la base de données audio 2 et de préférence choisie en fonction des contextes de la portion synthétisée PS. La durée du signal bruité supplémentaire est sensiblement égale à celle de la 30 portion synthétisée correspondante. Le mélangeur exécute l'opération de transformation de la portion synthétisée PS et du signal bruité supplémentaire en une portion mélangée comme précédemment pour la portion synthétisée PS et la portion résiduelle PR. 35 Un deuxième moyen dans le mélangeur consiste à modifier les paramètres vocaux de la portion synthétisée lorsque ceux-ci n'ont pas varié depuis un temps prédéterminé, de l'ordre de 90 secondes. In order to prevent the monotonous character of the voice resulting from the speech synthesis, the mixer 9 has several means such as the mixing of an additional noisy signal with the synthesized portion PS. This additional noisy signal is a portion of song contained in the audio database 2 and preferably chosen as a function of the contexts of the synthesized portion PS. The duration of the additional noisy signal is substantially equal to that of the corresponding synthesized portion. The mixer performs the operation of transforming the synthesized portion PS and the additional noisy signal into a mixed portion as before for the synthesized portion PS and the residual portion PR. A second means in the mixer consists in modifying the vocal parameters of the synthesized portion when these have not varied for a predetermined time, of the order of 90 seconds.

L'unité d'adaptation temporelle 12 reçoit d'une 5 part les portions mélangées PM et d'autres part les portions non remplacées PNT et remplacées PRT qui sont écrites en parallèle avec le signal audio SA dans une mémoire tampon dans l'unité 12. Le temps de conversion des portions à convertir PAC par le 10 système étant plus long que celui des portions non remplacées ou remplacées, l'unité d'adaptation retarde les portions non remplacées PNT et remplacées PRT afin que les portions PNT et PRT soient accolées aux portions PM dans le même ordre que les portions 1 5 initiales PNT, PRT et PAC au niveau de l'analyseur audio 1 afin de composer de manière continue le signal audio converti SAC en sortie de l'unité 12. The temporal adaptation unit 12 receives on the one hand the mixed portions PM and on the other hand the non-replaced portions PNT and replaced PRT which are written in parallel with the audio signal SA in a buffer memory in the unit 12 The conversion time of the portions to be converted PAC by the system being longer than that of the portions not replaced or replaced, the adaptation unit delays the portions not replaced PNT and replaced PRT so that the portions PNT and PRT are joined to the PM portions in the same order as the initial PNT, PRT and PAC portions 1 5 at the audio analyzer 1 in order to continuously compose the converted audio signal SAC at the output of the unit 12.

Lorsqu'une première somme des durées des portions du signal converti SAC est supérieure à une 20 deuxième somme des durées des portions correspondantes du signal audio entrant SA, l'unité d'adaptation 12 réduit la durée du signal converti de sortie SAC en commandant la suppression sensiblement égale de portions du signal audio SA ayant une durée 25 sensiblement égale à la différence entre les première et deuxième sommes par un signal de suppression SS appliqué à l'analyseur audio 1 qui transmet le signal audio SA sous la forme d'un signal audio restreint SAR. Les portions à supprimer sont de préférence 30 sélectionnées par l'analyseur 1 en fonction de leur type T, comme par exemple des spots publicitaires postérieurs au signal SS et ayant une durée sensiblement égale à la différence entre les première et deuxième sommes. Réciproquement, lorsque la 35 première somme des durées des portions du signal converti SAC est inférieure à la deuxième somme des durées des portions correspondantes du signal audio SA, l'unité d'adaptation 12 augmente la durée du signal converti SAC par l'adjonction de portions 5 audio au signal converti, ayant une durée sensiblement égale à la différence entre les deuxième et première sommes pour combler le déficit temporel du signal converti par rapport au signal audio entrant et ainsi supprimer des silences dans le 10 signal converti détecté par l'unité 12. Ces portions audio sont extraites de morceaux contenus dans la base de données audio 2 en fonction, par exemple de leur durée, et du contexte des portions du signal audio SA. When a first sum of the durations of the portions of the converted signal SAC is greater than a second sum of the durations of the corresponding portions of the incoming audio signal SA, the adaptation unit 12 reduces the duration of the converted output signal SAC by controlling the substantially equal suppression of portions of the audio signal SA having a duration substantially equal to the difference between the first and second sums by a suppression signal SS applied to the audio analyzer 1 which transmits the audio signal SA in the form of a signal audio restricted SAR. The portions to be deleted are preferably selected by the analyzer 1 as a function of their type T, such as for example commercials posterior to the signal SS and having a duration substantially equal to the difference between the first and second sums. Conversely, when the first sum of the durations of the portions of the converted signal SAC is less than the second sum of the durations of the corresponding portions of the audio signal SA, the adaptation unit 12 increases the duration of the converted signal SAC by adding 5 audio portions to the converted signal, having a duration substantially equal to the difference between the second and first sums to fill the time deficit of the converted signal with respect to the incoming audio signal and thus remove silences in the converted signal detected by the unit 12. These audio portions are extracted from pieces contained in the audio database 2 as a function, for example of their duration, and of the context of the portions of the audio signal SA.

Le signal audio converti SAC est ensuite envoyé en continue à l'installation terminale de l'usager IT. The converted audio signal SAC is then sent continuously to the terminal installation of the IT user.

Dans une autre réalisation le signal audio SA 20 est traité en différentes langues par le système avant même sa diffusion. Le système offre à ce moment une traduction en temps réel vue de l'usager. In another embodiment, the audio signal SA 20 is processed in different languages by the system even before it is broadcast. The system then offers real-time translation seen by the user.

Claims

1 - System for converting a continuous incoming audio signal (SA) including a voice signal in a first language into an audio signal converted in a second language, said system comprising means (1) for segmenting the received audio signal (SA) in audio portions (PAC), a means (3) for filtering the audio portions (PAC) each into a vocal portion (PV) and a residual portion (PR) of the same duration, a voice recognition means (6) converting the voice portions (PV) into text portions (PT) in the first language, a means for translating (7) the text portions (PT) 15 into translated text portions (PTT) in the second language, and a means (8 ) to synthesize the translated text portions (PTT) by voice into synthesized portions (PS) having different durations respectively from the audio portions, characterized in that it comprises a mixing means (9) for transforming each residual portion (PR) and the portion respective synthesized (PS) into portions which have a duration shorter than the greater of the durations of the residual portion (PR) and of the respective synthesized portion (PS) and which are in phase in order to mix them into a mixed portion ( PM) composing the converted audio signal (SAC).

2 - System according to claim 1, 30 comprising means (12) for detecting a temporal excess of a sum of the durations of the portions of the converted signal (SAC) relative to a sum of the durations of the corresponding portions of the incoming audio signal ( SA) so that the means for segmenting (1) deletes a portion of the incoming audio signal whose duration is substantially equal to said time surplus.

3 - System according to claim 1 or 2, 5 comprising means (12) for detecting a time deficit of a sum of the durations of the portions of the converted signal (SAC) relative to a sum of the durations of the corresponding portions of the audio signal incoming (SA) so that the means for detecting (12) 10 adds a portion to the converted signal (SAC) whose duration is substantially equal to said time deficit.

4 - System according to claim 1 to 3, wherein the means for segmenting (1) delimits in the incoming audio signal (SA) portions (PNT, PRT) which are contained in the database (2) and d other audio portions (PAC) which are applied to the filter means (3).

5 - System according to claim 4, wherein portions contained in the database (2) and delimited in the incoming audio signal (SA) are replaced respectively by replacement portions (PRT).

6 - System according to claim 4 or 5, wherein the means for segmenting (1) inhibits the operation of the means for synthesizing (8) in response to each portion (PNT, PRT) in the known incoming audio signal (SA) from the database (2).

7 - System according to any one of claims 1 to 6, comprising means (5) for analyzing the vocal portions (PV) to determine speech parameters (PVS), means (11) for segmenting the text portions (PT, PTT) in the first language, or in the first and second languages, in periodic temporal text segments, and means (11) for determining contexts (CSn) of the text segments according to the respective speech parameters (PVS ) and according to the text segments.

8 - System according to claim 7, in which the means for analyzing (5) determines vocal parameters characterizing sections respectively allocated to speakers and included in a vocal portion (PV) so that the means for synthesizing (8) synthesizes continuously the respective translated text portion 15 (PTT) according to the voice parameters (PVS).

9 - System according to any one of claims 1 to 8, included in a server (SE). 20 - System according to any one of claims 1 to 8, included at least partially in a user terminal installation (IT).