EP1952388B1

EP1952388B1 - System and method for synthesizing speech by concatenating acoustic units

Info

Publication number: EP1952388B1
Application number: EP06808137A
Authority: EP
Inventors: Edouard Hinard; Cédric BOIDIN; Laurent Roussarie
Original assignee: France Telecom SA
Current assignee: Orange SA
Priority date: 2005-10-24
Filing date: 2006-09-14
Publication date: 2009-04-01
Anticipated expiration: 2026-09-14
Also published as: ES2325132T3; FR2892555A1; DE602006006094D1; ATE427545T1; WO2007048891A1; EP1952388A1

Abstract

This system for synthesizing speech by concatenating acoustic units comprises: phonetic transcription means (6) capable of generating a series of target acoustic units representative of the text to synthesize; means (7) for storing candidate acoustic units, each candidate acoustic unit comprising a pre-recorded fragment of speech; preselecting means (8) capable of producing a number of flows of candidate acoustic units, each flow being preselected based on a minimization of its overall cost, said overall call being the sum of cost functions that determine the cost between each target acoustic unit and the candidate acoustic units and functions of costs of the transitions between two candidate acoustic units, and; interface means (9) that enable an operator to compare the auditory quality of each preselected flow of candidate acoustic units for selecting the flow whose auditory quality seems the best to him.

Description

La présente invention concerne un système et un procédé de synthèse vocale par concaténation d'unités acoustiques.The present invention relates to a system and method for voice synthesis by concatenating acoustic units.

La synthèse vocale par concaténation d'unités acoustiques utilise un certain nombre de principes connus.Concatenated vocal synthesis of acoustic units uses a number of known principles.

Typiquement, une chaîne de synthèse vocale à partir du texte comprend les étapes de

traitement linguistique permettant d'extraire du texte des informations pertinentes pour la synthèse,
transcription phonétique transformant les informations linguistiques en une chaîne phonétique comportant une suite d'unités acoustiques cibles,
sélection des unités acoustiques candidates, c'est-à-dire sélection des fragments de paroles préenregistrées qui seront utilisées pour la synthèse, et
synthèse du signal consistant à concaténer les unités acoustiques candidates sélectionnées pour former le signal sonore demandé.

Typically, a text-to-speech string from the text includes the steps of

linguistic processing to extract from the text information relevant for the synthesis,
phonetic transcription transforming the linguistic information into a phonetic string comprising a series of target acoustic units,
selection of the candidate acoustic units, that is to say selection of the fragments of pre-recorded words that will be used for the synthesis, and
signal synthesis of concatenating the selected candidate acoustic units to form the requested sound signal.

La qualité du signal sonore dépend essentiellement du choix des unités acoustiques candidates : il s'agit d'utiliser les fragments de paroles les plus appropriés pour obtenir un signal sonore « naturel ».The quality of the sound signal depends essentially on the choice of candidate acoustic units: it is a question of using the most appropriate fragments of words to obtain a "natural" sound signal.

Traditionnellement, la sélection des unités acoustiques candidates est basée sur un algorithme de Viterbi. Celui-ci détermine la séquence optimale des unités acoustiques à utiliser en calculant le chemin optimal dans un graphe, graphe dont les noeuds sont les unités acoustiques candidates et les arcs les transitions entre les unités acoustiques candidates.Traditionally, the selection of candidate acoustic units is based on a Viterbi algorithm. This determines the optimal sequence of the acoustic units to be used by calculating the optimal path in a graph, whose nodes are the candidate acoustic units and the arcs the transitions between the candidate acoustic units.

Le chemin est optimal dans le sens d'une minimisation de la somme des coûts associés aux noeuds et aux arcs qui constituent le chemin. Le coût associé à une unité acoustique candidate, noeud du graphe, est appelé coût-cible et mesure l'adéquation entre l'unité acoustique candidate et l'unité acoustique cible. Le coût associé à une transition, arc du graphe, est appelé coût de concaténation et mesure la qualité de la concaténation entre les deux unités candidates qu'elle relie.The path is optimal in the sense of minimizing the sum of the costs associated with the nodes and arcs that make up the path. The cost associated with a candidate acoustic unit, node of the graph, is called target cost and measures the adequacy between the candidate acoustic unit and the target acoustic unit. The cost associated with a transition, the arc of the graph, is called the concatenation cost and measures the quality of the concatenation between the two candidate units that it links.

Ces différents coûts sont déterminés par des fonctions de coût permettant de les calculer pour chacun des arcs et noeuds du graphe. On conçoit aisément que, ces fonctions de coût étant censées représenter la qualité de la synthèse, leurs choix et leurs paramétrages ont une forte influence sur le résultat final.These different costs are determined by cost functions that make it possible to calculate them for each of the arcs and nodes of the graph. It is easy to see that these cost functions are supposed to represent the quality of the synthesis, their choices and their settings have a strong influence on the final result.

Pour synthétiser la « meilleure » phrase, perceptuellement parlant, la demande de brevet US 2003/0229494 de RUTTEN et AL. propose de faire intervenir un opérateur qui, par itération successive, ajuste la qualité de la phrase produite. Le procédé proposé pour cette demande, consiste donc à sélectionner de façon classique, une suite d'unités acoustiques candidates, à faire écouter par l'opérateur la phrase ainsi produite par le module de sélection, puis à ajuster les paramètres de la sélection avant de relancer une sélection,...To summarize the "best" sentence, perceptually speaking, the patent application US 2003/0229494 to RUTTEN and AL. proposes to involve an operator who, by successive iteration, adjusts the quality of the sentence produced. The method proposed for this request therefore consists of selecting, in a conventional manner, a series of candidate acoustic units, to make the operator thus listen to the sentence thus produced by the selection module, and then to adjust the parameters of the selection beforehand. relaunch a selection, ...

Le procédé est réitéré jusqu'à ce que l'opérateur obtienne une solution qui lui convienne.The process is repeated until the operator obtains a suitable solution.

Le procédé et le système de synthèse vocale proposés par cette demande présentent l'inconvénient d'obliger l'opérateur à intervenir sur les paramètres de la sélection pour obtenir une solution. Or ces paramètres, comme, par exemple, les paramètres des fonctions de coût, n'ont pas toujours de liens directs et intuitifs avec le résultat obtenu. Cela nécessite donc, de la part de l'opérateur, un long apprentissage avant d'être capable d'utiliser un tel système efficacement.The method and voice synthesis system proposed by this application have the disadvantage of requiring the operator to intervene on the parameters of the selection to obtain a solution. However, these parameters, such as, for example, the parameters of the cost functions, do not always have direct and intuitive links with the result obtained. This requires, therefore, on the part of the operator, a long learning before being able to use such a system effectively.

De plus, à chaque changement de paramètres, il est nécessaire de lancer une nouvelle étape de sélection qui est très consommatrice en ressources de calcul.Moreover, with each change of parameters, it is necessary to launch a new selection step which is very consuming in computing resources.

Le but de l'invention est donc de remédier à ces inconvénients en proposant un système et un procédé de synthèse vocale aisés à mettre en oeuvre.The object of the invention is therefore to remedy these drawbacks by proposing a system and a method of voice synthesis that are easy to implement.

L'objet de l'invention est un système de synthèse vocale par concaténation d'unités acoustiques comportant :

des moyens de transcription phonétique aptes à générer une suite d'unités acoustiques cibles, représentative du texte à synthétiser,
des moyens de stockage d'unités acoustiques candidates, chaque unité acoustique candidate comportant un fragment de parole préenregistrée,
des moyens de présélection aptes à produire au moins un flux d'unités acoustiques candidates, chaque flux étant présélectionné sur la base d'une minimisation de son coût global, ledit coût global étant la somme de fonctions de coûts qui déterminent le coût entre chaque unité acoustique cible et les unités acoustiques candidates et de fonctions de coûts des transitions entre deux unités acoustiques candidates, et
des moyens d'interface aptes à permettre à un opérateur d'évaluer la qualité auditive de chaque flux d'unités acoustiques candidates présélectionné,

caractérisé en ce que les moyens de présélection sont aptes à produire une pluralité de flux d'unités acoustiques candidates ayant les meilleurs coûts globaux, et en ce que les moyens d'interface sont aptes à permettre à un opérateur de comparer les flux d'unités acoustiques présélectionnés et de choisir le flux dont la qualité auditive lui paraît la meilleure.The object of the invention is a voice synthesis system by concatenation of acoustic units comprising:

phonetic transcription means capable of generating a series of target acoustic units, representative of the text to be synthesized,
means for storing candidate acoustic units, each candidate acoustic unit comprising a prerecorded speech fragment,
preselection means capable of producing at least one stream of candidate acoustic units, each stream being preselected on the basis of a minimization of its overall cost, said overall cost being the sum of cost functions which determine the cost between each unit target acoustics and the candidate acoustic units and cost functions of the transitions between two candidate acoustic units, and
interface means adapted to allow an operator to evaluate the auditory quality of each preselected candidate acoustic unit stream,

characterized in that the preselection means are capable of producing a plurality of candidate acoustic unit streams having the best overall costs, and in that the interface means are capable of allowing an operator to compare the flow of units preselected acoustics and to choose the flow whose auditory quality seems the best.

D'autres caractéristiques de l'invention sont

les moyens de présélection utilisent un algorithme N-best pour présélectionner la pluralité de flux d'unités acoustiques candidates ;
les moyens d'interface comportent des moyens de filtrage aptes à éliminer, à partir de critères phonétiques, un sous-ensemble de flux d'unités acoustiques candidates de la pluralité des flux d'unités acoustiques candidates présélectionnés ;
les critères phonétiques comportent, seuls ou en combinaison, des critères d'interdiction de présence d'une unité acoustique, des critères d'interdiction de présence d'une concaténation entre deux unités acoustiques, et des critères d'interdiction d'une concaténation sur une transition.

Other features of the invention are

the preselection means uses an N-best algorithm to preselect the plurality of candidate acoustic unit streams;
the interface means comprise filtering means able to eliminate, from phonetic criteria, a subset of candidate acoustic unit streams of the plurality of preselected candidate acoustic unit streams;
the phonetic criteria include, alone or in combination, criteria for the prohibition of the presence of an acoustic unit, the criteria for prohibiting the presence of a concatenation between two acoustic units, and criteria for prohibiting concatenation on a transition.

Un autre objet de l'invention est un procédé de synthèse vocale par concaténation d'unités acoustiques comportant une étape préalable de stockage d'unités acoustiques candidates, chaque unité acoustique candidate comportant un fragment de parole préenregistrée, et ledit procédé comportant en outre les étapes de :

transcription phonétique apte à générer une suite d'unités acoustiques cibles représentative du texte à synthétiser,
présélection d'au moins un flux d'unités acoustiques candidates, chaque flux étant présélectionné sur la base d'une minimisation de son coût global, ledit coût global étant la somme de fonctions de coûts qui déterminent le coût entre chaque unité acoustique cible et les unités acoustiques candidates et de fonctions de coûts des transitions entre deux unités acoustiques candidates, et
évaluation par un opérateur de la qualité auditive de chaque flux,

et ledit procédé est caractérisé en ce que

l'étape de présélection est apte à produire une pluralité de flux d'unités acoustiques candidates présélectionnées ayant les meilleurs coûts globaux, et
l'étape d'évaluation consiste, pour l'opérateur, à comparer les flux d'unités acoustiques présélectionnés et à choisir le flux dont la qualité auditive lui paraît la meilleure.

Another object of the invention is a method of voice synthesis by concatenation of acoustic units comprising a prior step of storing candidate acoustic units, each candidate acoustic unit comprising a prerecorded speech fragment, and said method further comprising the steps from:

phonetic transcription capable of generating a series of target acoustic units representative of the text to be synthesized,
preselecting at least one stream of candidate acoustic units, each stream being preselected on the basis of minimizing its overall cost, said overall cost being the sum of cost functions which determine the cost between each target acoustic unit and the candidate acoustic units and cost functions of the transitions between two candidate acoustic units, and
evaluation by an operator of the auditory quality of each flow,

and said method is characterized in that

the preselection step is capable of producing a plurality of preselected candidate acoustic unit streams having the best overall costs, and
the evaluation step consists, for the operator, in comparing the flow of preselected acoustic units and choosing the flow whose auditory quality seems to him the best.

D'autres caractéristiques de cet objet sont

l'étape de présélection utilise un algorithme N-best pour présélectionner la pluralité de flux d'unités acoustiques candidates ;
l'étape d'évaluation comporte une étape de filtrage, à partir de critères phonétiques, apte à éliminer un sous-ensemble de flux d'unités acoustiques candidates de la pluralité des flux d'unités acoustiques candidates présélectionnés ;
les critères phonétiques comportent, seuls ou en combinaison, des critères d'interdiction de présence d'une unité acoustique, des critères d'interdiction de présence d'une concaténation entre deux unités acoustiques, et des critères d'interdiction d'une concaténation sur une transition.

Other features of this object are

the preselection step uses an N-best algorithm to preselect the plurality of candidate acoustic unit streams;
the evaluation step comprises a filtering step, based on phonetic criteria, able to eliminate a subset of candidate acoustic unit streams from the plurality of preselected candidate acoustic unit streams;
the phonetic criteria include, alone or in combination, criteria for the prohibition of the presence of an acoustic unit, the criteria for prohibiting the presence of a concatenation between two acoustic units, and criteria for prohibiting concatenation on a transition.

Un autre objet est un produit programme d'ordinateur comprenant des instructions de code de programme enregistré sur un support lisible par un ordinateur, pour mettre en oeuvre le procédé de synthèse vocale lorsque ledit programme fonctionne sur un ordinateur.Another object is a computer program product comprising program code instructions recorded on a computer readable medium, for implementing the speech synthesis method when said program is running on a computer.

Un autre objet est un support d'enregistrement lisible par un ordinateur sur lequel est enregistré un programme d'ordinateur.Another object is a computer-readable recording medium on which a computer program is recorded.

L'invention sera mieux comprise à la lecture de la description qui va suivre faite uniquement à titre d'exemple et en relation avec les dessins en annexe dans lesquels :

la figure 1 est un schéma simplifié d'un système de synthèse vocale selon l'invention ;
la figure 2 est un ordinogramme du procédé selon un mode de réalisation préféré de l'invention ;
la figure 3 est un schéma de présélection des unités acoustiques candidates ; et
la figure 4 est un schéma d'un écran d'interface avec l'opérateur du système de synthèse vocale selon un mode de réalisation préféré de l'invention.

The invention will be better understood on reading the description which follows, given solely by way of example and in relation to the appended drawings in which:

the figure 1 is a simplified diagram of a speech synthesis system according to the invention;
the figure 2 is a flow chart of the method according to a preferred embodiment of the invention;
the figure 3 is a preselection scheme of the candidate acoustic units; and
the figure 4 is a diagram of an interface screen with the operator of the speech synthesis system according to a preferred embodiment of the invention.

En référence à la figure 1, un système 1 de synthèse vocale est destiné à transformer un texte 2 en un flux sonore 3.With reference to the figure 1 , a voice synthesis system 1 is intended to transform a text 2 into a sound stream 3.

Le texte 2 est entré dans le système 1 par l'intermédiaire de moyens de saisie 4 qui le transforme en un fichier, typiquement au standard UNICODE.The text 2 is entered in the system 1 via input means 4 which transforms it into a file, typically the UNICODE standard.

Ce fichier est traité par des moyens 5 de traitements linguistiques permettant d'extraire du texte des informations pertinentes pour la synthèse par une analyse linguistique du texte.This file is processed by language processing means 5 for extracting text relevant information for synthesis by a linguistic analysis of the text.

Ces informations linguistiques sont utilisées par les moyens 6 de transcription phonétique. Cette transcription, non nécessairement unique, se présente sous la forme d'une suite d'unités acoustiques cibles, éventuellement augmentée d'informations supplémentaires telles que des consignes prosodiques ou des catégories grammaticales.This linguistic information is used by the phonetic transcription means 6. This transcription, not necessarily unique, is in the form of a series of target acoustic units, possibly augmented with additional information such as prosodic instructions or grammatical categories.

Ces moyens 4, 5 et 6 permettant d'obtenir une suite d'unités acoustiques cibles sont bien connus de l'homme du métier et ne seront pas décrits plus en détail. Des informations complémentaires sur ces moyens peuvent être trouvées, par exemple, dans la demande de brevet US 2003/0229494 précitée.These means 4, 5 and 6 for obtaining a series of target acoustic units are well known to those skilled in the art and will not be described in more detail. Further information on these means can be found, for example, in the patent application US 2003/0229494 supra.

Le système 1 de synthèse vocale comporte également des moyens 7 de stockage d'unités acoustiques candidates typiquement sous forme d'une base de données. Ces unités acoustiques candidates comportent principalement des fragments de paroles préenregistrées. Ces fragments peuvent correspondre à des phonèmes, des diphones, des syllabes, ... Chaque unité acoustique candidate représente une variation sonore d'une unité acoustique de base, par exemple des variations de longueur, de timbre, ... Typiquement, les moyens 7 de stockage peuvent contenir plus de 100 000 unités acoustiques candidates.The voice synthesis system 1 also comprises means 7 for storing candidate acoustic units, typically in the form of a database. These candidate acoustic units mainly comprise prerecorded speech fragments. These fragments can correspond to phonemes, diphones, syllables, ... Each candidate acoustic unit represents a sound variation of a basic acoustic unit, for example variations of length, of timbre, ... Typically, the means 7 storage can contain more than 100,000 candidate acoustic units.

Dans la description qui suit, et à titre purement illustratif, les unités acoustiques seront supposées être des diphones.In the following description, and purely illustrative, the acoustic units will be assumed to be diphones.

Les moyens 7 de stockage sont reliés à des moyens 8 de présélection dont l'objet est de produire au moins un flux d'unités acoustiques candidates. Chaque flux d'unités acoustiques candidates est représentatif de la suite d'unités acoustiques cibles.The storage means 7 are connected to preselection means 8 whose object is to produce at least one stream of candidate acoustic units. Each stream of candidate acoustic units is representative of the sequence of target acoustic units.

Habituellement, un système de synthèse vocale ne produit qu'un seul flux d'unités acoustiques. Un algorithme communément utilisé pour produire cet unique flux est l'algorithme de Viterbi qui minimise le coût global, somme des coûts-cible et des coûts de transition pour les unités acoustiques candidates et les transitions de ce flux.Usually, a speech synthesis system produces only a single stream of acoustic units. An algorithm commonly used to produce this single stream is the Viterbi algorithm which minimizes the overall cost, the sum of the target costs and transition costs for the candidate acoustic units and the transitions of this stream.

Des exemples de fonctions de coût utilisables dans le cadre de cet algorithme de Viterbi sont décrits dans « Perceptual and Objective Detection of discontinuities in concatenative Speech synthesis », Yannis Stylianou and Ann K. Syrdal, ICASSP 2001 .Examples of cost functions that can be used in the context of this Viterbi algorithm are described in "Perceptual and Objective Detection of Discontinuities in Concatenative Speech Synthesis", Yannis Stylianou and Ann K. Syrdal, ICASSP 2001 .

Pour cela, les moyens 8 de présélection n'utilisent pas que l'algorithme de Viterbi puisque celui-ci ne fournit qu'un seul flux, celui ayant le meilleur coût global. A titre purement illustratif, la suite de flux produite par les moyens 8 de présélection est le résultat d'un algorithme de type N-best qui fournit une suite ordonnée de N flux dont le premier flux correspond à la solution de l'algorithme de Viterbi.For this, the preselection means 8 do not use the Viterbi algorithm since it only provides a single stream, the one having the best overall cost. For purely illustrative purposes, the stream sequence produced by the preselection means 8 is the result of an N-best type algorithm which provides an ordered sequence of N streams whose first stream corresponds to the solution of the Viterbi algorithm. .

Deux exemples de ce type d'algorithme sont décrits dans « A comparison of two Exact Algorithms for finding the N-Best Sentence Hypothese in Continuous Speech Recognition », V.M. Jimenez, A. Marzal, J. Monné, Eurospeech 1995 .Two examples of this type of algorithm are described in "A Comparison of Two Exact Algorithms for Finding the N-Best Sentence Hypothesis in Continuous Speech Recognition", VM Jimenez, Marzal A., J. Monné, Eurospeech 1995 .

Les moyens 8 de présélection sont connectés à des moyens 9 d'interface. Ceux-ci sont connectés à des moyens 10 de restitution sonore permettant ainsi à un opérateur d'écouter, à la demande, un des flux d'unités acoustiques présélectionnés, et de déterminer ainsi celui qui a la meilleure qualité auditive.The preselection means 8 are connected to interface means 9. These are connected to sound reproduction means 10 thus allowing an operator to listen, on demand, one of the flow of preselected acoustic units, and thus determine the one with the best hearing quality.

Les moyens 9 d'interface sont également connectés à des moyens 11 de visualisation et de saisie permettant à l'opérateur de visualiser et de sélectionner les différents flux présélectionnés.The interface means 9 are also connected to viewing and input means 11 enabling the operator to view and select the different preselected flows.

De manière préférentielle, ces moyens 9 d'interface comportent des moyens 12 de filtrage. Ceux-ci sont adaptés pour que l'opérateur, par utilisation de critères phonétiques, puisse éliminer des sous-ensembles de flux parmi les flux présélectionnés de façon à limiter le nombre d'écoutes et de comparaisons à effectuer pour choisir le meilleur flux.Preferably, these interface means 9 comprise means 12 for filtering. These are adapted so that the operator, by using phonetic criteria, can eliminate subsets of flows among the preselected flows so as to limit the number of plays and comparisons to be made to choose the best flow.

Le fonctionnement de ce système va maintenant être explicité en référence à la figure 2.The operation of this system will now be explained with reference to the figure 2 .

Le procédé démarre à l'étape 20.The process starts in step 20.

La saisie d'un texte est effectuée à l'étape 21.Entering text is done in step 21.

Celui-ci est traité en 22 pour en extraire des informations linguistiques.This is processed in 22 to extract language information.

Ces informations linguistiques sont utilisées en 23 pour produire classiquement une suite d'unités acoustiques cibles.This linguistic information is used at 23 to classically produce a series of target acoustic units.

Par utilisation de l'algorithme de présélection, on sélectionne en 24 un nombre N de flux d'unités acoustiques candidates.By using the preselection algorithm, a number N of streams of candidate acoustic units is selected at 24.

Par exemple, figure 3, pour la suite 30 de quatre unités acoustiques cibles, on a représenté en 31 l'ensemble des graphes possibles dont les unités acoustiques candidates sont les noeuds 10-1, 10-2, 11-1,...For example, figure 3 for the sequence of four target acoustic units, there is shown at 31 all the possible graphs whose candidate acoustic units are the nodes 10-1, 10-2, 11-1, ...

Le flux 32, représenté en trait plein épais, correspond à la première solution. Il correspond au flux des unités acoustiques candidates 10-1, 11-2, 12-1, 13-1.The stream 32, shown as a thick solid line, corresponds to the first solution. It corresponds to the flow of candidate acoustic units 10-1, 11-2, 12-1, 13-1.

Le flux 33, représenté en traits pointillés épais, correspond à la deuxième solution. II est composé des unités acoustiques candidates 10-2, 11-1, 12-3, 13-3.The stream 33, shown in thick dashed lines, corresponds to the second solution. It is composed of candidate acoustic units 10-2, 11-1, 12-3, 13-3.

L'ensemble des N flux ainsi présélectionnés est stocké en mémoire et rendu disponible à l'utilisateur.All N preselected streams is stored in memory and made available to the user.

Celui-ci écoute en 25, figure 2, un des flux présélectionné.He listens in 25, figure 2 , one of the preselected flows.

S'il est satisfait par la qualité de ce flux en 26, alors le procédé est terminé en 27.If it is satisfied by the quality of this flux at 26, then the process is finished at 27.

Par contre, si le flux écouté n'est pas satisfaisant, un autre flux est écouté en 25 jusqu'à l'écoute d'un flux de bonne qualité.On the other hand, if the stream being listened to is not satisfactory, another stream is listened to until listening to a stream of good quality.

On conçoit que cette écoute successive peut être longue et fastidieuse. Aussi, il est avantageux d'offrir à l'utilisateur une interface permettant de filtrer l'ensemble des flux selon des critères phonétiques modifiables par l'utilisateur.It is understandable that this successive listening can be long and tedious. Also, it is advantageous to offer the user an interface for filtering all flows according to phonetic criteria modifiable by the user.

Ainsi, une étape 28 d'édition des filtres est, de manière facultative, insérée dans la boucle d'écoute / sélection.Thus, a filter editing step 28 is optionally inserted into the listen / select loop.

A titre d'exemple, un schéma simplifié d'un écran d'interface est représenté en figure 4.For example, a simplified diagram of an interface screen is represented in figure 4 .

Le flux actuellement traité et écouté par l'opérateur est représenté en 40 avec la suite des unités acoustiques candidates sélectionnées.The stream currently processed and listened to by the operator is represented at 40 with the following of the selected candidate acoustic units.

Par l'utilisation des boutons 41 et 42, l'opérateur passe au flux précédent ou au flux suivant. Il peut également choisir un des flux qu'il a déjà écouté et retenu dans la fenêtre 43.By the use of the buttons 41 and 42, the operator switches to the previous flow or the next flow. He can also choose one of the streams he has already listened to and retained in the window 43.

Il dispose d'opérations de filtrage pour contraindre les propriétés des flux qu'il veut visionner ou écouter.It has filtering operations to constrain the properties of the streams it wants to view or listen to.

Parmi les opérations de filtrage à sa disposition, il peut

interdire en 44 la présence d'une unité dans les flux filtrés. Par exemple, il peut interdire la présence de l'unité acoustique 10-4,
interdire en 45 la présence d'une concaténation entre deux unités acoustiques dans les flux filtrés. Par exemple, il peut interdire la transition entre les unités 11-2 et 12-1,
interdire en 46 toute concaténation sur une transition. Par exemple, il peut interdire toute concaténation entre les états acoustiques 12 et 13. Les seuls flux autorisés auront alors nécessairement, pour cette transition, deux unités adjacentes dans la base.

Among the filtering operations at his disposal, he can

prohibit at 44 the presence of a unit in the filtered flows. For example, it can prohibit the presence of the acoustic unit 10-4,
prohibit in 45 the presence of a concatenation between two acoustic units in the filtered streams. For example, it can prohibit the transition between units 11-2 and 12-1,
prohibit at 46 any concatenation on a transition. For example, it can prohibit any concatenation between the acoustic states 12 and 13. The only authorized flows will then necessarily have, for this transition, two adjacent units in the base.

La ligne 47 résume l'ensemble des filtres utilisés.Line 47 summarizes all the filters used.

On conçoit qu'il est possible de combiner plusieurs filtres selon une logique booléenne.It is conceivable that it is possible to combine several filters according to a Boolean logic.

On a ainsi décrit un système et un procédé de synthèse vocale par concaténation d'unités acoustiques aisées à manipuler puisque l'opérateur n'a pas à attendre que des calculs d'optimisation soient faits pour comparer deux flux. En effet, tous les calculs sont faits lors de l'étape de présélection et sont donc effectués sans que l'opérateur n'intervienne.Thus, a system and method of voice synthesis by concatenation of acoustic units that are easy to manipulate has been described since the operator does not have to wait for optimization calculations to be made to compare two streams. Indeed, all the calculations are made during the preselection step and are therefore performed without the operator intervenes.

De plus, les opérations de filtrage telles que la suppression d'une concaténation correspondent à une analyse auditive directe des flux. Il suffit en effet d'écouter un flux comportant une telle concaténation, de s'apercevoir qu'elle est mal sonnante, et donc de décider d'éliminer tous les flux comportant cette concaténation.In addition, filtering operations such as removing a concatenation correspond to a direct auditory analysis of the flows. It suffices to listen to a stream with such a concatenation, to notice that it is wrong, and thus to decide to eliminate all flows with this concatenation.

Ce procédé de synthèse vocale peut être mis en oeuvre par un programme d'ordinateur fonctionnant sur un ordinateur de type station de travail. Ce programme d'ordinateur est enregistré sur un support de données lisible par cet ordinateur.This speech synthesis method can be implemented by a computer program running on a workstation type computer. This computer program is saved on a data medium readable by this computer.

Claims

System of synthesizing voice by concatenation of acoustic units comprising:
- phonetic transcription means (6) capable of generating a sequence of target acoustic units, representative of the text to be synthesized,

- means (7) for storing candidate acoustic units, each candidate acoustic unit comprising a fragment of prerecorded speech,

- preselection means (8) capable of producing at least one stream of candidate acoustic units, each stream being preselected on the basis of a minimization of its overall cost, the said overall cost being the sum of cost functions which determine the cost between each target acoustic unit and the candidate acoustic units and of cost functions of transitions between two candidate acoustic units, and

- interfacing means (9) capable of allowing an operator to evaluate the hearing quality of each preselected stream of candidate acoustic units, characterized in that the preselection means (8) are capable of producing a plurality of streams of candidate acoustic units having the best overall costs, and in that the interface means (9) are capable of allowing an operator to compare the preselected streams of acoustic units and of choosing the stream of which the hearing quality seems best to him.
Voice synthesis system according to Claim 1, characterized in that the preselection means use an N-best algorithm for preselecting the plurality of streams of candidate acoustic units.
Voice synthesis system according to Claim 1 or 2, characterized in that the interface means (9) comprise filtering means (12) capable of removing, based on phonetic criteria, a subset of streams of candidate acoustic units from the plurality of preselected streams of candidate acoustic units.
Voice synthesis system according to Claim 3, characterized in that the phonetic criteria comprise, alone or in combination, criteria for prohibiting the presence of an acoustic unit, criteria for prohibiting the presence of a concatenation between two acoustic units, and criteria for prohibiting a concatenation on a transition.
Method of synthesizing voice by concatenation of acoustic units comprising a prior step of storing candidate acoustic units, each candidate acoustic unit comprising a fragment of prerecorded speech, and the said method also comprising the steps of:
- phonetic transcription (23) capable of generating a series of target acoustic units representative of the text to be synthesized,

- preselection (24) of at least one stream of candidate acoustic units, each stream being preselected on the basis of a minimization of its overall cost, the said overall cost being the sum of cost functions which determine the cost between each target acoustic unit and the candidate acoustic units and of cost functions of the transitions between two candidate acoustic units, and

- evaluation (25, 26) by an operator of the hearing quality of each stream,
and the said method is characterized in that
- the preselection step is capable of producing a plurality of preselected streams of candidate acoustic units having the best overall costs, and

- the evaluation step consists, for the operator, in comparing the preselected streams of acoustic units, and in choosing the stream of which the hearing quality seems best to him.
Voice synthesis method according to Claim 5, characterized in that the preselection step uses an N-best algorithm for preselecting the plurality of streams of candidate acoustic units.
Voice synthesis method according to Claim 5 or 6, characterized in that the evaluation step (25, 26) comprises a filtering step (28), based on phonetic criteria, capable of removing a subset of candidate acoustic unit streams from the plurality of preselected candidate acoustic unit streams.
Voice synthesis method according to Claim 7, characterized in that the phonetic criteria comprise, alone or in combination, criteria of prohibiting the presence of an acoustic unit, criteria of prohibiting the presence of a concatenation between two acoustic units, and criteria of prohibiting a concatenation on a transition.
Computer program product comprising program code instructions recorded on a computer-readable medium, these instructions being suitable for implementing the voice synthesis method according to one of Claims 6 to 8 when the said program runs on a computer.
Computer-readable recording medium on which a computer program according to Claim 9 is recorded.