FR2642882A1

FR2642882A1 - SPEECH PROCESSING APPARATUS

Info

Publication number: FR2642882A1
Application number: FR8901542A
Authority: FR
Original assignee: RIPOLL JEAN LOUIS
Current assignee: RIPOLL JEAN LOUIS
Priority date: 1989-02-07
Filing date: 1989-02-07
Publication date: 1990-08-10
Anticipated expiration: 2009-02-07
Also published as: FR2642882B1; WO1990009656A1; EP0456742A1

Abstract

The invention concerns the analysis and synthesis of speech and more generally, speech coding and decoding. Since recognizing the speech of several different speakers is very difficult due to differences in pronunciation of the same phonemes by different speakers, the invention discloses a recognition system using portable cards, and, in particular, chip cards, in which the characteristic voice parameters of the card holder are recorded. These parameters are read by a reader (16), transmitted to a voice recognition machine which adapts its algorithms or processing circuits according to the content of the card in order to optimize recognition according to a given speaker. The recognition machine (10) can then operate, with the greatest reliability, a machine (12), according to a speech signal transmitted by a microphone (14).

Description

APPAREIL DE TRAITEMENT DE LA PAROLE
L'invention concerne l'analyse et la synthèse de la parole, et plus généralement même le codage et le décodage de la parole.SPEECH PROCESSING APPARATUS
The invention relates to speech analysis and synthesis, and more generally even speech coding and decoding.

Les applications dans lesquelles on envisage de traiter électronîquement les signaux de voix humaine sont de plus en plus nombreuses. Il y a d'abord la reconnaissance et ia synthèse de parole en vue de faciliter la communication homme-machine qui se fait jusqu'à maintenant principalement à travers un clavier de saisie et un écran de visualisation, ou à travers de boutons et manettes de commande. I1 y a aussi la reconnaissance de parole en vue de l'identification d'une personne par ses caractéristiques vocales. Et il y a également des applications dans lesquelles le traitement sert à comprimer les informations émises oralement pour les transmettre à une plus grande vitesse ou avec une plus faible bande passante, etc. The applications in which it is envisaged to process the signals of human voice electronically are more and more numerous. First there is the recognition and synthesis of speech in order to facilitate the man-machine communication which has been done until now mainly through an input keyboard and a display screen, or through buttons and joysticks. ordered. There is also speech recognition for the purpose of identifying a person by his vocal characteristics. And there are also applications in which the processing is used to compress the information transmitted orally to transmit it at a higher speed or with a lower bandwidth, etc.

Mais le traitement de la parole est une opération très difficile, à cause de la complexité des mécanismes physiologiques par lesquels la parole est produite et par lesquels elle est entendue et comprise. But speech processing is a very difficult operation, because of the complexity of the physiological mechanisms by which speech is produced and by which it is heard and understood.

Le support de transmission de l'information est une vibration acoustique de l'air. Cette vibration est constituée par une succession d'ondes acoustiques de formes complexes. Lorsqu'on enregistre ces formes d'onde, on constate qu'il est pratiquement impossible, par simple observation visuelle, de faire un lien entre telle ou telle partie du diagramme et le son qui a été prononcé. The information transmission medium is an acoustic vibration of the air. This vibration is constituted by a succession of acoustic waves of complex shapes. When we record these waveforms, we see that it is practically impossible, by simple visual observation, to make a link between such or such part of the diagram and the sound which has been pronounced.

I1 en résulte qu'il est très difficile d'établir des circuits électroniques ou programmes de traitement de données qui seraient capables de reconnaître autre chose que des sons isolés très simples. Les problèmes sont également difficiles en synthèse vocale si on veut reproduire des sons qui ressemblent suffisamment fidelement au langage humain. As a result, it is very difficult to establish electronic circuits or data processing programs which would be able to recognize anything other than very simple isolated sounds. Problems are also difficult in speech synthesis if we want to reproduce sounds that closely resemble human language.

Pour donner une idée plus précise des difficultés rencontrées, on va rappeler ci-dessous quelques notions relatives à l'analyse, la reconnaissance et la synthèse de la parole. To give a more precise idea of the difficulties encountered, we will recall below some concepts relating to the analysis, recognition and synthesis of speech.

Les sons du langage peuvent être émis de plusieurs manières : il y a d'abord une distinction entre les sons voisés et les sons non voisés. Les sons voisés sont émis à partir d'une vibration des cordes vocales et sont modulés à travers le pharynx et la cavité buccale (et notamment par la langue et les lèvres); certains sons utilisent également la cavité nasale. Les sons non voisés ne sont pas émis à partir des cordes vocales; ils sont directement produits à l'intérieur de la cavité buccale. Language sounds can be emitted in several ways: first there is a distinction between voiced sounds and unvoiced sounds. The voiced sounds are emitted from a vibration of the vocal cords and are modulated through the pharynx and the oral cavity (and in particular by the tongue and the lips); some sounds also use the nasal cavity. Unvoiced sounds are not output from the vocal cords; they are produced directly inside the oral cavity.

D'autre part, que ce soit parmi les sons voisés ou les sons non voisés, on peut faire la distinction entre les sons produits par des turbulences d'air (dans une ouverture étroite), et ceux qui correspondent plutôt à un écoulement régulier. Les consonnes sont en général produites par des turbulences. Les voyelles correspondent plutôt à des écoulements réguliers. On the other hand, whether among voiced sounds or unvoiced sounds, one can distinguish between the sounds produced by air turbulence (in a narrow opening), and those which correspond rather to a regular flow. Consonants are generally produced by turbulence. Rather, the vowels correspond to regular flows.

Les consonnes fricatives (s, f, z, v) sont produites respectivement par un flux d'air dans l'intervalle étroit entre les dents (s, z) ou entre les levures (f, v). Les consonnes s et f ne sont pas voisées. The fricative consonants (s, f, z, v) are produced respectively by a flow of air in the narrow interval between the teeth (s, z) or between the yeasts (f, v). The consonants s and f are not seen.

Mais les consonnes z et v sont voisées.But the consonants z and v are seen.

Les consonnes plosives font intervenir une occlusion complète du conduit vocal en un point ou un autre, suivie d'une libération brusque de la pression accumulée dans le conduit. Le point de fermeture détermine le son produit. Ce son peut être1 là encore, voisé ou non voisé. Les consonnes p (non voisée) et b (voisée) correspondent à une fermeture des lèvres; t (non voisée) et d (voisée) correspondent à une occlusion par la langue dans la partie antérieure du palais. Les consonnes k (non voisée) et g (voisée) correspondent à une occlusion par la langue vers l'arrière du palais. Plosive consonants involve a complete occlusion of the vocal tract at one point or another, followed by an abrupt release of the pressure accumulated in the duct. The closing point determines the sound produced. This sound can be1 again, voiced or unvoiced. The consonants p (unvoiced) and b (voiced) correspond to a closure of the lips; t (unvoiced) and d (voiced) correspond to an occlusion by the tongue in the anterior part of the palate. The consonants k (unvoiced) and g (voiced) correspond to an occlusion by the tongue towards the back of the palate.

On peut ainsi décrire comment sont produits la plupart des phonèmes correspondant à une langue donnée. We can thus describe how most phonemes corresponding to a given language are produced.

Le phonème est le plus petit élément sonore permettant de distinguer un mot d'un autre ou plus précisément de modifier sa signification. I1 n'y a guère que quelques dizaines de phonèmes différents dans une langue donnée.The phoneme is the smallest sound element making it possible to distinguish one word from another or more precisely to modify its meaning. There are only a few dozen different phonemes in a given language.

on considère qu'il y en a une quarantaine dans la langue française.we consider that there are about forty in the French language.

Mais c'est un chiffre théorique. Dans la pratique on s'aperçoit que les phonèmes sont prononcés différemment selon les phonèmes qui les précèdent ou les suivent. C'est le phénomène de coarticulation entre phonèmes, qui complique sérieusement les problèmes de reconnaissance ou synthèse car il multiplie par 4 ou -5 le nombre de phonèmes pratiquement émis. Il est d'ailleurs souvent plus simple de fonder la reconnaissance de parole ou la synthese non pas sur les phonèmes mais soit sur des "diphonémes" qui sont des couples de phonèmes associés incluant la transition entre ces phonèmes, soit sur des "diphones" qui sont des -segments sonores débutant au milieu d'un phonème et s'arrêtant au milieu du phonème suivant (incluant donc la transition entre deux phonèmes mais pas la totalité de chacun des deux phonèmes). But it is a theoretical figure. In practice we notice that the phonemes are pronounced differently according to the phonemes which precede or follow them. It is the phenomenon of coarticulation between phonemes, which seriously complicates the problems of recognition or synthesis because it multiplies by 4 or -5 the number of phonemes practically emitted. It is moreover often simpler to base speech recognition or synthesis not on phonemes but either on "diphonemes" which are pairs of associated phonemes including the transition between these phonemes, or on "diphones" which are sound segments starting in the middle of a phoneme and ending in the middle of the next phoneme (thus including the transition between two phonemes but not the totality of each of the two phonemes).

L'oreille humaine les distingue très bien les uns des autres, mais les formes d'onde acoustique qui les distinguent ne semblent pas être suffisamment caractéristiques pour qu'une machine puisse facilement les reconnaître, surtout dans une parole en continu. The human ear distinguishes them very well from each other, but the acoustic waveforms which distinguish them do not seem to be sufficiently characteristic for a machine to be able to easily recognize them, especially in continuous speech.

Les ondes acoustiques correspondant aux voyelles ont un spectre de fréquences plus simple et plus étroit que les consonnes. Les voyelles représentent en effet plutôt une partie stable du signal vocal, tandis que les consonnes représentent plutôt des transitions. Les plosives par exemple représentent des transitions brutales, avec un spectre de fréquences très large durant la transition. The acoustic waves corresponding to vowels have a simpler and narrower frequency spectrum than consonants. The vowels represent indeed rather a stable part of the vocal signal, while the consonants represent rather transitions. Plosives for example represent sudden transitions, with a very wide frequency spectrum during the transition.

C'est pourquoi on a essayé de proposer des méthodes de traitement de la parole fondées essentiellement sur l'analyse fréquentielle des signaux acoustiques. This is why we have tried to propose methods of speech processing based essentially on the frequency analysis of acoustic signals.

Par ces analyses fréquentielles on arrive mieux à discerner des paramètres correspondant aux différents phonèmes ou diphones émis. Through these frequency analyzes we can better discern parameters corresponding to the different phonemes or diphones emitted.

A titre d'exemple, une méthode d'analyse fréquentielle qui a déjà prouvé son efficacité aussi bien en reconnaissance vocale qu'en synthèse vocale est la méthode des formants. On va rappeler en quelques paragraphes ce que sont les formants, pour mieux faire comprendre l'invention, bien que l'invention ne soit pas limitée aux systèmes utilisant une analyse ou une synthèse à formants. For example, a method of frequency analysis which has already proven its effectiveness as well in speech recognition as in speech synthesis is the method of formants. We will recall in a few paragraphs what are the formants, to better understand the invention, although the invention is not limited to systems using an analysis or synthesis formants.

Les formants sont les fréquences correspondant à des pics d'énergie du signal vocal : on voit clairement que le spectre de fréquences résultant de l'analyse du signal acoustique correspondant à une voyelle est un spectre comprenant des creux et des bosses. Les bosses sont les formants; et on distingue en général plusieurs formants successifs dans le spectre correspondant à un phonème déterminé. The formants are the frequencies corresponding to energy peaks of the voice signal: it is clearly seen that the frequency spectrum resulting from the analysis of the acoustic signal corresponding to a vowel is a spectrum comprising hollows and bumps. The bumps are the formants; and we generally distinguish several successive formants in the spectrum corresponding to a determined phoneme.

Les formants sont repérés par leur position dans le spectre de fréquences. On parlera de premier formant pour le pic de plus basse fréquence, de deuxième formant pour le pic suivant, etc. The formants are identified by their position in the frequency spectrum. We will speak of the first forming for the lowest frequency peak, the second forming for the next peak, etc.

Ces pics correspondent physiquement à des résonances de la cavité buccale, et la parole humaine consiste justement à moduler la forme de la cavité buccale de manière à modifier les différentes fréquences de résonance de cette cavité. These peaks physically correspond to resonances of the oral cavity, and human speech consists precisely in modulating the shape of the oral cavity so as to modify the different resonance frequencies of this cavity.

I1 y a un lien direct entre la prononciation d'un phonème et la forme du conduit vocal : l'émission du phonème est en effet liée à des positions bien précises des différents éléments mobiles de la cavité buccale (position des lèvres, de la langue, du voile du palais, etc.); et il y a un lien entre les fréquences de formant et la forme du conduit vocal; on comprend donc qu'il y a aussi un lien direct entre un phonème émis et les fréquences de formant détectées dans le spectre de fréquences du signal acoustique correspondant à ce phonème. There is a direct link between the pronunciation of a phoneme and the shape of the vocal tract: the emission of the phoneme is in fact linked to very precise positions of the various mobile elements of the oral cavity (position of the lips, of the tongue , soft palate, etc.); and there is a link between the forming frequencies and the shape of the vocal tract; we therefore understand that there is also a direct link between an emitted phoneme and the forming frequencies detected in the frequency spectrum of the acoustic signal corresponding to this phoneme.

L'analyse et la synthèse à formants sont fondés sur cette notion. Effectivement, on constate que la présence de certains formants est tout-à-fait caractéristique de 11 émission de tel ou tel phonème. Pour les voyelles, dont le spectre de fréquences est relativement stable, on peut très bien caractériser une voyelle déterminée par la position (sur l'axe des fréquences) des trois premiers formants, c'est-à-dire des trois premiers pics du spectre du signal acoustique correspondant. Analysis and synthesis with formants are based on this notion. Indeed, it can be seen that the presence of certain formants is entirely characteristic of the emission of a particular phoneme. For vowels, whose frequency spectrum is relatively stable, we can very well characterize a vowel determined by the position (on the frequency axis) of the first three formants, i.e. the first three peaks of the spectrum the corresponding acoustic signal.

A titre indicatif, on peut donner l'exemple suivant: la voyelle A est un signal acoustique dont le premier formant est situé entre 500 et 800 hertz, le deuxième est situé entre 1000 et 1600 hertz mais n'est pas écarté du premier de plus de 600 à 900 hertz, et le troisième formant est situé entre 2300 et 3200 hertz. As an indication, we can give the following example: the vowel A is an acoustic signal the first of which is located between 500 and 800 hertz, the second is located between 1000 and 1600 hertz but is not separated from the first more from 600 to 900 hertz, and the third component is located between 2300 and 3200 hertz.

Un autre exemple : la voyelle I aurait un premier formant entre 200 et 400 hertz, un deuxième formant situé entre 2100 et 2400 hertz, mais espacé d'au moins 2000 hertz du premier. Le troisième formant est à une fréquence plus élevée encore. Another example: the vowel I would have a first forming between 200 and 400 hertz, a second forming located between 2100 and 2400 hertz, but spaced at least 2000 hertz from the first. The third forming is at an even higher frequency.

Avec un vecteur mathématique composé de trois nombres qui sont les fréquences des trois premiers formants on peut assez bien caractériser toutes les voyelles et certaines consonnes. Pour d'autres consonnes l'utilisation des formants est plus malaisée, mais d'autres méthodes peuvent être utilisées, et notamment une évaluation du sens et de la rapidité de variation des fréquences de formant dans les diphones comportant une transition par consonne. With a mathematical vector made up of three numbers which are the frequencies of the first three formants one can fairly well characterize all the vowels and certain consonants. For other consonants the use of formants is more difficult, but other methods can be used, and in particular an evaluation of the direction and the speed of variation of the frequencies of forming in diphones comprising a transition by consonant.

Cependant, un problème supplémentaire vient de la diversité des prononciations des mêmes phonèmes par des personnes différentes. L'oreille humaine rétablit automatiquement la signification du phonème, même prononcé par plusieurs personnes différentes. Mais une machine de reconnaissance vocale confrontée à plusieurs vecteurs de formants aura beaucoup de mal à reconnaître ces différents vecteurs comme représentant un seul et même phonème si les vecteurs sont assez différents les uns des autres du fait qu'ils émanent de personnes différentes. C'est d'ailleurs d'autant plus vrai qu'on a déjà envisagé de réaliser des machines d'identification de personnes dont le fonctionnement repose sur -la reconnaissance vocale, ce qui montre que dans une certaine mesure il peut y avoir des différences très significatives dans l'émission des mêmes phonèmes par des personnes différentes. However, an additional problem comes from the diversity of pronunciations of the same phonemes by different people. The human ear automatically restores the meaning of the phoneme, even pronounced by several different people. But a voice recognition machine confronted with several formant vectors will have great difficulty in recognizing these different vectors as representing a single phoneme if the vectors are quite different from each other because they emanate from different people. This is all the more true since we have already envisaged making machines for identifying people whose operation is based on voice recognition, which shows that to some extent there may be differences. very significant in the emission of the same phonemes by different people.

A titre d'exemple, la figure I représente un tableau schématique des zones de prononciation . de différentes voyelles phonétiques. Les lettres entre crochets représentent des phonèmes usuels en français, selon le code de phonétique de l'Association
Internationale de Phonétique. Le tableau est un diagramme fréquentiel représentant les zones de valeur du premier formant (en ordonnée) et du deuxième formant (en abscisse). On voit notamment que certaines zones se recoupent, ce qui veut dire que le même son émis par deux personnes différentes peut correspondre à deux phonèmes de signification -différentes. Et plus généralement., les zones sont assez proches les unes des autres de sorte qu'il peut être difficile à une machine de reconnaître les phonèmes présents dans la parole humaine.By way of example, FIG. I represents a schematic table of the pronunciation zones. of different phonetic vowels. The letters in square brackets represent usual phonemes in French, according to the Association's phonetic code
International of Phonetics. The table is a frequency diagram representing the value zones of the first component (on the ordinate) and the second component (on the abscissa). We see in particular that certain zones overlap, which means that the same sound emitted by two different people can correspond to two phonemes of -different meaning. And more generally, the areas are close enough together that it can be difficult for a machine to recognize the phonemes present in human speech.

Les machines de reconnaissance vocale proposées jusqu'à maintenant sont habituellement capables de reconnaître seulement un petit nombre de mots isolés, prononcés par un locuteur bien déterminé qui a enregistré dans la machine les mots à reconnaître (qu'il a prononcé lui-même). The voice recognition machines proposed up to now are usually capable of recognizing only a small number of isolated words, spoken by a well-defined speaker who has recorded the words to be recognized in the machine (which he has spoken himself).

On a proposé de rendre ces machines capables de reconnaître les mêmes mots, prononcés par plusieurs locuteurs différents. Mais alors, le passage d'un locuteur à un autre nécessite d'abord une phase d'apprentissage de la machine : le deuxième locuteur doit prononcer devant la machine la succession des différents mots qu'elle doit pouvoir reconnaître, de manière que la machine enregistre en mémoire la manière dont ces mots sont prononcés, et qu'elle puisse ensuite les reconnaître. Cette phase d'apprentissage est très lourde; d'autant plus lourde que la machine doit pouvoir reconnaître plus de mots. Si elle doit reconnaître 1000 mots, il faudra les prononcer tous; il faudra même peut-être les prononcer chacun plusieurs fois pour établir une prononciation moyenne (car la prononciation d'un mot par une personne n'est pas quelque chose de figé et invariable).Pendant la phase d'apprentissage, la machine sera indisponible pour exécuter sa fonction de reconnaissance; l'opérateur sera aussi contraint de réserver un temps pour cette opération. Mais cette opération est a priori indispensable car la probabilité est très faible pour que la machine reconnaisse d'une manière fiable les mots prononcés par un locuteur autre que celui qui a enregistré les mots de référence. It has been proposed to make these machines capable of recognizing the same words, spoken by several different speakers. But then, the passage from one speaker to another first requires a learning phase of the machine: the second speaker must pronounce in front of the machine the succession of different words which it must be able to recognize, so that the machine stores in memory how these words are pronounced, and that she can then recognize them. This learning phase is very heavy; all the heavier than the machine must be able to recognize more words. If it has to recognize 1000 words, you will have to pronounce them all; it may even be necessary to pronounce them each several times to establish an average pronunciation (because the pronunciation of a word by a person is not something fixed and unchanging). During the learning phase, the machine will be unavailable to perform its recognition function; the operator will also be forced to reserve a time for this operation. But this operation is a priori essential because the probability is very low for the machine to recognize in a reliable manner the words spoken by a speaker other than the one who recorded the reference words.

Il est inutile de préciser que si la machine est destinée par exemple à une utilisation par le public dans un lieu public, il est hors de question de procéder à une phase d'apprentissage pour chaque utilisateur qui se présente devant la machine. On peut penser par exemple à une cabine téléphonique dans laquelle la composition du numéro appelé est faite oralement. Pour de telles machines, on est actuellement obligé de limiter au maximum le nombre de mots à reconnaître, pour augmenter la certitude de reconnaître le mot prononcé quelle que soit la personne qui le prononce. Needless to say, if the machine is intended for example for use by the public in a public place, there is no question of carrying out a learning phase for each user who comes before the machine. We can think for example of a telephone booth in which the dialed number is dialed orally. For such machines, we are currently obliged to limit the number of words to be recognized as much as possible, in order to increase the certainty of recognizing the word pronounced regardless of the person who pronounces it.

La présente invention a entre autres pour but de proposer un moyen simple permettant de rendre plus facile l'utilisation d'une machine de reconnaissance par plusieurs locuteurs différents, sans réduire excessivement les possibilités de la machine. The present invention aims, among other things, to propose a simple means making it easier to use a recognition machine by several different speakers, without excessively reducing the possibilities of the machine.

Un autre but de l'invention est de proposer un moyen simple permettant d'améliorer la synthèse vocale en adaptant aussi étroitement que possible la voix synthétisée à la voix d'un locuteur bien déterminé, de sorte que par exemple si la voix d'un locuteur est codée, puis transmise sur une ligne téléphonique, puis resynthétisée avant d'être restituée à un auditeur, la voix synthétisée puisse se rapprocher aussi près que possible de la voix du locuteur initial. Another object of the invention is to propose a simple means making it possible to improve speech synthesis by adapting as closely as possible the synthesized voice to the voice of a well-defined speaker, so that for example if the voice of a speaker is coded, then transmitted over a telephone line, then resynthesized before being returned to a listener, the synthesized voice can be as close as possible to the voice of the initial speaker.

Pour atteindre ces buts, la présente invention propose un système de traitement de parole comprenant un appareil de codage ou décodage de parole adapté à un codage ou un décodage multilocuteurs, caractérisé en ce que des paramètres spécifiques d'un locuteur déterminé sont contenus dans une carte portative personnelle que le locuteur conserve avec soi, le système comportant un lecteur de carte adapté à lire le contenu de la carte et à communiquer ce contenu à l'appareil de codage ou décodage, pour adapter instantanément, sans phase d'apprentissage, à ce locuteur
On comprend qu'avec ce système, on peut aller jusqu'à installer dans des lieux publics-des machines complexes utilisant la reconnaissance ou la synthèse de parole, et que toute personne possédant une carte personnelle contenant les paramètres propres de sa voix, pourra communiquer avec cette machine ou à travers cette machine, alors qu'elle ne pourrait le faire autrement.To achieve these goals, the present invention provides a speech processing system comprising a speech coding or decoding device suitable for multi-speaker coding or decoding, characterized in that specific parameters of a determined speaker are contained in a card. personal portable that the speaker keeps with you, the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus, to instantly adapt, without learning phase, to this speaker
We understand that with this system, we can go so far as to install in public places - complex machines using speech recognition or synthesis, and that anyone with a personal card containing their own voice parameters, will be able to communicate with this machine or through this machine, when it could not do otherwise.

La carte pourrait contenir sous forme de données codées une prononciation d'un certain nombre de mots par le titulaire de la carte (autant de mots que la machine doit pouvoir reconnaître ou synthétiser par exemple). The card could contain in the form of coded data a pronunciation of a certain number of words by the card holder (as many words as the machine must be able to recognize or synthesize for example).

Mais il est plus avantageux que la carte contienne plutôt des parametres de la voix indépendamment des mots à reconnaître ou synthétiser, car cela élargit les possibilités de reconnaissance ou synthèse.However, it is more advantageous for the card to contain voice parameters rather than the words to be recognized or synthesized, since this widens the possibilities of recognition or synthesis.

Les paramètres enregistrés dans la carte peuvent alors être des signaux électriques codés représentant les formes d'onde temporelle ou les spectres de fréquence de phonèmes ou diphonèmes ou diphones prononcés par le titulaire de la carte. Mais on préférera utiliser comme paramètres des vecteurs correspondant à ces phonèmes ou diphonèmes ou diphones, par exemple des vecteurs de trois ou quatre formants; chaque vecteur de trois ou quatre formants comprendra donc trois ou quatre valeurs de fréquences (ou plus vraisemblablement trois ou quatre gammes de fréquences) représentant un phonème ou diphonème ou diphone déterminé.Ces vecteurs seront stockés dans la carte, et transférés à la machine au moment de l'utilisation, en remplacement des vecteurs que la machine aura pu recevoir précédemment lors de l'utilisation par un autre locuteur disposant d'une autre carte personnelle. The parameters recorded in the card can then be coded electrical signals representing the time waveforms or the frequency spectra of phonemes or diphonemes or diphones spoken by the card holder. However, it will be preferable to use as parameters vectors corresponding to these phonemes or diphonemes or diphones, for example vectors of three or four formants; each vector of three or four formants will therefore include three or four frequency values (or more probably three or four frequency ranges) representing a determined phoneme or diphoneme or diphone. These vectors will be stored in the map, and transferred to the machine at the time. of use, replacing the vectors that the machine may have received previously during use by another speaker with another personal card.

On comprendra que si les formants semblent être les vecteurs les plus commodes pour représenter les voyelles, d'autres paramètres existent et peuvent être stockés pour d'autres phonèmes, diphonèmes ou diphones. It will be understood that if the formants seem to be the most convenient vectors for representing the vowels, other parameters exist and can be stored for other phonemes, diphonemes or diphones.

Notamment, les consonnes ou les diphones incluant des consonnes s'exprimeront plus facilement par des paramètres relatifs à la manière dont les formants varient: chute plus ou moins rapide du premier formant et simultanément montée plus ou moins rapide du deuxième, etc.In particular, consonants or diphones including consonants will be more easily expressed by parameters relating to the way in which the formants vary: more or less rapid fall of the first forming and simultaneously more or less rapid rise of the second, etc.

Des coefficients de fonctions de transfert échantillonnées (fonction de transfert en z) pourraient également être stockés comme paramètres de la voix dans une carte personnelle portative. Coefficients of sampled transfer functions (z-transfer function) could also be stored as voice parameters in a portable personal card.

La carte pourrait être une carte à piste magnétique, ou optique; mais elle sera de préférence une carte à puce incorporant une puce de circuit-intégré avec notamment une mémoire non volatile contenant les paramètres personnels de la voix. La carte peut être aussi un autre support d'information portable tel que par exemple : cartes magnétiques à haute densité de stockage, dont la surface magnétique couvre la totalité ou la quasi-totalité d'une des faces; mémoire de stockage de type EPROM ou EEPROM ou RAM non-volatile stockée dans un boîtier de forme très compacte et facilement transportable; clés à puce n'ayant pas spécialement la forme d'une carte plate, etc. The card could be a magnetic stripe card, or an optical card; but it will preferably be a chip card incorporating an integrated circuit chip with in particular a non-volatile memory containing the personal parameters of the voice. The card can also be another portable information medium such as, for example: magnetic cards with high storage density, the magnetic surface of which covers all or almost all of one of the faces; non-volatile EPROM or EEPROM or RAM type storage memory stored in a very compact and easily transportable case; chip keys not specially in the form of a flat card, etc.

D'autres caractéristiques et avantages de l'invention apparaîtront à la lecture de la description qui suit et qui est faite en référence aux dessins annexés dans lesquels
- la figure 1, déjà décrite, représente un diagramme de position de divers phonemes dans l'espace des formants (deux premiers formants);
- la figure 2 représente schématiquement une application de l'invention à la commande vocale d'une machine;
- la figure 3 représente schématiquement une application de l'invention aux communications téléphoniques.Other characteristics and advantages of the invention will appear on reading the description which follows and which is made with reference to the appended drawings in which
- Figure 1, already described, shows a position diagram of various phonemes in the space of formants (first two formants);
- Figure 2 schematically shows an application of the invention to voice control of a machine;
- Figure 3 schematically shows an application of the invention to telephone communications.

Une première application de 11 invention est la reconnaissance de la parole, telle qu'on peut l'utiliser par exemple pour la commande d'un robot, d'une machine industrielle, d'un véhicule, etc., ou, dans une application plus sophistiquée, pour une machine à dicter ou une machine à traduire. A first application of the invention is speech recognition, such as it can be used for example for controlling a robot, an industrial machine, a vehicle, etc., or, in an application more sophisticated, for a dictation machine or a translator.

La figure 2 schématise cette application dans le cas de la commande d'un robot. Un appareil de reconnaissance 10 est connecté à un robot industriel 12 pour lui fournir des ordres de commande de marche, d'arrêt, de rotation, etc. L'appareil de reconnaissance est couplé à un microphone 14 de sorte que les ordres de commande peuvent être donnés oralement sous la forme de mots simples tels que "marche", "stop", "droite", "gauche", etc. L'appareil est par ailleurs couplé à un lecteur de carte à puces 16 dans lequel on peut introduire une carte à puce 18 qui contient dans une mémoire non volatile (mémoire EPROM ou EEPROM) des données.personnalisées relatives à la voix d'un locuteur titulaire de cette carte. Figure 2 shows schematically this application in the case of controlling a robot. A recognition device 10 is connected to an industrial robot 12 to supply it with orders for on, off, rotation, etc. control. The recognition apparatus is coupled to a microphone 14 so that control commands can be given orally in the form of simple words such as "on", "stop", "right", "left", etc. The device is also coupled to a chip card reader 16 into which a chip card 18 can be inserted which contains in a non-volatile memory (EPROM or EEPROM memory) personalized data relating to the voice of a speaker holder of this card.

Lors du fonctionnement, les données de la carte sont d'abord chargées dans l'appareil de reconnaissance; ces données servent à modifier soit des configurations de circuits électroniques dans l'appareil, soit des algorithmes de reconnaissance utilisés dans l'appareil. During operation, the card data is first loaded into the recognition device; this data is used to modify either configurations of electronic circuits in the device, or recognition algorithms used in the device.

Les configurations modifiées ou les algorithmes modifiés sont tels que l'appareil soit alors adapté de manière optimale à la reconnaissance des mots ou phrases prononcés par le locuteur titulaire de la carte.The modified configurations or the modified algorithms are such that the device is then optimally adapted to the recognition of the words or sentences spoken by the speaker holding the card.

Par exemple, les modifications d'algorithme peuvent consister en modifications des valeurs moyennes et valeurs limites des fréquences de formants pour chaque phonème ou diphoneme ou diphone susceptible d'être prononcé; ou encore des modifications de coefficients de polynômes dans des algorithmes de calcul fondés sur la transformée en z des signaux acoustiques échantillonnés. For example, the modifications of algorithm can consist of modifications of the mean values and limit values of the frequencies of formants for each phoneme or diphoneme or diphone capable of being pronounced; or modifications of coefficients of polynomials in calculation algorithms based on the z-transform of the sampled acoustic signals.

Des modifications de configurations de circuits électroniques pourraient par exemple consister en modifications de valeurs de capacités (par commutation d'interrupteurs) dans des filtres à capacités commutées utilisés pour déterminer des fréquences de formants.Modifications of electronic circuit configurations could for example consist of modifications of capacitance values (by switching switches) in filters with switched capacitors used to determine formant frequencies.

Selon la sophistication de l'appareil de reconnaissance 10, on pourra reconnaître des mots ou phrases plus ou moins complexes. Si l'appareil 10 est très performant (et ses performances vis-à-vis de locuteurs multiples seront considérablement améliorées par l'invention), on peut envisager que la machine 12 commandée soit une machine de traitement de texte, voire même une machine de traduction automatique. Cela suppose bien entendu que l'appareil de reconnaissance soit capable de reconnaître non pas seulement des mots isolés mais des phrases continues. Depending on the sophistication of the recognition device 10, more or less complex words or sentences can be recognized. If the device 10 is very efficient (and its performance vis-à-vis multiple speakers will be considerably improved by the invention), it can be envisaged that the machine 12 controlled is a word processing machine, or even a machine for automatic translation. This of course presupposes that the recognition device is capable of recognizing not only individual words but continuous sentences.

Pour le choix des paramètres que l'on peut inscrire dans la carte pour représenter de manière personnalisée la voix du titulaire de la carte, on pourra utiliser d'une manière générale les théories de reconnaissance et synthèse de la voix telles qu'elles ont été formulées jusqu'à maintenant. On trouvera une indication des méthodes mathématiques permettant de faire ces choix dans le traité de René Boite et Murat Kunt : "Traitement de la parole", complément au Traité d'Electricité, publié aux Presses Polytechniques Romandes, ainsi que les ouvrages référencés dans la bibliograph-ie de ce traité. For the choice of parameters that can be entered in the card to represent the voice of the card holder in a personalized way, we can generally use the theories of voice recognition and synthesis as they have been. formulated so far. One will find an indication of the mathematical methods allowing to make these choices in the treaty of René Boite and Murat Kunt: "Speech processing", complement to the Treaty of Electricity, published in Presses Polytechniques Romandes, as well as the works referenced in the bibliograph -ie of this treaty.

Une autre application de l'invention est représentée à la figure 3. Dans cette application, on cherche à coder le signal de parole émis sur une ligne téléphonique, pour comprimer le signal et ainsi limiter le débit d'informations utile pour une communication. Another application of the invention is shown in FIG. 3. In this application, it is sought to code the speech signal transmitted over a telephone line, in order to compress the signal and thus limit the bit rate of information useful for communication.

Pour cela, on code le signal reçu par le microphone du combiné téléphonique; le codage est un codage phonétique au lieu d'être un codage numérique des formes d'onde du signal de parole : on code la parole en la décomposant en phonèmes ou diphones successifs; c'est donc une opération de reconnaissance de parole. Puis on envoie sur la ligne téléphonique des vecteurs successifs de données, chaque vecteur comportant plusieurs données relatives au phonème qui vient d'être prononcé dans le combiné. A la réception, on reconvertit les vecteurs de données en phonèmes; c'est une opération de synthèse de parole. La compression réalisée peut être très importante : on peut envisager de limiter à 2 kilobits par seconde la quantité de données nécessaire pour transmettre une conversation normale. En effet, le nombre de phonèmes émis ne dépasse-pas une dizaine par seconde.On dispose donc de 200 bits pour coder chaque phonème ou diphone ainsi que la prosodie (c'est-à-dire la mélodie engendrée par la variation de la fréquence fondamentale des cordes vocales au cours de la phrase).For this, the signal received by the microphone of the telephone handset is coded; the coding is a phonetic coding instead of being a digital coding of the waveforms of the speech signal: the speech is coded by breaking it down into successive phonemes or diphones; it is therefore a speech recognition operation. Then, successive vectors of data are sent to the telephone line, each vector comprising several data relating to the phoneme which has just been pronounced in the handset. On reception, the data vectors are reconverted into phonemes; it is a speech synthesis operation. The compression achieved can be very important: we can consider limiting the amount of data necessary to transmit a normal conversation to 2 kilobits per second. Indeed, the number of phonemes emitted does not exceed ten per second. We therefore have 200 bits to code each phoneme or diphone as well as prosody (that is to say the melody generated by the variation of the frequency vocal cords during the sentence).

Dans cette application, on utilisera selon l'invention un premier codeur/décodeur 20 interposé entre un premier appareil téléphonique 22 et une ligne téléphonique numérique 24. Ce premier codeur a pour fonction de coder la parole émise et de décoder la parole reçue. I1 est couplé à un premier lecteur de cartes à puces 26 dans lequel on pourra introduire une carte 28 comportant les données personnalisées sur la voix de la personne qui téléphone. On utilisera aussi un deuxième codeur/décodeur 30 semblable au premier, raccordé à l'autre bout de la ligne 24, interposé entre la ligne et un deuxième appareil téléphonique 32. Le deuxième codeur/décodeur est aussi couplé à un deuxième lecteur de cartes 36 dans lequel on peut insérer une carte 38 comportant les données personnalisées relatives à la voix du correspondant à l'autre bout de la ligne. In this application, a first coder / decoder 20 interposed between a first telephone apparatus 22 and a digital telephone line 24 will be used according to the invention. The function of this first coder is to encode the transmitted speech and to decode the received speech. I1 is coupled to a first chip card reader 26 into which a card 28 can be inserted comprising personalized data on the voice of the person who telephones. We will also use a second coder / decoder 30 similar to the first, connected to the other end of line 24, interposed between the line and a second telephone device 32. The second coder / decoder is also coupled to a second card reader 36 in which one can insert a card 38 comprising personalized data relating to the voice of the correspondent at the other end of the line.

Les codeur/décodeurs, qui sont en fait des appareils complets de reconnaissance et synthèse vocale, reçoivent les données contenues dans les deux cartes, de sorte que la -partie codage est adaptée à la reconnaissance de la voix de la personne située au même bout de la ligne que le codeur/décodeur, alors que la partie décodage est adaptée à la synthèse de la voix de la personne située à l'autre bout-de la ligne. The coders / decoders, which are in fact complete speech recognition and synthesis devices, receive the data contained in the two cards, so that the coding part is adapted to the recognition of the voice of the person situated at the same end. the line as the coder / decoder, while the decoding part is adapted to the synthesis of the voice of the person situated at the other end of the line.

On prévoit donc en début de conversation téléphonique un protocole d'échanges de données pour envoyer dans les codeurs/décodeurs les données qui conviennent. Puis la conversation peut avoir lieu l'une des personnes parle; sa voix est convertie en phonèmes codés, par le codeur qui a été spécialement adapté à la voix du locuteur; elle est envoyée sur la ligne; elle est reçue par le décodeur à l'autre bout de la ligne. Le décodeur a été lui aussi adapté à la voix du même locuteur; il synthétisera donc d'une manière optimale la voix de ce locuteur avant de la transmettre à l'écouteur du poste téléphonique. De même pour l'autre locuteur, codage et décodage sdnt spécialement adaptés à sa voix de sorte qu'à l'autre bout de la ligne le correspondant recevra une voix synthétisée d'une manière personnalisée. A data exchange protocol is therefore provided at the start of the telephone conversation in order to send the appropriate data to the coders / decoders. Then the conversation can take place one of the people talking; his voice is converted into coded phonemes, by the coder which has been specially adapted to the speaker's voice; it is sent over the line; it is received by the decoder at the other end of the line. The decoder was also adapted to the voice of the same speaker; it will therefore optimally synthesize the voice of this speaker before transmitting it to the telephone set listener. Similarly for the other speaker, coding and decoding sdnt specially adapted to his voice so that at the other end of the line the correspondent will receive a synthesized voice in a personalized way.

Dans une autre application encore, on cherche à interroger par téléphone une base de données. In yet another application, it is sought to interrogate a database by telephone.

L'interrogation est faite par la parole et non par l'intermédiaire d'un clavier. Un exemple est la réservation téléphonique de transports aériens.The interrogation is done by speech and not by means of a keyboard. An example is the telephone reservation of air transport.

L'utilisateur dispose, comme dans l'application précédente, d'un appareil téléphonique auquel est associé un lecteur de carte; la carte contient les paramètres de la voix de son titulaire. Les paramètres peuvent être utilisés de deux manières : d'une part ils peuvent être envoyés sur la ligne à titre d'éléments d'identification d'un titulaire autorisé; si les paramètres ne sont pas ceux d'un titulaire autorisé, la base de données n'est pas rendue accessible; d'autre part, après que les paramètres de la voix aient été transmis vers la base de données, un système i de reconnaissance de parole utilise ces paramètres pour s'adapter au mieux à la voix de celui qui va parler sur la ligne téléphonique. L'utilisateur peut alors parler; sa voix est transmise normalement sur la ligne (contrairement à l'application précédente où elle est codée en vue d'une réduction du débit); une analyse de parole est faite à l'autre bout de la ligne, adaptée à la voix du locuteur, pour déterminer par machine le message transmis et instaurer le dialogue homme-machine via la ligne téléphonique.The user has, as in the previous application, a telephone device with which a card reader is associated; the card contains the holder's voice settings. Parameters can be used in two ways: on the one hand they can be sent on the line as elements of identification of an authorized holder; if the parameters are not those of an authorized holder, the database is not made accessible; on the other hand, after the voice parameters have been transmitted to the database, a speech recognition system i uses these parameters to best adapt to the voice of the person who is going to speak on the telephone line. The user can then speak; its voice is transmitted normally on the line (unlike the previous application where it is coded for a reduction in bit rate); a speech analysis is done at the other end of the line, adapted to the speaker's voice, to determine the message transmitted by machine and to establish human-machine dialogue via the telephone line.

Dans toutes les applications, on prévoira de préférence que les paramètres personnels de la voix, sont inscrits dans la carte d'un titulaire par une machine spécialisée dont la fonction principale est de déterminer et enregistrer ces paramètres. Le titulaire de la carte devra à cet effet prononcer devant la machine un certain nombre de mots caractéristiques qui serviront à faire cette détermination. In all applications, it will preferably be provided that the personal parameters of the voice are entered in the card of a holder by a specialized machine whose main function is to determine and save these parameters. To this end, the card holder will have to pronounce a number of characteristic words in front of the machine which will be used to make this determination.

Claims

1. Speech processing system, comprising a speech coding or decoding apparatus suitable for multi-speaker coding or decoding, characterized in that specific parameters of the voice of a determined speaker are contained in a personal portable card that the speaker keeps with him, the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus in order to adapt it instantly, without learning phase, to this speaker.

2. Speech processing system according to claim 1, characterized in that the specific parameters of the speaker include acoustic data vectors corresponding to phonemes or diphonemes or diphones, as they are pronounced by the speaker holding the card.

3. Speech processing system according to claim 2, characterized in that each vector is constituted by a set of acoustic data, among which there are values of frequency of formants corresponding to a phoneme or diphoneme or diphone as pronounced by the speaker holding the card.

4. Speech processing system according to one of claims 1 to 3, characterized in that the specific parameters contained in the card include data relating to the frequency variations of formants corresponding to determined phonemes or diphonemes or diphones.

5. Speech processing system according to one of claims 1 to 4, characterized in that the parameters contained in the card include coefficients of sampled transfer functions (transfer function in z) of acoustic signals corresponding to phonemes or diphonemes or diphones pronounced by the card holder.

6. Speech processing system according to one of claims 1 to 5, characterized in that the card is a magnetic stripe or optical card, or preferably a chip card incorporating an integrated circuit chip with in particular a non-volatile memory containing personal voice parameters.

7. Speech processing system according to one of claims 1 to 5, characterized in that the card is a magnetic card with high storage density, the magnetic surface of which covers all or almost all of one face, or one integrated circuit key not specifically shaped like a flat card.

8. Speech processing system according to one of claims 1 to 7, characterized in that it comprises a phonetic speech coding and decoding apparatus interposed between a telephone apparatus and a telephone line, and capable of transmitting successively on the line of data vectors corresponding to a succession of phonemes or diphonemes or diphones, and a card reader, the coding and decoding apparatus being able to adapt its coding function as a function of personal voice parameters contained in a card inserted in the reader, and the apparatus being also able to adapt its decoding function as a function of personal parameters of voice received from the telephone line.

9. Speech processing system according to one of claims 1 to 7, characterized in that it comprises a telephone device coupled to a telephone line, and a card reader associated with the device, means for transmitting on the line the parameters of the voice contained in the card, and a speech recognition system at the other end of the line to firstly receive the said parameters from the line and secondly to receive a speech signal from of the telephone apparatus, the speech recognition system being able to adapt its operation as a function of the voice parameters received.