BE1002379A4

BE1002379A4 - Voice recognition and in particular transcription process

Info

Publication number: BE1002379A4
Application number: BE8800993A
Authority: BE
Original assignee: Halleux Benoit De; Dutilleul Eric; Joffroy Jean Marc
Priority date: 1988-08-31
Filing date: 1988-08-31
Publication date: 1991-01-22

Abstract

Voice recognition process and in particular written transcription of thevoice comporting a frequency analysis of the phonemes emitted by a speaker,characterised in that the phonemes undergo a specific frequency analysis thatgenerates a static image for each phoneme presented as immobile for a timethat is sufficient for the phoneme concerned to be recognised.

Description

       

  PROCEDE DE RECONNAISSANCE ET EN PARTICULIER D'ECRITURE

DE LA PAROLE

  
Objet de l'invention

  
La présente invention vise à réaliser un procédé qui permet la reconnaissance de la parole et en particulier l'écriture de .la parole utilisant un alphabet phonétique naturel. Elle concerne également les moyens et dispositifs permettant la mise en oeuvre de ce procédé.

  
Applications de l'invention

  
Les applications de la reconnaissance de la parole sont de deux types : le dialogue oral homme-machine et l'aide aux malentendants.

  
Le présent mémoire descriptif qui est basé sur des expériences réelles d'écriture de la parole destinée à des malentendants décrira essentiellement à titre d'illustration cet aspect particulier. Il est bien entendu cependant qu'il n'y est pas limité.

  
Résumé de Fêtât de la technique

  
 <EMI ID=1.1> 

  
n'est à même d'égaler l'homme dans la reconnaissance vocale.

  
A titre d'exemple de ce.que l'on peut faire aujourd'hui dans le domaine du dialogue homme-machine, on citera des applications développées dans le cadre de ARPA Speech Understanding System (applications militaires USA). Ces techniques permettent un dialogue sommaire, basé sur un vocabulaire limité
(1000 mots) et sur une syntaxe simple et immuable.

  
Malgré les développements récents de l'informatique et de l'électronique, en particulier en ce qui concerne le traitement du signal et l'intelligence artificielle, ces systèmes restent d'un coût prohibitif et on est encore fort loin d'un modèle de reconnaissance de la parole aussi performant que l'être humain.

  
Pour ce qui concerne l'aide aux malentendants, essentiellement deux moyens sont envisagés - outre l'implantation de prothèses chirurgicales - pour aider les personnes sourdes.

  
Le premier s'appuie sur l'exploitation du sens tactile et se concrétise par exemple par l'élaboration d'un bracelet vibreur qui donne des informations sommaires telles que sonnerie de portes, appel sur téléphone graphique (Minitel), ... 

  
Le deuxième moyen plus développé, vise à supplanter le canal auditif devenu inactif par le canal visuel. Il regroupe ainsi toutes les techniques destinées à la transposition du son en images comme la lecture sur les lèvres, le langage gestuel, le téléphone graphique (ou Minitel), les systèmes d'éducation à la parole, ...

  
La lecture sur les lèvres est une première façon dynamique de comprendre le message oral mais présente malgré tout des ambiguïtés gênantes. En effet, il existe dans la parole beaucoup

  
 <EMI ID=2.1> 

  
peuvent pas être distingués par la simple forme des lèvres.

  
Une solution proposée par le docteur Carnet (Gallaudat Collège, Washington 1967) pour lever ces indéterminations fut d'ajouter un code phonétique complémentaire mis en évidence par la forme et la position de la main du locuteur autour du visage. L'inconvénient de cette méthode appelée "cued speech" est que l'apprentissage du code tant par le locuteur que par le lecteur est nécessaire pour une compréhension mutuelle. Un autre inconvénient de la lecture sur les lèvres est qu'il faut voir le locuteur de face.

  
Le langage gestuel et le téléphone graphique possèdent eux aussi ces inconvénients majeurs ; les deux personnes en dialogue doivent connaître le code de communication gestuel ou doivent être équipées toutes deux de l'appareillage ad-hoc dans le cas du Minitel.

  
Ces techniques de reconnaissance pour sourds - mise à part la lecture labiale qui reste toutefois fort hasardeuse - ont tendance à marginaliser leur groupe restreint d'utilisateurs
(malentendants, famille, enseignants) vis-à-vis de personnes qui ne se sentent pas concernées d'une manière ou d'une autre par le problème des handicapés de l'ouïe et qui de ce fait ne feront aucun effort pour appréhender ces techniques.

  
Il n'existe pas véritablement à l'heure actuelle de traducteur efficace du son susceptible de délivrer au sourd une autonomie d'utilisation indispensable à son intégration dans le monde des malentendants.

  
Le premier essai du genre fut entamé par les laboratoires Bell pendant la deuxième guerre mondiale et se solda par la mise au point d'un appareil baptisé "sonagraphe" ("visible speech") qui transposait pour la première fois la parole en images appelées "sonagrammes". 

  
Les sonagrammmes sont des diagrammes à deux dimensions où l'abscisse est le temps et l'ordonnée représente les fréquences ; l'énergie sonore aux différentes fréquences étant matérialisée par une intensité de noirceur.

  
Depuis 40 ans, des expériences se multiplient un peu partout dans le monde pour tester les qualités de cette représentation mais le bilan s'avère en définitive plutôt décevant.

  
Il semble que la meilleure performance de lecture des sonagrammes appartient à Victor Zue, expert en reconnaissance de la parole au MIT (Boston). Celui-ci après une durée d'apprentissage de plus ou moins 2500 heures (quelques heures par jour pendant

  
 <EMI ID=3.1> 

  
émis par un grand nombre de personnes et ce, en un temps 20 fois plus grand que le temps réel de production du message verbal.

  
Les mauvaises performances de lecture de sonagrammes résultent en grande partie d'une qualité médiocre voire inadaptée de la représentation utilisée ; l'emploi d'une intensité de gris approximant la quantité d'énergie en fréquence donne une image grossière de la réalité sonore et court-ci rcui te ainsi une partie importante de l'information phonétique. L'absence d'arrêt de l'évolution de l'image est la cause principale de la difficulté de lecture ; de plus, la redondance d'information ne permet pas de se concentrer sur la partie importante de la représentation visuelle. But de l'invention

  
L'invention vise à fournir une représentation visuelle de la parole qui est de nature à surpasser largement le sonagramme tant par la plus grande richesse de l'information phonétique qu'elle présente que par l'aisance relative au niveau de son apprentissage. Eléments caractéristiques de l'invention

  
L'invention repose sur un procédé de reconnaissance de la parole et en particulier d'écriture de la parole comportant une analyse fréquentielle des phonèmes émis par un locuteur, caractérisé  en ce que les phonèmes subissent un traitement d'analyse fréquentielle spécifique qui pour chaque phonème exprimé génère une image statique, apparaissant immobile (figée) pendant un temps suffisant pour permettre sa reconnaissance. De manière avantageuse, l'image statique représentant le phonème que nous appelerons le caractère-image doit apparaître sur l'écran d'une façon qui paraisse instantanée et non progressive et rester immobile pendant le temps d'identification du phonème.

  
On s'est aperçu qu'un utilisateur peut après un délai d'apprentissage raisonnable reconnaître les différentes images statiques ainsi associées à chaque phonème.

  
Selon une forme d'exécution préférée de la technique précitée, il est avantageux de prévoir un affichage simultané de plusieurs phonèmes successifs, ce qui permet à l'utilisateur de mieux reconnaître ceux-ci par association avec le ou les phonèmes exprimés antérieurement à celui qui est lu et éventuellement le ou les phonèmes exprimés déjà après celui en cours de lecture.

  
Comme dans l'écriture classique, un nombre suffisant de phonèmes seront dans une application avantageuse de la technique visibles simultanément pour permettre la mémoire des éléments verbaux précédents. Nous entendons par éléments verbal., un phonème ou un groupe successif de phonèmes. Dans le cas ou le mot est décomposé en éléments verbaux, le terme caractère-image se rapporte à l'image représentant l'élément verbal. Il est en effet souhaitable qu'au moins les caractères-images correspondant aux éléments verbaux d'un mot, apparaissent au moment de l'interprétation du mot simultanément sur l'écran. En fait, il en va de même qu'avec l'écriture classique. Il suffit de songer aux conditions limites de lisibilité des sous-titres à la télévision et au cinéma.

   Il est souhaitable que les caractères-images soient immobiles pendant leur apparition sur l'écran. Quant il s'agit d'écriture, le mot

  
 <EMI ID=4.1> 

  
l'écriture phonétique et est une abréviation du mot "caractèreimage".

  
Contrairement aux techniques connues, le caractère statique de l'image associée à chaque phonème permet d'éviter la redondance de l'information, notamment pour les voyelles longues. La technique proposée aboutit à l'élaboration d'un alphabet naturel. A chaque élément verbal, phénomène essentiellement dynamique, est associée l'image statique (figée) qu'il génère grâce à un algorithme particulier. Cet alphabet peut éventuellement être réversible, c'està-dire que les images peuvent à leur tour donner naissance aux types d'éléments verbaux qui les ont générés.

   On crée ainsi un alphabet "naturel" ou "phgsique" dans le sens d'une part ou un élément verbal de par ce qu'il est, donne naissance à une image (le caractère-image) qui fait partie d'une famille d'images que l'on associe par éducation au type d'élément verbal en question et d'autre part que le caractèreimage correspondant à un élément verbal peut générer par ce qu'il est, le type d'élément verbal en question.

  
Il s'agit d'un alphabet et la technique de reconnaissance de la parole par transposition en images est en fait une écriture à l'aide des images qui constituent les caractères. On conçoit donc que les caractères-images seront utilement groupés ensemble en suite successive dans un mot et que l'on séparera utilement par un espacement les caractères-images relatifs à des mots différents. Il s'agit des mots résultant de la parole et non des mots de l'écriture classique. Les mots de l'écriture phonétique la plupart du temps ne coïncide pas avec les mots de. l'écriture classique. Souvent, le mot de l'écriture phonétique contient plusieurs mots de l'écriture classique. On entend par "mot" le mot de l'écriture phonétique à moins qu'il ne soit précisé qu'il s'agit du mot de l'écriture classique.

   On utilisera utilement et avantageusement une ponctuation particulière pour distinguer les espacements entre les mots, les espacements à virgule, point-virgule et point. L'évolution des caractéristiques adéquates du signal acoustique pendant la parole renseignera sur le caractère normal, interrogatif ou exclamatif d'une phrase, ce qui amènera naturellement à utiliser l'équivalent des points, points d'interrogation et points d'exclamation.

  
Une variante avantageuse de la technique comporte de plus l'affichage de la longueur de l'émission du phonème, de l'évolution dans le temps de son amplitude, de la régularité de l'émission acoustique du phonème dans le mot et dans la phrase et éventuellement des informations sur la fondamentale et les harmoniques.

  
Il est apparu que des résultats particulièrement intéressants sont obtenus par une technique comportant une opération d'analyse spécifique des phonèmes selon qu'ils appartiennent à un premier groupe constitué par les phonèmes autres que les consonnes fricatives ou les consonnes plosives, à un deuxième groupe constitué par les consonnes fricatives et un troisième groupe constitué par les consonnes plosives. 

  
Il est apparu que chacun de ces groupes doit subir une analyse fréquentielle qui lui est propre, à savoir :
- une analyse fréquentielle du premier groupe au moins jusqu'à une valeur de l'ordre de 4 à 5 kHz sur une durée de l'ordre d'au moins une quasi période,
- une analyse fréquentielle du deuxième groupe au moins jusqu'à une valeur de l'ordre de 12 kHz pendant une durée supérieure à 2 millisecondes et ne dépassant pas la durée des phonèmes,
- une analyse fréquentielle du troisième groupe au moins jusqu'à une valeur de l'ordre de 5 à 10 kHz pendant une durée suffisante pour tenir compte de révolution temporelle des caractéristiques fréquentielles. 

  
L'analyse fréquenti el l e pour les deux premiers groupes peut n'exploiter qu'une partie de l'information acoustique émise pendant la production du phonème car cette information est redondante. L'analyse fréquentielle pour le troisième groupe exploite toute ou quasi toute l'information acoustique émise pendant la production de la plosive ou à tout le moins une partie notable de cette information.

  
Avantageusement des signaux analogiques captés par un micro subissent une digitalisation et sont soumis à l'aide d'un algorithme à une analyse fréquent! elle propre au groupe afin d'obtenir une écriture de la parole basée sur l'analyse fréquentielle de celle-ci qui fournit un alphabet phonétique naturel.

  
Afin de faciliter la lecture de la dite écriture selon un alphabet phonétique naturel, l'invention comporte le fait que l'écriture correspondant à chacun des groupes précités subit une représentation particulière pour chaque groupe, notamment une représentation de couleur différente pour le groupe 1 de la couleur retenue pour la représentation du groupe 2 tandis que les phonèmes du groupe 3 qui subissent une représentation du type sonagramme ou équivalente en ce sens que l'on exploite avantageusement  l'information acoustique présente dans quasi toute la durée du phonème, se représente avantageusement dans une autre couleur que les groupes 1 et 2.

  
On peut également discriminer les groupes en adaptant des conventions autres que les couleurs, par exemple des formes géométriques différentes pour chaque groupe. A titre d'exemple des "caractères-images" de dimensions ou de formes différentes peuvent être utilisées.

  
Une façon avantageuse de présenter l'information de manière compacte est d'utiliser une représentation polaire des transformées fréquentielle ou équivalente.

  
On entend par traitement d'analyse fréquentielle ou analyse fréquentielle, un traitement mathématique de tout ou d'une partie du signal temporel lié au phonème, où le temps n'apparaît  plus dans la résultante obtenue par la transformation et qui permet de mettre en évidence une information caractéristique spécifique à chaque phonème, qui sera utilisée pour reconnaître le phonème.

  
On entend par quasi-période, la période apparente du signal temporel quasi-périodique relatif aux phonèmes de la classe

  
1. La quasi-période est mesurée par le temps que dure le plus petit motif du signal temporel qui génère quasiment le signal temporel par juxtaposition successive. Le concept de quasi-période est illustré sur la figure 1.

  
D'autres détails et caractéristiques de l'invention apparaîtront à la lecture du mémoire descriptif qui suit, relatif à une forme d'exécution préférée de l'invention.

  
Mode d'exécution préféré de l'invention

  
La démarche suivie au cours de l'étude qui a conduit à la présente demande de brevet est essentiellement expérimentale.

  
Le critère utilisé pour apprécier les diverses représentations étudiées a été le pourcentage d'identification correcte des phonèmes par la personne qui lit l'écriture de la parole. Ce critère est sévère puisqu'il n'est généralement pas nécessaire d'identifier correctement la totalité des phonèmes pour comprendre le message. En effet, la significaton, la cohérence et la structure syntaxique même du langage oral sont souvent déterminantes pour la bonne compréhension du message.

  
Le contenu phonétique du message est la seule information extérieure disponible à l'auditeur et c'est aussi la seule qui fait défaut au malentendant. C'est pourquoi un système d'aide visuelle ne doit pas "comprendre" le message mais simplement en assurer une représentation phonétique suffisante.

  
La représentation choisie est basée sur la notion de phonèmes et. d'alphabet phonétique. En effet, l'information phonétique, si elle varie continuellement au cours du temps, est la plupart du temps redondante localement de telle sorte que l'on peut décomposer le message oral en une succession de phonèmes et de parties de phonèmes. L'émission acoustique d'une partie d'un

  
 <EMI ID=5.1> 

  
le phonème. 

  
La recherche s'est donc articulée autour de la représentation des différents phonèmes.

  
Ainsi, si chaque phonème peut être représenté adéquatement, l'ensemble de ces représentations constitue alors un alphabet phonétique permettant, à l'instar de l'écriture de décrire le langage oral phonème par phonème.

  
L'invention repose donc essentiellement sur une

  
 <EMI ID=6.1> 

  
sur une représentation symbolique de façon à garder la richesse et le caractère général de l'information naturelle que recèlent diverses nuances phonétiques (âge, sexe, prononciation, émotions, fatigue,...) absentes dans une écriture où chaque phonème possède son symbole conventionnel stéréotypé (caractère-imprimé).

  
La classification des sons de la langue française adoptée au cours de l'étude est fondamentalement liée aux modes de production des phonèmes. Cette classification regroupe en trois groupes l'ensemble des 36 phonèmes de la langue française. Le groupe 3 contient les plosives, le groupe 2 les fricatives et le groupe 1 le reste des phonèmes soit les voyelles orales et nasales,

  
 <EMI ID=7.1> 

  
Cette classification rassemble les phonèmes dont les caractéristiques fréquentielles et/ou temporelles sont semblables. Elle est donc plus proche du but visé par l'invention que la classification linguistique.

  
Dans chacun des groupes, certains phonèmes peuvent être regroupés de la façon suivante :

  
 <EMI ID=8.1> 

  
e, ê, eu notés e (ex. : le, peur, mieux) ;

  
in, un notés ain (ex. : brin, brun) ;

  
i, notés i (ex. : lit, gîte). 

  
 <EMI ID=9.1> 

  
peuvent être du point de vue du contenu acoustique (celui qui nous intéresse) décomposés en plusieurs phonèmes :

  

 <EMI ID=10.1> 


  
Le groupe 1 voit donc son effectif réduit de 24 phonèmes

  
 <EMI ID=11.1> 

  
mesure où les sons regroupés ou décomposés sont phonétiquement semblables à leur substitut. Ainsi, l'analyse du contenu phonétique du mot "champignon" par exemple montre que le phonème "gn" se décompose en deux parties, une quasi-identique au "n" et l'autre au

  
 <EMI ID=12.1> 

  
entités phonétiques de base dont le tableau suivant reprend les caractéristiques. 

  

 <EMI ID=13.1> 
 

  
Il est important de préciser que les approximations cidessus (regroupements et décompositions) ont été introduites par souci de facilité. Elles ne constituent pas nécessairement une condition essentielle pour la mise en oeuvre de l'invention. D'autres variantes d'exécution en fonction des langues en cause notamment sont possibles.

  
Ainsi qu'il a été indiqué, le principe sur lequel repose l'invention consiste en ce qu'une analyse fréquentielle spécifique

  
 <EMI ID=14.1> 

  
Ainsi pour le groupe 1, il n'est pas nécessaire d'étudier le spectre au-delà de 5 kHz puisque l'énergie s'y trouve concentrée avant. Pour le groupe 2, une analyse fréquentielle est nécessaire jusqu'à 12 kHz. Pour ces deux classes où les sons peuvent être prolongés, l'évolution temporelle des paramètres fréquentiels est pratiquement nulle et si on calcule par exemple les spectres en amplitude d'un phonème dont la durée est prolongée pendant une seconde, on constate que l'analyse fréquentielle d'une portion de 20 ms, 50 ms, 500 ms ou de la totalité du phonème donne des résultats identiques ou similaires. L'information est redondante dans le temps et une analyse fréquentielle effectuée sur une petite fenêtre prise n'importe où en dehors des zones où l'effet de voisinage est trop important, suffit pour caractériser ces sons.

   En ce qui concerne les plosives (groupe 3) par contre, l'information temporelle est très importante et on comprend que l'information présentée par le caractère image doit donc tenir compte de l'évolution temporelle du contenu fréquentiel. L'information n'est pratiquement pas redondante dans le cas des plosives. Il est donc avantageux d'exploiter tout ou une partie notable du signal pour l'identification de la plosive. L'information présentée sera la même ou de même nature ou équivalente à l'information présentée par un sonagramme en particulier dans le sens ou elle exploite une partie notable du signal temporel.

  
La nature "hétéroclite" de cet alphabet phonétique naturel est justifiée par l'existence de trois groupes de sons qui lui donnent naissance.

  
Le principe étant admis, on peut utiliser avantageusement un "alphabet"' dans lequel les sons du groupe 1 sont représentés par la courbe enveloppe du spectre d'énergie jusque 5 kHz en couleur par exemple rouge, ceux du groupe 2 en couleur verte et les derniers par une présentation avantageusement en couleur différente des groupes 1 et 2, du type sonagramme ou d'un autre type de représentation par caractère-image contenant une information de même type que dans la représentation sonagramme. On peut bien entendu procéder à une discrimination selon d'autres critères que la couleur, en particulier des critères géométriques des caractères-images. 

  
L'analyse fréquentielle de la quasi-totalité du signal temporel de la plosive ou du moins d'une partie notable de ce signal ou même de la totalité du signal temporel de la plosive plus une partie du signal temporel du phonème précédent et/ou suivant peut être avantageusement considérée comme un tel caractère-image.

  
Un traitement préalable nécessaire doit remplir deux fonctions : d'une part, découper la parole en phonèmes et d'autre part, décider à quel groupe (1, 2 ou 3) appartiennent ces phonèmes.

  
La première fonction, la segmentation est présente dans tout système de reconnaissance analytique ; elle ne présente en général aucune difficulté. A titre d'exemple, un segmenteur utilisant des informations simples comme le nombre de passages à

  
 <EMI ID=15.1> 

  
La seconde étape ne pose pas de problèmes puisque les caractéristiques fréquentielles et/ou temporelles sont différentes d'une classe à l'autre.

EXEMPLES D'APPLICATION DE L'INVENTION

  
La figure 2 représente l'évolution temporelle de la pression en fonction du temps (3 secondes) pour la phrase "Alain est mon ami".

  
Cette phrase est constituée de 10 phonèmes que l'on aperçoit très distinctement dans le signal temporel de la figure 2, ceux-ci étant mis en évidence par les fenêtres en noir.

  
Dans ce premier exemple ayant choisi comme caractèreimage pour représenter le phonème, la courbe de la prédiction linéaire (enveloppe de la transformée de Fourier) représentée avec comme abscisse une échelle linéaire des fréquences allant de 0 à 5 kHz et une échelle logarithmique des ordonnées (décibels), on a représenté l'écriture de la phrase-exemple sur la figure 3. 

  
On voit que chaque phonème est représenté par un caractère-image

  
et un seul. Le caractère-image est présenté immobile précédé et/ou suivi par un certain nombre d'autres caractères-images. On découvre d'une manière qui paraît instantanée la totalité des caractèresimages présentés.

  
Un lecteur ayant subi un apprentissage peut lire chacun des phonèmes représenté par les caractères-image de la figure 3. Les fenêtres identifiées par les chiffres de 1 à 10 sur la figure 2 correspondent à des zooms de 200 ms qui sont repris successivement dans les figures 4, 6 à 14.

  
Elles permettent d'illustrer le caractère redondant de l'information dans le signal temporel et illustre la façon avec laquelle les caractères-images représentant les phonèmes ont été choisis.

  
La figure 4 présente en haut le premier zoom du signal temporel qui correspond au premier phonème "A" de la phrase et décrit endessous l'évolution de son spectre caractéristique en amplitude calculé sur des fenêtres de 20 ms désignées par les lettres minuscules a jusque f. Une première constation s'impose ; l'allure du spectre est suffisamment stable sur l'ensemble des 6 fenêtres de calcul et reste conforme au modèle du phonème isolé A tel qu'on peut le voir sur la figure 5 qui représente les caractères-images correspondants à des phonèmes du groupe 1 émis de façon isolée par le même locuteur que celui de la phrase concernée.

  
Le signal temporel de la figure 6 contient l'entièreté du son "L" - mis en évidence par les spectres relatifs aux fenêtres a-b-

  
c et la transition "L" - "AIN" caractérisée par les fenêtres d-e-f.

  
On remarque en effet dans ces dernières, l'apparition de l'allure générale du "AIN" qui reste toutefois influencée par le relèvement significatif du "L" à 1,5 kHz. Il s'agit d'un effet de voisinage.

  
La figure 7 quant à elle montre par ses fenêtres a-b-c-de un modèle du "AIN" plus typique.

  
La fenêtre a de la figure 8 clôture le mont ALAIN et la fenêtre b représente le début du phonème "è".

  
A la figure 9, on reconnaît aux fenêtres a-b-c-d le son "è" et on devine aux fenêtres e-f l'apparition timide du "M".

  
 <EMI ID=16.1> 

  
beaucoup plus révélatrices quant à la présence du "M" de même que le sont les courbes e et f pour le "ON". 

  
La figure 10 montre aux fenêtres b-c-d-e-f des exemples de spectres du son "N". La figure 11 est relative au phonème "A". La figure 12 est relative au phonème "M". <EMI ID=17.1> 

  
Par éducation, il est possible de s'éduquer à lire les caractères images tels ceux représentés à la figure 3. Des essais ont montré que les phonèmes isolés du groupe prononcés par un locuteur auquel on s'est habitué (c'est-à-dire que l'on's'est entraîné à reconnaître les caractères-images correspondant aux phonèmes isolés du groupe 1 émis par ce locuteur) sont reconnus dans une proportion de 99 %. Quant aux phonèmes coarticulés dans le langage, ils sont identifiés dans une proportion de 96 % ; le sens du mot et de la phrase permet de lever ultérieurement l'ambiguïté résiduelle comme dans la lecture d'une écriture manuscrite. On parle de phonème coarticulé lorsque le phonème n'est pas produit isolé mais dans un mot du langage parlé.

  
Les figures 15, 16, 17, 18, 19, 20, 21, 22 reprennent la même démarche pour le mot "sosie", phonétiquement sozie.

  
La figure 17 représente pour un même locuteur les caractères-images correspondant au groupe 2 de phonèmes, les fricatives. Pour les caractères-images correspondant aux fricatives, l'échelle des fréquences a été choisie s'étendant de 0 <EMI ID=18.1>  

REVENDICATIONS

  
1. Procédé de reconnaissance de la parole et en particulier d'écriture de la parole comportant une analyse fréquentielle des phonèmes émis par un locuteur, caractérisé en ce que les phonèmes subissent un traitement d'analyse fréquentielle spécifique qui génère une image statique pour chaque phonème présentée immobile pendant un temps suffisant pour permettre la reconnaissance du phonème concerné.



  RECOGNITION AND PARTICULARLY WRITING PROCESS

OF THE WORD

  
Subject of the invention

  
The present invention aims to provide a method which allows speech recognition and in particular the writing of speech using a natural phonetic alphabet. It also relates to the means and devices allowing the implementation of this process.

  
Applications of the invention

  
There are two types of speech recognition applications: human-machine speech and helping the hearing impaired.

  
This description which is based on real experiences in writing speech intended for the hearing impaired will essentially describe this particular aspect by way of illustration. It is understood, however, that it is not limited thereto.

  
Summary of the celebration of the technique

  
 <EMI ID = 1.1>

  
is not able to match man in speech recognition.

  
As an example of what can be done today in the field of human-machine dialogue, we will cite applications developed within the framework of ARPA Speech Understanding System (military applications in the USA). These techniques allow a summary dialogue, based on a limited vocabulary
(1000 words) and on a simple and immutable syntax.

  
Despite recent developments in computing and electronics, particularly with regard to signal processing and artificial intelligence, these systems remain prohibitively expensive and we are still far from a recognition model speech as efficient as human beings.

  
Regarding help for the hearing impaired, essentially two means are envisaged - in addition to the implantation of surgical prostheses - to help deaf people.

  
The first is based on the exploitation of the tactile sense and is materialized for example by the development of a vibrating bracelet which gives summary information such as door bell, call on graphic telephone (Minitel), ...

  
The second, more developed means aims to supplant the auditory canal that has become inactive through the visual canal. It thus brings together all the techniques intended for transposing sound into images such as lip reading, sign language, graphic telephone (or Minitel), speech education systems, ...

  
Reading on the lips is a first dynamic way of understanding the oral message but still presents uncomfortable ambiguities. Indeed, there are in speech a lot

  
 <EMI ID = 2.1>

  
can not be distinguished by the simple shape of the lips.

  
One solution proposed by Doctor Carnet (Gallaudat College, Washington 1967) to resolve these indeterminacies was to add a complementary phonetic code highlighted by the shape and position of the speaker's hand around the face. The disadvantage of this method called "cued speech" is that learning the code by both the speaker and the reader is necessary for mutual understanding. Another disadvantage of lip reading is that you have to see the speaker from the front.

  
Sign language and the graphic telephone also have these major drawbacks; the two people in dialogue must know the gestural communication code or must both be equipped with ad-hoc equipment in the case of the Minitel.

  
These recognition techniques for the deaf - apart from lip reading which remains however very hazardous - tend to marginalize their restricted group of users.
(hearing impaired, family, teachers) vis-à-vis people who do not feel concerned in one way or another by the problem of hearing impaired and who therefore will not make any effort to apprehend these techniques.

  
At the present time, there is not really an effective sound translator capable of delivering to the deaf an autonomy of use essential to its integration into the world of the hearing impaired.

  
The first test of its kind was started by the Bell laboratories during the Second World War and ended with the development of a device called "sonagraphe" ("visible speech") which transposed speech for the first time into images called " sonagrams ".

  
Sonograms are two-dimensional diagrams where the abscissa is time and the ordinate represents frequencies; the sound energy at the different frequencies being materialized by an intensity of darkness.

  
For 40 years, experiences are multiplying all over the world to test the qualities of this representation but the results are ultimately rather disappointing.

  
It seems that the best performance in reading sonograms belongs to Victor Zue, speech recognition expert at MIT (Boston). This after a learning time of more or less 2500 hours (a few hours a day for

  
 <EMI ID = 3.1>

  
emitted by a large number of people and this, in a time 20 times greater than the real time of production of the verbal message.

  
The poor performance of sonogram reading results in large part from a poor or even unsuitable quality of the representation used; the use of a gray intensity approximating the amount of energy in frequency gives a rough image of the sound reality and thus shortens an important part of the phonetic information. The lack of stopping the evolution of the image is the main cause of the reading difficulty; moreover, the redundancy of information does not allow to focus on the important part of the visual representation. Purpose of the invention

  
The invention aims to provide a visual representation of speech which is likely to greatly exceed the sonogram both by the greater richness of the phonetic information it presents than by the relative ease in the level of its learning. Character-defining elements of the invention

  
The invention is based on a method of speech recognition and in particular of writing speech comprising a frequency analysis of the phonemes emitted by a speaker, characterized in that the phonemes undergo a specific frequency analysis processing which for each phoneme expressed generates a static image, appearing immobile (frozen) for a sufficient time to allow its recognition. Advantageously, the static image representing the phoneme which we will call the image character must appear on the screen in a way which appears instantaneous and not progressive and remain stationary during the time of identification of the phoneme.

  
It has been found that a user can, after a reasonable learning period, recognize the different static images thus associated with each phoneme.

  
According to a preferred embodiment of the aforementioned technique, it is advantageous to provide a simultaneous display of several successive phonemes, which allows the user to better recognize them by association with the phoneme (s) expressed previously to that which is read and possibly the phoneme (s) already expressed after the one being read.

  
As in classical writing, a sufficient number of phonemes will be in an advantageous application of the technique visible simultaneously to allow the memory of the preceding verbal elements. By verbal elements we mean a phoneme or a successive group of phonemes. In the case where the word is broken down into verbal elements, the term character-image refers to the image representing the verbal element. It is indeed desirable that at least the image characters corresponding to the verbal elements of a word, appear simultaneously with the interpretation of the word on the screen. In fact, the same goes for classical writing. Just think of the borderline legibility conditions for subtitles on television and in the movies.

   It is desirable that the image characters be stationary during their appearance on the screen. When it comes to writing, the word

  
 <EMI ID = 4.1>

  
phonetic writing and is an abbreviation of the word "characterimage".

  
Contrary to known techniques, the static character of the image associated with each phoneme makes it possible to avoid redundancy of the information, in particular for long vowels. The proposed technique leads to the development of a natural alphabet. Each verbal element, an essentially dynamic phenomenon, is associated with the static (frozen) image that it generates thanks to a particular algorithm. This alphabet can possibly be reversible, that is to say that the images can in turn give rise to the types of verbal elements which generated them.

   We thus create an alphabet "natural" or "phgsique" in the sense of a part or a verbal element by what it is, gives birth to an image (the character-image) which is part of a family d 'images which one associates by education with the type of verbal element in question and on the other hand that the character image corresponding to a verbal element can generate by what it is, the type of verbal element in question.

  
It is an alphabet and the technique of speech recognition by transposition into images is in fact a writing using the images which constitute the characters. It is therefore understandable that the image characters will usefully be grouped together in successive sequence in a word and that the image characters relating to different words will be usefully separated by spacing. These are words resulting from speech and not words from classical writing. The words in phonetic writing most of the time do not coincide with the words of. classical writing. Often the word of phonetic writing contains several words of classical writing. "Word" means the word of phonetic writing unless it is specified that it is the word of classical writing.

   A particular punctuation will be usefully and advantageously used to distinguish the spaces between the words, the comma spaces, semicolon and period. The evolution of the appropriate characteristics of the acoustic signal during speech will provide information on the normal, interrogative or exclamatory character of a sentence, which will naturally lead to the use of the equivalent of points, question marks and exclamation marks.

  
An advantageous variant of the technique furthermore displays the length of the phoneme emission, the evolution over time of its amplitude, the regularity of the acoustic emission of the phoneme in the word and in the sentence. and possibly information on the fundamental and harmonics.

  
It appeared that particularly interesting results are obtained by a technique comprising an operation of specific analysis of phonemes according to whether they belong to a first group made up of phonemes other than fricative consonants or plosive consonants, to a second group made up by the fricative consonants and a third group constituted by the plosive consonants.

  
It appeared that each of these groups must undergo their own frequency analysis, namely:
- a frequency analysis of the first group at least up to a value of the order of 4 to 5 kHz over a duration of the order of at least a quasi-period,
- a frequency analysis of the second group at least up to a value of the order of 12 kHz for a duration greater than 2 milliseconds and not exceeding the duration of the phonemes,
- a frequency analysis of the third group at least up to a value of the order of 5 to 10 kHz for a period sufficient to take account of the temporal revolution of the frequency characteristics.

  
Frequency analysis for the first two groups may only use part of the acoustic information emitted during the production of the phoneme because this information is redundant. Frequency analysis for the third group uses all or almost all of the acoustic information emitted during the production of the plosive or at least a significant part of this information.

  
Advantageously, analog signals picked up by a microphone undergo digitalization and are subjected, using an algorithm, to frequent analysis! specific to the group in order to obtain a writing of the speech based on the frequency analysis of this which provides a natural phonetic alphabet.

  
In order to facilitate the reading of said writing according to a natural phonetic alphabet, the invention includes the fact that the writing corresponding to each of the aforementioned groups undergoes a particular representation for each group, in particular a different color representation for group 1 of the color chosen for the representation of group 2 while the phonemes of group 3 which undergo a sonogram or equivalent type representation in the sense that advantageous use is made of the acoustic information present in almost the entire duration of the phoneme, is advantageously represented in a color other than groups 1 and 2.

  
We can also discriminate groups by adapting conventions other than colors, for example different geometric shapes for each group. By way of example, "image characters" of different dimensions or shapes can be used.

  
An advantageous way of presenting information in a compact manner is to use a polar representation of the frequency or equivalent transforms.

  
By frequency analysis processing or frequency analysis is meant a mathematical processing of all or part of the time signal linked to the phoneme, where time no longer appears in the resultant obtained by the transformation and which makes it possible to highlight characteristic information specific to each phoneme, which will be used to recognize the phoneme.

  
By quasi-period is meant the apparent period of the quasi-periodic time signal relating to the phonemes of the class

  
1. The quasi-period is measured by the time that the smallest pattern of the time signal lasts, which almost generates the time signal by successive juxtaposition. The concept of quasi-period is illustrated in Figure 1.

  
Other details and characteristics of the invention will appear on reading the description which follows, relating to a preferred embodiment of the invention.

  
Preferred embodiment of the invention

  
The approach followed during the study which led to this patent application is essentially experimental.

  
The criterion used to assess the various representations studied was the percentage of correct identification of phonemes by the person who reads the writing of speech. This criterion is severe since it is generally not necessary to correctly identify all of the phonemes to understand the message. Indeed, the significance, the coherence and the syntactic structure itself of the oral language are often determining for the good comprehension of the message.

  
The phonetic content of the message is the only external information available to the listener and it is also the only one that the hearing impaired lack. This is why a visual aid system must not "understand" the message, but simply provide sufficient phonetic representation.

  
The representation chosen is based on the notion of phonemes and. phonetic alphabet. Indeed, the phonetic information, if it varies continuously over time, is most of the time locally redundant so that one can decompose the oral message into a succession of phonemes and parts of phonemes. The acoustic emission of part of a

  
 <EMI ID = 5.1>

  
the phoneme.

  
The research therefore revolved around the representation of the different phonemes.

  
Thus, if each phoneme can be adequately represented, all of these representations then constitute a phonetic alphabet allowing, like writing, to describe oral language phoneme by phoneme.

  
The invention is therefore essentially based on a

  
 <EMI ID = 6.1>

  
on a symbolic representation so as to keep the richness and the general character of the natural information which conceal various phonetic nuances (age, sex, pronunciation, emotions, fatigue, ...) absent in a writing where each phoneme has its conventional symbol stereotypical (character-printed).

  
The classification of sounds of the French language adopted during the study is fundamentally linked to the modes of production of phonemes. This classification groups all 36 phonemes of the French language into three groups. Group 3 contains the plosives, group 2 the fricatives and group 1 the rest of the phonemes, namely the oral and nasal vowels,

  
 <EMI ID = 7.1>

  
This classification brings together phonemes with similar frequency and / or time characteristics. It is therefore closer to the aim of the invention than linguistic classification.

  
In each of the groups, certain phonemes can be grouped as follows:

  
 <EMI ID = 8.1>

  
e, ê, had noted e (eg, fear, better);

  
in, a noted ain (ex .: strand, brown);

  
i, denoted i (e.g. bed, gîte).

  
 <EMI ID = 9.1>

  
can be from the point of view of acoustic content (the one that interests us) broken down into several phonemes:

  

 <EMI ID = 10.1>


  
Group 1 therefore sees its workforce reduced by 24 phonemes

  
 <EMI ID = 11.1>

  
measure that the grouped or decomposed sounds are phonetically similar to their substitute. Thus, the analysis of the phonetic content of the word "mushroom" for example shows that the phoneme "gn" breaks down into two parts, one almost identical to "n" and the other to

  
 <EMI ID = 12.1>

  
basic phonetic entities whose characteristics are shown in the following table.

  

 <EMI ID = 13.1>
 

  
It is important to note that the above approximations (groupings and decompositions) have been introduced for ease of reference. They do not necessarily constitute an essential condition for the implementation of the invention. Other variants depending on the languages in question in particular are possible.

  
As indicated, the principle on which the invention is based is that a specific frequency analysis

  
 <EMI ID = 14.1>

  
Thus for group 1, it is not necessary to study the spectrum beyond 5 kHz since the energy is concentrated there before. For group 2, a frequency analysis is necessary up to 12 kHz. For these two classes where the sounds can be extended, the temporal evolution of the frequency parameters is practically zero and if, for example, we calculate the amplitude spectra of a phoneme whose duration is extended for one second, we see that the analysis frequency of a portion of 20 ms, 50 ms, 500 ms or the entire phoneme gives identical or similar results. The information is redundant over time and a frequency analysis performed on a small window taken anywhere outside the areas where the neighborhood effect is too large, is enough to characterize these sounds.

   With regard to plosives (group 3) on the other hand, the temporal information is very important and it is understood that the information presented by the image character must therefore take into account the temporal evolution of the frequency content. The information is practically not redundant in the case of plosives. It is therefore advantageous to exploit all or a significant part of the signal for the identification of the plosive. The information presented will be the same or of the same nature or equivalent to the information presented by a particular sonogram in the sense that it uses a significant part of the time signal.

  
The "heterogeneous" nature of this natural phonetic alphabet is justified by the existence of three groups of sounds which give rise to it.

  
The principle being admitted, one can advantageously use an "alphabet" 'in which the sounds of group 1 are represented by the envelope curve of the energy spectrum up to 5 kHz in color for example red, those of group 2 in green color and the the latter by an advantageously different color presentation of groups 1 and 2, of the sonogram type or of another type of representation by character-image containing information of the same type as in the sonogram representation. It is of course possible to discriminate according to criteria other than color, in particular geometric criteria of the image characters.

  
Frequency analysis of almost all of the time signal of the plosive or at least a significant part of this signal or even all of the time signal of the plosive plus a part of the time signal of the preceding and / or next phoneme can advantageously be considered as such an image character.

  
A necessary preliminary processing must fulfill two functions: on the one hand, to cut the speech into phonemes and on the other hand, to decide to which group (1, 2 or 3) these phonemes belong.

  
The first function, segmentation, is present in any analytical recognition system; it generally presents no difficulty. For example, a segmenter using simple information such as the number of passes to

  
 <EMI ID = 15.1>

  
The second step poses no problems since the frequency and / or time characteristics are different from one class to another.

EXAMPLES OF APPLICATION OF THE INVENTION

  
FIG. 2 represents the time evolution of the pressure as a function of time (3 seconds) for the phrase "Alain is my friend".

  
This sentence is made up of 10 phonemes which can be seen very clearly in the time signal in Figure 2, these being highlighted by the black windows.

  
In this first example having chosen as image character to represent the phoneme, the curve of the linear prediction (envelope of the Fourier transform) represented with as abscissa a linear scale of the frequencies going from 0 to 5 kHz and a logarithmic scale of the ordinates (decibels ), the writing of the example sentence is shown in FIG. 3.

  
We see that each phoneme is represented by an image character

  
and only one. The image character is presented motionless preceded and / or followed by a certain number of other image characters. We discover in a way that seems instantaneous all of the image characters presented.

  
A reader who has undergone training can read each of the phonemes represented by the image characters of FIG. 3. The windows identified by the numbers from 1 to 10 in FIG. 2 correspond to zooms of 200 ms which are repeated successively in the figures. 4, 6 to 14.

  
They illustrate the redundant nature of the information in the time signal and illustrate the way in which the image characters representing the phonemes were chosen.

  
Figure 4 shows at the top the first zoom of the time signal which corresponds to the first phoneme "A" of the sentence and describes below the evolution of its characteristic spectrum in amplitude calculated on windows of 20 ms designated by the lowercase letters a until f . A first observation is essential; the shape of the spectrum is sufficiently stable over all 6 calculation windows and remains in conformity with the model of the isolated phoneme A as can be seen in FIG. 5 which represents the image characters corresponding to phonemes of group 1 emitted in isolation by the same speaker as that of the sentence concerned.

  
The time signal of FIG. 6 contains the entire sound "L" - highlighted by the spectra relating to the windows a-b-

  
c and the transition "L" - "AIN" characterized by the windows d-e-f.

  
We note in fact in the latter, the appearance of the general shape of the "AIN" which remains however influenced by the significant increase in the "L" at 1.5 kHz. It is a neighborhood effect.

  
Figure 7 shows through its windows a-b-c-of a more typical "AIN" model.

  
The window a of figure 8 closes the mont ALAIN and the window b represents the beginning of the phoneme "è".

  
In Figure 9, we recognize in windows a-b-c-d the sound "è" and we guess in windows e-f the timid appearance of "M".

  
 <EMI ID = 16.1>

  
much more revealing as to the presence of the "M" as are the curves e and f for the "ON".

  
Figure 10 shows in windows b-c-d-e-f examples of "N" sound spectra. FIG. 11 relates to the phoneme "A". FIG. 12 relates to the phoneme "M". <EMI ID = 17.1>

  
By education, it is possible to educate oneself to read the image characters such as those represented in figure 3. Tests have shown that the isolated phonemes of the group pronounced by a speaker to which one has become accustomed (ie say that one’s trained to recognize the image characters corresponding to the isolated phonemes of group 1 emitted by this speaker) are recognized in a proportion of 99%. As for the phonemes coarticulated in language, they are identified in a proportion of 96%; the meaning of the word and the sentence makes it possible to later remove the residual ambiguity as in reading a handwriting. We speak of a coarticulated phoneme when the phoneme is not produced in isolation but in a word of spoken language.

  
Figures 15, 16, 17, 18, 19, 20, 21, 22 show the same approach for the word "double", phonetically sozie.

  
FIG. 17 represents, for the same speaker, the image characters corresponding to group 2 of phonemes, the fricatives. For the image characters corresponding to the fricatives, the frequency scale has been chosen ranging from 0 <EMI ID = 18.1>

CLAIMS

  
1. A method of speech recognition and in particular of writing speech comprising a frequency analysis of the phonemes emitted by a speaker, characterized in that the phonemes undergo a specific frequency analysis processing which generates a static image for each phoneme presented motionless for a time sufficient to allow recognition of the phoneme concerned.

Claims

2. Method according to claim 1 characterized in that the static image representing the phoneme appears in an instantaneous manner on the screen.

3. Method according to claims 1 or 2 characterized in that it comprises a simultaneous display of several successive phonemes.

4. Method according to any one of claims 1 to

3 characterized in that the images form characters or are represented by characters which are grouped together in succession in a word which is usefully separated by spacing the characters relating to different words.

5. Method according to any one of claims 1 to 4 characterized in that it comprises the display of the length of the transmission of the verbal element, the evolution over time of its amplitude and the regularity acoustic emission- in <EMI ID = 19.1>

specific frequency of phonemes according to whether they belong to a first group consisting of phonemes other than fricative consonants or plosive consonants, to a second

group made up of plosive consonants.

7. Method according to any one of claims 1 to 6 'characterized in that one proceeds to a frequency analysis of

over a period of the order of at least a quasi-period.

8. Method according to claim 7 characterized in that one proceeds to a frequency analysis of the second group at least up to a value of the order of 12 kHz for a duration greater than 1 millisecond .. and not exceeding the duration of the phoneme.

9. Method according to claim 7 or 8 characterized in that one proceeds to a frequency analysis of the third group

account of the temporal evolution of the frequency characteristics.

10. Method according to any one of claims 1 to 9 characterized in that analog signals picked up by a microphone undergo digitization and are subjected using an algorithm to a frequency analysis specific to the group in order to obtain a speech writing based on the frequency analysis of this analysis which provides a natural phonetic alphabet.

11. Method according to any one of claims 6 to 10 characterized in that the writing corresponding to each of the 3 aforementioned groups undergoes a particular representation for each group, in particular a different color representation for group 1 of the color retained for the representation of group 2 while the phonemes of group 3 undergo a sonogram representation.

12. Method according to any one of claims 6 to 10 characterized in that the writing corresponding to each of the 3 aforementioned groups undergoes a different color representation for group 1 of the color used for group 2 while the phonemes of group 3 undergo a different color representation of the phonemes of the groups 1 and 2.

13. Method according to any one of claims 6 to 12 characterized in that the writing corresponding to each of the 3 aforementioned groups presents as image character corresponding to each phoneme, for group 1 the frequency transform obtained by exploiting only part or all of the temporal signal of the phoneme and this at least up to around 4 to 5 kHz, for group 2, the corresponding transform obtained by exploiting only part or all of the time signal of the phoneme and this at least up to around 12 kHz, for group 3, the frequency transform obtained by exploiting either all of the acoustic information of the time signal of the phoneme, or almost all of the acoustic information of the time signal of the phoneme,

either all of the acoustic information of the temporal signal of the phoneme with a part of that of one or two neighboring phonemes, or a part of the acoustic information of the temporal signal of the phoneme with a part of that of a neighboring phoneme .

14. Method according to any one of claims 6 a 13 characterized in that the writing corresponding to each of the above groups undergoes a particular representation for each group according to different geometric shapes for each group.