FR2647249A1

FR2647249A1 - SPEECH RECOGNITION METHOD

Info

Publication number: FR2647249A1
Application number: FR9005864A
Authority: FR
Inventors: Ian Bickerton
Original assignee: Smiths Group PLC
Current assignee: Smiths Group PLC
Priority date: 1989-05-18
Filing date: 1990-05-04
Publication date: 1990-11-23
Anticipated expiration: 2010-05-04
Also published as: GB8911461D0; JPH0315898A; FR2647249B1; GB2231698B; DE4012337A1; GB9010291D0; GB2231698A

Abstract

L'invention concerne un procédé de reconnaissance de la parole, offrant des possibilités accrues de reconnaissance des sons prononcés. On enregistre un vocabulaire de référence dans une mémoire, en prononçant de multiples spécimens de mots connus. Les spécimens de chaque mot sont soumis à un ajustement temporel et sont délivrés à un réseau neuronal. Celui-ci identifie les caractéristiques de chaque mot qui le distinguent d'autres mots du vocabulaire de référence. Ces caractéristiques sont intégrées aux paramètres d'un modèle semi-Markov caché et sont mises en mémoire. Ultérieurement, des signaux représentant des mots à reconnaître sont comparés aux informations mémorisées, après une restriction de syntaxe. Applications à des appareils de reconnaissance de la parole.The invention relates to a speech recognition method, offering increased possibilities for recognizing spoken sounds. A reference vocabulary is stored in a memory, by speaking multiple specimens of known words. The specimens of each word are time adjusted and delivered to a neural network. This identifies the characteristics of each word that distinguish it from other words in the reference vocabulary. These characteristics are integrated into the parameters of a hidden semi-Markov model and are stored in memory. Subsequently, signals representing words to be recognized are compared with the stored information, after a syntax restriction. Applications to speech recognition devices.

Description

- I -- I -

PROCEDE DE RECONNAISSANCE DE LA PAROLE SPEECH RECOGNITION METHOD

La présente invention concerne un procédé de reconnaissance de la parole. Dans une installation complexe ayant de multiples fonctions, il peut être utile de pouvoir commander cette installation par des commandes vocales. Cela est également utile lorsque les mains de l'utilisateur sont occupées par d'autres tâches ou quand l'utilisateur est empêché ou incapable d'utiliser ses mains pour commander l'installation de manière The present invention relates to a method of speech recognition. In a complex installation having multiple functions, it may be useful to be able to control this installation by voice commands. This is also useful when the user's hands are busy with other tasks or when the user is prevented or unable to use his / her hands to control the installation

classique au moyen de manettes ou de boutons. classic by means of levers or buttons.

La programmation d'un appareil de reconnaissance de la parole se fait en lisant une liste de mots ou de termes (y compris des expressions The programming of a speech recognition device is done by reading a list of words or terms (including expressions

composées) destinés à être enregistrés dans un vocabulaire de référence. composed) intended to be recorded in a reference vocabulary.

Les sons prononcés sont décomposés en composantes spectrales et sont stockés sous la forme de modèles ou motifs spectraux des mots en The pronounced sounds are decomposed into spectral components and are stored as patterns or spectral patterns of words in

fonction du temps.function of time.

Ultérieurement, quand un mot inconnu est prononcé, il est également décomposé en ses composantes spectrales et celles-ci sont comparées au vocabulaire de référence au moyen d'un algorithme approprié tel qu'un modèle semi-Markov caché. De préférence, on constitue le vocabulaire de référence en faisant répéter à de multiples reprises le même mot dans différentes circonstances et par différentes personnes. Cela crée une certaine variété et un élargissement des modèles des mots, afin d'avoir une plus forte probabilité d'identifier le même mot par rapport au modèle lorsqu'il sera prononcé ultérieurement. Cela peut toutefois aboutir à un recouvrement entre des modèles de mots qui se ressemblent, Subsequently, when an unknown word is pronounced, it is also decomposed into its spectral components and these are compared to the reference vocabulary by means of an appropriate algorithm such as a hidden semi-Markov model. Preferably, the reference vocabulary is formed by repeating the same word multiple times in different circumstances and by different people. This creates a certain variety and broadening of the word patterns, in order to have a higher probability of identifying the same word in relation to the model when it is pronounced later. However, this can lead to overlapping between similar word patterns,

donc à une plus grande probabilité d'identification incorrecte. therefore to a greater probability of incorrect identification.

Il a également été proposé d'utiliser des réseaux neuronaux, mais ceux-ci It has also been proposed to use neural networks, but these

ne permettent pas l'identification de la parole continue. do not allow the identification of continuous speech.

-2- D'autre part, la possibilité d'identifier avec précision des mots prononcés est réduite dans des circonstances difficiles, par exemple s'il y a un fort On the other hand, the possibility of accurately identifying spoken words is reduced in difficult circumstances, for example if there is a strong

bruit de fond ou si la personne qui parle est soumise à un stress. background noise or if the speaker is under stress.

Par conséquent, la présente invention a pour but de fournir un procédé de reconnaissance de la parole qui offre de meilleures possibilités de Accordingly, it is an object of the present invention to provide a speech recognition method which offers improved possibilities of speech recognition.

reconnaissance des sons prononcés.recognition of pronounced sounds.

Dans ce but, I'invention fournit un procédé de reconnaissance de la parole, caractérisé par des étapes dans lesquelles on délivre à un réseau neuronal des signaux dits vocaux, représentant une série de mots ou de termes connus, on identifie dans le réseau neuronal les caractéristiques de chaque mot ou terme qui le distinguent d'autres de ces mots ou termes, on délivre à une mémoire des informations relatives à ces caractéristiques distinctives conjointement à des informations identifiant le mot ou terme auquel ces caractéristiques sont associées, pour mémoriser un vocabulaire de référence, et l'on compare ultérieurement des signaux vocaux représentant un mot ou terme à reconnaître à des caractéristiques distinctives tirées du vocabulaire contenu dans ladite For this purpose, the invention provides a method of speech recognition, characterized by steps in which a so-called vocal signal is delivered to a neural network, representing a series of known words or terms, identifying in the neural network the characteristics of each word or term which distinguish it from others of these words or terms, information relating to these distinctive characteristics is delivered to a memory together with information identifying the word or term to which these characteristics are associated, to memorize a vocabulary of reference, and subsequently compare speech signals representing a word or term to be recognized to distinctive features derived from the vocabulary contained in said

mémoire, de manière à identifier ce mot ou terme. memory, so as to identify that word or term.

De préférence, le procédé comprend des étapes dans lesquelles on prononce plusieurs fois chaque mot ou terme connu pour former des spécimens et l'on effectue un ajustement temporel des spécimens de chaque mot pour produire les signaux vocaux qui sont délivrés au réseau neuronal. Lesdites caractéristiques distinctives de chaque mot ou terme peuvent être, par exemple, des caractéristiques spectrales ou des coefficients Preferably, the method comprises steps in which each known word or term is repeatedly pronounced to form specimens and the specimens of each word are temporally adjusted to produce the speech signals that are delivered to the neural network. Said distinguishing characteristics of each word or term may be, for example, spectral characteristics or coefficients

linéaires de prédiction.linear prediction.

De préférence, la comparaison entre les signaux vocaux relatifs à un mot ou terme à reconnaître et des caractéristiques distinctives tirées du vocabulaire de référence est effectuée au moyen d'une technique Preferably, the comparison between the speech signals relating to a word or term to be recognized and distinguishing characteristics derived from the reference vocabulary is performed by means of a technique

utilisant un modèle semi-Markov caché. using a hidden semi-Markov model.

3- Le vocabulaire de référence contenu dans la mémoire peut comprendre des motifs de distorsion de temps dynamique des caractéristiques distinctives. De préférence, une restriction de syntaxe est effectuée sur le 3- The reference vocabulary contained in the memory may comprise patterns of dynamic time distortion of the distinctive characteristics. Preferably, a syntax restriction is made on the

vocabulaire de référence en fonction de la syntaxe de mots précé- reference vocabulary according to the syntax of previous words

demment identifiés.have been identified.

On décrira ci-dessous, à titre d'exemple, un appareil de reconnaissance de la parole et son procédé de fonctionnement, en référence aux dessins annexés, dans lesquels la fig. I est un schéma-bloc de l'appareil, la fig. 2 représente des étapes du procédé, et Described below, by way of example, a speech recognition apparatus and its method of operation, with reference to the accompanying drawings, in which FIG. I is a block diagram of the apparatus, FIG. 2 represents steps of the process, and

la fig. 3 illustre une des étapes du procédé. fig. 3 illustrates one of the steps of the method.

L'appareil de reconnaissance de la parole est indiqué globalement par la référence 1. Il reçoit des signaux vocaux d'entrée provenant d'un microphone 2 qui peut, par exemple, être monté dans le masque respiratoire d'un pilote d'avion ou d'hélicoptère. Des signaux de sortie représentant des mots identifiés sont délivrés par l'appareil 1 à un dispositif de rétroaction 3 et à un dispositif d'utilisation 4. Le dispositif de rétroaction 3 peut être un affichage visuel ou un dispositif audible, destiné à communiquer à celui qui parle les mots ayant été identifiés par l'appareil 1. Le dispositif d'utilisation 4 peut être agencé pour commander une fonction de l'équipement de l'aéronef en réponse à une commande vocale reconnue par le dispositif d'utilisation dans les signaux The speech recognition apparatus is indicated globally by reference numeral 1. It receives input speech signals from a microphone 2 which may, for example, be mounted in the breathing mask of an airplane pilot or helicopter. Output signals representing identified words are provided by the apparatus 1 to a feedback device 3 and a user device 4. The feedback device 3 may be a visual display or an audible device for communicating with the which speaks the words that have been identified by the apparatus 1. The operating device 4 can be arranged to control a function of the equipment of the aircraft in response to a voice command recognized by the device of use in the signals

de sortie de l'appareil.output of the device.

Les signaux provenant du microphone 2 sont transmis à un préamplifi- The signals from the microphone 2 are transmitted to a preamplifier.

cateur 10 comportant un étage de précorrection 11 qui produit un spectre vocal plat en moyenne à long terme, afin d'assurer que toutes les sorties des canaux de fréquence occupent une gamme dynamique similaire, la caractéristique étant nominalement plate jusqu'à 1 kHz. Un controller 10 having a pre-correction stage 11 which produces a long-term mean flat speech spectrum, to ensure that all the outputs of the frequency channels occupy a similar dynamic range, the characteristic being nominally flat up to 1 kHz. A

264 AZ4Y264 AZ4Y

- 4 - commutateur 12 peut être actionné pour donner à choix un gain de 3 ou 6dB/octave dans les fréquences élevées. Le préamplificateur 10 comporte Switch 12 can be operated to give a choice of 3 or 6 dB / octave gain in high frequencies. The preamplifier 10 comprises

aussi un filtre antidistorsion 21 sous la forme d'un filtre passe-bas- also an anti-distortion filter 21 in the form of a low-pass filter

Butterworth du huitième ordre, dont la fréquence de coupure à -3dB est placée à 4kHz. Le signal de sortie du préamplificateur 10 est transmis, à travers un convertisseur analogique/numérique 13, à un banc de filtrage numérique 14. Ce banc de filtrage 14 comporte dix-neuf canaux réalisés par un logiciel assembleur dans un microprocesseur TMS32010 et il est basé sur le "JSRU Channel Vocoder" décrit par Holmes, 3.N dans IEE proc., vol 127, Pt.F, N 1, Fév. 1980. Le banc de filtrage 14 a des largeurs de canaux inégales, correspondant approximativement aux bandes critiques de la perception auditive dans la gamme de 250 à 4000 Hz. Les réponses de canaux adjacents se croisent approximativement à 3dB en dessous de leur crête. Au milieu d'un canal, l'atténuation d'un canal voisin est Butterworth of the eighth order, whose cutoff frequency at -3dB is set at 4kHz. The output signal of the preamplifier 10 is transmitted, through an analog / digital converter 13, to a digital filterbank 14. This filterbank 14 comprises nineteen channels made by an assembler software in a microprocessor TMS32010 and is based on on the "JSRU Channel Vocoder" described by Holmes, 3.N in IEE Proc., vol 127, Pt.F, N 1, Feb. 1980. The filterbank 14 has unequal channel widths, approximately corresponding to critical bands of auditory perception in the range of 250 to 4000 Hz. Adjacent channel responses intersect at approximately 3dB below their peak. In the middle of a channel, attenuation of a neighboring channel is

d'environ 1 dB.about 1 dB.

Les signaux sortant du banc de filtrage 14 sont délivrés à une unité 15 d'intégration et de reconnaissance de bruit, qui met en oeuvre un algorithme de reconnaissance de bruit du type décrit par 3.S. Bridle et al. dans "A noise compensating spectrum distance measure applied to The signals leaving the filter bank 14 are delivered to an integration and noise recognition unit 15, which implements a noise recognition algorithm of the type described by 3.S. Bridle et al. in "A noise compensating spectrum distance measure"

automatic speech recognition". Proc. Inst. Acoust. Windemere, Nov. 1984. automatic speech recognition ", Proc.inst.Acoust.Windemere, Nov. 1984.

Des techniques adaptatives de suppression de bruit peuvent être mises en oeuvre par l'unité 15 pour réduire le bruit périodique, ce qui peut servir Adaptive noise canceling techniques may be implemented by the unit 15 to reduce periodic noise, which may serve

à réduire par exemple le bruit périodique d'un hélicoptère. to reduce for example the periodic noise of a helicopter.

La sortie de l'unité 15 est transmise à une unité 16 de comparaison de The output of the unit 15 is transmitted to a comparison unit 16 of

motifs, qui accomplit les divers algorithmes de comparaison de motifs. patterns, which accomplishes the various patterns comparison algorithms.

L'unité 16 est raccordée à une mémoire de vocabulaire 17 qui contient des modèles de Markov basés sur des caractéristiques distinctives de The unit 16 is connected to a vocabulary memory 17 which contains Markov models based on distinctive characteristics of

chaque mot ou terme du vocabulaire concerné. each word or term in the vocabulary concerned.

Ces caractéristiques distinctives sont introduites dans le vocabulaire de These distinctive features are introduced into the vocabulary of

la manière illustrée par les figures 2 et 3. as illustrated in Figures 2 and 3.

26 724926 7249

- 5 - Dans une première étape 30, on enregistre des exemples isolés de chacun des mots ou termes à introduire dans le vocabulaire de référence. On répète cette opération de façon à disposer de multiples spécimens de chaque mot ou terme. Dans l'étape suivante 31, les différents spécimens enregistrés sont soumis à un ajustement temporel au moyen d'une programmation dynamique, pour être ajustés à la durée moyenne des spécimens prononcés. Ceci élimine les variations temporelles de l'élocution naturelle, dans laquelle le même mot peut être prononcé à différentes vitesses. On choisit comme mot moyen celui qui a une durée moyenne ou sur la base d'autres critères de métrique qui placent le mot dans la moyenne du groupe de mots. Par exemple, si le vocabulaire de référence comprend les chiffres de zéro à neuf, tous les spécimens de chaque chiffre auront la même durée après avoir été traités par In a first step 30, isolated examples of each of the words or terms to be included in the reference vocabulary are recorded. This operation is repeated so as to have multiple specimens of each word or term. In the next step 31, the different recorded specimens are adjusted temporally by means of dynamic programming, to be adjusted to the average duration of the pronounced specimens. This eliminates the temporal variations of natural speech, in which the same word can be pronounced at different speeds. The average word is chosen to be the average word or based on other metric criteria that place the word in the mean of the group of words. For example, if the reference vocabulary includes numbers from zero to nine, all specimens of each digit will have the same duration after being processed by

programmation dynamique.dynamic programming.

Dans une troisième étape 32, les mots ainsi ajustés temporellement sont délivrés à un réseau neuronal 20. La structure du réseau neuronal peut être monocouche ou multicouche et contenir n'importe quelle statégie connue d'apprentissage par propagation rétroactive d'erreurs. Le réseau neuronal est agencé pour apprendre les caractéristiques spectrales distinctives du vocabulaire, c'est-à-dire les caractéristiques de chaque mot qui le distinguent d'autres mots du vocabulaire. Ceci est représenté par un exemple dans la fig. 3 pour la prononciation du mot anglais "one", la fréquence sonore F étant représentée par un spectre dans différentes tranches du temps T. La partie de gauche de la figure représente l'analyse spectrale/temporelle du mot "one". La partie de droite de la figure représente les caractéristiques spectrales/temporelles qui distinguent ce mot "one" des autres chiffres "zero", "two", "three" etc. Dans une quatrième étape 33, ces caractéristiques distinctives sont soumises à un algorithme connu, éliminant l'influence des variations temporelles de l'élocution naturelle. Dans cet exemple, on utilise un modèle semi-Markov caché (HSMM). Les caractéristiques distinctives identifiées par le réseau neuronal sont groupées avec les paramètres du In a third step 32, the words thus adjusted temporally are delivered to a neural network 20. The structure of the neural network may be single-layer or multilayer and contain any known strategy for learning by retroactive propagation of errors. The neural network is arranged to learn the distinctive spectral characteristics of the vocabulary, that is, the characteristics of each word that distinguish it from other words in the vocabulary. This is represented by an example in FIG. 3 for the pronunciation of the English word "one", the sound frequency F being represented by a spectrum in different slices of the time T. The left part of the figure represents the spectral / temporal analysis of the word "one". The right-hand part of the figure represents the spectral / temporal characteristics that distinguish this word "one" from other "zero", "two", "three", and so on. In a fourth step 33, these distinctive features are subject to a known algorithm, eliminating the influence of temporal variations in natural speech. In this example, a hidden semi-Markov model (HSMM) is used. The distinguishing features identified by the neural network are grouped with the parameters of the

HSMM en vue de leur stockage dans la mémoire 17. HSMM for storage in memory 17.

-6- De cette manière, la mémoire 17 contient un modèle de chaque mot ou terme du vocabulaire de référence, ce modèle tenant compte des possibilités de confusion de ce mot avec d'autres mots du vocabulaire. Il en résulte une amélioration de la procédure d'enregistrement en vue d'une comparaison ultérieure de motifs. Les caractéristiques distinctives utilisées pour identifier chaque mot ne sont pas nécessairement des caractéristiques spectrales, mais elles pourraient être des coefficients linéaires de prédiction ou d'autres In this way, the memory 17 contains a model of each word or term of the reference vocabulary, this model taking into account the possibility of confusion of this word with other words of the vocabulary. This results in an improvement of the registration procedure for later comparison of reasons. The distinguishing characteristics used to identify each word are not necessarily spectral characteristics, but they could be linear prediction coefficients or other

caractéristiques du signal vocal.characteristics of the voice signal.

Les modèles de mots contenus dans la mémoire peuvent être des motifs de distorsion de temps dynamique (Dynamic Time Warping) afin de tenir compte de la variabilité temporelle et de la métrique considérée par le réseau neuronal 20 sur l'ensemble du mot. Une unité de syntaxe 18, interposée entre la mémoire de vocabulaire 17 et l'unité de comparaison de motifs 16, peut être utilisée pour effectuer une restriction de syntaxe de manière connue sur le vocabulaire mémorisé utilisé pour la The word patterns contained in the memory may be dynamic time warping patterns (Dynamic Time Warping) to take into account the temporal variability and the metric considered by the neural network 20 over the entire word. A syntax unit 18, interposed between the vocabulary memory 17 and the pattern comparison unit 16, can be used to perform a syntax restriction in a known manner on the stored vocabulary used for the

comparaison, en fonction de la syntaxe de mots précédemment identifiés. comparison, according to the syntax of previously identified words.

Le procédé selon l'invention permet la reconnaissance de la parole continue grâce à un processus d'enregistrement dans un réseau neuronal, avec les performances accrues qu'un tel réseau permet d'obtenir, mais The method according to the invention allows the recognition of the continuous speech through a recording process in a neural network, with the increased performance that such a network allows to obtain, but

sans exiger une capacité excessive de traitement des informations. without requiring excessive capacity for information processing.

- 7-- 7-

Claims

claims

A speech recognition method, characterized by steps in which said voice signals representing a series of known words or terms are delivered to a neural network (20), the characteristics of each of the neural network (20) are identified. word or term which distinguishes it from others of these words or terms, information relating to these distinctive characteristics is delivered to a memory (17) together with information identifying the word or term to which these characteristics are associated, to memorize a vocabulary of

reference, and later on, voice signals repre-

feeling a word or term to recognize distinctive features from the vocabulary contained in said memory (17), so as to

identify that word or term.

2. Method according to claim 1, characterized in that one pronounces several times each known word or term to form specimens and in that one makes a temporal adjustment of the specimens of each word to produce the vocal signals which are delivered to the network

neural.

3. Method according to claim 1 or 2, characterized in that said

distinguishing features of each word or term are

spectral parameters.

4. Method according to claim 1 or 2, characterized in that said

Distinctive features are linear prediction coefficients.

5. Method according to claim 1, characterized in that the comparison between the speech signals relating to a word or term to be recognized and distinctive characteristics derived from the reference vocabulary is performed using a technique using a hidden semi-Markov model. .

6. Method according to claim 1, characterized in that the reference vocabulary contained in the memory (17) comprises patterns of

dynamic time distortion of distinctive features.

7. Method according to claim 1, characterized in that a syntax restriction is made on the reference vocabulary as a function of

the syntax of previously identified words.