EP0456742A1

EP0456742A1 - Speech processing machine

Info

Publication number: EP0456742A1
Application number: EP90903181A
Authority: EP
Inventors: Jean-Louis Bâtiment 1 RIPOLL
Original assignee: ALCEPT
Current assignee: ALCEPT
Priority date: 1989-02-07
Filing date: 1990-02-06
Publication date: 1991-11-21
Also published as: FR2642882B1; WO1990009656A1; FR2642882A1

Abstract

L'invention concerne l'analyse et la synthèse de la parole, et plus généralement même le codage et le décodage de la parole. Etant donné que la reconnaissance de parole multilocuteurs est très difficile du fait des différences de prononciation des mêmes phonèmes par des locuteurs différents, l'invention propose un système de reconnaissance utilisant des cartes portatives, et tout particulièrement des cartes à puces, dans lesquelles on enregistre des paramètres caractéristiques de la voix du locuteur titulaire de la carte. Ces paramètres sont lus par un lecteur (16), transmis à un appareil de reconnaissance de parole (10) qui adapte ses algorithmes ou circuits de traitement en fonction du contenu de la carte pour optimiser la reconnaissance en fonction d'un locuteur déterminé. L'appareil de reconnaissance (10) peut alors commander avec une fiabilité maximale une machine (12), en fonction d'un signal de parole transmis par un microphone (14).The invention relates to speech analysis and synthesis, and more generally even speech coding and decoding. Since multi-speaker speech recognition is very difficult due to the differences in pronunciation of the same phonemes by different speakers, the invention proposes a recognition system using portable cards, and in particular smart cards, in which one records characteristic parameters of the voice of the cardholder speaker. These parameters are read by a reader (16), transmitted to a speech recognition device (10) which adapts its algorithms or processing circuits according to the content of the card to optimize recognition according to a determined speaker. The recognition device (10) can then control with maximum reliability a machine (12), as a function of a speech signal transmitted by a microphone (14).

Description

SPEECH PROCESSING APPARATUS

The invention relates to speech analysis and synthesis, and more generally even speech coding and decoding.

The applications in which it is envisaged to process the signals of human voice electronically are more and more numerous. First of all, there is speech recognition and synthesis in order to facilitate human-machine communication which has been done until now mainly through an input keyboard and a display screen, or through buttons and joysticks. ordered. There is also speech recognition for the purpose of identifying a person by their vocal characteristics. And there are also applications in which the processing is used to ^• compress the information transmitted orally to transmit it at a higher speed or with a lower bandwidth, etc.

But speech processing is a very difficult operation, because of the complexity of the physiological mechanisms by which speech is produced and by which it is heard and understood.

The medium for transmitting information is an acoustic vibration of the air. This vibration is constituted by a succession of acoustic waves of complex shapes. When we record these waveforms, we see that it is practically impossible, by simple visual observation, to make a link between such or such part of the diagram and the sound which was pronounced.

As a result, it is very difficult to establish electronic circuits or data processing programs that would be able to recognize something other than very simple isolated sounds. Problems are also difficult in speech synthesis if we want to reproduce sounds that closely resemble human language.

To give a more precise idea of the difficulties encountered, we will recall below some notions relating to the analysis, recognition and synthesis of speech.

Language sounds can be emitted in several ways: first there is a distinction between voiced sounds and unvoiced sounds. The voiced sounds are emitted from a vibration of the vocal cords and are modulated through the pharynx and the oral cavity (and in particular by the tongue and the lips); some sounds also use the nasal cavity. Unvoiced sounds are not output from the vocal cords; they are produced directly inside the oral cavity.

On the other hand, whether among voiced sounds or unvoiced sounds, one can distinguish between the sounds produced by air turbulence (in a narrow opening), and those which correspond rather to a regular flow. Consonants are generally produced by turbulence. Rather, the vowels correspond to regular flows.

The fricative consonants (s, f, z, v) are produced respectively by a flow of air in the narrow gap between the teeth (s, z) or between the lips (f, v). The consonants ^' s and f are not seen. But the consonants z and v are seen.

Plosive consonants involve a complete occlusion of the vocal tract at one point or another, followed by an abrupt release of pressure accumulated in the conduit. The closing point determines the sound produced. This sound can be, again, voiced or unvoiced. The consonants p (unvoiced) and b (voiced) correspond to a closure of the lips; t (unvoiced) and d (voiced) correspond to an occlusion by the tongue in the anterior part of the palate. The consonants (unvoiced) and g (voiced) correspond to an occlusion by the tongue towards the back of the palate.

We can thus describe how most phonemes corresponding to a given language are produced. The phoneme is the smallest sound element making it possible to distinguish one word from another or more precisely to modify its meaning. There are only a few dozen different phonemes in a given language. We consider that there are about forty in the French language.

But it is a theoretical figure. In practice we notice that the phonemes are pronounced differently according to the phonemes which precede or follow them. It is the phenomenon of coarticulation between phonemes, which seriously complicates the problems of recognition or synthesis because it multiplies by 4 or 5 the number of phonemes practically emitted. It is moreover often easier to base speech recognition or synthesis not on phonemes but either on "diphones" which are pairs of associated phonemes including the transition between these phonemes, or on "diphones" which are sound segments starting in the middle of a phoneme and ending in the middle of the next phoneme, (thus including the transition between two phonemes but not the totality of each of the two phonemes).

The human ear distinguishes them very well from each other, but the acoustic waveforms that distinguish them do not seem to be sufficiently characteristics so that a machine can easily recognize them, especially in continuous speech.

The acoustic waves corresponding to vowels have a simpler and narrower frequency spectrum than consonants. The vowels actually represent rather a stable part of the vocal signal, while the consonants represent rather transitions. Plosives for example represent sudden transitions, with a very wide frequency spectrum during the transition.

This is why we have tried to propose methods of speech processing based essentially on the frequency analysis of acoustic signals.

Through these frequency analyzes we can better discern parameters corresponding to the different phonemes or diphones emitted.

For example, a method of frequency analysis which has already proven its effectiveness as well in speech recognition as in speech synthesis is the method of formants. We will recall in a few paragraphs what are the formants, to better understand the invention, although the invention is not limited to systems using an analysis or synthesis formants. The formants are the frequencies corresponding to energy peaks of the voice signal: it is clearly seen that the frequency spectrum resulting from the analysis of the acoustic signal corresponding to a vowel is a spectrum comprising hollows and bumps. The bumps are the formants; and we generally distinguish several successive formants in the spectrum corresponding to a determined phoneme.

The formants are identified by their position in the frequency spectrum. We will speak of the first forming for the lowest frequency peak, of the second forming for the next peak, etc.

These peaks physically correspond to resonances of the oral cavity, and human speech consists precisely in modulating the shape of the oral cavity so as to modify the different resonance frequencies of this cavity.

There is a direct link between the pronunciation of a phoneme and the shape of the vocal tract: the emission of the phoneme is indeed linked to very precise positions of the various mobile elements of the oral cavity (position of the lips, of the tongue , soft palate, etc.); and there is a link between the forming frequencies and the shape of the vocal tract; it is therefore understood that there is also a direct link between an emitted phoneme and the forming frequencies detected in the frequency spectrum of the acoustic signal corresponding to this phoneme.

Analysis and synthesis with formants are based on this notion. Indeed, we note that the presence of certain formants is entirely characteristic of the emission of a particular phoneme. For vowels, whose frequency spectrum is relatively stable, we can very well characterize a vowel determined by the position (on the frequency axis) of the first three formants, i.e. the first three peaks of the spectrum of the corresponding acoustic signal.

As an indication, we can give the following example: the vowel A is an acoustic signal the first of which is located between 500 and 800 hertz, the second is located between 1000 and 1600 hertz but is not separated from the first more from 600 to 900 hertz, and the third component is located between 2300 and 3200 hertz.

Another example: the vowel I would have a first forming between 200 and 400 hertz, a second forming located between 2100 and 2400 hertz, but spaced at least 2000 hertz of the first. The third forming is at an even higher frequency.

With a mathematical vector made up of three numbers which are the frequencies of the first three formants one can fairly well characterize all the vowels and certain consonants. For other consonants the use of formants is more difficult, but other methods can be used, and in particular an evaluation of the direction and the speed of variation of the frequencies of forming in diphones comprising a transition by consonant.

However, an additional problem comes from the diversity of pronunciations of the same phonemes by different people. The human ear automatically restores the meaning of the phoneme, even pronounced by several different people. But a voice recognition machine confronted with several formant vectors will have great difficulty in recognizing these different vectors as representing a single phoneme if the vectors are quite different from each other because they emanate from different people. This is all the more true since we have already envisaged making identification machines for people whose operation is based on voice recognition, which shows that to a certain extent there can be very different significant in the emission of the same phonemes by different people.

As an example, FIG. 1 represents a schematic table of the pronunciation zones of different phonetic vowels. The letters in square brackets represent usual phonemes in French, according to the phonetics code of the Association Internationale de Phonétique. The table is a frequency diagram representing the areas of value of the first form (on the ordinate) and of the second form (on the abscissa). We see in particular that certain zones overlap, which means that the same sound emitted by two different people can correspond to two phonemes of different meaning. And more generally, the zones are close enough to each other so that it can be difficult for a machine to recognize the phonemes present in human speech. The speech recognition machines proposed so far are usually capable of recognizing only a small number of isolated words, spoken by a well-defined speaker who has recorded the words to be recognized in the machine (which he has spoken himself).

It has been proposed to make these machines capable of recognizing the same words, spoken by several different speakers. But then, the passage from one speaker to another first requires a learning phase of the machine: the second speaker must pronounce in front of the machine the succession of different words which it must be able to recognize, so that the machine stores in memory how these words are pronounced, and that she can then recognize them. This learning phase is very heavy; all the heavier than the machine must be able to recognize more words. If it has to recognize 1000 words, you will have to pronounce them all; it may even be necessary to pronounce them each several times to establish an average pronunciation (because the pronunciation of a word by a person is not something fixed and unchanging). During the learning phase, the machine will be unavailable to perform its recognition function; the operator will also be obliged to reserve a time for this operation. But this operation is a priori essential because the probability is very low for the machine to reliably recognize the words spoken by a speaker other than the one who recorded the reference words. It is needless to specify that if the machine is intended for example for use by the public in a public place, there is no question of carrying out a learning phase for each user who comes before the machine. One can think for example of a telephone booth in which the dialing of the called number is done orally. For such machines, we are currently obliged to limit the number of words to be recognized as much as possible, in order to increase the certainty of recognizing the word spoken regardless of the person who pronounces it.

The object of the present invention is, among other things, to propose a simple means making it easier to use a recognition machine by several different speakers, without excessively reducing the possibilities of the machine.

Another object of the invention is to propose a simple means making it possible to improve speech synthesis by adapting as closely as possible the synthesized voice to the voice of a well-defined speaker, so that for example if the voice of a the speaker is coded, then transmitted over a telephone line, then re-synthesized before being returned to a listener, the synthesized voice can come as close as possible to the voice of the initial speaker. To achieve these aims, the present invention provides a speech processing system comprising a speech coding or decoding apparatus suitable for multi-speaker coding or decoding, characterized in that specific parameters of a determined speaker are contained in a card. personal portable that the speaker keeps with him, the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus, to adapt it instantly, without learning phase, to this speaker .

We understand that with this system, we can go so far as to install in public places complex machines using speech recognition or synthesis, and that anyone with a personal card containing their own voice parameters, will be able to communicate with this machine or through this machine, when it could not do otherwise.

The card could contain in the form of coded data a pronunciation of a certain number of words by the card holder (as many words as the machine must be able to recognize or synthesize for example). But it is more advantageous that the card rather contains parameters of the voice independently of the words to be recognized or synthesized, because that widens the possibilities of recognition or synthesis.

The parameters recorded in the card can then be encoded electrical signals representing the shapes of temporal or spectral wave ^'of phonemes or diphones or frequency of diphones made by the cardholder. However, it will be preferable to use as parameters vectors corresponding to these phonemes or diphonemes or diphones, for example vectors of three or four formants; each vector of three or four formants will therefore include three or four frequency values (or more probably three or four frequency ranges) representing a determined phoneme or diphoneme or diphone. These vectors will be stored in the card, and transferred to the machine at the time of use, replacing the vectors that the machine may have previously received during use by another speaker with another personal card.

It will be understood that if the formants seem to be the most convenient vectors for representing the vowels, other parameters exist and can be stored for other phonemes, diphonemes or diphones. In particular, consonants or diphones including consonants will be expressed more easily by parameters relating to the way in which the formants vary: more or less rapid fall of the first forming and simultaneously more or less rapid rise of the second, etc.

Coefficients of sampled transfer functions (z-transfer function) could also be stored as voice parameters in a portable personal card.

The card could be a magnetic stripe or optical strip card; but it will preferably be a smart card incorporating an integrated circuit chip with in particular a non-volatile memory containing the personal parameters of the voice. The card can also be another portable information medium such as, for example: magnetic cards with high storage density, the magnetic surface of which covers all or almost all of one of the faces; non-volatile EPROM or EEPROM or RAM storage memory stored in a very compact and easily transportable case; chip keys not specially in the form of a flat card, etc.

Other characteristics and advantages of the invention will appear on reading the description which follows and which is given with reference to the appended drawings in which: FIG. 1, already described, represents a position diagram of various phonemes in the space of formants (first two formants); FIG. 2 schematically represents an application of the invention to the voice control of a machine;

- Figure 3 schematically shows an application of the invention to telephone communications.

A first application of the invention is speech recognition, such as it can be used for example for controlling a robot, an industrial machine, a vehicle, etc., or, in a more sophisticated application, for a dictation machine or a translator.

Figure 2 shows schematically this application in the case of controlling a robot. A recognition device 10 is connected to an industrial robot 12 to supply it with orders for on, off, rotation, etc. control. The recognition apparatus is coupled to a microphone 14 so that control commands can be given orally in the form of simple words such as "on", "stop", "right", "left", etc. The apparatus is also coupled to a chip card reader 16 into which a chip card 18 can be inserted which contains in non-volatile memory (EPROM or EEPROM memory) personalized data relating to the voice of a titular speaker of this card.

During operation, the card data is first loaded into the recognition device; this data is used to modify either configurations of electronic circuits in the device, or recognition algorithms used in the device. The modified configurations or the modified algorithms are such that the device is then optimally adapted to the recognition of the words or sentences spoken by the speaker holding the card. For example, the modifications of algorithm can consist of modifications of the mean values and limit values of the frequencies of formants for each phoneme or diphoneme or diphone likely to be pronounced; or modifications of coefficients of polynomials in calculation algorithms based on the z-transform of the sampled acoustic signals. Modifications of electronic circuit configurations could for example consist of modifications of capacitance values (by switching switches) in filters with switched capacitors used to determine formant frequencies.

Depending on the sophistication of the recognition device 10, it will be possible to recognize more or less complex words or sentences. If the apparatus 10 is very efficient (and its performance vis-à-vis multiple speakers will be considerably improved by the invention), it can be envisaged that the machine 12 controlled is a word processor, or even a speech machine. automatic translation. This of course presupposes that the recognition device is capable of recognizing not only individual words but continuous sentences.

For the choice of parameters that can be written in the card to represent the voice of the card holder in a personalized way, we can generally use the theories of voice recognition and synthesis as they have been. formulated so far. One will find an indication of the mathematical methods allowing to make these choices in the treatise of René Boite and Murât Kunt: "Treatment of speech ", supplement to the Treaty of Electricity, published by the Presses Polytechniques Romandes, as well as the works referenced in the bibliography of this treaty. Another application of the invention is represented in FIG. 3. In this application, we seek coding the speech signal emitted on a telephone line, to compress the signal and thus limit the bit rate of information useful for a communication. To do this, the signal received by the microphone of the telephone handset is coded; the coding is a phonetic coding instead of being a digital coding of the waveforms of the speech signal: we code the speech by decomposing it into successive phonemes or diphones; it is therefore a speech recognition operation. Then we send on the telephone line successive data vectors, each vector comprising several data relating to the phoneme which has just been pronounced in the handset. Upon reception, the vectors are reconverted s of data in phonemes; it is a speech synthesis operation. The compression achieved can be very important: we can consider limiting the amount of data necessary to transmit a normal conversation to 2 kilobits per second. Indeed, the number of phonemes emitted does not exceed ten per second. We therefore have 200 bits to encode each phoneme or diphone as well as prosody (that is to say the melody generated by the variation of the fundamental frequency of the vocal cords during the sentence). In this application, a first coder / decoder 20 interposed between a first telephone apparatus 22 and a digital telephone line 24 will be used according to the invention. The function of this first coder is to encode the speech transmitted and to decode the speech received. It is coupled with a first reader of smart cards 26 into which a card 28 can be inserted containing personalized data on the voice of the person who telephones. We will also use a second coder / decoder 30 similar to the first, connected to the other end of the line 24, interposed between the line and a second telephone device 32. The second coder / decoder is also coupled to a second card reader 36 in which one can insert a card 38 containing personalized data relating to the voice of the correspondent at the other end of the line.

The coders / decoders, which are in fact complete speech recognition and synthesis devices, receive the data contained in the two cards, so that the coding part is adapted to the recognition of the voice of the person located at the same end of the line than the coder / decoder, while the decoding part is adapted to the synthesis of the voice of the person located at the other end of the line.

A data exchange protocol is therefore provided at the start of the telephone conversation to send the appropriate data to the coders / decoders. Then the conversation can take place _. one of the people is speaking; his voice is converted into coded phonemes, by the coder which has been specially adapted to the speaker's voice; it is sent over the line; it is received by the decoder at the other end of the line. The decoder was also adapted to the voice of the same speaker; it will therefore optimally synthesize the voice of this speaker before transmitting it to the telephone set listener. Similarly for the other speaker, coding and decoding are specially adapted to his voice so that at the other end of the line the correspondent will receive a synthesized voice in a personalized manner.

In yet another application, we seek to query a database over the phone. The interrogation is done by speech and not by means of a keyboard. An example is the telephone reservation of air transport. The user has, as in the previous application, a telephone device with which a card reader is associated; the card contains the holder's voice settings. Parameters can be used in two ways: on the one hand they can be sent on the line as elements of identification of an authorized holder; if the parameters are not those of an authorized holder, the database is not made accessible; on the other hand, after the voice parameters have been transmitted to the database, a speech recognition system uses these parameters to best adapt to the voice of the person who is going to speak on the telephone line. The user can then speak; its voice is transmitted normally on the line (unlike the previous application where it is coded for a reduction in bit rate); a speech analysis is done at the other end of the line, adapted to the speaker's voice, to determine the message transmitted by machine and to establish human-machine dialogue via the telephone line.

In all applications, it will preferably be provided that the personal parameters of the voice are entered in the card of a holder by a specialized machine whose main function is to determine and save these parameters. To this end, the card holder will have to pronounce a number of characteristic words in front of the machine which will be used to make this determination.

Claims

1. Speech processing system, comprising a speech coding or decoding device suitable for multi-speaker coding or decoding, specific parameters of the voice of a determined speaker being contained in a personal portable card which the speaker stores with itself, the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus in order to adapt it instantly, without learning phase, to this speaker, characterized in that '' It includes a phonetic speech coding and decoding device interposed between a telephone device and a telephone line, and capable of transmitting successively on the line data vectors corresponding to a succession of phonemes or diphonemes or diphones, and a card reader , the coding and decoding apparatus being able to adapt its coding function as a function of personal parameters of the voice contained s in the card inserted in the reader, and the apparatus being also capable of adapting its decoding function as a function of personal parameters of voice received from the telephone line ”

2. Speech processing system, comprising a speech coding or decoding device suitable for multi-speaker coding or decoding, specific parameters of the voice of a determined speaker being contained in a personal portable card which the speaker stores with the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus in order to adapt it instantaneously, without phase teaching this speaker, characterized in that it comprises a telephone apparatus coupled to a telephone line, and a card reader associated with the apparatus, means for transmitting on the line the parameters of the voice contained in the card, and a speech recognition system at the other end of the line to firstly receive said parameters from the line and secondly to receive a speech signal from the telephone, the system speech recognition being able to adapt its operation according to the voice parameters received.

3. Speech processing system according to claim 1 or claim 2, characterized in that the specific parameters of the speaker include vectors of acoustic data corresponding to phonemes or diphonemes or diphones, as they are pronounced by the titular speaker from the menu

4. Speech processing system according to claim 3, characterized in that each vector is constituted by a set of acoustic data, among which there are values of frequency of formants corresponding to a phoneme or diphoneme or diphone as pronounced by the speaker holding the card.

5. Speech processing system according to one of claims 1 to 4, characterized in that the specific parameters contained in the card comprise data relating to the frequency variations of formants corresponding to specific phonemes or diphonemes or diphones.

6. Speech processing system according to one of claims l to 5, characterized in that the parameters contained in the card include coefficients of sampled transfer functions (z transfer function) of acoustic signals corresponding to phonemes or diphonemes or diphones pronounced by the card holder.

7. speech processing system according to one of claims 1 to 6, characterized in that the card is a magnetic stripe card, or optical, or preferably a chip card incorporating an integrated circuit chip with in particular a non-volatile memory containing personal voice parameters.

8. Speech processing system according to one of claims 1 to 6, characterized in that the card is a magnetic card with high storage density, the magnetic surface of which covers all or almost all of one face, or one integrated circuit key not specifically shaped like a flat card.