WO1990009656A1 - Appareil de traitement de la parole - Google Patents
Appareil de traitement de la parole Download PDFInfo
- Publication number
- WO1990009656A1 WO1990009656A1 PCT/FR1990/000091 FR9000091W WO9009656A1 WO 1990009656 A1 WO1990009656 A1 WO 1990009656A1 FR 9000091 W FR9000091 W FR 9000091W WO 9009656 A1 WO9009656 A1 WO 9009656A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- card
- speech
- speaker
- parameters
- voice
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/0018—Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
-
- G—PHYSICS
- G07—CHECKING-DEVICES
- G07C—TIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
- G07C9/00—Individual registration on entry or exit
- G07C9/20—Individual registration on entry or exit involving the use of a pass
- G07C9/22—Individual registration on entry or exit involving the use of a pass in combination with an identity check of the pass holder
- G07C9/25—Individual registration on entry or exit involving the use of a pass in combination with an identity check of the pass holder using biometric data, e.g. fingerprints, iris scans or voice recognition
- G07C9/257—Individual registration on entry or exit involving the use of a pass in combination with an identity check of the pass holder using biometric data, e.g. fingerprints, iris scans or voice recognition electronically
Definitions
- the invention relates to speech analysis and synthesis, and more generally even speech coding and decoding.
- the medium for transmitting information is an acoustic vibration of the air.
- This vibration is constituted by a succession of acoustic waves of complex shapes.
- Language sounds can be emitted in several ways: first there is a distinction between voiced sounds and unvoiced sounds.
- the voiced sounds are emitted from a vibration of the vocal cords and are modulated through the pharynx and the oral cavity (and in particular by the tongue and the lips); some sounds also use the nasal cavity.
- Unvoiced sounds are not output from the vocal cords; they are produced directly inside the oral cavity.
- the fricative consonants (s, f, z, v) are produced respectively by a flow of air in the narrow gap between the teeth (s, z) or between the lips (f, v).
- the consonants ' s and f are not seen. But the consonants z and v are seen.
- Plosive consonants involve a complete occlusion of the vocal tract at one point or another, followed by an abrupt release of pressure accumulated in the conduit. The closing point determines the sound produced. This sound can be, again, voiced or unvoiced.
- the consonants p (unvoiced) and b (voiced) correspond to a closure of the lips; t (unvoiced) and d (voiced) correspond to an occlusion by the tongue in the anterior part of the palate.
- the consonants (unvoiced) and g (voiced) correspond to an occlusion by the tongue towards the back of the palate.
- the human ear distinguishes them very well from each other, but the acoustic waveforms that distinguish them do not seem to be sufficiently characteristics so that a machine can easily recognize them, especially in continuous speech.
- the acoustic waves corresponding to vowels have a simpler and narrower frequency spectrum than consonants.
- the vowels actually represent rather a stable part of the vocal signal, while the consonants represent rather transitions.
- Plosives for example represent sudden transitions, with a very wide frequency spectrum during the transition.
- a method of frequency analysis which has already proven its effectiveness as well in speech recognition as in speech synthesis is the method of formants.
- the formants are the frequencies corresponding to energy peaks of the voice signal: it is clearly seen that the frequency spectrum resulting from the analysis of the acoustic signal corresponding to a vowel is a spectrum comprising hollows and bumps.
- the bumps are the formants; and we generally distinguish several successive formants in the spectrum corresponding to a determined phoneme.
- the formants are identified by their position in the frequency spectrum. We will speak of the first forming for the lowest frequency peak, of the second forming for the next peak, etc.
- the emission of the phoneme is indeed linked to very precise positions of the various mobile elements of the oral cavity (position of the lips, of the tongue , soft palate, etc.); and there is a link between the forming frequencies and the shape of the vocal tract; it is therefore understood that there is also a direct link between an emitted phoneme and the forming frequencies detected in the frequency spectrum of the acoustic signal corresponding to this phoneme.
- the vowel A is an acoustic signal the first of which is located between 500 and 800 hertz, the second is located between 1000 and 1600 hertz but is not separated from the first more from 600 to 900 hertz, and the third component is located between 2300 and 3200 hertz.
- the vowel I would have a first forming between 200 and 400 hertz, a second forming located between 2100 and 2400 hertz, but spaced at least 2000 hertz of the first.
- the third forming is at an even higher frequency.
- FIG. 1 represents a schematic table of the pronunciation zones of different phonetic vowels.
- the letters in square brackets represent usual phonemes in French, according to the phonetics code of the Association Internationale de Phonographic.
- the table is a frequency diagram representing the areas of value of the first form (on the ordinate) and of the second form (on the abscissa).
- certain zones overlap which means that the same sound emitted by two different people can correspond to two phonemes of different meaning.
- the zones are close enough to each other so that it can be difficult for a machine to recognize the phonemes present in human speech.
- the speech recognition machines proposed so far are usually capable of recognizing only a small number of isolated words, spoken by a well-defined speaker who has recorded the words to be recognized in the machine (which he has spoken himself).
- the machine will be unavailable to perform its recognition function; the operator will also be obliged to reserve a time for this operation. But this operation is a priori essential because the probability is very low for the machine to reliably recognize the words spoken by a speaker other than the one who recorded the reference words. It is needless to specify that if the machine is intended for example for use by the public in a public place, there is no question of carrying out a learning phase for each user who comes before the machine. One can think for example of a telephone booth in which the dialing of the called number is done orally. For such machines, we are currently obliged to limit the number of words to be recognized as much as possible, in order to increase the certainty of recognizing the word spoken regardless of the person who pronounces it.
- the object of the present invention is, among other things, to propose a simple means making it easier to use a recognition machine by several different speakers, without excessively reducing the possibilities of the machine.
- Another object of the invention is to propose a simple means making it possible to improve speech synthesis by adapting as closely as possible the synthesized voice to the voice of a well-defined speaker, so that for example if the voice of a the speaker is coded, then transmitted over a telephone line, then re-synthesized before being returned to a listener, the synthesized voice can come as close as possible to the voice of the initial speaker.
- the present invention provides a speech processing system comprising a speech coding or decoding apparatus suitable for multi-speaker coding or decoding, characterized in that specific parameters of a determined speaker are contained in a card. personal portable that the speaker keeps with him, the system comprising a card reader adapted to read the content of the card and to communicate this content to the coding or decoding apparatus, to adapt it instantly, without learning phase, to this speaker .
- the card could contain in the form of coded data a pronunciation of a certain number of words by the card holder (as many words as the machine must be able to recognize or synthesize for example). But it is more advantageous that the card rather contains parameters of the voice independently of the words to be recognized or synthesized, because that widens the possibilities of recognition or synthesis.
- the parameters recorded in the card can then be encoded electrical signals representing the shapes of temporal or spectral wave 'of phonemes or diphones or frequency of diphones made by the cardholder.
- vectors corresponding to these phonemes or diphonemes or diphones for example vectors of three or four formants; each vector of three or four formants will therefore include three or four frequency values (or more probably three or four frequency ranges) representing a determined phoneme or diphoneme or diphone.
- These vectors will be stored in the card, and transferred to the machine at the time of use, replacing the vectors that the machine may have previously received during use by another speaker with another personal card.
- consonants or diphones including consonants will be expressed more easily by parameters relating to the way in which the formants vary: more or less rapid fall of the first forming and simultaneously more or less rapid rise of the second, etc.
- Coefficients of sampled transfer functions could also be stored as voice parameters in a portable personal card.
- the card could be a magnetic stripe or optical strip card; but it will preferably be a smart card incorporating an integrated circuit chip with in particular a non-volatile memory containing the personal parameters of the voice.
- the card can also be another portable information medium such as, for example: magnetic cards with high storage density, the magnetic surface of which covers all or almost all of one of the faces; non-volatile EPROM or EEPROM or RAM storage memory stored in a very compact and easily transportable case; chip keys not specially in the form of a flat card, etc.
- FIG. 1, already described, represents a position diagram of various phonemes in the space of formants (first two formants);
- FIG. 2 schematically represents an application of the invention to the voice control of a machine;
- FIG. 3 schematically shows an application of the invention to telephone communications.
- a first application of the invention is speech recognition, such as it can be used for example for controlling a robot, an industrial machine, a vehicle, etc., or, in a more sophisticated application, for a dictation machine or a translator.
- FIG. 2 shows schematically this application in the case of controlling a robot.
- a recognition device 10 is connected to an industrial robot 12 to supply it with orders for on, off, rotation, etc. control.
- the recognition apparatus is coupled to a microphone 14 so that control commands can be given orally in the form of simple words such as "on", “stop", “right”, “left”, etc.
- the apparatus is also coupled to a chip card reader 16 into which a chip card 18 can be inserted which contains in non-volatile memory (EPROM or EEPROM memory) personalized data relating to the voice of a derivative speaker of this card.
- EPROM or EEPROM memory non-volatile memory
- the card data is first loaded into the recognition device; this data is used to modify either configurations of electronic circuits in the device, or recognition algorithms used in the device.
- the modified configurations or the modified algorithms are such that the device is then optimally adapted to the recognition of the words or sentences spoken by the speaker holding the card.
- the modifications of algorithm can consist of modifications of the mean values and limit values of the frequencies of formants for each phoneme or diphoneme or diphone likely to be pronounced; or modifications of coefficients of polynomials in calculation algorithms based on the z-transform of the sampled acoustic signals.
- Modifications of electronic circuit configurations could for example consist of modifications of capacitance values (by switching switches) in filters with switched capacitors used to determine formant frequencies.
- the recognition device 10 Depending on the sophistication of the recognition device 10, it will be possible to recognize more or less complex words or sentences. If the apparatus 10 is very efficient (and its performance vis-à-vis multiple speakers will be considerably improved by the invention), it can be envisaged that the machine 12 controlled is a word processor, or even a speech machine. automatic translation. This of course presupposes that the recognition device is capable of recognizing not only individual words but continuous sentences.
- the signal received by the microphone of the telephone handset is coded; the coding is a phonetic coding instead of being a digital coding of the waveforms of the speech signal: we code the speech by decomposing it into successive phonemes or diphones; it is therefore a speech recognition operation. Then we send on the telephone line successive data vectors, each vector comprising several data relating to the phoneme which has just been pronounced in the handset. Upon reception, the vectors are reconverted s of data in phonemes; it is a speech synthesis operation.
- the compression achieved can be very important: we can consider limiting the amount of data necessary to transmit a normal conversation to 2 kilobits per second. Indeed, the number of phonemes emitted does not exceed ten per second.
- a first coder / decoder 20 interposed between a first telephone apparatus 22 and a digital telephone line 24 will be used according to the invention.
- the function of this first coder is to encode the speech transmitted and to decode the speech received. It is coupled with a first reader of smart cards 26 into which a card 28 can be inserted containing personalized data on the voice of the person who telephones.
- a second coder / decoder 30 similar to the first, connected to the other end of the line 24, interposed between the line and a second telephone device 32.
- the second coder / decoder is also coupled to a second card reader 36 in which one can insert a card 38 containing personalized data relating to the voice of the correspondent at the other end of the line.
- the coders / decoders which are in fact complete speech recognition and synthesis devices, receive the data contained in the two cards, so that the coding part is adapted to the recognition of the voice of the person located at the same end of the line than the coder / decoder, while the decoding part is adapted to the synthesis of the voice of the person located at the other end of the line.
- a data exchange protocol is therefore provided at the start of the telephone conversation to send the appropriate data to the coders / decoders. Then the conversation can take place .
- one of the people is speaking; his voice is converted into coded phonemes, by the coder which has been specially adapted to the speaker's voice; it is sent over the line; it is received by the decoder at the other end of the line.
- the decoder was also adapted to the voice of the same speaker; it will therefore optimally synthesize the voice of this speaker before transmitting it to the telephone set listener.
- coding and decoding are specially adapted to his voice so that at the other end of the line the correspondent will receive a synthesized voice in a personalized manner.
- the interrogation is done by speech and not by means of a keyboard.
- An example is the telephone reservation of air transport.
- the user has, as in the previous application, a telephone device with which a card reader is associated; the card contains the holder's voice settings.
- Parameters can be used in two ways: on the one hand they can be sent on the line as elements of identification of an authorized holder; if the parameters are not those of an authorized holder, the database is not made accessible; on the other hand, after the voice parameters have been transmitted to the database, a speech recognition system uses these parameters to best adapt to the voice of the person who is going to speak on the telephone line.
- the user can then speak; its voice is transmitted normally on the line (unlike the previous application where it is coded for a reduction in bit rate); a speech analysis is done at the other end of the line, adapted to the speaker's voice, to determine the message transmitted by machine and to establish human-machine dialogue via the telephone line.
- the personal parameters of the voice are entered in the card of a holder by a specialized machine whose main function is to determine and save these parameters.
- the card holder will have to pronounce a number of characteristic words in front of the machine which will be used to make this determination.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR8901542A FR2642882B1 (fr) | 1989-02-07 | 1989-02-07 | Appareil de traitement de la parole |
FR89/01542 | 1989-02-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO1990009656A1 true WO1990009656A1 (fr) | 1990-08-23 |
Family
ID=9378539
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/FR1990/000091 WO1990009656A1 (fr) | 1989-02-07 | 1990-02-06 | Appareil de traitement de la parole |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP0456742A1 (fr) |
FR (1) | FR2642882B1 (fr) |
WO (1) | WO1990009656A1 (fr) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010013546A1 (en) | 1996-01-09 | 2001-08-16 | Ross William Leslie | Identification system |
GB2309110B (en) * | 1996-01-09 | 1999-12-08 | Personal Biometric Encoders | Identification system |
US6496099B2 (en) | 1996-06-24 | 2002-12-17 | Computer Motion, Inc. | General purpose distributed operating room control system |
DE19726265C2 (de) * | 1997-06-20 | 2001-08-02 | Deutsche Telekom Ag | Verfahren zum Betreiben einer Anlage zur Nutzung einer Chipkarte |
DE69736014T2 (de) * | 1997-10-20 | 2006-11-23 | Computer Motion, Inc., Goleta | Verteiltes allzweck-steuerungssystem für operationssäle |
EP1120752A1 (fr) * | 2000-01-24 | 2001-08-01 | Franke & Co. Verwaltungs KG | Système pour le contrôle des droits d'entrée ou d'accès |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3129282A1 (de) * | 1981-07-24 | 1983-02-10 | Siemens AG, 1000 Berlin und 8000 München | Verfahren zur sprecherabhaengigen erkennung von einzelnen gesprochenen worten in fernmeldesystemen |
EP0071716A2 (fr) * | 1981-08-03 | 1983-02-16 | Texas Instruments Incorporated | Vocodeur allophonique |
FR2533513A1 (fr) * | 1982-09-23 | 1984-03-30 | Renault | Procede et systeme pour communiquer a bord d'un vehicule automobile des informations complexes relatives au vehicule et a son environnement |
GB2139389A (en) * | 1983-04-29 | 1984-11-07 | Voice Electronic Technology Li | Identification apparatus |
DE3416238A1 (de) * | 1983-05-02 | 1984-12-20 | Motorola, Inc., Schaumburg, Ill. | Extremschmalband-uebertragungssystem |
WO1986006197A1 (fr) * | 1985-04-09 | 1986-10-23 | Drexler Technology Corporation | Systeme de cartes de donnees pour initialiser des unites de reconnaissance de mots parles |
WO1987004292A1 (fr) * | 1986-01-03 | 1987-07-16 | Motorola, Inc. | Procede et appareil pour synthetiser la parole a partir de modeles de reconnaissance de la parole |
US4799261A (en) * | 1983-11-03 | 1989-01-17 | Texas Instruments Incorporated | Low data rate speech encoding employing syllable duration patterns |
-
1989
- 1989-02-07 FR FR8901542A patent/FR2642882B1/fr not_active Expired - Fee Related
-
1990
- 1990-02-06 EP EP90903181A patent/EP0456742A1/fr not_active Withdrawn
- 1990-02-06 WO PCT/FR1990/000091 patent/WO1990009656A1/fr not_active Application Discontinuation
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE3129282A1 (de) * | 1981-07-24 | 1983-02-10 | Siemens AG, 1000 Berlin und 8000 München | Verfahren zur sprecherabhaengigen erkennung von einzelnen gesprochenen worten in fernmeldesystemen |
EP0071716A2 (fr) * | 1981-08-03 | 1983-02-16 | Texas Instruments Incorporated | Vocodeur allophonique |
FR2533513A1 (fr) * | 1982-09-23 | 1984-03-30 | Renault | Procede et systeme pour communiquer a bord d'un vehicule automobile des informations complexes relatives au vehicule et a son environnement |
GB2139389A (en) * | 1983-04-29 | 1984-11-07 | Voice Electronic Technology Li | Identification apparatus |
DE3416238A1 (de) * | 1983-05-02 | 1984-12-20 | Motorola, Inc., Schaumburg, Ill. | Extremschmalband-uebertragungssystem |
US4799261A (en) * | 1983-11-03 | 1989-01-17 | Texas Instruments Incorporated | Low data rate speech encoding employing syllable duration patterns |
WO1986006197A1 (fr) * | 1985-04-09 | 1986-10-23 | Drexler Technology Corporation | Systeme de cartes de donnees pour initialiser des unites de reconnaissance de mots parles |
WO1987004292A1 (fr) * | 1986-01-03 | 1987-07-16 | Motorola, Inc. | Procede et appareil pour synthetiser la parole a partir de modeles de reconnaissance de la parole |
Non-Patent Citations (3)
Title |
---|
Electrical Communication, Vol. 59, No. 3, 1985, (Harlow, GB), H. Mulla et al.: "Application of Speech Recognition and Synthesis to PABX Services", pages 273-280 * |
ICASSP 80, IEEE International Conference on Acoustics, Speech and Signal Processing, Denver, Colorado, 9-11 Avril 1980, Vol. 1, IEEE (New York, US), R. SCHWARTZ et al.: "A Preliminary Design of a Phonetic Vocoder Based on a Diphone Model", pages 32-35 * |
ICASSP 85, IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, Florida, 26-29 Mars 1985, Vol. 1, IEEE, (New York, US), S. ROUCOS et al.: "The Waveform Segment Vocoder: A New Approach for Very-Low-Rate Speech Coding", pages 236-239 * |
Also Published As
Publication number | Publication date |
---|---|
EP0456742A1 (fr) | 1991-11-21 |
FR2642882B1 (fr) | 1991-08-02 |
FR2642882A1 (fr) | 1990-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Greenberg | On the origins of speech intelligibility in the real world | |
EP0974221B1 (fr) | Dispositif de commande vocale pour radiotelephone, notamment pour utilisation dans un vehicule automobile | |
Ainsworth | Mechanisms of Speech Recognition: International Series in Natural Philosophy | |
McLoughlin | Applied speech and audio processing: with Matlab examples | |
US5943648A (en) | Speech signal distribution system providing supplemental parameter associated data | |
Syrdal et al. | Applied speech technology | |
US20120016674A1 (en) | Modification of Speech Quality in Conversations Over Voice Channels | |
EP0867856A1 (fr) | "Méthode et dispositif de detection d'activité vocale" | |
CA2602633A1 (fr) | Dispositif pour la communication par des personnes handicapees de la parole et/ou de l'ouie | |
EP2215626A1 (fr) | Systeme d'interpretation simultanee automatique | |
WO2018146305A1 (fr) | Methode et appareil de modification dynamique du timbre de la voix par decalage en fréquence des formants d'une enveloppe spectrale | |
CN113724718A (zh) | 目标音频的输出方法及装置、系统 | |
US6502073B1 (en) | Low data transmission rate and intelligible speech communication | |
CN115171731A (zh) | 一种情绪类别确定方法、装置、设备及可读存储介质 | |
CN113724683A (zh) | 音频生成方法、计算机设备及计算机可读存储介质 | |
CA2343701A1 (fr) | Methode de communication en langage naturel a l'aide d'un langage de balisage | |
WO1990009656A1 (fr) | Appareil de traitement de la parole | |
Hermansky | Auditory modeling in automatic recognition of speech | |
CN113724690B (zh) | Ppg特征的输出方法、目标音频的输出方法及装置 | |
EP1271469A1 (fr) | Procédé de génération de caractéristiques de personnalité et procédé de synthèse de la parole | |
Westall et al. | Speech technology for telecommunications | |
FR2859566A1 (fr) | Procede de transmission d'un flux d'information par insertion a l'interieur d'un flux de donnees de parole, et codec parametrique pour sa mise en oeuvre | |
JP2002297199A (ja) | 合成音声判別方法と装置及び音声合成装置 | |
Gao | Audio deepfake detection based on differences in human and machine generated speech | |
JP2009271315A (ja) | 音声二次元コードから音声を再生可能な携帯電話機および音声二次元コードを含む二次元コードが表示された印刷物 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CA JP KR US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH DE DK ES FR GB IT LU NL SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1990903181 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1990903181 Country of ref document: EP |
|
NENP | Non-entry into the national phase in: |
Ref country code: CA |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1990903181 Country of ref document: EP |