EP0157903A1

EP0157903A1 - Method and apparatus for speech synthesizing

Info

Publication number: EP0157903A1
Application number: EP84109492A
Authority: EP
Inventors: Christian Deforeit
Original assignee: Matth Hohner AG
Current assignee: Matth Hohner AG
Priority date: 1984-02-23
Filing date: 1984-08-09
Publication date: 1985-10-16
Also published as: DE3406540C1; JPS60211499A; EP0157903B1

Abstract

1. Process for the synthesis of speech in which combinations of a consonant followed by a vowel are stored and read out as required, characterised in that - every consonant occurring in the language is stored together with a weak uniform vowel "&", - all the vowels occurring in the language are stored individually, - the desired consonant/vowel combination is formed by staggered reading out of consonant and vowel in two channels, with the uniform vowel "&" being masked by the vowel that is read out.

Description

Die Erfindung betrifft ein Verfahren für die Sprachsynthese und eine Anordnung zu seiner Durchführung.The invention relates to a method for speech synthesis and an arrangement for carrying it out.

Es ist bekannt, die in einer Sprache vorkommenden Phoneme (Konsonanten und Vokale) einzeln abzuspeichern und dann bedarfsweise sequentiell auszulesen. Da die Anzahl der Phoneme relativ gering ist, bietet sich dabei die Möglichkeit, mit relativ wenig Speicherkapazität zu arbeiten. Nachteilig ist jedoch, daß bei der Wiedergabe ein von natürlicher Sprache stark abweichender Klang entsteht, weil zwischen aufeinanderfolgenden Konsonanten und Vokalen ein "Klick-"geräusch hörbar wird, so daß dieses Prinzip praktisch nicht angewandt wird.It is known to individually store the phonemes (consonants and vowels) occurring in a language and then to read them out sequentially as required. Since the number of phonemes is relatively small, it is possible to work with relatively little storage capacity. It is disadvantageous, however, that a sound that deviates greatly from natural language is produced because a "click" sound can be heard between successive consonants and vowels, so that this principle is practically not applied.

Um eine der menschlichen Sprache weitgehend angenäherte synthetische Sprache zu erzeugen, geht man daher so vor, daß jeweils aus den vorkommenden Konsonanten und den vorkommenden Vokalen gebildete Kombinationen, sogenannte Di-Phoneme, gespeichert und ausgelesen werden. Es versteht sich, daß hierfür eine sehr erhebliche Speicherkapazität benötigt wird, die für eine indogermanische Sprache in der Größenordnung von etwa 500 oder mehr Di-Phonemen liegt. Die Verwendung des Verfahrens beschränkt sich demgemäß auf solche Fälle, bei denen ein erheblicher Aufwand gerechtfertigt ist.In order to produce a synthetic language which is largely approximated to human language, the procedure is therefore such that combinations formed from the occurring consonants and the occurring vowels, so-called di-phonemes, are stored and read out. It is understood that for this a very significant storage capacity is required, which is of the order of about 500 or more di-phonemes for an Indo-European language. Accordingly, the use of the method is limited to those cases in which considerable effort is justified.

Ausgehend von dem letztgenannten Verfahren liegt der Erfindung die Aufgabe zugrunde, den Speicheraufwand erheblich herabzusetzen, so daß das Verfahren auch bei Massenkonsumgütern, z.B. Spielzeugen (Puppenstimmen und dergleichen) anwendbar wird.Proceeding from the latter method, the object of the invention is to considerably reduce the storage effort, so that the method can also be used for mass consumer goods, e.g. Toys (dolls' voices and the like) is applicable.

Der Patentanspruch 1 definiert das erfindungsgemäße erfahren. Man erkennt, daß jeder Konsonant nur in Kombination mit einem einzigen Einheitsvokal abgespeichert wird, der hier und im folgenden mit "&" bezeichnet werden soll, und der etwa dem "e" im deutschen Wort "alle", im englischen Wort "the im französischen Wort "le" entspricht. Es hat sich gezeigt, daß bei entsprechend zeitversetztem Auslesen das & von dem dann wiedergegebenen eigentlichen Vokal völlig oder jedenfal soweit maskiert wird, daß die resultierende Silbe der natürlichen Auss^prache weitgehend nahekommt.Claim 1 defines the experience according to the invention. It can be seen that each consonant is only saved in combination with a single unit vowel, which is to be referred to here and in the following as "&", and which has approximately the "e" in the German word "all", in the English word "the in French" word "le" corresponds. It has been shown that in accordance with the time-delayed read and is completely masked by the extent then reproduced actual vowel or jedenfal that the resulting syllable of natural Auss rache ^p largely comes close.

Bei der üblichen digitalen Abspeicherung der Phoneme ist es zweckmäßig, bei den Konsonanten an der richtigen Stel einen Befehl abzuspeichern, der den Beginn der Auslesung des Vokalspeichers einleitet. Ferner kann es zweckmäßig sein, durch entsprechend ausgelegte Filter die Frequenzen der Konsonanten einerseits, der Vokale andererseits unterschiedlich zu verstärken bzw. zu bedämpfen, um die Maskierung der & zu verbessern.In the usual digital storage of the phonemes, it is expedient to store a command at the correct position in the consonants, which initiates the beginning of the reading of the vowel memory. Furthermore, it can be expedient to use different filters to amplify or attenuate the frequencies of the consonants on the one hand and the vowels on the other to improve the masking of the &.

Unter Bezugnahme auf die beigefügten Zeichnunge soll die Erfindung nachstehend im einzelnen erläutert werden.

Fig. 1 zeigt anhand eines Beispiels das Prinzip des Verfahrens,
Fig. 2 zeigt Fre^quenzgänge von Filtern für Vokale und Konsonanten,
Fig. 3 zeigt schematisch ein mögliches Speicherformat für Konsonanten und Vokale,
Fig. 4 zeigt eine bevorzugte Hüllkurve für die Konsonantenerzeugung,
Fig. 5 ist ein Blockdiagramm einer Anordnung zur Ausführung des Verfahrens, und
Fig. 6 ist ein Diagramm zur Darstellung des Zeitablaufs bei der Synthese eines einfachen Wortes.

The invention will be explained in more detail below with reference to the accompanying drawings.

1 shows an example of the principle of the method,
Fig. 2 shows Fre ^q uenzgänge of filters for vowels and consonants,
3 schematically shows a possible storage format for consonants and vowels,
4 shows a preferred envelope for the consonant generation,
5 is a block diagram of an arrangement for carrying out the method, and
Fig. 6 is a diagram showing the timing of the synthesis of a simple word.

Da der Auslesevorgang zeitversetzt, das heißt so erfolgt, daß die Auslesung eines Vokals bereits beginnt, während noch das Auslesen des Konsonanten-Di-Phonems (nämlich dessen &-Teil) abläuft, arbeitet man mit zwei Auslesekanälen. In Fig. 1 stellt das obere Diagramm den Hüllkurvenverlauf des Konsonantenkanals, das untere den des Vokalkanals dar, wobei als Beispiel das einfache Wort "DATO" gewählt ist. Man erkennt, daß gleichzeitia die &-Anteile des Konsonantenkanals und die Vokale wiedergegeben werden, und bereits dadurch werden die schwachen &-Laute stark maskiert. Diese Maskierung kann aber noch durch weitere Maßnahmen unterstützt werden.Since the reading process is delayed, that is to say, the reading of a vowel begins while the reading of the consonant di-phoneme (namely its & part) is still in progress, one works with two reading channels. In Fig. 1 the upper diagram shows the envelope curve of the consonant channel, the lower one that of the vowel channel, the simple word "DATO" being chosen as an example. It can be seen that the & parts of the consonant channel and the vowels are reproduced at the same time, and the weak & sound is strongly masked by this alone. This masking can also be supported by further measures.

Fig. 2 stellt ein erstes Mittel hierfür dar. Es ist bekannt, daß das Frequenzspektrum der Konsonanten und Vokale unterschiedlich ist; z.B. liegen bei einer männlichen Stimme die Maxima der Konsonanten im Bereich von etwa 600...3000 Hz, der Vokale im Bereich von etwa 200...1000 Hz. Dementsprechend werden den beiden Kanälen Filter mit den in Fig. 2 gezeigten Durchlaßbändern zugeordnet, wobei die Filterung entweder bei der Aufzeichnung oder bei der Wiedergabe erfolgen kann.Fig. 2 represents a first means for this. It is known that the frequency spectrum of the consonants and vowels is different; e.g. In a male voice, the maxima of the consonants are in the range of about 600 ... 3000 Hz, the vowels in the range of about 200 ... 1000 Hz. Accordingly, the two channels are assigned filters with the passbands shown in FIG. 2, where filtering can be done either during recording or during playback.

_Fig. 3 zeigt schematisch das Format für die Speicherung. Bei der Aufzeichnung werden die Laute digitalisiert, das heißt mit einem Takt von z.B. 10 KHz oder mehr amplitudenabgetastet und die so erhaltenen Daten werden in aufeinanderfolgenden Speicherplätzen für serielles Auslesen abgespeichert. Es werden jedoch zwei Speicherplätze für Kommandodaten freigehalten, nämlich ein Kommando "weiter" und ein Kommando "Ende". Das Kommando "weiter" bedeutet den Zeitpunkt, bei welchem der jeweils andere Kanal mit dem Auslesen fortfahren soll; dieses Kommando liegt bei den Konsonantendaten beim Übergang des eigentlichen Konsonantenlauts zum &-Teil, während es bei den Vokaldaten nahe dem Ende des Datenstrangs liegt. Das Kommando "Ende" versteht sich von selbst, ist aber erforderlich, weil die einzelnen _Phoneme unterschiedliche Dauer besitzen. Das Kommando "weiter" kann dazu verwendet werden, um den !4askierungseffekt noch zu verstärken, indem bei seinem Auftreten die Hüllkurve des gerade ausgelesenen Kanals bedämpft wird, wie in Fig. 4 angedeutet, wofür man ein übliches analog arbeitendes Dämpfungsglied aus Diode, Widerstand und Kondensator verwenden kann. _F ig. 3 schematically shows the format for storage. During recording, the sounds are digitized, that is to say amplitude-sampled at a rate of, for example, 10 KHz or more, and the data thus obtained are stored in successive memory locations for serial readout. However, two memory locations for command data are kept free, namely a "continue" command and a "end" command. The "continue" command means the point in time at which the other channel is to continue reading; this command is in the consonant data when the actual consonant sound changes to the & part, while in the vowel data it is close to the end of the data string. The "end" command is obvious, but is necessary because the individual _P have honeme different duration. The "continue" command can be used to further intensify the masking effect by damping the envelope of the channel just read out when it occurs, as indicated in FIG Can use capacitor.

Fig. 5 zeigt in Blockform ein Ausführungsbeispiel eines Sprachsynthesizers, der -- wie man erkennt -- höchst einfach aufgebaut ist. Die Auswahl der wiederzugebenden Phoneme erfolgt durch externe Mittel, beispielsweise einen Mikroprozessor, und bildet keinen Gegenstand der vorliegenden Erfindung; hier ist deshalb nur als Block 1 eine externe Steuerschaltung angedeutet.Fig. 5 shows in block form an embodiment of a speech synthesizer, which - as can be seen - is extremely simple. The phonemes to be reproduced are selected by external means, for example a microprocessor, and do not form part of the subject matter of the present invention; an external control circuit is therefore only indicated here as block 1.

Die Anordnung umfaßt zwei untereinander identische Kanäle, von denen nachstehend nur einer beschrieben wird.The arrangement comprises two mutually identical channels, only one of which is described below.

Ein Speicheradressenzähler 2 wird von der Steuerschaltung 1 auf eine bestimmte Phonem-Startadresse gesetzt. Ein _Phonemspeicher 3 enthält alle für eine gegebene Sprache benötigten Phoneme, wobei für viele Sprachen sechsunddreißig Phoneme ausreichend sind. Die Phoneme sind nach Filterung bei der Aufnahme (wie oben erläutert) digitalisiert und in dem in Fig. 3 dargestellten Format abgespeichert; dabei können beispielsweise die Kodes "0" bzw. "1" für die Kommandos "weiter" bzw. "Ende" reserviert sein. Ein Taktgenerator 5 erzeugt den Auslesetakt von z.B. 10 KHz, und zwar für beide Kanäle. Die ausgelesenen Daten gelangen zu einem Dekoder 4, der feststellt, ob es sich um Daten oder eines der Kommandos "weiter" bzw. "Ende" handelt. Daten gelangen über einen Digital-Analog-Umsetzer 6 sowie ein Multiplizierglied 7 zu einem Summierglied 8 und von dort zu einer Verstärker-Lautsprecher-Einheit 9.A memory address counter 2 is set by the control circuit 1 to a specific phoneme start address. A _P honem memory 3 contains all the phonemes required for a given language, with thirty-six phonemes being sufficient for many languages. After filtering during recording (as explained above), the phonemes are digitized and stored in the format shown in FIG. 3; the codes "0" or "1" can be reserved for the commands "continue" or "end", for example. A clock generator 5 generates the read clock of 10 KHz, for both channels. The read data go to a decoder 4, which determines whether it is data or one of the commands "continue" or "end". Data pass via a digital-to-analog converter 6 and a multiplier 7 to a summing element 8 and from there to an amplifier / loudspeaker unit 9.

Bei Dekodierung des Kommandos "Ende" wird über ein U:;D-Gatter 10 die Inkrementierung des Adresszählers 2 gesperrt.When the "End" command is decoded, the incrementation of the address counter 2 is blocked via a U:; D gate 10.

Wird das Kommando "weiter" dekodiert, so wird ein Phonem-Anforderungs-Flipflop 11 für den jeweils anderen Kanal gesetzt; seine Rücksetzung erfolgt durch die externe Steuerschaltung bei Eingabe der nächsten Startadresse. Ferner wird beim Kommando "weiter" ein Dämpfungsflipflop 12 umgeschaltet, das mit seinem Ausgang F dem einen, mit seinem Ausgang F dem anderen Kanal einen Hüllkurvengenerator 13 zuschaltet, der auf das Multiplizierglied 7 einwirkt, so daß der Ausgang des betreffenden Kanals sanft abfallend bedämpft wird, ohne daß jedoch das "Klick"-geräusch entsteht. Die Ausgänge beider Kanäle werden im Summierglied 8 kombiniert.If the command "further" is decoded, a phoneme request flip-flop 11 is set for the other channel; it is reset by the external control circuit when the next start address is entered. Furthermore, a "flip" flip-flop 12 is switched when the command "continues", which connects an envelope generator 13 with its output F to the one channel, with its output F to the other channel, which acts on the multiplier 7, so that the output of the channel in question is gently damped without, however, the "click" sound. The outputs of both channels are combined in the summing element 8.

Der jeweilige Setzzustand des Flipflops 12 wird auch zu der externen Steuerschaltung übertragen, um dieser zu signalisieren, welcher der beiden Kanäle belegt werden kann, etwa zu Beginn eines Auslesezyklus nach Inbetriebnahme der Schaltung.The respective set state of the flip-flop 12 is also transmitted to the external control circuit in order to signal the latter which of the two channels can be occupied, for example at the beginning of a readout cycle after the circuit has been started up.

Bevor unter Bezugnahme auf Fig. 6 ein Synthesevorgang im einzelnen erläutert wird, sei noch auf mögliche Abwandlungen der in Fig.5 gezeigten Blockschaltung hingewiesen.Before a synthesis process is explained in detail with reference to FIG. 6, possible modifications of the block circuit shown in FIG. 5 are pointed out.

Der Speicheraufwand läßt sich halbieren, wenn für beide Kanäle nur ein Phonemspeicher 3 vorgesehen ist und das Auslesen im Zeitmultiplex erfolgt. Das Multiplizierglied 7 ist in bestimmten handelsüblichen Digital-Analog-Umsetzern bereits enthalten, so daß man den Ausgang der Hüllkurvengeneratoren 13 nur mit dem entsprechenden Eingang des Umsetzers zu verbinden braucht. Man kann die Schaltung auch weitgehend in einem Mikroprozessor realisieren, wobei dann entweder die beiden Hüllkurvengeneratoren und die beiden Umsetzer außerhalb bleiben oder nur ein einzelner, gemeinsamer Umsetzer, während alle anderen Vorgänge vom Mikroprozessor digital durchgeführt werden.The memory expenditure can be halved if only one phoneme memory 3 is provided for both channels and the reading is done in time division multiplex. The multiplier 7 is already included in certain commercially available digital-to-analog converters, so that the output of the envelope generator 13 need only be connected to the corresponding input of the converter. The circuit can also be largely implemented in a microprocessor, in which case either the two envelope generators and the two converters remain outside or only a single, common converter, while all other processes are carried out digitally by the microprocessor.

Fig. 6 zeigt

- in Zeile (a) den Takt des Taktgenerators 5; dieser Takt kann starr sein, kann aber auch von der Steuerschaltung 1 variiert werden, um eine der natürlichen Sprache noch ähnlichere Phrasierung zu erzielen,
- in Zeile (b) Formate aus dem ersten Kanal, hier die Phoneme "D&" und "T&",
- in Zeile (c) den Logikpegel am Ausgang des Flipflop 12,
- in Zeile (d) den Logikpegel am Ausgang des Flipflops 11 des zweiten Kanals,
- in Zeile (e) Formate aus demselben zweiten Kanal, hier Phoneme "a" und "o",
- in Zeile (f) den Logikpegel am Ausgang des Flipflops 11 des ersten Kanals,
- in Zeilen (g) bzw. (h) die Hüllkurven, erzeugt von den Hüllkurvengeneratoren 13 des ersten bzw. des zweiten Kanals, und
- in Zeilen (i) bzw. (k) die analogen Ausgangssignale des ersten bzw. zweiten Kanals; dabei sind die Hüllkurven nicht als repräsentativ für die tatsächlich erzeugten Laute "D", "A", "T" oder "0" zu verstehen; das Diagramm dient nur der Erläuterung des zeitlichen Ablaufs.

Fig. 6 shows

- In line (a) the clock of the clock generator 5; this clock can be rigid, but can also be varied by the control circuit 1 in order to achieve a phrase that is even more similar to natural language,
- in line (b) formats from the first channel, here the phonemes "D &" and "T &",
in line (c) the logic level at the output of the flip-flop 12,
in line (d) the logic level at the output of the flip-flop 11 of the second channel,
- in line (s) formats from the same second channel, here phonemes "a" and "o",
in line (f) the logic level at the output of the flip-flop 11 of the first channel,
- In lines (g) and (h) the envelopes generated by the envelope generators 13 of the first and second channels, and
- in lines (i) and (k) the analog output signals of the first and second channels; the envelopes are not to be understood as representative of the sounds "D", "A", "T" or "0" actually generated; the diagram only serves to explain the chronological sequence.

Claims

1. A method for speech synthesis, in which combinations of one consonant and a vowel following it are stored and, if necessary, read out, characterized in that - every consonant occurring in the language is stored together with a weak unit vowel "&",

- all vowels occurring in the language are saved individually,

- The desired consonant-vowel combination is formed by reading out the consonant and vowel in two channels, masking the unit vowel "&" by reading the vowel.

2. The method according to claim 1, characterized in that when storing the combinations of consonant and unit vowel typical vowel frequencies are attenuated and that the consonant-typical frequencies are attenuated when storing the vowels.

3. The method according to claim 1 or 2, characterized in that the amplitudes of the unit vowels "&" are damped when reading.

4. The method according to claim 1, 2 or 3, characterized in that the reading is carried out with a variable clock frequency.

5. Arrangement for performing the method according to claim 1, characterized by two alternately activatable readout channels and a switching circuit which can be controlled by a command read out in one channel to activate the other channel.

6. Arrangement according to claim 5, characterized in that each channel comprises a memory for all the sounds required (phonemes and di-phonemes).

7. Arrangement according to claim 5 or 6, in which the sounds (phonemes and di-phonemes) are stored digitally in each memory and can be read out sequentially, characterized in that at least in the di-phonemes the reversal command in the readout sequence between the consonant interval and the unit vowel interval is stored.

8. Arrangement according to claim 7 for carrying out the method according to claim 3, characterized in that for each channel an envelope generator is provided, by means of which the amplitude damping is effected and which can be activated by the reversal command.

9. Arrangement according to claim 5, in which the sounds (phonemes and di-phonemes) are stored digitally in each memory and can be read out sequentially, characterized in that at the end of each readout sequence an "end" command can be read out, by means of which a next readout process can be initiated is.