DE3935308C1

DE3935308C1 - Speech recognition method by digitising microphone signal - using delta modulator to produce continuous of equal value bits for data reduction

Info

Publication number: DE3935308C1
Application number: DE19893935308
Authority: DE
Inventors: Gebhard Prof. Dr. 7743 Furtwangen De Radi
Original assignee: Individual
Current assignee: Individual
Priority date: 1989-10-24
Filing date: 1989-10-24
Publication date: 1991-01-10
Anticipated expiration: 2009-10-25

Abstract

Individual spoken words are converted into analogue electrical speech signals and these, after amplification and normalisation, are digitalised. The digitalised signals are then coded and compared with all the other reference sounds, coded in the same form, in a memory. There results a test word, as a sequence of sounds, and this is compared with the words in a phonological-orthographic schedule so that a word similar to the test word can be identified and reproduced. The amplified analogue electrical speech signals are differentiated w.r.t. time and these are digitalised by a delta modulator to produce a stream of bits of the same value. Such a stream can be used to enhance the comparison process. USE/ADVANTAGE - Improvement in speed and certainty of recognition by using bar pattern.

Description

Die Erfindung betrifft ein Verfahren zur Spracherkennung nach dem Oberbegriff des Patentanspruchs 1 sowie eine Schaltungsanordnung zur Durchführung des Verfahrens nach dem Oberbegriff des Patentanspruchs 10. The invention relates to a method for speech recognition according to the preamble of claim 1 and a circuit arrangement for implementation of the method according to the preamble of the claim 10th

1. Typical speech recognition system and state of the art.

Spracherkennung, wie sie sich in den letzten Jahrzehnten etabliert hat, ist ein 3stufiger Prozeß mit mehreren eingebauten Rückkopplungs- und Iterationsmechanismen.Speech recognition as it has been in the past few decades has established is a 3-step process with several built-in Feedback and iteration mechanisms.

Typical speech recognition system (see Figure 1):

In der ersten Stufe, der akustischen Signalverarbeitung, wird das Zeitsignal vom Mikrophon einer Analog-Digitalwandlung (AD-Conversion) zugeführt und dann fouriertransformiert. Im Zeitbereich wird Segmentations-Information extrahiert. Das heißt, die Wellenform wird unterteilt in Wörter, es werden Energieinhalte gemessen, Nulldurchgänge gezählt und mit Hilfe von Autokorrelationsfunktionen werden stimmhafte und stimmlose Laute klassifiziert (in Englisch: formants, frication, voicing and pitch determination).In the first stage, acoustic signal processing, becomes the time signal from the microphone of an analog-to-digital conversion (AD conversion) and then Fourier transformed. Segmentation information is extracted in the time domain. That is, the waveform is divided into words, it will be Energy content measured, zero crossings counted and with the help Autocorrelation functions become voiced and unvoiced Lute classified (in English: formants, frication, voicing and pitch determination).

Mit Hilfe dieser im Zeitbereich gewonnenen Informationen werden geeignete Zeitscheiben von circa 10 msec Dauer, sogenannte "frames" aus dem Zeitsignal herausgeschnitten und einer schnellen Fouriertransformation unterworfen (Fast Fourier Transform, FFT). Das heißt, es werden typischerweise 256 Fourierkoeffizienten berechnet mit Hilfe spezieller Signalprozessoren.With the help of this information obtained in the time domain suitable time slices of approximately 10 msec duration, so-called "frames" cut out of the time signal and a quick one Fourier transform subjected (Fast Fourier Transform, FFT). That is, there are typically 256 Fourier coefficients calculated using special signal processors.

Eine andere wichtige Methode der Parameterextraktion ist Linear Predictive Coding (LPC). Hier wird ein physikalisches und elektrisches Modell des menschlichen Sprachtrakts zugrunde gelegt, bestehend aus Filtern und Schwingkreisen, und es werden typischerweise 20 Parameter aus dem Zeitsignal berechnet, die charakteristische Parameter des LPC-Modells darstellen sollen. Auch diese Berechnung ist nur mit sehr schnellen und aufwendigen Spezial-Signalprozessoren zu erledigen.Another important method of parameter extraction is Linear Predictive Coding (LPC). Here is a physical and electrical model of the human speech tract laid, consisting of filters and resonant circuits, and there will be typically calculates 20 parameters from the time signal that to represent characteristic parameters of the LPC model. This calculation is also only with very fast and complex To do special signal processors.

Der Prozeß der Parameterextraktion, sei es auf dem Wege FFT oder LPC, ergibt eine scharfe Reduktion des Bitstroms von circa 240 000 bits/sec (12-bit-AD-Conversion bei 20 kHz Abtastrate) auf etwa 1000 bits/sec. Nur dieser konzentrierte Datenstrom (Parameter) kann sinnvollerweise einer weiteren Verarbeitung zugeführt werden. The process of parameter extraction, be it on the way FFT or LPC, results in a sharp reduction in the bit stream of approx 240,000 bits / sec (12-bit AD conversion at 20 kHz sampling rate) to about 1000 bits / sec. Only this concentrated data stream (Parameters) can usefully be processed further are fed.

Decoder-Verarbeitung heißt die zweite Stufe der Spracherkennung. die in der ersten Stufe gewonnenen Parameter werden, als Komponenten eines Vektors aufgefaßt, mit gespeicherten Vektoren bekannter akustischer Symbole verglichen. Bei diesem Vergleich ergeben sich Lauf- oder Wortkandidaten, die dem zu erkennenden Laut oder Wort am nächsten kommen. Als Hintergrund dient dabei ein phonologisches Wörterverzeichnis mit allen Wörtern, samt allen ihren Abwandlungen durch Deklination und Konjugation, einzeln gespeichert.Decoder processing is the second stage of speech recognition. the parameters obtained in the first stage are called Components of a vector, with stored vectors known acoustic symbols compared. In this comparison there are running or word candidates that are to be recognized Come closest according to the word or word. As a background serves a phonological dictionary with all Words, together with all their variations through declination and Conjugation, saved individually.

Die endgültige Auswahl unter den Kandidaten erfolgt dann in der dritten Stufe, dem Linguistischen Prozessor. Dieser wertet auf statistische Weise die Wahrscheinlichkeit eines gegebenen Kandidaten im gegebenen Kontext aus.The final selection among the candidates is then made in the third stage, the linguistic processor. This evaluates statistically the probability of a given Candidates in the given context.

Am Ende dieser 3stufigen Verarbeitung ergibt sich der geschriebene Text.At the end of this 3-step processing, the result is the written one Text.

State of the art

Die meisten bekannten Spracherkennungssysteme arbeiten mit einer akustischen Signalvorverarbeitung mit folgenden Stufen:Most known speech recognition systems work with acoustic signal preprocessing with the following stages:

1. Rauscharmes Mikrophon.
2. Vorverstärker mit 5 kHz-Tiefpaßfilter.
3. 8-12-bit-Analog-Digitalwandler mit Abtastrate 15 . . 20 kHz.
4. Fast Fourier Transform (FFT) oder Linear Predictive Coding (LPC)-Parameter-Erzeugung über einem vorgegebenen Zeitraster.
5. Aufbereitung und Filterung der Spektrallinien (Parameter).
6. Normierung bezüglich Tonhöhen- und Sprecherabhängigkeit.1. Low noise microphone.
2. Preamplifier with 5 kHz low-pass filter.
3. 8-12-bit analog-digital converter with sampling rate 15. . 20 kHz.
4. Fast Fourier Transform (FFT) or Linear Predictive Coding (LPC) parameter generation over a predefined time grid.
5. Preparation and filtering of the spectral lines (parameters).
6. Standardization regarding pitch and speaker dependency.

Siehe dazu "Maschinelles Erkennen der deutschen Sprache", IBM Wissenschaftliches Zentrum Heidelberg, 21. April 1989 und Teuvo Kohonen: "The Neural Phonetic Typewriter", IEEE Computer, March 1988.See also "Machine recognition of the German language", IBM Scientific Center Heidelberg, April 21, 1989 and Teuvo Kohonen: "The Neural Phonetic Typewriter", IEEE Computer, March 1988.

Die wichtigsten marktgängigen Spracherkennungssysteme sind bis heute sprecherabhängige Einzelworterkennungssystem (EWE), bei denen der Erkennungsphase eine Trainingsphase vorausgeht. Diese meist längere Lern- oder Trainingsphase ist notwendig aufgrund der Methode, daß die Parameterfolge (FFT-, LPC- oder andere Parameter) eines jeden Wortes eines gegebenen Wörterverzeichnisses in der Trainingsphase mehrmals aufgenommen und als Wortschablone im Speicher abgelegt wird. In der Erkennungsphase wird dann die Parameterfolge des zu erkennennden Wortes mit allen abgelegten Wortschablonen nach einem als "Dynamische Programmierung" bekannten Verfahren verglichen und das nächstliegende Wort gilt als erkannt.The most important speech recognition systems on the market are up to Today speaker-dependent single word recognition system (EWE), at which is preceded by a training phase before the recognition phase. This mostly longer learning or training phase is necessary due to the method that the parameter sequence (FFT, LPC or other parameters) of each word in a given dictionary recorded several times in the training phase and is saved as a word template in memory. In the detection phase then the parameter sequence of the word to be recognized with all filed word templates after one as "Dynamic Programming "compared known methods and that nearest word is recognized.

Solche EWE-Systeme haben Vokabulare bis etwa 1000 Wörter und mehr, während sprecherunabhängige Systeme meist einen geringen Wortschatz von etwa 20 bis 100 Wörtern aufweisen.Such EWE systems have vocabularies of up to around 1000 words and more, while speaker-independent systems usually have one have a low vocabulary of around 20 to 100 words.

Etwas außerhalb dieser allgemeinen Richtung gab es auch schon Versuche, Spracherkennung ohne FFT- oder LPC-Transformation zu machen. In der GB-PS 12 16 756 ist ein Verfahren angegeben, den effektiven Schalldruck in mehreren Frequenzkanälen zu messen und aus den Diagrammen Schalldruck gegen Frequenz und Zeit mit Methoden der optischen Bildauswertung eine Lautfolge zu erkennen. Die DE-OS 29 18 533 beschreibt ein Spracherkennungssystem, wo mit zwei CVSD-(Continuously Variable Slope-) Deltamodulatoren, also Modulatoren mit veränderlicher Quantisierungsschrittweite, eine digitale Information über die Sprachsignalfrequenzen und eine digitale Information über die Amplitudeneinhüllende abgeleitet und anschließend in einem Mikrorechner weiterverarbeitet werden.There was also something outside of this general direction already attempts to recognize speech without FFT or LPC transformation close. In GB-PS 12 16 756 a method is specified the effective sound pressure in several frequency channels measure and from the diagrams sound pressure versus frequency and Time with methods of optical image evaluation a sound sequence to recognize. DE-OS 29 18 533 describes a speech recognition system, where with two CVSD (Continuously Variable Slope) delta modulators, so modulators with variable quantization step size, digital information about the speech signal frequencies and digital information about the Amplitude envelope derived and then in one Microcomputers can be processed further.

Aus der US 47 00 360 ist es bekannt, die Sprachsignale zu differenzieren, bevor sie auf einen Deltamodulator gegeben werden, um die Extrema der Sprachsignale in Nulldurchgänge abzubilden.From US 47 00 360 it is known to differentiate the speech signals, before being placed on a delta modulator to the extrema of the speech signals to map in zero crossings.

Die Deltamodulation und insbesondere ihre adaptive Variante CVSD spielt im übrigen in der Technik der Sprachdigitalisierung zum Zwecke der Spracheingabe, -speicherung, -übertragung und -wiedergabe sowie bei synthetischer Sprachausgabe eine Rolle (siehe z. B. Klaus Sickert: "Automatische Spracheingabe und Sprachausgabe" Verlag Markt & Technik, 1983, S. 161-169). Delta modulation and especially its adaptive variant CVSD also plays in the technology of speech digitization for the purpose of voice input, storage, transmission and -Replay and a role in synthetic speech (see e.g. Klaus Sickert: "Automatic voice input and voice output" Verlag Markt & Technik, 1983, pp. 161-169).

2. Conception of a new type of speech recognition system (method and circuit arrangement) 2.1 Motivation and task

Die gängigen Spracherkennungssysteme erfordern für die Fourier-, LPC- und/oder andere Spektralanalysen entweder große Rechenzeiten oder spezielle Signalprozessoren. Die anschließende Analyse der dreidimensionalen Spektrogramme (Amplituden über Frequenz- und Zeitachsen aufgetragen) erfordert weiterhin hohe Rechenleistung und komplexe Algorithmen.The common speech recognition systems require for the Fourier, LPC and / or other spectral analyzes either long computing times or special signal processors. The subsequent analysis of the three-dimensional spectrograms (Amplitudes plotted against frequency and time axes) still requires high computing power and complex Algorithms.

Bedenkt man noch, daß das Sprachsignal bei diesen Methoden zuerst mit 240 kb/s oder mehr aufgenommen und dann analysiert wird, um eine Buchstabenfolge von etwa 12 Buchst./s, entsprechend etwa 60 b/s zu gewinnen, so liegt der Gedanke nahe, gleich mit geringerer Datenrate zu beginnen (z. B. 50 kb/s) und die Bitrate sofort durch Umcodieren weiter zu reduzieren, in dem Bestreben, so die Analyse zu vereinfachen.Considering that the voice signal with these methods first recorded at 240 kb / s or more and then analyzed is a sequence of letters of about 12 letters / s, to win about 60 b / s accordingly, that's the idea close to starting with a lower data rate (e.g. 50 kb / s) and immediately further increase the bit rate by transcoding reduce, in an effort to simplify the analysis.

Ausgehend von diesen Beobachtungen wird in der vorliegenden Erfindung der Ansatz gemacht, Spracherkennung grundsätzlich nur im Zeitbereich zu machen, also ohne Fourier- oder LPC- Transformation.Based on these observations, the present Invention made the approach, speech recognition fundamentally only to be done in the time domain, i.e. without Fourier or LPC Transformation.

Task

Die Aufgabe der vorliegenden Erfindung ist es, ein Verfahren und eine Schaltungsanordnung zur Einzelworterkennung anzugeben, bei der aus dem digitalisierten Sprachsignal durch Rechneranalyse eine Lautschrift des gesprochenen Wortes ermittelt werden soll. Dabei wird Sprecherunabhängigkeit angestrebt.The object of the present invention is a method and to specify a circuit arrangement for individual word recognition in which the digitized speech signal by computer analysis a phonetic transcription of the spoken word is to be determined. Here speaker independence is sought.

In einem weiteren Schritt ist dann aus der Lautschrift das gesprochene Wort zu ermitteln, was aber nicht Gegenstand dieser Erfindung ist.In a further step this is from the phonetic transcription spoken word to determine what is not the subject of this Invention is.

Die Aufgabe wird bezüglich des Verfahrens durch die Merkmale des Patentanspruchs 1 und bezüglich der Schaltungsanordnung durch die Merkmale des Patentanspruchs 10 gelöst. With regard to the method, the task is characterized by the features of Claim 1 and with respect to the circuit arrangement by the Features of claim 10 solved.

2.2 Principles of conception

Das hier vorgestellte System (Verfahren und Schaltungsanordnung) erkennt gesprochene Sprache allein durch Aufbereitung des Mikrophonsignals im Zeitbereich. (Transformationsalgorithmen mit komplexen Spezial-Signal- Prozessoren entfallen).The system presented here (method and circuit arrangement) recognizes spoken language solely by processing the microphone signal in the time domain. (Transformation algorithms with complex special signal Processors are eliminated).

Das System gründet sich auf folgende Prinzipien:The system is based on the following principles:

a) Time differentiation

Das Mikrophonsignal (=Sprachsignal) wird verstärkt und nach der Zeit differenziert. Bei dieser Zeitdifferentiation werden, wie bekannt, Änderungen des Signals betont und der Gleichspannungsanteil fällt weg.The microphone signal (= speech signal) is amplified and after differentiated in time. With this time differentiation, as known, changes in the signal emphasized and the DC component falls away.

b) Delta modulation

Das differenzierte Sprachsignal wird mit einem Deltamodulator digitalisiert. Die Deltamodulation ist eine Methode der Analog- Digital-Wandlung (ADC-Analog-Digital-Conversion).The differentiated speech signal is with a delta modulator digitized. Delta modulation is a method of analog Digital conversion (ADC-analog-digital conversion).

Der Deltamodulator ist ein 1-bit-Digitalisierer (siehe z. B. Sickert).The delta modulator is a 1-bit digitizer (see e.g. Seeps).

Diese Operation entspricht einer zweiten zeitlichen Differentiation. Das Ergebnis ist ein kontinuierlicher Bitstrom aus gleichwertigen Bits.This operation corresponds to a second temporal differentiation. The result is a continuous bit stream equivalent bits.

Die Operation der Deltamodulation könnte auch rechnerisch (=numerisch) ausgeführt werden an einem Wortstrom aus n-bit- Worten, der aus einem n-bit-Analog-Digitalwandler kommt.The operation of delta modulation could also be computational (= numerical) are executed on a word stream from n-bit Words that comes from an n-bit analog-to-digital converter.

In beiden Fällen erhalten wir als Ergebnis einen kontinuierlichen Bitstrom aus gleichwertigen Bits, das heißt Bits gleicher Wertigkeit, im Gegensatz zum Wortstrom aus dem n-bit-ADC, wo im n-bit- Wort das Bit Nr. j (j=o, 1, 2, . ., n-1) die Wertigkeit 2^j hat. Mit solchen n-bit-Worten kann man wahrscheinlich keine Mustererkennung machen. In both cases, we get a continuous bit stream of equivalent bits, i.e. bits of equal value, in contrast to the word stream from the n-bit ADC, where bit n (j = o, 1 , 2,.., N-1) has the value 2 ^j . With such n-bit words, you can probably not make pattern recognition.

c) transcoding

Der kontinuierliche Bitstrom wird einer Umcodierung unterworfen. Dabei wird jedem Byte B eine Hauptcodezahl S=Zahl der Einsen im Byte B zugeordnet.The continuous bit stream is subjected to recoding. Each byte B becomes a main code number S = number of ones assigned in byte B.

Die folge der Hauptcodezahlen S kann als graphisches Balkenmuster dargestellt werden. Dies ist die eigentliche Bedeutung der Umcodierung.The sequence of the main code numbers S can be a graphical bar pattern being represented. That is the real meaning the recoding.

Die Umcodierung stellt gleichzeitig eine Datenreduktion dar, indem ein Byte B mit 256 möglichen Bitmustern abgebildet wird auf die Hauptcodezahl S, die sich im Intervall oS8 bewegt.The recoding also represents a data reduction, by mapping a byte B with 256 possible bit patterns to the main code number S, which is in the interval oS8.

d) pattern recognition

Die Balkenmuster können nach irgendeiner Methode der Mustererkennung erkannt werden.The bar patterns can be done by any method of pattern recognition be recognized.

Es bieten sich verschiedene Methoden an:There are different methods:

1. Key figure vector method
Classification of the patterns with the help of suitably defined key figures with strong invariance properties. The recognition is then carried out on the basis of the calculation of the distances from an unknown phoneme to the known reference phonemes in the index vector space.
2. Neural networks method
The bar patterns are fed into a neural network that has previously learned the reference phonemes in a learning phase.
3. Pattern recognition according to "classic" and other methods image evaluation and recognition.

Summary

Die Funktionalität des neuen Systems beruht aufThe functionality of the new system is based on

- two times differentiation of the speech signal, which brings out the emphasis on speech dynamics.
- Production of a continuous bit stream from equivalent Bits.
- recoding of the bit stream into a stream of main code numbers S, which can be represented graphically as a bar pattern.

Vorteile des neuen Systems sind folgende:The advantages of the new system are as follows:

a) The acoustic signal processing is easier than with State of the art and the required computing power for the analysis is less or the same Computer performance makes word recognition faster.
b) Only about 26 reference sounds have to be saved (e.g. the sounds of Fig. 6) in contrast to the word templates for each word of the vocabulary in the prior art. Usually only the reference vowel sounds change from speaker to speaker.
c) Less speaker dependency than with known systems for the reasons mentioned under b).

Ein Ausführungsbeispiel der Erfindung wird im folgenden anhand der Abb. 2-7 erläutert. Es zeigen:An embodiment of the invention is explained below with reference to Fig. 2-7. Show it:

Abb. 1 Typisches Spracherkennungssystem Fig. 1 Typical speech recognition system

Abb. 2 Neues Spracherkennungssytem Fig. 2 New speech recognition system

Abb. 3 Blockschaltbild der Hardware zur akustischen Signalaufbereitung Fig. 3 Block diagram of the hardware for acoustic signal processing

Abb. 4 Erzeugung eines Balkenmusters Fig. 4 Generation of a bar pattern

Abb. 5 Umcodierung Fig. 5 Recoding

Abb. 6 Balkenmuster aller Buchstaben und Laute Fig. 6 Bar pattern of all letters and sounds

Abb. 7 Ausdruck des Titels Fig. 7 Printout of the title

2.3 System components

Das neue System ist in Abb. 2 summarisch dargestellt. Im folgenden werden die Elemente und Komponenten im einzelnen vorgestellt.The new system is summarized in Fig. 2. The elements and components are presented in detail below.

1. Preamplifier with compressor and low-pass filter

Der Vorverstärker mit Kompressor und Tiefpaßfilter dient zur Lautstärkenormierung und Beseitigung von Störgeräuschen. Die Komponenten werden speziell entworfen und angepaßt für die besonderen Bedingungen des vorliegenden Systems. Der Entwurf erfolgt nach den Prinzipien der elektronischen Schaltungstechnik. Blockschaltbild siehe Abb. 3. The preamplifier with compressor and low-pass filter is used to standardize the volume and eliminate noise. The components are specially designed and adapted to the special conditions of the present system. The design is based on the principles of electronic circuit technology. Block diagram see Fig. 3.

2.1-bit analog-digital converter (delta modulator)

Der Deltamodulator erzeugt aus dem analog aufbereiteten Mikrophonsignal eine digitale Bitfolge zwecks Speicherung und digitaler Weiterverarbeitung des Sprachsignals. Es wird ein LDM-Linearer Deltamodulator, also mit unveränderlichen und gleichen Quantisierungsschritten, verwendet, damit die auslaufende Bitfolge aus gleichwertigen Bits besteht.The delta modulator generates from the analog one Microphone signal a digital bit sequence for storage and digital processing of the speech signal. It becomes an LDM linear delta modulator, so with unchangeable ones and the same quantization steps, so that the expiring bit sequence consists of equivalent bits.

3. Recoding the bit stream

Die Umcodierung des Bitstroms vom Deltamodulator ist die datentechnische Aufbereitung des Bitstroms zum Zwecke der Erzeugung von graphisch darstellbaren, möglichst markant unterschiedlichen Balkenmustern, die den einzelnen Buchstaben des gesprochenen Wortes zuzuordnen sind. Nähere Beschreibung siehe Abschnitt 2.4.The recoding of the bit stream from the delta modulator is data processing of the bit stream for the purpose of Generation of graphically representable, as striking as possible different bar patterns representing each letter can be assigned to the spoken word. For a more detailed description, see section 2.4.

4. Detection of the bar pattern

Maschinelle Erkennung der Balkenmuster mit Hilfe speziell entwicklter Algorithmen auf einem Personal Computer. Aus dem seriellen Balkenmuster ergibt sich so eine Lautfolge, auch Probewort genannt. Nähere Beschreibung siehe Abschnitt 2.5.Machine detection of the bar pattern with the help of special developed algorithms on a personal computer. A sequence of sounds results from the serial bar pattern, also called trial word. For a more detailed description see Section 2.5.

5. Linguistic processing

Das Probewort wird mit den Wörtern in einem phonologisch- orthographischen Wörterverzeichnis verglichen, wonach das dem Probewort ähnlichste Worte des Verzeichnisses identifiziert und dann auf Drucker oder Bildschirm oder durch Sprachausgabe wiedergegeben wird. The test word is written with the words in a phonological Orthographic dictionary compared, according to which Words in the directory most similar to the sample word identified and then on printer or screen or through Is played back.

2.4 Recoding and bar pattern

Wenn man versucht, den Bitstrom von 48 kBit/sec vom Deltamodulator graphisch darzustellen, ist nicht viel mehr zu sehen als ein ziemlich unregelmäßiger Pulszug, nicht unähnlich einem digitalen Rauschsignal und jedenfalls genau so leicht oder schwer zu analysieren wie das Mikrophonsignal selber. Dieser Bitstrom wird nun so umcodiert, daß er sich als optisch leicht erkennbares Balkenmuster darstellen läßt.If you try the bit stream of 48 kbit / sec from the delta modulator It is not much more to display graphically see as a fairly irregular pulse train, not dissimilar a digital noise signal and at least as easy or difficult to analyze like the microphone signal itself. This bit stream is now recoded so that it is can easily display optically recognizable bar pattern.

Transcoding

Der Bitstrom vom Deltamodulator wird zunächst in Bytes (8-bit- Pakete) unterteilt.The bit stream from the delta modulator is first expressed in bytes (8-bit Packages) divided.

Aus jedem Byte B mit der Heximaldarstellung H werden nun zwei neue Codezahlen gebildet, S und F.Each byte B with the hexagonal representation H now becomes two new code numbers formed, S and F.

Definition der Codezahl S:
S ist die Anzahl (Summe) der "Einsen" im Byte B.Definition of the code number S:
S is the number (total) of "ones" in byte B.

Definition der Codezahl F:
F ist die Anzahl der Flanken im Byte B, das heißt die Anzahl der o→1 Übergänge plus die Anzahl der 1→o Übergänge.Definition of the code number F:
F is the number of edges in byte B, i.e. the number of o → 1 transitions plus the number of 1 → o transitions.

Dabei wird die Flanke ganz am Anfang des Bytes B, also an der Nahtstelle mit dem vorigen Byte, mitgezählt, falls die Flanke existiert; die Flanke ganz am Ende des Bytes B, die ebenfalls existieren kann oder nicht, wird nicht mitgezählt.The edge is at the very beginning of byte B, i.e. at the Interface with the previous byte, counted if the edge exists; the edge at the very end of byte B, which is also may or may not exist, is not counted.

Die Codezahl F ist nach dieser Definition nicht nur vom gegenwärtigen Byte, sondern auch vom vorigen Byte abhängig.According to this definition, the code number F is not only of the current one Byte, but also dependent on the previous byte.

Code number intervals

Die so definierten Codezahlen S und F können sich, wie unmittelbar einzusehen ist, im Intervall zwischen o und 8 bewegen.The code numbers S and F defined in this way can, as immediately can be seen, move in the interval between o and 8.

oS8
oF8oS8
oF8

Ein großer S-Wert bedeutet viele Einsen, ein großer F-Wert bedeutet große Welligkeit oder "Rauhigkeit" des Bytes B. Beispiel siehe in Abb. 5. A large S value means many ones, a large F value means large ripple or "roughness" of byte B. Example see in Fig. 5.

Die hier definierten Codezahlen S und F stellen eine Pauschalierung oder Zusammenfassung des Bytes B dar.The code numbers S and F defined here represent a flat rate or summary of byte B.

Während B selber 2⁸=256 Realisationsmöglichkeiten hat, sind die Codezahlen S und F auf das Intervall von Null bis Acht beschränkt. Durch die Bildung von S und F wird also das Byte B umcodiert.While B itself has 2⁸ = 256 implementation options, these are Code numbers S and F limited to the interval from zero to eight. By forming S and F, the byte B recoded.

Die Codezahl F ergänzt in gewisser Weise die Codezahl S, da sie die Zusatzinformation "Rauhigkeit" enthält, die bei der Bildung von S untergeht. Bei der Bildung von S bleibt ja die Position der Einsen im Byte B unbeachtet, die Einsen werden nur gezählt.The code number F supplements the code number S in a way, since it contains the additional information "roughness", which in the formation of S goes down. The position remains when S is formed the ones in byte B are ignored, the ones are only counted.

Bar pattern display

Die Summenzahlen S des Bitstroms werden nun graphisch dargestellt als Balkenmuster. Siehe Abb. 6.The total numbers S of the bit stream are now graphically represented as a bar pattern. See fig. 6.

Genau diese Darstellungsart liefert für jeden Laut und häufig sogar für jeden einzelnen Buchstaben des Sprachsignals ein optisch markantes und klar erkennbares Balkenmuster.Exactly this type of representation delivers for every sound and frequently even for every single letter of the speech signal optically striking and clearly recognizable bar pattern.

Die Codezahl F spielt eine wesentlich geringere Rolle als S. F ist auch stark sprecherabhängig. F wird in der nachfolgenden Balkenmuster-Erkennungsphase als mögliche zusätzliche Kennzahl benutzt. The code number F plays a much smaller role than S. F is also heavily dependent on the speaker. F is used in the following Bar pattern recognition phase as a possible additional key figure used.

2.5 Pattern recognition

Die Balkenmuster, aus den Codezahlen S gebildet, sollen nun durch Methoden der Mustererkennung erkannt werden. Das heißt, ein neu eingegebenes Balkenmuster muß einem der bekannten und gespeicherten Buchstaben oder Laute, Referenzlaute genannt, zugeordnet werden.The bar pattern, formed from the code numbers S, should now be recognized by methods of pattern recognition. The means that a newly entered bar pattern must be one of the known and stored letters or sounds, reference sounds called, are assigned.

Die Balkenmuster sind mit dem Auge relativ leicht auseinanderzuhalten, da das menschliche Auge und Gehirn für diese Aufgabe geradezu geschaffen zu sein scheint. Auf diesem Spezialgebiet erzielt keine Maschine auch nur annähernd vergleichbar gute Resultate: (Ein Kind findet 1 Gummibärchen in einem Spielzeughaufen).The bar patterns are relatively easy to tell apart by eye, because the human eye and brain for this Task seems to have been created. On this No machine even comes close to this specialty comparably good results: (A child finds 1 gummy bear in a toy pile).

Die hier zu erkennenden Balkenmuster für die verschiedenen Buchstaben oder Laute sind mehr oder weniger von Lautstärke, Tonhöhe, Sprecher usw. abhängig. Trotzdem sind sie mit dem Auge identifizierbar.The bar patterns to be recognized here for the different Letters or sounds are more or less loud, Pitch, speaker etc. dependent. Still, they're with that Eye identifiable.

Daher müssen jetzt zwecks Mustererkennung geeignete Kennzahlen aus dem Muster gewonnen werden, die, von solchen "Zufälligkeiten" entkleidet, sozusagen die reine Information über das Balkenmuster enthalten.Suitable key figures must now be used for pattern recognition are obtained from the pattern that, from such "coincidences" stripped, so to speak, the pure information about contain the bar pattern.

Aus einer Vielzahl von denkbaren Kennzahlen haben sich folgende Definitionen als aussagekräftig und hinreichend invariant gegen die genannten Störungen erwiesen:A variety of possible key figures have emerged the following definitions as meaningful and sufficient proven invariant against the mentioned disorders:

Definitions of usable key figures

Höhe: mittlere Balkenhöhe
Skyline: Umrißlänge/Schwärzung
Symmetrie: Ermitteln einer Phasenverschiebung (=Periodenlänge größer als 10, bei der ein gegebenes Phonem mit sich selbst maximale Deckung hat. Diese maximale Deckung wird dann Symmetrie genannt
Periode: Periodenlänge, die sich bei der Berechnung der Symmetrie ergibt
Irregularität: mittlere Abweichung der Balken in einem Cluster von ihrem Zentrum
Zackigkeit: mißt das Auftreten von hohen Streifen der Breite jeweils mit einer Lücke der Breite 1 zum nächsten Streifen Height: average bar height
Skyline: outline length / blackening
Symmetry: Determination of a phase shift (= period length greater than 10, in which a given phoneme has maximum coverage with itself. This maximum coverage is then called symmetry
Period: Period length that results from the calculation of the symmetry
Irregularity: mean deviation of the bars in a cluster from their center
Jaggedness: measures the occurrence of high strips of width with a gap of width 1 to the next strip

Key figure vector V

Für die Mustererkennung hat es sich als zweckmäßig erwiesen, Balkenmuster der Länge 160 zu nehmen, also 160 Codezahlen S in Folge.For pattern recognition, it has proven to be useful To take bar patterns of length 160, i.e. 160 code numbers S as a result.

Aus jedem Balkenmuster der Länge 160 wird nun ein Kennzahlvektor V berechnet. V ist der Satz der k_z-Kennzahlen, ein k_z-dimensionaler Vektor. Typischerweise ist k_z=20. Es werden die oben definierten Kennzahlen benutzt plus eine Anzahl anderer Kennzahlen, die hier nicht näher erläutert sind.A key figure vector V is now calculated from each bar pattern of length 160. V is the set of k _z numbers, a k _z dimensional vector. Typically, k _z = 20. The key figures defined above are used plus a number of other key figures which are not explained in more detail here.

Initialization phase

Für jeden Buchstaben der Sprache wird in der Initialisierungsphase des Systems ein solcher Vektor berechnet. Die Vektoren dieser Phonem-Urtypen werden Referenzvektoren genannt und im Speicher des Rechners abgelegt.For each letter of the language is in the initialization phase the system calculates such a vector. The vectors these phoneme primitives are called reference vectors and in Computer memory stored.

Detection phase

In der Erkennungsphase werden folgende Kriterien errechnet und zur Entscheidungsfindung verwendet:
- Nähe
- DeckungIn the recognition phase, the following criteria are calculated and used for decision making:
- closeness
- cover

Aus den Kriterien wird die Akzeptanz berechnet, ein Maß für die Aussagekraft der Entscheidung.The acceptance is calculated from the criteria, a measure for the meaningfulness of the decision.

a) Proximity

In der Erkennungsphase wird zunächst für das neue und noch unbekannte Phonem der Kennzahlvektor V ermittelt. Dieses V stellt einen Punkt im k_z-dimensionalen Raum dar. Der nächstliegende Referenzpunkt oder in Zweifelsfällen, also bei größeren Minimaldistanzen, die 2 nächstliegenden Referenzpunkte ergeben die 1 oder 2 Kandidaten für die Phonemidentifikation.In the recognition phase, the indicator vector V is first determined for the new and as yet unknown phoneme. This V represents a point in the k _z -dimensional space. The closest reference point or in cases of doubt, ie with larger minimum distances, the 2 closest reference points result in the 1 or 2 candidates for phoneme identification.

Nähe ist das wichtigste Entscheidungskriterium. Bei 2 Kandidaten wird die weitere Entscheidung durch die Betrachtung des nächsten Kriteriums Deckung vorgenommen. Proximity is the most important decision criterion. In the case of 2 candidates, the further decision is made by the Consideration of the next coverage criterion.

b) coverage

Deckung mißt die Überlappung des Balkenmusters des unbekannten Phonems P mit den Referenzphonemen R.
Definition DeckungCoverage measures the overlap of the bar pattern of the unknown phoneme P with the reference phonemes R.
Definition of coverage

Schnittmenge P mit R, dividiert durch die Vereinigung P mit R ohne die Schnittmenge P mit R.Intersection P with R divided by the union P with R without the intersection P with R.

Akzeptanz ist die Größe, die die Aussagekraft einer Entscheidung mißt.Acceptance is the size that makes a decision meaningful measures.

Akzeptanz A = α · Nähe + β · DeckungAcceptance A = α · proximity + β · coverage

Die Gewichtung α, β der Kriterien ist passend zu wählen.The weighting α, β of the criteria should be chosen appropriately.

Mit dieser Methode der Mustererkennung liefert die Dekodierstufe des neuen Spracherkennungssystems eine Lautschrift des gesprochenen Wortes, die in vielen Fällen der orthographischen Klarschrift schon sehr nahe kommt.The decoding stage delivers with this method of pattern recognition of the new speech recognition system a phonetic transcription of the spoken Word, in many cases of plain orthographic writing comes very close.

Die dritte Stufe jedes Spracherkennungssystems, der Lingustische Prozessor, ist nicht Gegenstand dieser Schrift. Er leistet dann die endgültige Transskription in orthographisch richtigen Text. The third stage of every speech recognition system, the Lingustische Processor, is not the subject of this document. Then he performs the final transcription in orthographically correct text.

2.6 Sub-problems and solutions 2.6.1 Volume

Unterschiedliche Lautstärke hat einen großen Einfluß auf den deltamodulierten Bitstrom und beeinträchtigt die Unterscheidbarkeit der Balkenmuster erheblich.Different volume has a big impact on the delta-modulated bit stream and impairs distinctness the bar pattern considerably.

Lösung: Ein spezieller Kompressor bildet den ganzen Dynamikumfang des Mikrophonsignals auf ein geeignetes Lautstärke-"Band" ab. Dafür sind besondere automatisch regelnde Verstärker und Begrenzer erforderlich.Solution: A special compressor forms the entire dynamic range the microphone signal to a suitable volume "band" from. There are special automatic regulating amplifiers and Delimiter required.

2.6.2 Pitch Fluctuations

Ein "A" kann mit sehr tiefer Baßstimme oder mit hoher Stimme gesprochen werden.An "A" can have a very low bass voice or a high voice be spoken.

Lösung: Dieses Problem wurde zum großen Teil gelöst durch die Methode der Mustererkennung, die besonders unempfindlich gemacht wird gegenüber Tonhöheschwankungen durch Definition tonhöheninvarianter Merkmale. Solution: This problem was largely solved by the Method of pattern recognition, which is made particularly insensitive is compared to pitch fluctuations by definition of pitch invariant Characteristics.

2.6.3 Speaker dependency

Das Problem der Sprecherabhängigkeit ist im wesentlichen dasselbe wie das der Tonhöhenschwankungen und wird ebenso behandelt.The problem of speaker dependency is essentially the same as that of pitch fluctuations and becomes the same treated.

2.6.4 Speech speed

Zum Glück enthält natürliche Sprache soviel Redundanz, daß das Problem der Sprechgeschwindigkeit bei der vorliegenden Spracherkennungsmethode kaum störend wirkt. Dies ergibt sich aus der Methode dadurch, daß tatsächlich einzelne Laute und daraus eine Lautfolge ermittelt werden und nicht mit ganzen Wortschablonen gearbeitet wird. So können Redundanzen auf der Lautebene berücksichtigt werden. Siehe z. B. die Redundanz in den Vokalen in Abb. 7.Fortunately, natural language contains so much redundancy that the problem of speech speed with the present speech recognition method is hardly disturbing. This results from the method in that individual sounds and a sequence of sounds are actually determined and not with whole word templates. In this way, redundancies at the sound level can be taken into account. See e.g. B. the redundancy in the vowels in Fig. 7.

2.7 System specifications

Die Abtastfrequenz des Deltamodulators istThe sampling frequency of the delta modulator is

f_sample = 48 kHz, entspricht 48 000 Bits/secf _sample = 48 kHz, corresponds to 48,000 bits / sec

Ein 1-sec-Sprachsignal enthält circa 10 Buchstaben. Daher liefert der Bitstrom circa 4800 Bits/Buchstabe = 600 Bytes pro Buchstabe. Das ergibt 3,75 Balkenmuster der Länge 160 Bytes pro Buchstabe.A 1-second speech signal contains approximately 10 letters. Therefore the bit stream provides approximately 4800 bits / letter = 600 bytes per letter. This results in 3.75 bar patterns of length 160 Bytes per letter.

Diese Angaben sind grobe Näherungswerte, da Buchstaben natürlich verschieden schnell ausgesprochen werden können. Vokale, insbesondere betonte Vokale, nehmen die meiste Zeit in Anspruch.These figures are rough approximations, since letters can of course be pronounced at different speeds. Vowels, especially stressed vowels, take up most of the time claim.

Claims

1. A method for speech recognition, in which spoken individual words are converted by means of a microphone into analog electrical speech signals, which are digitized after amplification and normalization, whereupon the digitized speech signals are recoded and compared with all reference sounds stored in a memory and coded in the same form, resulting in a sequence of sounds, also called a sample word, which is compared with the words of a phonological-orthographic dictionary, after which the word of the dictionary most similar to the sample word is identified and then reproduced,
characterized by the following process steps:

a) The amplified analog electrical voice signals are differentiated according to time.
b) The voice signals differentiated in this way are digitized by means of a delta modulator to generate a continuous bit stream of equivalent bits.
c) The bit stream generated in this way is subjected to the following recoding for data reduction and for the purpose of recognizing:
- c1) Bytes are formed from n, preferably eight, consecutive bits.
- c2) The number of "ones" is determined in the bytes to form a main code number S, the sum of which is between o and n, in particular o and 8.
d) A graphically representable bar pattern over the time axis is generated from the main code numbers S of the bit stream.
e) The bar pattern generated in this way is compared for sound recognition with reference patterns stored in the memory, resulting in a sound sequence corresponding to the spoken word, which represents the sample word.

2. The method according to claim 1, characterized in that in the bytes to form an additional code number F the Number of bit edges, i.e. the transition from zero after one and vice versa, it is determined their sum is between zero and n, in particular zero and eight, from the additional code number F determined in this way F-bar patterns are formed, which additionally the Serve sound recognition.

3. The method according to claim 1 or 2, characterized in that characteristic indicators from the bar pattern are derived, which are stored in the memory Key figures of known reference sounds compared will.

4. The method according to claim 3, characterized in that the following criteria serve as key figures:

a) average bar height of a bar pattern assigned to a sound,
b) outline length divided by the sum of the bars (skyline) of a bar pattern,
c) symmetry of a bar pattern, calculated as the maximum coverage of a bar pattern with itself after a predetermined phase shift,
d) period length within a bar pattern, that is the phase shift that results when determining the symmetry,
e) average deviation of the bars in a cluster from their center (irregularity),
f) Frequency of high bars with a gap of width 1 to the next bar (jaggedness).

5.The method according to any one of claims 1 to 4, characterized characterized in that from a bar pattern of a certain length a key figure vector is calculated which with reference vectors of the stored in the memory known reference sounds is compared.

6. The method according to claim 5, characterized in that the bar pattern has length 160.

7. The method according to claim 1, characterized in that to decode the bar pattern a neural Network is used in the previously in a learning phase Reference sounds were fed.

8. The method according to claim 1, characterized in that decoding the bar pattern using methods the optical pattern recognition takes place.

9. The method according to any one of claims 1 to 5, characterized characterized that for the detection of sounds the proximity of the identified key figures to the key figures the reference sounds and the degree of agreement (coverage) of the unknown Loud be determined with the reference sounds.

10. Circuit arrangement for performing the method according to one or more of claims 1 to 8, consisting from a microphone, a downstream amplifier with compressor and low-pass filter, a digital converter and a calculator to compare the determined digitized signals with known reference signals, characterized in that the amplifier a differentiating circuit and this as a digital converter a delta modulator are connected downstream.

11. Circuit arrangement according to claim 10, characterized characterized in that the delta modulator with a Sampling frequency of 48 kHz, is operated.