RU2136059C1

RU2136059C1 - Device for identifying isolated words

Info

Publication number: RU2136059C1
Application number: RU98100939A
Authority: RU
Inventors: И.С. Брайнина; М.В. Кузнецов
Original assignee: Поволжский институт информатики, радиотехники и связи
Priority date: 1998-01-05
Filing date: 1998-01-05
Publication date: 1999-08-27

Abstract

FIELD: digital processing of speech for man-machine dialog, automatic handling of telephone networks, medicine, etc. SUBSTANCE: device has word beginning and end analyzer, word spacing segmentation unit, processing unit incorporating mean zero-crossings meter, and signal mean energy meter, as well as series-connected pre-classification unit and final classification unit whose output functions as device output; novelty is that device has, in addition, random-access memory, word length meter, analog-to-digital level converter, and analog-to-digital zero-number converter; first output of word beginning and end analyzer is connected to data input of random-access memory whose first and second outputs are connected, respectively, to data inputs of analog-to-digital level and zero-number converters and their outputs are connected, respectively, to first and second inputs of pre- classification unit whose third input is combined with clear inputs of both analog-to-digital converters and to output of word spacing segmentation unit; reference inputs of both analog-to-digital converters are connected, respectively, to outputs of signal mean energy meter and mean zero crossings meter whose inputs are connected to first output of word beginning and end analyzer whose second output is connected to combined write/read inputs of random- access memory and to input of word length meter whose output is connected to input of word spacing segmentation unit. EFFECT: improved precision of identifying isolated words irrespective of rate, volume, and main-current frequency of announcer's speech. 1 dwg

Description

Предлагаемое изобретение относится к технике цифровой обработки речи и может быть использовано в различных приложениях, таких, как системы речевого общения человека с ЭВМ, автоматическая коммутация в телефонной сети голосом абонента, медицинские исследования патологий речеобразующего тракта пациентов и других. The present invention relates to techniques for digital speech processing and can be used in various applications, such as a person’s voice communication system with a computer, automatic switching in a telephone network with a subscriber’s voice, medical research of pathologies of the speech pathway of patients and others.

Известна система, реализованная на ЭВМ /1/. В ней распознаются 200 изолированных слов (разделяемых достаточно продолжительными паузами), произнесенных отдельными дикторами, речь которых была использована для обучения. A known system implemented on a computer / 1 /. It recognizes 200 isolated words (separated by long enough pauses), uttered by individual speakers, whose speech was used for training.

Недостатком этой системы является ее техническая сложность, приводящая к тому, что не достигается работа в реальном масштабе времени. Время обработки произнесенного слова примерно в 22 раза превышает реальное, что снижает оперативность распознавания речи и сужает области применения данной системы. The disadvantage of this system is its technical complexity, which leads to the fact that work is not achieved in real time. The processing time of a spoken word is approximately 22 times longer than the real one, which reduces the speed of speech recognition and narrows the scope of this system.

Известна также система распознавания изолированных цифр, инвариантная к голосам дикторов, содержащая последовательно соединенные анализатор моментов начала и окончания слова, блок сегментации слова на интервалы, блок обработки, а также блоки предварительной и окончательной классификации /2/. Also known is a system for recognizing isolated numbers, invariant to the voices of speakers, containing a series-connected analyzer of moments of the beginning and end of a word, a block for segmenting a word into intervals, a processing unit, as well as blocks for preliminary and final classification / 2 /.

Недостатком описанного прототипа является невысокая точность распознавания изолированных слов вследствие ее зависимости от темпа речи, громкости и частоты основного тона голоса диктора. Известно, что продолжительность и громкость звучания данного слова, произнесенного одним и тем же диктором в разное время, оказываются неодинаковыми. Тем более это справедливо для набора различных голосов, отличающихся к тому же частотой основного тона речи. Это затрудняет идентификацию слова при сравнении его с эталонами, записанными в памяти устройства распознавания. The disadvantage of the described prototype is the low accuracy of recognition of isolated words due to its dependence on the pace of speech, volume and frequency of the main tone of the voice of the speaker. It is known that the duration and volume of a given word pronounced by the same speaker at different times turn out to be different. This is even more true for a set of different voices that differ in the same frequency of the main tone of speech. This makes it difficult to identify the word when comparing it with the patterns recorded in the memory of the recognition device.

К другим недостаткам прототипа относятся его сравнительная схемотехническая сложность, связанная с реализацией методов линейного предсказания речи, а также некоторая избыточность количества измеряемых в блоке обработки параметров речи, взаимно коррелированных и дублирующих друг друга. Other disadvantages of the prototype include its comparative circuitry complexity associated with the implementation of linear speech prediction methods, as well as some redundancy in the number of speech parameters measured in the processing unit, mutually correlated and duplicating each other.

Техническим результатом предлагаемого изобретения является повышение точности распознавания изолированных слов независимо от темпа, громкости и частоты основного тона речи произвольного диктора. The technical result of the invention is to increase the accuracy of recognition of isolated words, regardless of the pace, volume and frequency of the fundamental tone of speech of an arbitrary speaker.

Сущность изобретения состоит в том, что в устройство для распознавания изолированных слов, содержащее анализатор моментов начала и окончания слова, блок сегментации слова на интервалы, блок обработки, включающий в себя измеритель среднего числа переходов через нуль и измеритель средней энергии сигнала, а также последовательно соединенные блок предварительной классификации и блок окончательной классификации, выход которого является выходом устройства, введены дополнительно оперативное запоминающее устройство (ОЗУ), измеритель длительности слова, аналого-цифровой преобразователь уровня (АЦПУ) и аналого-цифровой преобразователь (АЦП) числа нулей, при этом первый выход анализатора моментов начала и окончания слова подключен к информационному входу ОЗУ, первый и второй выходы которого соединены соответственно с информационными входами АЦПУ и АЦП числа нулей, их выходы поданы соответственно на первый и второй входы блока предварительной классификации, третий вход которого объединен с входами сброса АЦПУ и АЦП числа нулей и подключен к выходу блока сегментации слова на интервалы, опорные входы АЦПУ и АЦП числа нулей соединены соответственно с выходами измерителя средней энергии сигнала и измерителя среднего числа переходов через нуль, входы которых подключены к первому выходу анализатора моментов начала и окончания слова, второй выход которого подан на объединенные входы записи / считывания ОЗУ и вход измерителя длительности слова, выход которого соединен со входом блока сегментации слова на интервалы. The essence of the invention lies in the fact that the device for recognizing isolated words, containing an analyzer for the moments of the beginning and end of the word, a block for segmenting the word into intervals, a processing unit including a meter for the average number of transitions through zero and a meter for the average signal energy, as well as connected in series a preliminary classification unit and a final classification unit, the output of which is the output of the device, an additional random access memory (RAM), a duration meter, are introduced the word, analog-to-digital level converter (ADC) and analog-to-digital converter (ADC) of the number of zeros, while the first output of the analyzer of the moments of the beginning and end of the word is connected to the information input of RAM, the first and second outputs of which are connected respectively to the information inputs of the ADC and The ADC of the number of zeros, their outputs are respectively supplied to the first and second inputs of the preliminary classification block, the third input of which is combined with the reset inputs of the ADC and ADC of the number of zeros and is connected to the output of the word segmentation block the intervals, reference inputs of the ADC and ADC of the number of zeros are connected respectively to the outputs of the average signal energy meter and the average number of zero transitions, the inputs of which are connected to the first output of the word moment and end analyzer, the second output of which is fed to the combined RAM write / read inputs and the input of the word duration meter, the output of which is connected to the input of the word segmentation unit at intervals.

На чертеже представлена структурная схема устройства распознавания изолированных слов. The drawing shows a structural diagram of a device for recognizing isolated words.

Устройство содержит последовательно соединенные анализатор моментов начала и окончания слова 1, оперативное запоминающее устройство (ОЗУ) 2, блок обработки 3, включающий в себя измеритель средней энергии сигнала 4 и измеритель среднего числа переходов через нуль 5, АЦПУ 6, АЦП числа нулей 7, блок предварительной классификации 8, блок окончательной классификации 9, измеритель длительности слова 10 и блок сегментации слова на интервалы 11. The device contains a series-connected analyzer of moments of the beginning and end of the word 1, random access memory (RAM) 2, the processing unit 3, which includes a meter of average signal energy 4 and a meter of the average number of zero-crossing 5, ADC 6, ADC number of zeros 7, block preliminary classification 8, the final classification block 9, the word duration meter 10 and the word segmentation block at intervals 11.

Устройство работает следующим образом. На вход анализатора моментов начала и окончания слова 1 поступают отсчеты речевого сигнала, следующие с частотой дискретизации F = 8 КГц. Благодаря тому, что каждому изолированному слову предшествует пауза, имеется возможность надежно определить моменты начала и окончания слова, установив адаптивный порог различения по уровню. С этой целью в анализаторе моментов начала и окончания слова 1 в паузе между словами осуществляется измерение среднего уровня шума и по результатам измерений определяется адаптивный порог, превышающий максимальное значение шума. Момент превышения этого порога принимается за начало слова, а момент, после которого сигнал в течение заданного времени оказывается ниже порога, считается моментом окончания слова. Использование адаптивного порога позволяет обеспечить надежное распознавание моментов начала и окончания слова в широком диапазоне значений отношения сигнал / шум на входе устройства, при условии (P_с/P_ш) > 10.The device operates as follows. The input of the analyzer of the moments of the beginning and end of word 1 receives samples of the speech signal following with a sampling frequency of F = 8 KHz. Due to the fact that each isolated word is preceded by a pause, it is possible to reliably determine the moments of the beginning and end of the word by setting an adaptive threshold for distinguishing by level. To this end, in the analyzer the moments of the beginning and end of word 1 in the pause between the words, the average noise level is measured and the adaptive threshold exceeding the maximum noise value is determined from the measurement results. The moment of exceeding this threshold is taken as the beginning of the word, and the moment after which the signal for a specified time is below the threshold is considered the moment of the word end. Using the adaptive threshold allows for reliable recognition of the moments of the beginning and end of the word in a wide range of signal-to-noise ratios at the input of the device, provided (P _s / P _w )> 10.

После распознавания момента начала звучания слова происходит запись отсчетов речевого сигнала в ОЗУ (2) вплоть до окончания слова. Одновременно в измерителе длительности слова 10 происходит определение продолжительности звучания с целью дальнейшего разделения слова на сегменты оптимальной длительности. Этим достигается нормирование темпа речи, благодаря чему, независимо от продолжительности произнесения слова, оно оказывается в блоке сегментации слова на интервалы 11 разделенным на фиксированное число n сегментов. Медленному темпу речи соответствуют сегменты большей длительности, быстрому темпу соответствуют короткие сегменты. Выбор в прототипе /2/ сегментов фиксированной длительности приводил к тому, что одно и то же слово, произнесенное в разном темпе, содержало различное число интервалов. Это соответствовало изменению масштаба по оси времени, что затрудняло объективное сравнение данного слова с эталоном, хранящимся в блоке предварительной классификации 8. After recognizing the moment the word starts, the speech signal samples are recorded in RAM (2) until the word ends. At the same time, the duration of the sound is determined in the word duration meter 10 in order to further divide the word into segments of optimal duration. This achieves the normalization of the pace of speech, due to which, regardless of the duration of the pronunciation of the word, it appears in the block segmentation of the word into intervals 11 divided by a fixed number n of segments. Segments of longer duration correspond to a slow pace of speech, while short segments correspond to a fast pace. The choice of fixed-duration segments in the prototype / 2 / led to the fact that the same word spoken at a different pace contained a different number of intervals. This corresponded to a change in scale along the time axis, which made it difficult to objectively compare this word with the standard stored in the preliminary classification block 8.

В продолжение звучания слова в блоке обработки 3 происходит измерение средней энергии сигнала и среднего числа переходов через нуль. По аналогии с /2/, в качестве энергетического параметра используется средний модуль отсчета сигнала, найденный как среднее арифметическое модулей отсчетов сигнала на протяжении всего слова. Информация о среднем модуле позволяет нормировать речевой сигнал по уровню и устранить зависимость точности распознавания слов от громкости речи. Выбор в АЦПУ 6 адаптивного шага квантования, пропорционального среднему модулю отсчета сигнала, обеспечивает автоматическую регулировку уровня речи. Громким голосам будет соответствовать крупный шаг квантования, тихим - малый шаг, благодаря чему число значащих разрядов двоичного кода на выходе АЦПУ 6 получается одинаковым. In continuation of the sound of the word in the processing unit 3, the average signal energy and the average number of zero transitions are measured. By analogy with / 2 /, the average signal sample module, found as the arithmetic average of the signal sample modules throughout the word, is used as the energy parameter. Information about the average module allows you to normalize the speech signal by level and eliminate the dependence of the accuracy of word recognition on the volume of speech. The choice in the ADC 6 adaptive quantization step, proportional to the average module of the signal reading, provides automatic adjustment of the speech level. A large quantization step will correspond to loud voices, a small step will correspond to quiet ones, due to which the number of significant bits of the binary code at the output of the ADCU 6 is the same.

Аналогично, информация на выходе АЦП числа нулей 7 позволяет нормировать голоса по основному тону речи. Среднее число переходов сигнала через нулевой уровень в единицу времени на протяжении звучания слова пропорционально частоте основного тона речи. Для мужских голосов интенсивность переходов через нуль оказывается низкой, поскольку в составе речи преобладают низкие частоты. Для женских и детских голосов основной тон речи в среднем в (1,5-2) раза выше, в речи преобладают более высокие частоты, соответственно растет и интенсивность переходов сигнала через нулевой уровень. Similarly, the information at the ADC output of the number of zeros 7 allows you to normalize the voices according to the main tone of speech. The average number of transitions of a signal through a zero level per unit time during the sound of a word is proportional to the frequency of the fundamental tone of speech. For male voices, the intensity of transitions through zero is low, since low frequencies prevail in speech. For female and children's voices, the main tone of speech is on average (1.5-2) times higher, higher frequencies prevail in speech, and the intensity of signal transitions through the zero level increases accordingly.

Выбор в АЦП числа нулей 7 адаптивного шага квантования, пропорционального интенсивности смен знака речевого сигнала, обеспечивает постоянство разрядности m двоичного кода на втором входе блока предварительной классификации 8. The choice in the ADC of the number of zeros 7 of the adaptive quantization step, proportional to the intensity of the sign change of the speech signal, ensures a constant bit depth m of the binary code at the second input of the preliminary classification block 8.

Таким образом, каждому произнесенному слову можно поставить в соответствие два нормированных звуковых образа. Первый из них отображает в цифровой форме зависимость нормированного уровня сигнала от номера сегмента (от первого до n-го, где n - фиксированное число сегментов, на которые подразделяется каждое слово). Thus, each spoken word can be associated with two normalized sound images. The first of them displays in digital form the dependence of the normalized signal level on the segment number (from the first to the n-th, where n is a fixed number of segments into which each word is subdivided).

Второй звуковой образ отображает зависимость нормированной текущей частоты сигнала от дискретного времени, т.е. от номера сегмента внутри данного слова. The second sound image displays the dependence of the normalized current signal frequency on discrete time, i.e. from the segment number inside the given word.

Каждый из этих двух звуковых образов отображается последовательностью из n двоичных кодов. Разрядность m двоичных кодовых комбинаций должна выбираться из компромиссных соображений. Each of these two sound images is mapped by a sequence of n binary codes. Bit depth m of binary code combinations should be selected from compromise considerations.

С одной стороны, увеличение числа разрядов m повышает точность цифрового отображения звукового образа, позволяет передать больше информации о характере изменения уровня и частоты речевого сигнала на протяжении звучания слова. С другой стороны, требование независимости точности распознавания слова от индивидуальных особенностей голосов различных дикторов обуславливает необходимость снижения разрядности m. При этом звуковой образ сохраняет только основную информацию, общую для всех голосов, произносящих данное слово, а индивидуальные различия оказываются потерянными. Многочисленные эксперименты на ПЭВМ показали, что оптимальное значение разрядности кодов на выходах АЦПУ 6 и АЦП числа нулей 7 составляет m=(2-3), что соответствует числу градаций нормированного уровня и частоты N=(4-8). При этом адаптивный шаг квантования по уровню и частоте в N/2 раз меньше средних значений соответственно уровня и интенсивности числа нулей на протяжении данного слова. On the one hand, an increase in the number of digits m increases the accuracy of the digital display of the sound image, allows you to transfer more information about the nature of the change in the level and frequency of the speech signal during the sound of the word. On the other hand, the requirement of independence of the accuracy of word recognition from the individual characteristics of the voices of various speakers necessitates a reduction in bit depth m. At the same time, the sound image retains only the basic information common to all voices pronouncing the given word, and individual differences turn out to be lost. Numerous experiments on a PC have shown that the optimal value of the bit depth of the codes at the outputs of the ADC 6 and ADC of the number of zeros 7 is m = (2-3), which corresponds to the number of gradations of the normalized level and frequency N = (4-8). Moreover, the adaptive quantization step in terms of level and frequency is N / 2 times less than the average values of the level and intensity of the number of zeros throughout the given word, respectively.

Фиксированное количество сегментов n, на которое делится каждое слово, также выбирается из компромиссных соображений. С одной стороны, увеличение n позволяет более детально отобразить в цифровой форме характер непрерывных изменений во времени громкости и частоты голоса, произносящего слово. С другой стороны, увеличение числа сегментов n приводит к сокращению длительности каждого сегмента и росту погрешности усреднения уровня и числа смен знака сигнала на протяжении короткого интервала. Известно, что продолжительность самых коротких невокализованных звуков речи составляет порядка (30 - 40) мс. Этот интервал принимается за интервал стационарности речи, в течение которого целесообразно производить усреднение параметров речевого сигнала. Поскольку средняя продолжительность звучания одной буквы слова составляет порядка 0,1 сек, а слово звучит в среднем (0,5-0,6) сек, оптимальное число сегментов составляет n=(12-20). Целесообразно выбрать n=16, тогда код длительности сегмента может быть легко найден простым сдвигом двоичного кода числа отсчетов на выходе измерителя длительности слова 10 на четыре разряда влево. The fixed number of segments n into which each word is divided is also chosen from compromise considerations. On the one hand, an increase in n allows for a more detailed digital display of the nature of continuous changes in time of the volume and frequency of the voice pronouncing the word. On the other hand, an increase in the number of segments n leads to a decrease in the duration of each segment and to an increase in the error in averaging the level and the number of signal sign changes over a short interval. It is known that the duration of the shortest unvoiced speech sounds is about (30 - 40) ms. This interval is taken as the interval of stationarity of speech, during which it is advisable to average the parameters of the speech signal. Since the average duration of a single letter of a word is about 0.1 sec, and the word sounds on average (0.5-0.6) sec, the optimal number of segments is n = (12-20). It is advisable to choose n = 16, then the segment duration code can be easily found by simply shifting the binary code of the number of samples at the output of the word 10 meter by four digits to the left.

В итоге, каждому слову соответствуют два набора из 16-ти (двух-трех)-разрядных кодов, отображающих изменение соответственно нормированной громкости и частоты звучания голоса на протяжении произносимого слова, разделенного на фиксированное число сегментов. As a result, each word corresponds to two sets of 16 (two to three) -digit codes that display the change in the normalized volume and frequency of the voice during the spoken word, divided into a fixed number of segments.

В постоянном запоминающем устройстве (ПЗУ) блока предварительной классификации 8 записаны по два эталонных набора кодов для каждого слова, полученных путем усреднения звучания многих голосов различных дикторов. Эталонные наборы также содержат каждый n=16 (2-3)-х разрядных двоичных кодов. В процессе распознавания слова в блоке предварительной классификации 8 происходит запись в ОЗУ поступающей на его первый и второй входы информации и ее сравнение с записанными в ПЗУ эталонными наборами кодов, характеризующими произнесенное слово. При этом определяются кодовые расстояния d_i между текущими и эталонными кодами каждого i-го сегмента (i= 1,2,........16) для каждого из K записанных слов в ПЗУ блока предварительной классификации 8. Этот процесс завершается определением среднего квадрата расстояния L_j ² между набором n кодов принятого слова и любым j-ым эталонным набором, отображающим изменение громкости или частоты звучания j-го слова.In the read-only memory (ROM) of the preliminary classification unit 8, two reference sets of codes for each word are recorded, obtained by averaging the sound of many voices of various speakers. Reference sets also contain each n = 16 (2-3) bit binary codes. In the process of recognizing a word in the preliminary classification block 8, the information received at its first and second inputs is recorded in RAM and compared with the standard sets of codes recorded in the ROM characterizing the spoken word. In this case, the code distances d _i between the current and reference codes of each i-th segment (i = 1,2, ........ 16) are determined for each of K written words in the ROM of the preliminary classification unit 8. This process ends determining the average square of the distance L _j ² between the set of n codes of the received word and any j-th reference set, displaying the change in volume or frequency of the sound of the j-th word.

В блоке окончательной классификации 9 должно быть вынесено решение о том, какое из K слов произнесено, по результатам сравнения между собой значений среднего квадрата расстояния L_j ² и выбора номера слова r, для которого L_r ² оказалось минимальным.

In the final classification block 9, a decision should be made on which of the K words was spoken, based on the results of comparing the mean square of the distance L _j ^{2 with} each other and choosing the word number r for which L _r ² turned out to be minimal.

Для упрощения и ускорения процедуры принятия решения желательно отказаться от полного перебора и сравнения между собой всех K пар значений L_j ², по два для каждого из K слов. С этой целью в блоке предварительной классификации 8 осуществляется разделение всех возможных K слов на несколько групп по ряду признаков. Так например, произносимые на русском языке цифры от 0 до 9 могут быть подразделены на две группы - первую, содержащую шесть односложных слов (0, 2, 3, 5, 6, 7), и вторую, содержащую четыре двусложных слова (1, 4, 8, 9). В свою очередь, внутри каждой группы можно выделить подгруппу слов, содержащих ударный гласный звук в начале слова (0, 5, 6, 7) или в его конце (2, 3) для первой группы слов, соответственно (8, 9) или (1, 4) для второй группы слов. Выделение вокализованных гласных звуков и определение местоположения ударных гласных можно осуществить, анализируя звуковые образы, отображающие изменения во времени громкости и частоты голоса. Так, ударным гласным звукам соответствуют отрезки максимальной громкости и максимальной протяженности звучания, охватывающей подряд несколько сегментов речевого сигнала. Безударным гласным звукам также соответствует некоторый подъем уровня сигнала сравнительно с невокализованными звуками. Кроме того, гласные звуки характеризуются стационарностью (примерным постоянством уровня и частоты переходов через нуль) на протяжении нескольких сегментов речи, чего нельзя сказать о невокализованных согласных звуках, особенно глухих (типа "т", "ч", "п", "ш", "с"), отличающихся к тому же резко заниженным уровнем громкости, сравнительно с гласными звуками, и повышенной частотой переходов через нуль.To simplify and speed up the decision-making procedure, it is desirable to abandon the exhaustive search and comparison among themselves of all K pairs of values of L _j ² , two for each of K words. For this purpose, in the preliminary classification block 8, all possible K words are divided into several groups according to a number of signs. For example, the numbers spoken in Russian from 0 to 9 can be divided into two groups - the first one containing six monosyllabic words (0, 2, 3, 5, 6, 7), and the second one containing four two-syllable words (1, 4 , 8, 9). In turn, within each group, one can distinguish a subgroup of words containing a stressed vowel sound at the beginning of a word (0, 5, 6, 7) or at its end (2, 3) for the first group of words, respectively (8, 9) or ( 1, 4) for the second group of words. Highlighting voiced vowels and locating stressed vowels can be done by analyzing sound images that reflect changes in the volume and frequency of the voice over time. So, shock vowels correspond to segments of maximum volume and maximum length of the sound, covering several consecutive segments of the speech signal. Unstressed vowel sounds also correspond to a certain increase in signal level compared to unvoiced sounds. In addition, vowel sounds are characterized by stationarity (approximate constancy of the level and frequency of transitions through zero) over several segments of speech, which cannot be said about unvoiced consonants, especially deaf ones (such as "t", "h", "p", "w" , "c"), which are also distinguished by a sharply low volume level, compared with vowels, and an increased frequency of zero transitions.

По ряду признаков произнесенное слово в блоке предварительной классификации 8 оказывается отнесенным с наибольшей вероятностью к одной из групп, содержащей значительно меньшее количество слов Q (Q << K). Информация о номере этой группы поступает в блок окончательной классификации 9, где осуществляется перебор и сравнение между собой значений L_j ², j=1,2,........K.According to a number of characteristics, the spoken word in the preliminary classification block 8 is most likely assigned to one of the groups containing a significantly smaller number of words Q (Q << K). Information about the number of this group goes to the final classification block 9, where it is enumerated and compared among themselves the values of L _j ² , j = 1,2, ........ K.

Окончательное решение принимается в пользу того j-го слова из Q возможных, для которого величины L_j ² окажутся минимальными одновременно как при анализе звукового образа, отображающего громкость звучания, так и образа, характеризующего частоту переходов сигнала через нуль.The final decision is made in favor of that j-th word from Q possible, for which the values of L _j ² will be minimal at the same time both when analyzing a sound image representing the sound volume and an image characterizing the frequency of transitions of the signal through zero.

В случае невыполнения этих условий ни для одного из Q слов, на выходе блока окончательной классификации 9 сформируется сигнал переспроса, требующий повторного произнесения слова. In the event that these conditions are not met for any of the Q words, an overexposure signal will be generated at the output of the final classification block 9, requiring a repeated pronunciation of the word.

Сопряжение предложенного устройства распознавания изолированных слов с ПЭВМ позволяет обеспечить речевой ввод информации в компьютер, в памяти которого предварительно записаны эталоны произносимых слов. В простейшем случае, распознавание изолированных цифр от 0 до 9 в сочетании со словами, "точка", "ввод", "забой", освобождает оператора ПЭВМ от необходимости пользоваться клавиатурой при вводе цифровой информации в память компьютера. Pairing the proposed device for recognizing isolated words with a PC allows for voice input of information into a computer in the memory of which the patterns of spoken words are pre-recorded. In the simplest case, the recognition of isolated numbers from 0 to 9 in combination with the words “dot”, “input”, “slaughter” frees up the PC operator from having to use the keyboard when entering digital information into the computer’s memory.

С учетом возможностей современной элементной базы, описанное устройство реализуется в цифровой форме либо с использованием дискретных микросхем жесткой логики в сочетании с БИС ОЗУ и ПЗУ, либо на основе микропроцессоров среднего быстродействия КМОП-структуры, с малым потреблением тока. Given the capabilities of the modern element base, the described device is implemented in digital form either using discrete logic chips in combination with LSI RAM and ROM, or based on medium-speed CMOS structure microprocessors with low current consumption.

В обоих вариантах распознавание осуществляется в реальном масштабе времени, а именно в паузе между словами происходит распознавание предшествующего слова. In both cases, recognition is carried out in real time, namely, in the pause between words, recognition of the previous word occurs.

Литература:
1. Дж.Д. Маркел, А.X. Грэй. Линейное предсказание речи. М.: Связь, 1980, стр. 282-283.Literature:
1. J.D. Markel, A.X. Gray. Linear prediction of speech. M .: Communication, 1980, pp. 282-283.

2. Л. P. Рабинер, P.В. Шафер. Цифровая обработка речевых сигналов. М.: Радио и связь, 1981, стр.456-464. 2. L. P. Rabiner, P.V. Best man. Digital processing of speech signals. M .: Radio and communication, 1981, pp. 456-464.

Claims

A device for recognizing isolated words, containing an analyzer of moments of the beginning and end of the word, a block for segmenting the word into intervals, a processing unit including a meter for the average number of transitions through zero and a meter for the average signal energy, as well as series-connected preliminary classification unit and final classification unit, the output of which is the output of the device, characterized in that an additional random access memory, a word duration meter, and an analogue qi are introduced level converter and analog-to-digital converter of the number of zeros, while the first output of the analyzer of the moments of the beginning and end of the word is connected to the information input of random access memory, the first and second outputs of which are connected respectively to the information inputs of the analog-to-digital level converter and analog-to-digital number converter zeros, their outputs are fed respectively to the first and second inputs of the preliminary classification block, the third input of which is combined with the reset inputs a an analog-to-digital level converter and an analog-to-digital converter of the number of zeros and is connected to the output of the word segmentation unit at intervals, the reference inputs of the analog-to-digital level converter and an analog-to-digital converter of the number of zeros are connected respectively to the outputs of the average signal energy meter and the average number of transitions through zero, the inputs of which are connected to the first output of the analyzer of moments of the beginning and end of the word, the second output of which is fed to the combined inputs of the write / read opera an external storage device and the input of the word duration meter, the output of which is connected to the input of the word segmentation unit at intervals.