RU2297676C2

RU2297676C2 - Method for recognizing words in continuous speech

Info

Publication number: RU2297676C2
Application number: RU2005108961/09A
Authority: RU
Inventors: Александр Владимирович Аграновский (RU); Александр Владимирович Аграновский; Дмитрий Анатольевич Леднов (RU); Дмитрий Анатольевич Леднов; Михаил Юрьевич Зулкарнеев (RU); Михаил Юрьевич Зулкарнеев; н Роман Эрнстович Арутюн (RU); Роман Эрнстович Арутюнян
Priority date: 2005-03-30
Filing date: 2005-03-30
Publication date: 2007-04-20
Also published as: RU2005108961A

Abstract

FIELD: automatics and computer science, possible usage in systems for controlling technological, home appliance and other equipment, in automatic reference systems, automatic translation systems, speech understanding systems.

SUBSTANCE: in accordance to method, during pronouncing of speech phrase, taken periodically are selections of acoustic signal of this phrase, digitized with given quantization frequency, in fixed time interval or based on combination of selection, functional is calculated, determining current acoustic condition, while received series of current acoustic conditions is used for restoration of series of words (working hypothesis), pronounced in original speech phrase, for that purpose lexical decoding network is used, which sets rules of order of standard acoustic conditions in a language. Working hypothesis is found, being optimal in terms of maximal match thereof with original speech signal, which is ensured by usage of movable marker algorithm, while working hypothesis is restored from marker, which at this time moment is at end vertex of lexical decoding network.

EFFECT: increased precision of words recognition in continuous speech.

4 cl, 12 dwg

Description

Изобретение относится к автоматике и вычислительной технике и может быть использовано в системах управления технологическим, бытовым и другим оборудованием, в автоматических справочных системах, системах автоматического перевода, системах понимания речи и т.д.The invention relates to automation and computer technology and can be used in control systems for technological, household and other equipment, in automatic reference systems, automatic translation systems, speech understanding systems, etc.

Известен способ распознавания слов в слитной речи, реализованный в системе автоматического понимания речи фирмы LDC [1].A known method of recognizing words in continuous speech, implemented in the system of automatic speech understanding company LDC [1].

Суть способа состоит в том, что с произнесением речевого высказывания периодически берут выборки акустического сигнала этого высказывания, оцифрованного с заданной частотой квантования, через фиксированные интервалы времени и по совокупности этих выборок вычисляют функционал, определяющий текущее акустическое состояние, при этом полученную последовательность текущих акустических состояний используют для восстановления последовательности слов (рабочей гипотезы), произнесенных в исходном речевом высказывании.The essence of the method is that with the utterance of a speech utterance, samples of the acoustic signal of this utterance digitized with a given quantization frequency are periodically taken, at fixed time intervals and from the totality of these samples, a functional is determined that determines the current acoustic state, and the obtained sequence of current acoustic states is used to restore the sequence of words (working hypothesis) spoken in the original speech utterance.

Особенностью известного способа является то, что последовательность текущих акустических состояний преобразуют в А-матрицу, которая используется при дальнейшем анализе вместо исходного речевого сигнала. А-матрица содержит различные параметры речевого сигнала, включая его фонетическую транскрипцию. На основе А-матрицы получают рабочую гипотезу, т.е. последовательность слов, которая предположительно произнесена в исходном речевом высказывании. Обработку А-матрицы производят слева направо - блок предварительного выделения слов обрабатывает слова из списка наиболее вероятных слов и формирует информацию о степени совпадения данного слова и начального участка речевого сигнала, далее блок управления анализирует полученную информацию и при помощи имеющейся синтаксической и семантической информации определяет список слов, которые могут следовать за данным словом, и передает их в блок предварительного выделения слов, далее процедура повторяется, пока не достигнут конец речевого высказывания.A feature of the known method is that the sequence of current acoustic states is converted into an A-matrix, which is used in further analysis instead of the original speech signal. The A matrix contains various parameters of the speech signal, including its phonetic transcription. Based on the A-matrix, a working hypothesis is obtained, i.e. the sequence of words that is allegedly spoken in the original speech utterance. The A-matrix is processed from left to right - the word pre-selection unit processes the words from the list of the most probable words and generates information about the degree of coincidence of the given word and the initial portion of the speech signal, then the control unit analyzes the received information and determines the list of words using the available syntactic and semantic information that can follow the given word and transfers them to the block of preliminary selection of words, then the procedure is repeated until the end of speech in sayings.

Недостатком этого способа является низкая точность распознавания, обусловленная тем, что: 1) стратегия поиска рабочей гипотезы не является оптимальной в смысле максимума степени ее совпадения с исходным речевым сигналом, поскольку решение об оптимальности гипотезы принимается при каждом сравнении слова с речевым сигналом и возможна потеря оптимальной гипотезы; 2) поиск рабочей гипотезы является двухэтапным процессом - на первом этапе вычисляется А-матрица, на втором этапе находится гипотеза, таким образом, поиск рабочей гипотезы основывается на фонетической транскрипции, которая может содержать ошибки.The disadvantage of this method is the low accuracy of recognition, due to the fact that: 1) the search hypothesis search strategy is not optimal in terms of the maximum degree of coincidence with the original speech signal, since the decision on the hypothesis optimality is made with each comparison of the word with the speech signal and the optimal hypotheses; 2) the search for a working hypothesis is a two-stage process - the A-matrix is calculated at the first stage, the hypothesis is found at the second stage, so the search for a working hypothesis is based on phonetic transcription, which may contain errors.

Известен способ распознавания слов в слитной речи, реализованный в системе автоматического понимания речи HWIM фирмы BBN [2].A known method of recognizing words in continuous speech, implemented in the system of automatic speech understanding HWIM company BBN [2].

Особенность этого способа состоит в том, что последовательность текущих акустических состояний преобразуют в ряд фонетических транскрипций, которые записывают в виде сегментной сетки. Полученную сегментную сетку используют при поиске рабочей гипотезы. Конструкции, в наибольшей степени соответствующие словам, независимо от их расположения, направляются в блок проверки слов, где производится оценка степени совпадения. Результаты проверки объединяются с результатами лексического подбора, и, если показатель этого объединенного результата достаточно высок, слова направляются в блок синтаксического предсказания. Слова, сформированные в блоке синтаксического предсказания в соответствии с правилами используемой грамматики, добавляются справа и слева к основному слову и направляются в блок проверки слов. Далее процесс повторяется, пока не будет распознан весь речевой сигнал. Подбор слов осуществляется при помощи лексических декодирующих сетей, которые представляют всевозможные фонетические представления слова во всех возможных фонетических контекстах.A feature of this method is that the sequence of current acoustic states is converted into a series of phonetic transcriptions, which are recorded as a segmented grid. The obtained segmented grid is used in the search for a working hypothesis. The constructions that most correspond to the words, regardless of their location, are sent to the word verification unit, where the degree of coincidence is evaluated. The results of the check are combined with the results of lexical selection, and if the indicator of this combined result is high enough, the words are sent to the syntactical prediction block. Words formed in the syntactical prediction block in accordance with the rules of the grammar used are added to the right and left to the main word and sent to the word verification block. The process then repeats until the entire speech signal is recognized. The selection of words is carried out using lexical decoding networks, which represent all kinds of phonetic representations of the word in all possible phonetic contexts.

Недостатком этого способа является низкая точность распознавания, обусловленная следующими факторами: 1) стратегия поиска рабочей гипотезы не является оптимальной в смысле максимума степени ее совпадения с исходным речевым сигналом, поскольку решение об оптимальности гипотезы принимается при каждом сравнении слова с речевым сигналом и возможна потеря оптимальной гипотезы; 2) поиск рабочей гипотезы является двухэтапным процессом - на первом этапе вычисляется сегментная сетка, на втором этапе находится рабочая гипотеза, таким образом, поиск рабочей гипотезы основывается на сегментной сетке, которая может содержать ошибки, однако в отличие от первого аналога этот способ является более устойчивым к ошибкам фонетического транскрибирования.The disadvantage of this method is the low recognition accuracy due to the following factors: 1) the search hypothesis search strategy is not optimal in terms of the maximum degree of coincidence with the original speech signal, since the decision on the hypothesis optimality is made with each comparison of the word with the speech signal and the loss of the optimal hypothesis is possible ; 2) the search for a working hypothesis is a two-stage process - at the first stage a segmented grid is computed, at the second stage a working hypothesis is found, thus, the search for a working hypothesis is based on a segmented grid that may contain errors, however, unlike the first analogue, this method is more stable to phonetic transcription errors.

Наиболее близким к предлагаемому является способ распознавания слов в слитной речи [3], принятый за прототип, состоящий в том, что с произнесением речевого высказывания периодически берут выборки акустического сигнала этого высказывания, оцифрованного с заданной частотой квантования, через фиксированные интервалы времени и по совокупности этих выборок вычисляют функционал, определяющий текущее акустическое состояние, при этом полученную последовательность текущих акустических состояний используют для восстановления последовательности слов (рабочей гипотезы), произнесенных в исходном речевом высказывании.Closest to the proposed one is a method for recognizing words in continuous speech [3], adopted as a prototype, consisting in the fact that with the utterance of a speech utterance, samples of the acoustic signal of this utterance digitized at a given quantization frequency are periodically taken at fixed time intervals and in combination of these the samples calculate the functional that determines the current acoustic state, while the resulting sequence of current acoustic states is used to restore the sequence spacing words (working hypothesis) spoken in the original speech utterance.

Особенностью этого способа является то, что рабочая гипотеза находится непосредственно из последовательности текущих акустических состояний при помощи сети лексического декодирования, вершинами которой являются эталонные акустические состояния, а переходы между ними задают следующие возможные эталонные акустические состояния. Сеть лексического декодирования задает закономерности следования эталонных акустических состояний в соответствии с грамматическими и фонетическими правилами языка.A feature of this method is that the working hypothesis is found directly from the sequence of current acoustic states using a lexical decoding network, the vertices of which are reference acoustic states, and the transitions between them define the following possible reference acoustic states. The lexical decoding network sets the patterns for following reference acoustic states in accordance with the grammatical and phonetic rules of the language.

Процесс декодирования начинается с выбора начальных вершин сети лексического декодирования и нахождения вершины, которой соответствует эталонное акустическое состояние, наиболее близкое текущему акустическому состоянию. Номер наиболее близкой вершины фиксируется в блоке хранения рабочего пути, который представляет собой последовательность номеров вершин. Очередное текущее акустическое состояние сравнивается с состояниями, связанными со следующими возможными вершинами сети. Наиболее близкая вершина фиксируется в блоке хранения рабочего пути. Процесс повторяется до завершения поступления текущих акустических состояний, соответствующих речевому сигналу. После завершения процесса найденный рабочий путь преобразуется в рабочую гипотезу.The decoding process begins with the selection of the initial vertices of the lexical decoding network and finding the vertex that corresponds to the reference acoustic state closest to the current acoustic state. The number of the closest vertex is fixed in the storage block of the working path, which is a sequence of vertex numbers. The next current acoustic state is compared with the states associated with the following possible network nodes. The closest peak is fixed in the storage block of the working path. The process is repeated until the arrival of the current acoustic states corresponding to the speech signal. After the process is completed, the found working path is converted into a working hypothesis.

Недостатком прототипа является низкая точность распознавания, связанная со следующими факторами: 1) стратегия поиска рабочей гипотезы не является оптимальной в смысле максимума степени ее совпадения с исходным речевым сигналом, поскольку решение об оптимальности гипотезы принимается в каждый момент времени; 2) при построении сети лексического декодирования не используется модель языка [5, стр.539]; 3) при расчете степени совпадения не учитывается информация о средней длительности акустических состояний [4, стр.259].The disadvantage of the prototype is the low recognition accuracy associated with the following factors: 1) the search hypothesis search strategy is not optimal in the sense of maximum degree of its coincidence with the original speech signal, since the decision on the hypothesis optimality is made at each moment of time; 2) when constructing a network of lexical decoding, a language model is not used [5, p.539]; 3) when calculating the degree of coincidence, information on the average duration of acoustic states is not taken into account [4, p. 259].

Технический результат, получаемый от внедрения изобретения, заключается в повышении точности распознавания слов в слитной речи.The technical result obtained from the implementation of the invention is to increase the accuracy of word recognition in continuous speech.

Данный технический результат достигают за счет того, что в известном способе распознавания слов в слитной речи, заключающемся в том, что с произнесением речевого высказывания периодически берут выборки акустического сигнала этого высказывания, оцифрованного с заданной частотой квантования, через фиксированные интервалы времени и по совокупности этих выборок вычисляют функционал, определяющий текущее акустическое состояние, при этом полученную последовательность текущих акустических состояний используют для восстановления последовательности слов (рабочей гипотезы), произнесенных в исходном речевом высказывании путем применения сети лексического декодирования, задающей закономерности следования эталонных акустических состояний в языке, при этом с целью повышения точности распознавания проводится поиск рабочей гипотезы, являющейся оптимальной в смысле максимума степени ее совпадения с исходным речевым сигналом, путем использования алгоритма перемещаемого маркера, а рабочую гипотезу восстанавливают из маркера, находящегося в данный момент времени в конечной вершине сети лексического декодирования.This technical result is achieved due to the fact that in the known method of recognizing words in continuous speech, namely, with the utterance of a speech utterance, samples of the acoustic signal of this utterance digitized with a given quantization frequency are periodically taken at fixed time intervals and from the totality of these samples calculate the functional that determines the current acoustic state, while the resulting sequence of current acoustic states is used to restore the last the sequence of words (working hypothesis) spoken in the original speech utilization by applying a lexical decoding network that defines the patterns for following reference acoustic states in the language, and in order to improve recognition accuracy, a search is made for a working hypothesis that is optimal in terms of the maximum degree of its coincidence with the original speech signal, using the algorithm of the moving marker, and the working hypothesis is restored from the marker, which is currently in the final top of the network of lexical decoding.

Особенностью данного способа является то, что поиск рабочей гипотезы является оптимальным в смысле максимума степени ее совпадения с исходным речевым сигналом, поскольку в основе алгоритма перемещаемого маркера, который используется для поиска рабочей гипотезы, лежит метод динамического программирования [6, стр.74].A feature of this method is that the search for a working hypothesis is optimal in the sense of the maximum degree of its coincidence with the original speech signal, since the algorithm of the moving marker, which is used to search for a working hypothesis, is based on the dynamic programming method [6, p. 74].

Также при построении сети лексического декодирования могут быть использованы модель языка [5, р.539] и/или вероятности перехода между состояниями, посредством которых учитывают средние длительности фонем.Also, when constructing a network of lexical decoding, a language model can be used [5, p.539] and / or the probability of transition between states by which the average phoneme durations are taken into account.

Изобретение поясняется чертежами, где на фиг.1-7 которого представлены этапы построения сети лексического декодирования; на фиг.8, 9 - алгоритм перемещаемого маркера; на фиг.10-12 - устройство, реализующее способ.The invention is illustrated by drawings, where in Fig.1-7 which presents the steps of building a network of lexical decoding; Figs. 8, 9 show a moving marker algorithm; figure 10-12 is a device that implements the method.

Сеть лексического декодирования создается путем выполнения следующих операций: представление речи конечным набором слов, построение возможных фонетических представлений слов, построение модели языка, создание базы данных контекстно-зависимых фонем (Трифонов), создание графа слов на основе модели языка, создание для каждого слова из множества Y фонетической сети слова, задающей возможные фонетические реализации слова, расширение графа слов при помощи фонетических сетей и замены фонем на трифоны.A lexical decoding network is created by performing the following operations: presenting a speech with a finite set of words, building possible phonetic representations of words, building a language model, creating a database of context-sensitive phonemes (Trifonov), creating a word graph based on a language model, creating for each word from the set Y is the phonetic network of the word, which defines the possible phonetic realizations of the word, the expansion of the graph of words with the help of phonetic networks and the replacement of phonemes by trifons.

Суть этих операций состоит в следующем.The essence of these operations is as follows.

1) Представляют речь конечным набором слов:1) Represent speech with a finite set of words:

где w_i - i-е слово, i=1, ..., n_Y.where w _i is the i-th word, i = 1, ..., n _Y.

2) Представляют слова как последовательности фонем:2) Represent words as sequences of phonemes:

где n_р - количество возможных фонетических представлений слова w_i, n_fj - количество фонем в j-м фонетическом представлении слова w_i, f_kj - k-я фонема в j-м фонетическом представлении слова w_i.where n _p is the number of possible phonetic representations of the word w _i , n _fj is the number of phonemes in the jth phonetic representation of the word w _i , f _kj is the kth phoneme in the jth phonetic representation of the word w _i .

3) Создают модель языка, для этого вычисляют вероятности:3) Create a language model, for this the probabilities are calculated:

p(w_i/W_n),p (w _i / W _n ),

где W_n - последовательность слов, предшествующих слову w_i, длиной n.where W _n is the sequence of words preceding the word w _i , length n.

4) Создают базу данных моделей контекстно-зависимых фонем (Трифонов), представляющих собой сеть (фиг.6), вершинами которой являются эталонные акустические состояния, а переходы между ними задают возможные переходы между эталонными акустическими состояниями в языке. С переходами связаны вероятности перехода между состояниями, которые неявным образом задают длительности акустических состояний.4) Create a database of models of context-sensitive phonemes (Trifonov), which are a network (Fig. 6), the vertices of which are reference acoustic states, and the transitions between them define possible transitions between reference acoustic states in the language. Transitions are associated with transition probabilities between states, which implicitly specify the duration of acoustic states.

5) Конструируют сеть лексического декодирования с учетом п.1, п.2, п.3, п.4.5) Design a network of lexical decoding taking into account claim 1, claim 2, claim 3, claim 4.

Этапы построения сети лексического декодирования представлены на примере сети, использующей двуграммную модель языка.The stages of constructing a lexical decoding network are presented using an example of a network using a two-gram language model.

Этапы построения сети лексического декодирования поясняются чертежами: на фиг.1 изображены орфографическое и фонетическое представление слов; на фиг.2 - граф слов, основанный на двуграммной модели языка, который используется для задания возможных в языке последовательностей слов; фиг.3 - подграф UNK, используемый для моделирования слов, не входящих во множество Y; фиг.4 - фонетическая сеть слова "АДВОКАТ"; фиг.5 - трифонная сеть слова "АДВОКАТ"; фиг.6 - сеть слова адвокат, вершинами которой являются акустические состояния; фиг.7 - лексическая сеть декодирования.The stages of constructing a network of lexical decoding are illustrated by drawings: in Fig. 1, spelling and phonetic representation of words is shown; figure 2 is a graph of words based on the two-gram language model, which is used to specify possible sequences of words in the language; figure 3 - subgraph UNK used to model words that are not included in the set Y; figure 4 - phonetic network of the word "LAWYER"; figure 5 - trifon network of the word "LAWYER"; 6 is a network of the word lawyer, the vertices of which are acoustic states; 7 is a lexical decoding network.

На первом этапе определяют словарь Y речевого общения. Множество Y содержит слово-класс UNK, который отвечает за слова, не попавшие в это множество. Далее определяют лексическое и фонетическое представление каждого слова (фиг.1), строят модель языка для данного словаря Y и базу данных Трифонов языка. На втором этапе создается шаблон сети декодирования (фиг.2) в соответствии с имеющейся моделью языка. Шаблон сети декодирования содержит начальный узел, в котором система находится перед началом процесса распознавания, и конечный узел, в котором система находится после завершения процесса распознавания. Узел, соответствующий слову UNK, является подграфом (фиг.3) и предназначен для моделирования слов, не входящих в словарь. Структура подграфа UNK выбрана таким образом, чтобы он мог моделировать любую последовательность фонем. На третьем этапе для каждого слова строится граф, который моделирует все ожидаемые фонетические представления слов (фиг.4). Вершинами графа являются фонемы, а дугами - указатели на следующие возможные фонемы. Конечное состояние графа слова обозначено прямоугольником. Оно предназначено для того, чтобы обозначать конец слова. Далее вместо фонем подставляют контекстно-зависимые фонемы (трифоны) (фиг.5) и далее акустические состояния, из которых они состоят (фиг.6), которые берут из базы данных Трифонов. На четвертом этапе построенные графы слов подставляют в шаблон сети декодирования, в результате чего получают сеть лексического декодирования (фиг.7).In a first step, a vocabulary Y of verbal communication is determined. The set Y contains the UNK class word, which is responsible for words that do not fall into this set. Next, the lexical and phonetic representation of each word is determined (Fig. 1), a language model for a given dictionary Y and a database of language Trifons are built. At the second stage, a decoding network template is created (Fig. 2) in accordance with the existing language model. The decoding network template contains the start node in which the system is located before the recognition process begins, and the end node in which the system is located after the recognition process is completed. The node corresponding to the word UNK is a subgraph (figure 3) and is intended for modeling words that are not included in the dictionary. The structure of the UNK subgraph is chosen so that it can simulate any sequence of phonemes. At the third stage, a graph is constructed for each word, which models all the expected phonetic representations of words (Fig. 4). The vertices of the graph are phonemes, and the arcs are pointers to the following possible phonemes. The final state of the word graph is indicated by a rectangle. It is intended to mark the end of a word. Next, context-dependent phonemes (trifons) are substituted for phonemes (Fig. 5) and then the acoustic states of which they consist (Fig. 6) are taken from the Trifonov database. At the fourth stage, the constructed word graphs are substituted into the decoding network template, as a result of which the lexical decoding network is obtained (Fig. 7).

Работа алгоритма перемещаемого маркера поясняется фиг.8, 9; фиг.8 представляет структуру данных, называемую маркером, фиг.9 представляет алгоритм перемещаемого маркера.The operation of the moveable marker algorithm is illustrated in FIGS. 8, 9; Fig. 8 represents a data structure called a marker; Fig. 9 represents a moveable marker algorithm.

Для реализации алгоритма перемещаемого маркера (фиг.9) каждая вершина сети лексического декодирования содержит указатель на структуру данных, называемую маркером (фиг.8), которая хранит информацию о частичном пути, заканчивающемся в данной вершине сети лексического декодирования. Структуру маркера можно представить в виде

, где

- степень совпадения частичного пути, прошедшего через вершину j в момент времени t,

- запись о состоянии, соответствующем вершине j в момент времени t.To implement the moveable marker algorithm (Fig. 9), each vertex of the lexical decoding network contains a pointer to a data structure called a marker (Fig. 8), which stores information about the partial path ending at a given vertex of the lexical decoding network. The marker structure can be represented as

where

- the degree of coincidence of the partial path that passed through the vertex j at time t,

- record of the state corresponding to the vertex j at time t.

При инициализации алгоритма в начальный узел сети помещают маркер с нулевой степенью совпадения. Далее начинается работа алгоритма. При поступлении очередного текущего акустического состояния копии маркеров из каждой вершины сети лексического декодирования перемещаются во все вершины, которые возможны после данной, при этом степень совпадения маркера пересчитывается в соответствии с формулойWhen the algorithm is initialized, a marker with a zero degree of coincidence is placed in the initial network node. Next, the algorithm begins. When the next current acoustic state arrives, copies of the markers from each vertex of the lexical decoding network move to all the vertices that are possible after this, while the degree of coincidence of the marker is recalculated in accordance with the formula

где

- предыдущее значение степени совпадения, р_ij - вероятность перехода между эталонными состояниями, связанными с вершинами сети лексического декодирования i и j, b_j(V_t) - степень совпадения между текущим акустическим состоянием V_t и эталонным состоянием, связанным с вершиной сети лексического декодирования j, w_i- слово, которому принадлежит состояние i, w_j. - слово, которому принадлежит состояние, связанное с вершиной сети лексического декодирования j. Далее исходные маркеры удаляются. В каждой вершине сети лексического декодирования удаляются все маркеры, кроме маркера, имеющего максимальную степень совпадения, эта процедура называется нормализацией множества маркеров. После того, как обработаны все текущие акустические состояния, соответствующие речевому сигналу, работа алгоритма завершается. Из маркера, находящегося в конечной вершине сети лексического декодирования, извлекается рабочий путь, из которого находится рабочая гипотеза.Where

is the previous value of the degree of coincidence, p _ij is the probability of transition between the reference states associated with the vertices of the lexical decoding network i and j, b _j (V _t ) is the degree of coincidence between the current acoustic state V _t and the reference state associated with the top of the lexical decoding network j, w _i is the word to which state i, w _j belongs. is the word to which the state associated with the top of the lexical decoding network j belongs. Next, the original markers are deleted. At each vertex of the lexical decoding network, all markers are deleted, except for the marker with the maximum degree of coincidence, this procedure is called normalization of the set of markers. After all current acoustic states corresponding to the speech signal are processed, the operation of the algorithm ends. From the marker located at the final vertex of the lexical decoding network, the working path is extracted from which the working hypothesis is found.

Устройство для реализации способа распознавания слов в слитной речи может быть выполнено в виде программы для ЭВМ. В этом случае устройство представляет собой структуры данных в оперативной памяти ЭВМ.A device for implementing the method of recognizing words in continuous speech can be performed in the form of a computer program. In this case, the device is a data structure in the main memory of a computer.

Устройство для реализации способа распознавания слов в слитной речи представлено на фиг.10, 11 и 12. На фиг.10 изображена структурная схема системы; на фиг.11 - структурная схема блока акустического анализатора; на фиг.12 - структурная схема лексического анализатора.A device for implementing the method of recognizing words in continuous speech is shown in FIGS. 10, 11 and 12. FIG. 10 shows a structural diagram of a system; figure 11 is a structural diagram of a block acoustic analyzer; in Fig.12 is a structural diagram of a lexical analyzer.

Устройство распознавания слов в слитной речи, использующее сеть лексического декодирования представлено на фиг.10. Оно состоит из акустического анализатора, представленного блоком 1, и лексического анализатора, представленного блоком 2.A device for recognizing words in continuous speech using a lexical decoding network is shown in FIG. 10. It consists of an acoustic analyzer represented by block 1 and a lexical analyzer represented by block 2.

Блок 1 предназначен для определения текущего акустического состояния V_t.Block 1 is designed to determine the current acoustic state V _t .

Блок 2 предназначен для поиска рабочей гипотезы. Вход блока 1 соединен с микрофоном. Выход блока 1 соединен со входом блока 2. С выхода блока 2 получают искомый результат.Block 2 is designed to search for a working hypothesis. The input of block 1 is connected to a microphone. The output of block 1 is connected to the input of block 2. From the output of block 2, the desired result is obtained.

Блок 1, структурная схема которого представлена на фиг.11, содержит: блок 3 - частотный анализатор спектра, блок 4 - вычислитель текущего акустического состояния V_t.Block 1, the structural diagram of which is shown in Fig. 11, contains: block 3 - a frequency spectrum analyzer, block 4 - a calculator of the current acoustic state V _t .

Блок 2, структурная схема которого представлена на фиг.12, содержит: блок 5 -хранилище множества состояний; блок 6 - хранилище сети лексического декодирования; блок 7 - вычислитель степени совпадения b_j(V_t) между текущим акустическим состоянием V_t и эталонным акустическим состоянием, связанным с текущей вершиной сети лексического декодирования j; блок 8 - хранилище маркеров; блок 9 - формирователь результата распознавания; блок 10 - хранилище пересчитанных маркеров; блок 11 - блок нормализации множества маркеров; блок 12 - блок вывода результатов распознавания; блок 13 - блок управления.Block 2, the structural diagram of which is shown in Fig. 12, contains: block 5 is a repository of multiple states; block 6 - network storage lexical decoding; block 7 is a calculator of the degree of coincidence b _j (V _t ) between the current acoustic state V _t and the reference acoustic state associated with the current top of the lexical decoding network j; block 8 - marker storage; block 9 - driver of the recognition result; block 10 - storage of recalculated markers; block 11 is a block normalization of many markers; block 12 - block output results recognition; block 13 is a control unit.

Блок 3 предназначен для вычисления спектра текущего участка речевого высказывания и преобразования его в цифровой вид.Block 3 is designed to calculate the spectrum of the current section of the speech utterance and convert it to digital form.

Блок 4 предназначен для вычисления текущего акустического состояния V_t, соответствующего текущему речевому участку речи.Block 4 is designed to calculate the current acoustic state V _t corresponding to the current speech portion of speech.

Блок 5 представляет собой устройство, в котором хранится база данных акустических состояний.Block 5 is a device that stores a database of acoustic states.

Блок 6 представляет собой устройство, в котором хранится сеть лексического декодирования.Block 6 is a device in which a lexical decoding network is stored.

Блок 7 предназначен для вычисления оценки степени совпадения b_j(V_t) между текущим акустическим состоянием V_t и эталонным акустическим состоянием, соответствующим данной вершине сети лексического декодирования j.Block 7 is designed to calculate an estimate of the degree of coincidence b _j (V _t ) between the current acoustic state V _t and the reference acoustic state corresponding to a given vertex of the lexical decoding network j.

Блок 8 представляет собой устройство, в котором хранится база данных маркеров.Block 8 is a device that stores a marker database.

Блок 9 предназначен для формирования результатов распознавания с использованием информации, хранящейся в маркере из конечной вершины в конечный момент времени.Block 9 is designed to generate recognition results using the information stored in the marker from the end vertex at the end time.

Блок 10 представляет собой устройство, в котором хранится база данных пересчитанных маркеров.Block 10 is a device that stores a database of recalculated tokens.

Блок 11 предназначен для нормализации множества маркеров.Block 11 is designed to normalize many markers.

Блок 12 предназначен для вывода результатов распознавания.Block 12 is intended for outputting recognition results.

Блок 13 предназначен для управления системой распознавания.Block 13 is designed to control the recognition system.

Работа системы распознавания слов в слитной речи осуществляется следующим образом (см. фиг.10, 11 и 12). Входное высказывание с микрофона поступает на вход блока 3 акустического анализатора 1.The operation of the word recognition system in continuous speech is as follows (see Fig. 10, 11 and 12). The input statement from the microphone is fed to the input of block 3 of the acoustic analyzer 1.

Блок 3 с помощью полосовых фильтров выделяет частотный спектр и преобразует его в цифровую форму в соответствии с прототипом. Эти оцифрованные сигналы подаются на вход блока 4.Block 3 using band-pass filters selects the frequency spectrum and converts it into digital form in accordance with the prototype. These digitized signals are fed to the input of block 4.

Блок 4 вычисляет текущее акустическое состояние V_t и одновременно определяет содержит ли текущий участок сигнала речь в соответствии с прототипом. После этого текущее акустическое состояние передается на вход блока 13 лексического анализатора.Block 4 calculates the current acoustic state V _t and simultaneously determines whether the current section of the signal contains speech in accordance with the prototype. After that, the current acoustic state is transmitted to the input of block 13 of the lexical analyzer.

Блок 13 в отличие от прототипа управляет работой лексического анализатора. Получив от акустического анализатора текущее акустическое состояние, блок управления дает команду блоку 7 создать маркер

, связанный с начальной вершиной сети лексического декодирования 0.Block 13, unlike the prototype, controls the operation of the lexical analyzer. Having received from the acoustic analyzer the current acoustic state, the control unit instructs block 7 to create a marker

associated with the initial vertex of the lexical decoding network 0.

Блок 7 в отличие от прототипа создает маркер

, связанный с начальной вершиной сети лексического декодирования 0, и через блоки 10 и 11 передает его в блок 8. Далее блок 13 дает команду блоку 6 начать обход вершин сети лексического декодирования.Block 7, unlike the prototype, creates a marker

associated with the initial vertex of the lexical decoding network 0, and through

blocks

10 and 11 transfers it to block 8. Next, block 13 instructs block 6 to begin bypassing the vertices of the lexical decoding network.

Блок 6 переходит к рассмотрению очередной вершины i сети лексического декодирования и передает на вход блока 8 индекс маркера

, связанного с данной вершиной. Также на вход блока 5 передается последовательность номеров состояний, связанных с вершинами, в которые система может перейти из текущей вершины.Block 6 goes on to consider the next vertex i of the lexical decoding network and passes the marker index to the input of block 8

associated with this vertex. Also, the sequence of state numbers associated with the vertices to which the system can go from the current vertex is transmitted to the input of block 5.

Блок 8 на вход блока 7 передает текущий маркер

.Block 8 to the input of block 7 transmits the current marker

.

Блок 5, получив команду с блока управления, передает на вход блока 7 очередное состояние, связанное с вершиной j сети лексического декодирования.Block 5, having received a command from the control unit, transfers to the input of block 7 the next state associated with the vertex j of the lexical decoding network.

Блок 7 создает копию маркера

, полученного из блока 8, для состояния, связанного с вершиной j, полученного из блока 5, пересчитывая степень совпадения по формуле (1). Вновь созданный маркер

передается в блок 10, который является хранилищем созданных маркеров.Block 7 creates a copy of the marker

obtained from block 8, for the state associated with the vertex j obtained from block 5, recounting the degree of coincidence by the formula (1). Newly created marker

passed to block 10, which is the repository of the created tokens.

После того, как обход всех вершин сети лексического декодирования выполнен, блок управления подает на вход блока 10 сигнал о нормализации маркеров, накопленных в блоке 10. Для нормализации маркеры передаются в блок 11.After the bypass of all the vertices of the lexical decoding network is completed, the control unit sends to the input of block 10 a signal to normalize the markers accumulated in block 10. For normalization, the markers are transmitted to block 11.

Блок 11 выполняет нормализацию множества маркеров, удаляя все маркеры, связанные с одной вершиной сети лексического декодирования, кроме маркера, имеющего максимальную величину степени совпадения

. Нормализованное множество маркеров передается на вход блока 8.Block 11 normalizes the set of markers, removing all the markers associated with one vertex of the lexical decoding network, except for the marker with the maximum degree of coincidence

. The normalized set of markers is transmitted to the input of block 8.

На этом обработка текущего акустического состояния V₁ завершается. Система переходит к следующему текущему акустическому состоянию V₂. Процедура распознавания завершается, когда текущий речевой участок не содержит речь. В этом случае блок управления подает сигнал на вход блока 6 о завершении процедуры распознавания. Блок 6 передает в блок 8 индекс маркера

, связанного с конечной вершиной N сети лексического декодирования. Блок 8 передает в блок 9 данный маркер

, по которому блок 9 формирует результат распознавания в виде последовательности распознанных слов. Эта последовательность передается в блок 12, который отображает результат распознавания в удобном для оператора виде.This completes the processing of the current acoustic state V ₁ . The system moves to the next current acoustic state, V ₂ . The recognition procedure ends when the current speech section does not contain speech. In this case, the control unit sends a signal to the input of block 6 to complete the recognition procedure. Block 6 passes the marker index to block 8

associated with the final vertex N of the lexical decoding network. Block 8 transmits this marker to block 9

, according to which block 9 generates a recognition result in the form of a sequence of recognized words. This sequence is transmitted to block 12, which displays the recognition result in a convenient form for the operator.

Таким образом, задача повышения точности распознавания достигается за счет того, что поиск рабочей гипотезы является оптимальным в смысле максимума степени ее совпадения с исходным речевым сигналом, поскольку в основе используемого алгоритма перемещаемого маркера лежит метод динамического программирования, а также при построении сети лексического декодирования используется модель языка и вероятности перехода между состояниями, посредством которых учитывают средние длительности фонем.Thus, the task of increasing recognition accuracy is achieved due to the fact that the search for a working hypothesis is optimal in the sense of the maximum degree of its coincidence with the original speech signal, since the algorithm used for moving the marker is based on the dynamic programming method, and also when constructing the network of lexical decoding, a model is used language and the probability of transition between states by which the average duration of phonemes is taken into account.

Источники информацииInformation sources

1. Klatt D.H. Review of the ARPA Speech Understanding Project, J. Acoust. Soc. America, 62, №4, pp.1366, 1977.1. Klatt D.H. Review of the ARPA Speech Understanding Project, J. Acoust. Soc. America, 62, No. 4, pp. 1366, 1977.

2. Woods W.A., Bates M., Brown G., et al. Speech Understanding Systems: Final Tech. Progress Report, Bolt, Beranek, Newman, Inc. Rep. №3438, Cambridge, 1976.2. Woods W.A., Bates M., Brown G., et al. Speech Understanding Systems: Final Tech. Progress Report, Bolt, Beranek, Newman, Inc. Rep. No. 3438, Cambridge, 1976.

3. Патент РФ №2101782 С1,кл. 7 G10L 15/00.3. RF patent No. 2101782 C1, cl. 7 G10L 15/00.

4. Rabiner L.R.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, proceedings of the IEEE, vol. 77, №2, February 1989.4. Rabiner L. R. A. Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, proceedings of the IEEE, vol. 77, No. 2, February 1989.

5. Huang X., Acero A., Hon H.-W. Spoken Language Processing: a guide to theory, algorithms, and system development. - Prentice-Hall, Inc., 2001.5. Huang X., Acero A., Hon H.-W. Spoken Language Processing: a guide to theory, algorithms, and system development. - Prentice-Hall, Inc., 2001.

6. Моттль В.В., Мучник И.Б. Скрытые Марковские модели в структурном анализе сигналов. - M.: ФИЗМАТЛИТ, 1999.6. Mottl VV, Muchnik IB Hidden Markov models in the structural analysis of signals. - M .: FIZMATLIT, 1999.

Claims

1. A method for recognizing words in continuous speech, consisting in the fact that with the utterance of a speech utterance, samples of the acoustic signal of this utterance digitized with a given quantization frequency are periodically taken at fixed time intervals and from the totality of these samples, the functional determining the current acoustic state is calculated at In this case, the obtained sequence of current acoustic states is used to restore the sequence of words (working hypothesis) spoken in the original speech showing what the lexical decoding network is used for, which sets the patterns for following reference acoustic states in a language, characterized in that a search for a working hypothesis is carried out, which is optimal in terms of the maximum degree of its coincidence with the original speech signal, which is ensured by the use of a moving marker algorithm, the working hypothesis is restored from the marker, which at this point in time is located at the final vertex of the lexical decoding network.

2. The method for recognizing words in continuous speech according to claim 1, in which a lexical decoding network is created based on a language model.

3. The method of recognizing words in continuous speech according to claim 1, in which when creating a network of lexical decoding, the probabilities of transition between states that specify the average duration of acoustic states are used.

4. The method for recognizing words in continuous speech according to claim 2, in which when creating a network of lexical decoding, the probabilities of transition between states that specify the average duration of acoustic states are used.