EA023695B1

EA023695B1 - Method for recognition of speech messages and device for carrying out the method

Info

Publication number: EA023695B1
Application number: EA201200945A
Authority: EA
Inventors: Михаил Васильевич Хитров; Кирилл Евгеньевич Левин
Original assignee: Ооо "Центр Речевых Технологий"
Priority date: 2012-07-16
Filing date: 2012-07-16
Publication date: 2016-07-29
Also published as: EA201200945A1

Abstract

Proposed are a method for automatic recognition of speech messages and a device for carrying out the method. The proposed device and method allow converting input speech message to the text, subject to multifactorial processing the speech message using algorithms for evaluating the quality of the speech message, algorithm for identification of a speaker, syntagma border searching algorithms and algorithms for determining the topic of the speech message. The device comprises a common logical unit, enabling combining diverse probabilistic estimators in order to get a general decision on the content of speech message. The use of said device and method increase reliability of recognition of voice search queries, recordings of conferences or negotiations.

Description

Изобретение относится к области автоматического распознавания речи и, в частности, к способу автоматического распознавания речевых сообщений и устройству для его осуществления. Изобретение может быть использовано при распознавании новостных сообщений, речевых поисковых запросов, а также при обработке записей совещаний и переговоров.The invention relates to the field of automatic speech recognition and, in particular, to a method for automatic recognition of voice messages and a device for its implementation. The invention can be used in the recognition of news messages, voice search queries, as well as in the processing of records of meetings and negotiations.

Уровень техникиThe level of technology

В настоящее время известны различные устройства и способы для распознавания речевых сообщений и отдельных слов. В основе известных решений лежит принцип сравнения входных речевых сигналов с эталонными сигналами, имеющимися в соответствующих словарях речевых образцов, и анализ вероятностей совпадения таких сравнений.Currently, various devices and methods for recognizing voice messages and individual words are known. The basis of the known solutions is the principle of comparing the input speech signals with the reference signals available in the corresponding dictionaries of speech patterns, and the analysis of the probability of coincidence of such comparisons.

Из патента ЕР 1069551 известно устройство и способ распознавания слов в потоке слитной речи для получения пользовательских команд. Предлагаемое устройство и способ реализуют алгоритм распознавания речи на основе скрытых марковских моделей. Согласно изобретению предварительно создается словарь эталонных речевых образцов, который включает, например, набор стандартных пользовательских команд. Указанные эталонные образцы используются для сравнения с образцами, получаемыми от пользователя. После приема голосового сообщения с помощью блока оценки вероятностей осуществляют определение вероятности совпадения произнесенной фразы с фразами из словаря речевых образцов. В случае выполнения неравенства Р > Р_тах, где Р - значение вероятности при сравнении с конкретной фразой из словаря речевых образцов, а Р_тах - максимальное пороговое значение, фразе пользователя присваивают значение указанной конкретной фразы.From the patent EP 1069551 a device and method for recognizing words in a stream of continuous speech is known for receiving user commands. The proposed device and method implements a speech recognition algorithm based on hidden Markov models. According to the invention, a pre-vocabulary of reference speech patterns is created, which includes, for example, a set of standard user commands. These reference samples are used to compare with samples received from the user. After receiving a voice message using the probability assessment unit, the probability of a match of the spoken phrase is determined with the phrases from the vocabulary of speech patterns. If the inequality is true, Р> Р _max , where Р is the probability value when compared with a specific phrase from the vocabulary of speech patterns, and P _max is the maximum threshold value, the phrase of the user is assigned the value of the specified specific phrase.

Предложенное решение предназначено для распознавания отдельных слов и при использовании для пословного распознавания речи обеспечивает низкую достоверность распознавания вследствие того, что не обеспечивается осуществление многофакторной предварительной обработки голосового сообщения, ограничен список критериев для вычисления вероятностей, ограничен словарь речевых образцов, а также не обеспечена возможность создания словаря дикторов. Таким образом, указанное решение не позволяет распознавать речевые сообщения с высокой степенью достоверности.The proposed solution is designed to recognize individual words and, when used for word-by-word speech recognition, provides low recognition accuracy due to the fact that the multifactor preprocessing of the voice message is not implemented, the list of criteria for calculating probabilities is limited, the dictionary of speech patterns is limited, and the possibility of creating a dictionary is not provided announcers. Thus, this solution does not allow to recognize voice messages with a high degree of reliability.

Из патента КИ 2223554 известно устройство распознавания речи и соответствующий способ, осуществляющие распознавание слов по введенной информации о моделях единичных элементов речи, каждый из которых является более коротким, чем слово. Устройство распознавания речи содержит в себе средство накопления совокупности словарных обозначений, осуществляющее накопление последовательностей обозначений указанных единичных элементов речи для слов общего характера, обычно используемых для распознавания слов по введенной речевой информации произвольных говорящих субъектов. Устройство также включает средство извлечения последовательностей обозначений для зарегистрированных слов, осуществляющее генерацию последовательностей обозначений, которые соответствуют связи указанных единичных элементов речи между собой, посредством использования совокупности, в которой описано указанное условие о связи единичных элементов речи, причем последовательности обозначений указанных единичных элементов речи имеют наибольшую вероятность для зарегистрированных слов из введенной речевой информации конкретного говорящего субъекта. Устройство также включает средство регистрации, осуществляющее запоминание указанных последовательностей обозначений единичных элементов речи для слов общего характера, обычно используемых для выполнения распознавания слов по введенной речевой информации произвольных говорящих субъектов, и созданных последовательностей обозначений для зарегистрированных слов в виде параллельных совокупностей. В устройстве указанные единичные элементы речи представляют собой акустические события, генерация которых выполнена посредством разделения скрытой марковской модели фонемы на отдельные состояния без изменения значений вероятности перехода, результирующей вероятности и количества состояний.From patent KI 2223554, a speech recognition device and a corresponding method are known that carry out the recognition of words according to the entered information about models of single speech elements, each of which is shorter than a word. A speech recognition device contains a means of accumulating a set of vocabulary designations that accumulates the sequences of designations of the indicated single speech elements for words of a general nature, usually used for recognizing words from the input speech information of arbitrary speaking subjects. The device also includes a means of extracting sequences of designations for registered words, which generates sequences of designations that correspond to the connection of these single speech elements to each other, by using a combination in which the specified condition on the communication of single speech elements is described, the sequence of designations of these single speech elements having the greatest the probability for registered words from the input speech information of a particular gov yaschego subject. The device also includes a registration tool that memorizes the specified sequences of single-element speech designations for words of a general nature, usually used to perform word recognition based on the input speech information of arbitrary speaking subjects, and created designation sequences for registered words in the form of parallel sets. In the device, the indicated single elements of speech are acoustic events, the generation of which is performed by dividing the hidden Markov model of the phoneme into separate states without changing the values of the transition probability, the resulting probability and the number of states.

В указанном устройстве обеспечена возможность пословного распознавания слитной речи, на аппаратном уровне реализована возможность формирования пополняемого словаря моделей единичных элементов речи, при этом распознавание слов осуществляется с использованием предварительно заданных моделей произвольных говорящих субъектов. Недостатком известного устройства по патенту КИ 2223554 и реализуемого им способа является низкая достоверность распознавания речевого сообщения ввиду ограниченности вероятностной информации, используемой при распознавании.The specified device provides the possibility of word-by-word recognition of continuous speech, the hardware level implements the possibility of forming a replenished dictionary of models of single speech elements, while word recognition is carried out using pre-defined models of arbitrary speaking subjects. A disadvantage of the known device according to patent KI 2223554 and the method it implements is the low reliability of speech message recognition due to the limited probabilistic information used in recognition.

Наиболее близким аналогом к заявляемому способу и устройству распознавания речевых сообщений является способ и реализующее такой способ устройство распознавания речевых сообщений, описанные в патенте КИ 2296376. Согласно указанному способу принимают звуковой сигнал, осуществляют предварительную обработку принятого звукового сигнала путем выделения интервалов, соответствующих выделенным на фоне шума словам, а также путем разбиения на сегменты стандартной длительности, меньшей длительности фонем, после чего осуществляют первичное декодирование указанной речевой составляющей, при котором формируют биспектральные признаки, которые сравниваются с эталонными признаками фонем с целью принятия решения о распознанной фонеме на каждом сегменте слова. При сравнении сформированного набора буквенных кодов фонем распознаваемого слова с наборами буквенных кодов фонем слов словаря с использованием эталонных признаков слов формируется массивThe closest analogue to the claimed method and a voice message recognition device is a method and a voice message recognition device implementing such a method, described in patent CI 2296376. According to this method, a sound signal is received, preliminary processing of the received sound signal is performed by selecting intervals corresponding to the noise selected from the background words, as well as by dividing into segments of standard duration, shorter phoneme length, after which primary deco dosing the said speech component, in which they form bispectral features that are compared with reference features of phonemes in order to make a decision about the recognized phoneme on each segment of a word. When comparing the formed set of letter codes of the recognized word phonemes with the sets of letter codes of the dictionary word phonems using the reference signs of the words, an array is formed

- 1 023695 значений показателей распознавания, равных количеству совпадающих буквенных кодов и кодов пауз распознаваемого слова со словами из словаря. Решение о распознанном слове принимается в пользу того слова словаря, при сравнении с которым получен максимальный показатель распознавания. Таким образом, может быть обеспечено пословное распознавание речевых сообщений.- 1,03695 values of recognition indicators equal to the number of matching letter codes and pause codes of a recognizable word with words from the dictionary. The decision on the recognized word is taken in favor of the dictionary word, when comparing with which the maximum recognition index is obtained. In this way, word-wise recognition of voice messages can be provided.

Известный способ и устройство не обеспечивают возможности определения темы речевого сообщения и идентификации личности диктора. Указанные недостатки ограничивают вероятностную информацию, используемую при распознавании, и не позволяют обеспечить высокий уровень достоверности такого распознавания.The known method and device do not provide the possibility of determining the theme of the voice message and the identification of the speaker. These drawbacks limit the probabilistic information used in recognition, and do not allow for a high level of confidence in such recognition.

Раскрытие изобретенияDISCLOSURE OF INVENTION

В настоящем разделе используются следующие термины и определения.The following terms and definitions are used in this section.

Акустическая модель (АМ) - набор статистических параметров отдельных звуков речи, которые позволяют определить наиболее вероятные словные последовательности.Acoustic model (AM) - a set of statistical parameters of individual speech sounds that allow you to determine the most likely word sequences.

Языковая (лингвистическая) модель (ЯМ) - совокупность возможных последовательностей слов в устной речи.Linguistic (linguistic) model (NM) - a set of possible sequences of words in oral speech.

Синтагма - совокупность нескольких слов, объединенных по принципу семантико-грамматическифонетической сочетаемости.Syntagma is a combination of several words, united according to the principle of semantic-grammatical-phonetic compatibility.

Фон слова - единица звукового уровня языка, выделяемая в речевом потоке безотносительно к её фонемной принадлежности (т.е. без отнесения её к той или иной фонеме) или как конкретная реализация фонемы в речи.The background of a word is a unit of the audio level of a language allocated in a speech flow regardless of its phoneme affiliation (that is, without assigning it to a particular phoneme) or as a concrete implementation of the phoneme in speech.

Задача настоящего изобретения заключается в создании технического решения, обеспечивающего распознавание речевых сообщений с высокой степенью достоверности и позволяющего получать многоуровневую текстовую разметку с присваиванием отдельных фраз различным дикторам.The present invention is to create a technical solution that provides voice recognition with a high degree of reliability and allows you to get multi-level text markup with the assignment of individual phrases to different speakers.

Указанная задача решается предлагаемым способом, согласно которому принимают звуковой сигнал, осуществляют предварительную обработку принятого звукового сигнала с выделением речевой составляющей этого сигнала и осуществляют первичное декодирование указанной речевой составляющей с использованием данных словаря речевых образцов. Предлагаемый способ отличается тем, что на этапе предварительной обработки определяют временные точки смены дикторов и определяют границы синтагм, а на этапе первичного декодирования используют данные о границах синтагм, при этом способ также включает этап определения тематики речевой составляющей с использованием классификатора тематик, осуществления вторичного декодирования указанной речевой составляющей с использованием данных о её тематике и данных словаря речевых образцов с получением последовательности слов в виде текста, осуществления идентификации личностей дикторов с использованием данных модели диктора и осуществления логической обработки полученной последовательности слов с использованием данных о тематике речевой составляющей и данных о личности дикторов с получением многоуровневой текстовой разметки.This problem is solved by the proposed method, according to which a sound signal is received, the received audio signal is pre-processed with the speech component of this signal extracted, and the said speech component is first decoded using data from a speech pattern dictionary. The proposed method differs in that at the preprocessing stage, the time points for changing the speakers are determined and the boundaries of the syntagms are determined, and at the initial decoding stage, the data on the boundaries of the syntagms are used, and the method also includes the step of determining the subject of the speech component using the specified classifier speech component using data on its subject and data from the vocabulary of speech patterns with obtaining a sequence of words in the form of text, uschestvleniya speaker identification of individuals using the data model of the speaker and received by the logical processing sequence of words by using the speech data of the subject component and speakers to obtain individual data of multilevel text markup.

Технический результат, проявляющийся при осуществлении предлагаемого способа, заключается в повышении точности распознавания речевых сообщений. Предлагаемый способ позволяет обрабатывать сложные речевые сообщения, которые могут принадлежать одному или нескольким дикторам и в которых может происходить неоднократная смена тематики и качества записи. Благодаря многофакторной предварительной обработке, использованию классификатора тематик и моделей дикторов обеспечивается вероятностная информация в виде нескольких гипотез о содержании речевого сообщения, достаточная для обеспечения высокого уровня достоверности распознавания.The technical result, which is manifested in the implementation of the proposed method, is to improve the accuracy of recognition of voice messages. The proposed method allows you to handle complex voice messages that may belong to one or more speakers and in which there can be a repeated change of subject and quality of the recording. Thanks to the multifactor preprocessing, the use of a classifier of topics and models of speakers, probabilistic information is provided in the form of several hypotheses about the content of a speech message, sufficient to ensure a high level of recognition accuracy.

Согласно еще одному варианту реализации предложенный способ отличается тем, что после вторичного декодирования осуществляют грамматическое согласование полученной последовательности слов.According to another embodiment, the proposed method is characterized in that, after secondary decoding, the grammatical concordance of the obtained sequence of words is carried out.

Согласно еще одному варианту реализации предложенный способ отличается тем, что после приема речевого сигнала осуществляют преобразование данных в формат, подходящий для распознавания.According to another implementation variant, the proposed method is characterized in that after receiving a speech signal, data is converted into a format suitable for recognition.

Задача настоящего изобретения также может быть решена предлагаемым устройством для распознавания речевых сообщений, которое включает модуль приема звукового сигнала, модуль предварительной обработки принятого звукового сигнала, выполненный с возможностью выделения речевой составляющей сигнала, модуль распознавания речи, включающий декодер, который выполнен с возможностью осуществления первичного декодирования речевой составляющей звукового сигнала с использованием данных словаря речевых образцов. Предлагаемое устройство отличается тем, что декодер модуля распознавания речи является двухпроходным и выполнен с возможностью осуществления вторичного декодирования с получением последовательности слов в виде текста, при этом модуль распознавания речи выполнен с возможностью использования данных о тематике указанной речевой составляющей, полученных от классификатора тематик, модуль предварительной обработки принятого звукового сигнала выполнен с возможностью определения временных точек смены дикторов и границ синтагм, модуль распознавания речи выполнен с возможностью осуществления первичного декодирования с использованием данных о границах синтагм, при этом устройство также включает модуль идентификации личностей дикторов с использованием данных модели диктора и логический модуль, который выполнен с возможностью осуществления логической обработки полученной последовательности слов с учетом данных о тематике речевой составляющей и данных о личности дикторов с получением многоуровневой тексто- 2 023695 вой разметки.The present invention can also be solved by the proposed device for the recognition of voice messages, which includes a sound receiving module, a preprocessing module of a received audio signal, configured to extract the speech component of the signal, a speech recognition module, including a decoder, which is configured to perform initial decoding speech component of the audio signal using the data dictionary of speech samples. The proposed device differs in that the decoder of the speech recognition module is two-pass and configured to perform secondary decoding with obtaining a sequence of words in the form of text, while the speech recognition module is configured to use data on the subject of the specified speech component received from the subject classifier, the preliminary module processing the received audio signal is made with the ability to determine the time points of the change of speakers and the boundaries of the syntagm, the module p speech recognition is made with the possibility of primary decoding using data on the boundaries of the syntagm, while the device also includes a speaker identification module using the speaker model data and a logic module that is configured to perform logical processing of the resulting word sequence taking into account the data on the speech component and data about the personality of the announcers with the receipt of multi-level text-marking.

Технический результат, обеспечиваемый устройством, заключается в повышении точности распознавания речевых сообщений. Предлагаемое устройство позволяет обеспечить высокую степень достоверности распознавания речевых сообщений благодаря тому, что в нем реализована возможность осуществления многофакторной предварительной обработки, формирования классификатора тематик и моделей дикторов, благодаря чему обеспечивается вероятностная информация, достаточная для обеспечения высокого уровня достоверности распознавания.The technical result provided by the device, is to improve the accuracy of recognition of voice messages. The proposed device allows you to provide a high degree of reliability of voice message recognition due to the fact that it implements the possibility of multi-factor pre-processing, the formation of a classifier of topics and models of speakers, thereby providing probabilistic information sufficient to ensure a high level of recognition accuracy.

Согласно еще одному варианту реализации предлагаемое устройство отличается тем, что модуль предварительной обработки выполнен с возможностью определения типов и уровней помех и искажений звукового сигнала.According to another implementation variant of the proposed device differs in that the preprocessing module is configured to determine the types and levels of interference and distortion of the audio signal.

Согласно еще одному варианту реализации предлагаемое устройство отличается тем, что модуль приема звукового сигнала выполнен с возможностью обмена данными с пользователем и с возможностью управления процессом обработки данных, загрузки речевых данных с различных источников данных, а также вывода информации о результатах работы. Таким образом, в частности, обеспечена возможность загрузки пользователем речевого сообщения, подлежащего распознаванию, со съемных носителей.According to another implementation variant, the proposed device is characterized in that the audio signal receiving module is configured to exchange data with the user and control the data processing process, download voice data from various data sources, and display information about the results of the work. Thus, in particular, it is provided that the user can download a voice message to be recognized from removable media.

Согласно еще одному варианту реализации предлагаемое устройство отличается тем, что оно включает преобразующий модуль, который выполнен с возможностью преобразовывать входные данные, поступающие в различных форматах и сохраненных на различных носителях, в формат, подходящий для распознавания речи.According to another implementation variant, the proposed device differs in that it includes a conversion module that is configured to convert input data coming in various formats and stored on various media into a format suitable for speech recognition.

Согласно еще одному варианту реализации предлагаемое устройство отличается тем, что оно включает блок грамматического согласования, выполненный с возможностью грамматического анализа полученных в результате вторичного декодирования последовательностей слов.According to another implementation variant, the proposed device differs in that it includes a grammatical matching unit, made with the possibility of grammatical analysis of the sequences of words obtained as a result of secondary decoding.

Согласно еще одному варианту реализации предлагаемое устройство отличается тем, что оно включает средства хранения результатов распознавания речевого сообщения.According to another implementation variant, the proposed device is characterized in that it includes means for storing the results of voice message recognition.

Краткое описание чертежейBrief Description of the Drawings

Ниже приведено подробное описание реализации изобретения со ссылками на прилагаемые чертежи: на фиг. 1 приведен предпочтительный вариант реализации устройства для распознавания речевых сообщений согласно настоящему изобретению.Below is a detailed description of the implementation of the invention with reference to the accompanying drawings: FIG. 1 shows a preferred embodiment of a voice recognition device according to the present invention.

на фиг. 2 проиллюстрирован предпочтительный вариант реализации способа распознавания речевых сообщений согласно настоящему изобретению.in fig. 2 illustrates a preferred embodiment of the voice message recognition method of the present invention.

Осуществление изобретенияThe implementation of the invention

На фиг. 1 приведен предпочтительный вариант реализации устройства для распознавания речевых сообщений согласно настоящему изобретению.FIG. 1 shows a preferred embodiment of a voice recognition device according to the present invention.

Предлагаемое устройство в предпочтительном варианте реализации представляет собой программно-аппаратный комплекс, включающий, например, компьютерную систему. Как видно из фиг. 1, предлагаемое устройство включает пользовательский интерфейс 1, модуль 2 формирования задания на распознавание, модуль 3 предобработки, модуль 4 распознавания речи, который включает декодер, классификатор 5 тематик, модуль 6 аннотирования, модуль 7 обучения, модуль 8 лингвистической обработки, модуль 9 сохранения результатов, а также модуль 10 пост-обработки, модуль 11 идентификации дикторов и логический модуль 12, модуль 14 компенсации, детекторы 15 речь/пауза, вычислитель 16 отношения сигнал-шум (С/Ш) и искусственные нейронные сети 17 (показаны на фиг. 2).The proposed device in the preferred implementation is a software-hardware complex, including, for example, a computer system. As can be seen from FIG. 1, the proposed device includes a user interface 1, a recognition task generation module 2, a preprocessing module 3, a speech recognition module 4, which includes a decoder, a classifier of 5 subjects, an annotation module 6, a training module 7, a linguistic processing module 8, a result saving module 9 as well as post processing module 10, speaker identification module 11 and logic module 12, compensation module 14, voice / pause detectors 15, signal-to-noise ratio (S / N) calculator 16 and artificial neural networks 17 (shown in u. 2).

Далее приведено описание работы предлагаемого устройства, с пояснением взаимосвязей между его модулями.The following is a description of the operation of the proposed device, with an explanation of the relationships between its modules.

Звуковой сигнал поступает на интерфейс 1, который выполнен с возможностью инициации процесса распознавания при поступлении звукового сигнала. Указанный интерфейс может включать клавиатуру и компьютерную мышь, а также средства приема звукового сигнала, средства для обеспечения загрузки речевых данные с различных источников данных, а также выведения информации о результатах работы. Указанный интерфейс 1 предоставляет пользователю возможность управлять процессом обработки данных. Через интерфейс 1 речевые данные попадают в модуль 2.The audio signal is fed to the interface 1, which is configured to initiate the recognition process when a sound signal is received. The specified interface can include a keyboard and a computer mouse, as well as a means of receiving a sound signal, means for ensuring the loading of voice data from various data sources, as well as outputting information about the results of work. The specified interface 1 provides the user with the ability to manage data processing. Through interface 1, the voice data enters module 2.

Модуль 2 является промежуточным между интерфейсом 1 и модулем 3. Модуль 2 преобразовывает входные данные, поступающие в различных форматах и хранящихся на разных носителях, в формат, подходящий для системы распознавания речи. После соответствующей подготовки речевой сигнал из указанного модуля 2 попадает в модуль 3.Module 2 is intermediate between interface 1 and module 3. Module 2 converts input data that comes in different formats and is stored on different media into a format suitable for a speech recognition system. After appropriate preparation, the speech signal from the specified module 2 enters module 3.

Модуль 3 осуществляет преобразование речевого сигнала в набор параметров речи таких как ΤΒΆΝΚ1, ΡΒΛΝΚ2 - результаты обработки речевого сообщения с использованием мел-частотных банков фильтров, Р₀ - значения частоты основного тона речевого сигнала, МРСС - мел-частотные кепстральные коэффициенты. Указанные параметры позволяют выделить информационную составляющую сигнала и сократить междикторскую и межсессионную изменчивость исходного сигнала. Преобразование указанного входного сигнала осуществляется с применением известных алгоритмов, таких как МРСС, РТТ, ЬСКС. Как видно из фиг. 2, основной набор параметров (ΡΒΑΝΚ1) поступает на вход модуля 14 компенсации, где выполняется первичная настройка на канал передачи речевого сообщения, во время которой детектор 15 и вычислитель 16 определяют качество записи и предоставляют данные для дальнейшегоModule 3 converts a speech signal into a set of speech parameters such as ΤΒΆΝΚ1, ΡΒΛΝΚ2 - the results of processing a voice message using chalk-frequency filter banks, P ₀ - values of the frequency of the fundamental tone of the speech signal, MPCC - chalk-frequency cepstral coefficients. These parameters make it possible to isolate the information component of the signal and reduce inter-speaker and inter-session variability of the original signal. Conversion of the specified input signal is carried out using well-known algorithms, such as MRSS, PTT, FCX. As can be seen from FIG. 2, the main set of parameters (ΡΒΑΝΚ1) is fed to the input of the compensation module 14, where primary tuning is performed on the voice message transmission channel, during which the detector 15 and the transmitter 16 determine the recording quality and provide data for further

- 3 023695 использования в модуле 4. Кроме того, в процессе первичной настройки на канал передачи речевого сообщения постоянно выполняется компенсация параметров ΡΒΑΝΚ1, которая позволяет удалить из входного сигнала постоянные искажения, вносимые частотной характеристикой канала передачи. Другие служебные наборы (ΡΒΛΝΚ2) подаются на вход детекторов 15 речь/пауза, шумов и помех, вычислитель 16 отношения сигнал-шум (С/Ш) и искусственные нейронные сети 17 (ИНН 1, ИНН 2). Нейронные сети вычисляют постериорные вероятности принадлежности входного вектора данных к состояниям фонов. Однако они вычисляют эти вероятности без учета допустимого порядка фонем в речи; этот порядок учитывается в модуле 4 при декодировании. С использованием оставшихся наборов параметров находятся временные точки смены дикторов и определяются возможные границы синтагм. Кроме того, в модуле 3 выделяются участки, содержащие речь, определяются типы и уровни помех и искажений, присутствующих во входном сигнале. Модуль 3 выполнен с возможностью выделения нескольких основных типов искажений, оказывающих наибольшее влияние на достоверность распознавания: нелинейные искажения (перегрузка) и аддитивные помехи канала передачи. Для оценки этих искажений в речевом сигнале определяется отношение сигнал/шум, а также участки с амплитудными изменениями, характерными для искажений перегрузки. Важной функцией модуля 3 является определение информативной части речевого сигнала. Она позволяет сократить время распознавания за счет исключения из процесса распознавания участков пауз. После соответствующей предобработки речевой сигнал из указанного модуля 3 попадает в модуль 4.- 3 023695 use in module 4. In addition, during the initial tuning to the voice message transmission channel, parameters ΡΒΑΝΚ1 are constantly compensated, which allows to remove from the input signal permanent distortions introduced by the frequency response of the transmission channel. Other service sets (ΡΒΛΝΚ2) are fed to the input of speech / pause, noise and interference detectors 15, the signal-to-noise ratio calculator 16 (S / N) and artificial neural networks 17 (TIN 1, TIN 2). Neural networks compute the posterior probabilities of the input vector of the data to the background states. However, they calculate these probabilities without taking into account the acceptable order of phonemes in speech; this order is taken into account in module 4 when decoding. Using the remaining parameter sets, temporary points for changing the speakers are found and possible syntagm boundaries are determined. In addition, in module 3, areas containing speech are identified, the types and levels of interference and distortion present in the input signal are determined. Module 3 is designed to highlight several main types of distortion that have the greatest effect on recognition accuracy: nonlinear distortion (overload) and additive interference of the transmission channel. To assess these distortions in the speech signal, the signal-to-noise ratio is determined, as well as areas with amplitude changes characteristic of overload distortion. An important function of module 3 is the definition of the informative part of the speech signal. It allows you to reduce the recognition time due to the exclusion of pause sections from the recognition process. After the corresponding preprocessing, the speech signal from the specified module 3 enters module 4.

Модуль 4 выполнен с возможностью определения наиболее грамматически вероятной гипотезы для неизвестного высказывания, т.е. наиболее вероятного пути по сети распознавания, состоящей из моделей слов (которые, в свою очередь, формируются из моделей отдельных фонов). Правдоподобие гипотезы зависит от двух факторов, а именно от вероятностей последовательности фонов, приписываемых акустической моделью, и вероятностей следования слов друг за другом, определяемых моделью языка. Значение правдоподобия гипотезы определяется путем перемножения последовательности фонов и вероятности следования слов друг за другом, более конкретно, перемножаются вероятности состояний фонов, вероятности перехода между этими состояниями, вероятности следования фонов внутри слова (слово может иметь несколько вариантов произнесения) и вероятности следования слов друг за другом. При этом быстродействие модуля 4 является приемлемым и достигается за счет осуществления поиска с пределом, который предполагает исследование не всех возможных частичных путей в сети распознавания, а только тех, общее правдоподобие для которых больше определенного предела. Кроме того, в каждый момент времени в модели находится наиболее вероятный частичный путь, по правдоподобию которого вычисляется нижняя граница поиска. Все пути со значением правдоподобия ниже данной границы исключаются из дальнейшего рассмотрения.Module 4 is designed to determine the most grammatically probable hypothesis for an unknown utterance, i.e. the most likely path through the recognition network, consisting of word patterns (which, in turn, are formed from models of individual backgrounds). The credibility of the hypothesis depends on two factors, namely, the probabilities of the sequence of backgrounds attributed to the acoustic model, and the probabilities of the words following each other, determined by the language model. The likelihood value of a hypothesis is determined by multiplying the sequence of backgrounds and the probabilities of following the words one after another, more specifically, the probabilities of the states of the backgrounds are multiplied, the transition probabilities between these states, the probabilities of following the backgrounds within a word (the word can have several pronunciations) . At the same time, the performance of module 4 is acceptable and is achieved by performing a search with a limit, which does not involve the study of all possible partial paths in the recognition network, but only those with a general likelihood for which there is more than a certain limit. In addition, at each time point in the model is the most likely partial path, the likelihood of which calculates the lower limit of the search. All paths with a likelihood value below this limit are excluded from further consideration.

В предлагаемом устройстве языковая модель строится с использованием модуля 8. Кроме того, быстродействие модуля 4 также может быть увеличено за счет обучения акустических моделей с использованием модуля 7. Указанное обучение предусматривает перестроение акустических моделей с использованием результатов предыдущего распознавания.In the proposed device, the language model is built using module 8. In addition, the speed of module 4 can also be increased by teaching acoustic models using module 7. This training involves rebuilding acoustic models using previous recognition results.

Модуль 4 включает двухпроходный декодер, позволяющий постепенно усложнять условия поиска наиболее вероятной последовательности слов. Как видно из фиг. 2, на основании информации о качестве речевого сигнала выполняют точную настройку акустических моделей на условия записи речи на первом проходе декодера, а данные о теме сообщения позволяют выбрать адекватную тематике языковую модель на втором проходе декодера. Следует отметить, что при распознавании для каждого диктора выполняется отдельное преобразование признаков речевого сигнала, приводящее характеристики диктора к некоторому среднему диктору. Для точной настройки акустических моделей на условия записи речи в зависимости от уровня помех производится компенсация спектра речевого сигнала. Также, для некоторых неречевых событий (например, треск, гудки, музыка) соответствующая информация подается напрямую в декодер, которые на этих участках использует иной режим декодирования, обеспечивая лишь распространение существующих гипотез, не генерируя новых словных гипотез. Это позволяет избежать заведомо ошибочных слов в результатах распознавания.Module 4 includes a two-pass decoder that allows you to gradually complicate the search conditions for the most likely sequence of words. As can be seen from FIG. 2, on the basis of the information about the quality of the speech signal, fine-tuning the acoustic models for the conditions of speech recording on the first pass of the decoder is performed, and the data on the message subject allows you to choose an appropriate language model for the second pass of the decoder. It should be noted that during recognition, for each speaker, a separate conversion of the speech signal is performed, leading to the characteristics of the speaker to a certain average speaker. To fine-tune the acoustic models for speech recording conditions, depending on the level of interference, the spectrum of the speech signal is compensated. Also, for some non-speech events (for example, crackling, beeps, music) the corresponding information is fed directly to the decoder, which in these areas uses a different decoding mode, ensuring only the dissemination of existing hypotheses, without generating new word hypotheses. This avoids knowingly erroneous words in the recognition results.

Кроме того, для каждой тематики заранее формируется отдельная языковая модель, например языковая модель политических новостей или языковая модель спортивного репортажа. После определения тематики на втором проходе декодер подключает ту языковую модель, которая наиболее точно соответствует теме. Это позволяет точнее распознавать слова и речевые обороты, характерные для каждой темы.In addition, for each topic a separate language model is formed in advance, for example, a language model of political news or a language model of sports reporting. After determining the subject on the second pass, the decoder connects the language model that most closely matches the theme. This allows you to more accurately recognize the words and turns of speech characteristic of each topic.

Как видно из фиг. 1, после модуля 4 речевая составляющая попадает в модуль 5, в котором речевой составляющей присваивается тематика, и в модуль 6, где для указанной речевой составляющей может быть составлена аннотация. Аннотация содержит несколько предложений из полного результата распознавания. Выбор этих предложений делается на основе критерия прироста информации. Для повышения точности распознавания предлагаемое устройство также дополнено модулями 10, 11 и 12. Как видно из фиг. 2, речевая составляющая из модуля 4 может попадать в модуль 10, который анализирует контекст результатов распознавания, и, на основании правил грамматики русского языка, выполняет согласование падежных и родовых окончаний, корректируя возможные ошибки распознавания.As can be seen from FIG. 1, after module 4, the speech component falls into module 5, in which the speech component is assigned a topic, and to module 6, where an annotation can be composed for the specified speech component. The abstract contains several sentences from the full recognition result. The choice of these proposals is made on the basis of the criterion of the increase in information. To improve the recognition accuracy, the proposed device is also supplemented with modules 10, 11, and 12. As can be seen from FIG. 2, the speech component from module 4 can fall into module 10, which analyzes the context of recognition results, and, based on the rules of the Russian grammar, performs matching of case and generic endings, correcting possible recognition errors.

Завершающий этап распознавания осуществляется в логическом модуле 12. При распознавании ре- 4 023695 чевой составляющей в модуле 12 используется вероятностная информация, поступающая в декодер из модуля 11 и модуля 5. Модуль 11 включает модели дикторов, представляющие собой вектор акустических параметров, размерность которого лежит в диапазоне 200-300, и строимые по образцу голоса человека автоматическими методами. Модели дикторов создаются автоматически по образцам записей конкретных людей, речь которых можно встретить в речевом сообщении. Модуль 11 может включать средства хранения информации для хранения моделей дикторов. В модуле 12 выполняется объединение разнородных гипотез, как было описано выше с тем, чтобы выдать наиболее правдоподобную цепочку слов, в которой проставлены границы предложений и указаны участки с различными темами и голосами.The final stage of recognition is carried out in a logical module 12. In recognizing the 4 023695 component in module 12, probabilistic information is used that enters the decoder from module 11 and module 5. Module 11 includes speaker models representing the vector of acoustic parameters whose dimension lies in range 200-300, and human-built human patterned patterns. Models of announcers are created automatically according to sample records of specific people, whose speech can be found in a speech message. Module 11 may include information storage media for the storage of speaker models. In module 12, dissimilar hypotheses are merged, as described above, in order to produce the most plausible chain of words in which the boundaries of sentences are indicated and sections with different themes and voices are indicated.

В заключение, результат распознавания попадает в модуль 9, где осуществляется его сохранение.In conclusion, the recognition result falls into module 9, where it is saved.

На фиг. 2 проиллюстрирован предпочтительный вариант реализации способа распознавания речевых сообщений согласно настоящему изобретению. Как видно из фиг. 2, на начальном этапе обработки принятый звуковой сигнал подают на модуль оценки параметров речи, который выполняет предварительную обработку принятого звукового сигнала с получением нескольких различных наборов параметров, используемых последующими модулями обработки, и с выделением речевой составляющей этого сигнала, как подробно описано выше.FIG. 2 illustrates a preferred embodiment of the voice message recognition method of the present invention. As can be seen from FIG. 2, at the initial stage of processing, the received audio signal is sent to the speech parameter estimation module, which performs preliminary processing of the received audio signal with obtaining several different sets of parameters used by subsequent processing modules and highlighting the speech component of this signal, as described in detail above.

На следующем этапе согласно предлагаемому способу осуществляют декодирование речи и идентификацию диктора.At the next stage, according to the proposed method, speech decoding and speaker identification are carried out.

В процессе декодирования используют двухпроходный декодер, позволяющий постепенно усложнять условия поиска наиболее вероятной последовательности слов.In the decoding process, a two-pass decoder is used to gradually complicate the search conditions for the most likely sequence of words.

На следующем этапе согласно настоящему способу речевую составляющую подают на модуль пост-обработки, в котором осуществляют анализ контекста результатов распознавания, и, на основании правил грамматики русского языка, выполняют согласование падежных и родовых окончаний, корректируя возможные ошибки распознавания.At the next stage, according to the present method, the speech component is supplied to the post-processing module, in which the context analysis of recognition results is analyzed, and, based on the grammar rules of the Russian language, the matching of case and generic endings are corrected, correcting possible recognition errors.

На следующем этапе согласно настоящему способу в логическом модуле выполняют объединение разнородных гипотез о теме сообщения, возможной последовательности слов и личности диктора с тем, чтобы выдать наиболее правдоподобную цепочку слов, в которой проставлены границы предложений и указаны участки с различными темами и голосами. Использование единого модуля оценки параметров позволяет существенно снизить вычислительные затраты на этом этапе обработки.At the next stage, according to the present method, the logical module combines disparate hypotheses about the subject of the message, the possible sequence of words and the speaker’s personality in order to give the most plausible word chain in which the boundaries of sentences are indicated and sections with different themes and voices are indicated. Using a single module for estimating parameters can significantly reduce computational costs at this stage of processing.

На завершающем этапе осуществляют сохранение результата распознавания.At the final stage, the recognition result is saved.

Предлагаемый способ позволяет преобразовать входной речевой сигнал в текст с высокой достоверностью.The proposed method allows you to convert the input speech signal to text with high accuracy.

Claims

1. A method for recognizing voice messages, according to which an audio signal is received;

pre-processing the received audio signal with the release of the speech component of this signal;

carry out primary decoding of the specified speech component using data from the vocabulary of speech samples, characterized in that at the stage of preliminary processing determine the time points of the change of speakers and determine the boundaries of the syntagm;

at the stage of primary decoding, syntagma boundary data is used; determine the subject of the speech component using the subject classifier; carry out secondary decoding of the specified speech component using data on its subject and data from the dictionary of speech samples to obtain a sequence of words in the form of text;

Identify speakers' personalities using speaker model data; carry out logical processing of the obtained sequence of words using data on the subject of the speech component and the personality data of the speakers with obtaining multi-level text markup.

2. The method according to claim 1, characterized in that after secondary decoding carry out the grammatical coordination of the resulting sequence of words.

3. The method according to claim 2, according to which, after receiving a speech signal, the data is converted into a format suitable for recognition.

4. A device for recognizing voice messages, which includes a module for receiving an audio signal;

a pre-processing module for the received audio signal, configured to extract the speech component of the signal;

a speech recognition module including a decoder, which is configured to perform primary decoding of the speech component of the audio signal using data

- 5,023,695 vocabulary of speech samples, characterized in that it includes a classifier of topics of voice messages, configured to determine the subject of the speech component;

the decoder of the speech recognition module is two-pass and is configured to perform secondary decoding to obtain a sequence of words in the form of text, while the speech recognition module is configured to use the subject data of the specified speech component obtained from the subject classifier;

the pre-processing module of the received audio signal is configured to determine the time points for changing speakers and syntagma boundaries;

the speech recognition module is configured to perform primary decoding using syntagma boundary data, and the device also includes a speaker personality identification module using speaker model data and a logic module that is capable of logically processing the obtained sequence of words taking into account data on the subject of the speech component and speaker identity data with multi-level text markup.

5. The device according to claim 4, characterized in that the pre-processing module is configured to determine the types and levels of interference and distortion of the audio signal.

6. The device according to claim 4, characterized in that the audio signal receiving module is configured to exchange data with a user and with the ability to control the data processing, download voice data from various data sources, as well as output information about the results of work.

7. The device according to claim 4, characterized in that it includes a conversion module, which is configured to convert the input data coming in various formats and stored on various media into a format suitable for speech recognition.

8. The device according to claim 4, characterized in that it includes a grammatical matching unit, configured to grammatically analyze the resulting sequences of words from the secondary decoding.

9. The device according to claim 4, characterized in that it includes means for storing speech recognition results.