RU61924U1

RU61924U1 - STATISTICAL SPEECH MODEL

Info

Publication number: RU61924U1
Application number: RU2006108050/22U
Authority: RU
Inventors: Михаил Николаевич Гусев; Игорь Вениаминович Жарков; Валерий Валерьевич Ситников
Original assignee: Михаил Николаевич Гусев
Priority date: 2006-03-14
Filing date: 2006-03-14
Publication date: 2007-03-10

Abstract

Полезная модель относится к области речевых технологий, модель формирует звуковой поток и может использоваться для анализа, синтеза и распознавания речи, а также оценки качества вокодеров и каналов связи. Задачей, на решение которой направлена полезная модель, является создание статистической модели речи, объединяющей в себе элементы синтезатора речи, статистические данные и корпуса речевых данных большого объема. Технический результат достигается за счет того, что в статическую модель речи, включающую интерфейсный блок, блок выбора диктора, содержащий генератор выбора диктора, блок выбора звуков, формирующего выборки звуков, которые в блоке формирования речевого потока преобразуются в звуковые сигналы с заданными свойствами и базу данных, содержащую описания типовых дикторов и другие необходимые сведения, внесены изменения, а именно:The utility model relates to the field of speech technology, the model generates a sound stream and can be used for analysis, synthesis and speech recognition, as well as assessing the quality of vocoders and communication channels. The task that the utility model aims to solve is to create a statistical speech model that combines the elements of a speech synthesizer, statistical data, and large-volume speech data bodies. The technical result is achieved due to the fact that in a static speech model that includes an interface unit, a speaker selection unit containing a speaker selection generator, a sound selection unit generating samples of sounds that are converted into audio signals with predetermined properties and a database in the speech stream generating unit containing descriptions of typical speakers and other necessary information, changes were made, namely:

- в блок выбора диктора дополнительно введен модуль статистики параметров населения различных регионов, выход которого соединен с входом генератора выбора дикторов;- a module for statistics of population parameters of various regions, the output of which is connected to the input of the speaker selection generator, is additionally introduced into the speaker selection block;

- введены дополнительные блоки: выборки типовых дикторов и хранения просодик выбранных звуков или цепочек звуков;- additional blocks have been introduced: samples of typical speakers and storage prosodik selected sounds or chains of sounds;

- в блок выбора звука дополнительно введены два модуля: правила следования звуков и правила наименования аллофонов.- two modules are additionally introduced into the sound selection block: rules for following sounds and rules for naming allophones.

Кроме того, внесены изменения в структуру некоторых блоков. Полезная модель обладает рядом преимуществ, к которым можно отнести следующее:In addition, changes were made to the structure of some blocks. The utility model has several advantages, which include the following:

- больший объем исходного речевого корпуса;- a larger volume of the original speech corpus;

- включение в базу дополнительной статистической информации по каждому ТД;- inclusion in the database of additional statistical information for each AP;

- наличие описаний интонационных контуров для каждого ТД;- the availability of descriptions of the intonation contours for each AP;

- возможность одновременной работы со структурными элементами разного размера и формата и ряд других.- the ability to simultaneously work with structural elements of different sizes and formats, and a number of others.

Кроме того, статистическая модель речи не зависит от языка, т.к. все алгоритмы и интерфейсы ее сохранятся. В настоящее время предлагаемая статистическая модель проходит проверку применением в системе синтеза речи по тексту и в системе объективной оценки качества вокодеров.In addition, the statistical speech model is language independent, as all algorithms and its interfaces will be preserved. Currently, the proposed statistical model is being tested by using a text-based speech synthesis system and an objective assessment system for vocoder quality.

Description

Полезная модель относится к области речевых технологий. Модель формирует звуковой поток и может использоваться для анализа, синтеза и распознавания речи, а также оценки качества вокодеров и каналов связи.The utility model relates to the field of speech technology. The model generates a sound stream and can be used for analysis, synthesis and speech recognition, as well as assessing the quality of vocoders and communication channels.

К вопросу о том, что есть «статистическая модель речи» можно подходить с разных сторон. С прикладной точки зрения - это средство изучения и моделирования процессов речевой активности, база для построения систем синтеза, анализа, и распознавания речи. Однако можно рассматривать статистическую модель и в глобальном масштабе, как слепок текущего состояния языка, сохранение его для потомков.The question of what is a “statistical model of speech” can be approached from different angles. From an applied point of view, it is a means of studying and modeling the processes of speech activity, a basis for building systems of synthesis, analysis, and speech recognition. However, we can consider the statistical model on a global scale as a cast of the current state of the language, preserving it for posterity.

Традиционно к речевым технологиям относят следующие четыре направления:Traditionally, the following four areas are referred to speech technologies:

- синтез речи (прежде всего text-to-speech);- speech synthesis (primarily text-to-speech);

- распознавание речи;- speech recognition;

- идентификация по голосу;- voice identification;

- обработка речевых сигналов- speech processing

До последнего времени программы, синтезирующие и распознающие человеческую речь, в основной своей массе можно было отнести к «демонстрационным» системам. Они могли принимать или не принимать вид решений практического применения, но в любом случае их создание не предполагало получения прибыли. Основной целью разработчиков таких решений была демонстрация уровня научных достижений.Until recently, programs that synthesize and recognize human speech, for the most part, could be attributed to "demonstration" systems. They may or may not take the form of practical decisions, but in any case, their creation did not imply a profit. The main goal of the developers of such solutions was to demonstrate the level of scientific achievement.

Обучить компьютер общению на естественном языке крайне интересная и важная задача. Основной формой выражения мыслей, целей и желаний для человека является речь, как наиболее естественная и удобная форма передачи информации. При вводе команд с клавиатуры человек вынужден оформлять мысли в строгую грамматическую форму, контролировать правильность набранного текста на экране. Ручной ввод длительное и утомительное занятие - обычный человек способен вводить 10-20 слов в минуту, в то время как наговаривать за минуту можно до 200 слов. К тому же далеко не все пользователи умеют и хотят пользоваться клавиатурой.To teach a computer to communicate in a natural language is an extremely interesting and important task. The main form of expression of thoughts, goals and desires for a person is speech, as the most natural and convenient form of information transfer. When entering commands from the keyboard, a person is forced to formulate thoughts in a strict grammatical form, to control the correctness of the typed text on the screen. Manual input is a long and tedious task - an ordinary person is able to enter 10-20 words per minute, while up to 200 words can be spoken in a minute. In addition, not all users can and want to use the keyboard.

Для работы с неподготовленным пользователем речевой ввод наиболее привлекателен - на естественном языке могут быть сформулированы практически любые задачи из различных областей человеческой деятельности. Речевой ввод не отменяет другие средства обмена информацией (И.Жарков, П.Скрелин, М.Гусев «Голос времени», Компьютер пресс №8, 2005 г.).To work with an unprepared user, speech input is most attractive - in a natural language, almost any tasks from various fields of human activity can be formulated. Speech input does not cancel other means of information exchange (I. Zharkov, P. Skrelin, M. Gusev “Voice of Time”, Computer Press No. 8, 2005).

Формально процесс распознавания речи можно описать буквально в нескольких фразах. В теории все просто, но стоит перейти к практике, как выясняется множество Formally, the speech recognition process can be described in just a few sentences. In theory, everything is simple, but if you go to practice, as it turns out a lot

«мелких» проблем, которые по сей день не дают исследователям получить процент распознавания, пригодный для использования в коммерческих системах.“Minor” problems, which to this day do not allow researchers to obtain a recognition percentage suitable for use in commercial systems.

С идентификацией (верификацией) голоса диктора - проблема та же, что и с распознаванием - высокий процент ошибки - никто не захочет защищать свой банковский счет голосовым паролем, если, скажем, каждый тысячный сможет получить к нему доступ.With the identification (verification) of the speaker’s voice - the problem is the same as with recognition - a high percentage of errors - no one wants to protect their bank account with a voice password if, say, every thousandth person can access it.

На сегодняшний день известны два основных принципа построения систем синтеза речи - это синтез «по правилам» и компилятивный синтез. Синтез «по правилам» основан на формировании физических характеристик звуков речи на основе их математических описаний и обладает низкой естественностью. Компилятивный синтез предполагает вырезание сегментов из естественной речевой последовательности и их последующую обработку и склейку. Цифровая обработка речевого сигнала позволяет решить задачи изменения частоты основного тона и длительности фрагмента сигнала.To date, two basic principles of building speech synthesis systems are known - this is “by the rules” synthesis and compilation synthesis. Synthesis “by the rules” is based on the formation of the physical characteristics of speech sounds based on their mathematical descriptions and has a low naturalness. Comparative synthesis involves cutting segments from a natural speech sequence and their subsequent processing and gluing. Digital processing of the speech signal allows us to solve the problem of changing the frequency of the fundamental tone and the duration of the signal fragment.

Вокодер моделирует речь в качестве отклика системы на возбуждение на коротких интервалах времени. Примеры систем вокодеров включают в себя вокодеры с линейным предсказанием, гомоморфные вокодеры, канальные вокодеры, кодеры с синусоидальным преобразованием (КСП), вокодеры с многополосным возбуждением (МПВ) и вокодеры с усовершенствованным многополосным возбуждением (УсовМПВ). В этих вокодерах речь делится на короткие сегменты (обычно 10-40 мс), причем каждый сегмент характеризуется набором параметров модели. Эти параметры обычно представляют собой несколько основных элементов каждого речевого сегмента, например шаг сегмента, речевое состояние и спектральную огибающую.A vocoder models speech as a system response to arousal over short time intervals. Examples of vocoder systems include linear prediction vocoders, homomorphic vocoders, channel vocoders, sine wave encoders (SSCs), multi-band excitation (IPM) vocoders, and advanced multi-band excitation vocoders (UsMOS). In these vocoders, speech is divided into short segments (usually 10-40 ms), each segment being characterized by a set of model parameters. These parameters usually represent several basic elements of each speech segment, for example, segment pitch, speech state, and spectral envelope.

Вокодер может использовать одно из множества известных представлений для каждого из этих параметров. Например, шаг может быть представлен периодом шага, основной частотой, или задержкой долгосрочного предсказания. Аналогично речевое состояние может быть представлено одним или несколькими озвученными/неозвученными решениями, мерой речевой вероятности или отношением периодической энергии к стохастической. Спектральную огибающую часто представляют в виде отклика фильтра с передаточной характеристикой с одними полюсами, но можно также представить набором спектральных амплитуд или других спектральных замеров (Патент РФ №2214048).A vocoder may use one of many known representations for each of these parameters. For example, a step may be represented by a step period, a fundamental frequency, or a delay in long-term prediction. Similarly, a speech state can be represented by one or several voiced / unvoiced decisions, a measure of speech probability, or the ratio of periodic energy to stochastic. The spectral envelope is often presented in the form of a filter response with a transfer characteristic with one pole, but can also be represented by a set of spectral amplitudes or other spectral measurements (RF Patent No. 2214048).

Модель речи с многополосным возбуждением (МПВ) обладает способностью обеспечивать высококачественную речь и работать на скоростях передачи битов от средних до низких. В этой модели используется гибкая речевая структура, которая позволяет A multiband excitation (MPV) speech model has the ability to provide high-quality speech and operate at medium to low bit rates. This model uses a flexible speech structure that allows

получать более естественно звучащую речь и делает ее более устойчивой к присутствию акустического фонового шума.receive more natural-sounding speech and makes it more resistant to the presence of acoustic background noise.

Наиболее современные речевые кодеры основаны на некоторой модели речевого тракта, используемой для формирования кодированного речевого сигнала. Параметры и сигналы модели подвергают квантованию, и информацию, описывающую их, передают по каналу. Доминирующей моделью кодера в применениях сотовой телефонной связи является способ линейного предсказания с кодовым возбуждением (ЛПКВ).Most modern speech encoders are based on some model of the speech path used to form the encoded speech signal. The parameters and signals of the model are quantized, and information describing them is transmitted over the channel. The dominant encoder model in cellular telephony applications is Code Excited Linear Prediction (LPCV).

Известно множество реализации кодекса, на основе ЛПКВ модели, например патент РФ №2223555. Кодированная речь формируется сигналом возбуждения, подаваемым через полюсный синтезирующий фильтр с порядком, обычно равным 10. Сигнал возбуждения формируют как сумму двух сигналов, которые выбирают из соответствующих кодовых книг (одна фиксированная и одна адаптивная) и затем умножают на соответствующие коэффициенты усиления. Сигналы кодовой книги обычно имеют длительность 5 мс (подкадр), тогда как синтезирующий фильтр обычно корректируется каждые 20 мс (кадр). Параметрами, связанными с моделью ЛПКВ, являются коэффициенты синтезирующего фильтра, записи кодовой книги и коэффициенты усиления.There are many known implementations of the code based on the LPKV model, for example, RF patent No. 2223555. Coded speech is generated by an excitation signal supplied through a pole synthesizing filter with an order usually equal to 10. The excitation signal is generated as the sum of two signals that are selected from the corresponding codebooks (one fixed and one adaptive) and then multiplied by the corresponding amplification factors. Codebook signals typically have a duration of 5 ms (subframe), while a synthesis filter is usually adjusted every 20 ms (frame). The parameters associated with the LPKV model are synthesis filter coefficients, codebook entries, and gain factors.

В результате поиска, проведенного по литературным и патентным источникам, из существующего уровня техники не выявлено однозначного прототипа, т.к. выявленные аналоги относятся к отдельным частям модели, поэтому не позволяют качественно решить поставленную задачу.As a result of a search conducted on literature and patent sources, from the current level of technology there is no unambiguous prototype, because identified analogues belong to individual parts of the model, therefore, do not allow to qualitatively solve the problem.

Задачей, на решение которой направлена полезная модель, является создание статистической модели речи, объединяющей в себе элементы синтезатора речи, статистические данные и корпуса речевых данных большого объема, со следующими целями: - повышения качества синтетической речи, выдаваемой системами синтеза, использующими модель;The task that the utility model aims to solve is to create a statistical speech model that combines the elements of a speech synthesizer, statistical data and speech data corps of a large volume, with the following objectives: - to improve the quality of synthetic speech generated by synthesis systems using the model;

- создания звуковых потоков, для обучения систем распознавания речи;- creating sound streams for training speech recognition systems;

- создание звуковых потоков для тестирования и оценки качества вокодеров и каналов связи.- Creation of sound streams for testing and evaluating the quality of vocoders and communication channels.

Анализ описаний аналогов показал, что, в статистической модели, для решения поставленной задачи, могут быть использованы следующие известные блоки: база данных, содержащая описания дикторов, блоки выбора звуков, блок формирования речевого потока, интерфейсный блок, осуществляющий согласованное управление компонентами модели и взаимодействие ее с пользователем.An analysis of the descriptions of analogues showed that, in a statistical model, to solve the problem, the following well-known blocks can be used: a database containing descriptions of speakers, blocks for selecting sounds, a block for generating a speech stream, an interface block for coordinated control of the components of the model and its interaction with the user.

Пользователем модели может быть автоматическое устройство или оператор, осуществляющие формирование запроса посредством использования интерфейса.The user of the model can be an automatic device or an operator that generates a request by using the interface.

Технический результат достигается за счет того, что в статистическую модель речи, включающую интерфейсный блок, блок выбора диктора, содержащий генератор выборки дикторов, блок выбора звуков, формирующего выборки звуков, которые в блоке формирования речевого потока преобразуются в звуковые сигналы с заданными свойствами и базу данных, содержащую описания типовых дикторов и другие необходимые сведения, внесены изменения, а именно:The technical result is achieved due to the fact that in a statistical speech model, including an interface unit, a speaker selection unit containing a speaker sample generator, a sound selection unit that generates samples of sounds that are converted into audio signals with predetermined properties and a database in the speech stream generation unit containing descriptions of typical speakers and other necessary information, changes were made, namely:

- в блок выбора диктора дополнительно введен модуль статистики параметров населения, выход которого соединен со входом генератора выборки дикторов;- a module for statistics of population parameters is added to the speaker selection block, the output of which is connected to the input of the speaker sample generator;

- между блоком выбора диктора и блоком выбора звуков дополнительно включен блок выборки типовых дикторов;- between the speaker selection block and the sound selection block, a sample block of typical speakers is additionally included;

- между блоком выбора звуков и блоком формирования речевого потока дополнительно включен блок хранения просодики, соединенный также с одним из входов интерфейсного блока;- between the unit for selecting sounds and the unit for forming a speech stream, an additional prosodyki storage unit is also included, also connected to one of the inputs of the interface unit;

- в блок выбора звука дополнительно введены два модуля: правила следования звуков и правила наименования аллофонов, выходы которых соединены с входами соответствующих модулей.- two modules are additionally introduced into the sound selection block: the rules for following sounds and the rules for naming allophones, the outputs of which are connected to the inputs of the corresponding modules.

Соответственно, с этими дополнительными требованиями, а также, учитывая возможности взаимодействия отдельных блоков с интерфейсным блоком в процессе работы, а также с отдельными элементами статистической модели, была выбрана модульная структура основных блоков модели, причем структурная схема соединения отдельных модулей в блоках могут несколько отличаться, в зависимости от конкретного примера реализации статистической модели речи.Accordingly, with these additional requirements, and also, taking into account the possibility of interaction of individual blocks with the interface block during operation, as well as with individual elements of the statistical model, the modular structure of the main blocks of the model was chosen, and the structural diagram of the connection of individual modules in the blocks may differ slightly, depending on a specific example of the implementation of a statistical speech model.

Большинство блоков модели имеют сложную структуру из-за большого набора выполняемых ими функций. Подробнее они будут описаны в примере реализации статистической модели речи.Most of the blocks of the model have a complex structure due to the large set of functions performed by them. They will be described in more detail in an example implementation of a statistical speech model.

Полезная модель поясняется следующими рисунками 1-4, а реализация рис.5, 6.The utility model is illustrated in the following Figures 1-4, and the implementation of Fig. 5, 6.

На рис.1 - общая структура статистической модели речи;In Fig. 1 - the general structure of the statistical model of speech;

На рис.2 - структура блока выбора диктора;Fig. 2 - structure of the speaker selection block;

На рис.3 - структура блока выбора звуков;Fig. 3 - structure of the block for selecting sounds;

На рис.4 - структура блока формирования речевого потока;In Fig. 4 - the structure of the block forming the speech flow;

На рис.5 - пример использования статистической модели при синтезе речи по тексту;Fig. 5 is an example of the use of a statistical model in the synthesis of speech by text;

На рис.6 - пример использования статистической модели в системе оценки качества вокодеров.Fig. 6 is an example of using a statistical model in a vocoder quality assessment system.

Общая структура статистической модели, представленная на рисунке 1.The general structure of the statistical model shown in Figure 1.

Интерфейсный блок 1 обеспечивает взаимодействие с внешним миром (или Пользователем). Блок №1 также осуществляет синхронизацию работы остальных блоков статистической модели.Interface unit 1 provides interaction with the outside world (or the User). Block No. 1 also synchronizes the operation of the remaining blocks of the statistical model.

Блок выбора диктора (блок 2) осуществляет генерацию выборки типовых дикторов (ТД) или последовательности индексов ТД. В зависимости от команды может быть сгенерирована либо представительная выборка ТД, либо выборка, состоящая из одного ТД. Представительная выборка является представительной в том смысле, что распределение параметров речи в выборке, будет соответствовать распределению параметров речи населения, описываемого моделью.The speaker selection block (block 2) generates a sample of typical speakers (TD) or a sequence of TD indices. Depending on the team, either a representative sample of APs or a sample consisting of one AP can be generated. A representative sample is representative in the sense that the distribution of speech parameters in the sample will correspond to the distribution of speech parameters of the population described by the model.

Сформированная на выходе этого блока последовательность идентификаторов ТД сохраняется для дальнейшего использования в блоке выборки ТД (блок 3).The sequence of TD identifiers generated at the output of this block is stored for further use in the TD sample block (block 3).

Блок выбора звуков (блок 4) формирует просодику (описания звуков). В зависимости от команды просодика формируется либо для представительной выборки звуков, либо для заданной последовательности звуков, либо для одного заданного звука.The block for selecting sounds (block 4) forms a prosodic (descriptions of sounds). Depending on the command, the prosodic is formed either for a representative sample of sounds, or for a given sequence of sounds, or for one given sound.

Просодика сохраняется в буфере просодики (блок 5) до дальнейшего использования. Блок формирования речевого потока (блок 6) преобразует описания звуков в отсчеты звукового сигнала, а полученный результат выдает в интерфейсный блок 1.The prosodica is stored in the prosodiki buffer (block 5) until further use. The block for the formation of the speech stream (block 6) converts the descriptions of sounds into samples of the sound signal, and the result is output to the interface unit 1.

Блок описаний типовых дикторов (блок 7) хранит описания ТД и возвращает по запросу: необходимые части описаний, информацию об их количестве, список дикторов. В нем содержится и другая необходимая информация для функционирования модели, т.е. блок 7 - это база данных статистической модели.The block of descriptions of typical announcers (block 7) stores the descriptions of the APs and returns upon request: the necessary parts of the descriptions, information about their number, a list of speakers. It also contains other necessary information for the functioning of the model, i.e. block 7 is the database of the statistical model.

Статистическая модель речи не является вещью в себе, она предназначена для работы в составе различных систем, в которых требуется моделировать речевой поток, являющихся Пользователями модели.The statistical speech model is not a thing in itself; it is designed to work as part of various systems in which it is required to simulate the speech flow that are users of the model.

Итак, Пользователь может выполнять запросы следующих типов:So, the User can execute the following types of requests:

1. запрашивать список типовых дикторов (ТД), представленных в модели;1. request a list of typical speakers (TD) presented in the model;

2. синтезировать отдельные звуки голосом любого ТД;2. synthesize individual sounds with the voice of any AP;

3. синтезировать цепочки звуков голосом любого ТД;3. synthesize a chain of sounds with the voice of any AP;

4. генерировать звуковой поток, характеризующий одного ТД;4. generate a sound stream characterizing one AP;

5. генерировать звуковой поток, характеризующий население, описываемое моделью;5. generate a sound stream characterizing the population described by the model;

6. отменять генерацию звукового потока.6. cancel the generation of sound stream.

Блок 2 выбора диктора (рис.2) содержит два модуля: генератор 2.1. выборки дикторов и модуль 2.2 - статистика параметров населения регионов.Speaker selection block 2 (Fig. 2) contains two modules: generator 2.1. speaker samples and module 2.2 - statistics of population parameters of regions.

При генерации представительной выборки блок выбора диктора получает статистические данные об описываемом моделью населении (модуль 2.2): регион, возрастно-половой состав, уровень образования, говор/диалект, род деятельности и т.п. Часть описаний ТД (блок 7), используемая модулем, состоит из диапазонов значений статистических параметров, позволяющих определить процент населения, соответствующий каждому ТД, и включать ТД выборку в количестве, пропорциональном составу населения.When generating a representative sample, the speaker selection block receives statistical data on the population described by the model (module 2.2): region, age and gender composition, educational level, dialect / dialect, occupation, etc. Part of the AP descriptions (block 7) used by the module consists of ranges of statistical parameters that allow you to determine the percentage of the population corresponding to each AP and include the AP sample in an amount proportional to the composition of the population.

Определив сколько и каких типовых дикторов нужно включить в выборку, блок выбора диктора выдает в случайном порядке их индексы в блок 3 на хранение.Having determined how many and what typical speakers need to be included in the sample, the speaker selection block randomly gives their indices to block 3 for storage.

После того, как все необходимые индексы сохранены, блок рапортует интерфейсному блоку 1 о завершении выполнения команды.After all the necessary indexes are saved, the block reports to the interface block 1 about the completion of the command.

При выборе одного конкретного диктора блок выбора диктора просто передает его индекс на хранение в блок 3 и рапортует блоку 1 о завершении работы.When you select one specific speaker, the speaker selection unit simply transfers its index for storage to block 3 and reports to block 1 about the completion of work.

Блок 4 - блок выбора звуков, представленный на рис.3, состоит из следующих последовательно соединенных модулей: формирования цепочек - 4.1; приписывания интонационных контуров - 4.2; именования аллофонов - 4.3; определения длительности и энергии звуковой цепочки - 4.4; наложения интонационных контуров - 4.5.Block 4 - the block for selecting sounds, shown in Fig. 3, consists of the following modules connected in series: chain formation - 4.1; attribution of intonation contours - 4.2; naming allophones - 4.3; determining the duration and energy of the sound chain - 4.4; imposition of intonation contours - 4.5.

Кроме того, для формирования цепочек звуков на соответствующий вход модуля 4.1 поступает сигнал от модуля 4.6 - правила следования звуков, а на вход модуля 4.3 сигнал от модуля 4.7 - правила именования аллофонов.In addition, to form chains of sounds, the signal from module 4.6, the rules for following sounds, is received at the corresponding input of module 4.1, and the signals from module 4.7, the rules for naming allophones, are input to module 4.3.

Блок 4 выбора звуков работает в одном из двух режимов: режиме генерации распределения и режиме синтезатора, когда звук и параметры выдаются интерфейсным блоком 1. Рассмотрим эти режимы подробнее.Sound selection block 4 operates in one of two modes: distribution generation mode and synthesizer mode, when sound and parameters are issued by interface unit 1. Let us consider these modes in more detail.

В режиме генерации распределения для каждого ТД (индексы которых берутся из блока 3) формируются выборки звуков с параметрами (работает модуль 4.1), как и в речи ТД. Для этого из блока описаний ТД (блок 7) по индексу берется информация о частотности звуков, и, с учетом правил следования (модуль 4.6), подготавливаются цепочки звуков от паузы до паузы. Длины цепочек также определяются параметрами ТД.In the distribution generation mode for each AP (the indices of which are taken from block 3), samples of sounds with parameters are formed (module 4.1 works), as in the speech of the AP. To do this, information on the frequency of sounds is taken from the TD description block (block 7) by index, and, taking into account the rules for following (module 4.6), chains of sounds from pause to pause are prepared. The lengths of the chains are also determined by the parameters of the AP.

Если по какой-то причине в параметрах ТД отсутствует информация о частотности звуков или длинах цепочек, то используются статистические данные, полученные на большом объеме текстов. Средняя статистика является частью блока 7 описаний ТД.If for some reason there is no information in the TD parameters about the frequency of sounds or the length of chains, then statistical data obtained on a large volume of texts are used. Average statistics is part of block 7 of the TD descriptions.

Каждой цепочке приписывается интонационный контур (модуль 4.2). Параметры интонационных контуров и информация об их используемости в речи берутся из описаний ТД. Если параметры интонационных контуров или распределения длительностей звуков, или энергии отсутствуют в описании данного ТД, берутся среднестатистические значения, являющиеся частью блока 7 описаний ТД.Each chain is assigned an intonation circuit (module 4.2). The parameters of the intonation circuits and information about their use in speech are taken from the descriptions of the TD. If the parameters of the intonation circuits or the distribution of the durations of sounds or energy are absent in the description of this AP, the average values are taken, which are part of block 7 of the AP descriptions.

С учетом контекстов (звуков слева и справа) и правил именования аллофонов (модуль 4.7) названия звуков преобразуются в имена аллофонов в модуле 4.3.Taking into account the contexts (sounds left and right) and the rules for naming allophones (module 4.7), the names of the sounds are converted to the names of allophones in module 4.3.

Для всех звуков цепочки на основе информации о параметрах ТД приписываются длительности и энергии (модуль 4.4). После чего производится наложение интонационного контура (ИК). В результате применения интонационных контуров к цепочкам звуков (блок 4.5), для каждого звука определяется основной тон.For all sounds of the chain, on the basis of information on the parameters of the AP, durations and energies are assigned (module 4.4). After that, the intonation circuit (IR) is superimposed. As a result of applying intonation circuits to chains of sounds (block 4.5), the main tone is determined for each sound.

В результате на выходе модуля 4.5 известны все необходимые просодические параметры: длительность, основной тон, энергия и идентификатор ТД. Сформированная просодика сохраняется в блоке 5. По завершению генерации просодики блок 4 рапортует об исполнении команды в интерфейсный блок 1.As a result, at the output of module 4.5, all necessary prosodic parameters are known: duration, pitch, energy, and TD identifier. The generated prosodic is stored in block 5. Upon completion of the prosodic generation, block 4 reports on the execution of the command in interface block 1.

Распределение просодических параметров звуков в цепочках, совпадает с распределением в реальной речи.The distribution of prosodic parameters of sounds in chains coincides with the distribution in real speech.

В режиме синтезатора на вход блока 4 выбора звуков поступают либо отдельные звуки, либо цепочки звуков от паузы до паузы. В случае выбора отдельного звука, все просодические параметры (длительность, частота основного тона (ЧОТ), энергия) указываются блоком 1. Блок 4 лишь именует аллофон, используя контекст «пауза»-«звук»-«пауза», передает описание на хранение в блок 5, и рапортует об исполнении команды в блок 1.In synthesizer mode, either separate sounds or chains of sounds from pause to pause are received at the input of block 4 of sound selection. If you select a single sound, all prosodic parameters (duration, pitch frequency (BST), energy) are indicated by block 1. Block 4 only names the allophone using the context “pause” - “sound” - “pause”, transfers the description to storage block 5, and reports on the execution of the command in block 1.

В случае генерации последовательности звуков блок 4 формирует просодику в соответствии с параметрами интонационного контура, определяет названия аллофонов и передает их на вход блока 5. После чего рапортует об исполнении команды в блок 1.In the case of generating a sequence of sounds, block 4 forms a prosody in accordance with the parameters of the intonation circuit, determines the names of allophones and transfers them to the input of block 5. After which it reports the execution of the command to block 1.

Используются разные процедуры определения длительностей звуков, при работе в режиме синтезатора и в режиме генерации выборки. Если при генерации выборки, длительности звуков определяются статистическим распределением, то в режиме синтезатора используются строгие правила, и центральные значения распределений. Правила определения длительностей определяются конкретной реализацией (модуль 4.6).Different procedures are used to determine the duration of sounds when working in synthesizer mode and in sample generation mode. If during the generation of the sample, the durations of sounds are determined by the statistical distribution, then in the synthesizer mode strict rules are used, and the central values of the distributions. The rules for determining the duration are determined by the specific implementation (module 4.6).

Блок 6 - блок формирования речевого потока, изображенный на рис 4 состоит из последовательно-соединенных модулей: 6.1 - формирование длительности звуков; 6.2 - изменение ЧОТ; 6.3 - формирование амплитудной огибающей; 6.4 - обработка стыков звуков; 6.5 - приведение к формату, заданному интерфейсным блоком 1.Block 6 - the block of the formation of the speech stream shown in Fig. 4 consists of series-connected modules: 6.1 - the formation of the duration of sounds; 6.2 - change in frequency response; 6.3 - the formation of the amplitude envelope; 6.4 - processing joints of sounds; 6.5 - cast to the format specified by the interface unit 1.

Блок 6 формирования речевого потока, получив просодику (из блока 5) и формат звукового потока (из блока 1), извлекает из описания ТД (из блока 7) образцы звуков с разметкой. Каждый звук приводится к длительности, определенной параметрами просодики Block 6 of the formation of the speech stream, having received the prosodic (from block 5) and the format of the sound stream (from block 1), extracts sound samples with markup from the description of the AP (from block 7). Each sound is reduced to a duration determined by prosodiki parameters

в модуле 6.1. Для звуков разных типов используются разные алгоритмы (стратегии) изменения длительности, обеспечивающие минимальные искажения качества звуков. Конкретные алгоритмы являются атрибутом реализации.in module 6.1. For sounds of different types, different algorithms (strategies) for changing the duration are used, providing minimal distortion of the sound quality. Concrete algorithms are an attribute of implementation.

После того, как длительности аллофонов сформированы, они приводятся к заданным частотам основного тона в модуле 6.2, причем ЧОТ не остается постоянной на всем аллофоне, а изменяется в соответствии с движением, заданным в просодических параметрах. Чтобы минимизировать искажения звуков, модификации ЧОТ звуков разных типов может проводиться с использованием различных алгоритмов. Конкретные алгоритмы являются атрибутом реализации.After the durations of allophones are formed, they are reduced to the specified frequencies of the fundamental tone in module 6.2, and the frequency response does not remain constant over the entire allophone, but changes in accordance with the movement specified in prosodic parameters. To minimize distortion of sounds, modifications of the frequency response of sounds of different types can be carried out using various algorithms. Concrete algorithms are an attribute of implementation.

Далее, с учетом параметров энергии, заданных в просодике, формируется амплитудная огибающая звуков цепочки (модуль 6.3), и производится морф стыков звуков, для минимизации шумов на стыках (модуль 6.4).Further, taking into account the energy parameters specified in the prosodic, the amplitude envelope of the chain sounds is formed (module 6.3), and the morph of the sound junctions is performed to minimize noise at the joints (module 6.4).

Звуковой сигнал приводится к формату, указанному интерфейсным блоком 1 и передается ему для последующего использования. Преобразование формата звукового сигнала производится в модуле 6.5.The audio signal is converted to the format specified by the interface unit 1 and transmitted to it for later use. Audio format conversion is performed in module 6.5.

Сформированный речевой поток из блока 6 поступает в интерфейсный блок 1. Блок 1 передает речевой поток Пользователю по мере поступления. При получении рапорта от блока 6 Пользователь информируется о завершении формирования речевого потока.The generated speech stream from block 6 enters the interface unit 1. Block 1 transmits the speech stream to the User as it arrives. Upon receipt of a report from block 6, the User is informed of the completion of the formation of the speech stream.

Ниже приводится описание реализации статистической модели речи. выполненной на персональном компьютере в соответствии с описанием модулей и схем взаимодействия блоков, приведенных ранее. В качестве блоков модели использованы средства программного обеспечения и оборудование компьютера. Для управления работой отдельных блоков модели может быть разработано программное обеспечение.The following is a description of the implementation of the statistical speech model. made on a personal computer in accordance with the description of the modules and interaction schemes of the blocks given above. As blocks of the model, software tools and computer equipment were used. Software can be developed to control the operation of individual blocks of the model.

Для лучшего понимания, ниже приводятся определение терминов, содержащихся в описании работы статистической модели речи.For a better understanding, the following are definitions of the terms contained in the description of the statistical speech model.

Аллофон - (от греч. allos - иной. другой и phone - звук, вариант, разновидность фонемы, обусловленная данным фонетическим окружением.Allophone - (from the Greek. Allos - different. Different and phone - sound, option, phoneme variety, due to this phonetic environment.

Фонема - (от греч. phonema - звук), основная единица звукового строя языка, предельный элемент, выделяемый линейным членением речи. Phoneme - (from the Greek. Phonema - sound), the basic unit of the sound system of the language, the limiting element highlighted by the linear division of speech.

Синтагма - (от греч. syntagma, буквально - вместе построенное, соединенное), в широком смысле - любая последовательность языковых элементов, связанных отношением определяемое - определяющее. В более узком смысле С. - словосочетание, вычленяемое в составе предложения (С. предикативная, атрибутивная, объектная и т.д.), а Syntagma - (from the Greek. Syntagma, literally - together built, connected), in the broad sense - any sequence of linguistic elements connected by a defined-determinative relation. In a narrower sense, C. is a phrase singled out as part of a sentence (C. is predicative, attributive, object, etc.), and

предложение - цепная последовательность синтагм. Если еще проще, то синтагма. - это последовательность звуков от паузы до паузы. sentence is a syntagm chain sequence. If even simpler, then syntagma. is a sequence of sounds from pause to pause.

Сонант - среди звонких согласных выделяется группа сонорных согласных, или сонантов (например, л, р, м, н, й),отличающихся от шумных С. (как звонких, так и глухих) наличием четкой формантной структуры. Последнее сближает сонанты с гласными, однако их отличает меньшая общая энергия. К сонорным принадлежат, в частности, носовые.Sonant - among voiced consonants, a group of sonorous consonants, or sonants (for example, l, r, m, n, d), distinguished from noisy S. (both voiced and deaf) by the presence of a clear formant structure stands out. The latter brings sonants together with vowels, but they are distinguished by a lower total energy. To nasal ones, in particular, nasal.

Фрикативный, взрывной, аффрикат - по характеру шумообразующей преграды согласные делятся на смычные, фрикативные и дрожащие. Первые образуются вследствие смыкания двух активных произносительных органов речи, например, нижней и верхней губы (п, б, м,) или активного органа с пассивным, например языка с небом (т, д, к). Смычка может заканчиваться резким раскрытием, взрывом или же постепенным раскрытием, переходом к щели. В первом случае возникают взрывные согласные (п, б, т, д), во втором - т.н. аффрикаты например, русские (ц и ч), которые являются как бы сложными звуками, т.к. имеют смычный и щелевой (фрикативный) элементы.Frictive, explosive, affricate - according to the nature of the noise-forming barriers, consonants are divided into closest, fricative and trembling. The first are formed due to the closure of two active pronunciation organs of speech, for example, the lower and upper lips (p, b, m,) or an active organ with a passive, for example, the tongue with the sky (t, d, k). A bow can end with a sharp opening, an explosion or a gradual opening, a transition to a gap. In the first case, explosive consonants arise (n, b, t, d), in the second - the so-called affricates, for example, Russian (c and h), which are kind of complex sounds, because have a commissive and crevice (fricative) elements.

Формат - в данном случае подразумевается частота дискретизации звукового сигнала и количество бит, отводимых на отсчет.Format - in this case, the sampling frequency of the audio signal and the number of bits allocated to the sample are implied.

Остановимся подробнее на алгоритме работы блока выбора диктора (блок №2).Let us dwell in more detail on the operation algorithm of the speaker selection block (block No. 2).

В основу статистики о составе и характеристиках населения положены данные полученные Госкомстатом России в результате Всероссийской переписи населения 2002 года. Для упрощения в данной конкретной реализации в модуле 2.2 было решено использовать только информацию о возрастно-половом составе населения.The statistics on the composition and characteristics of the population are based on data obtained by the Goskomstat of Russia as a result of the 2002 All-Russian Population Census. To simplify this specific implementation, in module 2.2 it was decided to use only information on the age-sex composition of the population.

Соответственно для связи ТД со статистикой населения использовались два критерия пол и возраст. Было введено шесть, весьма условных, ТД.Accordingly, two criteria for gender and age were used to link TDs with population statistics. Six, very conditional, TDs were introduced.

Таблица №1Table number 1 ТД №TD No. ПолFloor ВозрастAge ТД №TD No. ПолFloor ВозрастAge 1one мm моложе трудоспособного возрастаyounger than working age 4four жwell моложе трудоспособного возрастаyounger than working age

22 мm в трудоспособном возрастеat working age 55 жwell в трудоспособном возрастеat working age 33 мm старше трудоспособного возрастаolder than working age 66 жwell старше трудоспособного возрастаolder than working age

Процедура формирования выборки ТД, реализованная в блоке 2 работает следующим образом:The sampling procedure of the AP implemented in block 2 works as follows:

1. На основе статистики о возрастно-половом составе населения определяется процент населения, соответствующий каждому ТД;1. Based on statistics on the age-sex composition of the population, the percentage of the population corresponding to each AP is determined;

2. Значения процентов приводятся к целым числам (путем домножения на 10);2. The percentages are converted to integers (by multiplying by 10);

3. Производится минимизация значений (ищется наименьший общий делитель (НОД) всех значений процентов, после чего они все на него делятся);3. The values are minimized (the least common divisor (GCD) of all percent values is searched, after which they are all divided by it);

4. Подсчитывается сумма значений процентов (N_ТД) и заводится равномерный датчик случайных чисел. Сумма значений процентов равна длине выборки, поэтому нужно устанавливать разумные ограничения на точность приведения их к целым числам;4. The sum of the percent values (N _TD ) is calculated and a uniform random number sensor is started. The sum of the percentages is equal to the length of the sample, so you need to set reasonable restrictions on the accuracy of their reduction to integers;

5. Строятся интервалы значений, соответствующие ТД ([0, N_ТД1[, [N_ТД, N_ТД1+N_TД2[...);5. The intervals of values corresponding to the TD are constructed ([0, N _TD1 [, [N _TD , N _TD1 + N _TD2 [...);

6. Генерируются N_ТД значений датчика случайных чисел. Попадание значения датчика в интервал приводит к включению ТД в выборку, которая передается в блок 3 выбора звука, как индекс выбранного диктора.6. N _APs of random number sensor values are generated. When the sensor value falls into the interval, the AP is included in the sample, which is transmitted to the sound selection unit 3 as the index of the selected speaker.

Блок 2 выбора диктора в зависимости от команды интерфейсного блока (блок 1) либо выбирает одного конкретного диктора, либо генерирует выборку в соответствии с описанным выше алгоритмом (модуль 2.1). Статистические данные (модуль 2.2) оформлены в виде структуры данных, являющейся частью базы данных (блок 7). Сформированная выборка сохраняется в блоке 3 в виде файла.The speaker selection unit 2, depending on the command of the interface unit (unit 1), either selects one specific speaker or generates a sample in accordance with the algorithm described above (module 2.1). Statistical data (module 2.2) are designed as a data structure that is part of the database (block 7). The generated sample is saved in block 3 as a file.

Блок 4 выбора звуков представляет собой сложный блок, состоящий из отдельных модулей, реализующих алгоритм работы блока. По команде интерфейсного блока 1 в модуле 4.1 осуществляется формирование звуковых цепочек. Для этого в модуль 4.1 одновременно с командой на формирование цепочек из блока 3 поступают индексы типовых дикторов и описание параметров их из блока 7, в частности, это информация о частотности звуков, которая является индивидуальной для каждого диктор. В случае, если по каким либо причинам данная статистика отсутствует, предусмотрена возможность подмены ее статистикой, полученной на основе обработки текстов. Естественно общая статистика не позволяет в полной мере моделировать параметры ТД, зато появляется возможность работать с голосами, данные по которым подготовлены не полностью.Block 4 sound selection is a complex block consisting of separate modules that implement the algorithm of the block. At the command of the interface unit 1 in module 4.1, the formation of sound chains is carried out. For this, module 4.1 simultaneously with the command to form chains from block 3 receives indices of typical speakers and a description of their parameters from block 7, in particular, this is information about the frequency of sounds, which is individual for each speaker. If, for some reason, these statistics are not available, it is possible to replace them with statistics obtained on the basis of word processing. Naturally, the general statistics do not allow to fully simulate the parameters of the AP, but it becomes possible to work with voices, the data for which are not fully prepared.

Формирование звуковых цепочек осуществляется с учетом правил следования звуков (модуль 4.6), который следит, чтобы, в зависимости от частотности звуков генерировались цепочки звуков от паузы до паузы. Сформированные цепочки подаются на модуль 4.2, в котором на каждую звуковую цепочку накладывается интонационный контур (ИК).The formation of sound chains is carried out taking into account the rules for following sounds (module 4.6), which ensures that, depending on the frequency of the sounds, chains of sounds are generated from pause to pause. The formed chains are fed to module 4.2, in which an intonation circuit (IR) is superimposed on each sound chain.

ИК соответствует идентификатору типовых дикторов, индексы которых содержались в блоке 3. В режиме синтезатора, идентификатор (номер) интонационного контура является параметром команды. В режиме генерации распределения (модуль 4.2) сам выбирает номера интонационных контуров, на основании информации, содержащейся в описании ТД (взятой из блока 7), а при ее отсутствии - из общей таблицы. В общем случае количество интонационных контуров для каждого диктора будет свое, причем всегда четное (одна половина для синтагм с охвостьем и другая - без). В зависимости от команды и параметров ИК приписывает каждой цепочке идентификатор ИК и длительность паузы, после чего результат передается на вход модуля 4.3 наименования аллофонов.IR corresponds to the identifier of typical speakers whose indices were contained in block 3. In synthesizer mode, the identifier (number) of the intonation circuit is a parameter of the command. In the distribution generation mode (module 4.2), he himself selects the numbers of the intonation circuits, based on the information contained in the description of the AP (taken from block 7), and in the absence of it, from the general table. In the general case, the number of intonation circuits for each speaker will be different, and always even (one half for syntagmas with a hoot and the other without). Depending on the command and parameters, the IR assigns an IR identifier and a pause duration to each chain, after which the result is transmitted to the input of the module 4.3 for the name of allophones.

Наименование аллофонов производится на основании правил именования (модуль 4.7), который заменяет названия звуков на названия (или имена) комбинаторных аллофонов (или просто аллофонов), после чего в зависимости от команды, цепочки передаются в модуль 4.4 или в блоки 4.5 и/или интерфейсный блок 1.Allophones are named on the basis of naming rules (module 4.7), which replaces the names of sounds with the names (or names) of combinatorial allophones (or simply allophones), after which, depending on the command, the chains are transferred to module 4.4 or to blocks 4.5 and / or interface block 1.

Модуль 4.7 может быть реализован в виде подпрограммы, устанавливающей правила наименования аллофонов, т.е. порядок образования наименования, но на основании самого звука и звуков, расположенных слева и справа от него (контекста). Каждый звук имеет название, которое становится ядром имени комбинаторного аллофона. Соседние звуки дают имена контекстов (правила именования модуль 4.7), которые добавляются к ядру слева и справа соответственно.Module 4.7 can be implemented as a subprogram that sets the rules for naming allophones, i.e. the order in which the name is formed, but based on the sound itself and the sounds located to the left and right of it (context). Each sound has a name that becomes the core of the name of a combinatorial allophone. Neighboring sounds give context names (naming rules module 4.7), which are added to the kernel on the left and right, respectively.

Для аллофонов гласных, гласные имеющие разные редукции, считаются разными аллофонами и дают разные имена ядер. Для согласных разными звуками являются аллофоны твердых и мягких звуков, дающие резные имена ядер (и контекстов). В общем виде имя комбинаторного аллофона записывается так:For vowel allophones, vowels with different reductions are considered different allophones and give different names for the nuclei. For consonants, different sounds are allophones of hard and soft sounds, giving carved names of kernels (and contexts). In general terms, the name of a combinatorial allophone is written like this:

<имя левого контекста><название ядра><имя правого контекста>.<name of the left context> <name of the kernel> <name of the right context>.

Ноли для аллофона отсутствует информация о звуке, расположенном левее и/или правее, то они заменяются паузами, и имена контекстов определяются исходя из предположения о соседстве с паузами.Zeros for an allophone there is no information about the sound located to the left and / or to the right, then they are replaced by pauses, and the names of the contexts are determined based on the assumption of proximity to the pauses.

Модуль 4.4 работает по известному алгоритму. В нем всем звукам в сформированной цепочке приписывает длительности и коэффициенты энергий, после чего результаты обработки цепочек звуков передаются в модуль 4.5.Module 4.4 works according to a well-known algorithm. In it, he ascribes to all sounds in the formed chain the durations and energy coefficients, after which the results of processing the chains of sounds are transferred to module 4.5.

В зависимости от режима работы используются разные стратегии определения длительностей звуков. В режиме генерации выборки длительности звуков определяются случайным образом с помощью датчиков случайных чисел, и статистики о распределении длительности каждого звука (которая является частью описания ТД и берется из блока 7).Depending on the operating mode, different strategies for determining the duration of sounds are used. In the generation mode, samples of sound durations are determined randomly using random number sensors, and statistics on the distribution of the duration of each sound (which is part of the TD description and is taken from block 7).

В режиме синтезатора длительности звуков определяются по сложному алгоритму, учитывающему как значения статистики, так и длины цепочек и порядок следования звуков в них.In synthesizer mode, the durations of sounds are determined by a complex algorithm that takes into account both the values of statistics and the length of the chains and the order of the sounds in them.

Далее в модуле 4.5 определяют, по известному алгоритму, значения основного тона на всех гласных звуках и звонких согласных, содержащихся в цепочках и приписывают им параметры интонационных контуров, которые являются частью описания ТД и берутся из блока 7. После того. как была отработана последняя цепочка, в блок 1 выдается сообщение о завершении выполнения команды.Then in module 4.5, according to the well-known algorithm, the values of the fundamental tone for all vowels and voiced consonants contained in the chains are determined and attributed to them the parameters of the intonation circuits, which are part of the description of the AP and are taken from block 7. After that. as the last chain was worked out, in block 1 a message is issued about the completion of the command.

Полученные параметры (описания) звуковые цепочек с приписанными значениями основного тона (ОТ) и наложенным ИК, в зависимости от параметров команды передаются в блок 5 и/или в интерфейсный блок 1.The obtained parameters (descriptions) of the sound chains with the assigned values of the fundamental tone (OT) and superimposed IR, depending on the parameters of the command, are transmitted to block 5 and / or to interface block 1.

Блок 5 является блоком памяти, в которой хранится описания (просодика) звуков, полученных в результате их обработки в блоке 4, поэтому работа его не вызывает трудностей в понимании.Block 5 is a memory block that stores descriptions (prosodik) of sounds obtained as a result of their processing in block 4, so its operation does not cause difficulties in understanding.

Блок 6 формирования речевого потока, также представляет собой сложный, состоящий из отдельных модулей, реализующих алгоритм его работы. Управляющий сигнал из интерфейсного блока 1 поступает в модуль 6.1, в котором команду анализирует, и считывает просодику (описание цепочек звуков) из блока 5, а также при необходимости получает описание ТД из блока 7, после чего начинается формирование речевого потока с учетом формата, заданного интерфейсным блоком 1.Block 6 of the formation of the speech stream is also a complex one, consisting of separate modules that implement the algorithm of its operation. The control signal from the interface unit 1 enters the module 6.1, in which the command is analyzed, and reads the prosody (description of the chains of sounds) from block 5, and also, if necessary, receives the description of the AP from block 7, after which the formation of the speech stream taking into account the format specified interface unit 1.

На основании просодики (на один звук) выбирает из описания ТД разметку звукового сигнала, формирует по меткам необходимую длительность звука, причем в зависимости от типа звука (взрывные, фрикативные, сонанты, гласные, аффрикаты) для модификации длительности используются различные алгоритмы.Based on prosodiki (for one sound), it selects the marking of the sound signal from the TD description, generates the necessary sound duration from the labels, and depending on the type of sound (explosive, fricatives, sonants, vowels, affricates), various algorithms are used to modify the duration.

После того как определятся все метки, необходимые для включения в результирующий сигнал, каждой метке (если звук гласный или звонкий согласный) приписывается конкретное значение длины периода. Длины периодов определяются по просодическим данным и базовому значению основного тона (ОТ), указанному в описании ТД. После чего, список меток передается в модуль 6.2.After all the marks necessary for inclusion in the resulting signal are determined, a specific value of the length of the period is assigned to each label (if the sound is vowel or voiced consonant). The lengths of the periods are determined by prosodic data and the basic value of the fundamental tone (OT) indicated in the description of the AP. After that, the list of tags is passed to module 6.2.

Модуль 6.2 получив список меток, соответствующих звуку, тип звука и идентификатор ТД из модуля 6.1, из описания ТД для каждой метки извлекает звуковые данные, после чего производит приведение длин периодов к заданным значениям. Для этого использован известный алгоритм PSOLA (TD-PSOLA) с некоторыми изменениями и дополнениями. Предпочтение отдано именно этому алгоритму, т.к. он обеспечивает достаточно высокое качество преобразованного голоса и не требует значительных вычислительных ресурсов.Module 6.2, having received a list of labels corresponding to the sound, the sound type and the identifier of the AP from module 6.1, extracts audio data from the description of the AP for each label, after which it brings the lengths of the periods to the given values. For this, the well-known PSOLA algorithm (TD-PSOLA) was used with some changes and additions. This algorithm is preferred. it provides a sufficiently high quality of the transformed voice and does not require significant computing resources.

Отличия от стандартного алгоритма заключаются в том, что при повышении и понижении ЧОТ используются разные оконные функции. При сокращении длины периода (повышение ЧОТ) используется следующие косинусные прямое и обратное окошки:Differences from the standard algorithm are that when raising and lowering the frequency response, different window functions are used. When reducing the length of the period (increasing the frequency), the following cosine direct and reverse windows are used:

а при увеличении длины периода (понижение ЧОТ) используются следующие линейные прямое и обратное окна:and with an increase in the length of the period (decrease in the frequency frequency), the following linear forward and reverse windows are used:

NData - min(NOldData, NNewData);NData - min (NOldData, NNewData);

Wca - весовые коэффициенты прямого окошка;Wca - weighting factors of the direct window;

Wcb - весовые коэффициенты обратного окошка;Wcb - weighting coefficients of the reverse window;

i - индекс коэффициента;i is the index of the coefficient;

NOldData - длина исходного периода;NOldData - length of the initial period;

NNewData - длина формируемого периода.NNewData - the length of the formed period.

Другое отличие от стандартного алгоритма заключается в том, что отсчеты сформированного периода нормируются на сумму соответствующих коэффициентов прямого и обратного окон. Кроме того, увеличение длины периода основного тона производится не за один шаг, а за несколько.Another difference from the standard algorithm is that the samples of the generated period are normalized to the sum of the corresponding coefficients of the forward and reverse windows. In addition, the increase in the length of the period of the fundamental tone is performed not in one step, but in several.

После того, как длительности всех периодов сформированы, метки, звуковые данные и описание звука (просодика) передается модулю 6.3. формирования огибающей.After the durations of all periods are formed, labels, sound data and a description of the sound (prosodik) are transmitted to module 6.3. envelope formation.

Для ее определения использует только данные, полученные из модуля 6.2 и данные, сформированные на предыдущих этапах работы блоков модели. Сначала все отсчеты сигнала умножаются на коэффициент энергии, указанный в просодике модулем 4.4. Затем дополнительно определяются и применяются дополнительные коэффициенты энергий на каждый период, для того, чтобы получить плавные переход уровня от предыдущего звука. После того, как все коэффициенты энергий применены, рассчитываются и запоминаются максимальное и минимальное значения амплитуды последнего периода, используемые для определения дополнительных коэффициентов энергии, при обработке следующего звука.To determine it, it uses only the data obtained from module 6.2 and the data generated at the previous stages of the model blocks. First, all samples of the signal are multiplied by the energy coefficient specified in prosodic module 4.4. Then additional energy coefficients for each period are additionally determined and applied in order to obtain a smooth transition of the level from the previous sound. After all the energy coefficients are applied, the maximum and minimum values of the amplitude of the last period, used to determine additional energy coefficients, are calculated and stored in the processing of the next sound.

Полученные результаты (метки, звуковые данные и описание звука) передаются модулю 6.4, в котором происходит стыковка звуков.The results (labels, sound data and sound description) are transmitted to module 6.4, in which the sounds are docked.

Модуль 6.4 использует только данные, поступившие из модуля 6.3, и данные, сохраненные на предыдущем шаге обработки. Если невозможно произвести морф предыдущего и обрабатываемого звука (например, из-за того, что последний имеет малую длительность), то сохраненные звуковые данные отдаются модулю 6.5. Если морф возможен. производится смешивание сохраненных звуковых данных с началом обрабатываемого звука, после чего звуковые данные, исключая сохраняемую для стыковки со следующим звуком часть, передаются в модуле 6.5.Module 6.4 uses only the data received from module 6.3 and the data stored in the previous processing step. If it is impossible to produce a morph of the previous and processed sound (for example, due to the fact that the latter has a short duration), then the stored sound data is given to module 6.5. If morph is possible. the stored sound data is mixed with the beginning of the processed sound, after which the sound data, excluding the part stored for docking with the next sound, is transmitted in module 6.5.

В модуле 6.5. осуществляется приведение звукового сигнала, полученного из модуля 6.4, к заданному формату, параметры которого получают в команде от интерфейсного блока 1. Звуковой сигнал приводится к заданному в формате частоте дискретизации и количеству бит, приходящемуся на каждый отсчет, после чего выдается в блок 1 рапорт об окончании формирования речевого сигнала..In module 6.5. the sound signal received from module 6.4 is brought to a predetermined format, the parameters of which are received in a command from the interface unit 1. The sound signal is brought to the sampling frequency and the number of bits assigned to each sample specified in the format, after which a report about the end of the formation of the speech signal ..

База данных (блок 7) представляет собой сложную базу данных, содержащую описания ТД и общие таблицы. Описания ТД и общие таблицы создаются вручную.The database (block 7) is a complex database containing descriptions of APs and general tables. TD descriptions and general tables are created manually.

В базу дачных входят следующие общие таблицы: интонационных контуров, энергий, длительностей и частотностей звуков;The country house database includes the following general tables: intonation circuits, energies, durations and frequencies of sounds;

В описания типовых дикторов включены:Typical speaker descriptions include:

обязательные компоненты:required components:

- базовое значение частоты основного тона (ОТ);- base value of the frequency of the fundamental tone (OT);

- признак формата звуковой базы (аллофонная/субаллофонная), и индивидуальные параметры алгоритмов обработки звукового сигнала;- a sign of the sound base format (allophonic / suballophonic), and individual parameters of the audio signal processing algorithms;

- звуковая база, состоящая из звуковых фрагментов, соответствующих различным звукам русской речи с необходимой разметкой на периоды ОТ и т.д.;- sound base, consisting of sound fragments corresponding to various sounds of Russian speech with the necessary markup for periods of OT, etc .;

- параметры для связи ТД со статистикой населения. - parameters for linking APs with population statistics.

необязательные компоненты - это таблицы:optional components are tables:

- интонационных типов, энергий, длительностей и частотностей звуков;- intonation types, energies, durations and frequencies of sounds;

Для работы с моделью должны быть заполнены все общие таблицы и создано описание хотя бы одного ТД.To work with the model, all general tables must be filled in and a description of at least one AP created.

Формирование звуковой базы является длительным и очень трудоемким ручным процессом. Известно, что в русском языке 6 гласных и 36 согласных звуков. Однако, аллофонов будет много больше. Так характер редукции гласного зависит от его качества и ритмической позиции. При формировании звуковой базы Необходимо учитывать положение гласного относительно ударения. Могут быть выделены следующие редукции гласных:Sound base formation is a long and very laborious manual process. It is known that in Russian there are 6 vowels and 36 consonants. However, there will be many more allophones. So the nature of vowel reduction depends on its quality and rhythmic position. When forming a sound base It is necessary to take into account the position of the vowel with respect to stress. The following vowel reductions can be distinguished:

- ударные гласные «а0», «о0», «u0», «е0», «i0», «у0»;- stressed vowels “a0”, “o0”, “u0”, “e0”, “i0”, “у0”;

- первый предударный «a1», и предударные «a1», «u1», «e1», «i1», «y1»;- the first pre-shock “a1”, and the pre-shock “a1”, “u1”, “e1”, “i1”, “y1”;

- второй предударный «a2»:- second pre-shock "a2":

- заударные «a4». «o4», «u4», «e4», «i4». «y4»;- shocked "a4". “O4”, “u4”, “e4”, “i4”. "Y4";

Однако не только ритмическая позиция влияет на качество гласного. Фонетические реализации звуков (как гласных, так и согласных) зависят также и от непосредственных контекстов. Причем для ударных и неударных гласных можно выделить разные наборы контекстов. Некоторые звуки, являющиеся разными контекстами для ударных, являются одинаковыми контекстами для неударных.However, not only the rhythmic position affects the quality of the vowel. Phonetic realizations of sounds (both vowels and consonants) also depend on immediate contexts. Moreover, for stressed and unstressed vowels, different sets of contexts can be distinguished. Some sounds, which are different contexts for drums, are the same contexts for drums.

Кроме того, что фрагмент, соответствующий звуку речи, должен быть найден, выделен и поименован, требуется его разметить соответствующим образом. Разметка различается для разных звуков.In addition to the fact that the fragment corresponding to the sound of speech must be found, highlighted and named, it is necessary to mark it accordingly. The markup differs for different sounds.

Приведем два примера применения предлагаемой статистической модели речи, хотя их может быть значительно больше. Так, на основе описанной реализации была разработана система синтеза русской речи по тексту (рис.5). Статистическая модель также, нашла применение в качестве генератора тестовых сигналов в системе оценки качества вокодеров и каналов связи (рис.6).We give two examples of the application of the proposed statistical speech model, although there may be much more. So, on the basis of the described implementation, a system for synthesizing Russian speech in the text was developed (Fig. 5). The statistical model also found application as a test signal generator in a system for assessing the quality of vocoders and communication channels (Fig. 6).

Синтез речиSpeech synthesis

Статистическая модель была использована для построения системы высококачественного синтеза речи по тексту. Под высоким качеством подразумевается высокая естественность и разборчивость синтезируемой речи.The statistical model was used to build a system of high-quality text-based speech synthesis. High quality means high naturalness and intelligibility of synthesized speech.

Общая схема разработанной системы может быть представлена рисунком 5. Она содержит: оператора, интерфейсный модуль, лингвистический процессор и статистическую модель речи.The general scheme of the developed system can be represented by Figure 5. It contains: an operator, an interface module, a linguistic processor, and a statistical speech model.

Работа системы синтеза осуществляется по следующей схеме. Интерфейсный модуль запрашивает у статистической модели речи список типовых дикторов (ТД). Затем оператор выбирает одного ТД из списка, задает формат звукового сигнала и вводит синтезируемый текст. Затем, интерфейсный модуль передает синтезируемый текст лингвистическому процессору (ЛП). ЛП разделяет текст на синтагмы, приписывает им интонационные типы, проставляет ударения и транскрибирует их и передает статистической модели.The synthesis system operates as follows. The interface module requests a list of typical speakers (TD) from the statistical speech model. Then the operator selects one AP from the list, sets the audio signal format and enters the synthesized text. Then, the interface module transfers the synthesized text to the linguistic processor (LP). The LP divides the text into syntagmas, ascribes intonation types to them, stresses and transcribes them and transfers them to the statistical model.

Статистическая модель, на основе заложенных в нее алгоритмов, формирует просодику и звуковой поток. Звуковой поток и просодика передаются интерфейсному модулю. По желанию оператора звуковой поток может либо воспроизводиться звуковой картой компьютера, либо сохраняться в файл для дальнейшего использования.The statistical model, based on the algorithms laid down in it, forms the prosodic and the sound stream. Sound stream and prosodik are transmitted to the interface module. At the request of the operator, the sound stream can either be played back by the sound card of the computer, or saved to a file for future use.

Просодические данные могут быть сохранены и открыты для просмотра и редактирования в обычном текстовом редакторе.Prosodic data can be saved and opened for viewing and editing in a regular text editor.

Оценка качества вокодеровEvaluation of vocoder quality

Статистическая модель используется в качестве источника (генератора) тестового сигнала для оценки качества, вокодеров и систем связи.The statistical model is used as a source (generator) of a test signal for quality assessment, vocoders and communication systems.

Схема использования статистической модели в системе оценки качества вокодеров представлена на рисунке 6. Система состоит из: вокодера, модуля оценки качества, статистической модели речи и файлов для хранения звуковых потоков (исходного и обработанного вокодером).The scheme for using the statistical model in the vocoder quality assessment system is shown in Figure 6. The system consists of: a vocoder, a quality assessment module, a statistical speech model, and files for storing audio streams (the original and processed vocoder).

Модуль оценки качества выдает статистической модели речи команду на генерацию звукового потока, с параметрами характерными для описываемого моделью населения. Звуковой поток, сформированный моделью, сохраняется в файл. Файл подается на вокодер и на модуль оценки качества. Звуковой сигнал, прошедший процедуры кодирования и декодирования, реализуемые вокодером, также подается на модуль оценки качества. Модуль оценки качества производит сравнение сигналов и выдает оценку.The quality assessment module issues a command to the statistical speech model to generate a sound stream, with parameters specific to the population model being described. The sound stream generated by the model is saved to a file. The file is fed to the vocoder and to the quality assessment module. An audio signal that has passed the encoding and decoding procedures implemented by the vocoder is also supplied to the quality assessment module. The quality assessment module compares the signals and provides an estimate.

К достоинствам предлагаемой полезной модели можно отнести следующее:The advantages of the proposed utility model include the following:

- использование большого количества классификационных признаков при сегментации и описании речевого корпуса;- the use of a large number of classification features in the segmentation and description of the speech corpus;

- возможность одновременной работы со структурными элементами разного размера и формата:- the ability to simultaneously work with structural elements of different sizes and formats:

- учет статистической информации при формировании звукового потока;- accounting for statistical information in the formation of the sound stream;

- разные подходы к изменению длительности звуков разных типов, минимизирующие искажение перцептивных свойств звуков;- different approaches to changing the duration of sounds of different types, minimizing the distortion of the perceptual properties of sounds;

- разные подходы к изменению ЧОТ звуков разных типов, минимизирующие искажение их перцептивных свойств;- different approaches to changing the frequency response of sounds of different types, minimizing the distortion of their perceptual properties;

- возможность выбирать из базы звуки, (или, даже, цепочки звуков), требующие наименьшей модификации;- the ability to select sounds from the database (or even chains of sounds) that require the least modification;

- разнообразие контекстных реализаций звуков позволит синтезировать речевой поток, обладающий высокой естественностью;- a variety of contextual implementations of sounds will allow us to synthesize a speech stream with high naturalness;

- достижению высокой естественности также будет способствовать применение интонационных контуров, специально подобранных для каждого ТД.- The achievement of high naturalness will also be facilitated by the use of intonation contours, specially selected for each AP.

Кроме того, статистическая модель речи не зависит от языка. Язык, с которым будет работать модель, определяется лишь теми данными, которыми наполнена модель, а все алгоритмы и интерфейсы сохранятся.In addition, the statistical model of speech is language independent. The language with which the model will work is determined only by the data that the model is filled with, and all algorithms and interfaces will be preserved.

В настоящее время предлагаемая статистическая модель проходит проверку применением в системе синтеза речи по тексту и в системе объективной оценки качества вокодеров.Currently, the proposed statistical model is being tested by using a text-based speech synthesis system and an objective assessment system for vocoder quality.

Claims

1. A statistical model of speech, including an interface unit connected by corresponding inputs and outputs to a selection unit forming a sample of speakers, a unit for selecting sounds that select sounds and determine their parameters, with a unit for generating a speech stream that performs actions on elements of speech signals and with a database containing descriptions of typical speakers, which is also connected to the inputs of these blocks, characterized in that the parameter statistics module additionally included in the speaker selection block is For example, between the speaker selection block and the sound selection block, a sample speaker selection block is included, and between the sound selection block and the speech stream generation block, the prosodiki storage block is additionally included, and the rules for naming allophones and following sounds are added to the block for selecting sounds.

2. The statistical speech model according to claim 1, characterized in that the speaker selection block consists of modules: a speaker sample generator and population parameter statistics, the output of the population parameter statistics module being connected to the input of the speaker sample generator, the output of which is connected to the input of the sample block announcers.

3. The statistical speech model of claim 1, characterized in that the unit for selecting sounds consists of series-connected modules: forming chains, attributing an intonation contour, naming allophones, determining the duration, applying intonation contours, as well as modules of rules for following sounds and naming rules for allophones, moreover, the outputs of the last two modules are connected to additional inputs of the chaining module and the allophone naming module, and the additional output of the allophone naming module is connected to the output module for applying intonation circuits, the outputs of which are connected to the prosodiki storage unit and the interface unit.

4. The statistical speech model according to claim 1, characterized in that the speech stream generation unit comprises series-connected modules: duration formation, pitch changes, envelope formation, joint processing and reduction to a given format, wherein the duration formation module is connected to the output prosodiki storage unit, and the format conversion module with the output of an interface unit that sets the reduction format, as well as with the corresponding inputs of the specified block that take the sign of completion command and generated speech flow.