JPH0887295A

JPH0887295A - Sound source data generating method for voice synthesis

Info

Publication number: JPH0887295A
Application number: JP6222314A
Authority: JP
Inventors: Kiyoshi Ishida; 清石田
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1994-09-19
Filing date: 1994-09-19
Publication date: 1996-04-02
Anticipated expiration: 2017-06-10
Also published as: JP3289511B2

Abstract

PURPOSE: To provide a sound source data generating method for voice synthesis capable of eliminating tone quality degradation due to amplitude abnormality in synthesized wave forms associated with an energy control in a voice sythesizer. CONSTITUTION: When the normalization of the sound source of sound source data is carried out in order to control the energy of the sound source based on a rhythm control, a maximum amplitude value of the voice waveform of an original sound used for the normalization is corrected every sound source data using a beforehand generated table (S7). The normalization is performed by dividing the sound source of segmented voice waveforms by the corrected maximum amplitude value (S8). Thus, the energy transition from a consonant to a vowel and a vowel to a consonant is made smooth and the tone quality degradation of a synthesized voice is eliminated. The table for the amplitude correction is produced by visually checking the waveform of synthesized voice produced by the sound source data which are normalized by the maximum amplitude value, confirming the abnormal frame of waveform amplitude by a pitch period unit and tabulating amplitude correction values, which are suitable to eliminate abnormality, every voice data.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声データを音源とし
て振幅制御により音声合成を行う日本語規則音声合成装
置等に用いられる音声合成用音源データの作成方法に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for producing sound source data for voice synthesis used in a Japanese rule voice synthesizer or the like for performing voice synthesis by amplitude control using voice data as a sound source.

【０００２】[0002]

【従来の技術】従来の規則音声合成装置における音声合
成用の音源データの作成と音声合成の流れを図２に示
す。一般的に音声合成は、入力されたテキストを日本語
処理等により音素記号列に変換し、各音素についてデー
タベース等を参照して時間長（音声の継続時間）、ピッ
チ（音の高さ）、エネルギー（音の大きさ）のパターン
を生成し、これらの韻律制御のパターンに基づいて音声
データから音声波形を合成する。2. Description of the Related Art FIG. 2 shows a flow of generating sound source data for speech synthesis and speech synthesis in a conventional regular speech synthesizer. Generally, in speech synthesis, input text is converted into a phoneme symbol string by Japanese processing or the like, and the time length (speech duration), pitch (sound pitch), An energy (sound volume) pattern is generated, and a speech waveform is synthesized from speech data based on these prosody control patterns.

【０００３】ここで、音源データは、人間ののどから口
に至る声道特性を表す声道断面積パターンと音源とから
成り、ＣＶ−ＶＣデータ形式（ただし、Ｃは子音、Ｖは
母音）を採っている。図３、図４に“ま”の音声波形例
を示す。図３が／Ｍ（子音Ｃ）から／Ａ／（母音Ｖ）の
移行部を示し、図４が全体形状を示している。図３にお
けるピッチ周期単位のフレームＦ１〜Ｆ６が子音／Ｍ
／、フレームＦ７〜が母音／Ａ／である。Here, the sound source data is composed of a vocal tract cross-sectional area pattern representing a vocal tract characteristic from the human throat to the mouth and a sound source, and has a CV-VC data format (where C is a consonant and V is a vowel). I am collecting. FIG. 3 and FIG. 4 show examples of the voice waveform of "ma". 3 shows the transition from / M (consonant C) to / A / (vowel V), and FIG. 4 shows the overall shape. Frames F1 to F6 in pitch period units in FIG. 3 are consonants / M
/, Frame F7-is the vowel / A /.

【０００４】音源データは、合成しようとする音声の対
象波形（自然音声（原音）の波形）を分析し、各ピッチ
周期単位のフレーム（Ｆ１〜Ｆ１４）毎に声道断面積パ
ターンと第１次音源を抽出し、このうち第１次音源は分
析対象波形の最大振幅値Ｈで割って正規化した音源と
し、これらの声道断面積パターンと正規化した音源とが
音声合成用の音源データとしてデータベース等に格納さ
れて用いられる。The sound source data is obtained by analyzing a target waveform of a voice to be synthesized (a waveform of a natural voice (original sound)), and for each frame (F1 to F14) of each pitch period, the vocal tract cross-sectional area pattern and the primary order. A sound source is extracted, and the primary sound source is divided by the maximum amplitude value H of the analysis target waveform to be a normalized sound source, and these vocal tract cross-sectional area patterns and the normalized sound source are sound source data for speech synthesis. It is stored in a database and used.

【０００５】音声合成では、エネルギー制御部の制御に
基づいて振幅制御を行うため、音源データとして用意す
る音源には上記したように、原音波形の分析結果で得ら
れた第１次音源を、対応する原音の音声波形の最大振幅
で割って正規化した音源が用いられている。上記の振幅
制御によって、正規化した音源に合成時のエネルギー
Ｅ′が掛けられ、これが対応する音声データの声道断面
積パターンの声道特性を有するフィルタ等を通って合成
音声となる。In speech synthesis, amplitude control is performed based on the control of the energy control unit. Therefore, as described above, the sound source prepared as sound source data corresponds to the primary sound source obtained from the analysis result of the original sound waveform. A sound source is used which is normalized by dividing by the maximum amplitude of the sound waveform of the original sound. By the above amplitude control, the normalized sound source is multiplied by the energy E ′ at the time of synthesis, and this becomes a synthesized voice through a filter having the vocal tract characteristic of the vocal tract cross-sectional area pattern of the corresponding voice data.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上記従
来の技術による音源正規化方式では、原音波形の最大振
幅で正規化しているため、波形の形状によっては合成
時、波形レベルでギャップがでてしまう場合がある。図
３を例に取ると、／Ｍ／のような正弦波的な波形と、／
Ａ／のような複雑な波形とで、同じ基準で正規化する
と、原音の波形推移が損なわれてしまう場合が多く、そ
れが音質劣化につながっていた。具体的には、子音部の
方が母音部より振幅が大きくなったりすると、明瞭性が
なくなるし、もっと局所的な波形振幅の異常が出現する
と大きな異音となる。図５は“あまい”という合成音声
の波形例を示した図で／Ａ／と／Ｍ／の境界で、子音／
Ｍ／の振幅が大きく異常となっている。また子音／Ｍ／
全体としても母音／Ａ／に比べて振幅が大き目である。
このような異常は、正規化の問題のほかに、時間長の制
御のための間引きなども発生要因として考えられてい
る。However, in the sound source normalization method according to the above-mentioned conventional technique, since the maximum amplitude of the original sound waveform is used for normalization, a gap may appear at the waveform level during synthesis depending on the shape of the waveform. There are cases. Taking FIG. 3 as an example, a sinusoidal waveform such as / M /
If a normal waveform with a complicated waveform such as A / is normalized with the same reference, the waveform transition of the original sound is often impaired, which leads to deterioration in sound quality. Specifically, when the amplitude of the consonant part is larger than that of the vowel part, the intelligibility is lost, and when a more local abnormal waveform amplitude appears, a large abnormal sound is produced. FIG. 5 is a diagram showing an example of a waveform of a synthetic voice "Amai", which is a consonant at the boundary between / A / and / M /.
The amplitude of M / is large and abnormal. Consonant / M /
As a whole, the amplitude is larger than that of the vowel / A /.
In addition to the normalization problem, such anomalies are also considered to be factors such as decimation for controlling the time length.

【０００７】本発明は、上記問題点を解決するためにな
されたものであり、その目的は、規則音声合成装置等に
おける、エネルギー制御に伴う、合成波形の振幅異常に
起因する音質劣化を解消する音声合成用音源データ作成
方法を提供することにある。The present invention has been made to solve the above problems, and an object thereof is to eliminate the sound quality deterioration due to the amplitude abnormality of the synthesized waveform accompanying the energy control in the regular speech synthesizer or the like. It is to provide a sound source data creation method for voice synthesis.

【０００８】[0008]

【課題を解決するための手段】上記の目的を達成するた
め、本発明の音声合成用音源データ作成方法は、合成し
ようとする自然音声の音声波形を分析して抽出した音源
を該音声波形の最大振幅値で正規化し、該正規化した音
源を音源データの一部とする音声合成用音源データ作成
方法において、前記音源データを用いて作成した合成音
声を観察し、振幅異常が認められる場合に該振幅異常を
解消する振幅補正値を予めテーブル化しておき、音源を
正規化する際に、前記テーブルの対応する振幅補正値を
読み出して前記最大振幅値を補正し、該補正した最大振
幅値で音源の正規化を行うことを特徴としている。In order to achieve the above-mentioned object, a method for creating sound source data for speech synthesis according to the present invention analyzes a speech waveform of natural speech to be synthesized and extracts a sound source of the speech waveform. In the method of creating a sound source data for voice synthesis, which is normalized with the maximum amplitude value and uses the normalized sound source as a part of the sound source data, when the synthesized voice created using the sound source data is observed and an abnormal amplitude is recognized, An amplitude correction value for eliminating the amplitude abnormality is made into a table in advance, and when the sound source is normalized, the corresponding amplitude correction value in the table is read to correct the maximum amplitude value, and the corrected maximum amplitude value is used. The feature is that the sound source is normalized.

【０００９】[0009]

【作用】本発明の音声合成用音源データ作成方法では、
韻律制御に基づいて音源をエネルギー制御するために音
源データの音源を正規化する際に、正規化に用いる自然
音声波形の最大振幅値を、その最大振幅値で正規化した
音源による合成音声を波形分析して予め作成したテーブ
ルを用いて音源データ毎に補正することで、子音−母音
や母音−子音のエネルギー推移をなめらかにし、合成音
声の音質劣化を解消する。In the method of creating the sound source data for speech synthesis of the present invention,
When normalizing the sound source of the sound source data to control the energy of the sound source based on the prosody control, the maximum amplitude value of the natural speech waveform used for normalization is synthesized by the sound source that is normalized by the maximum amplitude value. By correcting each sound source data by using a table that is analyzed and created in advance, the energy transition of consonant-vowel and vowel-consonant is smoothed, and the deterioration of the sound quality of synthesized speech is eliminated.

【００１０】[0010]

【実施例】以下、本発明の実施例を図面を参照して詳細
に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１１】図１は本発明の一実施例を示す音源データ
作成の流れ図である。図中、Ｓ１〜Ｓ９は処理ステップ
を示す。まず、アナウンサ等により、合成しようとする
対象音声の原音（自然音声）を録音する（Ｓ１）。録音
はアナログ信号のままでテープ等に行われる。次に、録
音された原音を再生してＡ／Ｄ変換し、適当なサンプリ
ング周波数でサンプリングしてファイル化する（Ｓ
２）。なお、原音は、ローパスフィルタやハイパスフィ
ルタを通すことにより、あるいはＡ／Ｄ変換後のフィル
タリングにより、余分な周波数成分をカットするのが好
適である。次にファイル化した音声波形を目視で観察す
るなどして、各種制御点をマニュアルで決定する（Ｓ
３）。各種制御点としては、波形の切り出し範囲、Ｃ／
Ｖ境界や韻律制御等におけるパラメータなどがある。次
に、前処理として波形混合処理を行う（Ｓ４）。ここで
は、波形データと波形データの接続性を良くする。次
に、対象波形毎に波形分析を行い（Ｓ５）、フレーム毎
に声道断面積パターンと第１次音源を抽出する。FIG. 1 is a flow chart of sound source data creation showing an embodiment of the present invention. In the figure, S1 to S9 indicate processing steps. First, the announcer or the like records the original sound (natural sound) of the target sound to be synthesized (S1). Recording is done on tape etc. with analog signal as it is. Next, the recorded original sound is reproduced, A / D-converted, sampled at an appropriate sampling frequency and made into a file (S
2). Note that it is preferable to cut the extra frequency component of the original sound by passing it through a low-pass filter or a high-pass filter, or by filtering after A / D conversion. Next, various control points are manually determined by visually observing the filed voice waveform (S
3). Various control points include the waveform cutout range, C /
There are parameters such as V boundary and prosody control. Next, a waveform mixing process is performed as a pre-process (S4). Here, the connectivity between the waveform data and the waveform data is improved. Next, waveform analysis is performed for each target waveform (S5), and the vocal tract cross-sectional area pattern and the primary sound source are extracted for each frame.

【００１２】次に、波形データを切り出して（Ｓ６）、
正規化を行うことになるが、従来は波形の最大振幅値を
用いて自動的に正規化を行っていたのに対して、本実施
例では、正規化において、予めテーブル化しておいた振
幅補正値（倍率）をデータ切り出し時に読んで来て音源
データの正規化に用いる最大振幅値にかけて補正し（Ｓ
７）、補正した値で第１次音源を割って正規化の処理を
行う（Ｓ８）。以上により、最終的に声道断面積パター
ンと正規化音源から成る音源データを得る（Ｓ９）。Next, the waveform data is cut out (S6),
Although the normalization is performed, in the past, the normalization was automatically performed using the maximum amplitude value of the waveform, whereas in the present embodiment, the amplitude correction that is made in the table in advance is performed in the normalization. The value (magnification) is read at the time of data extraction and is corrected by multiplying it by the maximum amplitude value used for normalization of the sound source data (S
7) Then, the primary sound source is divided by the corrected value to perform normalization processing (S8). As a result, the sound source data composed of the vocal tract cross-sectional area pattern and the normalized sound source is finally obtained (S9).

【００１３】上記において、振幅補正値テーブルの作成
方法は、従来と同様に最大振幅値で正規化した音源デー
タから作成した合成音声の波形を一通り目視等でチェッ
クし、ピッチ周期単位で波形振幅の異常なフレームを確
認し、その異常をなくすのに適当な振幅補正値を音声デ
ータ毎にテーブル化することで行う。なお、補正の必要
のないフレームは倍率を１．０にセットすることで、上
記補正の処理を簡単化することができる。In the above method, the amplitude correction value table is created by visually checking the waveform of the synthesized voice created from the sound source data normalized by the maximum amplitude value as in the conventional method, and determining the waveform amplitude in pitch cycle units. This is performed by confirming the abnormal frame of No. 2 and making a table of amplitude correction values suitable for eliminating the abnormalities for each audio data. Note that the correction process can be simplified by setting the magnification to 1.0 for frames that do not require correction.

【００１４】音源データ作成時には波形の最大振幅を用
いて自動的に正規化し、合成時には韻律制御で得られた
パターンを適用する従来方式では、子音と母音の整合性
が波形形状により悪い場合があり、合成音声の全体的な
エネルギー推移のバランスが崩れる場合がある。本実施
例は、音源データ正規化に用いる最大振幅値を、予め上
記従来方式による音源データで合成した音声波形を分析
して作成したテーブルにより補正することで、合成波形
レベルで非常になめらかな振幅推移を実現することがで
き、規則音声合成装置における、エネルギー制御に伴
う、合成波形の振幅異常に起因する音質劣化を解消する
ことができる。In the conventional method in which the maximum amplitude of the waveform is automatically normalized when the sound source data is created, and the pattern obtained by the prosody control is applied during synthesis, the consonant and vowel consistency may be poor depending on the waveform shape. , The balance of the overall energy transition of synthetic speech may be lost. In this embodiment, the maximum amplitude value used for sound source data normalization is corrected by a table created by analyzing a voice waveform synthesized in advance with the sound source data according to the above-described conventional method, so that the amplitude is very smooth at the synthesized waveform level. It is possible to realize the transition, and it is possible to eliminate the sound quality deterioration due to the amplitude abnormality of the synthesized waveform accompanying the energy control in the regular speech synthesizer.

【００１５】[0015]

【発明の効果】以上の説明で明らかなように、本発明の
音声合成用音源データ作成方法によれば、合成音声に用
いる音源データ作成時のエネルギー制御のために伴う正
規化の際、正規化不良により発生する合成波形振幅異常
を解消することができ、音質劣化を解消することができ
る。As is apparent from the above description, according to the sound source data generation method for voice synthesis of the present invention, the normalization is performed at the time of normalization accompanying the energy control at the time of generating the sound source data used for the synthesized voice. It is possible to eliminate the abnormal synthesized waveform amplitude caused by the defect, and it is possible to eliminate the sound quality deterioration.

[Brief description of drawings]

【図１】本発明の一実施例を示す音源データ作成の流れ
図FIG. 1 is a flow chart of sound source data creation showing an embodiment of the present invention.

【図２】従来の技術を説明する音源データの作成と音声
合成の流れ図FIG. 2 is a flow chart of sound source data creation and voice synthesis explaining a conventional technique.

【図３】音声波形例のＣ−Ｖ移行部を示す図FIG. 3 is a diagram showing a CV transition section of a voice waveform example.

【図４】上記音声波形例の全体を示す図FIG. 4 is a diagram showing an entire example of the above speech waveform.

【図５】従来の技術による合成音声の波形例を示す図FIG. 5 is a diagram showing an example of a waveform of synthesized speech according to a conventional technique.

[Explanation of symbols]

Ｓ１〜Ｓ９…処理ステップ S1 to S9 ... Processing steps

Claims

[Claims]

1. A sound source for voice synthesis in which a sound source extracted by analyzing a voice waveform of a natural voice to be synthesized is normalized by a maximum amplitude value of the voice waveform, and the normalized sound source is a part of sound source data. In the data creation method, the synthesized voice created using the sound source data is observed, and amplitude correction values for eliminating the amplitude abnormalities are preliminarily tabulated when the amplitude abnormality is recognized, and when normalizing the sound source, A method for generating sound source data for speech synthesis, which comprises reading out a corresponding amplitude correction value in the table, correcting the maximum amplitude value, and normalizing the sound source with the corrected maximum amplitude value.