JPH0990970A

JPH0990970A - Speech synthesis device

Info

Publication number: JPH0990970A
Application number: JP7241460A
Authority: JP
Inventors: Toshio Hirai; 俊男平井; Yoshinori Kosaka; 芳典匂坂; Norio Higuchi; 宜男樋口
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1995-09-20
Filing date: 1995-09-20
Publication date: 1997-04-04
Anticipated expiration: 2015-09-20
Also published as: JP2880433B2

Abstract

PROBLEM TO BE SOLVED: To synthesize speech of one designated speaker by controlling a size of an accent phrase based on an accent type of an accent phrase of a speech synthesis object and the position of an accent phrase in sentences. SOLUTION: A parameter system generation part 1 by which a feature parameter system for a speech synthesis is generated based on the input character string, and a speech synthesis part 2 outputting to speaker 3 which generates a speech signal based on the generated feature parameter system are provided. And an inputted character string is converted into the voice of a prescribed speaker by using F0 control rules 31-33 controlling the pitch frequency of the voice made for each speaker. The F0 Control rules 31-33 are rules which control the size of the phrase based on the number of the morae of the phrases of the speech synthesis object and the numbers of the morae of the phrase preceding to this phrase, and which control the size of the accent phrase and the pitch frequency of the speech based on the accent type of the accent phrase of the voice synthesis object and the position of the accent phrase in the sentence.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、入力された文字列
に基づいて音声を合成する音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing device for synthesizing voice based on an input character string.

【０００２】[0002]

【従来の技術】音声の基本周波数であるピッチ周波数
（以下、Ｆ₀周波数という。）のモデル化には、従来か
ら少ないパラメータ数で効率良くＦ₀周波数の時系列の
パターン（以下、Ｆ₀パターンという。）をパラメータ
化することが可能な重畳型モデルが用いられている（例
えば、従来文献１「藤崎ほか，“日本語単語アクセント
の基本周波数パタンとその生成機構のモデル”，日本音
響学会論文誌，Ｖｏｌ．２７．Ｎｏ．９，ｐｐ．４４５
−４５３，１９７１年９月」参照。）この重畳型モデル
では、Ｆ₀パターンを句頭から句末にかけて緩やかに下
降するフレーズ成分（話調成分とも呼ばれる。）とアク
セント句に対応するアクセント成分の和として捉える。
重畳型モデルによるＦ₀パターンのパラメータ化には、
次のような利点がある。（１）モデルで用いられる自由パラメータ数が少なく、
統計分析によるＦ₀制御の最適化が容易である。（２）Ｆ₀パターンをフレーズ成分とアクセント成分の
２つの成分に分離するので、最適化するので、最適化の
結果得られる制御規則の解釈が比較的容易である。2. Description of the Related Art In order to model a pitch frequency (hereinafter referred to as F ₀ frequency) which is a fundamental frequency of speech, a time series pattern of the F ₀ frequency (hereinafter referred to as F ₀ pattern) is efficiently used with a small number of parameters. A superposed model that can be parameterized is used (see, for example, the conventional literature 1 “Fujisaki et al.,“ Fundamental frequency pattern of Japanese word accents and model of its generation mechanism ”), Acoustical Society of Japan. Magazine, Vol.27, No.9, pp.445.
-453, September 1971 ". In this superposed model, the F ₀ pattern is regarded as the sum of the phrase component (also called speech component) that gently falls from the beginning of the phrase to the end of the phrase and the accent component corresponding to the accent phrase.
Parameterization of the F ₀ pattern by the superposition model includes:
It has the following advantages. (1) The number of free parameters used in the model is small,
It is easy to optimize F ₀ control by statistical analysis. (2) Since the F ₀ pattern is separated into two components, that is, the phrase component and the accent component, the F ₀ pattern is optimized. Therefore, the control rule obtained as a result of the optimization is relatively easy to interpret.

【０００３】また、規則合成音声の多様化を図るため
の、普通調、コマーシャル調、朗読調の３つの発話様式
の間の変換規則（以下、従来例という。）が、例えば従
来文献２「阿部ほか，“発話様式の変化とその評価”，
日本音響学会講演論文集，３−Ｐ−１８，１９９３年１
０月」において提案されている。この従来例では、フォ
ルマント周波数と継続時間と基本周波数及びパワーのパ
ラメータを変換することにより、普通調、コマーシャル
調、朗読調の各音声と、普通調からコマーシャル調へ変
換した音声と、普通調から朗読調へ変換した音声の計５
つの発話様式を準備して、それらの類似性について評価
している。A conversion rule (hereinafter, referred to as a conventional example) between three utterance patterns of a normal tone, a commercial tone, and a reading tone for diversifying a rule-synthesized voice (hereinafter referred to as a conventional example) is disclosed in, for example, the conventional document 2 "Abe. In addition, "Change of utterance style and its evaluation",
Proceedings of ASJ, 3-P-18, 1993 1
Proposed in "October". In this conventional example, by converting the parameters of the formant frequency, the duration, the fundamental frequency, and the power, each voice of normal tone, commercial tone, and reading tone, and voice converted from normal tone to commercial tone, and from normal tone A total of 5 voices converted to reading
We prepare two speaking styles and evaluate their similarity.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述の
従来例では、発話様式の変換を対象としているが、話者
性を考慮せずに音声合成している。すなわち、アクセン
ト型が異なるとアクセントの高さが個人により異なり、
従来例では、ある指定された１人の話者の音声を合成す
ることはできない。However, in the above-mentioned conventional example, although the conversion of the utterance style is targeted, the speech synthesis is performed without considering the speaker characteristics. That is, if the accent type is different, the height of the accent will be different for each individual,
In the conventional example, the voice of one designated speaker cannot be synthesized.

【０００５】本発明の目的は以上の問題点を解決し、あ
る指定された１人の話者の音声を合成することができる
音声合成装置を提供することにある。An object of the present invention is to solve the above problems and to provide a voice synthesizer capable of synthesizing the voice of one designated speaker.

【０００６】[0006]

【課題を解決するための手段】本発明に係る請求項１記
載の音声合成装置は、入力された文字列に基づいて音声
を合成する音声合成装置において、話者毎に作成された
音声のピッチ周波数を制御する制御規則を用いて入力さ
れた文字列を予め指定された話者の音声に変換する変換
手段を備えたことを特徴とする。According to a first aspect of the present invention, there is provided a voice synthesizing apparatus for synthesizing a voice based on an input character string, the pitch of voices produced for each speaker. It is characterized in that it is provided with a conversion means for converting a character string input using a control rule for controlling a frequency into a voice of a speaker designated in advance.

【０００７】また、請求項２記載の音声合成装置は、請
求項１記載の音声合成装置において、上記制御規則は、
音声合成対象の当該フレーズのモーラ数と、当該フレー
ズに先行する先行フレーズのモーラ数とに基づいて当該
フレーズの大きさを制御し、音声合成対象のアクセント
句のアクセント型と上記アクセント句の文章内の位置と
に基づいてアクセント句の大きさを制御することによ
り、音声のピッチ周波数を制御する規則であることを特
徴とする。A speech synthesizer according to a second aspect is the speech synthesizer according to the first aspect, wherein the control rule is
The size of the phrase is controlled based on the number of moras of the phrase to be voice-synthesized and the number of mora of the preceding phrase preceding the phrase, and the accent type of the accent phrase to be voice-synthesized and the sentence of the above-mentioned accent phrase are included. It is a rule to control the pitch frequency of the voice by controlling the size of the accent phrase based on the position of.

【０００８】さらに、請求項３記載の音声合成装置は、
請求項１又は２記載の音声合成装置において、さらに、
上記制御規則を生成する学習手段を備え、上記学習手段
は、音声データに基づいて音声のピッチ周波数のパター
ンを抽出する抽出手段と、上記抽出手段によって抽出さ
れた音声のピッチ周波数のパターンに基づいて臨界制御
モデルによる分析法を用いて上記臨界制御モデルのモデ
ルパラメータを発生する発生手段と、上記抽出手段によ
って抽出された音声のピッチ周波数のパターンと、上記
発生手段によって発生された上記臨界制御モデルのモデ
ルパラメータとに基づいて、音声のピッチ周波数を制御
する制御規則を生成する生成手段とを備えたことを特徴
とする。[0008] Furthermore, the voice synthesizing device according to claim 3 is
The voice synthesizer according to claim 1 or 2, further comprising:
The learning means includes a learning means for generating the control rule, wherein the learning means extracts the pitch frequency pattern of the voice based on the voice data, and the pitch frequency pattern of the voice extracted by the extracting means. Generating means for generating a model parameter of the critical control model by using an analysis method by the critical control model, a pitch frequency pattern of the voice extracted by the extracting means, and the critical control model of the critical control model generated by the generating means. Generating means for generating a control rule for controlling the pitch frequency of the voice based on the model parameter.

【０００９】[0009]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１は、本実施形態のＦ
₀周波数を制御するＦ₀制御規則を生成するＦ₀制御規則
学習部２０を備えた音声合成装置のブロック図である。
図１において、本実施形態の音声合成装置は、入力され
る文字列に基づいて、選択的に接続される１人の話者の
Ｆ₀制御規則（３１，３２，３３のうちの１つ）と、声
質制御規則４１と、音素継続時間長制御規則４２とを用
いて音声合成のための特徴パラメータ系列を生成するパ
ラメータ系列生成部１と、生成された特徴パラメータ系
列に基づいて音声信号を発生してスピーカ３に出力する
音声合成部２とを備える。本実施形態においては、特
に、話者毎に作成された音声のピッチ周波数を制御する
Ｆ₀制御規則３１，３２，３３を用いて入力された文字
列を予め指定された話者の音声に変換することを特徴と
し、上記Ｆ₀制御規則３１，３２，３３は、音声合成対
象の当該フレーズのモーラ数と、当該フレーズに先行す
る先行フレーズのモーラ数とに基づいて当該フレーズの
大きさを制御し、音声合成対象のアクセント句のアクセ
ント型と上記アクセント句の文章内の位置とに基づいて
アクセント句の大きさを制御することにより、音声のピ
ッチ周波数を制御する規則である。DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows the F of this embodiment.
₀ is a block diagram of a speech synthesis apparatus having a F ₀ control rule learning unit 20 to generate a F ₀ control rules for controlling the frequency.
In FIG. 1, the speech synthesizer according to the present exemplary embodiment has an F ₀ control rule (one of 31, 32, and 33) of one speaker selectively connected based on an input character string. And a voice quality control rule 41 and a phoneme duration control rule 42 to generate a feature parameter sequence for voice synthesis, and a voice signal based on the generated feature parameter sequence. And a voice synthesizer 2 for outputting to the speaker 3. In the present embodiment, in particular, the character string input using the F ₀ control rules 31, 32 and 33 for controlling the pitch frequency of the voice created for each speaker is converted into the voice of the speaker specified in advance. The F ₀ control rules 31, 32, 33 control the size of the phrase based on the number of mora of the phrase to be synthesized and the number of mora of the preceding phrase preceding the phrase. Then, the pitch frequency of the voice is controlled by controlling the size of the accent phrase based on the accent type of the accent phrase of the voice synthesis target and the position of the accent phrase in the sentence.

【００１０】Ｆ₀制御規則学習部２０には、詳細後述す
るＦ₀制御規則学習処理を実行するときのワークエリア
として用いるワーキングメモリ２１が接続される。ま
た、Ｆ₀制御規則学習部２０には、スイッチＳＷ１を介
して、話者Ａ，Ｂ，Ｃの音声データ１１，１２，１３の
うちの１つが選択的に接続される一方、スイッチＳＷ２
を介して、話者Ａ，Ｂ，ＣのＦ₀制御規則３１，３２，
３３のうちの１つが選択的に接続される。これらのスイ
ッチＳＷ１，ＳＷ２の切り換えはＦ₀制御規則学習部２
０によって、同一の話者の音声データとＦ₀制御規則が
同時に接続されるように連動して制御される。さらに、
パラメータ系列生成部１には、スイッチＳＷ３を介し
て、話者Ａ，Ｂ，ＣのＦ₀制御規則３１，３２，３３の
うちの１つが選択的に接続される。このスイッチＳＷ３
の切り換えは、操作者により音声合成した話者のＦ₀制
御規則を選択するように行われる。また、パラメータ系
列生成部１には、詳細後述する従来の声質変換制御規則
４１と従来の音素継続時間長制御規則４２とが接続され
る。The F ₀ control rule learning unit 20 is connected to a working memory 21 used as a work area when executing a F ₀ control rule learning process which will be described later in detail. Further, one of the voice data 11, 12, and 13 of the speakers A, B, and C is selectively connected to the F ₀ control rule learning unit 20 via the switch SW 1, while the switch SW 2 is connected.
Via the F ₀ control rules 31, 32,
One of 33 is selectively connected. The switching of these switches SW1 and SW2 is performed by the F ₀ control rule learning unit 2
0, the voice data of the same speaker and the F ₀ control rule are controlled so as to be simultaneously connected. further,
One of the F ₀ control rules 31, 32, 33 of the speakers A, B, C is selectively connected to the parameter sequence generation unit 1 via the switch SW3. This switch SW3
Is switched so that the operator selects the F ₀ control rule of the speaker whose voice is synthesized. The parameter sequence generation unit 1 is also connected to a conventional voice quality conversion control rule 41 and a conventional phoneme duration control rule 42, which will be described in detail later.

【００１１】本実施形態において、音声データ１１，１
２，１３と、ワーキングメモリ２１と、Ｆ₀制御規則３
１，３２，３３と、声質制御規則４１と、音素継続時間
長制御規則４２とは、例えば、ハードディスクなどのメ
モリで構成される。また、Ｆ₀制御規則学習部２０と、
パラメータ系列生成部１とは、例えばデジタル電子計算
機で構成される。In this embodiment, the voice data 11, 1
2, 13, working memory 21, F ₀ control rule 3
1, 32, 33, the voice quality control rule 41, and the phoneme duration control rule 42 are configured by a memory such as a hard disk. In addition, the F ₀ control rule learning unit 20,
The parameter sequence generation unit 1 is composed of, for example, a digital computer.

【００１２】図２は、図１のＦ₀制御規則学習部２０に
よって実行されるＦ₀制御規則学習処理を示すフローチ
ャートである。まず、ステップＳ１では、音声データ１
１，１２，１３内の音声データに基づいてＦ₀パターン
を抽出した後、ステップＳ２において、抽出されたＦ₀
パターンに基づいて臨界制御モデルによる分析法を用い
て上記臨界制御モデルのモデルパラメータを発生する。
さらに、ステップＳ３で、抽出されたＦ₀パターンと、
臨界制御モデルのモデルパラメータとに基づいて、所定
の制御要因に注目して、音声のピッチ周波数を制御する
制御規則を生成する。ここで、制御要因とは、音声合成
対象の当該フレーズのモーラ数と、当該フレーズに先行
する先行フレーズのモーラ数と、音声合成対象のアクセ
ント句のアクセント型と、上記アクセント句の文章内の
位置であり、Ｆ₀制御規則は、音声合成対象の当該フレ
ーズのモーラ数と、当該フレーズに先行する先行フレー
ズのモーラ数とに基づいて当該フレーズの大きさを制御
し、音声合成対象のアクセント句のアクセント型と上記
アクセント句の文章内の位置とに基づいてアクセント句
の大きさを制御することにより、音声のピッチ周波数を
制御する。次いで、上記各ステップの処理の詳細につい
て説明する。FIG. 2 is a flowchart showing the F ₀ control rule learning processing executed by the F ₀ control rule learning unit 20 of FIG. First, in step S1, voice data 1
After extracting the F ₀ pattern based on the audio data in 1,12,13, in step S2, the extracted F ₀
A model parameter of the critical control model is generated using an analysis method based on the pattern based on the critical control model.
Further, in step S3, the extracted F ₀ pattern,
Based on the model parameter of the critical control model, attention is paid to a predetermined control factor, and a control rule for controlling the pitch frequency of the voice is generated. Here, the control factors are the number of mora of the phrase to be voice-synthesized, the number of mora of the preceding phrase preceding the phrase, the accent type of the accent phrase of the voice-synthesis target, and the position of the accent phrase in the sentence. The F ₀ control rule controls the size of the phrase on the basis of the number of mora of the phrase to be speech-synthesized and the number of mora of the preceding phrase preceding the phrase, so that the accent phrase of the speech-synthesis target of the phrase is controlled. The pitch frequency of the voice is controlled by controlling the size of the accent phrase based on the accent type and the position of the accent phrase in the sentence. Next, the details of the processing in each of the above steps will be described.

【００１３】まず、ステップＳ１の処理について述べ
る。音声データ１１，１２，１３にはそれぞれ、１人の
話者の読み上げ文（発声音声文ともいう。）の音声信号
のデータを含む。このステップＳ１では、この音声信号
のデータに対して、Ａ／Ｄ変換とＬＰＣ分析を行って特
徴パラメータデータを抽出した後、抽出した特徴パラメ
ータデータに基づいて、例えば公知の臨界制動モデルに
よる分析法（例えば、従来文献３「藤崎ほか，“Analys
is of voice fundamental frequency contours for dec
larative sentences of Japanese （日本語平叙文の基
本周波数パターンの分析）”，日本音響学会論文誌，
（Ｅ），Ｖｏｌ．５，Ｎｏ．４，ｐｐ．２３３−２４
４，１９８４年４月」参照。）により分析しかつＦ₀パ
ターンとモデルパラメータとを検出して、音素単位、ア
クセント句単位及びフレーズ単位でラベリングすること
より生成する。ここで、特徴パラメータデータは、対数
パワー、１６次ケプストラム係数、Δ対数パワー、及び
１６次Δケプストラム係数を含み、モデルパラメータと
は、アクセント指令と、フレーズ指令とを含み、この中
で、アクセント句境界の情報を含む。上記分析では、Ｆ
₀周波数の緩やかな下降成分であるフレーズ成分とＦ₀周
波数の局所的な起伏を示すアクセント成分に分解され
る。上記臨界制動モデルでは、フレーズ成分、アクセン
ト成分はそれぞれフレーズ指令、アクセント指令に対す
る臨界制動２次線形系の応答として捉える。各指令の精
密なタイミングと大きさは、音素ラベリング情報、アク
セント句情報、フレーズ境界情報から得られるフレーズ
指令、アクセント指令のおおよそのタイミングをもとに
自動的な合成による解析（Ａｎａｌｙｓｉｓ−ｂｙ−Ｓ
ｙｎｔｈｅｓｉｓ）を用いて求めることができる。First, the processing of step S1 will be described. The voice data 11, 12, and 13 each include voice signal data of a read-aloud sentence (also referred to as a vocalized voice sentence) of one speaker. In this step S1, the voice signal data is subjected to A / D conversion and LPC analysis to extract characteristic parameter data, and then based on the extracted characteristic parameter data, for example, a known critical braking model analysis method is used. (For example, in conventional document 3, “Fujisaki et al.,“ Analys
is of voice fundamental frequency contours for dec
larative sentences of Japanese (Analysis of fundamental frequency patterns of Japanese syllabary sentences) ”, The Acoustical Society of Japan,
(E), Vol. 5, No. 4, pp. 233-24
4, April 1984 ”. ), Detects the F ₀ pattern and the model parameter, and labels them in phoneme units, accent phrase units, and phrase units. Here, the characteristic parameter data includes a logarithmic power, a 16th-order cepstrum coefficient, a Δlogarithmic power, and a 16th-order Δcepstrum coefficient, and the model parameter includes an accent command and a phrase command. Contains boundary information. In the above analysis, F
₀ is decomposed into an accent component indicating a local relief phrases component and F ₀ frequency is gentle downward component of the frequency. In the above critical braking model, the phrase component and the accent component are regarded as the response of the critical braking quadratic linear system to the phrase command and the accent command, respectively. The precise timing and magnitude of each command is analyzed by automatic synthesis based on the approximate timing of the phoneme labeling information, accent phrase information, phrase command obtained from phrase boundary information, and accent command (Analysis-by-S).
It can be obtained by using

【００１４】次いで、ステップＳ２の処理について述べ
る。上述の重畳型制御モデルの１つとして、藤崎により
研究提案されてきた藤崎モデルが知られている（例え
ば、従来文献１参照。）。この藤崎モデルを用いたパラ
メータ化には、従来、山登り法が用いられてきた（例え
ば、従来文献３参照。）。すなわち、藤崎モデルのすべ
ての自由パラメータを変化させ、Ｆ₀パターンの平均推
定２乗誤差を最小にするパラメータの組を、そのＦ₀パ
ターンの分析結果とするものである。これは、パラメー
タの総数を探索空間次元数とする探索問題ととらえるこ
とができる。従来は、フレーズ指令に関しては角周波
数、入力時点、及び大きさ、アクセント指令に関しては
角周波数、立ち上がり時点、立ち下がり時点、及び大き
さを自由パラメータとして取り扱っていたため，探索空
間は（３Ｉ＋４Ｊ）次元（ここで、Ｉはフレーズ指令の
数であり、Ｊはアクセント指令の数である。）であっ
た。これらのパラメータのうち、アクセント指令の大き
さは、Ｆ₀周波数の実測値と他のパラメータを与えれ
ば、最小２乗法を用いて一意に求めることが可能でき、
探索空間の次元数をＪだけ下げることにより計算時間を
短縮することができる。この方法では、各時点でのＦ₀
周波数の値の信頼性（ここでは、音声からＦ₀周波数を
計算する際に得られる自己相関関数の極大値）を各時点
でのＦ₀周波数の推定誤差評価の重み付けに用いること
ができるよう定式化している。これは、音声データから
得られるＦ₀周波数の値の信頼性が各時点で一様ではな
いことに対応するためのものである。本実施形態におい
ては、この方法を用いた山登り法によりＦ₀パターンの
パラメータ化を行った。Next, the processing of step S2 will be described. The Fujisaki model, which has been researched and proposed by Fujisaki, is known as one of the above-described superimposed control models (see, for example, conventional document 1). The hill climbing method has been conventionally used for parameterization using the Fujisaki model (see, for example, conventional document 3). That is, by changing all the free parameters of Fujisaki model, a set of parameters that minimize the mean estimation squared error F ₀ pattern, it is an analysis result of the F ₀ pattern. This can be regarded as a search problem in which the total number of parameters is the number of search space dimensions. Conventionally, since the angular frequency, the input time point and the size of the phrase command and the accent frequency, the rising time point, the falling time point and the size of the accent command are handled as free parameters, the search space is (3I + 4J) -dimensional ( Here, I is the number of phrase commands and J is the number of accent commands.). Among these parameters, the size of the accent command can be uniquely obtained by using the least squares method if the measured value of the F ₀ frequency and other parameters are given.
The calculation time can be shortened by decreasing the number of dimensions of the search space by J. In this method, F ₀ at each time point is
A formula that allows the reliability of the frequency value (here, the maximum value of the autocorrelation function obtained when the F ₀ frequency is calculated from the speech) to be used for weighting the estimation error evaluation of the F ₀ frequency at each time point. It has become. This is because the reliability of the value of the F ₀ frequency obtained from the audio data is not uniform at each time point. In the present embodiment, the F ₀ pattern was parameterized by the hill climbing method using this method.

【００１５】さらに、ステップＳ３におけるＦ₀制御規
則の生成について述べる。フレーズ指令、アクセント指
令に影響を与えると考えられる上記制御要因から、各指
令の属性を推定する規則を公知の空間多重分割型数量化
法（ＭｕｌｔｉｐｌｅＳｐｌｉｔＲｅｇｒｅｓｓｉ
ｏｎ（例えば、従来文献４「岩橋ほか，“空間分割型数
量化法による音声制御の統計モデリング”，日本音響学
会講演論文集，１−５−１１，ｐ．２３７−２３８，平
成４年１０月」参照。）；以下、ＭＳＲ法という。）に
より求める。ＭＳＲ法では、回帰木での分析手順と同様
に、モデル推定値と実測値との２乗誤差総和を最も小さ
くする分類方法によって二分木を成長させ、モデル生成
を行なう。また、ＭＳＲ法では、二分木のリーフノード
以外のノードでそれ以下の部分木全体にわたって分岐条
件を共有することを許しており、少ないパラメータ数で
効率良くモデリングが行なえる。ルートノードに近いノ
ードで二分木の成長に用いられた制御要因は、多くのサ
ンプルの推定値に影響を与えるので、それらはＦ₀周波
数の制御に深く関わる重要な制御要因であると判断でき
る。Further, generation of the F ₀ control rule in step S3 will be described. A rule for estimating the attribute of each command from the above-mentioned control factors considered to affect the phrase command and the accent command is a known spatial multiple division quantification method (Multiple Split Regressi).
on (for example, the conventional literature 4 "Iwahashi et al.," Statistical modeling of voice control by space division quantification method ", Proceedings of the Acoustical Society of Japan, 1-5-11, pp. 237-238, October 1992). Hereinafter)); hereinafter referred to as MSR method. ). In the MSR method, similar to the analysis procedure using the regression tree, the binary tree is grown by the classification method that minimizes the sum of squared errors between the model estimated value and the measured value, and the model is generated. Further, in the MSR method, nodes other than the leaf node of the binary tree are allowed to share branching conditions over the entire subtrees smaller than that, and modeling can be performed efficiently with a small number of parameters. Since the control factors used for growing the binary tree at the node close to the root node affect the estimated values of many samples, it can be determined that they are important control factors deeply involved in the control of the F ₀ frequency.

【００１６】ところで、指令推定モデルの推定対象に
は、指令の大きさと立ち上がり時点などのタイミング情
報があるが、タイミング情報は少数の規則により推定で
きるのに対し、指令の大きさの推定には複雑な規則を必
要とすることから、本実施形態では、各指令の大きさを
推定の対象とした。ここでは、フレーズ指令、アクセン
ト指令それぞれの指令推定モデルを合わせてＦ₀制御規
則と呼んでいる。By the way, although the target of estimation of the command estimation model includes timing information such as the command size and the rising time point, the timing information can be estimated by a small number of rules, but the command size estimation is complicated. In this embodiment, the size of each command is used as an estimation target because various rules are required. Here, the command estimation models of the phrase command and the accent command are collectively referred to as the F ₀ control rule.

【００１７】推定モデル生成のための統計モデリング手
法の代表的なものに数量化Ｉ類（例えば、従来文献５
「林ほか，“数量化理論とデータ処理”，朝倉書店，１
９８２年」参照。）や回帰木（例えば、従来文献６「Ｂ
ｒｉｅｍａｎｅｔａｌ．，“Ｃｌａｓｓｉｆｉｃａ
ｔｉｏｎＡｎｄＲｅｇｒｅｓｓｉｏｎＴｒｅｅ
ｓ”，ＷａｄｓｗｏｒｔｈＳｔａｔｉｓｔｉｃｓ／Ｐ
ｒｏｂａｂｉｌｉｔｙＳｅｒｉｅｓ，Ｕ．Ｓ．Ａ．，
１９８４年」参照。）などがある。ここで、数量化Ｉ類
は、制御要因を説明変数空間とした線形重回帰モデルで
あり、制御要因間の独立性が仮定されているため、要因
間の依存関係を表現できない。また、説明変数空間を逐
次分割していく回帰木では、分割後の説明変数空間の独
立性が仮定されているため、分割された空間の間の従属
関係を表現できない。これに対して、ＭＳＲ法は、回帰
木の分析過程において、複数の分割で共用されるパラメ
ータを考えることで、数量化Ｉ類、回帰木の両者の問題
点を解決している。なお、数量化Ｉ類で得られる結果
は、ルートノードでしか分割を許さないＭＳＲ法の特殊
解として、また、回帰木で得られる結果は、複数ノード
での同時分割を禁止したＭＳＲ法の特殊解としてそれぞ
れ考えることができる。A typical statistical modeling method for generating an estimation model is a quantification type I (see, for example, conventional document 5).
"Hayashi et al.," Quantification theory and data processing ", Asakura Shoten, 1
982 ". ) And a regression tree (for example, in conventional document 6 “B
rieman et al. , "Classifica
tion And Repression Tree
s ", Wadsworth Statistics / P
robbability series, U.S.A. S. A. ,
1984 ". )and so on. Here, the quantification type I is a linear multiple regression model in which control factors are used as explanatory variable spaces, and since independence between control factors is assumed, it is not possible to express a dependency relationship between factors. In addition, in a regression tree in which the explanatory variable space is sequentially divided, independence of the divided explanatory variable space is assumed, so that a dependent relationship between the divided spaces cannot be expressed. On the other hand, the MSR method solves the problems of both the quantification type I and the regression tree by considering the parameters shared by a plurality of divisions in the regression tree analysis process. The result obtained by the quantification I is a special solution of the MSR method that allows only the root node to split, and the result obtained by the regression tree is a special solution of the MSR method that prohibits simultaneous splits at multiple nodes. Each can be thought of as a solution.

【００１８】回帰木と同様にＭＳＲ法の分析では木構造
のモデルが生成される。図３にＭＳＲ法によるモデルの
一例を示す。この例では、観測値を２種類の制御要因Ｃ
₁，Ｃ₂により推定することが可能である。観測推定値
は、制御要因をもとに一番上のルートノードから条件を
満たす木の枝を順次たどると同時にノードに書かれた数
量ａ_iを加算した時の、末端のノードでの数量の総和と
して得られる。数量ａ_ｉの値は大量データから得られる
制御要因と、観測値と正規方程式を用いて計算するが、
数量化Ｉ類や回帰木と同様に、パラメータ値を一意に求
められないため、いくつかのパラメータの値に制約を設
ける必要がある。本実施形態においては、条件に当ては
まらないノード側（例えば、条件Ｃ_１≦５でＮｏに分岐
する側のノード）の数量を０と置いて他の数量を求めて
いる。図３の例ではａ₃，ａ₇，ａ₉，ａ₁₁を０と置くこ
ととなる。この条件のもとでは、ルートノードの数量ａ
₁は、ａ₃，ａ₇，ａ₉，ａ₁₁がいずれも０であることか
ら、最下段右端のノードにたどり着くデータ群（すなわ
ち、どの分岐でも条件に当てはまらない側のノードを選
択するデータ）の観測値の平均値となる。数量ａ₁は推
定値を求める際の初期値と見なすことができる。Similar to the regression tree, the MSR method analysis produces a tree structure model. FIG. 3 shows an example of a model by the MSR method. In this example, the observed value is set to two types of control factors C
It is possible to estimate from ₁ and C ₂ . The observed estimated value is the quantity at the terminal node when the branches of the tree satisfying the conditions are sequentially traced from the top root node based on the control factor and the quantity a _i written in the node is added at the same time. Obtained as the sum. The value of the quantity a _i is calculated using the control factor obtained from a large amount of data, the observed value and the normal equation,
Similar to the quantification type I and the regression tree, the parameter value cannot be uniquely obtained, so that it is necessary to place restrictions on the values of some parameters. In this embodiment, the quantity on the node side that does not meet the condition (for example, the node on the side that branches to No when the condition C ₁ ≦ 5) is set to 0 and another quantity is obtained. In the example of FIG. 3, a ₃ , a ₇ , a ₉ and a ₁₁ are set to 0. Under this condition, the quantity a of the root node
₁ is a group of data that reaches the node at the right end of the bottom row (that is, data that selects the node that does not meet the condition in any branch) because a ₃ , a ₇ , a ₉ , and a ₁₁ are all 0 Is the average of the observed values of. The quantity a ₁ can be regarded as an initial value when obtaining an estimated value.

【００１９】図３中、点線で囲んだ部分の木構造はＭＳ
Ｒ法特有の分析結果の例である。この部分木は、ａ₃の
ノードでの分割がおこなわれた結果、ａ₆，ａ₇のノード
が生成され、その後再び、ａ₃ノードでの分割がおこな
われてノードａ₆，ａ₇が分割したことにより生成された
ものである。ａ₁₀，ａ₁₁は共有パラメータとしての数量
と見ることが可能である。数量化Ｉ類の場合はルートノ
ードでのみ分割が許されているため、また、回帰木の場
合は末端ノードでのみ分割が許されているため、例のよ
うな部分木での分割は表現できない。In FIG. 3, the tree structure surrounded by a dotted line is MS.
It is an example of the analysis result peculiar to R method. This subtree is split at the node a ₃ to generate nodes a ₆ and a ₇ , and then split again at the node a ₃ to split the nodes a ₆ and a _7. It is generated by doing. It is possible to regard a ₁₀ and a ₁₁ as quantities as shared parameters. In the case of quantification type I, division is allowed only at the root node, and in the case of regression tree, division is allowed only at the end nodes, so division in a subtree as in the example cannot be expressed. .

【００２０】以上説明したように、本実施形態で用いる
ＭＳＲ法は、数量化Ｉ類と回帰木の概念を包含し、拡張
したものとなっている。さらに、共用パラメータの存在
によりモデルのパラメータ数の増加を抑えることがで
き、少ないパラメータ数で効率良くモデリングが可能と
なる。このような見地から、本実施形態では統計モデリ
ング手法としてＭＳＲ法を用いている。As described above, the MSR method used in this embodiment is an extension of the concept of quantification I and the concept of regression tree. Further, the presence of the shared parameter can suppress an increase in the number of parameters of the model, and the modeling can be efficiently performed with a small number of parameters. From this point of view, the MSR method is used as the statistical modeling method in this embodiment.

【００２１】上述の処理により大量音声データから求め
られたフレーズ指令、アクセント指令と各指令に影響す
る制御要因との関係をＭＳＲ法を用いて分析すること
で、制御要因からフレーズ指令、アクセント指令を推定
するモデルが得られる。各モデルは、二分木構造とモデ
ルパラメータとで構成される。二分木は、各指令を制御
要因により分類する規則として利用される。またモデル
パラメータは、推定値の算出に用いられる。分析で得ら
れた二分木の構造を検討することにより、どのような制
御要因が各指令に影響を与えているか、などの解析が可
能となる。モデルパラメータの大きさも、そのパラメー
タがかかわる分類が各指令に大きな影響を及ぼしている
かどうかの判断基準となる。By using the MSR method to analyze the relationship between the phrase command and accent command obtained from a large amount of voice data by the above-mentioned processing and the control factor affecting each command, the phrase command and the accent command are extracted from the control factor. The model to estimate is obtained. Each model is composed of a binary tree structure and model parameters. The binary tree is used as a rule for classifying each command according to a control factor. The model parameter is also used to calculate the estimated value. By examining the structure of the binary tree obtained by the analysis, it is possible to analyze what kind of control factors influence each command. The size of the model parameter also serves as a criterion for judging whether or not the classification related to the parameter has a large influence on each command.

【００２２】図４及び図５に、４人の話者Ｍ１，Ｍ２，
Ｆ１，Ｆ２（ここで、Ｍ１，Ｍ２は男性話者であり、Ｆ
１，Ｆ２は女性話者である。）の各Ｆ₀制御規則を示
す。ここで、図４は、当該フレーズのモーラ数に対する
制御量と、先行フレーズのモーラ数に対する制御量とを
示し、図５に、アクセント句のアクセント型に対する制
御量と、上記アクセント句の文章内の位置に対する制御
量とを示す。ここで、モーラとは、実質的にかな１文字
に対応する拍である。また、アクセント型とは、アクセ
ント句が１拍目にあるのを１型といい、アクセント句が
２拍目にあるのを２型といい、以下同様に定義される。
図４及び図５の話者Ｆ２の場合のＦ₀制御規則を表１に
示す。4 and 5, four speakers M1, M2,
F1, F2 (where M1 and M2 are male speakers, and F
1, F2 are female speakers. ) Shows each F ₀ control rule. Here, FIG. 4 shows the control amount for the mora number of the phrase and the control amount for the mora number of the preceding phrase, and FIG. 5 shows the control amount for the accent type of the accent phrase and the sentence in the sentence of the accent phrase. The control amount with respect to the position is shown. Here, the mora is a beat substantially corresponding to one kana character. Further, the accent type is referred to as type 1 when the accent phrase is on the first beat, and is referred to as type 2 when the accent phrase is on the second beat.
Table 1 shows the F ₀ control rules for the speaker F2 in FIGS. 4 and 5.

【００２３】[0023]

【表１】各音素列に対するＦ₀制御規則の具体例＜話者Ｆ２の場合＞（図４の（ｄ）及び図５の（ｄ）に対応する。） ─────────────────────────────────── （１）当該フレーズ、先行フレーズ及びアクセント指令の大きさをそれぞれ当該話者の所定の初期値（０．６）に初期化する。 ─────────────────────────────────── （２）当該フレーズのモーラ数に関する判断制御（２−１）もし当該フレーズの長さが１モーラ以上３モーラ以下であるとき、当該フレーズの大きさを初期値から０．１５だけ減らす。（２−２）もし当該フレーズの長さが４モーラ以上６モーラ以下であるとき、当該フレーズの大きさを初期値から０．０５だけ減らす。（２−３）もし当該フレーズの長さが７モーラ以上１２モーラ以下であるとき、当該フレーズの大きさを初期値から０．０２５だけ減らす。（２−４）もし当該フレーズの長さが１３モーラ以上であるとき、当該フレーズの大きさを初期値から０．０２５だけ減らす。 ─────────────────────────────────── （３）先行フレーズのモーラ数に関する判断制御（３−１）もし先行フレーズの長さが１モーラ以上であるとき、先行フレーズの大きさを初期値から０．０１２５だけ減らす。（３−２）もし先行フレーズが無いとき、先行フレーズの大きさを初期値から変化しない。 ─────────────────────────────────── （４）アクセント句のアクセント型に関する判断制御（４−１）もしアクセント型が１型又は２型であるとき、アクセント句の大きさを初期値から０．０５だけ増やす。（４−２）もしアクセント型が３型以上であるとき、アクセント句の大きさを初期値から変化しない。（４−３）もしアクセント句が無い場合、アクセント句の大きさを初期値から０．２だけ減らす。 ─────────────────────────────────── （５）アクセント句の文章内の位置に関する判断制御（４−１）もしアクセント句が文頭にあるとき、アクセント句の大きさを初期値から変化しない。（４−２）もしアクセント句が文中にあるとき、アクセント句の大きさを初期値から変化しない。（４−３）もしアクセント句が文末にあるとき、アクセント句の大きさを初期値から０．２５だけ減らす。 ─────────────────────────────────── （注）フレーズ指令の大きさの制御は、表１内の（２）と（３）の制御量の合算とし、アクセント句の大きさの制御は、表１内の（４）と（５）の制御量の合算とする。[Table 1] Specific example of F ₀ control rule for each phoneme sequence <In the case of speaker F2> (corresponding to (d) of FIG. 4 and (d) of FIG. 5) ────────── ────────────────────────── (1) The size of the phrase, the preceding phrase, and the accent command are respectively set to the specified initial values of the speaker. Initialize to (0.6). ─────────────────────────────────── (2) Judgment control regarding the number of mora of the phrase (2-1 ) If the length of the phrase is 1 mora or more and 3 mora or less, reduce the size of the phrase by 0.15 from the initial value. (2-2) If the length of the phrase is 4 mora or more and 6 mora or less, the size of the phrase is reduced by 0.05 from the initial value. (2-3) If the length of the phrase is 7 mora or more and 12 mora or less, the size of the phrase is reduced by 0.025 from the initial value. (2-4) If the length of the phrase is 13 mora or more, the size of the phrase is reduced by 0.025 from the initial value. ─────────────────────────────────── (3) Judgment control regarding the number of mora of the preceding phrase (3-1 ) If the length of the preceding phrase is 1 mora or more, the size of the preceding phrase is reduced by 0.0125 from the initial value. (3-2) If there is no preceding phrase, the size of the preceding phrase is not changed from the initial value. ─────────────────────────────────── (4) Judgment control regarding accent type of accent phrase (4-1 ) If the accent type is type 1 or type 2, the size of the accent phrase is increased by 0.05 from the initial value. (4-2) If the accent type is 3 or more, the size of the accent phrase is not changed from the initial value. (4-3) If there is no accent phrase, reduce the size of the accent phrase by 0.2 from the initial value. ─────────────────────────────────── (5) Judgment control regarding the position of accent phrase in the sentence (4 -1) If the accent phrase is at the beginning of a sentence, the size of the accent phrase does not change from the initial value. (4-2) If the accent phrase is included in the sentence, the size of the accent phrase is not changed from the initial value. (4-3) If the accent phrase is at the end of the sentence, reduce the size of the accent phrase by 0.25 from the initial value. ─────────────────────────────────── (Note) Phrase command size control is shown in Table 1. The control amounts of (2) and (3) are to be summed, and the size of the accent phrase is to be summed of the control amounts of (4) and (5) in Table 1.

【００２４】次いで、合成音声「今日は良い天気です」
を得るときに、各音素又は音素列に対して各パラメータ
を制御するために用いられるＦ₀制御規則３１，３２，
３３、声質制御規則４１及び音素継続時間長制御規則４
２の各一例をそれぞれ、表２、表３及び表４に示す。な
お、表３において、音響的特徴パラメータとは、対数パ
ワー、１６次ケプストラム係数、Δ対数パワー、及び１
６次Δケプストラム係数を含む３４次元のパラメータで
ある。Next, a synthetic voice "Today is a nice day"
F ₀ control rules 31, 32, used to control each parameter for each phoneme or phoneme sequence when
33, voice quality control rule 41 and phoneme duration control rule 4
Tables 2, 3 and 4 show examples of No. 2 respectively. In Table 3, acoustic characteristic parameters are logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 1
It is a 34-dimensional parameter including a 6th-order Δ cepstrum coefficient.

【００２５】[0025]

【表２】Ｆ₀制御規則の一例 ─────────────────────────────────── 音素列Ｆ₀制御規則 ─────────────────────────────────── ｋｙｏ’ｕｗａフレーズの大きさのＦ₀制御規則アクセントの大きさのＦ₀制御規則 ─────────────────────────────────── ｙｏ’ｉｔｅ’Ｎｋｉｄｅｓｕフレーズの大きさのＦ₀制御規則アクセント１の大きさのＦ₀制御規則アクセント２の大きさのＦ₀制御規則 ───────────────────────────────────[Table 2] Example of F ₀ control rule ─────────────────────────────────── Phoneme string F ₀ control Rule ─────────────────────────────────── kyo'uwa phrase size F ₀ control rule accent F ₀ control rule of size ─────────────────────────────────── yo'i te'Nkidese the size of the F ₀ control rules Accents size of one of the F ₀ control rules Accents 2 magnitude of F ₀ control rules ─────────────────────── ─────────────

【００２６】[0026]

【表３】声質制御規則の一例 ───────────────── 音素音響的特徴パラメータ ───────────────── ｋｙ（０．０５，０．０３，…）ｏ（０．４５，０．３８，…）ｕ（０．２５，０．４２，…）ｗ（０．３２，０．３０，…）ａ（０．１２，０．４５，…） … … ─────────────────[Table 3] Examples of voice quality control rules ───────────────── Phoneme Acoustic feature parameters ───────────────── ky ( 0.05, 0.03, ...) O (0.45, 0.38, ...) U (0.25, 0.42, ...) W (0.32, 0.30, ...) A (0. 12, 0.45,…) ……… ──────────────────

【００２７】[0027]

【表４】音素継続時間長制御規則の一例 ───────────── 音素音素継続時間長 ───────────── ｋｙ０．０５４秒ｏ０．１２０秒ｕ０．０９５秒ｗ０．０８０秒ａ０．１１０秒 … … ─────────────[Table 4] Example of phoneme duration control rule ───────────── Phoneme Phoneme duration ───────────── ky 0.054 seconds o 0 120 seconds u 0.095 seconds w 0.080 seconds a 0.110 seconds ……… ──────────────

【００２８】さらに、図１に示す音声合成装置の動作に
ついて以下に説明する。図１に示すように、音声合成す
べき文字列はパラメータ系列生成部１に入力される。パ
ラメータ系列生成部１は、入力される文字列に基づい
て、Ｆ₀周波数を制御するＦ₀制御規則（３１，３２，３
３のうちの１つ）と、音響的特徴パラメータを制御する
声質制御規則４１と、音素継続時間長を制御する音素継
続時間長制御規則４２とを用いて、Ｆ₀周波数と音響的
特徴パラメータと音素継続時間長とを含む制御パラメー
タデータを選択し、選択されたパラメータデータに基づ
いて、例えばＤＴＷ法により時間整合処理及び音声スペ
クトルの内挿処理等の処理を実行して、例えば１６次の
ケプストラム係数の時系列データを生成して、音声合成
部２に出力する。音声合成部２は、パルス発生器と雑音
発生器と可変利得増幅器とフィルタを備えて構成され、
入力される時系列データに基づいて音声信号を発生して
スピーカ３に出力することにより、入力された文字列に
対応する合成音声を発生する。The operation of the speech synthesizer shown in FIG. 1 will be described below. As shown in FIG. 1, a character string to be voice-synthesized is input to the parameter sequence generation unit 1. Parameter sequence generating unit 1, based on the character string input, F ₀ control rules for controlling the F ₀ frequency (31,32,3
3), a voice quality control rule 41 that controls the acoustic feature parameter, and a phoneme duration control rule 42 that controls the phoneme duration, using the F ₀ frequency and the acoustic feature parameter. The control parameter data including the phoneme duration is selected, and based on the selected parameter data, a time matching process and a voice spectrum interpolation process are executed by, for example, the DTW method, and the 16th-order cepstrum, for example. The time series data of the coefficient is generated and output to the voice synthesis unit 2. The voice synthesizer 2 is configured to include a pulse generator, a noise generator, a variable gain amplifier and a filter,
A voice signal is generated based on the input time-series data and is output to the speaker 3, thereby generating a synthetic voice corresponding to the input character string.

【００２９】以上の実施形態において、少数の音声デー
タを変換目標の話者に発声させ、これに基づいて生成さ
れたＦ₀制御規則を、大量の音声データから生成された
Ｆ₀制御規則のものと入れ換えることにより、Ｆ₀制御規
則を生成してもよい。In the above embodiment, a small number of voice data is uttered by the conversion target speaker, and the F ₀ control rule generated based on this is the one of the F ₀ control rule generated from a large amount of voice data. The F ₀ control rule may be generated by replacing

【００３０】[0030]

【実施例】本発明者は、図１の音声合成装置を用いて、
Ｆ₀制御規則学習処理を音声データベースに対して施
し、フレーズ指令、アクセント指令の大きさを推定する
Ｆ₀制御規則を生成し、複数の話者のＦ₀制御規則を生成
しかつ分析して、各話者間での重要な制御要因の共通性
を調べた。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor uses the speech synthesizer shown in FIG.
Subjected to F ₀ control rule learning process on speech database, phrase command, and generates a F ₀ control rules to estimate the size of an accent command, generating and analyzing the F ₀ control rules a plurality of speakers, We investigated the commonality of important control factors among speakers.

【００３１】音声資料としては、Ｆ₀制御規則の生成に
は男女２名ずつの話者が発声した５００文章、合計２，
０００文章を用いた（例えば、従来文献７「阿部ほか，
日本音響学会講演論文集，ｐｐ．２６７−２６８，１９
８９年１０月」参照。）。発話内容は、新聞や雑誌から
選ばれた文章である。また、各音声データのフレーズ指
令、アクセント指令の数を表５に示す。上述の処理の方
法を用いて各音声データベースのＦ₀制御規則を生成
し、個々の制御規則を分析した。As the audio data, 500 sentences spoken by two male and two female speakers were used for the generation of the F ₀ control rule.
000 sentences were used (for example, in conventional document 7 “Abe et al.,
Proceedings of the Acoustical Society of Japan, pp. 267-268, 19
See October 1989. ). Utterance contents are sentences selected from newspapers and magazines. Table 5 shows the number of phrase commands and accent commands of each voice data. Using the method of processing described above, F ₀ control rules for each voice database were generated and the individual control rules were analyzed.

【００３２】[0032]

【表５】各音声データベースに含まれる指令の数 ─────────────────────────────────── 話者Ｍ１Ｍ２Ｆ１Ｆ２ ─────────────────────────────────── フレーズ指令１９０３１６８４１４２５１５３２アクセント指令３２００３１７６３３０６３１１９ ───────────────────────────────────[Table 5] Number of commands included in each voice database ─────────────────────────────────── Speaker M1 M2 F1 F2 ─────────────────────────────────── Phrase command 1903 1684 1425 1532 Accent command 3200 3176 3306 3119 ───────────────────────────────────

【００３３】Ｆ₀制御規則の生成に用いた制御要因と制
御規則の分析について述べる。臨界制動モデルで用いら
れるパラメータには、フレーズ指令については入力時点
と大きさ、アクセント指令については立ち上がり時点、
立ち下がり時点、大きさがある。これらのうち入力時点
などの時間情報については、少数の簡単な規則により制
御可能であることが報告されている（例えば、従来文献
８「海木ほか，電子情報通信学会技術報告，ＳＰ９２−
６，１９９２年３月」参照。）。これに対して、指令の
大きさの適切な制御は合成音の自然性や了解性の向上に
重要である。従って、Ｆ₀制御規則の生成ではフレーズ
指令及びアクセント指令の大きさを推定の対象とした。The analysis of the control factors and control rules used to generate the F ₀ control rule will be described. The parameters used in the critical braking model are input time and size for phrase commands, rising time for accent commands,
There is a magnitude at the time of the fall. Of these, it has been reported that time information such as the input time can be controlled by a small number of simple rules (for example, conventional document 8 "Kaiki et al., Technical Report of IEICE, SP92-").
6, March 1992 ”. ). On the other hand, proper control of the command size is important for improving the naturalness and intelligibility of the synthetic speech. Therefore, in the generation of the F ₀ control rule, the sizes of the phrase command and the accent command are used as targets for estimation.

【００３４】まず、フレーズ指令の大きさを推定するた
めに用いた制御要因とその影響について述べる。フレー
ズ指令の大きさを推定するためには、以下の４つの制御
要因を考慮した。（Ａ１）当該フレーズ長（具体的には、当該フレーズの
モーラ数）（５カテゴリに分割した。）（Ａ２）先行フレーズ長（具体的には、先行フレーズの
モーラ数）（６カテゴリに分割した。）（Ａ３）当該フレーズの文中での位置（文末又は非文末
の２カテゴリに分割した。）（Ａ４）当該フレーズの先頭アクセント句のアクセント
型（４カテゴリに分割した。）First, the control factors used to estimate the magnitude of the phrase command and their effects will be described. In order to estimate the magnitude of the phrase command, the following four control factors were considered. (A1) The phrase length (specifically, the number of mora of the phrase) (divided into 5 categories) (A2) The preceding phrase length (specifically, the number of mora of the preceding phrase) (divided into 6 categories (A3) Position of the phrase in the sentence (divided into two categories, sentence end or non-sentence.) (A4) Accent type of the leading accent phrase of the phrase (divided into four categories.)

【００３５】当該フレーズが短い場合はフレーズ成分を
長い間高い値で保つ必要がないことから、フレーズが短
いほどフレーズ指令が小さくなることが考えられる。ま
た、先行フレーズが短い場合は、先行フレーズのフレー
ズ成分が十分減衰するまでに当該フレーズが始まること
となり、この場合もまたフレーズ指令が小さくなること
が予想される。これらのことから、当該フレーズ及び先
行フレーズの長さをフレーズ指令の大きさを推定する制
御要因に用いた。これに加えて、音声では文末でＦ₀周
波数が顕著に低下し、文末にあるフレーズ指令はそれ以
外に位置するものに比べて小さくなると考えられるの
で、文中でのフレーズの位置をフレーズ指令の大きさの
推定に用いた。さらに、フレーズ先頭部でＦ₀周波数の
値が大きくなり過ぎることを抑えるため、フレーズ指令
の大きさを抑制する要因としてアクセント成分の大小と
強い相関を持つ要因であるアクセント型を用いた。When the phrase is short, it is not necessary to keep the phrase component at a high value for a long time, so it is conceivable that the shorter the phrase, the smaller the phrase command. Further, when the preceding phrase is short, the phrase starts before the phrase component of the preceding phrase is sufficiently attenuated, and in this case also, it is expected that the phrase command will become small. Therefore, the lengths of the phrase and the preceding phrase were used as control factors for estimating the size of the phrase command. In addition to this, in speech, the F ₀ frequency is significantly reduced at the end of a sentence, and the phrase command at the end of the sentence is considered to be smaller than that at other positions. It was used to estimate the height. Further, in order to prevent the value of the F ₀ frequency from becoming too large at the beginning of the phrase, the accent type, which is a factor having a strong correlation with the magnitude of the accent component, is used as a factor for suppressing the size of the phrase command.

【００３６】これらの制御要因からフレーズ指令の大き
さを推定するモデルを生成して分析したところ、当該フ
レーズ及び先行フレーズの長さがすべての音声データベ
ースで重要な制御要因であることが確認された。また、
上記要因（Ａ４）については、４話者中３話者において
アクセント核を有するアクセント句（以下、起伏型アク
セント句という。）がフレーズの先頭に存在する場合に
フレーズが小さくなることがわかった。When a model for estimating the size of the phrase command was generated from these control factors and analyzed, it was confirmed that the lengths of the phrase and the preceding phrase are important control factors in all voice databases. . Also,
Regarding the above factor (A4), it was found that the phrase becomes smaller when an accent phrase having an accent nucleus (hereinafter referred to as a relief accent phrase) is present at the beginning of the phrase in 3 out of 4 speakers.

【００３７】次いで、アクセント指令の大きさを推定す
るために用いた制御要因とその影響について述べる。ア
クセント指令の大きさを推定するためには、以下の４つ
の制御要因を考慮した。（Ｂ１）当該アクセント句長（具体的には、当該アクセ
ント句のモーラ数）（４カテゴリに分割した。）（Ｂ２）当該アクセント句のアクセント型（４カテゴリ
に分割した。）（Ｂ３）先行アクセント句のアクセント型（５カテゴリ
に分割した。）（Ｂ４）当該アクセント句の文中での位置（文頭、文
中、文末の３カテゴリに分割した。）Next, the control factors used to estimate the magnitude of the accent command and their effects will be described. The following four control factors were considered in order to estimate the magnitude of the accent command. (B1) the accent phrase length (specifically, the number of mora of the accent phrase) (divided into 4 categories) (B2) the accent type of the accent phrase (divided into 4 categories) (B3) preceding accent Phrase accent type (divided into 5 categories.) (B4) Position of the accent phrase in the sentence (divided into 3 categories: sentence beginning, sentence, and sentence end).

【００３８】公知の通り、アクセント句が短い場合、ま
たアクセント型が平板型である場合にアクセント成分は
小さくなることが知られているので、これらを制御要因
として考慮した。本発明者の実験結果では、アクセント
型を示す数字が小さいほど、すなわち「高」で発音され
る拍数が少ないほど、アクセント指令が大きくなる傾向
が見られたので、起伏型アクセント句をより細かく分類
して（１型、２型、３型乃至５型、６型以上）分析を行
なった。また、先行アクセント句が起伏型の場合には、
先行アクセント句でＦ₀周波数を上昇させるためのエネ
ルギーが消費されて当該アクセント句が小さくなること
が考えられるので、先行アクセント句のアクセント型を
制御要因に加えた。さらに、上述したように、フレーズ
指令の大きさを推定する制御要因として文中での位置を
取り扱うことを述べたが、アクセント指令についても文
頭、文中、文末でその大きさが違うことが考えられるの
で、これも要因として考慮した。As is well known, it is known that the accent component becomes small when the accent phrase is short or when the accent type is a flat type, so these factors were considered as control factors. In the results of experiments conducted by the present inventor, the smaller the number indicating the accent type, that is, the smaller the number of beats pronounced "high", the larger the accent command tends to be. Classification was performed (Type 1, Type 2, Type 3 to Type 5, Type 6 and above) for analysis. If the preceding accent phrase is undulating,
Since it is considered that the preceding accent phrase consumes energy for increasing the F ₀ frequency and the accent phrase becomes small, the accent type of the preceding accent phrase is added as a control factor. Furthermore, as described above, the position in the sentence is treated as a control factor for estimating the size of the phrase command, but the size of the accent command may differ at the beginning, in the sentence, and at the end of the sentence. , This was also considered as a factor.

【００３９】これらの制御要因とアクセント指令の大き
さの実測値を用いてアクセント指令推定モデルを生成し
てその分析を行なったところ、上記要因（Ｂ４）におい
て文末に位置するアクセント句のアクセント指令の大き
さが小さくなることが、どの話者の推定モデルにおいて
も確認された。また、より大量の音声データを扱った今
回の実験では、フレーズ指令とアクセント指令の大きさ
への影響の個人差は特に見られなかった。An accent command estimation model was generated using these control factors and the measured value of the size of the accent command, and its analysis was performed. As a result, the accent command of the accent phrase located at the end of the sentence in the factor (B4) was detected. It was confirmed that the size was small in all speaker estimation models. In addition, in this experiment dealing with a larger amount of voice data, no particular individual difference in the influence of the phrase command and the accent command on the size was found.

【００４０】以上説明したように、本発明に係る本実施
形態によれば、話者毎に作成された音声のピッチ周波数
を制御するＦ₀制御規則を用いて入力された文字列を予
め指定された話者の音声に変換し、Ｆ₀制御規則は、音
声合成対象の当該フレーズのモーラ数と、当該フレーズ
に先行する先行フレーズのモーラ数とに基づいて当該フ
レーズの大きさを制御し、音声合成対象のアクセント句
のアクセント型と上記アクセント句の文章内の位置とに
基づいてアクセント句の大きさを制御することにより、
音声のピッチ周波数を制御するように構成した。従っ
て、ある指定された１人の話者の音声を合成することが
できる音声合成装置を提供することができる。また、Ｆ
₀制御規則学習部２０により、音声データに基づいて音
声のピッチ周波数のパターンを抽出し、抽出された音声
のピッチ周波数のパターンに基づいて臨界制御モデルに
よる分析法を用いて臨界制御モデルのモデルパラメータ
を発生し、音声のピッチ周波数を制御する制御規則を生
成することができる。従って、ある指定された１人の話
者の音声を合成するために最適であって忠実なＦ₀制御
規則を自動的にかつ容易に作成することができる。As described above, according to this embodiment of the present invention, the input character string is designated in advance by using the F ₀ control rule for controlling the pitch frequency of the voice created for each speaker. The F ₀ control rule controls the size of the phrase based on the number of mora of the phrase to be voice-synthesized and the number of mora of the preceding phrase preceding the phrase. By controlling the size of the accent phrase based on the accent type of the accent phrase to be synthesized and the position of the accent phrase in the sentence,
It is configured to control the pitch frequency of the voice. Therefore, it is possible to provide a voice synthesizer capable of synthesizing the voice of one designated speaker. Also, F
_{0 The} control rule learning unit 20 extracts a pattern of the pitch frequency of the voice based on the voice data, and based on the extracted pattern of the pitch frequency of the voice, a model parameter of the critical control model is used by the analysis method by the critical control model. And a control rule for controlling the pitch frequency of the voice can be generated. Therefore, the optimum and faithful F ₀ control rule for synthesizing the voice of one designated speaker can be automatically and easily created.

【００４１】[0041]

【発明の効果】以上詳述したように本発明に係る音声合
成装置によれば、話者毎に作成された音声のピッチ周波
数を制御する制御規則を用いて入力された文字列を予め
指定された話者の音声に変換する変換手段を備え、ここ
で、上記制御規則は、音声合成対象の当該フレーズのモ
ーラ数と、当該フレーズに先行する先行フレーズのモー
ラ数とに基づいて当該フレーズの大きさを制御し、音声
合成対象のアクセント句のアクセント型と上記アクセン
ト句の文章内の位置とに基づいてアクセント句の大きさ
を制御することにより、音声のピッチ周波数を制御する
規則である。従って、ある指定された１人の話者の音声
を合成することができる音声合成装置を提供することが
できるという特有の効果がある。As described in detail above, according to the voice synthesizer of the present invention, the character string input is designated in advance by using the control rule for controlling the pitch frequency of the voice created for each speaker. And a conversion means for converting into a speaker's voice, wherein the control rule is that the size of the phrase is based on the number of mora of the phrase to be voice-synthesized and the number of mora of the preceding phrase preceding the phrase. This is a rule for controlling the pitch frequency of a voice by controlling the pitch frequency and controlling the size of the accent phrase based on the accent type of the accent phrase to be synthesized and the position of the accent phrase in the sentence. Therefore, there is a peculiar effect that it is possible to provide a voice synthesizing device capable of synthesizing the voice of one designated speaker.

【００４２】また、上記制御規則を生成する学習手段を
備え、上記学習手段は、音声データに基づいて音声のピ
ッチ周波数のパターンを抽出する抽出手段と、上記抽出
手段によって抽出された音声のピッチ周波数のパターン
に基づいて臨界制御モデルによる分析法を用いて上記臨
界制御モデルのモデルパラメータを発生する発生手段
と、上記抽出手段によって抽出された音声のピッチ周波
数のパターンと、上記発生手段によって発生された上記
臨界制御モデルのモデルパラメータとに基づいて、音声
のピッチ周波数を制御する制御規則を生成する生成手段
とを備える。これによって、ある指定された１人の話者
の音声を合成するために最適であって忠実なＦ₀制御規
則を自動的にかつ容易に作成することができる。Further, the learning means for generating the control rule is provided, and the learning means extracts the pattern of the pitch frequency of the voice based on the voice data, and the pitch frequency of the voice extracted by the extracting means. Generating means for generating a model parameter of the critical control model using an analysis method based on the pattern of the critical control model, a pattern of the pitch frequency of the voice extracted by the extracting means, and the pattern generated by the generating means. Generating means for generating a control rule for controlling the pitch frequency of the voice based on the model parameter of the critical control model. This makes it possible to automatically and easily create an F ₀ control rule that is optimal and faithful for synthesizing the voice of one designated speaker.

[Brief description of drawings]

【図１】本発明に係る一実施形態である音声合成装置
のブロック図である。FIG. 1 is a block diagram of a speech synthesizer according to an embodiment of the present invention.

【図２】図１のＦ₀制御規則学習部で実行されるＦ₀制
御規則学習処理を示すフローチャートである。2 is a flowchart showing the F ₀ control rule learning process executed by the F ₀ control rule learning unit of FIG.

【図３】図１のＦ₀制御規則学習部で用いる空間多重
分割型数量化法（ＭＳＲ）によるモデリングの一例を示
す図である。FIG. 3 is a diagram showing an example of modeling by a spatial multiple division type quantification method (MSR) used in the F ₀ control rule learning unit of FIG. 1.

【図４】図１のＦ₀制御規則学習部によって作成され
たフレーズ指令に関するＦ₀制御規則の一例を示すグラ
フである。FIG. 4 is a graph showing an example of an F ₀ control rule relating to a phrase command created by the F ₀ control rule learning unit of FIG. 1.

【図５】図１のＦ₀制御規則学習部によって作成され
たアクセント句に関するＦ₀制御規則の一例を示すグラ
フである。5 is a graph showing an example of F ₀ control rules relating to accent phrases created by the F ₀ control rule learning unit of FIG. 1. FIG.

[Explanation of symbols]

１…パラメータ系列生成部、２…音声合成部、３…スピーカ、１１…話者Ａの音声データ、１２…話者Ｂの音声データ、１３…話者Ｃの音声データ、２０…Ｆ₀制御規則学習部、２１…ワーキングメモリ、３１…話者ＡのＦ₀制御規則、３２…話者ＢのＦ₀制御規則、３３…話者ＣのＦ₀制御規則、４１…音質制御規則、４２…音素継続時間長データ。1 ... parameter sequence generating unit, 2 ... speech synthesis unit, 3 ... speaker, 11 ... audio data of the speaker A, 12 ... speaker B in the speech data, 13 ... speaker C sound data, 20 ... F ₀ Control Rule learning unit, 21 ... working memory, 31 ... F ₀ control rules of the speaker a, 32 ... F ₀ control rules of the speaker B, 33 ... speaker C of F ₀ control rule, 41 ... sound quality control rule, 42 ... phoneme Duration data.

───────────────────────────────────────────────────── フロントページの続き (72)発明者匂坂芳典京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者樋口宜男京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Yoshinori Kozaka, Yoshinori Kozaka, No. 5, Mihiraya, Seiji-cho, Seika-cho, Soraku-gun, Kyoto Prefecture, Ltd. Shiraka-gun Seika-cho, Osamu Osamu, Osamu Osamu, No. 5, Mihiraya, ATR Co., Ltd.

Claims

[Claims]

1. A voice synthesizer for synthesizing a voice based on an input character string, wherein the input character string is designated in advance using a control rule for controlling the pitch frequency of the voice created for each speaker. A speech synthesizer comprising a conversion means for converting the speech of the speaker.

2. The accent rule to be voice-synthesized is controlled by controlling the size of the phrase on the basis of the number of mora of the phrase to be synthesized and the number of mora of the preceding phrase preceding the phrase. The speech synthesis apparatus according to claim 1, wherein the rule is a rule for controlling the pitch frequency of the voice by controlling the size of the accent phrase based on the accent type of the phrase and the position of the accent phrase in the sentence. .

3. The speech synthesizer further comprises learning means for generating the control rule, the learning means comprising: an extracting means for extracting a pitch frequency pattern of a voice based on voice data; and the extracting means. Generating means for generating model parameters of the critical control model by using an analysis method by a critical control model based on the extracted pitch frequency pattern of the speech, and a pattern of the pitch frequency of the speech extracted by the extracting means, 3. The speech generating apparatus according to claim 1, further comprising: generating means for generating a control rule for controlling the pitch frequency of the speech based on the model parameter of the critical control model generated by the generating means. Synthesizer.