JP3133347B2

JP3133347B2 - Prosody control device

Info

Publication number: JP3133347B2
Application number: JP02410265A
Authority: JP
Inventors: 哲也酒寄
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-12-13
Filing date: 1990-12-13
Publication date: 2001-02-05
Anticipated expiration: 2016-02-05
Also published as: JPH04363634A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、規則音声合成装置の韻
律制御装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a prosody control device for a rule speech synthesizer.

【０００２】[0002]

【従来の技術】規則音声合成装置の韻律制御方式として
は、例えば、特開昭６３−４６４９８号公報に「韻律制
御方式」として本出願人により出願され開示されている
ものがあり、これは、数量化Ｉ類や重回帰分析などの多
変量解析の手法を用いて韻律を制御するものである。ま
た、その韻律制御方式に関する他の例として、やはり本
出願人により発案されたものとして、回帰係数セットを
話者やその時の状況に応じて選択的に用いることによっ
て、異なった話者特性や口調を容易に表現可能としたも
のがある。2. Description of the Related Art As a prosody control system of a ruled speech synthesizer, there is, for example, one disclosed and filed by the present applicant as a "prosody control system" in Japanese Patent Application Laid-Open No. 63-46498. The prosody is controlled using a multivariate analysis technique such as quantification class I or multiple regression analysis. As another example of the prosody control method, as having been conceived also by the applicant, by selectively used depending on the situation at that time the regression coefficient sets and speakers, different speaker characteristics Ya In some cases, the tone can be easily expressed.

【０００３】[0003]

【発明が解決しようとする課題】従来、規則音声合成装
置による合成音は、自然さに欠け、聴いていて違和感を
感じるという大きな欠点を有している。このような問題
は主に韻律の制御に起因するものである。近年の研究に
より徐々にではあるがそのような違和感は減少し、自然
な合成音声が合成できるようになりつつある。しかし、
人間は常に一定の韻律パターンで話しているのではな
く、また、話者によっても韻律パターンは当然異なる。
このように従来における合成音声の韻律は変化が乏し
く、長時間の聴取にはまだ苦痛であるのが現実である。Conventionally, a synthesized speech by a regular speech synthesizer has a serious disadvantage that it lacks naturalness and makes the listener feel uncomfortable. Such problems are mainly caused by the control of prosody. Although recent research has gradually reduced such discomfort, it has become possible to synthesize natural synthesized speech. But,
Humans do not always speak in a fixed prosody pattern, and the prosody pattern naturally differs from speaker to speaker.
As described above, the prosody of the conventional synthesized voice has little change, and it is a reality that listening for a long time is still painful.

【０００４】前述したような従来例によって韻律パター
ンを生成する場合は、韻律パターンを分析・予測するた
めの変数選択が重要な問題となる。特に、回帰分析にお
いてむやみに変数を増やすことは、未知の入力に対する
予測精度の低下を招きかねない。従来の方法ではどのよ
うな話者や口調に対しても一通りの変数セットを用いる
ため、最大公約数的な変数のとり方をしなければなら
ず、これによりきめ細かな分析や予測が困難となる。When a prosody pattern is generated by the above-described conventional example, variable selection for analyzing and predicting the prosody pattern becomes an important problem. In particular, unnecessarily increasing the number of variables in the regression analysis may lead to a decrease in prediction accuracy for unknown inputs. The conventional method uses a single set of variables for any speaker or tone, so it is necessary to use variables of the greatest common divisor, which makes detailed analysis and prediction difficult. .

【０００５】[0005]

【課題を解決するための手段】本発明では、予め用意し
た音声素片パラメータ系列を音韻・韻律を表現する入力
記号列に従って読出し、音声パラメータ結合規則によっ
て前記音声素片パラメータ系列を接続し、韻律制御規則
によって音韻・韻律を表現する入力記号列に応じた音韻
継続時間長、ピッチパターン、振幅パターンなどを計算
することにより韻律を付加する規則音声合成装置におい
て、入力記号列から変数抽出規則によって複数の変数を
抽出しこれらの変数の中から異なる話者や発声形態毎に
最適な変数セットを選択するとともに前記変数抽出規則
及び前記最適化された変数セット並びにこの最適な変数
セットと前記入力記号列及び音声データとから得た最適
な係数セットを記憶するカテゴリマッピング手段を設
け、異なる話者や発声形態に応じて変数セットを前記カ
テゴリマッピング手段で記憶された中から選択する変数
記憶選択手段を有し、この変数記憶選択手段で選択され
た変数セット及び対応する係数セットを前記カテゴリマ
ッピング手段で記憶された中から取り出し、これらの変
数セットと係数セットと入力記号列とによって合成音声
の韻律パターンを生成するようにした。According to the present invention, a speech unit parameter sequence prepared in advance is read out according to an input symbol string representing phonemes and prosody, and the speech unit parameter sequences are connected according to a speech parameter combination rule. In a rule speech synthesizer that adds a prosody by calculating a phoneme duration, a pitch pattern, an amplitude pattern, etc. according to an input symbol string expressing a phoneme / prosody by a control rule, a plurality of variables are extracted from the input symbol string by a variable extraction rule. the variable extraction rule with extracts of variables to select an optimum set of variables for different speakers and utterance form from these variables
And the optimized variable set and the optimal variable
Optimum obtained from the set and the input symbol string and voice data
Category mapping means for storing various coefficient sets , and the variable sets are stored in accordance with different speakers or utterance forms.
It has a variable storage selection means you choose from stored by categories mapping means, selected in this variable storage selection means
The variable set and the corresponding coefficient set
From the memory stored by the
Synthesized speech by number set, coefficient set and input symbol string
To generate a prosody pattern .

【０００６】[0006]

【作用】カテゴリマッピング手段と変数記憶選択手段と
を用いることによって、従来においては画一的であった
合成音声の韻律パターンを、変数選択も含めてきめ細か
く制御することが可能となる。By using the category mapping means and the variable memory selection means, it is possible to finely control the prosody pattern of the synthesized speech, which was conventionally uniform, including the variable selection.

【０００７】[0007]

【実施例】本発明の一実施例を図面に基づいて説明す
る。図１及び図２は、規則音声合成装置における韻律制
御装置の構成例を示すものである。まず、図１に示すよ
うに、テンプレート音声部１にはラベリング２が接続さ
れ、このラベリング２にはカテゴリマッピング手段とし
てのカテゴリマッピング部３が接続されている。このカ
テゴリマッピング部３には、回帰分析部４が接続されて
いる。また、図２に示すように、前記カテゴリマッピン
グ部３には、回帰予測部５が接続されている。An embodiment of the present invention will be described with reference to the drawings. 1 and 2 show a configuration example of a prosody control device in a rule speech synthesizer. First, as shown in FIG. 1, a labeling 2 is connected to the template audio unit 1, and a category mapping unit 3 as a category mapping unit is connected to the labeling 2. A regression analysis unit 4 is connected to the category mapping unit 3. Further, as shown in FIG. 2, a regression prediction unit 5 is connected to the category mapping unit 3.

【０００８】この場合、前記カテゴリマッピング部３
は、入力記号列６から変数抽出規則としてのカテゴリ判
定規則７によって複数の変数を抽出しこれらの変数の中
から異なる話者や発声形態毎に最適な変数セット８を選
択する働きがある。また、ここでは、前記カテゴリ判定
規則７や前記最適化された変数セット８、さらには、回
帰係数セット９を記憶すると共に、これらを合成音声の
使用目的や話者特性に応じて選択的に用いる、図示しな
い変数記憶選択手段が設けられている。In this case, the category mapping unit 3
Has a function of extracting a plurality of variables from an input symbol string 6 according to a category determination rule 7 as a variable extraction rule, and selecting an optimum variable set 8 for each different speaker or utterance form from these variables. Further, here, the category determination rule 7, the optimized variable set 8, and the regression coefficient set 9 are stored, and these are selectively used according to the purpose of use of the synthesized speech and the speaker characteristics. , A variable storage selecting means (not shown) is provided.

【０００９】このような構成において、本実施例では、
韻律パターンの例として、音韻継続時間長を取り上げ、
韻律パターンの分析・予測に数量化Ｉ類を用いる場合に
ついて考える。以下、予め人間の発声した音声データセ
ットが与えられているものとし、これをここではテンプ
レート音声と呼ぶ。本実施例の構成及び動作を、テンプ
レート音声に含まれる韻律パターンの特徴を分析し学習
する分析時と、その結果に基づいて韻律パターンを生成
する合成時に分けて説明する。In such a configuration, in this embodiment,
Take phoneme duration as an example of a prosody pattern,
Consider the case where quantification class I is used for prosody pattern analysis and prediction. Hereinafter, it is assumed that a voice data set uttered by a human is given in advance, and this is called a template voice here. The configuration and operation of the present embodiment will be described separately for an analysis for analyzing and learning the characteristics of the prosody pattern included in the template voice and a synthesis for generating a prosody pattern based on the result.

【００１０】まず、分析時における動作を説明する。図
１は分析時における構成を示すものである。テンプレー
ト音声部１から得られた人間の発声したテンプレート音
声からラベリング部２によって音韻境界位置を決定し、
個々の音韻の継続時間長を計測し、これを目的変数とす
る。一方、カテゴリマッピング部３において、ラベリン
グ部２によって付与されたラベル列からカテゴリ判定規
則７によって説明変数を求める。数量化Ｉ類における説
明変数は一般に要因アイテムと呼ばれ、サンプルがどの
カテゴリに該当するかという定性的なものである。ここ
で重要となるのは、どの要因アイテムを採用するか、さ
らには、どのようなカテゴリに分けるかという問題であ
る。そこで、本実施例においては、最大公約数的な変数
セットをとらずに、分析対象のテンプレート音声に対し
て最適化した変数選択を行う。変数選択の方法として
は、分散比や自由度調整済み寄与率などを評価関数とし
た前進選択法・後退消去法・逐次法などの既知の手法を
使用することができる。このようにして選択された最適
な変数セットを用いて回帰分析部４により回帰分析を行
い、これにより回帰係数セット９を求める。First, the operation at the time of analysis will be described. FIG. 1 shows the configuration at the time of analysis. A phoneme boundary position is determined by a labeling unit 2 from a template voice uttered by a human obtained from the template voice unit 1,
The duration of each phoneme is measured and used as the target variable. On the other hand, in the category mapping unit 3, an explanatory variable is obtained from the label string given by the labeling unit 2 by the category determination rule 7. The explanatory variable in the quantification class I is generally called a factor item and is a qualitative one to which category the sample falls. What is important here is which factor item is to be adopted, and further, what kind of category is to be used. Therefore, in the present embodiment, the variable selection optimized for the template speech to be analyzed is performed without taking the greatest common denominator variable set. As a variable selection method, a known method such as a forward selection method, a backward elimination method, or a sequential method using an variance ratio, a degree of freedom-adjusted contribution ratio, or the like as an evaluation function can be used. Regression analysis is performed by the regression analysis unit 4 using the optimal variable set selected in this way, and a regression coefficient set 9 is obtained.

【００１１】次に、合成時における動作を説明する。図
２は合成時における構成を示すものである。発声対象と
なるテキスト或いは概念などから、分析時に入力とした
ラベル列に対応するような入力記号列６が得られている
ものとする。この入力記号列６から、分析時に用いたカ
テゴリ判定規則７と、最適化された変数セット８とによ
って、カテゴリマッピング部３において適切なカテゴリ
へのマッピングが行われる。その後、分析時に求められ
た回帰係数セット９を用いて、入力記号列６に対応する
音韻継続時間長パターンを計算する。このように音声を
合成することによって、与えられたテンプレート音声の
特徴を再現するような合成音声を得ることができる。ま
た、話者や発声形態などの異なるテンプレート音声につ
いて別々に分析し、カテゴリ判定規則７、最適化された
変数セット８、回帰係数セット９を変数記憶選択手段に
よって記憶し適宜使い分けることによって、多彩な韻律
パターン１０を適切に設定して変化に富んだ音声を再現
することが可能となる。Next, the operation at the time of synthesis will be described. FIG. 2 shows a configuration at the time of synthesis. It is assumed that an input symbol string 6 corresponding to a label string input at the time of analysis is obtained from a text or a concept to be uttered. From the input symbol string 6, the category mapping unit 3 performs mapping to an appropriate category by using the category determination rule 7 used at the time of analysis and the optimized variable set 8. Then, using the regression coefficient set 9 obtained at the time of the analysis, a phoneme duration pattern corresponding to the input symbol string 6 is calculated. By synthesizing the speech in this way, a synthesized speech that reproduces the characteristics of the given template speech can be obtained. In addition, various template voices such as a speaker and a utterance form are separately analyzed, and the category determination rule 7, the optimized variable set 8, and the regression coefficient set 9 are stored by the variable storage selection means and appropriately used, so that various types are obtained. By appropriately setting the prosody pattern 10, it is possible to reproduce a variety of sounds.

【００１２】[0012]

【発明の効果】本発明は、予め用意した音声素片パラメ
ータ系列を音韻・韻律を表現する入力記号列に従って読
出し、音声パラメータ結合規則によって前記音声素片パ
ラメータ系列を接続し、韻律制御規則によって音韻・韻
律を表現する入力記号列に応じた音韻継続時間長、ピッ
チパターン、振幅パターンなどを計算することにより韻
律を付加する規則音声合成装置において、入力記号列か
ら変数抽出規則によって複数の変数を抽出しこれらの変
数の中から異なる話者や発声形態毎に最適な変数セット
を選択するとともに前記変数抽出規則及び前記最適化さ
れた変数セット並びにこの最適な変数セットと前記入力
記号列及び音声データとから得た最適な係数セットを記
憶するカテゴリマッピング手段を設け、異なる話者や発
声形態に応じて変数セットを前記カテゴリマッピング手
段で記憶された中から選択する変数記憶選択手段を有
し、この変数記憶選択手段で選択された変数セット及び
対応する係数セットを前記カテゴリマッピング手段で記
憶された中から取り出し、これらの変数セットと係数セ
ットと入力記号列とによって合成音声の韻律パターンを
生成するようにしたので、従来においては画一的であっ
た合成音声の韻律パターンを、変数選択も含めてきめ細
かく制御することが可能となり、これにより自然な音声
を合成することができるものである。According to the present invention, a speech unit parameter sequence prepared in advance is read in accordance with an input symbol string representing phoneme and prosody, the speech unit parameter sequence is connected by a speech parameter combination rule, and a phoneme is controlled by a prosody control rule.・ In a rule speech synthesizer that adds prosody by calculating phoneme duration, pitch pattern, amplitude pattern, etc. according to an input symbol string expressing prosody, multiple variables are extracted from the input symbol string by a variable extraction rule. From among these variables, an optimal variable set is selected for each of different speakers and utterance forms, and the variable extraction rule and the optimized
Variable set and this optimal variable set and the input
Describe the optimal coefficient set obtained from the symbol string and audio data.
Provide category mapping means to remember different speakers and
The variable set is assigned to the category mapping method according to the voice form.
Have a variable storage selection means you choose from stored by the step
And the variable set selected by the variable storage selecting means and
The corresponding coefficient set is recorded by the category mapping means.
Take these variables sets and coefficient sets out of
Prosody pattern of synthesized speech by
Since it is generated, it is possible to finely control the prosody pattern of the synthesized speech, which was conventionally uniform, including variable selection, and thereby synthesize a natural speech. .

[Brief description of the drawings]

【図１】本発明の一実施例である音声分析時における構
成を示すブロック図である。FIG. 1 is a block diagram showing a configuration at the time of voice analysis according to an embodiment of the present invention.

【図２】本発明の一実施例である音声合成時における構
成を示すブロック図である。３カテゴリマッピング手段６入力記号列７変数抽出規則８変数セットFIG. 2 is a block diagram showing a configuration at the time of speech synthesis according to an embodiment of the present invention. 3 Category mapping means 6 Input symbol string 7 Variable extraction rule 8 Variable set

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/00 - 13/08 G10L 21/00 - 21/06 Continuation of the front page (58) Field surveyed (Int. Cl. ⁷ , DB name) G10L 13/00-13/08 G10L 21/00-21/06

Claims

(57) [Claims]

1. A speech unit parameter sequence prepared in advance is read in accordance with an input symbol string representing phoneme / prosodic, said speech unit parameter sequence is connected by a speech parameter combination rule, and a phoneme / prosody is represented by a prosody control rule. In a rule speech synthesizer that adds a prosody by calculating a phoneme duration, a pitch pattern, an amplitude pattern, and the like according to an input symbol string, a plurality of variables are extracted from the input symbol string according to a variable extraction rule. And selecting the optimal variable set for each different speaker or utterance form, and the variable extraction rule and the optimized variable
Set and the optimal variable set and the input symbol string and
Mosquitoes <br/> categories mapping means for storing an optimum coefficient set obtained from a fine voice data provided, the different speakers and utterance form
The variable set is stored by the category mapping means according to
Is has a variable storage selection means you choose from the, this variant
The variable set selected by the number storage selection means and the corresponding
Several sets are stored in the category mapping means.
From these variables and coefficient sets and inputs
Generates a prosody pattern of a synthesized speech using a symbol string
Prosody control device, characterized in that had Unishi.