JP2001092481A

JP2001092481A - Method for rule speech synthesis

Info

Publication number: JP2001092481A
Application number: JP26988499A
Authority: JP
Inventors: Hiroyuki Hirai; 啓之平井; Makoto Hashimoto; 誠橋本; Hideji Nishida; 秀治西田; Kazuyoshi Okura; 計美大倉; Hiroki Onishi; 宏樹大西
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1999-09-24
Filing date: 1999-09-24
Publication date: 2001-04-06

Abstract

PROBLEM TO BE SOLVED: To provide a method for rule speech synthesis whereby speech synthesis is performed by utilizing CVs and speech elements in a VC unit prepared beforehand and whereby a speech dictionary volume and distortion in a connection plane are reduced. SOLUTION: The method for rule speech synthesis in this invention is characterized by dividing CVs and speech elements in a VC unit into synthesis units smaller than phoneme units by using HMnet, creating an intra-phoneme network over all the synthesis units, and selecting speech elements corresponding to an input phoneme train on the basis of the obtained intra-phoneme network.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、規則音声合成方
法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a rule speech synthesis method.

【０００２】[0002]

【従来の技術】規則音声合成では、一般に、音声データ
から抽出した接続単位（音声素片）を、合成したい文章
にあわせて選択し、韻律を修正して接続することによっ
て音声合成が行なわれる。2. Description of the Related Art In rule-based speech synthesis, speech synthesis is generally performed by selecting a connection unit (speech unit) extracted from speech data in accordance with a sentence to be synthesized, correcting the prosody, and connecting.

【０００３】接続単位には、（Ｃ，Ｖ）、（ＣＶ，Ｖ
Ｃ）、ＶＣＶ等が用いられる。ここで、Ｃは子音を示
し、Ｖは母音を示している。The connection units are (C, V), (CV, V
C), VCV, etc. are used. Here, C indicates a consonant, and V indicates a vowel.

【０００４】接続単位数が最も少ないのは、音素（Ｃ，
Ｖ）を接続単位とする場合であるが、音素と音素との境
界では、前後の音素の種類によって音響的性質が異なる
ため、各音素毎に１つの素片しか持たないとすると、接
続面での歪みが増加する。このため、一般的に、各音素
毎に複数の素片を持つ必要がある。各接続単位毎に複数
の素片を持たせるようにすると、音質は改善されるが、
音声辞書のサイズを増大させることになる。The number of connection units is the smallest for phonemes (C,
V) is a connection unit, but at the boundary between phonemes, since the acoustic properties differ depending on the types of the preceding and succeeding phonemes, if each phoneme has only one segment, the connection surface Distortion increases. For this reason, it is generally necessary to have a plurality of segments for each phoneme. Having multiple segments for each connection unit improves sound quality,
This will increase the size of the voice dictionary.

【０００５】ところで、（ＣＶ，ＶＣ）を接続単位にし
た場合、日本語を合成しようとすると、各接続単位毎に
１つの素片しか持たせないとしても、５００程度の素片
が必要となる。さらに、音質の面から見ても、各接続単
位毎に１つの素片しか持たないとすると、先行または後
続音素が異なる接続単位を接続する場合が生じ、大きな
接続歪みが発声する原因となる。By the way, when (CV, VC) is used as a connection unit, about 500 units are required to synthesize Japanese, even if only one unit is provided for each connection unit. . Further, from the viewpoint of sound quality, if each connection unit has only one segment, a preceding or succeeding phoneme may connect different connection units, which may cause significant connection distortion.

【０００６】（ＣＶ，ＶＣ）を接続単位とした場合、異
なる接続単位間でも同じような波形が重複して含まれて
いる。このような部分を削除するためには、接続単位を
より細かくし、同じような音響的性質の部分を１つの接
続単位（合成単位）として共通化すればよい。また、先
行および後続音素が異なることによる影響は、音素全て
に均等に現れるのではない。したがって、接続単位が細
かければ、影響の大きい部分にだけ効率よく接続単位を
増やすことができる。When (CV, VC) is used as a connection unit, similar waveforms are included in different connection units. In order to eliminate such a portion, the connection unit may be made finer, and a portion having similar acoustic properties may be shared as one connection unit (synthesis unit). In addition, the influence of different preceding and succeeding phonemes does not appear equally for all phonemes. Therefore, if the connection unit is fine, the connection unit can be efficiently increased only in a portion having a large influence.

【０００７】このような考え方は、音素（Ｃ，Ｖ）を接
続単位とした場合でも成り立ち、既に伊藤ら（特開平９
−２２２８９８参照）によって提案されている。つま
り、音素毎にノードを介して複数のアークが連結され、
各アークに対応する音響特徴パラメータが添付されてい
る音響パラメータネットワークを記憶しておき、入力音
素列に応じてアークの系列を探索し、音響特徴パラメー
タ列を形成し、音声の合成を行なうものである。[0007] Such a concept holds even when a phoneme (C, V) is used as a connection unit, and has already been described by Ito et al.
-222898). In other words, multiple arcs are connected via nodes for each phoneme,
A sound parameter network to which sound characteristic parameters corresponding to each arc are attached is stored, a series of arcs is searched according to an input phoneme sequence, a sound characteristic parameter sequence is formed, and speech is synthesized. is there.

【０００８】この方法では各音素毎にネットワークを形
成している。一方、予め用意したＣＶ，ＶＣ単位の素片
からなる音声データベースに対しては、全ての合成単位
にまたがるネットワークを構成する必要があるため、簡
単にはこの方法を応用できない。In this method, a network is formed for each phoneme. On the other hand, it is not easy to apply this method to a speech database made up of CV and VC units prepared in advance, since it is necessary to construct a network that spans all synthesis units.

【０００９】[0009]

【発明が解決しようとする課題】この発明は、予め用意
したＣＶ，ＶＣ単位の素片を利用して音声合成を行なう
音声合成方法であって、音声辞書のサイズの低減させる
ことができるとともに接続面での歪みを減少させること
ができる規則音声合成方法を提供することを目的とす
る。SUMMARY OF THE INVENTION The present invention relates to a voice synthesizing method for synthesizing voice using a unit of CV or VC prepared in advance. It is an object of the present invention to provide a rule-based speech synthesis method capable of reducing surface distortion.

【００１０】[0010]

【課題を解決するための手段】この発明による規則音声
合成方法は、ＨＭｎｅｔを用いてＣＶ，ＶＣ単位の素片
を、音素単位より細かい合成単位に分割し、全ての合成
単位にまたがる音素内ネットワークを生成し、得られた
音素内ネットワークに基づいて入力音素列に対応する素
片を選択するようにしたことを特徴とする。The rule speech synthesis method according to the present invention divides a CV / VC unit segment into smaller synthesis units than phoneme units using HMnet, and forms a network within a phoneme spanning all synthesis units. Is generated, and a segment corresponding to the input phoneme sequence is selected based on the obtained intra-phoneme network.

【００１１】音素内ネットワークは、たとえば、予め用
意した個々のＣＶ，ＶＣ単位の素片の始端部分から得ら
れる複数の第１状態と、全てのＣＶ，ＶＣ単位の素片の
中央部分から得られる１つの第２状態と、個々のＣＶ，
ＶＣ単位の素片の終端部分から得られる複数の第３状態
とよって構成されるＨＭｎｅｔの初期状態から、逐次状
態分割法によって状態を分割するステップ、および任意
の状態数まで分割されたＨＭｎｅｔと、個々のＣＶ，Ｖ
Ｃ単位の素片とに基づいて、ＨＭｎｅｔの各状態の代表
素片を決定するステップによって生成される。The intra-phoneme network is obtained, for example, from a plurality of first states obtained from the starting end portions of the individual CV and VC units prepared in advance and the central portion of all the CV and VC units. One second state and individual CVs,
A step of dividing a state by an iterative state division method from an initial state of HMNet composed of a plurality of third states obtained from a terminal part of a VC unit fragment, and HMNet divided to an arbitrary number of states; Individual CV, V
It is generated by the step of determining a representative segment of each state of HMNet based on the segment in C units.

【００１２】ＨＭｎｅｔの各状態の代表素片を決定する
ステップは、たとえば、個々のＣＶ，ＶＣ単位の素片
と、任意の状態数まで分割されＨＭｎｅｔから、各学習
用素片を各状態に分割し、各状態の素片の候補を作成す
るステップ、およびＨＭｎｅｔの全ての状態間の接続点
に、ＨＭｎｅｔの同じ音素からなる終端と始端の全ての
組み合わせを加えたものを総接続点とし、総接続点での
接続歪みの総和が最小となるような素片の組み合わせを
各状態の素片候補の中から選択して、代表素片とするス
テップからなる。The step of determining a representative segment of each state of HMNet is performed, for example, by dividing individual segments in units of CV and VC and an arbitrary number of states, and dividing each learning segment into each state from HMNet. Then, a step of creating a candidate of a segment of each state, and a connection point between all the states of HMNet plus all combinations of the terminal and the starting end of the same phoneme of HMNet are defined as a total connection point. The method comprises the steps of selecting a combination of segments that minimizes the sum of connection distortions at the connection points from the segment candidates in each state and setting the combination as a representative segment.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して、この発明
の実施の形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１４】〔１〕発明の特徴についての説明[1] Description of Features of the Invention

【００１５】この発明では、ＣＶ，ＶＣを接続単位とし
た音声合成方法であって、隠れマルコフ網（ＨＭｎｅ
ｔ）を用いて、全ての接続単位に対してネットワークを
形成することにより、音声辞書の圧縮を行なうものであ
る。According to the present invention, there is provided a speech synthesizing method using CV and VC as connection units, wherein the method comprises a hidden Markov network (HMne).
A speech dictionary is compressed by forming a network for all connection units using t).

【００１６】ＨＭｎｅｔとは、音声認識で用いられる手
法であり、ＨＭＭの学習時に、異なる音素や異音間での
状態の共有を行なうことによって、全体のモデルのパラ
メータを少なくし、統計的に安定なモデルを推定しよう
とするものである（「鷹見，嵯峨山：”逐次状態分割法
（ＳＳＳ）による隠れマルコフネットワークの自動生
成”日本音響学会平成３年度周期発表会講演論文集，２
−５−１−（１９９１−９）」参照）。この発明では、
ＨＭｎｅｔの１状態に合成の１接続単位（合成単位）を
割り当てた。HMNet is a technique used in speech recognition. During learning of the HMM, by sharing the state between different phonemes and allophones, the parameters of the entire model are reduced, and statistically stable. ("Takami, Sagayama:" Automatic Generation of Hidden Markov Network by Sequential State Division Method (SSS) "," Proc.
−5-1- (1991-1-9) ”). In the present invention,
One connection unit of synthesis (synthesis unit) was assigned to one state of HMNet.

【００１７】図１は、ＨＭｎｅｔを用いてＣＶ，ＶＣ単
位の素片を、音素単位より細かい合成単位に分割するこ
とにより得られた音素内ネットワークの一例を示してい
る。FIG. 1 shows an example of an intra-phoneme network obtained by dividing a segment in CV and VC units into smaller synthesis units than phoneme units using HMNet.

【００１８】各楕円は各状態を示し、各状態に合成の１
接続単位（音素単位より細かい単位）が割り当てられて
いる。楕円内のａ−ｓ，ｚ，ｎは、この状態を〔ａ〕で
始まり、〔ｓ〕，〔ｚ〕，〔ｎ〕で終わる素片が通るこ
とを示している。Each ellipse indicates each state, and each state has a composite 1
Connection units (units smaller than phoneme units) are assigned. The symbols a-s, z, and n in the ellipse indicate that a segment starting with [a] and ending with [s], [z], [n] passes.

【００１９】〔２〕音素内ネットワークを生成する方
法についての説明[2] Description of method for generating intra-phoneme network

【００２０】以下、ＨＭｎｅｔを用いてＣＶ，ＶＣ素片
をより細かく分割し、音素内ネットワークを生成する方
法について説明する。Hereinafter, a method of generating a network within a phoneme by dividing a CV / VC segment into smaller pieces by using HMNet will be described.

【００２１】図２は、ＨＭｎｅｔの初期状態を示してい
る。FIG. 2 shows the initial state of HMNet.

【００２２】左一列は、学習用に用意した全てのＣＶ，
ＶＣ素片のうち、〔ａ〕で始まる素片を抽出し、その始
端部分から計算した音響パラメータの平均値および分散
値からなる状態を〔ａ＊〕、〔ｕ〕で始まる素片から同
様にして計算された状態を〔ｕ＊〕、というようにして
生成された複数の第１状態を表している。中央の状態
は、全てのＣＶ，ＶＣ素片の中心部分から計算した音響
パラメータの平均値および分散値からなる１つの第２状
態を表している。右一列は、個々のＣＶ，ＶＣ素片の終
端部分からそれぞれ計算した音響パラメータの平均値お
よび分散値からなる複数の第３状態を表している。The left column shows all CVs prepared for learning,
Of the VC segments, a segment beginning with [a] is extracted, and the state consisting of the average value and the variance value of the acoustic parameters calculated from the beginning portion is similarly set from the segments beginning with [a *] and [u]. Represents the plurality of first states generated in such a manner as [u *]. The central state represents one second state including the average value and the variance of the acoustic parameters calculated from the central portions of all the CV and VC segments. The right column shows a plurality of third states including the average value and the variance value of the acoustic parameters calculated from the terminal portions of the individual CV and VC segments.

【００２３】状態の分割には、逐次状態分割法を用い
た。音声認識に用いられる逐次状態分割法では、コンテ
キスト方向への分割を、音素の種類あるいは音素環境に
基づいて行なっているが、この実施の形態では、音素の
種類に係わらず状態を増やすことにした。For the state division, a sequential state division method was used. In the sequential state division method used for speech recognition, division in the context direction is performed based on the phoneme type or phoneme environment. In this embodiment, however, the number of states is increased regardless of the phoneme type. .

【００２４】この理由は、音声合成時の音響パラメータ
の歪みが音素環境だけでなく、音韻環境などからも影響
を受けることを考慮し、それらの要因に対しても状態を
割り当てるためである。The reason for this is to take into account that the distortion of the acoustic parameters at the time of speech synthesis is affected not only by the phoneme environment but also by the phoneme environment, and to assign a state to those factors.

【００２５】逐次状態分割法によって、たとえば、図１
に示すように、任意の状態数まで状態が分割されると、
各状態の代表素片を次のようにして決定する。By the successive state division method, for example, FIG.
As shown in, when the state is divided into an arbitrary number of states,
The representative segment of each state is determined as follows.

【００２６】（１）各学習用素片（ＣＶ，ＶＣ素片）
と、任意の状態数まで分割されＨＭｎｅｔから、Ｖｉｔ
ｅｒｂｉアルゴリズムを用いて、各学習用素片を各状態
に分割し、各状態の素片の候補を作成する。(1) Each learning unit (CV, VC unit)
And the number of states is divided into
Using the erbi algorithm, each learning segment is divided into each state, and candidate segments for each state are created.

【００２７】つまり、ＨＭｎｅｔにおける任意の学習用
素片に対応する経路における各状態での時間長の比に基
づいて、当該学習用素片が各状態に分割される。同じ状
態から同じ状態への遷移を伴わない状態についての時間
長を１とした場合には、同じ状態から同じ状態への遷移
を伴う状態については、同じ状態から同じ状態への遷移
回数をｎとすると、この状態での時間長は（１＋ｎ）と
なる。That is, the learning unit is divided into the states based on the ratio of the time length in each state on the path corresponding to an arbitrary learning unit in HMNet. Assuming that the time length of a state that does not involve a transition from the same state to the same state is 1, for a state that involves a transition from the same state to the same state, the number of transitions from the same state to the same state is n. Then, the time length in this state is (1 + n).

【００２８】（２）ＨＭｎｅｔの状態遷移を元に、全て
の状態間の接続関係を求める。(2) The connection relation between all the states is obtained based on the state transition of HMNet.

【００２９】（３）上記接続関係にＨＭｎｅｔの同じ音
素からなる終端と始端の全ての組み合わせを加えたもの
を総接続点とする。(3) The total connection point is obtained by adding all the combinations of the terminal and the start of the same phoneme of HMNet to the above connection relation.

【００３０】（４）総接続点での接続歪みの総和が最小
となるような素片の組み合わせを各状態の素片候補の中
から選択し、代表素片とする。(4) A combination of segments that minimizes the sum of the connection distortions at the total connection points is selected from the segment candidates in each state and is used as a representative segment.

【００３１】上記（４）の代表素片の選択には、シミュ
レーティッドアニーリング（ＳＡ）を用いた。また、接
続歪みは、次の数式１により計算した。For the selection of the representative element in the above (4), simulated annealing (SA) was used. The connection distortion was calculated by the following equation (1).

【００３２】[0032]

【数１】 (Equation 1)

【００３３】Ｄ_F0、Ｄ_powおよびＤ_cepは、それぞれ接
続する２つの合成単位の接続点での基本周波数の差、パ
ワーの差およびケプストラム距離を表している。また、
ｗ_FO、ｗ_powおよびｗ_cepは、それぞれＤ_F0、Ｄ_powお
よびＤ_cepに乗ずる重み係数を表している。D _F0 , D _pow and D _cep represent a difference in fundamental frequency, a difference in power, and a cepstrum distance at a connection point of the two combining units connected respectively. Also,
w _FO , w _pow and w _cep represent weighting factors by which D _F0 , D _pow and D _cep are multiplied, respectively.

【００３４】〔３〕音声合成方法の説明[3] Description of speech synthesis method

【００３５】以下、図１の音素内ネットワーク（ＨＭｎ
ｅｔ）を用いて、音声合成を行なう場合について説明す
る。The intra-phoneme network (HMn) shown in FIG.
(et) will be described.

【００３６】ここでは、「あさ」を合成する場合につい
て説明する。「あさ」は、”ａ−ｓ”の素片と、”ｓ−
ａ”の素片とにより合成される。Here, a description will be given of a case where "asa" is synthesized. "Asa" is a fragment of "as" and "s-
a ".

【００３７】（１）まず、図３に示すように、”ａ−
ｓ”が通る経路を抽出する。(1) First, as shown in FIG.
The path through which s "passes is extracted.

【００３８】（２）次に、図４に示すように、”ｓ−
ａ”が通る経路を抽出する。(2) Next, as shown in FIG.
The path through which a "passes is extracted.

【００３９】（３）そして、図５に示すように、これら
の経路を接続する。(3) Then, as shown in FIG. 5, these routes are connected.

【００４０】この場合、状態Ａを通る経路と、状態Ｂを
通る経路とがある。そこで、どちらの経路が接続歪みが
小さいかを調べ、接続歪みが小さい法の経路を選択す
る。選択された経路上の各状態の代表素片を適当な高
さ、長さ、大きさに修正した後、接続することによっ
て、合成音声を生成する。In this case, there is a path passing through the state A and a path passing through the state B. Therefore, it is checked which path has a small connection distortion, and a modal path with a small connection distortion is selected. After the representative segments in each state on the selected route are corrected to appropriate height, length, and size, they are connected to generate synthesized speech.

【００４１】[0041]

【発明の効果】この発明によれば、予め用意したＣＶ，
ＶＣ単位の素片を利用して音声合成を行なう音声合成方
法において、音声辞書のサイズの低減させることができ
るとともに接続面での歪みを減少させることができるよ
うになる。According to the present invention, a CV prepared in advance,
In a speech synthesis method for performing speech synthesis using segments in VC units, it is possible to reduce the size of a speech dictionary and reduce distortion on a connection surface.

[Brief description of the drawings]

【図１】ＨＭｎｅｔを用いてＣＶ，ＶＣ単位の素片を、
音素単位より細かい合成単位に分割することにより得ら
れた音素内ネットワークの一例を示す模式図である。FIG. 1 shows a CV, VC unit fragment using HMNet.
It is a schematic diagram which shows an example of the intra-phoneme network obtained by dividing | segmenting into a synthesis unit finer than a phoneme unit.

【図２】ＨＭｎｅｔの初期状態を示す模式図である。FIG. 2 is a schematic diagram showing an initial state of HMNet.

【図３】図１のＨＭｎｅｔのうち、”ａ−ｓ”が通る経
路を示す模式図である。FIG. 3 is a schematic diagram showing a path through which “as” of HMNet of FIG. 1 passes.

【図４】図１のＨＭｎｅｔのうち、”ｓ−ａ”が通る経
路を示す模式図である。FIG. 4 is a schematic diagram illustrating a path through which “sa” passes among HMNets of FIG. 1;

【図５】図３の経路と図４の経路とを接続することによ
って得られる経路を示す模式図である。FIG. 5 is a schematic diagram showing a route obtained by connecting the route in FIG. 3 and the route in FIG. 4;

フロントページの続き (72)発明者西田秀治大阪府守口市京阪本通２丁目５番５号三洋電機株式会社内 (72)発明者大倉計美大阪府守口市京阪本通２丁目５番５号三洋電機株式会社内 (72)発明者大西宏樹大阪府守口市京阪本通２丁目５番５号三洋電機株式会社内Ｆターム(参考） 5D045 AA07 Continuation of the front page (72) Inventor Hideharu Nishida 2-5-5 Keihanhondori, Moriguchi-shi, Osaka Sanyo Electric Co., Ltd. (72) Inventor Mitsumi Okura 2-5-5-1 Keihanhondori, Moriguchi-shi, Osaka No. Sanyo Electric Co., Ltd. (72) Inventor Hiroki Onishi 2-5-5 Keihanhondori, Moriguchi-shi, Osaka F-term in Sanyo Electric Co., Ltd. (reference) 5D045 AA07

Claims

[Claims]

1. A unit of CV and VC units is divided into synthesis units smaller than phoneme units using HMNet, a network within the phoneme over all the synthesis units is generated, and based on the obtained network within the phoneme. A rule speech synthesis method in which a segment corresponding to an input phoneme sequence is selected.

2. The intra-phoneme network is obtained from a plurality of first states obtained from a starting end portion of each CV / VC unit piece prepared in advance and a central portion of all CV / VC unit pieces. From the initial state of HMNet, which is composed of one second state and a plurality of third states obtained from the terminal portions of the individual CV and VC units,
Splitting the states by the sequential state splitting method, and HMNet split into an arbitrary number of states, and individual C
2. The rule speech synthesis method according to claim 1, wherein: a step of determining a representative segment of each state of HMNet based on the segments in V and VC units.

3. The step of deciding a representative segment of each state of HMNet is performed by dividing individual segments in units of CV and VC and an arbitrary number of states and dividing each learning segment into each state from HMNet. And
Creating a candidate for each state segment; and HMn
The sum of the connection points between all the states of et and all the combinations of the end and start ends of the same phoneme of HMNet is defined as the total connection point, and the element that minimizes the sum of the connection distortion at the total connection point 3. The rule speech synthesis method according to claim 2, further comprising the step of: selecting a combination of segments from the segment candidates in each state to be a representative segment.