JP5498439B2

JP5498439B2 - Phoneme Labeling Data Phoneme Duration Length Conversion Method, Apparatus and Program

Info

Publication number: JP5498439B2
Application number: JP2011121435A
Authority: JP
Inventors: 歩相名神山; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-05-31
Filing date: 2011-05-31
Publication date: 2014-05-21
Anticipated expiration: 2031-05-31
Also published as: JP2012247729A

Description

この発明は、音素ラベリングデータベースの音素ラベリングデータを、例えば異なる音声合成システム用の音素ラベリングデータに変換する音素ラベリングデータ音素継続時間長変換方法と、その装置とプログラムに関する。 The present invention relates to a phoneme labeling data phoneme duration conversion method for converting phoneme labeling data of a phoneme labeling database into, for example, phoneme labeling data for different speech synthesis systems, and an apparatus and program thereof.

音素ラベリングとは、音声データ内で発声されている音素の種別と音素の境界を表すラベルを付与することである。ある既存の音素体系で音素ラベリングされた音声データベースが既に存在している場合において、その音声データベースを音声体系の異なる音素体系に基づく音声合成や音声認識等で利用するためには、基本的には新しい音素体系で新たにラベリングする必要がある。 Phoneme labeling refers to assigning a label representing the type of phoneme uttered in the speech data and the boundary between phonemes. In order to use a speech database that has been phoneme-labeled with an existing phoneme system, for speech synthesis or speech recognition based on a different phoneme system, basically, New labeling with a new phoneme system is required.

新しい音素体系で新たにラベリングする方法としては、人手によるラベリングや自動ラベリングがある。自動ラベリング方式は、例えば非特許文献１に開示されている。 New labeling methods using a new phoneme system include manual labeling and automatic labeling. The automatic labeling method is disclosed in Non-Patent Document 1, for example.

中村孝、宮崎昇、水野秀之、「発音変動に対応した多段階自動ラベリング方式の検討」日本音響学会講演論文集、p265-268,2009年9月Takashi Nakamura, Noboru Miyazaki, Hideyuki Mizuno, “Examination of multi-step automatic labeling method corresponding to pronunciation variation”, Proc. Of the Acoustical Society of Japan, p265-268, September 2009

従来方法の人手によるラベリングは、高精度なラベリングが行えるが、人手によるため多くの時間・費用が必要でありコストがかかる。一方自動ラベリングは、コストは低く抑えられるがラベリング精度が低く、用途によってはそのまま利用することが出来ない課題がある。そこで、既存の音素データベースを、目的のシステムに対応した音素データベースに変換する方法が、最も現実的のように思われる。しかしながら、音素境界を機械的に変換することが難しいことから、そのような音素データベースを変換する装置や方法は、今まで無かった。 Although manual labeling of the conventional method can be performed with high accuracy, it requires a lot of time and expense due to manual labor, and is costly. On the other hand, the automatic labeling has a problem that the cost is kept low but the labeling accuracy is low and cannot be used as it is depending on the application. Therefore, a method of converting an existing phoneme database into a phoneme database corresponding to the target system seems to be the most realistic. However, since it is difficult to convert phoneme boundaries mechanically, there has been no apparatus or method for converting such a phoneme database.

つまり、新しい音素体系では単一の音素が、既存の音素体系では複数の音素に対応する場合や、その逆の関係の場合もあり、音素境界を単純な規則で変換することが出来ない課題があり、その実現が難しかった。 In other words, a new phoneme system may correspond to a single phoneme and multiple phonemes in the existing phoneme system, or vice versa. Yes, it was difficult to realize.

この発明は、このような課題に鑑みてなされたものであり、既存の音素データベースを異なる音素体系にコストを掛けずに精度良く変換する音素ラベリングデータ音素継続時間長変換方法と、その装置とプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and is a phoneme labeling data phoneme duration conversion method for accurately converting an existing phoneme database to a different phoneme system without cost. The purpose is to provide.

この発明の音素ラベリングデータ音素継続時間長変換方法は、音素継続時間長分布推定過程と、音素変換過程と、音素継続時間長分割過程と、を備える。音素継続時間長分布推定過程は、変換対象話者の新音素体系における音素ラベリングデータである参照ラベリングデータと、複数話者のある音素体系における音素種別の音素継続時間長の平均値・分散値を、統計的に信頼できる値として得ることが可能な数の音素種別を含む複数話者ラベリングデータを入力として、参照ラベリングデータを複数話者ラベリングデータで直線回帰し、複数話者ラベリングデータの全ての音素種別に対応する変換対象話者の音素継続時間長の平均値と分散値である音素継続時間長分布を求める。音素変換過程は、変換対象話者の変換前の音素ラベリングデータである既存音素ラベリングデータを新音素体系の音素ラベリングデータに変換する。音素継続時間長分割過程は、音素継続時間長分布と新音素体系の音素ラベリングデータを入力として、１つの音素情報に１個の音素継続時間長を持つ音素ラベリングデータはそのまま新音素ラベリングデータとして通過させ、１個の音素継続時間長に対して複数の音素情報を持つ音素ラベリングデータは音素情報毎に音素継続時間長を分割して新音素ラベリングデータとして出力する。 The phoneme labeling data phoneme duration conversion method of the present invention comprises a phoneme duration distribution estimation process, a phoneme conversion process, and a phoneme duration split process. The phoneme duration distribution estimation process consists of the reference labeling data, which is the phoneme labeling data in the new phoneme system of the conversion target speaker, and the average and variance values of phoneme durations of phoneme types in the phoneme system with multiple speakers. The multi-speaker labeling data is input as multi-speaker labeling data including a number of phoneme types that can be obtained as statistically reliable values, and the reference labeling data is linearly regressed with the multi-speaker labeling data. The average phoneme duration of the conversion target speaker corresponding to the phoneme type and the phoneme duration distribution that is the variance are obtained. The phoneme conversion process converts existing phoneme labeling data, which is phoneme labeling data before conversion of a conversion target speaker, into phoneme labeling data of a new phoneme system. In the phoneme duration division process, the phoneme duration distribution and the phoneme labeling data of the new phoneme system are input, and the phoneme labeling data having one phoneme duration in one phoneme information is passed as new phoneme labeling data. The phoneme labeling data having a plurality of phoneme information for one phoneme duration is divided into phoneme durations for each phoneme information and output as new phoneme labeling data.

この発明の音素ラベリングデータ音素継続時間長変換方法は、少量の新音素体系の音素ラベリングデータを、大量の複数話者ラベリングデータで直線回帰し、複数話者ラベリングデータの全ての音素種別に対応する変換対象話者の音素継続時間長分布を求める。そして、１つの音素ラベルに複数の音素を含む音素ラベリングデータを、大量の複数話者ラベリングデータから求めた平均値と分散値に基づいて音素毎の継続時間長に分割した新たな音素体系の音素ラベリングデータとして出力する。したがって、少量の音素ラベリングデータから新音素体系の大量の音素ラベリングデータを精度良く求めることが出来る。 The phoneme labeling data phoneme duration conversion method of the present invention linearly regresses a small amount of phoneme labeling data of a new phoneme system with a large amount of multi-speaker labeling data, and supports all phoneme types of multi-speaker labeling data. The phoneme duration distribution of the conversion target speaker is obtained. A phoneme of a new phoneme system in which phoneme labeling data including a plurality of phonemes in one phoneme label is divided into durations for each phoneme based on an average value and a variance value obtained from a large amount of multi-speaker labeling data. Output as labeling data. Therefore, a large amount of phoneme labeling data of the new phoneme system can be accurately obtained from a small amount of phoneme labeling data.

この発明の音素ラベリングデータ音素継続時間長変換装置１００の機能構成例を示す図。The figure which shows the function structural example of the phoneme labeling data phoneme duration length conversion apparatus 100 of this invention. 音素ラベリングデータ音素継続時間長変換装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the phoneme labeling data phoneme duration length conversion apparatus. 音素継続時間長分布推定部４０の機能構成例を示す図。The figure which shows the function structural example of the phoneme duration duration distribution estimation part. 音素継続時間長分布推定部４０の動作フローを示す図。The figure which shows the operation | movement flow of the phoneme duration distribution distribution estimation part. 音素変換部５０の動作フローを示す図。The figure which shows the operation | movement flow of the phoneme conversion part. 音素継続時間長分割部６０の動作フローを示す図。The figure which shows the operation | movement flow of the phoneme duration length division | segmentation part 60. FIG. 変換対象話者の音素継続時間長の平均値μ_x′と、複数話者の音素継続時間長の平均値μ_xとの相関を例示する図。Diagram illustrating an average value of the phoneme duration of the conversion target speaker and mu _x ', the correlation between the average value mu _x phoneme duration of a plurality of speakers. 変換対象話者の音素継続時間長の分散値σ_x′²と、複数話者の音素継続時間長の分散値σ_x ²との相関を例示する図。Diagram illustrating a phoneme duration of variance sigma _x ^'2 conversion target speaker, the correlation between variance sigma _x ² phoneme duration of a plurality of speakers.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音素ラベリングデータ音素継続時間長変換装置１００の機能構成例を示す。その動作フローを図２に示す。音素ラベリングデータ音素継続時間長変換装置１００は、音素継続時間長分布推定部４０と、音素変換部５０と、音素継続時間長分割部６０と、具備する。音素ラベリングデータ音素継続時間長変換装置１００の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of a phoneme labeling data phoneme duration conversion device 100 of the present invention. The operation flow is shown in FIG. The phoneme labeling data phoneme duration length converter 100 includes a phoneme duration distribution estimation unit 40, a phoneme converter 50, and a phoneme duration division unit 60. The function of each part of the phoneme labeling data phoneme duration conversion device 100 is realized by reading a predetermined program into a computer composed of, for example, ROM, RAM, CPU, etc., and executing the program by the CPU. It is.

音素継続時間長分布推定部４０は、参照ラベリングデータ１０と複数話者ラベリングデータ２０を入力として、参照ラベリングデータ１０を複数話者ラベリングデータ２０で直線回帰し、複数話者ラベリングデータ２０の全ての音素種別に対応する変換対象話者の音素継続時間長の平均値と分散値である音素継続時間長分布を求める（ステップＳ４０）。音素継続時間長分布は、図１に破線で示すように、ＲＯＭ、ＲＡＭ、やハードディスクに記憶するようにしても良い。 The phoneme duration distribution estimation unit 40 receives the reference labeling data 10 and the multi-speaker labeling data 20 as input, performs linear regression on the reference labeling data 10 with the multi-speaker labeling data 20, and performs all of the multi-speaker labeling data 20. The average phoneme duration length of the conversion target speaker corresponding to the phoneme type and the phoneme duration distribution that is the variance value are obtained (step S40). The phoneme duration distribution may be stored in a ROM, RAM, or hard disk as indicated by a broken line in FIG.

参照ラベリングデータ１０は、変換対象話者の新音素体系における少数の音素ラベリングデータである。半分以下の音素の音素継続時間長の平均値・分散値を統計的に信頼できる値として得ることが可能な数であり、複数話者ラベリングデータ２０を例えば６０２２文章、音素数６１種類とした時に、参照ラベリングデータ１０は例えば５文章１４種類程度の数である。ただし、１文章には少なくとも主語と述語が含まれるものとする。５文章以上で統計的に信頼できる平均値・分散値が求まることを実験で確認した。 The reference labeling data 10 is a small number of phoneme labeling data in the new phoneme system of the conversion target speaker. This is a number that can be obtained as a statistically reliable value of the average and variance of phoneme durations of less than half of the phonemes. When the multi-speaker labeling data 20 is, for example, 6022 sentences and 61 types of phonemes The reference labeling data 10 is a number of about 14 types of 5 sentences, for example. However, one sentence includes at least a subject and a predicate. Experiments confirmed that statistically reliable average and variance values were obtained with 5 sentences or more.

複数話者ラベリングデータ２０は、複数話者の新音素体系における音素種別の音素継続時間長の平均値・分散値を、統計的に信頼できる値として得ることが可能な数の全ての音素種別を含む音素ラベリングデータである。 The multi-speaker labeling data 20 includes all the phoneme types that can obtain the average value / variance value of the phoneme durations of the phoneme types in the new phoneme system of the multi-speakers as a statistically reliable value. Phoneme labeling data including.

音素変換部５０は、既存音素ラベリングデータ３０を新音素体系の音素ラベリングデータである分割前音素ラベリングデータに変換する（ステップＳ５０）。表１に、既存音素ラベリングデータ３０の一部を例示する。分割前音素ラベリングデータは、音素継続時間長分布と同様にＲＯＭ、ＲＡＭ、やハードディスクに記憶するようにしても良い。 The phoneme conversion unit 50 converts the existing phoneme labeling data 30 into pre-division phoneme labeling data that is phoneme labeling data of a new phoneme system (step S50). Table 1 illustrates a part of the existing phoneme labeling data 30. The pre-division phoneme labeling data may be stored in a ROM, RAM, or hard disk in the same manner as the phoneme duration distribution.

既存音素ラベリングデータの新音素体系への変換は、以下の２段階の処理にて行う。
既存音素ラベリングデータ３０は、音素ラベルと音素継続時間長が一対一に対応したデータである。既存音素ラベリングデータ３０は、変換したい新音素体系とは異なる他の音素体系の変換対象話者の既存のデータであり、音素変換部５０によって新音素体系の分割前音素ラベリングデータに変換される。
表２に、分割前音素ラベリングデータの一部を例示する。表２は、表１の音素を新音素体系に基づいて変換した一段階目の処理の例である。 Conversion of existing phoneme labeling data to a new phoneme system is performed by the following two-stage processing.
The existing phoneme labeling data 30 is data in which the phoneme label and the phoneme duration are in one-to-one correspondence. The existing phoneme labeling data 30 is existing data of a conversion target speaker of another phoneme system different from the new phoneme system to be converted, and is converted into pre-division phoneme labeling data of the new phoneme system by the phoneme conversion unit 50.
Table 2 illustrates a part of the pre-division phoneme labeling data. Table 2 is an example of the first stage processing in which the phonemes in Table 1 are converted based on the new phoneme system.

既存音素ラベリングデータ３０の「t」が、分割前音素ラベリングデータの「cT,T」に対応している。このように「t」は、新音素体系では２つの音素に対応するが、ここでは個々の音素の継続時間長は求めず、複数の音素の音素情報とその音素継続時間長の合計値に２つの音素を対応させている。 “T” in the existing phoneme labeling data 30 corresponds to “cT, T” in the pre-division phoneme labeling data. In this way, “t” corresponds to two phonemes in the new phoneme system, but here, the duration length of each phoneme is not calculated, and the sum of the phoneme information of a plurality of phonemes and the phoneme duration length is 2 It corresponds to two phonemes.

音素継続時間長分割部６０は、音素継続時間長分布推定部４０が出力する音素継続時間長分布と、音素変換部５０が出力する分割前音素ラベリングデータを入力として、１つの音素情報に１個の音素継続時間長を持つ音素ラベリングデータはそのまま通過させて新音素ラベリングデータとし、１個の音素継続時間長に対して複数の音素情報を持つ音素ラベリングデータは、音素情報毎に音素継続時間長を分割して新音素ラベリングデータとして出力する（ステップＳ６０）。 The phoneme duration division unit 60 receives the phoneme duration distribution output from the phoneme duration distribution estimation unit 40 and the pre-division phoneme labeling data output from the phoneme conversion unit 50 as one input per phoneme information. Phoneme labeling data having a phoneme duration of 1 is passed as new phoneme labeling data, and phoneme labeling data having multiple phoneme information for one phoneme duration is phoneme duration length for each phoneme information. Are output as new phoneme labeling data (step S60).

表３に、新音素ラベリングデータの一部を例示する。表３は、表２の音素を新音素ラベリングデータに変換した例である。 Table 3 illustrates a part of the new phoneme labeling data. Table 3 is an example in which the phonemes in Table 2 are converted into new phoneme labeling data.

分割前音素ラベリングデータの「cT,T」が、「cT:70ms」と「T:20ms」の２つの音素情報に分割されている。１つの音素情報に１個の音素継続時間長が対応する分割前音素ラベリングデータ「A」は、そのまま新音素ラベリングデータとして出力される。 The pre-division phoneme labeling data “cT, T” is divided into two phoneme information of “cT: 70 ms” and “T: 20 ms”. The pre-division phoneme labeling data “A” in which one phoneme duration corresponds to one phoneme information is output as new phoneme labeling data as it is.

この音素継続時間長の分割は、少量の参照ラベリングデータ１０と大量の複数話者ラベリングデータ２０との直線回帰により求めた音素継続時間長分布に基づいて行われる。したがって、少量の一部の新音素体系の音素ラベリングデータから大量の新音素体系の音素ラベリングデータを精度良く生成することが可能になる。 The division of the phoneme duration is performed based on the phoneme duration distribution obtained by linear regression of a small amount of reference labeling data 10 and a large amount of multi-speaker labeling data 20. Therefore, a large amount of phoneme labeling data of a new phoneme system can be generated with high accuracy from a small amount of phoneme labeling data of a part of new phoneme system.

以降、音素ラベリングデータ音素継続時間長変換装置１００の各部の動作をより詳しく説明する。
〔音素継続時間長分布推定部〕
図３に、音素継続時間長分布推定部４０の機能構成例を示す。その動作フローを図４に示す。
音素継続時間長分布推定部４０は、平均値・分散値計算手段４１と、直線回帰式推定手段４２と、音素継続時間長分布推定手段４３と、を備える。平均値・分散値計算手段４１は、複数話者ラベリングデータ２０内の全ての音素集合をXとして、x∈Xである全ての音素xの音素継続時間長の平均値μ_xと分散値σ_x ²を求める（ステップＳ４１a〜Ｓ４１c）。 Hereinafter, the operation of each unit of the phoneme labeling data phoneme duration conversion device 100 will be described in more detail.
(Phoneme duration distribution estimation unit)
FIG. 3 shows a functional configuration example of the phoneme duration distribution estimation unit 40. The operation flow is shown in FIG.
The phoneme duration distribution estimation unit 40 includes an average value / dispersion value calculation unit 41, a linear regression equation estimation unit 42, and a phoneme duration distribution distribution estimation unit 43. The average value / dispersion value calculation means 41 uses all the phoneme sets in the multi-speaker labeling data 20 as X, and the average value μ _x of the phoneme duration lengths of all phonemes x with x∈X and the variance value σ _x. ² is obtained (steps S41a to S41c).

そして、参照ラベリングデータ１０についても、その全ての音素集合をX′（但し、X′⊆X）として、x∈X′である全ての音素xの出現回数n_x′と、音素継続時間長の平均値μ_x′と分散σ_x′²を求め、この平均値の集合M₁と分散の集合Σ₁を求める（ステップＳ４１e）。 Then, for the reference labeling data 10 also, all of the phoneme set X '(where, X'⊆X) as, X∈X' and number of occurrences n _x 'of all phonemes x is, the phoneme duration An average value μ _x ′ and a variance σ _x ′ ² are obtained, and a set M _{1 of} average values and a set Σ ₁ of variances are obtained (step S41e).

直線回帰式推定手段４２は、参照ラベリングデータの平均値μ_x′及び分散値σ_x′²と、複数話者ラベリングデータ２０の平均値μ_xと分散値σ_x ²との直線回帰式を最小二乗法で求める（ステップＳ４２a）。 Linear regression estimator 42, the minimum and the average value mu _x 'and variance sigma _x' of the reference labeling data ^2, a linear regression expression between the average value mu _x multiple speakers labeling data 20 and variance sigma _x ² Obtained by the square method (step S42a).

統計的に十分信頼のある平均値と分散値が得られる音素の出現回数の規定値をn（nは少なくとも５以上の値が望ましい）として、n_x′≧nである全ての音素の集合Y={x|x∈X′,n_x′≧n}について、最小二乗法で次式に示すようにパラメータa₁,a₂,b₁,b₂を求める。 A set Y of all phonemes for which n _x ′ ≧ n, where n is a specified value of the number of occurrences of a phoneme from which a statistically sufficiently reliable average value and variance value can be obtained (n is preferably at least 5 or more) For = {x | x∈X ′, n _x ′ ≧ n}, parameters a ₁ , a ₂ , b ₁ , and b ₂ are obtained by the least square method as shown in the following equation.

ここで、N(Y)は音素集合Yの要素数である。
複数話者ラベリングデータ２０内に出現して、参照ラベリングデータ１０内に出現しなかった音素、及び統計的に十分信頼できない回数出現した音素（nx′＜nである音素x）の集合をZ=X-Y={x|x∈X,n_x′≧n}とする。
音素継続時間長分布推定手段４３は、Zの全ての音素x（ｘ∈Z）について式（３）を用いて、平均値μ_x′と分散値σ_x′²を求め、平均値の集合M₂と分散値の集合Σ₂を求める（ステップＳ４３a）。 Here, N (Y) is the number of elements of the phoneme set Y.
A set of phonemes that have appeared in the multi-speaker labeling data 20 but did not appear in the reference labeling data 10 and phonemes that have appeared statistically unreliable times (phonemes x with nx ′ <n) Z = Let XY = {x | x∈X, n _x ′ ≧ n}.
The phoneme duration distribution estimation means 43 obtains the average value μ _x ′ and the variance value σ _x ′ ² using the formula (3) for all phonemes x (x∈Z) of Z, and sets the average value set M ₂ and a set of variance values Σ ₂ are obtained (step S43a).

そして、音素集合X内の全ての音素の音素継続時間長分布として下記の平均値集合M、分散値集合Σを得る（ステップＳ４３b）。 Then, the following average value set M and variance value set Σ are obtained as the phoneme duration distribution of all phonemes in the phoneme set X (step S43b).

〔音素変換部〕
図５に示す音素変換部５０の動作フローを参照してその動作を説明する。音素変換部５０は、既存音素ラベリングデータ３０を新しい音素体系における新音素ラベリングデータに変換する。 (Phoneme conversion part)
The operation will be described with reference to the operation flow of the phoneme conversion unit 50 shown in FIG. The phoneme conversion unit 50 converts the existing phoneme labeling data 30 into new phoneme labeling data in a new phoneme system.

まず最初に、既存の音素体系における音素列x=(x₁,x₂,…,x_n)が、新しい音素体系において音素列y=(y₁,y₂,…,y_m)に対応する場合、既存の音素体系から新しい音素体系への全変換ルールをFとし、F(x)=yとする。また、F(x)が定義されている全ての音素列xの集合（Fの定義域）をXとする（ステップ５００）。表４に、変換ルールの例を示す。 First, the phoneme sequence x = (x ₁ , x ₂ , ..., x _n ) in the existing phoneme system corresponds to the phoneme sequence y = (y ₁ , y ₂ , ..., y _m ) in the new phoneme system. In this case, F is the total conversion rule from the existing phoneme system to the new phoneme system, and F (x) = y. Also, let X be the set of all phoneme sequences x in which F (x) is defined (the domain of F) (step 500). Table 4 shows examples of conversion rules.

次に、音素列の集合X内で音素列を構成する音素の数が最大であるものの音素数Lを求める（ステップＳ５０１）。そして、全ての既存音素ラベリングデータ３０に対して下記の操作を行い、新しい音素体系による分割前音素ラベリングデータを出力する。 Next, the phoneme number L of the phoneme sequence having the maximum number of phonemes in the set X of phoneme sequences is obtained (step S501). And the following operation is performed with respect to all the existing phoneme labeling data 30, and the phoneme labeling data before a division | segmentation by a new phoneme system is output.

既存音素ラベリングデータ３０のラベルデータ番号を現す数をiとして、i←１とする（ステップＳ５０２）。例えば、上記した表１ではi番目のラベルデータは、i=１のとき「t,70」、i=２のとき「a,60」である。そして、音素列を構成する音素の数kをk←Lとする。ステップＳ５０２の処理は、既存音素ラベリングデータの音素ラベル数をNとして、N＜iの場合、つまり全てのラベルデータを変換し終わるまで繰り返される（ステップＳ５０９）。 The number representing the label data number of the existing phoneme labeling data 30 is i, and i ← 1 is set (step S502). For example, in Table 1 above, the i-th label data is “t, 70” when i = 1, and “a, 60” when i = 2. The number k of phonemes constituting the phoneme string is set to k ← L. The process of step S502 is repeated when the number of phoneme labels of the existing phoneme labeling data is N, and N <i, that is, until all the label data are converted (step S509).

既存音素ラベリングデータ３０のi番目の音素情報x_iからi+k-1番目の音素情報x_i+k-1を取得し、音素列x=(x_i,x_i+1,…,x_i+k-1)を作る（ステップＳ５０４）。
X∈Xの場合（変換ルール内の変換前音素列に定義されているとき）、次の操作を行う（ステップＳ５０５のYes）。
既存音素ラベリングデータ３０のi番目の音素継続時間長情報をt_iとして、新しい音素体系の音素情報y=F(x)と、音素継続時間長の合計値t′を求める（ステップＳ５０７）。 The i + k−1th phoneme information x _{i + k−1} is acquired from the i th phoneme information x _i of the existing phoneme labeling data 30, and the phoneme string x = (x _i , x _{i + 1} ,..., X _{i + k-1} ) is created (step S504).
In the case of X∈X (when defined in the pre-conversion phoneme string in the conversion rule), the following operation is performed (Yes in step S505).
Using the i-th phoneme duration information of the existing phoneme labeling data 30 as t _i , the phoneme information y = F (x) of the new phoneme system and the total value t ′ of the phoneme duration are obtained (step S507).

分割前音素ラベリングデータとして、音素情報y、音素継続時間長情報t′を出力する（ステップＳ５０８）。変換した音素の数だけラベルデータの番号を進めるため、i←i+kとする。音素列を構成する音素の数kをk←Lとする処理（ステップＳ５０３）に戻る。 Phoneme information y and phoneme duration information t ′ are output as pre-division phoneme labeling data (step S508). In order to advance the label data number by the number of converted phonemes, i ← i + k. The process returns to the process of setting the number k of phonemes constituting the phoneme string to k ← L (step S503).

x∈Xで無い場合（変換ルール内の変換前音素列に定義されていないとき）音素列を構成する音素の数を減らすため、k←k-1とする（ステップＳ５０６）。次の音素ラベルの処理（ステップＳ５０４）に戻る。 When x∈X is not satisfied (when not defined in the pre-conversion phoneme string in the conversion rule), k ← k−1 is set to reduce the number of phonemes constituting the phoneme string (step S506). The process returns to the next phoneme label process (step S504).

〔音素継続時間長分割部〕
図６に示す音素継続時間長分割部６０の動作フローを参照してその動作を説明する。音素継続時間長分割部６０は、音素変換部５０で得られた分割前ラベリングデータから、複数の音素へ変換されている音素情報について音素継続時間長を分割する。 [Phoneme duration length division]
The operation will be described with reference to the operation flow of the phoneme duration division unit 60 shown in FIG. The phoneme duration length dividing unit 60 divides the phoneme duration length for the phoneme information converted from the pre-division labeling data obtained by the phoneme conversion unit 50 into a plurality of phonemes.

全ての分割前ラベリングデータに対して次の操作を行う。
分割前音素ラベリングデータのラベルデータ番号を表す数をiとして、i←１とする（ステップＳ６００）。例えば、上記した表２においてi=１のとき「cT,T,70」、i=2のとき「A,60」である。分割前音素ラベリングデータの音素ラベル数をNとして、N＜iの場合（つまり、全てのラベルデータを分割し終えたとき）終了する（ステップＳ６０９）。 The following operation is performed on all pre-division labeling data.
The number representing the label data number of the pre-division phoneme labeling data is i, and i ← 1 (step S600). For example, in Table 2 above, “cT, T, 70” when i = 1, and “A, 60” when i = 2. If the number of phoneme labels in the pre-division phoneme labeling data is N, and N <i (that is, when all the label data have been divided), the process ends (step S609).

i番目の分割前音素ラベリングデータの音素情報x_i=(x_i,1,x_i,2,…,x_i,n)と音素継続時間長情報t_iを取得する（ステップＳ６０１）。上記した表２においてi=１のとき、x_i=(cT,T),t_i=70であり、x_i,1=cT,x_i,2=Tである。 Phoneme information x _i = (x _{i, 1} , x _{i, 2} ,..., x _{i, n} ) and phoneme duration information t _i of i-th pre-division phoneme labeling data are acquired (step S601). In Table 2 above, when i = 1, x _i = (cT, T), t _i = 70 and x _{i, 1} = cT, x _{i, 2} = T.

音素列x_iを構成する音素数nを求める（ステップＳ６０２）。例えば、x_i=(cT,T)のときn=2。
音素列x_iを構成する全ての音素x_i,k(k=1,2,…,n)について、音素継続時間長の平均値μ_x.i.k、分散値σ_x.i.k ²を音素継続時間長分布から取得し、それぞれの値の合計値MとSを求める（ステップＳ６０３）。 The number of phonemes n constituting the phoneme sequence x _i is obtained (step S602). For example, n = 2 when x _i = (cT, T).
For all phonemes x _{i, k} (k = 1,2, ..., n) that make up phoneme sequence x _i , average value of phoneme duration μ _xik and variance σ _xik ² are obtained from phoneme duration distribution Then, the total values M and S of the respective values are obtained (step S603).

次に、下記の操作を行い各音素情報と音素継続時間長を出力する。
音素列x_iを構成する音素の中の音素番号をkとして、k←１とする（ステップＳ６０４）。n＜kの場合（つまり、音素列xiの全ての音素を出力し終えたとき）は、i←i+1としてステップＳ６０１に戻って、次のラベルデータの処理を行う。 Next, the following operations are performed to output each phoneme information and phoneme duration.
The phoneme number in the phonemes constituting the phoneme sequence x _i is k, and k ← 1 (step S604). When n <k (that is, when all the phonemes in the phoneme string xi have been output), i ← i + 1 is returned to step S601, and the next label data is processed.

出力する音素情報をxとおいて、x←x_i,kとする（例えば、x_i=(cT,T）のとき、k=1のときx=x_i,k=cT,k=2のときx=x_i,k=T）（ステップＳ６０５）。
音素継続時間長分布から音素情報xの音素継続時間長の平均値μ_x、分散値σ_x ²を取得し、式（１３）で分割された音素継続時間長情報t′を求める（ステップＳ６０６）。 The output phoneme information is x and x ← x _{i, k} (eg, when x _i = (cT, T), when k = 1, when x = x _{i, k} = cT, k = 2 x = x _{i, k} = T) (step S605).
The average value μ _x and the variance value σ _x ² of the phoneme duration of the phoneme information x are acquired from the phoneme duration distribution, and the phoneme duration information t ′ divided by the equation (13) is obtained (step S606). .

求めた音素情報xと音素継続時間長情報t′は、新音素ラベリングデータとして出力される（ステップＳ６０７）。そして、k←k+1として次の音素番号の処理（ステップＳ６０５）に戻る。
〔直線回帰について〕
音素継続時間長分布推定部４０の直線回帰式推定手段４２で求めた直線回帰式の相関係数を示して、この発明の有効性について説明する。具体的なデータに基づいて求めた相関関係を図７と図８に示す。図７は、音素継続時間長の平均値の相関を示す。図８は、その分散値の相関を示す。 The obtained phoneme information x and phoneme duration information t ′ are output as new phoneme labeling data (step S607). Then, the process returns to the next phoneme number processing (step S605) as k ← k + 1.
[About linear regression]
The effectiveness of the present invention will be described by showing the correlation coefficient of the linear regression equation obtained by the linear regression equation estimation means 42 of the phoneme duration distribution estimation unit 40. The correlation obtained based on the specific data is shown in FIGS. FIG. 7 shows the correlation of the average value of phoneme durations. FIG. 8 shows the correlation of the variance values.

相関係数は、次のデータで求めた。複数話者ラベリングデータのラベルデータ数は33763文で音素数は117種類、変換対象話者１のラベルデータ数は6022文で音素数89種類、変換対象話者２のラベルデータ数は94文で音素数56種類のデータを用いた。 The correlation coefficient was obtained from the following data. The number of label data of multi-speaker labeling data is 33,763, 117 phonemes, the number of label data of speaker 1 to be converted is 6022, the number of phonemes is 89, and the number of label data of speaker 2 to be converted is 94 Data with 56 phonemes were used.

平均値μ_x′と平均値μ_xの相関係数は、話者１（図７の■）で相関係数γ=0.99、話者２（図７の◆）で相関係数γ=0.96と、非常に高い相関が見られた。また、分散値σ_x′²と分散値σ_x ²との相関も、話者１（図８の■）でγ=0.81、話者２（図８の◆）でγ=0.73と良好な値を示した。 The correlation coefficient between the average value μ _x ′ and the average value μ _x is as follows: the correlation coefficient γ = 0.99 for speaker 1 (■ in FIG. 7), and the correlation coefficient γ = 0.96 for speaker 2 (♦ in FIG. 7). A very high correlation was found. In addition, the correlation between the variance value σ _x ′ ² and the variance value σ _x ² is γ = 0.81 for speaker 1 (■ in FIG. 8) and γ = 0.73 for speaker 2 (♦ in FIG. 8). showed that.

この相関係数から、この発明の音素ラベリングデータ音素継続時間長変換装置１００を用いることで、少量の音素ラベリングデータから新音素体系の大量の音素ラベリングデータを、精度良く求めることが可能であることが分かる。 From this correlation coefficient, by using the phoneme labeling data phoneme duration length conversion device 100 of the present invention, it is possible to accurately obtain a large amount of phoneme labeling data of a new phoneme system from a small amount of phoneme labeling data. I understand.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

To obtain statistically reliable values of the reference labeling data, which is phoneme labeling data in the new phoneme system of the speaker to be converted, and the phoneme type average / dispersion value of the phoneme type in a phoneme system with multiple speakers The multi-speaker labeling data including the number of phonemes that can be input is input, the reference labeling data is linearly regressed with the multi-speaker labeling data, and the conversion corresponding to all the phoneme types of the multi-speaker labeling data is performed. The phoneme duration distribution estimation process for obtaining the average phoneme duration length of the target speaker and the phoneme duration length distribution which is a variance value;
A phoneme conversion process of converting existing phoneme labeling data, which is phoneme labeling data before conversion of a conversion target speaker, into phoneme labeling data of a new phoneme system;
Using the phoneme duration distribution and the phoneme labeling data of the new phoneme system as input, phoneme labeling data having one phoneme duration in one phoneme information is passed as new phoneme labeling data as it is. The phoneme labeling data having a plurality of phoneme information with respect to the duration length is divided into phoneme duration lengths for each phoneme information and output as new phoneme labeling data,
Phoneme labeling data phoneme duration conversion method.

In the phoneme labeling data phoneme duration conversion method according to claim 1,
The above phoneme duration distribution estimation process is
An average value / variance value calculating step for calculating an average value / variance value of each of the reference labeling data and the multi-speaker labeling data;
A linear regression equation estimation step for estimating a linear relational equation for linearly regressing the average value / variance value of the reference labeling data with the multi-speaker labeling data;
A phoneme duration distribution estimation step for estimating an average value and a variance of phoneme durations of all phoneme types of the conversion target speaker corresponding to the multi-speaker labeling data by the linear relational expression;
A phoneme labeling data phoneme duration conversion method.

In the phoneme labeling data phoneme duration conversion method according to claim 1 or 2,
The phoneme duration division process is
The phoneme number obtaining step of obtaining a phoneme number n composing the phoneme sequence x _i,
For all phonemes x _{i, k} (k = 1, 2,..., N) constituting the phoneme sequence x _i , the average phoneme duration length μ _{x, I, k} and variance σ _{x, I, k} ² are Obtaining from the above phoneme duration distribution, obtaining a total value M of average values and a total value S of variances, a total value obtaining step for obtaining respective total values;
The phoneme duration information t ′ of the above new phoneme labeling data,

Where t _i is the phoneme duration before division,
Obtaining phoneme duration information obtained in step 1,
A phoneme labeling data phoneme duration conversion method.

Reference labeling data, which is phoneme labeling data in the new phoneme system of the conversion target speaker,
Multiple episodes that are phoneme labeling data including all phoneme types that can be obtained as statistically reliable values of the average and variance of phoneme durations of phoneme types in a phoneme system with multiple speakers Labeling data,
Existing phoneme labeling data, which is phoneme labeling data before conversion of the conversion target speaker,
The conversion target speaker corresponding to all phoneme types of the multi-speaker labeling data by linearly regressing the reference labeling data with the multi-speaker labeling data by using the reference labeling data and the multi-speaker labeling data as inputs. A phoneme duration distribution estimation unit for obtaining a phoneme duration distribution which is a mean value and a variance of phoneme durations of
A phoneme conversion unit that converts the existing phoneme labeling data into pre-division phoneme labeling data that is phoneme labeling data of a new phoneme system;
Using the above phoneme duration distribution and the phoneme labeling data of the new phoneme system as input, phoneme labeling data having one phoneme duration in one phoneme information is passed as it is as new phoneme labeling data. The phoneme labeling data having a plurality of phoneme information for the duration length is divided into phoneme duration lengths for each phoneme information and output as new phoneme labeling data; and
Phoneme labeling data phoneme duration conversion device.

A program for causing a computer to function as the phoneme labeling data phoneme duration conversion device according to claim 4.