JP4787769B2

JP4787769B2 - F0 value time series generating apparatus, method thereof, program thereof, and recording medium thereof

Info

Publication number: JP4787769B2
Application number: JP2007027547A
Authority: JP
Inventors: 昇宮崎
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-02-07
Filing date: 2007-02-07
Publication date: 2011-10-05
Anticipated expiration: 2027-02-07
Also published as: JP2008191525A

Abstract

<P>PROBLEM TO BE SOLVED: To generate an F0 value time series inexpensively in accordance with various tones at low cost. <P>SOLUTION: A text in which a boundary position is determined for every accent phrase, and starting time and ending time are determined for each multiple accent types and moras, is input (3, S2). By using a rhythm event table, a plurality of rhythm events are related to an indicated position of the accent phrase according to the accent type (12, S8). By using a rhythm event table classified by tones, a rhythm event classified by tones, which corresponds to generation condition of the accent phrase is added (13, S10). A rhythm event parameter is created for each rhythm event from a rhythm event parameter data base (22, S12), and a delta function is created for each rhythm event from a creation function table (16, S16). An initial F0 value is calculated for each accent phrase (18, S18), and the F0 value time series is created for each accent phrase from the delta function and the initial F0 value (20, S20). <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、テキストから合成音声を生成するテキスト音声合成分野に属するもので、特に音声に適切な抑揚を与えるために音声の韻律パターンを生成するＦ０値時系列生成装置、その方法、そのプログラム、及びその記録媒体に関する。 The present invention belongs to the field of text-to-speech synthesis that generates synthesized speech from text, and in particular, an F0 value time-series generation device that generates a prosodic pattern of speech in order to give appropriate inflection to speech, its method, its program, And a recording medium thereof.

以下の説明では、Ｆ０値とは、ある時点における音声の基本周波数を示しており、Ｆ０値時系列とは、合成音声の継続する時間に渡ったＦ０値の系列を示している。
従来技術１として、従来の合成音声を生成する際の音声のＦ０値時系列を生成する手法で、多空間の確率分布に基づくＨＭＭ（ｍｕｌｔｉ−ｓｐａｃｅｐｒｏｂａｂｉｌｉｔｙｄｉｓｔｒｉｂｕｔｉｏｎＨＭＭ：ＭＳＤ−ＨＭＭ）を適用し、ピッチパラメータとスペクトルパラメータを結合した特徴パラメータを用いて、ピッチとスペクトルを統一的にモデル化する手法がある。これは、音韻ごとのＦ０値の時間変化や継続長を、ＨＭＭのような統計モデルで学習したモデルを用いて、このモデルから尤もらしいＦ０値時系列を生成する手法である。詳細は、非特許文献1に記載されている。 In the following description, the F0 value indicates the fundamental frequency of speech at a certain point in time, and the F0 value time series indicates a sequence of F0 values over the duration of the synthesized speech.
As a prior art 1, a method for generating a conventional F0 value time series for generating a synthesized speech, and applying a multi-space probability distribution HMM (MSD-HMM) based on a multi-space probability distribution, There is a method for modeling pitch and spectrum in a unified manner using feature parameters obtained by combining pitch parameters and spectral parameters. This is a method of generating a plausible F0 value time series from a model obtained by learning a temporal change and duration of the F0 value for each phoneme using a statistical model such as an HMM. Details are described in Non-Patent Document 1.

従来技術２として、複数のアクセント句からなるポーズ句ごとに暫次的に下降するフレーズ成分と、アクセント句毎に指定されるアクセント成分とを組み合わせて、Ｆ０値時系列を表現する生成過程モデルを用いて、このモデルにフレーズ成分の下降パラメータやアクセント成分の振幅パラメータ、位置パラメータ等に入力して得られるＦ０値時系列を得る手法がある。詳細は非特許文献２に記載されている。
ここで、非特許文献１、２に記載されているモーラとは、音韻論上、一定の時間的長さをもった音の文節単位である。例えば、「チョコレート」であれば、「チョ」「コ」「レ」「ー」「ト」がそれぞれモーラとなる。 As a prior art 2, a generation process model that expresses an F0 value time series by combining a phrase component that temporarily falls for each pause phrase composed of a plurality of accent phrases and an accent component that is specified for each accent phrase Using this model, there is a technique for obtaining an F0 value time series obtained by inputting the descending parameter of the phrase component, the amplitude parameter of the accent component, the position parameter, and the like. Details are described in Non-Patent Document 2.
Here, the mora described in Non-Patent Documents 1 and 2 is a syllable unit of a sound having a certain time length in phonological theory. For example, in the case of “chocolate”, “cho”, “co”, “le”, “-”, and “to” are respectively mora.

また、アクセント句とは、０個もしくは１個のアクセント核を含む言語的な単位であり、通常、1つ以上の文節から形成される。アクセント核とはアクセントが付くモーラのことである。日本語のアクセント句はアクセント核の位置によって（１）〜（３）の３種類に大別される。
（１）０型のアクセント句：先頭のモーラのＦ０値が相対的に低く、２モーラ目以降のモーラのＦ０値が相対的に高い、つまり、アクセント核を含まないアクセント句。
（２）１型のアクセント句：先頭のモーラのＦ０値が相対的に高く、２モーラ目以降のモーラのＦ０値が相対的に低い、つまり１番目のモーラがアクセント核に該当するアクセント句。
（３）ｎ型のアクセント句（ｎは２以上の整数）：先頭のモーラのＦ０値が相対的に低く、２モーラ目から第ｎモーラ目までのＦ０値が相対的に高く、第ｎ＋１モーラ目以降が相対的に低い、つまり先頭からｎ番目のモーラがアクセント核に該当するアクセント句。 An accent phrase is a linguistic unit including zero or one accent core, and is usually formed from one or more phrases. An accent core is a mora with an accent. Japanese accent phrases are roughly classified into three types (1) to (3) according to the position of the accent nucleus.
(1) Type 0 accent phrase: F0 value of the first mora is relatively low, and F0 values of the mora after the second mora are relatively high, that is, an accent phrase that does not include an accent nucleus.
(2) Type 1 accent phrase: an accent phrase in which the first mora has a relatively high F0 value, and the second and subsequent mora have relatively low F0 values, that is, the first mora corresponds to the accent core.
(3) n-type accent phrase (n is an integer greater than or equal to 2): the F0 value of the first mora is relatively low, the F0 values from the second mora to the nth mora are relatively high, and the (n + 1) th mora An accent phrase that is relatively low after the eye, that is, the nth mora from the beginning corresponds to the accent nucleus.

このアクセント句に境界を付与する手法、アクセント句毎にアクセント型を付与する手法は、非特許文献３に記載されている。
また、従来技術３として、実音声から抽出したＦ０値時系列を大量に収集し、生成したい合成音声を構文的に類似したＦ０値時系列を探索し用いる。事例に基づくテンプレートを用いる手法もある。詳細は特許文献1に記載されている。
これらの手法はいずれも、ある程度自然な音声を合成することに成功している。
電子情報通信学会論文誌D-IIl.J38-D-II.7July,2000,pp1600-1609”多空間確率分布ＨＭＭによるピッチパターン生成” Journal of the Acoustical Society (E)Vol.5,No.4(1984)”Analysis of voice fundamental frequency contours for declarative sentences of Japanese” 浅野、松岡、高木、小原“多段解析法による形態素解析を用いた音声合成用読韻律情報設定法とその単語辞書構成”、自然言語処理Vol6,No.2,Jan,1999 特許第３４２０９６４号 Non-patent document 3 describes a technique for giving a boundary to the accent phrase and a technique for giving an accent type for each accent phrase.
Also, as the prior art 3, a large amount of F0 value time series extracted from real speech is collected, and an F0 value time series that is syntactically similar to the synthesized speech to be generated is searched for and used. There is also a method using a template based on a case. Details are described in Patent Document 1.
All of these techniques have succeeded in synthesizing a natural sound to some extent.
IEICE Transactions D-IIl.J38-D-II.7July, 2000, pp1600-1609 "Pitch pattern generation by multi-space probability distribution HMM" Journal of the Acoustical Society (E) Vol.5, No.4 (1984) ”Analysis of voice fundamental frequency contours for declarative sentences of Japanese” Asano, Matsuoka, Takagi, Ohara "Method of setting prosodic information for speech synthesis using morphological analysis by multistage analysis and its word dictionary structure", Natural Language Processing Vol6, No.2, Jan, 1999 Japanese Patent No. 3420964

従来の手法はいずれもアナウンサーが淡々と文章を読み上げるような、いわゆる読み上げ音声口調を前提とした技術であった。しかし、テキスト音声合成の技術は読み上げ口調に対して、用いられるのみではない。例えば、電話受付オペレータが応答する口調に似せて、合成音声を生成することにより、電話受付オペレータの業務の一部を機械で置き換えたり、スポーツのニュースを生き生きと紹介したり、実況したりするような口調に似せて、合成音声を生成することにより、草野球チームの試合結果など普段プロのアナウンサーが紹介しないような情報についても、音声に変換し、地域に密着したローカル放送などで放送することが可能になる。 All of the conventional methods are based on the so-called speech tone that is used by an announcer to read aloud sentences. However, the technology of text-to-speech synthesis is not only used for reading tone. For example, by synthesizing the tone of the telephone reception operator's response, a part of the work of the telephone reception operator can be replaced with a machine, sports news can be introduced lively, and the live situation can be seen. By generating synthesized voices that resemble complex tone, information that is not usually introduced by professional announcers, such as the results of grass baseball team games, can be converted to voice and broadcast on local broadcasts closely related to the area. Is possible.

このように様々な口調に似せて、音声を合成することを考えると、従来の手法はいずれも課題を抱えており、そのままで用いることは困難である。 In consideration of synthesizing speech in a manner similar to various tone in this way, all the conventional methods have problems and are difficult to use as they are.

第１の問題点
従来技術１はＨＭＭからＦ０値時系列を合成する手法においては、Ｆ０値時系列を音韻ごとに学習し、合成する。このような場合、新たな口調に似せた音声を生成するためには、音韻ごとの平均Ｆ０値やその微分成分、場合によって、二階微分成分をモデルパラメータとして学習する必要があるため、モデルパラメータの数が増加する。このため、統計的に学習する際に必要となる学習データを膨大に収集する必要があり、コストが大きくなる問題がある。 First Problem In the conventional technique 1, in the method of synthesizing the F0 value time series from the HMM, the F0 value time series is learned and synthesized for each phoneme. In such a case, in order to generate a voice resembling a new tone, it is necessary to learn the average F0 value for each phoneme, its differential component, and in some cases, the second-order differential component as a model parameter. The number increases. For this reason, it is necessary to collect enormous amounts of learning data required for statistical learning, which increases the cost.

次に従来技術３の問題点を説明する。事例に基づくテンプレートを用いる手法では、合成音声を生成する際のターゲットとなる口調が変わった場合、ターゲットに合った口調の音声を大量に収集し、再度テンプレートを構築しなおす必要があり、ＨＭＭからＦ０値時系列を合成する手法と同様に、コストが大きくなる問題がある。第１の問題点としてコストの問題が挙げられる。 Next, problems of the prior art 3 will be described. In the method using a template based on the case, if the target tone when generating the synthesized speech changes, it is necessary to collect a large amount of the tone that matches the target, and to reconstruct the template. Similar to the method of synthesizing the F0 value time series, there is a problem that the cost increases. The first problem is a cost problem.

第２の問題点
次に第２の問題点を説明する。従来技術２のように、生成過程モデルを用いる場合、漸次的に下降する成分の存在を前提としている。しかし、例えば、相手に何かを問いかける口調では、音声のＦ０値は語尾にかけて上昇したり、また強い調子で話す場合は、特に下降せずそのままであったりして、必ずしも漸次的に下降するとは限らない。即ち、生成過程モデルは読み上げ音声句口調とは異なる口調に似せて、合成音声を生成する際にはモデルの構造が音声の特徴とミスマッチを起こし、正しい表現ができないことがあるという問題がある。よって第２の問題点として、読み上げ音声句口調とは異なる口調で合成音声を生成する際には、正しい表現ができないという問題が挙げられる。 Second problem Next, the second problem will be described. In the case of using the generation process model as in the prior art 2, it is assumed that there is a component that gradually decreases. However, for example, in a tone that asks something to the other party, the voice F0 value rises toward the end of the word, and when speaking in a strong tone, it does not particularly drop and does not necessarily decrease gradually. Not exclusively. That is, the generation process model resembles a tone different from the reading speech phrasing tone, and there is a problem that when the synthesized speech is generated, the structure of the model mismatches with the features of the speech, and the correct expression may not be achieved. Therefore, as a second problem, there is a problem that when a synthesized speech is generated with a tone different from the reading speech phrasing tone, correct expression cannot be made.

この発明は、アクセント句毎に境界位置とアクセント句毎のアクセント型が付与され、モーラごとの開始時刻及び終了時刻が決められたテキストが入力され、音声のＦ０値時系列を生成するＦ０値時系列生成装置に関する。この発明のＦ０値時系列生成装置は、韻律イベント部とＦ０値時系列部で構成されている。 In the present invention, a boundary position and an accent type for each accent phrase are assigned to each accent phrase, and a text with a determined start time and end time for each mora is input to generate an F0 value time series of speech. The present invention relates to a sequence generation device. The F0 value time series generation apparatus of the present invention includes a prosodic event part and an F0 value time series part.

韻律イベント部は、アクセント型、モーラ毎の開始時刻及び終了時刻とから韻律イベントパラメータテーブルを用いて、韻律イベントを生成し、前記韻律イベント毎に韻律イベントパラメータを生成する。Ｆ０値時系列部は、韻律イベントパラメータと所定の生成関数を用いて、アクセント句毎にＦ０値時系列を生成する。 The prosodic event unit generates a prosodic event from the accent type and the start time and end time for each mora using a prosodic event parameter table, and generates a prosodic event parameter for each prosodic event. The F0 value time series part generates an F0 value time series for each accent phrase using the prosodic event parameters and a predetermined generation function.

また、韻律イベント部は、韻律イベント生成部と口調別韻律イベント追加部と韻律イベントパラメータ生成部とで構成すればよい。韻律イベント生成部は、韻律イベントテーブルを用いて、アクセント型に応じてアクセント句の指定された箇所に対応付けられる複数の韻律イベントを生成する。口調別韻律イベント追加部は、口調別韻律イベントテーブルを用いて、アクセント句が発生条件に該当すれば、このアクセント句の指定された箇所に、この発生条件に対応する口調別韻律イベントを追加する。韻律イベントパラメータ生成部は、韻律イベントパラメータデータベースとアクセント句の情報を用いて、韻律イベント毎に韻律イベントパラメータを生成する。 The prosodic event unit may be composed of a prosody event generating unit, a tone-specific prosody event adding unit, and a prosodic event parameter generating unit. The prosodic event generation unit uses the prosodic event table to generate a plurality of prosodic events that are associated with the designated part of the accent phrase according to the accent type. The tone-specific prosodic event addition unit uses the tone-specific prosodic event table to add a tone-specific prosodic event corresponding to the occurrence condition to the specified location of the accent phrase if the accent phrase meets the occurrence condition. . The prosodic event parameter generation unit generates a prosodic event parameter for each prosodic event using the prosodic event parameter database and accent phrase information.

更に、Ｆ０値時系列部は、デルタ関数生成部と初期Ｆ０値生成部とＦ０値時系列生成部とで構成すればよい。デルタ関数生成部は、韻律イベント毎に生成関数テーブルから求めた生成関数に韻律イベントパラメータを適用し、全ての韻律イベントに対応する生成関数の和をアクセント句のデルタ関数として生成する。初期Ｆ０値生成部は、初期Ｆ０値パラメータデータベースとアクセント句の情報を用いて、アクセント句毎に初期Ｆ０値を求める。Ｆ０値時系列生成部は、デルタ関数と初期Ｆ０値とからアクセント句毎にＦ０値時系列を生成する。 Further, the F0 value time series unit may be configured by a delta function generation unit, an initial F0 value generation unit, and an F0 value time series generation unit. The delta function generation unit applies the prosodic event parameters to the generation function obtained from the generation function table for each prosodic event, and generates the sum of the generation functions corresponding to all the prosodic events as an accent phrase delta function. The initial F0 value generation unit obtains an initial F0 value for each accent phrase using the initial F0 value parameter database and accent phrase information. The F0 value time series generation unit generates an F0 value time series for each accent phrase from the delta function and the initial F0 value.

更に、上記複数の韻律イベントは上昇、下降、なだらかな下降、盛り上がり、としてもよい。 Furthermore, the plurality of prosodic events may be ascending, descending, gently descending, and exciting.

更に、韻律イベントパラメータデータベースは正規化された韻律イベントパラメータ（正規化韻律イベントパラメータという）が格納されており、生成された正規化韻律イベントパラメータをモーラの情報もしくはアクセント句の情報に応じて変換し、韻律イベントパラメータを出力してもよい。 Furthermore, the prosodic event parameter database stores normalized prosodic event parameters (referred to as normalized prosodic event parameters), and converts the generated normalized prosodic event parameters according to mora information or accent phrase information. The prosodic event parameters may be output.

上記の構成により、第１の問題点、第２の問題点が解決されたことを説明する。
まず、第１の問題点が解決されたことについて説明する。アクセント句毎に予め決められた複数の韻律イベント毎の位置パラメータ、大きさパラメータ、継続時間パラメータと、アクセント句毎の初期Ｆ０値だけでアクセント句のＦ０値時系列を表現する。例えば「神奈川県では」というアクセント句では上記の構成の場合、６個の韻律イベントが生成される。このため、３×６＋１＝１９個のパラメータで1つのアクセント句のＦ０値時系列を表現することが出来る。 The fact that the first problem and the second problem are solved by the above configuration will be described.
First, the fact that the first problem has been solved will be described. The F0 value time series of the accent phrase is expressed only by the position parameter, the size parameter, the duration parameter for each of the plurality of prosodic events predetermined for each accent phrase, and the initial F0 value for each accent phrase. For example, in the accent phrase “in Kanagawa Prefecture”, six prosodic events are generated in the above configuration. Therefore, the F0 value time series of one accent phrase can be expressed by 3 × 6 + 1 = 19 parameters.

従来技術１のＦ０値時系列生成手法では１音韻ごとに、Ｆ０値とＦ０値微分成分、さらにＦ０値の二階成分微分のそれぞれについての平均と分散を保持する、つまり、６個のパラメータを保持する必要がある。例えば、「神奈川県では」というアクセント句のＦ０値時系列を生成しようとすれば「ＫＡＮＡＧＡＷＡＫＥＮＤＥＷＡ」という１５個の音韻ごとに、６個のパラメータを保持する必要があるため、９０個のパラメータを用いる必要がある。 In the F0 value time series generation method of the prior art 1, the average and variance for each F0 value, F0 value differential component, and second-order component differential of the F0 value are held for each phoneme, that is, 6 parameters are held. There is a need to. For example, if an F0 value time series of an accent phrase “in Kanagawa Prefecture” is to be generated, it is necessary to hold 6 parameters for each of 15 phonemes “KANAGAWAKEENDEWA”, so 90 parameters are used. There is a need.

本願発明の構成のように、用いるパラメータの数が少なければ、適切なパラメータを生成するために必要となる学習データの数もこれに応じて減少し、結果として、Ｆ０値時系列を生成するコストを下げる効果がある。従って、従来技術１の問題点を解決することが出来る。 If the number of parameters to be used is small as in the configuration of the present invention, the number of learning data necessary to generate an appropriate parameter is reduced accordingly, and as a result, the cost of generating the F0 value time series Has the effect of lowering. Therefore, the problem of the prior art 1 can be solved.

また従来技術３のように本願発明の構成では、テンプレートを用いるという概念はなく、当然テンプレートの再構成をする必要が無く、従来技術３の問題点を解決することが出来る。 Further, in the configuration of the present invention as in the prior art 3, there is no concept of using a template, and naturally there is no need to reconfigure the template, and the problems of the prior art 3 can be solved.

次に、第２の問題点が解決されたことについて説明する。上述のように、従来技術２では、例えば、発話末にＦ０値が上昇して疑問口調になる、といった口調に対しては適切なＦ０値時系列を生成することが出来なかった。しかし、本願発明の構成であれば、Ｆ０値時系列の局所的な動きを発生させる韻律イベントを用いるが、発話全体の動きを規定するような成分は用いない。そのため、発話末のＦ０値を下げたければ、「下降」の種類の韻律イベントを用いればよく、発話末のＦ０値を上げたければ、「上昇」の種類の韻律イベントを用いればよい。よって、様々な口調に似せた合成音声のためのＦ０値時系列を生成することが出来る。従って、本願の発明の構成により第２の問題点を解決することが出来る。 Next, the fact that the second problem has been solved will be described. As described above, in the related art 2, for example, an appropriate F0 value time series cannot be generated for a tone in which the F0 value increases at the end of the utterance and becomes a questionable tone. However, in the case of the configuration of the present invention, a prosodic event that generates a local movement of the F0 value time series is used, but a component that defines the movement of the entire utterance is not used. Therefore, if the F0 value at the end of the utterance is to be lowered, the “progress” type prosodic event may be used, and if the F0 value at the end of the utterance is to be increased, the “rising” type prosodic event may be used. Therefore, it is possible to generate an F0 value time series for synthesized speech resembling various tone. Therefore, the second problem can be solved by the configuration of the invention of the present application.

以下に、発明を実施するための最良の形態を示す。 The best mode for carrying out the invention will be described below.

この実施例では、入力としてテキストを想定する。図１はこの実施例１の機能構成例を示した図であり、図２はこの実施例１の主な処理の流れを示したフローチャートである。以下の説明では、入力されるテキストが「それではよろしいですか」という疑問口調の文章であるとして説明する。 In this example, text is assumed as input. FIG. 1 is a diagram showing a functional configuration example of the first embodiment, and FIG. 2 is a flowchart showing a main processing flow of the first embodiment. In the following explanation, it is assumed that the input text is a question-like sentence “Are you sure?”.

まず、Ｆ０値時系列を生成する対象となるテキスト「それではよろしいですか」がテキスト入力部３−１から入力される（ステップＳ２）。また、生成されるＦ０値時系列の所望速度（以下、話速という）が話速入力部３−２から入力される。以下の説明では、話速を０．２秒／１モーラとして説明する。 First, the text “Are you sure?” That is the target of generating the F0 value time series is input from the text input unit 3-1 (step S2). Further, a desired speed (hereinafter referred to as speech speed) of the F0 value time series to be generated is input from the speech speed input unit 3-2. In the following description, the speech speed is assumed to be 0.2 seconds / 1 mora.

まず、アクセント句分割・付与部２では入力されたテキストのアクセント句毎に境界位置が付与され、更にアクセント句毎にアクセント型が付与される（ステップＳ４）。この処理の内容については上記非特許文献３に記載されている。テキスト「それではよろしいですか」についてはアクセント句「それでは」とアクセント句「よろしいですか」との間に境界線が付与される。更に、アクセント句「それでは」、アクセント句「よろしいですか」それぞれにアクセント型が付与され、読みも付与される。アクセント句「それでは」については、３番目のモーラ「で」がアクセント核になり、アクセント型は３型になる。「よろしいですか」については、３番目のモーラ「し」がアクセント核になり、アクセント型は３型になる。例えば、図３に示すように、アクセント句毎にアクセント型が付与される。アクセント句分割・付与部２からは例えば図３に示す形式で出力され、モーラ分割・付与部４に入力される。このアクセント句分割・付与部２の処理内容は上記非特許文献３に記載されている。 First, the accent phrase dividing / giving unit 2 assigns a boundary position to each accent phrase of the input text, and further assigns an accent type to each accent phrase (step S4). The contents of this processing are described in Non-Patent Document 3 above. For the text “Are you sure?”, A boundary line is added between the accent phrase “Now” and the accent phrase “Are you sure?”. Furthermore, an accent type is assigned to each of the accent phrase “Now” and the accent phrase “Are you sure?”, And a reading is also assigned. For the accent phrase “Now”, the third mora “de” becomes the accent core, and the accent type becomes type 3. As for “Are you sure?”, The third mora “shi” becomes the accent core, and the accent type becomes type 3. For example, as shown in FIG. 3, an accent type is given for each accent phrase. The accent phrase dividing / giving unit 2 outputs, for example, the format shown in FIG. The processing contents of the accent phrase dividing / giving unit 2 are described in Non-Patent Document 3.

モーラ分割・付与部４では、テキストがモーラ毎に分割され、各々のモーラに開始時刻と終了時刻とを付与される（ステップＳ６）。なお、説明の簡略化のため、モーラ分割・付与部４では１モーラ間の長さを全て等しく分割するとして、話速と同じ１モーラ当り０．２秒とする。モーラ分割の手法としては、これに限られるものではない。 In the mora dividing / giving unit 4, the text is divided for each mora, and a start time and an end time are given to each mora (step S6). For simplification of description, the mora dividing / granting unit 4 divides all the lengths of one mora equally and assumes 0.2 seconds per mora which is the same as the speech speed. The method of mora division is not limited to this.

「それではよろしいですか」については、図４に示すように、「そ」「れ」「で」「は」「よ」「ろ」「し」「−」「で」「す」「か」というモーラに分割される。更に、１番目のモーラ「そ」について開始時刻が０．１１秒とすると、１モーラ当りの時間が０．２秒であるので、モーラ「そ」の終了時刻が０．３１秒となる。次のモーラ「れ」の開始時刻は０．３１秒、終了時刻は０．５１秒となる。このようにして、残り全てのモーラについて開始時刻、終了時刻が図４のように付与される。モーラ分割・付与部４からは例えば、図４の形式で出力される。 As for “Is it OK?”, As shown in FIG. 4, “so” “re” “de” “ha” “yo” “ro” “shi” “−” “de” “su” “ka” Divided into mora. Furthermore, if the start time of the first mora “SO” is 0.11 seconds, the time per mora is 0.2 seconds, so the end time of the mora “SO” is 0.31 seconds. The next mora “re” has a start time of 0.31 seconds and an end time of 0.51 seconds. In this way, the start time and end time are assigned to all remaining mora as shown in FIG. For example, the mora dividing / giving unit 4 outputs the data in the format shown in FIG.

なお、違う入力テキスト「今日はよく晴れて、気持ちの良い一日です。」であれば、分割されるモーラ、各モーラに付与される開始時刻および終了時刻、アクセント句、このアクセント句に付与されるアクセント型は、図５に示すように付与される。 In addition, if the input text is “Today is sunny and a pleasant day”, the mora to be divided, the start and end times given to each mora, the accent phrase, and the accent phrase The accent type is given as shown in FIG.

アクセント句毎の境界位置とアクセント句毎のアクセント型が付与され、モーラ毎の開始時刻、終了時刻が決められた入力テキストは韻律イベント生成部１２に入力される。韻律イベント生成部１２では、韻律イベントテーブルを用いて、アクセント句の指定された箇所に、アクセント型に応じた複数の韻律イベントが対応付けられて生成される（ステップＳ８）。 The input text in which the boundary position for each accent phrase and the accent type for each accent phrase are given and the start time and end time for each mora are determined is input to the prosodic event generation unit 12. The prosodic event generation unit 12 uses the prosodic event table to generate a plurality of prosodic events corresponding to the accent type at the location where the accent phrase is specified (step S8).

ここで、韻律イベントとは、例えば、Ｆ０値時系列に急な上昇や急な下降、なだらかな下降、盛り上がりの４種類の局所的な動きを発生させる指令である。韻律イベントテーブルは韻律イベントテーブル記憶部２８に記憶されている。韻律イベントテーブルの例を図６に示す。例えば、アクセント句が０型であれば、韻律イベントＩＤ０〜２に対応する韻律イベント、つまり、下降イベント、上昇イベント、なだらかな下降イベント、がこのアクセント句に付与される。このアクセント句の１モーラ目の開始時刻に下降イベントが付与され、１モーラ目の終了時刻に上昇イベントが付与され、アクセント句の終了時刻つまり、最後のモーラの終了時刻になだらかな下降イベントが付与される。アクセント句が１型、ｎ型の場合であれば、同様に図６に示す生成箇所に韻律イベントが付与される。 Here, the prosodic event is, for example, a command for generating four types of local movements such as a sudden rise, a sudden fall, a gentle fall, and a rise in the F0 value time series. The prosodic event table is stored in the prosodic event table storage unit 28. An example of the prosodic event table is shown in FIG. For example, if the accent phrase is type 0, prosodic events corresponding to prosodic event IDs 0 to 2, that is, a descending event, an ascending event, and a gently descending event are assigned to this accent phrase. A descending event is given at the start time of the first mora of the accent phrase, a rising event is given at the end time of the first mora, and a gentle descending event is given at the end time of the accent phrase, that is, the end time of the last mora. Is done. If the accent phrase is type 1 or type n, a prosodic event is similarly assigned to the generation location shown in FIG.

具体的に説明すると、アクセント句「それでは」のアクセント型は３型であるため、韻律イベントＩＤが６〜１１に対応する韻律イベントが付与される。具体的には、１モーラ目「そ」の開始時刻０．１１秒に下降イベントが付与され、１モーラ目「そ」の終了時刻０．３１秒に上昇イベントが付与される。このようにして、韻律イベント生成部１２では、１つのアクセント句に対して、指定された箇所に、複数の韻律イベントが生成される。アクセント句「それでは」に付与された韻律イベントを示したものが図７Ａである。 Specifically, since the accent type of the accent phrase “Now” is type 3, prosodic events corresponding to prosodic event IDs 6 to 11 are given. Specifically, a descending event is given at the start time 0.11 second of the first mora “so”, and a rising event is given at the end time 0.31 second of the first mora “so”. In this way, the prosodic event generation unit 12 generates a plurality of prosodic events at designated locations for one accent phrase. FIG. 7A shows the prosodic event assigned to the accent phrase “Now”.

また、アクセント句「よろしいですか」についても同様に、図７Ｂに示すように複数の韻律イベントが付与される。韻律イベント生成部１２からは例えば図７Ａ、Ｂの形式で出力され、口調別韻律イベント追加部１３に入力される。 Similarly, a plurality of prosodic events are given to the accent phrase “Are you sure?” As shown in FIG. 7B. The prosody event generation unit 12 outputs, for example, in the format of FIGS. 7A and 7B and inputs to the tone-specific prosody event addition unit 13.

口調別韻律イベント追加部１３では、口調別韻律イベントテーブルを用いて、アクセント句が発生条件に該当すれば、このアクセント句の指定された箇所にこの発生条件に対応する口調別韻律イベントが追加される（ステップＳ１０）。口調別韻律イベントテーブルは口調別韻律イベントテーブル記憶部３０に記憶されている。 The tone-specific prosodic event adding unit 13 uses the tone-specific prosodic event table to add a tone-specific prosodic event corresponding to the occurrence condition to the specified location of the accent phrase if the accent phrase meets the occurrence condition. (Step S10). The tone-specific prosodic event table is stored in the tone-specific prosodic event table storage unit 30.

図８は口調別韻律イベントテーブルの例である。例えば、アクセント句が発生条件「助詞「か」が発話末に存在する」に該当する、つまりアクセント句の最後のモーラが助詞「か」であれば、「か」の開始時刻に韻律イベントＩＤが１００である上昇イベントが追加される。 FIG. 8 is an example of a tone-specific prosodic event table. For example, if the accent phrase corresponds to the occurrence condition “the particle“ ka ”exists at the end of the utterance”, that is, if the last mora of the accent phrase is the particle “ka”, the prosodic event ID is at the start time of “ka”. A rising event that is 100 is added.

口調別韻律イベントテーブルが図８である場合、アクセント句「それでは」は発生条件に該当しないが、アクセント句「よろしいですか」は発生条件「助詞「か」が発話末に存在する」に該当する。よって、「か」の開始時刻である２．１１秒に上昇イベント（口調別韻律イベント）が追加される。追加された結果例を図９に示す。なお、図６、図８に示す韻律イベントＩＤはこの実施例の説明の便宜上用いる符号であって、発明を実施する際には必ずしも必要ない。 When the tone-specific prosodic event table is FIG. 8, the accent phrase “So” does not meet the occurrence condition, but the accent phrase “Are you sure?” Corresponds to the occurrence condition “The particle“ ka ”exists at the end of the utterance” ” . Therefore, a rising event (tone-based prosody event) is added to 2.11 seconds, which is the start time of “ka”. An example of the added result is shown in FIG. The prosodic event IDs shown in FIG. 6 and FIG. 8 are symbols used for convenience of explanation of this embodiment, and are not necessarily required when the invention is carried out.

このように、韻律イベント生成部１２、口調別韻律イベント追加部１３では、従来技術２の生成過程モデルのような、発話全体にわたって影響を与えることを前提とするような大局的なイベントは用いない。また、口調別韻律イベント追加部１３で、発話末に助詞「か」が存在しているということは、このアクセント句は疑問口調であるとみなされ、「か」が上昇するということになる。よって、このような疑問口調であっても、的確なＦ０値時系列が生成される。その他、音声句口調とは異なる様々な口調、例えば「なれなれしい口調」等で合成音声を生成する場合であっても、口調別韻律イベントテーブルの設定次第で、正しい表現が出来、上記第２の問題点を解決することが出来る。口調別韻律イベント追加部１３からは例えば図９に示す形式で出力され、韻律イベントパラメータ生成部１４に入力される。 As described above, the prosodic event generation unit 12 and the tone-specific prosody event addition unit 13 do not use a global event such as the generation process model of the prior art 2 that presupposes an influence over the entire utterance. . In addition, the presence of the particle “ka” at the end of the utterance in the tone-specific prosodic event adding unit 13 means that this accent phrase is a questionable tone, and “ka” increases. Therefore, an accurate F0 value time series is generated even in such a questionable tone. In addition, even when a synthesized speech is generated in various tone different from the phonetic tone, for example, “natural tone”, the correct expression can be made depending on the setting of the tone-specific prosody event table. The point can be solved. The tone-specific prosody event adding unit 13 outputs, for example, the format shown in FIG. 9 and the prosody event parameter generating unit 14.

韻律イベントパラメータ生成部１４では、韻律イベントパラメータデータベースと韻律イベントが対応付けられた箇所における音声・言語的な状況を用いて、韻律イベント毎に韻律イベントパラメータが生成される（ステップＳ１４）。韻律イベントパラメータデータベースは韻律イベントパラメータデータベース記憶部２４に記憶されている。まず、韻律イベントパラメータについて説明する。 The prosodic event parameter generation unit 14 generates prosodic event parameters for each prosodic event using the speech / linguistic situation at the location where the prosodic event parameter database and the prosodic event are associated with each other (step S14). The prosodic event parameter database is stored in the prosodic event parameter database storage unit 24. First, the prosodic event parameters will be described.

韻律イベント生成部１２、口調別韻律イベント追加部１３で生成された韻律イベントの各々には、その種類に応じた生成関数が対応付けられる。後述する図１０に示す生成関数テーブルに示すように、上昇イベントであれば、例えば以下の式（１）の生成関数が対応付けられる。

Each of the prosodic events generated by the prosodic event generating unit 12 and the tone-specific prosody event adding unit 13 is associated with a generation function corresponding to its type. As shown in a generation function table shown in FIG. 10 to be described later, for example, a generation function of the following equation (1) is associated with a rising event.

下降イベントであれば、以下の式（２）が対応付けられる。

If it is a descending event, the following equation (2) is associated.

なだらかな下降イベントであれば、以下の式（３）が対応付けられる。
−Ａ（ｍ−ｔ）σ^２ｅｘｐ（−σ（ｍ−ｔ））（３） If it is a gentle descent event, the following equation (3) is associated.
-A (mt) σ ² exp (-σ (mt)) (3)

盛り上がりイベントであれば、以下の式（４）が対応付けられる。

If it is a climax event, the following equation (4) is associated.

式（１）〜（４）のｔは時間を表し、Ａ、ｍ、σが韻律イベントパラメータを表し、Ａ、ｍ、σを生成する必要がある。具体的には、Ａは関数の振幅を表す振幅パラメータであり、韻律イベントにより引き起こされる抑揚の大きさに対応する。ｍは生成関数の位置を表す位置パラメータであり、韻律イベントに対応付けられた生成箇所から、どの程度ずれた位置で実際に韻律イベントによりＦ０値の変化が引き起こされるかを表す。σは韻律イベントによるＦ０値の変化がどの程度の時間をかけて発生するかを表す継続時間パラメータである。 In Expressions (1) to (4), t represents time, A, m, and σ represent prosodic event parameters, and A, m, and σ need to be generated. Specifically, A is an amplitude parameter that represents the amplitude of the function, and corresponds to the amount of intonation caused by the prosodic event. m is a position parameter indicating the position of the generation function, and represents how much the F0 value is actually changed by the prosodic event at a position shifted from the generation position associated with the prosodic event. σ is a duration parameter indicating how much time the change of the F0 value due to the prosodic event occurs.

韻律イベントパラメータは同じ種類の韻律イベントに対しても、異なる値が生成されることがある。例示すると、図５で示すように入力テキストが「今日は、よく晴れて気持ちの良い一日です」である場合、アクセント句「今日は」のアクセント型と、アクセント句「晴れて」のアクセント型は同じ１型である。よって韻律イベント生成部１３で両者とも図６記載の韻律イベントＩＤ３、４、５のイベントが付与される。しかし、文の先頭のアクセント句「今日は」と、文の途中のアクセント句「晴れて」では一般的に、全く同じ抑揚で発生するわけではない。例えば、文の途中のアクセント句「晴れて」の抑揚は小さく、即ち振幅パラメータＡの値を小さくすることが適切な場合がある。 Different prosodic event parameters may be generated for the same type of prosodic event. For example, as shown in FIG. 5, when the input text is “Today is a sunny and pleasant day”, the accent type “Today is” and the accent type “Sunny” Are the same type. Accordingly, the prosodic event generation unit 13 is assigned events of prosodic event IDs 3, 4, and 5 shown in FIG. However, the accent phrase “Today” at the beginning of a sentence and the accent phrase “Sunny” in the middle of a sentence generally do not occur with exactly the same inflection. For example, the accent phrase “sunny” in the middle of a sentence is small, that is, it may be appropriate to reduce the value of the amplitude parameter A.

また、アクセント核の次に撥音「ん」がある場合とない場合とを比較すると、Ｆ０値が下降し始めるタイミングが異なることが観測されている。このような場合には、韻律イベントの位置パラメータｍを状況に応じて、適切に生成する必要がある。このように、同じ種類の韻律イベントであっても、文の先頭であるか、文の途中であるか、アクセント核の次に撥音がある等という状況が異なるため、生成される韻律イベントパラメータが異なる可能性がある。 Further, it is observed that the timing at which the F0 value starts to fall is different when comparing the case where there is a sound repellent “n” next to the accent nucleus and the case where there is no sound repellent. In such a case, it is necessary to appropriately generate the position parameter m of the prosodic event according to the situation. In this way, even for the same type of prosodic event, since the situation such as the beginning of a sentence, the middle of a sentence, or the presence of a repelling sound after an accent nucleus, the generated prosodic event parameters are different. May be different.

このような様々な状況に応じて、韻律イベント毎に、適切に韻律イベントパラメータを生成する必要がある一方で、テキスト音声合成の利用分野においては、どのようなテキストに対しても、合成音声を生成する必要があることを考えれば、韻律イベントパラメータ生成部１４はあらゆる状況に対して適切な韻律イベントパラメータを生成することが出来なければならない。 While it is necessary to appropriately generate prosodic event parameters for each prosodic event according to these various situations, in the field of application of text-to-speech synthesis, synthesized speech can be used for any text. In consideration of the necessity to generate, the prosodic event parameter generation unit 14 must be able to generate appropriate prosodic event parameters for every situation.

そこで、このような様々な状況に対応して、適切な韻律イベントパラメータを生成するための手法として、韻律イベントパラメータデータベースを、例えば、韻律イベント毎に、決定木を用いたコンテキストクラスタリングの手法で構成することが考えられる。 Therefore, as a method for generating appropriate prosodic event parameters corresponding to such various situations, the prosodic event parameter database is configured by, for example, a context clustering method using a decision tree for each prosodic event. It is possible to do.

一方、韻律イベントが対応付けられた箇所における音声・言語的な状況とは、例えばアクセント句の状況などが考えられる。そこで、韻律イベントパラメータを生成する方法として、韻律イベントパラメータデータベースと当該アクセント句の状況を用いて行うことを以下に説明する。 On the other hand, the speech / linguistic situation at the location associated with the prosodic event may be the situation of an accent phrase, for example. Therefore, as a method for generating prosodic event parameters, the following will be described using the prosodic event parameter database and the situation of the accent phrase.

図１１は、上昇イベントの韻律イベントパラメータデータベースの構成である決定木の一例である。図１１から明らかなように、決定木は例えば二分木であり、ノードにはＹＥＳ／ＮＯで答えられる質問が付与されている。生成された韻律イベントの状況に対する質問の答えがＹＥＳであれは、右の子ノードへ、ＮＯであれば、左の子ノードへと木をたどれば、韻律イベントがどのような状況で生成されようとも、最終的にいずれかの葉に到達する。葉（最終的なノード）には韻律イベントパラメータＡ、ｐ、ｑが指定されている。位置パラメータｐ、継続時間パラメータｑはそれぞれ位置パラメータｍと継続時間パラメータσを正規化した値である（以下、正規化位置パラメータｐ、正規化継続時間パラメータｑという）。この正規化については、後述する。
韻律イベントパラメータデータベースの構成をこのような決定木にすれば、どのような状況の韻律イベントに対しても、的確な韻律イベントパラメータを生成することが出来る。 FIG. 11 is an example of a decision tree that is a configuration of a prosodic event parameter database of rising events. As is clear from FIG. 11, the decision tree is, for example, a binary tree, and a question that can be answered with YES / NO is given to the node. If the answer to the question about the status of the generated prosodic event is YES, if the answer is NO to the right child node, if NO, the tree is traced to the left child node. Well, finally reach one of the leaves. Prosodic event parameters A, p, and q are designated for the leaves (final nodes). The position parameter p and the duration parameter q are values obtained by normalizing the position parameter m and the duration parameter σ, respectively (hereinafter, referred to as a normalized position parameter p and a normalized duration parameter q). This normalization will be described later.
If the structure of the prosodic event parameter database is such a decision tree, accurate prosodic event parameters can be generated for prosodic events in any situation.

次に、具体的な韻律イベントパラメータの生成処理の流れを説明する。図９Ｂ記載の参照番号８２０１である上昇イベント（以下、韻律イベント８２０１という）の韻律イベントパラメータの生成処理について図１１を用いて説明する。この上昇イベントが付加されているアクセント句は「よろしいですか」である。 Next, a specific flow of prosodic event parameter generation processing will be described. The prosody event parameter generation processing of the rising event (hereinafter referred to as prosodic event 8201) having the reference number 8201 described in FIG. 9B will be described with reference to FIG. The accent phrase to which this rising event is added is "Are you sure?"

まず、アクセント句「よろしいですか」について、ルートノードであるノード６０１の質問「文頭のフレーズであるか」否かを検討する。アクセント句「よろしいですか」は文頭のフレーズではなく、２番目のフレーズであるので、回答はＮＯである。ＮＯの符号が付与されたバスを通り、ノード６０２に移動する。 First, regarding the accent phrase “Are you sure?”, It is examined whether or not the question “is it the phrase at the beginning of the sentence” of the node 601 that is the root node. The accent phrase “Are you sure?” Is not the phrase at the beginning of the sentence but the second phrase, so the answer is NO. The bus moves to the node 602 through the bus with the symbol “NO”.

次に、ノード６０２の質問「現在のアクセント型が１型であるか」否かを検討する。現在のアクセント句「よろしいですか」のアクセント型は３型であるので回答はＮＯである。ＮＯの符号が付与されたバスを通り、ノード６０３に移動する。 Next, the question of the node 602 is examined whether or not the current accent type is type 1. Since the accent type of the current accent phrase “Are you sure?” Is type 3, the answer is NO. The bus moves to the node 603 through a bus assigned with a code of NO.

ノード６０３の質問「直前の句のアクセント型が０型であるか」否かを検討する。直前のアクセント句は「それでは」であり、アクセント句は３型であるので、回答はＮＯである。ＮＯの符号が付与されたバスを通り、ノード６０４へ移動する。ノード６０４は葉ノードであり、質問は付与されておらず、振幅パラメータＡ、正規化位置パラメータｐ、正規化継続時間パラメータｑの値が記述されている。そこで、韻律イベント８２０１の韻律イベントパラメータはＡ＝２．２、ｐ＝−０．２、ｑ＝０．１と生成される。 The node 603 question “whether the accent type of the immediately preceding phrase is type 0” or not is examined. The immediately preceding accent phrase is “Now,” and the accent phrase is type 3, so the answer is NO. The bus moves to the node 604 through a bus assigned with a code of NO. The node 604 is a leaf node, no question is given, and the values of the amplitude parameter A, the normalized position parameter p, and the normalized duration parameter q are described. Therefore, the prosodic event parameters of the prosodic event 8201 are generated as A = 2.2, p = −0.2, and q = 0.1.

図１１は上昇イベントに対応した決定木であるが、同様な決定木を下降イベント、なだらかな下降イベント、盛り上がりイベントについても準備する。そして、全ての種類の韻律イベントについて、韻律イベントの種類に対応する決定木を用いて、上記の処理で、韻律イベント毎に、韻律イベントパラメータを生成する。 FIG. 11 shows a decision tree corresponding to a rising event. Similar decision trees are prepared for a falling event, a gentle falling event, and a rising event. Then, for all types of prosodic events, prosodic event parameters are generated for each prosodic event by the above processing using the decision tree corresponding to the prosodic event type.

そして、図１２に示すように、アクセント句「よろしいですか」は、韻律イベントごとに、韻律イベントが対応付けられる生成箇所、振幅パラメータＡ、正規化位置パラメータｐ、正規化継続時間パラメータｑの４つの値の組で表される。韻律イベントパラメータ生成部１４からは例えば図１２に示す表の形式で出力され、韻律イベントパラメータ変換部２２に入力される。 As shown in FIG. 12, the accent phrase “Are you sure?” Is the generation location, amplitude parameter A, normalized position parameter p, and normalized duration parameter q, which are associated with each prosodic event. Expressed as a set of two values. The prosody event parameter generation unit 14 outputs, for example, in the form of a table shown in FIG. 12 and inputs to the prosody event parameter conversion unit 22.

また、図１１では振幅パラメータＡ、正規化位置パラメータｐ、正規化継続時間パラメータｑをまとめて決定する決定木を示しているが、パラメータの種類毎に異なる決定木を構築することも考えられる。また、図６や図７に示す韻律イベントＩＤごとに異なる決定木を用いることも考えられる。また、図１１の例では、質問として、アクセント句のアクセント型やかかり受け関係に関連する質問が例示されているが、このほかにも、入力テキスト中のアクセント句の位置やあるいは韻律イベントが生成された箇所の前後の単語の形態素情報や音韻の情報、あるいはパラメータを生成する対象の韻律イベントより前に生成された韻律イベントの振幅の総和など、様々な観点から質問を考えることが出来る。 Further, FIG. 11 shows a decision tree for collectively determining the amplitude parameter A, the normalized position parameter p, and the normalized duration parameter q, but it is also conceivable to construct a different decision tree for each parameter type. It is also conceivable to use a different decision tree for each prosodic event ID shown in FIGS. Further, in the example of FIG. 11, the question is related to the accent phrase accent type and the dependency relationship, but in addition to this, the position of the accent phrase in the input text or the prosodic event is generated. Questions can be considered from various points of view, such as morphological information and phonological information of words before and after a given location, or the sum of amplitudes of prosodic events generated before the target prosody event for which parameters are generated.

韻律イベントパラメータ変換部２２では、韻律イベントパラメータ生成部１４が生成した正規化韻律イベントパラメータ（正規化位置パラメータｐと正規化継続時間パラメータｑ）をモーラの情報もしくはアクセント句の情報に応じて、韻律イベントパラメータに変換される（ステップＳ１４）。具体的には、正規化位置パラメータｐと正規化継続時間パラメータｑがそれぞれ、位置パラメータｍ、継続時間パラメータσに変換される。以下の説明では、モーラの情報に応じて変換される場合を説明する。 In the prosodic event parameter conversion unit 22, the normalized prosodic event parameters (normalized position parameter p and normalized duration parameter q) generated by the prosodic event parameter generating unit 14 are prosodic according to mora information or accent phrase information. It is converted into an event parameter (step S14). Specifically, the normalized position parameter p and the normalized duration parameter q are converted into a position parameter m and a duration parameter σ, respectively. In the following description, a case where conversion is performed in accordance with mora information will be described.

位置パラメータｍと継続時間パラメータσの正規化について説明する。上述した位置パラメータｐと継続時間パラメータｑの単位は、該当韻律イベントを含むアクセント句の平均モーラ長で正規化された値である。例えばアクセント句「よろしいですか」であれば、アクセント句に７個のモーラを含む。また、図４等を参照すると、アクセント句「よろしいですか」は１番目のモーラ「よ」の開始時刻が０．９１秒であり、最後のモーラ「か」の終了時刻は２．３１秒である。よって、アクセント句の継続時間は１．４秒であり、アクセント句全体での平均モーラ長は０．２秒／モーラとなる。また、図１２記載の上昇イベント９０２の生成箇所は１．１１秒であり、正規化位置パラメータｐは−０．２である。これは１番目のモーラ「よ」の終了時刻である１．１１秒から、−０．２モーラ即ち、−０．２（上昇イベント９０２の正規化位置パラメータ）×０．２（平均モーラ長）＝−０．０４となる。つまり０．０４秒前である１．０７が位置パラメータｍの値である。 The normalization of the position parameter m and the duration parameter σ will be described. The unit of the position parameter p and the duration parameter q described above is a value normalized by the average mora length of the accent phrase including the corresponding prosodic event. For example, an accent phrase “Are you sure?” Includes seven mora in the accent phrase. Also, referring to FIG. 4 and the like, the accent phrase “Are you sure?” Has a start time of 0.91 seconds for the first mora “yo” and an end time of 2.31 seconds for the last mora “ka”. is there. Therefore, the duration of the accent phrase is 1.4 seconds, and the average mora length of the entire accent phrase is 0.2 seconds / mora. Moreover, the generation location of the rising event 902 illustrated in FIG. 12 is 1.11 seconds, and the normalized position parameter p is −0.2. This is from the end time of 1.11 seconds, which is the end time of the first mora “Yo”, to −0.2 mora, that is, −0.2 (normalized positional parameter of the rising event 902) × 0.2 (average mora length) = −0.04. That is, 1.07, which is 0.04 seconds before, is the value of the position parameter m.

また、同様に正規化継続時間パラメータｑは０．１である。これは、平均モーラ長０．１をかけて得られる０．０１が継続時間パラメータσの値である。その他の韻律イベントについても正規化位置パラメータｐと正規化継続時間パラメータｑを変換して、例えば図１３に示すような表が生成され、韻律イベントパラメータ変換部２２から出力され、デルタ関数生成部１６に入力される。 Similarly, the normalization duration parameter q is 0.1. In this case, 0.01 obtained by multiplying the average mora length of 0.1 is the value of the duration parameter σ. For other prosodic events, the normalized position parameter p and the normalized duration parameter q are converted, for example, a table as shown in FIG. 13 is generated and output from the prosodic event parameter converting unit 22, and the delta function generating unit 16 Is input.

正規化位置パラメータｐや正規化継続時間パラメータｑの単位は平均モーラ長に限られるものではなく、秒やミリ秒といった単位を直接用いることも可能である。しかし、秒やミリ秒といった絶対的な単位を用いると、韻律イベントパラメータの値が話速に強く影響を受けてしまう。このため、通常よりも、速い話速や遅い話速に対応した合成音声を生成する際に、所望の話速に応じた位置パラメータや継続時間パラメータに対応した決定木を用いる必要があり、韻律イベントパラメータデータベースには、所望の話速に応じた多数の決定木を準備する必要がある。従って、韻律イベントパラメータデータベースの構築にかかるコストが増大すると共に韻律イベントパラメータデータベース記憶部２４には膨大なデータを格納しなければならなくなる。従って、話速によらず、安定した位置や継続時間を表現するために、図１２の例では、平均モーラ長を単位としている。 The unit of the normalized position parameter p and the normalized duration parameter q is not limited to the average mora length, and units such as seconds and milliseconds can be directly used. However, if absolute units such as seconds or milliseconds are used, the value of the prosodic event parameter is strongly influenced by the speech speed. For this reason, when generating synthesized speech corresponding to a faster or slower speech speed than usual, it is necessary to use a decision tree corresponding to the position parameter and duration parameter corresponding to the desired speech speed. In the event parameter database, it is necessary to prepare a large number of decision trees corresponding to a desired speech speed. Therefore, the cost for constructing the prosodic event parameter database increases, and enormous data must be stored in the prosodic event parameter database storage unit 24. Therefore, in order to express a stable position and duration regardless of the speech speed, the example of FIG. 12 uses the average mora length as a unit.

デルタ関数生成部１６では、韻律イベント毎に、所定の生成関数に韻律イベントパラメータＡ、ｍ、σを適用し、全ての韻律イベントに対応する生成関数の和を計算することで、アクセント句におけるＦ０値時系列のデルタ関数ＦＤ（ｔ）が生成される（ステップＳ１６）。所定の関数とは例えば、上記式（１）〜（４）が考えられるが、これらに限られるものではない。以下の説明では、所定の生成関数を上記式（１）〜（４）として説明をする。生成関数テーブルは生成関数テーブル記憶部３２に記憶されており、上述した図１０が生成関数テーブルの一例である。 The delta function generation unit 16 applies the prosodic event parameters A, m, and σ to a predetermined generation function for each prosodic event, and calculates the sum of the generation functions corresponding to all the prosodic events, thereby obtaining F0 in the accent phrase. A value time series delta function FD (t) is generated (step S16). Examples of the predetermined function include the above formulas (1) to (4), but are not limited thereto. In the following description, the predetermined generation function is described as the above formulas (1) to (4). The generation function table is stored in the generation function table storage unit 32, and FIG. 10 described above is an example of the generation function table.

図１０に示すように、生成関数と生成関数の概型は韻律イベントの種類に対応付けられている。例えば、下降イベントであれば生成関数は上記式（２）である。また、デルタ関数生成部１６は生成関数生成部１６２、加算部１６４とで構成されている。 As shown in FIG. 10, the generation function and the general type of the generation function are associated with the type of prosodic event. For example, in the case of a descending event, the generation function is the above equation (2). The delta function generation unit 16 includes a generation function generation unit 162 and an addition unit 164.

まず生成関数生成部１６２で、入力に含まれる韻律イベント全てについて、韻律イベントに対応する生成関数が生成される。生成された生成関数を図１４に示す。図１４記載の生成関数の参照番号１００１〜１００７はそれぞれ図１３の韻律イベントの参照番号９０１〜９０７と対応する。生成された生成関数１００１〜１００７は全て加算部１６４に入力される。 First, the generation function generation unit 162 generates generation functions corresponding to prosodic events for all prosodic events included in the input. The generated generation function is shown in FIG. Reference numbers 1001 to 1007 of the generation function shown in FIG. 14 correspond to the prosodic event reference numbers 901 to 907 of FIG. All the generated generation functions 1001 to 1007 are input to the adding unit 164.

加算部１６４では、生成関数１００１〜１００７について、入力のアクセント句の開始時刻から終了時刻の各時刻の和を加算することでデルタ関数ＦＤ（ｔ）が求められる。デルタ関数ＦＤ（ｔ）の例を図１５に示す。デルタ関数ＦＤ（ｔ）とはＦ０値時系列を微分したもの、つまり、Ｆ０値時系列の増減を示す関数である。このようにして、アクセント句毎に、デルタ関数ＦＤ（ｔ）は生成される。 The addition unit 164 obtains the delta function FD (t) for the generation functions 1001 to 1007 by adding the sum of the times from the start time to the end time of the input accent phrase. An example of the delta function FD (t) is shown in FIG. The delta function FD (t) is a function obtained by differentiating the F0 value time series, that is, a function indicating increase / decrease of the F0 value time series. In this way, the delta function FD (t) is generated for each accent phrase.

なお、図１０に示す生成関数テーブルの生成関数の下降イベントの生成関数（上記式（２））、盛り上がりイベントの生成関数（上記式（４））については、以下の説明のように、上昇イベントの生成関数（上記式（１））で表すことができる。以下説明すると、下降イベントの生成関数は上昇イベントの生成関数に「−」を付したものである。 Note that the generation function of the descending event (the above formula (2)) and the generation function of the climax event (the above formula (4)) of the generation function of the generation function table shown in FIG. Can be expressed by the generation function (the above formula (1)). In the following description, the descending event generation function is obtained by adding “-” to the ascending event generation function.

盛り上がりイベントの生成関数については、まず上昇イベント、下降イベントの生成関数の継続時間を１／２にし、つまりσをσ／２に置き換える。そして継続時間が１／２の上昇イベントの生成関数をσ／２だけ負の方向に移動させ、つまりｍをｍ−σ／２に置き換える。また継続時間が１／２の下降イベントの生成関数をσ／２だけ正の方向に移動させ、つまり、ｍをｍ＋σ／２に置き換える。これら置き換えられた上昇イベントの生成関数と下降イベントの生成関数を加算することで、盛り上がりイベントの生成関数を求めることが出来る。以上のことから生成関数は上昇イベントの生成関数（上記式（１））となだらかな下降イベントの生成関数（上記式（３））とで表すことができる。 As for the generation function of the rising event, first, the duration of the generation function of the rising event and the falling event is halved, that is, σ is replaced with σ / 2. Then, the generation function of the rising event whose duration is ½ is moved in the negative direction by σ / 2, that is, m is replaced with m−σ / 2. Further, the generation function of the falling event whose duration is ½ is moved in the positive direction by σ / 2, that is, m is replaced with m + σ / 2. By adding the replaced rising event generating function and falling event generating function, a rising event generating function can be obtained. From the above, the generation function can be expressed by the generation function of the rising event (the above formula (1)) and the gentle generation function of the falling event (the above formula (3)).

初期Ｆ０値生成部１８では、初期Ｆ０値パラメータデータベースとアクセント句の情報を用いて、アクセント句毎に初期Ｆ０値が求められる（ステップＳ１８）。 The initial F0 value generation unit 18 obtains an initial F0 value for each accent phrase using the initial F0 value parameter database and information on the accent phrase (step S18).

初期Ｆ０値パラメータデータベースは初期Ｆ０値パラメータデータベース記憶部２６に記憶されている。初期Ｆ０値パラメータデータベースは上述した韻律イベントパラメータデータベースと同様に、例えば、二分木構成が考えられる。初期Ｆ０値パラメータデータベースの構成例を図１６に示す。 The initial F0 value parameter database is stored in the initial F0 value parameter database storage unit 26. As the initial F0 value parameter database, for example, a binary tree configuration is conceivable as in the above-mentioned prosodic event parameter database. A configuration example of the initial F0 value parameter database is shown in FIG.

アクセント句「よろしいですか」を例にして具体的に説明すると、まずルートノードであるノード１２０１の質問「現在のアクセント句のアクセント型が１型であるか」否かを検討する。アクセント句「よろしいですか」のアクセント型は３型であり、回答はＮＯである。よって、ＮＯの符号が付与されたバスを通り、ノード１２０２に移動する。次にノード１２０２の質問「現在の句のアクセント型が０型であるか」否かを検討する。回答はＮＯであるので、ＮＯの符号が付与されたバスを通り、ノード１２０４に移動する。次にノード１２０４の質問「文頭であるか」否かを検討する。アクセント句「よろしいですか」は文頭の句でないないので、回答はＮＯであり、ＮＯの符号が付与されたバスを通り、ノード１２０７に移動する。ノード１２０７は葉ノードであり、質問は付与されておらず、初期Ｆ０値が記述されている。そこで、「よろしいですか」の初期Ｆ０値は５．２に決定される。なお、初期Ｆ０値パラメータデータベースの構成例はニ分木に限られず、様々な構成が考えられる。このようにして、初期Ｆ０値生成部１８でアクセント句毎に初期Ｆ０値が求められ、Ｆ０値時系列生成部２０へ入力される。また、初期Ｆ０値生成部１８は図１７のように、図１３に示す表と、図１５に示すデルタ関数と初期Ｆ０値を組み合わせたものを出力してもよい The accent phrase “Are you sure?” Will be specifically described as an example. First, the question of the node 1201 as the root node “whether the current accent phrase is accent type 1” is examined. The accent type of the accent phrase “Are you sure?” Is type 3, and the answer is NO. Therefore, the vehicle moves to the node 1202 through the bus to which the symbol “NO” is assigned. Next, it is examined whether or not the question of node 1202 is “the accent type of the current phrase is type 0” or not. Since the answer is NO, the vehicle moves to the node 1204 through the bus to which the code of NO is assigned. Next, it is examined whether or not the question of the node 1204 is “beginning of sentence”. Since the accent phrase “Are you sure?” Is not the phrase at the beginning of the sentence, the answer is NO, and the bus moves to the node 1207 through the bus assigned the sign of NO. The node 1207 is a leaf node, no question is given, and an initial F0 value is described. Therefore, the initial F0 value of “Are you sure?” Is determined to be 5.2. The configuration example of the initial F0 value parameter database is not limited to the binary tree, and various configurations can be considered. In this way, the initial F0 value generation unit 18 determines the initial F0 value for each accent phrase and inputs it to the F0 value time series generation unit 20. Further, as shown in FIG. 17, the initial F0 value generation unit 18 may output the table shown in FIG. 13 and a combination of the delta function and the initial F0 value shown in FIG.

Ｆ０値時系列生成部２０では、デルタ関数生成部１６からのアクセント句毎のデルタ関数と、初期Ｆ０値生成部１８からのアクセント句毎の初期Ｆ０値とから、アクセント句毎にＦ０値時系列が生成される（ステップＳ２０）。 In the F0 value time series generation unit 20, the F0 value time series for each accent phrase is calculated from the delta function for each accent phrase from the delta function generation unit 16 and the initial F0 value for each accent phrase from the initial F0 value generation unit 18. Is generated (step S20).

具体的には、例えば、デルタ関数ＦＤ（ｔ）の積分値に初期Ｆ０値を加算して、アクセント句毎のＦ０値時系列Ｆ（ｔ）が生成される。ｔは開始時刻と終了時刻の間の任意の時間とする。つまり以下の式（５）でＦ０値時系列Ｆ（ｔ）が生成される。

Specifically, for example, the initial F0 value is added to the integral value of the delta function FD (t) to generate the F0 value time series F (t) for each accent phrase. t is an arbitrary time between the start time and the end time. That is, the F0 value time series F (t) is generated by the following equation (5).

ここでｔ１はアクセント句の開始時刻を示す。右辺の積分演算の意味は、上述の通り、デルタ関数ＦＤ（ｔ）はＦ０値時系列を微分したものであるので、デルタ関数ＦＤ（ｔ）を積分することで、Ｆ０値時系列を求めることが出来る。図１８は、Ｆ０値時系列生成部２０での処理結果である上記式（５）の演算結果、つまり生成されたアクセント句「よろしいですか」のＦ０値時系列を示すものである。 Here, t1 indicates the start time of the accent phrase. The meaning of the integral operation on the right side is that, as described above, the delta function FD (t) is obtained by differentiating the F0 value time series, so that the F0 value time series is obtained by integrating the delta function FD (t). I can do it. FIG. 18 shows the calculation result of the above equation (5) that is the processing result in the F0 value time series generation unit 20, that is, the F0 value time series of the generated accent phrase “Are you sure?”.

上述のように、例えば、アクセント句「よろしいですか」の場合、初期Ｆ０値と７つの韻律イベントそれぞれに３つずつの韻律イベントパラメータが生成される。よって、合計２２の韻律パラメータだけで、Ｆ０値時系列を表現することが出来る。一方、従来技術１では「ＹＯＲＯＳＩＩＤＥＳＵＫＡ」という１３個の音韻毎に６個のパラメータが必要であり、つまり７８個のパラメータが必要である。従って、この実施例では少ないパラメータでＦ０値時系列を生成することが可能になり、結果としてコストを下げることが出来、上記第１の問題点は解決される。 As described above, for example, in the case of the accent phrase “Are you sure?”, Three prosodic event parameters are generated for each of the initial F0 value and the seven prosodic events. Therefore, the F0 value time series can be expressed with only a total of 22 prosodic parameters. On the other hand, in the prior art 1, 6 parameters are required for every 13 phonemes “YOROSIDE ESUKA”, that is, 78 parameters are required. Therefore, in this embodiment, it becomes possible to generate the F0 value time series with a small number of parameters, and as a result, the cost can be reduced, and the first problem is solved.

また、アクセント句「よろしいですか」のアクセント型は３型であり、対応したＦ０値の動きに加えて、最後の「か」に対応したＦ０値の上昇が実現されている。よって、疑問口調に限らず、様々な口調に対応したＦ０値時系列を生成することができるので上記第２の問題点も解決される。 Further, the accent type of the accent phrase “Are you sure?” Is type 3, and in addition to the movement of the corresponding F0 value, the increase of the F0 value corresponding to the last “ka” is realized. Therefore, since the F0 value time series corresponding to various tone can be generated without being limited to the question tone, the second problem can be solved.

この実施例２では、処理を簡素にするため、実施例１で説明した韻律イベント生成部１２、口調別韻律イベント追加部１３、韻律イベントパラメータ生成部１４、韻律イベントパラメータ変換部２２との構成を統合して、韻律イベント部５４としてＦ０値時系列生成装置５２を作動させるものである。 In the second embodiment, in order to simplify the processing, the configuration of the prosody event generation unit 12, the tone-specific prosody event addition unit 13, the prosody event parameter generation unit 14, and the prosody event parameter conversion unit 22 described in the first embodiment is used. In combination, the F0 value time-series generating device 52 is operated as the prosodic event unit 54.

図１９は実施例２の機能構成例を示した図である。韻律イベント部５４では、アクセント型、モーラ型の開始時刻及び終了時刻とから韻律イベントパラメータテーブルを用いて、韻律イベントを生成し、更に口調別韻律イベントを追加し、韻律イベント、口調別韻律イベント毎に韻律イベントパラメータが生成される。 FIG. 19 is a diagram illustrating a functional configuration example of the second embodiment. The prosodic event unit 54 generates a prosodic event from the start time and end time of the accent type and the mora type using the prosodic event parameter table, and further adds a tone-specific prosody event. Prosodic event parameters are generated.

韻律イベントパラメータテーブルは韻律イベントパラメータテーブル記憶部２９に記憶されている。図２０に韻律イベントパラメータテーブルを示す。韻律イベントパラメータテーブルは、例えば、図６記載の韻律イベントテーブルと図８記載の口調別韻律イベントテーブルを統合させ、韻律イベント、口調別韻律イベント毎に対応する韻律イベントパラメータを付加させたものである。韻律イベント部では、図６記載の韻律イベントテーブル、図８記載の口調別韻律イベントテーブル、図１１記載の韻律イベントパラメータデータベースを用いない。 The prosodic event parameter table is stored in the prosodic event parameter table storage unit 29. FIG. 20 shows a prosodic event parameter table. For example, the prosodic event parameter table is obtained by integrating the prosodic event table shown in FIG. 6 and the tone-specific prosodic event table shown in FIG. 8 and adding prosodic event parameters corresponding to each prosodic event and tone-specific prosodic event. . In the prosodic event section, the prosodic event table shown in FIG. 6, the tone-specific prosodic event table shown in FIG. 8, and the prosodic event parameter database shown in FIG. 11 are not used.

まず、アクセント句毎に境界位置とアクセント句毎のアクセント型が付与され、モーラごとの開始時刻及び終了時刻が決められたテキストが韻律イベント部５４に入力される。韻律イベント部５４で、韻律イベントパラメータテーブルを用いて、アクセント句のアクセント型に応じて韻律イベントが生成され、同時に、その韻律イベントに対応する振幅パラメータＡ、位置パラメータｍ、継続時間パラメータσが求められる。以後のデルタ関数生成部１６などの処理は実施例１と同様なので、省略する。 First, a boundary position and an accent type for each accent phrase are assigned to each accent phrase, and a text in which a start time and an end time for each mora are determined is input to the prosodic event unit 54. In the prosodic event unit 54, a prosodic event is generated according to the accent type of the accent phrase using the prosodic event parameter table, and at the same time, an amplitude parameter A, a position parameter m, and a duration parameter σ corresponding to the prosodic event are obtained. It is done. Subsequent processing by the delta function generation unit 16 and the like is the same as that in the first embodiment, and thus will be omitted.

この実施例２は実施例１よりも少ないコストで実施することが出来る。 The second embodiment can be implemented at a lower cost than the first embodiment.

この実施例３は、実施例１で説明したデルタ関数生成部１６、初期Ｆ０値生成部１８、Ｆ０値時系列生成部２０を統合させてＦ０値時系列部５８として、韻律イベントパラメータ生成装置５６を処理させるものである。図２１は実施例３の機能構成例である。 In the third embodiment, the delta function generation unit 16, the initial F0 value generation unit 18, and the F0 value time series generation unit 20 described in the first embodiment are integrated into a F0 value time series unit 58 as a prosodic event parameter generation device 56. Is to be processed. FIG. 21 is a functional configuration example of the third embodiment.

Ｆ０値時系列部５８では、韻律イベントパラメータと所定の生成関数を用いて、アクセント句毎にＦ０値時系列が生成される。所定の生成関数とは、例えば、上昇イベントの生成関数（上記式（１））、なだらかな下降の生成関数（上記式（３））などが挙げられる。上述したように、下降イベントの生成関数、盛り上がりイベントの生成関数は上昇イベントの生成関数から求めることが出来る。 The F0 value time series unit 58 generates an F0 value time series for each accent phrase using the prosodic event parameters and a predetermined generation function. The predetermined generation function includes, for example, a rising event generation function (the above formula (1)), a gentle downward generation function (the above formula (3)), and the like. As described above, the generation function of the descending event and the generation function of the rising event can be obtained from the generation function of the rising event.

まず、Ｆ０値時系列部５８で韻律イベントに対応する生成関数が求められる。また例えば、上記の方法で初期Ｆ０値が求められ、韻律イベントパラメータ変換部２２で求められた韻律イベントパラメータが生成関数に適用される。また、例えば実施例１で説明した手法で初期Ｆ０値が求められ、これらより、Ｆ０値時系列が求められる。 First, the generation function corresponding to the prosodic event is obtained by the F0 value time series unit 58. In addition, for example, the initial F0 value is obtained by the above method, and the prosodic event parameter obtained by the prosodic event parameter converting unit 22 is applied to the generation function. Further, for example, the initial F0 value is obtained by the method described in the first embodiment, and the F0 value time series is obtained from these values.

実施例１では、生成関数の加算後、積分計算してＦ０値時系列を求めていたが、この実施例３では、積分計算がされた生成関数を加算して、Ｆ０値時系列を求めること等ができる。この実施例３は実施例１で説明した処理の順序でなくとも、目的が達成される点で有効である。 In the first embodiment, the F0 value time series is obtained by integration calculation after adding the generation function. In this third embodiment, the F0 value time series is obtained by adding the generation function subjected to the integral calculation. Etc. The third embodiment is effective in that the object is achieved even if the order of processing described in the first embodiment is not used.

この実施例４では、実施例２で説明した韻律イベント部５４と実施例３で説明したＦ０値時系列部５８とでＦ０値時系列生成装置６０は構成される。図２２は実施例４の機能構成例である。処理内容は、実施例２、実施例３で説明した通りなので、説明を省略する。 In the fourth embodiment, the F0 value time series generation device 60 is configured by the prosodic event unit 54 described in the second embodiment and the F0 value time series unit 58 described in the third embodiment. FIG. 22 is a functional configuration example of the fourth embodiment. Since the processing content is as described in the second and third embodiments, the description is omitted.

以上説明したＦ０値時系列生成処理の過程では、Ｆ０値の対数の値の時系列を生成してから、指数関数を用いてＦ０値時系列を合成するものとする。従って、生成関数の韻律イベントパラメータや初期Ｆ０値については、Ｆ０値の対数をとった数値が例として示されている。これは、対数領域でのＦ０値の変化が聴感上の変化によく対応するという知見を反映した処理である。もちろん、対数Ｆ０値を用いず、線形のＦ０値を用いる場合でも、韻律イベントパラメータデータベースや初期Ｆ０値パラメータデータベースに含まれる数値を線形Ｆ０値とすれば、同様の処理で直接Ｆ０値時系列を生成することが可能である。 In the process of F0 value time series generation processing described above, a time series of logarithmic values of F0 values is generated, and then an F0 value time series is synthesized using an exponential function. Therefore, as for the prosodic event parameter and the initial F0 value of the generation function, numerical values obtained by taking the logarithm of the F0 value are shown as examples. This is a process reflecting the knowledge that the change in the F0 value in the logarithmic region corresponds well to the change in audibility. Of course, even when a linear F0 value is used without using a logarithmic F0 value, if the numerical values included in the prosodic event parameter database and the initial F0 value parameter database are linear F0 values, the F0 value time series can be directly converted by the same processing. It is possible to generate.

以上の各実施形態の他、本発明であるＦ０値時系列生成装置は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、Ｆ０値時系列生成装置において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 In addition to the above embodiments, the F0 value time-series generation apparatus according to the present invention is not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the F0 value time-series generation apparatus is not only executed in time series in the order described, but also executed in parallel or individually as required by the processing capability of the apparatus that executes the processing. It is good.

また、この発明のＦ０値時系列生成装置における処理をコンピュータによって実現する場合、Ｆ０値時系列生成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、Ｆ０値時系列生成装置における処理機能がコンピュータ上で実現される。 Further, when the processing in the F0 value time series generation device of the present invention is realized by a computer, the processing contents of the functions that the F0 value time series generation device should have are described by a program. Then, by executing this program on a computer, the processing function in the F0 value time-series generation device is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＤＶＤ−ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（Ｒｅｃｏｒｄａｂｌｅ）／ＲＷ（ＲｅＷｒｉｔａｂｌｅ）等を、光磁気記録媒体として、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｃ）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（ＥｌｅｃｔｒｏｎｉｃａｌｌｙＥｒａｓａｂｌｅａｎｄＰｒｏｇｒａｍｍａｂｌｅ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like is used as an optical disc, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only Memory), a CD-R (Recordable). ) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable Programmable-Read Only Memory), etc. can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program.

また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（ＡｐｐｌｉｃａｔｉｏｎＳｅｒｖｉｃｅＰｒｏｖｉｄｅｒ）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Further, the above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、Ｆ０値時系列生成装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the F0 value time series generation device is configured by executing a predetermined program on the computer. However, at least a part of these processing contents may be realized by hardware. Good.

この発明の実施例１の機能構成例を示すブロック図。The block diagram which shows the function structural example of Example 1 of this invention. この発明の実施例１の主な処理の流れを示すフローチャート。The flowchart which shows the flow of the main processes of Example 1 of this invention. アクセント句分割・付与部２の出力例を示す図。The figure which shows the output example of the accent phrase division | segmentation and provision part 2. FIG. モーラ分割・付与部４の出力例を示す図。The figure which shows the example of an output of the mora division | segmentation and provision part 4. FIG. モーラ分割・付与部４のその他の出力例を示す図。The figure which shows the other output example of the mora division | segmentation and provision part 4. FIG. 韻律イベントテーブルの例を示す図。The figure which shows the example of a prosodic event table. 韻律イベント生成部１２の出力例を示す図。The figure which shows the example of an output of the prosodic event generation part 12. FIG. 口調別韻律イベントテーブルの例を示す図。The figure which shows the example of the prosodic event table classified by tone. 口調別韻律イベント追加部１３の出力例を示す図。The figure which shows the output example of the prosodic event addition part 13 according to a tone. 生成関数テーブルの例を示す図。The figure which shows the example of a production | generation function table. 韻律イベントパラメータデータベースの構成例を示す図。The figure which shows the structural example of a prosodic event parameter database. 韻律イベントパラメータ生成部１４の出力例を示す図。The figure which shows the example of an output of the prosodic event parameter production | generation part 14. FIG. 韻律イベントパラメータ変換部２２の出力例を示す図。The figure which shows the example of an output of the prosodic event parameter conversion part 22. FIG. 生成関数生成部１６２の出力例を示す図。The figure which shows the output example of the production | generation function production | generation part 162. FIG. 加算部１６４の出力例を示す図。The figure which shows the output example of the addition part 164. FIG. 初期Ｆ０値パラメータデータベースの構成例を示す図。The figure which shows the structural example of an initial F0 value parameter database. 初期Ｆ０値生成部１８の出力例を示す図。The figure which shows the example of an output of the initial F0 value production | generation part 18. Ｆ０値時系列生成部２０の出力例を示す図。The figure which shows the output example of F0 value time series production | generation part 20. FIG. この発明の実施例２の機能構成例を示す図。The figure which shows the function structural example of Example 2 of this invention. 韻律イベントパラメータテーブルの例を示す図。The figure which shows the example of a prosodic event parameter table. この発明の実施例３の機能構成例を示す図。The figure which shows the function structural example of Example 3 of this invention. この発明の実施例４の機能構成例を示す図。The figure which shows the function structural example of Example 4 of this invention.

Claims

An F0 value time series generating device that generates a F0 value time series of speech by inputting a boundary position and an accent type for each accent phrase for each accent phrase, and inputting text with a determined start time and end time for each mora. There,
A prosodic event section that generates a prosodic event using the prosody event parameter table from the accent type and the start time and end time for each mora, and generates a prosodic event parameter for each prosodic event;
An F0 value time series part for generating an F0 value time series for each accent phrase using prosodic event parameters and a predetermined generation function;
Have
The prosodic events are ascending, descending, gentle descent, and excitement.
The generation function has the prosodic event

And
The F0 value time-series generation apparatus, wherein the prosodic event parameters are A, σ, and m.

The F0 value time-series generation device according to claim 1,
The F0 value time series part is
For each prosodic event, a prosody event parameter is applied to the generating function obtained from the generating function table, and a sum of generating functions corresponding to all prosodic events is generated as a delta function of the F0 value time series in the accent phrase. When,
Using an initial F0 value parameter database and accent phrase information, an initial F0 value generation unit for obtaining an initial F0 value for each accent phrase;
An F0 value time series generation unit for generating an F0 value time series for each accent phrase from the delta function and the initial F0 value;
An F0 value time-series generation apparatus characterized by comprising:

The F0 value time series generation device according to claim 1 or 2,
The prosodic event part is
In place of the prosodic event parameter table, using a prosodic event table, a prosodic event generating unit that generates a plurality of prosodic events that are associated with a designated part of an accent phrase according to an accent type;
If the accent phrase matches the occurrence condition using the tone-specific prosody event table, the tone-specific prosody event addition unit that adds the tone-specific prosody event corresponding to the occurrence condition to the specified location of the accent phrase,
A prosodic event parameter generating unit that generates a prosodic event parameter for each prosodic event, using a speech / linguistic situation at a location where the prosodic event parameter database and the prosodic event are associated,
Have
The F0 value time-series generating device , wherein the tone-specific prosodic events are ascending, descending, gentle descending, and rising .

The F0 value time-series generation device according to claim 3 ,
The prosodic event parameter database stores normalized prosodic event parameters (hereinafter referred to as normalized prosodic event parameters),
The F0 value has a prosodic event parameter conversion unit that converts the normalized prosodic event parameter generated by the prosodic event parameter generation unit according to mora information or accent phrase information and outputs the prosodic event parameter Sequence generation device.

An F0 value time series generation method in which a boundary position and an accent type for each accent phrase are assigned to each accent phrase, text having a determined start time and end time for each mora is input, and an F0 value time series of speech is generated. There,
The prosodic event means generates a prosodic event using the prosodic event parameter table from the accent type and the start time and end time for each mora, and generates a prosodic event parameter for each prosodic event;
An F0 value time series means for generating an F0 value time series for each accent phrase using a prosodic event parameter and a predetermined generation function;
Have
The prosodic events are ascending, descending, gentle descent, and excitement.
The generation function has the prosodic event

And
The F0 value time-series generation method, wherein the prosodic event parameters are A, σ, and m.

The F0 value time series generation method according to claim 5,
The F0 time series process is
The delta function generating means applies the prosodic event parameters to the generating function obtained from the generating function table for each prosodic event, and the sum of the generating functions corresponding to all prosodic events is used as the F0 value time-series delta function in the accent phrase. The delta function generation process to generate,
An initial F0 value generating means for generating an initial F0 value for each accent phrase using the initial F0 value parameter database and accent phrase information;
An F0 value time series generating means for generating an F0 value time series for each accent phrase from the delta function and the initial F0 value;
An F0 value time series generation method characterized by comprising:

The F0 value time series generation method according to claim 5 or 6,
The prosodic event process is
A prosodic event generating means for generating a plurality of prosodic events that are associated with a designated portion of an accent phrase according to an accent type using a prosodic event table instead of the prosodic event parameter table; ,
The tone-specific prosodic event addition means uses the tone-specific prosodic event table to add a tone-specific prosodic event corresponding to the occurrence condition to the specified location of the accent phrase if the accent phrase meets the occurrence condition. The process of adding prosodic events by tone,
The prosodic event parameter generating means generates a prosodic event parameter for each prosodic event using a speech / linguistic situation at a location where the prosodic event parameter database and the prosodic event are associated, and
Have
The F0 value time series generation method, wherein the tone-specific prosodic events are ascending, descending, gently descending, and exciting .

The F0 value time series generation method according to claim 7 ,
The prosodic event parameter database stores normalized prosodic event parameters (hereinafter referred to as normalized prosodic event parameters),
The prosodic event parameter converting means has a prosodic event parameter converting process of converting the normalized prosodic event parameter generated in the prosodic event parameter generating process according to mora information or accent phrase information and outputting the prosodic event parameter The F0 value time series generation method characterized by the above.

An F0 value time series generation program for causing a computer to execute each process of the F0 value time series generation device according to claim 1.

A computer-readable recording medium on which the F0 value time series generation program according to claim 9 is recorded.