JP5320341B2

JP5320341B2 - Speaking text set creation method, utterance text set creation device, and utterance text set creation program

Info

Publication number: JP5320341B2
Application number: JP2010112423A
Authority: JP
Inventors: 公人田中; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2013-10-23
Anticipated expiration: 2030-05-14
Also published as: JP2011242470A

Description

本発明は、音声合成用音声素片ＤＢを構築する際に、発声者が読み上げるテキストセットを作成する発声用テキストセット作成方法、発声用テキストセット作成装置及び発声用テキストセット作成プログラムに関する。 The present invention relates to an utterance text set creation method, an utterance text set creation apparatus, and an utterance text set creation program for creating a text set read by a speaker when constructing a speech synthesis unit DB.

波形接続型音声合成システムは、音声合成を行う際に、音声素片を接続する必要があるため、音声合成用音声素片ＤＢを有する。なお、音声素片とは、予め用意した短い単位の音声データである。例えば、音声素片の単位としては、ＣＶ、ＶＣＶ、ＣＶＣ、［Ｃ］Ｖ^＊等があり、個別のＴＴＳ（text-to-speech system）に依存する。但し、Ｃは子音を、Ｖは母音を表し、［Ｃ］はＣがない場合がありえることを、Ｖ^＊は1つ以上のＶの連鎖を表す。音声合成用音声素片ＤＢを構築する際には、予め発声者が発声用テキストセットを読み上げ、その自然音声データを用いて、音声素片を求める必要がある。なお、自然音声データとは、単語、文等の自然な単位で発声者が発声した音を収録した音声データであり、音声合成用音声素片ＤＢとは、自然音声データから音声合成に必要な音声素片のみを抽出したデータベースである。より自然な音声合成処理を行うために、音声合成用音声素片ＤＢには、音声合成に必要な音声素片がより多く含まれることが望まれる。そのためには、発声用テキストセットが、音声素片を効率的に収集することができる文章からなることが必要である。 The waveform connection type speech synthesis system has a speech unit DB for speech synthesis because it is necessary to connect speech units when performing speech synthesis. Note that the speech segment is a short unit of speech data prepared in advance. For example, there are CV, VCV, CVC, [C] V ^*, etc. as speech unit units, which depend on individual TTS (text-to-speech system). However, C represents a consonant, V represents a vowel, [C] represents that there may be no C, and V ^* represents a chain of one or more Vs. When constructing the speech unit DB for speech synthesis, it is necessary for the speaker to read the speech text set in advance and obtain the speech unit using the natural speech data. Note that the natural speech data is speech data that records sounds uttered by a speaker in natural units such as words and sentences, and the speech synthesis speech unit DB is necessary for speech synthesis from natural speech data. It is a database that extracts only speech segments. In order to perform more natural speech synthesis processing, it is desired that the speech synthesis speech unit DB includes more speech units necessary for speech synthesis. For this purpose, it is necessary that the utterance text set is composed of sentences that can efficiently collect speech segments.

多様な口調や発話スタイル、豊かな感情を含んだ音声を高品質に合成する場合、目的とする口調や発話スタイル、感情を含んだ音声（以下「Ｘ口調」という）から作成された音声素片ＤＢを用いた方が、朗読口調で発声された音声から作成された音声素片ＤＢを用いるよりも合成音声の品質が高くなることが、非特許文献１により知られている。これは、Ｘ口調のバリエーション毎に、韻律やスペクトルの特徴が異なるため、大きな韻律変形量及びスペクトルの差異によって生じる自然性等の低下が原因であると考えられる。なお、発話スタイルとは、話者の環境や文化等によって起こる音響特性のことであり、例えば、方言、早口、ぞんざいな話し方、丁寧な話し方、ゆっくりとした話し方、はっきりと発音しない話し方等である。また、感情とは、悲しげな話し方、楽しげな話し方等である。口調とは、口に出したときの言葉の調子や、ものの言い方のようすのことであり、前記発話スタイルや感情を含んだ音声を含む概念とする。韻律の特徴とは声の高さ、イントネーション、リズム、ポーズ等であり、スペクトルとは、音声を周波数成分に分け、周波数毎の強さを表したものである。 When synthesizing high-quality speech that includes a variety of tone, utterance styles, and rich emotions, speech segments created from speech that includes the desired tone, utterance style, and emotion (hereinafter referred to as “X tone”) It is known from Non-Patent Document 1 that the quality of synthesized speech is higher when a DB is used than when a speech segment DB created from speech uttered in a reading tone is used. This is considered to be caused by a decrease in naturalness and the like caused by a large prosodic deformation amount and a difference in spectrum because prosody and spectral characteristics are different for each variation of X tone. Note that the utterance style refers to the acoustic characteristics that occur depending on the speaker's environment and culture, such as dialects, fast speech, awkward speaking, polite speaking, slow speaking, and how to speak clearly. . In addition, emotions include a sad way of speaking and a pleasant way of speaking. The tone refers to the tone of words when they are put out to the mouth and the way of saying things, and is a concept that includes speech that includes the speech style and emotions. Prosodic features are voice pitch, intonation, rhythm, pose, and the like, and a spectrum is a representation of the strength of each frequency divided into frequency components.

一般的には大量日本語テキストの音韻列及び韻律特徴のカバレッジを最大化するようなアルゴリズムを用いて発声用テキストセットが作成されていた（非特許文献２参照）。なお、音韻列とは、音韻（音素）の列であり、読み仮名である。音韻とは、任意の個別言語において意味の区別（弁別）に用いられる最小の音の単位を指し、母音や子音等である。また、カバレッジとは、波形接続型音声合成システムで音声合成を行う際に、処理対象のテキストを音声合成する際に必要となる音声素片が、音素環境、音韻継続時間長及び基本周波数パタンを考慮したときに、音声合成用音声素片ＤＢに含まれている確率である。 In general, an utterance text set has been created using an algorithm that maximizes the coverage of phoneme strings and prosodic features of a large amount of Japanese text (see Non-Patent Document 2). The phoneme string is a string of phonemes (phonemes) and is a reading pseudonym. A phoneme refers to a minimum sound unit used for distinction (discrimination) of meaning in an arbitrary individual language, such as a vowel or a consonant. In addition, coverage means that when speech synthesis is performed in a waveform-connected speech synthesis system, the speech segments required for speech synthesis of the text to be processed are the phoneme environment, phoneme duration length, and fundamental frequency pattern. It is the probability of being included in the speech synthesis speech element DB when considered.

大西浩二、益子貴史、小林隆夫著、「ＨＭＭ音声合成における異なる発話スタイルの生成の検討」、電子情報通信学会技術研究報告、２００３年、１０２巻、６１９号（SP2002-17）、ｐ１７〜２２Koji Onishi, Takashi Masuko, Takao Kobayashi, "Examination of generation of different utterance styles in HMM speech synthesis", IEICE Technical Report, 2003, 102, 619 (SP2002-17), p17-22 河井恒、樋口宜男、山本誠一著、「基本周波数及び音素時間継続時間長を考慮した音声合成用波形素片データセットの作成」、電子情報通信学会論文誌（Ｄ−II）、１９９９年８月、Ｖｏｌ．Ｊ８２−Ｄ−II、ｎｏ．８、ｐ．１２２９−１２３８Tsuyoshi Kawai, Yoshio Higuchi, Seiichi Yamamoto, “Creation of waveform segment data set for speech synthesis considering fundamental frequency and duration of phoneme duration”, IEICE Transactions (D-II), 1999 8 Month, Vol. J82-D-II, no. 8, p. 1229-1238

発声用テキストセットを作成する際に、漢字仮名混じり文の大量日本語テキストから音韻列を推定するために、音声合成プログラムが用いられるが、一般的な音声合成プログラムは朗読口調で読み上げる場合を想定している。そのため、従来技術は、Ｘ口調で発声する場合に、推定した通りに発声者が発声しない場合が生じる。例えば、朗読口調を想定した一般的な音声合成プログラムを利用して音韻列を推定して発声用テキストセットを作成し、それを用いてＸ口調で発声者が発声した場合、音声合成プログラムが推定した音韻列と実際にＸ口調で発声して得られる音韻列に差（読みの揺れ）が生じると想定される。例えば、「明日」という単語は、一般的な音声合成プログラムを用いて音韻列を推定すると“あし^た”（^は無声化を表す記号）となるが、驚きの感情で発声すると“あし^たー”と語尾が長母音化する。また、強調した発声の場合“あした！”と“し”が無声化しなかったりする場合が想定される。 When creating a text set for utterance, a speech synthesis program is used to estimate phonological sequences from a large amount of Japanese text in a kanji-kana mixed sentence, but a general speech synthesis program is assumed to be read out in a reading tone. doing. Therefore, in the conventional technique, when speaking in X tone, the speaker may not utter as estimated. For example, using a general speech synthesis program that assumes reading tone, create a text set for utterance by estimating the phoneme sequence, and if the speaker utters in X tone, the speech synthesis program estimates It is assumed that there is a difference (reading fluctuation) between the phoneme sequence obtained and the phoneme sequence actually obtained by speaking in X tone. For example, the word “Tomorrow” is “Ashi ^ ta” (^ is a symbol for devoicing) when the phoneme sequence is estimated using a general speech synthesis program, but “Ashi ^ "Tau" and ending vowels. Further, in the case of emphasized utterances, it may be assumed that “Ashita!” And “Shi” are not devoiced.

このように、発声用テキストセット生成時に想定した音韻列と、実際にＸ口調で発声して得られる音韻列とが異なる場合、発声用テキストセット生成時に計算した「音韻列及び韻律特徴のカバレッジ最大化」が想定した通り実現されず、それにより合成音声の品質が低下するという問題がある。 As described above, when the phoneme sequence assumed at the time of generating the utterance text set is different from the phoneme sequence obtained by actually uttering in the X tone, Is not realized as expected, which causes a problem that the quality of the synthesized speech is lowered.

前記の課題を解決するために、本発明に係る発声用テキストセット作成技術は、朗読口調の自然音声データから求めたパラメータの分布を、目的とするＸ口調の自然音声データから求めたパラメータの分布に変換するパラメータ分布変換関数を予め記憶しておき、発声用テキストセット候補を用いて音声合成プログラムにより音声合成処理を行い、音声合成データから所定のパラメータを求め、パラメータ分布変換関数を用いて、求めたパラメータの分布を変換し、変換後のパラメータ分布を用いて発声用テキストセット候補を評価する。 In order to solve the above-described problem, the utterance text set creation technology according to the present invention uses the parameter distribution obtained from the natural speech data of reading tone and the parameter distribution obtained from the natural speech data of target X tone. A parameter distribution conversion function to be converted into a pre-stored, speech synthesis processing is performed by the speech synthesis program using the utterance text set candidate, a predetermined parameter is obtained from the speech synthesis data, and the parameter distribution conversion function is used, The obtained parameter distribution is converted, and the utterance text set candidates are evaluated using the converted parameter distribution.

本発明は、朗読口調以外の口調で発声した場合にも、音韻列及び韻律特徴のカバレッジを最大化する発声用テキストセットを生成することができるという効果を奏する。 The present invention produces an effect that it is possible to generate an utterance text set that maximizes the coverage of phoneme strings and prosodic features even when uttered in a tone other than reading tone.

発声用テキストセット作成部の構成図。The block diagram of the utterance text set preparation part. 発声用テキストセット作成部の処理フローを示す図。The figure which shows the processing flow of the utterance text set preparation part. 変換関数作成部の構成図。The block diagram of a conversion function preparation part. 変換関数作成部の処理フローを示す図。The figure which shows the processing flow of a conversion function preparation part. （ａ−１）朗読口調における素片分布を、（ａ−２）Ｘ口調における素片分布を、（ｂ−１）朗読口調における継続長分布を、（ｂ−２）Ｘ口調における継続長分布を、（ｃ−１）朗読口調におけるＦ０分布を、（ｂ−２）Ｘ口調におけるＦ０分布を示す図。(A-1) Segment distribution in reading tone, (a-2) Segment distribution in X tone, (b-1) Duration distribution in reading tone, (b-2) Duration distribution in X tone (C-1) The F0 distribution in reading tone, (b-2) The figure which shows F0 distribution in X tone. テキストセット作成部の構成図。The block diagram of a text set preparation part. テキストセット作成部の処理フローを示す図。The figure which shows the processing flow of a text set preparation part.

以下、本発明の実施の形態について、詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

＜発声用テキストセット作成装置１０００＞
図１及び図２を用いて実施例１に係る発声用テキストセット作成装置１０００を説明する。発声用テキストセット作成装置１０００は、音声合成用音声素片ＤＢを構築する際に、発声者が読み上げるテキストセットを作成する。 <Speaking text set creation device 1000>
An utterance text set creation apparatus 1000 according to the first embodiment will be described with reference to FIGS. 1 and 2. The utterance text set creation apparatus 1000 creates a text set read by the utterer when the speech synthesis speech segment DB is constructed.

発声用テキストセット作成装置１０００は、入出力インタフェース部１０１と、変換関数作成部１００と、記憶部２０３と、テキストセット作成部２００を有する。 The utterance text set creation apparatus 1000 includes an input / output interface unit 101, a conversion function creation unit 100, a storage unit 203, and a text set creation unit 200.

発声用テキストセット作成装置１０００は、入出力インタフェース部１０１を介して、発声用テキストセット作成者（以下「ユーザ」という）から変換関数作成指示を受信すると、変換関数作成部１００は、オフライン処理により、朗読口調の自然音声データから求めたパラメータの分布を、目的とするＸ口調の自然音声データから求めたパラメータの分布に変換するパラメータ分布変換関数（例えば、後述する素片分布変換関数、継続長分布変換関数及びＦ０分布変換関数）を作成し（ｓ１００）、後述する発声用テキストセット候補を作成する前に、記憶部２０３に記憶する。 When the utterance text set creation apparatus 1000 receives a conversion function creation instruction from the utterance text set creator (hereinafter referred to as “user”) via the input / output interface unit 101, the conversion function creation unit 100 performs offline processing. , A parameter distribution conversion function that converts a parameter distribution obtained from natural speech data of reading tone to a parameter distribution obtained from natural speech data of target X tone (for example, a segment distribution conversion function, a continuation length described later) Distribution conversion function and F0 distribution conversion function) are created (s100), and stored in the storage unit 203 before the utterance text set candidate to be described later is created.

さらに、発声用テキストセット作成装置１０００は、入出力インタフェース部１０１を介して、ユーザからテキストセット作成指示と口調指定情報を受信すると（ｓ１０１）、テキストセット作成部２００は、オンライン処理により、Ｘ口調で発声した場合のカバレッジを最大化するテキストセットを作成し（ｓ２００）、入出力インタフェース部１０１を介して、ユーザに出力する。以下各部の処理内容を説明する。 Furthermore, when the utterance text set creation device 1000 receives a text set creation instruction and tone designation information from the user via the input / output interface unit 101 (s101), the text set creation unit 200 performs an X tone by online processing. A text set for maximizing the coverage when uttered in step S200 is created (s200) and output to the user via the input / output interface unit 101. The processing contents of each unit will be described below.

＜入出力インタフェース部１０１＞
入出力インタフェース部１０１は、ユーザからの入力を受け付けると共に、当該ユーザに対して情報を出力する。例えば、データが入力される入力インタフェース（例えばキーボード、マウス等）とデータが出力される出力インタフェース（例えばディスプレイ、プリンタ等）、または、それらの入出力インタフェースに対する入出力端子からなる。また、発声用テキストセット作成装置１０００がネットワーク上のサーバ等であり、ユーザがネットワークを介してアクセスする場合には、入出力インタフェース部１０１は、ユーザとデータを送受信するための通信部等であってもよい。 <Input / output interface unit 101>
The input / output interface unit 101 receives input from a user and outputs information to the user. For example, an input interface (for example, a keyboard, a mouse, etc.) for inputting data and an output interface (for example, a display, a printer, etc.) for outputting data, or input / output terminals for these input / output interfaces are included. In addition, when the utterance text set creation apparatus 1000 is a server on the network and the user accesses via the network, the input / output interface unit 101 is a communication unit for transmitting and receiving data to and from the user. May be.

＜記憶部２０３＞
記憶部２０３は、入出力される各データや演算過程の各データを、逐一、格納・読み出しする。それにより各演算処理が進められる。但し、必ずしも記憶部２０３に記憶しなければならないわけではなく、各部間で直接データを受け渡してもよい。なお、後述する素片分布変換関数ＤＢ２３４、継続長分布変換関数ＤＢ２３６及びＦ０分布変換関数ＤＢ２３８は、記憶部２０３の一部であってもよい。 <Storage unit 203>
The storage unit 203 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 203, and data may be directly transferred between the units. Note that an element distribution conversion function DB 234, a continuation length distribution conversion function DB 236, and an F0 distribution conversion function DB 238 described later may be part of the storage unit 203.

＜変換関数作成部１００＞
変換関数作成部１００は、例えば、Ｘ口調で人間が発声した自然音声を利用して、朗読口調とＸ口調の差分（音韻列に含まれる音声素片の出現頻度分布の差、音声素片毎の音韻継続時間長の出現頻度分布の差、音声素片毎の基本周波数パタンの出現頻度分布の差）を抽出し、パラメータ分布変換関数を求め、これを記憶部２０３に記憶する。 <Conversion function creation unit 100>
The conversion function creation unit 100 uses, for example, natural speech uttered by a human in X tone, and uses a difference between reading tone tone and X tone (difference in frequency distribution of speech units included in the phoneme sequence, each speech unit The difference between the appearance frequency distributions of the phoneme duration lengths and the difference in the appearance frequency distributions of the basic frequency patterns for each speech unit) is extracted, and a parameter distribution conversion function is obtained and stored in the storage unit 203.

図３及び図４を用いて変換関数作成部１００を説明する。変換関数作成部１００は、自然音声ＤＢ１１０と、音韻ラベリング部１１１と、第１パラメータ分布抽出部１２０と、パラメータ分布変換関数算出部１３０とを備える。 The conversion function creation unit 100 will be described with reference to FIGS. 3 and 4. The conversion function creation unit 100 includes a natural speech DB 110, a phonological labeling unit 111, a first parameter distribution extraction unit 120, and a parameter distribution conversion function calculation unit 130.

（自然音声ＤＢ１１０及び音韻ラベリング部１１１）
自然音声ＤＢ１１０は、朗読口調自然音声データとＸ口調自然音声データとを予め記憶しておく。例えば、各自然音声データは、同じテキスト（例えば「旋回する」）を朗読口調とＸ口調で読み上げたものである。なお、Ｘ口調として様々なバリエーションを有してもよく、バリエーション毎に自然音声データを作成し、記憶する。 (Natural speech DB 110 and phonological labeling unit 111)
The natural voice DB 110 stores read-tone natural voice data and X-tone natural voice data in advance. For example, each natural voice data is obtained by reading the same text (for example, “turn”) in a reading tone and an X tone. Note that the X tone may have various variations, and natural voice data is created and stored for each variation.

音韻ラベリング部１１１は、入出力インタフェース部１０１を介して変換関数作成指示を受信すると、自然音声ＤＢ１１０から、朗読口調自然音声データとＸ口調自然音声データとを取得し（ｓ１１０）、各自然音声データに、手動または自動で、音韻ラベル（例えば/seNkaisuru/等）を付与し（ｓ１１１）、取得した各自然音声データとそれに対するラベルデータを素片分布抽出部１２３に出力する。 When receiving the conversion function creation instruction via the input / output interface unit 101, the phonological labeling unit 111 acquires the reading-tone natural sound data and the X-tone natural sound data from the natural sound DB 110 (s110), and each natural sound data. The phoneme label (for example, / seNkaisuru / etc.) is assigned manually or automatically (s111), and the acquired natural speech data and the corresponding label data are output to the segment distribution extraction unit 123.

（第１パラメータ分布抽出部１２０）
第１パラメータ分布抽出部１２０は、所定の文書を朗読口調で読み上げた朗読口調自然音声データと、同一の文書をＸ口調で読み上げたＸ口調自然音声データとからそれぞれ所定のパラメータを求め、それぞれパラメータの分布を抽出する（ｓ１２０）。 (First parameter distribution extraction unit 120)
The first parameter distribution extraction unit 120 obtains predetermined parameters from the reading-tone natural voice data obtained by reading a predetermined document in the reading tone and the X-tone natural voice data obtained by reading the same document in the X-tone, respectively. Is extracted (s120).

例えば、第１パラメータ分布抽出部１２０は、全音声素片バリエーション記憶部１２２と、素片分布抽出部１２３と、継続長分布抽出部１２５と、Ｆ０分布抽出部１２７とを備える。 For example, the first parameter distribution extraction unit 120 includes an all speech unit variation storage unit 122, a unit distribution extraction unit 123, a duration distribution extraction unit 125, and an F0 distribution extraction unit 127.

｛全音声素片バリエーション記憶部１２２及び素片分布抽出部１２３｝
素片分布抽出部１２３は、全音声素片バリエーション記憶部１２２を参照して、それぞれの自然音声データから得られる音声素片に対し音声素片番号を付与し（ｓ１２３ａ）、音声素片の出現頻度の分布（以下「素片分布」という）を抽出する（ｓ１２３ｂ）。 {All speech segment variation storage unit 122 and segment distribution extraction unit 123}
The segment distribution extraction unit 123 refers to the all speech unit variation storage unit 122 and assigns a speech unit number to each speech unit obtained from each natural speech data (s123a), so that the speech unit appears. A frequency distribution (hereinafter referred to as “segment distribution”) is extracted (s123b).

全音声素片バリエーション記憶部１２２には、音声素片（または音声素片から得られる特徴量や音声素片に対応するラベルデータ等）と各音声素片に対する音声素片番号が記憶されている。但し、全音声素片バリエーション記憶部１２２に記憶される音声素片は、開発しようとするテキスト音声合成システムに依存したものになる。 The speech unit variation storage unit 122 stores speech units (or feature values obtained from speech units, label data corresponding to speech units, etc.) and speech unit numbers for each speech unit. . However, the speech unit stored in the all speech unit variation storage unit 122 depends on the text speech synthesis system to be developed.

素片分布抽出部１２３は、各自然音声データとラベルデータを受信し、自然音声データから得られる音声素片をキーとして、全音声素片バリエーション記憶部１２２を検索し、各音声素片に対する音声素片番号を取得する。得られた音声素片番号の数（出現頻度）に基づき、全ての音声素片の種類毎の出現頻度を求め、その素片分布を抽出する。素片分布を素片分布変換関数算出部１３３に、各自然音声データとそれに紐付けられた音声素片番号を継続長分布抽出部１２５とＦ０分布抽出部１２７に送信し、継続長分布抽出部１２５にはさらに各自然音声データに付与したラベルデータも送信する。 The segment distribution extraction unit 123 receives each natural speech data and label data, searches the entire speech unit variation storage unit 122 using a speech unit obtained from the natural speech data as a key, and performs speech for each speech unit. Get the segment number. Based on the number of speech unit numbers obtained (appearance frequency), the appearance frequency for every type of all speech units is obtained, and the segment distribution is extracted. The unit distribution is transmitted to the unit distribution conversion function calculating unit 133, and each natural speech data and the speech unit number associated therewith are transmitted to the duration distribution extracting unit 125 and the F0 distribution extracting unit 127, and the duration distribution extracting unit Further, the label data attached to each natural voice data is also transmitted to 125.

｛継続長分布抽出部１２５｝
継続長分布抽出部１２５は、ラベルデータと、音声素片番号を受信し、これを用いて、音声素片毎の音韻継続時間長を計算し（ｓ１２５ａ）、音声素片毎の音韻継続時間長の出現頻度の分布（以下「継続長分布」という）を抽出し（ｓ１２５ｂ）、これを継続長分布変換関数算出部１３５に送信する。なお、音韻継続長はベクトルデータとして計算される。例えば、音声素片”ＫＡＳ”の各音韻の継続時間長がそれぞれ、Ｋの長さが１２ｍｓ、Ａの長さが２２ｍｓ、Ｓの長さが１１ｍｓの場合には、ベクトルデータを（１２，２２，１１）とする。但し、他の従来技術により音声素片毎の音韻継続時間長を表してもよい。 {Continuation length distribution extraction unit 125}
The continuation length distribution extracting unit 125 receives the label data and the speech unit number, and calculates the phoneme duration for each speech unit using the label data and the speech unit number (s125a), and the phoneme duration for each speech unit. Is extracted (s125b) and transmitted to the duration distribution conversion function calculation unit 135. The phoneme continuation length is calculated as vector data. For example, when the duration length of each phoneme of the speech unit “KAS” is 12 ms for K, 22 ms for A, and 11 ms for S, the vector data is (12, 22). 11). However, the phoneme duration for each speech unit may be represented by other conventional techniques.

｛Ｆ０分布抽出部１２７｝
Ｆ０分布抽出部１２７は、自然音声データと、音声素片番号と、ラベルデータとを受信し、これらを用いて、音声素片毎の基本周波数パタンを抽出し（ｓ１２７ａ）、音声素片毎の基本周波数パタンの出現頻度の分布（以下「Ｆ０分布」という）を抽出し（ｓ１２７ｂ）、Ｆ０分布抽出部１２７に送信する。なお、基本周波数パタンはベクトルデータとして計算される。例えば、音声素片”ＡＳＵ”の各音韻の基本周波数パタンの周波数の平均値がそれぞれ、Ａの平均値が１２０Ｈｚ、Ｓの平均値が０Ｈｚ（Ｓは無声子音であり基本周波数がないため）、Ｕの平均値が２２０Ｈｚの場合には、ベクトルデータを（１２０，０，２２０）とする。但し、基本周波数パタンの指定方法は、この方法以外にも様々なものがあり、他の従来技術により音声素片毎の基本周波数パタンを表してもよい。例えば、音声素片の基本周波数パタンの周波数の平均値と、周波数の分散と、始点の周波数と、終点の周波数からなるベクトルデータを抽出してもよいし、音韻毎に平均値をとるのではなく、基本周波数の時間的変化パタンを３点の折れ線で近似してもよい。 {F0 distribution extraction unit 127}
The F0 distribution extraction unit 127 receives the natural speech data, the speech unit number, and the label data, and extracts the fundamental frequency pattern for each speech unit using these (s127a), A frequency distribution of basic frequency patterns (hereinafter referred to as “F0 distribution”) is extracted (s127b) and transmitted to the F0 distribution extraction unit 127. The fundamental frequency pattern is calculated as vector data. For example, the average value of the fundamental frequency pattern frequency of each phoneme of the speech unit “ASU” is 120 Hz and the average value of S is 0 Hz (since S is an unvoiced consonant and has no fundamental frequency), When the average value of U is 220 Hz, the vector data is (120, 0, 220). However, there are various basic frequency pattern designation methods other than this method, and the fundamental frequency pattern for each speech unit may be represented by other conventional techniques. For example, vector data consisting of the average value of the fundamental frequency pattern of the speech element, the frequency variance, the start point frequency, and the end point frequency may be extracted, or the average value may be taken for each phoneme. Alternatively, the temporal change pattern of the fundamental frequency may be approximated by a three-point broken line.

（パラメータ分布変換関数算出部１３０）
パラメータ分布変換関数算出部１３０は、朗読口調の自然音声データから求めたパラメータ分布を、Ｘ口調の自然音声データから求めたパラメータ分布に、変換するパラメータ分布変換関数を算出する（ｓ１３０）。 (Parameter distribution conversion function calculation unit 130)
The parameter distribution conversion function calculation unit 130 calculates a parameter distribution conversion function for converting the parameter distribution obtained from the natural speech data in the reading tone into the parameter distribution obtained from the natural speech data in the X tone (s130).

例えば、パラメータ分布変換関数算出部１３０は、素片分布変換関数算出部１３３と、継続長分布変換関数算出部１３５と、Ｆ０分布変換関数算出部１３７とを備える。 For example, the parameter distribution conversion function calculation unit 130 includes an element distribution conversion function calculation unit 133, a duration distribution conversion function calculation unit 135, and an F0 distribution conversion function calculation unit 137.

｛素片分布変換関数算出部１３３｝
素片分布変換関数算出部１３３は、各自然音声データから求めた素片分布を受信し、朗読口調の自然音声データから求めた素片分布（図５（ａ−１））を、Ｘ口調の自然音声データから求めた素片分布（図５（ａ−２））に変換する素片分布変換関数を算出し（ｓ１３３）、素片分布変換関数ＤＢ２３４に送信し、登録する。図５の上段は、朗読口調からＸ口調へ素片出現頻度分布を変換する素片分布変換関数ｆのイメージを示している。（ａ−１）及び（ａ−２）の横軸上にＮ個の音声素片番号が左から順番に並べられている。縦軸は出現頻度である。変換関数ｆは、左の分布を右の分布に変換する関数である。これにより、読みの揺れに関する両口調間の差などを変換関数ｆに織り込むことができる。 {Element distribution conversion function calculation unit 133}
The segment distribution conversion function calculation unit 133 receives the segment distribution obtained from each natural speech data, and converts the segment distribution (FIG. 5 (a-1)) obtained from the natural speech data of the reading tone to the X tone. A segment distribution conversion function to be converted into the segment distribution (FIG. 5 (a-2)) obtained from the natural speech data is calculated (s133), transmitted to the segment distribution conversion function DB 234, and registered. The upper part of FIG. 5 shows an image of the segment distribution conversion function f for converting the segment appearance frequency distribution from the reading tone to the X tone. N speech unit numbers are arranged in order from the left on the horizontal axis of (a-1) and (a-2). The vertical axis represents the appearance frequency. The conversion function f is a function for converting the left distribution into the right distribution. As a result, the difference between the two tones relating to the reading swing can be incorporated into the conversion function f.

例えば、音声素片の種類数をＮとするとき、音声素片毎に朗読口調の素片分布｛ｕ_１ｗ，ｕ_２ｗ，…，ｎ_Ｎｗ｝とＸ口調の素片分布｛ｕ_１ｘ，ｕ_２ｘ，…，ｎ_Ｎｘ｝との差分｛ｕ_１ｗ−ｕ_１ｘ，ｕ_２ｗ−ｕ_２ｘ，…，ｎ_Ｎｗ−ｕ_Ｎｘ｝を求め、記憶しておく。後述する素片分布変換部２３３において、素片分布変換関数は、入力される素片分布から、この差分を差し引くことで分布を変換する。また、例えば、素片分布変換関数は、音声素片毎に朗読口調の素片分布とＸ口調の素片分布との比を、入力される素片分布に乗じることで変換してもよい。また他の方法によって、朗読口調の素片分布をＸ口調の素片分布に変換してもよい。なお、素片分布変換関数算出部１３３はＸ口調のバリエーション数分の素片分布変換関数を算出し、素片分布変換関数ＤＢ２３４に送信し、登録する。 For example, when the number of types of speech segments is N, the segment distribution of reading tone {u _1w , u _2w ,..., N _Nw } and the segment distribution of X tone {u _1x , u _{2x for} each speech unit. ,..., N _Nx }, the difference {u _1w −u _1x , u _2w −u _2x ,..., N _Nw −u _Nx } is obtained and stored. In the element distribution conversion unit 233 described later, the element distribution conversion function converts the distribution by subtracting this difference from the input element distribution. Further, for example, the segment distribution conversion function may be converted by multiplying the input segment distribution by the ratio between the reading tone segment distribution and the X tone segment distribution for each speech segment. The reading tone segment distribution may be converted into an X tone segment distribution by other methods. The element distribution conversion function calculation unit 133 calculates the element distribution conversion functions for the number of variations of the X tone, and transmits and registers them in the element distribution conversion function DB 234.

｛継続長分布変換関数算出部１３５｝
継続長分布変換関数算出部１３５は、各自然音声データの音声素片毎の継続長分布を受信し、朗読口調の自然音声データから求めた音声素片毎の継続長分布（図５（ｂ−１））を、Ｘ口調の自然音声データから求めた音声素片毎の継続長分布（図５（ｂ−２））に変換する継続長分布変換関数を算出し（ｓ１３５）、継続長分布変換関数ＤＢ２３６に送信し、登録する。よって、継続長分布変換関数ＤＢ２３６には、（Ｘ口調のバリエーション数）×（音声素片の種類数Ｎ）分の継続長分布変換関数が登録されることになる。図５の中段は、朗読口調からＸ口調へ音韻継続時間長の出現頻度分布を変換する継続長分布変換関数のイメージを示している。左側が朗読口調におけるある音声素片iの音韻継続時間長の出現頻度（音韻継続時間長ベクトルのバリエーション数をＭｉとする）、右側がＸ口調におけるある音声素片iの音韻継続時間長の出現頻度を示しており、（ｂ−１）及び（ｂ−２）の横軸上にＭｉ個の音韻継続長ベクトルが左から順番に並べられている。縦軸は出現頻度である。変換関数ｇｉは、左の分布を右の分布に変換する関数である。 {Duration distribution conversion function calculation unit 135}
The continuous length distribution conversion function calculation unit 135 receives the continuous length distribution for each speech unit of each natural speech data, and the continuous length distribution for each speech unit obtained from the natural speech data of the reading tone (FIG. 5 (b- 1)) is calculated into a continuous length distribution conversion function (FIG. 5 (b-2)) for each speech unit obtained from natural speech data of X tone (s135), and a continuous length distribution conversion is performed. It transmits to function DB236 and registers. Accordingly, the continuous length distribution conversion function DB 236 is registered with continuous length distribution conversion functions for (the number of X tone variations) × (the number N of speech segment types). The middle part of FIG. 5 shows an image of a duration distribution conversion function for converting the appearance frequency distribution of the phoneme duration from the reading tone to the X tone. Appearance frequency of phoneme duration length of a speech unit i in the reading tone on the left (Mi is the number of variations of the phoneme duration vector), and appearance of the phoneme duration of a phoneme i in the X tone The frequency is shown, and Mi phoneme continuation length vectors are arranged in order from the left on the horizontal axis of (b-1) and (b-2). The vertical axis represents the appearance frequency. The conversion function gi is a function for converting the left distribution into the right distribution.

例えば、ある音声素片ｉに対する音韻継続時間長ベクトルのバリエーション数をＭｉとするとき、音声素片毎に朗読口調の継続長分布｛ｕ_１ｗ，ｕ_２ｗ，…，ｎ_Ｍｉｗ｝とＸ口調の継続長分布｛ｕ_１ｘ，ｕ_２ｘ，…，ｎ_Ｍｉｘ｝との差分｛ｕ_１ｗ−ｕ_１ｘ，ｕ_２ｗ−ｕ_２ｘ，…，ｎ_Ｍｉｗ−ｕ_Ｍｉｘ｝を求め、記憶しておく。後述する継続長分布変換部２３５において、継続長分布変換関数は、入力される継続長分布から、この差分を差し引くことで分布を変換する。この処理を全ての音声素片に対して行う。また他の方法によって、朗読口調の継続長分布をＸ口調の継続長分布に変換してもよい。後述するＦ０分布変換関数算出部１３７及びＦ０分布変換部２３７についても同様の処理により、Ｆ０分布変換関数を求め、Ｆ０分布を変換することができる。 For example, when the number of variations of the phoneme duration length vector for a certain speech unit i is Mi, the continuous length distribution {u _1w , u _2w ,..., N _Miw } for each speech unit and the continuation of the X tone the length distribution _{_{{u 1x, u 2x, ...}} , n Mix} and the difference _{_{_{_{{u 1w -u 1x, u 2w}}}} -u 2x, ..., n Miw -u Mix} sought and stored. In the continuation length distribution conversion unit 235 described later, the continuation length distribution conversion function converts the distribution by subtracting this difference from the input continuation length distribution. This process is performed for all speech segments. Further, the reading tone tone duration distribution may be converted into the X tone duration duration distribution by other methods. The F0 distribution conversion function calculation unit 137 and the F0 distribution conversion unit 237 described later can also obtain the F0 distribution conversion function and convert the F0 distribution by the same processing.

｛Ｆ０分布変換関数算出部１３７｝
Ｆ０分布変換関数算出部１３７は、各自然音声データの音声素片毎のＦ０分布を受信し、朗読口調の自然音声データから求めた音声素片毎のＦ０分布（図５（ｃ−１））を、Ｘ口調の自然音声データから求めた音声素片毎のＦ０分布（図５（ｃ−２））に変換するＦ０分布変換関数を算出し（ｓ１３７）、Ｆ０分布変換関数ＤＢ２３８に送信し、登録する。Ｆ０分布変換関数ＤＢ２３８には、（Ｘ口調のバリエーション数）×（音声素片の種類数Ｎ）分のＦ０分布変換関数が登録されることになる。図５の下段は、朗読口調からＸ口調へＦ０分布を変換する関数のイメージを示している。左側が朗読口調におけるある音声素片iの基本周波数パタンの出現頻度（基本周波数パタンベクトルのバリエーション数をＬｉとする）、右側がＸ口調におけるある素片iの基本周波数パタンの出現頻度を示しており、（ｃ−１）及び（ｃ−２）の横軸上にＬｉ個の基本周波数パタンベクトルが左から順番に並べられている。縦軸は出現頻度である。変換関数ｈｉは、左の分布を右の分布に変換する関数である。 {F0 distribution conversion function calculation unit 137}
The F0 distribution conversion function calculation unit 137 receives the F0 distribution for each speech unit of each natural speech data, and the F0 distribution for each speech unit obtained from the natural speech data of the reading tone (FIG. 5 (c-1)). Is converted into an F0 distribution conversion function for each speech unit (FIG. 5 (c-2)) obtained from natural speech data of X tone (s137), and transmitted to the F0 distribution conversion function DB 238. sign up. In the F0 distribution conversion function DB 238, F0 distribution conversion functions for (the number of X tone variations) × (the number N of speech segment types) are registered. The lower part of FIG. 5 shows an image of a function for converting the F0 distribution from the reading tone to the X tone. The left side shows the frequency of appearance of the fundamental frequency pattern of a certain speech unit i in reading tone (the number of variations of the fundamental frequency pattern vector is Li), and the right side shows the frequency of appearance of the fundamental frequency pattern of a unit i in X tone Li basic frequency pattern vectors are arranged in order from the left on the horizontal axes of (c-1) and (c-2). The vertical axis represents the appearance frequency. The conversion function hi is a function for converting the left distribution into the right distribution.

＜テキストセット作成部２００＞
図６及び図７を用いてテキストセット作成部２００を説明する。テキストセット作成部２００は、発声用テキストセット候補作成部２１０と、大量日本語ＤＢ２１１と、第２パラメータ分布抽出部２２０と、パラメータ分布変換部２３０と、評価部２５０と、終了判定部２６０とを有する。なお、図６中、パラメータ分布変換部２３０と、変換関数ＤＢ２３４、２３６及び２３８とが本発明によって追加される部分であり、その他の部分は従来の技術と同等の繰り返し処理を行ってもよい（例えば非特許文献２）。繰り返し処理には「交換法」や「貪欲アルゴリズム」等があるが、図６及び図７では交換法を例として示している。 <Text set creation unit 200>
The text set creation unit 200 will be described with reference to FIGS. The text set creation unit 200 includes an utterance text set candidate creation unit 210, a large-volume Japanese DB 211, a second parameter distribution extraction unit 220, a parameter distribution conversion unit 230, an evaluation unit 250, and an end determination unit 260. Have. In FIG. 6, the parameter distribution conversion unit 230 and the conversion function DBs 234, 236, and 238 are portions added by the present invention, and the other portions may be subjected to an iterative process equivalent to the conventional technique ( For example, Non-Patent Document 2). The iterative processing includes “exchange method”, “greedy algorithm”, and the like, but FIGS. 6 and 7 show the exchange method as an example.

（発声用テキストセット候補作成部２１０及び大量日本語ＤＢ２１１）
発声用テキストセット候補作成部２１０は、インタフェース部１０１を介してテキストセット作成指示を受信すると、大量日本語文章ＤＢ２１１から所定数（例えば、５００個）の文章を抽出し、最初の発声用テキストセット候補（以下「Ｔ」という）を作成し（ｓ２１０）、第２パラメータ分布抽出部２２０に送信する。なお、テキストセット作成指示に大量日本語文章ＤＢ２１１から抽出する文章の数を指定する情報（以下「抽出数指定情報」という）を加えてもよい。なお、抽出数指定情報は、ユーザが最初に指定し、入力する値である。 (Speech text set candidate creation unit 210 and mass Japanese DB 211)
Upon receiving the text set creation instruction via the interface unit 101, the utterance text set candidate creation unit 210 extracts a predetermined number (for example, 500) of sentences from the large-volume Japanese sentence DB 211, and the first utterance text set A candidate (hereinafter referred to as “T”) is created (s210) and transmitted to the second parameter distribution extraction unit 220. Information specifying the number of sentences extracted from the large-volume Japanese sentence DB 211 (hereinafter referred to as “extraction number specifying information”) may be added to the text set creation instruction. The extraction number designation information is a value that is first designated and input by the user.

（第２パラメータ分布抽出部２２０）
第２パラメータ分布抽出部２２０は、Ｔを用いて、音声合成プログラムにより音声合成処理を行い、音声合成データから所定のパラメータを求め、求めたパラメータの分布を抽出する（ｓ２２０）。 (Second parameter distribution extraction unit 220)
The second parameter distribution extraction unit 220 performs speech synthesis processing using the speech synthesis program using T, obtains predetermined parameters from the speech synthesis data, and extracts the obtained parameter distribution (s220).

例えば、第２パラメータ分布抽出部２２０は、音韻列、基本周波数パタン、音韻継続時間長抽出部２２１と、素片分布抽出部２２３と、継続長分布抽出部２２５と、Ｆ０分布抽出部２２７とを備える。 For example, the second parameter distribution extraction unit 220 includes a phoneme string, a fundamental frequency pattern, a phoneme duration extraction unit 221, a segment distribution extraction unit 223, a duration distribution extraction unit 225, and an F0 distribution extraction unit 227. Prepare.

｛音韻列、基本周波数パタン、音韻継続時間長抽出部２２１｝
音韻列、基本周波数パタン、音韻継続時間長抽出部２２１は、発声用テキストセット候補を受信し、これを用いて、音声合成プログラムにより音声合成処理を行い、音声合成データから音韻列、基本周波数パタン及び音韻継続時間長を推定し、これらを抽出して（ｓ２２１）、素片分布抽出部２２３に送信する。 {Phoneme sequence, fundamental frequency pattern, phoneme duration extraction unit 221}
The phoneme sequence, fundamental frequency pattern, and phoneme duration length extraction unit 221 receives the utterance text set candidate, uses this to perform speech synthesis processing by a speech synthesis program, and uses the speech synthesis data to generate the phoneme sequence and fundamental frequency pattern. Then, the phoneme duration length is estimated, extracted (s221), and transmitted to the segment distribution extraction unit 223.

｛素片分布抽出部２２３｝
素片分布抽出部２２３は、音韻列、基本周波数パタン及び音韻継続時間長を受信し、音韻列を用いて各音声素片の出現頻度を求め、素片分布を抽出し（ｓ２２３）、素片分布変換部２３３に送信する。また、音声素片とそれに紐付けられた音韻継続長を継続長分布抽出部２２５に、音声素片とそれに紐付けられた基本周波数パタンをＦ０分布抽出部２２７に送信する。 {Element distribution extraction unit 223}
The segment distribution extraction unit 223 receives the phoneme sequence, the fundamental frequency pattern, and the phoneme duration, obtains the appearance frequency of each speech segment using the phoneme sequence, extracts the segment distribution (s223), The data is transmitted to the distribution conversion unit 233. Further, the speech unit and the phoneme duration associated with it are transmitted to the duration distribution extraction unit 225, and the speech unit and the fundamental frequency pattern associated with it are transmitted to the F0 distribution extraction unit 227.

｛継続長分布抽出部２２５｝
継続長分布抽出部２２５は、素片分布と音声素片毎の音韻継続長を受信し、音声素片毎の音韻継続時間長を求め、その出現頻度から継続長分布を抽出し（ｓ２２５）、継続長分布変換部２３５に送信する。 {Continuation length distribution extraction unit 225}
The continuation length distribution extracting unit 225 receives the segment distribution and the phoneme continuation length for each speech unit, obtains the phoneme duration for each speech unit, extracts the continuation distribution from the appearance frequency (s225), It transmits to the continuation length distribution conversion part 235.

｛Ｆ０分布抽出部２２７｝
Ｆ０分布抽出部２２７は、素片分布と音声素片毎の基本周波数パタンを受信し、音声素片毎の基本周波数パタンを求め、その出現頻度からＦ０分布を抽出し（ｓ２２７）、Ｆ０分布変換部２３７に送信する。 {F0 distribution extraction unit 227}
The F0 distribution extraction unit 227 receives the unit distribution and the fundamental frequency pattern for each speech unit, obtains the fundamental frequency pattern for each speech unit, extracts the F0 distribution from the appearance frequency (s227), and converts the F0 distribution. To the unit 237.

なお、素片分布抽出部２２３では素片分布を１つ、継続長分布抽出部２２５及びＦ０分布抽出部２２７では音声素片のバリエーション数分のＦ０分布及び継続長分布を抽出する。 The segment distribution extraction unit 223 extracts one segment distribution, and the duration distribution extraction unit 225 and the F0 distribution extraction unit 227 extract F0 distributions and duration distributions corresponding to the number of variations of the speech segment.

（パラメータ分布変換部２３０）
パラメータ分布変換部２３０は、入出力インタフェース部１０１を介して口調指定情報を受信し、口調指定情報に基づき、記憶部２０３からパラメータ分布変換関数を取り出し、そのパラメータ分布変換関数を用いて、音声合成データから求めたパラメータ分布を変換する（ｓ２３０）例えば、パラメータ分布変換部２３０は、素片分布変換部２３３と、継続長分布変換部２３５と、Ｆ０分布変換部２３７とを備える。 (Parameter distribution converter 230)
The parameter distribution conversion unit 230 receives the tone designation information via the input / output interface unit 101, extracts the parameter distribution conversion function from the storage unit 203 based on the tone designation information, and uses the parameter distribution conversion function to perform speech synthesis. The parameter distribution obtained from the data is converted (s230). For example, the parameter distribution converter 230 includes an element distribution converter 233, a duration distribution converter 235, and an F0 distribution converter 237.

｛素片分布変換部２３３｝
素片分布変換部２３３は、口調指定情報と素片分布を受信し、口調指定情報をキーとして、記憶部２０３内の素片分布変換関数ＤＢ２３４を検索し、対応する素片分布変換関数を取り出し、これを用いて、（合成データから求めた）受信した素片分布を変換し（ｓ２３３）、変換後の素片分布を評価部２５０に送信する。 {Element distribution conversion unit 233}
The segment distribution conversion unit 233 receives the tone designation information and the segment distribution, searches the segment distribution conversion function DB 234 in the storage unit 203 using the tone designation information as a key, and extracts the corresponding segment distribution conversion function. Using this, the received segment distribution (obtained from the combined data) is converted (s233), and the converted segment distribution is transmitted to the evaluation unit 250.

｛継続長分布変換部２３５｝
継続長分布変換部２３５は、口調指定情報と継続長分布を受信し、口調指定情報をキーとして、記憶部２０３内の継続長分布変換関数ＤＢ２３６を検索し、対応する継続長分布変換関数を取り出し、これを用いて、（音声合成データから求めた）受信した継続長分布を変換し（ｓ２３５）、変換後の継続長分布を評価部２５０に送信する。 {Continuation length distribution conversion unit 235}
The duration distribution converter 235 receives the tone designation information and the duration distribution, searches the duration distribution conversion function DB 236 in the storage unit 203 using the tone designation information as a key, and extracts the corresponding duration distribution conversion function. Using this, the received duration distribution (obtained from the speech synthesis data) is converted (s235), and the converted duration distribution is transmitted to the evaluation unit 250.

｛Ｆ０分布変換部２３７｝
Ｆ０分布変換部２３７は、口調指定情報とＦ０分布を受信し、口調指定情報をキーとして、記憶部２０３内のＦ０分布変換関数ＤＢを検索し、対応するＦ０分布変換関数を取り出し、これを用いて、（音声合成データから求めた）受信したＦ０分布を変換し（ｓ２３７）、変換後のＦ０分布を評価部２５０に送信する。 {F0 distribution conversion unit 237}
The F0 distribution conversion unit 237 receives the tone designation information and the F0 distribution, searches the F0 distribution conversion function DB in the storage unit 203 using the tone designation information as a key, extracts the corresponding F0 distribution conversion function, and uses this The received F0 distribution (obtained from the speech synthesis data) is converted (s237), and the converted F0 distribution is transmitted to the evaluation unit 250.

（評価部２５０）
評価部２５０は、変換後のパラメータ分布（素片分布、継続長分布及びＦ０分布）を用いて評価関数を計算し、発声用テキストセット候補を評価し（ｓ２５０）、評価結果を終了判定部２６０を介して発声用テキストセット候補作成部２１０に送信する。例えば、非特許文献２の方法等により評価関数を計算する。 (Evaluation unit 250)
The evaluation unit 250 calculates an evaluation function using the converted parameter distribution (segment distribution, duration distribution, and F0 distribution), evaluates the text set candidate for utterance (s250), and determines the evaluation result as an end determination unit 260. To the utterance text set candidate creation unit 210. For example, the evaluation function is calculated by the method of Non-Patent Document 2.

例えば、全ての音声素片の種類をＮ、発声用テキストセット候補中に現れる音声素片の出現頻度を｛ｕ_１，ｕ_２，…，ｎ_Ｎ｝と表し、ｕ_ｉの相対出現頻度をｐ_ｉとする。ｕ_ｉに対応する音韻継続時間長の種類をＮ_ｉ、それぞれの出現頻度を｛ｖ_ｉ１，ｖ_ｉ２，…，ｖ_ｉＮｉ｝と表し、ｖ_ｉｊの相対出現頻度をｑ_ｉｊとする。なお、基本周波数パタンについても、音韻継続時間長と同様の方法により求めることができる。 For example, the type of all speech units is represented as N, the appearance frequency of speech units appearing in the utterance text set candidate is represented as {u ₁ , u ₂ ,..., _{N N} }, and the relative appearance frequency of u _i is represented as p. _{Let i} . The type of phoneme duration corresponding to u _i is represented by N _i , the respective appearance frequencies are represented as {v _i1 , v _i2 ,..., v _iNi }, and the relative appearance frequency of v _ij is represented by q _ij . The fundamental frequency pattern can also be obtained by the same method as the phoneme duration time.

音声素片ｕ_ｉのカバレッジの達成度を表す指標として、ｒ_ｉを導入する。但し、 R _i is introduced as an index representing the degree of coverage of the speech unit u _i . However,

であり、ｄ_ｉｊ（Ｔ）は、品質劣化の許容範囲内の変形によってｖ_ｉｊの基本周波数及び音韻継続時間長を実現できるような波形素片が発声用テキストセット候補Ｔに含まれるとき１、そうでないとき０をとる関数とする。 And d _ij (T) is 1 when the utterance text set candidate T includes waveform segments that can realize the fundamental frequency and phoneme duration length of v _ij by deformation within the allowable range of quality degradation. Otherwise, it is a function that takes 0.

発声用テキストセット候補Ｔに含まれる音声素片のカバレッジの総和は The total coverage of speech units included in the utterance text set candidate T is

であり、同一の音声素片に属する音韻継続時間長や基本周波数パタンの間では、音韻継続時間長や基本周波数パタンの出現頻度が高いものほど被覆の良さを測る評価規準への寄与が大きくなる。これを評価関数として用いてもよい。さらに、音素環境の広がりと基本周波数パタン、音声素片継続時間長の広がりの間の重みを調整するメカニズムとして、非線形関数等を導入しても良い（非特許文献２参照）。 Among phoneme duration lengths and fundamental frequency patterns belonging to the same speech segment, the higher the appearance frequency of the phoneme duration length and the fundamental frequency pattern, the greater the contribution to the evaluation criteria for measuring the goodness of covering. . This may be used as an evaluation function. Furthermore, a nonlinear function or the like may be introduced as a mechanism for adjusting the weight between the spread of the phoneme environment and the basic frequency pattern and the spread of the speech unit duration (see Non-Patent Document 2).

（終了判定部２６０）
終了判定部２６０は、終了条件を満たすか否かを判定し（ｓ２６０）、終了判定結果を発声用テキストセット候補作成部２１０に送信する。終了条件とは、例えば、交換を試みた文数が所定の値に達していることや、評価関数の大きさが所定の値以上であること等である。 (End determination unit 260)
The end determination unit 260 determines whether the end condition is satisfied (s260), and transmits the end determination result to the utterance text set candidate creation unit 210. The termination condition is, for example, that the number of sentences that have been exchanged has reached a predetermined value, or that the size of the evaluation function is greater than or equal to a predetermined value.

［繰り返し処理］
発声用テキストセット候補作成部２１０は、評価結果と終了判定結果を受信し、終了判定結果が終了条件を満たすことを意味する場合には（ｓ２６０）、その時点の発声用テキストセット候補を発声用テキストセットとして出力する（ｓ３１５）。終了判定結果が終了条件を満たしていないことを意味する場合には（ｓ２６０）、新たな発声用テキストセット候補を作成し（ｓ２１０）、処理（ｓ２１０〜ｓ２６０）を繰り返す。 [Repetition processing]
The utterance text set candidate creation unit 210 receives the evaluation result and the end determination result. If the end determination result means that the end condition is satisfied (s260), the utterance text set candidate creation unit 210 uses the utterance text set candidate for utterance. A text set is output (s315). If the end determination result means that the end condition is not satisfied (s260), a new utterance text set candidate is created (s210), and the processing (s210 to s260) is repeated.

なお、新たな発声用テキストセット候補は、大量日本語ＤＢ２１１から任意の１文を取り出し、発声用テキストセット候補中の任意の１文と交換することによって作成してもよい。この場合、任意の１文を交換した発声用テキストセット候補と、交換していない発声用テキストセット候補とを、記憶部２０３に記憶しておき、評価部２５０の評価結果に従って、評価の低い発声用テキストセット候補を削除する構成としてもよい。２週目以降の各処理は、差分のみを処理すればよいため、効率的に処理することができる。 It should be noted that a new utterance text set candidate may be created by taking an arbitrary sentence from the mass Japanese DB 211 and replacing it with an arbitrary sentence in the utterance text set candidate. In this case, the utterance text set candidate in which an arbitrary sentence is exchanged and the utterance text set candidate that has not been exchanged are stored in the storage unit 203, and the utterance having a low evaluation according to the evaluation result of the evaluation unit 250 It is good also as a structure which deletes the text set candidate for use. Each process after the second week can be processed efficiently because only the difference needs to be processed.

＜プログラム＞
なお、上述した発声用テキストセット作成装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（実施例で図に示した機能構成をもつ装置）として機能させるためのプログラム、または、その処理手順（実施例で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program>
The utterance text set creation device described above can also be operated by a computer. In this case, the program for causing the computer to function as the target device (the device having the functional configuration shown in the drawings in the embodiment) or each process of the processing procedure (shown in the embodiment) is stored in the computer. A program to be executed may be downloaded into a computer from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line, and the program may be executed.

＜効果＞
本発明では、Ｘ口調毎に、朗読口調との間に生じ易い発声の差分を予め抽出し、パタン化しておき、その発声差分に応じて音韻列や韻律特徴のカバレッジが最大になるように、発声用テキストセットを補正することで、朗読口調以外の口調で発声した場合でもカバレッジが最大化できるようにする。本発明により作成された発声用テキストセットを用いて、発声者が発声し、その自然音声データに基づいて音声素片ＤＢを構築することで、Ｘ口調における合成音声の劣化を防ぐことができる。 <Effect>
In the present invention, for each X tone, the utterance difference that is likely to occur between the reading tone and tone is extracted in advance and patterned, so that the coverage of the phoneme sequence and prosodic features is maximized according to the utterance difference. By correcting the utterance text set, the coverage can be maximized even when uttered in a tone other than reading tone. By using the utterance text set created according to the present invention, the utterer utters, and by constructing the speech segment DB based on the natural speech data, deterioration of the synthesized speech in the X tone can be prevented.

［変形例］
発声用テキストセット１０００は、変換関数作成部１００を有さずともよい。例えば、他の装置で作成した各変換関数を、記憶部２０３に記憶してもよい。 [Modification]
The utterance text set 1000 may not include the conversion function creation unit 100. For example, each conversion function created by another device may be stored in the storage unit 203.

発声用テキストセット１０００は、３つのパラメータ分布（素片分布、継続長分布、Ｆ０分布）を変換対象としているが、少なくとも１つを変換対象とすればよい。推定精度が落ちるが、データ量、計算量を減らすことができる。また、前記の３つのパラメータ分布以外のパラメータ分布を変換対象としてもよい。 The utterance text set 1000 has three parameter distributions (segment distribution, continuation length distribution, and F0 distribution) as conversion targets, but at least one may be converted. Although the estimation accuracy is reduced, the amount of data and calculation can be reduced. Further, parameter distributions other than the three parameter distributions may be converted.

実施例１では、非特許文献２記載の方法を用いて、発声用テキストセット候補を評価したが、他の既存技術を用いて評価してもよい。 In the first embodiment, the utterance text set candidates are evaluated using the method described in Non-Patent Document 2, but may be evaluated using other existing techniques.

本発明は、音声合成用音声素片ＤＢを構築する際に、発声者が読み上げるテキストセットを作成する際に利用することができる。本発明の発話用テキストセット作成装置１０００により作成されたテキストセットをＸ口調で発声者が読み上げることで、Ｘ口調における音韻列及び韻律特徴のカバレッジを最大化した音声合成用音声素片ＤＢを構築することができ、そのＤＢを用いることで、Ｘ口調における高品質の合成音声を可能とする。 The present invention can be used when creating a text set read by a speaker when constructing a speech unit DB for speech synthesis. A speech unit DB for speech synthesis that maximizes the coverage of phoneme strings and prosodic features in the X tone is constructed by the speaker reading out the text set created by the speech set creation device 1000 of the present invention in the X tone. By using the DB, high-quality synthesized speech in the X tone is made possible.

１０００発声用テキストセット作成装置
１００変換関数作成部
１０１入出力インタフェース部
１１０自然音声ＤＢ
１１１音韻ラベリング部
１２０第１パラメータ分布抽出部
１３０パラメータ分布変換関数算出部
２００テキストセット作成部
２０３記憶部
２１０発話用テキストセット候補作成部
２２０第２パラメータ分布抽出部
２３０パラメータ分布変換部
２３４素片分布変換関数ＤＢ
２３６継続長分布変換関数ＤＢ
２３８Ｆ０分布変換関数ＤＢ
２５０評価部
２６０終了判定部 1000 Spoken Text Set Creation Device 100 Conversion Function Creation Unit 101 Input / Output Interface Unit 110 Natural Speech DB
111 Phonological labeling unit 120 First parameter distribution extraction unit 130 Parameter distribution conversion function calculation unit 200 Text set creation unit 203 Storage unit 210 Utterance text set candidate creation unit 220 Second parameter distribution extraction unit 230 Parameter distribution conversion unit 234 Segment distribution Conversion function DB
236 Continuous length distribution conversion function DB
238 F0 distribution conversion function DB
250 Evaluation Unit 260 End Determination Unit

Claims

A speech text set creation method for creating a text set read out by a speaker when constructing a speech synthesis unit DB for speech synthesis,
It is assumed that a parameter distribution conversion function for converting a parameter distribution obtained from natural speech data in reading tone into a parameter distribution obtained from natural speech data in target X tone is stored in the storage unit in advance. ,
A utterance text set candidate creation step of extracting a predetermined number of sentences randomly from a large volume Japanese sentence DB and creating a utterance text set candidate;
A second parameter distribution extraction step of performing speech synthesis processing by a speech synthesis program using the utterance text set candidate, obtaining a predetermined parameter from the speech synthesis data, and extracting a distribution of the obtained parameter;
A parameter distribution conversion step of taking out the parameter distribution conversion function from the storage unit and converting the parameter distribution obtained from the speech synthesis data using the parameter distribution conversion function;
Evaluating the utterance text set candidates using the converted parameter distribution, and
A method for generating a text set for speech.

The utterance text set generation method according to claim 1,
A first parameter distribution for obtaining predetermined parameters from the reading-tone natural voice data read out from the predetermined document in the reading-tone and the X-tone natural voice data read out from the same document in the X-tone, and extracting the parameter distribution, respectively. An extraction step;
A parameter distribution conversion function calculating step for calculating a parameter distribution conversion function for converting the parameter distribution obtained from the natural speech data of the reading tone into the parameter distribution obtained from the natural speech data of the X tone;
Storing the parameter distribution conversion function in the storage unit before creating a text set candidate for utterance,
A method for generating a text set for speech.

A speech text set creation method for creating a text set read out by a speaker when constructing a speech synthesis unit DB for speech synthesis,
The storage unit includes an appearance frequency distribution of each speech unit obtained from natural speech data of reading tone, a distribution of phoneme durations for each speech unit (hereinafter referred to as “continuation length distribution”), and a basic unit for each speech unit. The frequency distribution (hereinafter referred to as “F0 distribution”) is the frequency distribution of each speech unit, the duration distribution for each speech unit, and the F0 for each speech unit obtained from natural speech data of the intended X tone. It is assumed that an element distribution conversion function, a duration distribution conversion function, and an F0 distribution conversion function to be converted into a distribution are stored in advance,
A utterance text set candidate creation step of extracting a predetermined number of sentences randomly from a large volume Japanese sentence DB and creating a utterance text set candidate;
Using the utterance text set candidates, the speech synthesis process is performed by a speech synthesis program, and the phoneme sequence, the fundamental frequency pattern, and the phoneme duration time length are extracted from the speech synthesis data. Steps,
A second segment distribution extraction step of obtaining an appearance frequency of each speech segment from the phoneme sequence and extracting a segment distribution;
A second phoneme duration length and F0 distribution extraction step for obtaining a phoneme duration for each speech unit and extracting a duration distribution, obtaining a fundamental frequency pattern for each speech unit and extracting an F0 distribution;
A segment distribution conversion function, a duration distribution conversion function, and an F0 distribution conversion function are extracted from the storage unit, and the segment distribution, duration distribution, and F0 distribution obtained from the speech synthesis data using these distribution conversion functions, respectively. A parameter distribution conversion step for converting
An evaluation function that calculates an evaluation function using the segment distribution, the duration distribution, and the F0 distribution after the conversion, and evaluates the utterance text set candidate.
A method for generating a text set for speech.

A utterance text set generation method according to claim 3,
A phonological labeling step of assigning a phonological label to the reading-tone natural voice data of a predetermined document read out in a reading-tone style and the X-tone natural voice data of the same document read out in an X tone;
A first segment distribution extraction step of referring to the all speech segment variation storage unit and assigning a speech unit number to a speech unit obtained from each natural speech data and extracting an appearance frequency distribution of the speech unit When,
A first phoneme duration and F0 distribution extraction step of calculating a phoneme duration for each speech unit, extracting a duration distribution, extracting a fundamental frequency pattern for each speech unit, and extracting an F0 distribution;
A segment distribution conversion function for converting a segment distribution, duration distribution, and F0 distribution obtained from natural speech data in reading tone into a segment distribution, duration distribution, and F0 distribution obtained from natural speech data in X tone, A parameter distribution conversion function calculating step for calculating a duration distribution conversion function and an F0 distribution conversion function;
Storing the segment distribution conversion function, the continuation length distribution conversion function, and the F0 distribution conversion function in the storage unit before creating the utterance text set candidate.
A method for generating a text set for speech.

An utterance text set creation device for creating a text set read by a speaker when constructing a speech segment DB for speech synthesis,
A storage unit that stores in advance a parameter distribution conversion function for converting the parameter distribution obtained from the natural speech data of the reading tone into the parameter distribution obtained from the natural speech data of the target X tone;
A utterance text set candidate creation unit that randomly extracts a predetermined number of sentences from a large amount of Japanese sentence DB and creates utterance text set candidates;
A second parameter distribution extraction unit that performs speech synthesis processing by a speech synthesis program using the utterance text set candidate, obtains a predetermined parameter from speech synthesis data, and extracts a distribution of the obtained parameter;
A parameter distribution conversion unit that takes out the parameter distribution conversion function from the storage unit and converts the parameter distribution obtained from the speech synthesis data using the parameter distribution conversion function;
An evaluation unit that evaluates the utterance text set candidate using the converted parameter distribution;
An utterance text set generation device characterized by the above.

The utterance text set generation device according to claim 5,
A first parameter distribution for obtaining predetermined parameters from the reading-tone natural voice data read out from the predetermined document in the reading-tone and the X-tone natural voice data read out from the same document in the X-tone, and extracting the parameter distribution, respectively. An extractor;
A parameter distribution conversion function calculation unit for calculating a parameter distribution conversion function for converting the parameter distribution obtained from the natural speech data of the reading tone into the parameter distribution obtained from the natural speech data of the X tone,
The storage unit stores the parameter distribution conversion function before generating a text set candidate for utterance.
An utterance text set generation device characterized by the above.

An utterance text set creation device for creating a text set read by a speaker when constructing a speech segment DB for speech synthesis,
Appearance frequency distribution of each speech segment (hereinafter referred to as “segment distribution”) obtained from natural speech data of reading tone, and distribution of appearance frequency of phoneme duration length for each speech segment (hereinafter referred to as “continuation length distribution”) Distribution of frequency of appearance of fundamental frequency patterns for each speech unit (hereinafter referred to as “F0 distribution”), frequency distribution of speech units obtained from natural speech data of target X tone, speech unit A storage unit in which a duration distribution for each unit, a unit distribution conversion function for converting into a F0 distribution for each speech unit, a duration distribution conversion function, and an F0 distribution conversion function are stored in advance;
A utterance text set candidate creation unit that randomly extracts a predetermined number of sentences from a large amount of Japanese sentence DB and creates utterance text set candidates;
Using the utterance text set candidates, the speech synthesis process is performed by a speech synthesis program, and the phoneme sequence, the fundamental frequency pattern, and the phoneme duration time length are extracted from the speech synthesis data. And
A second segment distribution extraction unit for obtaining an appearance frequency of each speech segment from the phoneme sequence and extracting a segment distribution;
A second duration distribution extraction unit for obtaining a phoneme duration for each speech unit and extracting a duration distribution;
A second F0 distribution extraction unit for obtaining a fundamental frequency pattern for each speech unit and extracting an F0 distribution;
A segment distribution conversion function, a duration distribution conversion function, and an F0 distribution conversion function are extracted from the storage unit, and the segment distribution, duration distribution, and F0 distribution obtained from the speech synthesis data using these distribution conversion functions, respectively. A parameter distribution conversion unit for converting
An evaluation function that calculates an evaluation function using the segment distribution after conversion, duration distribution, and F0 distribution, and evaluates the utterance text set candidate,
An utterance text set generation device characterized by the above.

The utterance text set generation device according to claim 7,
A phonological labeling unit that assigns a phonological label to the reading-tone natural voice data that reads a predetermined document in a reading-tone and the X-tone natural voice data that reads the same document in an X-tone;
A first unit distribution extraction unit that refers to the whole speech unit variation storage unit, assigns a speech unit number to a speech unit obtained from each natural speech data, and extracts a unit distribution;
A first duration distribution extractor for calculating a phoneme duration for each speech unit and extracting a duration distribution;
A first F0 distribution extraction unit that extracts a fundamental frequency pattern for each speech unit and extracts an F0 distribution;
A segment distribution conversion function for converting a segment distribution, duration distribution, and F0 distribution obtained from natural speech data in reading tone into a segment distribution, duration distribution, and F0 distribution obtained from natural speech data in X tone, A parameter distribution conversion function calculation unit for calculating a continuation length distribution conversion function and an F0 distribution conversion function,
The storage unit stores the element distribution conversion function, the duration distribution conversion function, and the F0 distribution conversion function before creating the utterance text set candidate.
An utterance text set generation device characterized by the above.

A utterance text set generation program for causing a computer to function as the utterance text set generation device according to any one of claims 5 to 8.