JP4533255B2

JP4533255B2 - Speech synthesis apparatus, speech synthesis method, speech synthesis program, and recording medium therefor

Info

Publication number: JP4533255B2
Application number: JP2005186454A
Authority: JP
Inventors: 光昭磯貝; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-06-27
Filing date: 2005-06-27
Publication date: 2010-09-01
Anticipated expiration: 2025-06-27
Also published as: JP2007004011A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a waveform-connecting voice synthesizier/method that can obtain synthesized voices having natural intonations. <P>SOLUTION: This synthesizier stores a voice waveform database and the voice information database, consisting of the entry of voice F<SB>0</SB>pattern detail information and the rhythm information containing the voice F<SB>0</SB>pattern general information in a memory, creates the rhythm information A containing F<SB>0</SB>pattern information from the phoneme sequence obtained by analyzing the inputted text; then calculates the distance scale following the phoneme sequence between the rhythm information A and the rhythm information B of the entry in the voice information database (including the calculation of the cost of the F0 pattern information in the rhythm information A and the F0 pattern general information in the rhythm information B.); chooses the entry having the rhythm information making the calculation result minimum from the voice information database; reads the voice waveform data from the voice waveform database following this chosen entry; and finally synthesizes the voice by connecting them together. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、テキストを入力し、そのテキストに対応した音声を出力する音声合成装置、音声合成方法、音声合成プログラムおよびその記録媒体に関し、より詳しくは、音声波形データを選択して接続することで音声合成する波形接続型の音声合成技術に関する。 The present invention relates to a speech synthesizer, a speech synthesis method, a speech synthesis program, and a recording medium for inputting text and outputting speech corresponding to the text, and more specifically, by selecting and connecting speech waveform data. The present invention relates to a waveform connection type speech synthesis technology for speech synthesis.

近年の音声合成技術では、数十分から数十時間の大量の肉声データから音声波形データベースを構成し、入力されたテキストに応じて、適切な基準で音声波形データベースから適切な長さの音声波形を選択し、それらを接続して合成音声を作成する波形接続型音声合成方法が提案されている（特許文献１参照）。 In recent speech synthesis technology, a speech waveform database is constructed from a large amount of real voice data of several tens of minutes to several tens of hours, and a speech waveform of an appropriate length is generated from the speech waveform database on an appropriate basis according to the input text. A waveform-connected speech synthesis method has been proposed in which synthesized speech is created by selecting and connecting them (see Patent Document 1).

このような波形接続型音声合成方法における音声合成装置の構成例を図１に示す。
音声合成装置（１）は、ハードディスクなどの外部記憶装置（２）、テキスト解析部（１０）、韻律生成部（１１）、音声波形選択部（１２）、音声合成部（１３）から構成されている。 A configuration example of a speech synthesizer in such a waveform connection type speech synthesis method is shown in FIG.
The speech synthesizer (1) includes an external storage device (2) such as a hard disk, a text analysis unit (10), a prosody generation unit (11), a speech waveform selection unit (12), and a speech synthesis unit (13). Yes.

より詳細に叙述する。音声合成装置（１）は、テキストを入力とし、合成音声を出力する音声合成装置である。外部記憶装置（２）は、音声波形データベース（３）および音声情報データベース（４）を記憶している。音声波形データベース（３）は、単語や文章を読み上げた音声データに対して公知のＡ／Ｄ変換を行い、合成音声を組み立てる上で適切な合成単位（例えば音素）で切出したもの（音声波形素片としての音声波形データ）の集合であり、外部記憶装置（２）の記憶領域に格納される。 Describe in more detail. The speech synthesizer (1) is a speech synthesizer that receives text as input and outputs synthesized speech. The external storage device (2) stores a speech waveform database (3) and a speech information database (4). The speech waveform database (3) performs a well-known A / D conversion on speech data read out from a word or sentence, and is cut out by a suitable synthesis unit (for example, phoneme) when assembling synthesized speech (speech waveform element) Audio waveform data as a piece) and stored in the storage area of the external storage device (2).

音声情報データベース（４）は、例えば図２のように、合成音声を組み立てる上で適切な単位（合成単位）を音素として、これに諸情報が対応付けられたエントリーからなるデータ構造（テーブル）となっており、外部記憶装置（２）の記憶領域に格納される。図２に示す音声情報データベース（４）の各エントリーは、音声波形素片の通し番号である音声波形素片番号、発声内容を示す音素ラベル情報、音素の発声時間長を示す音素継続時間情報、音素区間の平均パワーを正規化して得たパワー情報、音素の音高の時間推移を表したＦ_０パターン情報、音声波形データベース（３）の中での音声波形データの位置を示す情報（以下、音声波形データ位置情報という。）から構成される。
音声情報データベース（４）のエントリーと音声波形データベース（３）における（音声波形素片としての）各音声波形データとは、音声情報データベース（４）における音声波形データ位置情報によって対応付けられる。
各エントリーのＦ_０パターン微細情報は、肉声の有するＦ_０パターンの微細変動をそのまま保持したＦ_０パターンを表している。 For example, as shown in FIG. 2, the speech information database (4) has a data structure (table) composed of entries in which a unit (synthesis unit) appropriate for assembling synthesized speech is set as a phoneme and various information is associated with the phoneme. And is stored in the storage area of the external storage device (2). Each entry of the speech information database (4) shown in FIG. 2 includes a speech waveform segment number that is a serial number of the speech waveform segment, phoneme label information that indicates the utterance content, phoneme duration information that indicates the speech duration of the phoneme, and phonemes. Power information obtained by normalizing the average power of the section, F ₀ pattern information representing the time transition of the phoneme pitch, and information indicating the position of the speech waveform data in the speech waveform database (3) (hereinafter, speech Called waveform data position information).
An entry of the voice information database (4) and each voice waveform data (as a voice waveform segment) in the voice waveform database (3) are associated with each other by voice waveform data position information in the voice information database (4).
F ₀ pattern fine information of each entry represents a F ₀ pattern was kept fine variations in F ₀ pattern having a real voice.

テキスト解析部（１０）は、入力されたテキストを形態素解析し、入力されたテキストに対応した音素列とアクセント型を出力する。 The text analysis unit (10) performs morphological analysis on the input text and outputs a phoneme string and an accent type corresponding to the input text.

韻律生成部（１１）は、テキスト解析部（１０）が出力した情報を入力として、音素ごとの音声のＦ_０パターン(基本周波数パターン)、音素継続時間長(音素の発声の長さ)、パワー情報(音声の大きさ)を推定し、これを出力する。ここで、「推定」とは、音声合成のために必要となる情報（Ｆ_０パターン、音素継続時間長、パワー情報）として、ある特定のものに決定することを意味する。 Prosody generation unit (11) as input information text analyzer (10) is outputted, phonemes per voice F ₀ pattern (fundamental frequency pattern), the phoneme duration (the length of the phoneme uttered), power Estimate information (sound volume) and output it. Here, “estimation” means that information necessary for speech synthesis (F ₀ pattern, phoneme duration, power information) is determined to be specific.

音声波形選択部（１２）は、テキスト解析部（１０）が出力した音素列の並びに従い、韻律生成部（１１）で出力した、音素ごとの音声のＦ_０パターン、音素継続時間長、パワー情報をターゲットとして、これらターゲットとの歪みが小さく、また、音声波形素片を接続した際の音声波形素片同士での接続歪みが最小になるような音声波形素片の組み合わせ（最適音声波形素片列）を、音声情報データベース（４）から選択して、最適音声波形素片列の各音声波形素片番号（テキスト解析部（１０）が出力した音素列の並びに対応している。）を出力する。最適音声波形素片列の決定には動的計画法などを用いる。 Speech waveform selector (12) in accordance with a sequence of phoneme string text analyzer (10) has outputted, and outputs in the prosody generation unit (11), F ₀ pattern of phonemes each speech phoneme duration length, power information A combination of speech waveform segments (optimal speech waveform segments) that minimize distortion with these targets and minimize the connection distortion between speech waveform segments when speech waveform segments are connected. Column) is selected from the speech information database (4), and each speech waveform segment number of the optimal speech waveform segment sequence (corresponding to the sequence of phoneme sequences output by the text analysis unit (10)) is output. To do. Dynamic programming or the like is used to determine the optimum speech waveform segment sequence.

音声合成部（１３）は、音声波形選択部（１２）で選択された最適音声波形素片列の各音声波形素片番号を入力として、この最適音声波形素片列の各音声波形素片番号に対応した音声波形データを（音声波形データ位置情報を参照して）音声波形データベース（３）から読み込み、それら音声波形データを順次接続して連続した音声を生成し、これを合成音声として出力する。
特許２７６１５５２号公報 The speech synthesis unit (13) receives each speech waveform segment number of the optimum speech waveform segment sequence selected by the speech waveform selection unit (12) as input, and each speech waveform segment number of this optimum speech waveform segment sequence. Is read from the speech waveform database (3) (refer to the speech waveform data position information), and the speech waveform data is sequentially connected to generate continuous speech, which is output as synthesized speech. .
Japanese Patent No. 2761552

音声波形データベースに格納された音声波形データは肉声である。肉声のＦ_０パターンは微細な変動をし、図３に示した模式図のように、特に子音部分（図３では/Ｒ/の部分に相当する。）においてＦ_０パターン（図３の符号１０１で示す。）が落ち込むような微細な構造を有することが多い。 The speech waveform data stored in the speech waveform database is a real voice. The real voice F ₀ pattern fluctuates finely, and as shown in the schematic diagram of FIG. 3, the F ₀ pattern (reference numeral 101 in FIG. 3) particularly in the consonant part (corresponding to the part of / R / in FIG. 3). In many cases, it has a fine structure such that

一方、上記韻律生成部で求められるターゲットのＦ_０パターンは、肉声のＦ_０パターンにおける微細変動を反映したものではないため、音声波形選択部１２における音声波形素片の選択過程において、ターゲットのＦ_０パターン（肉声のＦ_０パターンにおける微細変動を反映したものではない。）と選択された音声波形素片のＦ_０パターン（肉声のＦ_０パターンにおける微細変動を反映している。）との間に不一致が生じうる。 On the other hand, F ₀ pattern of the target obtained by the prosody generation part, because it is not a reflection of the fine variations in F ₀ pattern of real voice, in the selection process of the speech waveform segments in the speech waveform selector 12, the target F _{Between the 0} pattern (not reflecting the fine fluctuation in the real voice F ₀ pattern) and the F ₀ pattern of the selected speech waveform segment (reflecting the fine fluctuation in the real voice F ₀ pattern). May be inconsistent.

そのためイントネーションの不自然さや音声波形素片間の接続箇所におけるＦ_０パターンのギャップ等に起因する音質劣化を引き起こし、聴感上適したイントネーションを有する合成音声が生成されないという問題があった。 Therefore cause sound quality degradation due to a gap or the like of the F ₀ patterns in unnatural and connecting points of the speech waveform element pieces intonation, synthesized speech having audibility appropriate intonation there is a problem that not generated.

この典型的な例を、模式図である図４および図５を参照して説明する。図４および図５において符号１０２はターゲットのＦ_０パターンである。ここでは図４に示すような、連続性のあるＦ_０パターンを有する音声波形素片（符号１０３ａ、１０３ｂ、１０３ｃ）が選択されるのが望ましい。なぜなら、接続ギャップが無く滑らか、かつ、肉声に則した微細な構造を有するＦ_０パターンを再現した、自然なイントネーションを有する音声が合成可能だと考えられるからである。しかしながら、従来的な音声合成手法によると、ターゲットのＦ_０パターンは、肉声のＦ_０パターンにおける微細変動を反映したものではないため、子音部分での（肉声の）Ｆ_０パターンとの距離が大きくなるため、図５に示すような、Ｆ_０パターンの歪みが小さい反面、Ｆ_０パターンが不連続な音声波形素片（符号１０４ａ、１０４ｂ、１０４ｃ）が選択されてしまう。 This typical example will be described with reference to FIGS. 4 and 5 which are schematic diagrams. 4 and 5, reference numeral 102 denotes a target _F0 pattern. Here, as shown in FIG. 4, the speech waveform segments with a continuous _{F 0} pattern (reference numeral 103a, 103b, 103c) is desirably selected. This is because the connection gap is not smooth, and reproduces the F ₀ pattern having a fine structure conforming to real voice, because speech having natural intonation is considered a possible synthesis. However, according to conventional speech synthesis method, F ₀ pattern of the target, because it is not a reflection of the fine variations in F ₀ pattern of real voice, the consonant portion (real voice) distance between F ₀ pattern is large becomes therefore, as shown in FIG. 5, although the strain of _{F 0} pattern is small, _{F 0} pattern discontinuous voice waveform segments (code 104a, 104b, 104c) from being selected.

上記の問題に鑑みて、本発明は、自然なイントネーションを有する合成音声を得る波形接続型の音声合成装置、音声合成方法、音声合成プログラムおよびその記録媒体を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a waveform-connected speech synthesizer, a speech synthesis method, a speech synthesis program, and a recording medium for obtaining synthesized speech having natural intonation.

上記課題を解決するために、本発明は、音声波形データを集めた音声波形データベースおよび、音声のＦ_０パターン情報（肉声のＦ_０パターンの微細変動を保持したＦ_０パターン微細情報と、Ｆ_０パターン微細情報における微細変動部分を補正したＦ_０パターン概形情報とから構成される。）を含む韻律情報と音声波形データベースにおける音声波形データとの対応を示すエントリーからなる音声情報データベースを記憶手段に記憶しておき、入力されたテキストを解析して音韻系列を生成し、この音韻系列から合成単位ごとの音声のＦ_０パターン情報を含む韻律情報Ａを生成し、次いで、音韻系列に従って、韻律情報Ａと音声情報データベースにおけるエントリーの韻律情報Ｂとの距離尺度（コスト）を演算し（韻律情報ＡにおけるＦ_０パターン情報と、韻律情報ＢにおけるＦ_０パターン概形情報とのコストの演算を含む。）、この演算結果が最小となる韻律情報を有するエントリーを音声情報データベースから選択し、この選択されたエントリーに従って音声波形データベースから音声波形データを読み込み、これら音声波形データを接続して音声を合成するものとする。 In order to solve the above-described problems, the present invention provides a speech waveform database in which speech waveform data is collected, speech F ₀ pattern information (F ₀ pattern fine information holding minute fluctuations in the real voice F ₀ pattern, and F _0. A speech information database comprising entries indicating correspondence between prosodic information including speech waveform data in speech waveform database and F ₀ pattern outline information in which fine variation portions in pattern fine information are corrected. Storing and analyzing the input text to generate a phoneme sequence, generating prosody information A including the F ₀ pattern information of speech for each synthesis unit from the phoneme sequence, and then prosodic information according to the phoneme sequence A distance measure (cost) between A and the prosodic information B of the entry in the speech information database is calculated (F in prosodic information A Including cost calculation of ₀ pattern information and F ₀ pattern outline information in prosodic information B.), an entry having prosodic information that minimizes the calculation result is selected from the speech information database, and the selected entry Then, the voice waveform data is read from the voice waveform database, and the voice waveform data is connected to synthesize voice.

また、予めＦ_０パターン概形情報を生成しておくのではなく、テキストから合成音声を生成する音声合成処理のたびに、Ｆ_０パターン概形情報を生成するようにしてもよい。 Further, rather than leave it generates advance F ₀ pattern envelope information, each time the speech synthesis process of generating a synthesized speech from the text, may be generated a F ₀ pattern approximate shape information.

さらには、韻律情報Ａと韻律情報Ｂとのコストに加え、各エントリー間のコストを演算し（各エントリー間のコストの演算には、少なくとも各エントリーにおけるＦ_０パターン微細情報間のコストの演算を含む。）、この演算結果が最小となる韻律情報を有するエントリーを音声情報データベースから選択するとしてもよい。 Further, in addition to the costs of prosodic information A and prosodic information B, the cost between each entry is calculated (the cost between each entry is calculated by calculating the cost between at least the F ₀ pattern fine information in each entry. In addition, an entry having prosodic information that minimizes the calculation result may be selected from the speech information database.

本発明の音声合成装置をコンピュータ上で機能させる音声合成プログラムによって、コンピュータを音声合成装置として作動処理させることができる。そして、この音声合成プログラムを記録した、コンピュータ読み取り可能なプログラム記録媒体によって、他のコンピュータを音声合成装置として機能させることや、音声合成プログラムを流通させることなどが可能になる。 The computer can be operated as a speech synthesizer by a speech synthesis program that causes the speech synthesizer of the present invention to function on the computer. A computer-readable program recording medium that records this speech synthesis program enables other computers to function as a speech synthesizer or distribute the speech synthesis program.

本発明によれば、テキスト解析で得られた音韻系列から生成された韻律情報Ａと音声情報データベースにおけるエントリーの韻律情報Ｂとの距離尺度（コスト）の演算において、韻律情報ＡにおけるＦ_０パターン情報と、韻律情報ＢにおけるＦ_０パターン概形情報とのコストの演算を含むことによって、ターゲットのＦ_０パターンと選択される音声波形素片のＦ_０パターンとの間のＦ_０パターン形状のミスマッチを避けることができるので、イントネーション、特にアクセント型の不自然さに起因する音質劣化が低減された、自然なイントネーションを有する合成音声を得ることが可能となる。 According to the present invention, in the calculation of the distance measure (cost) between the prosodic information A generated from the phoneme sequence obtained by text analysis and the prosodic information B of the entry in the speech information database, the F ₀ pattern information in the prosodic information A When, by including the cost of operation of the F ₀ pattern envelope information in prosody information B, and mismatch F ₀ pattern between F ₀ pattern of the speech waveform segments are selected as target of the F ₀ pattern Since it can be avoided, it is possible to obtain a synthesized speech having natural intonation in which sound quality deterioration caused by intonation, particularly accent-type unnaturalness is reduced.

また、韻律情報Ａと韻律情報Ｂとのコストに加え、各エントリー間のコストを演算し、この各エントリー間のコストの演算において、各エントリーにおけるＦ_０パターン微細情報間のコストの演算を含めることで、音声波形素片間の接続箇所のＦ_０ギャップを避けることができるため、Ｆ_０パターンの不連続に起因する音質劣化が低減された、自然なイントネーションを有する合成音声を得ることが可能となる。 In addition to the cost of the prosodic information A and the prosody information B, it calculates the costs between each entry, in the calculation of the costs between the respective entry, the inclusion of operation costs between F ₀ pattern fine information in each entry Thus, it is possible to avoid the F ₀ gap at the connection location between the speech waveform segments, and thus it is possible to obtain a synthesized speech having a natural intonation with reduced sound quality degradation caused by discontinuity of the F ₀ pattern. Become.

＜第１実施形態＞
以下、本発明である音声合成装置・方法等の第１実施形態を説明する。
図６は、第１実施形態に係わる音声合成装置のハードウェア構成を例示したハードウェア構成図である。
図７は、第１実施形態に係わる音声合成装置の機能構成を例示した機能構成図である。
図８は、第１実施形態に係わる音声合成の処理フローを示す図である。
図９は、第１実施形態に係わる音声情報データベースのデータ構成を示す図である。
図１０は、Ｆ_０パターン概形情報の生成方法の一例を示す図（その１）である。
図１１は、Ｆ_０パターン概形情報の生成方法の一例を示す図（その２）である。
図１２は、Ｆ_０パターン概形情報の生成方法の一例を示す図（その３）である。 <First Embodiment>
Hereinafter, a first embodiment of a speech synthesizer and method according to the present invention will be described.
FIG. 6 is a hardware configuration diagram illustrating a hardware configuration of the speech synthesizer according to the first embodiment.
FIG. 7 is a functional configuration diagram illustrating the functional configuration of the speech synthesizer according to the first embodiment.
FIG. 8 is a diagram showing a processing flow of speech synthesis according to the first embodiment.
FIG. 9 is a diagram showing a data configuration of the voice information database according to the first embodiment.
FIG. 10 is a diagram (part 1) illustrating an example of a method for generating F ₀ pattern outline information.
FIG. 11 is a diagram (part 2) illustrating an example of a method for generating F ₀ pattern outline information.
FIG. 12 is a diagram (No. 3) illustrating an example of a method for generating F ₀ pattern outline information.

図６に例示するように、音声合成装置（５００）は、キーボードなどが接続可能な入力部（５１）、液晶ディスプレイなどが接続可能な出力部（５２）、音声合成装置（５００）外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部（５３）、ＣＰＵ（Central Processing Unit）（５４）〔キャッシュメモリなどを備えていてもよい。〕、メモリであるＲＡＭ（５５）、ＲＯＭ（５６）やハードディスクである外部記憶装置（５７）並びにこれらの入力部（５１）、出力部（５２）、通信部（５３）、ＣＰＵ（５４）、ＲＡＭ（５５）、ＲＯＭ（５６）、外部記憶装置（５７）間のデータのやり取りが可能なように接続するバス（５８）を有している。また必要に応じて、音声合成装置（５００）に、ＣＤ−ＲＯＭなどの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 As illustrated in FIG. 6, the speech synthesizer (500) communicates with an input unit (51) to which a keyboard or the like can be connected, an output unit (52) to which a liquid crystal display or the like can be connected, and the speech synthesizer (500). A communication unit (53) to which a possible communication device (for example, a communication cable) can be connected, and a CPU (Central Processing Unit) (54) [may include a cache memory or the like. ], A RAM (55) as a memory, an external storage device (57) as a ROM (56) and a hard disk, and an input unit (51), an output unit (52), a communication unit (53), a CPU (54), The bus (58) is connected so that data can be exchanged between the RAM (55), the ROM (56), and the external storage device (57). If necessary, the speech synthesizer (500) may be provided with a device (drive) that can read and write a storage medium such as a CD-ROM.

音声合成装置（５００）に入力されるテキストは、入力部（５１）から入力されるものとしてもよいが、この実施形態では、予めテキストが外部記憶装置（５７）に記憶されているものとする。また、本発明においてテキストの種類などに格別の限定はなく、この実施形態では、漢字かな混合の日本語テキストとする。 The text input to the speech synthesizer (500) may be input from the input unit (51). In this embodiment, it is assumed that the text is stored in advance in the external storage device (57). . In the present invention, the type of text is not particularly limited. In this embodiment, Japanese text mixed with kanji and kana is used.

音声合成装置（５００）の外部記憶装置（５７）には、音声合成のためのプログラムおよびこのプログラムの処理において必要となるデータなどが保存記憶されている。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に保存記憶される。 The external storage device (57) of the speech synthesizer (500) stores and stores a program for speech synthesis, data necessary for processing of this program, and the like. Further, data obtained by the processing of these programs is appropriately stored and stored in a RAM or an external storage device.

より具体的に説明すると、外部記憶装置（５７）の所定の記憶領域には、プログラムの処理において必要となるデータとして、合成単位（この実施形態では音素とする。その他、例えば音声のピッチやＣＶ音節などを単位とすることも可能である。）ごとの音声波形データを集めた音声波形データベース（５７１）および、音素ごとの音声のＦ_０パターン情報を含む韻律情報と音声波形データベースにおける音声波形データとの対応を示すエントリーからなる音声情報データベース（５７２）が記憶されている。 More specifically, in a predetermined storage area of the external storage device (57), as data necessary for processing of the program, a synthesis unit (in this embodiment, a phoneme. In addition, for example, a pitch or CV of a voice is used. it is also possible to syllables or the like as a unit.) speech waveform database a collection of audio waveform data for each (571) and the voice waveform data in prosody information and speech waveform database including F ₀ pattern information of phonemes each voice A voice information database (572) consisting of entries indicating the correspondence with is stored.

音声波形データベース（５７１）における（音声波形素片としての）音声波形データは、単語や文章を読み上げた肉声の音声データに対して公知のＡ／Ｄ変換を行い、適宜に窓関数をかけることなどによって音素単位で切出したものである。 Speech waveform data (as a speech waveform segment) in the speech waveform database (571) performs known A / D conversion on real speech data read out from words and sentences, and appropriately applies a window function, etc. Is extracted in units of phonemes.

音声情報データベース（５７２）は、例えば図９のように、音素を単位として諸情報が対応付けられたエントリーからなるデータ構造（テーブル）となっている。図９に示す音声情報データベース（５７２）における各エントリーは、音声波形素片の通し番号である音声波形素片番号、発声内容を示す音素ラベル情報、音素の発声時間長を示す音素継続時間情報、当該音素区間の平均パワーを正規化して得たパワー情報、音素の音高（周波数）の時間推移を表したＦ_０パターン情報、音声波形データベース（５７１）の中での音声波形データの位置を示す情報（以下、音声波形データ位置情報という。）から構成されている。音声情報データベース（５７２）のエントリーと音声波形データベース（５７１）における（音声波形素片としての）各音声波形データとは、音声情報データベース（５７２）における音声波形データ位置情報によって対応付けられる。 For example, as shown in FIG. 9, the voice information database (572) has a data structure (table) including entries in which various information is associated with each phoneme as a unit. Each entry in the speech information database (572) shown in FIG. 9 includes a speech waveform segment number that is a serial number of the speech waveform segment, phoneme label information that indicates the utterance content, phoneme duration information that indicates the speech duration of the phoneme, Power information obtained by normalizing the average power of the phoneme section, F ₀ pattern information representing the time transition of the phoneme pitch (frequency), and information indicating the position of the speech waveform data in the speech waveform database (571) (Hereinafter referred to as voice waveform data position information). The entry of the speech information database (572) and each speech waveform data (as a speech waveform segment) in the speech waveform database (571) are associated with each other by speech waveform data position information in the speech information database (572).

音声情報データベース（５７２）における各エントリーのＦ_０パターン情報は、Ｆ_０パターン微細情報およびＦ_０パターン概形情報から構成される。Ｆ_０パターン微細情報は、肉声の有するＦ_０パターンの微細変動をそのまま保持したＦ_０パターンを表す。一方、Ｆ_０パターン概形情報は、Ｆ_０パターン微細情報におけるＦ_０パターンの微細変動部分を補正したＦ_０パターンを表す。 The F ₀ pattern information of each entry in the audio information database (572) is composed of F ₀ pattern fine information and F ₀ pattern outline information. The F ₀ pattern fine information represents an F ₀ pattern that retains the fine variation of the F ₀ pattern possessed by the real voice as it is. On the other hand, F ₀ pattern envelope information indicates F ₀ pattern obtained by correcting the fine portion of the change in the F ₀ pattern in F ₀ pattern fine information.

ここで、Ｆ_０パターン概形情報の生成方法の一例を図１０、１１、１２を参照して説明する。Ｆ_０パターン概形情報は、Ｆ_０パターン微細情報におけるＦ_０パターンの微細変動部分を補正して生成するが、より具体的には、Ｆ_０パターン微細情報におけるＦ_０パターンの子音に関わるＦ_０パターンの微細変動部分を補正することによって生成する。 Here, an example of a method of generating a _{F 0} pattern envelope information with reference to FIG. 10, 11 and 12 will be described. F ₀ pattern envelope information is generated by correcting the fine portion of the change in the F ₀ pattern in F ₀ pattern fine information, more specifically, F ₀ according to the consonants F ₀ pattern in F ₀ pattern fine information It is generated by correcting the fine variation portion of the pattern.

以下に、一例として、子音区間（ここでは/Ｒ/の区間）のＦ_０パターンの変微細動を除去することによって、Ｆ_０パターン概形情報におけるＦ_０パターンを得る処理について説明する。図１０の符号２０１は、ある音声の音素/Ａ//Ｒ//Ｕ/のＦ_０パターンを示している。 Hereinafter, as an example, by removing the variable fine movement of F ₀ pattern consonant segment (here / R / interval), it describes the processing of obtaining the F ₀ pattern in F ₀ pattern approximate shape information. The code | symbol 201 of FIG. 10 has shown the _F0 pattern of phoneme / A // R // U / of a certain audio | voice.

まず、子音区間の両側の各母音区間（ここでは/Ａ/および/Ｕ/の区間）内で最も高いＦ_０パターンの値を示すピーク点を求める。このピーク点は、エントリーにおけるＦ_０パターン微細情報および音素継続時間長を参照することによって求めることができる。図１１において、/Ａ/の区間では符号２０２に示す点が、/Ｕ/の区間では符号２０３に示す点が、各母音区間で最も高いＦ_０パターンの値のピーク点である。 First, each vowel section of each side of the consonant segment (here / A / and / U / intervals) obtaining a peak point indicating the value of the highest F ₀ pattern within. The peak point can be obtained by reference to the F ₀ pattern fine information and the phoneme duration in the entry. 11, a / A / interval point indicated by reference numeral 202, the / U / of interval point indicated by reference numeral 203, a peak point of the value of the highest F ₀ pattern for each vowel section.

次に、各母音区間で求めたピーク点とピーク点との間の直線補間を行う。この例では、図１１の符号２０４に示す破線が、直線補間を行うことによって得られるＦ_０パターンを示している。なお、補正方法を直線補間としたが、これに限定することなく、例えばスプライン補間などによって補正するものでもよい。以上の処理によって、図１２に示すようなＦ_０パターン（符号２０５ａ、２０５ｂ、２０５ｃ）を得ることができる。ここで得られたＦ_０パターン（符号２０５ａ、２０５ｂ、２０５ｃ）が、それぞれの音素のＦ_０パターン概形情報である。 Next, linear interpolation between the peak points obtained in each vowel section is performed. In this example, the broken line indicated by reference numeral 204 in FIG. 11 shows an F ₀ pattern obtained by performing linear interpolation. Although the correction method is linear interpolation, the correction method is not limited to this, and may be corrected by, for example, spline interpolation. Through the above processing, it is possible to obtain _{F 0} pattern as shown in FIG. 12 (reference numeral 205a, 205b, 205c) a. _{F 0} pattern obtained here (reference numeral 205a, 205b, 205c) is an _{F 0} pattern envelope information for each phoneme.

なお、以上の説明からも明らかなとおり、補正されるＦ_０パターンは、子音部分のＦ_０パターンだけなのではなく、母音部分のＦ_０パターンの一部（上記例で云えば、/Ａ/のピーク点から終点にかけてのＦ_０パターン部分、/Ｕ/の始点からピーク点にかけてのＦ_０パターン部分）も補正される場合があることに留意しなければならない。 Incidentally, as is apparent from the above description, F ₀ pattern to be corrected, rather than just F ₀ pattern consonant portion, a portion of the F ₀ pattern vowel portions (As far in the above example, / A / of F ₀ pattern portion extending the end point from the peak point, / U / F ₀ pattern portion extending the peak point from the beginning of) must also be noted that it may be corrected.

また、音声合成装置（１）の外部記憶装置（５７）には、入力されたテキストを解析して音韻系列を生成するテキスト解析部を実現するためのプログラム、音韻系列から、少なくとも音素ごとの音声のＦ_０パターン情報を含む韻律情報を生成する韻律生成部を実現するためのプログラム、韻律情報と音声情報データベースにおけるエントリーの韻律情報との距離尺度（コスト）を演算し、この演算結果が最小となる韻律情報を有するエントリーを音声情報データベースから順次選択する音声波形選択部を実現するためのプログラム、順次選択されたエントリーに従って音声波形データベースから音声波形データを読み込み、これら音声波形データを接続して音声を合成する音声合成部を実現するためのプログラムなどが保存記憶されている。その他、これらのプログラムに基づく処理を制御するための制御プログラムも適宜に保存しておく。 In addition, the external storage device (57) of the speech synthesizer (1) analyzes at least speech for each phoneme from a program for realizing a text analysis unit that analyzes input text and generates a phoneme sequence, and a phoneme sequence. A program for realizing a prosody generation unit that generates prosody information including F ₀ pattern information, a distance measure (cost) between the prosody information and the prosodic information of the entry in the speech information database is calculated, and the calculation result is minimized. A program for realizing a speech waveform selection unit that sequentially selects entries having prosodic information from the speech information database, reads speech waveform data from the speech waveform database according to the sequentially selected entries, and connects these speech waveform data to generate speech A program or the like for realizing a speech synthesizer for synthesizing is stored and stored. In addition, a control program for controlling processing based on these programs is also stored as appropriate.

第１実施形態に係る音声合成装置（５００）では、外部記憶装置（５７）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭ（５５）に読み込まれて、ＣＰＵ（５４）で解釈実行・処理される。その結果、ＣＰＵ（５４）が所定の機能（テキスト解析部、韻律生成部、音声波形選択部、音声合成部）を実現することで、音声合成が実現される。 In the speech synthesizer (500) according to the first embodiment, each program stored in the external storage device (57) and data necessary for processing each program are read into the RAM (55) as necessary. Interpretation is executed and processed by the CPU (54). As a result, the CPU (54) realizes predetermined functions (text analysis unit, prosody generation unit, speech waveform selection unit, speech synthesis unit), thereby realizing speech synthesis.

そこで次に、図７、図８を参照して、音声合成装置（５００）における音声合成の流れを順次説明する。
第１実施形態の音声合成装置（５００）は、テキスト解析部（５４１）、韻律生成部（５４２）、音声波形選択部（５４３）、音声合成部（５４４）から構成される（図７参照）。 Next, the flow of speech synthesis in the speech synthesizer (500) will be sequentially described with reference to FIGS.
The speech synthesis apparatus (500) of the first embodiment includes a text analysis unit (541), a prosody generation unit (542), a speech waveform selection unit (543), and a speech synthesis unit (544) (see FIG. 7). .

まず、テキスト解析部（５４１）は、外部記憶装置（５７）に記憶されるテキストを読み込み、この読み込んだテキストを形態素解析して、テキストに対応した音素列、アクセント型、呼気段落（フレーズ）位置などを表す音韻系列を出力する（ステップＳ１）。 First, the text analysis unit (541) reads the text stored in the external storage device (57), morphologically analyzes the read text, and stores the phoneme string, accent type, and exhalation paragraph (phrase) position corresponding to the text. A phoneme sequence representing the above is output (step S1).

形態素解析の概要について説明すると、テキスト解析部（５４１）は、単語モデル、漢字かな変換モデル、かな音素変換モデルなど（これらも必要に応じて外部記憶装置（５７）に記憶しておく。）を参照して、テキストを音素列に変換する。また、テキストが日本語の場合、複数の単語が集まって文節などを構成すると、アクセントが移動・消失するなどの現象が起こるので、予めこれらの規則（アクセント結合規則）をデータとして例えば外部記憶装置（５７）に記憶しておき、テキスト解析部（５４１）は、このアクセント結合規則に従って、テキストのアクセント型を決定する。さらに、テキストが日本語の場合、意味的ないし文法的なまとまり毎にアクセントが１つ付く特徴的傾向があるので、予めこれらの規則（フレーズ規則）をデータとして例えば外部記憶装置（５７）に記憶しておき、テキスト解析部（５４１）は、このフレーズ規則に従って、アクセントの１つ付いたまとまりがいくつか接続したものを呼気段落として決定する。
なお、ここで説明した形態素解析の概要は、形態素解析の一例であって、その他の形態素解析手法を排除する趣旨のものではない。本発明の音声合成装置・方法では、種々の形態素解析を用いることができるが、これらは公知技術によって達成されるので、その詳細を省略する。 The outline of the morphological analysis will be described. The text analysis unit (541) stores a word model, a kanji-kana conversion model, a kana phoneme conversion model, etc. (these are also stored in the external storage device (57) as necessary). Refer to and convert text to phoneme string. In addition, when the text is in Japanese, if a plurality of words are gathered to form a phrase or the like, a phenomenon such as an accent moving or disappearing occurs. Therefore, these rules (accent combining rules) are stored in advance as data, for example, an external storage device (57) and the text analysis unit (541) determines the accent type of the text according to the accent combination rule. Further, when the text is in Japanese, there is a characteristic tendency that one accent is attached to each semantic or grammatical unit, so these rules (phrase rules) are stored in advance as data in, for example, the external storage device (57). In addition, the text analysis unit (541) determines, as an exhalation paragraph, a combination of several groups with one accent according to this phrase rule.
The outline of the morpheme analysis described here is an example of morpheme analysis, and is not intended to exclude other morpheme analysis methods. In the speech synthesizer / method of the present invention, various morphological analysis can be used, but since these are achieved by known techniques, the details thereof are omitted.

韻律生成部（５４２）は、テキスト解析部（５４１）が出力した情報（音韻系列）を入力として、音素ごとの音声のＦ_０パターン(基本周波数パターン)、音素継続時間長(音素の発声の長さ)、パワー情報(音声の大きさ)を推定し、これを出力する（ステップＳ２）。音素継続時間長およびパワー情報は、予め規則化された、呼気段落内における音素の位置、発声速度、当該音素の前後の音素環境などに従って適宜に設定する。また、Ｆ_０パターンについては、いわゆる藤崎モデルなどによって求める。なお、「推定」とは既述のとおり、音声合成のために必要となる情報（Ｆ_０パターン、音素継続時間長、パワー情報）として、ある特定のものに決定することを意味する。本発明の音声合成装置・方法では、韻律情報の生成には、公知の韻律情報生成手法を用いることができるので、その詳細を省略する。 Prosody generation unit (542) is input with information text analyzer (541) is output (phoneme sequence), F ₀ pattern (fundamental frequency pattern) of phonemes each speech phoneme duration (phonemes uttered length The power information (sound volume) is estimated and output (step S2). The phoneme duration length and the power information are appropriately set according to the phoneme position in the exhalation paragraph, the utterance speed, the phoneme environment before and after the phoneme, etc., which are regulated in advance. As for the F ₀ pattern, determined by the so-called Fujisaki model. As described above, “estimation” means that information necessary for speech synthesis (F ₀ pattern, phoneme duration length, power information) is determined to be specific. In the speech synthesizer / method of the present invention, the prosodic information can be generated by using a known prosodic information generating method, and the details thereof are omitted.

音声波形選択部（５４３）は、テキスト解析部（５４１）が出力した音素列の並び順に従って、韻律生成部（５４２）によって出力された、音素ごとの音声のＦ_０パターン、音素継続時間長、パワー情報をターゲットとして、これらターゲットとの歪みが小さく、また、音声波形素片同士を接続した際の音声波形素片同士での接続歪みが最小になるような音声波形素片の組み合わせ（最適音声波形素片列）を、音声情報データベース（５７２）から選択し、最適音声波形素片列の各音声波形番号（テキスト解析部（５４１）が出力した音素列の並び順に対応している。）を出力する（ステップＳ３）。以下、歪みから定義される距離尺度をコストと呼ぶ。コスト最小の最適音声波形素片列の決定には動的計画法などを用いる。 The speech waveform selection unit (543), according to the arrangement order of the phoneme sequence output by the text analysis unit (541), the F ₀ pattern of the speech for each phoneme output by the prosody generation unit (542), the phoneme duration length, A combination of speech waveform segments (optimal speech) that uses power information as a target and minimizes distortion with these targets and minimizes connection distortion between speech waveform segments. (Speech segment sequence) is selected from the speech information database (572), and each speech waveform number of the optimal speech waveform segment sequence (corresponding to the order of the phoneme sequence output by the text analysis unit (541)) is selected. Output (step S3). Hereinafter, the distance measure defined from the distortion is referred to as cost. Dynamic programming or the like is used to determine the optimum speech waveform segment sequence with the lowest cost.

音声波形選択部（５４３）における最適音声波形素片列の選択について、更に詳細を説明する。ここでは、音素単位で最適音声波形素片列の選択をする場合について説明する。また、音声波形選択部（５４３）には、テキスト解析部（５４１）によって出力された（テキストに対する）音素列全てではなく、１つの呼気段落に対応する音素列ごとに入力していくとする。これは、ある呼気段落の終点音素と、この呼気段落に接続する次の呼気段落の始点音素との間の接続を考えなくてよいからである。勿論、テキストの長さなどによっては、テキスト解析部（５４１）によって出力された（テキストに対する）音素列全てを、音声波形選択部（５４３）の入力としてもよい。 The selection of the optimum speech waveform segment sequence in the speech waveform selection unit (543) will be described in further detail. Here, a case where the optimum speech waveform segment sequence is selected in units of phonemes will be described. Further, it is assumed that not all phoneme strings (for text) output by the text analysis unit (541) are input to the speech waveform selection unit (543) for each phoneme string corresponding to one exhalation paragraph. This is because there is no need to consider the connection between the end phoneme of a certain exhalation paragraph and the start point phoneme of the next exhalation paragraph connected to this exhalation paragraph. Of course, depending on the length of the text and the like, all the phoneme strings (for the text) output by the text analysis unit (541) may be input to the speech waveform selection unit (543).

以下、音声波形選択部（５４３）に入力される音素列（以下、これをターゲット音素列と呼ぶ。）のうち、ｉ番目のターゲット音素をｔ_ｉと表し、音声情報データベース（５７２）から読み込んだエントリーの音声波形素片列（以下、これを候補素片列と呼ぶ。）のうち、ｉ番目の候補素片をｕ_ｉと表す。なお、音声情報データベース（５７２）から読み込む候補素片ｕ_ｉは、音声情報データベース（５７２）の音素ラベルの項目を参照して、ターゲット音素ｔ_ｉの音素と同じ音素を有するものとする。 Hereinafter, the speech waveform selector (543) phoneme sequence (hereinafter, the target phoneme string and called.) Input within a represents the i-th target phoneme t _i, read from the speech information database (572) Of the speech waveform segment sequences of the entries (hereinafter referred to as candidate segment sequences), the i-th candidate segment is represented by u _i . Note that the candidate segment u _i read from the speech information database (572) has the same phoneme as the phoneme of the target phoneme t _i with reference to the phoneme label item of the speech information database (572).

音声波形選択部（５４３）は、１つの呼気段落全体で、前記ターゲット音素列と候補素片列との歪みおよび接続する候補素片同士での接続歪みが最小になるような候補素片の組み合わせ（最適候補素片列）を決定するために、音素ごとに、ターゲット音素ｔ_ｉと候補素片ｕ_ｉの歪みを表す距離尺度をコストＣ（ｔ_ｉ，ｕ_ｉ）として求める。 The speech waveform selection unit (543) is a combination of candidate segments that minimizes the distortion between the target phoneme sequence and the candidate segment sequence and the connection distortion between the candidate segment sequences to be connected in one whole exhalation paragraph. In order to determine (optimum candidate segment sequence), a distance measure representing the distortion between the target phoneme t _i and the candidate segment u _i is obtained as the cost C (t _i , u _i ) for each phoneme.

一例として、コストＣ（ｔ_ｉ，ｕ_ｉ）を、後述する各種サブコストの重みつき和として、Ｃ（ｔ_ｉ，ｕ_ｉ）＝Ｗｔｆ・Sｔｆ（ｔ_ｉ，ｕ_ｉ）＋Ｗｔｄｕｒ・Ｓｔｄｕｒ（ｔ_ｉ，ｕ_ｉ）＋Ｗｔｐｏｗ・Ｓｔｐｏｗ（ｔ_ｉ，ｕ_ｉ）＋Ｗｃｆ・Ｓｃｆ（ｕ_ｉ−１，ｕ_ｉ）＋Ｗｃｐｏｗ・Ｓｃｐｏｗ（ｕ_ｉ−１，ｕ_ｉ）＋Ｗｃｅｎｖ・Ｓｃｅｎｖ（ｕ_ｉ−１，ｕ_ｉ）・・・（１）と定義する。 As an example, the cost _{C (t} _{i, u} i) and, as a weighted sum of the later-described various _{_{sub-cost, C (t i, u i}} ) = Wtf · Stf (t i, u i) + Wtdur · Stdur (t i, u _i ) + Wtpow · Spow (t _i , u _i ) + Wcf · Scf (u _i−1 , u _i ) + Wcpow · Scpow (u _i−1 , u _i ) + Wcenv · Scenv (u _i−1 , u _i ).・・ Defined as (1).

Ｓｔｆ（ｔ_ｉ，ｕ_ｉ）はターゲット音素ｔ_ｉのＦ_０パターンおよび候補素片ｕ_ｉのＦ_０パターン概形情報におけるＦ_０パターン間の歪みを表し、ｔ_ｉのＦ_０パターンをＦｔ（ｔ_ｉ）、ｕ_ｉの（Ｆ_０パターン概形情報における）Ｆ_０パターンをＦｕ（ｕ_ｉ）としたとき、Ｆｔ（ｔ_ｉ）とＦｕ（ｕ_ｉ）の差の二乗Ｓｔｆ（ｔ_ｉ，ｕ_ｉ）＝｛Ｆｔ（ｔ_ｉ）−Ｆｕ（ｕ_ｉ）｝^２とする。以下これを、ターゲットＦ_０サブコストと呼ぶ。
なお、従来においては、Ｆｕ（ｕ_ｉ）は、ｕ_ｉのＦ_０パターン微細情報におけるＦ_０パターンであり、本発明では、ｕ_ｉのＦ_０パターン概形情報におけるＦ_０パターンであることに留意する。 Stf _(t i, _{u i)} is the target phoneme _{t i} represents the distortion between _{F 0} patterns in _{F 0} pattern envelope information _{F 0} pattern and the candidate segment _{u i} of the _{F 0} pattern _{t i} Ft (t _i), when the _{(F 0} pattern envelope) _{F 0} patterns in information _{u i} Fu and _{(u i),} the difference of the squares Stf of Ft _{(t i)} and _{_{Fu (u i) (t i}} , u i ) = {Ft (t _i ) −Fu (u _i )} ² Hereinafter, this is referred to as target F ₀ sub-cost.
In the prior art, Fu _{(u i)} is _{F 0} pattern in _{F 0} pattern fine information _{u i,} in the present invention, noted that it is _{F 0} pattern in _{F 0} pattern envelope information _{u i} To do.

Ｓｔｄｕｒ（ｔ_ｉ，ｕ_ｉ）はターゲット音素ｔ_ｉと候補素片ｕ_ｉの間での継続時間長の歪みを表し、ｔ_ｉの継続時間長をＤＵＲｔ（ｔ_ｉ）、ｕ_ｉの継続時間長をＤＵＲｕ（ｕ_ｉ）としたとき、ＤＵＲｔ（ｔ_ｉ）とＤＵＲｕ（ｕ_ｉ）の差の二乗Ｓｔｄｕｒ（ｔ_ｉ，ｕ_ｉ）＝｛ＤＵＲｔ（ｔ_ｉ）−ＤＵＲｕ（ｕ_ｉ）｝^２とする。以下これを、ターゲット継続時間長サブコストと呼ぶ。 Stdur (t _i , u _i ) represents the distortion of the duration between the target phoneme t _i and the candidate segment u _i , the duration of t _i is DURt (t _i ), and the duration of u _i Let DURu (u _i ) be the square of the difference between DURt (t _i ) and DURu (u _i ) Stdur (t _i , u _i ) = {DURt (t _i ) −DURu (u _i )} ² . Hereinafter, this is referred to as a target duration long sub-cost.

Ｓｔｐｏｗ（ｔ_ｉ，ｕ_ｉ）はターゲット音素ｔ_ｉと候補素片ｕ_ｉの間でのパワーの歪みを表し、ｔ_ｉのパワーをＰＯＷｔ（ｔ_ｉ）、ｕ_ｉのパワーをＰＯＷｕ（ｕ_ｉ)としたとき、ＰＯＷｔ（ｔ_ｉ）とＰＯＷｕ（ｕ_ｉ）の差の二乗Ｓｔｐｏｗ（ｔ_ｉ，ｕ_ｉ）＝｛ＰＯＷｔ（ｔ_ｉ）−ＰＯＷｕ（ｕ_ｉ）｝^２とする。以下これを、ターゲットパワーサブコストと呼ぶ。 Stpow _(t _{i, u} i) represents the distortion of the power between the target phoneme _{t i} and the candidate segment _{u i,} power the POWt _(t i) of _{t i,} POWu power of _{_{u i} (u i)} , It is assumed that the square of the difference between POWt (t _i ) and POWu (u _i ) is Spow (t _i , u _i ) = {POWt (t _i ) −POWu (u _i )} ² . Hereinafter, this is referred to as a target power sub-cost.

Ｓｃｆ（ｕ_ｉ−１，ｕ_ｉ）は候補素片ｕ_ｉと先行する候補素片ｕ_ｉ−１との接続点での（それぞれＦ_０パターン微細情報における）Ｆ_０パターンの歪みを表し、ｕ_ｉの始点のＦ_０パターンの値をＦＳｕ（ｕ_ｉ）、ｕ_ｉ−１の終点のＦ_０パターンの値をＦＥｕ（ｕ_ｉ−１）としたとき、ＦＳｕ（ｕ_ｉ）とＦＥｕ（ｕ_ｉ−１）の差の二乗Ｓｃｆ（ｕ_ｉ−１，ｕ_ｉ）＝｛ＦＳｕ（ｕ_ｉ）−ＦＥｕ（ｕ_ｉ−１）｝^２とする。以下これを、接続Ｆ_０サブコストと呼ぶ。
なお、この接続Ｆ_０サブコストにおけるＦ_０パターンは、Ｆ_０パターン微細情報におけるＦ_０パターンであることに留意する。 Scf (u _i−1 , u _i ) represents the distortion of the F ₀ pattern (in the F ₀ pattern fine information, respectively) at the connection point between the candidate element u _i and the preceding candidate element u _i−1. the value of _{F 0} pattern of the start point of the _i _FSu (u _{_i),} when the value of _{F 0} pattern of the end point of the _{u i-1} and _{fEu (u i-1),} FSu (u i) and feu _{(u i −1} ) squared difference Scf (u _i−1 , u _i ) = {FSu (u _i ) −FEu (u _i−1 )} ² Hereinafter, this is referred to as connection F ₀ sub-cost.
Incidentally, _{F 0} pattern in this connection _{F 0} subcosts is noted that it is _{F 0} pattern in _{F 0} pattern fine information.

Ｓｃｐｏｗ（ｕ_ｉ−１，ｕ_ｉ）は候補素片ｕ_ｉと、先行する候補素片ｕ_ｉ−１の接続点でのパワーの歪みを表し、ｕ_ｉの始点のパワーをＰＯＷＳｕ（ｕ_ｉ）、ｕ_ｉ−１の終点のパワーをＰＯＷＥｕ（ｕ_ｉ−１）としたとき、ＰＯＷＳｕ（ｕ_ｉ）とＰＯＷＥｕ（ｕ_ｉ−１）の差の二乗Ｓｃｐｏｗ（ｕ_ｉ−１，ｕ_ｉ）＝｛ＰＯＷＳｕ（ｕｉ）−ＰＯＷＥｕ（ｕ_ｉ−１）｝^２とする。以下これを、接続パワーサブコストと呼ぶ。 Scpow (u _i−1 , u _i ) represents the power distortion at the connection point between the candidate element u _i and the preceding candidate element u _i−1 , and the power of the starting point of u _i is represented by POWSu (u _i ). , U _{i−1 where} the end point power is POWEu (u _i−1 ), the square of the difference between POWSu (u _i ) and POWEu (u _i−1 ) Scow (u _i−1 , u _i ) = { POWSu (ui) −POWEu (u _i−1 )} ² . This is hereinafter referred to as connection power sub-cost.

Ｓｃｅｎｖ（ｕ_ｉ−１，ｕ_ｉ）は候補素片ｕ_ｉと、先行する候補素片ｕ_ｉ−１の音素環境の違いを表し、ｕ_ｉの先行音素とｕ_ｉ−１の後続音素との音響的類似度（例えばスペクトルの類似度）から定義される。以下これを、接続音素環境サブコストと呼ぶ。例えば、ｕ_ｉの先行音素とｕ_ｉ−１の後続音素とが一致すれば、Ｓｃｅｎｖ（ｕ_ｉ−１，ｕ_ｉ）＝０である。例えば、これらの値は音響的類似度データベースとして予め規則化しておき、必要に応じて、この音響的類似度データベースから、ｕ_ｉの先行音素とｕ_ｉ−１の後続音素との音響的類似度に対応した値を読み込むようにしておく。 Scenv and _{(u i-1,} _{u i)} is the candidate segment _{u i,} represents the difference between the preceding candidate segment _{u i-1} of the phoneme environment, the _{u i} preceding phoneme and _{u i-1} of the subsequent phoneme and the It is defined from acoustic similarity (for example, spectral similarity). Hereinafter, this is referred to as a connected phoneme environment sub-cost. For example, if the subsequent phonemes and matching of the preceding phoneme and _{u i-1} of _{u i,} is _{_{Scenv (u i-1, u}} i) = 0. For example, these values are previously ordered as acoustic similarity database, if necessary, from the acoustic similarity database, acoustic similarity between subsequent phoneme preceding phoneme and u _i-1 of u _i Read the value corresponding to.

また、ＷｔｆはＳｔｆ（ｔ_ｉ，ｕ_ｉ）に対する重み、ＷｔｄｕｒはＳｔｄｕｒ（ｔ_ｉ，ｕ_ｉ）に対する重み、ＷｔｐｏｗはＳｔｐｏｗ（ｔ_ｉ，ｕ_ｉ）に対する重み、ＷｃｆはＳｃｆ（ｕ_ｉ−１，ｕ_ｉ）に対する重み、ＷｃｐｏｗはＳｃｐｏｗ（ｕ_ｉ−１，ｕ_ｉ）に対する重み、ＷｃｅｎｖはＳｃｅｎｖ（ｕ_ｉ−１，ｕ_ｉ）に対する重みである。 Further, Wtf is Stf _(t _{i, u} i) weight for, Wtdur the Stdur _(t _{i, u} i) weight for, Wtpow the Stpow _(t _{i, u} i) weight for, Wcf is Scf _{(u i-1,} The weight for u _i ), Wcpow is the weight for Scoop (u _i−1 , u _i ), and Wcenv is the weight for Scenv (u _i−1 , u _i ).

上記各サブコストのうち、Ｓｔｆ（ｔ_ｉ，ｕ_ｉ）、Ｓｔｄｕｒ（ｔ_ｉ，ｕ_ｉ）、Ｓｔｐｏｗ（ｔ_ｉ，ｕ_ｉ）が、韻律生成部（５４２）で求めたターゲット情報（Ｆ_０パターン、音素継続時間長、パワー情報）と、候補素片ｕ_ｉの有するＦ_０パターン、音素継続時間長、パワー情報との差から求められるサブコストである。 Among the sub-costs described above, Stf (t _i , u _i ), Stdur (t _i , u _i ), and Stpow (t _i , u _i ) are the target information (F ₀ pattern) obtained by the prosody generation unit (542), Phoneme duration length, power information) and the sub cost obtained from the difference between the F ₀ pattern, phoneme duration length, and power information of the candidate unit u _i .

また、Ｓｃｆ（ｕ_ｉ−１，ｕ_ｉ）、Ｓｃｐｏｗ（ｕ_ｉ−１，ｕ_ｉ）、Ｓｃｅｎｖ（ｕ_ｉ−１，ｕ_ｉ）が、候補素片間でのＦ_０パターン、パワー情報、音素環境の違いから求められるサブコストである。 Scf (u _i−1 , u _i ), Scpow (u _i−1 , u _i ), and Scenv (u _i−1 , u _i ) are F ₀ patterns, power information, and phonemes between candidate segments. This is a sub-cost required for environmental differences.

上記各サブコストの計算に必要な候補素片ｕ_ｉのＦ_０パターン、継続時間長、パワー情報は、音声情報データベース（５７２）から得ることができる。そして、１つの呼気段落全体の音素列に対するコストＣを式（２）によって求める。ここで、Ｎは１つの呼気段落の音素数を表す。

The F ₀ pattern, duration length, and power information of the candidate segment u _i necessary for the calculation of each sub-cost can be obtained from the voice information database (572). And the cost C with respect to the phoneme string of one whole exhalation paragraph is calculated | required by Formula (2). Here, N represents the number of phonemes in one exhalation paragraph.

音声波形選択部（５４３）は、Ｃが最小となる最適候補素片列を動的計画法などの手法により求めることで、１つの呼気段落のターゲットに対して最適な音声波形素片列を選択し、この最適な音声波形素片列の各音声波形素片番号（音声波形選択部（５４３）に入力される音素列の並び順に対応している。）を得る。そして、音声波形選択部（５４３）は、全ての呼気段落（即ち、テキストに対する全ての音素列）について同様の処理を行い、全ての音素列に対応する最適な音声波形素片列（最適音声波形素片列）の各音声波形素片番号を出力する。 The speech waveform selection unit (543) selects an optimal speech waveform segment sequence for a target of one expiratory paragraph by obtaining an optimal candidate segment sequence that minimizes C by a method such as dynamic programming. Then, each speech waveform segment number of the optimal speech waveform segment sequence (corresponding to the arrangement order of the phoneme sequences input to the speech waveform selection unit (543)) is obtained. Then, the speech waveform selection unit (543) performs the same processing for all expiratory paragraphs (that is, all phoneme sequences for text), and optimal speech waveform segment sequences (optimal speech waveforms) corresponding to all phoneme sequences. The number of each speech waveform segment in the segment sequence) is output.

音声合成部（５４４）は、音声波形選択部（５４３）で選択された最適音声波形素片列の音声波形素片番号列を入力として、この各番号に対応した音声波形データを（各番号で特定されるエントリーの音声波形データ位置情報を参照して）音声波形データベース（５７１）から読み込み、それら音声波形データを順次接続して連続した音声を生成し、これを合成音声として出力する（ステップＳ）。音声合成部（５４４）における音声合成方法は、例えば波形重畳法などの公知技術によって実現される。 The speech synthesis unit (544) receives the speech waveform segment number sequence of the optimum speech waveform segment sequence selected by the speech waveform selection unit (543) as input, and outputs speech waveform data corresponding to each number (by each number). A voice is read from the voice waveform database (571) with reference to the voice waveform data position information of the identified entry, and the voice waveform data is sequentially connected to generate a continuous voice, which is output as a synthesized voice (step S). ). The speech synthesis method in the speech synthesis unit (544) is realized by a known technique such as a waveform superposition method.

音声波形選択部（５４３）によるコスト演算のターゲットＦ_０サブコストに、候補素片のＦ_０パターン概形情報におけるＦ_０パターンを用いることによって、図１３および図１４の模式図に示すように、ターゲットのＦ_０パターン（図１３の符号３０１）に近いＦ_０パターン概形情報におけるＦ_０パターン（図１４の符号３０２ａ、３０２ｂ、３０２ｃ）を有する音声波形素片が選択されやすくなる。 By using the F ₀ pattern in the F ₀ pattern outline information of the candidate segment as the target F ₀ sub cost of the cost calculation by the speech waveform selection unit (543), as shown in the schematic diagrams of FIG. 13 and FIG. Speech waveform segments having F ₀ patterns (reference numerals 302a, 302b, and 302c in FIG. 14) in the F ₀ pattern outline information close to the F ₀ pattern (reference numeral 301 in FIG. 13) can be easily selected.

また、この実施形態のように、サブコストに応じてＦ_０パターン情報の種別を使い分ける（つまり、ターゲットＦ_０サブコストにはＦ_０パターン概形情報におけるＦ_０パターンを用い、接続Ｆ_０サブコストにはＦ_０パターン微細情報のＦ_０パターンを用いる。）ことの利点を、図１５および図１６を用いて説明する。図１５と図１６は、それぞれ、同一のターゲットのＦ_０パターンに対する音声波形素片の候補を表示したものである。また、図１５、図１６共に、４０１はターゲットのＦ_０パターンを示す。 Moreover, as in this embodiment, selectively used type of F ₀ pattern information in accordance with sub-cost (i.e., the target F ₀ subcost using F ₀ pattern in F ₀ pattern envelope information, connection F ₀ to subcost F The advantage of using the ₀ pattern fine information F ₀ pattern) will be described with reference to FIGS. 15 and 16. FIG. 15 and 16, respectively, and setting the candidate of the speech waveform segment for F ₀ pattern of the same target. Further, FIG. 15, both 16, 401 denotes the _{F 0} pattern of the target.

図１５において符号４０２ａ、４０２ｂ、４０２ｃは、ある音声波形素片のＦ_０パターン微細情報におけるＦ_０パターンであり、符号４０３ａ、４０３ｂ、４０３ｃは、それぞれ、符号４０２ａ、４０２ｂ、４０２ｃのＦ_０パターン概形情報におけるＦ_０パターンである。また、図１６において符号４０４ａ、４０４ｂ、４０４ｃはある音声波形素片のＦ_０パターン微細情報におけるＦ_０パターンであり、符号４０５ａ、４０５ｂ、４０５ｃは、それぞれ、符号４０４ａ、４０４ｂ、４０４ｃのＦ_０パターン概形情報におけるＦ_０パターンである。 Code 402a, 402b, 402c in FIG. 15 is a _{F 0} pattern in _{F 0} pattern fine information of a voice waveform segments, code 403a, 403b, 403c, respectively, reference numerals 402a, 402b, _{F 0} pattern outline of 402c This is the _F0 pattern in the shape information. Further, a _{F 0} pattern in the code 404a, 404b, 404c is _{F 0} pattern fine information of the speech waveform segments with 16, reference numeral 405a, 405 b, 405c, respectively, reference numerals 404a, 404b, _F of 404c ₀ pattern it is _{F 0} pattern in approximate shape information.

接続Ｆ_０サブコストにＦ_０パターン微細情報におけるＦ_０パターンを用いた場合は、符号４０２ａ、４０２ｂ、４０２ｃ、４０４ａ、４０４ｂ、４０４ｃに示すＦ_０パターンがサブコストの計算に用いられる。一方、接続Ｆ_０サブコストにＦ_０パターン概形情報におけるＦ_０パターンを用いた場合は、符号４０３ａ、４０３ｂ、４０３ｃ、４０５ａ、４０５ｂ、４０５ｃに示すＦ_０パターンがサブコストの演算に用いられることになる。 In the case of using _{F 0} pattern in _{F 0} pattern fine information in connection _{F 0} sub-cost, reference numeral 402a, 402b, 402c, 404a, 404b, F 0 pattern shown in 404c is used in the calculation of the sub-costs. On the other hand, in the case of using the _{F 0} pattern in _{F 0} pattern envelope information in the connection _{F 0} sub-cost, consists codes 403a, 403b, 403c, 405a, 405 b, _{that F 0} pattern shown in 405c is used in the calculation of the sub-costs .

この例ではＦ_０パターン概形情報におけるＦ_０パターンを用いて接続Ｆ_０サブコストを求めた場合（図１５では符号４０３ａと符号４０３ｂとの接続点、および、符号４０３ｂと符号４０３ｃとの接続点。図１６では符号４０５ａと符号４０５ｂとの接続点、および、符号４０５ｂと符号４０５ｃとの接続点。）、図１５、図１６からわかるように、その値は同程度であるが、/Ｕ/部分のターゲットに対するＦ_０パターンの歪みは図１５に示した音声波形素片の方が小さいため、図１５に示したＦ_０パターン（符号４０３ａ、４０３ｂ、４０３ｃ）の音声波形素片が選択されてしまう可能性が高い。 In this example, when the connection F ₀ sub-cost is obtained using the F ₀ pattern in the F ₀ pattern outline information (in FIG. 15, the connection point between the reference numerals 403a and 403b and the connection point between the reference numerals 403b and 403c. In FIG. 16, the connection point between reference numeral 405a and reference numeral 405b, and the connection point between reference numeral 405b and reference numeral 405c.) As can be seen from FIGS. 15 and 16, the values are similar, but the / U / part the distortion of the _{F 0} pattern for the target for smaller for voice waveform segments shown in FIG. 15, _{F 0} pattern shown in FIG. 15 (reference numeral 403a, 403b, 403c) voice waveform segment from being selected for Probability is high.

そうすると、音声合成部は、符号４０３ａ、４０３ｂ、４０３ｃのＦ_０パターン（これは、Ｆ_０パターン概形情報におけるＦ_０パターンである。）をＦ_０パターン情報に有するエントリーに対応した音声波形データを読み込んで接続することになる。ところが、これら音声波形データは、符号４０２ａ、４０２ｂ、４０２ｃのＦ_０パターン（これは、Ｆ_０パターン微細情報におけるＦ_０パターンである。）に対応した性質を有するところ、/Ａ/および/Ｒ/の音素の間で著しい接続歪みが有り、このような音声波形データで合成された合成音声は、滑らかさを失った不自然なものとなってしまう。つまり、この場合、図１６に示した音声波形素片を選択した方が肉声のＦ_０パターンの接続点での差が小さいため、合成音声の品質（聴感上の滑らかさや自然さ）が高くなると考えられる。 Then, the speech synthesis unit, reference numeral 403a, 403b, _{F 0} pattern (which is a _{F 0} pattern in _{F 0} pattern approximate shape information.) Of 403c speech waveform data corresponding to the entry having the _{F 0} pattern information It will read and connect. However, these sound waveform data, the code 402a, 402b, _{F 0} pattern (which, _F is ₀ pattern. At _{F 0} pattern fine information) 402c where with properties corresponding to, / A / and / R / There is a significant connection distortion between the phonemes, and the synthesized speech synthesized with such speech waveform data becomes unnatural with a loss of smoothness. That is, in this case, if the speech waveform segment shown in FIG. 16 is selected, the difference in the connection point of the real voice F ₀ pattern is smaller, and therefore the quality of the synthesized speech (smoothness and naturalness) becomes higher. Conceivable.

従って、合成音声の聴感上の滑らかさや自然さを失わないようにするために、接続Ｆ_０サブコストには、Ｆ_０パターン微細情報におけるＦ_０パターンを用いることとするのである。 Therefore, in order not to lose the smoothness and naturalness of audibility of synthesized speech, the connection F ₀ sub-cost, it is to be referred to with the F ₀ pattern in F ₀ pattern fine information.

＜第２実施形態＞
第１実施形態では、Ｆ_０パターン概形情報は、予め音声情報データベース（５７２）の構成要素として記憶されているとした。これに対し、第２実施形態では、外部記憶装置の記憶容量を節約するなどの観点から、予めＦ_０パターン概形情報を生成しておくのではなく、テキストから合成音声を生成する音声合成処理のたびにＦ_０パターン概形情報を生成する。
第１実施形態と同じ機能・処理については同一の符号を当てて説明を省略し、第１実施形態と異なる点についてのみ説明を加える。 <Second Embodiment>
In the first embodiment, F ₀ pattern envelope information was stored as a component of pre-speech information database (572). On the other hand, in the second embodiment, from the viewpoint of saving the storage capacity of the external storage device, the speech synthesis process for generating synthesized speech from text, instead of generating F ₀ pattern outline information in advance. generating a F ₀ pattern envelope information for every.
The same functions and processes as those in the first embodiment are assigned the same reference numerals and description thereof is omitted, and only differences from the first embodiment are described.

第２実施形態の音声情報データベース（６７２）における各エントリーのＦ_０パターン情報は、第１実施形態で説明したＦ_０パターン微細情報である。第２実施形態においては、第１実施形態で説明したＦ_０パターン概形情報は、各エントリーのＦ_０パターン情報の構成要素となっていない。即ち、第２実施形態の音声情報データベース（６７２）は、図２に示すようなデータ構成になっている。 The F ₀ pattern information of each entry in the voice information database (672) of the second embodiment is the F ₀ pattern fine information described in the first embodiment. In the second embodiment, the F ₀ pattern outline information described in the first embodiment is not a component of the F ₀ pattern information of each entry. That is, the voice information database (672) of the second embodiment has a data configuration as shown in FIG.

第２実施形態に係わる音声合成装置（６００）の外部記憶装置（５７）には、第１実施形態で説明したプログラムに加え、各エントリーのＦ_０パターン微細情報におけるＦ_０パターンから、Ｆ_０パターン概形情報を得るためのプログラムも保存記憶されている。その他、これらのプログラムに基づく処理を制御するための制御プログラムも適宜に保存しておく。 The external storage device (57) of the speech synthesizer (600) according to the second embodiment, in addition to the program described in the first embodiment, the F ₀ pattern in the F ₀ pattern fine information of each entry, F ₀ pattern A program for obtaining outline information is also stored and stored. In addition, a control program for controlling processing based on these programs is also stored as appropriate.

音声合成装置（６００）では、外部記憶装置（５７）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてＲＡＭ（５５）に読み込まれて、ＣＰＵ（５４）で解釈実行・処理される。その結果、ＣＰＵ（５４）が所定の機能（テキスト解析部、韻律生成部、概形情報生成部、音声波形選択部、音声合成部）を実現することで、音声合成が実現される。 In the speech synthesizer (600), each program stored in the external storage device (57) and data necessary for processing each program are read into the RAM (55) as necessary, and interpreted by the CPU (54). Executed and processed. As a result, the CPU (54) realizes predetermined functions (text analysis unit, prosody generation unit, outline information generation unit, speech waveform selection unit, speech synthesis unit), thereby realizing speech synthesis.

そこで次に、図１７〜図１９を参照して、音声合成装置（６００）における音声合成の流れを順次説明する。
図１７は、第２実施形態に係わる音声合成装置の機能構成を例示した機能構成図である。
図１８は、第２実施形態に係わる音声合成の処理フローを示す図である。
図１９は、Ｆ_０パターン概形情報を生成する処理フローを示す図である。 Next, the flow of speech synthesis in the speech synthesizer (600) will be sequentially described with reference to FIGS.
FIG. 17 is a functional configuration diagram illustrating the functional configuration of the speech synthesizer according to the second embodiment.
FIG. 18 is a diagram showing a processing flow of speech synthesis according to the second embodiment.
Figure 19 is a diagram depicting a processing flow of generating a F ₀ pattern approximate shape information.

第１実施形態の音声合成装置（６００）は、テキスト解析部（５４１）、韻律生成部（５４２）、概形情報生成部（６４５）、音声波形選択部（５４３）、音声合成部（５４４）から構成される（図１７参照）。 The speech synthesis apparatus (600) of the first embodiment includes a text analysis unit (541), a prosody generation unit (542), a rough shape information generation unit (645), a speech waveform selection unit (543), and a speech synthesis unit (544). (See FIG. 17).

ステップＳ１およびステップＳ２の処理は第１実施形態と同様であるから説明を省略する。 Since the process of step S1 and step S2 is the same as that of 1st Embodiment, description is abbreviate | omitted.

ステップＳ２の処理の後、概形情報生成部（６４５）が、音声情報データベース（６７２）のエントリーのＦ_０パターン情報（Ｆ_０パターン微細情報）を読み込んで、このＦ_０パターン微細情報からＦ_０パターン概形情報を生成する（ステップＳ２ａ）。Ｆ_０パターン概形情報の生成は、第１実施形態において説明したとおりであるから、それに即して説明する（図１０、図１１、図１２参照）。 After step S2, outline information generating unit (645), reads the _{F 0} pattern information entry of a voice information database (672) _{(F 0} pattern fine information), _{F 0} from the _{F 0} pattern fine information Pattern outline information is generated (step S2a). The generation of the F ₀ pattern outline information is as described in the first embodiment, and will be described accordingly (see FIGS. 10, 11, and 12).

Ｆ_０パターン概形情報は、Ｆ_０パターン微細情報におけるＦ_０パターンの微細変動部分を補正して生成するが、より具体的には、Ｆ_０パターン微細情報におけるＦ_０パターンの子音に関わるＦ_０パターンの微細変動部分を補正することによって生成する。そこで一例として、子音区間（/Ｒ/の区間）のＦ_０パターンの変微細動を除去することによって、Ｆ_０パターン概形情報におけるＦ_０パターンを得る処理について説明する。 F ₀ pattern envelope information is generated by correcting the fine portion of the change in the F ₀ pattern in F ₀ pattern fine information, more specifically, F ₀ according to the consonants F ₀ pattern in F ₀ pattern fine information It is generated by correcting the fine variation portion of the pattern. So as an example, by removing the variable fine movement of F ₀ pattern consonant segment (/ R / interval), it describes the processing of obtaining the F ₀ pattern in F ₀ pattern approximate shape information.

まず、子音区間（/Ｒ/の区間）の両側の各母音区間（/Ａ/および/Ｕ/の区間）内で最も高いＦ_０パターンの値を示すピーク点を求める（ステップＳ２ａ１）。このピーク点は、エントリーにおけるＦ_０パターン微細情報および音素継続時間長を参照することによって求めることができる。図１１において、/Ａ/の区間では符号２０２に示す点が、/Ｕ/の区間では符号２０３に示す点が、各母音区間で最も高いＦ_０パターンの値のピーク点である。 First, the peak points indicating the value of the highest F ₀ pattern in both sides of each vowel section (/ A / and / U / interval) of consonant segment (/ R / interval) (step S2A1). The peak point can be obtained by reference to the F ₀ pattern fine information and the phoneme duration in the entry. 11, a / A / interval point indicated by reference numeral 202, the / U / of interval point indicated by reference numeral 203, a peak point of the value of the highest F ₀ pattern for each vowel section.

次に、求めた各母音区間それぞれのピーク点間の直線補間を行う（ステップＳ２ａ２）。なお、補正方法は直線補間に限定することなく、例えばスプライン補間などによって補正するものでもよい。図１１における符号２０４の破線は、各母音区間それぞれのピーク点間で、直線補間を行うことで得られるＦ_０パターンを示している。以上の処理によって、図１２に示すようなＦ_０パターン（符号２０５ａ、２０５ｂ、２０５ｃ）を得ることができる。ここで得られたＦ_０パターン（符号２０５ａ、２０５ｂ、２０５ｃ）が、それぞれの音素のＦ_０パターン概形情報である。 Next, linear interpolation between the peak points of each obtained vowel section is performed (step S2a2). The correction method is not limited to linear interpolation, but may be corrected by, for example, spline interpolation. A broken line 204 in FIG. 11 indicates an F ₀ pattern obtained by performing linear interpolation between the peak points of each vowel section. Through the above processing, it is possible to obtain _{F 0} pattern as shown in FIG. 12 (reference numeral 205a, 205b, 205c) a. _{F 0} pattern obtained here (reference numeral 205a, 205b, 205c) is an _{F 0} pattern envelope information for each phoneme.

なお、以上の説明からも明らかなとおり、補正される（Ｆ_０パターン微細情報における）Ｆ_０パターンは、子音部分のＦ_０パターンだけなのではなく、母音部分のＦ_０パターンの一部（上記例で云えば、/Ａ/のピーク点から終点にかけてのＦ_０パターン部分、/Ｕ/の始点からピーク点にかけてのＦ_０パターン部分）も補正される場合があることに留意しなければならない。 Incidentally, as is apparent from the above description, F ₀ pattern (F ₀ pattern fine information) corrected by, instead of only F ₀ pattern consonant portion, a portion of the F ₀ pattern of the vowel portion (the above example as far in, / a / F ₀ pattern portion extending the end point from the peak point of the, / U / F ₀ pattern portion extending the peak point from the beginning of) must also be noted that it may be corrected.

概形情報生成部（６４５）は、生成したＦ_０パターン概形情報を、該当する音素のエントリーのＦ_０パターン情報に追加して記憶する（ステップＳ２ａ３）。
つまりこの時点で、エントリーのＦ_０パターン情報には、Ｆ_０パターン微細情報およびＦ_０パターン概形情報が含まれることになる（図９参照）。 Outline information generating unit (645) the generated _{F 0} pattern envelope information, stored in addition to _{F 0} pattern information of the corresponding phoneme entry (step S2a3).
That at this time, the F ₀ pattern information entry will include F ₀ patterns fine information and F ₀ patterns approximate shape information (see FIG. 9).

なお、全てのエントリーのＦ_０パターン情報（Ｆ_０パターン微細情報）を読み込み、各エントリーについてＦ_０パターン微細情報からＦ_０パターン概形情報を生成するのであっては冗漫の場合もある。そこで、概形情報生成部（６４５）は、テキスト解析部（５４１）によって生成された音素列に含まれる音素について、音声情報データベース（５７２）の音素ラベルの項目を参照して、そのエントリーのＦ_０パターン情報（Ｆ_０パターン微細情報）を読み込み、各音素についてＦ_０パターン概形情報を生成するようにしてもよい。 Incidentally, reads the F ₀ pattern information of all entries (F ₀ pattern fine information), be from F ₀ pattern fine information to generate F ₀ pattern envelope information is sometimes a tedious for each entry. Therefore, the outline information generation unit (645) refers to the phoneme label item in the speech information database (572) for the phonemes included in the phoneme string generated by the text analysis unit (541), and the F of the entry. It is also possible to read ₀ pattern information (F ₀ pattern fine information) and generate F ₀ pattern outline information for each phoneme.

ステップＳ３およびステップＳ４の処理は第１実施形態と同様であるから説明を省略する。 Since the process of step S3 and step S4 is the same as that of 1st Embodiment, description is abbreviate | omitted.

本発明である音声合成装置・方法は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、音声波形選択部によるコスト演算におけるサブコストとして、ターゲットのＦ_０パターンと候補素片のＦ_０パターンとの傾きの一致度を示すサブコストも導入し（式（１）の右辺に加算する。）、そのサブコストの計算にＦ_０パターン概形情報におけるＦ_０パターンを用いるようにしてもよい。また、上記音声合成装置・方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The speech synthesizer / method according to the present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the gist of the present invention. For example, as a sub cost in the cost calculation by the speech waveform selection unit, a sub cost indicating the degree of coincidence of inclination between the target F ₀ pattern and the candidate segment F ₀ pattern is also introduced (added to the right side of Expression (1)). , it may be used F ₀ pattern in F ₀ pattern envelope information in the calculation of its sub-costs. In addition, the processing described in the speech synthesizer / method is not only executed in chronological order according to the order described, but also executed in parallel or individually as required by the processing capability of the device that executes the processing. It is good.

また、上記音声合成装置における処理機能をコンピュータによって実現する場合、音声合成装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記音声合成装置における処理機能がコンピュータ上で実現される。 When the processing functions in the speech synthesizer are realized by a computer, the processing contents of the functions that the speech synthesizer should have are described by a program. Then, by executing this program on a computer, the processing functions in the speech synthesizer are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from the portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、音声合成装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the speech synthesizer is configured by executing a predetermined program on a computer. However, at least a part of the processing contents may be realized by hardware.

本発明の音声合成装置・方法は、テキスト音声変換に有用である。 The speech synthesizer / method of the present invention is useful for text-to-speech conversion.

従来的な音声合成装置の機能構成例を示す図。The figure which shows the function structural example of the conventional speech synthesizer. 音声情報データベースの一例を示す図。The figure which shows an example of an audio | voice information database. Ｆ_０パターン微細情報の一例を示す図。F ₀ pattern diagram showing an example of fine information. 音声波形素片選択結果のＦ_０パターンの一例を示す図。It illustrates an example of a F ₀ pattern of the speech waveform segment selection result. 音声波形素片選択結果のＦ_０パターンの一例を示す図。It illustrates an example of a F ₀ pattern of the speech waveform segment selection result. 第１実施形態に係わる音声合成装置のハードウェア構成を例示したハードウェア構成図。The hardware block diagram which illustrated the hardware configuration of the speech synthesizer concerning 1st Embodiment. 第１実施形態に係わる音声合成装置の機能構成を例示した機能構成図。The function block diagram which illustrated the function structure of the speech synthesizer concerning 1st Embodiment. 第１実施形態に係わる音声合成の処理フローを示す図。The figure which shows the processing flow of the speech synthesis concerning 1st Embodiment. 第１実施形態に係わる音声情報データベースのデータ構成を示す図。The figure which shows the data structure of the audio | voice information database concerning 1st Embodiment. Ｆ_０パターン概形情報の生成方法の一例を示す図（その１）。It illustrates an example of a generation method of F ₀ pattern outline information (Part 1). Ｆ_０パターン概形情報の生成方法の一例を示す図（その２）。It illustrates an example of a generation method of F ₀ pattern outline information (Part 2). Ｆ_０パターン概形情報の生成方法の一例を示す図（その３）。It illustrates an example of a generation method of F ₀ pattern outline information (Part 3). ターゲットのＦ_０パターンの一例を示す図。The figure which shows an example of _F0 pattern of a target. 音声波形素片選択結果のＦ_０パターンの一例を示す図。It illustrates an example of a F ₀ pattern of the speech waveform segment selection result. ターゲットのＦ_０パターンと音声波形素片選択結果のＦ_０パターンとの関係を示す図（その１）。It shows the relationship between F ₀ pattern targets F ₀ pattern and the voice waveform segments selected result (Part 1). ターゲットのＦ_０パターンと音声波形素片選択結果のＦ_０パターンとの関係を示す図（その２）。It shows the relationship between F ₀ pattern targets F ₀ pattern and the voice waveform segments selected results (Part 2). 第２実施形態に係わる音声合成装置の機能構成を例示した機能構成図。The function block diagram which illustrated the function structure of the speech synthesizer concerning 2nd Embodiment. 第２実施形態に係わる音声合成の処理フローを示す図。The figure which shows the processing flow of the speech synthesis concerning 2nd Embodiment. Ｆ_０パターン概形情報を生成する処理フローを示す図。It shows a process flow of generating an F ₀ pattern approximate shape information.

Explanation of symbols

１０１音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０２ターゲットのＦ_０パターン
１０３ａ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０３ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０３ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０４ａ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０４ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
１０４ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
２０１音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
２０１Ｆ_０パターン概形情報の生成方法の一例における、母音区間内でＦ_０パターンの値が最も高いピーク点
２０３Ｆ_０パターン概形情報の生成方法の一例における、母音区間内でＦ_０パターンの値が最も高いピーク点
２０４Ｆ_０パターン概形情報の生成方法の一例における直線補間結果
２０５ａ生成された音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
２０５ｂ生成された音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
２０５ｃ生成された音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
３０１ターゲットのＦ_０パターン
３０２ａ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
３０２ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
３０２ｃ音声演形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０１ターゲットのＦ_０パターン
４０２ａ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０２ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０２ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０３ａ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０３ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０３ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０４ａ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０４ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０４ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン微細情報）
４０５ａ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０５ｂ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
４０５ｃ音声波形素片のＦ_０パターン（Ｆ_０パターン概形情報）
５００音声合成装置
５４１テキスト解析部
５４２韻律生成部
５４３音声波形選択部
５４４音声合成部
５７１音声波形データベース
５７２音声情報データベース
６００音声合成装置
６４５概形情報生成部
６７２音声情報データベース 101 F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
102 target of the _{F 0} pattern 103a speech waveform segment of the _{F 0} pattern _{(F 0} pattern fine information)
103b F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
103c F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
104a F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
104b F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
104c F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
201 F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
In the example of the method for generating 201 F ₀ pattern outline information, the peak point 203 having the highest value of the F ₀ pattern in the vowel section 203 In the example of the method for generating F ₀ pattern outline information, the F ₀ pattern in the vowel section is generated. Peak point 204 with the highest value 204 F ₀ pattern outline information generation method in one example of linear interpolation result 205a F ₀ pattern of generated speech waveform segment (F ₀ pattern outline information)
205b _{F 0} pattern of the generated voice waveform segments _{(F 0} pattern envelope information)
205c F ₀ pattern of generated speech waveform segment (F ₀ pattern outline information)
301 target of the _{F 0} pattern 302a speech waveform segment of the _{F 0} pattern _{(F 0} pattern envelope information)
302b F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
302c voice演形segment of the _{F 0} pattern _{(F 0} pattern approximate shape information)
401 target of the _{F 0} pattern 402a speech waveform segment of the _{F 0} pattern _{(F 0} pattern fine information)
402b F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
402c speech waveform segment of the _{F 0} pattern _{(F 0} pattern fine information)
403a F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
403b F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
403c F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
404a F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
404b speech waveform segment of the _{F 0} pattern _{(F 0} pattern fine information)
404c F ₀ pattern of speech waveform segment (F ₀ pattern fine information)
405a F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
405b F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
405c F ₀ pattern of speech waveform segment (F ₀ pattern outline information)
500 Speech synthesizer 541 Text analysis unit 542 Prosody generation unit 543 Speech waveform selection unit 544 Speech synthesis unit 571 Speech waveform database 572 Speech information database 600 Speech synthesizer 645 Outline information generation unit 672 Speech information database

Claims

Generates a phoneme sequence of the text from the input text, reads speech waveform data corresponding to this phoneme sequence from the speech waveform database in an appropriate unit (synthesis unit) for assembling the synthesized speech, and extracts these speech waveform data A speech synthesizer that synthesizes speech by connecting
Text analysis means for analyzing the input text and generating a phoneme sequence of the text;
Prosody generation means for generating prosody information including at least F ₀ pattern information of speech for each synthesis unit from the phoneme sequence generated by the text analysis means;
Speech waveform database were collected sound waveform data for each synthesis unit and stores voice information database of entries indicating the correspondence between the speech waveform data in the prosodic information and the speech waveform database including F ₀ pattern information of the speech for each synthesis unit Storage means for
In accordance with the phoneme sequence generated by the text analysis means, at least the distance measure (cost) between the prosody information generated by the prosody generation means and the prosodic information of the entry in the speech information database is calculated, and the prosody that minimizes the calculation result Voice waveform selection means for selecting an entry having information from a voice information database;
A voice synthesis unit that reads voice waveform data from the voice waveform database according to the entry selected by the voice waveform selection unit, and connects the voice waveform data to synthesize a voice;
The voice _F0 pattern information for each synthesis unit in the voice information database entry is
F ₀ pattern fine information and holding the fine variation of F ₀ pattern of real voice, between the peak point showing the maximum value of F ₀ pattern of each vowel portion of both sides of the consonant portion that put the fine information F ₀ pattern fine information Consists of F ₀ pattern outline information generated by interpolation ,
The cost calculation in the speech waveform selection means includes at least the cost calculation of the F ₀ pattern information in the prosodic information generated by the prosody generation means and the F ₀ pattern outline information in the entry of the speech information database. A featured voice synthesizer.

Generates a phoneme sequence of the text from the input text, reads speech waveform data corresponding to this phoneme sequence from the speech waveform database in an appropriate unit (synthesis unit) for assembling the synthesized speech, and extracts these speech waveform data A speech synthesizer that synthesizes speech by connecting
Text analysis means for analyzing the input text and generating a phoneme sequence of the text;
Prosody generation means for generating prosody information including at least F ₀ pattern information of speech for each synthesis unit from the phoneme sequence generated by the text analysis means;
Speech waveform database were collected sound waveform data for each synthesis unit and a voice information database of entries indicating the correspondence between the speech waveform data in the prosodic information and the speech waveform database including F ₀ pattern fine information of the speech for each synthesis unit Storage means for storing;
F ₀ pattern by interpolating between peak point showing the maximum value of F ₀ pattern fine information of each vowel portion of both sides of the consonant portions in the F ₀ pattern fine information of speech each synthesis unit in the entry of the audio information database Outline information generating means for generating outline information;
In accordance with the phoneme sequence generated by the text analysis means, at least the distance measure (cost) between the prosody information generated by the prosody generation means and the prosodic information of the entry in the speech information database is calculated, and the prosody that minimizes the calculation result Voice waveform selection means for selecting an entry having information from a voice information database;
A voice synthesis unit that reads voice waveform data from the voice waveform database according to the entry selected by the voice waveform selection unit, and connects the voice waveform data to synthesize a voice;
The cost of the operation in the audio waveform selecting means, at least, a F ₀ pattern information in prosody information generated by the prosody generation means, the cost of operation of the F ₀ pattern envelope information generated by the envelope information generator A speech synthesizer comprising:

The voice waveform selection means
According to the phoneme sequence generated by the text analysis means, the distance measure (cost) between the prosodic information generated by the prosody generation means and the prosodic information of the entry in the speech information database and the cost between each entry are calculated. The entry having the minimum prosodic information is selected from the speech information database,
The calculation of the cost between each entry, speech synthesis apparatus according to claim 1 or 2, characterized in that it comprises the calculation of the costs between F ₀ pattern fine information in at least each entry.

Generates a phoneme sequence of the text from the input text, reads speech waveform data corresponding to this phoneme sequence from the speech waveform database in an appropriate unit (synthesis unit) for assembling the synthesized speech, and extracts these speech waveform data Is a speech synthesis method for synthesizing speech by connecting
The storage unit, and the speech waveform database were collected sound waveform data for each synthesis unit, consisting of an entry showing the correspondence between the speech waveform data in the prosodic information and the speech waveform database including F ₀ pattern information of the speech for each synthesis unit Memorize voice information database,
A text analysis step in which the text analysis means analyzes the input text to generate a phoneme sequence of the text;
A prosody generation step in which prosody generation means generates prosody information including at least F ₀ pattern information of speech for each synthesis unit from the phoneme sequence generated in the text analysis step;
The speech waveform selection means, according to the phoneme sequence generated in the text analysis step, at least a distance measure (cost) between the prosody information generated in the prosody generation step and the prosodic information of the entry in the speech information database stored in the storage means A speech waveform selection step of selecting an entry having prosodic information that minimizes the computation result from a speech information database stored in the storage means;
The voice synthesis means includes a voice synthesis step of reading voice waveform data from the voice waveform database stored in the storage means according to the entry selected in the voice waveform selection step, and synthesizing voice by connecting the voice waveform data. ,
The voice F ₀ pattern information for each synthesis unit in the voice information database entry stored in the storage means is:
F ₀ pattern fine information and holding the fine variation of F ₀ pattern of real voice, between the peak point showing the maximum value of F ₀ pattern of each vowel portion of both sides of the consonant portion that put the fine information F ₀ pattern fine information Consists of F ₀ pattern outline information generated by interpolation ,
The cost calculation in the speech waveform selection step includes at least the cost calculation of the F ₀ pattern information in the prosody information generated in the prosody generation step and the F ₀ pattern outline information in the entry of the speech information database. A featured speech synthesis method.

Generates a phoneme sequence of the text from the input text, reads speech waveform data corresponding to this phoneme sequence from the speech waveform database in an appropriate unit (synthesis unit) for assembling the synthesized speech, and extracts these speech waveform data Is a speech synthesis method for synthesizing speech by connecting
Storage means, and speech waveform database were collected sound waveform data for each synthesis unit, consisting of an entry showing the correspondence between the speech waveform data in the prosodic information and the speech waveform database including F ₀ pattern fine information of the speech for each synthesis unit Memorize voice information database,
A text analysis step in which the text analysis means analyzes the input text to generate a phoneme sequence of the text;
A prosody generation step in which prosody generation means generates prosody information including at least F ₀ pattern information of speech for each synthesis unit from the phoneme sequence generated in the text analysis step;
Envelope information generating means, the maximum value of the audio of F ₀ pattern of each vowel portion of both sides of the consonant portions in the fine information F ₀ pattern fine information for each synthetic unit in the entry of the audio information database stored in the storage means An outline information generation step for generating F ₀ pattern outline information by interpolating between peak points indicating
The speech waveform selection means, according to the phoneme sequence generated in the text analysis step, at least a distance measure (cost) between the prosody information generated in the prosody generation step and the prosodic information of the entry in the speech information database stored in the storage means ) And selecting an entry having prosodic information that minimizes the calculation result from the speech information database;
The voice synthesis means includes a voice synthesis step of reading voice waveform data from the voice waveform database stored in the storage means according to the entry selected in the voice waveform selection step, and synthesizing voice by connecting the voice waveform data. ,
The cost of the operation in the speech waveform selecting step, at least, a F ₀ pattern information in prosody information generated in the prosody generation step, the cost of operation of the F ₀ pattern outline information generated at the approximate shape information generating step A speech synthesis method comprising:

The voice waveform selection step
In accordance with the phoneme sequence generated in the text analysis step, the distance measure (cost) between the prosodic information generated in the prosody generation step and the prosodic information of the entry in the speech information database and the cost between each entry are calculated. The entry having the minimum prosodic information is selected from the speech information database,
The cost of operation between each entry, speech synthesis method according to claim 4 or 5, characterized in that it comprises the calculation of the costs between F ₀ pattern fine information in at least each entry.

A speech synthesis program for causing a computer to function as the speech synthesizer according to any one of claims 1 to 3 .

A computer-readable program recording medium on which the speech synthesis program according to claim 7 is recorded.