JP4580317B2

JP4580317B2 - Speech synthesis apparatus and speech synthesis program

Info

Publication number: JP4580317B2
Application number: JP2005270735A
Authority: JP
Inventors: 寛之世木; 徹都木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2005-09-16
Filing date: 2005-09-16
Publication date: 2010-11-10
Anticipated expiration: 2025-09-16
Also published as: JP2007079476A

Description

この発明は、アクセント付き音素を用いて音声合成を行う音声合成装置および音声合成プログラムに関する。 The present invention relates to a speech synthesizer and a speech synthesis program for performing speech synthesis using accented phonemes.

従来、入力されたテキストデータに基づいて音声合成データを出力する音声合成装置が知られている（例えば、特許文献１参照）。特許文献１に開示されている音声合成装置には、音素とこの音素の発話時間が記録された音声データベースが備えられており、この音声合成装置は、入力されたテキストデータを音素に分解した後、分解した音素について音素を探索単位として音声データベースを探索し、連結コストおよび音韻韻律コストの和が最小になる探索結果を音声合成データとして出力するものである。 2. Description of the Related Art Conventionally, a speech synthesizer that outputs speech synthesis data based on input text data is known (see, for example, Patent Document 1). The speech synthesizer disclosed in Patent Document 1 is provided with a speech database in which phonemes and utterance times of the phonemes are recorded. The speech synthesizer disassembles input text data into phonemes. The speech database is searched for phonemes as search units for the decomposed phonemes, and the search result that minimizes the sum of the concatenation cost and the phoneme prosody cost is output as speech synthesis data.

また、従来、音声認識において、当該音素の前後に配置された音素（前後の音素環境）を考慮した音素（トライフォン）としてクラスタリングする方法が知られている（例えば、非特許文献１参照）。
特開平１０−４９１９３号公報（段落００１４〜００１８、図１） S. J. Young, J. J. Odell, and P. C. Woodland,“Tree- based state tying for high accuracy acoustics modeling”, Proc. of ARPA Human Language Technology Workshop, 1994, vol.1, p.307-312 Conventionally, in speech recognition, a method of clustering as phonemes (triphones) considering phonemes (front and back phoneme environments) arranged before and after the phoneme is known (see Non-Patent Document 1, for example).
JP 10-49193 (paragraphs 0014 to 0018, FIG. 1) SJ Young, JJ Odell, and PC Woodland, “Tree-based state tying for high accuracy acoustics modeling”, Proc. Of ARPA Human Language Technology Workshop, 1994, vol.1, p.307-312

しかしながら、非特許文献１に開示されたクラスタリング方法は、前後の音素環境、すなわち、スペクトル包絡の特徴のみを考慮するクラスタリングであるため、基本周波数の変化に起因するアクセントの影響が無視されている。そのため、このクラスタリング方法を用いた音声合成装置では、合成される合成音声（音声合成データ）が不自然な感じになる（前後の音素の接続が不自然になる）、つまり合成音声（音声合成データ）の自然性が劣化してしまうという問題がある。 However, since the clustering method disclosed in Non-Patent Document 1 is a clustering that considers only the phoneme environment before and after, that is, the characteristics of the spectral envelope, the influence of accents due to changes in the fundamental frequency is ignored. Therefore, in the speech synthesizer using this clustering method, the synthesized speech (speech synthesis data) to be synthesized feels unnatural (the connection of the phonemes before and after becomes unnatural), that is, synthesized speech (speech synthesis data) ) Is deteriorated in naturalness.

本発明は、以上のような問題点に鑑みてなされたものであり、合成音声（音声合成データ）が不自然に聞こえてしまう自然性の劣化を防止することができる音声合成装置および音声合成プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and a speech synthesizer and a speech synthesizer program capable of preventing deterioration of naturalness that a synthesized speech (speech synthesized data) sounds unnaturally. The purpose is to provide.

前記目的を達成するために、本発明の請求項１に記載の音声合成装置は、テキストデータ解析手段と、音素クラスタリング手段と、音素アクセントクラスタリング手段と、音声データベースと、音声データ探索手段と、音声データ連結手段とを備えることとした。 In order to achieve the above object, a speech synthesizer according to claim 1 of the present invention includes a text data analysis means, a phoneme clustering means, a phoneme accent clustering means, a speech database, a speech data search means, and a speech. And data linking means.

かかる構成によれば、音声合成装置は、テキストデータ解析手段によって、入力されたテキストデータを形態素解析して、アクセント付き音素に変換する。ここで、形態素とは、これ以上に細かくすると意味がなくなってしまう最小の文字列をいい、形態素解析とは、文章を形態素のレベルまで分解して解析することである。また、アクセントには、音の高低、強弱、長短、リズムのうちの少なくとも１つを含む。そして、音声合成装置は、音素クラスタリング手段によって、前記テキストデータ解析手段で変換されたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素でクラスタリングし、音素アクセントクラスタリング手段によって、前記音素クラスタリング手段でクラスタリングされたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素のアクセントでクラスタリングする。ここで、クラスタリングとは、所定のクラスタ（グループ）に分類すべき音素の集合を何らかの属性に注目して分類することを指すものである。また、アクセント付き音素の前後に配置された音素のアクセントでクラスタリングする方法は、前後の音の高低、強弱、長短、リズムのうちの少なくとも１つを用いるものである。 According to such a configuration, the speech synthesizer performs morphological analysis on the input text data by the text data analysis means, and converts the text data into accented phonemes. Here, the morpheme refers to the minimum character string that has no meaning if it is made finer than this, and the morpheme analysis is to analyze the sentence by breaking it down to the morpheme level. In addition, the accent includes at least one of a pitch, strength, length, and rhythm. Then, the speech synthesizer clusters the phonemes with accents converted by the text data analysis unit by the phoneme clustering unit with the phonemes arranged before and after the phonemes with the accents, and the phoneme clustering unit performs the phoneme clustering unit. The phonemes with accents clustered by the means are clustered with phoneme accents arranged before and after the phonemes with accents. Here, clustering refers to classifying a set of phonemes to be classified into a predetermined cluster (group) by paying attention to some attribute. In addition, the clustering method using the phoneme accents arranged before and after the accented phonemes uses at least one of the pitches of the front and back sounds, strength, shortness, and rhythm.

そして、音声合成装置は、音声データ探索手段によって、前記音素アクセントクラスタリング手段でクラスタリングされたアクセント付き音素に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、前記連結スコアが最大となる音声データ列を音声データベースから探索する。この音声データベースは、前記アクセント付き音素の前後に配置された音素および該音素のアクセントでクラスタリングされたアクセント付き音素に対応する音声データを記憶するものである。ここで、ビタービサーチとは、最良（最大）のスコアを与える仮説（アクセント付き音素に対応する音声データの組み合わせ）の履歴のみを残していく手法である。そして、音声合成装置は、音声データ連結手段によって、前記音声データ探索手段で探索された音声データ列を連結する。この音声データ列の連結により、音声合成データが生成され、音声合成装置から出力される。 Then, the speech synthesizer calculates a concatenation score of speech data strings generated by combining speech data corresponding to accented phonemes clustered by the phoneme accent clustering means by means of a Viterbi search. The voice data string that maximizes the connection score is searched from the voice database. This speech database stores phonemes arranged before and after the accented phonemes and speech data corresponding to accented phonemes clustered by the accents of the phonemes. Here, the Viterbi search is a method of leaving only the history of a hypothesis (a combination of speech data corresponding to accented phonemes) that gives the best (maximum) score. Then, the speech synthesizer concatenates the speech data strings searched by the speech data search means by the speech data connection means. By combining the speech data strings, speech synthesis data is generated and output from the speech synthesizer.

また、請求項２に記載の音声合成装置は、テキストデータ解析手段と、音素クラスタリング手段と、音素アクセントクラスタリング手段と、音素列記憶手段と、音声データベースと、音素列分割手段と、音声データ探索手段と、音声データ連結手段とを備えることとした。 Further, the speech synthesizer according to claim 2 is a text data analysis means, a phoneme clustering means, a phoneme accent clustering means, a phoneme string storage means, a speech database, a phoneme string dividing means, and a speech data search means. And voice data connecting means.

かかる構成によれば、音声合成装置は、テキストデータ解析手段によって、入力されたテキストデータを形態素解析して、アクセント付き音素に変換する。そして、音声合成装置は、音素クラスタリング手段によって、前記テキストデータ解析手段で変換されたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素でクラスタリングし、音素アクセントクラスタリング手段によって、前記音素クラスタリング手段でクラスタリングされたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素のアクセントでクラスタリングする。この音声合成装置は、予め音素列記憶手段に、前記アクセント付き音素の前後に配置された音素および該音素のアクセントでクラスタリングされたアクセント付き音素の列を記憶し、この音素列記憶手段に記憶されたアクセント付き音素の列に対応する音声データを、音声データベースに記憶している。そして、音声合成装置は、音素列分割手段によって、前記音素アクセントクラスタリング手段でクラスタリングされた音素に変換された前記テキストデータを、前記音素列記憶手段に記憶されたアクセント付き音素の列に分割する。したがって、入力テキストデータは、予め登録されている複数の音素列に分割される。 According to such a configuration, the speech synthesizer performs morphological analysis on the input text data by the text data analysis means, and converts the text data into accented phonemes. Then, the speech synthesizer clusters the phonemes with accents converted by the text data analysis unit by the phoneme clustering unit with the phonemes arranged before and after the phonemes with the accents, and the phoneme clustering unit performs the phoneme clustering unit. The phonemes with accents clustered by the means are clustered with phoneme accents arranged before and after the phonemes with accents. In this speech synthesizer, a phoneme sequence storage unit stores in advance a phoneme arranged before and after the accented phoneme and a sequence of accented phonemes clustered with the accent of the phoneme, and is stored in the phoneme sequence storage unit. Voice data corresponding to the accented phoneme string is stored in the voice database. Then, the speech synthesizer divides the text data converted into the phonemes clustered by the phoneme accent clustering means by the phoneme string dividing means into accented phoneme strings stored in the phoneme string storage means. Therefore, the input text data is divided into a plurality of phoneme strings registered in advance.

そして、音声合成装置は、音声データ探索手段によって、前記音素列分割手段で分割された前記アクセント付き音素の列に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、前記連結スコアが最大となる音声データ列を前記音声データベースから探索する。そして、音声合成装置は、音声データ連結手段によって、前記音声データ探索手段で探索された音声データ列を連結する。この音声データ列の連結により、音声合成データが生成され、音声合成装置から出力される。 The speech synthesizer then performs a Viterbi search on a concatenation score of the speech data sequence generated by combining speech data corresponding to the accented phoneme sequence divided by the phoneme sequence division unit by the speech data search unit. The voice data string having the maximum connection score is searched from the voice database. Then, the speech synthesizer concatenates the speech data strings searched by the speech data search means by the speech data connection means. By combining the speech data strings, speech synthesis data is generated and output from the speech synthesizer.

また、請求項３に記載の音声合成プログラムは、テキストデータに対応する音声を合成するために、コンピュータを、テキストデータ解析手段、音素クラスタリング手段、音素アクセントクラスタリング手段、音声データ探索手段と、音声データ連結手段として機能させることを特徴とする。 The speech synthesis program according to claim 3, in order to synthesize speech corresponding to the text data, comprises a computer, text data analysis means, phoneme clustering means, phoneme accent clustering means, speech data search means, speech data It is made to function as a connection means.

かかる構成によれば、音声合成プログラムは、テキストデータ解析手段によって、入力されたテキストデータを形態素解析して、アクセント付き音素に変換する。そして、音声合成プログラムは、音素クラスタリング手段によって、前記テキストデータ解析手段で変換されたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素でクラスタリングし、音素アクセントクラスタリング手段によって、前記音素クラスタリング手段でクラスタリングされたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素のアクセントでクラスタリングする。そして、音声合成プログラムは、音声データ探索手段によって、前記音素アクセントクラスタリング手段でクラスタリングされたアクセント付き音素に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、前記連結スコアが最大となる音声データ列を探索する。そして、音声合成プログラムは、音声データ連結手段によって、前記音声データ探索手段で探索された音声データ列を連結する。 According to such a configuration, the speech synthesis program performs morphological analysis on the input text data by the text data analysis means, and converts the text data into accented phonemes. The speech synthesis program clusters the phonemes with accents converted by the text data analysis unit with the phoneme clustering unit using the phonemes arranged before and after the phonemes with the accents, and the phoneme clustering unit with the phoneme clustering unit. The phonemes with accents clustered by the means are clustered with phoneme accents arranged before and after the phonemes with accents. Then, the speech synthesis program calculates a concatenation score of speech data strings generated by combining speech data corresponding to accented phonemes clustered by the phoneme accent clustering means by means of a Viterbi search. The voice data string that maximizes the connection score is searched. Then, the speech synthesis program connects the speech data strings searched by the speech data search means by the speech data connection means.

また、請求項４に記載の音声合成プログラムは、テキストデータに対応する音声を合成するために、コンピュータを、テキストデータ解析手段、音素クラスタリング手段、音素アクセントクラスタリング手段、音素列分割手段、音声データ探索手段、音声データ連結手段として機能させることを特徴とする。 According to a fourth aspect of the present invention, there is provided a speech synthesis program comprising: a computer for text data analysis means, phoneme clustering means, phoneme accent clustering means, phoneme string dividing means, speech data search, in order to synthesize speech corresponding to text data. And functioning as voice data connection means.

かかる構成によれば、音声合成プログラムは、テキストデータ解析手段によって、入力されたテキストデータを形態素解析して、アクセント付き音素に変換する。そして、音声合成プログラムは、音素クラスタリング手段によって、前記テキストデータ解析手段で変換されたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素でクラスタリングし、音素アクセントクラスタリング手段によって、前記音素クラスタリング手段でクラスタリングされたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素のアクセントでクラスタリングする。そして、音声合成プログラムは、音素列分割手段によって、前記音素アクセントクラスタリング手段でクラスタリングされた音素に変換された前記テキストデータをアクセント付き音素の列に分割する。そして、音声合成プログラムは、音声データ探索手段によって、前記音素列分割手段で分割された前記アクセント付き音素の列に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、前記連結スコアが最大となる音声データ列を探索する。そして、音声合成プログラムは、音声データ連結手段によって、前記音声データ探索手段で探索された音声データ列を連結する。 According to such a configuration, the speech synthesis program performs morphological analysis on the input text data by the text data analysis means, and converts the text data into accented phonemes. The speech synthesis program clusters the phonemes with accents converted by the text data analysis unit with the phoneme clustering unit using the phonemes arranged before and after the phonemes with the accents, and the phoneme clustering unit with the phoneme clustering unit. The phonemes with accents clustered by the means are clustered with phoneme accents arranged before and after the phonemes with accents. Then, the speech synthesis program divides the text data converted into the phonemes clustered by the phoneme accent clustering means into phoneme strings with accents by the phoneme string dividing means. Then, the speech synthesis program uses a speech data search means to generate a Viterbi search for a concatenation score of speech data strings generated by combining speech data corresponding to the accented phoneme strings divided by the phoneme string splitting means. And the voice data string having the maximum connection score is searched. Then, the speech synthesis program connects the speech data strings searched by the speech data search means by the speech data connection means.

請求項１または請求項３に記載の発明によれば、アクセント付き音素について、前後の音素環境に加えて前後のアクセント環境を考慮しているので、アクセントの正しい合成音声を作成することが可能になる。その結果、合成音声が不自然に聞こえてしまう自然性の劣化を防止することができる。 According to the first or third aspect of the invention, since the accented phoneme environment is taken into consideration in addition to the preceding and following phoneme environment, it is possible to create a synthesized speech with a correct accent. Become. As a result, it is possible to prevent the deterioration of naturalness that the synthesized speech is heard unnaturally.

請求項２または請求項４に記載の発明によれば、入力されたテキストデータを予め登録されているアクセント付き音素の列に分割し、この音素列を、音声データベースを探索する探索単位として使用するため、探索する際に前後の音素環境が異なる音素を探索することを防止し、音声合成処理に要する時間を短縮できる。その結果、合成した音声合成データの音質の低下を防止することができる。 According to the second or fourth aspect of the invention, the input text data is divided into pre-registered accented phoneme strings, and the phoneme strings are used as search units for searching the speech database. Therefore, it is possible to prevent searching for phonemes having different phoneme environments before and after searching, and to shorten the time required for speech synthesis processing. As a result, it is possible to prevent deterioration in sound quality of the synthesized speech synthesis data.

以下、本発明の実施の形態について図面を参照して説明する。
（第１の実施形態）
［音声合成装置の構成］
図１は、第１の実施形態に係る音声合成装置の構成を示す機能ブロック図である。音声合成装置１は、入力されたテキストデータに基づいて、出力すべき音声を音声データ列によって合成するものであり、ＣＰＵ（Central Processing Unit）と、ＲＯＭ（Read Only Memory）と、ＲＡＭ（Random Access Memory）と、ＨＤＤ（Hard Disk Drive）と、入出力インターフェース等（図示を省略）とを備え、ＣＰＵがＨＤＤ等に格納されたプログラムをＲＡＭに展開することにより後記する各種機能を実現するものである。この音声合成装置１は、図１に示すように、入力手段２と、記憶手段３と、音声合成制御手段４とを備えている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(First embodiment)
[Configuration of speech synthesizer]
FIG. 1 is a functional block diagram showing the configuration of the speech synthesizer according to the first embodiment. The speech synthesizer 1 synthesizes speech to be output with a speech data string based on input text data, and includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access). Memory (HDD), hard disk drive (HDD), input / output interface and the like (not shown), and the CPU implements various functions to be described later by expanding a program stored in the HDD or the like into the RAM. is there. As shown in FIG. 1, the speech synthesizer 1 includes an input unit 2, a storage unit 3, and a speech synthesis control unit 4.

入力手段２は、テキストデータを音声合成制御手段４に入力するものであり、例えば、キーボードやマウス等から構成される。この入力手段２は、テキストデータをインターネットなどの通信回路網（図示を省略）を介して受信する通信インターフェース等から構成されるようにしてもよい。
記憶手段３は、発音単語記憶手段３１と、接続確率記憶手段３２と、音声データベース３３と、メモリ３４とを備えている。 The input unit 2 inputs text data to the speech synthesis control unit 4 and is composed of, for example, a keyboard and a mouse. The input unit 2 may be configured by a communication interface that receives text data via a communication circuit network (not shown) such as the Internet.
The storage unit 3 includes pronunciation word storage unit 31, connection probability storage unit 32, speech database 33, and memory 34.

発音単語記憶手段３１は、アクセント付き音素を単語別に記憶したものであって、ＨＤＤ等の一般的な記録媒体である。
接続確率記憶手段３２は、単語の接続確率を記憶したものであって、ＨＤＤ等の一般的な記録媒体である。
音声データベース３３は、単位音声（音声データ）を記憶したものであって、ＨＤＤ等の一般的な記録媒体である。音声データベース３３に記憶されている単位音声は、全ての母音にアクセントの高低が付与されている音素を基盤としている。この音素は、アクセント付き音素の前後に配置された音素および該音素のアクセントでクラスタリングされたアクセント付き音素であり、複数のクラスタに分類されている。ここで、アクセントには、音の高低、強弱、長短、リズムのうちの少なくとも１つを含む。そして、例えば、これらの複数のアクセント付き音素の集合からなる「文章」がデータベースの構成単位となっている。なお、各文章に「文番号」を付与したり、各文章の発話時刻（発話開始時刻および発話終了時刻；発話時間）を記録したりするようにしてもよい。
メモリ３４は、半導体メモリや磁気メモリなどの書き換え可能な記憶手段であり、音声合成制御手段４による処理等に利用される。 The pronunciation word storage means 31 stores accented phonemes for each word, and is a general recording medium such as an HDD.
The connection probability storage means 32 stores word connection probabilities and is a general recording medium such as an HDD.
The audio database 33 stores unit audio (audio data) and is a general recording medium such as an HDD. The unit speech stored in the speech database 33 is based on phonemes in which accents are added to all vowels. The phonemes are phonemes arranged before and after the accented phonemes and accented phonemes clustered by the accents of the phonemes, and are classified into a plurality of clusters. Here, the accent includes at least one of a pitch, strength, length, and rhythm. For example, a “sentence” made up of a set of these plural accented phonemes is a structural unit of the database. Note that “sentence number” may be assigned to each sentence, or the utterance time (utterance start time and utterance end time; utterance time) of each sentence may be recorded.
The memory 34 is a rewritable storage unit such as a semiconductor memory or a magnetic memory, and is used for processing by the speech synthesis control unit 4.

音声合成制御手段４は、アクセント付き音素について、前後の音素環境および前後のアクセント環境を考慮して、音声データを合成するものであり、図１に示すように、テキストデータ解析手段４１と、音素クラスタリング手段４２と、音素アクセントクラスタリング手段４３と、音声データ探索手段４４と、音声データ補正手段４５と、音声データ連結手段４６とを備えている。 The speech synthesis control means 4 synthesizes speech data for accented phonemes in consideration of the front and back phoneme environment and the front and back accent environment. As shown in FIG. 1, the text synthesis analysis means 41 and the phoneme Clustering means 42, phoneme accent clustering means 43, speech data searching means 44, speech data correcting means 45, and speech data connecting means 46 are provided.

テキストデータ解析手段４１は、記憶手段３に格納された図示しない辞書データを参照して、入力されたテキストデータを形態素解析して、アクセント付き音素に変換して音素クラスタリング手段４２に出力するものである。ここで、形態素とは、これ以上に細かくすると意味がなくなってしまう最小の文字列をいい、形態素解析とは、文章を形態素のレベルまで分解して解析することである。例えば、日本語の「今日の天気は晴れです」という文章は、「今日・の・天気・は・晴れ・です」等のように区切ることができる。本実施形態では、テキストデータ解析手段４１は、発音単語記憶手段３１および接続確率記憶手段３２を参照して、アクセント付き音素に変換する。なお、テキストデータ解析手段４１は、「単語」ではなく、「品詞」の接続確率を用いるようにしてもよい。また、テキストデータ解析手段４１は、「接続確率」の代わりに、「単語の接続に関して予め定められた発音規則」に基づいてアクセント付き音素に変換するようにしてもよい。 The text data analysis unit 41 refers to dictionary data (not shown) stored in the storage unit 3, morphologically analyzes the input text data, converts it into accented phonemes, and outputs them to the phoneme clustering unit 42. is there. Here, the morpheme refers to the minimum character string that has no meaning if it is made finer than this, and the morpheme analysis is to analyze the sentence by breaking it down to the morpheme level. For example, the sentence “Today's weather is sunny” in Japanese can be separated as “Today's weather. In this embodiment, the text data analysis means 41 refers to the pronunciation word storage means 31 and the connection probability storage means 32 and converts them into accented phonemes. The text data analysis means 41 may use the connection probability of “part of speech” instead of “word”. Further, the text data analysis means 41 may convert into phonemes with accents based on “pronunciation rules determined in advance regarding word connection” instead of “connection probability”.

音素クラスタリング手段４２は、テキストデータ解析手段４１で変換されたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素でクラスタリング（分類）し、音素アクセントクラスタリング手段４３に出力するものである。この音素クラスタリング手段４２は、例えば、非特許文献１に記載された公知の方法でクラスタリングする。ここで、クラスタリングとは、所定のクラスタ（グループ）に分類すべき音素の集合を何らかの属性に注目して分類することを指すものである。つまり、音素クラスタリング手段４２で注目されている属性は、当該アクセント付き音素の前後に配置された音素の種類（前後の音素環境）である。 The phoneme clustering means 42 clusters (classifies) the accented phonemes converted by the text data analysis means 41 with the phonemes arranged before and after the accented phonemes, and outputs them to the phoneme accent clustering means 43. The phoneme clustering means 42 performs clustering by a known method described in Non-Patent Document 1, for example. Here, clustering refers to classifying a set of phonemes to be classified into a predetermined cluster (group) by paying attention to some attribute. That is, the attribute attracting attention by the phoneme clustering means 42 is the type of phoneme arranged before and after the accented phoneme (phoneme environment before and after).

音素アクセントクラスタリング手段４３は、音素クラスタリング手段４２でクラスタリングされたアクセント付き音素を、当該アクセント付き音素の前後に配置された音素のアクセントでクラスタリングするものである。すなわち、音素アクセントクラスタリング手段４３は、当該アクセント付き音素の前後に配置された音素のアクセント（前後のアクセント環境）という属性に注目してクラスタリングする。音素アクセントクラスタリング手段４３は、前後のアクセント環境として、前後の音の高低を用いている。 The phoneme accent clustering means 43 clusters the phonemes with accents clustered by the phoneme clustering means 42 with the accents of phonemes arranged before and after the accented phonemes. That is, the phoneme accent clustering means 43 performs clustering by paying attention to the attribute of phoneme accents (accent environment before and after) arranged before and after the accented phoneme. The phoneme accent clustering means 43 uses the pitch of the front and rear sounds as the front and rear accent environment.

具体的には、音素アクセントクラスタリング手段４３は、例えば、着目する音素（中心音素）が日本語の母音（当該母音）の場合であれば、前後の母音のアクセントが高い場合（高）と、低い場合（低）があるので、以下の７種類（クラスタ“１”〜クラスタ“７”）に分類している。なお、高低は、基本周波数が所定値よりも高いか低いかを示している。 Specifically, the phoneme accent clustering means 43 is low when, for example, the focused phoneme (central phoneme) is a Japanese vowel (the vowel), the accents of the preceding and following vowels are high (high), and low. Since there are cases (low), it is classified into the following seven types (cluster “1” to cluster “7”). Note that high and low indicate whether the fundamental frequency is higher or lower than a predetermined value.

（クラスタ“１”）＝（低、低、低）または（高、高、高）
（クラスタ“２”）＝（低、低、高）
（クラスタ“３”）＝（低、高、低）
（クラスタ“４”）＝（低、高、高）
（クラスタ“５”）＝（高、低、低）
（クラスタ“６”）＝（高、低、高）
（クラスタ“７”）＝（高、高、低）
ここで、例えば、クラスタ“３”である（低、高、低）は、前の母音のアクセントが低く、当該母音（中心音素）のアクセントが高く、後ろの母音のアクセントが低い音素からなるクラスタを意味している。 (Cluster “1”) = (Low, Low, Low) or (High, High, High)
(Cluster “2”) = (Low, Low, High)
(Cluster “3”) = (Low, High, Low)
(Cluster “4”) = (Low, High, High)
(Cluster “5”) = (High, Low, Low)
(Cluster “6”) = (High, Low, High)
(Cluster “7”) = (High, High, Low)
Here, for example, cluster “3” (low, high, low) is a cluster composed of phonemes with low accent of the previous vowel, high accent of the vowel (center phoneme), and low accent of the back vowel. Means.

なお、当該母音のアクセントが「低」で、前の母音が無い場合には、前の母音のアクセントを「低」とみなし、クラスタ“１”またはクラスタ“２”に分類することとする。また、当該母音のアクセントが「低」で、後ろの母音が無い場合に、後ろの母音のアクセントを「低」とみなし、クラスタ“１”またはクラスタ“５”に分類することとする。
同様に、当該母音のアクセントが「高」で、前の母音が無い場合には、前の母音のアクセントを「高」とみなし、クラスタ“１”またはクラスタ“７”に分類することとする。また、当該母音のアクセントが「高」で、後ろの母音が無い場合に、後ろの母音のアクセントを「高」とみなし、クラスタ“１”またはクラスタ“４”に分類することとする。 If the accent of the vowel is “low” and there is no previous vowel, the accent of the previous vowel is regarded as “low” and classified as cluster “1” or cluster “2”. Further, when the accent of the vowel is “low” and there is no back vowel, the back vowel accent is regarded as “low” and is classified into cluster “1” or cluster “5”.
Similarly, when the accent of the vowel is “high” and there is no previous vowel, the accent of the previous vowel is regarded as “high” and is classified into cluster “1” or cluster “7”. Further, when the accent of the vowel is “high” and there is no back vowel, the back vowel accent is regarded as “high” and is classified into cluster “1” or cluster “4”.

また、音素アクセントクラスタリング手段４３は、例えば、着目する音素（中心音素）が日本語の子音の場合であれば、クラスタを（前の母音のアクセントの高低、後ろの母音のアクセントの高低）で示す場合、以下の３種類（クラスタ“８”〜クラスタ“１０”）に分類する。なお、前または後の母音が無い場合には、クラスタ“８”に分類することとする。
（クラスタ“８”）＝（低、低）または（高、高）
（クラスタ“９”）＝（低、高）
（クラスタ“１０”）＝（高、低） The phoneme accent clustering means 43 indicates, for example, the cluster (the level of the accent of the previous vowel and the level of the accent of the back vowel) if the target phoneme (central phoneme) is a Japanese consonant. In this case, it is classified into the following three types (cluster “8” to cluster “10”). If there is no vowel before or after, it is classified into cluster “8”.
(Cluster “8”) = (Low, Low) or (High, High)
(Cluster “9”) = (Low, High)
(Cluster “10”) = (High, Low)

音声データ探索手段４４は、音素アクセントクラスタリング手段４３でクラスタリングされたアクセント付き音素に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、連結スコアが最大となる音声データ列を音声データベース３３から探索し、音声データ補正手段４５に出力するものである。この音声データ探索手段４４は、音素を探索単位とする。なお、ビタービサーチとは、最良（最大）のスコアを与える仮説（アクセント付き音素に対応する音声データの組み合わせ）の履歴のみを残していく手法のことである。 The voice data search means 44 calculates the connection score of the voice data string generated by combining the voice data corresponding to the phonemes with accents clustered by the phoneme accent clustering means 43 by viterbi search, and the connection score is the maximum. Is searched from the voice database 33 and output to the voice data correction means 45. The voice data search means 44 uses phonemes as search units. The Viterbi search is a method of leaving only the history of a hypothesis (combination of speech data corresponding to accented phonemes) that gives the best (maximum) score.

ここで、この連結スコアの算出の仕方について説明する。２つのアクセント付き音素を、それぞれ、音素Ａ、音素Ｂとして、音素Ａの後ろに音素Ｂを接続する場合を想定する。この場合の連結スコアＳc（Ａ，Ｂ）は、例えば、次に示す数式（１）によって求めることができる。 Here, how to calculate the connection score will be described. Assume that two phonemes with accents are phoneme A and phoneme B, respectively, and phoneme B is connected behind phoneme A. The connection score Sc (A, B) in this case can be obtained by, for example, the following mathematical formula (1).

この数式（１）において、ｐ^E _Aは音素Ａの終わり（終端）の基本周波数を表しており、ｐ^I _Bは音素Ｂの始め（先端）の基本周波数を表しており、ｃ^E _jAはｊ次元目における音素Ａの終わり（終端）の特徴量を表しており、ｃ^I _jBはｊ次元目における音素Ｂの始め（先端）の特徴量を表している。 In Equation (1), p ^E _A represents the fundamental frequency at the end (termination) of phoneme A, p ^I _B represents the fundamental frequency at the beginning (tip) of phoneme B, and c ^E _jA represents j The feature amount at the end (end) of the phoneme A in the dimension is represented, and c ^I _jB represents the feature amount at the beginning (tip) of the phoneme B in the jth dimension.

また、この数式（１）において、（ａ）および（ｂ）はｊ次元目における音素Ａの終わりのトライフォン（音素Ａがトライフォンであればそのトライフォン）が含まれるクラスタＴ^E _A、Ｔ^I _Bの隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）の分散値を表しており、（ｃ）および（ｄ）はｊ次元目における音素Ａの終わりのトライフォンが含まれるクラスタｃ（Ａ）のＨＭＭの平均値を表しており、ｄは特徴量の総次元数を表しており、ω₇、ω₈は正の重みを表しており、ａは正の定数を表している。なお、δ_ABは音素Ａと音素Ｂとが連続して音声データベース３３内にある場合を０、無い場合を１をとしている。 In Equation (1), (a) and (b) are clusters T ^E _A , T including a triphone at the end of phoneme A in the j-th dimension (or triphone if phoneme A is a triphone). hidden Markov models ^{_{I B (HMM: hidden Markov model}} ) represents a variance value of, (c) and (d) cluster c is included the triphone end of phonemes a in the j-th dimension (a) HMM D represents the total number of dimensions of the feature quantity, ω ₇ and ω ₈ represent positive weights, and a represents a positive constant. Note that δ _AB is 0 when the phoneme A and the phoneme B are continuously in the speech database 33, and 1 when there is no phoneme B.

音声データ補正手段４５は、音声データ探索手段４４で探索された音声データ列を、当該音声データ列の前後の音声データ列で補正するものである。具体的には、音声データ補正手段４５は、音声データ探索手段４４で探索された音声データ列の基本周波数を平均化して、その平均値と個々の音声データとのずれを補正した音声データ列を音声データ連結手段４６に出力する。なお、この補正に関しては、特開平２−４７７００号公報に記載されている方法を適用することとする。この音声データ補正手段４５は、必須の構成ではなく、音声合成装置１に備えないようにしてもよい。 The voice data correction unit 45 corrects the voice data string searched by the voice data search means 44 with voice data strings before and after the voice data string. Specifically, the audio data correction unit 45 averages the fundamental frequency of the audio data sequence searched by the audio data search unit 44, and calculates an audio data sequence in which a deviation between the average value and individual audio data is corrected. The data is output to the audio data connecting means 46. For this correction, the method described in JP-A-2-47700 is applied. The voice data correction unit 45 is not an essential component and may not be provided in the voice synthesizer 1.

音声データ連結手段４６は、音声データ補正手段４５により補正された音声データ列に含まれる音声データそれぞれを連結（接続）して音声出力装置ＳＰに出力するものである。音声合成装置１が、前記した音声データ補正手段４５を備えない場合には、音声データ連結手段４６は、音声データ探索手段４４で探索された音声データ列を連結して音声出力装置ＳＰに出力する。なお、音声出力装置ＳＰは、音声を出力するものであれば何でもよく、例えば、スピーカや、スピーカを含む表示装置（液晶ディスプレイ、ＣＲＴ（Cathode Ray Tube）等）である。なお、音声合成装置１は、図示しない通信インターフェースによって、インターネットなどの通信回路網（図示を省略）を介して音声出力装置ＳＰに出力するようにしてもよい。 The audio data connecting unit 46 connects (connects) the audio data included in the audio data sequence corrected by the audio data correcting unit 45 and outputs the audio data to the audio output device SP. When the speech synthesizer 1 does not include the speech data correction unit 45 described above, the speech data connection unit 46 connects the speech data strings searched by the speech data search unit 44 and outputs them to the speech output device SP. . The sound output device SP may be anything as long as it outputs sound, and is, for example, a speaker or a display device (liquid crystal display, CRT (Cathode Ray Tube), etc.) including the speaker. Note that the voice synthesizer 1 may output to the voice output device SP via a communication circuit network (not shown) such as the Internet by a communication interface (not shown).

［音声合成装置の動作］
次に、図２を参照（適宜図１参照）して、音声合成装置の動作について説明する。図２は、図１に示した音声合成装置の動作を示すフローチャートである。まず、音声合成装置１は、入力手段２によって、テキストデータを入力（ステップＳ１）し、テキストデータ解析手段４１によって、入力テキストデータを形態素解析してアクセント付き音素に変換する（ステップＳ２）。そして、音声合成装置１は、音素クラスタリング手段４２によって、前後の音素でクラスタリングし（ステップＳ３）、音素アクセントクラスタリング手段４３によって、前後の音素のアクセントでクラスタリングする（ステップＳ４）。 [Operation of speech synthesizer]
Next, the operation of the speech synthesizer will be described with reference to FIG. FIG. 2 is a flowchart showing the operation of the speech synthesizer shown in FIG. First, the speech synthesizer 1 inputs text data by the input unit 2 (step S1), and the text data analysis unit 41 converts the input text data into a phoneme with accent by performing morphological analysis (step S2). Then, the speech synthesizer 1 performs clustering with the phoneme clustering means 42 (step S3), and the phoneme accent clustering means 43 performs clustering with the phoneme accents (step S4).

続いて、音声合成装置１は、音声データ探索手段４４によって、数式（１）に基づいて、連結スコアが最大となる音声データ列を音声データベース３３から探索する（ステップＳ５）。そして、音声合成装置１は、音声データ補正手段４５によって、探索された音声データ列を補正する（ステップＳ６）。これにより、音声データ列は前後の音声データ列に滑らかに接続することができる。そして、音声合成装置１は、音声データ連結手段４６によって、補正された音声データ列を連結する（ステップＳ７）。これにより、連結された音声合成データは、音声出力装置ＳＰから合成音声として出力される。なお、音声合成装置１は、音声データ補正手段４５を備えていない場合、ステップＳ５に続いて、ステップＳ７に進む。 Subsequently, the speech synthesizer 1 searches the speech database 33 for a speech data string having the maximum connection score based on the mathematical formula (1) by the speech data search means 44 (step S5). Then, the voice synthesizer 1 corrects the searched voice data string by the voice data correction unit 45 (step S6). Thereby, the audio data string can be smoothly connected to the preceding and succeeding audio data strings. Then, the speech synthesizer 1 connects the corrected audio data strings by the audio data connecting means 46 (step S7). Thereby, the connected speech synthesis data is output as synthesized speech from the speech output device SP. If the speech synthesizer 1 does not include the speech data correction unit 45, the process proceeds to step S7 following step S5.

[具体的な音声合成例]
次に、図３を参照（適宜図１参照）して、音声合成装置１の具体的な音声合成例を説明する。図３は、図１に示した音声合成装置の具体的な音声合成例を示す説明図であって、（ａ）はテキストデータ解析手段の出力例、（ｂ）は音素クラスタリング手段および音素アクセントクラスタリング手段の出力例を示している。 [Specific speech synthesis example]
Next, a specific speech synthesis example of the speech synthesizer 1 will be described with reference to FIG. 3 (refer to FIG. 1 as appropriate). FIG. 3 is an explanatory diagram showing a specific speech synthesis example of the speech synthesizer shown in FIG. 1, where (a) is an output example of text data analysis means, and (b) is phoneme clustering means and phoneme accent clustering. An output example of the means is shown.

一例として、音声合成装置１の入力手段２にテキストデータ（入力日本語テキスト）として、「〈文頭〉おはようございます〈文末〉」が入力されたものとする。すると、音声合成装置１は、テキストデータ解析手段４１によって、入力日本語テキストをアクセント付き音素に変換し、図３の（ａ）に示すように、音素の列「ｏ（低）ｈａ（高）ｙｏ：（高）ｇｏ（高）ｚａ（高）ｉ（高）ｍａ（高）ｓｕ（低）」を出力する。 As an example, it is assumed that “<Sentence> Good morning <End of sentence>” is input as text data (input Japanese text) to the input means 2 of the speech synthesizer 1. Then, the speech synthesizer 1 converts the input Japanese text into accented phonemes by the text data analysis means 41, and as shown in FIG. 3A, the phoneme string “o (low) ha (high)”. yo: (high) go (high) za (high) i (high) ma (high) su (low) "is output.

そして、図３の（ｂ）に示すように、音声合成装置１は、音素クラスタリング手段４２によって、図３の（ａ）に示した連続した音素の列を構成する各音素３０１を、当該音素３０１の前後に配置された音素（前後の音素環境）を考慮してクラスタリングしたアクセント付き音素（トライフォン、音素クラスタリング３０２）の列を出力する。すなわち、「ｏ（低）＋ｈｏ−ｈ＋ａｈ−ａ（高）＋ｙａ−ｙ＋ｏ：ｙ−ｏ：（高）＋ｇｏ：−ｇ＋ｏｇ−ｏ（高）＋ｚｏ−ｚ＋ａｚ−ａ（高）＋ｉａ−ｉ（高）＋ｍｉ−ｍ＋ａｍ−ａ（高）＋ｓａ−ｓ＋ｕｓ−ｕ（低）」が出力される。ここで、例えば、トライフォン「ｏ−ｈ＋ａ」は、中心音素は「ｈ」であって、中心音素の前に配置された音素（先行音素）が「ｏ」で、中心音素の後ろに配置された音素（後続音素）が「ａ」であることを意味している。 Then, as shown in FIG. 3B, the speech synthesizer 1 uses the phoneme clustering means 42 to convert each phoneme 301 constituting the continuous phoneme sequence shown in FIG. A string of accented phonemes (triphones, phoneme clustering 302) clustered in consideration of phonemes arranged before and after (phoneme environments before and after) is output. That is, “o (low) + h o−h + a ha− (high) + y a−y + o: yo− (high) + go = −g + o go− (high) + z o−z + a z−a (high) + I a−i (high) + m i−m + a m−a (high) + s a−s + u su− (low) ”is output. Here, for example, in the triphone “oh−a + a”, the central phoneme is “h”, and the phoneme (preceding phoneme) arranged in front of the central phoneme is “o”, and is arranged behind the central phoneme. This means that the phoneme (following phoneme) is “a”.

さらに、音声合成装置１は、音素アクセントクラスタリング手段４３によって、音素クラスタリング手段４２から出力されたアクセント付き音素（音素クラスタリング３０２）を、アクセント付き音素（中心音素）の前後に配置された音素（先行音素、後続音素）のアクセント（前後のアクセント環境）を考慮してクラスタリングしたアクセント付き音素（音素アクセントクラスタリング３０３）の列として音声データ探索手段４４に出力する。すなわち、「ｏ＾２＋ｈｏ−ｈ＾９＋ａｈ−ａ＾４＋ｙａ−ｙ＾８＋ｏ：ｙ−ｏ：＾１＋ｇｏ：−ｇ＾８＋ｏｇ−ｏ＾１＋ｚｏ−ｚ＾８＋ａｚ−ａ＾１＋ｉａ−ｉ＾１＋ｍｉ−ｍ＾８＋ａｍ−ａ＾６＋ｓａ−ｓ＾１０＋ｕｓ−ｕ＾５」が出力される。ここで、例えば、「ｏ−ｈ＾９＋ａ」は、中心音素が前記した（クラスタ“９”）に属する音素「ｈ」であり、前の音素（先行音素）が「ｏ」であり、後ろの音素（後続音素）が「ａ」であることを意味している。 Further, the speech synthesizer 1 uses the phoneme accent clustering unit 43 to convert the phonemes with accents (phoneme clustering 302) output from the phoneme clustering unit 42 into phonemes (preceding phonemes) arranged before and after the accented phoneme (central phoneme). , The subsequent phoneme) is output to the speech data search means 44 as a string of accented phonemes (phoneme accent clustering 303) clustered in consideration of the accent (accent environment before and after). That is, “o2 + h o−h ^ 9 + a h−a ^ 4 + y a−y ^ 8 + o: yo − ^^ 1 + go: −g ^ 8 + o g−o ^ 1 + z o−z ^ 8 + a z−a ^ 1 + ia -I ^ 1 + m i-m ^ 8 + a m-a ^ 6 + s a-s ^ 10 + u s-u ^ 5 "is output. Here, for example, “o−h ^ 9 + a” is a phoneme “h” whose central phoneme belongs to the above-mentioned (cluster “9”), a previous phoneme (preceding phoneme) is “o”, and a rear This means that the phoneme (subsequent phoneme) is “a”.

続いて、音声合成装置１は、音声データ探索手段４４によって、該当するクラスタに属する音声データの組み合わせについて音声データベース３３の探索を行い、連結スコアを最大にする音声データ列を出力する。このとき、音素（例えば、先行音素、分類された中心音素および後続音素の組である「ｏ−ｈ＾９＋ａ」や、分類された中心音素である「ｈ＾９」）を探索単位として、連結スコアが最大となる音声データの組み合わせが音声データベース３３から探索される。 Subsequently, the speech synthesizer 1 searches the speech database 33 for speech data combinations belonging to the corresponding cluster by the speech data search means 44, and outputs a speech data string that maximizes the connection score. At this time, a phoneme (for example, “o−h ^ 9 + a” which is a set of the preceding phoneme, the classified central phoneme and the subsequent phoneme, and “h ^ 9” which is the classified central phoneme) is connected as a search unit. A combination of voice data that maximizes the score is searched from the voice database 33.

具体的には、音声データ探索手段４４は、まず、音声データベース３３中の「ｏ＾３＋ｈ」から音声データベース３３中の「ｏ−ｈ＾９＋ａ」に接続する音声データ列の全ての組み合わせについて、数式（１）を使用して求められる連結スコアを計算する。 Specifically, the voice data search means 44 first calculates mathematical expressions for all combinations of voice data strings connected from “o ^ 3 + h” in the voice database 33 to “oh−9 + a” in the voice database 33. Calculate the connection score determined using (1).

計算された結果、音声データベース３３中、１番始めの「ｏ−ｈ＾９＋ａ」に接続する「ｏ＾３＋ｈ」の音声データ列（出力候補）の中で連結スコアが一番大きいものが音声データ探索手段４４によりメモリ３４に記録される。そして、数式（１）を使用して求められる連結スコアの計算と、記録動作とが音声データベース３３中の全ての「ｏ−ｈ＾９＋ａ」について実行される。 As a result of the calculation, in the voice database 33, the voice data having the largest concatenation score in the voice data string (output candidate) of “o ^ 3 + h” connected to the first “oh9 + a” is voice data. It is recorded in the memory 34 by the search means 44. Then, the calculation of the connection score obtained using Equation (1) and the recording operation are executed for all “oh−9 ^ a” in the speech database 33.

さらに、「ｈ−ａ＾４＋ｙ」についても同様に、音声データベース３３中、１番始めの「ｈ−ａ＾４＋ｙ」に接続する「ｏ−ｈ＾９＋ａ」の音声データ列（出力候補）の中で連結スコアが一番大きいものが音声データ探索手段４４で記録される。そして、数式（１）を使用して求められる連結スコアの計算と、記録動作とが音声データベース３３の全ての「ｈ−ａ＾４＋ｙ」について実行される。 Similarly, “h−a ^ 4 + y” is also included in the voice data string (output candidate) of “o−h ^ 9 + a” connected to the first “h−a ^ 4 + y” in the voice database 33. The voice data search means 44 records the one with the largest connection score. Then, the calculation of the connection score obtained using Expression (1) and the recording operation are executed for all “ha−4 + y” in the voice database 33.

以下、同様にして、各アクセント付き音素を接続する連結スコアが求められ、最後に、入力日本語テキストに対応する音声データ列（出力候補）の組み合わせの中で、連結スコアが一番大きいものが探索されることとなる。この探索された音声データ列は、音声データ補正手段４５によって滑らかに接続され、音声データ連結手段４６によって連結され最終的に一つの音声データとなって、合成音声として音声出力装置ＳＰから出力される。 In the same manner, a concatenated score for connecting each accented phoneme is obtained, and finally, the combination of speech data strings (output candidates) corresponding to the input Japanese text has the largest concatenated score. Will be searched. The searched audio data string is smoothly connected by the audio data correcting unit 45, connected by the audio data connecting unit 46, and finally becomes one audio data, which is output from the audio output device SP as synthesized speech. .

第１の実施形態によれば、音声合成装置１は、アクセント付き音素について、前後の音素環境に加えて前後のアクセント環境を考慮して、音声データベース３３を探索するので、アクセントの正しい合成音声を作成することが可能になる。その結果、合成音声が不自然に聞こえてしまう自然性の劣化を防止することができる。 According to the first embodiment, the speech synthesizer 1 searches the speech database 33 for accented phonemes in consideration of the front and back accent environments in addition to the front and back phoneme environments. It becomes possible to create. As a result, it is possible to prevent the deterioration of naturalness that the synthesized speech is heard unnaturally.

（第２の実施形態）
［音声合成装置の構成］
図４は、第２の実施形態の音声合成装置の構成を示す機能ブロック図である。
音声合成装置１Ａは、図４に示すように、記憶手段３Ａと、音声合成制御手段４Ａの機能が異なる点を除いて、図１に示した音声合成装置１と同一の構成なので、同一の構成には同一の符号を付し、説明を省略する。 (Second Embodiment)
[Configuration of speech synthesizer]
FIG. 4 is a functional block diagram showing the configuration of the speech synthesizer according to the second embodiment.
As shown in FIG. 4, the speech synthesizer 1A has the same configuration as the speech synthesizer 1 shown in FIG. 1 except that the storage unit 3A and the speech synthesis control unit 4A have different functions. Are denoted by the same reference numerals, and description thereof is omitted.

記憶手段３Ａは、音素列リスト（音素列記憶手段）３５を備える。
音素列リスト３５は、アクセント付き音素の前後に配置された音素および該音素のアクセントでクラスタリングされたアクセント付き音素の列を記憶したものであって、ＨＤＤ等の一般的な記録媒体である。この音素列リスト３５の作成方法については、特開２００５−７０１６４号公報に記載された方法を適用することが出来る。
なお、音声データベース３３に記憶されている単位音声は、全ての母音にアクセントの高低が付与されている音素または音素列（音素列分割仮説データ）を基盤としている。 The storage means 3A includes a phoneme string list (phoneme string storage means) 35.
The phoneme string list 35 stores phonemes arranged before and after accented phonemes, and strings of accented phonemes clustered by the accents of the phonemes, and is a general recording medium such as an HDD. As a method for creating the phoneme string list 35, a method described in Japanese Patent Application Laid-Open No. 2005-70164 can be applied.
Note that the unit speech stored in the speech database 33 is based on a phoneme or phoneme sequence (phoneme sequence division hypothesis data) in which accents are assigned to all vowels.

音声合成制御手段４Ａは、音素列分割手段５１と、音声データ探索手段４４Ａとを備える。音素列分割手段５１は、音素アクセントクラスタリング手段４３でクラスタリングされた音素に変換された入力テキストデータを、音素列リスト３５に記憶されたアクセント付き音素の列（音素列分割仮説データ）に分割し、音声データ探索手段４４Ａに出力するものである。この音素列分割手段５１は、例えば、接続点数や音素数に基づいて、音素列分割仮説データを出力する。なお、このアクセント付き音素の列への分割に関しては、特開２００５−７０１６５号公報に記載されている方法を適用することができる。 The voice synthesis control unit 4A includes a phoneme string division unit 51 and a voice data search unit 44A. The phoneme string dividing means 51 divides the input text data converted into the phonemes clustered by the phoneme accent clustering means 43 into accented phoneme strings (phoneme string dividing hypothesis data) stored in the phoneme string list 35, This is output to the voice data search means 44A. The phoneme string dividing means 51 outputs phoneme string dividing hypothesis data based on the number of connection points and the number of phonemes, for example. Note that the method described in Japanese Patent Laid-Open No. 2005-70165 can be applied to the division of accented phonemes into columns.

音声データ探索手段４４Ａは、音素列分割手段５１で分割されたアクセント付き音素の列に対応する音声データを組み合わせることによって生成される音声データ列の連結スコアをビタービサーチによって計算し、連結スコアが最大となる音声データ列を音声データベース３３から探索し、音声データ補正手段４５に出力するものである。なお、音声データ補正手段４５を備えていない場合には音声データ連結手段４６に出力する。
この音声データ探索手段４４Ａは、アクセント付き音素の列（音素列）を探索単位とする。このアクセント付き音素の列は、図３の（ｂ）において音素アクセントクラスタリング３０３で示した例で表現すると、この音素アクセントクラスタリング３０３の連結したものに相当し、例えば、連続した中心音素の列と先行音素または／および後続音素の組である「（ｏ＾２ｈ＾９）＋ａ」や、「ｏ−（ｈ＾９ａ＾４ｙ＾８）＋ｏ」などで表現される。 The speech data search means 44A calculates a concatenation score of speech data strings generated by combining speech data corresponding to accented phoneme strings divided by the phoneme string split means 51 by viterbi search, and the concatenation score is The maximum audio data string is searched from the audio database 33 and output to the audio data correcting means 45. If the audio data correcting means 45 is not provided, the audio data connecting means 46 is output.
The voice data search means 44A uses a string of phonemes with accents (phoneme string) as a search unit. This sequence of phonemes with accents is represented by the example shown by the phoneme accent clustering 303 in FIG. 3B, and corresponds to the concatenation of the phoneme accent clustering 303. It is expressed by “(o ^ 2 h ^ 9) + a” or “o− (h ^ 9 a ^ 4 y ^ 8) + o”, which is a set of phonemes and / or subsequent phonemes.

［音声合成装置の動作］
次に、図５を参照（適宜図４参照）して、音声合成装置１Ａの動作について説明する。図５は、図４に示した音声合成装置の動作を示すフローチャートである。音声合成装置１Ａは、ステップＳ２１〜Ｓ２４を順次処理する。これらのステップＳ２１〜Ｓ２４は、それぞれ図２に示したステップＳ１〜Ｓ４と同一なので説明を省略する。ステップＳ２４に続けて、音声合成装置１Ａは、音素列分割手段５１によって、音素アクセントクラスタリング手段４３でクラスタリングされた音素に変換された入力テキストデータをアクセント付き音素の列に分割する（ステップＳ２５）。そして、音声合成装置１Ａは、ステップＳ２６〜Ｓ２８を順次処理する。これらのステップＳ２６〜Ｓ２８は、それぞれ図２に示したステップＳ５〜Ｓ７と同一なので説明を省略する。ただし、ステップＳ２６では、音声データ探索手段４４Ａは、アクセント付き音素（例えば、「ｏ−ｈ＾９＋ａ」や「ｈ＾９」）を対象とするのではなく、なるべく分割されたアクセント付き音素の列（例えば、「（ｏ＾２ｈ＾９）＋ａ」、「ｏ−（ｈ＾９ａ＾４ｙ＾８）＋ｏ」）を対象として、連結スコアが最大となる音声データ列を探索する。 [Operation of speech synthesizer]
Next, the operation of the speech synthesizer 1A will be described with reference to FIG. FIG. 5 is a flowchart showing the operation of the speech synthesizer shown in FIG. The speech synthesizer 1A sequentially processes steps S21 to S24. These steps S21 to S24 are the same as steps S1 to S4 shown in FIG. Subsequent to step S24, the speech synthesizer 1A divides the input text data converted into phonemes clustered by the phoneme accent clustering unit 43 by the phoneme sequence dividing unit 51 into accented phoneme sequences (step S25). Then, the speech synthesizer 1A sequentially processes steps S26 to S28. These steps S26 to S28 are the same as steps S5 to S7 shown in FIG. However, in step S26, the speech data search means 44A does not target accented phonemes (for example, “o−h ^ 9 + a” or “h ^ 9”), but instead arranges a sequence of accented phonemes. A speech data string having the maximum connection score is searched for (for example, “(o ^ 2 h ^ 9) + a”, “o− (h ^ 9 a ^ 4 y ^ 8) + o”).

次に、前記したステップＳ２５の処理を具体的に説明する。一例として、音声合成装置１Ａの入力手段２にテキストデータ（入力日本語テキスト）として、「〈文頭〉おはようございます〈文末〉」が入力されたものとする。この場合、音声合成装置１Ａは、図３の（ａ）で示したように、テキストデータ解析手段４１によって、連続した音素の列「ｏ（低）ｈａ（高）ｙｏ：（高）ｇｏ（高）ｚａ（高）ｉ（高）ｍａ（高）ｓｕ（低）」を出力する。そして、音声合成装置１Ａの音素列分割手段５１に、図３の（ｂ）で示したように、アクセントクラスタリングされた音素列「ｏ＾２＋ｈｏ−ｈ＾９＋ａｈ−ａ＾４＋ｙａ−ｙ＾８＋ｏ：ｙ−ｏ：＾１＋ｇｏ：−ｇ＾８＋ｏｇ−ｏ＾１＋ｚｏ−ｚ＾８＋ａｚ−ａ＾１＋ｉａ−ｉ＾１＋ｍｉ−ｍ＾８＋ａｍ−ａ＾６＋ｓａ−ｓ＾１０＋ｕｓ−ｕ＾５」が入力されることとなる。そして、音素列分割手段５１は、入力された１４個の音素（中心音素）から成る音素列を、「（ｏ＾２ｈ＾９）＋ａ」、「ｈ−（ａ＾４ｙ＾８ｏ：＾１ｇ＾８）＋ｏ」、「ｇ−（ｏ＾１ｚ＾８ａ＾１ｉ＾１）＋ｍ」、および、「ｉ−（ｍ＾８ａ＾６ｓ＾１０ｕ＾５）」の４つの音素列分割仮説データ（アクセント付き音素の列）に分割して、最終的な出力結果として出力することとなる。 Next, the process of step S25 described above will be specifically described. As an example, it is assumed that “<start of sentence> good morning <end of sentence>” is input as text data (input Japanese text) to the input unit 2 of the speech synthesizer 1A. In this case, as shown in FIG. 3A, the speech synthesizer 1A uses the text data analysis unit 41 to generate a continuous phoneme string “o (low) ha (high) yo: (high) go (high). ) Za (high) i (high) ma (high) su (low) ". Then, as shown in FIG. 3B, the phoneme string dividing unit 51 of the speech synthesizer 1A receives the accent clustered phoneme string “o ^ 2 + h o−h ^ 9 + a h−a ^ 4 + y a−y ^. 8 + o: yo: ^ 1 + go: -g ^ 8 + o g-o ^ 1 + z o-z ^ 8 + a z-a ^ 1 + i a-i ^ 1 + m i-m ^ 8 + a m-a ^ 6 + s a-s ^ 10 + u s -U ^ 5 "is input. The phoneme string dividing means 51 converts the input phoneme string consisting of the 14 phonemes (central phonemes) into “(o ^ 2 h ^ 9) + a”, “h− (a ^ 4 y ^ 8 o: ^ 1 g ^ 8) + o "," g- (o ^ 1 z ^ 8 a ^ 1 i ^ 1) + m ", and" i- (m ^ 8 a ^ 6 s ^ 10 u ^ 5) " This is divided into four phoneme string division hypothesis data (accented phoneme string) and output as a final output result.

第２の実施形態によれば、音声合成装置１Ａは、入力されたテキストデータを音素列リスト３５に予め登録されているアクセント付き音素の列に分割し、このアクセント付き音素の列を、音声データベース３３を探索する探索単位として使用するため、探索する際に前後の音素環境が異なる音素を探索することを防止し、音声合成処理に要する時間を短縮できる。その結果、合成した音声合成データの音質の低下を防止することができる。 According to the second embodiment, the speech synthesizer 1A divides the input text data into accented phoneme strings registered in the phoneme string list 35 in advance, and the accented phoneme strings are converted into a speech database. Since 33 is used as a search unit for searching, it is possible to prevent searching for phonemes having different phoneme environments before and after searching, and to shorten the time required for speech synthesis processing. As a result, it is possible to prevent deterioration in sound quality of the synthesized speech synthesis data.

以上、各実施形態に基づいて本発明を説明したが、本発明はこれらに限定されるものではない。例えば、音声合成装置１（１Ａ）の各構成を一つずつの過程と捉えた音声合成方法とみなすことや、各構成の処理を汎用のコンピュータ言語で記述した音声合成プログラムとみなすことも可能である。この場合、音声合成装置１（１Ａ）と同様の効果を得ることができる。 As mentioned above, although this invention was demonstrated based on each embodiment, this invention is not limited to these. For example, it is possible to regard each configuration of the speech synthesizer 1 (1A) as a speech synthesis method in which each configuration is regarded as one process, or as a speech synthesis program in which the processing of each configuration is described in a general-purpose computer language. is there. In this case, the same effect as the speech synthesizer 1 (1A) can be obtained.

また、各実施形態では、音声データベース３３は、基盤としている音素の母音にアクセントの高低を付与されているものとして説明したが、母音にアクセントの強弱を付与するようにしてもよく、また、母音の無声化に対しても、母音があるものとして高低もしくは強弱を付与するようにしてもよい。
また、各実施形態では、音素アクセントクラスタリング手段４３は、前後のアクセント環境として、前後の音の高低を用いるものとしたが、前後の音の高低、強弱、長短、リズムのうちの少なくとも１つを用いればよい。 In each embodiment, the voice database 33 has been described as having accents added to the base phoneme vowels, but accents may be added to the vowels. For devoicing, it may be possible to give high or low or strong or weak as vowels.
Further, in each embodiment, the phoneme accent clustering unit 43 uses the pitch of the front and rear sounds as the front and rear accent environment, but at least one of the pitch of the front and rear sounds, strength, shortness, and rhythm is used. Use it.

次に、本発明の効果を確認した実施例について説明する。第２の実施形態の音声合成装置１Ａを用いて、前後のアクセント環境まで考慮してクラスタリングすることにより合成した合成音声（実施例）と、従来のように前後の音素環境のみを考慮してクラスタリングすることにより合成した合成音声（比較例）とを、自然性（より自然に聞こえるか）に関して比較した。 Next, examples in which the effects of the present invention have been confirmed will be described. Clustering considering synthesized speech (example) synthesized by clustering in consideration of up to and including the accent environment using the speech synthesizer 1A of the second embodiment, and the conventional phoneme environment as in the past. The synthesized speech (comparative example) synthesized by doing so was compared in terms of naturalness (whether it sounds more natural).

[対比較実験]
音声データベース３３に予め蓄積したデータは、１９９６年６月３日から２００１年６月２２日までのＮＨＫニュースデータベースに存在する森田アナウンサにより発声された２５４８４文章と森田アナウンサが読み上げたバランス文（音素環境をバランスさせて作成した文）である９９文章の計７９．５時間分を全て収めたものであり、総トライフォン数は３５６万であり、異なりトライフォン数は８４５２である。また、評価用テキストには、２００１年６月２５日から６月２９日までの番組「ＮＨＫニュース１０」で森田アナウンサが発声した９６文章（音素数１３２６７）を使用した。 [Comparison experiment]
The data stored in the speech database 33 in advance is composed of 25484 sentences spoken by the Morita announcer in the NHK news database from June 3, 1996 to June 22, 2001, and the balance sentence read by the Morita announcer (phoneme environment). A total of 79.5 hours of 99 sentences, which is a sentence created by balancing the two), the total number of triphones is 3.56 million, and the number of different triphones is 8452. Further, 96 sentences (phoneme number 13267) uttered by the announcer Morita in the program “NHK News 10” from June 25 to June 29, 2001 were used as the evaluation text.

まず、対比較実験について説明する。この対比較実験は、防音室内でスピーカを用いて行い、当該実験の被験者は、音声評定の経験のある３名の女性である。また、この対比較実験では、評価用テキストである９６文章全てを実施例と比較例とについて受聴させ、それぞれの受聴は１回のみに限定した。この対比較実験の各試行は、実施例と比較例とを対でランダムな順序で呈示し、被験者がより自然に感じる方を選択するように当該被験者に指示を与えた。なお、この対比較実験は、各被験者に適度な時間間隔で休憩をとってもらいながら行った。 First, a comparative experiment will be described. This paired comparison experiment is performed using speakers in a soundproof room, and the subjects of the experiment are three women who have experience in voice evaluation. In this comparative comparison experiment, all 96 sentences as evaluation texts were listened to for the example and the comparative example, and each listening was limited to once. In each trial of the comparative experiment, the example and the comparative example were presented in pairs in a random order, and the subject was instructed to select the subject that felt more natural. This comparative experiment was conducted while having each subject take a break at an appropriate time interval.

この対比較実験の結果、全体（ｔｏｔａｌ）で５６．０％の音声に関して、音声合成装置１Ａによって合成した合成音声（実施例）の方が、従来のように前後のアクセント環境を考慮しないもの（比較例）に比べて、自然であると評価された。二項検定を用いると、危険率は５％で、この差は有意である。 As a result of this comparison experiment, with respect to a total of 56.0% speech, the synthesized speech synthesized by the speech synthesizer 1A (example) does not consider the accent environment before and after as in the past ( Compared to Comparative Example), it was evaluated as natural. Using the binomial test, the risk is 5%, and this difference is significant.

[５段階品質評価実験]
次に、５段階品質評価実験について説明する。この５段階品質評価実験は、前後のアクセント環境を考慮した音声データベース３３を利用して作成した合成音声（実施例）と、前後のアクセント環境を考慮しない音声データベースを利用して作成した合成音声（比較例）と、自然音声とに対して５段階で品質評価を行ったものである。 [5-level quality evaluation experiment]
Next, a five-stage quality evaluation experiment will be described. In this five-step quality evaluation experiment, the synthesized speech (Example) created using the speech database 33 considering the front and back accent environments and the synthesized speech created using the speech database not considering the front and back accent environments ( A comparative example) and natural speech were evaluated for quality in five stages.

この５段階品質評価実験は、対比較実験と同様に、防音室内で、スピーカを用いて行っており、被験者は音声評定の経験がある３名の女性である。各試行では、評価用データをランダムな順序で被験者に呈示し、被験者は自然性の違いを評価する。この自然性の評価は、“５”（自然である）、“４”（不自然な部分はあるが気にならない）、“３”（少し気になる）、“２”（気になる）、“１”（非常に気になる）の５段階で品質評価を行うこととした。なお、品質評価に先立ち、被験者には、自然音声を３文章聞かせて、どの程度の音声であれば、自然に聞こえるとするかといった評価基準（インストラクション）を与えた。また、評価用テキストとして実際に放送されたニュース文を利用しているので、１文の長さが平均で１０秒程度と長いことから、受聴は１回のみに限定し、適度な間隔で休憩を挟みながら行った。 This five-step quality evaluation experiment is performed using a speaker in a soundproof room as in the comparative comparison experiment, and the subjects are three women who have experience in voice evaluation. In each trial, the evaluation data is presented to the subject in a random order, and the subject evaluates the difference in naturalness. The evaluation of this naturalness is “5” (natural), “4” (unnatural part but not bothered), “3” (somewhat worried), “2” (worried) , "1" (very worrisome) was decided to perform quality evaluation in five stages. Prior to the quality evaluation, the subjects were given 3 sentences of natural speech and given an evaluation standard (instruction) as to how much speech should be heard naturally. In addition, since the news sentence actually broadcasted is used as the evaluation text, the average length of one sentence is about 10 seconds. Therefore, listening is limited to one time and breaks are made at appropriate intervals. I went while holding it.

ここで、平均オピニオン評点（ＭＯＳ：Mean Opinioin Score）について図６を参照して説明する。図６は、図４に示した音声合成装置を使用した５段階品質評価実験の結果を示すグラフである。図６に示すグラフから、実施例の場合のＭＯＳは「４．１９」となり、比較例のＭＯＳは「３．９５」となった。実施例のＭＯＳ「４．１９」は、「自然である」と「不自然な部分があるが気にならない」との間の自然性を持つと言え、比較例のＭＯＳ「３．９５」と比べ、良い評価であると言える。なお、自然音声のＭＯＳは「４．９９」となった。 Here, an average opinion score (MOS) will be described with reference to FIG. FIG. 6 is a graph showing the results of a five-stage quality evaluation experiment using the speech synthesizer shown in FIG. From the graph shown in FIG. 6, the MOS in the example is “4.19”, and the MOS in the comparative example is “3.95”. It can be said that the MOS “4.19” of the example has a naturalness between “natural” and “not natural but there is an unnatural part”, and the MOS “3.95” of the comparative example It can be said that it is a good evaluation. The natural voice MOS was "4.99".

また、図６に示すように、実施例では、全体の４９％の合成音声が“５”（自然である）と評価されている。そのため、実施例では、自然音声と変わらない品質の音声データが高頻度で合成されていると言える。なお、“２”および“１”の評価を受けたものは全体の８％である。 Further, as shown in FIG. 6, in the embodiment, 49% of the total synthesized speech is evaluated as “5” (natural). For this reason, in the embodiment, it can be said that voice data having the same quality as natural voice is synthesized with high frequency. In addition, 8% of the total received evaluations of “2” and “1”.

第１の実施形態に係る音声合成装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech synthesizer which concerns on 1st Embodiment. 図１に示した音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer shown in FIG. 図１に示した音声合成装置の具体的な音声合成例を示す説明図であって、（ａ）はテキストデータ解析手段の出力例、（ｂ）は音素クラスタリング手段および音素アクセントクラスタリング手段の出力例を示している。It is explanatory drawing which shows the specific speech synthesis example of the speech synthesizer shown in FIG. 1, (a) is an example of an output of a text data analysis means, (b) is an output example of a phoneme clustering means and a phoneme accent clustering means. Is shown. 第２の実施形態の音声合成装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the speech synthesizer of 2nd Embodiment. 図４に示した音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer shown in FIG. 図４に示した音声合成装置を使用した５段階品質評価実験の結果を示すグラフである。It is a graph which shows the result of the 5-step quality evaluation experiment using the speech synthesizer shown in FIG.

Explanation of symbols

１，１Ａ音声合成装置
２入力手段
３，３Ａ記憶手段
３１発音単語記憶手段
３２接続確率記憶手段
３３音声データベース
３４メモリ
３５音素列リスト（音素列記憶手段）
４，４Ａ音声合成制御手段
４１テキストデータ解析手段
４２音素クラスタリング手段
４３音素アクセントクラスタリング手段
４４，４４Ａ音声データ探索手段
４５音声データ補正手段
４６音声データ連結手段
５１音素列分割手段
ＳＰ音声出力装置
DESCRIPTION OF SYMBOLS 1,1A Speech synthesizer 2 Input means 3,3A Storage means 31 Pronunciation word storage means 32 Connection probability storage means 33 Speech database 34 Memory 35 Phoneme string list (phoneme string storage means)
4, 4A Speech synthesis control means 41 Text data analysis means 42 Phoneme clustering means 43 Phoneme accent clustering means 44, 44A Speech data searching means 45 Speech data correcting means 46 Speech data connecting means 51 Phoneme sequence dividing means SP Speech output device

Claims

Text data analysis means for converting input text data into accented phonemes by performing morphological analysis,
Phoneme clustering means for clustering accented phonemes converted by the text data analyzing means with phonemes arranged before and after the accented phonemes;
Phoneme accent clustering means for clustering accented phonemes clustered by the phoneme clustering means with phoneme accents arranged before and after the accented phonemes;
A speech database for storing speech data corresponding to phonemes arranged before and after the accented phonemes and accented phonemes clustered by the accents of the phonemes;
A concatenation score of a speech data sequence generated by combining speech data corresponding to accented phonemes clustered by the phoneme accent clustering means is calculated by viterbi search, and a speech data sequence that maximizes the concatenation score A voice data search means for searching from a voice database;
Voice data connection means for connecting the voice data strings searched by the voice data search means;
A speech synthesizer comprising:

Text data analysis means for converting input text data into accented phonemes by performing morphological analysis,
Phoneme clustering means for clustering accented phonemes converted by the text data analyzing means with phonemes arranged before and after the accented phonemes;
Phoneme accent clustering means for clustering accented phonemes clustered by the phoneme clustering means with phoneme accents arranged before and after the accented phonemes;
Phoneme string storage means for storing a phoneme arranged before and after the accented phoneme and a string of accented phonemes clustered by the accent of the phoneme;
A speech database for storing speech data corresponding to accented phonemes stored in the phoneme string storage means;
Phoneme string dividing means for dividing the text data converted into phonemes clustered by the phoneme accent clustering means into accented phoneme strings stored in the phoneme string storage means;
A speech data sequence generated by combining speech data corresponding to the accented phoneme sequence divided by the phoneme sequence partitioning unit is calculated by viterbi search, and the speech data that maximizes the connection score Speech data search means for searching for a sequence from the speech database;
Voice data connection means for connecting the voice data strings searched by the voice data search means;
A speech synthesizer comprising:

In order to synthesize speech corresponding to text data,
Text data analysis means for analyzing input morphological data and outputting accented phonemes,
Phoneme clustering means for clustering accented phonemes converted by the text data analyzing means with phonemes arranged before and after the accented phonemes;
Phoneme accent clustering means for clustering accented phonemes clustered by the phoneme clustering means with phoneme accents arranged before and after the accented phonemes;
A concatenation score of speech data sequences generated by combining speech data corresponding to accented phonemes clustered by the phoneme accent clustering means is calculated by Viterbi search, and a speech data sequence having the maximum concatenation score is searched. Voice data search means to
Voice data connecting means for connecting voice data strings searched by the voice data searching means;
A speech synthesis program characterized by functioning as

In order to synthesize speech corresponding to text data,
Text data analysis means for converting input text data into accented phonemes by morphological analysis,
Phoneme clustering means for clustering accented phonemes converted by the text data analyzing means with phonemes arranged before and after the accented phonemes;
Phoneme accent clustering means for clustering accented phonemes clustered by the phoneme clustering means with phoneme accents arranged before and after the accented phonemes;
Phoneme string dividing means for dividing the text data converted into phonemes clustered by the phoneme accent clustering means into accented phoneme strings;
A speech data sequence generated by combining speech data corresponding to the accented phoneme sequence divided by the phoneme sequence partitioning unit is calculated by viterbi search, and the speech data that maximizes the connection score Voice data search means for searching for a sequence;
Voice data connecting means for connecting voice data strings searched by the voice data searching means;
A speech synthesis program characterized by functioning as