JP2007163667A

JP2007163667A - Voice synthesizer and voice synthesizing program

Info

Publication number: JP2007163667A
Application number: JP2005357727A
Authority: JP
Inventors: Norifumi Oide; 訓史大出; Hiroyuki Segi; 寛之世木; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2005-12-12
Filing date: 2005-12-12
Publication date: 2007-06-28
Anticipated expiration: 2025-12-12
Also published as: JP4829605B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesizer and a voice synthesizing program capable of suppressing voice quality degradation in synthesized voice without connecting phonemes in which a voice wave form does not continue well. <P>SOLUTION: The voice synthesizer 1 for synthesizing voice of an input text data, comprises: a rhythm data base 7 for accumulating rhythm data; a phoneme data base 11 for accumulating phoneme data; a language analysis section 3 for analyzing language of the text data; a rhythm generating section 5 for generating rhythm information; a phoneme searching section 9 for searching the phoneme data; and a phoneme synthesizing section 13 for connecting the phoneme data. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、テキストデータから音声合成を行う音声合成装置および音声合成プログラムに関する。 The present invention relates to a speech synthesizer and a speech synthesis program for performing speech synthesis from text data.

従来、入力されたテキストデータを音声合成する手法として、当該テキストデータを言語解析し、言語解析した結果から各音素の抑揚、振幅および継続時間長の値を絶対的な値（いわゆるターゲットコスト）として取り扱って、データベースに蓄積されている音声素片を探索し、探索した音声素片を接続して行う手法が開示されている（例えば、特許文献１、２）。 Conventionally, as a method of synthesizing input text data, the text data is subjected to language analysis, and the phonetic inflection, amplitude, and duration value are determined as absolute values (so-called target costs) from the results of the language analysis. A method of handling and searching speech units stored in a database and connecting the searched speech units is disclosed (for example, Patent Documents 1 and 2).

なお、ここで開示されている手法では、データベースは比較的に小規模のもの（データ量が少ないもの）が用いられており、蓄積されている音声素片の少なさから各音素の抑揚、振幅および継続時間長の値を絶対的な値として取り扱わなければ、滑らかに接続できる音声素片を探索することができず、当該音声素片同士を合成した合成音声が不自然に聞こえてしまうというものであった。 In the method disclosed here, a relatively small database (small amount of data) is used, and the inflection and amplitude of each phoneme are calculated based on the small number of accumulated speech segments. If the duration value is not treated as an absolute value, it is not possible to search for speech units that can be connected smoothly, and the synthesized speech that combines the speech units may sound unnatural. Met.

ところで、近年、音声合成する際に用いられるデータベースが巨大化し（蓄積されるデータが飛躍的に増加し）、様々な音声素片の自然な組み合わせが多数蓄積されるようになっており、入力されたテキストデータの一部または全部に該当する音声素片がそのままデータベースに蓄積されていることも稀ではなくなっている。 By the way, in recent years, the database used for speech synthesis has become enormous (stored data has increased dramatically), and many natural combinations of various speech segments have been stored and input. It is not rare that speech segments corresponding to some or all of the text data are stored in the database as they are.

その結果、入力されたテキストデータを音声合成する別の手法として、当該テキストデータから韻律情報を生成し、この韻律情報に基づいて、データベースに蓄積されている音声素片から当該音声素片の音声波形（音声波形データ）が滑らかに接続するものを探索し、探索した音声素片を接続して行う手法が開示されている（例えば、特許文献３）。なお、ここで開示されている手法では、データベースは比較的に大規模のもの（データ量が多いもの）が用いられている。
特開２００１−１００７７７号公報特開平１１−３３８４８８号公報特開２００４−１３９０３３号公報 As a result, as another method for synthesizing the input text data, prosodic information is generated from the text data, and based on this prosodic information, the speech of the speech unit is generated from the speech units stored in the database. There has been disclosed a technique in which a waveform (voice waveform data) that is smoothly connected is searched and the searched voice segments are connected (for example, Patent Document 3). In the method disclosed herein, a relatively large database (a large amount of data) is used.
Japanese Patent Laid-Open No. 2001-100777 JP 11-338488 A JP 2004-139033 A

しかしながら、従来の韻律情報を生成して、音声合成を行う手法では、データベースに蓄積されている音声素片の中から、当該韻律情報に合致する音声素片を優先的に選択する、つまり、韻律情報から導き出される音声素片を固定的に取り扱うので、当該音声素片を接続した際の音声波形の連続性（滑らかさ）が劣るものを探索して、接続してしまい、この結果、合成音声の音質が劣化してしまうという問題がある。 However, in the conventional method of generating prosodic information and performing speech synthesis, a speech unit that matches the prosodic information is preferentially selected from speech units stored in the database. Since speech units derived from information are handled in a fixed manner, a search is made for a speech waveform having inferior continuity (smoothness) when the speech unit is connected. There is a problem that the sound quality of the sound quality deteriorates.

そこで、本発明では、前記した問題を解決し、音声波形の連続性が劣る音声素片を接続することなく、合成音声の音質劣化を抑制することができる音声合成装置および音声合成プログラムを提供することを目的とする。 Therefore, the present invention provides a speech synthesizer and a speech synthesizer program that can solve the above-described problems and can suppress deterioration in the quality of synthesized speech without connecting speech segments with inferior speech waveform continuity. For the purpose.

前記課題を解決するため、請求項１に記載の音声合成装置は、入力されたテキストデータを音声合成する音声合成装置であって、韻律データ蓄積手段と、音声素片データ蓄積手段と、言語解析手段と、韻律情報生成手段と、音声素片データ探索手段と、音声素片データ合成手段と、を備える構成とした。 In order to solve the above-mentioned problem, the speech synthesizer according to claim 1 is a speech synthesizer for synthesizing input text data, and includes prosodic data storage means, speech segment data storage means, and language analysis. Means, prosodic information generation means, speech segment data search means, and speech segment data synthesis means.

かかる構成によれば、音声合成装置は、音素と韻律に関する情報とを対応付けた韻律データを蓄積する韻律データ蓄積手段と、音素と音声波形とを対応付けた音声素片データを蓄積する音声素片データ蓄積手段とを予め備えている。韻律に関する情報とは、品詞、アクセント句、アクセント核の有無、単語境界、句境界、文節境界、呼気段落境界、係り受け、フォーカス（強調、例えば、文中で重要な単語を他の単語と区別するため、大きくゆっくり、意図的に間をあけること）の有無の情報である。音声合成装置は、言語解析手段によって、テキストデータを言語解析し、当該テキストデータを音素列に分解する。そして、音声合成装置は、韻律情報生成手段によって、言語解析手段で分解された音素列において、任意の音素の前後に位置する音素の基本周波数、振幅および継続時間長の許容範囲と遷移確率とからなる韻律情報を、韻律データ蓄積手段に蓄積されている韻律データを用いて生成する。 According to this configuration, the speech synthesizer includes a prosody data storage unit that stores prosody data that associates phonemes and information about prosody, and a speech element that stores speech unit data that associates phonemes and speech waveforms. One piece data storage means is provided in advance. Prosodic information includes parts of speech, accent phrases, presence of accent kernels, word boundaries, phrase boundaries, clause boundaries, exhalation paragraph boundaries, dependency, focus (emphasis, eg, distinguish important words in a sentence from other words) Therefore, it is information on whether there is a large, slow and intentional gap). The speech synthesizer performs language analysis on the text data by means of language analysis means, and decomposes the text data into phoneme strings. Then, the speech synthesizer uses the allowable range of the fundamental frequency, amplitude, and duration of the phoneme located before and after an arbitrary phoneme in the phoneme sequence decomposed by the language analysis unit by the prosodic information generation unit and the transition probability. The prosodic information is generated using the prosodic data stored in the prosodic data storage means.

そして、音声合成装置は、音声素片データ探索手段によって、韻律情報生成手段で生成された韻律情報に基づいて、音声素片データ蓄積手段に蓄積されている音声素片データを探索し、接続コストおよび韻律コストが最小になる前記音声素片データの組み合わせを出力する。そうしてから、音声合成装置は、音声素片データ合成手段によって、音声素片データ探索手段で出力された音声素片データの組み合わせを合成して出力する。 Then, the speech synthesizer searches the speech unit data stored in the speech unit data storage unit based on the prosody information generated by the prosody information generation unit by the speech unit data search unit, and determines the connection cost. And a combination of the speech segment data that minimizes the prosodic cost. After that, the speech synthesizer synthesizes and outputs the combination of the speech unit data output by the speech unit data search unit by the speech unit data synthesis unit.

請求項２に記載の音声合成装置は、請求項１に記載の音声合成装置において、前記言語解析手段が、前記テキストデータを言語解析する際に、音素若しくは品詞の種類、構文解析の係り受けの距離、句読点の有無、アクセント核の有無、呼気段落内での文節、単語の位置、前記単語のモーラ数の少なくとも一つについて言語解析することを特徴とする。 The speech synthesizer according to claim 2 is the speech synthesizer according to claim 1, wherein the language analysis unit is responsible for phoneme or part-of-speech types and syntactic analysis when the text data is language-analyzed. It is characterized by performing language analysis on at least one of distance, presence / absence of punctuation marks, presence / absence of accent core, clause in exhalation paragraph, word position, and number of mora of the word.

かかる構成によれば、音声合成装置は、言語解析手段によって、音素若しくは品詞の種類、構文解析の係り受けの距離、句読点の有無、アクセント核の有無、呼気段落内での文節、単語の位置、単語のモーラ数の少なくとも一つについて言語解析することで、韻律情報生成手段で生成される韻律情報の信頼性が向上することになる。 According to such a configuration, the speech synthesizer uses the language analysis means to determine the type of phoneme or part of speech, the distance of dependency of syntactic analysis, the presence or absence of punctuation marks, the presence or absence of an accent nucleus, the phrase within the expiratory paragraph, the position of the word, By performing linguistic analysis on at least one mora number of words, the reliability of the prosodic information generated by the prosodic information generating means is improved.

請求項３に記載の音声合成装置は、請求項１または２に記載の音声合成装置において、前記韻律情報生成手段が、前記韻律情報を生成する際に、予め設定した範囲内の前記韻律に関する情報を有している前記韻律データを用いて行うことを特徴とする。 The speech synthesizer according to claim 3 is the speech synthesizer according to claim 1 or 2, wherein the prosody information generation means generates information about the prosody within a preset range when the prosody information is generated. This is performed using the prosodic data having

かかる構成によれば、音声合成装置は、韻律情報生成手段が用いる韻律データについて、韻律に関する情報を、予め設定した範囲内に絞り込んでおくことで、音声素片データ探索手段によって探索される音声素片データについて、イレギュラーな音声素片データの接続を除くことができる。また、音声素片データ探索手段によって探索される音声素片データのデータ量を減縮することとなり、合成速度を向上させることができる。 According to such a configuration, the speech synthesizer narrows down information related to the prosody within the preset range for the prosody data used by the prosody information generation unit, so that the speech unit data searched by the speech unit data search unit can be searched. For single data, irregular voice segment data connections can be removed. In addition, the amount of speech unit data searched by the speech unit data search means is reduced, and the synthesis speed can be improved.

請求項４に記載の音声合成装置は、請求項１から３のいずれか一項に記載の音声合成装置において、前記韻律情報生成手段が、前記韻律情報を生成する際に、前記遷移確率の分布に応じて、前記韻律コストと前記接続コストとの影響度の割合を算出することを特徴とする。 The speech synthesis device according to claim 4 is the speech synthesis device according to any one of claims 1 to 3, wherein the transition probability distribution is generated when the prosodic information generation unit generates the prosodic information. The ratio of the degree of influence between the prosodic cost and the connection cost is calculated according to the above.

かかる構成によれば、音声合成装置は、韻律情報を生成する際に、例えば、遷移確率にピークが存在しない場合に、遷移確率の分布に応じて、韻律コストと接続コストとの影響度の割合を算出することで、音素同士の適切な接続を確保することができる。 According to this configuration, when the speech synthesizer generates prosody information, for example, when there is no peak in the transition probability, the ratio of the degree of influence between the prosody cost and the connection cost according to the distribution of the transition probability. By calculating, it is possible to ensure an appropriate connection between phonemes.

請求項５に記載の音声合成プログラムは、入力されたテキストデータを音声合成するために、音素と韻律に関する情報とを対応付けた韻律データを蓄積する韻律データ蓄積手段と、前記音素と音声波形とを対応付けた音声素片データを蓄積する音声素片データ蓄積手段とを備えたコンピュータを、言語解析手段、韻律情報生成手段、音声素片データ探索手段、音声素片データ合成手段、として機能させる構成とした。 The speech synthesis program according to claim 5, the speech prosthesis data storage means for storing prosody data in which phonemes and information related to prosody are associated with each other in order to synthesize input text data, and the phonemes and speech waveforms A computer comprising speech unit data storage means for storing speech unit data associated with the speech unit data function as language analysis means, prosodic information generation means, speech segment data search means, speech unit data synthesis means The configuration.

かかる構成によれば、音声合成プログラムは、言語解析手段によって、テキストデータを言語解析し、当該テキストデータを音素列に分解し、韻律情報生成手段によって、言語解析手段で分解された音素列において、任意の音素の前後に位置する音素の基本周波数、振幅および継続時間長の許容範囲と遷移確率とからなる韻律情報を、韻律データ蓄積手段に蓄積されている韻律データを用いて生成する。そして、音声合成プログラムは、音声素片データ探索手段によって、韻律情報生成手段で生成された韻律情報に基づいて、音声素片データ蓄積手段に蓄積されている音声素片データを探索し、接続コストおよび韻律コストが最小になる前記音声素片データの組み合わせを出力し、音声素片データ合成手段によって、音声素片データ探索手段で出力された音声素片データの組み合わせを合成して出力する。 According to this configuration, the speech synthesis program performs language analysis on the text data by the language analysis unit, decomposes the text data into phoneme sequences, and in the phoneme sequence decomposed by the language analysis unit by the prosody information generation unit, Prosody information including the allowable range of the fundamental frequency, amplitude, and duration of the phoneme located before and after an arbitrary phoneme and the transition probability is generated using the prosody data stored in the prosody data storage means. Then, the speech synthesis program searches the speech unit data stored in the speech unit data storage unit based on the prosodic information generated by the prosody information generation unit by the speech unit data search unit, and determines the connection cost. The speech unit data combination that minimizes the prosody cost is output, and the speech unit data synthesis unit synthesizes and outputs the speech unit data combination output by the speech unit data search unit.

請求項１、５に記載の発明によれば、テキストデータを分解した音素列において、任意の音素の前後に位置する音素の基本周波数、振幅および継続時間長の許容範囲と遷移確率とからなる韻律情報を生成し、この韻律情報に基づいて、音声素片データを探索するので、音声波形の連続性が劣る音声素片を接続することなく、合成音声の音質劣化を抑制することができる。 According to the first and fifth aspects of the present invention, in a phoneme string obtained by decomposing text data, a prosody comprising an allowable range of fundamental frequencies, amplitudes and durations of phonemes located before and after an arbitrary phoneme and a transition probability. Since the information is generated and the speech segment data is searched based on the prosodic information, it is possible to suppress the deterioration of the sound quality of the synthesized speech without connecting speech segments having inferior speech waveform continuity.

請求項２に記載の発明によれば、テキストデータを、様々な要素に基づいて言語解析することで、生成される韻律情報の信頼性が向上し、合成音声の音質を良質に保つことができる。 According to the invention described in claim 2, by analyzing the text data based on various elements, the reliability of the generated prosodic information can be improved, and the quality of the synthesized speech can be kept high. .

請求項３に記載の発明によれば、韻律データについて、韻律に関する情報を、予め設定した範囲内に絞り込んでおくことで、探索される音声素片データについて、イレギュラーな音声素片データの接続を除くことができる。 According to the third aspect of the present invention, the information about prosodic data is narrowed down within a preset range with respect to prosodic data, whereby irregular speech element data connections are made for the searched speech element data. Can be excluded.

請求項４に記載の発明によれば、遷移確率の分布に応じて、韻律コストと接続コストとの影響度の割合を算出することで、音素同士の適切な接続を確保することができ、合成音声の音質を良質に保つことができる。 According to the invention described in claim 4, by calculating the ratio of the degree of influence between the prosody cost and the connection cost according to the distribution of transition probabilities, an appropriate connection between phonemes can be ensured, and the synthesis The sound quality of the voice can be kept high.

次に、本発明の実施形態について、適宜、図面を参照しながら詳細に説明する。
〈音声合成装置の構成〉
図１は、音声合成装置のブロック図である。図１に示すように、音声合成装置１は、入力されたテキストデータを音声合成するもので、言語解析部（言語解析手段）３と、韻律生成部（韻律情報生成手段）５と、韻律データベース（韻律データ蓄積手段）７と、音声素片探索部（音声素片データ探索手段）９と、音声素片データベース（音声素片データ蓄積手段）１１と、音声素片合成部（音声素片データ合成手段）１３とを備えている。 Next, embodiments of the present invention will be described in detail with reference to the drawings as appropriate.
<Configuration of speech synthesizer>
FIG. 1 is a block diagram of a speech synthesizer. As shown in FIG. 1, a speech synthesizer 1 synthesizes speech from input text data, and includes a language analysis unit (language analysis unit) 3, a prosody generation unit (prosody information generation unit) 5, a prosody database. (Prosodic data storage means) 7, speech segment search unit (speech segment data search means) 9, speech segment database (speech segment data storage means) 11, and speech unit synthesizer (speech segment data) Synthesizing means) 13.

この音声合成装置１は、韻律データベース７および音声素片データベース１１の大規模データベース（データの種類が多様で且つ大容量）を用いた波形接続型の音声合成を行うものである。この波形接続型の音声合成を行う場合、合成音声の音質を向上させるためには、音声素片データ同士を滑らかに接続することが必要になる。ただし、音声素片データを接続する際に、できるだけ当該音声素片データの音声波形には、信号処理による韻律修正を行わないことが好ましい。つまり、音声素片データの音声波形の連続性を考慮した場合、当該音声素片データ同士の組み合わせの幅を広げて（音声素片データの様々組み合わせから選択して）、元々蓄積されている音声素片データをそのまま利用することで必然的に自然に聞こえることになる。すなわち、接続する音声素片データに、信号処理による韻律修正を行わない方が合成音声の音質は向上することになる。 This speech synthesizer 1 performs waveform-connected speech synthesis using a large-scale database (a variety of data types and a large capacity) of a prosody database 7 and a speech segment database 11. When performing this waveform connection type speech synthesis, it is necessary to smoothly connect speech unit data in order to improve the sound quality of the synthesized speech. However, when connecting speech unit data, it is preferable that prosody correction by signal processing is not performed on the speech waveform of the speech unit data as much as possible. In other words, when considering the continuity of the speech waveform of speech unit data, the range of combinations of the speech unit data is expanded (selected from various combinations of speech unit data), and the originally stored speech By using the fragment data as it is, it naturally sounds natural. That is, the sound quality of the synthesized speech is improved when the connected speech element data is not subjected to prosody correction by signal processing.

そこで、この音声合成装置１では、従来のように音素の特徴量における絶対値を用いるのではなく、韻律生成部５で生成した韻律情報（詳細は後記）、つまり、接続する音素が取りうる許容範囲および遷移確率に基づいて、より自然に聞こえる音声素片データの組み合わせを探索して、接続するようにしている。 Therefore, this speech synthesizer 1 does not use the absolute value of the phoneme feature value as in the conventional case, but prosody information (details will be described later) generated by the prosody generation unit 5, that is, the permissible phonemes to be connected. Based on the range and the transition probability, a combination of speech element data that can be heard more naturally is searched and connected.

言語解析部３は、入力されたテキストデータを言語解析し、当該テキストデータを音素列（探索単位）に分割して韻律生成部５に出力するものである。この言語解析部３による言語解析では、テキストデータを分割した音素列の他に、単語の境界と、単語の品詞と、アクセント句の境界と、アクセント核の位置と、文節の境界と、句の境界と、各文節の係り受けの関係と、文末が疑問文であるか否かと、テキストデータ中に含まれる単語に関して引用文の範囲はどれだけであるかと、テキストデータ中に含まれる単語同士が対立関係にあるか若しくは並列関係にあるかと、フォーカス（重要単語）とを求めている。 The language analysis unit 3 performs language analysis on the input text data, divides the text data into phoneme strings (search units), and outputs them to the prosody generation unit 5. In the language analysis by the language analysis unit 3, in addition to the phoneme string obtained by dividing the text data, the word boundary, the word part of speech, the accent phrase boundary, the accent nucleus position, the phrase boundary, and the phrase boundary The relationship between the boundary and the dependency of each clause, whether the sentence ends in question, whether there is a range of quotes for the words included in the text data, and the words included in the text data The focus (important word) is sought for whether there is a conflict or parallel relationship.

なお、音素列は、少なくとも３個の音素が連続している。ただし、テキストデータの最初の音素には前方の音素がなく、テキストデータの最後の音素には後方の音素がないが、これら音素がない、つまり、無音の場合も、「ｓｉｌ」（サイレントの語頭のスペル）という音素と想定し、文頭および文末も３個の音素が連続しているとして取り扱っている。 In the phoneme string, at least three phonemes are continuous. However, the first phoneme of the text data has no front phoneme, and the last phoneme of the text data has no rear phoneme. However, even if there is no phoneme, that is, silence, “sil” (silent Is assumed to have three phonemes at the beginning and end of the sentence.

また、この言語解析部３では、必要に応じて、単語の重要度、断定文・疑問文等の文章の特徴を求めてもよい。例えば、日常会話において交わされる対話形式の言語（話し言葉）では、名詞だけの疑問文（例えば、「何？」、「どこ？」）が存在し、これらがテキストデータとして入力されると、これらの単語は、通常のアクセントやイントネーションを有していないので、この言語解析部３によって、文章の特徴を求め、分離しておく（別の単語として取り扱う）必要がある。 In addition, the language analysis unit 3 may determine the importance of the word and the characteristics of the sentence such as the assertion sentence / question sentence as necessary. For example, in an interactive language (spoken language) exchanged in daily conversations, there are question sentences (for example, “What?”, “Where?”) With only nouns, and these are input as text data. Since the words do not have normal accents and intonations, it is necessary to obtain sentence features by the language analysis unit 3 and separate them (handle them as different words).

韻律生成部５は、言語解析部３で分割された音素列において、任意の音素の前後に位置する音素の基本周波数（Ｆ_０）、振幅および継続時間長の許容範囲と、これらの遷移確率とからなる韻律情報を生成するものである。 The prosody generation unit 5 includes an allowable range of fundamental frequencies (F ₀ ), amplitudes and durations of phonemes located before and after an arbitrary phoneme in the phoneme sequence divided by the language analysis unit 3, and transition probabilities thereof. Prosody information consisting of

許容範囲は、現時点の音素から次の時点の音素に移り変わる時の、基本周波数（Ｆ_０）、振幅および継続時間長の変化の方向（時間経過に伴って増加するか減少するか）と、これらの変化する範囲（取り得る範囲）とである。変化する範囲とは、例えば、音素の高さであれば、基本周波数の幅（例として３０Ｈｚ〜４０Ｈｚ）である。遷移確率は、ある音素から、許容範囲内に含まれているいずれかの音素に移り変わる頻度を、当該音素ごとの確率として求めたものである。 The allowable range is the direction of change in the fundamental frequency (F ₀ ), amplitude, and duration (when increasing or decreasing over time) when changing from the current phoneme to the next phoneme, and these The range of change (the possible range). The changing range is, for example, the width of the fundamental frequency (for example, 30 Hz to 40 Hz) if the phoneme is high. The transition probability is obtained as a probability for each phoneme that is a frequency of transition from a phoneme to any phoneme included in the allowable range.

韻律データベース７は、音素と韻律に関する情報とを対応付けた韻律データを蓄積しているもので、一般的な記録媒体によって構成されている。韻律に関する情報とは、品詞、アクセント句、アクセント核の有無、単語境界、句境界、文節境界、呼気段落境界、係り受け、フォーカスの有無の情報等である。 The prosody database 7 stores prosody data in which phonemes and information about prosody are associated with each other, and is configured by a general recording medium. Information on prosody includes part of speech, accent phrases, presence / absence of accent kernel, word boundary, phrase boundary, phrase boundary, expiratory paragraph boundary, dependency, information on presence / absence of focus, and the like.

この韻律データベース７では、予め準備しておいた複数の文章を発声した音声を分析した、音響的な分析結果（基本周波数（Ｆ_０）、スペクトル、振幅、音素の継続時間長等）と言語的な分析結果（品詞、単語境界、構文構造、フォーカスの有無等）とを、韻律データとして蓄積している。 In this prosody database 7, acoustic analysis results (fundamental frequency (F ₀ ), spectrum, amplitude, phoneme duration, etc.) and linguistic analysis of voices uttered by a plurality of sentences prepared in advance are analyzed. Analysis results (parts of speech, word boundaries, syntax structure, presence of focus, etc.) are stored as prosodic data.

ここで、図２を参照して、韻律生成部５および韻律データベース７について説明する。図２は、韻律データベース７に蓄積されている韻律データを模式的に示すと共に、韻律生成部５で生成される韻律情報を模式的に示した模式図である。この図２に示すように、韻律データベース７に蓄積されている韻律データを、韻律生成部５は、入力されたテキストデータが言語解析された言語解析結果に基づいてクラスタリングし、許容範囲と遷移確率とからなる韻律情報を生成している。つまり、韻律情報を生成する際に、韻律に関する情報を、予め設定した範囲内に絞り込んでいる。 Here, the prosody generation unit 5 and the prosody database 7 will be described with reference to FIG. FIG. 2 is a schematic diagram schematically showing the prosody data stored in the prosody database 7 and schematically showing the prosody information generated by the prosody generation unit 5. As shown in FIG. 2, the prosody generation unit 5 clusters the prosody data stored in the prosody database 7 based on the linguistic analysis result obtained by performing linguistic analysis on the input text data. Prosody information consisting of That is, when prosodic information is generated, information related to prosody is narrowed down to a preset range.

ここでは、クラスタリングする際の一例として、図２（ａ）に示すように、基本周波数（Ｆ_０）の変化の向きに着目した場合を示している。なお、このクラスタリングは、モーラ単位（子音の音素と母音の音素とからなる単位）で基本周波数（Ｆ_０）、振幅、継続時間長を求めて、言語的な特徴が類似するように分類することである。クラスタリングした結果（以下、クラスタという）、単語内の音素について、Ａ．当該単語内にアクセント核が無く、アクセント句の中間である場合、Ｂ.当該単語内にアクセント核がある場合、Ｃ．呼気段落境界である場合によって、それぞれ、許容範囲と遷移確率とが異なる。 Here, as an example of clustering, as shown in FIG. 2A, a case where attention is paid to the direction of change of the fundamental frequency (F ₀ ) is shown. In this clustering, the fundamental frequency (F ₀ ), amplitude, and duration are calculated in mora units (units consisting of consonant phonemes and vowel phonemes), and classified so that their linguistic features are similar. It is. As a result of clustering (hereinafter referred to as a cluster), phonemes in words, A. If there is no accent nucleus in the word and is in the middle of an accent phrase, B. If there is an accent nucleus in the word, C. The permissible range and transition probability differ depending on the case of the exhalation paragraph boundary.

この図２（ａ）において、丸印が音素を模式的に示しており、この丸印から出ている矢印の向きが上向きの場合、接続する音素の基本周波数（Ｆ_０）が増加することを示しており、矢印の向きが下向きの場合、接続する音素の基本周波数（Ｆ_０）が減少することを示している。この矢印の指し示す範囲が接続する音素の基本周波数（Ｆ_０）の許容範囲となる。 In FIG. 2 (a), circles schematically indicate phonemes. When the direction of an arrow from the circle is upward, the fundamental frequency (F ₀ ) of the connected phonemes increases. When the direction of the arrow is downward, the fundamental frequency (F ₀ ) of the phoneme to be connected is decreased. The range indicated by this arrow is the allowable range of the fundamental frequency (F ₀ ) of the phoneme to be connected.

これらＡ．Ｂ．Ｃにおける、変化の向き（増加または減少）とその遷移確率とを図２（ｂ）に示している。そして、図２（ｃ）に示すように、これらＡ．Ｂ．Ｃの中央値と取り得る範囲とを、韻律生成部５は求めている。 These A. B. FIG. 2B shows the direction of change (increase or decrease) and its transition probability in C. Then, as shown in FIG. B. The prosody generation unit 5 obtains the median value of C and a possible range.

図１に戻って音声合成装置１の構成の説明を続ける。
音声素片探索部９は、韻律生成部５で生成された韻律情報に基づいて、音声素片データベース１１に蓄積されている音声素片データを探索し、接続コストおよび韻律コストが最小になる音声素片データの組み合わせを出力するものである。 Returning to FIG. 1, the description of the configuration of the speech synthesizer 1 will be continued.
The speech segment search unit 9 searches the speech segment data stored in the speech segment database 11 based on the prosodic information generated by the prosody generation unit 5, and the speech with the minimum connection cost and prosodic cost is obtained. A combination of segment data is output.

接続コストは、音声素片データ同士の接続点とこの接続点近傍における基本周波数（Ｆ_０）と、スペクトルの誤差を求めたものである。一般的に、接続コストとは、音声合成する際の探索単位（ここでは、音声素片データ）のつながりの良し悪しを数値化したもので、数値が低ければ、つながりが良いとなり、数値が高ければ、つながりが悪いとなるものである。 The connection cost is obtained by calculating a connection point between speech unit data, a fundamental frequency (F ₀ ) near the connection point, and a spectrum error. In general, the connection cost is a numerical value of the connection between search units (speech segment data in this case) for speech synthesis. If the numerical value is low, the connection is good and the numerical value is high. If this is the case, the connection will be bad.

韻律コストは、各音素対、つまり、各音声素片データ対（連続する音声素片データそれぞれ）における基本周波数（Ｆ_０）の変化の向き（上昇するまたは低下する）における遷移確率（上昇する確率または低下する確率）と、継続時間長の変化の向き（長くなるまたは短くなる）における遷移確率（長くなる確率または短くなる確率）とから求めたものである。一般的に、韻律コストとは、音声合成する際の探索単位における韻律の連続性の良し悪しを数値化したもので、数値が低ければ、つながりが良いとなり、数値が高ければ、つながりが悪いとなるものである。 The prosodic cost is the transition probability (probability to increase) in each phoneme pair, that is, the direction of change (increase or decrease) of the fundamental frequency (F ₀ ) in each speech element data pair (each continuous speech element data). Or the probability of decrease) and the transition probability (probability of becoming longer or shorter) in the direction of change in the duration (longer or shorter). In general, prosody cost is a quantification of the continuity of prosody in the search unit when synthesizing speech. It will be.

例えば、音の繋がりは良くとも（接続コストは低くとも）、変化の向きの取り得る範囲から外れている音素対（音声素片データ対）が含まれている場合、韻律コストが上昇することになる。そして、音声素片探索部９は、このような音素対（音声素片データ対）を、音声素片データの組み合わせから除外している。 For example, even if the sound connection is good (even if the connection cost is low), if a phoneme pair (speech segment data pair) that is out of the possible range of change is included, the prosody cost will increase. Become. Then, the speech segment search unit 9 excludes such phoneme pairs (speech segment data pairs) from the combination of speech segment data.

ここで「変化の向きの取り得る範囲から外れている音素対（音声素片データ対）」とは、前方の音素の基本周波数より後方の音素の基本周波数が上昇する確率が高いのに後方の音素の基本周波数が低下している音素対（音声素片データ対）、前方の音素の基本周波数より後方の音素の基本周波数が低下する確率が高いのに後方の音素の基本周波数が上昇している音素対（音声素片データ対）、前方の音素の継続時間長より後方の音素の継続時間長が長くなる確率が高いのに後方の音素の継続時間長が短くなっている音素対（音声素片データ対）、前方の音素の継続時間長より後方の音素の継続時間長が短くなる確率が高いのに後方の音素の継続時間長が長くなっている音素対（音声素片データ対）を指している。 Here, “phoneme pairs that are out of the possible range of change (speech segment data pair)” means that the basic frequency of the rear phoneme is higher than the basic frequency of the front phoneme, but the rear A phoneme pair whose phoneme fundamental frequency is decreasing (speech segment data pair), but the fundamental frequency of the back phoneme is higher than that of the front phoneme A phoneme pair (speech segment data pair), a phoneme pair (speech that the duration of the back phoneme is shorter than that of the front phoneme, but the duration of the back phoneme is short. A pair of phoneme data), a phoneme pair in which the duration of the back phoneme is longer than the duration of the back phoneme, but the duration of the back phoneme is longer than the duration of the front phoneme Pointing.

なお、音声素片探索部９では、合成音声の精度を向上させるため、韻律生成部５から出力された韻律データと、音声素片データベース１１に蓄積されている音声素片データとに基づいて、再度、韻律の変化の向き（許容範囲および遷移確率）を算出してもよい。 Note that, in order to improve the accuracy of the synthesized speech, the speech unit search unit 9 is based on the prosody data output from the prosody generation unit 5 and the speech unit data stored in the speech unit database 11. Again, the direction of prosodic change (allowable range and transition probability) may be calculated.

また、音声素片探索部９では、許容範囲の取り得る範囲が広い場合、つまり、許容範囲が予め設定した設定範囲より広い場合（超過許容範囲）、または、遷移確率の分布に明確なピークが存在しない場合、韻律の変化の向きを重要視する必要はないとし、韻律コストの影響を小さくする影響度を算出する。つまり、この音声素片探索部９では、韻律コストと接続コストとの影響度の割合を求めている。この影響度が高くなればなるほど、韻律コストの影響が希釈化され、この結果、接続コストが最小になる音声素片データの組み合わせが優先的に採用されることになる。 Further, in the speech segment search unit 9, when the allowable range is wide, that is, when the allowable range is wider than the preset setting range (excess allowable range), or there is a clear peak in the transition probability distribution. If not, it is not necessary to place importance on the direction of prosodic change, and the degree of influence that reduces the influence of the prosodic cost is calculated. That is, the speech segment search unit 9 obtains the ratio of the degree of influence between the prosody cost and the connection cost. The higher the degree of influence, the more the influence of the prosodic cost is diluted. As a result, the combination of speech segment data that minimizes the connection cost is preferentially adopted.

音声素片データベース１１は、音素と音声波形（音声波形データ）とを対応付けた音声素片データを蓄積しているもので、一般的な記録媒体によって構成されている。 The speech segment database 11 stores speech segment data in which phonemes and speech waveforms (speech waveform data) are associated with each other, and is configured by a general recording medium.

音声素片合成部１３は、音声探索部９で探索された音声素片データの組み合わせに従って、当該音声素片データ同士を接続して、合成音声にして出力するものである。 The speech unit synthesizing unit 13 connects the speech unit data to each other according to the combination of the speech unit data searched by the speech search unit 9, and outputs the synthesized speech.

ここで、図３を参照して、音声素片データベース１１に存在する音声素片データの接続について、従来の方法と本発明による方法との違いについて説明する。図３は、音声素片データ１１に存在する音声素片データを模式的に示すとともに、予測韻律（従来、本発明）による音声素片データの接続の仕方を示した図である。この図３（ａ）において、「ａ」（ア）、「ｉ」（イ）、「ｕ」（ウ）の音声素片データにおける基本周波数（Ｆ０）を示したものであり、これら「ａ」（ア）、「ｉ」（イ）、「ｕ」（ウ）に対応づけられている波線（それぞれ３個ずつ図示）は音声波形の形状を示したものである。 Here, with reference to FIG. 3, the difference between the conventional method and the method according to the present invention will be described regarding the connection of the speech unit data existing in the speech unit database 11. FIG. FIG. 3 is a diagram schematically showing speech unit data existing in the speech unit data 11 and showing how to connect speech unit data based on the predicted prosody (conventional, the present invention). In FIG. 3A, the fundamental frequency (F0) in the speech element data of “a” (a), “i” (b), “u” (c) is shown. (A), “i” (I), and “u” (C) correspond to wavy lines (three are respectively shown) indicate the shape of the speech waveform.

また、図３（ｂ）では、これらの音声素片データ「ａ」（ア）、「ｉ」（イ）、「ｕ」（ウ）を、予測韻律（従来）を示した箇所では、黒丸（ｘ、ｙ、ｚ）として表している。さらに、予測韻律（本発明）を示した箇所では、次に連続する音声素片データの許容範囲をとりうる角度α、βとして表している。 Further, in FIG. 3B, these speech segment data “a” (a), “i” (b), “u” (c) are represented by black circles (where conventional predictions are shown). x, y, z). Furthermore, in the place where the predicted prosody (the present invention) is shown, it is expressed as angles α and β that can take the allowable range of the next continuous speech segment data.

図３（ａ）に示したように、音声素片データ「ａ」、「ｉ」、「ｕ」を接続しようとした場合、図３（ｂ）に示したように、従来の予測韻律では、音声素片データの接続が悪くても絶対値（図３（ｂ）の黒丸）に近い音声素片データを選択している。そこで、本発明による予測韻律では、次に接続する音声素片データの許容範囲により、高低の変化の向きが一致し、接続がよい（接続コストおよび韻律コストが最小となる）音声素片データを選択している。 As shown in FIG. 3A, when speech unit data “a”, “i”, “u” is to be connected, as shown in FIG. The speech unit data close to the absolute value (black circle in FIG. 3B) is selected even if the speech unit data is poorly connected. Therefore, in the predicted prosody according to the present invention, the speech unit data in which the direction of change in height matches and the connection is good (the connection cost and the prosodic cost are minimized) are matched according to the allowable range of the speech unit data to be connected next. Selected.

図１に戻る。この音声合成装置１によれば、言語解析部３でテキストデータを分解した音素列において、韻律生成部５によって、任意の音素の前後に位置する音素の基本周波数、振幅および継続時間長の許容範囲と遷移確率とからなる韻律情報を生成し、音声素片探索部９によって、韻律情報に基づいて、音声素片データを探索するので、音声波形の連続性が劣る音声素片を接続することなく、合成音声の音質劣化を抑制することができる。 Returning to FIG. According to the speech synthesizer 1, in the phoneme string obtained by decomposing the text data by the language analysis unit 3, the prosody generation unit 5 allows the fundamental frequency, amplitude, and duration of the phonemes located before and after an arbitrary phoneme. And the transition probabilities are generated, and the speech unit search unit 9 searches for speech unit data based on the prosodic information. Therefore, without connecting speech units having inferior speech waveform continuity. Therefore, it is possible to suppress deterioration of the sound quality of the synthesized speech.

また、この音声合成装置１によれば、入力されたテキストデータを、言語解析部５によって、様々な要素に基づいて言語解析することで、生成される韻律情報の信頼性が向上し、合成音声の音質を良質に保つことができる。 Further, according to the speech synthesizer 1, the language analysis unit 5 performs linguistic analysis on the input text data based on various elements, thereby improving the reliability of the generated prosodic information, and the synthesized speech Sound quality can be kept high.

さらに、この音声合成装置１によれば、韻律データについて、韻律に関する情報を、予め設定した範囲内に絞り込んでおくことで、探索される音声素片データのデータ量が減縮し、合成音声を生成する際の合成速度を向上させることができる。 Furthermore, according to the speech synthesizer 1, the prosody data is narrowed down within a preset range for the prosody data, thereby reducing the amount of speech segment data to be searched and generating synthesized speech. The synthesis speed can be improved.

さらにまた、この音声合成装置１によれば、音声素片探索部９によって、設定した許容範囲よりも広い場合（超過許容範囲）、遷移確率にピークが存在しない場合に、超過許容範囲または当該遷移確率の分布に応じた韻律コストの許容範囲への影響度を算出することで、音素同士の適切な接続を確保することができ、合成音声の音質を良質に保つことができる。 Furthermore, according to the speech synthesizer 1, when the speech unit search unit 9 is wider than the set allowable range (excess allowable range), or when there is no peak in the transition probability, the excessive allowable range or the transition By calculating the degree of influence of the prosodic cost on the permissible range according to the probability distribution, it is possible to secure an appropriate connection between phonemes and to keep the quality of the synthesized speech in good quality.

〈音声合成装置の動作〉
次に、図４に示すフローチャートを参照して、音声合成装置１の動作について説明する（適宜、図１参照）。
まず、音声合成装置１は、入力されたテキストデータを、言語解析部３によって、言語解析し、解析した言語解析結果を韻律生成部５に出力する（ステップＳ１）。 <Operation of speech synthesizer>
Next, the operation of the speech synthesizer 1 will be described with reference to the flowchart shown in FIG. 4 (see FIG. 1 as appropriate).
First, the speech synthesizer 1 performs language analysis on the input text data by the language analysis unit 3, and outputs the analyzed language analysis result to the prosody generation unit 5 (step S1).

続いて、音声合成装置１は、韻律生成部５によって、言語解析部３から出力された言語解析結果と、韻律データベース７に蓄積されている韻律データとに基づいて、許容範囲と遷移確率とからなる韻律情報を生成して、音声素片探索部９に出力する（ステップＳ２）。そして、音声合成装置１は、音声素片探索部９によって、韻律生成部５から出力された韻律情報に基づいて、音声素片データベース１１に蓄積されている音声素片データの中で、接続コストおよび韻律コストが最小になる音声素片データの組み合わせを探索し、音声素片合成部１３に出力する（ステップＳ３）。 Subsequently, the speech synthesizer 1 uses the permissible range and the transition probability based on the language analysis result output from the language analysis unit 3 by the prosody generation unit 5 and the prosodic data stored in the prosody database 7. Is generated and output to the speech segment search unit 9 (step S2). Then, the speech synthesis device 1 uses the speech segment search unit 9 based on the prosodic information output from the prosody generation unit 5, among the speech unit data stored in the speech segment database 11, the connection cost. Then, a combination of speech unit data that minimizes the prosody cost is searched for and output to the speech unit synthesis unit 13 (step S3).

そして、音声合成装置１は、音声素片合成部１３によって、音声素片探索部９で探索した音声素片データの組み合わせを合成（接続）して合成音声として出力する（ステップＳ４）。 Then, the speech synthesizer 1 synthesizes (connects) the combination of the speech unit data searched by the speech unit search unit 9 by the speech unit synthesizer 13 and outputs the synthesized speech (step S4).

以上、本発明の実施形態について説明したが、本発明は前記実施形態には限定されない。例えば、本実施形態では、音声合成装置１として説明したが、当該装置１の各構成の処理を実現可能に、汎用的または特殊なコンピュータ言語を用いて記述した音声合成プログラムとして構成することも可能である。この場合、音声合成装置１と同様の効果を得ることができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment. For example, although the present embodiment has been described as the speech synthesizer 1, it can also be configured as a speech synthesis program described using a general-purpose or special computer language so that the processing of each component of the device 1 can be realized. It is. In this case, the same effect as the speech synthesizer 1 can be obtained.

本発明の実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to an embodiment of the present invention. 韻律データベースと韻律生成部とを模式的に示した図である。It is the figure which showed the prosody database and the prosody generation part typically. 音声素片データベースを模式的に示した図である。It is the figure which showed the speech segment database typically. 図１に示した音声合成装置の動作を説明するためのフローチャートである。3 is a flowchart for explaining the operation of the speech synthesizer shown in FIG. 1.

Explanation of symbols

１音声合成装置
３言語解析部（言語解析手段）
５韻律生成部（韻律情報生成手段）
７韻律データベース（韻律データ蓄積手段）
９音声素片探索部（音声素片データ探索手段）
１１音声素片データベース（音声素片データ蓄積手段）
１３音声素片合成部（音声素片データ合成手段） 1 speech synthesizer 3 language analysis unit (language analysis means)
5 Prosody generation part (prosody information generation means)
7 Prosody database (Prosodic data storage means)
9 Speech segment search unit (speech segment data search means)
11 Speech segment database (speech segment data storage means)
13 Speech unit synthesis unit (speech unit data synthesis means)

Claims

A speech synthesizer for synthesizing input text data,
Prosodic data storage means for storing prosodic data in which phonemes and prosodic information are associated;
Speech unit data storage means for storing speech unit data in which the phonemes and speech waveforms are associated;
Linguistic analysis of the text data, language analysis means for decomposing the text data into phoneme strings,
In the phoneme string decomposed by the language analysis means, prosodic information consisting of the allowable range of the fundamental frequency, amplitude and duration of the phoneme located before and after any phoneme and the transition probability is stored in the prosodic data storage means Prosody information generating means for generating using the prosodic data being processed,
Based on the prosodic information generated by the prosodic information generating means, the speech segment data stored in the speech segment data storing means is searched, and the speech segment data of the speech segment data that minimizes the connection cost and the prosodic cost is searched. Speech segment data search means for outputting a combination;
Speech unit data synthesis means for synthesizing and outputting a combination of speech unit data output by the speech unit data search means;
A speech synthesizer comprising:

The language analysis means, when language analysis of the text data, the type of phoneme or part of speech, the distance of dependency of syntax analysis, the presence or absence of punctuation marks, the presence or absence of an accent nucleus, the clause in the expiratory paragraph, the position of the word, The speech synthesizer according to claim 1, wherein language analysis is performed on at least one of the number of mora of the word.

3. The prosodic information generation means, when generating the prosodic information, uses the prosodic data having information related to the prosody within a preset range. Voice synthesizer.

2. The prosody information generation means, when generating the prosody information, calculates a ratio of the degree of influence between the prosody cost and the connection cost according to the distribution of the transition probabilities. The speech synthesizer according to claim 3.

In order to synthesize input text data, prosodic data storage means for storing prosody data in which phonemes and information related to prosody are associated, and speech unit data in which the phonemes are associated with speech waveforms are stored. A computer comprising speech segment data storage means;
Language analysis means for analyzing the text data and decomposing the text data into phoneme strings;
In the phoneme string decomposed by the language analysis means, prosodic information consisting of the allowable range of the fundamental frequency, amplitude and duration of the phoneme located before and after any phoneme and the transition probability is stored in the prosodic data storage means Prosody information generating means for generating using the prosodic data being provided,
Based on the prosodic information generated by the prosodic information generating means, the speech segment data stored in the speech segment data storing means is searched, and the speech segment data of the speech segment data that minimizes the connection cost and the prosodic cost is searched. Speech segment data search means for outputting combinations;
Speech unit data synthesizing means for synthesizing and outputting a combination of speech unit data output by the speech unit data search means;
A speech synthesis program characterized by functioning as