JP4841339B2

JP4841339B2 - Prosody correction device, speech synthesis device, prosody correction method, speech synthesis method, prosody correction program, and speech synthesis program

Info

Publication number: JP4841339B2
Application number: JP2006188406A
Authority: JP
Inventors: 洋一郎八幡; 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2006-07-07
Filing date: 2006-07-07
Publication date: 2011-12-21
Anticipated expiration: 2026-07-07
Also published as: JP2008015362A

Abstract

<P>PROBLEM TO BE SOLVED: To eliminate the necessity of changing processing content and pregenerating various pattern data in accordance with an accent type of input text data and to eliminate the necessity of storing much data such as pattern data for rhythm correction in the generation of intonation for expressing problems in a rhythm correction device or a speech synthesis device. <P>SOLUTION: Out of pitches of respective phonetic codes preimparted to text data, the pitch (PitL) of a terminating phoneme is corrected to a first pitch (PitL1) and a second pitch (PitL2). Wherein the values of the first and second pitches are calculated by using the maximum value (PitP) of pitches in a preimparted pitch series. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、韻律の補正に関し、特に、入力されたテキストデータを解析して得られる韻律を補正する韻律補正装置、音声合成装置、韻律補正方法、音声合成方法、韻律補正プログラム、および、音声合成プログラムに関する。 The present invention relates to prosody correction, and more particularly to a prosody correction device, a speech synthesizer, a prosody correction method, a speech synthesis method, a prosody correction program, and a speech synthesis for correcting a prosody obtained by analyzing input text data. Regarding the program.

電子データとして表現されるテキストデータを音声出力する技術は、古くから研究されており、近年では、着信した電子メールを合成音声で読み上げる機能を搭載した携帯電話などが商品化されている。 Techniques for outputting text data expressed as electronic data by voice have been studied for a long time, and in recent years, mobile phones equipped with a function of reading incoming e-mails with synthesized voice have been commercialized.

従来の音声合成装置の一例に、入力された文字記号系列から、単語境界の検出、単語単位への分割、単語の音素記号列への変換、単語アクセント・文イントネーションの付与（韻律生成）を行ない、音素記号列と韻律情報に基づき、かつ、予め記憶している音声データと規則によって、音声合成器を駆動するために必要な制御信号を生成する音声合成装置が、非特許文献１に開示されている。上記非特許文献１に開示されている音声合成手法は、公知の規則音声合成手法そのものである。 As an example of a conventional speech synthesizer, word boundary detection, word segmentation, conversion to word phoneme symbol strings, word accent / sentence intonation (prosodic generation) are performed from an input character / symbol sequence. Non-Patent Document 1 discloses a speech synthesizer that generates a control signal necessary for driving a speech synthesizer based on phoneme symbol strings and prosodic information and based on previously stored speech data and rules. ing. The speech synthesis method disclosed in Non-Patent Document 1 is a known regular speech synthesis method itself.

また、従来の音声合成装置の他の例として、入力文章を解析して生成される言語属性および音韻継続時間長の情報を入力して、合成音声信号の生成のための基本ピッチパターンを生成し、予め生成された複数の補正パターンを含む補正パターン辞書から基本ピッチパターンを補正するための一つの補正パターンを選択し、選択された補正パターンを基本ピッチパターンに各々の終端位置を一致させて加え合わせることにより、補正されたピッチパターンを生成するピッチパターン生成方法が、特許文献１に開示されている。
特開２００４−２２６５０５号公報斎藤収三，中田和男共著、「音声情報処理の基礎」、第１版、株式会社オーム社、昭和５６年１１月３０日、ｐ．１６６‐１７１ As another example of a conventional speech synthesizer, a basic pitch pattern for generating a synthesized speech signal is generated by inputting language attribute and phoneme duration information generated by analyzing an input sentence. Select one correction pattern for correcting the basic pitch pattern from a correction pattern dictionary including a plurality of correction patterns generated in advance, and add the selected correction pattern with the end position corresponding to the basic pitch pattern. A pitch pattern generation method for generating a corrected pitch pattern by combining them is disclosed in Patent Document 1.
JP 2004-226505 A Co-authored by Shuzo Saito and Kazuo Nakata, “Basics of Speech Information Processing”, 1st Edition, Ohm Co., Ltd., November 30, 1981, p. 166-171

上記非特許文献１に開示されている、入力文字記号系列から韻律を生成して合成音声を生成する手法は、一般に平叙文を前提とした韻律を生成するものである。つまり、非特許文献１に開示された技術では、疑問文のような発話者の意図を表現する文末特徴を合成音声にて効果的に表現することが困難である。 The method disclosed in Non-Patent Document 1 for generating a prosody from an input character / symbol sequence to generate a synthesized speech generally generates a prosody based on a plain text. That is, with the technique disclosed in Non-Patent Document 1, it is difficult to effectively express the sentence end feature that expresses the intention of the speaker such as a question sentence with synthesized speech.

なお、生成された韻律の終端ピッチ（基本周波数）に、所定の値を加算（または所定の倍率で乗算）してピッチを補正することにより、疑問文を表現する手法も容易に考えることができる。 A method of expressing a question sentence can be easily considered by correcting a pitch by adding a predetermined value (or multiplying by a predetermined magnification) to the terminal pitch (basic frequency) of the generated prosody. .

しかしながら、所定の値を終端ピッチに加算（または乗算）する手法では、たとえば１型アクセント（アクセント核が１モーラ目にあるアクセント型）で自然な疑問合成音声となるように所定の値を設定すると、０型アクセント（アクセント核が存在しない）に対して当該所定の値を適用した場合、語尾のピッチが異常に高くなるという問題が生じる。逆に、たとえば０型アクセントで自然な疑問合成音声となるように所定の値を設定すると、１型アクセント（または２型、３型など０型以外のアクセント型）に対して当該所定の値を適用した場合、語尾のピッチが十分に高くならず、疑問を表現するイントネーションを得られないという問題が生じる。 However, in the method of adding (or multiplying) a predetermined value to the end pitch, for example, if the predetermined value is set so that a natural interrogative synthesized speech is obtained with a type 1 accent (accent type with an accent kernel in the first mora), for example. When the predetermined value is applied to the 0 type accent (the accent nucleus does not exist), there arises a problem that the ending pitch becomes abnormally high. Conversely, for example, when a predetermined value is set so that a natural interrogative synthesized speech is obtained with a 0 type accent, the predetermined value is set for a 1 type accent (or an accent type other than 0 type such as 2 type, 3 type). When applied, the ending pitch is not sufficiently high, and there is a problem that intonation that expresses a question cannot be obtained.

また、上記特許文献１に開示されている、予め生成された複数の補正パターンから選択して基本ピッチパターンに加え合わせる手法は、疑問文などに応じたピッチ補正パターンを予め生成しておくことにより、疑問を表現するイントネーションを得ることを可能とし、さらにアクセント型にも応じて補正パターンを予め生成しておくことにより、上記アクセント型の違いにも対応することが可能である。 In addition, the technique disclosed in the above-mentioned Patent Document 1 is selected from a plurality of correction patterns generated in advance and added to the basic pitch pattern by generating a pitch correction pattern corresponding to a question sentence or the like in advance. It is possible to obtain intonation that expresses a question, and it is also possible to cope with the difference in the accent type by generating a correction pattern in advance according to the accent type.

しかしながら、多様な条件に応じた補正パターンを予め生成しておく必要があり、さらに、生成された補正パターンを記憶保持しておく必要があるため、煩雑な作業が必要な上に、補正パターンを保持するメモリの搭載のために音声合成装置のコストを上昇させるという問題がある。 However, it is necessary to generate a correction pattern according to various conditions in advance, and furthermore, since the generated correction pattern needs to be stored and held, a complicated operation is required, and a correction pattern is generated. There is a problem that the cost of the speech synthesizer is increased due to the mounting of the held memory.

本発明は、以上のような問題点に鑑みてなされたものであって、入力されたテキストデータのアクセント型に応じた処理内容の変更が不要であり、多様なパターンデータを予め作成する必要もなく、パターンデータのような多量なデータを韻律補正のために保持することなく、疑問を表現するイントネーションを生成することができる、韻律補正装置、音声合成装置、韻律補正方法、音声合成方法、韻律補正プログラム、および、音声合成プログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and does not require processing contents to be changed according to the accent type of input text data, and it is also necessary to create various pattern data in advance. Prosody correction device, speech synthesizer, prosody correction method, speech synthesis method, prosody can generate intonation expressing a question without holding a large amount of data such as pattern data for prosody correction. An object of the present invention is to provide a correction program and a speech synthesis program.

本発明に従った韻律補正装置は、音韻記号系列に対して予め付与されたピッチ系列の文末から最大値を抽出するピッチピーク抽出手段と、前記抽出したピッチの最大値を用いて、前記音韻記号系列における終端音韻に付与するためのピッチである終端音韻用ピッチを生成する終端音韻ピッチ生成手段と、前記生成した終端音韻用ピッチを前記音韻記号系列における終端音韻のピッチに付与することにより、前記音韻記号系列に対して予め付与されたピッチ系列を補正する終端音韻ピッチ補正手段とを備え、前記終端音韻ピッチ生成手段は、前記抽出したピッチの最大値と、前記音韻記号系列における終端音韻に予め付与されたピッチとの間の値に相当する基準ピッチを算出し、前記終端音韻用ピッチとして、第１ピッチと第２ピッチを生成し、前記第１ピッチは、前記音韻記号系列の終端音韻におけるピッチの変化量を表す所定の値を用いて生成され、前記基準ピッチよりも小さい値であり、前記第２ピッチは、前記所定の値を用いて生成され、前記基準ピッチよりも大きい値である、前記第１ピッチと前記第２ピッチの変化量は、前記所定の値を用いて算出され、前記第１ピッチと前記第２ピッチの変化量は、前記第１ピッチと前記第２ピッチの差または比率であることを特徴とする。 The prosody correcting device according to the present invention uses the pitch peak extracting means for extracting the maximum value from the end of the pitch sequence given in advance to the phoneme symbol sequence, and using the maximum value of the extracted pitch, the phonological symbol A terminal phoneme pitch generating means for generating a terminal phoneme pitch which is a pitch to be given to a terminal phoneme in a sequence, and adding the generated terminal phoneme pitch to a terminal phoneme pitch in the phoneme symbol sequence, A terminal phoneme pitch correcting unit that corrects a pitch sequence previously assigned to the phoneme symbol sequence, and the terminal phoneme pitch generating unit preliminarily calculates the maximum value of the extracted pitch and the terminal phoneme in the phoneme symbol sequence. A reference pitch corresponding to a value between the assigned pitches is calculated, and a first pitch and a second pitch are generated as the terminal phoneme pitch. The first pitch is generated using a predetermined value representing a change amount of the pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch, and the second pitch is the predetermined value. The amount of change between the first pitch and the second pitch, which is generated using the reference pitch and is greater than the reference pitch, is calculated using the predetermined value, and the change between the first pitch and the second pitch. The quantity is a difference or ratio between the first pitch and the second pitch .

また、本発明に従った韻律補正装置では、前記ピッチピーク抽出手段は、音韻記号系列に対して予め付与されたピッチ系列の文末のアクセント句または文末の呼気段落から最大値を抽出することが好ましい。 In the prosody correction device according to the present invention, it is preferable that the pitch peak extraction unit extracts a maximum value from an accent phrase at the end of a pitch sequence or an exhalation paragraph at the end of a sequence given to a phoneme symbol sequence. .

また、本発明に従った韻律補正装置では、前記終端音韻ピッチ生成手段は、前記第１ピッチを、前記抽出したピッチの最大値よりも小さい値とし、前記第２ピッチを、前記抽出したピッチの最大値よりも大きい値とすることが好ましい。 In the prosody correction device according to the present invention, the terminal phoneme pitch generation means sets the first pitch to a value smaller than the maximum value of the extracted pitch, and sets the second pitch to the extracted pitch. A value larger than the maximum value is preferable.

また、本発明に従った韻律補正装置では、前記終端音韻ピッチ生成手段は、前記抽出した最大値をＰｉｔＰとし、前記音韻記号系列における終端音韻に予め付与されたピッチをＰｉｔＬとし、前記ピッチの変化量を表す所定の値をＰＴＤＩＦＦとし、前記音韻記号系列における終端音韻のピッチを補正する度合いに関する値をｍ，Ｍ，ｎ，Ｎとした場合に、これらの値を利用して、前記基準ピッチであるＰｉｔＳを、
ＰｉｔＳ＝ＰｉｔＬ＋（ＰｉｔＰ−ＰｉｔＬ）×ｍ／Ｍ
に従って算出し、前記第１ピッチと前記第２ピッチの変化量であるＰＤｉｆを、
ＰＤｉｆ＝ＰＴＤＩＦＦ×ｎ／Ｎ
に従って算出することが好ましい。 Further, in the prosody correction device according to the present invention, the terminal phoneme pitch generation means sets the extracted maximum value as PitP, sets the pitch previously given to the terminal phoneme in the phoneme symbol sequence as PitL, and changes the pitch. When the predetermined value representing the quantity is PTDIFF and the values regarding the degree of correction of the pitch of the terminal phoneme in the phoneme symbol sequence are m, M, n, and N, these values are used to calculate the reference pitch. A PitS
PitS = PitL + (PitP−PitL) × m / M
And PDif which is the amount of change between the first pitch and the second pitch,
PDif = PTDIFF × n / N
It is preferable to calculate according to:

また、本発明に従った韻律補正装置では、前記終端音韻のピッチを補正する度合いに関する値ｍ，Ｍ，ｎ，Ｎは、ｍ＝ｎかつＭ＝Ｎであることが好ましい。 In the prosody correction device according to the present invention, it is preferable that the values m, M, n, and N relating to the degree of correcting the pitch of the terminal phoneme are m = n and M = N.

また、本発明に従った韻律補正装置は、前記音韻記号系列における終端音韻に対して予め付与されたパワーを補正するパワー補正手段、または、前記音韻記号系列における終端音韻に対して予め付与された継続時間を補正する音韻継続時間長補正手段をさらに備えることが好ましい。 Further, the prosody correction device according to the present invention is a power correction means for correcting power given in advance to the terminal phoneme in the phoneme symbol sequence, or provided in advance to the terminal phoneme in the phoneme symbol sequence. It is preferable to further include a phoneme duration correction unit that corrects the duration.

また、本発明に従った韻律補正装置では、前記パワー補正手段は、前記終端音韻に予め付与されたパワーを用いて、前記終端音韻のパワーを補正することが好ましい。 In the prosody correction device according to the present invention, it is preferable that the power correction unit corrects the power of the terminal phoneme by using power preliminarily given to the terminal phoneme.

また、本発明に従った韻律補正装置では、前記パワー補正手段は、前記終端音韻のパワーを補正することにより、前記終端音韻に付与する第１パワーと、前記終端音韻に付与する第２パワーとを生成し、前記第２パワーを、前記終端音韻に予め付与されたパワーに対する減衰量を表す所定の値に基づいて生成し、かつ、前記終端音韻に予め付与されたパワーよりも小さくなるように生成することが好ましい。 In the prosody correction device according to the present invention, the power correction unit corrects the power of the terminal phoneme, thereby providing a first power to be given to the terminal phoneme and a second power to be given to the terminal phoneme. And generating the second power based on a predetermined value representing an attenuation amount with respect to the power given in advance to the terminal phoneme, and smaller than the power given in advance to the terminal phoneme It is preferable to produce.

また、本発明に従った韻律補正装置では、前記音韻継続時間長補正手段は、前記終端音韻に予め付与された音韻継続時間長を用いて、前記終端音韻の音韻継続時間長を補正することが好ましい。 In the prosody correction device according to the present invention, the phoneme duration correction unit may correct the phoneme duration of the terminal phoneme by using the phoneme duration previously given to the terminal phoneme. preferable.

また、本発明に従った韻律補正装置では、前記音韻継続時間長補正手段は、前記終端音韻の音韻継続時間長を補正することにより、前記終端音韻に予め付与された音韻継続時間長に対する割合を表す所定の値に基づいて、前記終端音韻に予め付与された音韻継続時間長よりも値が小さい第１音韻継続時間長と第２音韻継続時間長とを生成することが好ましい。 In the prosody correction device according to the present invention, the phoneme duration correction means corrects the phoneme duration of the terminal phoneme, thereby obtaining a ratio to the phoneme duration previously given to the terminal phoneme. It is preferable to generate a first phoneme duration and a second phoneme duration that are smaller than the phoneme duration given in advance to the terminal phoneme, based on the predetermined value that is represented.

本発明に従った音声合成装置は、入力されたテキストデータを解析して、前記テキストデータに関する音韻記号と文末韻律の制御方法を表す文末韻律制御情報とを取得するテキスト解析手段と、前記取得した音韻記号に基づいて、少なくとも前記テキストデータのピッチ系列を含む韻律情報を生成する韻律生成手段と、前記韻律生成手段が生成したピッチ系列の文末からピッチの最大値を抽出するピッチピーク抽出手段と、前記抽出したピッチの最大値を用いて、前記テキストデータの終端音韻に付与するためのピッチである終端音韻用ピッチを生成する終端音韻ピッチ生成手段と、前記終端音韻用ピッチを、前記テキストデータの終端音韻のピッチに付与することにより、前記韻律生成手段が生成した前記テキストデータの韻律情報を補正する韻律補正手段と、前記取得した音韻記号と前記補正された韻律情報とを用いて、前記テキストデータに関する音声信号を合成する合成手段とを備え、前記終端音韻ピッチ生成手段は、前記抽出したピッチの最大値と、前記音韻記号系列における終端音韻に予め付与されたピッチとの間の値に相当する基準ピッチを算出し、前記終端音韻用ピッチとして、第１ピッチと第２ピッチを生成し、前記第１ピッチは、前記音韻記号系列の終端音韻におけるピッチの変化量を表す所定の値を用いて生成され、前記基準ピッチよりも小さい値であり、前記第２ピッチは、前記所定の値を用いて生成され、前記基準ピッチよりも大きい値である、前記第１ピッチと前記第２ピッチの変化量は、前記所定の値を用いて算出され、前記第１ピッチと前記第２ピッチの変化量は、前記第１ピッチと前記第２ピッチの差または比率であることを特徴とする。
また、本発明に従った音声合成装置では、前記ピッチピーク抽出手段は、音韻記号系列に対して予め付与されたピッチ系列の文末のアクセント句または文末の呼気段落から最大値を抽出することが好ましい。 The speech synthesizer according to the present invention analyzes the input text data, and obtains the phoneme symbol related to the text data and the sentence end prosody control information indicating the sentence end prosody control method, and the acquired Prosody generation means for generating prosody information including at least the pitch sequence of the text data based on phonemic symbols; pitch peak extraction means for extracting the maximum pitch value from the end of the pitch sequence generated by the prosody generation means; A terminal phoneme pitch generating means for generating a terminal phoneme pitch that is a pitch to be given to the terminal phoneme of the text data using the extracted maximum value of the text data, and the terminal phoneme pitch of the text data Prosody information of the text data generated by the prosody generation means is corrected by adding to the pitch of the terminal phoneme That by using the prosody correcting means and the acquired and phoneme symbol the corrected prosodic information, a synthesizing means for synthesizing a speech signal relating to the text data, the termination phoneme pitch generating means, pitch and the extracted A reference pitch corresponding to a value between a maximum value of the first phoneme and a pitch previously given to a terminal phoneme in the phoneme symbol sequence, and generating a first pitch and a second pitch as the terminal phoneme pitch, The first pitch is generated using a predetermined value representing a change amount of the pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch, and the second pitch is the predetermined value. The change amount of the first pitch and the second pitch, which is generated using the reference pitch and is larger than the reference pitch, is calculated using the predetermined value, and the first pitch and the second pitch are calculated. The amount of change in pitch is characterized in that the a difference or ratio of the first pitch and the second pitch.
In the speech synthesizer according to the present invention, it is preferable that the pitch peak extracting unit extracts a maximum value from an accent phrase at the end of a pitch sequence or a breath paragraph at the end of a pitch sequence previously assigned to a phoneme symbol sequence. .

また、本発明に従った音声合成装置は、１つ以上の音韻記号に対応する音声情報を表す、複数の音声素片を記憶する音声素片記憶手段を、更に備えることが好ましい。 The speech synthesizer according to the present invention preferably further comprises speech unit storage means for storing a plurality of speech units representing speech information corresponding to one or more phoneme symbols.

本発明に従った韻律補正方法は、音韻記号系列に対して予め付与されたピッチ系列の文末からピッチの最大値を抽出するステップと、前記抽出したピッチの最大値を用いて、前記音韻記号系列の終端音韻に付与するためのピッチである終端音韻用ピッチを生成するステップと、前記終端音韻用ピッチを、前記音韻記号系列の終端音韻に付与するステップとを備え、前記終端音韻用ピッチを生成するステップは、前記抽出したピッチの最大値と、前記音韻記号系列における終端音韻に予め付与されたピッチとの間の値に相当する基準ピッチを算出し、さらに、前記終端音韻用ピッチとして、第１ピッチと第２ピッチを生成し、前記第１ピッチは、前記音韻記号系列の終端音韻におけるピッチの変化量を表す所定の値を用いて生成され、前記基準ピッチよりも小さい値であり、前記第２ピッチは、前記所定の値を用いて生成され、前記基準ピッチよりも大きい値である、前記第１ピッチと前記第２ピッチの変化量は、前記所定の値を用いて算出され、前記第１ピッチと前記第２ピッチの変化量は、前記第１ピッチと前記第２ピッチの差または比率であることを特徴とする。
また、本発明に従った韻律補正方法では、前記ピッチの最大値を抽出するステップは、音韻記号系列に対して予め付与されたピッチ系列の文末のアクセント句または文末の呼気段落から最大値を抽出することが好ましい。 Prosody correction method according to the present invention uses a step of extracting the maximum value of the pitch from the end of the sentence of previously assigned pitch sequence relative to the phoneme symbol sequence, the maximum value of the pitch and the extraction, the phoneme symbol sequence Generating a terminal phoneme pitch, which is a pitch to be given to the terminal phoneme of the phone, and a step of giving the terminal phoneme pitch to the terminal phoneme of the phoneme symbol sequence , and generating the terminal phoneme pitch The step of calculating a reference pitch corresponding to a value between the maximum value of the extracted pitch and a pitch given in advance to the terminal phoneme in the phoneme symbol sequence, and further, as the terminal phoneme pitch, 1 pitch and 2nd pitch are generated, and the 1st pitch is generated using a predetermined value indicating the amount of change in pitch in the terminal phoneme of the phoneme symbol sequence, and the reference The second pitch is generated using the predetermined value, and the amount of change between the first pitch and the second pitch is a value larger than the reference pitch. It is calculated using a predetermined value, and the amount of change between the first pitch and the second pitch is a difference or ratio between the first pitch and the second pitch.
Further, in the prosody correction method according to the present invention, the step of extracting the maximum value of the pitch extracts the maximum value from the accent phrase at the end of the pitch sequence or the exhalation paragraph at the end of the sequence given to the phoneme symbol sequence in advance. It is preferable to do.

本発明に従った韻律補正プログラムは、音韻記号系列の終端音韻の韻律を補正する韻律補正プログラムであって、上記したような韻律補正方法をコンピュータ上で実現するための韻律補正プログラムであることを特徴とする。 The prosody correction program according to the present invention is a prosody correction program for correcting the prosody of the terminal phoneme of the phoneme symbol sequence, and is a prosody correction program for realizing the above-mentioned prosody correction method on a computer. Features.

本発明に従った音声合成方法は、入力されたテキストデータを解析して、前記テキストデータに関する音韻記号と文末韻律の制御方法を表す文末韻律制御情報とを取得するステップと、前記取得した音韻記号に基づいて、少なくとも前記テキストデータのピッチ系列を含む韻律情報を生成するステップと、前記生成したピッチ系列の文末からピッチの最大値を抽出するステップと、前記抽出したピッチの最大値を用いて、前記テキストデータの終端音韻に付与するためのピッチである終端音韻用ピッチを生成するステップと、前記終端音韻用ピッチを、前記テキストデータの終端音韻に付与することにより、前記生成した韻律情報を補正するステップと、前記取得した音韻記号と前記補正した韻律情報とを用いて、前記テキストデータに関する音声信号を合成するステップとを備え、前記終端音韻用ピッチを生成するステップは、前記抽出したピッチの最大値と、前記音韻記号系列における終端音韻に予め付与されたピッチとの間の値に相当する基準ピッチを算出し、さらに、前記終端音韻用ピッチとして、第１ピッチと第２ピッチを生成し、前記第１ピッチは、前記音韻記号系列の終端音韻におけるピッチの変化量を表す所定の値を用いて生成され、前記基準ピッチよりも小さい値であり、前記第２ピッチは、前記所定の値を用いて生成され、前記基準ピッチよりも大きい値である、前記第１ピッチと前記第２ピッチの変化量は、前記所定の値を用いて算出され、前記第１ピッチと前記第２ピッチの変化量は、前記第１ピッチと前記第２ピッチの差または比率であることを特徴とする。
また、本発明に従った音声合成方法では、前記ピッチの最大値を抽出するステップは、音韻記号系列に対して予め付与されたピッチ系列の文末のアクセント句または文末の呼気段落から最大値を抽出することが好ましい。 The speech synthesis method according to the present invention comprises analyzing the input text data to obtain a phoneme symbol related to the text data and a sentence end prosody control information representing a sentence end prosody control method, and the acquired phoneme symbol. On the basis of the step of generating prosodic information including at least the pitch sequence of the text data, extracting the maximum pitch value from the end of the generated pitch sequence , and using the extracted maximum pitch value, Correcting the generated prosodic information by generating a terminal phoneme pitch, which is a pitch to be added to the terminal phoneme of the text data, and adding the terminal phoneme pitch to the terminal phoneme of the text data; Using the acquired phoneme symbol and the corrected prosodic information, Synthesizing a speech signal, and the step of generating the terminal phoneme pitch corresponds to a value between a maximum value of the extracted pitch and a pitch previously given to the terminal phoneme in the phoneme symbol sequence A first pitch and a second pitch are generated as the terminal phoneme pitch, and the first pitch is a predetermined value representing a pitch change amount in the terminal phoneme of the phoneme symbol sequence. The second pitch is generated using the predetermined value and is greater than the reference pitch, and the second pitch is a value smaller than the reference pitch. The change amount of the pitch is calculated using the predetermined value, and the change amount of the first pitch and the second pitch is a difference or a ratio between the first pitch and the second pitch. That.
Further, in the speech synthesis method according to the present invention, the step of extracting the maximum value of the pitch extracts the maximum value from the accent phrase at the end of the pitch sequence or the exhalation paragraph at the end of the sequence given to the phoneme symbol sequence. It is preferable to do.

本発明に従った音声合成プログラムは、入力されたテキストデータに基づいて音声信号を合成する音声合成プログラムであって、上記したような音声合成方法をコンピュータ上で実現するための音声合成プログラムであることを特徴とする。 A speech synthesis program according to the present invention is a speech synthesis program that synthesizes a speech signal based on input text data, and is a speech synthesis program for realizing the speech synthesis method as described above on a computer. It is characterized by that.

本発明によれば、音韻記号系列に対して予め付与されたピッチ系列から抽出されたピッチの最大値を用いて終端音韻用ピッチ（音韻記号系列の終端音韻に付与されるためのピッチ）が生成され、この終端音韻用ピッチが上記音韻記号系列の終端音韻に付与されることにより、予め付与されていたピッチ系列の中の終端音韻に対応するピッチが補正される。つまり、本発明によれば、音韻記号系列に対して予め付与されたピッチの最大値を用いて終端音韻用ピッチが生成されるため、入力されたテキストデータのアクセント型に応じて処理内容を変更する必要がなく、また、多様なパターンデータを予め作成する必要がなく、さらに、パターンデータのような多量なデータを韻律補正のために保持する必要がない。 According to the present invention, the terminal phoneme pitch (pitch to be given to the terminal phoneme of the phoneme symbol sequence) is generated using the maximum pitch value extracted from the pitch sequence previously given to the phoneme symbol series. The terminal phoneme pitch is added to the terminal phoneme of the phoneme symbol sequence, so that the pitch corresponding to the terminal phoneme in the previously assigned pitch sequence is corrected. That is, according to the present invention, the terminal phoneme pitch is generated using the maximum pitch value previously assigned to the phoneme symbol sequence, so that the processing content is changed according to the accent type of the input text data. It is not necessary to prepare various pattern data in advance, and it is not necessary to store a large amount of data such as pattern data for prosody correction.

以下、本発明の実施の形態について図面を参照しながら説明する。
［音声合成装置の構成］
まず、本発明の一実施の形態である音声合成装置（本発明の一実施の形態である韻律補正装置を含む音声合成装置）の構成を説明する。図１は、音声合成装置のハードウェア構成を示す図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Configuration of speech synthesizer]
First, the configuration of a speech synthesizer according to an embodiment of the present invention (a speech synthesizer including a prosody correcting device according to an embodiment of the present invention) will be described. FIG. 1 is a diagram illustrating a hardware configuration of the speech synthesizer.

図１を参照して、音声合成装置１００は、その動作を全体的に制御する中央処理部１１０を含む。中央処理部１１０は、テキストデータに関する韻律情報の補正の処理、および、テキストデータの音韻記号と補正後の韻律情報とを用いて当該テキストデータに関する音声信号を合成するための処理を実行する補正・合成部１１１を含む。 Referring to FIG. 1, speech synthesizer 100 includes a central processing unit 110 that controls the overall operation thereof. The central processing unit 110 performs correction processing for correcting prosody information related to text data, and processing for synthesizing a speech signal related to the text data using the phoneme symbols of the text data and the corrected prosodic information. A synthesis unit 111 is included.

音声合成装置１００は、さらに、外部の装置との間で情報を送受信するために利用されるアンテナ１２０、アンテナ１２０を利用した情報の送受信を制御する通信制御部１２１、種々の設定値や外部の装置から受信した電子メール文書などの情報を記憶するフラッシュメモリ１３０、外部から情報が入力される際に操作される入力部１４０、後述する韻律情報テーブルを記憶する韻律情報テーブル記憶部１５０、補正・合成部１１１において実行されるプログラムや音声素片辞書等の種々のデータを記憶するＲＯＭ（Read Only Memory）１６０、液晶ディスプレイ等で構成される表示部１７０、スピーカ１８１、および、スピーカ１８１で音声を出力するためのアンプ１８０を含む。入力部１４０は、たとえば、適宜設けられたボタンによって構成されたり、表示部１７０がタッチパネルを構成する場合には当該表示部１７０に表示されるタッチボタンによって構成されたりする。 The speech synthesizer 100 further includes an antenna 120 that is used to transmit / receive information to / from an external device, a communication control unit 121 that controls transmission / reception of information using the antenna 120, various setting values and external A flash memory 130 for storing information such as an e-mail document received from the apparatus, an input unit 140 operated when information is input from the outside, a prosody information table storage unit 150 for storing a prosody information table to be described later, A ROM (Read Only Memory) 160 that stores various data such as a program executed in the synthesis unit 111 and a speech unit dictionary, a display unit 170 including a liquid crystal display, a speaker 181, and a speaker 181 An amplifier 180 for outputting is included. For example, the input unit 140 is configured by buttons provided as appropriate, or by a touch button displayed on the display unit 170 when the display unit 170 configures a touch panel.

図２に、図１の補正・合成部１１１の構成を説明するための機能ブロック図を示す。
図２を参照して、補正・合成部１１１では、アンテナ１２０を介して受信したテキストデータやフラッシュメモリ１３０に格納されたテキストデータやＲＯＭ１６０に格納されたテキストデータが入力され、当該テキストデータに対する解析が行なわれ、解析結果に基づいて当該テキストデータの終端部分に対応する韻律が補正され、そして、当該テキストデータに関する補正後の韻律を含む韻律情報と、当該テキストデータの音韻記号とを用いて合成音声信号が生成されて、音声データとして出力される。なお、以下に、補正・合成部１１１の構成を説明する。 FIG. 2 is a functional block diagram for explaining the configuration of the correction / combination unit 111 in FIG.
Referring to FIG. 2, correction / combination unit 111 receives text data received via antenna 120, text data stored in flash memory 130, or text data stored in ROM 160, and analyzes the text data. The prosody corresponding to the end portion of the text data is corrected based on the analysis result, and the prosody information including the corrected prosody related to the text data and the phonological symbol of the text data are combined. An audio signal is generated and output as audio data. The configuration of the correction / combination unit 111 will be described below.

補正・合成部１１１では、テキストデータは、まずテキスト解析部１に入力される。なお、ここで入力されるテキストデータは、たとえば、デジタルデータとして取扱い可能な、ＡＳＣＩＩ（American Standard Code for Information Interchange）コードまたはＳＪＩＳ（Shift Japan Industrial Standard）コードなどで表現される文字コード系列で構成される電子テキストデータである。テキスト解析部１は、入力テキストデータに対して、形態素解析および構文解析を含む解析を行ない、読み方を表す音韻記号の系列を表す音韻記号列と、アクセント型、品詞、モーラ数、係り受けなどの言語属性を表す言語属性情報と、疑問など文末韻律の制御方法を表す文末韻律制御情報とを生成する。 In the correction / synthesis unit 111, the text data is first input to the text analysis unit 1. The text data input here is, for example, composed of a character code series expressed as ASCII (American Standard Code for Information Interchange) code or SJIS (Shift Japan Industrial Standard) code that can be handled as digital data. Electronic text data. The text analysis unit 1 performs analysis including morphological analysis and syntactic analysis on the input text data, and includes phonemic symbol strings representing a series of phonemic symbols representing how to read, accent type, part of speech, number of mora, dependency, and the like. Language attribute information representing a language attribute and sentence end prosody control information representing a method for controlling a sentence end prosody such as a question are generated.

なお、音声素片辞書（図１のＲＯＭ１６０に格納）には、音素ごとに、合成に必要な音声波形情報が格納されている。そして、上記した言語属性情報は、前記音声素片辞書に格納されている音素の中で、各音韻記号に対応する音素を特定する情報も含む。音声合成の生成に利用される辞書は、合成に必要な音声波形等の情報を必ずしも音素ごとに格納している必要はなく、これらの情報は他の単位で定義される音声素片ごとに格納されていても良い。たとえば、音節（ＣＶ）、ＶＣ（母音−子音）、ＶＣＶ（母音中心−子音−母音中心）単位などでも良い。また、合成音声の生成に利用される辞書には、合成に必要な音声波形情報ではなく、これらの情報は他の態様で定義される音声素片情報が格納されていても良い。たとえば、合成に必要な音声波形を生成するためのパラメータ（たとえば、音源とスペクトル）などであっても良い。 The speech segment dictionary (stored in the ROM 160 in FIG. 1) stores speech waveform information necessary for synthesis for each phoneme. The language attribute information described above also includes information for specifying a phoneme corresponding to each phoneme symbol among the phonemes stored in the phoneme dictionary. The dictionary used for speech synthesis generation does not necessarily store information such as speech waveforms necessary for synthesis for each phoneme, but stores such information for each speech unit defined in other units. May be. For example, syllable (CV), VC (vowel-consonant), VCV (vowel center-consonant-vowel center) unit, etc. may be used. In addition, the dictionary used for generating the synthesized speech may store not only speech waveform information necessary for synthesis but also speech unit information defined in another manner. For example, it may be a parameter (for example, sound source and spectrum) for generating a speech waveform necessary for synthesis.

テキスト解析部１は、韻律生成部２と合成部４に接続され、生成した音韻記号列と言語属性情報を韻律生成部２および合成部４に出力する。また、テキスト解析部１は、韻律補正部３と接続され、生成した文末韻律制御情報を韻律補正部３に出力する。テキスト解析部１は、たとえば専用のＬＳＩ（Large Scale Integration）によって構成される。 The text analysis unit 1 is connected to the prosody generation unit 2 and the synthesis unit 4, and outputs the generated phoneme symbol string and language attribute information to the prosody generation unit 2 and the synthesis unit 4. The text analysis unit 1 is connected to the prosody correction unit 3 and outputs the generated sentence end prosody control information to the prosody correction unit 3. The text analysis unit 1 is configured by a dedicated LSI (Large Scale Integration), for example.

韻律生成部２は、パワー生成部２１、音韻継続時間長生成部２２、および、ピッチ生成部２３から構成される。 The prosody generation unit 2 includes a power generation unit 21, a phoneme duration generation unit 22, and a pitch generation unit 23.

パワー生成部２１は、テキスト解析部１から音韻記号列と言語属性情報を入力される。そして、音韻記号列と言語属性情報に基づいて、入力された音韻記号列中の各音韻記号に対応するパワー情報（波形の振幅を決定する際に用いる情報）を生成する。パワー生成部２１は、韻律補正部３に接続され、生成したパワー情報を韻律補正部３に出力する。なお、パワー生成部２１において生成される、各音韻記号に対応するパワー情報を、以下「標準パワー情報」と呼ぶ。 The power generation unit 21 receives the phoneme symbol string and the language attribute information from the text analysis unit 1. Then, based on the phoneme symbol string and the language attribute information, power information (information used when determining the amplitude of the waveform) corresponding to each phoneme symbol in the input phoneme symbol string is generated. The power generation unit 21 is connected to the prosody correction unit 3 and outputs the generated power information to the prosody correction unit 3. The power information generated by the power generation unit 21 and corresponding to each phoneme symbol is hereinafter referred to as “standard power information”.

音韻継続時間長生成部２２は、テキスト解析部１から音韻記号列と言語属性情報とを入力され、音韻記号列と言語属性情報に基づいて、音韻記号列中の各音韻記号に対応する継続時間長を表す音韻継続時間長情報を生成する。音韻継続時間長生成部２２は、ピッチ生成部２３と韻律補正部３に接続され、生成した音韻継続時間長情報をピッチ生成部２３と韻律補正部３に出力する。なお、言語継続時間長生成部２２が生成した各音韻記号に対応する継続時間長情報を、以下「標準継続時間長情報」と呼ぶ。 The phoneme duration generation unit 22 receives the phoneme symbol string and language attribute information from the text analysis unit 1, and based on the phoneme symbol string and language attribute information, the duration corresponding to each phoneme symbol in the phoneme symbol string The phoneme duration information indicating the length is generated. The phoneme duration generation unit 22 is connected to the pitch generation unit 23 and the prosody correction unit 3 and outputs the generated phoneme duration length information to the pitch generation unit 23 and the prosody correction unit 3. The duration information corresponding to each phoneme symbol generated by the language duration generator 22 is hereinafter referred to as “standard duration information”.

ピッチ生成部２３は、テキスト解析部１から音韻記号列と言語属性情報とを入力されるとともに、音韻継続時間長生成部２２から音韻継続時間長情報が入力され、これらに基づいて、音韻記号列中の各音韻記号に対応するピッチ（基本周波数Ｆ０）を表すピッチ情報を生成する。ピッチ生成部２３は、韻律補正部３と接続され、生成したピッチ情報を韻律補正部３に出力する。なお、ピッチ生成部２３が生成した各音韻記号に対応するピッチ情報を、以下「標準ピッチ情報」と呼ぶ。 The pitch generation unit 23 receives the phoneme symbol string and the language attribute information from the text analysis unit 1, and also receives the phoneme duration length information from the phoneme duration generation unit 22, and based on these, the phoneme symbol sequence Pitch information representing the pitch (fundamental frequency F0) corresponding to each phoneme symbol is generated. The pitch generation unit 23 is connected to the prosody correction unit 3 and outputs the generated pitch information to the prosody correction unit 3. The pitch information corresponding to each phoneme symbol generated by the pitch generator 23 is hereinafter referred to as “standard pitch information”.

なお、韻律生成部２は、たとえば専用のＬＳＩによって構成される。
韻律補正部３は、パワー補正部３１、音韻継続時間長補正部３２、および、ピッチ補正部３３から構成される。 The prosody generation unit 2 is configured by a dedicated LSI, for example.
The prosody correction unit 3 includes a power correction unit 31, a phoneme duration correction unit 32, and a pitch correction unit 33.

韻律補正部３は、テキスト解析部１から文末韻律制御情報を入力され、入力された文末韻律制御情報に基づいて、韻律を補正するか否かを切り替える。 The prosodic correction unit 3 receives the end-of-sentence prosody control information from the text analysis unit 1, and switches whether to correct the prosody based on the input end-of-sentence prosody control information.

パワー補正部３１は、パワー生成部２１から標準パワー情報を入力され、入力された標準パワー情報の一部を補正する。補正処理の詳細については後述する。パワー補正部３１は、合成部４と接続され、補正したパワー情報を合成部４に出力する。 The power correction unit 31 receives the standard power information from the power generation unit 21 and corrects a part of the input standard power information. Details of the correction processing will be described later. The power correction unit 31 is connected to the synthesis unit 4 and outputs the corrected power information to the synthesis unit 4.

音韻継続時間長補正部３２は、音韻継続時間長生成部２２から標準音韻継続時間長情報を入力され、入力された標準音韻継続時間長情報の一部を補正する。補正処理の詳細については後述する。音韻継続時間長補正部３２は、合成部４と接続され、補正した音韻継続時間長情報を合成部４に出力する。 The phoneme duration correction unit 32 receives the standard phoneme duration information from the phoneme duration generator 22 and corrects part of the input standard phoneme duration information. Details of the correction processing will be described later. The phoneme duration correction unit 32 is connected to the synthesis unit 4 and outputs the corrected phoneme duration information to the synthesis unit 4.

ピッチ補正部３３は、ピッチ生成部２３から標準ピッチ情報を入力され、入力された標準ピッチ情報の一部を補正する。補正処理の詳細については後述する。ピッチ補正部３３は、合成部４と接続され、補正したピッチ情報を合成部４に出力する。 The pitch correction unit 33 receives the standard pitch information from the pitch generation unit 23 and corrects a part of the input standard pitch information. Details of the correction processing will be described later. The pitch correction unit 33 is connected to the synthesis unit 4 and outputs corrected pitch information to the synthesis unit 4.

なお、韻律補正部３は、たとえば専用のＬＳＩによって構成される。
合成部４は、テキスト解析部１から音韻記号列と言語属性情報を入力されるとともに、韻律補正部３から、補正パワー情報、補正音韻継続時間長情報、および、補正ピッチ情報を入力される。 The prosody correction unit 3 is configured by a dedicated LSI, for example.
The synthesizing unit 4 receives the phoneme symbol string and the language attribute information from the text analysis unit 1, and receives the correction power information, the corrected phoneme duration information, and the correction pitch information from the prosody correction unit 3.

そして、合成部４は、入力された音韻記号列などの情報に基づいて音声素片を選択し、入力された韻律情報（補正パワー情報、補正音韻継続時間長情報、補正ピッチ情報）に基づいて選択された音声素片を接続することにより、音声信号を合成する。音声素片を接続して音声信号を合成する方法については、たとえば、特開２００１−１０９５００号公報に詳しい。なお、補正パワー情報、補正音韻継続時間長情報、補正ピッチ情報に基づいて合成された音声信号を、以下「補正合成音声」と呼ぶ。 Then, the synthesis unit 4 selects a speech segment based on information such as the input phonemic symbol string, and based on the input prosodic information (corrected power information, corrected phoneme duration length information, corrected pitch information). An audio signal is synthesized by connecting the selected speech segments. A method for synthesizing an audio signal by connecting audio segments is detailed in, for example, Japanese Patent Application Laid-Open No. 2001-109500. Note that the speech signal synthesized based on the corrected power information, the corrected phoneme duration information, and the corrected pitch information is hereinafter referred to as “corrected synthesized speech”.

合成部４が、合成した音声信号をデジタルアンプ（アンプ１８０）に出力することにより、音声合成装置１００では、スピーカ１８１から、入力されたテキストデータに対応した音声が出力される。なお、合成部４は、合成した音声信号を、図示しないＤ／Ａ（デジタル−アナログ）コンバータ、スピーカ、ヘッドホンなどに出力するように構成されていても良い。また、合成部４は、たとえば専用のＬＳＩによって構成される。 The synthesizer 4 outputs the synthesized voice signal to the digital amplifier (amplifier 180), whereby the voice synthesizer 100 outputs a voice corresponding to the input text data from the speaker 181. The synthesizing unit 4 may be configured to output the synthesized audio signal to a D / A (digital-analog) converter, a speaker, headphones, or the like (not shown). The combining unit 4 is configured by a dedicated LSI, for example.

本実施の形態では、韻律補正部３が音声合成装置１００の構成要素の一つとなる場合について説明したが、韻律補正部３は、入力されるパワー情報、音韻継続時間長情報、ピッチ情報を補正し、補正後のパワー情報、音韻継続時間長情報、ピッチ情報を出力する独立した装置として用いてもよい。 In the present embodiment, the case where the prosody correction unit 3 is one of the components of the speech synthesizer 100 has been described. However, the prosody correction unit 3 corrects input power information, phoneme duration time information, and pitch information. Alternatively, the power information, the phoneme duration information, and the pitch information after correction may be used as an independent device that outputs the information.

たとえば、予めＲＯＭなどの不揮発性の記憶媒体に記憶された音声波形、または、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）などの通信経路もしくはマイクなどを介して入力されてＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などの記憶媒体に一時的に記憶された音声波形を、周知の音声分析技術によって、音響特徴パラメータ系列と音源パラメータ系列に分解し、そして、分解して得られる音響特徴パラメータ系列に対して周知の音声認識処理を施すことにより音韻記号列と音韻境界情報とを生成するとともに、生成した音韻境界情報と上記の音源パラメータ系列とを用いて、標準パワー情報、標準音韻継続時間長情報、標準ピッチ情報を生成する。そして、生成した標準パワー情報、標準音韻継続時間長情報、標準ピッチ情報を、韻律補正部３に入力し、韻律補正部３から出力される補正パワー情報、補正韻律継続時間長情報、補正ピッチ情報と、前述の分解して得られた音響特徴パラメータ系列、音源パラメータ系列、音韻記号列とを用いて、周知の音声合成処理を施して補正合成音声を生成することができる。 For example, a voice waveform stored in a non-volatile storage medium such as a ROM or a communication path such as a LAN (Local Area Network) or a microphone and input to a storage medium such as a RAM (Random Access Memory). The temporarily stored speech waveform is decomposed into an acoustic feature parameter sequence and a sound source parameter sequence by a known speech analysis technique, and a known speech recognition process is performed on the acoustic feature parameter sequence obtained by the decomposition. Thus, a phoneme symbol string and phoneme boundary information are generated, and standard power information, standard phoneme duration time information, and standard pitch information are generated using the generated phoneme boundary information and the sound source parameter sequence. Then, the generated standard power information, standard phoneme duration time information, and standard pitch information are input to the prosody correction unit 3 and corrected power information, corrected prosody duration time information, correction pitch information output from the prosody correction unit 3 Using the acoustic feature parameter series, the sound source parameter series, and the phoneme symbol string obtained by the above-described decomposition, a known synthesized speech process can be performed to generate a corrected synthesized speech.

また、上述の説明では、テキスト解析部１を構成要素として含む場合について説明した。なお、たとえば入力テキストデータが漢字かな混じり文を含まず、カタカナ表記とアクセント記号や文末記号などの制御記号で構成されることが既知の場合、テキスト解析部１は、形態素解析や構文解析の処理を省略し、読み方を表す音韻記号の系列を表す音韻記号列と、アクセント型、モーラ数などの言語属性を表す言語属性情報と、疑問など文末韻律の制御方法を表す文末韻律制御情報とを生成するようにしてもよい。 In the above description, the case where the text analysis unit 1 is included as a constituent element has been described. For example, when the input text data does not include kana-kana mixed sentences and is known to be composed of katakana notation and control symbols such as accent symbols and end-of-sentence symbols, the text analysis unit 1 performs morphological analysis and syntax analysis processing. Is used to generate phonetic symbol strings that represent phoneme symbol sequences that represent reading, language attribute information that represents language attributes such as accent type and number of mora, and sentence end prosody control information that represents control methods of sentence end prosody such as questions You may make it do.

また、音声合成装置を、音韻記号列、言語属性情報、文末韻律制御情報が入力されるように構成することも可能である。この場合、音声合成装置において、テキスト解析部１は不要となる。 In addition, the speech synthesizer can be configured such that phoneme symbol strings, language attribute information, and sentence end prosody control information are input. In this case, the text analysis unit 1 is not required in the speech synthesizer.

また、上述のテキスト解析部１、韻律生成部２、韻律補正部３、合成部４の一部または全部は、専用のＬＳＩではなく、パーソナルコンピュータなどの一般的なコンピュータやマイクロプロセッサで実現してもよい。また、中央処理部１１０がマイクロプロセッサを備えていれば、当該マイクロプロセッサによって実現されても良い。この場合、たとえば後述する音声合成処理または韻律補正処理を、コンピュータに実行させるためのプログラムとして記述してもよい。そして、この場合、上記プログラムは、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ − ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）などの記録媒体に記録されて流通し、ＣＤ−ＲＯＭドライブなどによりコンピュータが備えるＲＡＭなどに読み込まれ、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：中央演算装置）で実行される。なお、プログラムが記録される記録媒体は、ＣＤ−ＲＯＭのほかに、たとえばフレキシブルディスク、カセットテープ、ハードディスク、光ディスク、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｃ）、ＭＤ（ＭｉｎｉＤｉｓｃ、登録商標）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）、ＩＣカード（メモリカードを含む）、光カード、マスクＲＯＭ、ＥＰＲＯＭ、ＥＥＰＲＯＭ、フラッシュＲＯＭなどの半導体メモリなどの固定的にプログラムを記録する媒体でもよい。さらに、インターネットなどのネットワークを介して他の装置からダウンロードされてもよい。 In addition, some or all of the text analysis unit 1, prosody generation unit 2, prosody correction unit 3, and synthesis unit 4 described above are realized by a general computer such as a personal computer or a microprocessor, not a dedicated LSI. Also good. Further, if the central processing unit 110 includes a microprocessor, it may be realized by the microprocessor. In this case, for example, speech synthesis processing or prosody correction processing described later may be described as a program for causing a computer to execute. In this case, the program is recorded and distributed on a recording medium such as a CD-ROM (Compact Disc-Read Only Memory), read into a RAM provided in the computer by a CD-ROM drive, and the like (CPU). (Unit: central processing unit). In addition to the CD-ROM, the recording medium on which the program is recorded is, for example, a flexible disk, a cassette tape, a hard disk, an optical disk, an MO (Magneto-Optical disc), an MD (Mini Disc (registered trademark)), a DVD (Digital Versatile). Disc), an IC card (including a memory card), an optical card, a mask ROM, an EPROM, an EEPROM, a semiconductor memory such as a flash ROM, or a medium for recording a program fixedly. Furthermore, it may be downloaded from another device via a network such as the Internet.

また、上述のテキスト解析部１、韻律生成部２、韻律補正部３、合成部４の一部または全部は、専用のＬＳＩとコンピュータやマイクロプロセッサとを組み合わせることによって実現されてもよく、また、一部または全部を一つの専用のＬＳＩとして構成してもよい。 Further, part or all of the text analysis unit 1, the prosody generation unit 2, the prosody correction unit 3, and the synthesis unit 4 described above may be realized by combining a dedicated LSI with a computer or a microprocessor. A part or the whole may be configured as one dedicated LSI.

以下、テキスト解析部１、韻律生成部２、韻律補正部３、合成部４が専用のＬＳＩによって構成される場合を例として説明する箇所については、上述と同様に種々の方法にて実現することができる。 Hereinafter, the parts described as an example in which the text analysis unit 1, the prosody generation unit 2, the prosody correction unit 3, and the synthesis unit 4 are configured by a dedicated LSI can be realized by various methods as described above. Can do.

［補正合成音声を生成するための処理］
次に、音声合成装置１００において実行される具体的な処理内容について説明する。 [Process for generating corrected synthesized speech]
Next, specific processing contents executed in the speech synthesizer 100 will be described.

図３は、音声合成装置１００において実行される補正合成音声を生成するための処理（音声合成処理）のフローチャートである。 FIG. 3 is a flowchart of a process (speech synthesis process) for generating a corrected synthesized speech executed in the speech synthesizer 100.

図３を参照して、テキスト解析部１にテキストデータが入力されると、テキスト解析部１は、入力されたテキストデータに対して、形態素解析と構文解析を含む解析を行ない、読み方を表す音韻記号の系列を表す音韻記号列を生成し（ステップＳ１０１）、アクセント型、品詞、モーラ数、係り受けなどの言語属性を表す言語属性情報を生成し（ステップＳ１０２）、疑問など文末韻律の制御方法を表す文末韻律制御情報を生成する（ステップＳ１０３）。 Referring to FIG. 3, when text data is input to text analysis unit 1, text analysis unit 1 performs analysis including morphological analysis and syntax analysis on the input text data, and indicates phonemes representing how to read the text data. A phoneme symbol string representing a sequence of symbols is generated (step S101), and language attribute information representing language attributes such as accent type, part of speech, number of mora, and dependency is generated (step S102), and a method for controlling sentence end prosody such as questions Is generated (step S103).

ここで、音韻記号列は、たとえば入力テキストデータが「外で遊ぶ？」のとき、「ｓ」「ｏ」「ｔ」「ｏ」「ｄ」「ｅ」「ａ」「ｓ」「ｏ」「ｂ」「ｕ」のように表される。音韻記号列を生成する処理は、入力テキストデータ全体に対して一度の処理で生成してもよく、入力テキストデータを部分に区切って複数回の処理で生成してもよい。なお、入力テキストデータは、「ソトデアソブ？」などの表音可能な他の表現でも良く、上記に限定されるものではない。また、入力テキストデータの代わりに、音韻記号列「ｓ」「ｏ」「ｔ」「ｏ」「ｄ」「ｅ」「ａ」「ｓ」「ｏ」「ｂ」「ｕ」が直接入力されるようにしても良い。音韻記号列の表現についても、「ｓ」「ｏ」「ｔ」「ｏ」「ｄ」「ｅ」「ａ」「ｓ」「ｏ」「ｂ」「ｕ」は一例であり、音の特徴を示す単位に対応付けられた記号であれば他の表現（「Ｓ」「Ｏ」「Ｔ」「Ｏ」「Ｄ」…の大文字等）を用いても良いことは言うまでもない。 Here, the phoneme symbol string is “s” “o” “t” “o” “d” “e” “a” “s” “o” “ b ”“ u ”. The process of generating the phoneme symbol string may be generated by a single process for the entire input text data, or may be generated by a plurality of processes by dividing the input text data into parts. Note that the input text data may be another expression that can be expressed, such as “Sotode Asob?”, And is not limited to the above. Also, instead of the input text data, phoneme symbol strings “s” “o” “t” “o” “d” “e” “a” “s” “o” “b” “u” are directly input. You may do it. Regarding the expression of phoneme symbol strings, “s”, “o”, “t”, “o”, “d”, “e”, “a”, “s”, “o”, “b”, and “u” are examples, and the characteristics of the sound are shown. It goes without saying that other expressions (capital letters such as “S”, “O”, “T”, “O”, “D”...) May be used as long as the symbols are associated with the units shown.

文末韻律制御情報は、たとえば入力テキストデータが「外で遊ぶ？」のとき、「？」記号を検出することにより、ＲＡＭなどの記憶装置など（図示せず）に、疑問フラグとして「１」を設定する。また、「？」記号が検出されない場合、疑問フラグとして「０」を設定する。 For example, when the input text data is “play outside”, the end-of-sentence prosody control information detects “?” Symbol to set “1” as a question flag to a storage device such as a RAM (not shown). Set. If no “?” Symbol is detected, “0” is set as the question flag.

次に、パワー生成部２１は、音韻記号列（ステップＳ１０１で生成されたもの）と言語属性情報（ステップＳ１０２で生成されたもの）を入力されることにより、これらを用いて、音韻記号列中の各音韻記号に対応するパワー情報（波形の振幅を決定する際に用いる情報）を生成する（ステップＳ１０４）。 Next, the power generation unit 21 receives the phoneme symbol string (generated in step S101) and the language attribute information (generated in step S102), and uses them in the phoneme symbol string. Power information (information used when determining the amplitude of the waveform) corresponding to each phoneme symbol is generated (step S104).

ここで、パワー情報の生成では、たとえば、周知の手法（周知の多変量解析など）を用いて予めパワー情報テーブルを作成し、当該パワー情報テーブルをＲＯＭまたはＲＡＭなどの記憶装置（図示せず）に予め記憶しておき、そして、作成されたパワー情報テーブルを参照して、音韻記号列や言語属性情報に応じたパワー情報を生成すればよい。また、ここで生成されるパワー情報は、上記した標準パワー情報である。 Here, in the generation of power information, for example, a power information table is created in advance using a known method (such as a known multivariate analysis), and the power information table is stored in a storage device (not shown) such as a ROM or a RAM. The power information corresponding to the phoneme symbol string and the language attribute information may be generated by referring to the created power information table. The power information generated here is the standard power information described above.

次に、音韻継続時間長生成部２２は、音韻記号列（ステップＳ１０１で生成されたもの）と言語属性情報（ステップＳ１０２で生成されたもの）とを入力されることにより、これらを用いて、音韻記号列中の各音韻記号に対応する継続時間長を表す音韻継続時間長情報を生成する（ステップＳ１０５）。 Next, the phoneme duration generation unit 22 receives the phoneme symbol string (generated in step S101) and the language attribute information (generated in step S102), and uses them, Phoneme duration information representing the duration length corresponding to each phoneme symbol in the phoneme symbol string is generated (step S105).

ここで、音韻継続時間長情報の生成では、たとえば、周知の手法（周知の多変量解析など）を用いて予め音韻継続時間長情報テーブルを作成し、当該テーブルをＲＯＭまたはＲＡＭなどの記憶装置（図示せず）に予め記憶しておき、作成された音韻継続時間長情報テーブルを参照して、音韻記号列や言語属性情報に応じた音韻継続時間長情報を生成すればよい。また、ここで生成される音韻継続時間長情報は、上記した標準音韻継続時間長情報である。 Here, in the generation of phoneme duration information, for example, a phoneme duration information table is created in advance using a known method (well-known multivariate analysis or the like), and the table is stored in a storage device such as a ROM or RAM ( It may be stored in advance (not shown) and the phoneme duration information corresponding to the phoneme symbol string and the language attribute information may be generated by referring to the created phoneme duration information table. Further, the phoneme duration information generated here is the standard phoneme duration information described above.

次に、ピッチ生成部２３は、音韻記号列（ステップＳ１０１で生成されたもの）と、言語属性情報（ステップＳ１０２で生成されたもの）と、音韻継続時間長情報（ステップＳ１０５で生成されたもの）とを入力されることにより、これらを用いて、音韻記号列中の各音韻記号に対応するピッチを表すピッチ情報を生成する（ステップＳ１０６）。 Next, the pitch generator 23 generates a phoneme symbol string (generated in step S101), language attribute information (generated in step S102), and phoneme duration information (generated in step S105). ) Is used to generate pitch information representing the pitch corresponding to each phoneme symbol in the phoneme symbol string (step S106).

ここで、ピッチ情報の生成は、たとえば、藤崎モデルに基づくピッチ生成などの周知の手法を用いて生成すればよい。また、ここで生成されるピッチ情報は、上記した標準ピッチ情報である。 Here, the generation of the pitch information may be generated by using a known method such as pitch generation based on the Fujisaki model. The pitch information generated here is the standard pitch information described above.

なお、本実施の形態では、母音中心にピッチ情報を割り当てる点ピッチに基づくピッチ制御を用いる。点ピッチ間に割り当てる具体的なピッチの値は、たとえば、線形補間により算出される。 In the present embodiment, pitch control based on a point pitch that assigns pitch information to the vowel center is used. A specific pitch value assigned between the point pitches is calculated by, for example, linear interpolation.

次に、韻律補正部３は、ステップＳ１０３で生成した文末韻律制御情報を参照して疑問文であるか否かを判定し、疑問文であると判定すればステップＳ１０８へ、疑問文ではないと判定すればステップＳ１１１へ、それぞれ処理を進める（ステップＳ１０７）。なお、韻律補正部３は、ステップＳ１０７において、たとえば、ＲＡＭなどの記憶装置（図示せず）に記憶される疑問フラグの設定値を参照し、疑問フラグが「１」の場合、ステップＳ１０８に処理を進めて韻律補正処理（後述するステップＳ１０８〜ステップＳ１１０の総称、または、これらの少なくとも一つ）を行ない、疑問フラグが「０」の場合、韻律補正処理を行なうことなくステップＳ１１１に処理を進める。ここで、韻律補正処理を行なわない場合、韻律補正部３で、ステップＳ１０４〜ステップＳ１０６で生成した標準韻律情報（標準パワー情報、標準音韻継続時間長情報、標準ピッチ情報）を、そのまま合成部４に受け渡すように構成してもよく、また、音声合成装置に設けられる制御部（図示せず）でデータの流れを制御することにより、韻律生成部２で生成される標準韻律情報を、韻律補正部３を介することなく直接合成部４に出力するように構成してもよい。 Next, the prosody correction unit 3 refers to the sentence end prosody control information generated in step S103 to determine whether the sentence is a question sentence. If it is determined to be a question sentence, the process proceeds to step S108. If so, the process proceeds to step S111 (step S107). In step S107, the prosody correction unit 3 refers to, for example, the setting value of the question flag stored in a storage device (not shown) such as a RAM. If the question flag is “1”, the process proceeds to step S108. To perform prosody correction processing (a generic name of step S108 to step S110 described later, or at least one of them). If the question flag is “0”, the processing proceeds to step S111 without performing prosody correction processing. . Here, when prosody correction processing is not performed, the prosody correction unit 3 directly uses the standard prosody information (standard power information, standard phoneme duration information, standard pitch information) generated in steps S104 to S106 as it is. The standard prosody information generated by the prosody generation unit 2 can be converted into a prosody by controlling the data flow with a control unit (not shown) provided in the speech synthesizer. You may comprise so that it may output directly to the synthetic | combination part 4 without going through the correction | amendment part 3. FIG.

ステップＳ１０８では、パワー補正部３１は、ステップＳ１０４で生成されたパワー情報を入力されることにより、入力された標準パワー情報の一部を補正する（ステップＳ１０８）。ここでの補正処理の詳細については後述する。 In step S108, the power correction unit 31 receives the power information generated in step S104 and corrects a part of the input standard power information (step S108). Details of the correction process will be described later.

次に、音韻継続時間長補正部３２は、ステップＳ１０５で生成された音韻継続時間長情報を入力されることにより、入力された標準音韻継続時間長情報の一部を補正する（ステップＳ１０９）。ここでの補正処理の詳細については後述する。 Next, the phoneme duration correction unit 32 receives the phoneme duration information generated in step S105 and corrects a part of the input standard phoneme duration information (step S109). Details of the correction process will be described later.

次に、ピッチ補正部３３は、ステップＳ１０６で生成されたピッチ情報を入力されることにより、入力された標準ピッチ情報の一部を補正する（ステップＳ１１０）。ここでの補正処理の詳細については後述する。 Next, the pitch correction unit 33 receives the pitch information generated in step S106, thereby correcting a part of the input standard pitch information (step S110). Details of the correction process will be described later.

ステップＳ１１１では、合成部４は、ステップＳ１０１で生成された音韻記号列と、ステップＳ１０２で生成された言語属性情報と、ステップＳ１０８〜ステップＳ１１０で生成された補正韻律情報（補正パワー情報、補正音韻継続時間長情報、補正ピッチ情報）またはステップＳ１０４〜ステップＳ１０６で生成された標準韻律情報とを入力され、ＲＯＭやＲＡＭなどの記憶装置（図示せず）などに予め記憶された複数の音声素片情報から、音韻記号列に応じた音声素片情報を選択する。 In step S111, the synthesizing unit 4 includes the phoneme symbol string generated in step S101, the language attribute information generated in step S102, and the corrected prosodic information (corrected power information and corrected phoneme generated in steps S108 to S110). Continuous time length information, correction pitch information) or standard prosody information generated in steps S104 to S106, and a plurality of speech segments stored in advance in a storage device (not shown) such as a ROM or RAM. From the information, speech unit information corresponding to the phoneme symbol string is selected.

そして、合成部４は、ステップＳ１１１で選択された音声素片情報を用いて、入力される補正韻律情報（または標準韻律情報）に基づいて音声素片情報を接続することにより、音声信号を合成する（ステップＳ１１２）。 The synthesizing unit 4 synthesizes the speech signal by connecting the speech unit information based on the input corrected prosody information (or standard prosody information) using the speech unit information selected in step S111. (Step S112).

なお、上述の説明において、ステップＳ１０８〜ステップＳ１１０で、パワー情報、音韻継続時間長情報、ピッチ情報を補正しているが、本発明の一つの特徴はピッチ補正部３３の構成または処理内容（詳細は後述）である。つまり、本発明に従った韻律補正装置および音声合成装置は、少なくともピッチ補正部３３を備えていれば良い。このことから、ステップＳ１０８のパワー補正処理およびステップＳ１０９の音韻継続時間長補正処理の一部または全部は、除かれてもよい。なお、この場合、韻律補正部３は、パワー補正部３１と音韻継続時間長補正部３２の一部または全部を除いて構成されてもよい。 In the above description, power information, phoneme duration information, and pitch information are corrected in steps S108 to S110. However, one feature of the present invention is the configuration of the pitch correction unit 33 or details of processing (details). Is described later). That is, the prosody correction device and the speech synthesizer according to the present invention need only include at least the pitch correction unit 33. For this reason, part or all of the power correction processing in step S108 and the phoneme duration correction processing in step S109 may be omitted. In this case, the prosody correction unit 3 may be configured by excluding part or all of the power correction unit 31 and the phoneme duration correction unit 32.

［パワー補正部の構成］
次に、音声合成装置１００におけるパワー補正部３１の詳細な構成について説明する。 [Configuration of power correction unit]
Next, a detailed configuration of the power correction unit 31 in the speech synthesizer 100 will be described.

図４は、図２のパワー補正部３１の構成を示す機能ブロック図である。
パワー補正部３１は、終端音韻パワー生成部３１１と終端音韻パワー設定部３１２を含む。これらは、たとえばそれぞれ専用のＬＳＩによって構成され、そして、いずれもパワー生成部２１から標準パワー情報を入力される。そして、終端音韻パワー生成部３１１は、入力された標準パワー情報を用いて終端音韻のパワーの補正値を生成し、終端音韻パワー設定部３１２は、標準パワー情報の中の終端音韻のパワー情報を終端音韻パワー生成部３１１が生成した補正値に置換（設定）することにより、補正パワー情報を生成し、合成部４に出力する。 FIG. 4 is a functional block diagram showing the configuration of the power correction unit 31 in FIG.
The power correction unit 31 includes a terminal phoneme power generation unit 311 and a terminal phoneme power setting unit 312. These are configured by, for example, dedicated LSIs, respectively, and all receive standard power information from the power generation unit 21. Then, the terminal phoneme power generation unit 311 generates a correction value of the power of the terminal phoneme using the input standard power information, and the terminal phoneme power setting unit 312 generates the power information of the terminal phoneme in the standard power information. By replacing (setting) with the correction value generated by the terminal phoneme power generation unit 311, correction power information is generated and output to the synthesis unit 4.

なお、終端音韻とは、音韻記号列の終端部である。具体的には、上記した音韻記号列（「ｓ」「ｏ」「ｔ」「ｏ」「ｄ」「ｅ」「ａ」「ｓ」「ｏ」「ｂ」「ｕ」）では、「ｕ」が終端音韻である。 The terminal phoneme is the terminal part of the phoneme symbol string. Specifically, in the above phoneme symbol string (“s” “o” “t” “o” “d” “e” “a” “s” “o” “b” “u”), “u” Is the terminal phoneme.

［パワー補正処理］
次に、図５〜図７をさらに参照して、パワー補正部３１の具体的な処理内容の一例を説明する。なお、図５は、図３のパワー補正処理（Ｓ１０８）のサブルーチンのフローチャートである。また、図６は、標準韻律情報（標準パワー情報、標準音韻継続時間長情報、および、標準ピッチ情報）の一例を模式的に示す図であり、図７は、図６に示された情報に対応する補正韻律情報（補正パワー情報、補正音韻継続時間長情報、および、補正ピッチ情報）の一例を模式的に示す図である。 [Power correction processing]
Next, an example of specific processing contents of the power correction unit 31 will be described with further reference to FIGS. FIG. 5 is a flowchart of the subroutine of the power correction process (S108) in FIG. FIG. 6 is a diagram schematically showing an example of standard prosody information (standard power information, standard phoneme duration information, and standard pitch information), and FIG. 7 shows information shown in FIG. It is a figure which shows typically an example of corresponding correction prosody information (correction power information, correction phoneme duration information, and correction pitch information).

なお、図６と図７は、それぞれグラフ形式で記載されている。それぞれのグラフの元となったデータを、表１（図６）と表２（図７）に示す。 FIG. 6 and FIG. 7 are described in a graph format. Table 1 (FIG. 6) and Table 2 (FIG. 7) show the data used as the basis of each graph.

図６と図７では、各音韻記号のピッチが「●」で示され、パワーが「○」で示されている。また、図６と図７では、縦軸にピッチとパワーが定義され、横軸に時間が定義されており、各音韻記号に対する音韻継続時間が縦方向の破線で区切られて示されている。 In FIG. 6 and FIG. 7, the pitch of each phoneme symbol is indicated by “●”, and the power is indicated by “◯”. In FIGS. 6 and 7, pitch and power are defined on the vertical axis, time is defined on the horizontal axis, and phoneme durations for each phoneme symbol are shown separated by vertical broken lines.

なお、図６と図７から理解されるように、本実施の形態では、韻律情報が補正されることにより、音韻記号列の終端音韻に対して、パワー情報、音韻継続時間長情報、ピッチ情報が、それぞれ２つ設定されるようになる。本明細書では、以下、補正後の韻律情報における終端音韻に対応したパワー情報、音韻継続時間長情報、ピッチ情報のそれぞれについて、先に発音される音声に対応するものを、終端音韻前半パワー、終端音韻前半継続時間、終端音韻前半ピッチ（または、「終端音韻第１ピッチ」）と呼び、また、後に発音される音声に対応するものを、終端音韻後半パワー、終端音韻後半継続時間、終端音韻後半ピッチ（または、「終端音韻第２ピッチ」）と呼ぶ。 As can be understood from FIGS. 6 and 7, in the present embodiment, power information, phoneme duration information, pitch information is obtained for the terminal phoneme of the phoneme symbol string by correcting the prosodic information. Are set to two. In this specification, the power information corresponding to the terminal phoneme in the corrected prosodic information, the phoneme duration information, the pitch information corresponding to the previously pronounced speech, Terminal phoneme first half duration, terminal phoneme first half pitch (or “terminal phoneme first pitch”), and the one corresponding to the sound that is pronounced later, terminal phoneme second half power, terminal phoneme second half duration, terminal phoneme This is called the latter half pitch (or “end phoneme second pitch”).

図５を主に参照して、パワー補正処理では、まずステップＳ１０８１で、終端音韻パワー生成部３１１が、標準パワー情報から終端音韻のパワー（ＰｏｗＬ）を取得する。 Referring mainly to FIG. 5, in the power correction process, first, in step S1081, the terminal phoneme power generation unit 311 acquires the power of the terminal phoneme (PowL) from the standard power information.

次に、終端音韻パワー生成部３１１は、ステップＳ１０８２で終端音韻前半パワー（ＰｏｗＬ１）を算出し、ステップＳ１０８３で終端音韻後半パワー（ＰｏｗＬ２）を算出する。なお、終端音韻前半パワーは、標準パワー情報における終端音韻のパワーとされ、また、終端音韻後半パワーは、標準パワー情報における終端音韻のパワーから所定の値（ＰＷＤＯＷＮ）を差し引いたものとされる。 Next, the terminal phoneme power generation unit 311 calculates the terminal phoneme first half power (PowL1) in step S1082, and the terminal phoneme second power (PowL2) in step S1083. The terminal first phoneme power is the power of the terminal phoneme in the standard power information, and the terminal phoneme second power is a value obtained by subtracting a predetermined value (PWDOWN) from the power of the terminal phoneme in the standard power information.

そして、終端音韻パワー設定部３１２が、標準パワー情報に対して、終端音韻のパワーを、終端音韻前半パワーと終端音韻後半パワーに入れ替えて（Ｓ１０８４，Ｓ１０８５）処理を図３にリターンさせる。 Then, the terminal phoneme power setting unit 312 replaces the power of the terminal phoneme with the power of the first half of the terminal phoneme and the power of the second half of the terminal phoneme with respect to the standard power information (S1084, S1085), and returns the processing to FIG.

図５を参照して説明したパワー補正処理によれば、標準韻律情報における終端音韻のパワー（ＰｏｗＬ）は、補正韻律情報では、終端音韻前半パワー（ＰｏｗＬ１）と終端音韻後半パワー（ＰｏｗＬ２）に入れ替えられる。 According to the power correction processing described with reference to FIG. 5, the power of the terminal phoneme (PowL) in the standard prosodic information is replaced with the power of the first half of the terminal phoneme (PowL1) and the power of the second half of the terminal phoneme (PowL2) in the corrected prosodic information. It is done.

なお、図６、図７、表１、および、表２から理解されるように、本実施の形態では、これらの値の具体例として、ＰｏｗＬ＝ＰｏｗＬ１＝５９（ｄＢ）、および、ＰｏｗＬ２＝５０（ｄＢ）が示されている。つまり、ＰＷＤＯＷＮ＝９とされている。なお、ＰＷＤＯＷＮの値については、ユーザにより適宜変更とされても良い。 As can be understood from FIGS. 6, 7, Table 1, and Table 2, in this embodiment, PowL = PowL1 = 59 (dB) and PowL2 = 50 as specific examples of these values. (DB) is shown. That is, PWDOWN = 9. Note that the value of PWDOWN may be changed as appropriate by the user.

［音韻継続時間補正部の構成］
次に、音声合成装置１００における音韻継続時間長補正部３２の詳細な構成について説明する。 [Configuration of phoneme duration correction unit]
Next, the detailed configuration of the phoneme duration correction unit 32 in the speech synthesizer 100 will be described.

図８は、図２の音韻継続時間長補正部３２の構成を示す機能ブロック図である。
音韻継続時間長補正部３２は、終端音韻継続時間長生成部３２１と終端音韻継続時間長設定部３２２を含む。これらは、たとえばそれぞれ専用のＬＳＩによって構成され、そして、いずれも終端音韻継続時間長生成部２２から標準音韻継続時間長情報を入力される。そして、終端音韻継続時間長生成部３２１は、入力された標準音韻継続時間長情報を用いて終端音韻の音韻継続時間長の補正値を生成し、終端音韻継続時間長設定部３２２は、標準音韻継続時間長情報の中の終端音韻の音韻継続時間長情報を終端音韻継続時間長生成部３２１が生成した補正値に置換（設定）することにより、補正音韻継続時間長情報を生成し、合成部４に出力する。 FIG. 8 is a functional block diagram showing the configuration of the phoneme duration correction unit 32 of FIG.
The phoneme duration correction unit 32 includes a terminal phoneme duration generation unit 321 and a terminal phoneme duration setting unit 322. These are configured by dedicated LSIs, for example, and all receive standard phoneme duration information from the terminal phoneme duration duration generator 22. Then, the terminal phoneme duration duration generation unit 321 generates a correction value for the phoneme duration length of the terminal phoneme using the input standard phoneme duration length information, and the terminal phoneme duration setting unit 322 By replacing (setting) the phoneme duration information of the terminal phoneme in the duration length information with the correction value generated by the terminal phoneme duration generation unit 321, the corrected phoneme duration information is generated, and the synthesis unit 4 is output.

［音韻継続時間長補正処理］
次に、図５、図６、および、図９をさらに参照して、音韻継続時間長補正部３２の具体的な処理内容の一例を説明する。なお、図９は、図３の音韻継続時間長補正処理（Ｓ１０９）のサブルーチンのフローチャートである。 [Phonological duration length correction]
Next, an example of specific processing contents of the phoneme duration correction unit 32 will be described with further reference to FIGS. 5, 6, and 9. FIG. 9 is a flowchart of a subroutine of the phoneme duration correction process (S109) of FIG.

図９を主に参照して、音韻継続時間長補正処理では、まずステップＳ１０９１で、終端音韻継続時間長生成部３２１が、標準音韻継続時間長情報から終端音韻の音韻継続時間長（ＤｕｒＬ）を取得する。 Referring mainly to FIG. 9, in the phoneme duration correction process, first, in step S1091, the terminal phoneme duration generation unit 321 calculates the phoneme duration (DurL) of the terminal phoneme from the standard phoneme duration information. get.

次に、終端音韻継続時間長生成部３２１は、ステップＳ１０９２で終端音韻前半継続時間（ＤｕｒＬ１）を算出し、ステップＳ１０９３で終端音韻後半継続時間（ＤｕｒＬ２）を算出する。なお、終端音韻前半パワーは、標準音韻継続時間長情報における終端音韻の音韻継続時間長（ＤｕｒＬ）と予め定められた第１の値（ＤＲＷＧＴ１）の積とされ、また、終端音韻後半パワーは、標準音韻継続時間長情報における終端音韻の音韻継続時間長（ＤｕｒＬ）と予め定められた第２の値（ＤＲＷＧＴ２）との積とされる。また、本実施の形態では、第１の値（ＤＲＷＧＴ１）の一例として０．８が採用され、第２の値（ＤＲＷＧＴ２）の一例として０．５が採用されているが、それぞれこれらの数値に限定されるものではなく、また、ユーザによって適宜変更されても良い。 Next, the terminal phoneme duration duration generator 321 calculates the terminal phoneme first half duration (DurL1) in step S1092, and calculates the terminal phoneme second half duration (DurL2) in step S1093. The first phoneme half power is a product of the phoneme duration length (DurL) of the terminal phoneme in the standard phoneme duration information and a predetermined first value (DRWGT1). It is the product of the phoneme duration (DurL) of the terminal phoneme in the standard phoneme duration information and a predetermined second value (DRWGT2). In this embodiment, 0.8 is adopted as an example of the first value (DRWGT1) and 0.5 is adopted as an example of the second value (DRWGT2). It is not limited and may be appropriately changed by the user.

そして、終端音韻継続時間長設定部３２２が、標準音韻継続時間長情報に対して、終端音韻の音韻継続時間長を、終端音韻前半継続時間長と終端音韻後半継続時間長に入れ替えて（Ｓ１０９４，Ｓ１０９５）処理を図３にリターンさせる。 Then, the terminal phoneme duration setting unit 322 replaces the phoneme duration of the terminal phoneme with the first phoneme last duration and the last phoneme last duration for the standard phoneme duration information (S1094). S1095) The processing is returned to FIG.

図９を参照して説明した音韻継続時間長補正処理によれば、標準韻律情報における終端音韻の音韻継続時間長（ＤｕｒＬ）は、補正韻律情報では、終端音韻前半継続時間長（ＤｕｒＬ１）と終端音韻後半継続時間長（ＤｕｒＬ２）に入れ替えられる。 According to the phoneme duration correction process described with reference to FIG. 9, the phoneme duration (DurL) of the terminal phoneme in the standard prosodic information is the first phoneme duration (DurL1) and the terminal phoneme duration in the corrected prosodic information. The phonological second half duration (DurL2) is replaced.

なお、図６、図７、表１、および、表２から理解されるように、本実施の形態では、これらの値の具体例として、ＤｕｒＬ＝９４（ｍｓｅｃ）、ＤｕｒＬ１＝７５（ｍｓｅｃ）、および、ＤｕｒＬ２＝４７（ｍｓｅｃ）が示されている。 As can be understood from FIGS. 6, 7, Table 1 and Table 2, in the present embodiment, as specific examples of these values, DurL = 94 (msec), DurL1 = 75 (msec), And DurL2 = 47 (msec) is shown.

［ピッチ補正部の構成］
次に、音声合成装置１００におけるピッチ補正部３３の詳細な構成について説明する。 [Configuration of pitch correction unit]
Next, a detailed configuration of the pitch correction unit 33 in the speech synthesizer 100 will be described.

図１０は、ピッチ補正部３３の構成を示す機能ブロック図である。
図１０を参照して、ピッチピーク抽出部３３１は、ピッチ生成部２１で生成された標準ピッチ情報を入力される。ピッチピーク抽出部３３１では、標準ピッチ情報に含まれるピッチ系列の中から、ピークピッチ（ピッチの最大値）を抽出する。なお、ピークピッチを抽出する対象となるピッチ系列の範囲は、文末のアクセント句、または文末の呼気段落などとすればよい。ピッチピーク抽出部３３１は、終端音韻ピッチ生成部３３２と接続され、抽出したピークピッチを終端音韻ピッチ生成部３３２に出力する。なお、ピッチピーク抽出部３３１は、たとえば専用のＬＳＩによって構成される。 FIG. 10 is a functional block diagram showing the configuration of the pitch correction unit 33.
Referring to FIG. 10, pitch peak extraction unit 331 receives standard pitch information generated by pitch generation unit 21. The pitch peak extraction unit 331 extracts a peak pitch (maximum pitch value) from the pitch series included in the standard pitch information. Note that the range of the pitch sequence from which the peak pitch is extracted may be an accent phrase at the end of the sentence or an exhalation paragraph at the end of the sentence. The pitch peak extraction unit 331 is connected to the terminal phoneme pitch generation unit 332 and outputs the extracted peak pitch to the terminal phoneme pitch generation unit 332. Note that the pitch peak extraction unit 331 is configured by a dedicated LSI, for example.

終端音韻ピッチ生成部３３２は、標準ピッチ情報（ピッチ生成部２１で生成されたもの）とピークピッチ（ピッチピーク抽出部３３１で抽出されたもの）を入力される。そして、終端音韻ピッチ生成部３３２は、入力されたピークピッチを用いて基準ピッチを算出し、ピッチ変化量を算出し、そして、基準ピッチとピッチ変化量を用いて終端音韻に割り当てる２つのピッチ（終端音韻第１ピッチ、終端音韻第２ピッチ）を生成する。これらの２つのピッチを生成する詳細な処理内容については後述する。終端音韻ピッチ生成部３３２は、終端音韻ピッチ設定部３３３と接続され、生成した終端音韻第１ピッチと終端音韻第２ピッチを終端音韻ピッチ設定部３３３に出力する。なお、終端音韻ピッチ生成部３３２は、たとえば専用のＬＳＩによって構成される。 The terminal phoneme pitch generation unit 332 receives standard pitch information (generated by the pitch generation unit 21) and a peak pitch (extracted by the pitch peak extraction unit 331). The terminal phoneme pitch generation unit 332 calculates a reference pitch using the input peak pitch, calculates a pitch change amount, and uses the reference pitch and the pitch change amount to allocate two pitches ( A terminal phoneme first pitch and a terminal phoneme second pitch) are generated. Detailed processing contents for generating these two pitches will be described later. The terminal phoneme pitch generation unit 332 is connected to the terminal phoneme pitch setting unit 333, and outputs the generated terminal phoneme first pitch and terminal phoneme second pitch to the terminal phoneme pitch setting unit 333. The terminal phoneme pitch generation unit 332 is configured by a dedicated LSI, for example.

終端音韻ピッチ設定部３３３は、標準ピッチ情報（ピッチ生成部２１で生成されたもの）と、終端音韻第１ピッチと終端音韻第２ピッチ（終端音韻ピッチ生成部３３２で生成されたもの）を入力される。終端音韻ピッチ設定部３３３は、入力された標準ピッチ情報に含まれるピッチ系列のうち、終端音韻に割り当てられたピッチ（時間方向に最後のピッチ）以外のピッチは修正せず、終端音韻に割り当てられたピッチを入力された終端音韻第１ピッチと終端音韻第２ピッチに入れ替えることにより、標準ピッチ情報を補正する。すなわち、終端音韻ピッチを、入力された１つの値から、生成した２つの値となるように変更して、補正ピッチ情報を生成する。終端音韻ピッチ設定部３３３は、合成部４と接続され、終端音韻第１ピッチと終端音韻第２ピッチを含む補正ピッチ情報を、合成部４に出力する。なお、終端音韻ピッチ設定部３３３は、たとえば専用のＬＳＩによって構成される。 The terminal phoneme pitch setting unit 333 receives standard pitch information (generated by the pitch generation unit 21), and the terminal phoneme first pitch and the terminal phoneme second pitch (generated by the terminal phoneme pitch generation unit 332). Is done. The terminal phoneme pitch setting unit 333 does not modify any pitch other than the pitch assigned to the terminal phoneme (the last pitch in the time direction) in the pitch series included in the input standard pitch information, and assigns it to the terminal phoneme. The standard pitch information is corrected by replacing the input pitch with the input terminal phoneme first pitch and terminal phoneme second pitch. In other words, the corrected pitch information is generated by changing the terminal phoneme pitch from the one input value to the two generated values. The terminal phoneme pitch setting unit 333 is connected to the synthesis unit 4 and outputs correction pitch information including the first terminal phoneme first pitch and the second terminal phoneme pitch to the synthesis unit 4. The terminal phoneme pitch setting unit 333 is configured by a dedicated LSI, for example.

なお、上述のピッチピーク抽出部３３１、終端音韻ピッチ生成部３３２、終端音韻ピッチ設定部３３３の一部または全部は、専用のＬＳＩではなく、パーソナルコンピュータなどの一般的なコンピュータやマイクロプロセッサで実現してもよい。この場合、たとえば後述するピッチ補正処理を、コンピュータに実行させるためのプログラムとして記述してもよい。 Note that some or all of the above-described pitch peak extraction unit 331, termination phoneme pitch generation unit 332, and termination phoneme pitch setting unit 333 are realized not by a dedicated LSI but by a general computer such as a personal computer or a microprocessor. May be. In this case, for example, a pitch correction process described later may be described as a program for causing a computer to execute.

［ピッチ補正処理］
次に、図５、図６、および、図１１をさらに参照して、ピッチ補正部３３の具体的な処理内容の一例を説明する。なお、図１１は、図３のピッチ補正処理（Ｓ１１０）のサブルーチンのフローチャートである。 [Pitch correction processing]
Next, an example of specific processing contents of the pitch correction unit 33 will be described with further reference to FIGS. 5, 6, and 11. FIG. 11 is a flowchart of the subroutine of the pitch correction process (S110) of FIG.

図１１を参照して、ピッチ補正処理では、まず、ピッチピーク抽出部３３１が、標準ピッチ情報におけるピッチ系列の中から、ピークピッチ（ＰｉｔＰ）を抽出する（ステップＳ１１０１）。 Referring to FIG. 11, in the pitch correction process, first, pitch peak extraction section 331 extracts a peak pitch (PitP) from the pitch series in the standard pitch information (step S1101).

次に、終端音韻ピッチ生成部３３２は、標準ピッチ情報から終端音韻に割り当てられたピッチ（ＰｉｔＬ）を取得し（ステップＳ１１０２）、ステップＳ１１０１で抽出したピークピッチ（ＰｉｔＰ）を用いて、終端音韻ピッチ生成において基準値として用いる基準ピッチ（ＰｉｔＳ）を、式（１）により算出する（ステップＳ１１０３）。 Next, the terminal phoneme pitch generation unit 332 acquires the pitch (PitL) assigned to the terminal phoneme from the standard pitch information (step S1102), and uses the peak pitch (PitP) extracted in step S1101 to end the phoneme pitch. A reference pitch (PitS) used as a reference value in the generation is calculated by the equation (1) (step S1103).

ＰｉｔＳ＝ＰｉｔＰ …式（１）
次に、終端音韻ピッチ生成部３３２は、予めＲＯＭまたはＲＡＭなどの記憶装置に記憶された既定ピッチ変化量（ＰＴＤＩＦＦ）を用いて、後述する処理において終端音韻第１ピッチと終端音韻第２ピッチの変化量として用いるピッチ変化量（ＰＤｉｆ）を、式（２）により算出する（ステップＳ１１０４）。 PitS = PitP (1)
Next, the terminal phoneme pitch generation unit 332 uses a predetermined pitch change amount (PTDIFF) stored in advance in a storage device such as a ROM or a RAM, and uses a predetermined pitch change amount (PTDIFF) to process the terminal phoneme first pitch and the terminal phoneme second pitch in a process described later. A pitch change amount (PDif) used as the change amount is calculated by the equation (2) (step S1104).

ＰＤｉｆ＝ＰＴＤＩＦＦ …式（２）
ここで、既定ピッチ変化量（ＰＴＤＩＦＦ）は、ピッチが対数で表現される際の変化量を表す値であり、一例として、自然対数でピッチを表現する場合、「０．６」を用いればよい。 PDif = PTDIFF Expression (2)
Here, the predetermined pitch change amount (PTDIFF) is a value indicating the change amount when the pitch is expressed in logarithm, and as an example, when expressing the pitch in natural logarithm, “0.6” may be used. .

次に、終端音韻ピッチ生成部３３２が、ステップＳ１１０３で算出した基準ピッチ（ＰｉｔＳ）と、ステップＳ１１０４で算出したピッチ変化量（ＰＤｉｆ）とを用いて、終端音韻の前半に割り当てる終端音韻第１ピッチ（ＰｉｔＬ１）を、式（３）により算出する（ステップＳ１１０５）。 Next, the terminal phoneme pitch generation unit 332 uses the reference pitch (PitS) calculated in step S1103 and the pitch change amount (PDif) calculated in step S1104 to end the first phoneme pitch assigned to the first half of the terminal phoneme. (PitL1) is calculated by equation (3) (step S1105).

Ｌｏｇ（ＰｉｔＬ１）＝Ｌｏｇ（ＰｉｔＳ）−ＰＤｉｆ／２ …式（３）
次に、終端音韻ピッチ生成部３３２が、ステップＳ１１０３で算出した基準ピッチ（ＰｉｔＳ）と、ステップＳ１１０４で算出したピッチ変化量（ＰＤｉｆ）とを用いて、終端音韻の後半に割り当てる終端音韻第２ピッチ（ＰｉｔＬ２）を、式（４）により算出する（ステップＳ１１０６）。 Log (PitL1) = Log (PitS) −PDif / 2 Equation (3)
Next, the terminal phoneme pitch generation unit 332 uses the reference pitch (PitS) calculated in step S1103 and the pitch change amount (PDif) calculated in step S1104, and the terminal phoneme second pitch assigned to the second half of the terminal phoneme. (PitL2) is calculated by equation (4) (step S1106).

Ｌｏｇ（ＰｉｔＬ２）＝Ｌｏｇ（ＰｉｔＳ）＋ＰＤｉｆ／２ …式（４）
そして、終端音韻ピッチ設定部３３３が、ステップＳ１１０５で算出した終端音韻第１ピッチ（ＰｉｔＬ１）とステップＳ１１０で算出した終端音韻第２ピッチ（ＰｉｔＬ２）を、標準ピッチ情報において終端音韻に対応したピッチと入れ替えることにより、これらのピッチを設定する（ステップＳ１１０７，ステップＳ１１０８）。 Log (PitL2) = Log (PitS) + PDif / 2 Equation (4)
Then, the terminal phoneme pitch setting unit 333 uses the terminal phoneme first pitch (PitL1) calculated in step S1105 and the terminal phoneme second pitch (PitL2) calculated in step S110 as the pitch corresponding to the terminal phoneme in the standard pitch information. By replacing them, these pitches are set (steps S1107 and S1108).

なお、上述の説明において、式（１）〜式（４）は一例であり、次に示す５つの条件を全て満たせばよく、本発明における算出の詳細な方法を限定するものではない。 In the above description, Expressions (1) to (4) are merely examples, and all the following five conditions may be satisfied, and the detailed calculation method in the present invention is not limited.

・ピークピッチ（ＰｉｔＰ）を用いて基準ピッチ（ＰｉｔＳ）を算出
・既定ピッチ変化量（ＰＴＤＩＦＦ）を用いてピッチ変化量（ＰＤｉｆ）を算出
・終端音韻第１ピッチ（ＰｉｔＬ１）が基準ピッチ（ＰｉｔＳ）よりも小さい
・終端音韻第２ピッチ（ＰｉｔＬ２）が基準ピッチ（ＰｉｔＳ）よりも大きい
・終端音韻第１ピッチ（ＰｉｔＬ１）と終端音韻第２ピッチ（ＰｉｔＬ２）との差（または比率でもよい）と、ピッチ変化量（ＰＤｉｆ）との間に相関がある。・ Calculate the reference pitch (PitS) using the peak pitch (PitP) ・ Calculate the pitch change (PDif) using the default pitch change (PTDIFF) ・ The first pitch (PitL1) of the final phoneme is the reference pitch (PitS) The terminal phoneme second pitch (PitL2) is larger than the reference pitch (PitS). The difference (or ratio) between the terminal phoneme first pitch (PitL1) and the terminal phoneme second pitch (PitL2) may be: There is a correlation with the pitch change amount (PDif).

また、各式は、同じ結果を得られるのであれば適宜変更されても良く、たとえば、ステップＳ１１０６で終端音韻第２ピッチ（ＰｉｔＬ２）を算出する際、式（４）の代わりに、ステップＳ１１０５で算出した終端音韻第１ピッチ（ＰｉｔＬ１）を用いた式（５）によって算出してもよい。 Each equation may be changed as long as the same result can be obtained. For example, when calculating the terminal phoneme second pitch (PitL2) in step S1106, instead of equation (4), in step S1105 You may calculate by Formula (5) using the calculated terminal phoneme 1st pitch (PitL1).

Ｌｏｇ（ＰｉｔＬ２）＝Ｌｏｇ（ＰｉｔＬ１）＋ＰＤｉｆ …式（５）
また、上述の説明において、ステップＳ１１０１〜ステップＳ１１０８の一連の処理過程を説明したが、前段の処理結果を参照する場合を除いて、処理の順番について限定するものではない。たとえば、ステップＳ１１０１のピッチピーク抽出処理と、ステップＳ１１０２の終端音韻ピッチ取得処理とは、処理の順番が入れ替えられてもよく、また、同時に実行されてもよい。 Log (PitL2) = Log (PitL1) + PDif Expression (5)
In the above description, a series of processing steps from step S1101 to step S1108 has been described. However, the order of processing is not limited except when referring to the previous processing result. For example, the order of the pitch peak extraction processing in step S1101 and the terminal phoneme pitch acquisition processing in step S1102 may be interchanged, or may be executed simultaneously.

次に、図１１を参照して説明したピッチ補正処理によってピッチ情報がどのように補正されるかについて、図１２〜図１５をさらに参照して具体的に説明する。図１２〜図１５は、音韻記号列「ｓ」「ｏ」「ｔ」「ｏ」「ｄ」「ｅ」「ａ」「ｓ」「ｏ」「ｂ」「ｕ」に対応するピッチの自然対数の変化を示す図である。これらの図では、縦軸にピッチの自然対数が定義され、横軸に時間が定義されている。なお、図１２は、標準ピッチ情報に従ったピッチ系列の一例を示し、また、図１３〜図１５は、それぞれ、図１２のピッチ系列がピッチ補正部３３によって補正されて生成されたピッチ系列の第１から第３の例を示す。 Next, how the pitch information is corrected by the pitch correction processing described with reference to FIG. 11 will be specifically described with reference to FIGS. 12 to 15 show the natural logarithm of the pitch corresponding to the phoneme symbol string “s” “o” “t” “o” “d” “e” “a” “s” “o” “b” “u”. It is a figure which shows the change of. In these figures, the natural logarithm of the pitch is defined on the vertical axis, and the time is defined on the horizontal axis. FIG. 12 shows an example of a pitch sequence according to the standard pitch information. FIGS. 13 to 15 show pitch sequences generated by correcting the pitch sequence of FIG. First to third examples are shown.

図１２における各音韻記号の音韻継続時間長とピッチは、表１に示されたものである。図１３における各音韻記号の音韻継続時間長とピッチは、表２に示されたものである。また、図１４における各音韻記号の音韻継続時間長とピッチを表３に、図１５における各音韻記号の音韻継続時間長とピッチを表４に、それぞれ示す。なお、表３および表４には、後述する説明のために、各音韻記号のパワーも併記されている。 The phoneme duration length and pitch of each phoneme symbol in FIG. 12 are shown in Table 1. The phoneme duration length and pitch of each phoneme symbol in FIG. 13 are shown in Table 2. 14 shows the phoneme duration and pitch of each phoneme symbol in FIG. 14, and Table 4 shows the phoneme duration and pitch of each phoneme symbol in FIG. In Tables 3 and 4, the power of each phoneme symbol is also shown for the purpose of explanation to be described later.

まず、図１２と図１５を参照して、説明を行なう。
図１５には、ステップＳ１１０で補正された補正ピッチ情報（図中の黒丸）と、補正前の標準ピッチ情報（図中の破線丸）とが記載されており、また、図１５において、ＰｉｔＰは、ステップＳ１１０１で抽出されるピッチピークであり、ＰｉｔＬは、ステップＳ１１０２で取得される終端音韻ピッチである。また、ＰｉｔＳは、ステップＳ１１０３で算出される基準ピッチ（ＰｉｔＰそのもの）であり、ＰＤｉｆは、ステップＳ１１０４で算出されるピッチ変化量（０．６）である。また、ＰｉｔＬ１は、ステップＳ１１０５で算出される終端音韻第１ピッチであり、ＰｉｔＬ２は、ステップＳ１１０６で算出される終端音韻第２ピッチである。 First, description will be made with reference to FIGS.
FIG. 15 shows the corrected pitch information (black circle in the figure) corrected in step S110 and the standard pitch information before correction (broken line circle in the figure). In FIG. 15, PitP is , The pitch peak extracted in step S1101, and PitL is the terminal phoneme pitch acquired in step S1102. PitS is the reference pitch (PitP itself) calculated in step S1103, and PDif is the pitch change amount (0.6) calculated in step S1104. Also, PitL1 is the terminal phoneme first pitch calculated in step S1105, and PitL2 is the terminal phoneme second pitch calculated in step S1106.

図１２と図１５から理解されるように、ピッチ補正処理によって、標準終端音韻ピッチＰｉｔＬが終端音韻第１ピッチＰｉｔＬ１と終端音韻第２ピッチＰｉｔＬ２で置き換えられることにより、下降気味であった語尾のイントネーションを、尻上がりなイントネーション（疑問文に適すると考えられるイントネーション）に修正することが可能となる。 As understood from FIG. 12 and FIG. 15, the standard end phoneme pitch PitL is replaced with the end phoneme first pitch PitL1 and the end phoneme second pitch PitL2 by the pitch correction process, and the intonation of the ending that was in a downward trend Can be corrected to a rising intonation (intonation considered to be suitable for a question sentence).

ここで、ピッチ変化量ＰＤｉｆを標準ピッチ情報において終端音韻に対応するピッチＰｉｔＬに加算することにより、ＰｉｔＬを補正する方法が考えられる。しかしながら、アクセント型が０型（アクセント核が存在しない）の場合、ピッチＰｉｔＬがピッチピークＰｉｔＰと近い値となり、文末ピッチが不自然に高くなる（高くなりすぎる）という問題が発生すると考えられる。一方、本実施の形態に従えば、ピッチピークＰｉｔＰに基づいて算出される基準ピッチＰｉｔＳよりも小さい値（ＰｉｔＬ１）と大きい値（ＰｉｔＬ２）とで文末のピッチを表現するため、アクセント型によらず、安定して、自然な疑問のイントネーションを得ることが可能となる。 Here, a method of correcting PitL by adding the pitch change amount PDif to the pitch PitL corresponding to the terminal phoneme in the standard pitch information can be considered. However, when the accent type is 0 type (accent nucleus does not exist), the pitch PitL is close to the pitch peak PitP, and it is considered that the sentence end pitch becomes unnaturally high (too high). On the other hand, according to the present embodiment, the pitch at the end of the sentence is expressed by a smaller value (PitL1) and a larger value (PitL2) than the reference pitch PitS calculated based on the pitch peak PitP. It is possible to obtain a stable, natural questioning intonation.

また、図１０および図１１では、ピッチを周波数（Ｈｚ）の単位で示したが、図１２〜図１５では、ピッチを周波数の自然対数で示している。なお、楽器を演奏する場合において音程が１オクターブ上がることにより音のピッチが２倍となるように、音の高さの傾向について議論する場合、ピッチの値そのものを比較するよりも、ピッチの自然対数を比較した方が適切であると考えられる場合がある。そこで、本実施の形態のピッチの補正では、ピッチは、その自然対数の形で取扱われている。 10 and 11, the pitch is shown in units of frequency (Hz). However, in FIGS. 12 to 15, the pitch is shown in natural logarithm of frequency. When discussing the tendency of the pitch so that the pitch of the sound is doubled by increasing the pitch by one octave when playing an instrument, it is more natural to compare the pitch value than to compare the pitch value itself. It may be considered appropriate to compare logarithms. Therefore, in the pitch correction according to the present embodiment, the pitch is handled in the form of its natural logarithm.

［ピッチ補正処理の変形例］
図１６は、図１１を参照して説明したピッチ補正処理の変形例のフローチャートを示す図である。 [Modification of pitch correction processing]
FIG. 16 is a flowchart illustrating a modification of the pitch correction process described with reference to FIG.

この変形例では、ステップＳ２１０１，２１０２，２１０５〜２１０８の処理内容は、それぞれ、図１１を参照して説明したステップＳ１１０１，１１０２，１１０５〜１１０８の処理内容と同じものとされている。なお、この変形例では、図１１に示した処理内容と比較して、ステップＳ１１０３が変更されてステップＳ２１０３とされ、ステップＳ１１０４が変更されてステップＳ２１０４とされている。つまり、基準ピッチの算出とピッチ変化量の算出に、変更が加えられている。 In this modification, the processing contents of steps S2101, 1022, 2105 to 2108 are the same as the processing contents of steps S1101, 1102, 1105 to 1108 described with reference to FIG. In this modification, step S1103 is changed to step S2103, and step S1104 is changed to step S2104, compared to the processing content shown in FIG. That is, changes are made to the calculation of the reference pitch and the calculation of the pitch change amount.

ステップＳ２１０３における基準ピッチの算出では、終端音韻ピッチ生成部３３２は、次の式（６）に従って、基準ピッチ（ＰｉｔＳ）を算出する。 In the calculation of the reference pitch in step S2103, the terminal phoneme pitch generation unit 332 calculates the reference pitch (PitS) according to the following equation (6).

ＰｉｔＳ＝ＰｉｔＬ＋（ＰｉｔＰ−ＰｉｔＬ）×ｍ／Ｍ …式（６）
また、ステップＳ２１０４におけるピッチ変化量の算出では、終端音韻ピッチ生成部３３２は、次の式（７）に従って、ピッチ変化量（ＰＤｉｆ）を算出する。 PitS = PitL + (PitP−PitL) × m / M (6)
Further, in the calculation of the pitch change amount in step S2104, the terminal phoneme pitch generation unit 332 calculates the pitch change amount (PDif) according to the following equation (7).

ＰＤｉｆ＝ＰＴＤＩＦＦ×ｎ／Ｎ …式（７）
図１６に示された変形例において、式（６）中の「ｍ」と「Ｍ」、および、式（７）中の「ｎ」と「Ｎ」は、補正後の終端音韻のピッチを調整するための値であり、たとえば、ユーザが入力部１４０を介して入力できる。 PDif = PTDIFF × n / N (7)
In the modification shown in FIG. 16, “m” and “M” in equation (6) and “n” and “N” in equation (7) adjust the pitch of the final phoneme after correction. For example, the value can be input by the user via the input unit 140.

具体的には、ｍが小さく設定されることにより、または、Ｍが大きく設定されることにより、基準ピッチ（ＰｉｔＳ）の値が小さくなり、したがって、終端音韻のピッチが全体的に低くなることになる。 Specifically, when m is set small or M is set large, the value of the reference pitch (PitS) becomes small, and therefore the pitch of the terminal phoneme becomes low as a whole. Become.

また、ｎが小さく設定されることにより、または、Ｎが大きく設定されることにより、ピッチ変化量（ＰＤｉｆ）が小さくなるため、終端音韻における尻上がりの度合いが小さくなることになる。 Further, when n is set to be small or N is set to be large, the amount of pitch change (PDif) is reduced, so that the degree of rising in the terminal phoneme is reduced.

また、図１１に示した処理内容は、ｍ＝Ｍであり、かつ、ｎ＝Ｎである例とも言える。
そして、図１３，図１４，図１５は、順に、「ｍ／Ｍ＝１／３，ｎ／Ｎ＝１／３」、「ｍ／Ｍ＝２／３，ｎ／Ｎ＝２／３」、「ｍ／Ｍ＝３／３，ｎ／Ｎ＝３／３」とされた場合のピッチ系列を示す図である。また、各図中、Ｄ１１，Ｄ２１，Ｄ３１は、標準ピッチ情報に従ったピッチ系列におけるピッチの最大値（ＰｉｔＰ）と終端音韻のピッチ（ＰｉｔＬ）の差であり、Ｄ１１＝Ｄ２１＝Ｄ３１である。また、Ｄ１２，Ｄ２２，Ｄ３２は、基準ピッチ（ＰｉｔＳ）と標準ピッチ情報に従ったピッチ系列の終端音韻のピッチ（ＰｉｔＬ）の差である。なお、図１５に示された例では、基準ピッチ（ＰｉｔＳ）がピッチの最大値（ＰｉｔＰ）とされるため、Ｄ３１＝Ｄ３２となる。 Further, the processing content shown in FIG. 11 can be said to be an example in which m = M and n = N.
13, 14, and 15 are “m / M = 1/3, n / N = 1/3”, “m / M = 2/3, n / N = 2/3”, It is a figure which shows a pitch series when it is set as "m / M = 3/3, n / N = 3/3". In each figure, D11, D21, and D31 are the differences between the maximum pitch value (PitP) and the pitch of the terminal phoneme (PitL) in the pitch sequence according to the standard pitch information, and D11 = D21 = D31. D12, D22, and D32 are differences between the pitch (PitL) of the terminal phoneme in the pitch series according to the standard pitch information (PitS) and the standard pitch information. In the example shown in FIG. 15, since the reference pitch (PitS) is the maximum pitch value (PitP), D31 = D32.

図１３〜図１５を比較した場合、図１３、図１４、図１５の順に、ｍ（ｍ／Ｍ）の値が大きくかつｎ（ｎ／Ｎ）の値が大きく設定されている。これに対応して、この順に、終端音韻において、ＰｉｔＬ１とＰｉｔＬ２の値が高くなり、また、ＰｉｔＬ１からＰｉｔＬ２の変化量が大きくなっている。 When comparing FIGS. 13 to 15, the value of m (m / M) and the value of n (n / N) are set larger in the order of FIGS. 13, 14, and 15. Correspondingly, in this order, the values of PitL1 and PitL2 increase in the terminal phoneme, and the amount of change from PitL1 to PitL2 increases.

このように、本実施の形態では、ｍ，ｎ，Ｍ，Ｎの値が調整されることによって、ＰｉｔＬ１とＰｉｔＬ２の値、つまり、入力テキストデータが発音される際の語尾の上がり具合を調整することが可能となり、また、ＰｉｔＬ１からＰｉｔＬ２の変化量、つまり、語尾（もともと終端音韻とされていた箇所）における抑揚の度合いを調整することができる。これにより、本実施の形態では、ｍ，ｎ，Ｍ，Ｎの値を調整することによって、疑問を表現するイントネーションの強弱を容易に調整できる。 As described above, in this embodiment, the values of m, n, M, and N are adjusted to adjust the values of PitL1 and PitL2, that is, the degree of ending when the input text data is pronounced. It is also possible to adjust the amount of change from PitL1 to PitL2, that is, the degree of inflection at the end of the word (the place that was originally regarded as the terminal phoneme). Thereby, in this Embodiment, the intensity of intonation expressing a question can be easily adjusted by adjusting the value of m, n, M, and N.

さらに、基準ピッチ（ＰｉｔＳ）とピッチ変化量（ＰＤｉｆ）を独立して調整するのではなく、ｍ＝ｎかつＭ＝Ｎとして、基準ピッチ（ＰｉｔＳ）とピッチ変化量（ＰＤｉｆ）を同じ調整用の値に基づいて調整することによって、疑問を表現するイントネーションの強弱をさらに容易に調整できる。 Further, the reference pitch (PitS) and the pitch change amount (PDif) are not adjusted independently, but m = n and M = N, and the reference pitch (PitS) and the pitch change amount (PDif) are set for the same adjustment. By adjusting based on the value, the intensity of intonation expressing a question can be adjusted more easily.

なお、参考までに、図１４と図１５に対応した、補正後のピッチ系列と補正後の各音韻記号に対応したパワーを、図１７と図１８にそれぞれ示す。図１７と図１８の横軸および縦軸は、図１２および図１３と同じである。また、図１７と図１８では、図１２および図１３と同様に、ピッチが「●」で示され、パワーが「○」で示されている。 For reference, FIG. 17 and FIG. 18 show the power corresponding to the corrected pitch series and the corrected phoneme symbols, respectively, corresponding to FIG. 14 and FIG. The horizontal and vertical axes in FIGS. 17 and 18 are the same as those in FIGS. 12 and 13. In FIGS. 17 and 18, the pitch is indicated by “●” and the power is indicated by “◯” as in FIGS. 12 and 13.

以上説明した本実施の形態では、ピッチピーク抽出部３３１によって、標準韻律情報に従ったピッチ系列の中のピッチの最大値が抽出され、終端音韻ピッチ生成部３３２によって、当該最大値を用いて終端音韻第１ピッチおよび終端音韻第２ピッチ（入力されたテキストデータの終端音韻に付与されるピッチ）が生成され、そして、終端音韻ピッチ設定部３３３により、テキスト解析部１に入力されたテキストデータに対応するピッチ系列であって補正ピッチ情報が生成される。 In the present embodiment described above, the pitch peak extraction unit 331 extracts the maximum value of the pitch in the pitch sequence according to the standard prosodic information, and the terminal phoneme pitch generation unit 332 uses the maximum value to terminate the pitch. A phoneme first pitch and a terminal phoneme second pitch (a pitch given to the terminal phoneme of the input text data) are generated, and the terminal phoneme pitch setting unit 333 converts the text data input to the text analysis unit 1 into text data. Corresponding pitch sequences and corrected pitch information are generated.

このように、標準韻律情報に従ったピッチ系列の中のピッチの最大値（ＰｉｔＰ）が、当該ピッチ系列の補正に用いられることにより、ピッチ系列の補正のために、多様なパターンデータを予め作成することなく、また、パターンデータのような多量なデータを韻律補正のために保持することなく、疑問を表現するイントネーションを生成することが可能となる。 In this way, the maximum pitch value (PitP) in the pitch sequence according to the standard prosodic information is used for correcting the pitch sequence, so that various pattern data are created in advance for correcting the pitch sequence. It is possible to generate intonation expressing a question without having to store a large amount of data such as pattern data for prosody correction.

また、上記した終端音韻第１ピッチは、ＰｉｔＰからピッチ変化量（ＰＤｉｆ）を２で割ったものが差し引かれることによって算出され、終端音韻第２ピッチは、ＰｉｔＰにＰＤｉｆを２で割ったものを足しあわされることによって算出される。これにより、終端音韻において、疑問を表現するイントネーションを生成することを可能とするとともに、入力されたテキストデータに基本的に与えられるイントネーションに沿った自然なイントネーションを生成することが可能となる。 Also, the above-mentioned first phoneme first pitch is calculated by subtracting the pitch change amount (PDif) divided by 2 from PitP, and the final phoneme second pitch is obtained by dividing PDif by Pif by 2 Calculated by adding together. This makes it possible to generate an intonation that expresses a question in the terminal phoneme, and to generate a natural intonation along the intonation basically given to the input text data.

また、本実施の形態では、終端音韻第１ピッチが終端音韻第２ピッチよりも小さい値となるようにされるため、終端音韻において、疑問を表現するイントネーションを生成することが可能となる。 Further, in the present embodiment, since the terminal phoneme first pitch is set to a value smaller than the terminal phoneme second pitch, it is possible to generate intonation expressing a question in the terminal phoneme.

また、本実施の形態では、終端音韻第１ピッチが標準ピッチ情報に従った終端音韻のピッチ（ＰｉｔＬ：終端音韻に予め付与されたピッチ）よりも小さい値とされ、また、終端音韻第２ピッチが上記ＰｉｔＬよりも大きい値とされるため、終端音韻において、疑問を表現する、文のイントネーションに沿った自然なイントネーションを生成することが可能となる。 In the present embodiment, the terminal phoneme first pitch is set to a value smaller than the terminal phoneme pitch according to the standard pitch information (PitL: a pitch given in advance to the terminal phoneme), and the terminal phoneme second pitch. Is set to a value larger than the above PitL, it is possible to generate a natural intonation along the sentence intonation expressing a question in the terminal phoneme.

また、本実施の形態では、図１３または図１４に示されるように、上記のＰｉｔＰとＰｉｔＬの間となるように基準ピッチが算出され、終端音韻第１ピッチが当該基準ピッチよりも小さい値とされ、終端音韻第２ピッチが当該基準ピッチよりも大きい値とされるため、終端音韻において、文のアクセント型に関わらず、疑問を表現する、文のイントネーションに沿った自然なイントネーションを生成することが可能となる。 Further, in the present embodiment, as shown in FIG. 13 or FIG. 14, the reference pitch is calculated so as to be between the above PitP and PitL, and the terminal phoneme first pitch is smaller than the reference pitch. Since the terminal phoneme second pitch is set to a value larger than the reference pitch, the terminal phoneme generates a natural intonation along the sentence intonation that expresses a question regardless of the accent type of the sentence. Is possible.

また、本実施の形態では、終端音韻について、ピッチの補正に加えてパワーの補正が行なわれることにより、疑問を表現するイントネーションを生成することが可能となる。 In the present embodiment, the terminal phoneme is subjected to power correction in addition to pitch correction, so that an intonation expressing a question can be generated.

また、本実施の形態において、パワーの補正が、標準パワー情報に従った終端音韻のパワー（ＰｏｗＬ：終端音韻に予め付与されたパワー）を用いて終端音韻のパワーの補正値が生成されることから、終端音韻において、疑問を表現する自然なイントネーションを生成することが可能となる。 Further, in the present embodiment, the power correction is performed by using the power of the terminal phoneme (PowL: power given in advance to the terminal phoneme) according to the standard power information to generate a correction value of the power of the terminal phoneme. Therefore, it is possible to generate a natural intonation expressing a question in the terminal phoneme.

また、本実施の形態において、終端音韻前半パワー（ＰｏｗＬ１）が上記のＰｏｗＬとされ、終端音韻後半パワー（ＰｏｗＬ２）がそれよりもＰｏｗＬ１よりも小さい値とされることにより、終端音韻において、疑問を表現する、文のイントネーションに沿った自然なイントネーションを生成することが可能となる。 Further, in the present embodiment, the terminal phoneme first half power (PowL1) is set to the above-mentioned PowL, and the terminal phoneme second half power (PowL2) is set to a value smaller than PowL1. It is possible to generate natural intonation along the sentence intonation to be expressed.

また、本実施の形態において、標準音韻継続時間長情報に従った終端音韻の音韻継続時間長（終端音韻に予め付与された音韻継続時間長）を用いて、補正した音韻継続時間長情報が生成されることにより、終端音韻において、疑問を表現する自然なイントネーションを生成することを可能となる。 In the present embodiment, the corrected phoneme duration information is generated using the phoneme duration of the terminal phoneme according to the standard phoneme duration information (the phoneme duration previously given to the terminal phoneme). This makes it possible to generate a natural intonation that expresses a question in the terminal phoneme.

また、本実施の形態において、標準音韻継続時間長情報に従った終端音韻の音韻継続長情報（ＤｕｒＬ）と、ともに１より小さい値であるＤＲＷＧＴ１およびＤＲＷＧＴ２とに基づいて、終端音韻前半継続時間（ＤｕｒＬ１）と終端音韻後半継続時間（ＤｕｒＬ２）が生成されることにより、これらが上記ＤｕｒＬよりも短い時間とされ、終端音韻において、疑問を表現する、文のイントネーションに沿った自然なイントネーションを生成することが可能となる。 Further, in the present embodiment, based on the phoneme duration information (DurL) of the terminal phoneme according to the standard phoneme duration time information, and DRWGT1 and DRWGT2 that are both smaller than 1, the first phoneme last duration ( By generating the DurL1) and the terminal phonological second half duration (DurL2), these are shorter than the above DurL, and in the terminal phonology, a natural intonation according to the sentence intonation expressing the question is generated. It becomes possible.

以上説明した本実施の形態では、韻律補正部３により、本発明の韻律補正装置の一実施の形態が示される。なお、韻律補正装置は、少なくともピッチ補正部３３を含めばよい。 In the present embodiment described above, the prosody correction unit 3 shows an embodiment of the prosody correction device of the present invention. Note that the prosody correction device may include at least the pitch correction unit 33.

また、以上説明した本実施の形態では、補正・合成部１１１を含む音声合成装置１００によって、本発明の音声合成装置１００の一実施の形態が示される。なお、音声合成装置は、少なくともピッチ補正部３３と合成部４を含めば良い。 In the present embodiment described above, an embodiment of the speech synthesizer 100 of the present invention is shown by the speech synthesizer 100 including the correction / synthesis unit 111. Note that the speech synthesizer may include at least the pitch correction unit 33 and the synthesis unit 4.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明は、たとえば電話機や家電製品、ゲーム機、ゲーム機用ソフトウェア、パーソナルコンピュータ、パーソナルコンピュータ用ソフトウェアのように、予め記録された、またはネットワークなどを介して動的に読み込まれるテキストデータを、スピーカまたはヘッドホンなどを介して音声出力するような装置に適用することができる。 The present invention relates to text data recorded in advance or dynamically read via a network or the like, such as telephones, home appliances, game machines, game machine software, personal computers, and personal computer software. Alternatively, the present invention can be applied to a device that outputs sound via headphones or the like.

また本発明は、たとえば電話機やゲーム機、ゲーム機用ソフトウェア、パーソナルコンピュータ、パーソナルコンピュータ用ソフトウェアのように、予め記録された、またはネットワークなどを介して動的に読み込まれる、またはマイクなどを介して録音される音声データを、スピーカまたはヘッドホンなどを介して音声出力するような装置に適用することができる。 In addition, the present invention can be recorded in advance, such as a telephone, a game machine, game machine software, a personal computer, or personal computer software, or dynamically loaded via a network, or via a microphone. The present invention can be applied to a device that outputs voice data to be recorded through a speaker or headphones.

また本発明は、たとえば対話型音声案内サーバや対話型ゲームサーバのように、予め記録された、またはネットワークなどを介して動的に読み込まれる、テキストデータまたは音声データを、ネットワークなどを介して出力するような装置に適用することができる。 In addition, the present invention outputs text data or voice data recorded in advance or dynamically read via a network or the like, such as an interactive voice guidance server or an interactive game server. It can be applied to such a device.

本発明の一実施の形態である音声合成装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the speech synthesizer which is one embodiment of this invention. 図１の音声合成装置の補正・合成部の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the correction | amendment / synthesis | combination part of the speech synthesizer of FIG. 図１の音声合成装置において実行される音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process performed in the speech synthesizer of FIG. 図２のパワー補正部の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the power correction part of FIG. 図３のパワー補正処理のサブルーチンのフローチャートである。It is a flowchart of the subroutine of the power correction process of FIG. 図１の音声合成装置において扱われる標準韻律情報の一例を模式的に示す図である。It is a figure which shows typically an example of the standard prosodic information handled in the speech synthesizer of FIG. 図１の音声合成装置において扱われる補正韻律情報の一例を模式的に示す図である。It is a figure which shows typically an example of the correction | amendment prosody information handled in the speech synthesizer of FIG. 図２の音韻継続時間長補正部の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the phoneme duration time correction | amendment part of FIG. 図３の音韻継続時間長補正処理のサブルーチンのフローチャートである。It is a flowchart of the subroutine of the phoneme duration correction process of FIG. 図２のピッチ補正部の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the pitch correction | amendment part of FIG. 図３のピッチ補正処理のサブルーチンのフローチャートである。It is a flowchart of the subroutine of the pitch correction process of FIG. 図１の音声合成装置において扱われる、標準ピッチ情報に従ったピッチ系列の一例を示す図である。It is a figure which shows an example of the pitch series according to standard pitch information handled in the speech synthesizer of FIG. 図１２のピッチ系列が図１０のピッチ補正部において補正されることによって生成されたピッチ系列の第１の例を示す図である。It is a figure which shows the 1st example of the pitch series produced | generated by correcting the pitch series of FIG. 12 in the pitch correction | amendment part of FIG. 図１２のピッチ系列が図１０のピッチ補正部において補正されることによって生成されたピッチ系列の第２の例を示す図である。It is a figure which shows the 2nd example of the pitch series produced | generated by correcting the pitch series of FIG. 12 in the pitch correction part of FIG. 図１２のピッチ系列が図１０のピッチ補正部において補正されることによって生成されたピッチ系列の第３の例を示す図である。It is a figure which shows the 3rd example of the pitch series produced | generated by correcting the pitch series of FIG. 12 in the pitch correction | amendment part of FIG. 図１１に示された処理の変形例のフローチャートである。It is a flowchart of the modification of the process shown by FIG. 図１の音声合成装置において扱われる補正韻律情報の他の例を模式的に示す図である。It is a figure which shows typically the other example of the correction | amendment prosody information handled in the speech synthesizer of FIG. 図１の音声合成装置において扱われる補正韻律情報のさらに他の例を模式的に示す図である。It is a figure which shows typically the further another example of the correction | amendment prosody information handled in the speech synthesizer of FIG.

Explanation of symbols

１テキスト解析部、２韻律生成部、３韻律補正部、４合成部、２１パワー生成部、２２音韻継続時間長生成部、２３ピッチ生成部、３１パワー補正部、３２音韻継続時間長補正部、３３ピッチ補正部、１００音声合成装置、１１０中央処理部、１１１補正・合成部、１２０アンテナ、１２１通信制御部、１３０フラッシュメモリ、１４０入力部、１５０韻律情報テーブル記憶部、１６０ＲＯＭ、１７０表示部、１８０アンプ、１８１スピーカ、３１１終端音韻パワー生成部、３１２終端音韻パワー設定部、３２１終端音韻継続時間長生成部、３２２終端音韻継続時間長設定部、３３１ピッチピーク抽出部、３３２終端音韻ピッチ生成部、３３３終端音韻ピッチ設定部。 1 text analysis unit, 2 prosody generation unit, 3 prosody correction unit, 4 synthesis unit, 21 power generation unit, 22 phoneme duration duration generation unit, 23 pitch generation unit, 31 power correction unit, 32 phoneme duration length correction unit, 33 Pitch correction unit, 100 Speech synthesizer, 110 Central processing unit, 111 Correction / synthesis unit, 120 Antenna, 121 Communication control unit, 130 Flash memory, 140 Input unit, 150 Prosodic information table storage unit, 160 ROM, 170 Display unit , 180 amplifier, 181 speaker, 311 terminal phoneme power generation unit, 312 terminal phoneme power setting unit, 321 terminal phoneme duration generation unit, 322 terminal phoneme duration setting unit, 331 pitch peak extraction unit, 332 terminal phoneme pitch generation Part 333, terminal phoneme pitch setting part.

Claims

Pitch peak extracting means for extracting the maximum value from the end of the pitch sequence given in advance to the phoneme symbol sequence;
Using the maximum value of the extracted pitch, terminal phoneme pitch generation means for generating a terminal phoneme pitch that is a pitch to be given to the terminal phoneme in the phoneme symbol sequence;
A terminal phoneme pitch correction unit that corrects a pitch sequence previously given to the phoneme symbol sequence by giving the generated pitch for the end phoneme to a pitch of the end phoneme in the phoneme symbol sequence;
The terminal phoneme pitch generating means includes
Calculating a reference pitch corresponding to a value between the maximum value of the extracted pitch and a pitch given in advance to the terminal phoneme in the phoneme symbol sequence;
As the terminal phoneme pitch, a first pitch and a second pitch are generated,
The first pitch is generated using a predetermined value representing the amount of change in pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch,
The second pitch is generated using the predetermined value and is larger than the reference pitch.
The amount of change between the first pitch and the second pitch is calculated using the predetermined value,
The prosody correction device, wherein the amount of change between the first pitch and the second pitch is a difference or ratio between the first pitch and the second pitch.

The prosody correction device according to claim 1, wherein the pitch peak extraction unit extracts a maximum value from an accent phrase at the end of a pitch sequence or an exhalation paragraph at the end of a sequence given to a phoneme symbol sequence in advance.

The terminal phoneme pitch generation means sets the first pitch to a value smaller than the maximum value of the extracted pitch, and sets the second pitch to a value larger than the maximum value of the extracted pitch. The prosody correction device according to claim 1 or 2.

The terminal phoneme pitch generation means sets the extracted maximum value as PitP, a pitch given in advance to the terminal phoneme in the phoneme symbol sequence as PitL, a predetermined value representing the amount of change in the pitch as PDTIFF, and the phoneme When values regarding the degree of correction of the pitch of the terminal phoneme in the symbol series are m, M, n, and N, using these values, PitS that is the reference pitch is set as follows.
PitS = PitL + (PitP−PitL) × m / M
According to
PDif, which is the amount of change between the first pitch and the second pitch,
PDif = PTDIFF × n / N
The prosody correction device according to claim 1, wherein the prosody correction device according to claim 1 is calculated.

5. The prosody correction device according to claim 4, wherein the values m, M, n, and N relating to the degree of correcting the pitch of the terminal phoneme are m = n and M = N.

Power correcting means for correcting power given in advance to a terminal phoneme in the phoneme symbol sequence, or phoneme duration length correcting means for correcting a duration given in advance to a terminal phoneme in the phoneme symbol sequence The prosody correction apparatus according to claim 1, further comprising:

The prosody correction apparatus according to claim 6, wherein the power correction unit corrects the power of the terminal phoneme using power preliminarily given to the terminal phoneme.

The power correction means includes
Correcting the power of the terminal phoneme to generate a first power to be given to the terminal phoneme and a second power to be given to the terminal phoneme;
Generating the second power based on a predetermined value representing an amount of attenuation with respect to the power given in advance to the terminal phoneme, and generating the second power so as to be smaller than the power given in advance to the terminal phoneme; The prosody correction device according to claim 7, characterized in that it is characterized in that:

The prosody correction according to claim 6, wherein the phoneme duration correction means corrects the phoneme duration of the terminal phoneme using a phoneme duration length previously given to the terminal phoneme. apparatus.

The phoneme duration correction means corrects the phoneme duration of the terminal phoneme, thereby correcting the phoneme duration based on a predetermined value representing a ratio to the phoneme duration length previously given to the terminal phoneme. 10. The prosody correction device according to claim 9, wherein a first phoneme duration length and a second phoneme duration length value that are smaller than a predetermined phoneme duration length are generated.

Text analysis means for analyzing input text data and obtaining phonological symbols related to the text data and ending prosodic control information representing a controlling method of ending prosody;
Prosody generation means for generating prosody information including at least a pitch sequence of the text data based on the acquired phoneme symbol;
Pitch peak extraction means for extracting the maximum value of the pitch from the end of the pitch sequence generated by the prosody generation means;
Using the maximum value of the extracted pitch, a terminal phoneme pitch generating means for generating a terminal phoneme pitch that is a pitch to be given to the terminal phoneme of the text data;
Prosody correction means for correcting the prosodic information of the text data generated by the prosody generation means by giving the pitch for the terminal phoneme to the pitch of the terminal phoneme of the text data;
Using the acquired phonological symbol and the corrected prosodic information, comprising a synthesis means for synthesizing a speech signal related to the text data,
The terminal phoneme pitch generating means includes
Calculating a reference pitch corresponding to a value between the maximum value of the extracted pitch and a pitch given in advance to the terminal phoneme in the phoneme symbol sequence;
As the terminal phoneme pitch, a first pitch and a second pitch are generated,
The first pitch is generated using a predetermined value representing the amount of change in pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch,
The second pitch is generated using the predetermined value and is larger than the reference pitch.
The amount of change between the first pitch and the second pitch is calculated using the predetermined value,
The speech synthesis apparatus, wherein the amount of change between the first pitch and the second pitch is a difference or ratio between the first pitch and the second pitch.

12. The speech synthesizer according to claim 11, wherein the pitch peak extracting unit extracts a maximum value from an accent phrase at the end of a pitch sequence or an exhalation paragraph at the end of a pitch sequence given in advance to a phoneme symbol sequence.

The speech synthesizer according to claim 11 or 12, further comprising speech unit storage means for storing a plurality of speech units representing speech information corresponding to one or more phoneme symbols. .

Extracting a maximum pitch value from the end of a pitch sequence given in advance to the phoneme symbol sequence;
Generating a terminal phoneme pitch that is a pitch to be given to a terminal phoneme of the phoneme symbol sequence using the maximum value of the extracted pitch;
Providing the terminal phoneme pitch to the terminal phoneme of the phoneme symbol sequence,
Generating the terminal phoneme pitch comprises:
Calculating a reference pitch corresponding to a value between the maximum value of the extracted pitch and a pitch given in advance to a terminal phoneme in the phoneme symbol sequence;
As the terminal phoneme pitch, a first pitch and a second pitch are generated,
The first pitch is generated using a predetermined value representing the amount of change in pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch,
The second pitch is generated using the predetermined value and is larger than the reference pitch.
The amount of change between the first pitch and the second pitch is calculated using the predetermined value,
The prosody correction method, wherein the amount of change between the first pitch and the second pitch is a difference or ratio between the first pitch and the second pitch.

The prosody correction method according to claim 14, wherein the step of extracting the maximum value of the pitch extracts the maximum value from an accent phrase at the end of a pitch sequence or an exhalation paragraph at the end of the sequence given to a phoneme symbol sequence in advance.

A prosody correction program for correcting the prosody of a terminal phoneme of a phoneme symbol sequence,
A prosody correction program for realizing the prosody correction method according to claim 14 or 15 on a computer.

Analyzing the input text data, obtaining a phonological symbol related to the text data and a sentence end prosody control information representing a sentence end prosody control method;
Generating prosodic information including at least a pitch sequence of the text data based on the acquired phonological symbol;
Extracting the maximum pitch value from the end of the generated pitch sequence;
Generating a terminal phoneme pitch that is a pitch to be given to the terminal phoneme of the text data using the maximum value of the extracted pitch;
Correcting the generated prosodic information by assigning the terminal phoneme pitch to the terminal phoneme of the text data;
Using the acquired phonological symbol and the corrected prosodic information, synthesizing a speech signal related to the text data,
Generating the terminal phoneme pitch comprises:
Calculating a reference pitch corresponding to a value between the maximum value of the extracted pitch and a pitch given in advance to a terminal phoneme in the phoneme symbol sequence;
As the terminal phoneme pitch, a first pitch and a second pitch are generated,
The first pitch is generated using a predetermined value representing the amount of change in pitch in the terminal phoneme of the phoneme symbol sequence, and is a value smaller than the reference pitch,
The second pitch is generated using the predetermined value and is larger than the reference pitch.
The amount of change between the first pitch and the second pitch is calculated using the predetermined value,
The speech synthesis method, wherein the amount of change between the first pitch and the second pitch is a difference or ratio between the first pitch and the second pitch.

The speech synthesizing method according to claim 17, wherein the step of extracting the maximum value of the pitch extracts the maximum value from an accent phrase at the end of a pitch sequence or an exhalation paragraph at the end of the sequence given in advance to a phoneme symbol sequence.

A speech synthesis program for synthesizing speech signals based on input text data,
A speech synthesis program for implementing the speech synthesis method according to claim 17 or 18 on a computer.