JP2006337476A

JP2006337476A - Voice synthesis method and system

Info

Publication number: JP2006337476A
Application number: JP2005159123A
Authority: JP
Inventors: Masaaki Yamada; 雅章山田; Yasuo Okuya; 泰夫奥谷; Michio Aizawa; 道雄相澤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-05-31
Filing date: 2005-05-31
Publication date: 2006-12-14
Also published as: WO2006129814A1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice synthesis method for enhancing quality of synthesis voice by properly regulating rhythm, and to provide its system. <P>SOLUTION: The voice synthesis method selects prime pieces, determines whether rhythm modification is performed to the selected prime pieces, calculates a rhythm modification target value of the prime piece determining that the rhythm modification is performed on the basis of determination result, and performs the rhythm modification so that the rhythm of the prime pieces determining that the rhythm modification is performed is a calculated rhythm modification target value. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、所望の音声を合成する音声合成方法に関わるものである。 The present invention relates to a speech synthesis method for synthesizing a desired speech.

従来より、所望の音声を合成する音声合成技術が存在する。これは、所望の発声内容に対応する音声素片を接続し、所望の韻律を持つように調整することで実現される。 Conventionally, there is a speech synthesis technique for synthesizing a desired speech. This is realized by connecting speech segments corresponding to the desired utterance content and adjusting to have a desired prosody.

音声合成技術のうち、典型的な技術の一つに音源−声道モデルに基づくものがある。これは、声道パラメータ系列を音声素片として持つものである。この声道パラメータを用いて、声帯振動を模擬したパルス列や呼気による雑音を模擬した雑音をフィルタ処理することにより合成音声を得ることができる。 Among speech synthesis technologies, one of typical technologies is based on a sound source-vocal tract model. This has a vocal tract parameter sequence as a speech segment. Using this vocal tract parameter, a synthesized speech can be obtained by filtering a pulse train simulating vocal cord vibration or noise simulating noise caused by expiration.

一方、最近になって、コーパスベース音声合成と呼ばれる音声合成技術が広く用いられるようになってきた（例えば、非特許文献１）。これは、あらかじめ様々なバリエーションを持つ音声を収録しておき、合成時には素片の接続だけを行なうものである。コーパスベース音声合成において、韻律調整は、大量の素片から所望の韻律を持つ素片を適切に選択することで行なわれる。 On the other hand, recently, a speech synthesis technique called corpus-based speech synthesis has been widely used (for example, Non-Patent Document 1). In this method, voices with various variations are recorded in advance, and only the segments are connected during synthesis. In corpus-based speech synthesis, prosody adjustment is performed by appropriately selecting a segment having a desired prosody from a large number of segments.

一般に、コーパスベース音声合成では、音源−声道モデルに基づく音声合成と比較して、自然で高品質な合成音が得られるものとされている。これは、モデル化や信号処理といった、音声を変形する（すなわち音声の劣化要因となる）処理を含まないためだと言われている。
世木、都木、“ニュース番組の収録音声を利用した高品質な音声合成のための素片選択法、”信学技報、ＳＰ２００３−３５、ｐｐ．１−６、Ｊｕｎｅ（２００３） In general, in corpus-based speech synthesis, compared to speech synthesis based on a sound source-vocal tract model, natural and high-quality synthesized speech can be obtained. This is said to be because it does not include processing such as modeling or signal processing that transforms speech (that is, causes deterioration of speech).
Seki, Miyagi, “Fragment selection method for high-quality speech synthesis using recorded speech of news programs,” IEICE Tech. 1-6, June (2003)

しかしながら、上記コーパスベース音声合成では、所望の韻律を持つ素片が見つからなかった場合に、合成音声の品質が損なわれるという課題がある。特に、所望の韻律から外れた素片と、その両側の素片との間で韻律的ギャップが生じ、そのために合成音声の自然性が大幅に損なわれるという課題がある。 However, the above corpus-based speech synthesis has a problem that the quality of synthesized speech is impaired when a segment having a desired prosody is not found. In particular, there is a problem that a prosodic gap is generated between a segment that deviates from a desired prosody and segments on both sides of the segment, which greatly impairs the naturalness of the synthesized speech.

上記課題を解決するために、本発明における音声合成方法では、素片を選択する工程と、前記選択された素片に対して韻律変更を行なうか否かを判定する工程と、前記判定の結果に基づいて韻律変更を行なうと判定された素片の韻律変更目標値を計算する工程と、前記韻律変更を行なうと判定された素片の韻律が前記計算された韻律変更目標値となるように韻律変更を行なう工程と、前記韻律変更された素片および前記判定の結果韻律変更を行なわないと判定された素片とを接続することを特徴とする。 In order to solve the above problems, in the speech synthesis method according to the present invention, a step of selecting a segment, a step of determining whether or not to change the prosody for the selected segment, and a result of the determination Calculating the prosody change target value of the segment determined to be changed based on the prosody, and the prosody change of the segment determined to be changed to be the calculated prosody change target value The step of changing prosody is connected to the segment whose prosody has been changed and the segment which has been determined not to be changed as a result of the determination.

本発明によれば、韻律の調整を適切に行うことでき、合成音声の品質を高めることが可能となる。 According to the present invention, the prosody can be adjusted appropriately, and the quality of the synthesized speech can be improved.

以下、図面を参照しながら本発明の好適な実施例について説明していく。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings.

図１は本発明の一実施例におけるハードウェア構成である。１は数値演算・制御等の処理を行なう中央処理部であり本発明の手順に従って演算を行なう。２は音声を出力する音声出力部である。３はタッチパネルやキーボード・マウス・ボタン等の入力部であり、ユーザが本装置に対して動作の指示を与えるのに用いられる。ユーザの指示によらず、自律的に動作する機器の場合、入力部は省略される場合がある。 FIG. 1 shows a hardware configuration in an embodiment of the present invention. Reference numeral 1 denotes a central processing unit that performs processing such as numerical calculation and control, and performs calculations according to the procedure of the present invention. Reference numeral 2 denotes an audio output unit that outputs audio. Reference numeral 3 denotes an input unit such as a touch panel, a keyboard, a mouse, or a button, which is used by the user to give an operation instruction to the apparatus. In the case of a device that operates autonomously regardless of a user instruction, the input unit may be omitted.

４はディスク装置や不揮発メモリ等の外部記憶部であり、音声合成に使用される言語解析辞書４０１や韻律予測パラメータ４０２、音声素片データベース４０３等が保持される。さらに、外部記憶部４には、ＲＡＭ６に保持される各種情報のうち、恒久的に使用されるべき情報も保持される。また、外部記憶部４は、ＣＤ−ＲＯＭやメモリカードといった可搬性のある形態であっても良く、これによって利便性を高めることもできる。 Reference numeral 4 denotes an external storage unit such as a disk device or a non-volatile memory, which holds a language analysis dictionary 401 used for speech synthesis, a prosodic prediction parameter 402, a speech segment database 403, and the like. Further, the external storage unit 4 also holds information to be used permanently among various pieces of information held in the RAM 6. Further, the external storage unit 4 may be a portable form such as a CD-ROM or a memory card, which can enhance convenience.

５は読み取り専用のメモリであり、本発明を実現するためのプログラムコード５０１や図示しない固定的データ等が格納される。もっとも、本発明において、外部記憶部４とＲＯＭ５の使用には任意性がある。例えば、プログラムコード５０１は、ＲＯＭ５ではなく外部記憶部４にインストールされるものであっても良い。６はＲＡＭ等の一時情報を保持するメモリであり、一時的なデータや各種フラグ等が保持される。上記中央処理部１〜ＲＡＭ６は、バス７で接続されている。 A read-only memory 5 stores program code 501 for realizing the present invention, fixed data (not shown), and the like. However, in the present invention, the use of the external storage unit 4 and the ROM 5 is optional. For example, the program code 501 may be installed in the external storage unit 4 instead of the ROM 5. Reference numeral 6 denotes a memory for holding temporary information such as a RAM, which holds temporary data, various flags, and the like. The central processing unit 1 to RAM 6 are connected by a bus 7.

次に、本実施例における処理フローを図２に則して説明する。まず、ステップＳ１で、入力された音声合成対象を解析する。音声合成対象が、「今日は良い天気です。」のような自然言語の場合、本ステップでは自然言語処理の手法である形態素解析や構文解析が用いられる。この際、言語解析辞書４０１が適宜用いられる。一方、音声合成対象が、「キョ’ーワ／ヨ’イ／テ’ンキデス。」のように音声合成用の人工言語で記述されている場合、本ステップでは専用の解析処理が行なわれる。 Next, the processing flow in the present embodiment will be described with reference to FIG. First, in step S1, the input speech synthesis target is analyzed. When the speech synthesis target is a natural language such as “Today is a good weather”, morphological analysis and syntax analysis, which are natural language processing techniques, are used in this step. At this time, the language analysis dictionary 401 is used as appropriate. On the other hand, when the speech synthesis target is described in an artificial language for speech synthesis, such as “Kyowa / Yo / I / Tenchides”, a dedicated analysis process is performed in this step.

次に、ステップＳ２で、前記ステップＳ１における解析結果をもとに、合成音声の音韻系列を決定する。次に、ステップＳ３で、前記ステップＳ１における解析結果をもとに、素片選択のための要因を取得する。素片を選択するための要因として、各音韻の属する句のモーラ数やアクセント型、文や句あるいは単語内での位置、音韻種別等が用いられる。 Next, in step S2, a phoneme sequence of the synthesized speech is determined based on the analysis result in step S1. Next, in step S3, a factor for selecting a segment is acquired based on the analysis result in step S1. As factors for selecting a segment, the number of mora and accent type of a phrase to which each phoneme belongs, the position in a sentence or phrase or word, the phoneme type, and the like are used.

次に、ステップＳ４で、前記ステップＳ３の結果に基づいて、外部記憶部４に記憶された音声素片データベース４０３から適切な素片を選択する。素片を選択するための情報として、前記ステップＳ３の結果だけでなく、隣接した素片間のスペクトル形状や韻律のギャップが少なくなるように素片を選択することもできる。 Next, in step S4, an appropriate segment is selected from the speech segment database 403 stored in the external storage unit 4 based on the result of step S3. As information for selecting a segment, not only the result of step S3 but also a segment can be selected so that the spectral shape and prosody gap between adjacent segments are reduced.

次に、ステップＳ５で、前記ステップＳ４で選択された素片の韻律値を取得する。素片の韻律値は、前記選択された素片から直接測定することもできるし、前もって測定した値を外部記憶部４に記憶しておき、これを読み出して用いることもできる。なお、一般に韻律というと、基本周波数（Ｆ０）・継続時間長・パワーを指すことが多いが、本処理では、それらの一部のみを取得するものとしても良い。これは、コーパスベース音声合成では韻律変更を行なわないのが基本のため、韻律変更の対象とならないものについては必ずしも取得する必要がないためである。 Next, in step S5, the prosody value of the segment selected in step S4 is acquired. The prosodic value of the segment can be directly measured from the selected segment, or the value measured in advance can be stored in the external storage unit 4 and read out for use. In general, prosody often refers to fundamental frequency (F0), duration length, and power, but in this process, only a part of them may be acquired. This is because in the corpus-based speech synthesis, the prosody change is basically not performed, and it is not always necessary to acquire the ones that are not subject to the prosody change.

次に、ステップＳ６で、前記ステップＳ５で取得された素片の韻律値の、隣接する素片の韻律値に対する解離度を評価し、解離度が閾値よりも大きい場合には処理をステップＳ７に進め、それ以外の場合は処理をステップＳ９に進める。 Next, in step S6, the dissociation degree of the prosody value of the segment acquired in step S5 with respect to the prosody value of the adjacent segment is evaluated. If the dissociation degree is greater than the threshold value, the process proceeds to step S7 If not, the process proceeds to step S9.

なお、韻律がＦ０の場合、１つの素片に対して複数の値が対応する。このため、解離度を評価する際に、複数の方法が考えられる。例えば、Ｆ０の平均値・中間値・最大値・最小値といった代表値、素片の中心におけるＦ０、素片の端点におけるＦ０等を用いることができる。さらに、Ｆ０の傾きを利用することもできる。 When the prosody is F0, a plurality of values correspond to one segment. For this reason, a plurality of methods are conceivable when evaluating the degree of dissociation. For example, representative values such as average value, intermediate value, maximum value, and minimum value of F0, F0 at the center of the segment, F0 at the end point of the segment, and the like can be used. Furthermore, the slope of F0 can be used.

ステップＳ７では、韻律変更後の韻律値を計算する。最も単純には、前記ステップＳ６で用いた解離度が最小になるように、前記ステップＳ５で取得された素片の韻律値に対して定数を加算・乗算することができる。あるいは、前記ステップＳ６で用いた解離度が閾値の範囲内に収まる程度に韻律変更量をとどめることも可能である。さらに、前記ステップＳ６で韻律値を変更しないものと判定された素片の韻律値を補間して韻律変更後の韻律値とすることもできる。 In step S7, the prosodic value after the prosody change is calculated. Most simply, a constant can be added and multiplied to the prosodic value of the segment acquired in step S5 so that the degree of dissociation used in step S6 is minimized. Alternatively, the prosody change amount can be limited to such an extent that the degree of dissociation used in step S6 falls within the threshold range. Furthermore, the prosodic value after the prosody change can be obtained by interpolating the prosodic value of the segment determined not to change the prosodic value in step S6.

次に、ステップＳ８で、前記ステップＳ７で計算された韻律に基づいて素片の韻律を変更する。韻律の変更法として、従来の音声合成で用いられている種々の手法（例えばＰＳＯＬＡ法）を利用することができる。 Next, in step S8, the prosody of the segment is changed based on the prosody calculated in step S7. As a prosody changing method, various methods (for example, PSOLA method) used in conventional speech synthesis can be used.

ステップＳ９では、前記ステップＳ６の結果に基づいて、前記ステップＳ４で選択された素片あるいは前記ステップＳ８で韻律を変更された素片を接続し、合成音声として出力する。 In step S9, based on the result of step S6, the segment selected in step S4 or the segment whose prosody was changed in step S8 is connected and output as synthesized speech.

以上のような構成とすることで、所望の韻律を持つ素片が見つからなかった場合であっても、所望の韻律から外れた素片と、その両側の素片との間における韻律的ギャップが緩和される。その結果、合成音声の品質が大幅に損なわれることが防止される。 With the above configuration, even if a segment having a desired prosody is not found, there is a prosodic gap between the segment that is out of the desired prosody and the segments on both sides thereof. Alleviated. As a result, the quality of the synthesized speech is prevented from being greatly impaired.

第２の実施例として、言語解析結果に基づいて韻律予測を行ない、予測された韻律を利用する方法がある。 As a second embodiment, there is a method in which prosody prediction is performed based on the language analysis result and the predicted prosody is used.

本実施例における処理フローを図３に則して説明する。まず、ステップＳ１およびステップＳ２は、前記実施例１と同様の処理である。次に、ステップＳ１０１で、前記ステップＳ１における解析結果をもとに、韻律を予測するための要因を取得する。韻律を予測するための要因として、各音韻の属する句のモーラ数やアクセント型、文や句あるいは単語内での位置、音韻種別、隣接する句や単語に関する情報等が用いられる。 The processing flow in the present embodiment will be described with reference to FIG. First, step S1 and step S2 are the same processes as in the first embodiment. Next, in step S101, a factor for predicting the prosody is acquired based on the analysis result in step S1. As factors for predicting the prosody, the number of mora and accent type of the phrase to which each phoneme belongs, the position in the sentence or phrase or word, the phoneme type, information on the adjacent phrase or word, etc. are used.

次に、ステップＳ１０２で、前記ステップＳ１０１で取得した韻律予測要因および外部記憶部４に記憶された韻律予測パラメータ４０２をもとに、韻律を予測する。なお、一般に韻律というと、基本周波数（Ｆ０）・継続時間長・パワーを指すことが多いが、本処理では、それらの一部のみを予測するものとしても良い。これは、コーパスベース音声合成では素片の韻律を利用することができるので、必ずしも全ての情報を予測する必要がないためである。 Next, in step S102, the prosody is predicted based on the prosody prediction factor acquired in step S101 and the prosody prediction parameter 402 stored in the external storage unit 4. In general, prosody often refers to fundamental frequency (F0), duration length, and power, but in this process, only a part of them may be predicted. This is because in corpus-based speech synthesis, the prosody of the segment can be used, so it is not always necessary to predict all information.

次に、実施例１と同様に、素片選択要因取得（ステップＳ３）・素片選択（ステップＳ４）・素片韻律取得（ステップＳ５）の処理を行なう。なお本実施例では、韻律予測を行なっているため、素片を選択するための情報として、前記ステップＳ１０２で予測した韻律を利用することもできる。 Next, in the same manner as in the first embodiment, segment selection factor acquisition (step S3), segment selection (step S4), and segment prosody acquisition (step S5) are performed. In this embodiment, since prosody prediction is performed, the prosody predicted in step S102 can be used as information for selecting a segment.

次に、ステップＳ１０３で、前記ステップＳ１０２で予測された韻律値と前記ステップＳ５で取得された素片の韻律値との解離度を評価し、解離度が閾値よりも大きい場合には処理をステップＳ１０４に進め、それ以外の場合は処理をステップＳ９に進める。 Next, in step S103, the degree of dissociation between the prosodic value predicted in step S102 and the prosody value of the segment acquired in step S5 is evaluated. If the degree of dissociation is greater than a threshold value, processing is performed. The process proceeds to S104. Otherwise, the process proceeds to step S9.

なお、韻律がＦ０の場合、１つの素片に対して複数の値が対応する。このため、解離度を評価する際に、複数の方法が考えられる。例えば、Ｆ０の平均値・中間値・最大値・最小値といった代表値、素片の中心におけるＦ０、予測された韻律値と素片の韻律値との二乗平均誤差等を用いることができる。 When the prosody is F0, a plurality of values correspond to one segment. For this reason, a plurality of methods are conceivable when evaluating the degree of dissociation. For example, representative values such as the average value, intermediate value, maximum value, and minimum value of F0, F0 at the center of the segment, and the mean square error between the predicted prosodic value and the prosodic value of the segment can be used.

次に、ステップＳ１０４で、韻律の変更後の韻律値を計算する。最も単純には、韻律変更後の韻律値が前記ステップＳ１０２で予測された韻律値と等しくなるようにすることができる。あるいは、前記ステップＳ１０３で用いた解離度が閾値の範囲内に収まる程度に韻律変更量をとどめることも可能である。ステップＳ８およびステップＳ９は、実施例１と同様である。 Next, in step S104, the prosodic value after the prosody change is calculated. Most simply, the prosodic value after the prosody change can be made equal to the prosodic value predicted in step S102. Alternatively, the prosody change amount can be limited to such an extent that the degree of dissociation used in step S103 falls within the threshold range. Steps S8 and S9 are the same as those in the first embodiment.

第３の実施例として、前記実施例において、韻律変更を行なうものと判定された素片について、素片選択をやり直す例がある。 As a third embodiment, there is an example in which the segment selection is redone for the segment determined to be changed in the above-described embodiment.

本実施例における処理フローを図４に則して説明する。まず、ステップＳ１からステップＳ５まで、前記実施例２と同様に処理する。次に、ステップＳ１０３で、前記実施例２と同様の判定を行ない、解離度が閾値よりも大きい場合には処理をステップＳ２０１に進め、それ以外の場合は処理をステップＳ９に進める。 The processing flow in the present embodiment will be described with reference to FIG. First, steps S1 to S5 are processed in the same manner as in the second embodiment. Next, in step S103, the same determination as in the second embodiment is performed. If the degree of dissociation is larger than the threshold value, the process proceeds to step S201. Otherwise, the process proceeds to step S9.

ステップＳ２０１では、素片を再選択するための要因を取得する。素片を再選択するための要因として、ステップＳ３で用いた要因だけでなく、前記ステップＳ１０３で韻律変更を行なわないと判定された素片の情報を用いることができる。例えば、前記ステップＳ１０３で韻律変更を行なわないと判定された素片の素片韻律値を用いて、韻律の連続性を向上させることができる。あるいは、前記ステップＳ５で取得した素片とのスペクトルの連続性を考慮することもできる。 In step S201, a factor for reselecting the segment is acquired. As a factor for reselecting the segment, not only the factor used in step S3 but also information on the segment determined not to change the prosody in step S103 can be used. For example, the continuity of the prosody can be improved by using the segment prosody value of the segment determined not to be changed in step S103. Alternatively, the continuity of the spectrum with the segment acquired in step S5 can be considered.

次に、ステップＳ２０２で、前記ステップＳ４と同様に、前記ステップＳ２０１の結果に基づいて素片を選択する。 Next, in step S202, as in step S4, a segment is selected based on the result of step S201.

さらに、実施例１と同様に、隣接する素片間の韻律値の解離が大きい場合には韻律変更を行なう（ステップＳ６、ステップＳ７、ステップＳ８）。最後に、前記実施例と同様に合成音出力を行なう（ステップＳ９）。 Further, as in the first embodiment, when the dissociation of prosodic values between adjacent segments is large, prosody change is performed (step S6, step S7, step S8). Finally, the synthesized sound is output in the same manner as in the previous embodiment (step S9).

このような構成とすることで、より適切な素片を選択する可能となる。 By setting it as such a structure, it becomes possible to select a more suitable segment.

なお、本発明の目的は、前述した実施例の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムあるいは装置に供給し、そのシステムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても達成されることは言うまでもない。 An object of the present invention is to supply a storage medium recording a program code of software that realizes the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (operating system) operating on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

本発明の実施例におけるハードウェア構成を示したブロック図である。It is the block diagram which showed the hardware constitutions in the Example of this invention. 実施例１における処理フローを示すフローチャートである。3 is a flowchart illustrating a processing flow in the first embodiment. 実施例２における処理フローを示すフローチャートである。10 is a flowchart illustrating a processing flow in the second embodiment. 実施例３における処理フローを示すフローチャートである。10 is a flowchart illustrating a processing flow in Embodiment 3.

Explanation of symbols

１数値演算・制御等の処理を行なう中央処理部
２ユーザに対して音声を提示する音声出力部
３タッチパネルやキーボード・マウス・ボタン等の入力部
４ディスク装置や不揮発メモリ等の外部記憶部
５読み取り専用のメモリ
６ＲＡＭ等の一時情報を保持するメモリ
７バス
４０１音声合成に使用される言語解析辞書
４０２音声合成に使用される韻律予測パラメータ
４０３音声合成に使用される音声素片データベース
５０１本発明を実現するためのプログラムコード DESCRIPTION OF SYMBOLS 1 Central processing part which performs processing, such as numerical calculation, control 2 Audio | voice output part which presents audio | voice to a user 3 Input parts, such as a touch panel, a keyboard, a mouse | mouth, and a button 4 External storage parts, such as a disk apparatus and a non-volatile memory 5 Reading Dedicated memory 6 Memory for holding temporary information such as RAM 7 Bus 401 Language analysis dictionary used for speech synthesis 402 Prosody prediction parameter used for speech synthesis 403 Speech unit database 501 used for speech synthesis Program code to realize

Claims

Selecting a fragment;
Determining whether to change the prosody for the selected segment;
Calculating a prosody change target value of a segment determined to perform prosody change based on the determination result;
Changing the prosody so that the prosody of the segment determined to perform the prosody change becomes the calculated prosody change target value;
A speech synthesizing method comprising connecting the segment whose prosody has been changed and the segment which has been determined not to be changed as a result of the determination.

The speech synthesis method according to claim 1, wherein whether or not to change the prosody is determined based on a prosody dissociation degree between adjacent segments.

The speech synthesis method according to claim 1, wherein the prosodic change target value is calculated based on a prosodic value of a segment determined not to change prosody as a result of the determination.

Further comprising the step of predicting the prosody,
The speech synthesis method according to claim 1, wherein whether or not to change the prosody is determined based on a degree of dissociation from the predicted prosody.

Further comprising predicting prosody,
The speech synthesis method according to claim 1, wherein the prosodic change target value is calculated based on the predicted prosodic value.

The speech synthesis method according to claim 1, further comprising a step of reselecting a segment with respect to a segment determined to change prosody as a result of the determination.

The speech synthesis method according to claim 6, wherein the reselection of the segment is performed based on information of a segment determined not to change the prosody as a result of the determination.

The method further comprises a step of re-determining whether or not a prosody change is performed on the reselected segment, and the prosody change is performed on the re-determined segment when the prosody change is performed. Item 7. The speech synthesis method according to Item 6.

A control program for causing a computer to execute the speech synthesis method according to claim 1.

Means for selecting the pieces;
Means for determining whether or not to change the prosody for the selected segment;
Means for calculating a prosody change target value of a segment determined to perform prosody change based on the determination result;
Means for changing the prosody such that the prosody of the segment determined to perform the prosody change becomes the calculated prosody change target value;
A speech synthesizer characterized by connecting the segment whose prosody has been changed and the segment which has been determined not to be changed as a result of the determination.