JP2005352327A

JP2005352327A - Device and program for speech synthesis

Info

Publication number: JP2005352327A
Application number: JP2004174943A
Authority: JP
Inventors: Shigeaki Komatsu; 慈明小松; Akiko Yamato; 亜紀子大和
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2004-06-14
Filing date: 2004-06-14
Publication date: 2005-12-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and program for speech synthesis that can obtain a natural long sound more easily. <P>SOLUTION: A text as an object of speech synthesis is inputted (S1), language analysis processing is performed (S2), and normal voice unit data are used for an output result to calculate pitches and sound volume by syllables (S3, S5). Then it is judged whether each syllable is a long sound (S9), has an accent (S13), and includes a pause (S17), thereby leaving normal speech unit data as they are (S11) when a syllable is not a long sound (NO at S9), continuing a last phoneme (S15) by using the normal speech unit data when the syllable is a long sound and has no accept (YES at S9 and NO at S13), changing the normal speech unit data into 1st speech unit data for a long sound (S19) when the syllable is a long sound, has an accent, and includes no pause (YES at S9, YES at S13, and NO at S17), and changing the data into 2nd speech unit data for a long sound (S21) when the syllable is a long sound, has an accent, and is before a pause (YES at S9, YES at S13, and YES at S17). <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声合成装置に関するものである。 The present invention relates to a speech synthesizer.

従来より、テキストを入力するとそれに対応する合成音声を生成する音声合成装置や、このような音声合成装置としてコンピュータを機能させるための音声合成プログラムが開発されている。このような音声合成装置やプログラムにおいて、品質を向上させるためのポイントの１つとして、長音の処理をどのように行なうかが問題となる。例えば、特許文献１では、入力されたテキストを言語辞書を参照して解析し、解析結果に対して予め用意された長音化候補リストを参照して長音化の可能性のある部分を検出する。長音化の候補が存在する部分については、長音化／非長音化規則を参照し、長音化／非長音化の設定が行われているか否かを判定し、長音化の設定がされていれば長音化処理を行なうようにしている。そして、長音化処理を行なう場合には、直前の母音の音素を延長している。
特開２００１−３４３９８７号公報 2. Description of the Related Art Conventionally, a speech synthesizer that generates synthesized speech corresponding to text input and a speech synthesis program for causing a computer to function as such a speech synthesizer have been developed. In such a speech synthesizer and program, as one of the points for improving quality, there is a problem of how to process a long sound. For example, in Patent Document 1, an input text is analyzed with reference to a language dictionary, and a portion with a possibility of lengthening is detected with reference to a lengthening candidate list prepared in advance for the analysis result. For the part where there is a candidate for the long sound, refer to the long sound / non-long sound rule, determine whether long sound / non-long sound is set, and if long sound is set The long sound processing is performed. Then, in the case of performing the lengthening process, the phoneme of the immediately preceding vowel is extended.
JP 2001-343987 A

しかしながら、上記従来の音声合成装置のように、長音化を行なう場合に直前の母音を伸ばす方法をとると、当該長音がブザー音のように不自然に聞こえることがある。また、このような直前の母音を伸ばす方法の他にも、直前の母音を重ねることも考えられるが、この場合には、不連続に聞こえてしまうことがある。このような不自然さを解消するために、前の音素と長音の固まりを音素としてデータベースに記憶させて使用することも考えられるが、考えられる組み合わせを全てデータベースに記憶させておこうとすると、必要な容量が膨大になってしまう問題がある。 However, when a method of extending the immediately preceding vowel when the sound is made longer as in the conventional speech synthesizer, the long sound may sound unnaturally like a buzzer sound. In addition to the method of extending the immediately preceding vowel, it may be possible to superimpose the immediately preceding vowel, but in this case, it may sound discontinuous. In order to eliminate such unnaturalness, it is conceivable to store the previous phoneme and the cluster of long sounds as phonemes in the database, but if you try to store all possible combinations in the database, There is a problem that the necessary capacity becomes enormous.

本発明は上記問題を解決するためになされたものであり、簡単な構成でより自然な長音が得られる音声合成装置及び音声合成プログラムを提供することを目的とする。 The present invention has been made to solve the above problems, and an object thereof is to provide a speech synthesizer and a speech synthesis program capable of obtaining a more natural long sound with a simple configuration.

上記目的を達成するために、本発明の請求項１に記載の音声合成装置は、テキストの読み情報に対応する通常音声単位データを記憶した通常音声単位データ記憶手段と、入力されたテキストを、辞書を用いて解析し、アクセント付読み情報を生成する読み情報生成手段と、当該読み情報生成手段により生成された読み情報に、前記通常音声単位データ記憶手段に記憶された通常音声単位データを適用して合成音声を出力する音声合成手段とを備えた音声合成装置において、長音の読み情報に対応する長音用第１音声単位データを記憶した長音用第１音声単位データ記憶手段と、前記読み情報生成手段により生成された読み情報から長音を検出する長音検出手段と、当該長音検出手段が検出した長音がアクセントを含むか否かを判断するアクセント判断手段と、当該アクセント判断手段がアクセント有りと判断した場合には、前記通常音声単位データに代えて、前記長音用第１音声単位データ記憶手段に記憶された長音用第１音声単位データを適用する音声単位データ変更手段とを備えたことを特徴とする。 In order to achieve the above object, a speech synthesizer according to claim 1 of the present invention comprises normal speech unit data storage means for storing normal speech unit data corresponding to text reading information, and input text. Analyzing using a dictionary and generating reading information with accents, and applying normal voice unit data stored in the normal voice unit data storage means to reading information generated by the reading information generating means First speech unit data storage unit for long sound storing first sound unit data for long sound corresponding to reading information of long sound, and the reading information A long sound detecting means for detecting a long sound from the reading information generated by the generating means, and an access for determining whether or not the long sound detected by the long sound detecting means includes an accent. And the first sound unit data for long sound stored in the first sound unit data storage means for long sound, instead of the normal sound unit data, when the accent determination means determines that there is an accent. And a voice unit data changing unit to be applied.

また、本発明の請求項２に記載の音声合成装置は、請求項１に記載の発明の構成に加え、直後にポーズを伴う長音の読み情報に対応する長音用第２音声単位データを記憶した長音用第２音声単位データ記憶手段と、前記長音検出手段が検出した長音の直後の前記読み情報に、ポーズが存在するか否かを判断するポーズ判断手段とを備え、前記音声単位データ変更手段は、当該ポーズ判断手段がポーズ有りと判断した場合には、前記通常音声単位データに代えて、前記長音用第２音声単位データ記憶手段に記憶された長音用第２音声単位データを適用し、前記ポーズ判断手段がポーズなしと判断した場合には、前記長音用第１音声単位データ記憶手段に記憶された長音用第１音声単位データを適用することを特徴とする。 In addition to the configuration of the invention described in claim 1, the speech synthesizer described in claim 2 of the present invention stores the second sound unit data for long sound corresponding to the reading information of the long sound accompanied by a pause immediately after. Second sound unit data storage means for long sound, and pause determination means for determining whether or not there is a pause in the reading information immediately after the long sound detected by the long sound detection means, the sound unit data changing means When the pause determination means determines that there is a pause, the second sound unit data for long sound stored in the second sound unit data storage means for long sound is applied instead of the normal sound unit data, When the pause determination unit determines that there is no pause, the first sound unit data for long sound stored in the first sound unit data storage unit for long sound is applied.

また、本発明の請求項３に記載の音声合成プログラムは、入力されたテキストを、辞書を用いて解析し、アクセント付読み情報を生成する読み情報生成ステップと、当該読み情報生成ステップにおいて生成された読み情報に、テキストの読み情報に対応するように通常音声単位データ記憶手段に記憶された通常音声単位データを適用して合成音声を出力する音声合成ステップとをコンピュータに実行させる音声合成において、前記読み情報生成ステップにおいて生成された読み情報から長音を検出する長音検出ステップと、当該長音検出ステップにおいて検出された長音がアクセントを含むか否かを判断するアクセント判断ステップと、当該アクセント判断ステップにおいてアクセント有りと判断された場合には、前記通常音声単位データに代えて、長音の読み情報に対応するように長音用第１音声単位データ記憶手段に記憶された長音用第１音声単位データを適用する音声単位データ変更ステップとをコンピュータにさらに実行させることを特徴とする。 The speech synthesis program according to claim 3 of the present invention is generated in a reading information generation step of analyzing input text using a dictionary and generating accented reading information, and the reading information generation step. In speech synthesis for causing the computer to execute a speech synthesis step of outputting synthesized speech by applying the normal speech unit data stored in the normal speech unit data storage means so as to correspond to the reading information of the text, In a long sound detecting step for detecting a long sound from the reading information generated in the reading information generating step, an accent determining step for determining whether or not the long sound detected in the long sound detecting step includes an accent, and in the accent determining step If it is determined that there is an accent, the normal voice unit data The computer further executes a sound unit data changing step of applying the first sound unit data for long sound stored in the first sound unit data storage unit for long sound so as to correspond to the reading information of the long sound. To do.

また、本発明の請求項４に記載の音声合成プログラムは、請求項３に記載の発明の構成に加え、前記長音検出ステップにおいて検出された長音の直後の前記読み情報に、ポーズが存在するか否かを判断するポーズ判断ステップをコンピュータにさらに実行させ、前記音声単位データ変更ステップでは、当該ポーズ判断ステップにおいてポーズ有りと判断された場合には、前記通常音声単位データに代えて、直後にポーズを伴う長音の読み情報に対応するように長音用第２音声単位データ記憶手段に記憶された長音用第２音声単位データを適用し、前記ポーズ判断ステップにおいてポーズなしと判断された場合には、前記長音用第１音声単位データを適用することを特徴とする。 Further, in the speech synthesis program according to claim 4 of the present invention, in addition to the configuration of the invention according to claim 3, is there a pause in the reading information immediately after the long sound detected in the long sound detection step? The computer further executes a pause determination step for determining whether or not, and in the voice unit data changing step, when it is determined that there is a pause in the pause determination step, the pause is performed immediately instead of the normal voice unit data. When the second sound unit data for long sound stored in the second sound unit data storage means for long sound is applied so as to correspond to the reading information of the long sound accompanied by the sound, and it is determined that there is no pause in the pause determination step, The first sound unit data for long sound is applied.

本発明の請求項１に記載の音声合成装置によれば、長音用の音声単位データを、アクセントありの場合となしの場合との２パターン用意し、音声合成の対象となるテキストが入力されたら、まず長音を検出し、さらにその長音がアクセントを含むか否かを判断してから、アクセントありの場合となしの場合のそれぞれに合わせて音声単位データを適用するので、より自然な合成音を出力できる。 According to the speech synthesizer described in claim 1 of the present invention, two patterns of voice unit data for long sound, with and without accent, are prepared, and a text to be synthesized is input. First, a long sound is detected, and then whether or not the long sound includes an accent is determined, and then the voice unit data is applied to each of the cases with and without the accent, so that a more natural synthesized sound can be obtained. Can output.

また、本発明の請求項２に記載の音声合成装置によれば、請求項１に記載の発明の効果に加え、アクセントありの場合の長音を、さらに、直後にポーズがあるか否かで場合分けし、長音用の音声データとして３パターン用意するので、さらに自然な合成音を出力できる。 Further, according to the speech synthesizer according to claim 2 of the present invention, in addition to the effect of the invention according to claim 1, a long sound when there is an accent is further determined by whether or not there is a pause immediately thereafter. Separately, three patterns are prepared as sound data for long sound, so that a more natural synthesized sound can be output.

また、本発明の請求項３に記載の音声合成プログラムによれば、長音用の音声単位データを、アクセントありの場合となしの場合との２パターン用意し、音声合成の対象となるテキストが入力されたら、まず長音を検出し、さらにその長音がアクセントを含むか否かを判断してから、アクセントありの場合となしの場合のそれぞれに合わせて音声単位データを適用するので、より自然な合成音を出力できる。 According to the speech synthesis program of claim 3 of the present invention, two patterns of voice unit data for long sound are prepared, with and without accent, and the text to be synthesized is input. If a long sound is detected, it is first determined whether or not the long sound includes an accent, and then the voice unit data is applied to the case with or without the accent. Sound can be output.

また、本発明の請求項４に記載の音声合成プログラムによれば、請求項３に記載の発明の効果に加え、アクセントありの場合の長音を、さらに、直後にポーズがあるか否かで場合分けし、長音用の音声データとして３パターン用意するので、さらに自然な合成音を出力できる。 According to the speech synthesis program described in claim 4 of the present invention, in addition to the effect of the invention described in claim 3, a long sound when there is an accent is further determined by whether or not there is a pause immediately thereafter. Separately, three patterns are prepared as sound data for long sound, so that a more natural synthesized sound can be output.

次に本発明を実施するための最良の形態について図面を参照して説明する。まず、図１を参照して、本発明を適用した音声合成装置１の構成について説明する。図１は、音声合成装置１の電気的構成を示すブロック図である。 Next, the best mode for carrying out the present invention will be described with reference to the drawings. First, the configuration of a speech synthesizer 1 to which the present invention is applied will be described with reference to FIG. FIG. 1 is a block diagram showing an electrical configuration of the speech synthesizer 1.

図１に示すように、音声合成装置１には、音声合成装置１全体を制御するＣＰＵ１０が設けられ、ＣＰＵ１０には、ＢＩＯＳ等を記憶したＲＯＭ１２と、ＣＰＵ１０が実行する各種演算処理のワークエリアとして使用されるＲＡＭ１３と、ハードディスク等の外部記憶装置２０と、合成された音声をアナログの音声波形信号に変換するとともに所定の増幅を行なうためのオーディオ部３１と、各種の操作を入力するための操作パネル４１とがバスを介して接続されている。さらに、ＣＰＵ１０には、所定時間毎にＣＰＵ１０に対して割込を発生させるタイマ１１が接続されている。また、オーディオ部３１には、スピーカ３２が接続されている。 As shown in FIG. 1, the speech synthesizer 1 is provided with a CPU 10 that controls the entire speech synthesizer 1. The CPU 10 includes a ROM 12 that stores BIOS and the like, and a work area for various arithmetic processes executed by the CPU 10. RAM 13 used, an external storage device 20 such as a hard disk, an audio unit 31 for converting synthesized voice into an analog voice waveform signal and performing predetermined amplification, and operations for inputting various operations A panel 41 is connected via a bus. Further, the CPU 10 is connected with a timer 11 that generates an interrupt to the CPU 10 at predetermined time intervals. In addition, a speaker 32 is connected to the audio unit 31.

ここで、外部記憶装置２０には、音声合成装置１が音声合成処理を行なうための音声合成プログラム２１，入力されたテキストを言語解析するための言語辞書２２，合成音声を構成する３種類の音声単位データ２３〜２５が記憶されている。言語辞書２２には、形態素、形態素の読み、品詞、アクセント型、文法などの言語解析に必要なデータが記憶されている。また、音声単位データとしては、通常音の音声合成処理に使用される通常音声単位データ２３，アクセント付長音の場合に使用される長音用第１音声単位データ２４，アクセント付で直後にポーズを有する長音の場合に使用される長音用第２音声単位データ２５の３種類が記憶されている。 Here, the external storage device 20 includes a speech synthesis program 21 for the speech synthesizer 1 to perform speech synthesis processing, a language dictionary 22 for language analysis of the input text, and three types of speech constituting the synthesized speech. Unit data 23 to 25 are stored. The language dictionary 22 stores data necessary for language analysis, such as morpheme, morpheme reading, part of speech, accent type, and grammar. The voice unit data includes normal voice unit data 23 used for voice synthesis processing of normal sound, first long voice unit data 24 used for accented long sound, and a pause immediately after the accent. Three types of second sound unit data 25 for long sound used for long sound are stored.

通常音声単位データ２３には、日本語の各音素を発声した音声単位データが切り出され、音素に対応する周波数（ピッチ）、継続時間長、パワー（音量）とが記憶されている。ここで、音素とは、ある一つの言語（本実施形態では日本語）で用いる音の単位であって、意味の相違をもたらす最小の単位をいう。通常音声単位データ２３は、長音固有の音声単位データをもっていない。従って、長音に対して通常音声単位データ２３を使用して音声合成を行なう場合には、直前の音素を引き延ばして使用することになる。 In the normal speech unit data 23, speech unit data obtained by uttering each Japanese phoneme is extracted, and a frequency (pitch), duration length, and power (volume) corresponding to the phoneme are stored. Here, the phoneme is a unit of sound used in a certain language (Japanese in this embodiment), and is the smallest unit that causes a difference in meaning. The normal sound unit data 23 does not have long sound-specific sound unit data. Therefore, when speech synthesis is performed using normal speech unit data 23 for a long sound, the previous phoneme is extended and used.

また、長音用第１音声単位データ２４には、直前の音素と同じ音韻を持ち、直後にポーズのない長音の発声を集めて切り出した音声単位データが記憶されている。さらに、長音用第２音声単位データ２５には、直前の音素と同じ音韻を持ち、かつ、直後にポーズのある長音の発声を集めて切り出した音声単位データが記憶されている。日本語のアクセントでは、アクセントのある音節から次の音節に向かって急激にピッチが下がることになるが、人の自然な発声においては、同じようにアクセントを有する長音の場合にも、直後にポーズがあるか否かによってピッチの下がり方に差があるため、ポーズの有無で長音の合成に用いる音声単位データを２種類用意することにより、さらに自然な合成音を得ることが可能になる。 Further, the first sound unit data 24 for long sound stores sound unit data that is collected by cutting out utterances of long sounds having the same phoneme as the immediately preceding phoneme and having no pause immediately after. Further, the second sound unit data for long sound 25 stores sound unit data obtained by collecting and cutting out utterances of long sound having the same phoneme as the immediately preceding phoneme and having a pause immediately after. In Japanese accents, the pitch drops sharply from the accented syllable to the next syllable. Since there is a difference in how the pitch is lowered depending on whether or not there is, there are two types of voice unit data used for synthesizing a long sound depending on the presence or absence of a pause, thereby making it possible to obtain a more natural synthesized sound.

次に、以上の構成の音声合成装置１で行なわれる音声合成処理について図２及び図３を参照して説明する。図２は、音声合成処理の流れを示すフローチャートである。図３は、音声合成装置１で実行する音声合成処理の対象となる読み文字列１００の例を示す説明図である。 Next, speech synthesis processing performed by the speech synthesizer 1 having the above configuration will be described with reference to FIGS. FIG. 2 is a flowchart showing the flow of the speech synthesis process. FIG. 3 is an explanatory diagram illustrating an example of the read character string 100 that is a target of the speech synthesis process executed by the speech synthesizer 1.

まず、音声合成処理の対象となるテキストを入力する（Ｓ１）。ここで対象テキストは、通常は漢字かな混じり文で入力される。入力は、操作パネル４１から行なってもよいし、キーボードを接続してキーボードから入力してもよい。また、音声合成装置１をネットワークに接続して外部から入力してもよいし、外部記憶装置２０に記憶されているファイルを読み込んだり、フレキシブルディスクやＣＤ−ＲＯＭ等の記憶媒体から入力してもよく、入力方法は限定されない。例えば、「ジョー、一週間ばかりニューヨークを取材したよ」という漢字かな混じり文が入力されたとする。 First, a text to be subjected to speech synthesis processing is input (S1). Here, the target text is usually input as a kanji-kana mixed sentence. Input may be performed from the operation panel 41 or may be input from a keyboard connected to a keyboard. Further, the speech synthesizer 1 may be connected to a network and input from the outside, or a file stored in the external storage device 20 may be read or input from a storage medium such as a flexible disk or a CD-ROM. Well, the input method is not limited. For example, it is assumed that a kanji-kana mixed sentence “Joe, I covered New York for a week” is input.

次に、Ｓ１で入力された対象テキストを言語辞書２２を用いて周知の方法で形態素解析し、読み、アクセント、ポーズを付与する言語解析処理を実行する（Ｓ２）。この言語解析処理が実行されると、対象テキストには、読み、アクセント、ポーズが付与され、図３に示すように読み文字列１００が出力される。ここでは、読みがカタカナ文字列で示され、「｜」はポーズ、「’」はアクセント、「＞」はアクセント区切りを示している。なお、図示しないが、言語解析処理では、音節の区切りも出力される。 Next, the target text input in S1 is subjected to morphological analysis using a language dictionary 22 by a well-known method, and language analysis processing for adding reading, accents, and poses is executed (S2). When this language analysis processing is executed, reading, accent, and pose are added to the target text, and a reading character string 100 is output as shown in FIG. Here, the reading is indicated by a katakana character string, “|” is a pause, “′” is an accent, and “>” is an accent delimiter. Although not shown, the linguistic analysis process also outputs syllable breaks.

次に、Ｓ２の出力結果である読み文字列１００の各音節に対し、通常音声単位データ２３を使用して、周知の方法で各音節のピッチを計算し（Ｓ３）、音量を計算して（Ｓ５）、その結果をＲＡＭ１３内のバッファに記憶する（Ｓ７）。 Next, for each syllable of the reading character string 100 as the output result of S2, the pitch of each syllable is calculated by a known method using the normal speech unit data 23 (S3), and the volume is calculated ( The result is stored in a buffer in the RAM 13 (S7).

次に、出力結果１００の音節毎にその音節が長音であるか否かを判断し、長音の場合の音声単位データを変更する音声単位データ変更処理を実行する（Ｓ９〜Ｓ２１）。まず、処理対象音節が長音であるか否かを判断する（Ｓ９）。長音でない場合は（Ｓ９：ＮＯ）、その音節に使用する音声単位データを変更する必要はないので、そのまま通常音声単位データ２３を選択し（Ｓ１１）、Ｓ２３に進む。 Next, for each syllable of the output result 100, it is determined whether or not the syllable is a long sound, and sound unit data changing processing for changing sound unit data in the case of a long sound is executed (S9 to S21). First, it is determined whether or not the processing target syllable is a long sound (S9). If it is not a long sound (S9: NO), it is not necessary to change the voice unit data used for the syllable, so the normal voice unit data 23 is selected as it is (S11), and the process proceeds to S23.

処理対象音節が長音の場合には（Ｓ９：ＹＥＳ）、さらに、その音節がアクセントを有するか否かを判断する（Ｓ１３）。その音節にアクセントがない場合には（Ｓ１３：ＮＯ）、長音用の音声単位データは使用せず、通常音声単位データ２３を使って、直前の音素の継続時間長を延長して長音とする（Ｓ１５）。そして、Ｓ２３に進む。 If the syllable to be processed is a long sound (S9: YES), it is further determined whether or not the syllable has an accent (S13). If there is no accent in the syllable (S13: NO), the sound unit data for long sound is not used, but the normal sound unit data 23 is used to extend the duration of the previous phoneme to make a long sound ( S15). Then, the process proceeds to S23.

処理対象音節が長音であり（Ｓ９：ＹＥＳ）、かつアクセントがある場合には（Ｓ１３：ＹＥＳ）、さらに、その音節がポーズの直前であるか否かを判断する（Ｓ１７）。ポーズの直前でない場合には（Ｓ１７：ＮＯ）、当該長音用に使用する音声単位データとして長音用第１音声単位データ２４に記憶されている音声単位データを選択する（Ｓ１９）。そして、Ｓ２３に進む。既述のように、長音用第１音声単位データ２４データは、直前の音素と同じ音韻を持ち、直後にポーズのない長音の発声を集めて切り出した音声単位データであるから、これを用いることにより、耳に自然な長音を合成することができる。 If the syllable to be processed is a long sound (S9: YES) and there is an accent (S13: YES), it is further determined whether or not the syllable is immediately before a pause (S17). If it is not immediately before the pause (S17: NO), the voice unit data stored in the first voice unit data 24 for long sound is selected as the voice unit data used for the long sound (S19). Then, the process proceeds to S23. As described above, the first sound unit data 24 for long sound is the sound unit data that has the same phoneme as the immediately preceding phoneme and is collected by cutting out the utterances of the long sound without the pause immediately thereafter. Thus, natural long sound can be synthesized in the ear.

処理対象音節が長音で（Ｓ９：ＹＥＳ）、かつアクセントがあり（Ｓ１３：ＹＥＳ）、その音節がポーズの直前である場合には（Ｓ１７：ＹＥＳ）、当該長音用に使用する音声単位データとして長音用第２音声単位データ２５に記憶されている音声単位データを選択する（Ｓ２１）。既述のように、長音用第２音声単位データ２５は、直後にポーズのある長音の発声を集めて切り出した音声単位データであり、長音用第１音声単位データ２４とはピッチの下がり方が異なるため、ポーズのある場合にもの長音用第２音声単位データ２５を用いることにより、より自然な長音を合成することができる。 When the syllable to be processed is a long sound (S9: YES) and there is an accent (S13: YES) and the syllable is immediately before a pause (S17: YES), a long sound is used as voice unit data used for the long sound. The audio unit data stored in the second audio unit data 25 is selected (S21). As described above, the second sound unit data for long sound 25 is sound unit data obtained by collecting the utterances of long sounds that are paused immediately after, and the pitch decreasing method is different from the first sound unit data for long sound 24. Since the second sound unit data 25 for long sound is used when there is a pause, a more natural long sound can be synthesized.

以上のＳ９〜Ｓ２１の処理により、処理対象音節に使用する音声単位データが選択されたので、その選択に従って、Ｓ３及びＳ５で計算されたピッチと音量を変更し（Ｓ２３）、ＲＡＭ１２のバッファに記憶する（Ｓ２５）。そして、全ての音節について処理が終了したか否かを判断し（Ｓ２７）、全音節について終了していれば（Ｓ２７：ＹＥＳ）、処理を終了する。バッファに記憶された合成音声信号は、オーディオ部３１に送られ、音声波形信号に変換されて、スピーカ３２から音響信号として出力されることになる。まだ未処理の音節が残っていれば（Ｓ２７：ＮＯ）、次の音節を処理対象音節として、Ｓ９〜Ｓ２５を繰り返す。 Through the processes of S9 to S21, the voice unit data to be used for the processing target syllable is selected. According to the selection, the pitch and volume calculated in S3 and S5 are changed (S23) and stored in the buffer of the RAM 12 (S25). Then, it is determined whether or not processing has been completed for all syllables (S27). If all syllables have been completed (S27: YES), processing is terminated. The synthesized speech signal stored in the buffer is sent to the audio unit 31, converted into a speech waveform signal, and output from the speaker 32 as an acoustic signal. If unprocessed syllables still remain (S27: NO), S9 to S25 are repeated with the next syllable as a processing target syllable.

次に図３に示す例を参照して、以上の処理を具体的に説明する。最初の音節１０１の「ジョー」は、長音であり（Ｓ９：ＹＥＳ）、アクセントも有り（Ｓ１３：ＹＥＳ）、その直後にポーズもあるので（Ｓ１７：ＹＥＳ）、通常音声データ「jo」と長音用第２音声単位データ２５「o-~」が選択され（Ｓ２１）、ポーズ前長音用モデルである長音用第２音声単位データ２５を使用してピッチ・音量を変更し（Ｓ２３）、バッファに記憶する（Ｓ２７）。 Next, the above processing will be specifically described with reference to the example shown in FIG. “Joe” in the first syllable 101 is a long sound (S9: YES), there is an accent (S13: YES), and there is a pause immediately after that (S17: YES), so normal voice data “jo” and long sound are used. The second audio unit data 25 “o-˜” is selected (S21), the pitch / volume is changed using the second sound unit data for long sound 25 which is the model for long sound before pause (S23), and stored in the buffer. (S27).

次の音節「イッ」は、長音ではないから（Ｓ９：ＮＯ）、そのまま通常音声単位データが選択され（Ｓ１１）、ピッチや音量の変更は行なわれない（Ｓ２３）。３番目の音節１０２の「シュー」は、長音であるが（Ｓ９：ＹＥＳ）、アクセントはない（Ｓ１３：ＮＯ）。従って、選択されている通常音声単位データである「shu」の母音部分「u」の継続時間長を延長することとする（Ｓ１５）。 Since the next syllable “I” is not a long sound (S9: NO), normal voice unit data is selected as it is (S11), and the pitch and volume are not changed (S23). The “shoe” of the third syllable 102 is a long sound (S9: YES) but has no accent (S13: NO). Accordingly, the duration of the vowel part “u” of “shu”, which is the selected normal voice unit data, is extended (S15).

４〜６番目の音節「カン」「バ」「カ」「リ」は、長音ではないから（Ｓ９：ＮＯ）、そのまま通常音声単位データが選択され（Ｓ１１）、ピッチや音量の変更は行なわれない（Ｓ２３）。 Since the fourth to sixth syllables “Kan” “B” “K” “Li” are not long sounds (S9: NO), normal voice unit data is selected as it is (S11), and the pitch and volume are changed. No (S23).

７番目の音節１０３の「ニュー」は、長音であるが（Ｓ９：ＹＥＳ）、アクセントはない（Ｓ１３：ＮＯ）。従って、通常音声単位データ２３に記憶されている対応音声単位データである「nyu」の母音部分「u」の継続時間長を延長することとし（Ｓ１５）、これに従ってピッチ・音量を変更し（Ｓ２３）、バッファに記憶する（Ｓ２７）。 “New” in the seventh syllable 103 is a long sound (S9: YES), but there is no accent (S13: NO). Therefore, the duration of the vowel part “u” of “nyu”, which is the corresponding voice unit data stored in the normal voice unit data 23, is extended (S15), and the pitch / volume is changed accordingly (S23). ) And stored in the buffer (S27).

８番目の音節１０４の「ヨー」は、長音であり（Ｓ９：ＹＥＳ）、アクセントも有るが（Ｓ１３：ＹＥＳ）、その後にポーズはないので（Ｓ１７：ＮＯ）、長音用第１音声単位データ２４「o-」が選択され（Ｓ１９）、ポーズなし長音用モデルである長音用第１音声単位データ２４を使用してピッチ・音量を変更し（Ｓ２３）、バッファに記憶する（Ｓ２７）。 “Yaw” in the eighth syllable 104 is a long sound (S9: YES), has an accent (S13: YES), and has no pause after that (S17: NO), so the first sound unit data for long sound 24 "O-" is selected (S19), the pitch / volume is changed using the first sound unit data 24 for long sound, which is a long sound model without pause (S23), and stored in the buffer (S27).

９番目〜最後の音節には、長音はないので（Ｓ９：ＮＯ）、そのまま通常音声単位データが選択され（Ｓ１１）、ピッチや音量の変更は行なわれない（Ｓ２３）。以上で全ての音節について使用する音声単位データが決定・変更され、ピッチや音量が設定されて合成音声信号が生成され、バッファに記憶されたので、処理を終了して、生成された合成音声信号をオーディオ部３１に送る。これを受けたオーディオ部３１では、合成音声信号が音声波形信号に変換されて、スピーカ３２から音響信号として出力される。 Since the ninth to last syllables have no long sound (S9: NO), normal voice unit data is selected as it is (S11), and the pitch and volume are not changed (S23). The voice unit data to be used for all syllables is determined / changed, the pitch and volume are set, and the synthesized voice signal is generated and stored in the buffer. Is sent to the audio unit 31. In response to this, the audio unit 31 converts the synthesized speech signal into a speech waveform signal and outputs it as an acoustic signal from the speaker 32.

以上説明したように、長音の音声単位データとして、アクセントの有無及び直後のポーズの有無で異なるモデルを予め用意しておき、長音やポーズを検出してそれぞれに対応する音声単位データを使用するので、より自然な長音の合成音声を出力することができる。すなわち、同じ長音であっても、アクセントがある音節の場合は、ピッチがアクセントの前後で急激に落ちるため、特別な長音モデルを前の音素に結合しても違う音に聞こえにくく、直前の音素を引き延ばす通常音声単位データを使用するよりも自然に聞こえやすい効果がある。また、直後にポーズがあるかどうかでピッチの下がり方が異なるので、ポーズの有無で長音モデルを別に用意しておけば、さらに自然な長音を得ることができる。 As described above, as long sound unit data, different models are prepared in advance depending on the presence or absence of an accent and the presence or absence of a pause immediately after that, and the sound unit data corresponding to each is used by detecting a long sound or a pause. It is possible to output a synthesized voice of a more natural long sound. That is, even in the case of syllables with the same long sound, but with accents, the pitch drops sharply before and after the accent. It is easier to hear naturally than using normal voice unit data. Moreover, since the way of decreasing the pitch differs depending on whether there is a pause immediately after that, if a long sound model is prepared separately depending on whether there is a pause, a more natural long sound can be obtained.

尚、上記実施の形態において、図２のフローチャートのＳ２において言語解析処理を実行するＣＰＵ１０が本発明の読み情報生成手段として機能し、Ｓ３及びＳ５においてピッチ計算・音量計算処理を実行するＣＰＵ１０が本発明の音声合成手段として機能し、Ｓ９において長音か否かを判断するＣＰＵ１０が本発明の長音検出手段として機能し、Ｓ１３においてアクセントの有無を判断するＣＰＵ１０が本発明のアクセント判断手段として機能し、Ｓ１７でポーズの有無を判断するＣＰＵ１０が本発明のポーズ判断手段として機能し、Ｓ１９及びＳ２１で長音用第１音声単位データ又は長音用第２音声単位データを選択し、Ｓ２３でピッチ・音量を変更するＣＰＵ１０が本発明の音声単位データ変更手段として機能する。 In the above embodiment, the CPU 10 that executes language analysis processing in S2 of the flowchart of FIG. 2 functions as reading information generation means of the present invention, and the CPU 10 that executes pitch calculation / volume calculation processing in S3 and S5. The CPU 10 that functions as the speech synthesis means of the invention, determines whether or not the sound is a long sound in S9, functions as the sound detection means of the present invention, and the CPU 10 that determines the presence or absence of an accent in S13 functions as the accent determination means of the present invention, The CPU 10 that determines the presence or absence of a pause in S17 functions as the pause determination means of the present invention, selects the first sound unit data for long sound or the second sound unit data for long sound in S19 and S21, and changes the pitch and volume in S23. CPU 10 that functions as the voice unit data changing means of the present invention.

音声合成装置１の電気的構成を示すブロック図である。2 is a block diagram showing an electrical configuration of the speech synthesizer 1. FIG. 音声合成処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a speech synthesis process. 音声合成装置１で実行する音声合成処理の対象となる読み文字列１００の例を示す説明図である。It is explanatory drawing which shows the example of the reading character string 100 used as the object of the speech synthesis process performed with the speech synthesizer.

Explanation of symbols

１音声合成装置
１０ＣＰＵ
１３ＲＡＭ
２０外部記憶装置
２１音声合成プログラム
２２言語辞書
２３通常音声単位データ
２４長音用第１音声単位データ
２５長音用第２音声単位データ
1 Speech Synthesizer 10 CPU
13 RAM
20 external storage device 21 speech synthesis program 22 language dictionary 23 normal speech unit data 24 first sound unit data for long sound 25 second sound unit data for long sound

Claims

Normal speech unit data storage means for storing normal speech unit data corresponding to text reading information;
Reading information generation means for analyzing input text using a dictionary and generating accented reading information;
In a speech synthesizer comprising speech synthesis means for outputting synthesized speech by applying normal speech unit data stored in the normal speech unit data storage means to the reading information generated by the reading information generating means,
First sound unit data storage unit for long sound that stores first sound unit data for long sound corresponding to long sound reading information;
A long sound detecting means for detecting a long sound from the reading information generated by the reading information generating means;
An accent determining means for determining whether or not the long sound detected by the long sound detecting means includes an accent;
When the accent determination means determines that there is an accent, the voice unit data to which the first sound unit data for long sound stored in the first sound unit data storage means for long sound is applied instead of the normal sound unit data Change means,
A speech synthesizer characterized by comprising:

Second sound unit data storage means for long sound storing second sound unit data for long sound corresponding to reading information of a long sound accompanied by a pause immediately after,
Pause determination means for determining whether or not there is a pause in the reading information immediately after the long sound detected by the long sound detection means,
The sound unit data changing means, when the pause determining means determines that there is a pause, the second sound for long sound stored in the second sound unit data storing means for long sound instead of the normal sound unit data. The unit data is applied, and when the pause determination unit determines that there is no pause, the first sound unit data for long sound stored in the first sound unit data storage unit for long sound is applied. Item 2. The speech synthesizer according to Item 1.

A reading information generation step of analyzing input text using a dictionary and generating accented reading information;
A speech synthesis step of outputting synthesized speech by applying the normal speech unit data stored in the normal speech unit data storage means to the reading information generated in the reading information generation step so as to correspond to the text reading information. In a speech synthesis program to be executed by a computer,
A long sound detecting step for detecting a long sound from the reading information generated in the reading information generating step;
An accent determination step for determining whether or not the long sound detected in the long sound detection step includes an accent;
If it is determined that there is an accent in the accent determination step, the first long sound first data stored in the long sound first sound unit data storage means so as to correspond to the long sound reading information instead of the normal sound unit data. A speech synthesis program for causing a computer to further execute a speech unit data changing step for applying speech unit data.

Causing the computer to further execute a pause determination step for determining whether or not there is a pause in the reading information immediately after the long sound detected in the long sound detection step;
In the sound unit data changing step, when it is determined that there is a pause in the pause determining step, the second long sound second information is used so as to correspond to the long sound reading information immediately following the pause instead of the normal sound unit data. The second sound unit data for long sound stored in the sound unit data storage means is applied, and when it is determined that there is no pause in the pause determination step, the first sound unit data for long sound is applied. The speech synthesis program according to claim 3.