JP2006337468A

JP2006337468A - Device and program for speech synthesis

Info

Publication number: JP2006337468A
Application number: JP2005159003A
Authority: JP
Inventors: Takashi Ito; 孝伊藤
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2005-05-31
Filing date: 2005-05-31
Publication date: 2006-12-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and a program for speech synthesis that synthesize speeches of a plurality of voices. <P>SOLUTION: In full-text selection mode, "all together", "repetition" or, "troll" is specified and a kind of speech is specified. Alternatively, a kind of speech is specified for each accent phrase in accent phrase mode. In the "all together" mode, a plurality of speeches are outputted simultaneously to generate such effect that a plurality of persons read a text aloud simultaneously. In the "repetition" mode, a speech of a leading speech kind is output by accent phrases, and then a speech of the speech kind of "repetition" is output to generate such effect that the output of each accent phrase with the leading speech is repeated with the speech of "repetition". In the "troll" mode, speech output is started in specified order of specified speech kinds after a first accent phrase of a last speech kind is output to generate such effect that a plurality of persons read the text aloud one after another like a troll. In the "accent phrase mode", a different number of persons read the text aloud with a different kind of speech for each accent phrase. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声合成装置及び音声合成プログラムに関するものであり、詳細には、複数の声の音声を合成する音声合成装置及び音声合成プログラムに関するものである。 The present invention relates to a speech synthesizer and a speech synthesis program, and more particularly to a speech synthesizer and a speech synthesis program that synthesize a plurality of voices.

従来、音声合成技術において、その使用目的に合わせて、より効果的な音声が出力できるように、出力される音声に様々な演出を行う技術が提案されている。例えば、特許文献１に記載の発明のゲーム装置では、出力される最後の文字に対応した一音を再度低音量で音声出力することにより、出力される音声に余韻を残したり、各文字の発音ごとにそれぞれ、同一音量波形でそれよりも順次音量を下げたものを一定時間だけ順次遅らせてミックスさせて音声出力することにより、出力される音声につながりと残響を付加してエコー効果を持たせたりしている。また、特許文献２に記載の発明の音声読み上げ装置では、利用者により「繰り返し指定」が行われると、出力された音声のうち重要部分（数字を含む部分）が復唱される。 2. Description of the Related Art Conventionally, in a speech synthesis technique, a technique for performing various effects on output sound has been proposed so that a more effective sound can be output in accordance with the purpose of use. For example, in the game device of the invention described in Patent Document 1, a sound corresponding to the last character to be output is output again at a low volume so that the output sound has a lingering sound or the pronunciation of each character. Each time, the same volume waveform with lower volume is sequentially delayed and mixed for a certain period of time and mixed to output the sound, thereby adding echo and reverberation to the output sound. It is. Further, in the speech reading apparatus according to the invention described in Patent Document 2, when the user performs “repetitive designation”, an important part (a part including numbers) of the output speech is repeated.

また、音声合成の演出のひとつとして、複数の音声を合成するということが行われている。特許文献３に記載の発明の歌声音声合成装置では、複数のパートに分かれた歌声を合わせて合唱させる場合には、歌声信号生成部で生成された各パートの音声合成波形である歌声信号を合唱信号生成部で加算して、合唱信号を作成して、アナログ信号に変換後歌声として出力されている。また、特許文献４に記載の発明の歌声合成装置では、特許文献１に記載の発明の歌声合成装置のように、各パートの歌声合成部で生成された音声を合唱生成部で合唱音声を生成したり、異なるパートの音声を別々の出力部から出力したり、歌声合成部で先に合成されたパートの歌声を一旦外部記憶装置に記憶し、後で別のパートの歌声が合成された際に、先に合成されたパートの歌声を取り出して合成して合唱の音声として出力したりしている。
特許３２５２８９６号公報特開平５−１９７３８４号公報特許３３３３０２２号公報特許３５１４２６３号公報 As one of the effects of voice synthesis, synthesizing a plurality of voices is performed. In the singing voice synthesizer of the invention described in Patent Document 3, when singing voices divided into a plurality of parts together, the singing voice signal that is the voice synthesis waveform of each part generated by the singing voice signal generation unit is choralized. Addition is made in the signal generation unit to create a choral signal, which is converted into an analog signal and output as a singing voice. Moreover, in the singing voice synthesizing device of the invention described in Patent Document 4, as in the singing voice synthesizing device of the invention described in Patent Document 1, the chorus generating unit generates the choral voice generated by the singing voice synthesizing unit of each part. Or when the voices of different parts are output from different output units, or the singing voices of the parts previously synthesized by the singing voice synthesizing unit are temporarily stored in the external storage device, and then the singing voices of other parts are synthesized later In addition, the singing voice of the previously synthesized part is taken out and synthesized and output as a chorus voice.
Japanese Patent No. 3252896 JP-A-5-197384 Japanese Patent No. 3333022 Japanese Patent No. 3514263

しかしながら、特許文献１に記載の発明のゲーム装置では、一種類の音声を用いて音声に余韻を持たせたり、エコー効果を持たせたりしているのみであり、特許文献２に記載の発明の音声読み上げ装置では、はじめに出力された音声と同じ音声で復唱しているのみであり、複数の種類の音声を用いて出力音声の演出を行っているものではない。また、特許文献３や特許文献４に記載の音声合成装置において、複数のパートの音声をすべて生成した後に、それらの音声を合成して出力する場合には、パートの数が多ければ多いほど、全ての音声を生成するのに時間が係るため、音声の生成を開始してから出力が開始されるまでに時間がかかってしまうという問題点があった。また、特許文献４に記載の音声合成装置のように、異なるパートの音声を別々の出力部から出力する為には、音声の種類だけ出力部を設ける必要があるという問題点があった。 However, in the game device of the invention described in Patent Document 1, only one type of sound is used to give the sound a lingering sound or an echo effect. The voice reading device only repeats with the same voice as the voice that was output first, and does not produce the output voice using a plurality of types of voice. In addition, in the speech synthesizer described in Patent Literature 3 and Patent Literature 4, after all of the voices of a plurality of parts are generated and then synthesized and output, the larger the number of parts, Since it takes time to generate all sounds, there is a problem that it takes time from the start of sound generation to the start of output. Further, as in the speech synthesizer described in Patent Document 4, in order to output different part sounds from different output units, there is a problem in that it is necessary to provide output units corresponding to the types of sounds.

本発明は、上述の問題点を解決するためになされたものであり、複数の声の音声を合成する音声合成装置及び音声合成プログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and an object thereof is to provide a speech synthesizer and a speech synthesis program for synthesizing a plurality of voices.

上記課題を解決するため、請求項１に係る発明の音声合成装置では、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書を複数の音声種類ごとに記憶する音響辞書記憶手段と、音声を生成する文を入力する文入力手段と、前記文入力手段により入力された前記音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ当該音声を生成する文の読みを決定する言語解析手段と、当該言語解析手段により解析された解析結果を記憶する解析結果記憶手段と、前記音響辞書記憶手段に記憶されている前記音響辞書のうちの１つ又は複数の前記音声種類を指定する音声種類指定手段と、前記解析結果記憶手段に記憶されている前記解析結果及び前記音声種類指定手段で指定された前記音声種類に基づいて前記音響辞書から前記音響モデルを選択する音響モデル選択手段と、当該音響モデル選択手段により選択された前記音響モデルをもとに音声を生成する音声生成手段とを備えたことを特徴とする。 In order to solve the above-described problem, in the speech synthesizer according to the first aspect of the present invention, a phoneme model created from phoneme data obtained by analyzing speech into an acoustic parameter sequence and a prosody model created from fundamental frequency data obtained by analyzing speech Acoustic dictionary storage means for storing an acoustic dictionary that is a set of acoustic models at least for each of a plurality of voice types, sentence input means for inputting a sentence for generating voice, and the voice input by the sentence input means Language analysis means for determining a part of speech by decomposing a sentence to be generated, determining an accent type indicating an accent position for each accent phrase, and determining a reading of a sentence for generating the speech; and the language analysis means Analysis result storage means for storing the analysis result analyzed by one or more of the acoustic dictionaries stored in the acoustic dictionary storage means Sound type designation means for designating a class, and sound for selecting the acoustic model from the acoustic dictionary based on the analysis result stored in the analysis result storage means and the voice type designated by the voice type designation means It is characterized by comprising model selection means and sound generation means for generating sound based on the acoustic model selected by the acoustic model selection means.

また、請求項２に係る発明の音声合成装置では、請求項１に記載の発明の構成に加えて、前記音声生成手段は、前記音響モデル選択手段により選択された前記音響モデルをもとにアクセント句、単語、形態素、又は、文字ごとに音声を生成することを特徴とする。 Further, in the speech synthesizer according to the invention of claim 2, in addition to the configuration of the invention of claim 1, the speech generation means is configured to accentuate based on the acoustic model selected by the acoustic model selection means. A voice is generated for each phrase, word, morpheme, or character.

また、請求項３に係る発明の音声合成装置では、請求項１又は２に記載の発明の構成に加えて、前記音声種類指定手段は、前記音声を生成する文の全体について１つ又は複数の前記音声種類を指定する全体音声種類指定手段を備えていることを特徴とする。 Moreover, in the speech synthesizer of the invention according to claim 3, in addition to the configuration of the invention according to claim 1 or 2, the speech type designation means includes one or a plurality of speech statements for the whole sentence for generating the speech. An overall voice type designation means for designating the voice type is provided.

また、請求項４に係る発明の音声合成装置では、請求項１乃至３のいずれかに記載の発明の構成に加えて、前記音声種類指定手段は、前記音声を生成する文におけるアクセント句ごとに１つ又は複数の前記音声種類を指定するアクセント句別音声指定手段を備えていることを特徴とする。 Moreover, in the speech synthesizer of the invention according to claim 4, in addition to the configuration of the invention according to any one of claims 1 to 3, the speech type designation means is provided for each accent phrase in the sentence that generates the speech. An accent phrase-specific voice designation means for designating one or a plurality of the voice types is provided.

また、請求項５に係る発明の音声合成装置では、請求項１乃至４のいずれかに記載の発明の構成に加えて、前記音声種類指定手段は、第１パート及び第２パートの２つパートの前記音声種類をそれぞれ指定するパート別音声種類指定手段を備え、前記音声生成手段は、前記音声を生成する文の所定のブロックごとに前記第１パートで指定されている前記音声種類の音声が出力された後に前記第２パートで指定されている前記音声種類の音声を出力して、前記第１パートの音声を前記第２パートの音声が復唱するように音声を合成する復唱音声生成手段を備えたことを特徴とする。 Further, in the speech synthesizer of the invention according to claim 5, in addition to the configuration of the invention according to any of claims 1 to 4, the speech type designation means includes two parts, a first part and a second part. The voice type designation means for designating each of the voice types, wherein the voice generation means receives the voice of the voice type specified in the first part for each predetermined block of the sentence for generating the voice. A repetitive sound generating means for outputting the sound of the sound type specified in the second part after being output and synthesizing the sound so that the sound of the second part repeats the sound of the first part. It is characterized by having.

また、請求項６に係る発明の音声合成装置では、請求項１乃至５のいずれかに記載の発明の構成に加えて、前記音声種類指定手段は、複数のパートの前記音声種類をそれぞれ指定する複数パート音声種類指定手段と、前記複数のパートの中で音声を出力する順番を指定する順番指定手段とを備え、前記音声生成手段は、まず前記順番指定手段に指定された順番が１番目の前記パートの前記音声を生成する文の所定のブロックの音声を出力し、前記順番が２番目以降の前記パートは前記順番が１つ前のパートの１番目のブロックの音声の出力が完了した時点で音声の出力を開始させるように、前記順番指定手段により指定されている順番に前記各パートの音声が輪唱するように音声を合成する輪唱音声生成手段を備えたことを特徴とする。 In the speech synthesizer of the invention according to claim 6, in addition to the configuration of the invention according to any one of claims 1 to 5, the speech type designation means designates the speech types of a plurality of parts, respectively. A multi-part sound type designating unit; and an order designating unit for designating an order of outputting the sound among the plurality of parts, wherein the sound generating unit is first in the order designated by the order designating unit. The voice of a predetermined block of the sentence that generates the voice of the part is output, and the second and subsequent parts of the part are output when the voice of the first block of the previous part is completed is output. And a ringing voice generating means for synthesizing the voices so that the voices of the parts ring in the order specified by the order specifying means.

また、請求項７に係る発明の音声合成装置では、請求項５又は６に記載の発明の構成に加えて、前記ブロックは、所定の記号で区切られた文、アクセント句、単語、又は、文字のうちの少なくとも１つであることを特徴とする。 Further, in the speech synthesizer of the invention according to claim 7, in addition to the configuration of the invention according to claim 5 or 6, the block is a sentence, an accent phrase, a word, or a character delimited by a predetermined symbol. It is at least one of these.

また、請求項８に係る発明の音声合成装置では、請求項３乃至７のいずれかに記載の発明の構成に加えて、前記音声種類指定手段は、前記アクセント句別音声指定手段、前記全体音声種類指定手段、前記パート別音声種類指定手段、及び、前記複数パート音声種類指定手段のうちから１つの手段を選択する指定方法選択手段を備えたことを特徴とする。 Further, in the speech synthesizer of the invention according to claim 8, in addition to the configuration of the invention according to any of claims 3 to 7, the speech type designation means includes the accent phrase-specific speech designation means, the whole speech It is characterized by comprising a designation method selection means for selecting one of a type designation means, a part-specific voice type designation means, and a multi-part voice type designation means.

また、請求項９に係る発明の音声合成プログラムでは、請求項１乃至８のいずれかに記載の音声合成装置の各種処理手段としてコンピュータを機能させる構成となっている。 According to a ninth aspect of the present invention, there is provided a speech synthesis program that causes a computer to function as various processing means of the speech synthesis apparatus according to any one of the first to eighth aspects.

請求項１に係る発明の音声合成装置では、音響辞書記憶手段は、音声を音響パラメータ列に分析した音韻データから作られた音韻モデルと音声を分析した基本周波数データから作られた韻律モデルとを少なくとも含む音響モデルの集合である音響辞書を複数の音声種類ごとに記憶し、文入力手段は、音声を生成する文を入力し、言語解析手段は、文入力手段により入力された音声を生成する文を単語に分解して品詞を決定し、アクセント句ごとにそのアクセント位置を示すアクセント型を決定し、かつ音声を生成する文の読みを決定し、解析結果記憶手段は、言語解析手段により解析された解析結果を記憶し、音声種類指定手段は、音響辞書記憶手段に記憶されている音響辞書のうちの１つ又は複数の音声種類を指定し、音響モデル選択手段は、解析結果記憶手段に記憶されている解析結果及び音声種類指定手段で指定された音声種類に基づいて音響辞書から音響モデルを選択し、音声生成手段は、音響モデル選択手段により選択された音響モデルをもとに音声を生成することができる。したがって、出力したい音声の種類を指定できるので、音声種類の組み合わせにより多様な音声を出力することができる。 In the speech synthesizer of the invention according to claim 1, the acoustic dictionary storage means includes a phonological model created from phonological data obtained by analyzing speech into an acoustic parameter sequence, and a prosodic model created from fundamental frequency data analyzed by speech. An acoustic dictionary that is a set of at least an acoustic model is stored for each of a plurality of speech types, a sentence input unit inputs a sentence for generating a speech, and a language analysis unit generates a speech input by the sentence input unit. Decompose sentences into words to determine part-of-speech, determine the accent type that indicates the accent position for each accent phrase, and determine the reading of the sentence that generates speech, and the analysis result storage means is analyzed by the language analysis means And the voice type designation means designates one or a plurality of voice types from the acoustic dictionary stored in the acoustic dictionary storage means, and the acoustic model selection means. The acoustic model is selected from the acoustic dictionary based on the analysis result stored in the analysis result storage unit and the voice type designated by the voice type designation unit, and the voice generation unit selects the acoustic model selected by the acoustic model selection unit. Voice can be generated based on the above. Therefore, since the type of sound to be output can be designated, various sounds can be output by combining the sound types.

また、請求項２に係る発明の音声合成装置では、請求項１に記載の発明の効果に加えて、音声生成手段は、音響モデル選択手段により選択された音響モデルをもとにアクセント句、単語、形態素、又は、文字ごとに音声を生成することができる。したがって、音声出力する全文の音声を生成してから音声を出力するのではないので、音声の合成を開始してから音声が出力されるまでの時間が短く、音声出力に遅延がなくスムースな出力をすることができる。 In the speech synthesizer of the invention according to claim 2, in addition to the effect of the invention of claim 1, the speech generation means includes an accent phrase and a word based on the acoustic model selected by the acoustic model selection means. Voice can be generated for each morpheme or character. Therefore, since the voice is not output after generating the full-text voice to be output, the time from the start of voice synthesis until the voice is output is short, and there is no delay in the voice output and smooth output Can do.

また、請求項３に係る発明の音声合成装置では、請求項１又は２に記載の発明の効果に加えて、音声種類指定手段の全体音声種類指定手段は、音声を生成する文の全体について１つ又は複数の音声種類を指定することができる。したがって、文全体を複数の人で発声しているような効果を得ることができる。 In the speech synthesizer of the invention according to claim 3, in addition to the effect of the invention according to claim 1 or 2, the overall speech type designation means of the speech type designation means is 1 for the whole sentence for generating speech. One or more audio types can be specified. Therefore, it is possible to obtain an effect that the whole sentence is uttered by a plurality of people.

また、請求項４に係る発明の音声合成装置では、請求項１乃至３のいずれかに記載の発明の効果に加えて、音声種類指定手段のアクセント句別音声指定手段は、音声を生成する文におけるアクセント句ごとに１つ又は複数の音声種類を指定することができる。したがって、一部分のみ複数の人で発声しているようにもすることができるので多様な演出を行うことができる。 In the speech synthesizer of the invention according to claim 4, in addition to the effect of the invention according to any one of claims 1 to 3, the speech phrase-specific speech designation means of the speech type designation means is a sentence for generating speech. One or more voice types can be specified for each accent phrase in. Therefore, since it is possible to make a part of the voice uttered by a plurality of people, various effects can be performed.

また、請求項５に係る発明の音声合成装置では、請求項１乃至４のいずれかに記載の発明の効果に加えて、音声種類指定手段のパート別音声種類指定手段は、第１パート及び第２パートの２つパートの音声種類をそれぞれ指定することができる。また、音声生成手段の復唱音声生成手段は、音声を生成する文の所定のブロックごとに第１パートで指定されている音声種類の音声が出力された後に第２パートで指定されている音声種類の音声を出力して、第１パートの音声を第２パートの音声が復唱するように音声を合成することができるので、好みの音声種類で復唱をさせることができる。 In the speech synthesizer of the invention according to claim 5, in addition to the effect of the invention according to any one of claims 1 to 4, the speech type designation means by part of the speech type designation means includes the first part and the first part. It is possible to designate two types of two-part audio types. The repetitive voice generating means of the voice generating means outputs the voice type specified in the second part after the voice of the voice type specified in the first part is output for each predetermined block of the sentence generating the voice. Since the voice can be synthesized so that the voice of the first part is repeated by the voice of the second part, the voice of the first part can be repeated with a favorite voice type.

また、請求項６に係る発明の音声合成装置では、請求項１乃至５のいずれかに記載の発明の効果に加えて、音声種類指定手段の複数パート音声種類指定手段は、複数のパートの音声種類をそれぞれ指定し、順番指定手段は、複数のパートの中で音声を出力する順番を指定することができる。また、音声生成手段の輪唱音声生成手段は、まず順番指定手段に指定された順番が１番目のパートの音声を生成する文の所定のブロックの音声を出力し、順番が２番目以降のパートは順番が１つ前のパートの１番目のブロックの音声の出力が完了した時点で音声の出力を開始させるように、順番指定手段により指定されている順番に各パートの音声が輪唱するように音声を合成することができる。したがって、好みの音声種類の音声を好みの順序で出力することができ、多様な演出を行うことができる。 Further, in the speech synthesizer of the invention according to claim 6, in addition to the effect of the invention according to any one of claims 1 to 5, the multi-part speech type designating means of the speech type designating means Each type is designated, and the order designating unit can designate the order of outputting the sound among the plurality of parts. In addition, the ringing sound generating means of the sound generating means first outputs the sound of a predetermined block of the sentence that generates the sound of the first part in the order specified by the order specifying means, and the second and subsequent parts are output. Audio so that the sound of each part circulates in the order specified by the order specifying means so that the output of the sound is started when the output of the sound of the first block of the previous part is completed. Can be synthesized. Therefore, it is possible to output a sound of a favorite sound type in a favorite order and perform various effects.

また、請求項７に係る発明の音声合成装置では、請求項５又は６に記載の発明の効果に加えて、ブロックは、所定の記号で区切られた文、アクセント句、単語、又は、文字のうちの少なくとも１つとすることができるので、短いサイクルで音声出力を行うことができ、音声出力に遅延がなくスムースな出力をすることができる。 In the speech synthesizer of the invention according to claim 7, in addition to the effects of the invention of claim 5 or 6, the block is a sentence, accent phrase, word, or character delimited by a predetermined symbol. Since at least one of them can be used, audio output can be performed in a short cycle, and the audio output can be smoothly output without delay.

また、請求項８に係る発明の音声合成装置では、請求項３乃至７のいずれかに記載の発明の効果に加えて、音声種類指定手段の指定方法選択手段は、アクセント句別音声指定手段、全体音声種類指定手段、パート別音声種類指定手段、及び、複数パート音声種類指定手段のうちから１つの手段を選択することができるので、様々な演出で複数の音声種類を出力することができる。 In addition, in the speech synthesizer of the invention according to claim 8, in addition to the effect of the invention of any one of claims 3 to 7, the designation method selection means of the speech type designation means includes accent phrase-specific voice designation means, Since one means can be selected from the whole voice type designation means, the part-by-part voice type designation means, and the multi-part voice type designation means, a plurality of voice types can be output with various effects.

また、請求項９に係る発明の音声合成プログラムでは、請求項１乃至８のいずれかに記載の音声合成装置の各種処理手段としてコンピュータを機能させることができる。 In the speech synthesis program of the invention according to claim 9, the computer can function as various processing means of the speech synthesis apparatus according to any one of claims 1 to 8.

以下、本発明の実施の形態を図面を参照して説明する。本発明の音声合成装置及び音声合成プログラムでは、音声出力したい文章（テキスト）について、「全文選択モード」にて「一斉」，「復唱」，「輪唱」の指定をし、出力される音声種類を指定することができる。または、「アクセント句モード」にてアクセント句ごとに出力される音声種類を指定することができる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the speech synthesizer and the speech synthesis program of the present invention, for the sentence (text) to be output as speech, “simultaneous”, “return”, and “rotation” are designated in the “full text selection mode”, and the output speech type is set. Can be specified. Alternatively, the type of sound output for each accent phrase can be specified in the “accent phrase mode”.

「全文選択モード」の「一斉」では、指定された音声種類の音声が複数であれば、複数の音声を同時に出力して、複数の人が同時にテキストを読み上げているように音声が出力される。なお、指定された音声種類が一種類であれば、一種類の音声で一人がテキストを読み上げているように音声が出力される。また、「全文選択モード」の「復唱」では、アクセント句ごとに、先導する音声として指定された音声種類（以下、「先導音声種類」とする）で出力をした後に、復唱する音声として指定された音声種類（以下、「復唱音声種類」とする）を出力して、アクセント句ごとに先導音声種類の音声で出力されたテキストを復唱音声種類の音声が復唱するように音声が出力される。なお、復唱音声種類の音声が複数指定されている場合には、同時に出力される。また、「全文選択モード」の「輪唱」では、指定された音声種類について、指定された順番に、ひとつ前の音声種類の１番目のアクセント句の出力が終了した後に音声出力が開始され、合唱における「輪唱」と同様に複数の人が前の人に続いて読みあげを行っているように音声が出力される。 In "Batch" of "Full text selection mode", if there are multiple voices of the specified voice type, multiple voices are output at the same time, and the voice is output as if multiple people are reading the text simultaneously . If there is only one designated voice type, the voice is output as if one person is reading a text with one type of voice. In “Repeat” in “Full text selection mode”, each accent phrase is designated as the voice to be repeated after being output with the voice type designated as the leading voice (hereinafter referred to as “leading voice type”). The voice type (hereinafter referred to as “repeated voice type”) is output, and the voice is output so that the voice of the repeated voice type repeats the text output with the voice of the leading voice type for each accent phrase. In addition, when a plurality of repetitive voice types are designated, they are output simultaneously. Also, in the “full text selection mode” “ring”, voice output is started after the output of the first accent phrase of the previous voice type in the specified order for the specified voice type, In the same way as in “Ring”, voice is output as if a plurality of people are reading aloud following the previous person.

また、「アクセント句モード」では、アクセント句ごとに指定されている音声種類の音声を出力して、アクセント句ごとに別の人が読み上げているように音声が出力されたり、アクセント句ごとに読み上げている人数が変わるように音声が出力されたりする。なお、本実施の形態の音声合成装置及び音声合成プログラムでは、「全文選択モード」において選択がされている場合には、「アクセント句モード」において音声種類の指定はできないように制御されている。 In the “accent phrase mode”, the voice type specified for each accent phrase is output, and the voice is output as if another person is speaking for each accent phrase, or is read for each accent phrase. Sound is output so that the number of people who are changing. Note that, in the speech synthesizer and the speech synthesis program of the present embodiment, when the “full sentence selection mode” is selected, control is performed so that the speech type cannot be specified in the “accent phrase mode”.

まず、図１を参照して、本実施の形態の音声合成装置である携帯電話機１について説明する。図１は、携帯電話機１の外観図であり、図２は、携帯電話機１の電気的構成を示すブロック図である。そして、図３は、ＲＡＭ２２の構成を示す模式図であり、図４は、ＲＡＭ２２の一斉情報記憶エリア２２４の構成を示す模式図であり、図５は、ＲＡＭ２２の復唱情報記憶エリア２２５の構成を示す模式図であり、図６は、ＲＡＭ２２の輪唱情報記憶エリア２２６の構成を示す模式図であり、図７は、ＲＡＭ２２のアクセント句モード情報記憶エリア２２７の構成を示す模式図である。 First, a mobile phone 1 that is a speech synthesizer according to the present embodiment will be described with reference to FIG. FIG. 1 is an external view of the mobile phone 1, and FIG. 2 is a block diagram showing an electrical configuration of the mobile phone 1. 3 is a schematic diagram illustrating the configuration of the RAM 22, FIG. 4 is a schematic diagram illustrating the configuration of the simultaneous information storage area 224 of the RAM 22, and FIG. 5 illustrates the configuration of the repetitive information storage area 225 of the RAM 22. FIG. 6 is a schematic diagram showing the configuration of the singing information storage area 226 of the RAM 22, and FIG. 7 is a schematic diagram showing the configuration of the accent phrase mode information storage area 227 of the RAM 22.

図１に示すように、携帯電話機１には、表示画面２と、テン・キー入力部３と、４方向のボタン及び決定ボタンを備えたマルチボタン４と、通話開始ボタン５と、通話終了ボタン６と、マイク７と、スピーカ８と、機能選択ボタン９，１０と、アンテナ１２（図２参照）とが設けられている。尚、テン・キー入力部３、マルチボタン４、通話開始ボタン５、通話終了ボタン６、機能選択ボタン９，１０によりキー入力部３８（図２参照）が構成される。 As shown in FIG. 1, the mobile phone 1 includes a display screen 2, a numeric key input unit 3, a multi-button 4 having buttons in four directions and a determination button, a call start button 5, and a call end button. 6, a microphone 7, a speaker 8, function selection buttons 9 and 10, and an antenna 12 (see FIG. 2). The numeric key input unit 3, multi-button 4, call start button 5, call end button 6, and function selection buttons 9 and 10 constitute a key input unit 38 (see FIG. 2).

また、図２に示すように、携帯電話機１には、マイク７からの音声信号の増幅及びスピーカ８から出力する音声の増幅等を行うアナログフロントエンド３６と、アナログフロントエンド３６で増幅された音声信号のデジタル信号化及びモデム部３４から受け取ったデジタル信号をアナログフロントエンド３６で増幅できるようにアナログ信号化する音声コーディック部３５と、変復調を行うモデム部３４と、アンテナ１２から受信した電波の増幅及び検波を行い、また、キャリア信号をモデム部３４から受け取った信号により変調し、増幅する送受信部３３が設けられている。 As shown in FIG. 2, the cellular phone 1 includes an analog front end 36 that performs amplification of an audio signal from the microphone 7 and amplification of audio output from the speaker 8, and audio amplified by the analog front end 36. Digitalization of the signal and digital code received from the modem unit 34 are converted into an analog signal so that the analog front end 36 can amplify the signal, a modem unit 34 that performs modulation / demodulation, and amplification of radio waves received from the antenna 12 In addition, a transmission / reception unit 33 that performs detection and modulates and amplifies the carrier signal with the signal received from the modem unit 34 is provided.

さらに、携帯電話機１には、携帯電話機１全体の制御を行う制御部２０が設けられ、制御部２０には、ＣＰＵ２１と、データを一時的に記憶するＲＡＭ２２と、時計機能部２３とが内蔵されている。また、制御部２０には、文字等を入力するキー入力部３８と、表示画面２と、プログラムや各種音声種類の音響辞書を記憶した不揮発メモリ３０と、着信音を発生するメロディ発生器３２が接続されている。メロディ発生器３２には、メロディ発生器３２で発生した着信音を発声するスピーカ３７が接続されている。 Further, the mobile phone 1 is provided with a control unit 20 that controls the entire mobile phone 1. The control unit 20 includes a CPU 21, a RAM 22 that temporarily stores data, and a clock function unit 23. ing. Further, the control unit 20 includes a key input unit 38 for inputting characters and the like, a display screen 2, a nonvolatile memory 30 storing a program and an acoustic dictionary of various voice types, and a melody generator 32 for generating a ring tone. It is connected. The melody generator 32 is connected to a speaker 37 that utters a ring tone generated by the melody generator 32.

なお、本実施の形態では、「男性」，「女性」，「男の子」，「女の子」，「アニメ」の５種類の音声種類を用い、男性を１番目の音声種類、女性を２番目の音声種類、男の子を３番目の音声種類、女の子を４番目の音声種類、アニメを５番目の音声種類とする。不揮発メモリ３０の音響辞書には、これらの５種類の音声種類の音声を生成する為の５種類の音響辞書が記憶されている。 In this embodiment, five types of voices of “male”, “female”, “boy”, “girl”, and “animation” are used, with male being the first voice type and female being the second voice. Kind, boy is the third voice type, girl is the fourth voice type, and animation is the fifth voice type. The acoustic dictionary of the nonvolatile memory 30 stores five types of acoustic dictionaries for generating sounds of these five types of speech.

また、図３に示すように、ＲＡＭ２２には、音声合成の処理を行う際に使用される変数や生成データを記憶する種々の記憶エリアが設けられている。例えば、テキスト記憶エリア２２１には、音声合成を行う指示のされたテキストが記憶される。また、解析結果記憶エリア２２２には、形態素解析処理及びアクセント句形成処理により解析された結果が記憶されている。また、パラメータ情報記憶エリア２２３には、「全文選択モード」，「アクセント句モード」で指定された設定の情報が記憶されている。具体的には、「全文選択モード」の設定がされている場合には、「一斉」が選択されていれば「１」、「復唱」が選択されていれば「２」、「輪唱」が選択されていれば「３」がセットされ、「アクセント句モード」で各アクセント句に音声種類が設定されていれば「９」がセットされる。なお、初期値は「０」である。 Also, as shown in FIG. 3, the RAM 22 is provided with various storage areas for storing variables and generated data used when performing speech synthesis processing. For example, the text storage area 221 stores text instructed to perform speech synthesis. The analysis result storage area 222 stores the results analyzed by the morphological analysis process and the accent phrase formation process. The parameter information storage area 223 stores setting information designated in the “full sentence selection mode” and the “accent phrase mode”. Specifically, when “full text selection mode” is set, “1” is selected when “simultaneous” is selected, “2” is selected when “return” is selected, and “ring” is selected. If it is selected, “3” is set. If the voice type is set for each accent phrase in “accent phrase mode”, “9” is set. The initial value is “0”.

そして、一斉情報記憶エリア２２４は、「全文選択モード」で「一斉」が選択されている場合に使用される記憶エリアであり、復唱情報記憶エリア２２５は、「全文選択モード」で「復唱」が選択されている場合に使用される記憶エリアであり、輪唱情報記憶エリア２２６は、「全文選択モード」で「輪唱」が選択されている場合に使用される記憶エリアである。そして、アクセント句モード情報記憶エリア２２７は、「アクセント句モード」が指定されている場合に使用される記憶エリアである。また、合成音声記憶エリア２２８は、複数の音声の音声データが加算された結果が記憶される記憶エリアである。 The simultaneous information storage area 224 is a storage area used when “simultaneous” is selected in the “full-text selection mode”, and the repetitive information storage area 225 is “repeated” in the “full-text selection mode”. This is a storage area that is used when it is selected, and the ring information storage area 226 is a storage area that is used when “ring” is selected in the “full text selection mode”. The accent phrase mode information storage area 227 is a storage area used when “accent phrase mode” is designated. The synthesized voice storage area 228 is a storage area for storing a result of adding a plurality of voice data.

図４に示すように、ＲＡＭ２２の一斉情報記憶エリア２２４には、１番目から５番目までの音声種類ごとに、設定欄及び音声データ欄が設けられている。設定欄にはそれぞれの音声種類が出力される音声として設定されているか否かの情報が記憶される。なお、本実施の形態では出力される音声として設定されている場合には「１」、設定されていない場合には「０」を記憶するものとする。そして、音声データ欄には、各音声種類で生成された音声データ（音源信号）が記憶される。ここに記憶された音声データが合成されて、合成音声が出力されることとなる。尚、「ｍ」は後述する一斉処理（図１３参照）において、音声種類をカウントする際に使用される変数である。 As shown in FIG. 4, the simultaneous information storage area 224 of the RAM 22 is provided with a setting field and a sound data field for each of the first to fifth sound types. Information on whether or not each sound type is set as an output sound is stored in the setting column. In the present embodiment, “1” is stored when the output sound is set, and “0” is stored when it is not set. In the audio data column, audio data (sound source signal) generated for each audio type is stored. The voice data stored here is synthesized and a synthesized voice is output. Note that “m” is a variable used when counting voice types in the simultaneous processing (see FIG. 13) described later.

図５に示すように、ＲＡＭ２２の復唱情報記憶エリア２２５には、先導音声種類、１個目の復唱音声種類、２個目の復唱音声種類、３個目の復唱音声種類、４個目の復唱音声種類ごとに、設定欄及び音声データ欄が設けられている。そして、復唱人数欄が設けられている。それぞれの設定欄には、音声種類を示す番号がセットされる。また、復唱人数欄には、復唱音声種類として指定されている音声種類の数がセットされる。たとえば、先導音声種類が「男性」であり、復唱音声種類が「女性」と「アニメ」であれば、先導音声種類情報の設定欄に「１」、復唱音声種類欄の１個目の設定欄に「２」、２個目の設定欄に「５」、復唱人数欄には「２」がセットされる。そして、音声データ欄には、各音声種類で生成された音声データ（音源信号）が記憶される。ここに記憶された音声データが合成されて、合成音声が出力されることとなる。尚、「ｐ」は後述する復唱処理（図１４参照）において、復唱音声種類をカウントする際に使用される変数である。 As shown in FIG. 5, in the repetition information storage area 225 of the RAM 22, the leading voice type, the first repeated voice type, the second repeated voice type, the third repeated voice type, the fourth repeated voice A setting field and a sound data field are provided for each sound type. And, the number of repeaters column is provided. Each setting column is set with a number indicating the voice type. Also, the number of voice types designated as the read back voice type is set in the number of repeats column. For example, if the lead voice type is “male” and the repetitive voice types are “female” and “anime”, “1” is set in the lead voice type information setting field, and the first setting field in the repeat voice type field. “2”, “5” is set in the second setting field, and “2” is set in the return number field. In the audio data column, audio data (sound source signal) generated for each audio type is stored. The voice data stored here is synthesized and a synthesized voice is output. Note that “p” is a variable used when counting the type of repetitive sound in a repetitive processing (see FIG. 14) described later.

図６に示すように、ＲＡＭ２２の輪唱情報記憶エリア２２６には、輪唱を行う音声種類の順番ごとに、音声種類欄及び形態素ごとの音声データ欄が設けられている。そして、輪唱人数欄及び１番目のアクセント句の形態素数欄が設けられている。音声種類欄には、各順番で出力される音声種類を示す番号がセットされ、輪唱人数欄には、輪唱する音声種類として指定されている音声種類の数がセットされ、１番目のアクセント句の形態素数欄には、後述する輪唱処理(図１５参照）で１番目のアクセント句の形態素の数が算出されて記憶される。たとえば、「女性」が１番、「女の子」が２番、「アニメ」が３番として輪唱の順序が指定されている場合には、１番目の音声種類欄に「２」、２番目の音声種類欄に「４」、３番目の音声種類欄に「５」がセットされ、輪唱人数欄には「３」がセットされる。また、形態素ごとの音声データ欄には、各音声種類の形態素ごとの音声データ（音源信号）がセットされる。なお、本実施の形態では５種類の音声種類を用いているので、形態素の記憶エリアも５つ設けられており、（順番，形態素）として音声データ欄を示すとすると、１つの形態素について生成された音声データは、（ｑ，ｑ＋最初のアクセント句の形態素数×（ｑ−１））に記憶される。なお、「ｑ」は後述する輪唱処理（図１５参照）において、輪唱の順番をカウントするための変数である。そして、音声を出力する際には、（１，１）、（２，１）、（３，１）、（４，１）、（５，１）に記憶されている音声データが出力される。 As shown in FIG. 6, in the singing information storage area 226 of the RAM 22, an audio type column and an audio data column for each morpheme are provided for each order of the audio types to be sung. A number of singers column and a morpheme number column of the first accent phrase are provided. In the voice type column, a number indicating the voice type output in each order is set, and in the number of ringers column, the number of voice types specified as the voice type to be rotated is set, and the first accent phrase In the morpheme number column, the number of morphemes of the first accent phrase is calculated and stored in the singing process described later (see FIG. 15). For example, when “Women” is No. 1, “Girls” is No. 2, and “Animation” is No. 3, and the order of singing is specified, “2” and the second voice are entered in the first voice type column. “4” is set in the type column, “5” is set in the third audio type column, and “3” is set in the number of singers. In the voice data column for each morpheme, voice data (sound source signal) for each morpheme of each voice type is set. In the present embodiment, since five types of speech are used, five morpheme storage areas are also provided. If the speech data column is shown as (order, morpheme), one morpheme is generated. The voice data is stored in (q, q + morphic prime number of the first accent phrase × (q−1)). Note that “q” is a variable for counting the order of singing in the singing process (see FIG. 15) described later. When outputting the sound, the sound data stored in (1,1), (2,1), (3,1), (4,1), (5,1) is output. .

図７に示すように、ＲＡＭ２２のアクセント句モード情報記憶エリア２２７には、音声種類ごとにアクセント句ごとの設定欄及びアクセント句ごとの音声データ欄が設けられている。アクセント句ごとの設定欄には、音声種類ごとに音声合成を行うテキストのアクセント句の数だけ設定を記憶できるようになっており、その音声種類の音声をそのアクセント句で出力するか否かの情報がセットされる。本実施の形態では、出力すると設定されている場合には「１」、設定されていない場合には「０」を記憶するものとする。また、アクセント句ごとの音声データ欄では、音声種類ごとにアクセント句の数だけ音声データ（音源信号）を記憶できるようになっており、出力すると設定されている音声種類について生成された音声データが記憶される。尚、「ｍ」は後述するアクセント句処理（図１６参照）において、音声種類をカウントする際に使用される変数であり、「ｎ」はアクセント句をカウントする際に使用される変数である。 As shown in FIG. 7, the accent phrase mode information storage area 227 of the RAM 22 is provided with a setting field for each accent phrase and a sound data field for each accent phrase for each sound type. The setting field for each accent phrase can store as many settings as the number of accent phrases in the text to be synthesized for each speech type, and whether or not to output the speech of that speech type as an accent phrase. Information is set. In this embodiment, “1” is stored when output is set, and “0” is stored when it is not set. In the voice data field for each accent phrase, voice data (sound source signal) can be stored as many as the number of accent phrases for each voice type, and the voice data generated for the voice type set to be output is stored. Remembered. Note that “m” is a variable used when counting the voice type in the accent phrase processing (see FIG. 16) described later, and “n” is a variable used when counting the accent phrase.

次に、図８乃至図１１を参照して、音声出力を行うテキストを入力する画面、「全文選択モード」及び「アクセント句モード」の選択を行う際に表示される画面について説明する。図８は、メイン画面２９０のイメージ図であり、図９は、音声出力画面２００のイメージ図であり、図１０は、全文選択モード画面２１０のイメージ図であり、図１１は、アクセント句モード画面２３０のイメージ図である。 Next, with reference to FIG. 8 to FIG. 11, a screen for inputting text for voice output and a screen displayed when selecting “full sentence selection mode” and “accent phrase mode” will be described. 8 is an image diagram of the main screen 290, FIG. 9 is an image diagram of the audio output screen 200, FIG. 10 is an image diagram of the full sentence selection mode screen 210, and FIG. 11 is an image diagram of the accent phrase mode screen 230. It is.

図８に示すメイン画面２９０は、携帯電話機１を操作して、音声合成プログラムを起動させると表示される画面であり、音声出力を行うテキストを入力するテキスト入力欄２９１と、参照ボタン２９２，ＯＫボタン，キャンセルボタンが設けられている。テキスト入力欄２９１を選択すると、キー入力部３８の操作によりテキストを入力することができる。また、参照ボタン２９２を選択すると、携帯電話機１の不揮発メモリ３０に記憶されているメールの文章やインターネットに接続して表示した画面に記載されている文章からテキスト入力欄２９１に入力するテキストを選択することができる。そして、ＯＫボタンが選択されると、テキスト入力欄２９１に入力されているテキストが音声出力をするテキストとして、ＲＡＭ２２のテキスト記憶エリア２２１に記憶される。なお、キャンセルボタンが選択されると、音声合成の処理は終了する。 A main screen 290 shown in FIG. 8 is a screen that is displayed when the mobile phone 1 is operated to start a speech synthesis program. A text input field 291 for inputting text for voice output, a reference button 292, and an OK button. A button and a cancel button are provided. When the text input field 291 is selected, the text can be input by operating the key input unit 38. When the reference button 292 is selected, the text to be input to the text input field 291 is selected from the mail text stored in the nonvolatile memory 30 of the mobile phone 1 or the text displayed on the screen connected to the Internet. can do. When the OK button is selected, the text input in the text input field 291 is stored in the text storage area 221 of the RAM 22 as text for voice output. When the cancel button is selected, the speech synthesis process ends.

図９に示す音声出力画面２００は、メイン画面２９０でＯＫボタンが選択されると表示される画面であり、「全文選択モード」及び「アクセント句モード」の選択を行うことができる。音声出力画面２００には、テキスト表示欄２０１，出力ボタン，キャンセルボタンが設けられており、テキスト表示欄２０１には、メイン画面２９０のテキスト入力欄２９１に入力され、音声合成されるテキストとしてテキスト記憶エリア２２１に記憶されているテキストに、モード選択用のタグ「◇」、「▽」が挿入されて表示されている。テキスト表示欄２０１が選択されている場合には、カーソル２０２が表示されており、マルチボタン４の４方向のボタンを操作して、タグ上にカーソルを移動させ、選択ボタンで選択すると、各タグに対応したモードの画面が表示される。◇タグは「全文選択モード」の設定タグであり、▽タグは「アクセント句モード」の設定タグである。 The audio output screen 200 shown in FIG. 9 is a screen that is displayed when the OK button is selected on the main screen 290, and the “full-text selection mode” and the “accent phrase mode” can be selected. The voice output screen 200 is provided with a text display column 201, an output button, and a cancel button. The text display column 201 stores text as text to be input to the text input column 291 of the main screen 290 and synthesized. The mode selection tags “◇” and “▽” are inserted in the text stored in the area 221 and displayed. When the text display field 201 is selected, the cursor 202 is displayed. When the four buttons of the multi-button 4 are operated to move the cursor over the tag and the selection button is used to select each tag. The mode screen corresponding to is displayed. The ◇ tag is a setting tag for “full text selection mode”, and the ▽ tag is a setting tag for “accent phrase mode”.

図９に示す例では、テキスト表示欄２０１には「◇▽運動会の▽思い出。▽手作りの▽応援用ハッピを▽着て、▽一所懸命▽踊った事。」が表示されている。 In the example shown in FIG. 9, the text display column 201 displays “◇ ▽ Athletic meet ▽ Memories.

◇タグが選択されると、図１０に示す全文選択モード画面２１０が表示される。全文選択モード画面２１０には、モード選択欄２１１，音声種類選択欄２１２，順序指定欄２１３，ＯＫボタン，キャンセルボタンが設けられている。モード選択欄２１１はラジオボタンになっており、「一斉」，「復唱」，「輪唱」のうちの１つを選択できるようになっている。また、音声種類選択欄２１２はチェックボックスになっており、「男性」，「女性」，「男の子」，「女の子」，「アニメ」から１つ以上の音声種類が選択できるようになっている。また、順序指定欄２１３では、数値を入力可能になっており、「復唱」では「１」と入力された音声種類の音声が先導音声種類とされる。また、「輪唱」ではここに入力された数字の順番に輪唱が行われる。なお、「一斉」では順序指定欄２１３に数値が入力されていても使用されず、「復唱」及び「輪唱」では、数値が入力されていても音声種類選択欄２１２で選択されていない音声の値は使用されない。なお、ＯＫボタンが選択されると、入力されたモードにしたがって設定内容がＲＡＭ２２の所定の記憶エリアに記憶され、音声出力画面２００へ戻る。また、キャンセルボタンが選択されると、入力された内容は記憶されずに音声出力画面２００へ戻る。 When the tag is selected, a full text selection mode screen 210 shown in FIG. 10 is displayed. The full text selection mode screen 210 is provided with a mode selection field 211, a voice type selection field 212, an order designation field 213, an OK button, and a cancel button. The mode selection field 211 is a radio button, and one of “simultaneous”, “return”, and “rotation” can be selected. The voice type selection column 212 is a check box, and one or more voice types can be selected from “male”, “female”, “boy”, “girl”, and “anime”. In the order designation field 213, a numerical value can be input. In “repeating”, the voice of the voice type input as “1” is set as the leading voice type. In “ring”, singing is performed in the order of the numbers input here. Note that “simultaneous” is not used even if a numerical value is input in the order designation column 213, and “repeated” and “spinning” are not used even if a numerical value is input in the audio type selection column 212. The value is not used. When the OK button is selected, the setting contents are stored in a predetermined storage area of the RAM 22 according to the input mode, and the sound output screen 200 is displayed again. If the cancel button is selected, the input content is not stored and the process returns to the audio output screen 200.

また、音声出力画面２００において、▽タグが選択されると、図１１に示すアクセント句モード画面２３０が表示される。アクセント句モード画面２３０では、音声種類選択欄２３１，ＯＫボタン，キャンセルボタンが設けられている。音声種類選択欄２３１はチェックボックスになっており、「男性」，「女性」，「男の子」，「女の子」，「アニメ」から１つ以上の音声種類が選択できるようになっている。そして、ＯＫボタンが選択されると、入力された内容にしたがって設定内容がＲＡＭ２２のアクセント句モード情報記憶エリア２２７に記憶され、音声出力画面２００へ戻る。また、キャンセルボタンが選択されると、入力された内容は記憶されずに音声出力画面２００へ戻る。 When the ▽ tag is selected on the audio output screen 200, an accent phrase mode screen 230 shown in FIG. 11 is displayed. In the accent phrase mode screen 230, a voice type selection field 231, an OK button, and a cancel button are provided. The voice type selection column 231 is a check box, and one or more voice types can be selected from “male”, “female”, “boy”, “girl”, and “anime”. When the OK button is selected, the setting contents are stored in the accent phrase mode information storage area 227 of the RAM 22 according to the input contents, and the sound output screen 200 is displayed again. If the cancel button is selected, the input content is not stored and the process returns to the audio output screen 200.

次に、図１２乃至図１６のフローチャートを参照して、音声合成処理について説明する。図１２は、本実施の形態での音声合成プログラムのメイン処理のフローチャートであり、図１３は、メイン処理の中で行われる一斉処理のフローチャートであり、図１４は、メイン処理の中で行われる復唱処理のフローチャートであり、図１５は、メイン処理の中で行われる輪唱処理のフローチャートであり、図１６は、メイン処理の中で行われるアクセント句処理のフローチャートである。 Next, the speech synthesis process will be described with reference to the flowcharts of FIGS. FIG. 12 is a flowchart of the main process of the speech synthesis program in the present embodiment, FIG. 13 is a flowchart of the simultaneous process performed in the main process, and FIG. 14 is performed in the main process. FIG. 15 is a flowchart of a repetitive process performed in the main process, and FIG. 16 is a flowchart of an accent phrase process performed in the main process.

図１２に示すメイン処理は、携帯電話機１において音声合成の処理を行う指示がなされた際に開始される。まず、初期処理として各種記憶エリアのクリア等が行われる（Ｓ１）。そして、音声合成処理のメイン画面２９０が表示され、音声出力する文章が取得されたら（Ｓ２）、形態素解析処理が行われ（Ｓ３）、アクセント句形成処理が行われる（Ｓ４）。 The main process shown in FIG. 12 is started when an instruction to perform a voice synthesis process is given in the mobile phone 1. First, various storage areas are cleared as an initial process (S1). Then, a main screen 290 for speech synthesis processing is displayed, and when a sentence to be output is acquired (S2), morphological analysis processing is performed (S3), and accent phrase formation processing is performed (S4).

Ｓ２では、メイン画面２９０が表示され、テキスト入力欄２９１にキー入力部３８の操作により文字を入力したり、携帯電話機１の不揮発メモリ３０に記憶されているメールの文章やインターネットに接続して表示した画面に記載されている文章を挿入したりして音声出力を行うテキストが入力され、メイン画面２９０においてＯＫボタンが選択されたら、ＲＡＭ２２のテキスト記憶エリア２２１に記憶されることにより、出力文章が取得される。 In S2, the main screen 290 is displayed, and a character is input to the text input field 291 by operating the key input unit 38, or is displayed by connecting to a mail text or the Internet stored in the nonvolatile memory 30 of the mobile phone 1. When a text for voice output is input by inserting a sentence described on the screen and the OK button is selected on the main screen 290, the text is stored in the text storage area 221 of the RAM 22 so that the output sentence is stored. To be acquired.

また、形態素解析処理では、品詞情報、読み情報、接続情報、アクセント情報等をもつ言語辞書（図示外）が参照されて周知の最長一致法で形態素解析が行われ、テキスト記憶エリア２２１に記憶されているテキストが形態素（品詞）に解析される。そして、アクセント句形成処理では、言語辞書の接続情報が参照されて、形態素がアクセント句にまとめられる。さらに、アクセント句形成処理では、アクセント位置も言語辞書のアクセント情報から割り出される。そして、複合語にまとめられる際に、アクセント位置の移動がある語については、アクセント位置の変更処理も行われる。そして、最後に、言語情報の読み情報が参照されて、文字列がカタカナの文字列に置き換えられ、「一週間ばかり、ニューヨークを取材した。」というようなテキストであれば、「イッシューカンバカリ（６）｜ニューヨークヲ（３）シュザイシタ（０）」という解析結果が出力される。ここで「｜」は呼気段落区切りを示し、（）はアクセント句の区切りを示し、（）内の数字がアクセント句のアクセント位置を示している。なお、形態素解析処理（Ｓ３）及び、アクセント句形成処理（Ｓ４）の結果は、解析結果記憶エリア２２２に記憶される。また、ここで、音声出力するテキストのアクセント句数が算出され、解析結果記憶エリア２２２に記憶される。 In the morphological analysis process, a language dictionary (not shown) having part-of-speech information, reading information, connection information, accent information, and the like is referred to, morphological analysis is performed by a known longest match method, and stored in the text storage area 221. Text is parsed into morphemes (parts of speech). In the accent phrase forming process, the connection information of the language dictionary is referred to and the morphemes are collected into accent phrases. Further, in the accent phrase forming process, the accent position is also determined from the accent information in the language dictionary. When words are moved to a compound word, accent position change processing is also performed for words with accent position movement. Finally, the reading information of the language information is referred to, and the text is replaced with the katakana text. If the text reads “I covered New York for a week,” “Issou Kambakari ( 6) | New York wo (3) Shuzaishita (0) "is output. Here, “|” indicates an exhalation paragraph delimiter, () indicates an accent phrase delimiter, and numbers in () indicate the accent position of the accent phrase. The results of the morphological analysis process (S3) and the accent phrase formation process (S4) are stored in the analysis result storage area 222. Further, here, the number of accent phrases of the text to be output as speech is calculated and stored in the analysis result storage area 222.

そして、パラメータ選択処理が行われる（Ｓ５）。このパラメータ選択処理では、音声出力画面２００（図９参照）が表示画面２に表示され、◇タグ及び▽タグを選択することにより全文選択モード画面２１０（図１０参照），アクセント句モード画面２３０（図１１参照）が表示されて、音声出力についての各種設定が行われる。音声出力画面２００において出力ボタンが選択されたら、「全文選択モード」のモード選択欄２１１に選択があれば、ＲＡＭ２２パラメータ情報記憶エリア２２３に「１」〜「３」がモード選択欄２１１での選択にしたがって記憶され、「全文選択モード」のモード選択欄２１１に選択がなく、「アクセント句モード」に設定があれば「９」が記憶される。なお、両モード共に設定がない場合には、初期値である「０」のままである。そして、いずれかのモードが設定されている場合には、設定内容がＲＡＭ２２の各記憶エリアに記憶され、パラメータ選択処理は終了する。 Then, parameter selection processing is performed (S5). In this parameter selection process, the audio output screen 200 (see FIG. 9) is displayed on the display screen 2, and by selecting the ◇ tag and the ▽ tag, the full text selection mode screen 210 (see FIG. 10), the accent phrase mode screen 230 ( 11) is displayed, and various settings for audio output are performed. When the output button is selected on the voice output screen 200, if there is a selection in the mode selection field 211 of “full text selection mode”, “1” to “3” are selected in the mode selection field 211 in the RAM 22 parameter information storage area 223. If there is no selection in the “full text selection mode” mode selection field 211 and “accent phrase mode” is set, “9” is stored. If neither mode is set, the initial value remains “0”. If any mode is set, the setting content is stored in each storage area of the RAM 22, and the parameter selection process ends.

そして、ＲＡＭ２２パラメータ情報記憶エリア２２３に記憶されている値が読み出される（Ｓ６）。読み出された値が「１」，「２」，「３」，「９」であり、モード設定がされていれば（Ｓ７：ＹＥＳ）、複数の人数による音声出力のある可能性があるので、Ｓ８へ進む。また、これらの値以外でなくモード設定がされていなければ（Ｓ７：ＮＯ）、一種類の音声種類による音声の出力なので、形態素解析処理及びアクセント句形成処理の結果に基づいて、全文についてケプストラム分析処理が行われて音源信号が生成され（Ｓ１１）、音源信号がＭＬＳＡフィルターを介して音声として出力される（Ｓ１２）。そして、処理は終了する。 Then, the value stored in the RAM 22 parameter information storage area 223 is read (S6). If the read values are “1”, “2”, “3”, “9” and the mode is set (S7: YES), there is a possibility that there is a voice output by a plurality of people. , Go to S8. Also, if the mode is not set other than these values (S7: NO), since the sound is output by one type of speech, the cepstrum analysis is performed on the whole sentence based on the results of the morphological analysis processing and the accent phrase formation processing. Processing is performed to generate a sound source signal (S11), and the sound source signal is output as sound through the MLSA filter (S12). Then, the process ends.

このケプストラム分析処理では、形態素解析された結果に基づいて、不揮発メモリ３０に記憶されている音響辞書（本実施の形態では、１番目の音声種類「男性」の音響辞書を用いることとする）に記憶されている音韻モデルが選択されて音韻列が生成され、各音素の音韻モデルが結合されてメルケプストラム列と有声／無声情報列（以下、ｍｃｅｐ列とする）が生成される。なお、音響辞書には、「ａ，ｂ，ｂｙ，ｃｈ，ｃｌ，ｄ，ｄｙ，ｅ，ｆ，ｆｙ，ｇ，ｇｙ，ｈ，ｈｙ，ｉ，ｊ，ｋ，ｋｙ，ｍ，ｍｙ，ｎ，Ｎ，ｎｙ，ｏ，ｐ，ｐａｕ，ｐｙ，ｒ，ｒｙ，ｓ，ｓｈ，ｔ，ｔｓ，ｔｙ，ｕ，ｗ，ｙ，ｚ」の３８種の音韻モデルのリストが記憶されている。尚、これ以外に前後の音韻環境、韻律環境を考慮する場合もある。この音韻モデルは、自然音声をメルケプストラム分析することによって得られるものであり、各音韻モデルはその継続時間をフレーム（１フレームは１０ｍｓとする）で分割され、フレームごとにメルケプストラム係数が記憶されている。また、その他にフレームごとに有声か無声かの情報が記憶されている。尚、「ｐａｕ」はポーズを示している。 In this cepstrum analysis processing, an acoustic dictionary stored in the nonvolatile memory 30 (in this embodiment, the acoustic dictionary of the first speech type “male” is used) based on the result of morphological analysis. A stored phoneme model is selected to generate a phoneme sequence, and a phoneme model of each phoneme is combined to generate a mel cepstrum sequence and a voiced / unvoiced information sequence (hereinafter referred to as a mcep sequence). The acoustic dictionary includes “a, b, by, ch, cl, d, dy, e, f, fy, g, gy, h, hy, i, j, k, ky, m, my, n, N, ny, o, p, pau, py, r, ry, s, sh, t, ts, ty, u, w, y, z ”are stored in a list of 38 phoneme models. In addition, there are cases where the phoneme environment and the prosodic environment before and after are considered. This phonological model is obtained by performing mel cepstrum analysis on natural speech. Each phonological model is divided into frames (one frame is 10 ms), and a mel cepstrum coefficient is stored for each frame. ing. In addition, voiced or unvoiced information is stored for each frame. Note that “pau” indicates a pause.

また、Ｓ４のアクセント句形成処理により解析されたアクセント区切り、アクセント型に該当する韻律モデル列が音響辞書の韻律モデルから選択され、韻律モデル列が生成される。「一週間ばかり、ニューヨークを取材した。」の例では、「（９，６）、ｐａｕ、（６，３）、（５，０）」という韻律モデル列が生成される。これは、９モーラ（拍）のアクセント型６の韻律モデルの次に、ポーズがあり、その後に６モーラのアクセント型３、５モーラのアクセント型０となることを示している。次いで、生成された韻律モデル列が接続されてｐｉｔｃｈ列が生成される。ただし、接続時に音韻モデル列の各音韻の長さに合わせて、モーラ長を伸縮して音韻モデルとの同期が取られる。 Also, the prosodic model sequence corresponding to the accent delimiter and accent type analyzed by the accent phrase forming process of S4 is selected from the prosodic model of the acoustic dictionary, and the prosodic model sequence is generated. In the example of “I covered New York for only a week”, a prosodic model sequence of “(9, 6), pau, (6, 3), (5, 0)” is generated. This indicates that there is a pose after the prosody model of the accent type 6 with 9 mora (beats), and then the accent type 3 with 6 mora and the accent type 0 with 5 mora. Next, the generated prosodic model sequence is connected to generate a pitch sequence. However, the mora length is expanded and contracted to synchronize with the phoneme model according to the length of each phoneme in the phoneme model sequence at the time of connection.

そして、生成されたｍｃｅｐ列の有声／無声情報、及び生成されたｐｉｔｃｈ列に基づいて音源信号が生成される。音源信号は、ｐｉｔｃｈ列に基づいて有声部にはパルス列信号が生成され、無声部には雑音信号が生成される。 Then, a sound source signal is generated based on the voiced / unvoiced information of the generated msep sequence and the generated pitch sequence. As for the sound source signal, a pulse train signal is generated in the voiced portion and a noise signal is generated in the unvoiced portion based on the pitch sequence.

また、ＲＡＭ２２パラメータ情報記憶エリア２２３から読み出された値が「１」，「２」，「３」，「９」であり、モード設定がされていれば（Ｓ７：ＹＥＳ）、「全文選択モード」の指定があるか否かの判断が行われ（Ｓ８）、「１」，「２」，「３」でなく、「全文選択モード」の指定がなければ（Ｓ８：ＮＯ）、「アクセント句モード」の指定があるということなので、アクセント句処理が行われ（Ｓ１３，図１６参照）、処理は終了する。また、「全文選択モード」の指定があり（Ｓ８：ＹＥＳ）、読み出された値が「１」であり、「一斉」が選択されている場合には（Ｓ９：ＹＥＳ）、一斉処理が行われる（Ｓ１４、図１３参照）。そして、読み出された値が「２」であり、「復唱」が選択されている場合には（Ｓ９：ＮＯ，Ｓ１０：ＹＥＳ）、復唱処理が行われる（Ｓ１５、図１４参照）。また、読み出された値が「３」であり、「輪唱」が選択されている場合には（Ｓ９：ＮＯ，Ｓ１０：ＮＯ）、輪唱処理が行われる（Ｓ１６、図１５参照）。そして、処理は終了する。 If the values read from the RAM 22 parameter information storage area 223 are “1”, “2”, “3”, “9”, and the mode is set (S7: YES), the “full text selection mode” Is determined (S8), and if "full text selection mode" is not specified (S8: NO) instead of "1", "2", "3", "accent phrase" Since “mode” is designated, accent phrase processing is performed (S13, see FIG. 16), and the processing ends. If “full text selection mode” is specified (S8: YES), the read value is “1”, and “simultaneous” is selected (S9: YES), simultaneous processing is performed. (S14, see FIG. 13). When the read value is “2” and “return” is selected (S9: NO, S10: YES), a repeat process is performed (S15, see FIG. 14). When the read value is “3” and “ring” is selected (S9: NO, S10: NO), the ringing process is performed (S16, see FIG. 15). Then, the process ends.

ここで、一斉処理について、図１３のフローチャートを参照して説明する。まず、アクセント句をカウントするための変数ｎに初期値の「１」がセットされる（Ｓ２１）。そして、ｎ番目のアクセント句の解析結果が解析結果記憶エリア２２２から読み込まれる（Ｓ２２）。そして、音声種類をカウントするための変数ｍに初期値の「１」がセットされ（Ｓ２３）、ｍ番目の音声種類が出力される音声として設定されているか否かの判断が行われる（Ｓ２４）。これは、一斉情報記憶エリア２２４のｍ番目の音声種類の設定欄に「１」が記憶されているか否かにより判断される。 Here, the simultaneous processing will be described with reference to the flowchart of FIG. First, an initial value “1” is set in a variable n for counting accent phrases (S21). Then, the analysis result of the nth accent phrase is read from the analysis result storage area 222 (S22). Then, an initial value “1” is set to the variable m for counting the voice type (S23), and it is determined whether or not the mth voice type is set as the output voice (S24). . This is determined by whether or not “1” is stored in the m-th audio type setting field of the simultaneous information storage area 224.

設定欄に「１」が記憶されており、出力される音声として設定されていれば（Ｓ２４：ＹＥＳ）、ｍ番目の音声種類でｎ番目のアクセント句についてケプストラム処理が行われ、音源信号が生成され、一斉情報記憶エリア２２４のｍ番目の音声種類の音声データ欄に記憶される（Ｓ２５）。そして、音声種類のカウント用変数ｍに「１」が加算され（Ｓ２６）、Ｓ２７へ進む。また、設定欄に「１」が記憶されておらず、出力される音声として設定されていなければ（Ｓ２４：ＮＯ）、音声データ（音源信号）は生成する必要はないので、何もせずにＳ２７へ進む。 If “1” is stored in the setting column and it is set as an output sound (S24: YES), cepstrum processing is performed for the nth accent phrase in the mth sound type, and a sound source signal is generated. Then, it is stored in the audio data column of the mth audio type in the simultaneous information storage area 224 (S25). Then, “1” is added to the voice type count variable m (S26), and the process proceeds to S27. Further, if “1” is not stored in the setting column and it is not set as the output sound (S24: NO), it is not necessary to generate the sound data (sound source signal), so nothing is done in S27. Proceed to

そして、Ｓ２７では、変数ｍの値が「５（本実施の形態の音声種類の数）」より大きいか否かにより、全ての音声種類について処理を行ったか否かの判断が行われる。「５」より大きくなければ、まだ全ての音声種類についての処理が終了していないので（Ｓ２７：ＮＯ）、Ｓ２４へ戻り、次の音声種類についての処理が行われる。Ｓ２４〜Ｓ２７の処理が繰り返され、変数ｍの値が「５」より大きくなったら（Ｓ２７：ＹＥＳ）、一斉情報記憶エリア２２４の音声データ欄に記憶されている全ての音声データ（音源信号）が加算されて、レベル補正により波形加工が行われて、合成音声記憶エリア２２８へ記憶され（Ｓ２８）、音声が出力される（Ｓ２９）。 In S27, it is determined whether or not the processing has been performed for all voice types depending on whether or not the value of the variable m is larger than “5 (the number of voice types in the present embodiment)”. If it is not greater than “5”, the processing for all the voice types has not been completed yet (S27: NO), so the process returns to S24, and the processing for the next voice type is performed. When the processing of S24 to S27 is repeated and the value of the variable m becomes larger than “5” (S27: YES), all the audio data (sound source signal) stored in the audio data column of the simultaneous information storage area 224 is stored. The waveform is processed by level correction and stored in the synthesized voice storage area 228 (S28), and the voice is output (S29).

そして、アクセント句をカウントする変数ｎに「１」が加算され（Ｓ３０）、変数ｎの値が、解析結果記憶エリア２２２に記憶されている出力テキストのアクセント句の数より大きくなっていなければ（Ｓ３１：ＮＯ）、Ｓ２２へ戻り、次のアクセント句の処理が行われる（Ｓ２２〜Ｓ３１）。そして、Ｓ２２〜Ｓ３１の処理が繰り返されて、全てのアクセント句の処理が終了したら（Ｓ３１：ＹＥＳ）、一斉処理は終了し、メイン処理へ戻り、メイン処理も終了する。 Then, “1” is added to the variable n for counting the accent phrases (S30), and the value of the variable n is not larger than the number of accent phrases of the output text stored in the analysis result storage area 222 ( (S31: NO), the process returns to S22, and the next accent phrase is processed (S22 to S31). Then, when the processes of S22 to S31 are repeated and all the accent phrases have been processed (S31: YES), the simultaneous process ends, the process returns to the main process, and the main process also ends.

このようにして、「全文選択モード」で「一斉」が選択され、指定されている音声種類の音声が複数であれば、複数の人が同じ文章を読み上げているような効果が得られる。また、アクセント句ごとにケプストラム処理を行って、音声を出力するので、複数の音声種類の出力を同時にする場合であっても、使用者が音声出力の指示を行ってから音声が出力するまでの間に時間がかからず、スムースな音声出力ができる。 In this way, if “simultaneous” is selected in the “full-text selection mode” and there are a plurality of voices of the designated voice type, the effect is obtained that a plurality of people read the same sentence. In addition, since cepstrum processing is performed for each accent phrase and the sound is output, even when multiple sound types are output at the same time, the user outputs instructions until the sound is output. Smooth audio output is possible without taking time.

次に、復唱処理について、図１４のフローチャートを参照して説明する。まず、アクセント句をカウントするための変数ｎに初期値の「１」がセットされる（Ｓ４２）。そして、ｎ番目のアクセント句の解析結果が解析結果記憶エリア２２２から読み込まれ（Ｓ４３）、復唱情報記憶エリア２２５の先導音声種類に設定されている音声種類の音響辞書が参照されて、ｎ番目のアクセント句についてケプストラム処理が行われ、音源信号が生成され、復唱情報記憶エリア２２５の先導音声種類の音声データ欄に記憶される（Ｓ４４）。そして、音声が出力される（Ｓ４５）。 Next, the repetition process will be described with reference to the flowchart of FIG. First, an initial value “1” is set in a variable n for counting accent phrases (S42). Then, the analysis result of the nth accent phrase is read from the analysis result storage area 222 (S43), and the sound type acoustic dictionary set as the lead sound type in the repetitive information storage area 225 is referred to, so that the nth A cepstrum process is performed on the accent phrase, a sound source signal is generated, and stored in the voice data column of the lead voice type in the repetition information storage area 225 (S44). Then, sound is output (S45).

次いで、復唱音声種類をカウントするための変数ｐに初期値の「１」がセットされ（Ｓ４６）、ｐ個目の設定欄に記憶されている値の示す音声種類で、ｎ番目のアクセント句についてケプストラム処理が行われ、音源信号が生成され、復唱情報記憶エリア２２５の復唱音声種類のｐ個目の音声データ欄に記憶される（Ｓ４７）。そして、変数ｐに「１」が加算され（Ｓ４８）。変数ｐが復唱人より大きくなり、全ての復唱音声種類について処理が行われたか否かの判断が行われる（Ｓ４９）。全ての復唱音声種類について処理が行われていなければ（Ｓ４９：ＮＯ）、Ｓ４７へ戻り、次の復唱音声種類についてケプストラム分析処理が行われる（Ｓ４７）。そして、Ｓ４７〜Ｓ４９の処理が繰り返し行われ、全ての復唱音声について処理が終了したら（Ｓ４９：ＹＥＳ）、復唱音声種類情報の音声データ欄に記憶されている音声データ(音源信号）が加算されて、レベル補正により波形加工が行われて、合成音声記憶エリア２２８へ記憶され（Ｓ５０）、音声が出力される（Ｓ５１）。 Next, an initial value “1” is set in the variable p for counting the type of repetitive voice (S46), and the voice type indicated by the value stored in the p-th setting field is used for the nth accent phrase. A cepstrum process is performed, a sound source signal is generated, and stored in the p-th audio data column of the repetitive audio type in the repetitive information storage area 225 (S47). Then, “1” is added to the variable p (S48). It is determined whether or not the variable p is larger than that of the repeater and processing has been performed for all the types of readback voice (S49). If processing has not been performed for all the repetitive voice types (S49: NO), the process returns to S47, and cepstrum analysis processing is performed for the next repetitive voice type (S47). Then, when the processing of S47 to S49 is repeated and the processing is completed for all the repetitive voices (S49: YES), the voice data (sound source signal) stored in the voice data column of the repetitive voice type information is added. Then, waveform processing is performed by level correction, and it is stored in the synthesized voice storage area 228 (S50), and voice is output (S51).

そして、アクセント句をカウントする変数ｎに「１」が加算され（Ｓ５２）、変数ｎの値がアクセント句の数より大きくなっていなければ（Ｓ５３：ＮＯ）、Ｓ４３へ戻り、次のアクセント句の処理が行われる（Ｓ４３〜Ｓ５３）。そして、Ｓ４３〜Ｓ５３の処理が繰り返されて、全てのアクセント句の処理が終了したら（Ｓ５３：ＹＥＳ）、復唱処理は終了し、メイン処理へ戻り、メイン処理も終了する。 Then, “1” is added to the variable n for counting the accent phrase (S52), and if the value of the variable n is not larger than the number of accent phrases (S53: NO), the process returns to S43 and the next accent phrase is determined. Processing is performed (S43 to S53). When the processes of S43 to S53 are repeated and all the accent phrases have been processed (S53: YES), the repeat process is ended, the process returns to the main process, and the main process is also ended.

このようにして、「全文選択モード」で「復唱」が選択されている場合には、先導音声種類として指定された音声により、１つのアクセント句の音声が出力されてから、復唱音声種類として指定された音声により、同じアクセント句の音声が続けて出力される。ここでも、アクセント句ごとにケプストラム処理を行って、音声を出力するので、複数の音声種類の出力を同時にする場合であっても、使用者が音声出力の指示を行ってから音声が出力するまでの間に時間がかからず、スムースな音声出力ができる。 In this way, when “repeat” is selected in “full-text selection mode”, the sound of one accent phrase is output by the sound designated as the lead speech type, and then designated as the type of repetitive speech The voice of the same accent phrase is continuously output by the generated voice. Again, because cepstrum processing is performed for each accent phrase and audio is output, even if multiple audio types are output simultaneously, until the user outputs audio after the user issues an audio output instruction Smooth audio output is possible without taking time.

次に、輪唱処理について、図１５のフローチャートを参照して説明する。まず、１番目のアクセント句の形態素数が算出され、輪唱情報記憶エリア２２６の１番目のアクセント句の形態素数欄にセットされる（Ｓ６１）。そして、形態素をカウントするための変数ｓに初期値の「１」がセットされ（Ｓ６２）、ｓ番目の形態素の解析結果が解析結果記憶エリア２２２から読み込まれ（Ｓ６３）、輪唱の順番をカウントする変数ｑに初期値の「１」がセットされる（Ｓ６４）。そして、輪唱情報記憶エリア２２６のｑ番目の音声種類欄にセットされている値の示す音声種類で、ｓ番目の形態素についてケプストラム処理が行われ、音源信号が生成され（Ｓ６５）、輪唱情報記憶エリア２２６のｑ番目の形態素ごとの音声データ欄の（ｑ＋最初のアクセント句の形態素数×（ｑ−１））番目の欄に記憶される（Ｓ６６）。 Next, the singing process will be described with reference to the flowchart of FIG. First, the morpheme number of the first accent phrase is calculated and set in the morpheme number field of the first accent phrase in the ring information storage area 226 (S61). Then, an initial value “1” is set to the variable s for counting morphemes (S62), the analysis result of the sth morpheme is read from the analysis result storage area 222 (S63), and the order of singing is counted. The initial value “1” is set in the variable q (S64). Then, cepstrum processing is performed on the s-th morpheme with the voice type indicated by the value set in the q-th voice type field of the ring information storage area 226, and a sound source signal is generated (S65). 226 is stored in the (q + first accent phrase morpheme number × (q−1))-th column in the voice data column for each q-th morpheme (S66).

そして、輪唱の順番をカウントする変数ｑに「１」が加算され（Ｓ６７）、変数ｑの値が輪唱人数より大きくなっているか否かにより、最後の音声種類まで処理が終了したか否かの判断が行われる（Ｓ６８）。まだ最後まで処理が終了していなければ（Ｓ６８：ＮＯ）、Ｓ６５へ戻り次の順番の音声種類についての処理が行われる（Ｓ６５〜Ｓ６８）。そして、Ｓ６５〜Ｓ６８の処理が繰り返され、最後の音声種類まで処理が終了したら（Ｓ６８：ＹＥＳ）、形態素ごとの音声データ欄の（１，１）、（２，１）、（３，１）、（４，１）、（５，１）に記憶されている音声データ（音源信号）が加算されて、レベル補正により波形加工が行われて、合成音声記憶エリア２２８へ記憶され（Ｓ６９）、音声が出力される（Ｓ７０）。そして、形態素ごとの音声データ欄の情報がシフトされる（Ｓ７１）。具体的には、音声出力された（１，１）、（２，１）、（３，１）、（４，１）、（５，１）の音声データが削除される。そして、（１，２）の音声データが（１，１）へ記憶され、（２，２）の音声データが（２，１）へ記憶され、（３，２）の音声データが（３，１）へ記憶され、（４，２）の音声データが（４，１）へ記憶され、（５，２）の音声データが（５，１）へ記憶される。そして、（１，３）の音声データが（１，２）へ記憶され、（２，３）の音声データが（２，２）へ記憶され、（３，３）の音声データが（３，２）へ記憶され、（４，３）の音声データが（４，２）へ記憶され、（５，３）の音声データが（５，２）へ記憶されるというように、音声データの記憶されている欄のデータは一つ前の欄へシフトされる。 Then, “1” is added to the variable q that counts the order of singing (S67), and whether or not the processing has been completed up to the last voice type depending on whether or not the value of the variable q is larger than the number of singers. A determination is made (S68). If the process has not been completed to the end (S68: NO), the process returns to S65 and the process for the next type of sound is performed (S65 to S68). Then, the processes of S65 to S68 are repeated, and when the process is completed up to the last voice type (S68: YES), (1, 1), (2, 1), (3, 1) in the voice data column for each morpheme. , (4,1), (5,1) are added to the sound data (sound source signal), the waveform is processed by level correction, and stored in the synthesized sound storage area 228 (S69), Audio is output (S70). Then, the information in the voice data column for each morpheme is shifted (S71). Specifically, the audio data (1, 1), (2, 1), (3, 1), (4, 1), and (5, 1) output as audio are deleted. The audio data (1, 2) is stored in (1, 1), the audio data (2, 2) is stored in (2, 1), and the audio data (3, 2) is (3, 3). 1), (4,2) audio data is stored in (4,1), and (5,2) audio data is stored in (5,1). Then, (1,3) audio data is stored in (1,2), (2,3) audio data is stored in (2,2), and (3,3) audio data is (3,3). 2), audio data of (4, 3) is stored in (4, 2), audio data of (5, 3) is stored in (5, 2), etc. The data in the current column is shifted to the previous column.

そして、形態素をカウントするための変数ｓに「１」が加算される（Ｓ７２）。そして、変数ｓの値が全ての形態素数より大きいか否かにより、全ての形態素の処理を終了したか否かの判断が行われる（Ｓ７３）。変数ｓの値が形態素数の数より大きくなっていなければ（Ｓ７３：ＮＯ）、Ｓ６３へ戻り、次の形態素の処理が行われる（Ｓ６３〜Ｓ７３）。そして、Ｓ６３〜Ｓ７３の処理が繰り返されて、全ての形態素の処理が終了したら（Ｓ７３：ＹＥＳ）、輪唱処理は終了し、メイン処理へ戻り、メイン処理も終了する。 Then, “1” is added to the variable s for counting morphemes (S72). Then, depending on whether or not the value of the variable s is larger than all the morpheme numbers, it is determined whether or not the processing of all the morphemes has been completed (S73). If the value of the variable s is not larger than the number of morphemes (S73: NO), the process returns to S63, and the next morpheme is processed (S63 to S73). And if the process of S63-S73 is repeated and the process of all the morphemes is complete | finished (S73: YES), a ring process will be complete | finished, it will return to a main process and a main process will also be complete | finished.

このようにして、指定された順番に、ひとつ前の音声種類の１番目のアクセント句の出力が終了した後に音声出力が開始され、合唱における「輪唱」と同様に複数の人が前の人に続いて読みあげを行っているように音声が出力される。ここでは、形態素ごとにケプストラム処理を行って、音声を出力するので、複数の音声種類の出力を同時にする場合であっても、使用者が音声出力の指示を行ってから音声が出力するまでの間に時間がかからず、スムースな音声出力ができる。 In this way, in the designated order, the audio output is started after the output of the first accent phrase of the immediately preceding audio type is completed, and a plurality of people are assigned to the previous person in the same way as in “chorus” in chorus. The sound is then output as if reading out. Here, since cepstrum processing is performed for each morpheme and the sound is output, even when a plurality of sound types are output at the same time, the sound is output after the user instructs the sound output. Smooth audio output is possible without taking time.

次に、アクセント句処理について、図１６のフローチャートを参照して説明する。まず、アクセント句をカウントする変数ｎに初期値の「１」がセットされる（Ｓ８１）。そして、ｎ番目のアクセント句の解析結果が解析結果記憶エリア２２２から読み込まれる（Ｓ８２）。そして、音声種類をカウントするための変数ｍに初期値の「１」がセットされ（Ｓ８３）、ｍ番目の音声種類が出力される音声として設定されているか否かの判断が行われる（Ｓ８４）。これは、アクセント句モード情報記憶エリア２２７のｍ番目の音声種類のｎ番目のアクセント句ごとの設定欄に「１」が記憶されているか否かにより判断される。 Next, accent phrase processing will be described with reference to the flowchart of FIG. First, an initial value “1” is set in a variable n for counting accent phrases (S81). Then, the analysis result of the nth accent phrase is read from the analysis result storage area 222 (S82). Then, an initial value “1” is set to the variable m for counting the voice type (S83), and it is determined whether or not the mth voice type is set as the output voice (S84). . This is determined by whether or not “1” is stored in the setting field for each nth accent phrase of the mth speech type in the accent phrase mode information storage area 227.

設定欄に「１」が記憶されており、出力される音声として設定されていれば（Ｓ８４：ＹＥＳ）、ｍ番目の音声種類でｎ番目のアクセント句についてケプストラム処理が行われ、音源信号が生成され（Ｓ８５）、アクセント句モード情報記憶エリア２２７のｍ番目の音声種類のｎ番目のアクセント句ごとの音声データ欄に記憶される（Ｓ８６）。そして、音声種類のカウント用変数ｍに「１」が加算され（Ｓ８７）、Ｓ８８へ進む。また、設定欄に「１」が記憶されておらず、出力される音声として設定されていなければ（Ｓ８４：ＮＯ）、音声データ（音源信号）は生成する必要はないので、何もせずにＳ８７へ進み、音声種類のカウント用変数ｍに「１」が加算され（Ｓ８７）、Ｓ８８へ進む。 If “1” is stored in the setting column and it is set as an output sound (S84: YES), cepstrum processing is performed for the nth accent phrase in the mth sound type, and a sound source signal is generated. (S85) and stored in the voice data column for each nth accent phrase of the mth voice type in the accent phrase mode information storage area 227 (S86). Then, “1” is added to the voice type count variable m (S87), and the process proceeds to S88. Further, if “1” is not stored in the setting column and it is not set as the output sound (S84: NO), it is not necessary to generate the sound data (sound source signal), so nothing is done in S87. Then, “1” is added to the voice type count variable m (S87), and the process proceeds to S88.

そして、Ｓ８８では、変数ｍの値が「５（本実施の形態の音声種類の数）」より大きいか否かにより、全ての音声種類について処理を行ったか否かの判断が行われる。「５」より大きくなければ、まだ全ての音声種類についての処理が終了していないので（Ｓ８８：ＮＯ）Ｓ８４へ戻り、次の音声種類についての処理が行われる。Ｓ８４〜Ｓ８８の処理が繰り返され、変数ｍの値が「５」より大きくなったら（Ｓ８８：ＹＥＳ）、アクセント句モード情報記憶エリア２２７のｎ番目のアクセント句の音声データ欄に記憶されている全ての音声データ（音源信号）が加算されて、レベル補正により波形加工が行われて、合成音声記憶エリア２２８へ記憶され（Ｓ８９）、音声が出力される（Ｓ９０）。 In S88, it is determined whether or not the processing has been performed for all voice types depending on whether or not the value of the variable m is larger than “5 (number of voice types in the present embodiment)”. If it is not greater than “5”, the processing for all voice types has not been completed yet (S88: NO), the process returns to S84, and the processing for the next voice type is performed. When the processing of S84 to S88 is repeated and the value of the variable m becomes larger than “5” (S88: YES), all the data stored in the voice data column of the nth accent phrase in the accent phrase mode information storage area 227 are stored. The sound data (sound source signal) is added, waveform processing is performed by level correction, and the result is stored in the synthesized sound storage area 228 (S89), and the sound is output (S90).

そして、アクセント句をカウントする変数ｎに「１」が加算され（Ｓ９１）、変数ｎの値がアクセント句の数より大きくなっていなければ（Ｓ９２：ＮＯ）、Ｓ８２へ戻り、次のアクセント句の処理が行われる（Ｓ８２〜Ｓ９２）。そして、Ｓ８２〜Ｓ９２の処理が繰り返されて、全てのアクセント句の処理が終了したら（Ｓ９２：ＹＥＳ）、アクセント句処理は終了し、メイン処理へ戻り、メイン処理も終了する。 Then, “1” is added to the variable n for counting the accent phrase (S91), and if the value of the variable n is not larger than the number of accent phrases (S92: NO), the process returns to S82 and the next accent phrase is determined. Processing is performed (S82 to S92). Then, when the processes of S82 to S92 are repeated and the processing of all accent phrases is completed (S92: YES), the accent phrase process is terminated, the process returns to the main process, and the main process is also terminated.

このようにして、アクセント句ごとに指定されている音声種類の音声データを作成して、アクセント句ごとに異なる音声種類、異なる音声種類数で音声を出力することができる。ここでも、アクセント句ごとにケプストラム処理を行って、音声を出力するので、複数の音声種類の出力を同時にする場合であっても、使用者が音声出力の指示を行ってから音声が出力するまでの間に時間がかからず、スムースな音声出力ができる。 In this way, voice data of a voice type designated for each accent phrase can be created, and voices can be output with different voice types and different voice types for each accent phrase. Again, because cepstrum processing is performed for each accent phrase and audio is output, even if multiple audio types are output simultaneously, until the user outputs audio after the user issues an audio output instruction Smooth audio output is possible without taking time.

また、上記実施の形態における不揮発メモリ３０の音響辞書を記憶している記憶エリアが「音響辞書記憶手段」に該当し、ＲＡＭ２２の解析結果記憶エリア２２２が「解析結果記憶手段」に該当する。また、メイン画面２９０のテキスト入力欄２９１が「文入力手段」に該当し、音声出力画面２００及び全文選択モード画面２１０及びアクセント句モード画面２３０が「音声種類指定手段」に該当し、音声出力画面２００及び全文選択モード画面２１０が「全体音声種類指定手段」に該当し、音声出力画面２００及びアクセント句モード画面２３０が「アクセント句別音声指定手段」に該当し、音声出力画面２００及び全文選択モード画面２１０が「パート別音声種類指定手段」に該当し、音声出力画面２００及び全文選択モード画面２１０が「複数パート音声種類指定手段」に該当し、音声出力画面２００及び全文選択モード画面２１０が「順番指定手段」に該当する。 In addition, the storage area storing the acoustic dictionary of the nonvolatile memory 30 in the above embodiment corresponds to “acoustic dictionary storage means”, and the analysis result storage area 222 of the RAM 22 corresponds to “analysis result storage means”. In addition, the text input field 291 of the main screen 290 corresponds to “sentence input means”, the voice output screen 200, the full sentence selection mode screen 210, and the accent phrase mode screen 230 correspond to “voice type designation means”, and the voice output screen. 200 and the full sentence selection mode screen 210 correspond to “whole voice type designation means”, the voice output screen 200 and the accent phrase mode screen 230 correspond to “speech designation means by accent phrase”, and the voice output screen 200 and the full sentence selection mode. The screen 210 corresponds to “part-specific voice type designation means”, the voice output screen 200 and the full-text selection mode screen 210 correspond to “multi-part voice type designation means”, and the voice output screen 200 and the full-text selection mode screen 210 correspond to “ Corresponds to “order designation means”.

そして、図１２に示すメイン処理のＳ３、Ｓ４の処理を行うＣＰＵ２１が「言語解析手段」に相当し、図１２に示すメイン処理のＳ１１，図１３に示す一斉処理のＳ２５，図１４に示す復唱処理のＳ４４，Ｓ４７，図１５に示す輪唱処理のＳ６５，図１６に示すアクセント句処理のＳ８５の処理を行うＣＰＵ２１が「音響モデル選択手段」に該当する。そして、図１２に示すメイン処理のＳ１１，図１３に示す一斉処理のＳ２５，図１４に示す復唱処理のＳ４４，Ｓ４７，図１５に示す輪唱処理のＳ６５，図１６に示すアクセント句処理のＳ８５の処理を行うＣＰＵ２１が「音声生成手段」に該当し、図１４に示す復唱処理を行うＣＰＵ２１が「復唱音声生成手段」に該当し、図１５に示す輪唱処理を行うＣＰＵ２１が「輪唱音声生成手段」に該当する。音声出力画面２００のテキスト表示欄２０１のタグ、全文選択モード画面２１０のモード選択欄２１１が「指定方法選択手段」に該当する。 The CPU 21 that performs the processes of S3 and S4 of the main process shown in FIG. 12 corresponds to “language analysis means”. S11 of the main process shown in FIG. 12, S25 of the simultaneous process shown in FIG. 13, and the repetition shown in FIG. The CPU 21 that performs the processes S44 and S47, the ring process S65 shown in FIG. 15, and the accent phrase process S85 shown in FIG. 16 corresponds to the “acoustic model selection means”. Then, S11 of the main process shown in FIG. 12, S25 of the simultaneous process shown in FIG. 13, S44 and S47 of the repeat process shown in FIG. 14, S65 of the ring process shown in FIG. 15, and S85 of the accent phrase process shown in FIG. The CPU 21 that performs the processing corresponds to “sound generation means”, the CPU 21 that performs the repetitive processing illustrated in FIG. 14 corresponds to “repeated sound generation means”, and the CPU 21 that performs the revolving processing illustrated in FIG. It corresponds to. The tag in the text display field 201 of the voice output screen 200 and the mode selection field 211 in the full text selection mode screen 210 correspond to “designation method selection means”.

尚、本発明の音声合成装置及び音声合成プログラムは、上記した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。上記実施の形態では、音声合成装置を携帯電話機１としたが、音声合成装置は携帯電話機に限らず、音声合成用の専用端末機やパーソナルコンピュータ等その他の装置であってもよいことは言うまでもない。また、上記実施の形態では、携帯電話機１において全ての処理が完結しているが、利用者が音声出力の処理の実施を指示する「文入力手段」及び「音声種類指定手段」を備えた装置と、出力する音声データを生成する「言語解析手段」、「解析結果記憶手段」、「音響モデル選択手段」、「音声生成手段」を備えた装置は、同じ装置である必要はなく、インターネットやＬＡＮなどのネットワークでこれらの装置を接続して、データの送受信を可能にし、本発明の音声合成装置を構成してもよい。 Note that the speech synthesizer and speech synthesis program of the present invention are not limited to the above-described embodiments, and various changes can be made without departing from the scope of the present invention. In the above embodiment, the voice synthesizer is the mobile phone 1. However, it goes without saying that the voice synthesizer is not limited to the mobile phone, and may be another device such as a dedicated terminal for voice synthesis or a personal computer. . In the above-described embodiment, all processing is completed in the mobile phone 1, but an apparatus provided with “sentence input means” and “speech type designation means” for a user to instruct execution of voice output processing. And the device provided with “language analysis means”, “analysis result storage means”, “acoustic model selection means”, and “speech generation means” for generating the output voice data do not need to be the same device, such as the Internet or The speech synthesizer of the present invention may be configured by connecting these devices via a network such as a LAN to enable data transmission / reception.

また、一斉処理、復唱処理、輪唱処理、アクセント句処理の制御は上記実施の形態に限らず、同様の効果を得られる制御であればよい。また、上記実施の形態では、アクセント句ごとに音声種類の選択を行ったが、このブロックはアクセント句に限らず、所定の記号（たとえば、＃，＊，／，○，●，◎など）で区切られた文、単語、文字、形態素（品詞）などの区切りであってもよい。また、一斉処理、復唱処理では、アクセント句ごとにケプストラム処理を行って、音声出力を行っているが、この処理のサイクルもアクセント句に限らず、所定の記号（たとえば、＃，＊，／，○，●，◎など）で区切られた文、単語、文字、形態素（品詞）などの区切りであってもよい。なお、所定の記号は音声出力する文章（テキスト）を入力した際に、使用者が入力を行う。 Control of simultaneous processing, repetitive processing, ring processing, and accent phrase processing is not limited to the above-described embodiment, and may be any control that can obtain the same effect. In the above embodiment, the voice type is selected for each accent phrase. However, this block is not limited to the accent phrase, and a predetermined symbol (for example, #, *, /, ○, ●, ◎, etc.) is used. It may be a sentence, word, character, morpheme (part of speech), or the like. In the simultaneous processing and the repetitive processing, cepstrum processing is performed for each accent phrase and voice output is performed. However, the cycle of this processing is not limited to the accent phrase, and a predetermined symbol (for example, #, *, /, It may be a sentence, word, character, morpheme (part of speech), etc. The user inputs a predetermined symbol when a sentence (text) to be output is input.

また、上記実施の形態では、「復唱」において先導音声種類を一種類の音声のみとしたが、先導音声種類も複数の音声を指定できるようにしてもよいことはいうまでもない。 In the above embodiment, only one type of voice is used as the lead voice type in “repeating”, but it is needless to say that a plurality of voices may be designated as the lead voice type.

また、上記実施の形態では、「輪唱処理」として、音楽での「輪唱」と同様に、１つ前の音声種類の１番目のアクセント句の音声出力が終了すると、次の音声種類の音声の出力を開始し、文章の終わりまで継続して音声出力を行っているが、複数の音声を順番にずらして出力する方法はこれに限らない。例えば、１番目の音声種類の１番目のアクセント句の音声出力が終了した後に、１番目の音声種類の２番目のアクセント句と２番目の音声種類の１番目の音声種類をアクセント句の開始位置を揃えて出力し、短い方のアクセント句は足りない分の時間をポーズで補い、この音声出力が終了した後には、１番目の音声種類の３番目のアクセント句と２番目の音声種類の２番目の音声種類と３番目の音声種類の１番目の音声種類とをアクセント句の開始位置を揃えて出力し、短い方のアクセント句は足りない分の時間をポーズで補い、音声出力するといったように、アクセント句ごとにずらして音声出力するようにしてもよい。この場合には、全文の音源信号を作成して音声出力するのではなく、アクセント句ごとに音源信号を作成して、音声出力をすると効率的であり、リアルタイムな音声出力を行うことができる。 Further, in the above embodiment, as the “rotation process”, as in the case of “rotation” in music, when the audio output of the first accent phrase of the previous audio type is completed, the audio of the next audio type is output. The output is started and the voice is continuously output until the end of the sentence. However, the method of outputting a plurality of voices in order is not limited to this. For example, after the voice output of the first accent phrase of the first voice type is finished, the second accent phrase of the first voice type and the first voice type of the second voice type are changed to the start position of the accent phrase. The short accent phrase is compensated for by the pause, and after this voice output is completed, the third accent phrase of the first voice type and 2 of the second voice type are output. The first voice type and the first voice type of the third voice type are output with the start position of the accent phrase aligned, and the shorter accent phrase is supplemented with the pause time for the shortest accent phrase, and so on. In addition, the voice may be output with a shift for each accent phrase. In this case, it is efficient to create a sound source signal for each accent phrase and output the sound instead of creating a full-text sound source signal and outputting the sound, and real-time sound output can be performed.

本発明の音声合成装置及び音声合成プログラムは、複数の音声の合成を行う音声合成装置及び音声合成プログラムに適応可能である。 The speech synthesizer and speech synthesis program of the present invention can be applied to a speech synthesizer and speech synthesis program that synthesize a plurality of speech.

携帯電話機１の外観図である。1 is an external view of a mobile phone 1. FIG. 携帯電話機１の電気的構成を示すブロック図である。2 is a block diagram showing an electrical configuration of the mobile phone 1. FIG. ＲＡＭ２２の構成を示す模式図である。3 is a schematic diagram showing a configuration of a RAM 22. FIG. ＲＡＭ２２の一斉情報記憶エリア２２４の構成を示す模式図である。3 is a schematic diagram showing a configuration of a simultaneous information storage area 224 of a RAM 22. FIG. ＲＡＭ２２の復唱情報記憶エリア２２５の構成を示す模式図である。3 is a schematic diagram showing a configuration of a repetitive information storage area 225 of a RAM 22. FIG. ＲＡＭ２２の輪唱情報記憶エリア２２６の構成を示す模式図である。It is a schematic diagram which shows the structure of the rotation information storage area 226 of RAM22. ＲＡＭ２２のアクセント句モード情報記憶エリア２２７の構成を示す模式図である。3 is a schematic diagram showing a configuration of an accent phrase mode information storage area 227 of a RAM 22. FIG. メイン画面２９０のイメージ図である。It is an image figure of the main screen 290. 音声出力画面２００のイメージ図である。It is an image figure of the audio | voice output screen 200. FIG. 全文選択モード画面２１０のイメージ図である。It is an image figure of the full text selection mode screen. アクセント句モード画面２３０のイメージ図である。It is an image figure of the accent phrase mode screen. メイン処理のフローチャートである。It is a flowchart of a main process. メイン処理の中で行われる一斉処理のフローチャートである。It is a flowchart of the simultaneous process performed in the main process. メイン処理の中で行われる復唱処理のフローチャートである。It is a flowchart of the repeat process performed in the main process. メイン処理の中で行われる輪唱処理のフローチャートである。It is a flowchart of the singing process performed in the main process. メイン処理の中で行われるアクセント句処理のフローチャートである。It is a flowchart of the accent phrase process performed in the main process.

Explanation of symbols

１携帯電話機
２表示画面
２１ＣＰＵ
２２ＲＡＭ
３０不揮発メモリ
３８キー入力部
２００音声出力画面
２１０全文選択モード画面
２２１テキスト記憶エリア
２２２解析結果記憶エリア
２２３パラメータ情報記憶エリア
２２４一斉情報記憶エリア
２２５復唱情報記憶エリア
２２６輪唱情報記憶エリア
２２７アクセント句モード情報記憶エリア
２３０アクセント句モード画面
２９０メイン画面
ｍ音声種類カウント用変数
ｎアクセント句カウント用変数
ｐ復唱音声種類カウント用変数
ｑ輪唱の順番カウント用変数
ｓ形態素カウント用変数 1 Mobile phone 2 Display screen 21 CPU
22 RAM
30 Non-volatile memory 38 Key input unit 200 Voice output screen 210 Full-text selection mode screen 221 Text storage area 222 Analysis result storage area 223 Parameter information storage area 224 Simultaneous information storage area 225 Repetition information storage area 226 Ring information storage area 227 Accent phrase mode information Storage area 230 Accent phrase mode screen 290 Main screen m Voice type count variable n Accent phrase count variable p Repeated voice type count variable q Rotation order count variable s Morphological count variable

Claims

An acoustic dictionary, which is a set of acoustic models including at least a phonological model created from phoneme data analyzed from speech into acoustic parameter sequences and a prosodic model created from fundamental frequency data analyzed from speech, is stored for each type of speech. Acoustic dictionary storage means for
A sentence input means for inputting a sentence for generating speech;
The sentence that generates the speech input by the sentence input unit is decomposed into words to determine the part of speech, the accent type indicating the accent position is determined for each accent phrase, and the sentence that generates the speech is read Language analysis means to determine;
Analysis result storage means for storing the analysis result analyzed by the language analysis means;
Voice type designation means for designating one or a plurality of the voice types of the acoustic dictionary stored in the acoustic dictionary storage means;
Acoustic model selection means for selecting the acoustic model from the acoustic dictionary based on the analysis result stored in the analysis result storage means and the voice type designated by the voice type designation means;
A speech synthesizer comprising: speech generation means for generating speech based on the acoustic model selected by the acoustic model selection means.

The voice according to claim 1, wherein the voice generation unit generates a voice for each accent phrase, word, morpheme, or character based on the acoustic model selected by the acoustic model selection unit. Synthesizer.

The voice according to claim 1 or 2, wherein the voice type designation means includes a whole voice type designation means for designating one or a plurality of the voice types for the whole sentence generating the voice. Synthesizer.

4. The voice phrase designation means includes accent phrase-specific voice designation means for designating one or a plurality of the voice types for each accent phrase in a sentence that generates the voice. The speech synthesis apparatus according to any one of the above.

The voice type designation means includes part-by-part voice type designation means for designating the voice types of the first part and the second part, respectively.
The voice generating means outputs the voice type specified in the second part after the voice of the voice type specified in the first part is output for each predetermined block of the sentence that generates the voice. 5. The apparatus according to claim 1, further comprising: a repetitive sound generating unit that outputs a sound and synthesizes the sound so that the sound of the second part repeats the sound of the first part. Voice synthesizer.

The voice type designation means is
A multi-part audio type specifying means for specifying the audio types of a plurality of parts,
Order designating means for designating the order of outputting the sound among the plurality of parts,
The voice generation means includes
First, the sound of a predetermined block of the sentence that generates the sound of the first part in the order specified by the order specifying unit is output, and the second and subsequent parts of the part that are in the second order are the previous ones. Rotating to synthesize the sound so that the sound of each part circulates in the order specified by the order specifying means so that the sound output is started when the sound output of the first block of the part is completed. 6. The speech synthesis apparatus according to claim 1, further comprising speech generation means.

The speech synthesizer according to claim 5 or 6, wherein the block is at least one of a sentence, an accent phrase, a word, or a character delimited by a predetermined symbol.

The voice type designation means is a designation method for selecting one of the accent phrase voice designation means, the whole voice type designation means, the part-by-part voice type designation means, and the multi-part voice type designation means. The speech synthesizer according to claim 3, further comprising a selection unit.

A speech synthesis program for causing a computer to function as various processing means of the speech synthesizer according to claim 1.