JP2007127994A

JP2007127994A - Voice synthesizing method, voice synthesizer, and program

Info

Publication number: JP2007127994A
Application number: JP2005322645A
Authority: JP
Inventors: Katsuhiko Kawasaki; 勝彦川崎; Kenichiro Nakagawa; 賢一郎中川; Michio Aizawa; 道雄相澤; Toshiaki Fukada; 俊明深田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2005-11-07
Filing date: 2005-11-07
Publication date: 2007-05-24

Abstract

PROBLEM TO BE SOLVED: To prevent a user from missing a specific word spoken by enabling reading out a desired word in a text in a specified voice output mode referring to a word dictionary in which optional voice output modes are registered, and enabling drawing the user's attention when the specific word is read out. SOLUTION: When the text to be output in voice from a text capture part 2 is input, a language analysis part 3 performs morpheme analysis to divide the text into each word, using inter-word conjunction information stored in a basic dictionary 8 storing a default voice output mode of words. Then, an accent phrase is formed from divided each word using an accent joining rule, and voice synthesis is performed by reading out the voice output mode of each word in the accent phrase from a user dictionary 9 in which the optional voice output mode is stored for each word. Then, the accent phrase is output from a voice synthesis output part 6. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、単語辞書に登録された話者、感情、速度、音量、発話スタイル、声の高さ、言語等の音声出力様式で入力テキストを読み上げる音声合成方法及び音声合成装置並びにプログラムに関する。 The present invention relates to a speech synthesizing method, a speech synthesizing apparatus, and a program for reading input text in a speech output style such as a speaker, emotion, speed, volume, speech style, voice pitch, language, etc. registered in a word dictionary.

従来、Speech Synthesis Markup Language (http://www.w3.org/TR/speech-synthesis/)といった、音声合成の読み方を制御するタグ仕様が存在する。これによって、例えば、ある「濃度」という単語は男声で読み、またある「濃度」という単語は女声で読む、というような制御を可能にしている。また、Pronunciation Lexicon Specification (http://www.w3.org/TR/lexicon-reqs/)のように、読み情報を含めた単語情報を記述する標準仕様も策定が進んでいる。 Conventionally, there is a tag specification for controlling how to read speech synthesis, such as Speech Synthesis Markup Language (http://www.w3.org/TR/speech-synthesis/). Thus, for example, it is possible to control such that a certain word “concentration” is read in male voice and a certain word “concentration” is read in female voice. In addition, a standard specification that describes word information including reading information, such as the Pronounciation Lexicon Specification (http://www.w3.org/TR/lexicon-reqs/), has been developed.

さらに、次のような音声合成処理に関する特許文献が開示されている。特許文献１では、文章中で重要とされた部分の韻律を変更する技術を開示している。また、特許文献２では、感情を考慮した韻律辞書および波形辞書を作成しておき、入力文章がタスクの指定とともに入力された際はその感情音声で読み上げる技術を開示している。さらに、特許文献３では、単語辞書中の特定単語を音声出力しない技術を開示している。さらにまた、特許文献４では、「とっても」等の副詞を抽出し、その度合いに応じて「とっても」や「とーっても」のように変形して出力する技術を開示している。
特開平１１−２３１８８５号公報特開２００１−０３４２８２公報特開２００２−２２１９８１公報特開２００２−３１１９８１公報 Further, patent documents relating to the following speech synthesis processing are disclosed. Patent Document 1 discloses a technique for changing the prosody of an important part in a sentence. Patent Document 2 discloses a technique in which a prosodic dictionary and a waveform dictionary that take emotions into consideration are created, and when an input sentence is input together with a task specification, the emotional speech is read out. Furthermore, Patent Document 3 discloses a technique that does not output a specific word in a word dictionary. Furthermore, Patent Document 4 discloses a technique for extracting an adverb such as “very” and transforming it into “very” or “very” depending on the degree.
Japanese Patent Laid-Open No. 11-231885 JP 2001-034282 A JP 2002-221981 JP 2002-311981 A

しかしながら、従来のSpeech Synthesis Markup Languageやその他の読みタグでは、テキストの読ませたい部分ごとにタグを挿入することが必要であり、手間がかかる。また、上記特許文献１、２では、文書中で韻律や感情を指定しているが、単語辞書の登録で韻律や感情を指定できないという問題点があった。さらに、特許文献３においては、単語辞書中の特定単語を音声出力しないようにしているが、読み上げ方式の指定を単語辞書の登録で行うことができないという問題点があった。さらにまた、特許文献４においては、副詞をその度合いに応じて変形して出力しているが、任意の単語をユーザの所望の音量、速度、韻律、感情で出力できないという問題点があった。 However, with the conventional Speech Synthesis Markup Language and other reading tags, it is necessary to insert a tag for each part of the text that you want to read, which is troublesome. In the above-mentioned patent documents 1 and 2, prosody and emotion are specified in the document, but there is a problem that the prosody and emotion cannot be specified by registering the word dictionary. Further, in Patent Document 3, a specific word in the word dictionary is not output by voice, but there is a problem that the reading method cannot be specified by registering the word dictionary. Furthermore, in Patent Document 4, adverbs are transformed and output according to their degree, but there is a problem that arbitrary words cannot be output with the user's desired volume, speed, prosody, and emotion.

本発明は、このような事情を考慮してなされたものであり、次のような音声合成方法及び音声合成装置並びにプログラムを提供することを目的とする。すなわち、任意の音声出力様式を登録した単語辞書を参照して、テキスト中の所望の単語を指定した音声出力様式で読み上げることができ、特定の単語が読み上げられた時にユーザの注意を引くことができ聴き逃し等の防止を図ることができる。 The present invention has been made in view of such circumstances, and an object thereof is to provide the following speech synthesis method, speech synthesis apparatus, and program. That is, referring to a word dictionary in which an arbitrary voice output format is registered, a desired word in the text can be read out in a specified voice output format, and the user's attention can be drawn when a specific word is read out. Therefore, it is possible to prevent missed listening.

上記課題を解決するために、本発明に係る音声合成方法は、
テキスト情報を受信する受信工程と、
前記受信工程で受信した前記テキスト情報を所定の単位に分割する分割工程と、
各単語と任意の音声出力様式を対応付けて記憶した単語辞書から、前記分割工程で分割された各所定の単位と等しい又は前記各所定の単位に含まれる単語に対応する音声出力様式を取得する取得工程と、
取得した前記音声出力様式に基づいて前記テキスト情報を音声合成して出力する音声合成工程と
を有することを特徴とする。 In order to solve the above problems, a speech synthesis method according to the present invention includes:
A receiving process for receiving text information;
A dividing step of dividing the text information received in the receiving step into predetermined units;
A speech output format corresponding to a word equal to or included in each predetermined unit obtained in the dividing step is acquired from a word dictionary in which each word and an arbitrary speech output format are stored in association with each other. Acquisition process;
A speech synthesis step of synthesizing and outputting the text information based on the acquired speech output format.

また、上記課題を解決するために、本発明に係る音声合成装置は、
テキスト情報を受信する受信手段と、
前記受信手段で受信した前記テキスト情報を所定の単位に分割する分割手段と、
各単語と任意の音声出力様式を対応付けて記憶した単語辞書から、前記分割手段で分割された各所定の単位と等しい又は前記各所定の単位に含まれる単語に対応する音声出力様式を取得する取得手段と、
取得した前記音声出力様式に基づいて前記テキスト情報を音声合成して出力する音声合成手段と
を備えることを特徴とする。 In addition, in order to solve the above problem, a speech synthesizer according to the present invention provides:
Receiving means for receiving text information;
Dividing means for dividing the text information received by the receiving means into predetermined units;
A speech output format corresponding to a word equal to or included in each predetermined unit obtained by the dividing means is acquired from a word dictionary in which each word and an arbitrary speech output format are stored in association with each other. Acquisition means;
Voice synthesis means for synthesizing and outputting the text information based on the acquired voice output format.

さらに、上記課題を解決するために、本発明に係るプログラムは、上記音声合成方法をコンピュータを用いて実行させることを特徴とする。 Furthermore, in order to solve the above-mentioned problem, a program according to the present invention is characterized in that the speech synthesis method is executed using a computer.

本発明によれば、任意の音声出力様式を登録した単語辞書を参照して、テキスト中の所望の単語を指定した音声出力様式で読み上げることができ、特定の単語が読み上げられた時にユーザの注意を引くことができ聴き逃し等の防止を図ることができる。 According to the present invention, a word dictionary in which an arbitrary voice output format is registered can be referred to and a desired word in a text can be read out in a specified voice output format, and a user's attention when a specific word is read out. Can be prevented and missed listening can be prevented.

以下、図面を参照して、本発明の一実施形態に係る音声合成装置による音声合成処理の詳細について説明する。 Hereinafter, with reference to the drawings, details of speech synthesis processing by the speech synthesizer according to the embodiment of the present invention will be described.

図１は、本発明の一実施形態に係る音声合成装置の構成を示すブロック図である。図１に示すように、本実施形態に係る音声合成装置本体１２には、テキストを入力するキーボードやマウス等の入力デバイス１と、合成された音声を出力するスピーカ等の出力デバイス７が着脱可能である。 FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to an embodiment of the present invention. As shown in FIG. 1, an input device 1 such as a keyboard and a mouse for inputting text and an output device 7 such as a speaker for outputting synthesized speech can be attached to and detached from the speech synthesizer main body 12 according to the present embodiment. It is.

そして、音声合成装置本体１２は、入力デバイス１から入力されたテキストを取り込むテキスト取り込み部２、取り込まれたテキストを基本辞書８に記憶されている単語情報と各単語間の接続情報とを用いて言語解析する言語解析部３を備える。さらに、音声合成装置１２は、言語解析の結果から読みとアクセントとポーズをつけて合成音声の韻律を生成する韻律制御部４を備える。さらにまた、音声合成装置１２は、生成された韻律に応じて波形辞書１１に記録された音声素片を接続して音声波形を生成する波形生成部５を備える。さらにまた、音声合成装置１２は、生成された音声波形を合成する合成音声出力部６を備える。 Then, the speech synthesizer main body 12 uses the text capturing unit 2 that captures the text input from the input device 1, and uses the word information stored in the basic dictionary 8 and the connection information between the words. A language analysis unit 3 for language analysis is provided. Furthermore, the speech synthesizer 12 includes a prosody control unit 4 that generates a prosody of the synthesized speech by adding a reading, an accent, and a pose from the result of language analysis. Furthermore, the speech synthesizer 12 includes a waveform generation unit 5 that connects speech units recorded in the waveform dictionary 11 according to the generated prosody and generates a speech waveform. Furthermore, the speech synthesizer 12 includes a synthesized speech output unit 6 that synthesizes the generated speech waveform.

さらにまた、音声合成装置１２には、基本的な単語に対してデフォルトの表記と読みとアクセントと品詞と活用型とが記録された基本辞書８が備わっている。さらにまた、音声合成装置１２には、ユーザが所望の単語に対して表記、読み、品詞、アクセント、話者、感情、読み上げ速度、音量、読み上げの有効回数を任意に指定できるユーザ辞書９が備わっている。さらにまた、音声合成装置１２は、ユーザ辞書９に登録されている情報をＧＵＩで編集、或いは新たに登録するためのユーザ辞書編集部１０を備える。 Furthermore, the speech synthesizer 12 includes a basic dictionary 8 in which default notations, readings, accents, parts of speech, and usage types are recorded for basic words. Furthermore, the speech synthesizer 12 is provided with a user dictionary 9 that allows the user to arbitrarily specify notation, reading, part of speech, accent, speaker, emotion, reading speed, volume, and effective number of readings for a desired word. ing. Furthermore, the speech synthesizer 12 includes a user dictionary editing unit 10 for editing information registered in the user dictionary 9 using a GUI or newly registering information.

図２は、本実施形態に係る音声合成装置が備える基本辞書８に登録された単語の一例を示す図である。図２では、表記「今日」に対して、読み「きょう」、アクセント「１型」、品詞「名詞」が登録されている。また、基本辞書８には各単語間の接続情報が記録されており、この接続情報を用いて入力テキストを形態素解析して単語に分割し、各単語に読みとアクセントを付与する。 FIG. 2 is a diagram illustrating an example of words registered in the basic dictionary 8 included in the speech synthesizer according to the present embodiment. In FIG. 2, a reading “Kyo”, an accent “Type 1”, and a part of speech “Noun” are registered for the notation “Today”. The basic dictionary 8 stores connection information between the words. Using the connection information, the input text is morphologically analyzed and divided into words, and readings and accents are given to the words.

図３は、ユーザ辞書編集部１０におけるユーザ辞書の編集画面に表示されたユーザ辞書に登録された単語情報の一例を示す図である。図３に示すようにユーザ辞書に「今日」という単語を登録する。このとき、読みは「きょう」、品詞は「名詞」、アクセントは「１型」、話者は「太郎」、感情は「楽しい」、読み上げ速度は「速い」、音量は「大きい」、有効回数は「１回」のように各プロパティを設定する。各プロパティは図４のようなプルダウンメニューで選択して設定可能である。尚、言語解析においてはユーザ辞書９の情報が基本辞書８の情報より優先して使用される。 FIG. 3 is a diagram illustrating an example of word information registered in the user dictionary displayed on the user dictionary editing screen in the user dictionary editing unit 10. As shown in FIG. 3, the word “today” is registered in the user dictionary. At this time, the reading is “Kyo”, the part of speech is “Noun”, the accent is “Type 1”, the speaker is “Taro”, the emotion is “Fun”, the reading speed is “Fast”, the volume is “High”, the number of effective times Sets each property like “once”. Each property can be selected and set from a pull-down menu as shown in FIG. In the language analysis, the information in the user dictionary 9 is used with priority over the information in the basic dictionary 8.

図４は、ユーザ辞書９に登録された単語についての各プロパティの設定画面例を示す図である。ここでは、表記が「今日」、読みが「きょう」、品詞が「名詞」の単語に対して、その他のプロパティをＧＵＩで設定する。 FIG. 4 is a diagram illustrating an example of a property setting screen for words registered in the user dictionary 9. In this case, other properties are set in the GUI for a word whose notation is “today”, reading is “today”, and part of speech is “noun”.

ここで、例えば、話者を「花子」に設定すれば、テキスト中の「今日」という単語は波形辞書から「花子」の音声素片データを用いて音声合成される。また、例えば、感情を「楽しい」に設定すれば、テキスト中の「今日」という単語は韻律制御部において「楽しい」に応じた韻律が生成されて音声合成される。さらに、例えば、読み上げ速度を「速い」に設定すれば、テキスト中の「今日」という単語は他の単語より所定の割合だけ速く読み上げられる。さらにまた、例えば、音量を「大きい」に設定すれば、テキスト中の「今日」という単語は他の単語より所定の割合だけ大きな音声で読み上げられる。 Here, for example, if the speaker is set to “Hanako”, the word “today” in the text is synthesized by using the speech segment data of “Hanako” from the waveform dictionary. Also, for example, if the emotion is set to “fun”, a prosody corresponding to “fun” is generated in the prosody control unit for the word “today” in the text, and synthesized. Further, for example, if the reading speed is set to “fast”, the word “today” in the text is read a predetermined rate faster than other words. Furthermore, for example, if the volume is set to “high”, the word “today” in the text is read out with a predetermined amount higher than other words.

さらにまた、有効回数を「１回」に設定すれば、テキスト中の「今日」という単語は、最初の１回だけはユーザ辞書９で指定された出力様式で読み上げられ、２回目以降に現われた場合はデフォルトの出力様式で読み上げられる。また、有効回数を「１文中１回」と指定すれば、１文中で１回だけ指定された出力様式で読み上げられる（この場合には、後述する図６のステップＳ９においてユーザ辞書の有効回数を書き換える必要はない）。 Furthermore, if the effective count is set to “once”, the word “today” in the text is read out in the output format specified in the user dictionary 9 only once and appears in the second and subsequent times. If it is, it will be read in the default output format. If the effective count is designated as “once per sentence”, it is read out in the output format designated only once per sentence (in this case, the effective count of the user dictionary is set in step S9 in FIG. 6 described later). There is no need to rewrite).

図５は、本発明の一実施形態に係る音声合成装置によって入力テキストを音声合成する処理の概要を説明するための図である。また、図６は、本発明の一実施形態に係る音声合成装置によって入力テキストを音声合成する処理手順を説明するためのフローチャートである。 FIG. 5 is a diagram for explaining the outline of the process for synthesizing the input text by the speech synthesizer according to the embodiment of the present invention. FIG. 6 is a flowchart for explaining a processing procedure for speech synthesis of input text by the speech synthesizer according to the embodiment of the present invention.

まず、ステップＳ１において、入力デバイス１を用いてテキストを入力し、テキスト取り込み部２から当該テキストを音声合成装置本体１２の内部に入力する。ここでは、図５（ａ）に示すように、「今日は良い天気です」というテキストが入力されたとする。次に、ステップＳ２において、入力テキストを基本辞書８とユーザ辞書９を用いて、言語解析部３において形態素解析する。ここでは、図５（ｂ）に示すように、「今日」「は」「良い」「天気」「です」のように各単語の活用情報と接続情報を用いて単語に分割する。 First, in step S 1, text is input using the input device 1, and the text is input from the text capturing unit 2 into the speech synthesizer main body 12. Here, as shown in FIG. 5A, it is assumed that the text “Today is a good weather” is input. Next, in step S 2, the language analysis unit 3 performs morphological analysis on the input text using the basic dictionary 8 and the user dictionary 9. Here, as shown in FIG. 5B, the words are divided into words using the utilization information and connection information of each word such as “today”, “ha”, “good”, “weather”, and “is”.

そして、ステップＳ３において、ユーザ辞書９を参照して各単語の読み方有効回数を確認する。ここでは、「今日」という単語がユーザ辞書で所定の読み方を１回だけするように指定されているものとする。 In step S3, the user dictionary 9 is referred to and the number of effective readings of each word is confirmed. Here, it is assumed that the word “today” is specified to be read only once in the user dictionary.

そして、ステップＳ４〜Ｓ５において、各単語に読みとアクセントと話者・感情・速度・音量情報を付与する。尚、各単語には、品詞やアクセント結合規則等の文法情報も付与されるが、以下の記述では省略する。ここでは
「きょ’う（話者：花子）（感情：楽しい）（読み上げ速度：速い）（音量：大きい）（有効回数：１回）」、
「わ（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「よ’い（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「て’んき（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「で’す（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」
というように各単語に読みとアクセントを付与する。 In steps S4 to S5, reading, accent, and speaker / emotion / speed / volume information are assigned to each word. Each word is also given grammatical information such as part of speech and accent combining rules, but is omitted in the following description. Here, “Kyo '(Speaker: Hanako) (Emotion: Fun) (Reading speed: Fast) (Volume: High) (Effective number: 1)”,
“Wa (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
“Yo (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
“Tenki (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
“De'su (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”
Thus, each word is given a reading and accent.

そして、アクセント結合規則を用いて、各アクセント句を図５（ｃ）のように
「きょ’う（話者：花子）（感情：楽しい）（読み上げ速度：速い）（音量：大きい）（有効回数：１回）わ（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「よ’い（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「て’んきです（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」
と生成する。 Then, using the accent combination rule, each accent phrase is expressed as “Cho '(Speaker: Hanako) (Emotion: Fun) (Reading speed: Fast) (Volume: Large) (Effective) Number of times: 1) (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)
“Yo (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
"I'm Tekin (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)"
And generate.

さらに、ステップＳ６において、各アクセント句内の単語に対してアクセント型と感情に応じた韻律を図５（ｄ）のように付与する。次いで、ステップＳ７においては、各単語に対して指定された話者（すなわち、「花子」或いは「太郎」等）の音声素片データを波形辞書１１から取得する。そして、生成した韻律に応じて１ピッチごとの音声素片データを波形生成部５において接続する。さらに、ステップＳ８で、各単語を指定された速度と音量で音声出力する。その後、ステップＳ９において、ユーザ辞書９の単語「今日」の読み上げ有効回数を１つ減らす。 Further, in step S6, prosody according to the accent type and emotion is given to the words in each accent phrase as shown in FIG. Next, in step S 7, speech segment data of a speaker designated for each word (that is, “Hanako” or “Taro”) is acquired from the waveform dictionary 11. Then, the speech unit data for each pitch is connected in the waveform generation unit 5 according to the generated prosody. In step S8, each word is output as a voice at a designated speed and volume. Thereafter, in step S9, the number of effective readings of the word “today” in the user dictionary 9 is reduced by one.

このように、ユーザ辞書の登録において、各単語を読み上げる際の、話者、感情、読み上げ速度、音量、読み上げの有効回数を簡易に指定でき、それを好適に音声合成に反映することができる。 As described above, in registering the user dictionary, the speaker, emotion, reading speed, volume, and effective number of readings when reading each word can be easily specified, and can be suitably reflected in speech synthesis.

＜実施例２＞
上記音声合成装置と同様の装置を用いて、所望の読み上げ形式を指定した単語を含むアクセント句全体を所望の読み上げ方式で読み上げても良い。すなわち、同じテキストを用いて、例えば、
「きょ’うわ（話者：花子）（感情：楽しい）（読み上げ速度：速い）（音量：大きい）（有効回数：１回）」、
「よ’い（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「て’んきです（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」
と生成する。 <Example 2>
Using the same device as the above-described speech synthesizer, the entire accent phrase including a word designating a desired reading format may be read out by a desired reading method. That is, using the same text, for example,
"Kyo'wa (speaker: Hanako) (emotion: fun) (reading speed: fast) (volume: loud) (effective number: 1)",
“Yo (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
"I'm Tekin (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)"
And generate.

＜実施例３＞
上記音声合成装置と同様の装置を用いて、所望の読み上げ形式を指定した単語はアクセント結合させないで、独立したアクセント句として読み上げても良い。すなわち、同じテキストを用いて、例えば、
「きょ’う（話者：花子）（感情：楽しい）（読み上げ速度：速い）（音量：大きい）（有効回数：１回）」、
「わ（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「よ’い（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」、
「て’んきです（話者：太郎）（感情：普通）（読み上げ速度：普通）（音量：普通）（有効回数：無限）」
のようにアクセント句を生成する。 <Example 3>
Using a device similar to the above-described speech synthesizer, a word designating a desired reading format may be read out as an independent accent phrase without being accent-coupled. That is, using the same text, for example,
"Kyo '(Speaker: Hanako) (Emotion: Fun) (Reading speed: Fast) (Volume: Large) (Effective number: 1)",
“Wa (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
“Yo (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)”,
"I'm Tekin (Speaker: Taro) (Emotion: Normal) (Reading speed: Normal) (Volume: Normal) (Effective frequency: Infinite)"
An accent phrase is generated as follows.

＜実施例４＞
上記実施例１〜３において、各単語について話者、感情、速度、音量以外の出力様式をユーザ辞書９で設定可能にしても良い。例えば、発話スタイル（「アナウンサー調」、「朗読調」、「会話調」等）、声の高さ（「非常に高い」、「高い」、「普通」、「低い」、「非常に低い」等）、言語（英単語は「アメリカ人っぽく」等）等を用いることができる。 <Example 4>
In the first to third embodiments, an output format other than the speaker, emotion, speed, and volume may be set in the user dictionary 9 for each word. For example, utterance style (“announcer style”, “reading style”, “conversational style”, etc.), voice pitch (“very high”, “high”, “normal”, “low”, “very low”) Etc.), language (English words are “American-like”, etc.), etc. can be used.

＜その他の実施形態＞
以上、実施形態例を詳述したが、本発明は、例えば、システム、装置、方法、プログラム若しくは記憶媒体（記録媒体）等としての実施態様をとることが可能である。具体的には、複数の機器から構成されるシステムに適用しても良いし、また、一つの機器からなる装置に適用しても良い。 <Other embodiments>
Although the embodiment has been described in detail above, the present invention can take an embodiment as a system, apparatus, method, program, storage medium (recording medium), or the like. Specifically, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.

尚、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラム（実施形態では図に示すフローチャートに対応したプログラム）を、システムあるいは装置に直接あるいは遠隔から供給する。そして、そのシステムあるいは装置のコンピュータが該供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。 In the present invention, a software program (in the embodiment, a program corresponding to the flowchart shown in the drawing) that realizes the functions of the above-described embodiments is directly or remotely supplied to a system or apparatus. In addition, this includes a case where the system or the computer of the apparatus is also achieved by reading and executing the supplied program code.

従って、本発明の機能処理をコンピュータで実現するために、該コンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明は、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. In other words, the present invention includes a computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等の形態であっても良い。 In that case, as long as it has the function of a program, it may be in the form of object code, a program executed by an interpreter, script data supplied to the OS, or the like.

プログラムを供給するための記録媒体としては、例えば、以下のようなものがある。フロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ，ＤＶＤ−Ｒ）。 Examples of the recording medium for supplying the program include the following. Floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card, ROM, DVD (DVD-ROM, DVD-R) .

その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページからハードディスク等の記録媒体にダウンロードすることによっても供給できる。すなわち、ホームページに接続し、該ホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをダウンロードする。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明に含まれるものである。 As another program supply method, the program can be supplied by downloading it from a homepage on the Internet to a recording medium such as a hard disk using a browser of a client computer. That is, it connects to a homepage and downloads the computer program itself of the present invention or a compressed file including an automatic installation function from the homepage. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布する。そして、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせる。そして、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 Further, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, and distributed to users. Then, the user who has cleared the predetermined condition is allowed to download key information for decryption from the homepage via the Internet. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される。その他にも、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現され得る。 Further, the functions of the above-described embodiments are realized by the computer executing the read program. In addition, the function of the above-described embodiment can be realized by an OS running on the computer based on an instruction of the program and performing part or all of the actual processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後にも前述した実施形態の機能が実現される。すなわち、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行うことによっても前述した実施形態の機能が実現される。 Further, the functions of the above-described embodiments are realized even after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer. That is, the functions of the above-described embodiments are realized by performing a part or all of the actual processing by the CPU or the like provided in the function expansion board or function expansion unit based on the instructions of the program.

本発明の一実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on one Embodiment of this invention. 本実施形態に係る音声合成装置が備える基本辞書８に登録された単語の一例を示す図である。It is a figure which shows an example of the word registered into the basic dictionary 8 with which the speech synthesizer which concerns on this embodiment is provided. ユーザ辞書編集部１０におけるユーザ辞書の編集画面に表示されたユーザ辞書に登録された単語情報の一例を示す図である。It is a figure which shows an example of the word information registered into the user dictionary displayed on the edit screen of the user dictionary in the user dictionary editing part. ユーザ辞書９に登録された単語についての各プロパティの設定画面例を示す図である。It is a figure which shows the example of a setting screen of each property about the word registered into the user dictionary. 図５は、本発明の一実施形態に係る音声合成装置によって入力テキストを音声合成する処理の概要を説明するための図である。FIG. 5 is a diagram for explaining the outline of the process for synthesizing the input text by the speech synthesizer according to the embodiment of the present invention. 本発明の一実施形態に係る音声合成装置によって入力テキストを音声合成する処理手順を説明するためのフローチャートである。It is a flowchart for demonstrating the process sequence which carries out the speech synthesis | combination of the input text by the speech synthesizer which concerns on one Embodiment of this invention.

Explanation of symbols

１入力デバイス
２テキスト取り込み部
３言語解析部
４韻律制御部
５波形生成部
６合成音声出力部
７出力デバイス
８基本辞書
９ユーザ辞書
１０ユーザ辞書編集部
１１波形辞書
１２音声合成装置本体 DESCRIPTION OF SYMBOLS 1 Input device 2 Text capture part 3 Language analysis part 4 Prosody control part 5 Waveform generation part 6 Synthetic speech output part 7 Output device 8 Basic dictionary 9 User dictionary 10 User dictionary edit part 11 Waveform dictionary 12 Speech synthesizer body

Claims

A receiving process for receiving text information;
A dividing step of dividing the text information received in the receiving step into predetermined units;
A speech output format corresponding to a word equal to or included in each predetermined unit obtained in the dividing step is acquired from a word dictionary in which each word and an arbitrary speech output format are stored in association with each other. Acquisition process;
A speech synthesis method comprising: synthesizing and outputting the text information based on the acquired speech output format.

The speech synthesis method according to claim 1, wherein the dividing step divides the text information into accent phrase units.

The speech synthesis method according to claim 1, wherein the dividing step divides the text information into predetermined units by performing morphological analysis.

The speech synthesis method according to claim 1, further comprising a registration step of newly registering a speech output format of a word in the word dictionary.

The speech synthesis method according to claim 1, further comprising an editing step of editing a speech output format of words stored in the word dictionary.

A setting step for setting the effective number of voice output styles of words stored in the word dictionary;
In the speech synthesis step, when each word in the text information is speech synthesized and output, if the speech synthesis of each word is within the effective number of times, speech synthesis is performed with reference to the speech output format, 2. The speech synthesis method according to claim 1, wherein speech synthesis is performed with reference to a default speech output format when the effective number of times is not exceeded.

The type of voice output style of the word is any one of speaker, emotion, speech style, voice pitch, speed, volume, language, or a combination of two or more thereof. The speech synthesis method according to 1.

The speech synthesis step reads out the speaker waveform data set in the speech output format from the waveform dictionary storing the waveform data of a plurality of speakers for each word in the text information, and synthesizes each of them. The speech synthesis method according to claim 7.

8. The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes each word in the text information with reference to emotion prosodic data set in the speech output format. .

The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes speech for each word in the text information at a speed set in the speech output format.

The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes each word in the text information at a volume set in the speech output format.

The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes each word in the text information with a speech style set in the speech output format.

The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes each word in the accent phrase at a voice level set in the speech output format.

The speech synthesis method according to claim 7, wherein the speech synthesis step synthesizes speech for each word in the text information in a language set in the speech output format.

Receiving means for receiving text information;
Dividing means for dividing the text information received by the receiving means into predetermined units;
A speech output format corresponding to a word equal to or included in each predetermined unit divided by the dividing unit is acquired from a word dictionary in which each word and an arbitrary speech output format are stored in association with each other. Acquisition means;
A speech synthesizer comprising: speech synthesizer that synthesizes and outputs the text information based on the acquired speech output format.

The program for performing the speech synthesis method of any one of Claim 1-14 using a computer.