JP2001265374A

JP2001265374A - Voice synthesizing device and recording medium

Info

Publication number: JP2001265374A
Application number: JP2000071150A
Authority: JP
Inventors: Masaru Otani; 賢大谷; Yumi Tsutsumi; ゆみ堤; Toshiyuki Sano; 敏幸佐野
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 2000-03-14
Filing date: 2000-03-14
Publication date: 2001-09-28
Anticipated expiration: 2020-03-14
Also published as: JP3785892B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesizing device which can generate a voice message with natural voice quality and rhythm. SOLUTION: As for selected databases 2A, 2B... for voice synthesis, a database feature quantity calculation part 4 calculates the feature quantities of their voice waveforms and then a FUJISAKI model parameter calculation part 5 calculates parameters of a FUJISAKI model. Further, a target rhythm generation part 8 generates target rhythm by using the parameters of the FUJISAKI model. An inputted text is converted by a phoneme connection part 8 into synthesized voice by connecting phonemes selected from the databases 2. Then a rhythm correction part 11 compares the rhythm of the synthesized voice with that of the target voice and corrects the synthesized voice so that they match.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力テキストなど
から音声メッセージを作成し、それを編集して発話させ
るための音声合成装置及び記録媒体に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a voice synthesizing apparatus and a recording medium for generating a voice message from an input text or the like, editing the voice message and causing the voice message to be spoken.

【０００２】[0002]

【背景技術】従来の音声メッセージ作成・編集装置は、
音声合成の方式に規則合成や波形重畳を用いている。こ
れら方式によって作成された合成音声では、韻律は自然
であるが、声質（人間の声らしいさの程度）が電子的で
不自然であるという問題点がある。また、これらの方式
では合成音声の声質を変更することが困難であり、声に
よる差異化ないし個性化を行うことができない。2. Description of the Related Art A conventional voice message creating / editing device is:
Rule synthesis and waveform superposition are used for the speech synthesis method. The synthesized speech created by these methods has a problem that the prosody is natural, but the voice quality (the degree of likeness of a human voice) is electronic and unnatural. Further, in these systems, it is difficult to change the voice quality of the synthesized speech, and it is not possible to perform differentiation or individualization by voice.

【０００３】上記問題点を解決するため、音声素片接続
型の音声合成方法（特開平１０−４９１９３号）が提案
されている。この方法では、発話させたい言語の標準テ
キストを話者（モデル）に読み上げてもらって録音して
おき、この自然発話の録音音声波形を音声素片（以下、
音素という）単位に分割し、当該言語における音素を網
羅した音素部品データベースを作っておく。そして、合
成音声を出力するときには、読み上げたいテキストに合
わせて音素部品を再結合する。この音素部品データベー
スを用いることにより、読み上げた人の肉声に近い合成
音声を得ることができる。In order to solve the above problem, a speech synthesis method of a speech unit connection type (Japanese Patent Laid-Open No. Hei 10-49193) has been proposed. In this method, a speaker (model) reads aloud a standard text of a language to be spoken and records it.
A phoneme component database is created in which the phonemes are divided into units and the phonemes in the language are covered. When outputting synthesized speech, the phoneme components are recombined according to the text to be read. By using this phoneme parts database, it is possible to obtain a synthesized voice that is close to the real voice of a person who has read out.

【０００４】音素接続により合成された音声は、声質が
肉声に近くて自然であり、しかも、合成時に使用する音
素部品データベースを変更することで、複数の話者で合
成音声を作成することができる。しかし、この音素接続
型の音声合成方法では、合成音声の韻律（音声の長短や
アクセントの位置）が不自然になるという問題点があ
る。A voice synthesized by phoneme connection has a natural voice quality close to that of a real voice, and a plurality of speakers can create a synthesized voice by changing a phoneme component database used at the time of synthesis. . However, this phoneme connection type speech synthesis method has a problem that the prosody of the synthesized speech (the length of the speech and the position of the accent) becomes unnatural.

【０００５】また、音声の韻律を変更する手段として、
音声ピッチ変換方法（特許第２６１２８６７号）、音声
の発声速度変換方法（特許第２６１２８６８号）があ
る。これらの方法によれば、教師となる波形に修正した
い波形を合わせ込む、あるいは、修正したい波形を手動
で操作することで任意の韻律に変更することができる。[0005] As means for changing the prosody of speech,
There are a voice pitch conversion method (Japanese Patent No. 261868) and a voice utterance speed conversion method (Japanese Patent No. 261868). According to these methods, the desired waveform can be changed to an arbitrary prosody by adjusting the waveform to be corrected to the waveform to be a teacher or by manually operating the waveform to be corrected.

【０００６】しかしながら、前者の方法は、生徒の録音
音声と教師の録音音声とを比較し、生徒の録音音声の波
形を教師の録音音声の波形に合わせて変換するというも
のであり、与えられた任意のテキストを音声合成して電
子的に読み上げるといった用途では、教師となる録音音
声もしくは波形をすべての場合について用意しておくこ
とは不可能である。また、音声認識等と組み合わせて、
リアルタイムで合成音声を作成する場合には、手動で波
形を操作して韻律の修正を行うということも不可能であ
る。However, the former method compares a student's recorded voice with a teacher's recorded voice, and converts the waveform of the student's recorded voice according to the waveform of the teacher's recorded voice. In applications such as synthesizing an arbitrary text and reading it out electronically, it is impossible to prepare a recorded voice or waveform as a teacher for all cases. Also, in combination with voice recognition, etc.,
When a synthesized speech is created in real time, it is impossible to manually adjust the waveform to correct the prosody.

【０００７】音声合成に関する現状は、上記のようなと
ころにあり、音声メッセージ作成・編集装置やテキスト
読み上げ装置などに用いられる音声合成において、声質
の自然さと韻律の自然さとを両立させるのはきわめて困
難であった。[0007] The current state of speech synthesis is as described above, and it is extremely difficult to achieve both natural voice quality and natural prosody in voice synthesis used in voice message creation / editing devices and text-to-speech devices. Met.

【０００８】また、複数話者の音素部品データベースを
用いる場合には、話者が切り替わった時には、その話者
の個性ないし特徴が影響して合成音声が滑らかで無くな
ったり、不自然になったりすることがあった。In the case of using a phoneme parts database of a plurality of speakers, when a speaker is switched, the synthesized voice becomes unsmooth or unnatural due to the individuality or characteristics of the speaker. There was something.

【０００９】[0009]

【発明の開示】本発明は上記の従来技術に鑑みてなされ
たものであり、その目的とするところは、自然な声質と
滑らかさ、特に自然な声質と韻律で音声メッセージを作
成することができる音声合成装置を提供することにあ
る。DISCLOSURE OF THE INVENTION The present invention has been made in view of the above-mentioned prior art, and an object of the present invention is to produce a voice message with natural voice quality and smoothness, particularly with natural voice quality and prosody. A speech synthesizer is provided.

【００１０】本発明にかかる音声合成装置は、複数の話
者に関する音素波形データを保存した音声合成用データ
ベースと、音声合成用データベースに保存された各話者
の音素波形データから特徴を抽出する手段と、音声合成
用データベースから対象文字に対応する音素波形を選択
して接続する音素結合手段と、選択された話者に応じた
前記特徴を用いて、前記音素結合手段で生成された合成
音声を補正する手段とを備えたものである。A speech synthesizing apparatus according to the present invention includes a speech synthesis database storing phoneme waveform data relating to a plurality of speakers, and means for extracting features from phoneme waveform data of each speaker stored in the speech synthesis database. And a phoneme combining means for selecting and connecting a phoneme waveform corresponding to the target character from the speech synthesis database, and using the feature according to the selected speaker, the synthesized speech generated by the phoneme combining means. Correction means.

【００１１】音素を接続することによって生成された合
成音声は声質が自然であるという特徴がある。さらに、
この音声合成装置によれば、話者に応じた特徴を抽出
し、その特徴を用いて合成音声を補正しているので、複
数の話者を切り替えて合成音声を作成する際、異なる話
者の音素音声合成用データベースに切り替わっても、選
択された話者の音素波形の特徴に応じて合成音声が補正
されるので、自然な声質と滑らかさで合成音声を発生さ
せることができる。The synthesized voice generated by connecting the phonemes has a characteristic that the voice quality is natural. further,
According to this speech synthesizer, a feature corresponding to a speaker is extracted, and the synthesized speech is corrected using the feature. Therefore, when a plurality of speakers are switched to create a synthesized speech, a different speaker is used. Even if the database is switched to the phoneme speech synthesis database, the synthesized speech is corrected in accordance with the characteristics of the phoneme waveform of the selected speaker, so that the synthesized speech can be generated with natural voice quality and smoothness.

【００１２】本発明の実施形態における前記補正手段
は、選択された話者に応じた前記特徴を用いて、合成音
声の韻律を補正するものとなっている。補正手段により
合成音声の韻律を補正することにより、選択された話者
に応じて自然な声質と韻律で音声を合成させることが可
能になる。The correcting means in the embodiment of the present invention corrects the prosody of the synthesized speech using the characteristics according to the selected speaker. By correcting the prosody of the synthesized speech by the correction means, it becomes possible to synthesize speech with natural voice quality and prosody according to the selected speaker.

【００１３】特に、本発明の実施形態としては、複数の
話者に関する音素波形データを保存した音声合成用デー
タベースと、音声合成用データベースの各話者の特徴か
ら発話モデルのパラメータを決定するパラメータ決定手
段と、前記パラメータ決定手段により決定されたパラメ
ータを用いて音声合成時の目標韻律を生成する手段と、
前記目標韻律生成手段で生成した目標韻律に基づいて、
前記音声合成用データベースから対象文字に対応する音
素波形を選択し接続する音素結合手段と、合成音声の韻
律と目標韻律とを比較して合成音声の韻律を目標韻律に
合わせるようにしたものが望ましい。In particular, as an embodiment of the present invention, a speech synthesis database in which phoneme waveform data relating to a plurality of speakers is stored, and a parameter determination for determining parameters of an utterance model from characteristics of each speaker in the speech synthesis database. Means, and means for generating a target prosody at the time of speech synthesis using the parameters determined by the parameter determining means,
Based on the target prosody generated by the target prosody generation means,
Desirably, a phoneme combining means for selecting and connecting a phoneme waveform corresponding to the target character from the speech synthesis database, and comparing the prosody of the synthesized speech with the target prosody by comparing the prosody of the synthesized speech with the target prosody. .

【００１４】また、本発明にかかる記録媒体は、音声合
成用データベースに保存された話者の音素波形データか
ら特徴を抽出する手段と、音声合成用データベースから
対象文字に対応する音素波形を選択して接続する音素結
合手段と、選択された話者に応じた前記特徴を用いて、
前記音素結合手段で生成された合成音声を補正する手段
とを備えたものである。このような記録媒体としては、
通常はＣＤやＤＶＤ、ＭＯなどの形態で実現される。Further, the recording medium according to the present invention includes means for extracting features from the phoneme waveform data of the speaker stored in the speech synthesis database, and selecting a phoneme waveform corresponding to the target character from the speech synthesis database. Using the phoneme coupling means to connect with the above, the feature according to the selected speaker,
Means for correcting the synthesized speech generated by the phoneme combining means. As such a recording medium,
Usually, it is realized in the form of a CD, DVD, MO or the like.

【００１５】このような記録媒体に納められているプロ
グラムを実行することにより、話者に応じた特徴を抽出
し、その特徴を用いて音素接続による合成音声を補正し
ているので、異なる話者の音素音声合成用データベース
に切り替わっても、自然な声質と滑らかさで合成音声を
発生させることができる。By executing a program stored in such a recording medium, a feature corresponding to a speaker is extracted, and a synthesized speech by phoneme connection is corrected using the feature. Even if the database is switched to a phoneme speech synthesis database, synthesized speech can be generated with natural voice quality and smoothness.

【００１６】また、この記録媒体でも、前記補正手段
が、選択された話者に応じた前記特徴を用いて、合成音
声の韻律を補正するものであれば、選択された話者に応
じて自然な声質と韻律で音声を合成させることが可能に
なる。Also, in this recording medium, if the correction means corrects the prosody of the synthesized speech using the feature corresponding to the selected speaker, the correction means naturally adjusts according to the selected speaker. It is possible to synthesize speech with a proper voice quality and prosody.

【００１７】特に、この記録媒体の実施形態としては、
音声合成用データベースの話者の特徴から発話モデルの
パラメータを決定するパラメータ決定手段と、前記パラ
メータ決定手段により決定されたパラメータを用いて音
声合成時の目標韻律を生成する手段と、前記目標韻律生
成手段で生成した目標韻律に基づいて、前記音声合成用
データベースから対象文字に対応する音素波形を選択し
接続する音素結合手段と、合成音声の韻律と目標韻律と
を比較して合成音声の韻律を目標韻律に合わせるように
したものが望ましい。Particularly, as an embodiment of this recording medium,
Parameter determining means for determining parameters of the utterance model from the characteristics of the speaker in the speech synthesis database; means for generating a target prosody at the time of speech synthesis using the parameters determined by the parameter determining means; Means for selecting and connecting a phoneme waveform corresponding to the target character from the speech synthesis database based on the target prosody generated by the means, and comparing the prosody of the synthesized speech with the target prosody to determine the prosody of the synthesized speech. It is desirable to match the target prosody.

【００１８】[0018]

【発明の実施の形態】図１は本発明の一実施形態による
音声合成装置１の構成を示す図である。この音声合成装
置１は、音声合成用データベース２Ａ，２Ｂ、…、話者
データベース選択部３、データベース特徴量計算部４、
藤崎モデルパラメータ算出部５、テキスト読み込み部
６、音素列変換部７、目標韻律生成部８、音素結合部
９、韻律表示部１０、韻律補正部１１、合成音声出力部
１２、保存部１３から構成されている。また、必要に応
じて、韻律操作部１４が設けられる。この音声合成装置
１は、ＣＤ等の記録媒体に納められた、コンピュータ上
で動作するプログラムとして、あるいはＣＤやハードデ
ィスク等の記録媒体に該プログラムを格納したコンピュ
ータシステムとして、あるいは専用装置として構成され
る。図２（ａ）はコンピュータ上の画面表示（ユーザー
インターフェイス）を表している。この表示画面（親ウ
ィンドウ）２１内には、入力されたテキストを表示する
テキスト表示ウィンドウ２２、テキストから変換された
音素表記を表示する音素表記表示ウィンドウ２３、韻律
表示ウィンドウ２４、操作パネル２５内に設けられた話
者選択用のコンボボックス２６、発話ボタン２７及び波
形保存ボタン２８などが配置されている。図２（ｂ）は
話者選択用のコンボボックス２６を開いた様子を表して
いる。以下、図１及び図２（ａ）（ｂ）により、この音
声合成装置１を説明する。FIG. 1 is a diagram showing a configuration of a speech synthesizer 1 according to an embodiment of the present invention. This speech synthesis device 1 includes speech synthesis databases 2A, 2B,..., A speaker database selection unit 3, a database feature amount calculation unit 4,
It is composed of a Fujisaki model parameter calculation unit 5, a text reading unit 6, a phoneme sequence conversion unit 7, a target prosody generation unit 8, a phoneme connection unit 9, a prosody display unit 10, a prosody correction unit 11, a synthesized speech output unit 12, and a storage unit 13. Have been. In addition, a prosody operation unit 14 is provided as needed. The speech synthesizer 1 is configured as a program operating on a computer stored in a recording medium such as a CD, a computer system storing the program in a recording medium such as a CD or a hard disk, or as a dedicated device. . FIG. 2A shows a screen display (user interface) on a computer. The display screen (parent window) 21 includes a text display window 22 for displaying input text, a phoneme notation display window 23 for displaying phoneme notation converted from text, a prosody display window 24, and an operation panel 25. A provided combo box 26 for speaker selection, an utterance button 27, a waveform storage button 28, and the like are provided. FIG. 2B shows a state where the combo box 26 for speaker selection is opened. Hereinafter, the speech synthesizer 1 will be described with reference to FIGS. 1 and 2A and 2B.

【００１９】音声合成用データベース２Ａ，２Ｂ、…
は、音声を合成する元になる音声波形、音声波形とその
内容に対応する音素の文字データ、音素毎のセグメン
ト、各音素の韻律特徴パラメータを保持したデータベー
スであって、予めハードディスクなどの記憶媒体に格納
されている。音声合成用データベース２Ａ，２Ｂ、…
は、異なる話者毎にそれぞれ用意されている。The speech synthesis databases 2A, 2B,...
Is a database that holds a speech waveform from which speech is synthesized, character data of phonemes corresponding to the speech waveform and its contents, segments for each phoneme, and prosodic feature parameters of each phoneme. Is stored in Speech synthesis databases 2A, 2B, ...
Is prepared for each different speaker.

【００２０】話者データベース選択部３は、登録済みの
複数の音声合成用データベース２Ａ，２Ｂ、…から、音
声合成に用いる話者を選択する。すなわち、図２（ｂ）
のように話者選択用のコンボボックス２６には種々なタ
イプの話者が登録されており、画面の話者選択用のコン
ボボックスを開いて話者を選択すると、当該話者の音声
合成用データベース２Ａ，２Ｂ、…が話者データベース
選択部３により選択される。The speaker database selection unit 3 selects a speaker to be used for speech synthesis from a plurality of registered speech synthesis databases 2A, 2B,. That is, FIG.
A variety of types of speakers are registered in the combo box 26 for speaker selection as shown in FIG. 3. When the combo box for speaker selection on the screen is opened and a speaker is selected, the voice synthesis for the speaker is performed. The databases 2A, 2B,... Are selected by the speaker database selection unit 3.

【００２１】データベース特徴量計算部４は、いずれか
の話者の音声合成用データベース２Ａ，２Ｂ、…（以
下、選択された音声合成用データベースを選択データベ
ース２ということがある。）が選択又は指定されると、
該選択データベース２の音声波形について特徴量を算出
する。すなわち、選択データベース２から抽出した音声
波形中の有声区間について、一定窓幅で自己相関を取る
ことにより各窓の基本周波数を算出し、各窓ごとの基本
周波数について特徴量として平均値(ｆo)meanと標準偏
差(ｆo)stdを算出する。The database feature calculation unit 4 selects or designates one of the speech synthesis databases 2A, 2B,... (Hereinafter, the selected speech synthesis database may be referred to as the selection database 2). When done
The feature amount is calculated for the audio waveform in the selection database 2. That is, for the voiced section in the voice waveform extracted from the selection database 2, the basic frequency of each window is calculated by taking an autocorrelation with a fixed window width, and the average value (fo) is calculated as a feature amount for the basic frequency of each window. Calculate mean and standard deviation (fo) std.

【００２２】藤崎モデルパラメータ算出部５は、選択デ
ータベース２に格納されている話者の音声周波数特徴量
を用い、各話者に応じた藤崎モデルのパラメータを算出
する。藤崎モデルとは、日本語の平叙文を読み上げる際
の音声基本周波数パターンのモデルであって、次の(1)
式で表される。ここで、Ａpi、Ａajは係数であって、p
はフレーズ成分、aはアクセント成分を表わす。The Fujisaki model parameter calculation unit 5 calculates the parameters of the Fujisaki model corresponding to each speaker by using the speaker's speech frequency feature stored in the selection database 2. The Fujisaki model is a model of the fundamental frequency pattern of speech when reading a Japanese declarative sentence.
It is expressed by an equation. Here, Api and Aaj are coefficients, and p
Represents a phrase component and a represents an accent component.

【００２３】[0023]

【数１】 (Equation 1)

【００２４】藤崎モデルパラメータ算出部５は、このモ
デルの話者に依存するパラメータを上記データベース特
徴量計算部４で求めた各話者の音声合成用データベース
（選択データベース２）の特徴量を用いて決定する。上
記藤崎モデル式（１）で、Ａpi、Ａaj、Ｆminが話者に
依存するパラメータである。自然な合成音声を実現する
には、選択データベース２の音声の特徴に合わせて、音
声合成用データベース２Ａ，２Ｂ、…毎に、これらパラ
メータの値を変更する必要がある。従って、藤崎モデル
パラメータ算出部５は、この３つのパラメータＡpi、Ａ
aj、Ｆminについて、話者の特徴量にあった値を決定す
る。The Fujisaki model parameter calculation unit 5 uses the features of the speech synthesis database (selection database 2) of each speaker obtained by the database feature calculation unit 4 for parameters dependent on the speaker of this model. decide. In the above Fujisaki model equation (1), Api, Aaj, and Fmin are parameters depending on the speaker. In order to realize natural synthesized speech, it is necessary to change the values of these parameters for each of the speech synthesis databases 2A, 2B,... In accordance with the features of the speech in the selection database 2. Therefore, the Fujisaki model parameter calculation unit 5 calculates the three parameters Api, A
With respect to aj and Fmin, values suitable for the feature amount of the speaker are determined.

【００２５】すなわち、藤崎モデルパラメータ算出部５
は、話者の選択時に選択された音声合成用データベース
２の基本周波数平均値(ｆo)meanと標準偏差(ｆo)stdを
用い、当該データベース２にあった藤崎モデルのパラメ
ータを次の(4)〜(6)式より算出する。Ｆmin＝ln｛(ｆo)mean−(ｆo)std｝ …(4) Ａpi＝0.3｛−0.42ln〔(ｆo)mean〕＋0.42ln〔(ｆo)std〕｝＋1.1 …(5) Ａai＝0.7｛−0.42ln〔(ｆo)mean〕＋0.42ln〔(ｆo)std）｝＋1.1 …(6) これによって音声合成用データベース２Ａ，２Ｂ、…の
うち、選択されたデータベースの話者の特徴が藤崎モデ
ルという形でモデル化される。That is, the Fujisaki model parameter calculator 5
Uses the fundamental frequency average value (fo) mean and the standard deviation (fo) std of the speech synthesis database 2 selected at the time of speaker selection, and converts the parameters of the Fujisaki model in the database 2 into the following (4) Calculated from Equation (6). Fmin = ln {(fo) mean- (fo) std} ... (4) Api=0.3@-0.42ln [(fo) mean] + 0.42ln [(fo) std] @ + 1.1 ... (5) Aai = 0.7 ｛−0.42ln [(fo) mean] + 0.42ln [(fo) std)｝ + 1.1 (6) As a result, among the speech synthesis databases 2A, 2B,. Features are modeled in the form of Fujisaki models.

【００２６】テキスト読み込み部６は、音声を合成して
発話させようとする元の文（テキスト）をコンピュータ
のメモリ上に読み込む。テキストはパーソナルコンピュ
ータのキーボードから入力され、あるいはインターネッ
ト等の回線を通じて送られてくるが、これらのテキスト
はコンピュータのメモリ上に読み込まれる。読み込まれ
たテキストは、かな漢字混じりテキストでテキスト表示
ウィンドウ２２に表示される。The text reading section 6 reads an original sentence (text) to be synthesized and uttered into a memory of a computer. The text is input from a keyboard of a personal computer or sent through a line such as the Internet, and these texts are read into the memory of the computer. The read text is displayed in the text display window 22 as kana-kanji mixed text.

【００２７】テキスト読み込み部６から入力されたかな
漢字混じりテキストは、音素列変換部７により合成用の
音素列に変換され、アクセント付きの音素表記として音
素表記表示ウィンドウ２３に表示される。そして、操作
パネル２５の発話ボタン２７を押すと、音声合成の処理
が実行される。The kana-kanji mixed text input from the text reading section 6 is converted into a phoneme string for synthesis by the phoneme string conversion section 7 and displayed in the phoneme notation display window 23 as a phoneme notation with accent. Then, when the utterance button 27 of the operation panel 25 is pressed, a speech synthesis process is executed.

【００２８】目標韻律生成部８は、藤崎モデルパラメー
タ算出部５において決定された藤崎モデルのパラメータ
Ａpi、Ａaj、Ｆminを用い、藤崎モデルを適用すること
によって音声合成時の目標となる韻律を生成する。目標
韻律は音声基本周波数パターンと音素継続時間長とから
なり、目標韻律生成部８も音声基本周波数パターンを生
成する過程と、各音素の継続時間長を算出する過程とか
らなる。音声基本周波数パターンを生成する過程では、
藤崎モデルパラメータ算出部５で決定された、Ａpi、Ａ
aj、Ｆminの３つのパラメータの値を実際に藤崎モデル
の式(1)に適用し、音声合成時に目標として用いる音声
基本周波数パターンを生成する。また、音素継続時間長
を算出する過程では、各話者の音声データベース中の各
音素継続時間長の平均値を用いて、合成対象となる音素
の持続時間長を算出する。The target prosody generation unit 8 uses the Fujisaki model parameters Api, Aaj, and Fmin determined by the Fujisaki model parameter calculation unit 5 to generate a target prosody at the time of speech synthesis by applying the Fujisaki model. . The target prosody includes a basic voice frequency pattern and a phoneme duration, and the target prosody generation unit 8 includes a process of generating a basic voice frequency pattern and a process of calculating the duration of each phoneme. In the process of generating the voice fundamental frequency pattern,
Api, A determined by the Fujisaki model parameter calculation unit 5
The values of the three parameters aj and Fmin are actually applied to the expression (1) of the Fujisaki model to generate a speech fundamental frequency pattern to be used as a target during speech synthesis. In the process of calculating the phoneme duration, the duration of the phoneme to be synthesized is calculated using the average value of the phoneme durations in the speech database of each speaker.

【００２９】音素結合部９は、キーボード等から入力さ
れた合成対象となるテキストに対して、目標韻律生成部
８で生成した目標韻律（基本周波数パターン、音素継続
時間長）をもとに選択データベース２から音素（音素波
形）を選択し、それらを結像して合成音声を作成する。
こうして作成された合成音声の韻律と目標音声の韻律と
は、韻律表示部１０により韻律表示ウィンドウ２４にグ
ラフィカルに表示される。The phoneme connection unit 9 selects a database based on the target prosody (fundamental frequency pattern, phoneme duration) generated by the target prosody generation unit 8 for the text to be synthesized input from a keyboard or the like. 2, a phoneme (phoneme waveform) is selected, and an image thereof is formed to create a synthesized speech.
The prosody of the synthesized speech and the prosody of the target speech created in this way are graphically displayed in the prosody display window 24 by the prosody display unit 10.

【００３０】韻律補正部１１は、音素結合部９で生成さ
れた合成音声の韻律（音声基本周波数、音素継続時間
長）と藤崎モデルに基づいて算出された目標韻律とを比
較し、一致しない場合には合成音声波形の各音素を延長
したり短縮したりして合成音声波形を目標値に合わせ込
むように補正する。The prosody correction unit 11 compares the prosody (sound fundamental frequency, phoneme duration) of the synthesized speech generated by the phoneme connection unit 9 with the target prosody calculated based on the Fujisaki model. Is corrected by extending or shortening each phoneme of the synthesized speech waveform so that the synthesized speech waveform matches the target value.

【００３１】合成音声出力部１２は、こうして作成され
た合成音声を出力する。例えば、出力された合成音声は
増幅されてスピーカ等で音声に変換されたり、保存部１
３や適宜記録媒体に保存される。The synthesized speech output unit 12 outputs the synthesized speech created in this way. For example, the output synthesized voice is amplified and converted into voice by a speaker or the like, or the storage unit 1
3 and appropriately stored in a recording medium.

【００３２】操作パネル２５の波形保存ボタン２８を押
すと、あるいは自動的に、音素結合部９で作成された合
成音声や韻律補正部１１によって補正された合成音声は
音声波形として保存部１３に保存される。また、韻律操
作部１４が設けられている場合には、韻律表示ウィンド
ウ２４に表示された目標韻律または実際に選択された音
素の韻律を操作することにより、任意の韻律に変更する
ことができる。When the waveform save button 28 on the operation panel 25 is pressed, or automatically, the synthesized voice created by the phoneme combining unit 9 and the synthesized voice corrected by the prosody correction unit 11 are stored in the storage unit 13 as a voice waveform. Is done. When the prosody operation unit 14 is provided, the desired prosody can be changed to an arbitrary prosody by operating the target prosody displayed in the prosody display window 24 or the prosody of the actually selected phoneme.

【００３３】従って、この音声合成装置１によれば、複
数の話者で合成音声を作成する場合に、選択された話者
の特徴量を抽出して藤崎モデルによりモデル化すること
ができる。そして、音声合成用データベースから抽出し
た音素を結合させて合成音声を作成した後、このモデル
の韻律により合成音声の韻律を補正することができる。
よって、この音声合成装置１によれば、音声用データベ
ースのそれぞれの声質を保ったまま、韻律の自然な音声
メッセージを作成することができる。Therefore, according to the speech synthesizer 1, when a synthesized speech is created by a plurality of speakers, the feature amount of the selected speaker can be extracted and modeled by the Fujisaki model. Then, after combining the phonemes extracted from the speech synthesis database to create a synthesized speech, the prosody of the synthesized speech can be corrected by the prosody of this model.
Therefore, according to the voice synthesizing device 1, a voice message with a natural prosody can be created while maintaining the voice quality of each of the voice databases.

【００３４】[0034]

【発明の効果】本発明の音声合成装置及び記録媒体によ
れば、自然な声質で滑らかな合成音声を得ることができ
る。特に、自然な声質と韻律で発声させることが可能に
なる。According to the speech synthesizing apparatus and the recording medium of the present invention, a smooth synthesized speech with natural voice quality can be obtained. In particular, it is possible to utter with natural voice quality and prosody.

[Brief description of the drawings]

【図１】本発明の一実施形態による音声合成装置の構成
を示す図である。FIG. 1 is a diagram showing a configuration of a speech synthesis device according to an embodiment of the present invention.

【図２】（ａ）は音声合成装置の表示画面を示す図、
（ｂ）はその話者選択用のコンボボックスを開いた状態
を示す図である。FIG. 2A is a diagram showing a display screen of a speech synthesizer,
(B) is a diagram showing a state in which the speaker selection combo box is opened.

[Explanation of symbols]

２Ａ，２Ｂ、… 音声合成用データベース３話者データベース選択部４データベース特徴量計算部５藤崎モデルパラメータ算出部８目標韻律生成部９音素結合部１１韻律補正部２２テキスト表示ウィンドウ２３音素表記表示ウィンドウ２４韻律表示ウィンドウ 2A, 2B, ... Speech synthesis database 3 Speaker database selection unit 4 Database feature calculation unit 5 Fujisaki model parameter calculation unit 8 Target prosody generation unit 9 Phoneme coupling unit 11 Prosody correction unit 22 Text display window 23 Phoneme display window 24 Prosody display window

フロントページの続き (72)発明者佐野敏幸京都府京都市右京区花園土堂町10番地オムロン株式会社内Ｆターム(参考） 5D045 AA01 AA09 9A001 DD15 HH18 KK54 Continued on the front page (72) Inventor Toshiyuki Sano 10-Family Todocho Hanazono, Ukyo-ku, Kyoto-shi F-term in Omron Corporation (reference) 5D045 AA01 AA09 9A001 DD15 HH18 KK54

Claims

[Claims]

1. A speech synthesis database storing phoneme waveform data for a plurality of speakers, a means for extracting features from phoneme waveform data of each speaker stored in the speech synthesis database, and a speech synthesis database Phoneme combining means for selecting and connecting a phoneme waveform corresponding to the target character,
Means for correcting the synthesized speech generated by the phoneme combining means using the feature according to the selected speaker.

2. The speech synthesizer according to claim 1, wherein the correction means corrects the prosody of the synthesized speech using the feature according to the selected speaker.

3. A means for extracting features from speaker's phoneme waveform data stored in a speech synthesis database, a phoneme coupling means for selecting and connecting a phoneme waveform corresponding to a target character from the speech synthesis database, Means for correcting the synthesized speech generated by the phoneme combining means using the feature according to the selected speaker.