JP5949634B2

JP5949634B2 - Speech synthesis system and speech synthesis method

Info

Publication number: JP5949634B2
Application number: JP2013071951A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2013-03-29
Filing date: 2013-03-29
Publication date: 2016-07-13
Anticipated expiration: 2033-03-29
Also published as: JP2014197072A

Description

本発明は、音声合成システム、及び音声合成方法に関する。 The present invention relates to a speech synthesis system and a speech synthesis method.

従来、周知の音声合成技術を用いて、入力された文章データを読み上げる音声合成装置が知られている（特許文献１参照）。
この特許文献１に記載された音声合成装置では、入力された文章データによって表されたテキストを解析し、その解析結果として属性情報を導出する。そして、属性情報と予め対応付けられた韻律パラメータに、上記解析結果である属性情報を照合し、類似度が基準値以上となる属性情報と対応付けられた韻律パラメータを用いて音声合成を実行する。 2. Description of the Related Art Conventionally, a speech synthesizer that reads out input text data using a known speech synthesis technique is known (see Patent Document 1).
In the speech synthesizer described in this Patent Document 1, the text represented by the input sentence data is analyzed, and attribute information is derived as the analysis result. Then, the attribute information which is the analysis result is compared with the prosodic parameters previously associated with the attribute information, and the speech synthesis is performed using the prosodic parameters associated with the attribute information whose similarity is equal to or higher than the reference value. .

なお、特許文献１に記載された属性情報とは、文の構造を表す情報であり、例えば、モーラ数、アクセント型、品詞などの情報である。 The attribute information described in Patent Document 1 is information representing the structure of a sentence, and is information such as the number of mora, accent type, part of speech, etc., for example.

特開２０００−０５６７８８号公報JP 2000-056788 A

ところで、音声合成装置においては、音声合成によってテキストを読み上げた合成音に対して、当該テキストの内容に適した表情を付与することが求められている。
しかしながら、特許文献１に記載された音声合成装置では、音声合成に用いる韻律データを、文構造を表す属性情報に従って特定しているため、音声合成によってテキストを読上げた合成音に、当該テキストの内容に適した表情を付与できないという課題がある。 By the way, in the speech synthesizer, it is required to give a facial expression suitable for the content of the text to the synthesized speech read out by the speech synthesis.
However, in the speech synthesizer described in Patent Document 1, prosodic data used for speech synthesis is specified in accordance with attribute information representing sentence structure, so that the content of the text is converted into synthesized speech read out by speech synthesis. There is a problem that it is not possible to give a facial expression suitable for the camera.

そこで、本発明は、音声合成によって文章データを読上げた合成音を出力する際に、当該合成音に適切な表情を付与することを目的とする。 Therefore, an object of the present invention is to give an appropriate facial expression to the synthesized sound when outputting the synthesized sound obtained by reading out text data by speech synthesis.

上記目的を達成するためになされた本発明の音声合成システムは、文章取得手段と、文章解析手段と、音源解析手段と、パラメータ補正手段と、音声合成手段とを備えている。
このうち、文書取得手段は、指定された文章を構成する文字列を表す文章データを取得し、文章解析手段は、文章取得手段で取得された文章データによって表される文章を解析し、当該文章にて出現する各種類の表情の分布度合いを表すテキスト表情分布を導出する。 The speech synthesis system of the present invention made to achieve the above object comprises a sentence acquisition means, a sentence analysis means, a sound source analysis means, a parameter correction means, and a speech synthesis means.
Among these, the document acquisition means acquires sentence data representing a character string constituting the specified sentence, the sentence analysis means analyzes the sentence represented by the sentence data acquired by the sentence acquisition means, and the sentence A text expression distribution representing the distribution degree of each type of expression appearing in is derived.

さらに、複数種類の表情が出現する内容の文章として規定された規定内容文について発声したときの各表情を表す表情データと、規定内容文にて各表情が出現する部分について発声された音の少なくとも一つの音声パラメータとを、表情の種類ごと、かつ、発声者ごとに対応付けたデータを音源データとし、音源解析手段は、音源データが格納された記憶装置から、指定された発声者に対応する音源データである指定音源データを取得して解析し、指定音源データに含まれる音声パラメータによって表される音声にて表出する各種類の表情の分布度合いを表す音源表情分布を導出する。 Further, facial expression data representing each facial expression when uttered with respect to a prescribed content sentence defined as a sentence of content in which multiple types of facial expressions appear, and at least a sound uttered with respect to a portion where each facial expression appears in the prescribed content sentence Data in which one voice parameter is associated with each type of expression and for each speaker is set as sound source data, and the sound source analysis unit corresponds to the designated speaker from the storage device storing the sound source data. The designated sound source data, which is sound source data, is acquired and analyzed, and a sound source facial expression distribution representing the distribution degree of each type of facial expression expressed by the voice represented by the voice parameter included in the designated sound source data is derived.

そして、パラメータ補正手段は、文章解析手段にて導出されたテキスト表情分布に、音源解析手段にて導出された音源表情分布が合致するように、指定音源データに含まれる音声パラメータを補正した補正パラメータを導出する。さらに、音声合成手段は、パラメータ補正手段で導出された補正パラメータに基づいて、文章取得手段で取得した文章データによって表される文章の音声合成を実行する。 The parameter correction unit corrects the speech parameter included in the designated sound source data so that the sound source expression distribution derived by the sound source analysis unit matches the text expression distribution derived by the sentence analysis unit. Is derived. Furthermore, the speech synthesis unit executes speech synthesis of the sentence represented by the sentence data acquired by the sentence acquisition unit based on the correction parameter derived by the parameter correction unit.

このような音声合成システムによれば、音源表情分布がテキスト表情分布に合致するように、当該音源表情分布に対応する音声パラメータを補正して音声合成を実行するため、指定された文章データに適した表情を、その音声合成による合成音に付与できる。 According to such a speech synthesis system, speech synthesis is performed by correcting speech parameters corresponding to the sound source facial expression distribution so that the sound source facial expression distribution matches the text facial expression distribution. Can be added to the synthesized sound by the speech synthesis.

なお、本発明における「表情」とは、少なくとも、感情や情緒、情景、状況を含む概念である。
本発明の音声合成システムにおける音源解析手段は、表情が中立状態であることを表す表情データと対応付けられた音声パラメータを基準パラメータとし、指定音源データに含まれる音声パラメータによって表される音声にて表出する各表情の強さを、基準パラメータからのベクトルで表した表情差分ベクトルを表情の種類ごとに導出し、全ての表情差分ベクトルのスカラー量の最大値が１となるように、表情差分ベクトルを正規化した結果を音源表情分布として導出しても良い。 The “expression” in the present invention is a concept including at least emotions, emotions, scenes, and situations.
The sound source analysis means in the speech synthesis system of the present invention uses the speech parameter associated with the facial expression data representing that the facial expression is in a neutral state as a reference parameter, and uses the speech represented by the speech parameter included in the designated sound source data. Deriving facial expression difference vectors representing the strength of each facial expression as a vector from the reference parameter for each facial expression type, so that the maximum value of the scalar amount of all facial expression difference vectors is 1. The result of normalizing the vector may be derived as a sound source facial expression distribution.

この場合、本発明におけるパラメータ補正手段では、均一差分導出手段が、表情差分ベクトルそれぞれを音源表情分布にて除した均一差分ベクトルを導出し、表情反映手段が、文章解析手段にて導出されたテキスト表情分布を均一差分導出手段で導出された均一差分ベクトルそれぞれに乗じた結果に、基準パラメータを加えることで、補正パラメータを導出する。 In this case, in the parameter correction means in the present invention, the uniform difference deriving means derives a uniform difference vector obtained by dividing each expression difference vector by the sound source expression distribution, and the expression reflecting means is the text derived by the sentence analyzing means. A correction parameter is derived by adding a reference parameter to the result obtained by multiplying each uniform difference vector derived by the uniform difference deriving means by the facial expression distribution.

このような音声合成システムによれば、テキスト表情分布に音源表情分布が合致するように補正した補正パラメータを導出することができる。
なお、本発明は、音声合成方法としてなされていても良い。 According to such a speech synthesis system, it is possible to derive a correction parameter corrected so that the sound source expression distribution matches the text expression distribution.
Note that the present invention may be implemented as a speech synthesis method.

この場合、本発明の音声合成方法では、文章データを取得する文章取得手順と、テキスト表情分布を導出する文章解析手順と、音源表情分布を導出する音源解析手順と、補正パラメータを導出するパラメータ補正手順と、補正パラメータに基づいて、文章取得手順で取得した文章データによって表される文章の音声合成を実行する音声合成手順とを備えている必要がある。 In this case, in the speech synthesis method of the present invention, a sentence acquisition procedure for acquiring sentence data, a sentence analysis procedure for deriving a text expression distribution, a sound source analysis procedure for deriving a sound source expression distribution, and a parameter correction for deriving a correction parameter It is necessary to include a procedure and a speech synthesis procedure for performing speech synthesis of a sentence represented by the sentence data acquired in the sentence acquisition procedure based on the correction parameter.

このような音声合成方法を実行すれば、請求項１に係る音声合成システムと同様の効果を得ることができる。 By executing such a speech synthesis method, the same effect as the speech synthesis system according to claim 1 can be obtained.

音源合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a sound source synthesis system. 音源データ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source data registration process. 音源合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source synthetic | combination process. 音源合成処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of a sound source synthetic | combination process. 音源合成処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of a sound source synthetic | combination process.

以下に本発明の実施形態を図面と共に説明する。
〈音声合成システム〉
図１に示す音声合成システム１は、ユーザが指定した文章データＷＴの内容を読み上げるシステムであり、情報処理サーバ１０と、少なくとも一つの音声出力端末６０とを備えている。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
A speech synthesis system 1 shown in FIG. 1 is a system that reads out the contents of text data WT designated by a user, and includes an information processing server 10 and at least one speech output terminal 60.

この音声合成システム１の情報処理サーバ１０は、音声出力端末６０のユーザが指定した文章データＷＴ、及び音声出力端末６０のユーザが指定した音源データＳＤを解析し、文章データＷＴの解析結果に音源データＳＤの解析結果が一致するように当該音源データＳＤを補正する。さらに、音声合成システム１では、その補正された音源データＳＤに基づいて、音声出力端末６０が音声合成を実行して、指定された文章データＷＴに対応する内容の合成音を生成し、音声出力端末６０から出力することで、文章データＷＴの内容を読み上げる。
〈音声出力端末〉
音声出力端末６０は、通信部６１と、情報受付部６２と、表示部６３と、音入力部６４と、音出力部６５と、記憶部６６と、制御部７０とを備えている。本実施形態における音声出力端末６０として、例えば、周知の携帯端末を想定しても良いし、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。なお、携帯端末には、周知の電子書籍端末や、携帯電話、タブレット端末などの携帯情報端末を含む。 The information processing server 10 of the speech synthesis system 1 analyzes the text data WT designated by the user of the speech output terminal 60 and the sound source data SD designated by the user of the speech output terminal 60, and uses the sound source as the analysis result of the text data WT. The sound source data SD is corrected so that the analysis results of the data SD match. Furthermore, in the speech synthesis system 1, the speech output terminal 60 executes speech synthesis based on the corrected sound source data SD, generates a synthesized sound having contents corresponding to the designated sentence data WT, and outputs the speech. By outputting from the terminal 60, the contents of the text data WT are read out.
<Audio output terminal>
The audio output terminal 60 includes a communication unit 61, an information receiving unit 62, a display unit 63, a sound input unit 64, a sound output unit 65, a storage unit 66, and a control unit 70. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal may be assumed, or a known information processing apparatus such as a so-called personal computer may be assumed. Note that portable terminals include well-known electronic book terminals, and portable information terminals such as mobile phones and tablet terminals.

通信部６１は、通信網を介して音声出力端末６０が外部との間で情報通信を行う。情報受付部６２は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６３は、制御部７０からの信号に基づいて画像を表示する。 In the communication unit 61, the audio output terminal 60 performs information communication with the outside via a communication network. The information receiving unit 62 receives information input via an input device (not shown). The display unit 63 displays an image based on a signal from the control unit 70.

音入力部６４は、音を電気信号に変換して制御部７０に入力する装置であり、例えば、マイクロホンである。音出力部６５は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。記憶部６６は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部６６には、各種処理プログラムや各種データが記憶される。 The sound input unit 64 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 70, and is, for example, a microphone. The sound output unit 65 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker. The storage unit 66 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 66 stores various processing programs and various data.

また、制御部７０は、ＲＯＭ７２、ＲＡＭ７４、ＣＰＵ７６を少なくとも有した周知のコンピュータを中心に構成されている。
すなわち、音声出力端末６０は、情報受付部６２にて受け付けた情報を、通信部６１を介して情報処理サーバ１０に送信し、情報処理サーバ１０にて合成された合成音を受信して音出力部６５から出力する。
〈情報処理サーバ〉
情報処理サーバ１０は、通信部１２と、制御部２０と、記憶部３０とを備え、少なくとも、文章を構成する文字列を表す文章データＷＴと、予め入力された音声の音声特徴量を少なくとも含む音源データＳＤとが格納されたサーバである。 The control unit 70 is configured around a known computer having at least a ROM 72, a RAM 74, and a CPU 76.
That is, the audio output terminal 60 transmits information received by the information receiving unit 62 to the information processing server 10 via the communication unit 61, receives the synthesized sound synthesized by the information processing server 10, and outputs sound. Output from the unit 65.
<Information processing server>
The information processing server 10 includes a communication unit 12, a control unit 20, and a storage unit 30, and includes at least sentence data WT representing a character string constituting a sentence and at least a voice feature amount of speech input in advance. This is a server in which the sound source data SD is stored.

通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。
制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２２と、処理プログラムやデータを一時的に格納するＲＡＭ２４と、ＲＯＭ２２やＲＡＭ２４に記憶された処理プログラムに従って各種処理を実行するＣＰＵ２６とを少なくとも有した周知のコンピュータを中心に構成されている。この制御部２０は、通信部１２や記憶部３０を制御する。 In the communication unit 12, the information processing server 10 communicates with the outside through a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.
The control unit 20 includes a ROM 22 that stores processing programs and data that need to retain stored contents even when the power is turned off, a RAM 24 that temporarily stores processing programs and data, and processes stored in the ROM 22 and RAM 24. A known computer having at least a CPU 26 that executes various processes according to a program is mainly configured. The control unit 20 controls the communication unit 12 and the storage unit 30.

記憶部３０は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。この記憶装置とは、例えば、ハードディスク装置やフラッシュメモリなどである。記憶部３０には、文章データＷＴと、音源データＳＤとが格納されている。 The storage unit 30 is a non-volatile storage device configured to be able to read and write stored contents. The storage device is, for example, a hard disk device or a flash memory. The storage unit 30 stores text data WT and sound source data SD.

ここでいう文章データＷＴ_kは、例えば、書籍をテキストデータ化したデータであり、書籍ごとに予め用意されている。ここでいう書籍とは、小説などである。また、符号ｋは、「１」以上の整数（自然数）である。 The text data WT _k here is, for example, data obtained by converting a book into text data, and is prepared in advance for each book. Books here are novels and the like. The symbol k is an integer (natural number) equal to or greater than “1”.

音源データＳＤは、音声パラメータｓｐｒ_lと、タグデータ（表情データ）ＴＧ_lとを音源ｌ（ｌは、「１」以上の整数）ごとに対応付けたデータである。
音声パラメータｓｐｒは、人が発した音の波形を表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごと、かつ、音素ごとに用意される。音声パラメータｓｐｒにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、及びそれらの時間差分を少なくとも備えている。 The sound source data SD is data in which sound parameters spr _l and tag data (expression data) TG _l are associated with each sound source l (l is an integer equal to or greater than “1”).
The voice parameter spr is at least one feature amount representing a waveform of a sound emitted by a person. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker and for each phoneme. As a feature amount in the speech parameter spr, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof in each phoneme in the uttered speech are provided.

タグデータＴＧは、音声パラメータｓｐｒによって表される音の性質を表すデータであり、発声者の特徴を表す発声者特徴データと、当該音声が発声されたときの発声者の表情を表す表情データとを少なくとも含む。発声者特徴データには、例えば、発声者の性別、年齢などを含む。また、表情データは、感情や情緒、情景、状況を少なくとも含む表情としての概念を表すデータであり、発声者の表情を推定するために必要な情報を含んでも良い。 The tag data TG is data representing the nature of the sound represented by the speech parameter spr, and includes speaker feature data representing the features of the speaker, and facial expression data representing the expression of the speaker when the speech is spoken. At least. The speaker characteristic data includes, for example, the sex and age of the speaker. The expression data is data representing a concept as an expression including at least emotions, emotions, scenes, and situations, and may include information necessary for estimating the expression of the speaker.

これらの音声パラメータｓｐｒとタグデータＴＧとが対応付けられた音源データＳＤは、音源データ登録処理を制御部２０が実行することで生成され、記憶部３０に記憶される。
〈音源登録処理〉
その音源データ登録処理は、起動されると、図２に示すように、文章データＷＴの中で、複数種類の表情が出現する内容の文章として予め規定された規定内容文の文字列を表す発声内容文章データを取得する（Ｓ１１０）。 The sound source data SD in which the sound parameters spr and the tag data TG are associated is generated by the sound source data registration process performed by the control unit 20 and stored in the storage unit 30.
<Sound source registration process>
When the sound source data registration process is started, as shown in FIG. 2, an utterance that represents a character string of a prescribed content sentence that is prescribed in advance as a sentence having contents in which a plurality of types of facial expressions appear in the sentence data WT. Content sentence data is acquired (S110).

続いて、Ｓ１１０にて取得した発声内容文章データに対応する一つの音声波形データを取得する（Ｓ１２０）。この音声波形データは、発声内容文章データによって表される規定内容文について、予め発声された音声波形それぞれを表すデータであり、多様な人物によって予め発声されたものである。 Subsequently, one speech waveform data corresponding to the utterance content sentence data obtained in S110 is obtained (S120). The speech waveform data is data representing each speech waveform uttered in advance with respect to the specified content sentence represented by the utterance content text data, and is uttered in advance by various persons.

さらに、Ｓ１２０にて取得した音声波形データそれぞれから音声パラメータｓｐｒを導出する（Ｓ１３０）。本実施形態のＳ１３０では、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分を、それぞれ、音声パラメータｓｐｒとして導出する。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、時間軸に沿った自己相関、周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、時間分析窓における振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Furthermore, the speech parameter spr is derived from each speech waveform data acquired in S120 (S130). In S130 of the present embodiment, the fundamental frequency, the mel frequency cepstrum (MFCC), the power, and the time difference between them are each derived as a speech parameter spr. Since these fundamental frequency, MFCC, and power derivation methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis, autocorrelation of the frequency spectrum, Alternatively, it may be derived using a method such as a cepstrum method. In the case of the MFCC, the result of frequency analysis (for example, FFT) for each time analysis window may be derived by further frequency analysis of the logarithmic magnitude of each frequency. The power may be derived by integrating the squared result of the amplitude in the time analysis window in the time direction.

続いて、音源データ登録処理では、表情データＴＧを推定する表情データ推定処理を実行する（Ｓ１４０）。この表情データ推定処理では、Ｓ１１０にて取得した発声内容文章データを解析した結果に基づいて、音声波形データによって表現された表情を推定する。 Subsequently, in the sound source data registration processing, facial expression data estimation processing for estimating facial expression data TG is executed (S140). In this facial expression data estimation process, the facial expression expressed by the speech waveform data is estimated based on the result of analyzing the utterance content text data acquired in S110.

ここでいう「発声内容文章データ」の解析とは、例えば、発声内容文章データに対応する文章を形態素解析することで特定した各単語について、単語それぞれに対応する単語表情情報を取得する。ここでいう単語表情情報とは、単語それぞれと、各単語によって表される表情の内容とを予め対応付けた情報であり、単語表情データベースに予め格納されている。そして、取得した単語表情情報に従って、同一内容を表す表情の登場頻度を各表情の内容ごとに集計し、この集計の結果、最も頻度が高い表情の内容を、当該音声波形データによって表された表情として推定すれば良い。 The analysis of “speech content text data” here refers to, for example, acquiring word expression information corresponding to each word for each word specified by morphological analysis of text corresponding to the utterance content text data. The word expression information here is information that associates each word with the contents of the expression represented by each word in advance, and is stored in advance in the word expression database. Then, according to the acquired word facial expression information, the appearance frequency of facial expressions representing the same content is aggregated for each facial expression content, and as a result of the aggregation, the content of the most frequent facial expression is expressed by the facial expression represented by the speech waveform data. Can be estimated.

続いて、Ｓ１３０にて導出した音声パラメータｓｐｒと、Ｓ１４０にて推定した表情データＴＧとを対応する音声波形データごとに対応付けることで、音源データＳＤを生成して記憶部３０に格納する音声パラメータ登録を実行する（Ｓ１５０）。なお、本実施形態のＳ１５０にて記憶部３０に格納される音声パラメータｓｐｒと対応付けられるデータは、表情データＴＧに加えて、発声した文章の内容（種類）や、発声者ＩＤ、発声者特徴データを含む（即ち、タグデータＴＧである）。これら発声者ＩＤや発声者特徴データは、情報処理サーバ１０や音声出力端末６０、その他の端末へのログインに用いる情報を発声者ＩＤや発声者特徴データとして取得すれば良い。 Subsequently, the speech parameter registration in which the sound source data SD is generated and stored in the storage unit 30 by associating the speech parameter spr derived in S130 with the facial expression data TG estimated in S140 for each corresponding speech waveform data. Is executed (S150). In addition, in addition to the facial expression data TG, the data associated with the speech parameter spr stored in the storage unit 30 in S150 of the present embodiment includes the content (type) of the spoken sentence, the speaker ID, and the speaker characteristics. Data (ie, tag data TG). For the speaker ID and speaker feature data, information used for logging in to the information processing server 10, the voice output terminal 60, and other terminals may be acquired as the speaker ID and speaker feature data.

その後、本音源データ登録処理を終了する。
つまり、本実施形態の音源データ登録処理では、発声内容文章データによって表される規定内容文に対して発声された一つの音声波形データを解析し、音声パラメータｓｐｒを導出する。これと共に、音源データ登録処理では、当該発声内容文章データによって表される規定内容文を解析し、当該音声パラメータｓｐｒにて表現される表情を表す表情データを導出する。 Thereafter, the sound source data registration process is terminated.
That is, in the sound source data registration process of the present embodiment, one speech waveform data uttered with respect to the specified content sentence represented by the utterance content text data is analyzed, and the speech parameter spr is derived. At the same time, in the sound source data registration process, the specified content sentence represented by the utterance content sentence data is analyzed, and facial expression data representing the facial expression represented by the speech parameter spr is derived.

そして、音源データ登録処理では、それらの対応する音声パラメータｓｐｒと表情データとを対応付けることで音源データＳＤを生成し、その音源データＳＤを記憶部３０に記憶する。これにより、記憶部３０には、規定内容文について発声された音声ごとに作成された音源データＳＤが格納される。
〈音声合成処理〉
次に、情報処理サーバ１０の制御部２０が実行する音声合成処理について説明する。 In the sound source data registration process, the sound source data SD is generated by associating the corresponding sound parameters spr with the expression data, and the sound source data SD is stored in the storage unit 30. Thus, the storage unit 30 stores the sound source data SD created for each voice uttered with respect to the specified content sentence.
<Speech synthesis processing>
Next, a speech synthesis process executed by the control unit 20 of the information processing server 10 will be described.

この音声合成処理は、起動されると、図３に示すように、音声出力端末６０にて指定された文章データＷＴを表す文章指定情報を取得する（Ｓ３１０）。続いて、Ｓ３１０にて取得した文章指定情報に対応する文章データ（以下、「指定文章データ」と称す）ＷＴを記憶部３０から取得する（Ｓ３２０）。このＳ３２０にて取得する指定文章データＷＴは、文章を構成する文字列そのもの、即ち、テキストデータである。 When the voice synthesis process is started, as shown in FIG. 3, the text designation information representing the text data WT designated by the voice output terminal 60 is acquired (S310). Subsequently, sentence data (hereinafter referred to as “designated sentence data”) WT corresponding to the sentence designation information obtained in S310 is obtained from the storage unit 30 (S320). The designated sentence data WT acquired in S320 is a character string itself constituting the sentence, that is, text data.

さらに、Ｓ３２０にて取得した指定文章データＷＴをテキスト解析し、指定文章データＷＴによって表される文章中に登場する登場人物ｉと、各登場人物ｉが発声すべきテキストの内容を表す発声テキストとを対応付けた話者テキスト対応データを生成する（Ｓ３３０）。なお、ここでいう登場人物ｉとは、発話者とナレータとを含むものである。例えば、会話文については、文章中にて当該会話文を発声した人物を表す発話者を登場人物ｉとして、地の文についてはナレータを登場人物ｉとして特定する。 Further, the designated sentence data WT acquired in S320 is subjected to text analysis, a character i appearing in the sentence represented by the designated sentence data WT, and a utterance text representing the content of the text that each character i should utter, Is generated in correspondence with the speaker text (S330). In addition, the character i here includes a speaker and a narrator. For example, for a conversational sentence, a speaker representing the person who uttered the conversational sentence in the sentence is identified as the character i, and for a local sentence, the narrator is identified as the character i.

具体的には、Ｓ３３０では、まず、Ｓ３１０にて取得した指定文章データＷＴを、当該指定文章データＷＴによって表される文章中の句読点及び括弧にて分割して、文章を構成する単位区間である発声テキストに切り分ける。そして、その切り分けた発声テキストに対して形態素解析、及び係り受け解析を実行して、当該発声テキストを発声すべき登場人物ｉを特定する。さらに、各発声テキストと、当該発声テキストに対応する登場人物ｉとを対応付けることで、登場人物ｉと発声テキストとを対応付けた話者テキスト対応データを生成する。 Specifically, in S330, first, the designated sentence data WT acquired in S310 is divided into punctuation marks and parentheses in the sentence represented by the designated sentence data WT, and is a unit section constituting the sentence. Cut into spoken text. Then, morphological analysis and dependency analysis are performed on the uttered text, and the character i who should utter the uttered text is specified. Further, by associating each utterance text with the character i corresponding to the utterance text, speaker text correspondence data in which the character i and the utterance text are associated with each other is generated.

なお、形態素解析や係り受け解析は、周知の手法を用いれば良く、例えば、形態素解析であれば、“ＭｅＣａｂ”を用いれば良い。また、係り受け解析であれば、“Ｃａｂｏｃｈａ（「工藤拓，松本裕治，“チャンキングの段階適用による日本語係り受け解析”，情報処理学会論文誌，４３（６），１８３４−１８４２（２００１）」）”を用いれば良い。 For morphological analysis and dependency analysis, a known method may be used. For example, in the case of morphological analysis, “MeCab” may be used. For dependency analysis, “Cabocha (“ Taku Kudo, Yuji Matsumoto, “Japanese Dependency Analysis by Chunking Stage Application” ”, Transactions of Information Processing Society of Japan, 43 (6), 1834-1842 (2001). ")" May be used.

音声合成処理へと戻り、話者テキスト対応データに基づいて、登場人物ｉごとに対応付けられた発声テキストを解析して、各発声テキストに出現する表情を特定する（Ｓ３４０）。このＳ３４０における解析は、上述した単語表情情報に基づいて、発声テキストに含まれる各単語によって表される表情の内容を取得することで実施すれば良い。 Returning to the speech synthesis process, the utterance text associated with each character i is analyzed based on the speaker text correspondence data, and the facial expression appearing in each utterance text is specified (S340). The analysis in S340 may be performed by acquiring the content of the facial expression represented by each word included in the utterance text based on the word facial expression information described above.

続いて、指定文章データＷＴによって表される文章中の登場人物ｉごとに、Ｓ３４０における表情解析の結果を集計し、登場人物ｉごとの表情の分布を表すテキスト表情分布ｔｐｄ（ｉ，ｋ）を導出する（Ｓ３５０）。このＳ３５０にて導出するテキスト表情分布ｔｐｄ（ｉ，ｋ）は、登場人物ｉが発生すべき文章にて出現する各種類の表情ｋの分布度合いを表すものである。 Subsequently, for each character i in the text represented by the designated text data WT, the results of facial expression analysis in S340 are tabulated, and a text expression distribution tpd (i, k) representing the distribution of facial expressions for each character i is obtained. Derived (S350). The text expression distribution tpd (i, k) derived in S350 represents the distribution degree of each type of expression k appearing in the sentence that the character i should generate.

さらに、指定文章データＷＴの各登場人物ｉに対して、音声出力端末６０を介して指定された人物（即ち、配役ｊ）を表す配役情報を取得する（Ｓ３６０）。すなわち、配役情報とは、音声出力端末６０を介して指定された人物に対応する発声者特徴データである。 Further, for each character i in the designated sentence data WT, the casting information representing the person (that is, the casting j) designated through the voice output terminal 60 is acquired (S360). That is, the casting information is speaker characteristic data corresponding to a person designated via the voice output terminal 60.

そして、Ｓ３６０にて取得した配役情報によって表される配役ｊそれぞれに対応付けられた音源データＳＤそれぞれを、記憶部３０から取得し、その取得した各音源データＳＤにおける表情の分布を表す音源表情分布ｖｐｄ（ｊ，ｋ）を導出する（Ｓ３７０）。 Then, each of the sound source data SD associated with each of the casting j represented by the casting information acquired in S360 is acquired from the storage unit 30, and the sound source facial expression distribution representing the facial expression distribution in each of the acquired sound source data SD. vpd (j, k) is derived (S370).

このＳ３７０では、具体的には、まず、表情の内容が中立状態である表情データと対応付けられた音声パラメータｓｐｒ＿ｎ（ｊ，ｋ）を基準とし、下記（１）式に従って、表情差分ベクトルｄｓｐｒ＿ｅ（ｊ，ｋ）を導出する。 Specifically, in S370, first, the facial expression difference vector dspr_e () is expressed according to the following equation (1) with reference to the speech parameter spr_n (j, k) associated with the facial expression data in which the content of the facial expression is neutral. j, k) is derived.

この表情差分ベクトルｄｓｐｒ＿ｅ（ｊ，ｋ）は、図４（Ａ）に示すように、基準となる音声パラメータ（即ち、基準パラメータ）ｓｐｒ＿ｎ（ｊ，ｋ）から、各表情ｋを内容とする表情データと対応付けられた音声パラメータｓｐｒ＿ｅ（ｊ，ｋ）へのベクトルである。なお、基準パラメータｓｐ＿ｎ（ｊ，ｋ）とは、配役ｊと対応付けられた音声パラメータｓｐｒ＿ｅ（ｊ，ｋ）の中で、表情ｋが中立状態であることを表すタグデータＴＧと対応付けられた音声パラメータである。なお、ここで言う表情ｋが中立状態であることとは、無表情であることを含むものである。 As shown in FIG. 4A, the facial expression difference vector dspr_e (j, k) is facial expression data containing each facial expression k from the reference speech parameter (ie, the reference parameter) spr_n (j, k). To the speech parameter spr_e (j, k) associated with. The reference parameter sp_n (j, k) is associated with tag data TG indicating that the facial expression k is neutral in the voice parameter spr_e (j, k) associated with the cast j. It is a voice parameter. The expression “k” being in the neutral state includes expressionless expression.

さらに、Ｓ３７０では、下記（２）式に従って、表情差分ベクトルｄｓｐｒ＿ｅ（ｊ，ｋ）のスカラー量の最大値が「１」となるように正規化することで、音源表情分布ｖｐｄ（ｊ，ｋ）を導出する。ただし、（２）式中の関数ｍａｘは、最大値を返答する関数である。 Further, in S370, the sound source facial expression distribution vpd (j, k) is normalized by normalizing the facial expression difference vector dspr_e (j, k) so that the maximum value of the scalar quantity is “1” according to the following equation (2). Is derived. However, the function max in the equation (2) is a function that returns the maximum value.

続いて、下記（３）式に従って、表情差分ベクトルｄｓｐｒ＿ｅ（ｊ，ｋ）を音源表情分布ｖｐｄ（ｊ，ｋ）で除して正規化し、均一差分ベクトルｎｄｓｐｒ＿ｅ（ｊ，ｋ）を算出する（Ｓ３８０）。 Subsequently, the expression difference vector dspr_e (j, k) is divided by the sound source expression distribution vpd (j, k) and normalized according to the following equation (3) to calculate the uniform difference vector ndspr_e (j, k) (S380). ).

この均一差分ベクトルｎｄｓｐｒ＿ｅ（ｊ，ｋ）は、図４（Ｂ）に示すように、基準パラメータｓｐｒ＿ｎ（ｊ，ｋ）から、各内容の表情ｋを表す表情データと対応付けられた音声パラメータｓｐｒ＿ｅ（ｊ，ｋ）までのスカラー量が均一となるように正規化されている。 As shown in FIG. 4B, the uniform difference vector ndspr_e (j, k) is obtained from the reference parameter spr_n (j, k) as a speech parameter spr_e () associated with facial expression data representing the facial expression k of each content. j, k) is normalized so that the scalar quantity is uniform.

さらに、音声合成処理では、下記（４）式に従って、配役情報によって表される配役ｊごとの音源データＳＤそれぞれに対して、テキスト表情分布ｔｐｄ（ｉ，ｋ）を反映し、補正パラメータｅ＿ｓｐａ（ｊ）を導出する（Ｓ３９０）。なお、（４）式における関数ｖｃｈは、各発生テキストを発声する登場人物ｉに対して配役ｊを対応付ける関数である。 Further, in the speech synthesis process, the text expression distribution tpd (i, k) is reflected on each sound source data SD for each casting j represented by the casting information according to the following formula (4), and the correction parameter e_spa (j ) Is derived (S390). The function vch in the equation (4) is a function that associates a cast j with a character i who utters each generated text.

すなわち、（４）式では、図５（Ａ）に示すように、テキスト表情分布ｔｐｄ（ｉ，ｋ）を均一差分ベクトルｎｄｓｐｒ＿ｅ（ｊ，ｋ）に乗じて表情重付差分ベクトルを導出する。さらに、（４）式では、図５（Ｂ）に示すように、表情重付差分ベクトルを基準パラメータｓｐｒ＿ｎ（ｊ，ｋ）に加えることで補正パラメータｅ＿ｓｐａ（ｊ）を導出する。 That is, in the expression (4), as shown in FIG. 5A, the facial expression weighted difference vector is derived by multiplying the text expression distribution tpd (i, k) by the uniform difference vector ndspr_e (j, k). Further, in the equation (4), as shown in FIG. 5B, the correction parameter e_spa (j) is derived by adding the expression weighted difference vector to the reference parameter spr_n (j, k).

そのＳ３９０にて導出された配役ｊごとの補正パラメータｅ＿ｓｐａ（ｊ）に基づいて、指定文章データＷＴによって表される文章の内容に沿って音声合成を実行して合成音を生成する（Ｓ４００）。続いて、Ｓ４００にて生成された合成音を音声出力端末６０へと配信し、その音声出力端末６０に合成音を出力させる（Ｓ４１０）。 Based on the correction parameter e_spa (j) for each casting j derived in S390, voice synthesis is executed along the content of the sentence represented by the designated sentence data WT to generate a synthesized sound (S400). Subsequently, the synthesized sound generated in S400 is delivered to the voice output terminal 60, and the voice output terminal 60 is caused to output the synthesized sound (S410).

その後、本音声合成処理を終了する。
つまり、音声合成処理では、指定文章データＷＴによって表される文章を解析し、当該文章にて出現する各種類の表情の分布度合いを表すテキスト表情分布ｔｐｄ（ｉ，ｋ）を登場人物ｉごとに導出する。そして、音声出力端末６０を介して指定された、各配役ｊに対応する音源データＳＤをそれぞれ取得して解析し、音源データＳＤごとに、当該音源データＳＤに含まれる音声パラメータｓｐｒにて表される音声に表出する各種類の表情の分布度合いを表す音源表情分布ｖｐｄ（ｊ，ｋ）を導出する。 Thereafter, the speech synthesis process ends.
That is, in the speech synthesis process, the sentence represented by the designated sentence data WT is analyzed, and the text expression distribution tpd (i, k) representing the distribution degree of each type of expression appearing in the sentence is shown for each character i. To derive. Then, the sound source data SD corresponding to each casting j specified through the sound output terminal 60 is acquired and analyzed, and each sound source data SD is represented by a sound parameter spr included in the sound source data SD. A sound source facial expression distribution vpd (j, k) representing the distribution degree of each type of facial expression expressed in the voice is derived.

さらに、音声合成処理では、登場人物ｉごとのテキスト表情分布ｔｐｄ（ｉ，ｋ）に、各登場人物ｉに対する配役ｊごとの音源表情分布ｖｐｄ（ｊ，ｋ）それぞれが合致するように、各音源表情分布ｖｐｄ（ｊ，ｋ）を構成する音声パラメータｓｐｒ（ｊ）を補正して、補正パラメータｅ＿ｓｐｒ（ｊ）を導出する。そして、指定文章データＷＴによって表される文章のそれぞれについて、各文章に対応する補正パラメータｅ＿ｓｐａ（ｊ）に従って音声合成を実行して合成音を音声出力端末６０から出力させる。
［実施形態の効果］
以上説明したように、音声合成システム１によれば、テキスト表情分布ｔｐｄ（ｉ，ｋ）に音源表情分布ｖｐｄ（ｊ，ｋ）それぞれが合致するように、各音源表情分布ｖｐｄ（ｊ，ｋ）を構成する音声パラメータｓｐｒ（ｊ）を補正した補正パラメータｅ＿ｓｐｒ（ｊ）を導出することができる。 Further, in the speech synthesis process, each sound source is so arranged that the sound expression expression distribution vpd (j, k) for each cast j for each character i matches the text expression distribution tpd (i, k) for each character i. The correction parameter e_spr (j) is derived by correcting the speech parameter spr (j) constituting the facial expression distribution vpd (j, k). Then, for each sentence represented by the designated sentence data WT, speech synthesis is executed according to the correction parameter e_spa (j) corresponding to each sentence, and a synthesized sound is output from the speech output terminal 60.
[Effect of the embodiment]
As described above, according to the speech synthesis system 1, each sound source facial expression distribution vpd (j, k) is matched so that each of the sound source facial expression distributions vpd (j, k) matches the text facial expression distribution tpd (i, k). It is possible to derive a correction parameter e_spr (j) obtained by correcting the speech parameter spr (j) that constitutes.

音声合成システム１によれば、その導出した補正パラメータｅ＿ｓｐｒ（ｊ）を用いて音声合成を実行するため、指定文章データＷＴに適した表情を、その音声合成による合成音に付与できる。 According to the speech synthesis system 1, since speech synthesis is performed using the derived correction parameter e_spr (j), a facial expression suitable for the designated sentence data WT can be added to the synthesized speech by the speech synthesis.

特に、音声合成システム１によれば、ユーザが指定した配役ｊの音源データＳＤに含まれる音声パラメータｓｐｒを補正して音声合成を実行するため、ユーザが指定した配役ｊの音声に最適な感情を付与させて、指定文章データＷＴを読み上げることができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 In particular, according to the speech synthesis system 1, speech synthesis is performed by correcting the speech parameter spr included in the sound source data SD of the casting j specified by the user, so that the optimum emotion for the speech of the casting j specified by the user is obtained. The designated sentence data WT can be read aloud.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

すなわち、上記実施形態の構成の一部を、課題を解決できる限りにおいて省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 That is, the aspect which abbreviate | omitted a part of structure of the said embodiment as long as the subject could be solved is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

例えば、上記実施形態では、補正パラメータｅ＿ｓｐａ（ｊ）を（４）式に基づいて導出していたが、補正パラメータｅ＿ｓｐａ（ｊ）は、これに限るものではない。すなわち、補正パラメータｅ＿ｓｐａ（ｊ）は、表情分布値が最大となる表情ｋｍａｘの内容を用いて、下記（５）式に従って導出しても良い。 For example, in the above embodiment, the correction parameter e_spa (j) is derived based on the equation (4), but the correction parameter e_spa (j) is not limited to this. That is, the correction parameter e_spa (j) may be derived according to the following equation (5) using the content of the facial expression kmax that maximizes the facial expression distribution value.

ただし、この場合における表情ｋｍａｘは、下記（６）式に従って導出することが好ましい。 However, the facial expression kmax in this case is preferably derived according to the following equation (6).

また、上記実施形態の音声合成処理では、情報処理サーバ１０にてＳ４００，Ｓ４１０を実行し、情報処理サーバ１０にて生成した合成音を音声出力端末６０から出力していたが、Ｓ４００，Ｓ４１０を実行する装置は、これに限るものではない。例えば、音声合成処理におけるＳ４００，Ｓ４１０は、音声出力端末６０にて実行されても良い。 In the speech synthesis process of the above embodiment, S400 and S410 are executed by the information processing server 10 and the synthesized sound generated by the information processing server 10 is output from the speech output terminal 60. The apparatus to be executed is not limited to this. For example, S400 and S410 in the speech synthesis process may be executed by the speech output terminal 60.

つまり、テキスト表情分布ｔｐｄ（ｉ，ｋ）に音源表情分布ｖｐｄ（ｊ，ｋ）それぞれが合致するように、補正パラメータｅ＿ｓｐａ（ｊ）を導出し、指定文章データＷＴによって表される文章のそれぞれについて、各文章に対応する補正パラメータｅ＿ｓｐａ（ｊ）に従って音声合成を実行して合成音を出力可能であれば、音声合成処理を構成する各ステップを、情報処理サーバ１０または音声出力端末６０のいずれで実行しても良い。また、音声合成処理自体が音声出力端末６０にて実行されても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 That is, the correction parameter e_spa (j) is derived so that each of the sound source facial expression distributions vpd (j, k) matches the text facial expression distribution tpd (i, k), and for each sentence represented by the designated sentence data WT. If the speech synthesis can be performed according to the correction parameter e_spa (j) corresponding to each sentence and the synthesized speech can be output, each step constituting the speech synthesis process is performed by either the information processing server 10 or the speech output terminal 60. May be executed. Further, the voice synthesis process itself may be executed by the voice output terminal 60.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声合成処理におけるＳ３１０，Ｓ３２０が、特許請求の範囲の記載における文章取得手段に相当し、Ｓ３３０〜Ｓ３５０が、特許請求の範囲の記載における文章解析手段に相当し、Ｓ３６０，Ｓ３７０が、特許請求の範囲の記載における音源解析手段に相当する。そして、音源合成処理におけるＳ３８０，Ｓ３９０が、特許請求の範囲の記載におけるパラメータ補正手段に相当し、このうち、Ｓ３８０が、特許請求の範囲の記載における均一差分導出手段に相当し、Ｓ３９０が、特許請求の範囲の記載における表情反映手段に相当する。 S310 and S320 in the speech synthesis process of the above embodiment correspond to the sentence acquisition means in the description of the claims, S330 to S350 correspond to the sentence analysis means in the description of the claims, and S360 and S370. This corresponds to the sound source analyzing means in the claims. S380 and S390 in the sound source synthesis processing correspond to the parameter correction means in the description of the claims, and among these, S380 corresponds to the uniform difference deriving means in the description of the claims, and S390 is the patent. This corresponds to the expression reflecting means in the claims.

なお、音声合成処理におけるＳ４００，Ｓ４１０が、特許請求の範囲の記載における音声合成手段に相当する。 Note that S400 and S410 in the speech synthesis process correspond to the speech synthesis means in the claims.

１…音声合成システム１０…情報処理サーバ１２…通信部２０…制御部２２…ＲＯＭ２４…ＲＡＭ２６…ＣＰＵ３０…記憶部６０…音声出力端末６１…通信部６２…情報受付部６３…表示部６４…音入力部６５…音出力部６６…記憶部７０…制御部７２…ＲＯＭ７４…ＲＡＭ７６…ＣＰＵ DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Information processing server 12 ... Communication part 20 ... Control part 22 ... ROM 24 ... RAM 26 ... CPU 30 ... Memory | storage part 60 ... Voice output terminal 61 ... Communication part 62 ... Information reception part 63 ... Display part 64 ... Sound input unit 65 ... Sound output unit 66 ... Storage unit 70 ... Control unit 72 ... ROM 74 ... RAM 76 ... CPU

Claims

Sentence acquisition means for acquiring sentence data representing a character string constituting the specified sentence;
Sentence analysis means for analyzing a sentence represented by sentence data acquired by the sentence acquisition means and deriving a text expression distribution representing a distribution degree of each type of expression appearing in the sentence;
Expression data representing each facial expression when uttering a prescribed content sentence defined as a sentence having contents in which multiple types of facial expressions appear, and at least one sound uttered for a portion where each facial expression appears in the prescribed content sentence With the sound source data corresponding to the designated speaker from the storage device in which the sound source data is stored as data corresponding to two voice parameters for each expression type and for each speaker, Sound source analysis means for obtaining and analyzing a certain designated sound source data and deriving a sound source facial expression distribution representing a distribution degree of each type of facial expression expressed by the voice represented by the voice parameter included in the designated sound source data; ,
A parameter for deriving a correction parameter obtained by correcting the speech parameter included in the designated sound source data so that the sound source expression distribution derived by the sound source analyzing unit matches the text expression distribution derived by the sentence analyzing unit. Correction means;
A speech synthesis system comprising: speech synthesis means for performing speech synthesis of a sentence represented by sentence data acquired by the sentence acquisition means based on the correction parameter derived by the parameter correction means.

The sound source analysis means includes
The voice parameter associated with the facial expression data representing that the facial expression is in a neutral state is used as a reference parameter, and the strength of each facial expression expressed by the voice represented by the voice parameter included in the designated sound source data, A result of normalizing the facial expression difference vector so that a facial expression difference vector represented by a vector from the reference parameter is derived for each type of facial expression, and the maximum scalar value of all the facial expression difference vectors is 1. Is derived as the sound source facial expression distribution,
The parameter correction means includes
Uniform difference derivation means for deriving a uniform difference vector obtained by dividing each of the expression difference vectors by the sound source expression distribution;
Facial expression reflecting means for deriving the correction parameter by adding the reference parameter to a result obtained by multiplying each uniform difference vector derived by the uniform difference deriving means by the text expression distribution derived by the sentence analyzing means; The speech synthesis system according to claim 1, comprising:

A sentence acquisition procedure for acquiring sentence data representing a character string constituting the specified sentence;
A sentence analysis procedure for analyzing a sentence represented by sentence data acquired in the sentence acquisition procedure, and deriving a text expression distribution representing a distribution degree of each type of expression appearing in the sentence;
Expression data representing each facial expression when uttering a prescribed content sentence defined as a sentence having contents in which multiple types of facial expressions appear, and at least one sound uttered for a portion where each facial expression appears in the prescribed content sentence With the sound source data corresponding to the designated speaker from the storage device in which the sound source data is stored as data corresponding to two voice parameters for each expression type and for each speaker, A sound source analysis procedure for obtaining and analyzing a specified sound source data and deriving a sound source facial expression distribution representing a distribution degree of each type of facial expression expressed by the voice represented by the voice parameter included in the designated sound source data; ,
A parameter for deriving a correction parameter obtained by correcting a speech parameter included in the specified sound source data so that the sound expression distribution derived in the sound source analysis procedure matches the text expression distribution derived in the sentence analysis procedure Correction procedure;
A speech synthesis method comprising: a speech synthesis procedure for performing speech synthesis of a sentence represented by the sentence data acquired in the sentence acquisition procedure based on the correction parameter derived in the parameter correction procedure.