JP2007025323A

JP2007025323A - Speech synthesizing method, device, program, and recording medium

Info

Publication number: JP2007025323A
Application number: JP2005208156A
Authority: JP
Inventors: Akihiro Yoshida; 明弘吉田; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-07-19
Filing date: 2005-07-19
Publication date: 2007-02-01
Anticipated expiration: 2025-07-19
Also published as: JP4425192B2

Abstract

<P>PROBLEM TO BE SOLVED: To convert text data into a synthesized speech of high quality. <P>SOLUTION: The number of accent phrases, and metrical patterns for the accent phrases are found from an input text. An elementary speech unit series is determined through collation with a database based on the metrical patterns, and quality is evaluated as to the elementary speech unit series. The quality is evaluated by a statistical method, such as, SVM using a plurality of evaluation scales, and elementary speech unit series are determined for the accent phrases, when all evaluations scales are satisfied or when a search for an elementary speech unit series is made as many times as a preset reselection permissible frequency, thus generating the synthesized speech of the best quality. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は任意のテキストを入力することで、そのテキストと一致した音声を生成し、出力する、つまり、テキストから音声へのメディア変換を行う、音声合成技術に関する。 The present invention relates to a speech synthesis technique in which an arbitrary text is input to generate and output speech that matches the text, that is, media conversion from text to speech is performed.

テキストを入力するだけで所望の音声を生成することができる音声合成技術は、電子メールやWeb記事の読み上げ、コンタクトセンタなどのような電話音声による情報案内のための読み上げなどに利用されており、実際にテキストを読み上げるための作業コストの削減に貢献している。
しかしながら、合成音声の品質は、人間の生音声と比較すると依然として及ばず、さらなる市場の拡大を実現するには、品質の向上が必要不可欠であると考えられる。
任意のテキストを音声へ変換する音声合成技術の１つである波形接続型音声合成手法（特許文献１）は、大量の音声コーパスから任意長の音声波形の断片（音声素片）を探索、接続することで合成音声を生成する。 Speech synthesis technology that can generate the desired speech simply by inputting text is used for reading out e-mails and Web articles, reading out information for telephone voice information such as contact centers, etc. Contributes to the reduction of work costs for actually reading out text.
However, the quality of synthesized speech is still not as high as that of human live speech, and it is considered that quality improvement is indispensable for further market expansion.
A waveform-connected speech synthesis technique (Patent Document 1), which is one of speech synthesis techniques for converting arbitrary text into speech, searches for and connects fragments (speech segments) of speech waveforms of arbitrary length from a large number of speech corpora. To generate synthesized speech.

大量の音声コーパスから音声素片を選択することで、韻律のバリエーションをカバーするこの音声合成手法は、基本的に信号系の処理を行わないため、合成音声の肉声らしさは保たれる。
波形接続型音声合成手法による合成音声の品質は、音声素片の探索アプローチが大きく関わる。
一般的な探索アプローチは、まず始めに、入力された漢字かな文をテキスト解析することで、目標となる音韻系列、韻律パタンなどを作成する。次に、これらの目標にできるだけ近くなるような音声波形の組み合わせの探索を、複数基準パラメータを総合的に評価することで行う。 By selecting speech segments from a large number of speech corpora, this speech synthesis method, which covers prosodic variations, basically does not perform signal processing, so the quality of the synthesized speech is maintained.
A speech segment search approach is greatly involved in the quality of synthesized speech by the waveform-connected speech synthesis method.
A general search approach first creates a target phoneme sequence, prosodic pattern, etc. by text analysis of an input kanji sentence. Next, a search for a combination of speech waveforms that is as close as possible to these targets is performed by comprehensively evaluating a plurality of reference parameters.

基準パラメータを総合的に評価する最も一般的な方法は、基本周波数や韻律環境、音韻継続時間、素片接続回数などの複数のパラメータに関するサブコストを計算し、それらを重み付け加算した総合コストが最も低い音声素片系列を出力音声とする方法である。
前述の音声素片系列の選択手法における、重みセットと各パラメータのコスト、総合コストを式に表わすと次のように書くことができる。 The most common method for comprehensive evaluation of reference parameters is to calculate the sub-costs related to multiple parameters such as fundamental frequency, prosodic environment, phoneme duration, number of segment connections, etc. In this method, a speech unit sequence is used as output speech.
The weight set, the cost of each parameter, and the total cost in the above-described speech element sequence selection method can be expressed as follows.

この式による評価方法は、各パラメータを総合的に判定しているので、あるパラメータ、例えば、イントネーションに深く影響する、基本周波数に関するパラメータのコストが高く計算されても、他の基準パラメータにおいてコストが低く算出されれば、結果として総合コストが低くなる場合がある。そのため、音声データベース中に、より適切な音声素片の組み合わせが存在しているにもかかわらず、誤ったイントネーションになってしまう音声素片系列が選択されることがしばしばある。

Since the evaluation method based on this formula comprehensively determines each parameter, even if a certain parameter, for example, a parameter related to the fundamental frequency, which has a deep influence on intonation, is calculated at a high cost, the cost for other reference parameters is also low. If it is calculated to be low, the total cost may be low as a result. For this reason, a speech unit sequence that is erroneously entered even though there is a more appropriate speech unit combination in the speech database is often selected.

高品質な合成音声を生成するための重みセットの値は、このような事例が極力起こらないような値に設定すべきである。しかし、入力テキストに含まれる語彙、入力テキストに含まれる音素系列、目標韻律パタンを再現可能な音声素片が音声データベース中にどれほど含まれているかなど、入力テキストと音声データベースの整合性の度合によって最適な重みセットの値は異なる。極論を言うと、入力テキストごとに最適な重みセットの値は異なるとも考えられる。
そこで、生成した合成音声の品質を評価し、品質が低い場合には合成処理手法を変更し、再度合成処理を行うという一連の流れを高品質な合成音声が生成されるまで繰り返すという手法が提案されている（非特許文献１、特許文献２）。 The value of the weight set for generating high-quality synthesized speech should be set to such a value that such a case does not occur as much as possible. However, depending on the degree of consistency between the input text and the speech database, such as the vocabulary included in the input text, the phoneme sequence included in the input text, and how many speech segments that can reproduce the target prosody pattern are included in the speech database. The optimal weight set values are different. In extreme terms, the optimal weight set value may be different for each input text.
Therefore, a method is proposed in which the quality of the generated synthesized speech is evaluated, and if the quality is low, the synthesis processing method is changed, and a series of processes of synthesizing again is repeated until a high-quality synthesized speech is generated. (Non-Patent Document 1, Patent Document 2).

非特許文献１では、基本周波数が降下しているときアクセント核において、人間が知覚できる基本周波数の降下が合成音声にあるかどうかを機械学習を用いて評価し、降下がないと判断された場合には、重みセットを変更し、再度合成処理を行っている。また、特許文献２では、合成音声の品質評価をある基準値を満たすかどうかで判断している。
特許第２７６１５５２号明細書特開２００４−３５４６４４号公報長谷川未来、水野秀之、“コーパスベース音声合成における素片選択についての検討”、日本音響学会講演論文集、1-7-2、PP.215-216、Mar.2004. In Non-Patent Document 1, when the synthesized speech has evaluated whether there is a fundamental frequency drop that can be perceived by humans in the accent nucleus when the fundamental frequency is falling, and it is determined that there is no drop In this case, the weight set is changed and the synthesis process is performed again. In Patent Document 2, the quality evaluation of synthesized speech is determined based on whether or not a certain reference value is satisfied.
Japanese Patent No. 2761552 JP 2004-354644 A Mirai Hasegawa, Hideyuki Mizuno, “Study on segment selection in corpus-based speech synthesis”, Acoustical Society of Japan, 1-7-2, PP.215-216, Mar.2004.

上記非特許文献１では、アクセント核という局所的な部分しか見えておらず、それ以外の場所でイントネーションが合っているかどうかを判断していない。また、基本周波数に関する尺度しか評価していない。これらの理由により、正しい品質評価ができないという問題がある。
また、特許文献２では、合成音声の品質評価をある基準値を満たすか否かで判断しているが、単純にある値を満たすかどうかで合成音声の品質を評価する方法では、合成音声の品質を高い精度で評価することは難しく、場合によっては、品質に何ら問題ない合成音声に対して再合成処理を施したり、品質が低い合成音声をそのまま出力したりといった問題が生じ、高品質な合成音声を安定して生成することが困難である。 In the said nonpatent literature 1, only the local part called an accent nucleus is visible, and it is not judged whether the intonation is suitable in the place other than that. In addition, only measures related to the fundamental frequency are evaluated. For these reasons, there is a problem that correct quality evaluation cannot be performed.
In Patent Document 2, the quality evaluation of the synthesized speech is determined based on whether or not a certain reference value is satisfied. However, in the method of simply evaluating the synthesized speech quality based on whether or not a certain value is satisfied, It is difficult to evaluate quality with high accuracy, and in some cases, problems such as re-synthesizing synthesized speech that does not have any quality problem or outputting synthesized speech with low quality as it is, cause high quality. It is difficult to generate synthesized speech stably.

また特許文献２では、品質が低いと判断された場合、入力情報に含まれる制御情報に基づいて合成処理方法を変更するとしているが、実際に作成した合成音声の品質評価結果に基づいた変更方法ではないため、あらかじめ決めておいた合成処理手段が必ずしも高品質な合成音声を生成できるとは限らない。
本発明の目的はこれらの不都合を一掃し、高品質な合成音声を生成することができる音声合成方法、装置、プログラム及び記録媒体を提案しようとするものである。 Further, in Patent Document 2, when it is determined that the quality is low, the synthesis processing method is changed based on the control information included in the input information. However, the change method based on the quality evaluation result of the actually generated synthesized speech Therefore, the predetermined synthesis processing means cannot always generate high-quality synthesized speech.
An object of the present invention is to propose a speech synthesis method, apparatus, program, and recording medium that can eliminate these disadvantages and generate high-quality synthesized speech.

本発明の音声合成装置は音声データを保持した音声コーパスを格納したデータベースを備え、入力された任意のテキストを音声へと変換するために、音声コーパスに含まれる任意長の音声素片を探索接続することで入力テキストと内容が一致した合成音声を生成する音声合成装置であって、
その特徴とする構成は入力テキストを形態素解析することで読みを得、韻律パタン生成に必要とするアクセント句数と音調結合情報を求めるテキスト解析処理部と、テキスト解析処理部で得られた情報を利用し、目標韻律パタンを生成する韻律パタン生成部と、目標韻律パタンに従って処理対象とするアクセント句を設定する処理対象アクセント句設定部と、各アクセント句毎に重みセットを初期値に設定する処理及び、設定値を変更する処理とを実行する重みセット設定及び設定変更部と、上記韻律パタン生成ステップで生成した韻律パタンを基に音声素片を選択し音声素片系列を設定する音声素片探索部と、この音声素片探索ステップで探索された音声素片系列に関して複数の評価尺度のそれぞれについて合成音声の品質評価を行なう品質評価部と、品質評価部の評価結果の中に品質が悪いと判定された評価尺度の有無を検出し、その検出結果が有りの場合は重みセットの設定及び設定変更部の処理を経て上記音声素片系列探索部での音声素片系列探索処理へ回帰させ、無しの場合はその時点で探索されている音声素片系列を処理対象アクセント句の音声素片系列と決定する出力音声素片系列決定部と、この出力音声素片系列決定部で各アクセント句毎の音声素片系列が決定される毎に入力テキストに含まれるアクセント句の全てについて処理が終了したか否かを判定し、終了と判定した場合は入力テキストを構成する各アクセント句それぞれについて音声素片系列を合成音声処理部に出力し、未終了と判定した場合は次のアクセント句を処理対象アクセント句として設定し、上記重みセット設定及び設定変更部と、音声素片系列探索部と、品質評価部と、出力音声素片系列決定部を予め設定した再選択許容回数分繰返し動作させる入力テキスト終了判定部とを備える。 The speech synthesizer of the present invention includes a database storing speech corpora holding speech data, and searches and connects speech segments of any length included in the speech corpus in order to convert any input text into speech. A speech synthesizer that generates synthesized speech whose content matches the input text,
The characteristic structure is that the input text is read by morphological analysis, the number of accent phrases required for prosodic pattern generation and tone combination information are obtained, and the information obtained by the text analysis processing section A prosody pattern generation unit that generates a target prosody pattern, a processing target accent phrase setting unit that sets an accent phrase to be processed according to the target prosody pattern, and a process that sets a weight set to an initial value for each accent phrase And a weight set setting and setting changing unit for executing a process for changing the set value, and a speech unit for selecting a speech unit based on the prosody pattern generated in the prosody pattern generation step and setting a speech unit sequence A product that evaluates the quality of synthesized speech for each of a plurality of evaluation scales with respect to a speech unit sequence searched in the search unit and the speech unit search step. The evaluation unit and the evaluation result of the quality evaluation unit detect the presence / absence of an evaluation scale determined to be of poor quality, and if the detection result is present, the above voice is set through the processing of the weight set and the setting change unit Return to the speech unit sequence search process in the unit sequence search unit, and if none, output speech unit sequence to determine the speech unit sequence searched at that time as the speech unit sequence of the accent phrase to be processed Each time the speech unit sequence for each accent phrase is determined by the determination unit and the output speech unit sequence determination unit, it is determined whether or not the processing has been completed for all of the accent phrases included in the input text. If it is determined, the speech unit sequence is output to the synthesized speech processing unit for each accent phrase constituting the input text, and if it is determined to be incomplete, the next accent phrase is set as the processing target accent phrase, A weight set setting / setting change unit, a speech unit sequence search unit, a quality evaluation unit, and an input text end determination unit that repeatedly operates the output speech unit sequence determination unit for a preset reselection allowable number of times. .

更に本発明の特徴とする構成は品質評価部の評価結果に品質が悪いと判定された評価尺度が有りの場合に、音声素片系列探索部と品質評価部の実行回数が再選択許容回数に達したか否かを判定し、未達成の場合は重みセット設定及び設定変更部の重みセット設定変更処理を経て音声素片系列探索部の探索動作に回帰させる再選択許容回数判定部と、この再選択許容回数判定部が達成と判定した場合は現在処理対象となっているアクセント句に対しては信号処理を必要とする情報と共にその時点で探索されている音声素片系列を現在処理対象としているアクセント句の音声素片系列と決定する信号処理必要決定部と、品質評価部で品質が悪いと判定する評価尺度が無と判定された場合に、現在処理対象としているアクセント句に対しては信号系の処理を不要とする情報と共に、その時点で探索されている音声素片系列をそのアクセント句の音声素片系列と決定する信号処理不要決定部とを付加した点である。 Furthermore, in the configuration characterized by the present invention, when the evaluation result of the quality evaluation unit includes an evaluation measure determined to be of poor quality, the number of executions of the speech unit sequence search unit and the quality evaluation unit becomes the reselection allowable number of times. Re-selection allowable number determination unit that returns to the search operation of the speech unit sequence search unit through the weight set setting change processing of the weight set setting and the setting change unit if not achieved, If the reselection allowable number determination unit determines that the achievement is achieved, the current speech processing unit is the speech unit sequence searched at that time together with information that requires signal processing for the accent phrase that is currently being processed. If there is no signal processing necessity determination unit to determine the speech unit sequence of the accent phrase and the quality evaluation unit determines that the quality evaluation unit determines that the quality is poor, for the accent phrase currently being processed, Signal system Together with the information for processing unnecessary, a point obtained by adding a signal processing required determination unit for determining a speech unit sequence being searched at that time the speech unit sequence of the prosodic phrase.

更に本発明の特徴とする構成は品質評価部に設けられる評価手段は、入力テキストから生成した韻律パタンに基づいて決定した音声素片系列を少なくともイントネーションに関する評価手段又は発話速度に関する評価手段或いは連続性に関する評価手段の内の何れかを含む点である。
更に本発明の特徴とする構成は重みセット設定及び設定変更部で設定される重みセットは、少なくとも入力テキストから生成した韻律パタンに基づいて決定した音声素片系列の基本周波数に関するパラメータの重みと、継続時間に関するパラメータの重みと、接続点のギャップに関するパラメータの重みの何れかの組み合わせを含む点である。 Further, the configuration characterized by the present invention is that the evaluation means provided in the quality evaluation unit is characterized in that the speech unit sequence determined based on the prosodic pattern generated from the input text is at least evaluation means for intonation, evaluation means for speech rate, or continuity. It is a point including any of the evaluation means regarding.
Further, the configuration characterized by the present invention is the weight set set by the weight set setting and setting change unit, the weight of the parameter relating to the fundamental frequency of the speech unit sequence determined based on at least the prosody pattern generated from the input text, This is a point including any combination of the parameter weight related to the duration and the parameter weight related to the gap of the connection point.

本発明によれば、アクセント句を処理対象区間とすることで局所的な品質評価ではなく、大局的な品質評価、つまり、韻律としてまとまりのある、ある程度長い範囲での品質を評価することが可能となり、また、基本周波数以外の評価尺度についても品質評価することで、合成音声全体の品質を正しく評価することができる。
本発明によれば、例えばSVMのような統計的手法を用いることで、複数特徴量と人による正誤判断の非線形な関係を対応付けることができるので、より高精度な品質評価が可能となり、安定して高品質な合成音声を生成することができる。 According to the present invention, it is possible to evaluate global quality evaluation, that is, quality in a certain long range, which is organized as a prosody, instead of local quality evaluation by setting an accent phrase as a processing target section. In addition, the quality of the entire synthesized speech can be correctly evaluated by evaluating the quality of evaluation scales other than the fundamental frequency.
According to the present invention, by using a statistical method such as SVM, for example, it is possible to associate a non-linear relationship between a plurality of feature quantities and human right / wrong judgment, thereby enabling more accurate quality evaluation and stability. High-quality synthesized speech.

本発明によれば、統計的手法による品質評価を複数の評価尺度を用いて行い、その結果を重みセットに反映させ、音声素片系列を再選択することで、高品質な合成音声を安定して得ることができる。 According to the present invention, quality evaluation by a statistical method is performed using a plurality of evaluation scales, the result is reflected in a weight set, and a speech unit sequence is reselected to stabilize high-quality synthesized speech. Can be obtained.

本発明による統計的手法を用いた合成音声の品質評価方法は、それを実現するプログラムをコンピュータにインストールして使用するのが好ましく、可変長の音声波形を接続、組み合わせることにより、任意のテキストを音声へ変換する波形接続型音声合成手段をプログラム内に実装して使用する実施形態が、この品質評価方法を機能させる最良の形態である。
コンピュータを用いて、本発明の音声合成装置を機能させるには、大量の音声データを保持した音声コーパスを備え、入力された任意のテキストを音声へと変換するために、音声コーパスに含まれる任意長の音声素片を探索、接続することで入力テキストと内容が一致した合成音声を生成する音声合成装置において、入力されたテキストを形態素解析することで読みを得、韻律生成に必要な情報である、アクセント型・音調結合型を付与するテキスト解析処理部と、テキスト解析処理部で得られた情報を利用し、目標韻律パタンを生成する韻律生成部と、合成音声の韻律パタンに合わせるなど、基準となる複数のパラメータを総合的に評価し、その評価結果に基づいて音声素片を選択する音声素片系列探索部と、生成された合成音声の品質を、人間の品質評価判断と複数特徴量の非線形な関係を対応付けることができる統計的手法を用い、複数ある評価尺度それぞれに対して、大局的に評価することが可能な品質評価部と、品質評価結果から音声素片系列を採用するかどうかを判定する出力音声素片系列決定部と、品質が悪かった場合に品質評価部で得られた評価結果を反映させた重みセットを設定し、次のアクセント句に対する素片探索を行うために重みセットを初期値に設定する処理と設定値の変更処理を行う重みセット設定及び設定変更部と、入力テキストが最後まで探索されたかを判定する入力テキスト終了判定部と、信号処理手法等を用いることで、選択された音声素片系列を滑らかに連結する合成処理部を構築することで実現される。
この構成とすることにより、高品質な合成音声を安定して得られることができるとする本発明の独特の作用効果を得ることができる。 The synthetic speech quality evaluation method using the statistical method according to the present invention is preferably used by installing a program for realizing it in a computer, and by connecting and combining variable-length speech waveforms, any text can be obtained. An embodiment in which waveform-connected speech synthesis means for converting to speech is implemented in a program and used is the best mode for operating this quality evaluation method.
In order to make the speech synthesizer of the present invention function using a computer, a speech corpus having a large amount of speech data is provided, and any text included in the speech corpus is used to convert any input text into speech. In a speech synthesizer that generates synthesized speech whose contents match the input text by searching for and connecting long speech segments, the input text is read by morphological analysis, and the information required for prosody generation A text analysis processing unit that gives an accent type / tone combination type, a prosody generation unit that generates a target prosody pattern using the information obtained by the text analysis processing unit, and a synthesized speech prosody pattern, etc. Comprehensive evaluation of multiple reference parameters, and a speech segment sequence search unit that selects speech segments based on the evaluation results, and the quality of the generated synthesized speech A quality evaluation unit capable of globally evaluating each of multiple evaluation scales using a statistical method capable of associating a human quality evaluation judgment with a nonlinear relationship between multiple features, and a quality evaluation result Set the output speech unit sequence determination unit that determines whether or not to adopt the speech unit sequence and the weight set that reflects the evaluation result obtained by the quality evaluation unit when the quality is bad, and set the next accent A process for setting a weight set to an initial value for performing a segment search for a phrase, a weight set setting / setting changing unit for performing a setting value changing process, and an input text end determination for determining whether the input text has been searched to the end This is realized by constructing a synthesis processing unit that smoothly connects selected speech element sequences by using a signal processing method and the like.
With this configuration, it is possible to obtain the unique operational effect of the present invention that a high-quality synthesized speech can be stably obtained.

図１に本発明による音声合成装置１００の全体の構成を示す。本発明による音声合成方法は基本的に波形接続型音声合成方法であり、この波形接続型音声合成方法で動作する音声合成装置１００はテキスト解析処理部１０と、韻律生成部２０と、音声素片探索部３０と、合成処理部４０と、音声データベース５０とによって構成される。音声データベース５０には大量の音声データを保持した音声コーパスを備える。
音声合成装置１００における処理の流れは、音声合成処理の対象である入力テキストTXに対してテキスト解析処理部１０で形態素解析を行うことで読みを得、韻律生成処理に必要な情報であるアクセント型・音調結合型を付与する。 FIG. 1 shows the overall configuration of a speech synthesizer 100 according to the present invention. The speech synthesis method according to the present invention is basically a waveform-connected speech synthesis method, and a speech synthesizer 100 that operates according to this waveform-connected speech synthesis method includes a text analysis processing unit 10, a prosody generation unit 20, a speech segment. The search unit 30, the synthesis processing unit 40, and the voice database 50 are configured. The voice database 50 includes a voice corpus that holds a large amount of voice data.
The flow of processing in the speech synthesizer 100 is that an accent type, which is information necessary for prosody generation processing, is obtained by performing a morphological analysis on the input text TX that is the target of speech synthesis processing by the text analysis processing unit 10. -Tone combination type is given.

韻律生成部２０はこれらの情報を利用して目標韻律パタンを生成する。韻律系列が入力テキストと一致した上で、ここで得られた韻律パタンや音声素片の音韻環境などができるだけ一致している音声素片系列を音声データベース５０から探索する処理を音声素片探索部３０にて行う。最後に選択された音声素片系列を合成処理部４０において滑らかに連結し、合成音声データODを作成する。
ここまでの処理の流れは従来の音声合成方法と同等である。
本発明の特徴とする点は音声素片探索部３０における処理であり、特にこの音声素片探索部３０に備えた品質評価部３３の処理に特徴がある。 The prosody generation unit 20 uses these pieces of information to generate a target prosody pattern. The speech segment search unit searches the speech database 50 for a speech segment sequence in which the prosodic sequence matches the input text and the obtained prosodic pattern and the speech environment of the speech segment match as much as possible. At 30. The speech unit series selected last is smoothly connected in the synthesis processing unit 40 to generate synthesized speech data OD.
The processing flow up to this point is equivalent to the conventional speech synthesis method.
The feature of the present invention is the processing in the speech unit search unit 30, and particularly the processing in the quality evaluation unit 33 provided in the speech unit search unit 30.

図２に音声素片探索部３０の内部構成を、また図７にその処理フローを示す。これら以外のテキスト解析処理部１０と、韻律生成部２０と、合成処理部４０と、音声データベース５０については従来と変わらないので詳細な説明は省略する。
音声素片探索部３０の内部は重みセット設定及び設定変更部３１と、音声素片系列探索部３２と、品質評価部３３と、出力音声素片系列決定部３４と、入力テキスト終了判定部３５とを備えて構成される。重みセット設定及び設定変更部３１には韻律生成部２０から出力される目標韻律パタン及び入力テキスト情報STが入力され、入力テキスト終了判定部３５からは音声素片探索部３０の出力として音声素片系列データOSが出力される。 FIG. 2 shows the internal configuration of the speech segment search unit 30, and FIG. 7 shows the processing flow. Since the text analysis processing unit 10, the prosody generation unit 20, the synthesis processing unit 40, and the speech database 50 other than these are not different from those in the past, detailed description thereof is omitted.
The speech unit search unit 30 includes a weight set setting / setting change unit 31, a speech unit sequence search unit 32, a quality evaluation unit 33, an output speech unit sequence determination unit 34, and an input text end determination unit 35. And is configured. The target prosody pattern and the input text information ST output from the prosody generation unit 20 are input to the weight set setting and setting change unit 31, and the speech unit is output from the input text end determination unit 35 as the output of the speech unit search unit 30. Series data OS is output.

音声素片探索部３０内の処理の流れは、まず始めに、従来手法と同様に、韻律生成部２０で生成された目標韻律パタンにできるだけ近く、テキスト解析処理部１０から得られる音韻系列などの入力テキスト情報TXと一致した音声素片系列を、音声データベース５０から探索する処理を音声素片系列探索部３２で行う。
目標韻律パタンにできるだけ近い音声素片系列の選択は、（１）式に示したように、基本周波数や音韻環境、音韻継続時間、素片接続回数などの複数のパラメータに関するサブコスト、詳しく述べると、例えば基本周波数に関しては、基本周波数パタンの一致度合、音声素片の接続点における基本周波数のギャップ値などをコストとして計算し、それらを重み付け加算した総合コストを評価することで行っている。ここで用いる重みセットの初期値は様々な入力テキストに対して平均的に高品質な合成音声が得られるように、あらかじめチューニングされた値を使用する。この重みセットの初期値は音声素片を探索する前に重みセット設定及び設定変更部３１で与えられる。 The flow of processing in the speech segment search unit 30 is as close as possible to the target prosody pattern generated by the prosody generation unit 20 as in the conventional method, and includes phoneme sequences obtained from the text analysis processing unit 10. The speech unit sequence search unit 32 performs a process of searching the speech database 50 for a speech unit sequence that matches the input text information TX.
The selection of a speech segment sequence as close as possible to the target prosody pattern is as follows. As shown in equation (1), sub-costs related to a plurality of parameters such as fundamental frequency, phoneme environment, phoneme duration, number of segment connections, etc. For example, with respect to the fundamental frequency, the degree of coincidence of the fundamental frequency pattern, the gap value of the fundamental frequency at the connection point of the speech unit is calculated as a cost, and the total cost obtained by weighting and adding them is evaluated. The initial value of the weight set used here is a value tuned in advance so that high-quality synthesized speech can be obtained on average for various input texts. The initial value of the weight set is given by the weight set setting / setting changing unit 31 before searching for speech segments.

次に、品質評価部３３において、一度選択された音声素片系列から生成される合成音声の大局的な品質を評価する。評価する品質の尺度（品質評価尺度）は１つとは限らず、イントネーションに深く関わる基本周波数に関する尺度や、発声速度に深く関わる音韻継続時間に関する尺度などがある。
品質を評価するための単位は、大局的な品質を評価することを目的とする本発明においては、できるだけ長い単位であることが望ましい。しかし、単位が長すぎると品質を評価する際に考慮すべきパラメータが多くなりすぎ、評価が難しくなると考えられるため、適切な長さの音韻素片系列に対して品質評価を行う必要がある。 Next, the quality evaluation unit 33 evaluates the overall quality of the synthesized speech generated from the speech element sequence selected once. The scale of quality to be evaluated (quality evaluation scale) is not limited to one, but includes a scale related to the fundamental frequency deeply related to intonation and a scale related to phoneme duration related to speech rate.
The unit for evaluating quality is preferably as long as possible in the present invention for the purpose of evaluating global quality. However, if the unit is too long, there are too many parameters to be considered when evaluating the quality, and it is considered that the evaluation becomes difficult. Therefore, it is necessary to evaluate the quality of a phoneme segment sequence having an appropriate length.

イントネーションや発声速度の尺度を用いて品質を評価する際には、それらの尺度が韻律を表現する要素であることから、以下の例では、韻律的なまとまりであるアクセント句を品質評価単位とした場合について説明する。
また、音声素片の接続箇所において、肉声らしさに深く関わるスペクトルや基本周波数の連続性に関する尺度や明瞭性に深く関わる音声素片の韻律環境の一致に関する尺度についても考慮しつつ、合成音声の品質評価をする。
以上のことから、品質評価部３３では、アクセント句単位ごとに複数の評価尺度に対して、統計的手法を用いて品質評価を行う。統計的手法としては機械学習が例として挙げられる。本明細書では、数ある機械学習器の中で、SVM（Support Vector Machine）を利用した場合を例に説明する。 When evaluating quality using measures of intonation and utterance speed, these measures are elements that represent prosody, so in the following example, the accent phrase that is a prosodic unit is used as the quality evaluation unit. The case will be described.
In addition, the quality of the synthesized speech, taking into account the spectrum that is deeply related to the real voice, the continuity of the fundamental frequency, and the measure that matches the prosodic environment of the speech unit that is deeply related to the clarity at the connection point of the speech unit. Make an evaluation.
From the above, the quality evaluation unit 33 performs quality evaluation using a statistical method for a plurality of evaluation scales for each accent phrase unit. An example of a statistical method is machine learning. In this specification, a case where SVM (Support Vector Machine) is used among a number of machine learners will be described as an example.

SVMは、学習データとして大量のサンプルを用いるが、ある学習サンプルに対して付与されたクラスと、そのサンプルから抽出された多次元特徴量の関係を学習し、多次元特徴空間のどこがどのクラスに割り当てられているかを対応付けることで、ある特徴量が入力された場合にどのクラスに属するかということを推定するパターン認識手法の一種である。SVMは、式では（２）式のように、図では図６のように表わすことができる。以下では、２つのクラスに分類することを目的とした場合について述べる。 SVM uses a large number of samples as learning data, but learns the relationship between a class assigned to a certain learning sample and the multidimensional feature value extracted from that sample, and where is the class in the multidimensional feature space. This is a type of pattern recognition technique for estimating which class belongs when a certain feature value is input by associating whether the feature is assigned. The SVM can be expressed as shown in equation (2) and as shown in FIG. 6 in the figure. Below, the case where it aims at classifying into two classes is described.

２つのクラスA、Bがあるとし、f(x⁻)の値はクラスAの場合に正と、クラスＢの場合は負と、クラスの境界では０となるように、学習サンプルからw⁻，ｂの値を学習する。これらの値を学習することで、特徴空間上のクラスの境界を推定する。この推定は、境界と学習サンプルの最小距離が最大となるように、つまり２つのクラスの真ん中に境界が位置するように行われる（図６）。
以下では，基本周波数に関する品質評価のためのSVMの作成について具体例を示す。学習データには、あらかじめ作成しておいた合成音声や人間の生の音声などを使うことができる。基本周波数に関するSVMの学習データのクラスは、イントネーションが正しいか、間違っているかのどちらかをクラスとして与える。このクラスの付与は人手を用いて行うが、発声速度や明瞭度など他の要因は一切気にせず、イントネーションだけに注目した評価をアクセント句単位で行い、その結果得られる正誤をクラスとして付与する。

Two classes A, consider a B, f (x ^-) and positive in the case of values Class A, and negative in the case of class B, as becomes zero at the boundary of the class, w from learning samples ^-, Learn the value of b. By learning these values, class boundaries on the feature space are estimated. This estimation is performed so that the minimum distance between the boundary and the learning sample is maximized, that is, the boundary is located in the middle of the two classes (FIG. 6).
In the following, a specific example is shown for creating an SVM for quality evaluation of the fundamental frequency. For the learning data, synthesized speech or human live speech prepared in advance can be used. The SVM learning data class related to the fundamental frequency is given as a class indicating whether the intonation is correct or incorrect. This class is assigned manually, but we do not care about other factors such as utterance speed and intelligibility at all, and we give an evaluation focusing on only intonation on an accent phrase basis. .

一方、イントネーションの正誤の判断に利用可能な特徴量については、局地的な品質評価パラメータだけではなく、大局的なパラメータも使用することで、より高い精度で品質評価を行うことができる。以下に品質評価のパラメータの一例を挙げる。図３に示すように、目標韻律パタンと合成音声の韻律パタン、それぞれについて、モーラごとに回帰直線を引き、その傾き、つまり回帰係数を求め、モーラ数分の回帰係数の系列を得る。得られた２つの系列の相関係数を求め、その値をパラメータとする。また、回帰係数列のΔ成分についても同様に相関係数を求めパラメータとする。また、回帰係数列のΔ成分についても同様に相関係数を求めパラメータとする。この他にも、アクセント核のあるモーラと次のモーラの回帰係数の差、モーラ間における基本周波数のギャップなどが挙げられる。 On the other hand, with respect to the feature quantity that can be used to determine whether the intonation is correct or not, the quality can be evaluated with higher accuracy by using not only the local quality evaluation parameter but also the global parameter. An example of quality evaluation parameters is given below. As shown in FIG. 3, a regression line is drawn for each mora for each of the target prosody pattern and the synthesized speech prosody pattern, and the slope, that is, the regression coefficient is obtained to obtain a series of regression coefficients corresponding to the number of mora. The correlation coefficient of the obtained two sequences is obtained and the value is used as a parameter. Similarly, the correlation coefficient is obtained for the Δ component of the regression coefficient sequence and used as a parameter. Similarly, the correlation coefficient is obtained for the Δ component of the regression coefficient sequence and used as a parameter. In addition, the difference between the regression coefficients of the mora with the accent kernel and the next mora, the gap of the fundamental frequency between the mora, and the like can be mentioned.

ある音声素片選択結果が入力された際に，その音声素片から作成される合成音声のイントネーションが正しいか誤っているかを判断することが可能なSVMを、上述のような特徴量とクラス（イントネーションの正誤）が対応付けられた大量の学習データを用いてあらかじめ作成しておく。
SVMを用いて合成音声の品質を評価する品質評価部３３（図２）では、まず始めに、イントネーションに関するSVMの入力に必要となる特徴量を音声素片系列探索部３２（図３）で得られた音声素片選択結果から抽出する。次に、その特徴量をSVMに入力することで、音声素片選択結果から得られる合成音声のイントネーションに関する品質の良し悪しが得られる。 When a speech unit selection result is input, an SVM that can determine whether the intonation of the synthesized speech created from the speech unit is correct or incorrect is defined as the above feature quantity and class ( It is created in advance using a large amount of learning data associated with correctness of intonation.
In the quality evaluation unit 33 (FIG. 2) that evaluates the quality of the synthesized speech using the SVM, first, the speech unit sequence search unit 32 (FIG. 3) obtains the feature quantity necessary for the input of the SVM related to intonation. Extracted from the selected speech segment selection result. Next, by inputting the feature amount into the SVM, quality of the synthesized speech intonation obtained from the speech unit selection result can be obtained.

音声素片探索の場合と同様に、上述のような特徴量を個別に評価し、その評価値を線形和することで得られる総合コストを用いて、イントネーションの正誤を判断するという手法も考えられるが、SVMは複数特徴量と人による正誤判断の非線形な関係を対応付けることができるので、単純な線形和による総合コストよりも、高精度な正誤判断が可能となる。
SVMを用いた品質評価は、基本周波数から求めるイントネーションの他に、音素の継続時間長から求める発声速度、素片接続点における基本周波数やスペクトルのギャップ値から求める連続性、音韻環境の一致から求める明瞭性などについても同様に行うものとする。 As in the case of speech segment search, a method may be considered in which feature values such as those described above are individually evaluated, and the total cost obtained by linearly summing the evaluation values is used to determine whether the intonation is correct or incorrect. However, since the SVM can associate a non-linear relationship between a plurality of feature amounts and correctness / incorrectness determination by a person, it is possible to perform correctness / incorrectness determination with higher accuracy than the total cost of a simple linear sum.
In addition to intonation obtained from the fundamental frequency, quality evaluation using the SVM is obtained from the utterance speed obtained from the phoneme duration, the continuity obtained from the fundamental frequency and spectrum gap values at the unit connection points, and the coincidence of the phoneme environment The same shall be done for clarity.

複数の品質評価尺度を用いた品質評価部３３の例を図４に示す。この例ではイントネーションに関する品質評価と、発話速度に関する品質評価と、連続性に関する品質評価の３種類の評価尺度について品質評価を行う場合について説明するが、その評価尺度の数は３種に限定されるものでない。各品質評価を行うに先立って特徴量抽出手段３３Ａ−１、３３Ｂ−１、３３Ｃ−１が設けられ、これら各特徴量抽出手段３３Ａ−１、３３Ｂ−１、３３Ｃ−１でそれぞれ、イントネーションに関する品質評価では特徴量として回帰係数の相関関数、Δ回帰係数の相関関数、核における回帰係数の差等を抽出する。また、発話速度に関する品質評価では各モーラの継続時間長、各モーラの母音継続時間長、各母音平均継続時間長等を特徴量として抽出し、連続性に関する品質評価では基本周波数のギャップ値、スペクトルのギャップ値、前音韻環境、後音韻環境等を特徴量として抽出し、これらの特徴量を用いて各SVM３３Ａ−２、３３Ｂ−２、３３Ｃ−２で評価が行われる。なお、それぞれの品質評価尺度に関するSVM３３Ａ−２、３３Ｂ−２、３３Ｃ−２が出力する品質評価結果が良い場合を○、悪い場合を×として示している。 An example of the quality evaluation unit 33 using a plurality of quality evaluation scales is shown in FIG. In this example, a case is described in which quality evaluation is performed for three types of evaluation scales: quality evaluation related to intonation, quality evaluation related to speech rate, and quality evaluation related to continuity, but the number of evaluation scales is limited to three. Not a thing. Prior to performing each quality evaluation, feature quantity extraction means 33A-1, 33B-1, and 33C-1 are provided, and each of these feature quantity extraction means 33A-1, 33B-1, and 33C-1 has a quality related to intonation. In the evaluation, a correlation function of a regression coefficient, a correlation function of a Δ regression coefficient, a difference of regression coefficients in the nucleus, and the like are extracted as feature quantities. In addition, in the quality evaluation related to speech rate, the duration of each mora, the vowel duration of each mora, the average duration of each vowel, etc. are extracted as features, and in the quality evaluation related to continuity, the gap value and spectrum of the fundamental frequency are extracted. Are extracted as feature values, and evaluation is performed by each of the SVMs 33A-2, 33B-2, and 33C-2 using these feature values. In addition, the case where the quality evaluation result which SVM33A-2, 33B-2, 33C-2 regarding each quality evaluation scale outputs is good is shown as ◯, and the case where it is bad is shown as x.

出力音素片系列決定部３４では、品質評価部３３で得られた品質評価結果から、選択された音声素片を使用するか、破棄して音声素片を再選択するかを決定する。
あるアクセント句における全ての品質評価尺度のうち、品質が悪いと判定された尺度がある場合、出力音声素片系列決定部３４は、その尺度に対応する素片探索パラメータの重み係数を重視した再選択へと処理を移行させる。例えば、イントネーションに関するSVMが、品質が悪いと判定した場合、基本周波数に関する素片探索パラメータの重み計数に初期値の５０％のように一定量を加算する。または、SVMが出力する値、つまり、（２）式で示したf(x⁻)の値から重み計数をどれくらい加算するかを動的に決定し、重み計数を変更する。f(x⁻)の値が、品質が悪いクラスを示す値であり、絶対値が大きければ、品質が良いクラスから遠いということを意味するので、重み計数の加算量を大きくし、絶対値が小さい場合は、逆に品質が良いクラスに近いので、重み係数の加算量も小さくする。 The output phoneme sequence determination unit 34 determines from the quality evaluation result obtained by the quality evaluation unit 33 whether to use the selected speech unit or discard it and reselect the speech unit.
If there is a measure that is determined to be of poor quality among all quality evaluation measures in an accent phrase, the output speech segment sequence determination unit 34 re-emphasizes the weight coefficient of the segment search parameter corresponding to that measure. Shift processing to selection. For example, when the SVM related to intonation determines that the quality is poor, a certain amount is added to the weight count of the segment search parameter related to the fundamental frequency, such as 50% of the initial value. Alternatively, the weighting factor is changed by dynamically determining how much the weighting factor is to be added from the value output by the SVM, that is, the value of f (x ⁻ ) expressed by the equation (2). The value of f (x ⁻ ) is a value indicating a class with poor quality, and if the absolute value is large, it means that it is far from a class with good quality. If it is small, it is close to a class with good quality, so the addition amount of the weight coefficient is also reduced.

これらの処理は、重みセット設定及び設定変更部３１の各重み設定部３１−１、３１−２、３１−３で行う（図５）。そして、音声素片係数探索部３２において素片の再選択を行い、再度品質評価部３３で品質の判定をするという処理を品質が良くなるまで行う。仮に、品質が悪いとされた尺度が複数あった場合でも、悪いと判定された尺度に対応する重み係数を全て変更する。再選択を予め設定した再選択回数繰り返しても、品質が悪いと判定される場合は、初期値の重みセットで探索した音声素片系列を出力すると決定する。
品質評価部３３において、全ての品質評価SVM３３Ａ−２、３３Ｂ−２、３３Ｃ−２がアクセント句の品質が良いと判定した場合、出力音声素片系列決定部３４により、そのアクセント句を表現する音声素片系列を出力すると決定する。 These processes are performed by the weight setting units 31-1, 31-2, and 31-3 of the weight set setting / setting changing unit 31 (FIG. 5). Then, the speech unit coefficient search unit 32 performs re-selection of the unit, and the quality evaluation unit 33 performs the process of determining the quality again until the quality is improved. Even if there are a plurality of measures that are judged to have poor quality, all the weighting factors corresponding to the measures that are judged to be bad are changed. If it is determined that the quality is poor even if the reselection is repeated a predetermined number of reselections, it is determined to output the speech segment sequence searched with the initial weight set.
In the quality evaluation unit 33, when all the quality evaluation SVMs 33A-2, 33B-2, and 33C-2 determine that the quality of the accent phrase is good, the output speech unit sequence determination unit 34 determines the voice that expresses the accent phrase. It is determined that the segment sequence is output.

品質評価はアクセント句単位で行うため、これまでの、音声素片系列探索部３２における音声素片探索、品質評価部３３における品質評価、出力音声素片系列決定部３４における出力音声素片系列の決定、重みセット設定及び設定変更部３１における重みセットの変更は、入力テキストの最初から１アクセント句ずつ行うものとする。
この一連の処理をアクセント句ごとに入力テキストの最後まで行うが、入力テキストが終了したかどうかを入力テキスト終了判定部３５で行い、終了していない場合、重みセット設定及び設定変更部３１で重みセットを初期値に設定し（図５）、次のアクセント句の素片探索を行う。終了したと判定された場合には、合成処理部４０（図１）において、合成音声として使用する音声素片系列の隣接する音声素片同士を滑らかに接続し、合成音声データとして出力する。 Since the quality evaluation is performed in units of accent phrases, the speech unit search in the speech unit sequence search unit 32, the quality evaluation in the quality evaluation unit 33, and the output speech unit sequence in the output speech unit sequence determination unit 34 so far are performed. The determination, weight set setting, and change of the weight set in the setting change unit 31 are performed for each accent phrase from the beginning of the input text.
This series of processing is performed for each accent phrase up to the end of the input text. Whether or not the input text has been completed is determined by the input text end determination unit 35. If the input text has not been completed, the weight set setting and setting change unit 31 performs weighting. The set is set to the initial value (FIG. 5), and the next accent phrase segment search is performed. If it is determined that the processing has been completed, the synthesis processing unit 40 (FIG. 1) smoothly connects adjacent speech units of the speech unit series used as synthesized speech and outputs the synthesized speech data.

図７に上述した音声素片探索部３０の動作を説明するためのフローチャートを示す。図７は本発明の音声合成方法に用いる音声素片探索部３０の基本的な処理フローを示す。
入力テキストのテキスト解析処理により、入力テキストのアクセント句数Ｎを算出し、再選択許容回数Ｓを設定（ステップＳ７−１）。
入力テキスト終了判定部３５ですべてのアクセント句について処理を終了したかを判定する（ステップＳ７−２）。
ｎ番目のアクセント句を処理対象に設定する（ステップＳ７−３）。 FIG. 7 shows a flowchart for explaining the operation of the speech element search unit 30 described above. FIG. 7 shows a basic processing flow of the speech segment search unit 30 used in the speech synthesis method of the present invention.
By the text analysis process of the input text, the number N of accent phrases of the input text is calculated, and the reselection allowable number S is set (step S7-1).
The input text end determination unit 35 determines whether the processing has been completed for all accent phrases (step S7-2).
The nth accent phrase is set as a processing target (step S7-3).

重みセット設定及び設定変更部３１の重みセットを初期値に設定する（ステップＳ７−４）。
再選択許容回数Ｓに１を代入（ステップＳ７−５）。
音声素片系列探索部３２で音声素片系列を探索する（ステップＳ７−６）。
品質評価部３３で複数の尺度それぞれについて品質評価を行い、「品質が良い」か「品質が悪い」かを求める（ステップＳ７−７）。
出力音声素片系列決定部３４で品質が悪いと判定された尺度があるか否かを判定（ステップＳ７−８）。 The weight set of the weight set setting / setting changing unit 31 is set to an initial value (step S7-4).
1 is substituted into the reselection allowable number of times S (step S7-5).
The speech unit sequence search unit 32 searches for speech unit sequences (step S7-6).
The quality evaluation unit 33 performs quality evaluation for each of the plurality of scales, and determines whether “quality is good” or “quality is bad” (step S7-7).
It is determined whether or not there is a scale determined by the output speech segment series determination unit 34 as having poor quality (step S7-8).

ステップＳ７−８の判定結果がＮ０であった場合は、その時点で探索されている音声素片系列をｎ番目のアクセント句の「音声素片系列」と決定する（ステップＳ７−１１）。
アクセント句数ｎのカウント数を＋１してステップＳ７−２に戻る（ステップＳ７−１２）。
全てのアクセント句について処理を終了したか否かを判定（ステップＳ７−２）し、終了していればステップＳ７−１３に分岐し、入力テキストを構成する各アクセント句それぞれについて「音声素片系列」を合成処理部４０（図１）に出力する。 If the determination result in step S7-8 is NO, the speech unit sequence searched at that time is determined as the “speech unit sequence” of the nth accent phrase (step S7-11).
The number of accent phrases n is incremented by 1, and the process returns to step S7-2 (step S7-12).
It is determined whether or not the processing has been completed for all accent phrases (step S7-2), and if completed, the process branches to step S7-13, and “speech segment series” for each accent phrase constituting the input text. Is output to the composition processing unit 40 (FIG. 1).

全てのアクセント句について処理が終了していない場合はステップＳ７−３〜ステップＳ７−１０を繰返し、次のアクセント句の音声素片系列を探索する。
ステップＳ７−８で品質が悪い評価尺度が有ると判定された場合、重みセット設定変更ステップＳ７−９で重みセットの設定値を変更し、再選択許容回数Ｓのカウント値を＋１（ステップＳ７−１０）し、ステップＳ７−６に戻り、そのアクセント句の音声素片系列の探索が続けられる。 If the processing has not been completed for all accent phrases, steps S7-3 to S7-10 are repeated to search for the speech unit sequence of the next accent phrase.
If it is determined in step S7-8 that there is an evaluation scale with poor quality, the weight set setting value is changed in weight set setting changing step S7-9, and the count value of the reselection allowable number of times S is incremented by +1 (step S7- 10), the process returns to step S7-6, and the search for the speech segment sequence of the accent phrase is continued.

図８及び図９に、この発明の請求項２で提案する音声合成方法を実施する音声合成装置の処理フローを示す。この実施例では例外処理として合成処理部４０において、音声波形の基本周波数や音韻継続時間長を操作する。例えば特許文献３（特許第３５５７１２４号明細書）に開示されているような信号処理を行う。音声素片の組み合わせを考慮するだけではどうしても品質が低い合成音声しか得られない場合、このような信号処理を行うことで、信号処理を行わない場合より高い品質の合成音声を生成できる。信号処理を行うかどうかは再選択を一定回数繰り返したかということから出力音声素片系列決定部３４で判断し、その判断結果を合成処理部４０に伝える。 8 and 9 show the processing flow of the speech synthesizer that implements the speech synthesis method proposed in claim 2 of the present invention. In this embodiment, as an exception process, the synthesis processing unit 40 manipulates the fundamental frequency of the speech waveform and the phoneme duration. For example, signal processing as disclosed in Patent Document 3 (Japanese Patent No. 3557124) is performed. When only synthesized speech with low quality can be obtained simply by considering the combination of speech units, it is possible to generate synthesized speech with higher quality than when no signal processing is performed by performing such signal processing. Whether or not to perform signal processing is determined by the output speech segment sequence determination unit 34 based on whether reselection has been repeated a predetermined number of times, and the determination result is transmitted to the synthesis processing unit 40.

つまり、図８及び図９に示す処理フローにおいて、ステップＳ７−１〜Ｓ７−１３は図７に示した各ステップに対応するが、この実施例２では図９に示すステップＳ８−１、Ｓ８−２、Ｓ８−３を設けた点を特徴とするものである。
ステップＳ８−１は再選択許容回数判定ステップを示し、この再選択許容回数判定ステップＳ８−１で音声素片系列探索ステップＳ７−６と、品質評価ステップＳ７−７の実行回数が再選択許容回数Ｓに達したか否かを判定する。Ｎ０であれば重みセット設定変更ステップＳ７−９を経てステップＳ７−６に戻るルーチンをたどり、音声素片系列探索が継続される。 That is, in the processing flow shown in FIGS. 8 and 9, steps S7-1 to S7-13 correspond to the steps shown in FIG. 7, but in the second embodiment, steps S8-1 and S8- shown in FIG. 2 and S8-3 are provided.
Step S8-1 shows a reselection allowable number determination step. In this reselection allowable number determination step S8-1, the number of executions of the speech element sequence search step S7-6 and the quality evaluation step S7-7 is the reselection allowable number of times. It is determined whether or not S has been reached. If it is N0, the routine returns to step S7-6 via weight set setting change step S7-9, and the speech unit sequence search is continued.

一方、出力音声素片系列決定ステップＳ７−８で品質が悪いと判定された尺度が無である場合はステップＳ８−２で現在処理中のｎ番目のアクセント句は信号処理は不要と決定し、ステップＳ７−１１でその時点で検索されている音声素片系列をこのｎ番目のアクセント句の音声素片系列と決定し、信号処理を不要とする情報と共に音声素片系列を合成処理部４０に出力する。
品質評価結果が全て良に達したいにも係わらず再選択許容回数判定ステップＳ８−１でステップＳ７−７とＳ７−８の実行回数が再選択許容回数Ｓに達したと判定された場合は、ステップＳ８−３に分岐する。ステップＳ８−３ではこの現在処理中のｎ番目のアクセント句は信号処理が必要と決定し、信号処理を必要とする情報と共にその時点で検索されている音声素片系列を合成処理部４０に出力する。 On the other hand, if there is no measure that is determined to be of poor quality in the output speech segment sequence determination step S7-8, it is determined in step S8-2 that the nth accent phrase currently being processed does not require signal processing, In step S7-11, the speech unit sequence searched at that time is determined as the speech unit sequence of the nth accent phrase, and the speech unit sequence is added to the synthesis processing unit 40 together with information that does not require signal processing. Output.
When it is determined that the number of executions of steps S7-7 and S7-8 has reached the reselection allowable number S in the reselection allowable number determination step S8-1 even though all the quality evaluation results want to reach good. The process branches to step S8-3. In step S8-3, the nth accent phrase currently being processed is determined to require signal processing, and the speech unit sequence searched at that time is output to the synthesis processing unit 40 together with information that requires signal processing. To do.

つまり、図８及び図９に示した処理フローでは図７に示した処理フローに上述した特許文献３で提案された信号処理を付加した処理フローとなっており、その採用により、合成音声として使用する音声素片系列の隣接する音声素片同士が滑らかに接続される効果が得られる。
上述した本発明による音声合成方法及びこの音声合成方法を用いて動作する音声合成装置は、本発明で提案する音声合成プログラムをコンピュータにインストールし、コンピュータによって実現する実施形態が最も望ましい実施形態である。本発明で提案する音声合成プログラムはコンピュータが解読可能なプログラム言語によって記述され、コンピュータが読み取り可能な磁気ディスク或いはＣＤ−ＲＯＭのような記録媒体に記録され、これらの記録媒体から通信回線を通じてコンピュータにインストールされ、コンピュータに備えられたＣＰＵに解読されて実行される。 That is, the processing flow shown in FIGS. 8 and 9 is a processing flow in which the signal processing proposed in Patent Document 3 described above is added to the processing flow shown in FIG. An effect is obtained in which adjacent speech elements of the speech element series to be connected are smoothly connected.
In the above-described speech synthesis method according to the present invention and the speech synthesizer operating using this speech synthesis method, the embodiment in which the speech synthesis program proposed in the present invention is installed in a computer and realized by the computer is the most desirable embodiment. . The speech synthesis program proposed in the present invention is written in a computer-readable program language, recorded on a recording medium such as a magnetic disk or CD-ROM that can be read by the computer, and from these recording media to a computer through a communication line. It is installed, decoded by a CPU provided in the computer, and executed.

プログラムの概要としては図１に示した各部の構成要素１０〜４０を構成するための各ルーチンで構成され、特にこの発明で特徴とする部分は図２に示した音声素片探索部３０の構成であり、更にその細部は図７又は図８及び図９に示した処理手順に存在する。 The outline of the program is composed of routines for constructing the constituent elements 10 to 40 shown in FIG. 1, and particularly the feature of the present invention is the construction of the speech segment search unit 30 shown in FIG. Further details are present in the processing procedure shown in FIG. 7 or FIG. 8 and FIG.

本発明による音声合成方法及び装置は各種の機器の音声ガイド装置として活用される。 The voice synthesizing method and apparatus according to the present invention is used as a voice guide device for various devices.

この発明の概要を説明するためのブロック図。1 is a block diagram for explaining an outline of the present invention. 図１に示した音声素片探索部の詳細を説明するためのブロック図。The block diagram for demonstrating the detail of the speech unit search part shown in FIG. 目標韻律の回帰直線と合成音声の韻律の回帰直線を示した図。The figure which showed the regression line of the target prosody and the regression line of the synthetic speech prosody. 図２に示した音声素片探索部に用いることができる品質評価部の詳細を説明するためのブロック図。The block diagram for demonstrating the detail of the quality evaluation part which can be used for the speech unit search part shown in FIG. 図２に示した重みセット設定部の構成の一例を説明するためのブロック図。The block diagram for demonstrating an example of a structure of the weight set setting part shown in FIG. 図４に示した音声素片探索部の処理フローの一例を説明するためのフローチャート。5 is a flowchart for explaining an example of a processing flow of a speech element search unit shown in FIG. 4. 図２に示した処理フローに信号系処理機能を付加した例を説明するためのフローチャート。The flowchart for demonstrating the example which added the signal type | system | group processing function to the processing flow shown in FIG. 図７に示した処理フローに信号系処理機能を付加した例を説明するためのフローチャート。8 is a flowchart for explaining an example in which a signal processing function is added to the processing flow shown in FIG. 図８に示した処理フローの続きを説明するためのフローチャート。9 is a flowchart for explaining the continuation of the processing flow shown in FIG.

Explanation of symbols

ＴＸ入力テキスト３２音声素片系列探索部
ＯＤ合成音声データ３３品質評価部
１０テキスト解析処理部３４出力音声素片系列決定部
２０韻律生成部３５入力テキスト終了判定部
３０音声素片探索部４０合成処理部
３１重みセット設定及び設定変更部５０音声データベース TX input text 32 speech unit sequence search unit OD synthesized speech data 33 quality evaluation unit 10 text analysis processing unit 34 output speech unit sequence determination unit 20 prosody generation unit 35 input text end determination unit 30 speech unit search unit 40 synthesis processing Unit 31 Weight set setting and setting change unit 50 Voice database

Claims

It has a database that stores speech corpora holding speech data, and in order to convert any input text into speech, input text and content can be searched and connected by searching for speech segments of any length included in the speech corpus. In a speech synthesis method for generating synthesized speech in which
A text analysis processing step for obtaining a reading by morphological analysis of the input text and obtaining the number of accent phrases and tone combination information necessary for generating a prosodic pattern;
A prosodic pattern generation step for generating a target prosodic pattern using the information obtained in the text analysis processing step;
A processing target accent phrase setting step for setting a processing target accent phrase according to a target prosodic pattern;
A weight set setting and setting change step for performing a process of setting the weight set to an initial value for each accent phrase and a process of changing the set value of the weight set value;
A speech segment search step of selecting a speech segment based on the prosodic pattern generated in the prosodic pattern generation step and setting a speech segment sequence;
A quality evaluation step of performing a quality evaluation of the synthesized speech for each of a plurality of evaluation scales with respect to the speech unit sequence searched in the speech unit search step;
The presence / absence of an evaluation scale determined to be poor in the evaluation result of the quality evaluation step is detected, and if the detection result is present, the above speech unit series is subjected to the weight set setting and the setting change process of the setting change step. Processing to return to the search step, and if not, an output speech unit sequence determination step for determining the speech unit sequence searched at that time as the speech unit sequence of the processing target accent phrase,
When the speech unit sequence for each accent phrase is determined in this output speech unit sequence determination step, it is determined whether or not the processing has been completed for all the accent phrases included in the input text. Outputs the speech segment series for each accent phrase constituting the input text to the synthesized speech processing step, and if it is determined that it has not been completed, sets the next accent phrase as the processing target accent phrase, and sets and sets the above weight set A change step, a speech segment sequence search step, a quality evaluation step, an output speech segment sequence determination step, an input text end determination step that repeatedly executes a preset reselection allowable number of times,
A speech synthesis method comprising:

2. The speech synthesis method according to claim 1, wherein when the evaluation result of the quality evaluation step includes an evaluation scale determined to be of poor quality, the number of executions of the speech unit sequence search step and the quality evaluation step is the re-execution number. It is determined whether or not the allowable number of selections has been reached, and if not achieved, a reselection allowable number of times determination step that returns to the speech segment sequence search step through the weight set setting change step,
If it is determined that this reselection allowable number determination step is achieved, the speech unit sequence currently searched for along with the information that requires signal processing for the accent phrase currently being processed is currently processed. A signal system processing necessity determination step for determining the speech unit sequence of the accent phrase being
If the quality evaluation step determines that the evaluation scale is poor, the speech being searched at that time together with information that does not require signal processing for the accent phrase currently being processed A signal processing unnecessary determination step for determining a unit sequence as a speech unit sequence of the accent phrase;
A speech synthesis method characterized by adding

3. The speech synthesis method according to claim 1, wherein the evaluation scale used in the quality evaluation step is at least one of an evaluation scale related to intonation, an evaluation scale related to speech rate, or an evaluation scale related to continuity. A speech synthesis method comprising:

4. The speech synthesis method according to claim 1, wherein the weight set set in the weight set setting and setting change step includes at least a parameter weight related to a fundamental frequency, a parameter weight related to a duration, and a connection. A speech synthesis method comprising any combination of parameter weights for point gaps.

It has a database that stores speech corpora holding speech data, and in order to convert any input text into speech, input text and content can be searched and connected by searching for speech segments of any length included in the speech corpus. In a speech synthesizer that generates synthesized speech in which
A text analysis processing unit that obtains readings by morphological analysis of input text and obtains the number of accent phrases and tone combination information necessary for prosody pattern generation;
Using the information obtained by the text analysis processing unit, a prosody pattern generation unit that generates a target prosody pattern,
A processing target accent phrase setting unit for setting an accent phrase to be processed according to a target prosodic pattern;
A weight set setting and setting change unit for performing a process of setting a weight set to an initial value for each accent phrase and a process of changing a weight set value;
A speech segment search unit that selects a speech unit based on the prosodic pattern generated by the prosody pattern generation unit;
A quality evaluation unit that evaluates the quality of the synthesized speech for each of a plurality of evaluation scales with respect to the speech unit sequence searched by the speech unit search unit;
The presence / absence of an evaluation scale determined to be poor in the evaluation result of the quality evaluation unit is detected, and if the detection result is present, the speech unit sequence is subjected to the weight set setting and the setting change processing of the setting change unit. Return to the speech unit sequence search process in the search unit, if not, an output speech unit sequence determination unit that determines the speech unit sequence searched at that time as the speech unit sequence of the accent phrase to be processed; ,
When the output speech segment sequence determination unit determines whether or not the processing has been completed for all accent phrases included in the input text each time the speech segment sequence for each accent phrase is determined, Outputs a speech segment sequence for each accent phrase constituting the input text to the synthesized speech processing unit, and when it is determined that the input is not completed, sets the next accent phrase as a processing target accent phrase, and the weight set setting unit A speech unit sequence search unit, a quality evaluation unit, and an input text end determination unit that repeatedly operates the output speech unit sequence determination unit for a preset number of reselection allowed,
A speech synthesizer comprising:

6. The speech synthesizer according to claim 5, wherein when the evaluation result of the quality evaluation unit includes an evaluation measure that is determined to be of poor quality, the number of executions of the speech unit sequence search unit and the quality evaluation unit is re-executed. It is determined whether or not the allowable number of selections has been reached, and if not achieved, the reselection allowable number determination is made to return to the search operation of the speech unit sequence search unit through the weight set setting change process of the weight set setting change unit. And
If this reselection allowable number determination unit determines that the achievement is achieved, for the accent phrase currently being processed, the current speech unit sequence searched for at that time together with information requiring signal processing is currently processed. A signal processing necessity determination unit for determining a speech unit sequence of an accent phrase being
When the quality evaluation unit determines that the quality measure is determined to be poor, the speech being searched at that time together with information that does not require signal processing for the accent phrase currently being processed A signal processing unnecessary determination unit for determining a unit sequence as a speech unit sequence of the accent phrase;
A speech synthesizer characterized by the addition of

7. The speech synthesizer according to claim 5, wherein the evaluation unit provided in the quality evaluation unit includes at least an evaluation unit for intonation of a speech unit sequence determined based on a prosodic pattern generated from input text. A speech synthesizer characterized by including either an evaluation means for speech rate or an evaluation means for continuity.

8. The speech synthesizer according to claim 5, wherein the weight set set by the weight set setting and setting change unit is a speech unit sequence determined based on at least a prosodic pattern generated from input text. A speech synthesizer comprising a combination of a parameter weight related to a fundamental frequency, a parameter weight related to a duration, and a parameter weight related to a gap at a connection point.

9. A speech synthesis program that is written in a computer-readable program language and causes the computer to function as the speech synthesizer according to claim 5.

A recording medium comprising a computer-readable recording medium, wherein at least the speech synthesis program according to claim 9 is recorded on the recording medium.