JP5712818B2

JP5712818B2 - Speech synthesis apparatus, sound quality correction method and program

Info

Publication number: JP5712818B2
Application number: JP2011146033A
Authority: JP
Inventors: 野田　拓也; 拓也野田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-06-30
Filing date: 2011-06-30
Publication date: 2015-05-07
Anticipated expiration: 2031-06-30
Also published as: JP2013011828A

Description

本発明は、音声合成装置および音質修正方法に関する。 The present invention relates to a speech synthesizer and a sound quality correction method.

利用者が入力したテキストを解析して合成音声に変換して読み上げる、音声合成技術がある。例えば、入力されたテキストを解析して目標韻律を作成し、出力音声を組み立てるための情報を記載した波形辞書から目標韻律に応じた適切な波形を選択して接続することにより、合成音声を生成する音声合成技術が知られている。選択できる所望の波形がないときには、使用目的に合致するよう変形を施したり、新たに波形を生成するようにしたりして、自然な音声を出力することを目指している。 There is speech synthesis technology that analyzes text input by a user, converts it into synthesized speech, and reads it out. For example, the input text is analyzed to create the target prosody, and the synthesized speech is generated by selecting and connecting the appropriate waveform according to the target prosody from the waveform dictionary that describes the information for assembling the output speech Speech synthesis technology is known. When there is no desired waveform that can be selected, it aims to output a natural sound by performing modification to match the purpose of use or by generating a new waveform.

合成音声の品質向上のため、合成された音声の品質を予測する予測手段を設け、予測手段により品質が設定値範囲外であると判定された音素片を、利用者に提示する手法が知られている。また、所定の評価基準を用いて複数の合成単位の候補を選択する一次選択部と、選択された複数の候補から素片接続の歪が小さくなるように素片を選び出す二次選択部により素片選択を行う例もある。この例では、さらに、合成時に一定の品質が得られないと判定されると、代替素片処理を行う。 In order to improve the quality of synthesized speech, there is known a method for predicting the quality of synthesized speech, and presenting to the user phonemes whose quality is determined to be out of the set value range by the predicting means. ing. Further, a primary selection unit that selects a plurality of synthesis unit candidates using a predetermined evaluation criterion, and a secondary selection unit that selects a segment from the selected plurality of candidates so that the distortion of the segment connection is reduced. There is also an example in which single selection is performed. In this example, if it is determined that a certain quality cannot be obtained at the time of synthesis, an alternative segment process is performed.

また、合成音声の韻律が自然であるかを、話速、イントネーションなど複数のパラメータについてそれぞれ判定し、判定結果に応じて、信号処理または再度音声素片系列探索を実施することもある。 In addition, whether or not the prosody of the synthesized speech is natural is determined for each of a plurality of parameters such as speech speed and intonation, and signal processing or speech unit sequence search may be performed again according to the determination result.

一方、利用者が合成音声を聞いて品質が悪い箇所を入力し、品質が悪い箇所に対応する音素片に対してペナルティを与えて再合成することにより音質修正を行う方法も知られている。 On the other hand, a method is also known in which a user listens to synthesized speech, inputs a portion with poor quality, and corrects the sound quality by re-synthesizing by giving a penalty to a phoneme piece corresponding to the portion with poor quality.

特開平１−２８４８９８号公報JP-A-1-284898 特開２００６−３１３１７６号公報JP 2006-313176 A 特開平８−２６３０９５号公報JP-A-8-263095 特許第３４２３２７６号公報Japanese Patent No. 3423276 特開２００８−１３９６３１号公報JP 2008-139931 A

上記のような現状の合成音声の品質は、人の発声と比べると十分とは言えない。例えば、所々で合成音声の品質劣化が存在するため、利用者が簡便に品質劣化を修正したいという要望が多い。 The current quality of synthesized speech as described above is not sufficient compared to human speech. For example, since there is a quality degradation of synthesized speech in some places, there are many requests that users want to easily correct the quality degradation.

しかし、予測手段により品質が設定範囲値外の音素片を利用者に提示する例では、提示された音質修正位置が利用者感覚とずれている場合があり、利用者の感覚どおりに音質を修正できないことがある。合成音声の品質を向上させるために音素等を選び直す例では、選びなおした音素等を単純に再合成するだけでは、新規に選ばれた音素または音素列が、利用者の希望する音質になるとは限らず、音質が修正できないことがある。 However, in the example in which the prediction unit presents a phoneme whose quality is out of the set range value, the presented sound quality correction position may deviate from the user's sense, and the sound quality is corrected according to the user's sense. There are things that cannot be done. In the example of reselecting phonemes to improve the quality of synthesized speech, simply re-synthesize the selected phonemes etc., and the newly selected phoneme or phoneme string will be the sound quality desired by the user. The sound quality may not be corrected.

一方、利用者が品質の悪い箇所を指摘する例においては、利用者が音質修正箇所を的確に指定することは難しく、漠然と指定した場合に、利用者が望まない部分の音質修正を行ってしまう場合がある。 On the other hand, in an example where the user points out a poor quality part, it is difficult for the user to specify the sound quality correction part accurately, and when it is specified vaguely, the sound quality correction of the part that the user does not want is performed. There is a case.

上記課題に鑑み、本発明は、利用者の主観に合うように合成音声を修正することが可能な音声合成装置、音質修正方法およびプログラムを提供する。 In view of the above problems, the present invention provides a speech synthesizer, a sound quality correction method, and a program capable of correcting synthesized speech so as to match the user's subjectivity.

ひとつの態様である音声合成装置は、合成音声取得部、範囲取得部と、劣化コスト算出部、劣化種別判定部、劣化種別取得部、修正情報生成部、および音声合成部を有することを特徴としている。合成音声取得部は、合成された音声を取得する。範囲取得部は、前記音声における修正を行う第１の範囲を取得する。劣化コスト算出部は、前記第１の範囲における音質劣化に応じた情報である少なくとも一つの劣化コストを算出する。劣化種別判定部は、前記第１の範囲において修正を行う候補となる前記音質劣化の性質に応じた少なくとも一つの修正候補劣化種別を、少なくとも一つの前記劣化コストに基づき判定する。劣化種別取得部は、判定された前記修正候補劣化種別の中から修正を行うために選択される修正劣化種別を取得する。修正情報生成部は、前記修正劣化種別に基づき前記音声を修正するための修正情報を生成する。音声合成部は、前記修正情報に基づき前記音声の再合成を行う。 A speech synthesizer according to one aspect includes a synthesized speech acquisition unit, a range acquisition unit, a degradation cost calculation unit, a degradation type determination unit, a degradation type acquisition unit, a correction information generation unit, and a speech synthesis unit. Yes. The synthesized voice acquisition unit acquires the synthesized voice. The range acquisition unit acquires a first range for correcting the voice. The deterioration cost calculation unit calculates at least one deterioration cost which is information corresponding to sound quality deterioration in the first range. The deterioration type determination unit determines at least one correction candidate deterioration type corresponding to the nature of the sound quality deterioration, which is a candidate for correction in the first range, based on the at least one deterioration cost. The deterioration type acquisition unit acquires a correction deterioration type selected for correction from the determined correction candidate deterioration types. The correction information generation unit generates correction information for correcting the sound based on the correction deterioration type. The speech synthesizer re-synthesizes the speech based on the correction information.

別の態様である音声修正方法は、合成された音声を取得し、前記音声における修正を行う第１の範囲を取得し、前記第１の範囲における音質劣化に応じた情報である少なくとも一つの劣化コストを算出する。また、音声修正方法は、前記第１の範囲において修正を行う候補となる前記音質劣化の性質に応じた少なくとも一つの修正候補劣化種別を、少なくとも一つの前記劣化コストに基づき判定する。さらに、音声修正方法は、判定された前記修正候補劣化種別の中から修正を行うために選択される修正劣化種別を取得し、前記修正劣化種別に基づき前記音声を修正するための修正情報を生成し、前記修正情報に基づき前記音声の再合成を行うことを特徴としている。 According to another aspect of the present invention, there is provided a speech correction method that acquires synthesized speech, acquires a first range for performing correction on the speech, and at least one deterioration that is information corresponding to sound quality deterioration in the first range. Calculate the cost. Further, in the sound correction method, at least one correction candidate deterioration type corresponding to the property of the sound quality deterioration that is a candidate for correction in the first range is determined based on at least one of the deterioration costs. Further, the voice correction method acquires a correction deterioration type selected for correction from the determined correction candidate deterioration types, and generates correction information for correcting the sound based on the correction deterioration type The speech is re-synthesized based on the correction information.

なお、上述した本発明に係る方法をコンピュータに行わせるためのプログラムであっても、このプログラムを当該コンピュータによって実行させることにより、上述した本発明に係る方法と同様の作用・効果を奏するので、前述した課題が解決される。 In addition, even if it is a program for causing a computer to perform the method according to the present invention described above, since the program is executed by the computer, the same operations and effects as the method according to the present invention described above are achieved. The aforementioned problems are solved.

上述した態様の音声合成装置、音声修正方法およびプログラムによれば、利用者の主観に合うように合成音声を修正することが可能となる。 According to the speech synthesizer, the speech correction method, and the program according to the aspect described above, it is possible to correct the synthesized speech so as to match the user's subjectivity.

第１の実施の形態による音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer by 1st Embodiment. 第１の実施の形態による音声合成装置の機能を示すブロック図である。It is a block diagram which shows the function of the speech synthesizer by 1st Embodiment. 第１の実施の形態による素片選択ペナルティを示す図である。It is a figure which shows the segment selection penalty by 1st Embodiment. 第１の実施の形態による素片接続ペナルティを示す図である。It is a figure which shows the segment connection penalty by 1st Embodiment. 第１の実施の形態による「音素環境」について説明する図である。It is a figure explaining the "phoneme environment" by 1st Embodiment. 第１の実施の形態による劣化コスト関数および劣化種別を示す図である。It is a figure which shows the degradation cost function and degradation type by 1st Embodiment. 第１の実施の形態による音声合成装置における劣化種別表示の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the degradation classification display in the speech synthesizer by 1st Embodiment. 第１の実施の形態による表音テキストおよび連続音素列の表示例を示す図である。It is a figure which shows the example of a display of the phonetic text and continuous phoneme string by 1st Embodiment. 第１の実施の形態による劣化種別判定の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of degradation type determination by 1st Embodiment. 第１の実施の形態による劣化位置の例を示す図である。It is a figure which shows the example of the degradation position by 1st Embodiment. 第１の実施の形態による劣化種別選択の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of deterioration classification selection by 1st Embodiment. 第１の実施の形態の変形例による劣化種別判定の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of deterioration classification determination by the modification of 1st Embodiment. 第２の実施の形態による音声合成装置の機能を示すブロック図である。It is a block diagram which shows the function of the speech synthesizer by 2nd Embodiment. 第２の実施の形態による劣化範囲指定の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of designation | designated of the degradation range by 2nd Embodiment. 第２の実施の形態による劣化種別指定の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of a deterioration classification designation | designated by 2nd Embodiment. 標準的なコンピュータのハードウエア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a standard computer.

（第１の実施の形態）
以下、図面を参照しながら第１の実施の形態による音声合成装置について説明する。なお、本明細書において、音素とは、母音、撥音、促音、および子音の１つ１つに相当する文言である。モーラとは、母音、撥音、促音、または「子音＋母音」のような、日本語発声の１音に相当する文言である。 (First embodiment)
The speech synthesizer according to the first embodiment will be described below with reference to the drawings. In this specification, the phoneme is a wording corresponding to each of vowels, repellent sounds, prompt sounds, and consonants. A mora is a word corresponding to one sound of a Japanese utterance, such as a vowel, repelling sound, prompting sound, or “consonant + vowel”.

まず、図１および図２を参照しながら、第１の実施の形態による音声合成装置１の構成について説明する。図１は、第１の実施の形態による音声合成装置の構成を示すブロック図、図２は、第１の実施の形態による音声合成装置の機能を示すブロック図である。 First, the configuration of the speech synthesizer 1 according to the first embodiment will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to the first embodiment, and FIG. 2 is a block diagram showing functions of the speech synthesizer according to the first embodiment.

図１に示すように、音声合成装置１は、入力部３、記憶部５、音声合成部７、音質修正決定部９、および出力部１１を有しており、互いにバス１３で接続されている。入力部３は、音声合成装置１の利用者により操作されると、その操作内容に対応付けられている利用者からの各種情報の入力を取得し、取得した入力情報を音声合成部７または音質修正決定部９に送付する装置であり、キーボード装置、マウス装置、タッチパネル装置等である。 As shown in FIG. 1, the speech synthesizer 1 includes an input unit 3, a storage unit 5, a speech synthesizer 7, a sound quality correction determination unit 9, and an output unit 11, which are connected to each other via a bus 13. . When the input unit 3 is operated by a user of the speech synthesizer 1, the input unit 3 acquires input of various information from the user associated with the operation content, and the acquired input information is input to the speech synthesizer 7 or the sound quality. Devices that are sent to the correction determination unit 9 are a keyboard device, a mouse device, a touch panel device, and the like.

記憶部５は、例えば、読出し可能な半導体記憶装置、随時書込み読出し可能な半導体記憶装置、および随時書込み読出し可能な可搬記憶媒体等である。記憶部５は、音声合成装置１の基本動作を制御するためのプログラムや、後述する音声合成を実行するために必要な情報を格納したり、音声合成の処理において必要に応じて作業領域として使用したりするための記憶装置である。 The storage unit 5 is, for example, a readable semiconductor storage device, a semiconductor storage device capable of writing / reading as needed, and a portable storage medium capable of writing / reading as needed. The storage unit 5 stores a program for controlling the basic operation of the speech synthesizer 1, information necessary for executing speech synthesis described later, and is used as a work area as needed in speech synthesis processing. This is a storage device.

音声合成部７は、例えば、利用者が入力部３を介して入力した表音テキストに従って音声合成の際の目標韻律を生成し、目標韻律に従って適切な音声素片の選択を行い、選択した音声素片を互いに接続することにより、選択接続情報を生成する。さらに音声合成部７は、生成された選択接続情報に基づいて波形処理を行い、合成音声を生成する。 For example, the speech synthesizer 7 generates a target prosody for speech synthesis according to the phonetic text input by the user via the input unit 3, selects an appropriate speech unit according to the target prosody, and selects the selected speech Selection connection information is generated by connecting the pieces to each other. Furthermore, the speech synthesizer 7 performs waveform processing based on the generated selected connection information to generate synthesized speech.

音質修正決定部９は、音声合成部７で生成された合成音声を取得し、利用者が入力した音質劣化を感じる音質劣化範囲を取得し、音質劣化範囲について、音質劣化の性質に応じた劣化種別を利用者に提示する。劣化種別としては、例えば「音（音質）が悪い」「抑揚が悪い」「滑舌が悪い」等が挙げられる。提示された劣化種別から利用者が修正すべき修正劣化種別を選択すると、音質修正決定部９は、選択された修正劣化種別を取得し、修正劣化種別に応じた修正を行うための修正情報を出力する。 The sound quality correction determination unit 9 acquires the synthesized speech generated by the speech synthesizer 7, acquires a sound quality deterioration range that feels the sound quality deterioration input by the user, and the sound quality deterioration range is degraded according to the nature of the sound quality deterioration. Present the type to the user. Examples of the degradation type include “sound (sound quality) is bad”, “inflation is bad”, “smooth tongue is bad”, and the like. When the user selects a correction deterioration type to be corrected from the presented deterioration types, the sound quality correction determination unit 9 acquires the selected correction deterioration type and provides correction information for performing correction according to the correction deterioration type. Output.

出力部１１は、音声合成部７で合成される音声や音質修正決定部９から出力される修正情報に基づいて再合成される音声を出力するスピーカ、および音質修正決定部９で判定される劣化種別等を表示する表示装置等である。バス１３は、上記各装置等を互いに接続し、データのやり取りを行う通信経路である。 The output unit 11 is a speaker that outputs the voice synthesized by the voice synthesis unit 7 and the voice recombined based on the correction information output from the sound quality correction determination unit 9, and the deterioration determined by the sound quality correction determination unit 9. It is a display device or the like that displays a type or the like. The bus 13 is a communication path for connecting the above devices and the like to exchange data.

次に、図２を参照しながら、音声合成装置１の構成についてさらに詳細に説明する。図２に示すように、音声合成部７は、言語処理部２３、韻律生成部２５、素片選択接続部２７、および波形処理部２９の機能を有している。記憶部５は、言語辞書３１、韻律辞書３３、および波形辞書３５を格納している。音質修正決定部９は、範囲取得部４１、劣化コスト算出部４３、劣化種別判定部４５、劣化位置取得部４７および修正手段決定部４９の機能を有している。音質修正ＵｓｅｒＩｎｔｅｒｆａｃｅ（ＵＩ）部２１は、入力部３および出力部１１のインタフェース機能を代表して表したものである。 Next, the configuration of the speech synthesizer 1 will be described in more detail with reference to FIG. As shown in FIG. 2, the speech synthesis unit 7 has functions of a language processing unit 23, prosody generation unit 25, segment selection connection unit 27, and waveform processing unit 29. The storage unit 5 stores a language dictionary 31, a prosody dictionary 33, and a waveform dictionary 35. The sound quality correction determination unit 9 has functions of a range acquisition unit 41, a deterioration cost calculation unit 43, a deterioration type determination unit 45, a deterioration position acquisition unit 47, and a correction means determination unit 49. The sound quality correction user interface (UI) unit 21 represents the interface function of the input unit 3 and the output unit 11 as a representative.

なお、音声合成部７、音質修正決定部９は、例えば不揮発性の記憶装置に予めそれぞれの機能を実現するためのプログラムを格納しておき、不図示の演算装置によりそれらのプログラムを読み込んで実行することにより実現される機能とすることができる。 Note that the speech synthesizer 7 and the sound quality correction determination unit 9 store programs for realizing the respective functions in advance in a nonvolatile storage device, for example, and read and execute these programs by an arithmetic device (not shown). It can be set as the function implement | achieved by doing.

音質修正ＵＩ部２１は、利用者と音声合成装置１との間で情報の入出力を行なうためのインタフェース装置であり、利用者から音声合成を行うための入力テキストを入力されるとともに、表音テキスト、合成音声波形を不図示の表示装置の画面などに提示する。また、音質修正ＵＩ部２１は、合成音声を音声として出力し、利用者が合成音声に音質劣化を感じた場合、音声合成波形または表音テキストにおいて指定する、劣化していると感じた音質劣化範囲を取得する。更に、利用者が範囲指定した領域内で判定した音質の劣化種別を、ポップアップメニューなどで利用者へ提示し、利用者はその中から自分の感じた音質劣化内容と一致する劣化種別を選択できるように構成する。 The sound quality correction UI unit 21 is an interface device for inputting / outputting information between the user and the speech synthesizer 1, and receives an input text for speech synthesis from the user and a phonetic sound. The text and synthesized speech waveform are presented on a screen of a display device (not shown). Also, the sound quality modification UI unit 21 outputs the synthesized speech as speech, and when the user feels sound quality degradation in the synthesized speech, the sound quality degradation felt to be specified as specified in the speech synthesis waveform or phonetic text. Get the range. Furthermore, the sound quality degradation type determined within the area designated by the user is presented to the user via a pop-up menu or the like, and the user can select the degradation type that matches the sound quality degradation content felt by the user. Configure as follows.

音声合成部７の言語処理部２３は、音質修正ＵＩ部２１を介して入力された入力テキストを、記憶部５に格納された言語辞書３１を参照しながら品詞分解等の解析を行い、表音テキストを生成する。入力テキストとは、例えば「今日の天気は晴れです。」等、音声合成の対象となる文章などである。表音テキストとは、入力テキストの読みをカタカナで表現し、アクセント位置などを記号で表現したもので、例えば「キョ’ーノ／テ’ンキハ／ハレ’デス．」などの文字列である。ここで、「’」はアクセントの位置を、「／」はアクセント句の区切りを、「．」は文末（句点）を示している。アクセント句とは、１アクセントを構成する文節境界単位、またはそれより細かい単位の区切りを意味する。上述の文例は３アクセント句で構成され、各アクセント句が文節境界単位と一致した例である。また、言語辞書３１とは、単語の読みや品詞、文法、更にはアクセント情報などを記録した情報である。 The language processing unit 23 of the speech synthesizing unit 7 analyzes the part of speech of the input text input via the sound quality correcting UI unit 21 with reference to the language dictionary 31 stored in the storage unit 5, Generate text. The input text is, for example, a sentence to be subjected to speech synthesis, such as “Today's weather is sunny”. The phonetic text is a text string in which the input text is expressed in katakana and the accent position is expressed in symbols, and is a character string such as “Kyo'no / Tenkiha / Hare'Death.”. Here, “′” indicates an accent position, “/” indicates an accent phrase delimiter, and “.” Indicates a sentence end (punctuation point). An accent phrase means a segment boundary unit constituting one accent or a finer unit. The above sentence example is an example in which three accent phrases are formed, and each accent phrase matches a phrase boundary unit. The language dictionary 31 is information that records word readings, parts of speech, grammar, and accent information.

韻律生成部２５は、生成された表音テキストをさらに音素にし、韻律辞書３３を参照しながら、音声合成の目標とする目標韻律を生成する。ここで韻律は、例えば音の長さ、高さ、大きさ等を、音声の時間に対する周波数、および振幅の変化で表した情報である。韻律辞書３３には、例えばナレータやアナウンサ等の膨大な音声データから、様々な音の強弱や長短、高低によって構成されるリズムのバリエーションを抽出して統計処理した韻律情報が格納されている。目標韻律とは、入力テキストから合成音声を生成する際に目標とする韻律であり、前述の表音テキストに基づいて、韻律辞書３３に格納された韻律情報から生成した、適切な音の強弱、長短、高低変化のリズムである。 The prosody generation unit 25 further converts the generated phonetic text into phonemes, and generates a target prosody as a target of speech synthesis while referring to the prosody dictionary 33. Here, the prosody is information representing, for example, the length, height, magnitude, and the like of a sound by changes in frequency and amplitude with respect to the time of the sound. In the prosody dictionary 33, prosody information obtained by statistically processing probabilistic information extracted from voluminous voice data such as narrators and announcers is extracted from variations of rhythms composed of various dynamics, lengths, and shorts. The target prosody is a target prosody when generating synthesized speech from the input text. Based on the above-mentioned phonetic text, an appropriate sound strength generated from the prosody information stored in the prosody dictionary 33, It is a long, short, high and low rhythm.

素片選択接続部２７は、生成された目標韻律に従って波形辞書３５の中から適切な音声素片の選択を行うとともに互いに接続するための選択接続情報を生成する。波形辞書３５には、収録したナレータ音声波形と様々な音素情報（音声形に対応した音素名、音素長、ピッチ周波数など）が格納されている。以下、素片とは、素片選択によって選択された音素列（１つ以上の連続する音素）、またはモーラ列（１つ以上の連続するモーラ）の１まとまりを指していう。なお、素片は、選択次第で、１音素にも、Ｎ音素にもなりうる。 The segment selection / connection unit 27 selects an appropriate speech segment from the waveform dictionary 35 according to the generated target prosody and generates selection connection information for connection to each other. The waveform dictionary 35 stores recorded narrator speech waveforms and various phoneme information (phoneme names corresponding to speech forms, phoneme lengths, pitch frequencies, etc.). Hereinafter, the term “segment” refers to a phoneme string (one or more continuous phonemes) selected by segment selection or a group of mora strings (one or more consecutive mora). Note that the segment can be either one phoneme or N phoneme, depending on the selection.

素片選択接続部２７は、韻律生成部２５で生成された目標韻律になるべく近い素片を波形辞書３５から選択（素片選択）し、更に、選択した各素片の中から、素片同士の接続境界が自然な韻律となる素片を選択（素片接続）する。素片を選択する際には、後述する素片選択ペナルティおよび素片接続ペナルティを算出し、それらを用いて算出した素片選択ペナルティ評価値ＰＰおよび素片接続ペナルティ評価値ＰＣがなるべく小さい値となるように選択を行う。 The segment selection connection unit 27 selects a segment as close as possible to the target prosody generated by the prosody generation unit 25 from the waveform dictionary 35 (segment selection). Select a segment whose connection boundary is a natural prosody (segment connection). When selecting an element, an element selection penalty and an element connection penalty, which will be described later, are calculated, and an element selection penalty evaluation value PP and an element connection penalty evaluation value PC calculated using them are set to the smallest possible values. Make a selection.

素片選択接続部２７は、選択接続情報と、算出した素片選択ペナルティおよび素片接続ペナルティを中間情報として範囲取得部４１へ出力する。また、素片選択接続部２７は、後述の修正手段決定部４９から修正情報を受け取り、音声の再合成のための素片選択を再び行うとともに、素片接続情報を再生成する。 The segment selection / connection unit 27 outputs the selected connection information, the calculated segment selection penalty, and the segment connection penalty to the range acquisition unit 41 as intermediate information. The segment selection / connection unit 27 receives correction information from the correction unit determination unit 49 (to be described later), performs segment selection again for speech resynthesis, and regenerates the segment connection information.

波形処理部２９では、素片選択接続部２７で生成された選択接続情報に基づいて、波形辞書３５から選択された素片波形を読出し、波形処理によって互いに接続し、合成音声を生成する。 The waveform processing unit 29 reads out the segment waveforms selected from the waveform dictionary 35 based on the selection connection information generated by the segment selection connection unit 27 and connects them by waveform processing to generate synthesized speech.

音質修正決定部９の範囲取得部４１は、利用者が指定した音質劣化範囲と、音声合成部７の言語処理部２３、韻律生成部２５および素片選択接続部２７で生成された中間情報に基づいて、音質修正を行う範囲を取得する。すなわち、範囲取得部４１は、音質修正ＵＩ部２１を介して利用者が指定した音質劣化範囲に対応する、合成音声の範囲（以下、対象音素範囲という）を取得する。このとき、範囲取得部４１は、音質劣化範囲を完全に含んだ、音素単位、音素列単位、モーラ単位、およびモーラ列単位のいずれかの区切りを音声の再合成を行う再合成範囲として決定するように構成してもよい。 The range acquisition unit 41 of the sound quality correction determination unit 9 uses the sound quality deterioration range specified by the user and the intermediate information generated by the language processing unit 23, the prosody generation unit 25, and the segment selection connection unit 27 of the speech synthesis unit 7. Based on this, a range for sound quality correction is acquired. That is, the range acquisition unit 41 acquires a synthesized speech range (hereinafter referred to as a target phoneme range) corresponding to the sound quality degradation range designated by the user via the sound quality correction UI unit 21. At this time, the range acquisition unit 41 determines any segment of the phoneme unit, the phoneme sequence unit, the mora unit, and the mora sequence unit, which completely includes the sound quality degradation range, as a resynthesis range in which speech synthesis is performed. You may comprise as follows.

劣化コスト算出部４３は、音質劣化範囲に対応した対象音素範囲において、各劣化コストを算出する。劣化コストとその劣化コスト関数は、音声合成部７の中間情報の１つである素片選択ペナルティ、または素片接続ぺナルティを含む少なくとも１つの組み合わせによって定義される。劣化コスト算出部４３は、劣化コスト関数から劣化コストを算出する。劣化コストは、上述の音質劣化範囲における音素毎、またはモーラ毎に算出することが好ましいが、合成音素単位（音素列単位、モーラ列単位）毎であってもよい。以下、音素単位、モーラ単位、音素列単位、モーラ列単位を総称して、素片単位という。劣化コストの詳細は後述する。 The deterioration cost calculation unit 43 calculates each deterioration cost in the target phoneme range corresponding to the sound quality deterioration range. The degradation cost and the degradation cost function are defined by at least one combination including a segment selection penalty, which is one piece of intermediate information of the speech synthesizer 7, or a segment connection penalty. The deterioration cost calculation unit 43 calculates the deterioration cost from the deterioration cost function. The deterioration cost is preferably calculated for each phoneme or mora in the above-described sound quality deterioration range, but may be for each synthesized phoneme unit (phoneme string unit, mora string unit). Hereinafter, the phoneme unit, the mora unit, the phoneme string unit, and the mora string unit are collectively referred to as a segment unit. Details of the deterioration cost will be described later.

なお、素片選択ペナルティとは、目標とする韻律（ピッチや音素長等）や、音響特徴量（音色等）に対し、波形辞書３５から選択した音声素片の韻律や音響特徴量との差異を定量化して、ペナルティとして定義したものである。素片接続ペナルティとは、選択した素片同士を接続する際に、その境界前後の韻律や音響特徴量の差異を定量化してペナルティとして定義したものである。素片選択ペナルティ、素片接続ペナルティとも差異が大きい程、目標の韻律や音響特徴量から外れて音質劣化となる。各ペナルティの組み合わせを劣化コスト関数の構成要素として定義することで、劣化種別を判定するのに適した劣化コストとなり、劣化種別の判定精度を向上することが可能となる。素片選択ペナルティ、素片接続ペナルティの詳細は後述する。 Note that the segment selection penalty is the difference between the target prosody (pitch, phoneme length, etc.) and acoustic feature (tone, etc.) from the prosody or acoustic feature of the speech unit selected from the waveform dictionary 35. Is quantified and defined as a penalty. The segment connection penalty is defined as a penalty by quantifying the difference in prosody and acoustic feature values before and after the boundary when connecting selected segments. The greater the difference between the segment selection penalty and the segment connection penalty, the farther the target prosody or acoustic feature value is, the more the sound quality deteriorates. By defining each penalty combination as a component of the deterioration cost function, the deterioration cost is suitable for determining the deterioration type, and the determination accuracy of the deterioration type can be improved. Details of the segment selection penalty and the segment connection penalty will be described later.

劣化種別判定部４５は、算出された劣化コストに基づき、音質の劣化種別を判定する。劣化種別判定部４５は、劣化種別を例えば「音（音質）、抑揚、滑舌」などの音質劣化の性質に分類する。各劣化種別は、少なくとも１つの劣化コストに基づいて分類される。劣化コストの中で、値の高い順から所定順位以上の劣化コストに対応する劣化種別、または、予め定められた所定の閾値を超えている劣化種別を、劣化種別判定部４５は、音質修正候補の劣化種別と判定する。以下、音質修正候補の劣化種別を、修正候補劣化種別という。以下、劣化コストの値が、値の高い順から所定順位以上の劣化コストを、劣化コスト上位という。劣化種別判定部４５は、判定された修正候補劣化種別を音質修正ＵＩ部２１を介して利用者に提示する。 The deterioration type determination unit 45 determines the sound quality deterioration type based on the calculated deterioration cost. The deterioration type determination unit 45 classifies the deterioration type into sound quality deterioration characteristics such as “sound (sound quality), intonation, smooth tongue” and the like. Each deterioration type is classified based on at least one deterioration cost. Among the deterioration costs, the deterioration type determination unit 45 selects a deterioration type corresponding to a deterioration cost that is higher than a predetermined order from the highest value or a deterioration type that exceeds a predetermined threshold value. The degradation type is determined. Hereinafter, the deterioration type of the sound quality correction candidate is referred to as a correction candidate deterioration type. Hereinafter, a degradation cost having a degradation cost value of a predetermined order from the highest value is referred to as a higher degradation cost. The deterioration type determination unit 45 presents the determined correction candidate deterioration type to the user via the sound quality correction UI unit 21.

このように、劣化種別判定部４５は、各劣化コストのうち、少なくとも１つの劣化コスト上位を含む劣化種別を、音質修正候補の劣化種別と判定するように構成する。これは、劣化コストが高いものは音質劣化の原因である可能性が高く、このように劣化コスト上位を含む劣化種別が音質の劣化原因である可能性が高いためである。 As described above, the deterioration type determination unit 45 is configured to determine a deterioration type including at least one higher deterioration cost among the deterioration costs as the deterioration type of the sound quality correction candidate. This is because a high deterioration cost is likely to cause sound quality deterioration, and a deterioration type including a higher deterioration cost is likely to be a cause of sound quality deterioration.

また、劣化種別判定部４５は、音質修正候補の有無に関わらず、全ての劣化種別を、音質修正ＵＩ部２１を介して利用者に提示してもよい。これは、利用者が感じた音質劣化の劣化種別が選択肢として無い場合、利用者が音質修正のための手段を失うことを回避するためである。音質修正候補でない劣化種別が利用者によって選択された場合は、例えば、その劣化種別の修正に最適な不図示の手動修正ＵＩを起動するように構成することが好ましい。 Further, the deterioration type determination unit 45 may present all deterioration types to the user via the sound quality correction UI unit 21 regardless of the presence or absence of the sound quality correction candidate. This is to prevent the user from losing the means for correcting the sound quality when the deterioration type of the sound quality deterioration felt by the user is not an option. When a deterioration type that is not a sound quality correction candidate is selected by the user, for example, it is preferable that a manual correction UI (not shown) that is optimal for the correction of the deterioration type is activated.

劣化位置取得部４７は、音質修正候補と判定された修正候補劣化種別について、劣化種別の劣化コスト値と、その劣化位置を取得し、保持する。利用者は、音質修正ＵＩ部２１上で、提示された修正候補劣化種別の中から、自らの感覚に合った劣化種別を選択する。利用者により選択され、音声合成装置１により取得された劣化種別を、以下、修正劣化種別という。 The deterioration position acquisition unit 47 acquires and holds the deterioration cost value of the deterioration type and its deterioration position for the correction candidate deterioration type determined as the sound quality correction candidate. On the sound quality correction UI unit 21, the user selects a deterioration type that suits his / her feeling from the presented correction candidate deterioration types. The deterioration type selected by the user and acquired by the speech synthesizer 1 is hereinafter referred to as a corrected deterioration type.

ただし、劣化位置取得部４７は、劣化種別が音質修正候補か否かに関わらず、全ての劣化種別の劣化コストとその劣化位置を取得、保持するように構成してもよい。これは、音質修正候補の劣化種別と判定できなくとも、利用者が音質劣化を感じる劣化種別として選択した場合に、その劣化種別と、対応する劣化コストを音質劣化修正候補とすることで、音質修正が可能となる。 However, the deterioration position acquisition unit 47 may be configured to acquire and hold deterioration costs and deterioration positions of all deterioration types regardless of whether or not the deterioration type is a sound quality correction candidate. Even if it cannot be determined that the sound quality correction candidate is a deterioration type, when the user selects a deterioration type that feels sound quality deterioration, the deterioration type and the corresponding deterioration cost are set as sound quality deterioration correction candidates. Can be modified.

修正手段決定部４９は、音質修正ＵＩ部２１を介して利用者が選択した修正劣化種別を取得し、劣化位置取得部４７から劣化種別に対応した劣化コストと劣化位置を取得し、範囲取得部４１から、音声の再合成を行う再合成範囲を取得する。修正手段決定部４９は、再合成範囲と、修正劣化種別に対応する各劣化コストとに基づいて音質劣化を改善するため、劣化コストに対応した素片選択ペナルティ、素片接続ペナルティを改善するための修正情報を生成し、素片選択接続部２７へ出力する。修正情報は、後述する素片選択ペナルティ評価値ＰＰおよび素片接続ペナルティ評価値ＰＣを含む。 The correction means determination unit 49 acquires the correction deterioration type selected by the user via the sound quality correction UI unit 21, acquires the deterioration cost and the deterioration position corresponding to the deterioration type from the deterioration position acquisition unit 47, and the range acquisition unit From 41, a re-synthesis range for re-synthesizing speech is acquired. In order to improve sound quality deterioration based on the re-synthesis range and each deterioration cost corresponding to the correction deterioration type, the correction means determination unit 49 improves the element selection penalty and the element connection penalty corresponding to the deterioration cost. Is generated and output to the segment selection / connection unit 27. The correction information includes a segment selection penalty evaluation value PP and a segment connection penalty evaluation value PC, which will be described later.

以上の構成により、音声合成装置１は、利用者が指定した音質劣化範囲から音質劣化位置とその音質劣化種別を判定し、音質劣化種別を利用者へ提示する。利用者は提示された劣化種別の中から、自らの感覚に合った劣化種別を選択し、音声合成装置１は、劣化種別に応じた音質修正を行う。 With the above configuration, the speech synthesizer 1 determines the sound quality deterioration position and the sound quality deterioration type from the sound quality deterioration range designated by the user, and presents the sound quality deterioration type to the user. The user selects a degradation type that suits his / her sense from the presented degradation types, and the speech synthesizer 1 performs sound quality correction according to the degradation type.

以下、図３、図４を参照しながら、素片選択ペナルティ、素片選択ペナルティ評価値ＰＰ、素片接続ペナルティおよび素片接続ペナルティ評価値ＰＣについて説明する。図３は、素片選択ペナルティを示す図である。図３に示すように、素片選択ペナルティとは、合成する音声の目標韻律（ピッチ周波数、音素長、音響特徴量など）と、波形辞書３５から選択する素片の韻律の差異を定量化してペナルティとして与えることで、より目標韻律に近い素片を選び、合成音声の音質劣化を抑えるための変数である。なお、ペナルティとは、音声の劣化度合いに対応した参照値を意味する。 Hereinafter, the element selection penalty, the element selection penalty evaluation value PP, the element connection penalty, and the element connection penalty evaluation value PC will be described with reference to FIGS. FIG. 3 is a diagram showing a segment selection penalty. As shown in FIG. 3, the segment selection penalty is a quantification of the difference between the target prosody of the synthesized speech (pitch frequency, phoneme length, acoustic feature amount, etc.) and the prosody of the segment selected from the waveform dictionary 35. By giving it as a penalty, it is a variable for selecting a segment closer to the target prosody and suppressing the sound quality degradation of the synthesized speech. The penalty means a reference value corresponding to the degree of voice deterioration.

図３の表７５に示すように、素片選択ペナルティとしては、ピッチペナルティＰＰａａ、ＰＰａｂ、音素長ペナルティＰＰｂａ、ＰＰｂｂ、音色ペナルティＰＰｃａが例示されている。ピッチペナルティＰＰａａ、ＰＰａｂは、ピッチ周波数に関するペナルティである。ピッチペナルティＰＰａａは、目標韻律におけるピッチ周波数（以下、目標ピッチ周波数という。以下同様）が波形辞書３５から選択した素片のピッチ周波数（以下、波形辞書素片ピッチ周波数という。以下同様）以上である場合である。音素長ペナルティＰＰｂｂは、目標ピッチ周波数が波形辞書素片ピッチ周波数未満である場合である。 As shown in Table 75 of FIG. 3, examples of the segment selection penalty include pitch penalties PPaa and PPab, phoneme length penalties PPba and PPbb, and timbre penalties PPca. Pitch penalties PPaa and PPab are penalties regarding the pitch frequency. The pitch penalty PPaa is equal to or higher than the pitch frequency of the segment selected from the waveform dictionary 35 (hereinafter referred to as the waveform dictionary segment pitch frequency, hereinafter the same) in the target prosody. Is the case. The phoneme length penalty PPbb is a case where the target pitch frequency is less than the waveform dictionary segment pitch frequency.

音素長ペナルティＰＰｂａ、ＰＰｂｂは、音素長に関するペナルティである。音素長ペナルティＰＰｂａは、目標音素長が波形辞書素片音素長以上の場合であり、音素長ペナルティＰＰｂｂは、目標音素長が波形辞書素片音素長未満の場合である。音色ペナルティＰＰｃａは、音色（音響特徴量）に関するペナルティである。音色ペナルティＰＰｃａは、目標音色が、波形辞書素片音色と有意な差がある場合である。 Phoneme length penalties PPba and PPbb are penalties for phoneme length. The phoneme length penalty PPba is the case where the target phoneme length is greater than or equal to the waveform dictionary segment phoneme length, and the phoneme length penalty PPbb is the case where the target phoneme length is less than the waveform dictionary segment phoneme length. The timbre penalty PPca is a penalty related to the timbre (acoustic feature amount). The timbre penalty PPca is when the target timbre is significantly different from the waveform dictionary fragment timbre.

以下、素片選択ペナルティの定義の例を以下に示す。
ピッチペナルティＰＰａａ＝Σ（目標ピッチ周波数／波形辞書素片ピッチ周波数）・・・（式Ａ１−１）
式Ａ１−１において、Σは、対象音素範囲が単独音素である場合のピッチ比率の和を示す。
ピッチペナルティＰＰａａ＝Σ（Σ（目標ピッチ周波数／波形辞書素片ピッチ周波数））・・・（式Ｂ１−１）
式Ｂ１−１において、一つ目のΣは、対象音素範囲における全ての音素についての和を示し、二つ目のΣは、対象音素範囲における各音素の各ピッチ比率の和を示す。なお、式Ａ１−１、式Ｂ１−１においては、目標ピッチ周波数≧波形辞書素片ピッチ周波数について合算する。 An example of the definition of the segment selection penalty is shown below.
Pitch penalty PPaa = Σ (target pitch frequency / waveform dictionary segment pitch frequency) (formula A1-1)
In Expression A1-1, Σ represents the sum of pitch ratios when the target phoneme range is a single phoneme.
Pitch penalty PPaa = Σ (Σ (target pitch frequency / waveform dictionary segment pitch frequency)) (formula B1-1)
In Expression B1-1, the first Σ represents the sum of all phonemes in the target phoneme range, and the second Σ represents the sum of pitch ratios of the phonemes in the target phoneme range. In addition, in Formula A1-1 and Formula B1-1, it adds together about target pitch frequency> waveform dictionary fragment pitch frequency.

ピッチペナルティＰＰａｂ=Σ（波形辞書素片ピッチ周波数／目標ピッチ周波数）・・・（式Ａ１−２）
ピッチペナルティＰＰａｂ=Σ（Σ（波形辞書素片ピッチ周波数／目標ピッチ周波数））・・・（式Ｂ１−２）
ここで、Σについては、式Ａ１−２は式Ａ１−１と同様であり、式Ｂ１−２は、式Ｂ１−１と同様である。なお、式Ａ１−２、式Ｂ１−２においては、目標ピッチ周波数＜波形辞書素片ピッチ周波数について合算する。 Pitch penalty PPab = Σ (waveform dictionary segment pitch frequency / target pitch frequency) (Formula A1-2)
Pitch penalty PPab = Σ (Σ (waveform dictionary segment pitch frequency / target pitch frequency)) (Formula B1-2)
Here, regarding Σ, Formula A1-2 is the same as Formula A1-1, and Formula B1-2 is the same as Formula B1-1. In addition, in Formula A1-2 and Formula B1-2, it adds together about target pitch frequency <waveform dictionary fragment | piece pitch frequency.

ピッチペナルティＰＰａａが高値であれば、「目標ピッチ周波数≧波形辞書素片ピッチ周波数」の傾向が強く、目標とする声の高さより合成音声が低い声である傾向が強いことを示す。また、ピッチペナルティＰＰａｂが高値であれば、「目標ピッチ周波数＜波形辞書素片ピッチ周波数」の傾向が強く、目標の声の高さよりも合成音声が高い声である傾向が強いことを示す。このように、ピッチペナルティＰＰａａ、ＰＰａｂのいずれに関しても、値が高い場合には、抑揚やアクセントが不自然な合成音声となる傾向が強い。 A high pitch penalty PPaa indicates a strong tendency of “target pitch frequency ≧ waveform dictionary segment pitch frequency” and a strong tendency that the synthesized voice is lower than the target voice level. Moreover, if the pitch penalty PPab is high, it indicates that the tendency “target pitch frequency <waveform dictionary segment pitch frequency” is strong, and the tendency that the synthesized voice is higher than the target voice is strong. Thus, for both pitch penalties PPaa and PPab, when the value is high, there is a strong tendency to produce synthesized speech with unnatural or accented sounds.

上記のように、（式Ａ１−１）、（式Ａ１−２）は、音素単位でピッチペナルティを積算し、１音素単位のピッチペナルティを算出する式である。（式Ｂ１−１）、（式Ｂ１−２）は、複数音素のピッチペナルティを積算し、音素列単位のピッチペナルティを算出する式である。両式のいずれも、目標と波形辞書素片の抑揚差を反映したペナルティとなり、抑揚の不自然さを知る手がかりとなる。算出された音素または音素列におけるピッチペナルティＰＰａａ、ＰＰａｂは、全て例えば記憶部５に保持され、必要に応じて読み出し可能であることが好ましい。 As described above, (Equation A1-1) and (Equation A1-2) are equations for accumulating the pitch penalty in units of phonemes and calculating the pitch penalty in units of phonemes. (Equation B1-1) and (Equation B1-2) are equations for accumulating the pitch penalties of a plurality of phonemes and calculating a pitch penalty for each phoneme string. Both of these formulas are penalties that reflect the inflection difference between the target and the waveform dictionary segment, and are clues to know the unnaturalness of the inflection. It is preferable that the pitch penalties PPaa and PPab in the calculated phonemes or phoneme strings are all stored in, for example, the storage unit 5 and can be read out as necessary.

また、ピッチペナルティＰＰａａ、ＰＰａｂを音素単位で算出する代わりにモーラ単位で算出するようにしてもよく、音素列単位で算出する代わりにモーラ列単位で算出するようにしてもよい。 Further, the pitch penalties PPaa and PPab may be calculated in units of mora instead of being calculated in units of phonemes, and may be calculated in units of mora sequences instead of being calculated in units of phoneme sequences.

「音素長ペナルティ」は、目標音素長と波形辞書の音素長の差異をペナルティとして定量化したもので、例えば下式で定義される。
音素長ペナルティＰＰｂａ＝目標音素長／波形辞書音素長・・・（式Ａ２−１）
音素長ペナルティＰＰｂａ＝Σ（目標音素長／波形辞書音素長）・・・（式Ｂ２−１）
Σは、対象音素範囲における各音素の音素長比率の和を示す。ここでは、目標音素長≧波形辞書音素長である場合について合算する。
音素長ペナルティＰＰｂｂ＝波形辞書音素長／目標音素長・・・（式Ａ２−２）
音素長ペナルティＰＰｂｂ＝Σ（波形辞書音素長／目標音素長）・・・（式Ｂ２−２）
Σは、対象音素範囲における各音素の音素長比率の和を示す。ここでは、目標音素長≦波形辞書音素長である場合について合算する。 The “phoneme length penalty” is obtained by quantifying the difference between the target phoneme length and the phoneme length of the waveform dictionary as a penalty, and is defined by the following equation, for example.
Phoneme length penalty PPba = target phoneme length / waveform dictionary phoneme length (formula A2-1)
Phoneme length penalty PPba = Σ (target phoneme length / waveform dictionary phoneme length) (formula B2-1)
Σ indicates the sum of phoneme length ratios of each phoneme in the target phoneme range. Here, the sum is calculated for the case where target phoneme length ≧ waveform dictionary phoneme length.
Phoneme length penalty PPbb = Waveform dictionary phoneme length / target phoneme length (formula A2-2)
Phoneme length penalty PPbb = Σ (waveform dictionary phoneme length / target phoneme length) (formula B2-2)
Σ indicates the sum of phoneme length ratios of each phoneme in the target phoneme range. Here, the sum is calculated for the case where target phoneme length ≦ waveform dictionary phoneme length.

ここで、音素長ペナルティＰＰｂａが高値であれば「目標音素長≧波形辞書音素長」の傾向が強く、目標より早い（慌しい）合成音声である傾向が強いことを示す。音素長ペナルティＰＰｂｂが高値であれば「目標音素長＜波形辞書音素長」の傾向が強く、目標より遅い（たどたどしい）合成音声である傾向が強いことを示す。いずれの場合も、滑舌が不自然な合成音声となる傾向が強い。 Here, if the phoneme length penalty PPba is high, it indicates that the tendency “target phoneme length ≧ waveform dictionary phoneme length” is strong, and the tendency that the synthesized speech is earlier (unsatisfactory) than the target is strong. A high phoneme length penalty PPbb indicates a strong tendency of “target phoneme length <waveform dictionary phoneme length”, and a strong tendency to be synthesized speech that is slower than the target. In either case, the smooth tongue tends to be an unnatural synthetic voice.

なお、（式Ａ２−１）、（式Ａ２−２）は、音素単位で音素長ペナルティを積算し、１音素単位の音素長ペナルティを算出する。（式Ｂ２−２）（式Ｂ２−２）は、複数音素の音素長ペナルティを積算し、音素列単位の音素長ペナルティを算出する。両式のいずれも、目標と波形辞書素片の滑舌を反映したペナルティとなり、滑舌の不自然さを知る手がかりとなる。算出された音素または音素列における音素長ペナルティＰＰｂａ、ＰＰｂｂは、全て例えば記憶部５に保持され、必要に応じて読み出し可能であることが好ましい。 Note that (Equation A2-1) and (Equation A2-2) integrate phoneme length penalties in units of phonemes and calculate phoneme length penalties in units of phonemes. (Equation B2-2) (Equation B2-2) integrates phoneme length penalties of a plurality of phonemes, and calculates a phoneme length penalty for each phoneme string. Both of these formulas are penalties that reflect the smooth tongue of the target and the waveform dictionary fragment, and are clues to know the unnaturalness of the smooth tongue. It is preferable that all the phoneme length penalties PPba and PPbb in the calculated phonemes or phoneme strings are held in, for example, the storage unit 5 and can be read out as necessary.

また、音素長ペナルティＰＰｂａ、ＰＰｂｂを音素単位で算出する代わりにモーラ単位で算出するようにしてもよく、音素列単位で算出する代わりにモーラ列単位で算出するようにしてもよい。 Further, the phoneme length penalties PPba and PPbb may be calculated in units of mora instead of being calculated in units of phonemes, or may be calculated in units of mora sequences instead of being calculated in units of phoneme sequences.

「音色ペナルティ」は、目標音色と波形辞書音素の音色との差異をペナルティとして定量化したもので、例えば下式で定義すればよい。
音色ペナルティＰＰｃａ=ｓｑｒｔ（Σ（目標ＭＦＣＣ（ｎ）−波形辞書ＭＦＣＣ（ｎ））^２）・・・（式Ａ３−１）
式Ａ３−１において、Σは、対象音素範囲が単独音素である場合の音素音色の二乗和平均を示す。 The “tone color penalty” is obtained by quantifying the difference between the target tone color and the tone color of the waveform dictionary phoneme as a penalty, and may be defined by the following equation, for example.
Tone penalty PPca = sqrt (Σ (target MFCC (n) −waveform dictionary MFCC (n)) ² ) (Formula A3-1)
In Expression A3-1, Σ represents the mean square sum of phoneme timbres when the target phoneme range is a single phoneme.

音色ペナルティＰＰｃａ=Σ（ｓｑｒｔ（Σ（目標ＭＦＣＣ（ｎ）−波形辞書ＭＦＣＣ（ｎ））^２））・・・（式Ｂ３−１）
式Ｂ３−１において、一つ目のΣは、対象音素範囲における全ての音素についての和を示し、二つ目のΣは、対象音素範囲が単独音素である場合の各音素音色の二乗和平均を示す。 Tone penalty PPca = Σ (sqrt (Σ (target MFCC (n) −waveform dictionary MFCC (n)) ² )) (formula B3-1)
In Formula B3-1, the first Σ represents the sum of all phonemes in the target phoneme range, and the second Σ is the mean square sum of each phoneme tone color when the target phoneme range is a single phoneme. Indicates.

音色ペナルティＰＰｃａでは、音色を表現する音響特徴量として、メル周波数ケプストラム係数（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｃｙ：ＭＦＣＣ）が利用される。式Ａ３−１、式Ｂ３−１は、Ｎ次元のＭＦＣＣを音色として採用した例である。目標のＭＦＣＣは、予め目的別かつ音素別に平均的なＭＦＣＣを算出しておけばよく、目標ＭＦＣＣと対象音素ＭＦＣＣのＮ次の二乗和平均を音色ペナルティと定義すればよい。なお、目的別とは、例えば、通常音声合成と、感情（喜・怒・哀・楽など）を表現した感情音声合成のような目的が異なる音声合成のことをいう。例えば、通常音声合成と感情音声合成とでは、同じ音素でも音色が異なるため、その目的別に、音素別の平均的なＭＦＣＣを予め算出しておくことが好ましい。 In the timbre penalty PPca, a Mel Frequency Cepstrum Coefficient (MFCC) is used as an acoustic feature amount expressing a timbre. Expressions A3-1 and B3-1 are examples in which N-dimensional MFCC is adopted as a timbre. For the target MFCC, an average MFCC may be calculated in advance for each purpose and for each phoneme, and an N-order square sum of the target MFCC and the target phoneme MFCC may be defined as a timbre penalty. Note that “by purpose” means, for example, normal speech synthesis and speech synthesis with different purposes such as emotional speech synthesis expressing emotions (joy, anger, sorrow, comfort, etc.). For example, normal voice synthesis and emotional voice synthesis have different tone colors even for the same phoneme, and it is preferable to calculate an average MFCC for each phoneme for each purpose.

ここで、音色ペナルティＰＰｃａが高値であれば、波形辞書音素と目標の音色が異なる傾向が強いため、不自然な音（音質）となる傾向が強い。なお、式Ａ３−１は、音素単位で音色ペナルティを積算し、１音素単位の音色ペナルティを算出する。式Ｂ３−１は、複数音素の音色ペナルティを積算し、音素列単位の音色ペナルティを算出する。両式のいずれも、目標と波形辞書素片の音色を反映したペナルティとなり、音質の不自然さを知る手がかりとなる。算出された音素または音素列における音色ペナルティＰＰｃａは、全て例えば記憶部５に保持され、必要に応じて読み出し可能であることが好ましい。また、音色ペナルティＰＰｃａを音素単位で算出する代わりに、モーラ単位で算出してもよく、音素列単位で算出する代わりにモーラ列単位で算出することにしてもよい。 Here, if the timbre penalty PPca is a high value, the waveform dictionary phoneme and the target timbre tend to be different from each other, and thus there is a strong tendency to be unnatural sound (sound quality). In Formula A3-1, the timbre penalties are integrated in units of phonemes, and the timbre penalty in units of one phoneme is calculated. Formula B3-1 integrates the timbre penalties of a plurality of phonemes and calculates the timbre penalty for each phoneme string. Both of these formulas are penalties that reflect the timbre of the target and the waveform dictionary fragment, and are clues to know the unnaturalness of the sound quality. It is preferable that all the timbre penalties PPca in the calculated phonemes or phoneme strings are held in, for example, the storage unit 5 and can be read out as necessary. The timbre penalty PPca may be calculated in units of mora instead of being calculated in units of phonemes, and may be calculated in units of mora sequences instead of being calculated in units of phoneme sequences.

次に、図４を参照しながら、素片接続ペナルティについて説明する。図４は、素片接続ペナルティを示す図である。素片接続ペナルティとは、素片選択ペナルティで選択した素片列（音素列）同士を接続する際に、その接続境界における音響的特徴の差異を定量化してペナルティとして与えることで、より接続境界の音響特徴量が合う接続を選び、合成音声の音質劣化を抑えるための変数である。 Next, the unit connection penalty will be described with reference to FIG. FIG. 4 is a diagram showing a unit connection penalty. The unit connection penalty is the connection boundary by quantifying the difference in acoustic characteristics at the connection boundary when connecting the unit sequences (phoneme sequences) selected by the unit selection penalty. This is a variable for selecting a connection that matches the acoustic feature amount and suppressing deterioration in the quality of the synthesized speech.

「ピッチ変化ペナルティ」は、波形辞書から選択された素片列（音素列）同士を接続する場合に、接続境界前後のピッチ周波数の差異をペナルティとして定量化したもので、例えば下式で定義する。 The “pitch change penalty” is obtained by quantifying the difference in pitch frequency before and after the connection boundary as a penalty when connecting segment strings (phoneme strings) selected from the waveform dictionary. .

ピッチ変化ペナルティＰＣａａ＝境界前ピッチ周波数／境界後ピッチ周波数・・・（式Ａ４−１）
ただし、境界前ピッチ周波数≧境界後ピッチ周波数となる場合について合算している。 Pitch change penalty PCaa = Pitch frequency before boundary / Pitch frequency after boundary (formula A4-1)
However, the cases where the pitch frequency before boundary ≧ the pitch frequency after boundary are added up.

ピッチ変化ペナルティＰＣａｂ=境界後ピッチ周波数／境界前ピッチ周波数・・・（式Ａ４−２）
ただし、境界前ピッチ周波数≦境界後ピッチ周波数となる場合について合算している。 Pitch change penalty PCab = post-boundary pitch frequency / pre-boundary pitch frequency (formula A4-2)
However, the case where the pitch frequency before boundary ≦ the pitch frequency after boundary is added up.

ここで、ピッチ変化ペナルティＰＣａａの値が高値である程「境界前ピッチ周波数≧境界後ピッチ周波数」の傾向が強く、境界前から境界後に向かって、声の高さが急に低くなる合成音声となる。ピッチ変化ペナルティＰＣａｂの値が高値である程「境界前ピッチ周波数≦境界後ピッチ周波数」の傾向が強く、境界前から境界後に向かって、声の高さが高くなる合成音声となる。いずれの場合も、抑揚やアクセントが不自然な合成音声となる傾向が強い。 Here, the higher the value of the pitch change penalty PCaa, the stronger the tendency of “pre-boundary pitch frequency ≧ post-boundary pitch frequency”, and the synthesized voice in which the voice pitch suddenly decreases from before the boundary to after the boundary. Become. The higher the value of the pitch change penalty PCab, the stronger the tendency of “pre-boundary pitch frequency ≦ post-boundary pitch frequency”, and the synthesized speech becomes higher in voice from before the boundary to after the boundary. In either case, there is a strong tendency for the synthesized speech to have unnatural or accented sounds.

「音素環境ペナルティ」は、波形辞書から選択された素片列（音素列）同士を接続する場合、接続境界前後の音素環境の差異をペナルティとして定量化したもので、例えば下式で定義する。
音素環境ペナルティＰＣｂａ＝境界前音素環境スコア・・・（式Ａ５−１）
音素環境ペナルティＰＣｂｂ＝境界後音素環境スコア・・・（式Ａ５−２） The “phoneme environment penalty” is obtained by quantifying the difference in phoneme environment before and after the connection boundary when connecting segment strings (phoneme strings) selected from the waveform dictionary, and is defined by the following equation, for example.
Phoneme environment penalty PCba = Pre-boundary phoneme environment score (formula A5-1)
Phoneme environment penalty PCbb = phoneme environment score after boundary (formula A5-2)

以下、音素環境ペナルティＰＣｂａ、ＰＣｂｂについて、図５を参照しながら説明する。図５は、「音素環境」について説明する図である。図５に示すように、例えば、接続境界３６２より前の境界前音素列３５０が「庵」で、表音テキスト３５２が「イオリ」、合成音素列３５４が「家」で、表音テキスト３５６が「イエ」、境界後音素列３５８が「上」で、表音テキスト３６２が「ウエ」であるとする。 Hereinafter, the phoneme environment penalty PCba, PCbb will be described with reference to FIG. FIG. 5 is a diagram for explaining the “phoneme environment”. As shown in FIG. 5, for example, the pre-boundary phoneme string 350 before the connection boundary 362 is “庵”, the phonetic text 352 is “Iori”, the synthesized phoneme string 354 is “house”, and the phonetic text 356 is It is assumed that “yes”, the post-boundary phoneme string 358 is “up”, and the phonetic text 362 is “we”.

上記の例では、音素列「庵」の先頭音素「イ」と、音素列「上」の終端音素「エ」を素片接続して「家(イエ)」を合成するが、このような場合、音質が劣化する傾向が強い。すなわち、通常「イエ」という発声は、「イ」の発声部分と、「エ」の発声部分と、「イからエ」へ変化する“渡り”と呼ばれる発声部分に分かれる。ところが、「庵」は「イからオ」に変化する「イ」、「上」は「ウからエ」に変化する「エ」であり、いずれも渡りの部分が、合成音素「家（イエ）」の渡りと全く一致しないため、渡り部分の音色が不自然となって音質劣化となる傾向が強い。 In the above example, the first phoneme “I” of the phoneme sequence “庵” and the end phoneme “e” of the phoneme sequence “up” are connected together to synthesize “house”. The sound quality tends to deteriorate. That is, the utterance “yes” is usually divided into an utterance portion “i”, an utterance portion “e”, and an utterance portion called “crossover” that changes from “i” to “e”. However, “庵” is “I” that changes from “I to O”, and “Up” is “E” that changes from “U to D”. ”Does not match at all, and there is a strong tendency that the timbre of the transition part becomes unnatural and the sound quality deteriorates.

したがって、境界接続においては、少なくとも境界前後の２音素の一致（音素環境の一致）が重要であり、不一致の場合に音質劣化となりやすいため、不一致の場合は、音素環境スコアをペナルティとして採用する。よって、音素環境スコアは、合成音素と境界前または境界後の音素とが一致する場合はスコア０と定義し、不一致の場合は、合成音素と波形辞書音素の種類に応じて正スコアを加点するように構成する。よって、上記のように、音素環境ペナルティＰＣｂａ、ＰＣｂｂの値が高値である程、接続境界において不自然な音質の合成音声となる傾向が強い。 Therefore, in boundary connection, at least two phonemes before and after the boundary are important (phoneme environment match), and if they do not match, the sound quality is likely to deteriorate. Therefore, if they do not match, the phoneme environment score is adopted as a penalty. Therefore, the phoneme environment score is defined as score 0 when the synthesized phoneme and the phoneme before or after the boundary match, and in the case of mismatch, a positive score is added according to the type of the synthesized phoneme and the waveform dictionary phoneme. Configure as follows. Therefore, as described above, the higher the value of the phoneme environment penalty PCba, PCbb, the stronger the tendency to become unnatural sound quality synthesized speech at the connection boundary.

「音色変化ペナルティ」は、波形辞書３５から選択された素片列（音素列）同士を接続する場合、接続境界前後の音色の差異をペナルティとして定量化したもので、例えば下式で定義する。
音色変化ペナルティＰＣｃａ＝ｓｑｒｔ（Σ（境界前置音素ＭＦＣＣ（ｎ）−境界後置音素ＭＦＣＣ（ｎ））^２）・・・（式Ａ６−１）
音色を表現する音響特徴量として、前述の「音色ペナルティ」で説明したメル周波数ケプストラム係数（ＭＦＣＣ）を利用し、境界前音素と境界後音素の音色差を音色変化ペナルティとして表現する。 The “timbre change penalty” is obtained by quantifying a difference in timbre before and after the connection boundary as a penalty when connecting segment strings (phoneme strings) selected from the waveform dictionary 35, and is defined by the following equation, for example.
Tone change penalty PCca = sqrt (Σ (pre-boundary phoneme MFCC (n) −post-boundary phoneme MFCC (n)) ² ) (Formula A6-1)
The mel frequency cepstrum coefficient (MFCC) described in the above-mentioned “timbre penalty” is used as the acoustic feature amount representing the timbre, and the timbre difference between the pre-boundary phoneme and the post-boundary phoneme is represented as a timbre change penalty.

例えば、長母音「アー」を「ア」と「ア」の音素境界で接続して合成する場合には、様々な音色の「ア」があるため、音色の異なる「ア」同士を接続すると、不自然な音質の音声となる。すなわち、音色変化ペナルティＰＣｃａの値が高値である程、接続境界において不自然な音（音質）の合成音声となる。 For example, when synthesizing the long vowel `` A '' by connecting it at the phoneme boundary between `` A '' and `` A '', there are various timbres `` A ''. The sound will be unnatural. That is, the higher the value of the timbre change penalty PCca is, the more the synthesized speech becomes unnatural sound (sound quality) at the connection boundary.

また、上記合成音声が「イエ」の場合のように、前置と後置の音素が異なる素片を用いた場合は、「イ」の後半の渡り部分と「エ」の前半の渡り部分のメル周波数ケプストラムを対象にして式Ａ６−１による音色変化ペナルティＰＣｃａを利用すれば、異なる音素同士であっても、その渡り部分の音色差に応じてペナルティを与えることが可能となる。 In addition, as in the case where the synthesized speech is “Yes”, when using segments with different front and rear phonemes, the transition part in the latter half of “I” and the transition part in the first half of “D” If the timbre change penalty PCca according to Formula A6-1 is used for the mel frequency cepstrum, a penalty can be given according to the timbre difference in the transition portion even between different phonemes.

なお、上記式Ａ６−１は、Ｎ次元のＭＦＣＣを音色として採用した例である。目標のＭＦＣＣは、予め目的別かつ音素別に平均的なＭＦＣＣを算出しておけばよく、目標ＭＦＣＣと対象音素ＭＦＣＣのＮ次の二乗和平均を音色ペナルティと定義すればよい。
上記素片接続ペナルティの各算出値は、全て例えば記憶部５に記憶され、必要に応じて読み出し可能であることが好ましい。 The formula A6-1 is an example in which N-dimensional MFCC is adopted as a timbre. For the target MFCC, an average MFCC may be calculated in advance for each purpose and for each phoneme, and an N-order square sum of the target MFCC and the target phoneme MFCC may be defined as a timbre penalty.
It is preferable that each calculated value of the unit connection penalty is stored in, for example, the storage unit 5 and can be read out as necessary.

素片選択接続部２７は、算出された各素片選択ペナルティおよび素片接続ペナルティにより算出される、以下の素片選択ペナルティ評価値ＰＰおよび素片接続ペナルティ評価値ＰＣが最小になるように、素片の選択および接続を行う。
素片選択ペナルティ評価値ＰＰ＝α１×ＰＰａａ＋α２×ＰＰａｂ＋α３×ＰＰｂａ＋α４×ＰＰｂｂ＋α５×ＰＰｃａ・・・式Ｃ１
素片接続ペナルティ評価値ＰＣ＝β１×ＰＣａａ＋β２×ＰＣａｂ＋β３×ＰＣｂａ＋β４×ＰＣｂｂ＋β５×ＰＣｃａ・・・式Ｃ２
ここで、αｌ、βｌ（ｌ＝１、２、・・・）は、算出された各素片選択ペナルティまたは素片接続ペナルティの音声合成に対する影響度に対応する重み係数である。 The element selection / connection unit 27 calculates the following element selection penalty evaluation value PP and element connection penalty evaluation value PC, which are calculated based on the calculated element selection penalty and the element connection penalty, so that the following element selection penalty evaluation value PC is minimized. Select and connect pieces.
Segment selection penalty evaluation value PP = α1 × PPaa + α2 × PPab + α3 × PPba + α4 × PPbb + α5 × PPca (Formula C1)
Element connection penalty evaluation value PC = β1 × PCaa + β2 × PCab + β3 × PCba + β4 × PCbb + β5 × PCca (Formula C2)
Here, αl, βl (l = 1, 2,...) Are weighting coefficients corresponding to the degree of influence of the calculated segment selection penalty or segment connection penalty on speech synthesis.

次に、図６を参照しながら、「音質の劣化種別と劣化コストの対応付け」および「劣化コスト関数」の説明を以下に示す。図６は、劣化コスト関数および劣化種別を示す図である。なお、図６における係数ｗ１０〜ｗ６７は、劣化コスト関数で構成される劣化コストを同一レンジで対比するための正規化係数であり、Ｗ００〜Ｗ２７は、劣化コストで構成される各劣化種別を同一レンジで対比するための正規化係数である。 Next, referring to FIG. 6, explanations of “correspondence between sound quality deterioration type and deterioration cost” and “deterioration cost function” will be given below. FIG. 6 is a diagram illustrating a deterioration cost function and a deterioration type. Note that the coefficients w10 to w67 in FIG. 6 are normalization coefficients for comparing the degradation costs constituted by the degradation cost function in the same range, and W00 to W27 are the same for each degradation type constituted by the degradation costs. This is a normalization coefficient for comparison by range.

まず、劣化種別の「音（音質）」（以下、単に「音」と記載する）について説明する。音声合成に精通しない一般利用者は、音質劣化を分類することができないケースが多く、抑揚の不自然さや、滑舌の不自然さ、誤アクセントなどを全て区別が不可能な音質の劣化と捕らえることが多い。このため、劣化種別の「音」では、素片選択ペナルイティと素片接続ペナルティの全てを含んだ劣化コスト、およびその劣化コスト関数を定義することが好ましい。 First, the deterioration type “sound (sound quality)” (hereinafter simply referred to as “sound”) will be described. General users who are not familiar with speech synthesis often cannot classify sound quality degradation, and inferior unnaturalness, smooth tongue unnaturalness, false accents, etc. are all regarded as sound quality degradation that cannot be distinguished. There are many cases. For this reason, in the deterioration type “sound”, it is preferable to define a degradation cost including all of the segment selection penalty and the segment connection penalty, and its degradation cost function.

例えば、図６に示すように、「音」が音質劣化の劣化種別であるとする場合に、劣化コストＣ００〜Ｃ０５を以下のように定義することにより算出する。
Ｃ００＝Ｗ００×（ｗ１０×ＰＰａａ＋ｗ１１×ＰＰａｂ）・・・式（１−１）
なお、この場合、判別される劣化原因は「ピッチ差」である。
Ｃ０１＝Ｗ０１×（ｗ２０×ＰＰｂａ＋ｗ２１×ＰＰｂｂ）・・・式（２−１）
なお、この場合、判別される劣化原因は「音素長差」である。
Ｃ０２＝Ｗ０２×（ｗ３０×ＰＰｃａ）・・・式（３−１）
なお、この場合、判別される劣化原因は「音素音色差」である。
Ｃ０３＝Ｗ０３×（ｗ４０×ＰＣａａ＋ｗ４１×ＰＣａｂ）・・・式（４−１）
なお、この場合、判別される劣化原因は「境界ピッチ差」である。
Ｃ０４＝Ｗ０４×（ｗ５０×ＰＣｂａ＋ｗ５１×ＰＣｂｂ）・・・式（５−１）
なお、この場合、判別される劣化原因は「音素環境差」である。
Ｃ０５＝Ｗ０５×（ｗ６０×ＰＣｃａ）・・・式（６−１）
なお、この場合、判別される劣化原因は「境界音色差」である。 For example, as shown in FIG. 6, when “sound” is a deterioration type of sound quality deterioration, the deterioration costs C00 to C05 are calculated by defining them as follows.
C00 = W00 × (w10 × PPaa + w11 × PPab) (1)
In this case, the identified cause of deterioration is “pitch difference”.
C01 = W01 × (w20 × PPba + w21 × PPbb) Equation (2-1)
In this case, the identified cause of deterioration is “phoneme length difference”.
C02 = W02 × (w30 × PPca) (formula (3-1))
In this case, the identified deterioration cause is “phoneme tone color difference”.
C03 = W03 × (w40 × PCaa + w41 × PCab) Formula (4-1)
In this case, the identified cause of deterioration is “boundary pitch difference”.
C04 = W04 × (w50 × PCba + w51 × PCbb) Equation (5-1)
In this case, the identified cause of deterioration is “phoneme environment difference”.
C05 = W05 × (w60 × PCca) Expression (6-1)
In this case, the identified cause of deterioration is “boundary tone color difference”.

ここで上述のように、音声合成に精通しない一般利用者は、音質劣化を分類することができないケースが多く、抑揚の不自然さや、滑舌の不自然さ、誤アクセントなどを全て区別が不可能な音質の劣化と捕らえることが多い。このため、利用者が劣化種別の「音」を選択した場合は、劣化コストＣ００〜Ｃ０５までの各種コスト値に応じて、劣化種別を「音」以外の「抑揚」や「滑舌」に自動修正するように構成することが好ましい。このため、劣化コストＣ００〜Ｃ０５までの各種コスト値に応じて、予め劣化種別の修正方法を指定しておくようにしてもよい。 Here, as mentioned above, general users who are not familiar with speech synthesis often cannot classify sound quality degradation, and are unable to distinguish between unnatural inflection, unnatural smooth tongue, and false accents. Often perceived as possible sound quality degradation. For this reason, when the user selects the deterioration type “sound”, the deterioration type is automatically set to “inflection” or “smooth tongue” other than “sound” according to various cost values from the deterioration cost C00 to C05. It is preferable to be configured to correct. For this reason, a correction method for the degradation type may be designated in advance according to various cost values from the degradation costs C00 to C05.

更に、不自然な音の１つとして、誤アクセントが存在する。誤アクセントは自動修正が困難であるため、利用者により劣化種別詳細として「誤アクセント」が選択された場合は、強制的に手動のアクセント修正ＵＩへ遷移するように構成することが好ましい。 Furthermore, false accents exist as one of unnatural sounds. Since it is difficult to automatically correct an erroneous accent, it is preferable that the user is forced to transit to a manual accent correction UI when “false accent” is selected as the deterioration type detail by the user.

次に、劣化種別８２欄の「抑揚」について説明する。抑揚の音質劣化は、主に以下の２種に起因する。
抑揚例１）目標に比べて合成音声の声の高さが高すぎたり低すぎたり、またはそれらが混在している場合
抑揚例２）接続境界の音色が急激に不自然に変化する場合 Next, “intonation” in the degradation type column 82 will be described. The sound quality degradation of intonation is mainly due to the following two types.
Intonation example 1) When the voice of the synthesized speech is too high or too low compared to the target, or when they are mixed Intonation example 2) When the timbre of the connection boundary suddenly changes unnaturally

抑揚例１）は、素片選択ペナルティにおけるピッチペナルティＰＰａａ、ＰＰａｂの少なくともいずれかが大きい場合として表される。すなわち、合成音声が目標から外れた声の高さとなるため、抑揚が不自然となる。更に、素片接続ペナルティのピッチ変化ペナルティＰＣａａが大きい場合としても表される。すなわち、接続境界の前後でピッチが大きく変化してしまうため、抑揚が不自然となる。 Intonation example 1) is expressed as a case where at least one of the pitch penalties PPaa and PPab in the segment selection penalty is large. In other words, since the synthesized speech is at a pitch higher than the target, the intonation is unnatural. Further, it is also expressed as a case where the pitch change penalty PCaa of the unit connection penalty is large. In other words, since the pitch changes greatly before and after the connection boundary, the intonation becomes unnatural.

抑揚例２）は、素片接続ペナルティの音色変化ペナルティＰＣｃａが大きい場合として現される。すなわち、境界前後の音色が不自然に変化するため、聴感上で抑揚が急激に変化したように感じる傾向となる。 Intonation example 2) appears when the timbre change penalty PCca of the segment connection penalty is large. That is, since the timbre before and after the boundary changes unnaturally, it tends to feel as if the inflection has changed abruptly on hearing.

以上の理由から、素片選択ペナルティの「ピッチペナルティＰＰａａ、ＰＰａｂ」、素片接続ペナルティの「ピッチ変化ペナルティＰＣａａ」「音色変化ペナルティＰＣｃａ」を用いて、図６に示すように劣化種別と、その劣化コスト、劣化コスト関数を定義する。 For the above reason, using the “pitch penalty PPaa, PPab” of the segment selection penalty and the “pitch change penalty PCaa” and “timbre change penalty PCca” of the segment connection penalty, as shown in FIG. Define degradation cost and degradation cost function.

まず、例えば、図６の種別詳細欄８４に示すように、音質劣化の劣化種別詳細として「抑揚が大きい」と感じる場合は、目標より高い声で合成された場合であり、ピッチペナルティＰＰａｂに関係する。よって、劣化コストＣ１０とその劣化関数は、図６の式（１−２）に示すように、ピッチペナルティＰＰａｂの関数として以下のように定義する。
Ｃ１０＝Ｗ１０×（ｗ１２×ＰＰａｂ）・・・式（１−２）
なお、このとき判別される劣化原因は「ピッチ差」となる。 First, as shown in the type detail column 84 of FIG. 6, for example, when the deterioration type details of the sound quality deterioration are felt as “high intonation”, it is a case where the voice is synthesized with a voice higher than the target and is related to the pitch penalty PPab. To do. Therefore, the degradation cost C10 and its degradation function are defined as follows as a function of the pitch penalty PPab as shown in the equation (1-2) in FIG.
C10 = W10 × (w12 × PPab) (1-2)
The cause of deterioration determined at this time is “pitch difference”.

更に、接続境界の前後で不自然に声の高さが変化した場合、目標よりも声が高く変化しても、逆に低く変化しても、相対的に抑揚が大きく変化したと感じる傾向にある。よって、図６の式（４−２）に示すように、ピッチ変化ペナルティＰＣａａ、ＰＣａｂの関数として、劣化コストＣ１１とその劣化関数を定義する。
Ｃ１１＝Ｗ１１×（ｗ４２×ＰＣａａ＋ｗ４３×ＰＣａｂ）・・・式（４−２）
なお、このとき判別される劣化原因は、「境界ピッチ差」となる。 Furthermore, if the voice pitch changes unnaturally before and after the connection boundary, even if the voice changes higher or lower than the target, it tends to feel that the inflection has changed relatively. is there. Therefore, as shown in the equation (4-2) in FIG. 6, the degradation cost C11 and its degradation function are defined as functions of the pitch change penalties PCaa and PCab.
C11 = W11 × (w42 × PCaa + w43 × PCab) Formula (4-2)
The cause of deterioration determined at this time is “boundary pitch difference”.

また、接続境界前後の音色差が大きい場合は、抑揚が急激に変化することで、聴感上、抑揚が大きく感じる傾向にある。よって、図６の式（６−２）に示すように、音色変化ペナルティＰＣｃａの関数として、劣化コストＣ１２とその劣化関数を定義する。
Ｃ１２＝Ｗ１２×（ｗ６１×ＰＣｃａ）・・・式（６−２）
なお、このとき判別される劣化原因は「境界音色差」となる。 In addition, when the tone color difference before and after the connection boundary is large, the inflection changes abruptly, so that the intonation tends to feel large in terms of hearing. Therefore, as shown in the equation (6-2) in FIG. 6, the degradation cost C12 and its degradation function are defined as a function of the timbre change penalty PCca.
C12 = W12 × (w61 × PCca) Expression (6-2)
The cause of deterioration determined at this time is “boundary tone color difference”.

次に、音質劣化の劣化種別詳細で「抑揚が小さい」と感じる場合は、ピッチペナルティにおいて、目標より低い声で合成された場合である。よって、図５の式（１−３）に示すように、ピッチペナルティＰＰａａの関数として劣化コストＣ１３とその劣化関数を定義する。
Ｃ１３＝Ｗ１３×（ｗ１３×ＰＰａａ）・・・式（１−３）
なお、このとき判別される劣化原因は「ピッチ差」となる。 Next, when the sound quality deterioration type details indicate that “the intonation is small”, the voice is synthesized with a voice lower than the target in the pitch penalty. Therefore, as shown in the equation (1-3) in FIG. 5, the degradation cost C13 and its degradation function are defined as a function of the pitch penalty PPaa.
C13 = W13 × (w13 × PPaa) Formula (1-3)
The cause of deterioration determined at this time is “pitch difference”.

更に、不自然な抑揚の１つとして、誤アクセントが存在する。誤アクセントは自動修正が困難であるため、利用者により劣化種別詳細として「誤アクセント」が選択された場合は、強制的に手動のアクセント修正ＵＩへ遷移するように構成することが好ましい。 Furthermore, false accents exist as one of the unnatural inflections. Since it is difficult to automatically correct an erroneous accent, it is preferable that the user is forced to transit to a manual accent correction UI when “false accent” is selected as the deterioration type detail by the user.

次に、劣化種別の「滑舌」について説明する。滑舌の音質劣化は、主に以下の２種に起因する。
滑舌１）目標に比べて合成音声の声が慌しすぎたり、たどたどしすぎたり、またはそれらが混在している場合
滑舌２）接続境界の音色が不自然に変化する場合 Next, the deterioration type “smooth tongue” will be described. The sound quality deterioration of the tongue is mainly due to the following two types.
Tongue 1) When the voice of the synthesized voice is too jealous, traced, or mixed, compared to the target, Tongue 2) When the tone at the connection boundary changes unnaturally

滑舌１）は、素片選択ペナルティにおける音素長ペナルティＰＰｂａ、ＰＰｂｂのいずれか少なくとも一方が大きい場合である。このような場合、合成音声が慌しくなったり、たどたどしくなったりするため、滑舌が不自然となる。 Smooth tongue 1) is a case where at least one of phoneme length penalties PPba and PPbb in the segment selection penalty is large. In such a case, the synthesized speech becomes dull or distorted, and the smooth tongue becomes unnatural.

滑舌２）は、素片接続ペナルティの音素環境ペナルティＰＣｂａ、ＰＣｂｂのいずれか少なくとも一方が大きい場合である。このような場合、素片同士の接続境界の音の渡りが不自然となって音色の変化に滑らかさが無くなり、滑舌が不自然となる。更に、音色変化ペナルティＰＣｃａが大きい場合もある。このような場合、境界前後の音色が不自然に変化するため、音色の変化に滑らかさが無くなり、滑舌が不自然となる。 The smooth tongue 2) is a case where at least one of the phoneme environment penalties PCba and PCbb of the segment connection penalty is large. In such a case, the transition of the sound at the connection boundary between the segments is unnatural, the smoothness of the timbre changes is lost, and the smooth tongue is unnatural. Furthermore, the timbre change penalty PCca may be large. In such a case, since the timbre before and after the boundary changes unnaturally, the change in timbre is not smooth, and the smooth tongue becomes unnatural.

よって、素片選択ペナルティの「音素長ペナルティＰＰｂａ、ＰＰｂｂ」、素片接続ペナルティの「音素環境ペナルティＰＣｂａ、ＰＣｂｂ」「音色変化ペナルティＰＣｃａ」を用いて、図３に示すように劣化種別と、その劣化コスト、劣化コスト関数を定義する。 Therefore, using the unit selection penalty “phoneme length penalty PPba, PPbb” and the unit connection penalty “phoneme environment penalty PCba, PCbb” “timbre change penalty PCca” as shown in FIG. Define degradation cost and degradation cost function.

まず、音質劣化の種別詳細として「滑舌が慌しい」と感じる場合は、目標より短い音素長で合成された場合である。よって、図６の式（２−２）に示すように、音素長ペナルティＰＰｂａを用いて、劣化コストＣ２０とその劣化関数を定義する。
Ｃ２０＝Ｗ２０×（ｗ２２×ＰＰｂａ）・・・式（２−２）
なお、このとき判別される劣化原因は、「音素長差」となる。 First, when it is felt that “smooth tongue is ugly” as a detail type of sound quality degradation, it is a case where the phoneme length is shorter than the target. Therefore, as shown in the equation (2-2) in FIG. 6, the degradation cost C20 and its degradation function are defined using the phoneme length penalty PPba.
C20 = W20 × (w22 × PPba) Formula (2-2)
The cause of deterioration determined at this time is “phoneme length difference”.

更に、接続境界の前後で音素環境が異なる場合、境界前後の音のつながりの滑らかさが無くなり、慌しく感じたり、たどたどしく感じたりする傾向にある。よって、図６の式（５−２）に示すように、音素環境ペナルティＰＣｂａ、ＰＣｂｂを用いて劣化コストＣ２１とその劣化関数を定義する。
Ｃ２１＝Ｗ２１×（ｗ５２×ＰＣｂａ＋ｗ５３×ＰＣｂｂ）・・・式（５−２）
なお、このとき判別される劣化原因は、「音素環境差」となる。 Furthermore, when the phoneme environment is different before and after the connection boundary, the smoothness of the sound connection before and after the boundary is lost, and it tends to feel ugly or violent. Therefore, as shown in the equation (5-2) in FIG. 6, the degradation cost C21 and its degradation function are defined using the phoneme environment penalty PCba, PCbb.
C21 = W21 × (w52 × PCba + w53 × PCbb) Formula (5-2)
Note that the cause of deterioration determined at this time is “phoneme environment difference”.

また、接続境界前後の音色差が大きい場合も、境界前後の音のつながりの滑らかさが無くなり、慌しく感じたり、たどたどしく感じたりする傾向にある。よって、図６の式（６−３）に示すように、音色変化ペナルティＰＣｃａを用いて劣化コストＣ２２とその劣化関数を定義する。
Ｃ２２＝Ｗ２２×（ｗ６２×ＰＣｃａ）・・・式（６−３）
なお、このとき判別される劣化原因は、「境界音色差」となる。 In addition, even when the timbre difference before and after the connection boundary is large, the smoothness of the sound connection before and after the boundary is lost, and there is a tendency that it feels ugly or traced. Therefore, as shown in the equation (6-3) in FIG. 6, the degradation cost C22 and its degradation function are defined using the timbre change penalty PCca.
C22 = W22 × (w62 × PCca) (formula 6-3)
The cause of deterioration determined at this time is “boundary tone color difference”.

次に、音質劣化の劣化種別詳細で「滑舌がたどたどしい」と感じる場合は、目標より長い音素長で合成された場合である。よって、図６の式（２−３）に示すように、音素長ペナルティＰＰｂｂを用いて劣化コストＣ２３とその劣化関数を定義する。
Ｃ２３＝Ｗ２３×（ｗ２３×ＰＰｂｂ）・・・式（２−３）
なお、このとき判別される劣化原因は、「音素長差」である。 Next, when it is felt that “the smooth tongue is rugged” in the deterioration type details of the sound quality deterioration, it is a case where the synthesis is performed with a phoneme length longer than the target. Therefore, as shown in the equation (2-3) in FIG. 6, the degradation cost C23 and its degradation function are defined using the phoneme length penalty PPbb.
C23 = W23 × (w23 × PPbb) (2-3)
The cause of deterioration determined at this time is “phoneme length difference”.

更に、接続境界の前後で音素環境が異なる場合、境界前後の音のつながりの滑らかさが無くなり、たどたどしく感じたり、慌しく感じたりする傾向にある。よって、図６の式（５−３）に示すように、音素環境ペナルティＰＣｂａ、ＰＣｂｂを用いて劣化コストＣ２４とその劣化関数を定義する。
Ｃ２４＝Ｗ２４×（ｗ５４×ＰＣｂａ＋ｗ５５×ＰＣｂｂ）・・・式（５−３）
なお、このとき判別される劣化原因は、「音素環境差」である。 Furthermore, when the phoneme environment is different before and after the connection boundary, the smoothness of the connection of sounds before and after the boundary is lost, and there is a tendency to feel irritated or ugly. Therefore, as shown in the equation (5-3) in FIG. 6, the degradation cost C24 and its degradation function are defined using the phoneme environment penalty PCba, PCbb.
C24 = W24 × (w54 × PCba + w55 × PCbb) Formula (5-3)
Note that the cause of deterioration determined at this time is “phoneme environment difference”.

また、接続境界前後の音色差が大きい場合も、境界前後の音のつながりの滑らかさが無くなり、たどたどしく感じたり、慌しく感じたりする傾向にある。よって、図６の式（６−４）にしめすように、音色ペナルティＰＰｃａを用いて劣化コストＣ２５とその劣化関数を定義する。
Ｃ２５＝Ｗ２５×（ｗ６３×ＰＣｃａ）・・・式（６−４）
なお、このとき判別される劣化原因は「境界音色差」となる。 In addition, even when the timbre difference before and after the connection boundary is large, the smoothness of the sound connection before and after the boundary is lost, and there is a tendency to feel confused or ugly. Therefore, the degradation cost C25 and its degradation function are defined using the timbre penalty PPca as shown in the equation (6-4) in FIG.
C25 = W25 × (w63 × PCca) Formula (6-4)
The cause of deterioration determined at this time is “boundary tone color difference”.

利用者により劣化種別詳細として「誤アクセント」が選択された場合は、「抑揚」の場合と同様に、強制的に手動のアクセント修正ＵＩへ遷移するように構成することが好ましい。 When “false accent” is selected as the degradation type details by the user, it is preferable to force the transition to the manual accent correction UI as in the case of “inflection”.

以上の構成により、音声合成時の中間情報に含まれる素片選択ペナルティ、素片接続ペナルティが音質劣化を量的に表す情報である点を利用し、各ペナルティの組み合わせを劣化コストおよび劣化コスト関数として定義する。これにより、劣化コストは、劣化種別を正しく判定するのに適した指標となり、劣化種別の判定精度を向上させることができる。更に、劣化種別は、劣化コストと１対１で対応するものではなく、複数の劣化コストの組み合わせとしているので、劣化種別の判定精度を向上させることが可能となる。 With the above configuration, using the fact that the segment selection penalty and segment connection penalty included in the intermediate information at the time of speech synthesis are information that quantitatively represents sound quality degradation, the combination of each penalty is a degradation cost and a degradation cost function. Define as Thereby, the deterioration cost becomes an index suitable for correctly determining the deterioration type, and the determination accuracy of the deterioration type can be improved. Furthermore, since the degradation type does not correspond to the degradation cost on a one-to-one basis and is a combination of a plurality of degradation costs, it is possible to improve the determination accuracy of the degradation type.

以下、図７から図１１を参照しながら、第１の実施の形態による音声合成装置１の動作について説明する。図７は、音声合成装置１における劣化種別表示の動作を示すフローチャート、図８は、表音テキストおよび連続音素列の表示例を示す図である。 The operation of the speech synthesizer 1 according to the first embodiment will be described below with reference to FIGS. FIG. 7 is a flowchart showing the operation of displaying the degradation type in the speech synthesizer 1. FIG. 8 is a diagram showing a display example of the phonetic text and the continuous phoneme string.

図７に示すように、まず、音声合成装置１では、利用者により、音質修正ＵＩ部２１を介して日本語テキストが入力テキストとして入力される。音声合成部７の言語処理部２３は、入力テキストを変換することにより、表音テキストを取得する（Ｓ１０１）。表音テキストは、日本語テキストの読みをカタカナ表記したものであり、更には音声合成に必要なアクセント情報等の付加情報が存在する場合もある。 As shown in FIG. 7, first, in the speech synthesizer 1, the user inputs Japanese text as input text via the sound quality correction UI unit 21. The language processing unit 23 of the speech synthesizing unit 7 acquires the phonetic text by converting the input text (S101). The phonetic text is Japanese text reading in katakana, and there may be additional information such as accent information necessary for speech synthesis.

また、音質修正ＵＩ部２１は、音声合成部７で合成された音声を取得し、利用者が、合成音声を再生して、音声が劣化していると感じる範囲を音質劣化範囲として指定するための画面を表示する（Ｓ１０２）。 In addition, the sound quality correction UI unit 21 acquires the voice synthesized by the voice synthesis unit 7, and the user reproduces the synthesized voice and designates the range where the voice feels degraded as the sound quality degradation range. Is displayed (S102).

音声合成部７では、韻律生成部２５が読みテキスト５４に従って目標韻律を生成し、素片選択接続部２７が、目標韻律に応じて波形辞書３５から選択した音声素片および素片同士の接続に関する選択接続情報を生成する。波形処理部２９では、選択接続情報に基づいて、波形辞書３５から選択された音声素片に対して波形処理を行い、例えば図８の合成音声波形７０を生成し、再生する。 In the speech synthesis unit 7, the prosody generation unit 25 generates a target prosody according to the read text 54, and the segment selection connection unit 27 relates to the connection between the speech units selected from the waveform dictionary 35 according to the target prosody and the segments. Select connection information is generated. The waveform processing unit 29 performs waveform processing on the speech unit selected from the waveform dictionary 35 based on the selected connection information, and generates and reproduces, for example, the synthesized speech waveform 70 of FIG.

この際に、音質修正ＵＩ部２１は、図８に示すように、入力テキスト５２、読みテキスト５４、合成音声波形７０等を不図示の表示部に表示例５０のように提示する。表示例５０のように、入力テキスト５２が「今日の天気は晴れです。」である場合、その読みテキスト５４は「キョーノ／テンキワ／ハレデス．」等と表わされる。また、合成音声波形７０は、横軸が時間、縦軸は振幅として表される。 At this time, as shown in FIG. 8, the sound quality correction UI unit 21 presents the input text 52, the reading text 54, the synthesized speech waveform 70, and the like on a display unit (not shown) as in the display example 50. When the input text 52 is “Today's weather is sunny” as in the display example 50, the reading text 54 is expressed as “Kyono / Tenkiwa / Haledes.”. The synthesized speech waveform 70 is represented as time on the horizontal axis and amplitude on the vertical axis.

ここで表示例５０は、合成音声を再生させるための再生ボタン６２、停止させるための停止ボタン６４、合成音声を保存するための保存ボタン６６を表示するように構成されている。 Here, the display example 50 is configured to display a reproduction button 62 for reproducing the synthesized voice, a stop button 64 for stopping, and a save button 66 for saving the synthesized voice.

利用者が出力された合成音声に音質劣化を感じた場合、利用者は、表示部に提示された読みテキスト５４、または合成音声波形７０の中から、音質劣化範囲５６を指定する。図８では、読みテキスト５４上で、音質劣化範囲５６を指定した例を示している。音質劣化範囲５６は、「キハ／ハレ」として指定されている。 When the user feels sound quality degradation in the synthesized speech output, the user designates the sound quality degradation range 56 from the reading text 54 or the synthesized speech waveform 70 presented on the display unit. FIG. 8 shows an example in which the sound quality degradation range 56 is designated on the reading text 54. The sound quality degradation range 56 is designated as “Kiha / Hare”.

すなわち、図７に戻って、範囲取得部４１は、読みテキスト５４において、利用者が指定した音質劣化範囲５６を取得する（Ｓ１０３）。このとき例えば、音声合成装置１において、図示せぬマウス装置等により利用者が「キワ／ハレ」の部分を選択することにより、範囲取得部４１は、音質修正ＵＩ部２１を介して音質劣化範囲５６を取得する。 That is, returning to FIG. 7, the range acquisition unit 41 acquires the sound quality degradation range 56 designated by the user in the reading text 54 (S103). At this time, for example, in the speech synthesizer 1, when the user selects a “kiwa / halle” portion using a mouse device (not shown) or the like, the range acquisition unit 41 receives the sound quality degradation range via the sound quality correction UI unit 21. 56 is acquired.

また、範囲取得部４１は、利用者が指定した音質劣化範囲５６と、音声合成部７からの中間情報に基づいて、利用者指定の音質劣化範囲５６に一致する合成音声の対象音素の範囲を取得する。 In addition, the range acquisition unit 41 determines the target phoneme range of the synthesized speech that matches the sound quality deterioration range 56 specified by the user based on the sound quality deterioration range 56 specified by the user and the intermediate information from the speech synthesis unit 7. get.

劣化コスト算出部４３は、素片選択接続部２７から中間情報を取得し（Ｓ１０４）、取得した中間情報と音質劣化範囲５６に基づき、音質劣化範囲５６に相当する対象音素範囲における劣化コストを算出する（Ｓ１０５）。劣化コストは、上述のように音声合成部７の中間情報の１つである素片選択ペナルティや素片接続のペナルティに基づき算出する。このとき、算出された劣化コストは、例えば、記憶部５に格納される。 The deterioration cost calculation unit 43 acquires intermediate information from the segment selection connection unit 27 (S104), and calculates a deterioration cost in the target phoneme range corresponding to the sound quality deterioration range 56 based on the acquired intermediate information and the sound quality deterioration range 56. (S105). The deterioration cost is calculated based on the segment selection penalty and the segment connection penalty which are one of the intermediate information of the speech synthesizer 7 as described above. At this time, the calculated deterioration cost is stored in the storage unit 5, for example.

ところで、劣化コストと劣化種別については、音声合成方式によって、劣化種別と劣化コストの対応付けと、劣化コスト関数の定義方法が異なるため、以下に第１から第３の３種類の音声合成方式を簡単に説明する。 By the way, regarding the degradation cost and the degradation type, the correspondence between the degradation type and the degradation cost and the definition method of the degradation cost function are different depending on the speech synthesis method. Briefly described.

第１の音声合成方式は、ピッチ変換や音素長変換等の波形処理によって波形辞書から選択した素片を目標韻律に合わせ、合成音声を生成する波形変換方式である。波形辞書から選択した素片に対し波形変換処理を行うので、波形辞書規模は比較的小規模で済み、目標韻律通りの合成音声が生成できる利点がある。しかし、波形処理で音質が劣化することがある。 The first speech synthesis method is a waveform conversion method for generating synthesized speech by matching a segment selected from a waveform dictionary with a target prosody by waveform processing such as pitch conversion or phoneme length conversion. Since the waveform conversion process is performed on the segment selected from the waveform dictionary, the waveform dictionary size is relatively small, and there is an advantage that synthesized speech according to the target prosody can be generated. However, the sound quality may be deteriorated by the waveform processing.

第２の音声合成方式は、波形変換処理を行わず、目標韻律に近い素片を波形辞書から選択し、そのままつなぎ合わせる波形無変換方式である。波形変換処理を行わないので肉声感の高い合成音声が生成できる利点があるが、必ずしも目標韻律通りとならずに韻律が破綻する欠点を持つ。また、波形辞書を大規模化することで韻律破綻を抑制しているが、完全ではない。 The second speech synthesis method is a waveform non-conversion method in which a segment close to the target prosody is selected from the waveform dictionary without being subjected to waveform conversion processing and connected as it is. Since waveform conversion processing is not performed, there is an advantage that a synthesized voice with a high real voice can be generated, but there is a disadvantage that the prosody breaks down without necessarily following the target prosody. Moreover, the prosody breakdown is suppressed by increasing the waveform dictionary, but it is not perfect.

第３の音声合成方式は、波形変換方式と波形無変換方式の混在方式である。波形無変換方式がベースとなるが、韻律が破綻する箇所に限って波形変換方式を利用することで、必要最小限の波形変換により、肉声感を保ちつつ韻律が安定した合成音声を生成できる利点があるが、肉声感の高い部分と肉声感を損なった部分が混在し、違和感が高くなることがある。 The third speech synthesis method is a mixed method of a waveform conversion method and a waveform non-conversion method. Although it is based on the no-waveform conversion method, the use of the waveform conversion method only in places where the prosody breaks down, the advantage of being able to generate synthesized speech with stable prosody while maintaining the real voice feeling with the minimum necessary waveform conversion However, there are cases where a portion with a high sense of real voice and a portion with a loss of real voice are mixed, resulting in an uncomfortable feeling.

第１の実施の形態においては、上記第１から第３の音声合成方式のいずれの方式においても、劣化種別と劣化コストの対応付けと、劣化コスト関数の定義方法を調整することで劣化種別に応じた音質修正が可能である。ただし、本明細書中では説明の便宜上、第２の音声合成方式である波形無変換方式に限定して、図３から図６に示した「各種ペナルティ」と、図６に示した「劣化種別と劣化コストの対応付け」と「劣化コスト関数の定義」を説明した。 In the first embodiment, in any of the first to third speech synthesis methods, the degradation type is adjusted by adjusting the association between the degradation type and the degradation cost and the definition method of the degradation cost function. Sound quality can be modified accordingly. However, in the present specification, for convenience of explanation, the “variable penalty” shown in FIGS. 3 to 6 and the “degradation type” shown in FIG. "Association of deterioration cost" and "Definition of deterioration cost function" were explained.

図７に戻って、劣化コスト算出部４３が、音質劣化範囲に相当する対象音素範囲において、各劣化コストを算出すると、劣化種別判定部４５は、対象音素範囲における音質劣化を劣化コストに基づいて「音、抑揚、滑舌」の劣化種別を判定する（Ｓ１０６）。 Returning to FIG. 7, when the deterioration cost calculation unit 43 calculates each deterioration cost in the target phoneme range corresponding to the sound quality deterioration range, the deterioration type determination unit 45 determines the sound quality deterioration in the target phoneme range based on the deterioration cost. The deterioration type of “sound, intonation, smooth tongue” is determined (S106).

ここで、図９を参照しながら、劣化種別判定の動作について説明する。図９は、音声合成装置１における劣化種別判定の動作を示すフローチャートである。図９に示すように、劣化種別判定部４５は、所定個数Ｍ個の劣化コスト上位ＨＣ（ｍ）（ｍ＝０〜Ｍ−１）を初期化する（Ｓ１２１）。なお、整数Ｍは、劣化種別を選択する際の修正候補劣化種別を決定するために参照する、劣化コストＣｎの数である。劣化コストＣｎとは、図６に示した劣化コストＣ００〜Ｃ２５の総称である。ここで、ｎ＝００〜２５である。 Here, the operation of determining the deterioration type will be described with reference to FIG. FIG. 9 is a flowchart showing the operation of determining the degradation type in the speech synthesizer 1. As shown in FIG. 9, the deterioration type determination unit 45 initializes a predetermined number M of deterioration cost upper HC (m) (m = 0 to M−1) (S121). The integer M is the number of deterioration costs Cn that is referred to in order to determine a correction candidate deterioration type when selecting a deterioration type. The deterioration cost Cn is a generic name of the deterioration costs C00 to C25 shown in FIG. Here, n = 00-25.

劣化種別判定部４５は、劣化コスト算出部４３が算出した劣化コストＣｎを取得する（Ｓ１２２）。劣化種別判定部４５は、劣化コストＣｎ＞劣化コスト上位ＨＣ（ｍ）であるか否か判別する（Ｓ１２３）。劣化種別判定部４５は、劣化コストＣｎが劣化コスト上位ＨＣ（ｍ）を超えた場合（Ｓ１２３：ＹＥＳ）、Ｓ１２４に処理を進め、劣化コスト上位ＨＣ（ｍ）＝劣化コストＣｎとする。劣化種別判定部４５は、劣化コストＣｎ≦劣化コスト上位ＨＣであれば（Ｓ１２３：ＮＯ）、Ｓ１２５に処理を進める。劣化種別判定部４５は、Ｓ１２３、Ｓ１２４をｍ＝０〜Ｍ−１について繰り返し実行することにより、劣化コストＣｎをソート処理し、例えば記憶部５に登録する。 The degradation type determination unit 45 acquires the degradation cost Cn calculated by the degradation cost calculation unit 43 (S122). The deterioration type determination unit 45 determines whether or not deterioration cost Cn> deterioration cost upper HC (m) (S123). When the deterioration cost Cn exceeds the deterioration cost upper HC (m) (S123: YES), the deterioration type determination unit 45 proceeds to S124, and sets the deterioration cost upper HC (m) = deterioration cost Cn. If the degradation cost Cn ≦ deterioration cost upper HC (S123: NO), the degradation type determination unit 45 advances the process to S125. The degradation type determination unit 45 sorts the degradation costs Cn by repeatedly executing S123 and S124 for m = 0 to M−1 and registers them in the storage unit 5, for example.

劣化種別判定部４５は、次の劣化コストＣｎが存在するか否か判別し（Ｓ１２５）、次の劣化コストＣｎが存在する間（Ｓ１２５：ＹＥＳ）、劣化コストＣｎを劣化コストＣｎの値が高い順にソートを繰り返す。この結果、劣化種別判定部４５は、Ｍ個のソートされた劣化コスト上位ＨＣ（ｍ）を得る。 The degradation type determination unit 45 determines whether or not the next degradation cost Cn exists (S125). While the next degradation cost Cn exists (S125: YES), the degradation cost Cn has a high value of the degradation cost Cn. Repeat sorting in order. As a result, the degradation type determination unit 45 obtains M sorted degradation cost high rank HC (m).

劣化種別判定部４５は、未処理の劣化コストＣｎが存在しなくなると（Ｓ１２５：ＮＯ）、Ｍ個以下の劣化コスト上位ＨＣ（ｍ）に登録された劣化コストＣｎから、劣化種別を選定する。すなわち、劣化種別判定部４５は、ｍ＝０とし（Ｓ１２６）、ｍ＜Ｍであれば（Ｓ１２７：ＹＥＳ）、０≦ｎ≦０５であるか否か判別する（Ｓ１２８）。０≦ｎ≦０５である場合には（Ｓ１２８：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「音」とし、種別詳細は表示せず（Ｓ１２９）、ｍ＝ｍ＋１（Ｓ１３０）に更新してＳ１２７に戻る。 When there is no unprocessed degradation cost Cn (S125: NO), the degradation type determination unit 45 selects a degradation type from the degradation costs Cn registered in the M or less degradation cost upper HC (m). That is, the degradation type determination unit 45 sets m = 0 (S126), and if m <M (S127: YES), determines whether 0 ≦ n ≦ 05 is satisfied (S128). If 0 ≦ n ≦ 05 (S128: YES), the deterioration type determination unit 45 sets the deterioration type to “sound”, does not display the type details (S129), and updates m = m + 1 (S130). And return to S127.

０≦ｎ≦０５でない場合には（Ｓ１２８：ＮＯ）、Ｓ１３１に進み、劣化種別判定部４５は、１０≦ｎ≦１２であるか否か判別する。１０≦ｎ≦１２である場合には（Ｓ１３１：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「抑揚」とし、種別詳細は「大」とし（Ｓ１３２）、ｍ＝ｍ＋１（Ｓ１３０）に更新してＳ１２７に戻る。 When 0 ≦ n ≦ 05 is not satisfied (S128: NO), the process proceeds to S131, and the deterioration type determination unit 45 determines whether 10 ≦ n ≦ 12. When 10 ≦ n ≦ 12 is satisfied (S131: YES), the deterioration type determination unit 45 sets the deterioration type to “intonation”, sets the type details to “large” (S132), and updates m = m + 1 (S130). Then, the process returns to S127.

１０≦ｎ≦１２でない場合には（Ｓ１３１：ＮＯ）、Ｓ１３３に進み、劣化種別判定部４５は、ｎ＝１３であるか否か判別する。ｎ＝１３である場合には（Ｓ１３３：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「抑揚」とし、種別詳細は「小」とし（Ｓ１３４）、ｍ＝ｍ＋１（Ｓ１３０）に更新Ｓ１２７に戻る。 When 10 ≦ n ≦ 12 is not satisfied (S131: NO), the process proceeds to S133, and the deterioration type determination unit 45 determines whether n = 13. When n = 13 (S133: YES), the deterioration type determination unit 45 sets the deterioration type to “intonation”, sets the type details to “small” (S134), and updates m = m + 1 (S130) to S127. Return.

ｎ＝１３でない場合には（Ｓ１３３：ＮＯ）、Ｓ１３５に進み、劣化種別判定部４５は、２０≦ｎ≦２２であるか否か判別する。２０≦ｎ≦２２である場合には（Ｓ１３５：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「滑舌」とし、種別詳細は「慌しい」とし（Ｓ１３６）、ｍ＝ｍ＋１（Ｓ１３０）に更新してＳ１２７に戻る。 When it is not n = 13 (S133: NO), the process proceeds to S135, and the deterioration type determination unit 45 determines whether 20 ≦ n ≦ 22. When 20 ≦ n ≦ 22 is satisfied (S135: YES), the deterioration type determination unit 45 sets the deterioration type as “smooth tongue”, sets the type details as “sad” (S136), and sets m = m + 1 (S130). Update and return to S127.

２０≦ｎ≦２２でない場合には（Ｓ１３５：ＮＯ）、Ｓ１３７に進み、劣化種別判定部４５は、２３≦ｎ≦２５であるか否か判別する。２３≦ｎ≦２５である場合には（Ｓ１３７：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「滑舌」とし、種別詳細は「たどたどしい」とし（Ｓ１３８）、ｍ＝ｍ＋１（Ｓ１３０）に更新してＳ１２７に戻る。 If 20 ≦ n ≦ 22 is not satisfied (S135: NO), the process proceeds to S137, and the deterioration type determination unit 45 determines whether 23 ≦ n ≦ 25. When 23 ≦ n ≦ 25 is satisfied (S137: YES), the deterioration type determination unit 45 sets the deterioration type as “smooth tongue”, sets the type details as “traceable” (S138), and sets m = m + 1 (S130). Update and return to S127.

２３≦ｎ≦２５でない場合には、劣化種別判定部４５は、エラーを出力する。劣化種別判定部４５は、ｍ＞Ｍとなった場合には、図７のＳ１０６に戻る。 When 23 ≦ n ≦ 25 is not satisfied, the degradation type determination unit 45 outputs an error. If m> M, the degradation type determination unit 45 returns to S106 in FIG.

上述のように、各劣化種別は、少なくとも１つの劣化コストに基づいて分類される。劣化種別判別部４５は、分類された劣化種別に含まれる劣化コストの値が、他の劣化コストに比べて高値であるか否かを、本実施形態においては、劣化コスト上位ＨＣ（ｍ）を抽出することにより判別している。劣化コストＣｎが高値であると判別されると、すなわち、劣化コスト上位ＨＣ（ｍ）に対応する劣化コストＣｎを、劣化種別判定部４５は、音質修正候補の劣化種別と判定する。 As described above, each degradation type is classified based on at least one degradation cost. The degradation type discriminating unit 45 determines whether or not the degradation cost value included in the classified degradation type is higher than other degradation costs. In this embodiment, the degradation cost upper HC (m) is determined. It is determined by extracting. When it is determined that the deterioration cost Cn is a high value, that is, the deterioration cost Cn corresponding to the deterioration cost upper HC (m) is determined as the deterioration type of the sound quality correction candidate.

図７に処理に戻って、劣化位置取得部４７は、音質修正候補と判定された劣化種別について、劣化種別の劣化コスト値と、その劣化位置を取得および保持する（Ｓ１０７）。図１０は、劣化位置の例を示す図である。図１０に示すように、音素列１６０は、読みテキスト５４において指定された音質劣化範囲５６に対応する再合成範囲の一例である。音質劣化範囲５６の中の「キワ／」は、音素列１６２の「Ｋ|Ｉ|Ｗ|Ａ」に対応し、「ハレ」は、音素列１６３の「Ｈ|Ａ|Ｒ|Ｅ」に対応している。ここで、「|」は音素境界を示す。 Returning to the processing in FIG. 7, the deterioration position acquisition unit 47 acquires and holds the deterioration cost value of the deterioration type and its deterioration position for the deterioration type determined as the sound quality correction candidate (S107). FIG. 10 is a diagram illustrating an example of the deterioration position. As shown in FIG. 10, the phoneme string 160 is an example of a re-synthesis range corresponding to the sound quality deterioration range 56 specified in the reading text 54. “Kiwa /” in the sound quality degradation range 56 corresponds to “K | I | W | A” of the phoneme string 162, and “Hare” corresponds to “H | A | R | E” of the phoneme string 163. doing. Here, “|” indicates a phoneme boundary.

例えば、図９の劣化種別判定処理において、劣化コスト上位ＨＣ（ｍ）として劣化コストＣ２０が判別されているとする。このとき、劣化位置取得部４７は、劣化コストＣ２０が音素長ペナルティＰＰｂａの関数であることを判別し、さらに、算出された音素長ペナルティＰＰｂａを記憶部５で参照し、値が高値の箇所を劣化位置として特定する。劣化位置は、選択された素片に対応することが好ましい。図１０においては、音素列「ＷＡ」が、劣化コストＣ２０に対応する劣化位置１６５と取得される。 For example, it is assumed that the deterioration cost C20 is determined as the higher deterioration cost HC (m) in the deterioration type determination process of FIG. At this time, the degradation position acquisition unit 47 determines that the degradation cost C20 is a function of the phoneme length penalty PPba, and further refers to the calculated phoneme length penalty PPba in the storage unit 5 to determine a location where the value is high. Identified as a degraded position. The degradation position preferably corresponds to the selected segment. In FIG. 10, the phoneme string “WA” is acquired as the deterioration position 165 corresponding to the deterioration cost C20.

別の例として、劣化コスト上位ＨＣ（ｍ）として劣化コストＣ０５が判別されているとする。このとき、劣化位置取得部４７は、劣化コストＣ０５が音色変化ペナルティＰＣｃａの関数であることを判別し、さらに、算出された音色変化ペナルティＰＣｃａを記憶部５で参照し、値が高値の箇所を劣化位置として特定する。この例では、図１０における音素列「ＷＡ−ＨＡ」が、劣化コストＣ０５に対応する劣化位置１６７と取得される。 As another example, it is assumed that the degradation cost C05 is determined as the higher degradation cost HC (m). At this time, the degradation position acquisition unit 47 determines that the degradation cost C05 is a function of the timbre change penalty PCca, and further refers to the calculated timbre change penalty PCca in the storage unit 5 to determine a location where the value is high. Identified as a degraded position. In this example, the phoneme string “WA-HA” in FIG. 10 is acquired as the deterioration position 167 corresponding to the deterioration cost C05.

音声合成装置１は、音質修正ＵＩ部２１を介して、修正候補の劣化種別を利用者に提示する。図８は、「音」と「抑揚」と「滑舌」および「誤アクセント」の４種が修正候補の劣化種別５８として提示された例である。図８では、さらに、劣化種別詳細６０として「慌しい」と「たどたどしい」が選択肢として表示されている。以上で、劣化種別表示の処理を終了する。 The voice synthesizer 1 presents the deterioration type of the correction candidate to the user via the sound quality correction UI unit 21. FIG. 8 is an example in which four types of “sound”, “intonation”, “smooth tongue”, and “false accent” are presented as the degradation types 58 of the correction candidates. In FIG. 8, “dear” and “trace” are further displayed as options as the degradation type details 60. This is the end of the degradation type display process.

次に、図１１を参照しながら、劣化種別選択の動作について説明する。図１１は、劣化種別選択の動作を示すフローチャートである。利用者は、音質修正ＵＩ部２１を介して、提示された劣化種別５８およびその種別詳細６０等の中から、自らの感覚に合った劣化種別を選択する。図１１に示すように、修正手段決定部４９は、音質修正ＵＩ部２１を介して利用者が選択した劣化種別を取得し、劣化位置取得部４７から選択された劣化種別に対応した劣化コストと劣化位置を取得する（Ｓ１５１）。また、修正手段決定部４９は、範囲取得部４１から再合成範囲を受け取り、再合成する範囲と、劣化種別の各劣化コストに基づいて音質劣化を改善するため、劣化コストに対応した素片選択ペナルティ、素片接続ペナルティを改善するための修正情報を生成し、素片選択接続部２７へ出力する（Ｓ１５２）。 Next, the degradation type selection operation will be described with reference to FIG. FIG. 11 is a flowchart showing the operation of selecting the degradation type. The user selects a degradation type that suits his / her feeling from the presented degradation type 58 and its type details 60 through the sound quality correction UI unit 21. As shown in FIG. 11, the correction means determination unit 49 acquires the deterioration type selected by the user via the sound quality correction UI unit 21, and the deterioration cost corresponding to the deterioration type selected from the deterioration position acquisition unit 47. A degradation position is acquired (S151). In addition, the correction means determination unit 49 receives the re-synthesis range from the range acquisition unit 41, and selects the segment corresponding to the degradation cost in order to improve the sound quality degradation based on the range to be re-synthesized and each degradation cost of the degradation type. Correction information for improving the penalty and segment connection penalty is generated and output to the segment selection / connection unit 27 (S152).

修正情報とは、選択された劣化種別または劣化種別詳細に応じた劣化コストを表す劣化関数に含まれる各素片選択ペナルティの、式Ｃ１における係数αｌ（ｌ＝１、２、・・・）、または各素片接続ペナルティの式Ｃ２における係数βｌ（ｌ＝１、２、・・・）である。 The correction information is a coefficient αl (l = 1, 2,...) In the expression C1 of each element selection penalty included in the deterioration function indicating the deterioration cost corresponding to the selected deterioration type or deterioration type details, Alternatively, the coefficient βl (l = 1, 2,...) In the equation C2 of each unit connection penalty.

図８に戻って、利用者が、音質劣化範囲５６において、劣化種別５８として「音質が不自然」であること、「滑舌が不自然」であることを選択し、劣化種別詳細６０で「慌しい」と選択したとする。修正手段決定部４９は、選択された劣化種別「滑舌（慌しい）」に対応する劣化コスト劣化コストＣ２０であり、「音」に対応する劣化コストが、劣化コストＣ０５であると判別したとする。このとき、修正手段決定部４９は、劣化コストＣ２０に音素長ペナルティＰＰｂａが対応していると判別する。また、劣化コストＣ０５に音色変化ペナルティＰＣｃａ対応していると判別する。 Returning to FIG. 8, the user selects “sound quality is unnatural” or “smooth tongue is unnatural” as the deterioration type 58 in the sound quality deterioration range 56. Suppose you select “Sad”. It is assumed that the correction means determination unit 49 determines that the deterioration cost is the deterioration cost C20 corresponding to the selected deterioration type “smooth tongue (smooth)” and the deterioration cost corresponding to “sound” is the deterioration cost C05. . At this time, the correction means determination unit 49 determines that the phoneme length penalty PPba corresponds to the degradation cost C20. Further, it is determined that the deterioration cost C05 corresponds to the timbre change penalty PCca.

そして、修正手段決定部４９は、図１０の例で、音素列１６２において、音素長ペナルティＰＰｂａに対応する係数α３の重みを加重し、音素列１６３において、音色変化ペナルティＰＣｃａに対応する係数β５の重みを加重した修正情報を生成する。 Then, in the example of FIG. 10, the correction means determination unit 49 weights the weight of the coefficient α3 corresponding to the phoneme length penalty PPba in the phoneme string 162, and the coefficient β5 corresponding to the timbre change penalty PCca in the phoneme string 163. Correction information weighted with weights is generated.

素片選択接続部２７は、修正情報を受け取ると、再合成範囲の素片選択および素片接続を再度行い、選択接続情報を再度生成する。ここで、再合成範囲は、図１０に示すように、少なくとも音質劣化範囲５６を含む再合成範囲１６９として設定されるようにしてもよい。すなわち、範囲取得部４１は、利用者が指定した音質劣化範囲５６の全てを含んだ音素単位、音素列単位、モーラ単位、モーラ列単位のいずれかの区切りを、再合成範囲として決定するように構成する。 When the segment selection / connection unit 27 receives the correction information, the segment selection / connection of the recombination range is performed again, and the selected connection information is generated again. Here, the resynthesis range may be set as a resynthesis range 169 including at least the sound quality deterioration range 56 as shown in FIG. That is, the range acquisition unit 41 determines any segmentation of a phoneme unit, a phoneme sequence unit, a mora unit, or a mora sequence unit that includes the entire sound quality degradation range 56 specified by the user as a resynthesis range. Configure.

図１０の例では、読みテキスト５４「キョーノ／テンキワ／ハレデス．」に対し、利用者が「キワ／ハレ」の部分を音質劣化範囲５６として選択している。このとき、利用者の指定範囲は音素列１６２と音素列１６３を足し合わせた「Ｋ|Ｉ|Ｗ|Ａ|Ｈ|Ａ|Ｒ|Ｅ」を再合成範囲１６８として、音声合成部７は再合成を行っても良い。また、利用者の指定範囲を含む音素列単位「Ｋ|Ｉ|Ｗ|Ａ」と音素列単位「Ｈ|Ａ|Ｒ|Ｅ|Ｄ|Ｅ|Ｓ|Ｕ」を再合成範囲１６９としてもよい。再合成範囲１６９は、２つの音素列単位がそれぞれ、合成音声の２つの素片単位と一致する例であり、再合成範囲を素片単位で区切ることで、利用者が指定した範囲外に音質劣化があった場合でも、音質改善効果を得ることが可能となる。 In the example of FIG. 10, for the reading text 54 “Kyono / Tenkiwa / Haledes.”, The user selects “Kiwa / Hare” as the sound quality deterioration range 56. At this time, the designated range of the user is “K | I | W | A | H | A | R | E”, which is the sum of the phoneme sequence 162 and the phoneme sequence 163, as the resynthesis range 168, and the speech synthesis unit 7 Synthesis may be performed. The phoneme string unit “K | I | W | A” and the phoneme string unit “H | A | R | E | D | E | S | U” including the user's designated range may be used as the re-synthesis range 169. . The re-synthesis range 169 is an example in which the two phoneme string units each match the two unit units of the synthesized speech. By dividing the re-synthesis range into units, the sound quality falls outside the range specified by the user. Even when there is deterioration, it is possible to obtain a sound quality improvement effect.

図１１に戻って、波形処理部２９は、再生成された選択接続情報をもとに音声を再合成する。すなわち、再合成範囲１６８のみ、または再合成範囲１６９のみ、もしくは再合成範囲１６８または再合成範囲１６９を含む入力テキスト５２に対応する音声全てを再合成して合成音声の音声波形を再生成する（Ｓ１５３）。このように、音声合成装置１は、利用者が選択した劣化種別に対応する各素片選択ペナルティまたは素片接続ペナルティに対応する重みを加重する修正情報を生成する。そして、音質劣化の原因となっている対象の素片選択ペナルティや素片接続ペナルティに加重した素片選択ペナルティＰＰおよび素片接続ペナルティＰＣが最小となるように、上記の再合成範囲について音声を再合成することにより、再合成時の音声を修正する。 Returning to FIG. 11, the waveform processing unit 29 re-synthesizes the voice based on the regenerated selection connection information. That is, only the re-synthesis range 168, only the re-synthesis range 169, or all the speech corresponding to the input text 52 including the re-synthesis range 168 or the re-synthesis range 169 is re-synthesized to regenerate the speech waveform of the synthesized speech ( S153). As described above, the speech synthesizer 1 generates correction information that weights the weight corresponding to each segment selection penalty or segment connection penalty corresponding to the degradation type selected by the user. Then, for the above-mentioned re-synthesis range, the speech is selected so that the unit selection penalty PP and the unit connection penalty PC weighted to the target unit selection penalty and the unit connection penalty causing the sound quality degradation are minimized. The sound at the time of re-synthesis is corrected by re-synthesis.

以上詳細に説明したように、音声合成装置１は、利用者が指定した音質劣化範囲５６における劣化コストを算出することにより、音質劣化位置とその音質劣化種別を判定し、音質劣化種別を利用者へ提示する。利用者は提示された劣化種別の中から、自らの感覚に合った劣化種別を選択する。音声合成装置１は、選択された劣化種別を取得し、取得した劣化種別に対応した素片選択ペナルティまたは素片接続ペナルティに対応する重み係数を更新した修正情報を生成し、修正情報に基づき音声の再合成を行うことにより、音声を修正する。 As described above in detail, the speech synthesizer 1 determines the sound quality deterioration position and the sound quality deterioration type by calculating the deterioration cost in the sound quality deterioration range 56 designated by the user, and determines the sound quality deterioration type as the user. To present. The user selects a degradation type that suits his / her feeling from the presented degradation types. The speech synthesizer 1 acquires the selected deterioration type, generates correction information in which the weight coefficient corresponding to the segment selection penalty or the unit connection penalty corresponding to the acquired deterioration type is updated, and the voice is generated based on the correction information. The speech is corrected by recombining.

以上の構成により、音声合成装置１においては、利用者が指定した音質劣化範囲から音質劣化位置とその音質劣化種別を判定し、音質劣化種別を利用者へ提示する。よって、利用者は提示された劣化種別の中から、自らの感覚に合った劣化種別を選択することができ、劣化種別に応じた音質修正が可能となる。 With the above configuration, the speech synthesizer 1 determines the sound quality deterioration position and the sound quality deterioration type from the sound quality deterioration range designated by the user, and presents the sound quality deterioration type to the user. Therefore, the user can select a degradation type that suits his / her sense from the presented degradation types, and sound quality correction according to the degradation type is possible.

上述の音声合成技術では、波形辞書から素片を選択する際、および選択した素片を接続する際に、目標韻律に合わない素片選択にはペナルティを大きく、音質劣化しやすい素片接続にもペナルティを大きく与える。そして、素片選択ペナルティ評価値ＰＰ、素片接続ペナルティ評価値ＰＣが最小となる素片選択、素片接続を行うことで、合成音声の音質向上を図っている。したがって、上述のように利用者が選択した劣化種別に対応した素片選択ペナルティ、または素片接続ペナルティに対するペナルティの重みに荷重して再合成することで、より目標韻律に近い音声となる素片選択と素片接続が行われる。これにより、利用者が指定した劣化種別に対応した音質劣化を改善することができる。 In the speech synthesis technology described above, when selecting a segment from the waveform dictionary and when connecting the selected segment, the selection of a segment that does not match the target prosody has a large penalty, and the segment connection is likely to deteriorate the sound quality. Also give a big penalty. Then, the sound quality of the synthesized speech is improved by performing segment selection and segment connection that minimize the segment selection penalty evaluation value PP and segment connection penalty evaluation value PC. Therefore, as described above, a segment that becomes speech closer to the target prosody is obtained by re-synthesizing by weighting the unit selection penalty corresponding to the degradation type selected by the user or the penalty weight for the segment connection penalty. Selection and segment connection are performed. Thereby, sound quality degradation corresponding to the degradation type designated by the user can be improved.

また、音声合成時の中間情報の一つである素片選択ペナルティ、素片接続ペナルティは、音質劣化に関与するペナルティであり、各ペナルティを用いて劣化コストを定義する。これにより、劣化種別を判定するのに適した劣化コストとなり、劣化種別の判定精度を向上することができる。これらの劣化種別は、各劣化コストと１対１で対応するものではなく、複数の劣化コストの組み合わせに応じて劣化種別が正しく判定される。よって、劣化種別の判定精度を向上させることが可能となる。 A segment selection penalty and a segment connection penalty, which are one of the intermediate information at the time of speech synthesis, are penalties related to sound quality degradation, and the degradation cost is defined using each penalty. Thereby, the deterioration cost is suitable for determining the deterioration type, and the determination accuracy of the deterioration type can be improved. These deterioration types do not correspond to each deterioration cost on a one-to-one basis, and the deterioration type is correctly determined according to a combination of a plurality of deterioration costs. Therefore, it is possible to improve the determination accuracy of the deterioration type.

修正候補劣化種別は、音質劣化原因となっている可能性の高い劣化コスト上位から劣化種別を選定することで、劣化種別の判定精度を向上させることができる。すなわち、劣化コストが高いものは音質劣化原因となっている可能性が高く、このような劣化コスト上位を含む劣化種別も音質劣化原因である可能性が高いため、劣化種別の判定精度を向上させることが可能となる。 As the correction candidate deterioration type, it is possible to improve the determination accuracy of the deterioration type by selecting the deterioration type from the higher deterioration cost that is likely to cause the sound quality deterioration. In other words, those with a high deterioration cost are likely to be the cause of sound quality deterioration, and deterioration types including such a high deterioration cost are also likely to be a cause of sound quality deterioration, so the determination accuracy of the deterioration type is improved. It becomes possible.

なお、音質劣化範囲を指定する場合には、利用者は音質劣化を感じる範囲を含むおよその範囲で指定すればよく、音質劣化が生じている箇所に限定して範囲指定する必要はない。また、範囲取得部４１は、利用者が指定した音質劣化範囲５６を完全に含み、音素単位、またはモーラ単位を区切りとした範囲を音質劣化範囲として取得すればよい。このとき、音素単位、モーラ単位の代替として、合成音声の素片単位に一致した音素列単位、モーラ列単位としてもよい。音質劣化範囲を取得する際には、範囲取得部４１は、利用者が指定した音質劣化範囲５６と、音声合成部７で得られた読みテキスト５４、または音声合成部７の中間情報を照合することが好ましい。 When the sound quality deterioration range is designated, the user may designate the sound quality within an approximate range including the range in which the sound quality is felt, and it is not necessary to designate the range limited to the location where the sound quality deterioration occurs. Further, the range acquisition unit 41 may acquire a sound quality deterioration range that completely includes the sound quality deterioration range 56 designated by the user and that is divided into phoneme units or mora units. At this time, as an alternative to the phoneme unit or the mora unit, a phoneme string unit or a mora string unit that matches the unit of the synthesized speech may be used. When acquiring the sound quality degradation range, the range acquisition unit 41 collates the sound quality degradation range 56 designated by the user with the read text 54 obtained by the speech synthesis unit 7 or the intermediate information of the speech synthesis unit 7. It is preferable.

また、音声合成は、１音素または１モーラ単位で合成するとは限らず、連続した複数音素（音素列）または複数モーラ（モーラ列）が混在して合成される。したがって、上記で取得した音質劣化範囲が、必ずしも合成音声の素片単位の境界と一致するとは限らない。よって、音質劣化範囲を完全に含んだ素片単位の区切りを再合成範囲と設定することで、音質劣化の生じていない合成音声部分の音質を保持しつつ、指定範囲外に音質劣化があった場合でも、必要最小限範囲の再合成によって音質改善を図ることができる。 In addition, the speech synthesis is not necessarily performed in units of one phoneme or one mora, but is synthesized by mixing a plurality of continuous phonemes (phoneme strings) or a plurality of mora (mora strings). Therefore, the sound quality degradation range acquired above does not necessarily coincide with the boundary of the unit unit of the synthesized speech. Therefore, by setting the segment unit segment that completely includes the sound quality degradation range as the resynthesis range, while maintaining the sound quality of the synthesized speech part where sound quality degradation has not occurred, there was sound quality degradation outside the specified range Even in this case, the sound quality can be improved by re-synthesis in the minimum necessary range.

劣化種別が取得する際の方法は、上記に限定されない。例えば、劣化コストの高い順に劣化種別を利用者へ提示してもよく、更には、音質劣化の可能性の低い劣化種別を提示しない、または、提示しても再合成時に反映されない非アクティブで提示するように構成してもよい。 The method for acquiring the degradation type is not limited to the above. For example, the degradation type may be presented to the user in descending order of degradation cost, and furthermore, the degradation type with a low possibility of sound quality degradation is not presented, or it is presented inactive that is not reflected at the time of resynthesis even if presented. You may comprise.

（第１の実施の形態の変形例）
以下、図１２を参照しながら、第１の実施の形態の変形例による音声合成装置１について説明する。本変形例は、第１の実施の形態による音声合成装置１における劣化種別判定に関する変形例である。よって本変形例において、第１の実施の形態による音声合成装置１の構成および動作と同一の部分は詳細な説明を省略し、劣化種別判定動作に関してのみ説明する。 (Modification of the first embodiment)
Hereinafter, the speech synthesizer 1 according to a modification of the first embodiment will be described with reference to FIG. This modification is a modification related to the deterioration type determination in the speech synthesizer 1 according to the first embodiment. Therefore, in this modification, detailed description of the same parts as the configuration and operation of the speech synthesizer 1 according to the first embodiment will be omitted, and only the deterioration type determination operation will be described.

図１２は、本変形例における劣化種別判定の動作を示すフローチャートである。本変形例では、修正候補の劣化種別を予め定められた閾値を超えたか否かにより判別する。図１２に示すように、劣化種別判定部４５は、劣化コスト算出部４３が算出した劣化コストＣｎを取得する（Ｓ１８１）。続いて、劣化種別判定部４５は、まずｎ＝０と設定する（Ｓ１８２）。劣化種別判定部４５は、ｎ＜２６であるか否か判別し、ｎ＜２６であれば、Ｃ００〜Ｃ２５のそれぞれについて、予め設定されたそれぞれの閾値Ｓ００〜Ｓ２５よりも値が高値であるか否か判別する（Ｓ１８３）。そして、Ｓ１８３において、閾値よりも高値であると判別された劣化コストＣｎについて、Ｓ１８４以下の判別を行う。 FIG. 12 is a flowchart showing the operation of determining the deterioration type in the present modification. In this modification, determination is made based on whether or not the deterioration type of the correction candidate exceeds a predetermined threshold. As illustrated in FIG. 12, the deterioration type determination unit 45 acquires the deterioration cost Cn calculated by the deterioration cost calculation unit 43 (S181). Subsequently, the degradation type determination unit 45 first sets n = 0 (S182). The degradation type determination unit 45 determines whether or not n <26. If n <26, whether each of C00 to C25 is higher than the preset thresholds S00 to S25. It is determined whether or not (S183). Then, in S183, the deterioration cost Cn determined to be higher than the threshold value is determined in S184 and the subsequent steps.

Ｓ１８４では、劣化種別判定部４５は、劣化コストＣｎ＜閾値Ｓｎであるか否か判別し、Ｃｎ＜閾値Ｓｎでない場合には（Ｓ１８４：ＮＯ）、ｎ＝ｎ＋１とし（Ｓ１８５）、Ｓ１８３に戻る。Ｃｎ＜閾値Ｓｎである場合には（Ｓ１８４：ＹＥＳ）、処理はＳ１８６に進む。 In S184, the deterioration type determination unit 45 determines whether or not the deterioration cost Cn <threshold Sn, and if Cn <threshold Sn is not satisfied (S184: NO), n = n + 1 (S185), and the process returns to S183. If Cn <threshold Sn (S184: YES), the process proceeds to S186.

劣化種別判定部４５は、０≦ｎ≦０５であるか否か判別する（Ｓ１８６）。０≦ｎ≦０５である場合には（Ｓ１８６：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「音」とし、種別詳細は表示せず（Ｓ１８７）、ｎ＝ｎ＋１（Ｓ１８５）に更新してＳ１２７に戻る。 The degradation type determination unit 45 determines whether or not 0 ≦ n ≦ 05 (S186). When 0 ≦ n ≦ 05 (S186: YES), the deterioration type determination unit 45 sets the deterioration type to “sound”, does not display the type details (S187), and updates to n = n + 1 (S185). And return to S127.

０≦ｎ≦０５でない場合には（Ｓ１８６：ＮＯ）、Ｓ１８８に進み、劣化種別判定部４５は、１０≦ｎ≦１２であるか否か判別する。１０≦ｎ≦１２である場合には（Ｓ１８８：ＹＥＳ）、劣化種別判定部４５は、劣化種別は「抑揚」とし、種別詳細は「大」とし（Ｓ１８９）、ｎ＝ｎ＋１（Ｓ１８５）に更新してＳ１８３に戻る。 When 0 ≦ n ≦ 05 is not satisfied (S186: NO), the process proceeds to S188, and the degradation type determination unit 45 determines whether 10 ≦ n ≦ 12. When 10 ≦ n ≦ 12 is satisfied (S188: YES), the deterioration type determination unit 45 sets the deterioration type as “intonation”, sets the type details as “large” (S189), and updates n = n + 1 (S185). Then, the process returns to S183.

１０≦ｎ≦１２でない場合には（Ｓ１８８：ＮＯ）、Ｓ１９０に進み、劣化種別判定部４５は、ｎ＝１３であるか否か判別する。ｎ＝１３である場合には（Ｓ１９０：ＹＥＳ）、劣化種別判定部４５は、劣化種別を「抑揚」とし、種別詳細は「小」とし（Ｓ１９１）、ｎ＝ｎ＋１（Ｓ１８５）に更新Ｓ１８３に戻る。 When 10 ≦ n ≦ 12 is not satisfied (S188: NO), the process proceeds to S190, and the deterioration type determination unit 45 determines whether n = 13. When n = 13 (S190: YES), the deterioration type determination unit 45 sets the deterioration type to “intonation”, sets the type details to “small” (S191), and updates n = n + 1 (S185) to S183. Return.

ｎ＝１３でない場合には（Ｓ１９０：ＮＯ）、Ｓ１９２に進み、劣化種別判定部４５は、２０≦ｎ≦２２であるか否か判別する。２０≦ｎ≦２２である場合には（Ｓ１９２：ＹＥＳ）、劣化種別判定部４５は、劣化種別を「滑舌」とし、種別詳細は「慌しい」とし（Ｓ１９３）、ｎ＝ｎ＋１（Ｓ１８５）に更新してＳ１８３に戻る。 When n is not 13 (S190: NO), the process proceeds to S192, and the deterioration type determination unit 45 determines whether 20 ≦ n ≦ 22. When 20 ≦ n ≦ 22 is satisfied (S192: YES), the deterioration type determination unit 45 sets the deterioration type as “smooth tongue”, sets the type details as “noticeable” (S193), and sets n = n + 1 (S185). Update and return to S183.

２０≦ｎ≦２２でない場合には（Ｓ１９２：ＮＯ）、Ｓ１９４に進み、劣化種別判定部４５は、劣化種別を「滑舌」とし、種別詳細は「たどたどしい」とし、ｎ＝ｎ＋１（Ｓ１８５）に更新してＳ１８３に戻る。Ｓ１８３においてｎ＜２６でない場合には、劣化種別判定部４５は、処理を図７のＳ１０６に戻す。 When 20 ≦ n ≦ 22 is not satisfied (S192: NO), the process proceeds to S194, in which the deterioration type determination unit 45 sets the deterioration type to “smooth tongue”, sets the type details as “tracing”, and sets n = n + 1 (S185). Update and return to S183. If n <26 is not satisfied in S183, the degradation type determination unit 45 returns the process to S106 of FIG.

例えば、劣化コスト算出部４３が算出した劣化コストＣ２０とＣ０５が所定の閾値を超えた例では、劣化種別５８として「音質が不自然」であること、「滑舌が不自然」であることを選択可能に表示し、劣化種別詳細６０で「慌しい」を選択可能に表示することができる。 For example, in the example in which the deterioration costs C20 and C05 calculated by the deterioration cost calculation unit 43 exceed a predetermined threshold, the deterioration type 58 is “sound quality is unnatural” and “smooth tongue is unnatural”. It can be displayed so that it can be selected, and “dear” can be displayed in the deterioration type details 60 so as to be selectable.

以上説明したように、本変形例によれば、劣化種別判定部４５は、劣化コスト算出部４３で算出した劣化コストＣｎが、予め劣化コスト毎に設定された所定閾値Ｓｎ（ｎ：００〜２５）を超えた場合、修正候補劣化種別として劣化コストＣｎの劣化種別を選定する。 As described above, according to this modification, the degradation type determination unit 45 uses the predetermined threshold Sn (n: 00 to 25) in which the degradation cost Cn calculated by the degradation cost calculation unit 43 is set in advance for each degradation cost. ), The deterioration type of the deterioration cost Cn is selected as the correction candidate deterioration type.

よって、本変形例によれば、所定の閾値を設定することで、音質劣化が生じていないにも関わらず、劣化種別を判定して利用者へ誤提示してしまうリスクを回避することができる。 Therefore, according to the present modification, by setting a predetermined threshold value, it is possible to avoid the risk of erroneously presenting to the user by determining the deterioration type even though the sound quality deterioration has not occurred. .

さらに、本変形例において各劣化コストが所定の閾値を超えない場合、第１の実施の形態による図９に記載の動作に切り替えて、各劣化コストから所定個数の劣化コスト上位を選定し、各劣化コスト上位の劣化種別を選定するように構成するようにしてもよい。このような構成により、全ての劣化コストがそれぞれの所定の閾値を超えない場合に「劣化種別無し」となってしまい、利用者の音質劣化を修正できなくなる問題を回避し、劣化コスト上位を含む劣化種別を利用者へ提示することで、音質劣化改善が可能となる。 Furthermore, when each deterioration cost does not exceed a predetermined threshold value in this modification, the operation is switched to the operation shown in FIG. 9 according to the first embodiment, and a predetermined number of deterioration cost upper ranks are selected from each deterioration cost. You may make it comprise so that the degradation type of degradation cost high rank may be selected. Such a configuration avoids the problem that “no degradation type” occurs when all degradation costs do not exceed the respective predetermined thresholds, and the sound quality degradation of the user cannot be corrected, and includes higher degradation costs. By presenting the degradation type to the user, the sound quality degradation can be improved.

また、劣化算出部４３が算出した各劣化コストが所定の閾値を超えない場合、全ての劣化種別を利用者へ提示し、修正手段決定部４９が、音質修正ＵＩ部２１を介して、利用者が選択した劣化種別に対応した手動修正ＵＩを起動するように構成することもできる。例えば、誤アクセントなどの自動修正が困難な劣化種別の場合も、アクセント手動修正ＵＩを起動するように構成するようにしてもよい。 If each deterioration cost calculated by the deterioration calculation unit 43 does not exceed a predetermined threshold value, all the deterioration types are presented to the user, and the correction means determination unit 49 passes the sound quality correction UI unit 21 to the user. The manual correction UI corresponding to the degradation type selected by can be activated. For example, the accent manual correction UI may be activated even in the case of a degradation type that is difficult to automatically correct, such as an erroneous accent.

通常、判定した音質の劣化種別が利用者の感じる劣化種別と一致しない場合があるため、音質の劣化種別は全て提示することが好ましいが、この場合には、利用者が選択した劣化種別の音質劣化を自動修正することは困難である。しかし、利用者選択の劣化種別に応じた手動修正ＵＩを起動することで、利用者自身が起動する手動修正ＵＩの選択を行わなくても最適な手動修正ＵＩが自動起動し、手動操作により音質修正を可能とすることができる。 Normally, it is preferable to present all the sound quality deterioration types because the determined sound quality deterioration type may not match the deterioration type felt by the user, but in this case, the sound quality of the deterioration type selected by the user It is difficult to automatically correct the deterioration. However, by starting the manual correction UI corresponding to the degradation type selected by the user, the optimum manual correction UI is automatically started without selecting the manual correction UI to be started by the user himself. Modifications can be made possible.

（第２の実施の形態）
以下、図１３〜図１５を参照しながら、第２の実施の形態による音声合成装置について説明する。なお、第２の実施の形態において、第１の実施の形態による音声合成装置１と同一の構成および動作については、重複説明を省略する。 (Second Embodiment)
The speech synthesizer according to the second embodiment will be described below with reference to FIGS. In the second embodiment, redundant description of the same configuration and operation as those of the speech synthesizer 1 according to the first embodiment will be omitted.

図１３は、第２の実施の形態による音声合成装置の機能を示すブロック図、図１４は、第２の実施の形態による劣化範囲指定の動作を示すフローチャート、図１５は、第２の実施の形態による劣化種別指定の動作を示すフローチャートである。第２の実施の形態による音声合成装置は、第１の実施の形態による音声合成装置１における音声修正決定部９に代えて音声修正決定部２３０を有する構成である。以下、第２の実施の形態による音声合成装置を、音声合成装置２００ということにする。 FIG. 13 is a block diagram showing the functions of the speech synthesizer according to the second embodiment, FIG. 14 is a flowchart showing the operation of designating a degradation range according to the second embodiment, and FIG. 15 is a diagram showing the second embodiment. It is a flowchart which shows the operation | movement of the deterioration classification designation | designated by a form. The speech synthesizer according to the second embodiment is configured to include a speech correction determination unit 230 instead of the speech correction determination unit 9 in the speech synthesizer 1 according to the first embodiment. Hereinafter, the speech synthesizer according to the second embodiment is referred to as a speech synthesizer 200.

図１３に示すように、音声合成装置２００の音声修正決定部２３０は、音声修正決定部９の構成に加えて、修正履歴データベース２３２を有している。修正履歴データベース２３２は、利用者が選択した劣化種別を格納する記憶装置である。 As shown in FIG. 13, the speech correction determination unit 230 of the speech synthesizer 200 has a correction history database 232 in addition to the configuration of the speech correction determination unit 9. The correction history database 232 is a storage device that stores the degradation type selected by the user.

図１４に示すように、まず、範囲取得部４１は、利用者により指定された音質劣化範囲５６を取得する（Ｓ２０１）。範囲取得部４１は、取得した音質劣化範囲５６に基づき、上述のように再合成範囲を、例えば図１０の再合成範囲１６９などと決定するとともに、再合成回数ｉを初期化する（Ｓ２０２）。劣化コスト算出部４３は、音質劣化範囲５６について上述の方法で劣化コストを算出する（Ｓ２０３）。 As shown in FIG. 14, first, the range acquisition unit 41 acquires the sound quality degradation range 56 designated by the user (S201). Based on the acquired sound quality degradation range 56, the range acquisition unit 41 determines the resynthesis range as, for example, the resynthesis range 169 in FIG. 10 as described above, and initializes the resynthesis number i (S202). The deterioration cost calculation unit 43 calculates the deterioration cost for the sound quality deterioration range 56 by the method described above (S203).

劣化種別判定部４５は、算出された劣化コストに基づき、例えば、第１の実施形態またはその変形例に記載した方法で劣化種別を判定する。また、劣化位置取得部４７は、判定された劣化種別に対応する劣化位置を取得する（Ｓ２０４）。さらに、劣化種別判定部４５は、音質修正ＵＩ部２１を介して劣化種別を提示する（Ｓ２０５）。 The deterioration type determination unit 45 determines the deterioration type based on the calculated deterioration cost, for example, by the method described in the first embodiment or its modification. Further, the deterioration position acquisition unit 47 acquires a deterioration position corresponding to the determined deterioration type (S204). Further, the degradation type determination unit 45 presents the degradation type via the sound quality correction UI unit 21 (S205).

図１５に示すように、修正手段決定部４９は、音質修正ＵＩ部２１を介して、劣化種別が「誤アクセント」であると取得されたか否かを判別する（Ｓ２１１）。「誤アクセント」であると取得された場合は（Ｓ２１１：ＹＥＳ）、手動修正ＵＩを起動する（Ｓ２１６）。 As shown in FIG. 15, the correction means determination unit 49 determines whether or not the deterioration type is acquired as “false accent” via the sound quality correction UI unit 21 (S211). When it is acquired that it is “false accent” (S211: YES), the manual correction UI is activated (S216).

「誤アクセント」であると取得されない場合には（Ｓ２１１：ＮＯ）、修正手段決定部４９は、選択された劣化種別に対応する素片選択ペナルティまたは素片接続ペナルティの加重を行う修正情報を生成する（Ｓ２１２）。音声合成部７は、生成された修正情報に基づき音声を再合成するが、その際、修正手段決定部４９は、再合成回数ｉ＝ｉ＋１とインクリメントし、対象の劣化種別と再合成回数ｉを修正履歴データベース２３２に格納する（Ｓ２１３）。 If it is not acquired as “false accent” (S211: NO), the correction means determination unit 49 generates correction information for weighting the segment selection penalty or segment connection penalty corresponding to the selected degradation type. (S212). The speech synthesizer 7 re-synthesizes the speech based on the generated correction information. At this time, the correction means determination unit 49 increments the re-synthesis number i = i + 1, and sets the target degradation type and the re-synthesis number i. It is stored in the correction history database 232 (S213).

劣化種別判定部４５は、全劣化コストＣｎがそれぞれ予め定められた閾値未満であるか否かを判別する（Ｓ２１４）。一つでも閾値以上の劣化コストＣｎがある場合には（Ｓ２１４：ＮＯ）、修正手段決定部４９は、劣化コストＣｎが閾値未満でない劣化種別について、再合成がＮ回以下であるか否か判別する（Ｓ２１５）。再合成がＮ回以下の場合には（Ｓ２１５：ＹＥＳ）、処理はＳ２０３に戻る。再合成がＮ回を超えた場合には（Ｓ２１５：ＮＯ）、手動修正ＵＩを起動する。手動修正ＵＩにより、再合成がＮ回を越えた劣化種別に対応付けて記憶された修正情報を生成し、生成された修正情報に基づき音声を修正する。このとき、修正情報は、対応する劣化種別に関連付けて例えば記憶部５に格納することが好ましい。 The deterioration type determination unit 45 determines whether or not the total deterioration cost Cn is less than a predetermined threshold value (S214). If there is at least one degradation cost Cn greater than or equal to the threshold (S214: NO), the correction means determination unit 49 determines whether or not recombination is N times or less for the degradation type for which the degradation cost Cn is not less than the threshold. (S215). If the re-synthesis is N times or less (S215: YES), the process returns to S203. When the re-synthesis exceeds N times (S215: NO), the manual correction UI is activated. The manual correction UI generates correction information stored in association with the degradation type that has been recombined over N times, and corrects the voice based on the generated correction information. At this time, the correction information is preferably stored in, for example, the storage unit 5 in association with the corresponding deterioration type.

Ｓ２１４において、全劣化コストＣｎがすべて閾値未満であれば（Ｓ２１４：ＹＥＳ）、修正手段決定部４９は、音質修正ＵＩ部２１を介して利用者に修正終了確認画面を提示し（Ｓ２１７）、音質修正終了か否かを判別する（Ｓ２１８）。音質修正終了が確認されない場合には（Ｓ２１８：ＮＯ）、手動修正ＵＩを起動し、音質修正終了が確認された場合には（Ｓ２１８：ＹＥＳ）、音声合成処理を終了する。 If all the degradation costs Cn are less than the threshold value in S214 (S214: YES), the correction means determination unit 49 presents a correction end confirmation screen to the user via the sound quality correction UI unit 21 (S217), and the sound quality is determined. It is determined whether or not the correction is completed (S218). When the end of the sound quality correction is not confirmed (S218: NO), the manual correction UI is activated, and when the end of the sound quality correction is confirmed (S218: YES), the speech synthesis process is ended.

以上説明したように、第２の実施の形態による音声合成装置２００によれば、修正履歴データベース２３２に同一劣化種別に関する再合成回数ｉを格納する。修正手段決定部４９は、利用者が同一の劣化種別を所定回数選択した場合、利用者が望む音質の自動修正が困難と判断し、音質修正ＵＩ部２１を介して、利用者が選択した劣化種別に対応した手動修正ＵＩを起動するように構成する。 As described above, according to the speech synthesizer 200 according to the second embodiment, the recombination count i regarding the same deterioration type is stored in the correction history database 232. When the user selects the same deterioration type a predetermined number of times, the correction means determination unit 49 determines that it is difficult to automatically correct the sound quality desired by the user, and the deterioration selected by the user via the sound quality correction UI unit 21. A manual correction UI corresponding to the type is activated.

このように、利用者が選択した劣化種別の音質劣化を自動修正しても、利用者希望の音質にならないケースが連続した場合、不必要に自動修正を繰り返すより最適な手動修正ＵＩによって利用者が手動操作することで音質修正が可能となる。 As described above, even when the sound quality deterioration of the deterioration type selected by the user is automatically corrected, if there are consecutive cases in which the sound quality does not become the user's desired sound quality, the user can use the optimum manual correction UI to repeat automatic correction unnecessarily. The sound quality can be corrected by manual operation.

なお、第２の実施の形態において、修正劣化種別として選択された回数がいずれの劣化種別も所定のＮ回を超えない場合に、最も選択された回数の多い劣化種別に関連付けて記憶された修正情報により音声を修正するようにしてもよい。 In the second embodiment, when the number of times selected as the correction deterioration type does not exceed the predetermined N times, the correction stored in association with the most frequently selected deterioration type You may make it correct an audio | voice with information.

（第２の実施の形態による変形例）
以下、図１３を参照しながら、第２の実施の形態の変形例による音声合成装置２００について説明する。本変形例は、第２の実施の形態による音声合成装置２００における再合成実行判断処理に関する変形例である。よって本変形例において、第２の実施の形態による音声合成装置２００の構成および動作と同一の部分は詳細な説明を省略し、再合成実行判断に関してのみ説明する。 (Modification according to the second embodiment)
Hereinafter, a speech synthesizer 200 according to a modification of the second embodiment will be described with reference to FIG. This modification is a modification regarding the resynthesis execution determination process in the speech synthesizer 200 according to the second embodiment. Therefore, in this modification, detailed description of the same parts as the configuration and operation of the speech synthesis apparatus 200 according to the second embodiment will be omitted, and only the resynthesis execution determination will be described.

本変形例において、修正手段決定部４９は、利用者が音質修正を行った際の劣化種別の選択頻度を積算し、例えばその積算値が所定の閾値を超えた劣化種別を修正履歴データベース２３２に登録するように構成する。更に、修正手段決定部４９は、通常の音声合成処理を行う際に、以下のように修正情報を生成する。すなわち、修正手段決定部４９は、修正履歴データベース２３２の劣化種別に対応する劣化コストの劣化コスト関数を構成する素片選択ペナルティまたは素片接続ペナルティの各種ペナルティに対し、ペナルティの重みを強化する修正情報を生成する。音声合成部７の素片選択接続部２７は、この修正情報に基づいて合成音声を生成する。 In this modification, the correction means determination unit 49 integrates the selection frequency of the deterioration type when the user performs sound quality correction, and, for example, the deterioration type whose integrated value exceeds a predetermined threshold is stored in the correction history database 232. Configure to register. Furthermore, the correction means determination unit 49 generates correction information as follows when performing normal speech synthesis processing. That is, the correction means determination unit 49 corrects the penalty weight with respect to various unit selection penalties or various unit connection penalties constituting the deterioration cost function of the deterioration cost corresponding to the deterioration type of the correction history database 232. Generate information. The segment selection connection unit 27 of the speech synthesizer 7 generates a synthesized speech based on this correction information.

以上説明したように、本変形例によれば、修正履歴データベース２３２に格納された修正履歴を参照することにより、利用者が選択した音質の劣化種別の頻度が高いものが、利用者の音質修正の傾向と判断される。そして、音声合成時に、予め対象の劣化種別の音質劣化を抑えるために、劣化種別の劣化コストを算出する劣化コスト関数の構成要素である素片選択ペナルティ、素片接続ペナルティを改善する音声合成処理を行う。これにより、利用者の音質劣化修正の負担を軽減することが可能となる。 As described above, according to the present modification, by referring to the correction history stored in the correction history database 232, the sound quality deterioration type selected by the user is frequently used. It is judged that Then, during speech synthesis, in order to suppress deterioration of sound quality of the target degradation type in advance, speech synthesis processing that improves the segment selection penalty and segment connection penalty, which are components of the degradation cost function for calculating the degradation cost of the degradation type I do. As a result, it is possible to reduce the burden of correcting the sound quality deterioration of the user.

一般に、音質劣化の感じ方は個人差が強いため、利用者個人の音質修正の傾向を把握し、合成音声生成時に予め利用者嗜好の音質修正を施した合成音声を生成することで、利用者の音質劣化修正の負担を軽減することが可能となる。さらに、利用者が選択した劣化種別の音質劣化を自動修正しても、利用者希望の音質にならないケースが連続した場合、劣化種別に応じた音質自動修正のみでなく、最適な手動修正ＵＩを起動させることで、利用者は手動操作により音質修正が可能となる。 In general, there is a strong individual difference in how sound quality is perceived, so it is easy for users to understand the tendency of individual users to correct sound quality, and to generate synthesized speech that has undergone user preference sound quality correction when generating synthesized speech. It is possible to reduce the burden of correcting sound quality degradation. Furthermore, even if the sound quality deterioration of the deterioration type selected by the user is automatically corrected, if there are consecutive cases in which the sound quality does not become the user desired sound quality, not only automatic sound quality correction according to the deterioration type but also an optimal manual correction UI When activated, the user can correct the sound quality by manual operation.

上記第１の実施の形態、第２の実施の形態およびそれぞれの変形例において、修正手段決定部４９は、修正情報生成部、取得回数係数部の一例であり、素片選択ペナルティおよび素片接続ペナルティは、劣化情報の一例である。特に、素片選択ペナルティは、第１の劣化情報の一例であり、素片接続ペナルティは、第２の劣化情報の一例である。 In the first embodiment, the second embodiment, and the respective modifications, the correction means determination unit 49 is an example of a correction information generation unit and an acquisition frequency coefficient unit, and a unit selection penalty and unit connection A penalty is an example of deterioration information. In particular, the segment selection penalty is an example of first degradation information, and the segment connection penalty is an example of second degradation information.

さらに、ピッチペナルティＰＰａａ、ＰＰａｂは、第１の差分情報の一例であり、音素長ペナルティＰＰｂａ、ＰＰｂｂは、第２の差分情報の一例であり、音色ペナルティＰＰｃａは、第３の差分情報の一例である。また、ピッチ変化ペナルティＰＣａａ、ＰＣａｂは、第４の差分情報の一例であり、音素環境ペナルティＰＣｂａ、ＰＣｂｂは、第５の差分情報の一例であり、音色変化ペナルティＰＣｃａは、第６の差分情報の一例である。第１の範囲は、音質劣化範囲５６は、第１の範囲の一例であり、再合成範囲１６８、１６９は、第２の範囲の一例である。 Furthermore, the pitch penalties PPaa and PPab are examples of first difference information, the phoneme length penalty PPba and PPbb are examples of second difference information, and the timbre penalty PPca is an example of third difference information. is there. Further, the pitch change penalty PCaa, PCab is an example of the fourth difference information, the phoneme environment penalty PCba, PCbb is an example of the fifth difference information, and the timbre change penalty PCca is the sixth difference information. It is an example. In the first range, the sound quality degradation range 56 is an example of the first range, and the resynthesis ranges 168 and 169 are examples of the second range.

ここで、上記第１または第２の実施の形態およびそれぞれの変形例による音声合成装置の動作をコンピュータに行わせるために共通に適用されるコンピュータの例について説明する。図１６は、標準的なコンピュータのハードウエア構成の一例を示すブロック図である。図１６に示すように、コンピュータ３００は、ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ（ＣＰＵ）３０２、メモリ３０４、入力装置３０６、出力装置３０８、外部記憶装置３１２、媒体駆動装置３１４、ネットワーク接続装置等がバス３１０を介して接続されている。 Here, an example of a computer that is commonly applied to cause the computer to perform the operation of the speech synthesizer according to the first or second embodiment and the respective modifications will be described. FIG. 16 is a block diagram illustrating an example of a hardware configuration of a standard computer. As shown in FIG. 16, a computer 300 includes a central processing unit (CPU) 302, a memory 304, an input device 306, an output device 308, an external storage device 312, a medium driving device 314, a network connection device, and the like via a bus 310. It is connected.

ＣＰＵ３０２は、コンピュータ３００全体の動作を制御する演算処理装置である。メモリ３０４は、コンピュータ３００の動作を制御するプログラムを予め記憶したり、プログラムを実行する際に必要に応じて作業領域として使用したりするための記憶部である。メモリ３０４は、例えばＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）等である。入力装置３０６は、コンピュータの使用者により操作されると、その操作内容に対応付けられている使用者からの各種情報の入力を取得し、取得した入力情報をＣＰＵ３０２に送付する装置であり、例えばキーボード装置、マウス装置などである。出力装置３０８は、コンピュータ３００による処理結果を出力する装置であり、表示装置などが含まれる。例えば表示装置は、ＣＰＵ３０２により送付される表示データに応じてテキストや画像を表示する。 The CPU 302 is an arithmetic processing unit that controls the operation of the entire computer 300. The memory 304 is a storage unit for storing in advance a program for controlling the operation of the computer 300 or using it as a work area when necessary when executing the program. The memory 304 is, for example, a random access memory (RAM), a read only memory (ROM), or the like. The input device 306 is a device that, when operated by a computer user, acquires various information input from the user associated with the operation content and sends the acquired input information to the CPU 302. Keyboard device, mouse device, etc. The output device 308 is a device that outputs a processing result by the computer 300, and includes a display device and the like. For example, the display device displays text and images according to display data sent by the CPU 302.

外部記憶装置３１２は、例えば、ハードディスクなどの記憶装置であり、ＣＰＵ３０２により実行される各種制御プログラムや、取得したデータ等を記憶しておく装置である。媒体駆動装置３１４は、可搬記録媒体３１６に書き込みおよび読み出しを行うための装置である。ＣＰＵ３０２は、可搬型記録媒体３１６に記録されている所定の制御プログラムを、記録媒体駆動装置３１４を介して読み出して実行することによって、各種の制御処理を行うようにすることもできる。可搬記録媒体３１６は、例えばＣｏｎｐａｃｔＤｉｓｃ（ＣＤ）−ＲＯＭ、ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ（ＤＶＤ）、ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ（ＵＳＢ）メモリ等である。ネットワーク接続装置３１８は、有線または無線により外部との間で行われる各種データの授受の管理を行うインタフェース装置である。バス３１０は、上記各装置等を互いに接続し、データのやり取りを行う通信経路である。 The external storage device 312 is a storage device such as a hard disk, and stores various control programs executed by the CPU 302, acquired data, and the like. The medium driving device 314 is a device for writing to and reading from the portable recording medium 316. The CPU 302 can read out and execute a predetermined control program recorded on the portable recording medium 316 via the recording medium driving device 314 to perform various control processes. The portable recording medium 316 is, for example, a Compact Disc (CD) -ROM, a Digital Versatile Disc (DVD), a Universal Serial Bus (USB) memory, or the like. The network connection device 318 is an interface device that manages transmission / reception of various data performed between the outside by wired or wireless. A bus 310 is a communication path for connecting the above devices and the like to exchange data.

上記第１または第２の実施の形態およびそれぞれの変形例による音声合成装置をコンピュータに実行させるプログラムは、例えば外部記憶装置３１２に記憶させる。ＣＰＵ３０２は、外部記憶装置３１２からプログラムを読み出し、コンピュータ３００に音声合成の動作を行なわせる。このとき、まず、音声合成の処理をＣＰＵ３０２に行わせるための制御プログラムを作成して外部記憶装置３１２に記憶させておく。そして、入力装置３０６から所定の指示をＣＰＵ３０２に与えて、この制御プログラムを外部記憶装置３１２から読み出させて実行させるようにする。また、このプログラムは、可搬記録媒体３１６に記憶するようにしてもよい。 A program that causes a computer to execute the speech synthesizer according to the first or second embodiment and the respective modifications is stored in, for example, the external storage device 312. The CPU 302 reads the program from the external storage device 312 and causes the computer 300 to perform a speech synthesis operation. At this time, first, a control program for causing the CPU 302 to perform speech synthesis processing is created and stored in the external storage device 312. Then, a predetermined instruction is given from the input device 306 to the CPU 302 so that the control program is read from the external storage device 312 and executed. The program may be stored in the portable recording medium 316.

なお、本発明は、以上に述べた実施の形態および変形例に限定されるものではなく、本発明の要旨を逸脱しない範囲内で種々の構成または実施形態を採ることができる。例えば、劣化位置取得部４７は、修正候補劣化種別に対応する劣化位置を取得するようにしたが、少なくとも修正劣化種別に対応する劣化位置を取得すればよい。また、修正情報は、劣化種別と劣化位置とに基づき生成する例について説明したが、例えば、音質劣化範囲が非常に短く設定されている例などでは、必ずしも劣化位置に基づき生成しなくてもよい。 The present invention is not limited to the above-described embodiments and modifications, and various configurations or embodiments can be adopted without departing from the gist of the present invention. For example, the deterioration position acquisition unit 47 acquires the deterioration position corresponding to the correction candidate deterioration type, but it is sufficient to acquire at least the deterioration position corresponding to the correction deterioration type. In addition, the correction information is generated based on the deterioration type and the deterioration position. However, for example, in the case where the sound quality deterioration range is set to be very short, the correction information may not be generated based on the deterioration position. .

１音声合成装置
３入力部
５記憶部
７音声合成部
９音声修正決定部
１１出力部
２１音質修正ＵＩ部
２３言語処理部
２５韻律生成部
２７素片選択接続部
２９波形処理部
３１言語辞書
３３韻律辞書
３５波形辞書
４１範囲取得部
４３劣化コスト算出部
４５劣化種別判定部
４７劣化位置取得部
４９修正手段決定部 DESCRIPTION OF SYMBOLS 1 Speech synthesizer 3 Input part 5 Memory | storage part 7 Speech synthesizer 9 Voice correction determination part 11 Output part 21 Sound quality correction UI part 23 Language processing part 25 Prosody generation part 27 Segment selection connection part 29 Waveform processing part 31 Language dictionary 33 Prosody Dictionary 35 Waveform dictionary 41 Range acquisition unit 43 Deterioration cost calculation unit 45 Degradation type determination unit 47 Deterioration position acquisition unit 49 Correction means determination unit

Claims

A synthesized voice acquisition unit for acquiring synthesized voice;
A range acquisition unit for acquiring a first range for performing correction in the voice;
A degradation cost calculation unit that calculates at least one degradation cost that is information according to sound quality degradation in the first range;
A deterioration type determining unit that determines and presents at least one correction candidate deterioration type according to the nature of the sound quality deterioration that is a candidate for correction in the first range based on the at least one deterioration cost;
A deterioration type acquiring unit that acquires a correction deterioration type selected by the user to perform correction from the correction candidate deterioration types determined and presented ;
A correction information generating unit for generating correction information for correcting the sound based on the correction deterioration type;
A speech synthesizer that re-synthesizes the speech based on the correction information;
A speech synthesizer characterized by comprising:

The degradation cost calculation unit
The deterioration cost is calculated based on deterioration information corresponding to the sound quality deterioration calculated for each of a phoneme, a phoneme string, a mora, and a mora string of speech in the first range. The speech synthesizer described.

A deterioration position acquisition unit that acquires a deterioration position corresponding to the acquired correction deterioration type;
Further comprising
The correction information generation unit
The speech synthesis apparatus according to claim 2, wherein correction information for correcting the voice is generated based on the correction deterioration type and the deterioration position.

The degradation position acquisition unit
The speech synthesis apparatus according to claim 3, wherein the deterioration position is acquired based on the deterioration information.

The speech synthesizer
A prosody generation unit that generates a target prosody to be a target when the speech is synthesized;
Selecting a unit for synthesizing the speech based on the target prosody, and a unit selection connecting unit for connecting the selected unit to each other;
Have
The deterioration information is
First degradation information according to a difference between the selected one segment and the target prosody corresponding to the selected segment, and the target prosody corresponding to the segment in a connection portion when connecting the segments The speech synthesizer according to any one of claims 2 to 4, further comprising at least one of second deterioration information corresponding to the difference.

The first deterioration information is
First ratio information according to a pitch frequency ratio between the target prosody and the selected segment; second ratio information according to a phoneme length ratio between the target prosody and the selected segment; and 6. The speech synthesizer according to claim 5, comprising at least one of first difference information corresponding to a tone color difference between the target prosody and the selected segment.

The second deterioration information is
Third ratio information according to the pitch frequency ratio before and after the connection portion, second difference information according to the phoneme environment difference between the segments before and after the connection portion, and first information according to the timbre difference before and after the connection portion. 7. The speech synthesizer according to claim 5 or 6, wherein at least one of the three pieces of difference information is included.

The deterioration type determination unit
2. The deterioration type according to at least one of the at least one deterioration cost, which is equal to or higher than a predetermined order in descending order, is determined as the correction candidate deterioration type. Item 8. The speech synthesizer according to any one of Items 7.

The deterioration type determination unit
8. The deterioration type according to the deterioration cost that exceeds a predetermined first threshold according to the deterioration cost is determined as the correction candidate deterioration type. The speech synthesizer described.

The speech synthesizer
Wherein said first range, described for the second range the Ru first range of the range of Der claim 1, characterized in that to perform the re-synthesis of the speech to one of claims 9 Speech synthesizer.

If the degradation class acquisition unit, at least one of the degradation cost corresponding to the acquired modified degraded type was determined not to exceed a second predetermined threshold value in response to the degradation cost,
The correction information generation unit
The correction information stored in association with the correction deterioration type corresponding to the deterioration cost not exceeding the second threshold is generated as the correction information for correcting the sound. speech synthesis apparatus according to claim 10.

An acquisition number counting unit that counts the number of times each modified deterioration type acquired by the deterioration type acquisition unit is acquired,
Further comprising
The correction information generation unit
When any of the number of times exceeds a predetermined number of times, the correction information stored in association with the correction deterioration type acquired exceeding the predetermined number of times is generated as correction information for correcting the sound. The speech synthesizer according to any one of claims 1 to 10.

An acquisition number counting unit that counts the number of times each modified deterioration type acquired by the deterioration type acquisition unit is acquired;
Further comprising
The correction information generation unit
Analyzing the number of times, and generating correction information stored in association with a correction deterioration type whose number of times is larger than other correction deterioration types as correction information for correcting the sound,
The speech synthesizer according to any one of claims 1 to 10, wherein:

Get the synthesized speech,
Obtaining a first range for correction in the voice;
Calculating at least one deterioration cost which is information corresponding to sound quality deterioration in the first range;
Determining and presenting at least one modification candidate degradation type according to the nature of the sound quality degradation that is a candidate for modification in the first range based on the at least one degradation cost;
The correction deterioration type selected by the user to perform correction from the correction candidate deterioration types determined and presented is acquired,
Generating correction information for correcting the voice based on the correction deterioration type;
Re-synthesize the speech based on the correction information;
An audio correction method characterized by the above.

Get the synthesized speech,
Obtaining a first range for correction in the voice;
Calculating at least one deterioration cost which is information corresponding to sound quality deterioration in the first range;
Determining and presenting at least one modification candidate degradation type according to the nature of the sound quality degradation that is a candidate for modification in the first range based on the at least one degradation cost;
The correction deterioration type selected by the user to perform correction from the correction candidate deterioration types determined and presented is acquired,
Generating correction information for correcting the voice based on the correction deterioration type;
Re-synthesize the speech based on the correction information;
A program that causes a computer to execute processing.