JP6400526B2

JP6400526B2 - Speech synthesis apparatus, method thereof, and program

Info

Publication number: JP6400526B2
Application number: JP2015103692A
Authority: JP
Inventors: 水野　秀之; 秀之水野; 勇祐井島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-05-21
Filing date: 2015-05-21
Publication date: 2018-10-03
Anticipated expiration: 2035-05-21
Also published as: JP2016218281A

Description

本発明は音声合成技術に関し、特に合成音声の品質を向上させる技術に関する。 The present invention relates to speech synthesis technology, and more particularly to technology for improving the quality of synthesized speech.

近年の統計的音声合成技術の発展に伴い、高品質な合成音声の生成が可能になってきている。例えば、ＨＭＭ（隠れマルコフモデル）音声合成技術（例えば非特許文献１等参照）の発展に伴い、任意の話者の音声データを学習することで、その話者の声質や調子での合成音声の生成が可能である。 With the recent development of statistical speech synthesis technology, high-quality synthesized speech can be generated. For example, with the development of HMM (Hidden Markov Model) speech synthesis technology (for example, see Non-Patent Document 1 etc.), by learning speech data of an arbitrary speaker, the synthesized speech with the voice quality and tone of the speaker can be learned. It can be generated.

また、合成音声の高品質化技術として様々な方法（例えば非特許文献２等参照）が提案されている。しかしながら、統計的な音声合成では実際の音声の様々な現象をとらえることはできず、品質の向上には限界がある。そのため、原音声の品質を生かすことで合成音声の品質を改善する方法が提案されている。例えば非特許文献３では、原音声の品質を生かす素片接続型音声合成方式（例えば、特許文献１参照）とＨＭＭ音声合成方式を組み合わせ、高域部分に素片接続型音声合成方式によるスペクトルを利用し、低域部分にＨＭＭ音声合成方式によるスペクトルを利用することで、合成音声の品質を改善する方法が提案されている。 In addition, various methods (for example, see Non-Patent Document 2) have been proposed as techniques for improving the quality of synthesized speech. However, statistical speech synthesis cannot capture various phenomena of actual speech, and there is a limit to improving quality. Therefore, a method for improving the quality of synthesized speech by utilizing the quality of the original speech has been proposed. For example, in Non-Patent Document 3, a unit connection type speech synthesis method (for example, see Patent Document 1) that utilizes the quality of the original speech is combined with a HMM speech synthesis method, and the spectrum of the unit connection type speech synthesis method is applied to the high frequency part. There has been proposed a method for improving the quality of synthesized speech by using the spectrum of the HMM speech synthesis method in the low frequency part.

特許第２７６１５５２号公報Japanese Patent No. 2761552

H. Zen, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura,“A Hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. and Syst., vol.E90-D, no.5, pp.825-834, 2007.H. Zen, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, “A Hidden semi-Markov model-based speech synthesis system,” IEICE Trans. Inf. And Syst., Vol. E90-D, no .5, pp.825-834, 2007. T. Toda., and K.Tokuda. “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis.” IEICE Trans. Inf. and Syst vol.E90-D, no.5, pp.816-824, 2007.T. Toda., And K. Tokuda. “A speech parameter generation algorithm considering global variance for HMM-based speech synthesis.” IEICE Trans. Inf. And Syst vol.E90-D, no.5, pp.816-824, 2007. T. Inoue, S.Hara, M.Abe,“A Hybrid Text-to-Speech Based on Sub-Band Approach,” in Proc of APSIPA DOI:10.1109/APSIPA.2014.7041575, 2014.T. Inoue, S. Hara, M. Abe, “A Hybrid Text-to-Speech Based on Sub-Band Approach,” in Proc of APSIPA DOI: 10.1109 / APSIPA. 2014.7041575, 2014.

非特許文献３の方法では、高域部分と低域部分の接続境界となる境界周波数が合成音声の品質に大きな影響を与える。すなわち、境界周波数を下げるほど、合成音声に占める素片接続型音声合成方式によるスペクトルが増加し、ＨＭＭ音声合成方式によるスペクトルが減少する。そのため、境界周波数を下げるほど、合成音声の自然感（肉声感）は向上するが、隣接する音素（素片）の接続部での歪みに基づく異音発生頻度が高くなる。一方、境界周波数を上げるほど、合成音声に占める素片接続型音声合成方式によるスペクトルが減少し、ＨＭＭ音声合成方式によるスペクトルが増加する。そのため、境界周波数を上げるほど、音素の接続部での歪みに基づく異音発生頻度は低くなるが、合成音声の自然感は低下する。よって、音素の接続部での歪に基づく異音発生と、素片接続型音声合成方式導入による自然感の向上との両方を考慮しなければ、合成音声の品質を向上することはできない。 In the method of Non-Patent Document 3, the boundary frequency serving as the connection boundary between the high frequency part and the low frequency part greatly affects the quality of the synthesized speech. That is, as the boundary frequency is lowered, the spectrum by the unit connection type speech synthesis method in the synthesized speech increases and the spectrum by the HMM speech synthesis method decreases. For this reason, the lower the boundary frequency, the better the natural feeling (speech feeling) of the synthesized speech, but the higher the frequency of abnormal noise generation based on the distortion at the connection of adjacent phonemes (segments). On the other hand, as the boundary frequency is increased, the spectrum of the unit connection type speech synthesis method occupied in the synthesized speech decreases and the spectrum of the HMM speech synthesis method increases. For this reason, the higher the boundary frequency, the lower the frequency of abnormal noise generation based on the distortion at the phoneme connection, but the natural feeling of the synthesized speech decreases. Therefore, the quality of synthesized speech cannot be improved unless both the generation of abnormal sounds based on distortion at the phoneme connection portion and the improvement of natural feeling due to the introduction of the unit connection type speech synthesis method are taken into consideration.

これらを考慮した適切な境界周波数は素片接続される音素に依存する。しかしながら、非特許文献３の方法では一律の境界周波数が設定されていたため、十分な品質向上効果が得られない場合があった。また、従来は境界周波数を手動で設定する必要があった。 An appropriate boundary frequency considering these depends on the phoneme connected to the unit. However, in the method of Non-Patent Document 3, since a uniform boundary frequency is set, a sufficient quality improvement effect may not be obtained. Conventionally, it has been necessary to manually set the boundary frequency.

本発明の課題は、適切な境界周波数を自動的に決定し、合成音声の品質を向上する技術を提供することである。 An object of the present invention is to provide a technique for automatically determining an appropriate boundary frequency and improving the quality of synthesized speech.

テキストに応じて素片接続される音素の素片境界前後のスペクトル特徴量の距離に基づいて境界周波数を決定し、テキストに応じた素片接続によって得られる第１合成音声のスペクトルの境界周波数に応じた高域側の成分と、テキストに音声合成のための音響モデルを適用して得られる第２合成音声のスペクトルの境界周波数に応じた低域側の成分とを混合した混合スペクトルを得る。 The boundary frequency is determined based on the distance between the spectral feature quantities before and after the boundary of the phoneme unit connected in accordance with the text, and the boundary frequency of the spectrum of the first synthesized speech obtained by the segment connection in accordance with the text is determined. A mixed spectrum is obtained by mixing the corresponding high-frequency component and the low-frequency component corresponding to the boundary frequency of the spectrum of the second synthesized speech obtained by applying an acoustic model for speech synthesis to the text.

これにより、適切な境界周波数を自動的に決定し、合成音声の品質を向上できる。 Thereby, an appropriate boundary frequency is automatically determined, and the quality of the synthesized speech can be improved.

図１は実施形態の音声合成装置を例示したブロック図である。FIG. 1 is a block diagram illustrating a speech synthesis apparatus according to an embodiment. 図２Ａは実施形態の境界周波数決定部を例示したブロック図である。図２Ｂは実施形態の波形生成処理部を例示したブロック図である。FIG. 2A is a block diagram illustrating a boundary frequency determination unit according to the embodiment. FIG. 2B is a block diagram illustrating a waveform generation processing unit of the embodiment. 図３は実施形態の音声合成方法を例示するための図である。FIG. 3 is a diagram for illustrating the speech synthesis method according to the embodiment. 図４Ａは実施形態の境界周波数決定方法を例示するためのフロー図である。図４Ｂは実施形態の境界周波数決定方法を例示するための概念図である。FIG. 4A is a flowchart for illustrating the boundary frequency determination method according to the embodiment. FIG. 4B is a conceptual diagram for illustrating the boundary frequency determination method according to the embodiment. 図５は実施形態の波形生成処理を例示するための概念図である。FIG. 5 is a conceptual diagram for illustrating the waveform generation processing of the embodiment. 図６は実施形態の境界周波数決定方法を例示するためのフロー図である。FIG. 6 is a flowchart for illustrating the boundary frequency determination method of the embodiment. 図７Ａおよび図７Ｂは実施形態の境界周波数決定方法を例示するための概念図である。7A and 7B are conceptual diagrams for illustrating the boundary frequency determination method according to the embodiment.

以下、本発明の実施形態を説明する。
［概要］
実施形態の概要を説明する。実施形態では、入力された「テキスト」に応じて素片接続される音素の素片境界前後のスペクトル特徴量の距離に基づいて境界周波数を決定し、「テキスト」に応じた素片接続によって得られる第１合成音声のスペクトルの境界周波数に応じた高域側の成分と、「テキスト」に音声合成のための「音響モデル」を適用して得られる第２合成音声のスペクトルの境界周波数に応じた低域側の成分とを混合した混合スペクトルを得る。「音響モデル」の例はＨＭＭなどの確率モデルである。境界周波数よりも高域の帯域が「境界周波数に応じた高域側」であり、それ以外の帯域が「境界周波数に応じた低域側」であってもよいし、境界周波数以上の高域が「境界周波数に応じた高域側」であり、それ以外の帯域が「境界周波数に応じた低域側」であってもよい。境界周波数よりも高域の帯域が「境界周波数に応じた高域側」であり、境界周波数よりも低域の帯域が「境界周波数に応じた低域側」であってもよい。境界周波数に定数または変数を加算または減算した周波数を境界として「境界周波数に応じた高域側」および「境界周波数に応じた低域側」が定められてもよい。「境界周波数に応じた高域側」の帯域と「境界周波数に応じた低域側」の帯域とが一部で重複してもよい。ここで、素片境界前後のスペクトル特徴量の距離に基づいて境界周波数を決定するため、境界周波数の設定を自動化できるとともに、音素の接続部での歪の大きさに応じ、合成音声に占める素片接続型音声合成方式によるスペクトルが含まれる帯域を調整できる。その結果、音素の接続部での歪に基づく異音発生と、素片接続型音声合成方式導入による自然感の向上との両方を考慮して合成音声の品質を向上させることができる。 Embodiments of the present invention will be described below.
[Overview]
An overview of the embodiment will be described. In the embodiment, the boundary frequency is determined based on the distance between the spectral features before and after the segment boundary of the phoneme unit connected in accordance with the input “text”, and obtained by the segment connection in accordance with “text”. According to the boundary frequency of the spectrum of the second synthesized speech obtained by applying the “acoustic model” for speech synthesis to the “text”. A mixed spectrum obtained by mixing the low-frequency components is obtained. An example of the “acoustic model” is a probabilistic model such as an HMM. The higher frequency band than the boundary frequency may be “the high frequency side corresponding to the boundary frequency”, and the other band may be “the low frequency side corresponding to the boundary frequency”, or the high frequency above the boundary frequency May be “the high frequency side corresponding to the boundary frequency”, and the other band may be “the low frequency side corresponding to the boundary frequency”. The higher frequency band than the boundary frequency may be “a high frequency side corresponding to the boundary frequency”, and the lower frequency band than the boundary frequency may be “a low frequency side corresponding to the boundary frequency”. “A high frequency side corresponding to the boundary frequency” and “a low frequency side corresponding to the boundary frequency” may be determined with a frequency obtained by adding or subtracting a constant or a variable to or from the boundary frequency as a boundary. The band on the “high band side corresponding to the boundary frequency” and the band on the “low band side corresponding to the boundary frequency” may partially overlap. Here, since the boundary frequency is determined based on the distance between the spectral feature quantities before and after the segment boundary, setting of the boundary frequency can be automated, and the element occupied in the synthesized speech according to the magnitude of distortion at the phoneme connection portion. It is possible to adjust the band including the spectrum by the single connection type speech synthesis method. As a result, it is possible to improve the quality of the synthesized speech in consideration of both the generation of abnormal noise based on distortion at the phoneme connection portion and the improvement of natural feeling by introducing the unit connection type speech synthesis method.

「テキスト」に応じて素片接続される音素の種別に応じ、（ａ）スペクトル特徴量の距離に基づいて「所定の周波数区間」内の境界周波数を決定する第１方式、（ｂ）「所定の周波数区間」の上限値以上の周波数を境界周波数とする第２方式、または（ｃ）「所定の周波数区間」の下限値以下の周波数を境界周波数とする第３方式の何れかを選択してもよい。これにより、素片接続される音素の種別に応じて適切な境界周波数を選択でき、合成音声の品質を向上させることができる。 (A) a first method for determining a boundary frequency within a “predetermined frequency section” based on a distance of a spectrum feature amount according to the type of phoneme connected according to “text”; and (b) “predetermined” And select either the second method in which the frequency equal to or higher than the upper limit value of “frequency interval” is the boundary frequency, or (c) the third method in which the frequency equal to or lower than the lower limit value of “predetermined frequency interval” is the boundary frequency. Also good. As a result, an appropriate boundary frequency can be selected according to the type of phoneme connected, and the quality of the synthesized speech can be improved.

例えば、「テキスト」に応じて素片接続される音素のそれぞれである「現音素」と「現音素」の直前の「先行音素」との間の歪みが音質に与える影響が大きく、かつ、「現音素」と「現音素」の直後の「後続音素」との間の歪みが音質に与える影響が小さい場合、「現音素」に対する境界周波数を決定するために「第１方式」が選択される。例えば「現音素」が母音または有声子音を表し、「現音素」の直前の「先行音素」が母音または有声子音を表し、「現音素」の直後の「後続音素」が無声子音を表す場合に、「現音素」に対して「第１方式」が選択される。これらの例の「第１方式」では「現音素」と「現音素」の直前の「先行音素」とのスペクトル特徴量の距離に基づいて「現音素」に対する境界周波数を決定する。これにより、「現音素」と「先行音素」との間の歪みに基づく異音発生と、素片接続型音声合成方式導入による自然感の向上との両方を考慮し、「現音素」に対応する合成音声の品質を向上できる。 For example, the distortion between the “present phoneme” that is each of the phonemes connected according to the “text” and the “preceding phoneme” immediately before the “present phoneme” has a great influence on the sound quality, and “ If the distortion between the “current phoneme” and the “subsequent phoneme” immediately after the “current phoneme” has little effect on the sound quality, the “first method” is selected to determine the boundary frequency for the “current phoneme”. . For example, when “present phoneme” represents a vowel or voiced consonant, “preceding phoneme” immediately before “present phoneme” represents a vowel or voiced consonant, and “succeeding phoneme” immediately after “present phoneme” represents an unvoiced consonant. , “First method” is selected for “current phoneme”. In the “first method” in these examples, the boundary frequency for the “current phoneme” is determined based on the distance of the spectral feature amount between the “current phoneme” and the “preceding phoneme” immediately before the “current phoneme”. As a result, both the generation of abnormal sounds based on the distortion between the “present phoneme” and the “preceding phoneme” and the improvement of the natural feeling through the introduction of the unit-connected speech synthesis method are supported. The quality of synthesized speech can be improved.

例えば、「現音素」と「後続音素」との間の歪みが音質に与える影響が大きい場合、「現音素」に対する境界周波数を決定するために「第２方式」が選択される。例えば、「現音素」が母音または有声子音を表し、「現音素」の直後の「後続音素」が母音または有声子音を表す場合に、「現音素」に対して「第２方式」が選択される。これによって境界周波数を高くし、「現音素」に対応する合成音声に占める素片接続型音声合成方式によるスペクトルを減少させ、「現音素」と「後続音素」との間の歪みに基づく異音発生を抑制する。 For example, when the distortion between the “current phoneme” and the “succeeding phoneme” has a great influence on the sound quality, the “second method” is selected to determine the boundary frequency for the “current phoneme”. For example, when “present phoneme” represents a vowel or voiced consonant and “succeeding phoneme” immediately after “present phoneme” represents a vowel or voiced consonant, “second method” is selected for “present phoneme”. The As a result, the boundary frequency is increased, the spectrum of the unit-connected speech synthesis method that occupies the synthesized speech corresponding to the “present phoneme” is reduced, and the abnormal sound based on the distortion between the “present phoneme” and the “subsequent phoneme” is reduced. Suppresses the occurrence.

例えば、「現音素」と「現音素」の直前の「先行音素」との間の歪みが音質に与える影響が小さく、かつ、「現音素」と「現音素」の直後の「後続音素」との間の歪みが音質に与える影響が小さい場合、「現音素」に対して「第３方式」を選択する。例えば、「現音素」が母音または有声子音を表し、「現音素」の直前の「先行音素」および直後の「後続音素」が無声子音を表す場合、および／または、現音素が無声子音を表す場合に、「現音素」に対して「第３方式」が選択される。このように歪みが音質に与える影響が小さい場合、境界周波数を低くし、「現音素」に対応する合成音声に占める素片接続型音声合成方式によるスペクトルを増加させる。これにより、歪みに基づく異音発生を抑えつつ、合成音声の自然感を向上させる。 For example, the distortion between “present phoneme” and “preceding phoneme” immediately before “present phoneme” has little effect on the sound quality, and “following phoneme” immediately after “present phoneme” and “present phoneme” When the effect of the distortion during the period on the sound quality is small, the “third method” is selected for the “current phoneme”. For example, when “present phoneme” represents a vowel or voiced consonant, “preceding phoneme” immediately before “present phoneme” and “succeeding phoneme” immediately following represent unvoiced consonant, and / or present phoneme represents unvoiced consonant. In this case, the “third method” is selected for the “current phoneme”. When the influence of distortion on the sound quality is small in this way, the boundary frequency is lowered, and the spectrum by the unit connection type speech synthesis method that occupies the synthesized speech corresponding to the “current phoneme” is increased. This improves the natural feeling of the synthesized speech while suppressing the generation of abnormal noise based on distortion.

「第１方式」で、所定の周波数区間の一部の帯域である「第１判定帯域」での前述の「スペクトル特徴量の距離」を得、「第１判定帯域」での「スペクトル特徴量の距離」が「許容限界値（閾値）」未満であれば、「第１判定帯域」に応じた周波数を境界周波数とし、「第１判定帯域」での「スペクトル特徴量の距離」が「許容限界値」未満でなければ、「第１判定帯域」よりも周波数の高い帯域を「第２判定帯域」とし、「第２判定帯域」を「第１判定帯域」とした同様な処理を再び実行してもよい。すなわち、低域側の「第１判定帯域」から順次「スペクトル特徴量の距離」を計算し、「スペクトル特徴量の距離」が「許容限界値」未満であれば、そのときの「第１判定帯域」に応じた周波数を境界周波数としてもよい。「第１判定帯域に応じた周波数」の例は、「第１判定帯域」の下限周波数、「第１判定帯域」の上限周波数、または「第１判定帯域」の中心周波数などである。「第２判定帯域」の例は、「第１判定帯域」の上限周波数または当該上限周波数に隣接する周波数を下限周波数とする帯域、あるいは「第１判定帯域」の上限周波数よりも高いその他の周波数を下限周波数とする帯域である。「第１判定帯域」の初期値の例は、所定の周波数区間の下限周波数を下限とする帯域である。人間の聴覚特性上、周波数が高いほど隣接する音素の接続部での歪みに基づく異音が合成音声の品質に与える影響が小さい。そのため、「第１判定帯域」での「スペクトル特徴量の距離」が「許容限界値」未満であって音素間の歪みが小さいのであれば、「第１判定帯域」よりも周波数が高い帯域でも音素間の歪みに基づく異音が合成音声の品質に与える影響も小さいことが多い。そのため、「スペクトル特徴量の距離」が「許容限界値」未満となった「第１判定帯域」に応じた周波数を境界周波数とすることで、歪みに基づく異音の影響を抑えつつ、合成音声の自然感を向上できる。また、「スペクトル特徴量の距離」が「許容限界値」未満となるまで、「第１判定帯域」の周波数を上げながら上述の処理を繰り返すことにより、歪みに基づく異音の影響を抑制可能なできるだけ低い周波数を境界周波数とできる。その結果、歪みに基づく異音の影響を抑えつつ、合成音声に占める素片接続型音声合成方式によるスペクトルをできるだけ増加させ、合成音声の自然感を向上できる。 In the “first method”, the above-described “spectral feature amount distance” in the “first determination band” which is a partial band of the predetermined frequency section is obtained, and the “spectral feature amount” in the “first determination band” If the “distance” is less than the “allowable limit value (threshold)”, the frequency corresponding to the “first determination band” is set as the boundary frequency, and the “spectral feature distance” in the “first determination band” is “allowable”. If it is not less than the “limit value”, the same processing is executed again with the frequency band higher than the “first determination band” as the “second determination band” and the “second determination band” as the “first determination band”. May be. That is, the “spectral feature amount distance” is calculated sequentially from the “first determination band” on the low frequency side, and if the “spectral feature amount distance” is less than the “allowable limit value”, the “first determination” at that time A frequency corresponding to “band” may be used as the boundary frequency. Examples of the “frequency according to the first determination band” are the lower limit frequency of the “first determination band”, the upper limit frequency of the “first determination band”, or the center frequency of the “first determination band”. Examples of the “second determination band” include a band having the upper limit frequency of the “first determination band” or a frequency adjacent to the upper limit frequency as the lower limit frequency, or other frequencies higher than the upper limit frequency of the “first determination band”. Is a band having a lower limit frequency. An example of the initial value of the “first determination band” is a band whose lower limit is the lower limit frequency of a predetermined frequency section. From the viewpoint of human auditory characteristics, the higher the frequency, the smaller the influence that an abnormal sound based on the distortion at the connection part of adjacent phonemes has on the quality of the synthesized speech. Therefore, if the “spectral feature distance” in the “first determination band” is less than the “allowable limit value” and the distortion between phonemes is small, even in a band having a higher frequency than the “first determination band”. In many cases, the influence of abnormal sounds based on distortion between phonemes on the quality of synthesized speech is small. For this reason, the frequency corresponding to the “first determination band” in which the “spectral feature amount distance” is less than the “allowable limit value” is set as the boundary frequency, thereby suppressing the influence of abnormal noise based on the distortion and the synthesized speech. Can improve the natural feeling. Further, by repeating the above processing while increasing the frequency of the “first determination band” until the “spectral feature distance” becomes less than the “allowable limit value”, it is possible to suppress the influence of abnormal noise based on distortion. The lowest possible frequency can be used as the boundary frequency. As a result, it is possible to increase the spectrum of the unit connection type speech synthesis method that occupies the synthesized speech as much as possible and to improve the natural feeling of the synthesized speech while suppressing the influence of abnormal sounds based on distortion.

「許容限界値」は、「第１判定帯域」ごと（第１判定帯域に応じた周波数ごと）に定められてもよいし、すべての「第１判定帯域」に対して均一であってもよい。「許容限界値」が「第１判定帯域」ごとに定められる場合、「許容限界値」が「第１判定帯域」に応じた周波数に対して広義単調増加する関係にあってもよい。例えば「第１判定帯域」に応じた周波数が高いほど「許容限界値」が大きくてもよい。この場合、「許容限界値」が大きいほど、「第１判定帯域」での「スペクトル特徴量の距離」が「許容限界値」未満となる頻度が上がり、低い周波数が境界周波数として選択される頻度が高くなる。一方、周波数が高いほど隣接する音素の接続部での歪みに基づく異音が合成音声の品質に与える影響は小さくなる。そのため、「許容限界値」が「第１判定帯域」に応じた周波数に対して広義単調増加する関係にある場合、歪みに基づく異音の影響を抑えつつ、合成音声に占める素片接続型音声合成方式によるスペクトルをできるだけ増加させ、合成音声の自然感を向上させることができる。 The “allowable limit value” may be determined for each “first determination band” (for each frequency according to the first determination band), or may be uniform for all “first determination bands”. . When the “allowable limit value” is determined for each “first determination band”, the “allowable limit value” may be monotonously increased in a broad sense with respect to the frequency corresponding to the “first determination band”. For example, the “allowable limit value” may be larger as the frequency corresponding to the “first determination band” is higher. In this case, as the “allowable limit value” is larger, the frequency at which the “spectral feature distance” in the “first determination band” is less than the “allowable limit value” increases, and the frequency at which a lower frequency is selected as the boundary frequency. Becomes higher. On the other hand, the higher the frequency, the smaller the influence of abnormal sound on the quality of the synthesized speech due to distortion at the connection part of adjacent phonemes. Therefore, when the “allowable limit value” is in a monotonically increasing relationship with respect to the frequency corresponding to the “first determination band”, the unit-connected speech occupying the synthesized speech while suppressing the influence of abnormal noise based on distortion. The spectrum by the synthesis method can be increased as much as possible to improve the natural feeling of the synthesized speech.

「第２方式」および「第３方式」が、「テキスト」に応じて素片接続される音素の基本周波数に対する、「テキスト」に前述の「音響モデル」を適用して得られる基本周波数の「変更度合い」に基づいて境界周波数を決定する方式であってもよい。「第２方式」および「第３方式」で選択される「境界周波数」が「変更度合い」の大きさに対して広義単調増加する関係にあってもよい。例えば、「第２方式」において、「変更度合い」が所定の範囲内である場合に「所定の周波数区間」の上限値を境界周波数とし、そうでない場合に「所定の周波数区間」の上限値を超える周波数（例えば、ナイキスト周波数）を境界周波数としてもよい。「第３方式」において、「変更度合い」が所定の範囲内である場合に「所定の周波数区間」の下限値未満の周波数（例えば、０Ｈｚ）を境界周波数とし、そうでない場合に「所定の周波数区間」の下限値を境界周波数としてもよい。一般に「変更度合い」が大きいほど、素片接続型音声合成方式で得られる合成音声の品質は低下する。そのため、「変更度合い」に基づいて境界周波数を決定することで、合成音声の品質を向上させることができる。 The “second method” and the “third method” are fundamental frequencies “obtained by applying the above-mentioned“ acoustic model ”to“ text ”with respect to the fundamental frequencies of phonemes connected in accordance with“ text ”. A method of determining the boundary frequency based on the “degree of change” may be used. The “boundary frequency” selected by the “second method” and the “third method” may be monotonically increasing in a broad sense with respect to the “change degree”. For example, in the “second method”, when the “degree of change” is within a predetermined range, the upper limit value of the “predetermined frequency section” is set as the boundary frequency, and otherwise, the upper limit value of the “predetermined frequency section” is set. It is good also considering the frequency (for example, Nyquist frequency) beyond as a boundary frequency. In the “third method”, when the “change degree” is within a predetermined range, a frequency less than the lower limit value of the “predetermined frequency section” (for example, 0 Hz) is set as the boundary frequency, and otherwise, the “predetermined frequency” The lower limit value of “section” may be the boundary frequency. In general, the greater the “change degree”, the lower the quality of synthesized speech obtained by the unit connection type speech synthesis method. Therefore, by determining the boundary frequency based on the “degree of change”, the quality of the synthesized speech can be improved.

［第１実施形態］
第１実施形態を説明する。
＜構成＞
図１に例示するように、本形態の音声合成装置１は、入力部１１、音声コーパス記憶部２１２、音声データベース（ＤＢ）記憶部１２２、音響モデル記憶部１２３、音声データベース（ＤＢ）構築部１３１、音響モデル生成部１３２、素片接続型音声合成部１３３、ＨＭＭ音声合成部１３４、境界周波数決定部１３５、スペクトル混合処理部１３６、および波形生成処理部１３７を有する。図２Ａに例示するように、本形態の境界周波数決定部１３５は決定部１３５２を含む。図２Ｂに例示するように、本形態のスペクトル混合処理部１３６は、ハイパスフィルタ１３６１、ローパスフィルタ１３６２、および混合部１３６３を有する。実施形態の音声合成装置は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）およびＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、プログラムを用いることなく処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 [First Embodiment]
A first embodiment will be described.
<Configuration>
As illustrated in FIG. 1, the speech synthesizer 1 of this embodiment includes an input unit 11, a speech corpus storage unit 212, a speech database (DB) storage unit 122, an acoustic model storage unit 123, and a speech database (DB) construction unit 131. , An acoustic model generation unit 132, a unit connection type speech synthesis unit 133, an HMM speech synthesis unit 134, a boundary frequency determination unit 135, a spectrum mixing processing unit 136, and a waveform generation processing unit 137. As illustrated in FIG. 2A, the boundary frequency determination unit 135 of this embodiment includes a determination unit 1352. As illustrated in FIG. 2B, the spectrum mixing processing unit 136 of this embodiment includes a high-pass filter 1361, a low-pass filter 1362, and a mixing unit 1363. The speech synthesizer according to the embodiment includes, for example, a general purpose or dedicated processor including a processor (hardware processor) such as a central processing unit (CPU) and a memory such as random-access memory (RAM) and read-only memory (ROM). The computer is configured by executing a predetermined program. The computer may include a single processor and memory, or may include a plurality of processors and memory. This program may be installed in a computer, or may be recorded in a ROM or the like in advance. In addition, some or all of the processing units are configured using an electronic circuit that realizes a processing function without using a program, instead of an electronic circuit (circuitry) that realizes a functional configuration by reading a program like a CPU. May be. In addition, an electronic circuit constituting one device may include a plurality of CPUs.

＜前処理＞
前処理として音声コーパス記憶部１２１に音声コーパスが格納される。音声ＤＢ構築部１３１は、音声コーパス記憶部１２１に格納された音声コーパスを用い、素片接続型音声合成方式による音声合成に利用可能な音声ＤＢを生成する。例えば、音声ＤＢ構築部１３１は、特許２７６１５５２号公報、特許４４３０９６０号公報などに記載された公知の方法を用いて音声ＤＢを生成できる。生成された音声ＤＢは、音声ＤＢ記憶部１２２に格納される。音響モデル生成部１３２には、音声コーパス記憶部１２１に格納された音声コーパスを学習データとして用い、音声合成のための音響モデルを生成する。音響モデルの例は、ＨＭＭ音声合成用の音響モデルであり、例えば非特許文献１などに記載された公知の方法を用いて生成できる。本形態の音響モデルは、ＨＭＭスペクトル（周波数スペクトル）および基本周波数等をモデル化したものである。生成された音響モデルは音響モデル記憶部１２３に格納される。 <Pretreatment>
The speech corpus is stored in the speech corpus storage unit 121 as preprocessing. The speech DB constructing unit 131 uses the speech corpus stored in the speech corpus storage unit 121 to generate a speech DB that can be used for speech synthesis by the unit connection speech synthesis method. For example, the voice DB constructing unit 131 can generate a voice DB by using a known method described in Japanese Patent No. 2761552, Japanese Patent No. 4430960, and the like. The generated voice DB is stored in the voice DB storage unit 122. The acoustic model generation unit 132 generates an acoustic model for speech synthesis using the speech corpus stored in the speech corpus storage unit 121 as learning data. An example of the acoustic model is an acoustic model for HMM speech synthesis, and can be generated using a known method described in Non-Patent Document 1, for example. The acoustic model of the present embodiment models an HMM spectrum (frequency spectrum) and a fundamental frequency. The generated acoustic model is stored in the acoustic model storage unit 123.

＜音声合成処理＞
次に、図３を用いて本形態の音声合成処理を説明する。入力部１１には音声合成の対象となる文章を表すテキストが入力される。テキストは素片接続型音声合成部１３３およびＨＭＭ音声合成部１３４に入力される。ＨＭＭ音声合成部１３４は、音響モデル記憶部１２３に格納された音響モデルにテキストを適用（例えば、テキストのテキスト解析結果を適用）してＨＭＭスペクトルおよび基本周波数等を得て出力する。ＨＭＭスペクトルおよび基本周波数等はスペクトル混合処理部１３６に送られ、基本周波数はさらに素片接続型音声合成部１３３に送られる（ステップＳ１３４：ＨＭＭ音声合成処理）。素片接続型音声合成部１３３は、テキストおよび基本周波数を入力とし、音声ＤＢ記憶部１２２に格納された音声ＤＢを用いて当該テキストに応じた素片接続（素片接続型音声合成）を行い、素片スペクトル（当該テキストに応じた素片接続によって得られる第１合成音声のスペクトル）およびそれに対応する時間情報付の音素系列を得て出力する。「時間情報付の音素系列」は、素片接続された音素の系列であって各音素の時間情報が付与されたものである。音素の時間情報とは、例えば、所定の時刻（例えば、音素系列の先頭時刻）を基準とした音素の時間軸上の位置（例えば、音素の開始時刻または終了時刻）を表す情報である。素片スペクトルおよびそれに対応する時間情報付の音素系列は境界周波数決定部１３５に送られ、素片スペクトルはさらにスペクトル混合処理部１３６に送られる（ステップＳ１３３：素片接続型音声合成処理）。 <Speech synthesis processing>
Next, the speech synthesis process of this embodiment will be described with reference to FIG. The input unit 11 receives text representing a sentence that is a target of speech synthesis. The text is input to the unit connection type speech synthesis unit 133 and the HMM speech synthesis unit 134. The HMM speech synthesizer 134 applies text to the acoustic model stored in the acoustic model storage unit 123 (for example, applies a text analysis result of the text) to obtain and output an HMM spectrum, a fundamental frequency, and the like. The HMM spectrum, the fundamental frequency, and the like are sent to the spectrum mixing processing unit 136, and the fundamental frequency is further sent to the unit connection type speech synthesis unit 133 (step S134: HMM speech synthesis processing). The unit connection type speech synthesis unit 133 receives the text and the fundamental frequency as input, and performs unit connection (unit connection type speech synthesis) according to the text using the speech DB stored in the speech DB storage unit 122. , And obtain and output a segment spectrum (a spectrum of the first synthesized speech obtained by segment connection corresponding to the text) and a corresponding phoneme sequence with time information. The “phoneme sequence with time information” is a sequence of phonemes connected in segments, to which time information of each phoneme is given. The phoneme time information is, for example, information representing a position (for example, a phoneme start time or an end time) on the time axis of the phoneme with a predetermined time (for example, the start time of the phoneme sequence) as a reference. The unit spectrum and the corresponding phoneme sequence with time information are sent to the boundary frequency determination unit 135, and the unit spectrum is further sent to the spectrum mixing processing unit 136 (step S133: unit connection type speech synthesis process).

境界周波数決定部１３５は、素片スペクトルおよび時間情報付の音素系列を入力とし、前述のようにテキストに応じて素片接続された音素の素片境界前後のスペクトル特徴量の距離に基づいて境界周波数Ｂを決定して出力する（ステップＳ１３５：境界周波数決定処理）。 The boundary frequency determination unit 135 receives a segment spectrum and a phoneme sequence with time information as input, and based on the distance between the spectral feature quantities before and after the segment boundary of the phonemes connected according to the text as described above The frequency B is determined and output (step S135: boundary frequency determination process).

≪ステップＳ１３５の詳細の例示≫
図４Ａおよび図４Ｂを用い、ステップＳ１３５の詳細を例示する。本形態では素片接続された音素ごとに境界周波数を決定する。本形態では、境界周波数を決定しようとする現音素（ｉ番目の音素）と当該現音素の直前の先行音素（ｉ−１番目の音素）との音素境界Ｔ_ｉの前後の計算区間長Ｔの時間区間における「所定の周波数区間」がＮ個（たとえば１０個）の帯域（周波数帯域）ｂ_１，・・・，ｂ_Ｎに区分される。ただし、Ｔは正値であり、例えば２０ｍｓｅｃである。Ｎは２以上の正整数であり、例えばＴ＝１０である。「所定の周波数区間」の上限値Ｂ_Ｎおよび下限値Ｂ_０は予め定められている。上限値Ｂ_Ｎの例は母音の第２フォルマントの上限周波数（例えば３ｋＨｚ）であり、下限値Ｂ_０の例は母音の第１フォルマントの下限周波数（例えば２００Ｈｚ）である。各帯域ｂ_ｎの上限値をＢ_ｎと表記する。本形態ではｎが大きな帯域ｂ_ｎほど周波数が高く、帯域ｂ_ｎ’−１の上限値Ｂ_ｎ’−１が帯域ｂ_ｎ’の下限値と一致する。帯域ｂ_１，・・・，ｂ_Ｎは「所定の周波数区間」をメル尺度（メル周波数）上で等間隔に区分したものであることが望ましいが、線形尺度（線形周波数）上で等間隔に区分したもの等、それ以外の基準で区分されたものであってもよい。 << Example of details of step S135 >>
The details of step S135 are illustrated using FIGS. 4A and 4B. In this embodiment, the boundary frequency is determined for each phoneme connected to the segment. In this embodiment, the calculation interval length T before and after the phoneme boundary T _i between the current phoneme (i-th phoneme) whose boundary frequency is to be determined and the preceding phoneme (i−1th phoneme) immediately before the current phoneme The “predetermined frequency section” in the time section is divided into N (for example, 10) bands (frequency bands) b ₁ ,..., B _N. However, T is a positive value, for example, 20 msec. N is a positive integer equal to or greater than 2, for example, T = 10. The upper limit value B _N and the lower limit value B ₀ of the “predetermined frequency section” are determined in advance. An example of the upper limit value B _N is the upper limit frequency (for example, 3 kHz) of the second formant of the vowel, and an example of the lower limit value B ₀ is the lower limit frequency (for example, 200 Hz) of the first formant of the vowel. The upper limit value of each band b _n is expressed as B _n . High frequency the larger the band _{b n} is n is in this embodiment, the upper limit _{B n'-1} band _{b n'-1} matches the lower limit of the band b _{n '.} The bands b ₁ ,..., B _N are preferably obtained by dividing the “predetermined frequency section” at equal intervals on the mel scale (mel frequency), but at equal intervals on the linear scale (linear frequency). It may be classified according to other criteria such as classified.

境界周波数決定部１３５の決定部１３５２（図２Ａ）はｎ：＝１と初期化する。ただし、「ｎ：＝１」はｎを１とする（ｎに１を代入する）ことを意味する（ステップＳ１３５２ａ）。決定部１３５２は、先行音素のＴ_ｉ−ＴからＴ_ｉまでの時間区間における帯域ｂ_ｎ（所定の周波数区間の一部の帯域である第１判定帯域）でのスペクトル特徴量の平均値Ｓ_{ｎ，ｉ−１}と、現音素のＴ_ｉからＴ_ｉ＋Ｔまでの時間区間における帯域ｂ_ｎでのスペクトル特徴量の平均値Ｓ_ｎ，ｉとを計算する。例えば、帯域ｂ_ｎでのｉ−１番目の音素（先行音素）の離散時刻ｔでのスペクトル特徴量をｓ_{ｎ，ｉ−１，ｔ}とし、帯域ｂ_ｎでのｉ番目の音素（現音素）の離散時刻ｔでのスペクトル特徴量をｓ_{ｎ，ｉ，ｔ}とすると、以下の関係が成り立つ。

なお、スペクトル特徴量の例は、パワースペクトル、メルケプストラム係数、ケプストラム係数などである（ステップＳ１３５２ｂ）。 The determination unit 1352 (FIG. 2A) of the boundary frequency determination unit 135 initializes n: = 1. However, “n: = 1” means that n is 1 (1 is substituted for n) (step S1352a). The determination unit 1352 uses the average value S _n of the spectral feature amounts in the band b _n (the first determination band that is a part of a predetermined frequency section) in the time interval from T _i −T to T _i of the preceding phoneme. _{, I−1} and the average value S _{n, i} of the spectral feature values in the band b _n in the time interval from T _i to T _i + T of the current phoneme. For example, _let s _{n, i−1, t} be the spectral feature quantity at the discrete time t of the i−1 th phoneme (preceding phoneme) in the band b _n , and the i th phoneme (current phoneme) in the band b _n. Assuming that the spectral feature quantity at discrete time t is s _{n, i, t} , the following relationship holds.

Note that examples of the spectrum feature amount include a power spectrum, a mel cepstrum coefficient, a cepstrum coefficient, and the like (step S1352b).

次に決定部１３５２は、Ｓ_{ｎ，ｉ−１}とＳ_ｎ，ｉとの距離Ｄ_ｎ，ｉ（第１判定帯域ｂ_ｎでのスペクトル特徴量の距離）を計算する。Ｓ_{ｎ，ｉ−１}およびＳ_ｎ，ｉがスカラーである場合、距離Ｄ_ｎ，ｉはＳ_{ｎ，ｉ−１}とＳ_ｎ，ｉとの差分（絶対値）であり、Ｓ_{ｎ，ｉ−１}およびＳ_ｎ，ｉがベクトルである場合、距離Ｄ_ｎ，ｉはＳ_{ｎ，ｉ−１}とＳ_ｎ，ｉとのノルムである（ステップＳ１３５２ｃ）。決定部１３５２は、帯域ｂ_ｎにおける許容限界値Ｌ_ｎと距離Ｄ_ｎ，ｉとを比較し、Ｌ_ｎ＞Ｄ_ｎ，ｉであるかを判定する。ただし、許容限界値Ｌ_ｎは聴取実験などによって予め定めておいた値である。許容限界値Ｌ_ｎは、帯域ｂ_ｎごとに定められてもよいし、すべての帯域ｂ_１，・・・，ｂ_Ｎに対して均一であってもよい。許容限界値Ｌ_ｎが帯域ｂ_ｎごとに定められる場合、許容限界値Ｌ_ｎが帯域ｂ_ｎに応じた周波数（例えば、Ｂ_ｎまたはＢ_ｎ−１）に対して広義単調増加する関係にあってもよい（ステップＳ１３５２ｄ）。ここで、距離Ｄ_ｎ，ｉが許容限界値Ｌ_ｎ未満（Ｌ_ｎ＞Ｄ_ｎ，ｉ）であれば、決定部１３５２は、帯域ｂ_ｎに応じた周波数を境界周波数Ｂとする。例えば、Ｂ＝Ｂ_ｎとしてもよいし、Ｂ＝Ｂ_ｎ−１としてもよいし、Ｂ＝（Ｂ_ｎ−１＋Ｂ_ｎ）／２としてもよい（ステップＳ１３５２ｇ）。一方、距離Ｄ_ｎ，ｉが許容限界値Ｌ_ｎ未満でなければ、決定部１３５２はｎ＝Ｎであるかを判定する（ステップＳ１３５２ｅ）。ここで、ｎ＝Ｎでなければ、決定部１３５２は、ｎ：＝ｎ＋１として（ステップＳ１３５２ｆ）、ステップＳ１３５２ｂ以降の処理を実行する。すなわち決定部１３５２は、第１判定帯域（帯域ｂ_ｎ）でのスペクトル特徴量の距離Ｄ_ｎ，ｉが許容限界値Ｌ_ｎ未満でなければ、第１判定帯域よりも周波数の高い帯域ｂ_ｎ＋１（この例では１つ高域側の帯域）を第２判定帯域とし、第２判定帯域を第１判定帯域とした処理を行う。一方、ｎ＝ＮであればステップＳ１３５２ｇが実行され、帯域ｂ_Ｎに応じた周波数（例えば上限値Ｂ_Ｎ）を境界周波数Ｂとする（ステップＳ１３５２ｇ）。以上のステップＳ１３５２ａ〜Ｓ１３５２ｇの処理は、音素接続される各音素を現音素としてそれぞれ実行される。このような処理により、定量的な基準に基づいて各音素に対応する境界周波数Ｂを決定できる。 Next, the determination unit 1352 calculates the distance D _{n, i} (the distance of the spectral feature amount in the first determination band b _n ) between S _{n, i−1} and S _{n, i} . When S _{n, i-1} and S _{n, i} are scalars, the distance D _{n, i} is the difference (absolute value) between S _{n, i} _-1 and S _{n, i} and S _{n, i-1} If S _{n, i} is a vector, the distance D _{n, i} is the norm of S _{n, i-1} and S _{n, i} (step S1352c). The determination unit 1352 compares the allowable limit value L _n and the distance D _{n, i} in the band b _n and determines whether L _n > D _{n, i} . However, the allowable limit value L _n is a value determined in advance by a listening experiment or the like. The allowable limit value L _n may be determined for each band b _n or may be uniform for all bands b ₁ ,..., B _N. When the permissible limit value L _n is determined for each band b _n , the permissible limit value L _n is in a monotonically increasing relationship with respect to a frequency (for example, B _n or B _n−1 ) corresponding to the band b _n. It is also possible (step S1352d). Here, if the distance D _{n, i} is less than the allowable limit value L _n (L _n > D _{n, i} ), the determination unit 1352 sets the frequency corresponding to the band b _n as the boundary frequency B. For example, B = _Bn may be set, B = _Bn-1 may be set, or B = ( _Bn-1 + _Bn ) / 2 may be set (step S1352g). On the other hand, if the distance D _{n, i} is not less than the allowable limit value L _n , the determination unit 1352 determines whether n = N (step S1352e). Here, if n = N is not satisfied, the determination unit 1352 sets n: = n + 1 (step S1352f), and executes the processing after step S1352b. In other words, the determination unit 1352 determines that the band b _{n + 1} ((b) having a frequency higher than that of the first determination band if the distance D _{n, i} of the spectral feature amount in the first determination band (band b _n ) is not less than the allowable limit value L _n. In this example, a process is performed in which one higher band is set as the second determination band, and the second determination band is set as the first determination band. On the other hand, if n = N, step S1352g is executed, and the frequency (for example, the upper limit value B _N ) corresponding to the band b _N is set as the boundary frequency B (step S1352g). The processes in steps S1352a to S1352g described above are executed with each phoneme connected as a current phoneme. By such processing, the boundary frequency B corresponding to each phoneme can be determined based on a quantitative criterion.

境界周波数決定部１３５から出力された境界周波数Ｂはスペクトル混合処理部１３６に送られる。スペクトル混合処理部１３６は、音素ごとに、入力された素片スペクトル（テキストに応じた素片接続によって得られる第１合成音声のスペクトル）の境界周波数Ｂに応じた高域側の成分と、ＨＭＭスペクトル（テキストに音声合成のための音響モデルを適用して得られる第２合成音声のスペクトル）の境界周波数Ｂに応じた低域側の成分とを混合した混合スペクトルを得る。図５の例では、素片スペクトルおよび境界周波数Ｂがハイパスフィルタ１３６１に入力される。ハイパスフィルタ１３６１は、ハイパスフィルタ処理によって素片スペクトルの境界周波数Ｂ以下の低域部をカットし、素片スペクトルの高域側の成分を得て出力する。またＨＭＭスペクトルおよび境界周波数Ｂがローパスフィルタ１３６２に入力される。ローパスフィルタ１３６２は、ローパスフィルタ処理によってＨＭＭスペクトルの境界周波数Ｂ以上の高域部をカットすることによりＨＭＭスペクトルの低域側の成分を得て出力する。混合部１３６３は、素片スペクトルの高域側の成分およびＨＭＭスペクトルの低域側の成分を入力とし、これらを混合（合成）して混合スペクトルを得て出力する。これらの処理は全て音素毎に行う。混合スペクトルは基本周波数とともに波形生成処理部１３７に送られる（ステップＳ１３６：スペクトル混合処理）。 The boundary frequency B output from the boundary frequency determination unit 135 is sent to the spectrum mixing processing unit 136. For each phoneme, the spectrum mixing processing unit 136 includes, for each phoneme, a high frequency component corresponding to the boundary frequency B of the input segment spectrum (the spectrum of the first synthesized speech obtained by segment connection corresponding to the text), the HMM A mixed spectrum is obtained by mixing low-frequency components corresponding to the boundary frequency B of the spectrum (the spectrum of the second synthesized speech obtained by applying an acoustic model for speech synthesis to text). In the example of FIG. 5, the segment spectrum and the boundary frequency B are input to the high pass filter 1361. The high-pass filter 1361 cuts a low-frequency part below the boundary frequency B of the segment spectrum by high-pass filter processing, and obtains and outputs a component on the high-frequency side of the segment spectrum. Further, the HMM spectrum and the boundary frequency B are input to the low pass filter 1362. The low-pass filter 1362 obtains and outputs a component on the low frequency side of the HMM spectrum by cutting the high frequency region above the boundary frequency B of the HMM spectrum by low-pass filter processing. The mixing unit 1363 receives the high-frequency component of the segment spectrum and the low-frequency component of the HMM spectrum as inputs, and mixes (synthesizes) these to obtain a mixed spectrum and output it. All of these processes are performed for each phoneme. The mixed spectrum is sent to the waveform generation processing unit 137 together with the fundamental frequency (step S136: spectrum mixing process).

波形生成処理部１３７は、入力された混合スペクトルおよび基本周波数を用い、混合スペクトルに対応する時間領域の波形（合成音声）を生成して出力する。この処理には、例えば、参考文献１（H.Kawahara, ”STRAIGHT, exploitation of the other aspect of VOCODER : Perceptually isomorphic decomposition of speech sounds”, Acoustic Science and Technology, Vol.27, No.6, pp.349-353, 2006）に記載された方法を用いることができる（ステップＳ１３７：波形生成処理）。 The waveform generation processing unit 137 generates and outputs a time-domain waveform (synthesized speech) corresponding to the mixed spectrum using the input mixed spectrum and fundamental frequency. For example, Reference 1 (H. Kawahara, “STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds”, Acoustic Science and Technology, Vol. 27, No. 6, pp.349 -353, 2006) can be used (step S137: waveform generation processing).

［第２実施形態］
本形態は第１実施形態の変形例である。本形態では、テキストに応じて素片接続される音素の種別に応じ、（ａ）スペクトル特徴量の距離に基づいて所定の周波数区間内の境界周波数を決定する第１方式、（ｂ）周波数区間の上限値以上の周波数を境界周波数とする第２方式、または（ｃ）周波数区間の下限値以下の周波数を境界周波数とする第３方式の何れかを選択する。以下では既に説明した事項との相違点を中心に説明し、説明済みの事項についてはそれまでに用いた参照番号を流用して説明を省略する。 [Second Embodiment]
This embodiment is a modification of the first embodiment. In this embodiment, according to the type of phoneme connected according to the text, (a) a first method for determining a boundary frequency within a predetermined frequency interval based on the distance of the spectral feature amount, (b) the frequency interval Is selected from the second method in which the frequency equal to or higher than the upper limit value is set as the boundary frequency, or (c) the third method in which the frequency equal to or lower than the lower limit value of the frequency section is set as the boundary frequency. In the following, differences from the items already described will be mainly described, and for the items already described, the reference numbers used so far will be used and description thereof will be omitted.

＜構成＞
図１に例示するように、本形態の音声合成装置２は、入力部１１、音声コーパス記憶部２１２、音声ＤＢ記憶部１２２、音響モデル記憶部１２３、音声ＤＢ構築部１３１、音響モデル生成部１３２、素片接続型音声合成部１３３、ＨＭＭ音声合成部１３４、境界周波数決定部２３５、スペクトル混合処理部１３６、および波形生成処理部１３７を有する。図２Ａに例示するように、本形態の境界周波数決定部２３５は、決定方法選択部２３５１および決定部２３５２を含む。 <Configuration>
As illustrated in FIG. 1, the speech synthesizer 2 of this embodiment includes an input unit 11, a speech corpus storage unit 212, a speech DB storage unit 122, an acoustic model storage unit 123, a speech DB construction unit 131, and an acoustic model generation unit 132. , A unit connection type speech synthesis unit 133, an HMM speech synthesis unit 134, a boundary frequency determination unit 235, a spectrum mixing processing unit 136, and a waveform generation processing unit 137. As illustrated in FIG. 2A, the boundary frequency determination unit 235 according to the present embodiment includes a determination method selection unit 2351 and a determination unit 2352.

＜前処理＞
第１実施形態と同じである。 <Pretreatment>
The same as in the first embodiment.

＜音声合成処理＞
第１実施形態との相違点は、境界周波数決定部１３５がステップＳ１３５の境界周波数決定処理を行うことに代えて、境界周波数決定部２３５がステップＳ２３５の境界周波数決定処理を行うことである。その他は第１実施形態で説明した通りである。以下では、ステップＳ２３５の境界周波数決定処理のみを説明する。 <Speech synthesis processing>
The difference from the first embodiment is that the boundary frequency determination unit 235 performs the boundary frequency determination process in step S235 instead of the boundary frequency determination unit 135 performing the boundary frequency determination process in step S135. Others are as described in the first embodiment. Hereinafter, only the boundary frequency determination processing in step S235 will be described.

≪ステップＳ２３５≫
境界周波数決定部２３５は、素片スペクトルおよび時間情報付の音素系列を入力とし、前述のようにテキストに応じて素片接続された音素の素片境界前後のスペクトル特徴量の距離に基づいて境界周波数Ｂを決定して出力する（ステップＳ２３５：境界周波数決定処理）。図６を用いて本形態のステップＳ２３５の処理を説明する。 << Step S235 >>
The boundary frequency determination unit 235 receives the segment spectrum and the phoneme sequence with time information as input, and based on the distance between the spectrum feature quantities before and after the segment boundary of the phonemes connected according to the text, as described above. The frequency B is determined and output (step S235: boundary frequency determination process). The process of step S235 of the present embodiment will be described using FIG.

まず、境界周波数決定部２３５の決定方法選択部２３５１（図２Ａ）は、時間情報付の音素系列によって表された素片接続される音素の種別（音素そのものの種別および隣接する音素の種別の組み合わせ）に応じ、音素ごとに、スペクトル特徴量の距離に基づいて、方式ａ（第１方式）、方式ｂ（第２方式）、または方式ｃ（第３方式）の何れかを選択する（図７Ｂ参照）。 First, the determination method selection unit 2351 (FIG. 2A) of the boundary frequency determination unit 235 determines the type of phoneme connected by the phoneme sequence with time information (the combination of the type of the phoneme itself and the type of the adjacent phoneme). ), A method a (first method), a method b (second method), or a method c (third method) is selected for each phoneme based on the distance of the spectral feature amount (FIG. 7B). reference).

方式ａは「所定の周波数区間（下限値Ｂ_０から上限値Ｂ_Ｎまでの区間）」内の境界周波数Ｂを決定する方式であり、第１実施形態で説明した方式である。前述のように、方式ａでは、テキストに応じて素片接続される音素のそれぞれである「現音素」と「現音素」の直前の「先行音素」とのスペクトル特徴量の距離に基づいて「現音素」に対する境界周波数Ｂを決定する（例えば、図４Ａ）。すなわち方式ａは「現音素」と「先行音素」との間の歪みが大きいときには素片接続型音声合成方式に適さないが、「現音素」と「先行音素」との間の歪みが小さいときには素片接続型音声合成方式を適用可能な「現音素」に向いている。決定方法選択部２３５１は、このような「現音素」に対して方式ａを選択すればよい。 The method a is a method for determining the boundary frequency B in the “predetermined frequency section (section from the lower limit value B ₀ to the upper limit value B _N )”, and is the method described in the first embodiment. As described above, in the method a, based on the distance between the spectral feature amounts of “present phoneme” that is each of phonemes connected in units according to text and “preceding phoneme” immediately before “current phoneme”. The boundary frequency B for “current phoneme” is determined (for example, FIG. 4A). That is, the method a is not suitable for the unit-connected speech synthesis method when the distortion between the “present phoneme” and the “preceding phoneme” is large, but when the distortion between the “present phoneme” and the “preceding phoneme” is small. It is suitable for “present phonemes” to which the unit-connected speech synthesis method can be applied. The determination method selection unit 2351 may select the method a for such “current phoneme”.

方式ｂは「所定の周波数区間」の上限値Ｂ_Ｎ以上の周波数を境界周波数Ｂとする方式である。方式ｂで選択される境界周波数Ｂの上限値は例えばナイキスト周波数である。この場合には、上限値Ｂ_Ｎからナイキスト周波数までの間の境界周波数Ｂが選択される。方式ｂで選択される境界周波数Ｂは方式ａで選択される境界周波数以上となる。すなわち方式ｂは、基本的に素片接続型音声合成方式に適さない「現音素」、および隣接する音素とそのような関係を持った「現音素」に向いている。決定方法選択部２３５１は、このような「現音素」に対して方式ｂを選択すればよい。 Method b is a method for the boundary frequency B the upper limit B _N frequencies above the "predetermined frequency interval." The upper limit value of the boundary frequency B selected by the method b is, for example, the Nyquist frequency. In this case, the boundary frequency B between the upper limit value B _N and the Nyquist frequency is selected. The boundary frequency B selected by the method b is equal to or higher than the boundary frequency selected by the method a. That is, the method b is suitable for “present phonemes” that are basically unsuitable for the unit-connected speech synthesis method and “present phonemes” having such a relationship with adjacent phonemes. The determination method selection unit 2351 may select the method b for such “current phoneme”.

方式ｃは「所定の周波数区間」の下限値Ｂ_０以下の周波数を境界周波数Ｂとする方式である。方式ｃで選択される境界周波数Ｂの下限値は例えば０Ｈｚである。この場合には、０Ｈｚから下限値Ｂ_０までの間の境界周波数Ｂが選択される。方式ｃで選択される境界周波数Ｂは方式ａで選択される境界周波数以下となる。すなわち方式ｃは、基本的に素片接続型音声合成方式に適した「現音素」、および隣接する音素とそのような関係を持った「現音素」に向いている。決定方法選択部２３５１は、このような「現音素」に対して方式ｃを選択すればよい。 The system c is a system in which the boundary frequency B is a frequency equal to or lower than the lower limit B ₀ of the “predetermined frequency section”. The lower limit value of the boundary frequency B selected by the method c is, for example, 0 Hz. In this case, the boundary frequency B between ₀ Hz and the lower limit value B ₀ is selected. The boundary frequency B selected by the method c is equal to or lower than the boundary frequency selected by the method a. That is, the method c is suitable for the “present phoneme” that is basically suitable for the unit connection type speech synthesis method and the “present phoneme” having such a relationship with the adjacent phonemes. The determination method selection unit 2351 may select the method c for such “current phoneme”.

≪方式の選択方法の例示１≫
方式の選択方法を例示する。
（ａ）決定方法選択部２３５１は、「現音素」と「現音素」の直前の「先行音素」との間の歪みが音質に与える影響が大きく、かつ、「現音素」と「現音素」の直後の「後続音素」との間の歪みが音質に与える影響が小さい場合に、「現音素」に対して方式ａを選択する。
（ｂ）決定方法選択部２３５１は、「現音素」と「現音素」の直後の「後続音素」との間の歪みが音質に与える影響が大きい場合に、「現音素」に対して方式ｂを選択する。
（ｃ）決定方法選択部２３５１は、「現音素」と「現音素」の直前の「先行音素」との間の歪みが音質に与える影響が小さく、かつ、「現音素」と「現音素」の直後の「後続音素」との間の歪みが音質に与える影響が小さい場合に、「現音素」に対して方式ｃを選択する。 <Example 1 of method selection method>
A method for selecting a method is illustrated.
(A) The determination method selection unit 2351 has a large influence on the sound quality due to distortion between the “present phoneme” and the “preceding phoneme” immediately before the “present phoneme”, and the “present phoneme” and the “present phoneme”. If the distortion between the “subsequent phoneme” immediately after “has a little influence on the sound quality”, the method a is selected for the “current phoneme”.
(B) When the distortion between the “current phoneme” and the “subsequent phoneme” immediately after the “current phoneme” has a great influence on the sound quality, the determination method selection unit 2351 determines the method b for the “current phoneme”. Select.
(C) The determination method selection unit 2351 has less influence on the sound quality due to distortion between the “current phoneme” and the “preceding phoneme” immediately before the “current phoneme”, and the “current phoneme” and the “current phoneme”. The method c is selected for the “current phoneme” when the distortion between the “succeeding phoneme” immediately after “has a little influence on the sound quality”.

≪方式の選択方法の例示２≫
より具体的な方式の選択方法を例示する（図７Ａ）。
（ａ）決定方法選択部２３５１は、「現音素」が母音または有声子音を表し、「現音素」の直前の「先行音素」が母音または有声子音を表し、「現音素」の直後の「後続音素」が無声子音を表す場合に、「現音素」に対して方式ａを選択する。
（ｂ）決定方法選択部２３５１は、「現音素」が母音または有声子音を表し、「現音素」の直後の「後続音素」が母音または有声子音を表す場合に、「現音素」に対して方式ｂを選択する。
（ｃ）決定方法選択部２３５１は、「現音素」が母音または有声子音を表し、「現音素」の直前の「先行音素」および直後の「後続音素」が無声子音を表す場合、および／または、「現音素」が無声子音を表す場合に、「現音素」に対して方式ｃを選択する。 <Example 2 of method selection method>
A more specific method of selecting a method is illustrated (FIG. 7A).
(A) The determination method selection unit 2351 indicates that “present phoneme” represents a vowel or voiced consonant, “preceding phoneme” immediately before “current phoneme” represents vowel or voiced consonant, and “following” immediately after “present phoneme”. If “phoneme” represents an unvoiced consonant, method a is selected for “present phoneme”.
(B) When the “current phoneme” represents a vowel or a voiced consonant and the “succeeding phoneme” immediately after the “current phoneme” represents a vowel or a voiced consonant, Select scheme b.
(C) The determination method selection unit 2351 may be configured such that “present phoneme” represents a vowel or voiced consonant, “preceding phoneme” immediately before “current phoneme” and “succeeding phoneme” immediately after represent a voiceless consonant, and / or , “C” is selected for “present phoneme” when “present phoneme” represents an unvoiced consonant.

このような境界周波数の決定方法の切り替えにより、音素の種別に基づいた最適な境界周波数の決定が可能となる。選択された方式を特定する情報、素片スペクトル、および時間情報付の音素系列は決定部２３５２に送られる（ステップＳ２３５１）。 By switching the boundary frequency determination method, it is possible to determine the optimum boundary frequency based on the phoneme type. Information for identifying the selected method, segment spectrum, and phoneme sequence with time information are sent to the determination unit 2352 (step S2351).

方式ａを特定する情報が決定部２３５２に送られた場合、決定部２３５２は第１実施形態のステップＳ１３５で説明した処理によって「現音素」に対応する境界周波数Ｂを決定して出力する（ステップＳ２３５２ａ）。方式ｂを特定する情報が決定部２３５２に送られた場合、決定部２３５２は、上限値Ｂ_Ｎ以上の周波数（例えばナイキスト周波数）を「現音素」に対応する境界周波数Ｂとして決定して出力する（ステップＳ２５３２ｂ）。方式ｃを特定する情報が決定部２３５２に送られた場合、決定部２３５２は、下限値Ｂ_０以下の周波数（例えば、０Ｈｚ）を「現音素」に対応する境界周波数Ｂとして決定して出力する（ステップＳ２５３２ｃ）。以降の処理は第１実施形態と同じである。 When information specifying the method a is sent to the determination unit 2352, the determination unit 2352 determines and outputs the boundary frequency B corresponding to the “current phoneme” by the process described in step S135 of the first embodiment (step S135). S2352a). When information specifying the method b is sent to the determination unit 2352, the determination unit 2352 determines and outputs a frequency (for example, a Nyquist frequency) equal to or higher than the upper limit value B _N as the boundary frequency B corresponding to the “current phoneme”. (Step S2532b). When the information specifying the method c is sent to the determination unit 2352, the determination unit 2352 determines and outputs a frequency (for example, 0 Hz) equal to or lower than the lower limit value B ₀ as the boundary frequency B corresponding to the “current phoneme”. (Step S2532c). The subsequent processing is the same as in the first embodiment.

［第３実施形態］
本形態は第２実施形態の変形例である。本形態では、方式ｂ（第２方式）および方式ｃ（第３方式）は、テキストに応じて素片接続される音素の基本周波数に対する、「第１合成音声」の基本周波数の変更度合いに基づいて境界周波数Ｂを決定する。 [Third Embodiment]
This embodiment is a modification of the second embodiment. In this embodiment, the method b (second method) and the method c (third method) are based on the degree of change of the fundamental frequency of the “first synthesized speech” with respect to the fundamental frequency of phonemes connected in units according to text. To determine the boundary frequency B.

＜構成＞
図１に例示するように、本形態の音声合成装置３は、入力部１１、音声コーパス記憶部２１２、音声ＤＢ記憶部１２２、音響モデル記憶部１２３、音声ＤＢ構築部１３１、音響モデル生成部１３２、素片接続型音声合成部１３３、ＨＭＭ音声合成部１３４、境界周波数決定部３３５、スペクトル混合処理部１３６、および波形生成処理部１３７を有する。図２Ａに例示するように、本形態の境界周波数決定部３３５は、決定方法選択部２３５１および決定部３３５２を含む。 <Configuration>
As illustrated in FIG. 1, the speech synthesizer 3 of this embodiment includes an input unit 11, a speech corpus storage unit 212, a speech DB storage unit 122, an acoustic model storage unit 123, a speech DB construction unit 131, and an acoustic model generation unit 132. , A unit connection type speech synthesis unit 133, an HMM speech synthesis unit 134, a boundary frequency determination unit 335, a spectrum mixing processing unit 136, and a waveform generation processing unit 137. As illustrated in FIG. 2A, the boundary frequency determination unit 335 of this embodiment includes a determination method selection unit 2351 and a determination unit 3352.

＜音声合成処理＞
第２実施形態との相違点は、境界周波数決定部２３５がステップＳ２３５の境界周波数決定処理を行うことに代えて、境界周波数決定部３３５がステップＳ３３５の境界周波数決定処理を行うことである。ステップＳ３３５のステップＳ２３５との相違点は、決定部２３５２に代えて決定部３３５２が以下の処理を行うことである。その他は第１，２実施形態で説明した通りである。以下では決定部３３５２の処理のみを説明する。 <Speech synthesis processing>
The difference from the second embodiment is that the boundary frequency determination unit 235 performs the boundary frequency determination process in step S335 instead of the boundary frequency determination unit 235 performing the boundary frequency determination process in step S235. The difference between step S335 and step S235 is that, instead of the determination unit 2352, the determination unit 3352 performs the following processing. Others are as described in the first and second embodiments. Only the processing of the determination unit 3352 will be described below.

本形態では、決定方法選択部２３５１で選択された方式を特定する情報、素片スペクトル、および時間情報付の音素系列が決定部３３５２に送られる。さらに、決定部３３５２には、ステップＳ１３４でＨＭＭ音声合成部１３４から出力された基本周波数も送られる。方式ａを特定する情報が決定部３３５２に送られた場合、決定部３３５２は第１実施形態のステップＳ１３５で説明した処理によって「現音素」に対応する境界周波数Ｂを決定して出力する（ステップＳ２３５２ａ）。 In this embodiment, information for specifying the method selected by the determination method selection unit 2351, a segment spectrum, and a phoneme sequence with time information are sent to the determination unit 3352. Further, the fundamental frequency output from the HMM speech synthesizer 134 in step S134 is also sent to the determination unit 3352. When information specifying the method a is sent to the determination unit 3352, the determination unit 3352 determines and outputs the boundary frequency B corresponding to the “current phoneme” by the process described in step S135 of the first embodiment (step S35). S2352a).

方式ｂを特定する情報が決定部３３５２に送られた場合、決定部３３５２は、「現音素」の音素区間における音素の基本周波数の平均値Ｆ_０ｏｒｇと、当該音素区間におけるＨＭＭ音声合成部１３４での合成音の基本周波数の平均値Ｆ_０ｓｙｎとの比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎ（基本周波数の変更度合い）が予め定めた値（たとえば０．５以上２．０以下）の範囲を超えるかを判定する。基本周波数の平均値Ｆ_０ｓｙｎはＨＭＭ音声合成部１３４から出力された基本周波数に基づいて定められる（ステップＳ３３５２ｂａ）。比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内の場合、決定部３３５２は境界周波数Ｂ＝Ｂ_Ｎとして出力する（ステップＳ３３５２ｂｂ）。一方、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内でない場合、決定部３３５２は、前述のステップＳ２３５２ｂによって境界周波数Ｂを決定して出力する（ステップＳ２３５２ｂ）。 When information specifying the method b is sent to the determination unit 3352, the determination unit 3352 uses the average value F _0org of the fundamental frequency of the phoneme in the phoneme section of “current phoneme” and the HMM speech synthesis unit 134 in the phoneme section. It is determined _whether the ratio F _0org / F _0syn (the degree of change of the fundamental frequency) with the average value F _0syn of the fundamental frequencies of the synthesized sound exceeds the range of a predetermined value (for example, 0.5 or more and 2.0 or less) . The average value F _0syn of the fundamental frequency is determined based on the fundamental frequency output from the HMM speech synthesizer 134 (step S3352ba). When the ratio _{_F 0org} _/ _{F 0syn} is within a predetermined range of values, determining unit 3352 outputs as a boundary frequency B = _{B N} (step S3352bb). On the other hand, when the ratio F _0org / F 0 _syn is not within the predetermined value range, the determination unit 3352 determines and outputs the boundary frequency B in the above-described step S2352b (step S2352b).

方式ｃを特定する情報が決定部３３５２に送られた場合も、決定部３３５２は比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲を超えるかを判定する（ステップＳ３３５２ｃａ）。比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内の場合、決定部３３５２は前述のステップＳ２３５２ｃによって境界周波数Ｂを決定して出力する（ステップＳ２３５２ｃ）。一方、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内でない場合、決定部３３５２は、境界周波数Ｂ＝Ｂ_０として出力する（ステップＳ２３５２ｃｂ）。以降の処理は第１実施形態と同じである。 Even when information specifying the method c is sent to the determination unit 3352, the determination unit 3352 determines _{whether the} ratio F _0org / F _0syn exceeds the predetermined value range (step S3352ca). When the ratio F _0org / F _0syn is within a predetermined value range, the determination unit 3352 determines and outputs the boundary frequency B in step S2352c described above (step S2352c). On the other hand, when the ratio F _0org / F _0syn is not within the range of the predetermined value, the determination unit 3352 outputs the boundary frequency B = B ₀ (step S2352cb). The subsequent processing is the same as in the first embodiment.

［第３実施形態の変形例］
第３実施形態では、方式ｂが選択され、かつ、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内の場合にＢ＝Ｂ_Ｎとし、方式ｂが選択され、かつ、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内でない場合にナイキスト周波数をＢとした。しかしながら、方式ｂが選択された場合に、その他の基準に則って、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎに応じてＢ_Ｎ以上ナイキスト周波数以下の周波数がＢとされてもよい。例えば、Ｂ_Ｎ以上ナイキスト周波数以下の範囲で、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定められた値（例えば１）に近いほど、Ｂ_Ｎに近い周波数がＢとされてもよい。 [Modification of Third Embodiment]
In the third embodiment, method b is selected and the ratio _{_F 0org} _/ _{F 0syn} is a B = _{B N} in the case of the range of a predetermined value, method b is selected and the ratio _{F 0org} / _F _The Nyquist frequency was set to B when _0syn was not within a predetermined value range. However, when the method b is selected, the frequency from B _{N to the} Nyquist frequency may be set to B according to the ratio F _0org / F _0syn according to other criteria. For example, in the range from B _{N to the} Nyquist frequency, the frequency closer to B _N may be set to B as the ratio F _0org / F 0 _syn is _closer to a predetermined value (eg, 1).

第３実施形態では、方式ｃが選択され、かつ、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内の場合にＢ＝０とし、方式ｃが選択され、かつ、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲内でない場合にＢ＝Ｂ_０とした。しかしながら、方式ｃが選択された場合に、その他の基準に則って、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎに応じて０以上Ｂ_０以下の周波数がＢとされてもよい。例えば、０以上Ｂ_０以下の範囲で、比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定められた値（例えば１）に近いほど、０に近い周波数がＢとされてもよい。 In the third embodiment, when the method c is selected and the ratio F _0org / F 0 _syn is within a predetermined value range, B = 0 is set, the method c is selected, and the ratio F _0org / F ₀ _syn is selected. There was B = B ₀ if not within the predetermined range of values. However, when the method c is selected, a frequency of ₀ or more and B ₀ or less may be set to B according to the ratio F _0org / F _0syn according to other criteria. For example, in the range of ₀ or more and B ₀ or less, the frequency closer to 0 may be set to B as the ratio F _0org / F _0syn is _closer to a predetermined value (for example, 1).

第３実施形態では、「基本周波数の変更度合い」として比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎを用いたが、これに代えてその他の指標を用いてもよい。例えば、比率Ｆ_０ｓｙｎ／Ｆ_０ｏｒｇ、差分｜Ｆ_０ｏｒｇ−Ｆ_０ｓｙｎ｜、二乗誤差｛（Ｆ_０ｏｒｇ）^２−（Ｆ_０ｓｙｎ）^２｝等を「基本周波数の変更度合い」として用いてもよい。 In the third embodiment, the ratio F _0org / F 0 _syn is used as the “degree of change of the fundamental frequency”, but other indicators may be used instead. For example, the ratio F _0syn / F _0org , the difference | F _0org −F _0syn |, the square error {(F _0org ) ² − (F _0syn ) ² }, and the like may be used as the “basic frequency change degree”.

第３実施形態では、ステップＳ３３５２ｃａで比率Ｆ_０ｏｒｇ／Ｆ_０ｓｙｎが予め定めた値の範囲を超えた場合に境界周波数Ｂを下限値Ｂ_０としたが、この場合に境界周波数Ｂをナイキスト周波数としてもよい。 In the third embodiment, when the ratio F _0org / F _0syn exceeds the predetermined value range in step S3352ca, the boundary frequency B is set to the lower limit value B ₀ , but in this case, the boundary frequency B may be set as the Nyquist frequency. Good.

［特徴］
以上のように各実施形態では、隣接する音素間の連続性に基づく定量的な尺度、音素の種別、基本周波数の変更度合等に応じて適切に境界周波数を設定できる。これにより、ＨＭＭ音声合成方式および素片接続型音声合成方式それぞれの長所を生かし、音素の接続部における異音の発生を抑制しつつ、音素のスペクトルの有する肉声感を導入した合成音声を生成できる。 [Feature]
As described above, in each embodiment, the boundary frequency can be appropriately set according to a quantitative scale based on continuity between adjacent phonemes, the type of phoneme, the degree of change of the fundamental frequency, and the like. This makes it possible to generate synthesized speech that introduces the real voice feeling of the phoneme spectrum while suppressing the occurrence of abnormal sounds at the phoneme connection portion, taking advantage of the HMM speech synthesis method and the unit connection type speech synthesis method. .

また各実施形態では、予め設定した許容限界値、音素の種別、基本周波数の変更度合いなどに基づいて定量的に境界周波数を決定する。そのため、話者や音声合成対象のテキストを変更するたびに手作業で境界周波数を定める作業が不要となり、自動的に境界周波数を決定できる。 In each embodiment, the boundary frequency is quantitatively determined based on a preset allowable limit value, a phoneme type, a change degree of the fundamental frequency, and the like. Therefore, it is not necessary to manually determine the boundary frequency every time the speaker or the text to be synthesized is changed, and the boundary frequency can be automatically determined.

［その他の変形例］
上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 [Other variations]
The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

１，２，３音声合成装置 1,2,3 Speech synthesizer

Claims

A boundary frequency determination unit that determines a boundary frequency based on a distance between spectral feature amounts before and after a boundary between phoneme segments connected according to text;
A high frequency component corresponding to the boundary frequency of the spectrum of the first synthesized speech obtained by segment connection according to the text, and a second synthesis obtained by applying an acoustic model for speech synthesis to the text A spectrum mixing processing unit for obtaining a mixed spectrum obtained by mixing a low-frequency component corresponding to the boundary frequency of the spectrum of speech;
A speech synthesizer.

The speech synthesizer of claim 1,
The boundary frequency determination unit
(A) a first method for determining the boundary frequency within a predetermined frequency section based on the distance of the spectral feature amount, according to the type of phoneme connected in units according to the text; and (b) the frequency section. A speech synthesizer that selects either the second method in which a frequency equal to or higher than the upper limit value is used as the boundary frequency, or (c) the third method in which a frequency equal to or lower than the lower limit value of the frequency section is used as the boundary frequency.

The speech synthesizer according to claim 2,
In the second method and the third method, the boundary frequency is based on a change degree of a fundamental frequency obtained by applying the acoustic model to the text with respect to a fundamental frequency of phonemes connected in units according to the text. A speech synthesizer.

The speech synthesizer according to claim 2 or 3,
In the first method, the boundary frequency for the current phoneme is based on a distance of the spectral feature amount between a current phoneme that is a phoneme connected in units according to the text and a preceding phoneme immediately before the current phoneme. Decide
The boundary frequency determination unit
The distortion between the current phoneme and the preceding phoneme immediately before the current phoneme has a large effect on the sound quality, and the distortion between the current phoneme and the subsequent phoneme immediately after the current phoneme has an effect on the sound quality. If so, select the first method for the current phoneme;
When the distortion between the current phoneme and the subsequent phoneme immediately after the current phoneme has a great influence on the sound quality, the second method is selected for the current phoneme,
The effect of distortion between the current phoneme and the preceding phoneme immediately before the current phoneme on the sound quality is small, and the distortion between the current phoneme and the subsequent phoneme immediately after the current phoneme has an effect on the sound quality. A speech synthesizer that, when smaller, selects the third method for the current phoneme.

The speech synthesizer according to any one of claims 1 to 4,
The boundary frequency determination unit
Obtain the distance of spectral features of the first determination zone, which is part of a band of a predetermined frequency interval, the distance of the spectral feature amount in the first determination band is less than the allowable limit, the If the frequency corresponding to the first determination band is the boundary frequency, and the distance of the spectral feature amount in the first determination band is not less than the allowable limit value, a band having a frequency higher than the first determination band is A speech synthesizer that performs processing with the second determination band as the first determination band.

The speech synthesizer according to claim 5,
The permissible limit value is a speech synthesizer that is monotonically increasing in a broad sense with respect to a frequency corresponding to the first determination band.

A boundary frequency determination step for determining a boundary frequency based on a distance between spectral feature amounts before and after the boundary between phoneme segments connected according to text;
A high frequency component corresponding to the boundary frequency of the spectrum of the first synthesized speech obtained by segment connection according to the text, and a second synthesis obtained by applying an acoustic model for speech synthesis to the text A spectral mixing processing step for obtaining a mixed spectrum obtained by mixing a low-frequency side component corresponding to the boundary frequency of the voice spectrum;
A speech synthesis method comprising:

A program for causing a computer to function as the speech synthesizer according to claim 1.