JP5301376B2

JP5301376B2 - Speech synthesis apparatus and program

Info

Publication number: JP5301376B2
Application number: JP2009158626A
Authority: JP
Inventors: 礼子田高; 徹都木; 信正清山
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-07-03
Filing date: 2009-07-03
Publication date: 2013-09-25
Anticipated expiration: 2029-07-03
Also published as: JP2011013534A

Description

本発明は、合成音声を生成する音声合成装置に関する。特に、本発明は、目的話者とは異なる他話者の音声を利用して合成音声を生成する音声合成装置に関する。 The present invention relates to a speech synthesizer that generates synthesized speech. In particular, the present invention relates to a speech synthesizer that generates synthesized speech using speech of another speaker different from the target speaker.

目的話者の音声素片を用いて合成音声を構成する際に、目的話者の声質の適切な音声素片が存在しなければ、音声素片を変換して用いることが考えられる。特許文献１には声質の変換について記載されている。 When a synthesized speech is composed using the speech unit of the target speaker, if there is no speech unit having an appropriate voice quality of the target speaker, it is possible to convert the speech unit for use. Patent Document 1 describes conversion of voice quality.

しかしながら、変換量が大きいと、音声素片自体の音質が劣化してしまう場合がある。目的話者の音声素片のバリエーションが不足している場合に、他話者の音声素片を目的話者の声質に違和感を与えない範囲で、目的話者の音声素片と同等に利用できれば、音声素片の不足を補い、音声素片のバリエーションを拡大できることが考えられている。特許文献２には、他話者の音声素片のデータベースを用いて音声合成を行なう技術が記載されている。 However, if the amount of conversion is large, the sound quality of the speech segment itself may deteriorate. If there is not enough variation of the target speaker's speech segment, if the speech unit of the other speaker can be used in the same way as the target speaker's speech unit, as long as the target speaker's voice quality does not feel uncomfortable It is considered that the lack of speech segments can be compensated and the variation of speech segments can be expanded. Patent Document 2 describes a technique for performing speech synthesis using a database of speech units of other speakers.

一方、他話者の音声素片の利用する場合には、目的話者の音声素片による文章中に他話者の音声素片を存在させる形で合成音声を構成するため、全体として不自然な音声にならないようにすることが望ましい。非特許文献１には、そのような場合に、他話者の音声素片の時間長が短い場合、あるいは他話者の音声素片の部分の基本周波数が低く韻律として目立たない場合には、他話者であることを気付きにくく、全体として自然な音声が得られることが記載されている。 On the other hand, when the speech unit of another speaker is used, the synthesized speech is formed in such a way that the speech unit of the other speaker is present in the sentence of the target speaker's speech unit. It is desirable not to make a sound. In Non-Patent Document 1, in such a case, when the time length of the speech unit of the other speaker is short, or when the fundamental frequency of the part of the speech unit of the other speaker is low and not prominent as a prosody, It is described that it is difficult to recognize that the speaker is another speaker, and natural speech can be obtained as a whole.

特開２００７−１４８１７２号公報JP 2007-148172 A 特開２００７−０２５０４２号公報JP 2007-025042 A

田高礼子，世木寛之，清山信正，都木徹，「別話者音素による部分置換音声の自然性とスペクトル特徴量について」，電子情報通信学会，電子情報通信学会技術研究報告．ＳＰ，音声，２００８年３月，ｖｏｌ．１０７，ｎｏ．５５１，ｐ．１２３−１２８Reiko Tadaka, Hiroyuki Seki, Nobumasa Kiyama, Toru Toki, “Naturalities and spectral features of partially substituted speech by different phonemes”, IEICE Technical Report. SP, Voice, March 2008, vol. 107, no. 551, p. 123-128

目的話者とは異なる他話者の音声素片を利用して合成音声を構成する場合に、利用する音声素片の特徴に応じて、全体としての音声における不自然さが目立つ場合と目立たない場合の両方が存在すると考えられる。しかしながら、総当り的なトライアンドエラーで他話者の音声素片を選択していては、合成音声を得るための効率が悪い。違和感なく利用できる他話者の音声素片を選択するにあたって、利用可能であるかどうかを所定の条件により効率よく選択できることが求められる。 When constructing synthesized speech using speech units of other speakers different from the target speaker, depending on the characteristics of the speech units used, the unnaturalness of the speech as a whole may or may not stand out. Both cases are considered to exist. However, if a speech unit of another speaker is selected by brute force trial and error, the efficiency for obtaining synthesized speech is poor. In selecting a speech unit of another speaker that can be used without a sense of incongruity, it is required that whether or not the speech unit is usable can be efficiently selected according to a predetermined condition.

本発明は上記のような課題を解決するために為されたものであり、主として目的話者の音声素片で構成する合成音声中に、目的話者の声質に違和感を与えない他話者の音声素片を利用し、高品質な合成音声を効率よく作成することのできる音声合成装置およびプログラムを提供することを目的とする。 The present invention has been made in order to solve the above-described problems, and it is intended for other speakers who do not give a sense of incongruity to the voice quality of the target speaker during the synthesized speech mainly composed of the speech unit of the target speaker. An object of the present invention is to provide a speech synthesizer and a program capable of efficiently creating high-quality synthesized speech using speech segments.

［１］上記の課題を解決するため、本発明の一態様による音声合成装置は、目的話者および他話者の音声素片を記憶する音声データベース記憶部と、目的とする合成音声に対応する表記データを記憶する表記データ記憶部と、複数の音声素片のそれぞれの特徴量に基づき、それら複数の音声素片の間の音素特徴量適合度を算出する音素特徴量適合度推定部と、前記表記データ記憶部から取得する表記データに基づいて、前記目的とする合成音声を構成する音声素片の候補を前記音声データベース記憶部の中から選択するとともに、選択された前記音声素片の候補のうち他話者の音声素片について、前記音素特徴量適合度推定部が算出した前記音素特徴量適合度に基づいて当該音声素片の候補を採用するか否かを決定し、その結果採用された前記音声素片によって構成された前記合成音声を出力する音声素片選択部と、を具備し、
前記音素特徴量適合度推定部は、前記他話者の音声素片と目的話者の任意の音声素片との間の音素特徴量適合度、又は前記他話者の音声素片と前記音声素片選択部によって選択されている前記音声素片の候補のうちの当該他話者の音声素片以外の音声素片との間の音素特徴量適合度を算出する、ことを特徴とする。
この構成によれば、算出された音素特徴量適合度に基づき、適合度の高い音声素片を採用した合成音声が出力される。 [1] In order to solve the above problems, a speech synthesizer according to an aspect of the present invention corresponds to a speech database storage unit that stores speech segments of a target speaker and other speakers, and a target synthesized speech. A notation data storage unit that stores notation data, a phoneme feature value fitness estimation unit that calculates phoneme feature value suitability between the plurality of speech units based on the feature values of the plurality of speech units, and Based on the notation data acquired from the notation data storage unit, a speech unit candidate constituting the target synthesized speech is selected from the speech database storage unit, and the selected speech unit candidate is selected. the speech unit sac Chi other speakers, to determine whether to adopt the candidates of the speech unit on the basis of the phonemic feature quantity matching degree of the phonemic feature quantity matching degree estimation unit is calculated, the result Before adopted Comprising a speech unit selection unit configured to output the synthesized speech constituted by speech segment, a,
The phoneme feature value fitness estimation unit is configured to determine the phoneme feature value match between the speech unit of the other speaker and an arbitrary speech unit of the target speaker, or the speech unit of the other speaker and the speech. A phoneme feature quantity matching degree with a speech unit other than the speech unit of the other speaker among the speech unit candidates selected by the unit selection unit is calculated .
According to this configuration, based on the calculated phoneme feature value matching degree, synthesized speech that employs speech segments having a high matching degree is output.

［２］また、本発明の一態様による音声合成装置は、上記の音声合成装置において、前記音声素片の音素の種類と、前記合成音声における当該音声素片の前後の音声素片の音素の種類とに少なくとも基づいて音素環境適合度を算出する音素環境推定部を更に具備し、前記音声素片選択部は、前記音声素片の候補のうち、他話者の音声素片についての音素環境適合度を前記音素環境推定部に算出させ、当該音素環境適合度にも基づいて当該音声素片の候補を採用するか否かを決定する、ことを特徴とするものである。
この構成によれば、採用候補となる音声素片の前後音素環境を判別し、その判別結果に基づく音素環境適合度が計算される。そして、音素環境適合度の高い音声素片を採用した合成音声が出力される。
なお、音素環境推定部が、更に、音素種類判別に基づき音素環境適合度を算出するようにしてもよい。
また、音素環境推定部が、更に、韻律環境判別に基づき音素環境適合度を算出するようにしてもよい。 [2] A speech synthesizer according to an aspect of the present invention is the speech synthesizer described above, wherein the type of phoneme of the speech unit and the phoneme of the speech unit before and after the speech unit in the synthesized speech A phoneme environment estimation unit that calculates a phoneme environment suitability based on at least the type, and the speech unit selection unit includes a phoneme environment for a speech unit of another speaker among the speech unit candidates The adaptability is calculated by the phoneme environment estimation unit, and it is determined whether to adopt the speech segment candidate based on the phoneme environment adaptability.
According to this configuration, the phoneme environment of the speech segment that is a candidate for adoption is determined, and the phoneme environment suitability based on the determination result is calculated. Then, synthesized speech using speech segments having a high phoneme environment suitability is output.
The phoneme environment estimation unit may further calculate a phoneme environment suitability based on phoneme type discrimination.
Further, the phoneme environment estimation unit may further calculate a phoneme environment suitability based on prosodic environment discrimination.

［３］また、本発明の一態様による音声合成装置は、上記の音声合成装置において、前記音素特徴量適合度推定部は、前記音声素片のスペクトル傾斜又はＦＦＴケプストラム係数の１次の係数又は声帯音源の特性を表わす特徴量のいずれかの値を前記特徴量とする、ことを特徴とするものである。
なお、ＦＦＴケプストラム係数とは、ＦＦＴ（高速フーリエ変換）を用いて求められたケプストラム係数である。この構成によれば、低周波数域から高周波数域にかけてのフォルマントの減衰度を特徴量として利用し、適合度の高い音声素片を採用した合成音声が出力される。 [3] Further, the speech synthesizer according to one aspect of the present invention is the speech synthesizer described above, wherein the phoneme feature value fitness estimator is a first-order coefficient of a spectral tilt of the speech segment or an FFT cepstrum coefficient, or Any one of the feature values representing the characteristics of the vocal cord sound source is used as the feature value.
The FFT cepstrum coefficient is a cepstrum coefficient obtained using FFT (Fast Fourier Transform). According to this configuration, synthesized speech that employs speech units with high fitness is output using the formant attenuation from the low frequency range to the high frequency range as a feature value.

［４］また、本発明の一態様による音声合成装置は、上記の音声合成装置において、前記音素特徴量適合度推定部は、前記音声素片の音声スペクトルのうち所定の周波数帯域におけるスペクトル重心の周波数を前記特徴量とする、ことを特徴とするものである。
この構成によれば、所定の周波数低域（例えば低周波数帯域）におけるスペクトル重心を特徴量として利用し、適合度の高い音声素片を採用した合成音声が出力される。 [4] Further, in the speech synthesizer according to one aspect of the present invention, in the speech synthesizer described above, the phoneme feature value fitness estimation unit may calculate a spectrum centroid in a predetermined frequency band of the speech spectrum of the speech segment. The frequency is used as the feature amount.
According to this configuration, synthesized speech using a speech unit having a high degree of fitness is output using the spectral centroid in a predetermined low frequency band (for example, a low frequency band) as a feature amount.

［５］また、本発明の一態様による音声合成装置は、上記の音声合成装置において、前記音素特徴量適合度推定部は、前記音声素片のフォルマント周波数およびフォルマントバンド幅を前記特徴量とする、ことを特徴とするものである。
この構成によれば、フォルマント周波数およびフォルマントバンド幅を特徴量として利用し、適合度の高い音声素片を採用した合成音声が出力される。 [5] Further, in the speech synthesizer according to one aspect of the present invention, in the speech synthesizer described above, the phoneme feature amount fitness estimation unit uses the formant frequency and formant bandwidth of the speech unit as the feature amount. It is characterized by that.
According to this configuration, synthesized speech that employs speech units having high fitness is output using formant frequency and formant bandwidth as feature quantities.

［６］また、本発明の一態様による音声合成装置は、上記の音声合成装置において、他話者の音声素片の数の比率の設定値を記憶する他話者比率設定記憶部を更に具備し、前記音声素片選択部は、前記合成音声を構成する音声素片のうち他話者の音声素片の比率が前記他話者比率設定記憶部から読み出した前記設定値以下になるように、算出した前記音素環境適合度が上位の他話者の音声素片を採用するとともに、その他の他話者の音声素片については前記音声データベース記憶部の中から再選した目的話者の音声素片で置き換える、ことを特徴とするものである。 [6] In addition, the speech synthesizer according to an aspect of the present invention further includes the other speaker ratio setting storage unit that stores a setting value of the ratio of the number of speech units of other speakers in the speech synthesizer described above. The speech unit selection unit is configured so that a ratio of speech units of other speakers out of the speech units constituting the synthesized speech is equal to or less than the set value read from the other speaker ratio setting storage unit. The speech unit of the other speaker with the calculated phoneme environment suitability is adopted, and the speech unit of the target speaker reselected from the speech database storage unit is used for the speech unit of the other speaker. It is characterized by being replaced with a piece.

［７］また、本発明の一態様によるコンピュータプログラムは、目的話者および他話者の音声素片を記憶する音声データベース記憶部と、目的とする合成音声に対応する表記データを記憶する表記データ記憶部と、を具備するコンピュータに、複数の音声素片のそれぞれの特徴量に基づき、それら複数の音声素片の間の音素特徴量適合度を算出する音素特徴量適合度推定過程と、前記表記データ記憶部から取得する表記データに基づいて、前記目的とする合成音声を構成する音声素片の候補を前記音声データベース記憶部の中から選択するとともに、選択された前記音声素片の候補のうち他話者の音声素片について、前記音素特徴量適合度推定過程で算出した前記音素特徴量適合度に基づいて当該音声素片の候補を採用するか否かを決定し、その結果採用された前記音声素片によって構成された前記合成音声を出力する音声素片選択過程と、の処理を実行させるプログラムであって、前記音素特徴量適合度推定過程は、前記他話者の音声素片と目的話者の任意の音声素片との間の音素特徴量適合度、又は前記他話者の音声素片と前記音声素片選択過程によって選択されている前記音声素片の候補のうちの当該他話者の音声素片以外の音声素片との間の音素特徴量適合度を算出する、ことを特徴とするものである。
[7] A computer program according to an aspect of the present invention includes a speech database storage unit that stores speech segments of a target speaker and other speakers, and notation data that stores notation data corresponding to a target synthesized speech. A phoneme feature quantity fitness estimation step for calculating a phoneme feature quantity suitability between the plurality of speech segments based on the feature quantities of the plurality of speech segments in a computer comprising the storage unit, and Based on the notation data acquired from the notation data storage unit, a speech unit candidate constituting the target synthesized speech is selected from the speech database storage unit, and the selected speech unit candidate is selected. the speech unit Urn Chi other speakers, to determine whether to adopt the candidates of the speech unit on the basis of the phonemic feature quantity matching degree calculated by the phonemic feature quantity matching degree estimation process, A result adopted speech unit selection step of outputting the synthesized speech constituted by the speech segment, a program for executing processing of the phoneme feature quantity matching degree estimation process, said other speakers Phoneme feature match between a speech unit of the target speaker and an arbitrary speech unit of the target speaker, or the speech unit selected by the speech unit and the speech unit selection process of the other speaker A phoneme feature amount matching degree between speech candidates other than the speech unit of the other speaker among the candidates is calculated.

本発明によれば、目的話者の音声素片の不足を補い、バリエーションを拡張するために、目的話者の音声素片と違和感なく利用できる他話者の音声素片を選択し利用することができ、合成音声の品質向上につながる。また、そのような他話者の音声素片の選択の処理の全部又は一部を自動的に行うことができ、他話者音声素片の選択の効率が上がる。 According to the present invention, in order to make up for the shortage of speech units of the target speaker and to expand variations, the speech unit of the other speaker that can be used without a sense of incongruity with the speech unit of the target speaker is selected and used. Can improve the quality of synthesized speech. Further, all or part of the process of selecting the speech unit of such other speaker can be automatically performed, and the efficiency of selecting the speech unit of the other speaker is increased.

本発明の実施形態による音声合成装置の機能構成を示したブロック図である。It is the block diagram which showed the function structure of the speech synthesizer by embodiment of this invention. 同実施形態による音声データベース（目的話者音声データベース，他話者音声データベース）の構成およびデータ例を示した概略図である。It is the schematic which showed the structure and example of data of the audio | voice database (target speaker audio | voice database, another speaker audio | voice database) by the embodiment. 同実施形態による合成音声記憶部の構成およびデータ例を示した概略図を示したブロック図である。It is the block diagram which showed the schematic which showed the structure and example of data of the synthetic | combination audio | voice storage part by the embodiment. 同実施形態による音素特徴量適合度推定部１１３が利用する特徴量のひとつであるスペクトル傾斜を説明するためのグラフである。It is a graph for demonstrating the spectrum inclination which is one of the feature-values which the phoneme feature-value adaptation degree estimation part 113 by the embodiment uses. 同実施形態による音素特徴量適合度推定部１１３が利用する特徴量のひとつである低域スペクトル重心を説明するためのグラフである。It is a graph for demonstrating the low-pass spectrum gravity center which is one of the feature-values which the phoneme feature-value adaptation degree estimation part 113 by the embodiment uses. 同実施形態による音声合成装置全体の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the whole speech synthesizer by the same embodiment.

以下、図面を参照しながら本発明の実施形態について説明する。
図１は、本実施形態による音声合成装置の機能構成を示すブロック図である。図示するように、音声合成装置１０は、音声素片選択部１１０と、音素環境適合度推定部１１２と、音素特徴量適合度推定部１１３と、他話者音声素片箇所指定部１２０と、音声データベース記憶部１３０と、比較音素指定部１４０と、合成音声記憶部１５０と、テキスト記憶部１５５（表記データ記憶部）と、他話者比率設定記憶部１６０と、デフォルト設定記憶部１７０とを含んで構成される。
また、音声データベース記憶部１３０は、目的話者音声データベース１３１と他話者音声データベース１３２とを含む。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a functional configuration of the speech synthesizer according to the present embodiment. As shown in the figure, the speech synthesizer 10 includes a speech unit selection unit 110, a phoneme environment suitability estimation unit 112, a phoneme feature amount suitability estimation unit 113, an other-speaker speech unit location designation unit 120, A speech database storage unit 130, a comparative phoneme designating unit 140, a synthesized speech storage unit 150, a text storage unit 155 (notation data storage unit), an other speaker ratio setting storage unit 160, and a default setting storage unit 170 Consists of including.
The voice database storage unit 130 includes a target speaker voice database 131 and an other speaker voice database 132.

なお、音声素片選択部１１０、音素環境適合度推定部１１２、音素特徴量適合度推定部１１３、他話者音声素片箇所指定部１２０、比較音素指定部１４０は、電子回路等を用いた情報処理装置として実現される。また、音声データベース記憶部１３０、合成音声記憶部１５０、テキスト記憶部１５５、他話者比率設定記憶部１６０、デフォルト設定記憶部１７０は、例えば磁気ディスク装置や半導体メモリ等を用いて実現される。
これら各部の機能は、次に記載する通りである。 The speech unit selection unit 110, the phoneme environment suitability estimation unit 112, the phoneme feature amount suitability estimation unit 113, the other-speaker speech unit location designation unit 120, and the comparison phoneme designation unit 140 are electronic circuits. Realized as an information processing apparatus. The speech database storage unit 130, the synthesized speech storage unit 150, the text storage unit 155, the other speaker ratio setting storage unit 160, and the default setting storage unit 170 are realized using, for example, a magnetic disk device or a semiconductor memory.
The function of each part is as described below.

音声素片選択部１１０は、テキスト記憶部１５５に格納されている表記データ（平仮名文や、漢字・仮名混じり文や、音素ラベル表記）に基づいて、目的とする合成音声を構成する音声素片の候補を音声データベース記憶部１３０の中から選択する。なお、表記データが平仮名文や漢字・仮名混じり文で記憶されている場合には、音声素片選択部１１０は、その表記データを適宜音素ラベル表記に変換してから、音声データベース記憶部１３０に記憶されている音声素片の候補を選択する。また、音声素片選択部１１０は、合成音声を構成するための音声素片の候補のうち、他話者の音声素片と比較対象となる音声素片との間の音素特徴量適合度を音素特徴量適合度推定部１１３に算出させ、当該音素特徴量適合度に基づいて当該音声素片の候補を採用するか否かを決定する。また、音声素片選択部１１０は、音声素片の候補のうち、他話者の音声素片についての音素環境適合度を音素環境推定部に算出させ、当該音素環境適合度にも基づいて当該音声素片の候補を採用するか否かを決定する。そして、音声素片選択部１１０は、その結果採用することとなった音声素片を利用して構成した合成音声を出力する。また、音声素片選択部１１０は、合成音声を構成する音声素片のうち他話者の音声素片の比率が他話者比率設定記憶部１６０から読み出した設定値以下になるように、算出した音素環境適合度が上位の他話者の音声素片を採用するとともに、その他の他話者の音声素片については再選した目的話者の音声素片で置き換える。 The speech unit selection unit 110 is a speech unit that constitutes a desired synthesized speech based on notation data (hirakana sentence, kanji / kana mixed sentence, or phoneme label notation) stored in the text storage unit 155. Are selected from the speech database storage unit 130. If the notation data is stored in Hiragana or Kanji / Kana mixed sentences, the phoneme segment selection unit 110 converts the notation data into phoneme label notation as appropriate, and then stores it in the phonetic database storage unit 130. Select a stored speech segment candidate. In addition, the speech unit selection unit 110 determines a phoneme feature value fitness between a speech unit of another speaker and a speech unit to be compared among speech unit candidates for composing synthesized speech. The phoneme feature quantity matching degree estimation unit 113 is made to calculate, and based on the phoneme feature quantity matching degree, it is determined whether or not to adopt the speech segment candidate. Further, the speech unit selection unit 110 causes the phoneme environment estimation unit to calculate the phoneme environment suitability for the speech unit of another speaker among the speech unit candidates, and based on the phoneme environment suitability Decide whether or not to adopt speech segment candidates. Then, the speech segment selection unit 110 outputs a synthesized speech configured using the speech segment that has been adopted as a result. Further, the speech unit selection unit 110 calculates so that the ratio of speech units of other speakers out of the speech units constituting the synthesized speech is equal to or less than the set value read from the other speaker ratio setting storage unit 160 The speech unit of the other speaker with the highest phoneme environment suitability is adopted, and the speech unit of the other speaker is replaced with the speech unit of the re-selected target speaker.

音素環境適合度推定部１１２は、音声素片の音素の種類と、合成音声における当該音声素片の前後の音声素片の音素の種類とに少なくとも基づいて音素環境適合度を算出する。更に、音素環境適合度推定部１１２は、当該音素の音素種類の判別や、韻律環境の判別にも基づいて音素環境適合度を算出するようにしても良い。
なおここで、音素の種類とは、（１）音素が母音か子音か、（２）音素が有声音か無声音か、（３）音素の調音方式（例えば、鼻子音など）、の（１）〜（３）のいずれか、あるいはこれらの組み合わせによって分類される種類である。
音素特徴量適合度推定部１１３は、与えられる複数の音声素片のそれぞれの特徴量に基づき、それら複数の音声素片の間の音素特徴量適合度を算出する。ここで、音声素片の特徴量とは、スペクトル傾斜、低域スペクトル重心、フォルマント（ｆｏｒｍａｎｔ）周波数、フォルマントバンド幅などであるが、これらについては後で詳述する。 The phoneme environment suitability estimation unit 112 calculates a phoneme environment suitability based at least on the phoneme type of the speech unit and the phoneme types of the speech units before and after the speech unit in the synthesized speech. Furthermore, the phoneme environment suitability estimation unit 112 may calculate the phoneme environment suitability based on the determination of the phoneme type of the phoneme and the determination of the prosodic environment.
Here, the type of phoneme is (1) whether (1) the phoneme is a vowel or consonant, (2) whether the phoneme is voiced or unvoiced, and (3) the phoneme articulation method (for example, nasal consonant). It is a kind classified by any one of-(3), or these combination.
The phoneme feature quantity matching degree estimation unit 113 calculates phoneme feature quantity suitability between the plurality of phonemes based on the feature quantities of the given phonemes. Here, the feature amount of the speech element includes a spectrum inclination, a low-frequency spectrum centroid, a formant frequency, a formant bandwidth, and the like, which will be described in detail later.

他話者音声素片箇所指定部１２０は、ユーザーからの入力等に基づき、合成音声中で他話者音声素片を利用する箇所の指定と、音声素片を選択する音声データベースの範囲の指定とを行う。
音声データベース記憶部１３０は、目的話者および他話者の音声素片を記憶する。
比較音素指定部１４０は、音素特徴量適合度推定部によって比較される対象となる音声素片、すなわち基準となる音声素片を指定する。 The other-speaker speech segment location specifying unit 120 specifies a location where the other-speaker speech segment is used in the synthesized speech and a range of a speech database for selecting the speech segment based on an input from the user. And do.
The speech database storage unit 130 stores speech segments of the target speaker and other speakers.
The comparison phoneme designation unit 140 designates a speech unit to be compared by the phoneme feature value fitness estimation unit, that is, a reference speech unit.

合成音声記憶部１５０は、合成音声を構成するための複数の音声素片に関するデータを記憶する。音声素片選択部１１０が音声素片を選択したり棄却したりするときに、この合成音声記憶部１５０も適宜書き換えられる。
テキスト記憶部１５５は、目的とする合成音声に対応する表記データのテキストを記憶する。この表記データは、例えば、日本語の平仮名等のデータである。なお、音素に対応するラベルの列として、例えば音素ラベル表記など、平仮名以外の形のデータを用いても良い。
他話者比率設定記憶部１６０は、他話者の音声素片の数の比率の設定値を記憶する。
デフォルト設定記憶部１７０は、デフォルト設定値を記憶する。デフォルト設定値とは、例えば、合成音声中で他話者音声素片を利用する箇所や、音声素片を選択する音声データベースの範囲などである。 The synthesized speech storage unit 150 stores data relating to a plurality of speech units for constituting synthesized speech. When the speech unit selection unit 110 selects or rejects a speech unit, the synthesized speech storage unit 150 is appropriately rewritten.
The text storage unit 155 stores text of notation data corresponding to the target synthesized speech. This notation data is, for example, data such as Japanese hiragana. Note that data in a form other than hiragana, such as a phoneme label notation, may be used as a column of labels corresponding to phonemes.
The other speaker ratio setting storage unit 160 stores a setting value of the ratio of the number of speech units of other speakers.
The default setting storage unit 170 stores default setting values. The default setting value is, for example, a location where another speaker's speech unit is used in the synthesized speech, a range of a speech database for selecting a speech unit, or the like.

なお、音素とは、言語において意味の弁別に用いられる最小の音の単位である。例えば日本語においては、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」といった母音と、「ｋ」、「ｓ」、「ｔ」、「ｎ」、「ｈ」、「ｍ」、・・・などといった子音が、それぞれ音素に対応する。
また、音声素片とは、合成音声を構成するための構成要素であり、予め用意された短い単位の音声データである。音声素片は、単一の音素に対応していても良いし、複数の音素の列に対応していても良い。 Note that a phoneme is a minimum unit of sound used for meaning discrimination in a language. For example, in Japanese, vowels such as “a”, “i”, “u”, “e”, “o”, “k”, “s”, “t”, “n”, “h”, “ Each consonant such as “m”,... corresponds to a phoneme.
A speech segment is a component for configuring synthesized speech, and is speech data in a short unit prepared in advance. The speech element may correspond to a single phoneme or may correspond to a plurality of phoneme strings.

また、基本周波数とは、音声信号の最も低い周期性のある周波数成分の周波数である。
また、フォルマントとは、音声の周波数スペクトルにおけるピークである。これらのピークのうち、周波数の低い方から順に第１フォルマント、第２フォルマント、第３フォルマント、・・・と呼ぶ。フォルマント周波数のパターンは、音素を特徴付ける要素である。
また、目的話者とは、合成音声を構成する際にターゲットとなる話者である。作成される合成音声を構成する音声素片の主要な話者は、目的話者である。
また、他話者とは、目的話者とは異なる話者である。本実施形態による音声合成装置１０は、他話者の音声素片も一部に混在させながら、全体としては目的話者の音声であるように人が認識できる合成音声を作成する。 The fundamental frequency is the frequency of the frequency component having the lowest periodicity of the audio signal.
A formant is a peak in the frequency spectrum of speech. Among these peaks, they are called first formant, second formant, third formant,... In order from the lowest frequency. The formant frequency pattern is an element that characterizes phonemes.
The target speaker is a target speaker when composing synthesized speech. The main speaker of the speech segment constituting the synthesized speech to be created is the target speaker.
The other speaker is a speaker different from the target speaker. The speech synthesizer 10 according to the present embodiment creates synthesized speech that can be recognized by a person as if the speech of the target speaker as a whole, while also mixing speech segments of other speakers.

図２は、音声データベース記憶部１３０に記憶される音声データベース（目的話者音声データベース１３１および他話者音声データベース１３２）のデータ構成とデータ例を示す概略図である。図示するように、この音声データベースは表形式のデータであり、話者識別情報と、音声素片識別情報と、音素ラベル表記と、トライフォン（ｔｒｉｐｈｏｎｅ）と、音声信号データと、スペクトル特徴量と、基本周波数情報の各項目を有している。
このデータの行は音声素片ごとに存在し、データの主キーは音声素片識別情報である。
話者識別情報は、話者を一意に識別するデータである。 FIG. 2 is a schematic diagram illustrating a data configuration and a data example of a voice database (target speaker voice database 131 and other speaker voice database 132) stored in the voice database storage unit 130. As shown in the figure, this speech database is tabular data, and includes speaker identification information, speech segment identification information, phoneme label notation, triphone, speech signal data, spectral features, and the like. , Each item of basic frequency information.
This row of data exists for each speech unit, and the main key of the data is speech unit identification information.
The speaker identification information is data that uniquely identifies the speaker.

音声素片識別情報は、音声素片を一意に識別する情報である。
音素ラベル表記は、ローマ字を用いて当該音声素片の発音を表記したデータである。ここで、大文字の「Ｑ」は促音を表わし、記号の「：」（コロン）は長音を表わす。従って、例えば、表中の音声素片識別情報が「Ｂ０００１」の行における音素ラベル表記「ｈｏＱｋａｉｄｏ：」は、「ほっかいどー」という発音を表わす。
トライフォンは、音素環境を表わす表記である。例えば、表中の音声素片識別情報が「Ｂ０００９」の行におけるトライフォン「ａ−ｏ＋ｉ」は、当該音声素片の音素「ｏ」に先行する音素が「ａ」であって、後続する音素が「ｉ」であることを表している。このトライフォンにおける表記「ｓｉｌ」は無音を表す。つまり、表中の音声素片識別情報が「Ｂ０００７」の行におけるトライフォン「ｏ−ｉ＋ｓｉｌ」は、当該音声素片の音素「ｉ」に先行する音素が「ｏ」であり、後続する音素が無音であることを表している。このように、例えば表中の音声素片識別情報が「Ｂ０００４」の行と「Ｂ０００５」の行とを比較すると、話者識別情報「Ａ００１」と音素ラベル表記「ａ」が共通であるが、トライフォンが異なっている。つまり、音声データベース記憶部１３０は、音素環境にも応じた音声素片を格納している。
音声信号データは、当該音声素片の音声信号そのものを表わすデータである。この音声信号データは、例えば、時系列の音圧レベルのデータとして表わされたり、所定の短い期間における周波数スペクトルのデータとして表わされたりする。
スペクトル特徴量は、当該音声素片の特徴量を表わすデータであり、例えばＭＦＣＣ（メル周波数ケプストラム係数，Mel-Frequency Cepstrum Coefficient）などを用いる。
基本周波数情報は、当該音声素片の基本周波数を表わすデータであり、当該音声素片における基本周波数の代表的な値、又は基本周波数の時系列の値などを用いる。あるいは、基本周波数の範囲をＨ（Ｈｉｇｈ、高周波数）とＬ（Ｌｏｗ、低周波数）で２値化し、この「Ｈ」または「Ｌ」の時系列の値を基本周波数情報としても良い。
なお、上記のスペクトル特徴量や基本周波数は、後述するように、音声素片の選択の際に用いられる。 The speech unit identification information is information for uniquely identifying a speech unit.
The phoneme label notation is data in which the pronunciation of the speech segment is expressed using Roman letters. Here, the capital letter “Q” represents a prompt sound, and the symbol “:” (colon) represents a long sound. Therefore, for example, the phoneme label notation “hoQkaido:” in the row of the speech unit identification information “B0001” in the table represents the pronunciation “Hokkaido”.
The triphone is a notation representing a phoneme environment. For example, the triphone “a−o + i” in the row of the speech unit identification information “B0009” in the table indicates that the phoneme preceding the phoneme “o” of the speech unit is “a” and the subsequent phoneme. Represents “i”. The notation “sil” in this triphone represents silence. That is, in the triphone “o−i + sil” in the row of the speech unit identification information “B0007” in the table, the phoneme preceding the phoneme “i” of the speech unit is “o”, and the subsequent phoneme is This means that there is no sound. Thus, for example, when comparing the line with the speech unit identification information “B0004” and the line with “B0005” in the table, the speaker identification information “A001” and the phoneme label notation “a” are common. The triphone is different. That is, the speech database storage unit 130 stores speech segments corresponding to the phoneme environment.
The audio signal data is data representing the audio signal itself of the speech unit. This audio signal data is represented, for example, as time-series sound pressure level data or as frequency spectrum data in a predetermined short period.
The spectrum feature amount is data representing the feature amount of the speech unit, and for example, MFCC (Mel-Frequency Cepstrum Coefficient) is used.
The fundamental frequency information is data representing the fundamental frequency of the speech unit, and a representative value of the fundamental frequency in the speech unit or a time-series value of the fundamental frequency is used. Alternatively, the range of the fundamental frequency may be binarized with H (High, high frequency) and L (Low, low frequency), and the time series value of “H” or “L” may be used as the fundamental frequency information.
Note that the spectral feature amount and the fundamental frequency are used when selecting a speech unit, as will be described later.

なお、目的話者音声データベース１３１と他話者音声データベース１３２とを、個別のデータベーステーブルに格納しても良いし、共通のデータベーステーブルに格納しても良い。いずれの場合にも、話者識別情報をデータ内に保持しているため、目的話者の話者識別情報とデータベース上の話者識別情報を比較することにより、目的話者音声素片と他話者音声素片とを区別することができる。
また、図示したデータ項目のほかに、例えば音声ファイル番号や時間情報等を音声データベース記憶部上のテーブルの項目として保持するようにしても良い。ここで、音声ファイル番号は、テーブルの外部に記憶されており音声信号データ等を保持している音声ファイルを一意に識別するための番号である。また、時間情報は、音声信号データに含まれる対象の音素の時間情報（当該音素の開始点および終了点を先頭からの相対時刻で表した情報）である。 The target speaker voice database 131 and the other speaker voice database 132 may be stored in individual database tables or in a common database table. In any case, since the speaker identification information is stored in the data, comparing the speaker identification information of the target speaker with the speaker identification information in the database, It is possible to distinguish between speaker speech units.
In addition to the illustrated data items, for example, an audio file number, time information, and the like may be held as items in a table on the audio database storage unit. Here, the audio file number is a number for uniquely identifying an audio file stored outside the table and holding audio signal data or the like. The time information is time information of the target phoneme included in the audio signal data (information indicating the start point and the end point of the phoneme as relative time from the head).

図３は、合成音声記憶部１５０に記憶される合成音声データのデータ構成とデータ例を示す概略図である。図示するように、この音声データベースは表形式のデータであり、合成音声識別情報と、順序と、話者識別情報と、音声素片識別情報と、音素ラベル表記と、音声信号データの各項目を有する。なお、このデータにおいて、合成音声を構成する個々の音声素片ごとに行が存在する。 FIG. 3 is a schematic diagram illustrating a data configuration and a data example of the synthesized speech data stored in the synthesized speech storage unit 150. As shown in the figure, this speech database is tabular data, and each item of synthesized speech identification information, order, speaker identification information, speech segment identification information, phoneme label notation, and speech signal data is stored. Have. In this data, there is a row for each individual speech unit constituting the synthesized speech.

合成音声識別情報は、合成音声を一意に識別するデータである。
順序は、ある合成音声内での音声素片の順序を示す値である。
話者識別情報は、音声素片の話者を一意に識別するデータであり、音声データベース記憶部１３０に記憶される話者識別情報と同様のものである。
音声素片識別情報は、音声素片を一意に識別するデータである。
音素ラベル表記は、その音声素片の音素ラベルを表わすデータであり、音声データベース記憶部１３０に記憶される音素ラベル表記と同様のものである。
音声信号データは、当該音声素片の音声信号そのものを表わすデータであり、音声データベース記憶部１３０に記憶される音声信号データと同様のものである。
なお、同図に示すデータ例では、合成音声識別情報「Ｃ０００１」によって識別される合成音声の１番目の音声素片の話者識別情報は「Ａ００１」であり、その音声素片識別情報は「Ｂ０００２」であり、その音素ラベル表記は「ｔｏ：ｋｙｏ：」である。また、同合成音声の２番目の音声素片の話者識別情報は「Ａ００２」であり、その音声素片識別情報は「Ｂ０７７７」であり、その音素ラベル表記は「ｋａｒａ」である。３番目以降の音声素片についても同様である。 The synthesized speech identification information is data that uniquely identifies the synthesized speech.
The order is a value indicating the order of speech units within a certain synthesized speech.
The speaker identification information is data that uniquely identifies the speaker of the speech unit, and is the same as the speaker identification information stored in the speech database storage unit 130.
The speech unit identification information is data that uniquely identifies a speech unit.
The phoneme label notation is data representing the phoneme label of the speech segment, and is similar to the phoneme label notation stored in the speech database storage unit 130.
The voice signal data is data representing the voice signal itself of the voice unit, and is similar to the voice signal data stored in the voice database storage unit 130.
In the example of data shown in the figure, the speaker identification information of the first speech unit of the synthesized speech identified by the synthesized speech identification information “C0001” is “A001”, and the speech unit identification information is “A001”. “B0002”, and the phoneme label notation is “to: kyo:”. Further, the speaker identification information of the second speech unit of the synthesized speech is “A002”, the speech unit identification information is “B0777”, and the phoneme label notation is “kara”. The same applies to the third and subsequent speech segments.

次に、音声素片の適合度の推定について説明する。音声合成装置１０は、音声素片の適合度として、音素環境適合度と音素特徴量適合度とを用いる。 Next, estimation of the fitness of speech segments will be described. The speech synthesizer 10 uses a phoneme environment suitability and a phoneme feature value suitability as speech unit suitability.

音素環境適合度推定部１１２は、音素環境適合度を推定する。そのため、音素環境適合度推定部１１２は、音素種類の判別と、前後音素環境の判別と、韻律環境判別とを行う。 The phoneme environment suitability estimation unit 112 estimates the phoneme environment suitability. Therefore, the phoneme environment suitability estimation unit 112 performs phoneme type discrimination, front-and-rear phoneme environment discrimination, and prosodic environment discrimination.

音素種類の判別においては、音素環境適合度推定部１１２は、（１）母音か子音かの判別、（２）有声音か無声音かの判別、（３）調音方式による判別を行う。
（１）母音／子音の判別
音素環境適合度推定部１１２は、母音と子音のそれぞれに対して予め定められた指標を記憶しており、適合度推定の対象となる音声素片が母音あるいは子音のいずれであるかに応じて、その指標値を当該音声素片の音素環境適合度の算出に用いる。なお、母音よりも子音の方が、音素環境適合度が高い（適合しやすい）。
（２）有声音／無声音の判別
音素環境適合度推定部１１２は、有声音と無声音のそれぞれに対して予め定められた指標を記憶しており、適合度推定の対象となる音声素片が有声音あるいは無声音のいずれであるかに応じて、その指標値を当該音声素片の音素環境適合度の算出に用いる。なお、有声音よりも無声音の方が、音素環境適合度が高い（適合しやすい）。
（３）調音方式の判別
調音方式とは、音声器官によって声道に閉鎖又は狭まりを形成する方式のことである。例えば、鼻にかかる音である「ｍ」と「ｎ」とは、共通の調音方式に属する。また、一旦口を閉じてから破裂する音である「ｐ」と「ｔ」と「ｋ」とは、共通の調音方式に属する。音素環境適合度推定部１１２は、それぞれの調音方式に対して予め定められた指標を記憶しており、適合度推定の対象となる音声素片の調音方式がそれらのいずれであるかに応じて、その指標値を当該音声素片の音素環境適合度の算出に用いる。 In the determination of the phoneme type, the phoneme environment suitability estimation unit 112 performs (1) determination of whether it is a vowel or consonant, (2) determination of whether it is voiced or unvoiced, and (3) determination based on the articulation method.
(1) Discriminating vowels / consonants The phoneme environment suitability estimation unit 112 stores a predetermined index for each of the vowels and consonants, and the speech unit that is the target of the suitability estimate is a vowel or consonant. The index value is used to calculate the phoneme environment suitability of the speech segment according to which one of the speech units is. Note that consonants have higher phoneme environment suitability (easy to fit) than vowels.
(2) Discrimination of voiced / unvoiced sound The phoneme environment suitability estimation unit 112 stores a predetermined index for each of voiced sound and unvoiced sound, and there is a speech unit that is a target of fitness estimation. Depending on whether it is a voice sound or an unvoiced sound, the index value is used to calculate the phoneme environment suitability of the speech segment. In addition, unvoiced sound has higher phoneme environment suitability than voiced sound (it is easy to adapt).
(3) Discrimination of articulation method The articulation method is a method in which the vocal tract is closed or narrowed by a voice organ. For example, “m” and “n”, which are sounds applied to the nose, belong to a common articulation method. Also, “p”, “t”, and “k”, which are sounds that burst after closing the mouth, belong to a common articulation method. The phoneme environment suitability estimation unit 112 stores a predetermined index for each articulation method, and depending on which of them is the articulation method of the speech unit for which the suitability is to be estimated. The index value is used to calculate the phoneme environment suitability of the speech segment.

前後音素環境の判別においては、音素環境適合度推定部１１２は、音素の種類ごと及びその話者ごとの前後音素環境の指標を予め記憶している。そして、音素環境適合度推定部１１２は、合成音声記憶部１５０に記憶されている順序のデータに基づいて対象の音声素片の前後の音声素片のデータを読み出し、当該対象の音声素片の前後の音素の種類およびその話者を判別する。そして、音素環境適合度推定部１１２は、判別した音素の種類およびその話者に応じた指標値を、音素環境適合度の算出に用いる。
上記の方法で算出される前後音素環境に基づく適合度は、前後の音のつながりの良さを表わす。 In the discrimination of the front and back phoneme environment, the phoneme environment suitability estimation unit 112 stores in advance an index of the front and back phoneme environment for each phoneme type and for each speaker. Then, the phoneme environment suitability estimation unit 112 reads the data of the speech units before and after the target speech unit based on the order data stored in the synthesized speech storage unit 150, and determines the target speech unit. Determine the type of phoneme before and after and the speaker. Then, the phoneme environment suitability estimation unit 112 uses the determined phoneme type and the index value corresponding to the speaker to calculate the phoneme environment suitability.
The fitness based on the front and back phoneme environment calculated by the above method represents the goodness of connection between the front and rear sounds.

韻律環境の判別においては、音素環境適合度推定部１１２は、音素時間長と、その音素の基本周波数の相対的な高さに応じた指標を予め記憶している。そして、対象の音素の音素時間長と、その音素の基本周波数の相対的な高さのそれぞれに応じた指標値を、音素環境適合度の算出に用いる。
なお、音素時間長が短い程、音素環境適合度が高い（適合しやすい）。また、音素時間長が所定の閾値よりも長い場合には、音素環境適合度が極端に低くなる（利用できない）。また、音素の基本周波数が低い程、音素環境適合度が高い（適合しやすい）。 In discrimination of the prosodic environment, the phoneme environment suitability estimation unit 112 stores in advance an index corresponding to the phoneme time length and the relative height of the fundamental frequency of the phoneme. Then, index values corresponding to the phoneme time length of the target phoneme and the relative height of the fundamental frequency of the phoneme are used to calculate the phoneme environment suitability.
Note that the shorter the phoneme duration, the higher the phoneme environment suitability (easy to adapt). When the phoneme time length is longer than a predetermined threshold, the phoneme environment suitability is extremely low (cannot be used). Moreover, the lower the fundamental frequency of the phoneme, the higher the phoneme environment suitability (easy to adapt).

なお、複数の音素を含む音声素片については、音素環境適合度推定部１１２は、そのそれぞれの音素についての音素環境適合度を算出する。
そして、音素環境適合度推定部１１２は、上で得られた各指標値の重み付総和により音素環境適合度を算出する。 Note that for a phoneme unit including a plurality of phonemes, the phoneme environment suitability estimation unit 112 calculates a phoneme environment suitability for each phoneme.
Then, the phoneme environment suitability estimation unit 112 calculates the phoneme environment suitability from the weighted sum of each index value obtained above.

音素特徴量適合度推定部１１３は、音素特徴量適合度を推定する。そのため、音素特徴量適合度推定部１１３は、スペクトル特徴量を比較する処理を行う。ここで利用するスペクトル特徴量は、スペクトル傾斜、ＦＦＴケプストラム係数の１次の係数（Ｃ１）、声帯音源の特性（スペクトル特性）を表わす特徴量、スペクトルの低い周波数帯域のスペクトル重心（低域スペクトル重心）、フォルマント周波数、フォルマントバンド幅などである。 The phoneme feature value fitness estimation unit 113 estimates the phoneme feature value fitness. Therefore, the phoneme feature quantity matching degree estimation unit 113 performs a process of comparing the spectrum feature quantities. The spectral feature used here is the spectral tilt, the first-order coefficient (C1) of the FFT cepstrum coefficient, the feature representing the characteristics of the vocal cord sound source (spectral characteristics), the spectral centroid of the low frequency band of the spectrum (low spectral centroid) ), Formant frequency, formant bandwidth, etc.

図４は、音素特徴量適合度推定部１１３が求めるスペクトル傾斜を説明するための音声スペクトル包絡を示すグラフである。
同図において、横軸は周波数（単位はヘルツ）、縦軸は強度（単位はデシベル）である。また、図示する音声スペクトル包絡に現れるピーク点が、周波数の低い側から点Ｐ_１，Ｐ_２，Ｐ_３，・・・である。また、同グラフにおける点Ｐ_１，Ｐ_２，Ｐ_３の座標が、それぞれ、（ｆ_１、ｍ_１），（ｆ_２、ｍ_２），（ｆ_３、ｍ_３）である。この周波数ｆ_１，ｆ_２，ｆ_３は、それぞれ、第１、第２、第３フォルマント周波数である。
スペクトル傾斜とは、これら複数のピーク点のうちの所定の２つのピーク点を結ぶ直線の傾斜である。
スペクトル傾斜を算出するために、音素特徴量適合度推定部１１３は、音素の周波数スペクトルの包絡線を求め、その包絡線における複数のピーク点を求め、周波数の低い側から１番目のピークと３番目のピークとを結んだ直線の傾きを計算する。これがスペクトル傾斜であり、周波数の低域から高域にかけての減衰度合いを表わす特徴量である。
つまり、音素特徴量適合度推定部１１３は、下の式（１）によりスペクトル傾斜ｇを計算する。 FIG. 4 is a graph showing a speech spectrum envelope for explaining the spectrum inclination obtained by the phoneme feature value fitness estimation unit 113.
In the figure, the horizontal axis represents frequency (unit: hertz) and the vertical axis represents intensity (unit: decibel). Also, the peak points appearing in the illustrated speech spectrum envelope are points P ₁ , P ₂ , P ₃ ,... From the lower frequency side. In addition, the coordinates of the points P ₁ , P ₂ , and P ₃ in the graph are (f ₁ , m ₁ ), (f ₂ , m ₂ ), and (f ₃ , m ₃ ), respectively. The frequencies f ₁ , f ₂ , and f ₃ are the first, second, and third formant frequencies, respectively.
The spectrum inclination is an inclination of a straight line connecting two predetermined peak points among the plurality of peak points.
In order to calculate the spectrum inclination, the phoneme feature value fitness estimation unit 113 obtains an envelope of the frequency spectrum of the phoneme, obtains a plurality of peak points in the envelope, and obtains the first peak from the lower frequency side and 3 Calculate the slope of the straight line connecting the th peak. This is a spectral inclination, which is a feature quantity representing the degree of attenuation from a low frequency range to a high frequency range.
That is, the phoneme feature quantity matching degree estimation unit 113 calculates the spectrum inclination g by the following equation (1).

また、スペクトル傾斜を近似する値として、ＦＦＴケプストラム係数の１次の係数を特徴量として利用しても良い。 Further, as a value approximating the spectrum inclination, a first order coefficient of the FFT cepstrum coefficient may be used as the feature amount.

また、スペクトル傾斜は、声帯音源の特性と音声を発する時の放射特性の影響を受ける。そして、放射特性はほぼ一定と考えることができるため、スペクトル傾斜は声帯音源の特性によって変わると言える。そこで、この声帯音源の特性が影響する他の特徴量を、スペクトル傾斜の代わりに用いても良い。具体的には、声帯音源の特性を表わす特徴量としては、中高域の雑音成分の割合や、ＦＦＴスペクトルから得られる第１調波と第２調波のパワー（デシベル）差分や、ＦＦＴスペクトルから得られる第１調波とＦ３付近ピークのパワー（デシベル）差分のいずれかを用いることができる。 The spectral tilt is affected by the characteristics of the vocal cord sound source and the radiation characteristics when sound is emitted. Since the radiation characteristic can be considered to be almost constant, it can be said that the spectral tilt changes depending on the characteristics of the vocal cord sound source. Therefore, other feature quantities influenced by the characteristics of the vocal cord sound source may be used instead of the spectrum tilt. Specifically, the characteristic amount representing the characteristics of the vocal cord sound source includes a ratio of middle and high frequency noise components, a power (decibel) difference between the first harmonic and the second harmonic obtained from the FFT spectrum, and an FFT spectrum. Either the obtained first harmonic or the power (decibel) difference of the peak near F3 can be used.

図５は、音素特徴量適合度推定部１１３が求める低域スペクトル重心を説明するための音声スペクトル包絡を示すグラフである。
同図においても同じく、横軸は周波数（単位はヘルツ）、縦軸は強度（単位はデシベル）である。また、Ｌは、スペクトルの低い周波数帯域であり、この帯域Ｌの範囲は予め定められている。
そして、音素特徴量適合度推定部１１３は、下の式（２）により、低域スペクトル重心ｆ_Ｗ（スペクトル重心の周波数）を計算する。 FIG. 5 is a graph showing a speech spectrum envelope for explaining the low-frequency spectrum centroid obtained by the phoneme feature value fitness estimation unit 113.
In the same figure, the horizontal axis represents frequency (unit: hertz) and the vertical axis represents intensity (unit: decibel). L is a frequency band having a low spectrum, and the range of the band L is determined in advance.
Then, the phoneme feature value fitness estimation unit 113 calculates the low-frequency spectrum centroid f _W (frequency of the spectrum centroid) by the following equation (2).

なお、式（２）におけるｍ（ｆ）は、音声スペクトルにおける周波数ｆでの強度を表わす。 Note that m (f) in Equation (2) represents the intensity at the frequency f in the speech spectrum.

なお、フォルマント周波数は、音声スペクトルにおける複数のピーク（フォルマント）の周波数である。また、フォルマントバンド幅は、上記のフォルマントのバンド幅である。 The formant frequency is a frequency of a plurality of peaks (formant) in the voice spectrum. The formant band width is the band width of the above formant.

音素特徴量適合度推定部１１３は、上記のような特徴量を用いて、下の式（３）により、音素間（音素１と音素２）の適合度Ｍを算出する。 The phoneme feature amount matching degree estimation unit 113 calculates the degree of fit M between phonemes (phoneme 1 and phoneme 2) by using the feature amount as described above and the following equation (3).

なお、式（３）において、ｔ_１，ｉは音素１のｉ番目の特徴量（スカラー又はベクトル）であり、ｔ_２，ｉは音素２のｉ番目の特徴量（スカラー又はベクトル）である。また、ｄ（ｔ_１，ｉ，ｔ_２，ｉ）は、両特徴量間の距離に応じて定まる値（スカラー）である。また、ｗ_ｉは、ｉ番目の特徴量に対応する重み値であり、この値は予め定められ音素特徴量適合度推定部１１３が記憶している。
ｄ（ｔ_１，ｉ，ｔ_２，ｉ）の具体例としては、例えば、単純にこれら両特徴量間の距離を用いてよい。このときの適合度Ｍは、下の式（４）を用いて算出される。 In the equation _{(3), t 1, i} is the i-th feature quantity of phonemes 1 (scalar or _{vector), t 2, i} is the i-th feature value of the phoneme 2 (scalar or vector). Further, d (t _{1, i} , t _{2, i} ) is a value (scalar) determined according to the distance between both feature amounts. Further, w _i is a weight value corresponding to the i-th feature value, and this value is determined in advance and stored in the phoneme feature value fitness estimation unit 113.
As a specific example of d (t _{1, i} , t _{2, i} ), for example, a distance between these both feature amounts may be simply used. The fitness M at this time is calculated using the following equation (4).

なお、複数の音素を含む音声素片については、音素特徴量適合度推定部１１３は、そのそれぞれの音素についての音素特徴量適合度を算出する。
なお、音素特徴量適合度推定部１１３が比較の都度、上記の各特徴量を計算する代わりに、予め音声素片の特徴量を計算しておいてその値を音声データベース記憶部１３０に記憶させておき、比較する際に読み出して用いるようにしても良い。 Note that for a phoneme unit including a plurality of phonemes, the phoneme feature value fitness estimation unit 113 calculates a phoneme feature value fitness for each phoneme.
Note that instead of calculating each feature amount described above, the phoneme feature amount matching degree estimation unit 113 calculates the feature amount of the speech segment in advance and stores the value in the speech database storage unit 130. It may be read and used for comparison.

図６は、音声合成装置１０による音声合成の処理手順を示すフローチャートである。以下、このフローチャートに沿って、音声合成装置１０の処理手順を説明する。 FIG. 6 is a flowchart showing the procedure of speech synthesis performed by the speech synthesizer 10. The processing procedure of the speech synthesizer 10 will be described below along this flowchart.

なお、このフローチャートの処理に先立って、音声データベース１３０とテキスト記憶部１５５と他話者比率設定記憶部１６０とデフォルト設定記憶部１７０には所定のデータが記憶されている。
音声データベース１３０には、複数の話者の音声素片が予め蓄積されている。
テキスト記憶部１５５には、合成しようとする目的の音声に対応する表記のテキストが記憶されている。具体例としては、テキスト記憶部１５５には、「とーきょーからよこはまへいきます」（平仮名文）や「東京から横浜へ行きます」（漢字・仮名混じり文）や「ｔｏ：ｋｙｏｋａｒａｙｏｋｏｈａｍａｅｉｋｉｍａｓｕ」（音素ラベル表記）などといったテキストデータが記憶されている。
他話者比率設定記憶部１６０には、合成音声中の全音声素片数のうちの他話者音声素片数の比率の値（例えば、「１５％」など）が設定値として記憶されている。
デフォルト設定記憶部１７０には、合成音声中で他話者音声素片を利用する箇所（例えば、「全箇所」。）、および音声素片を選択する音声データベースの範囲（選択対象とする話者の範囲。例えば、「全話者」。）が設定値として記憶されている。 Prior to the processing of this flowchart, predetermined data is stored in the speech database 130, the text storage unit 155, the other speaker ratio setting storage unit 160, and the default setting storage unit 170.
The speech database 130 stores speech segments of a plurality of speakers in advance.
The text storage unit 155 stores a notation text corresponding to the target speech to be synthesized. For example, in the text storage unit 155, “I will go from Tokyo to Yokohama” (Hiragana sentence), “I will go from Tokyo to Yokohama” (a sentence mixed with kanji and kana) and “to: kyo kara”. Text data such as “Yokohama e Ikimasu” (phoneme label notation) is stored.
The other-speaker ratio setting storage unit 160 stores a ratio value (for example, “15%”) of the number of other-speaker speech units out of the total number of speech units in the synthesized speech as a set value. Yes.
In the default setting storage unit 170, a portion (for example, “all locations”) in which the other speaker's speech unit is used in the synthesized speech, and a range of the speech database for selecting the speech unit (speakers to be selected) (For example, “all speakers”) is stored as a set value.

そして、まずステップＳ１において、他話者音声素片箇所指定部１２０は、ユーザーの操作（入力）に基づき、合成音声中で他話者音声素片を利用する箇所と、音声素片を選択する音声データベースの範囲の指定を受け付け、それらの情報を音声素片選択部１１０に渡す。このとき、合成音声中で他話者音声素片を利用する箇所については、他話者音声素片箇所指定部１２０は、個々の音声素片の単位で指定を受け付ける。例えば、他話者音声素片箇所指定部１２０は、合成音声記憶部１５０に記憶されている合成音声識別情報と順序とを一組として、その一組あるいは複数組のデータにより、他話者音声素片を利用する箇所の指定を受け付ける。また、音声素片を選択する音声データベースの範囲の指定としては、他話者音声素片箇所指定部１２０は、「目的話者音声データベースと他話者音声データベースの両方」又は「他話者音声データベースのみ」のいずれかを表わす情報を受け付ける。またこのとき、特定の単数又は複数の他話者の音声素片のみを選択する場合には、他話者音声素片箇所指定部１２０は、対象とする話者の話者識別情報の指定を受け付けることもできる。
また、このとき、ユーザーの操作により、合成音声中で他話者音声素片を利用する箇所を「全箇所」と指定することができ、また音声素片を選択する音声データベースの範囲（選択対象とする話者の範囲）を「全話者」と指定することもできる。 First, in step S1, the other-speaker speech unit location specifying unit 120 selects a location and speech unit that uses the other-speaker speech unit in the synthesized speech based on the user's operation (input). The specification of the range of the speech database is accepted, and the information is passed to the speech segment selection unit 110. At this time, the other-speaker speech unit location designation unit 120 accepts designation in units of individual speech units for locations where the other-speaker speech units are used in the synthesized speech. For example, the other-speaker speech unit location specifying unit 120 sets the synthesized speech identification information and the order stored in the synthesized speech storage unit 150 as a set, and uses the one or more sets of data as the other-speaker speech. It accepts the designation of the location where the segment is used. In addition, as the designation of the range of the speech database for selecting speech segments, the other-speaker speech segment location designating unit 120 selects “both target speaker speech database and other-speaker speech database” or “other-speaker speech”. Information representing either “database only” is accepted. At this time, when only the speech unit of a specific speaker or a plurality of other speakers is selected, the other-speaker speech unit location specifying unit 120 specifies the speaker identification information of the target speaker. It can also be accepted.
Also, at this time, the user's operation can specify “all locations” in the synthesized speech where the other-speaker speech segment is used, and the range of the speech database for selecting speech segments (selection target) Can be designated as “all speakers”.

なお、本ステップにおいて、ユーザーが具体的な指定を行う代わりに、デフォルト設定値を使用することを指定することもできる。デフォルト設定の使用が指定された場合には、音声素片選択部１１０は、デフォルト設定記憶部１７０から設定値を読み出して使用する。
例えば、デフォルト設定記憶部１７０に、合成音声中で他話者音声素片を利用する箇所が「全箇所」であり、また音声素片を選択する音声データベースの範囲（選択対象とする話者の範囲）が「全話者」であることが記憶されている場合には、音声素片選択部１１０は、その設定値を使用する。 In this step, it is possible to specify that the default setting value is used instead of a specific specification by the user. When the use of the default setting is designated, the speech element selection unit 110 reads the setting value from the default setting storage unit 170 and uses it.
For example, in the default setting storage unit 170, “all locations” are locations where other speaker speech segments are used in the synthesized speech, and the range of the speech database for selecting speech segments (the speakers to be selected). When it is stored that “range” is “all speakers”, the speech segment selection unit 110 uses the set value.

なお、ユーザーの指定により、又はデフォルト設定記憶部１７０に設定されている値により、合成音声中で他話者音声素片を利用する箇所を「全箇所」とする場合には、つまり具体的な箇所の指定が行われない場合には、後述する方法により、他話者音声素片数が占める比率が他話者比率設定記憶部１６０に記憶されている比率より大きくならないように、他話者音声素片の適合度の上位の順に他話者音声素片を利用する箇所を決定する。 In addition, when the location where the other-speaker speech unit is used in the synthesized speech is set to “all locations” by the user's designation or the value set in the default setting storage unit 170, that is, When the location is not designated, the other speaker's speech unit count ratio is not made larger than the ratio stored in the other speaker ratio setting storage unit 160 by the method described later. The location where the other-speaker speech unit is used is determined in the descending order of the fitness level of the speech unit.

次にステップＳ２において、音声素片選択部１１０は、他話者比率設定記憶部１６０から比率値を読み出し、合成音声中で他話者音声素片を利用する箇所の比率が設定の範囲内か否かを確認する。具体的には、音声素片選択部１１０は、他話者音声素片箇所指定部１２０から渡された情報に基づき、（合成音声内で他話者音声素片を利用する音声素片数／当該合成音声内の全音声素片数）の値を計算し、この値が他話者比率設定記憶部１６０から読み出した比率値以下であるかどうかを確認する。計算された値が設定の比率値以下である場合（ステップＳ２：ＹＥＳ）には、次のステップＳ３に進む。計算された値が設定の比率値よりも大きい場合（ステップＳ２：ＮＯ）には、再度ユーザーからの操作に基づく指定を受けるためにステップＳ１に戻る。
なお、ステップＳ１において他話者音声素片を利用する箇所が具体的に指定されなかった場合には、本ステップにおける判定結果は常に「ＹＥＳ」となる。 Next, in step S2, the speech unit selection unit 110 reads the ratio value from the other speaker ratio setting storage unit 160, and determines whether the ratio of the locations where the other speaker speech unit is used in the synthesized speech is within the set range. Confirm whether or not. Specifically, the speech unit selection unit 110 (based on the information passed from the other-speaker speech unit location specifying unit 120) (the number of speech units that use the other-speaker speech unit in the synthesized speech / The number of all speech units in the synthesized speech is calculated, and it is confirmed whether or not this value is equal to or less than the ratio value read from the other speaker ratio setting storage unit 160. If the calculated value is less than or equal to the set ratio value (step S2: YES), the process proceeds to the next step S3. When the calculated value is larger than the set ratio value (step S2: NO), the process returns to step S1 in order to receive designation based on the operation from the user again.
In addition, when the location using the other-speaker speech unit is not specifically specified in step S1, the determination result in this step is always “YES”.

次にステップＳ３において、音声素片選択部１１０は、音声データベース記憶部１３０内の目的話者音声データベース１３１又は他話者音声データベース１３２から、必要な音声素片を選択する。なお、前のステップにおいて特定の箇所（単数又は複数）のみが指定されている場合には、該当する箇所のみについて、利用する音声素片の選択を行う。また、前のステップにおいて特定の話者（単数又は複数）が指定されている場合には、該当するデータベースの当該話者の音声素片の中から選択を行う。ここでの音声素片の選択自体は、従来技術の方法によるものであり、音素ラベルなどのマッチングとともに、音声データベース記憶部１３０に保持されているスペクトル特徴量および基本周波数のマッチングにより行われる選択である。 Next, in step S <b> 3, the speech unit selection unit 110 selects a necessary speech unit from the target speaker speech database 131 or the other speaker speech database 132 in the speech database storage unit 130. If only a specific location (single or plural) is specified in the previous step, the speech segment to be used is selected only for the relevant location. If a specific speaker (single or plural) is specified in the previous step, a selection is made from the speech units of the speaker in the corresponding database. The selection of the speech unit here is based on the method of the prior art, and is a selection performed by matching of the phoneme label and the like, and matching of the spectral feature quantity and the fundamental frequency held in the speech database storage unit 130. is there.

次にステップＳ４において、音声素片選択部１１０は、選択された他話者音声素片の数をカウントし、その箇所の比率が他話者比率設定記憶部１６０に設定されている比率の範囲内か否かを判定する。設定の範囲内である場合（ステップＳ４：ＹＥＳ）にはステップＳ６に飛び、設定の範囲を超えている場合（ステップＳ５：ＮＯ）には次のステップＳ５に進む。
なお、本ステップで判定結果が「ＮＯ」となり得るのは、ステップＳ１において具体的な箇所の指定がなく、ユーザーの指定により、又はデフォルト設定記憶部１７０に設定されている値により、「全箇所」が指定されていた場合のみである。 Next, in step S <b> 4, the speech unit selection unit 110 counts the number of the selected other speaker speech units, and the ratio range is set in the other speaker ratio setting storage unit 160. It is determined whether it is within. If it is within the setting range (step S4: YES), the process jumps to step S6, and if it exceeds the setting range (step S5: NO), the process proceeds to the next step S5.
Note that the determination result in this step may be “NO” because there is no specific location designation in step S 1, and the “all locations” is designated by the user or the value set in the default setting storage unit 170. "Is only specified.

ステップＳ５においては、音素環境適合度推定部１１２が他話者音声素片の音素環境の適合度を推定する。そして、音素環境適合度推定部１１２は、上で選択された音声素片について、適合度順に優先順位を決定する。そして、その結果、他話者比率設定記憶部１６０に設定されている比率の範囲内に入らなかった優先順位が下位の他話者音声素片については、その他話者音声素片を棄却し、音声素片選択部１１０が目的話者音声データベース１３１の中から代わりの目的話者音声素片を再選する。つまり、音声素片選択部は、優先順位が下位の他話者音声素片を、再選された目的話者音声素片で置き換える。なお、ここでの音声素片の再選の方法自体は、前述の通り、既存技術によるものである。
つまり、音素環境適合度推定部１１２は、他話者比率設定記憶部１６０に記憶されている設定値に基づいて、適合度が上位の音声素片のみを採用する。 In step S5, the phoneme environment suitability estimation unit 112 estimates the suitability of the phoneme environment of the other-speaker speech unit. Then, the phoneme environment suitability estimation unit 112 determines the priority order of the speech units selected above in order of suitability. As a result, for other speaker speech units whose priority is not within the ratio range set in the other speaker ratio setting storage unit 160, the other speaker speech units are rejected, The speech segment selection unit 110 reselects an alternative target speaker speech segment from the target speaker speech database 131. That is, the speech unit selection unit replaces the other speaker speech unit having the lower priority with the reselected target speaker speech unit. Note that the speech element reselection method here is based on the existing technology as described above.
That is, the phoneme environment suitability estimation unit 112 employs only the speech element having the highest suitability based on the setting value stored in the other speaker ratio setting storage unit 160.

次にステップＳ６において、音素特徴量適合度推定部１１３は、合成音声用に選択されている音声素片の特徴量と比較音素指定部１４０が指定する音声素片との特徴量とを比較し、合成音声用に選択されている音声素片の適合度を推定する。なおこのとき、比較音素指定部１４０が比較対象として指定する音声素片は、目標話者の任意の音素、又は音声素片選択部１１０によって選択されている音声素片のうちの、現在比較しようとしている当該音声素片以外のもののいずれかである。 Next, in step S <b> 6, the phoneme feature amount matching degree estimation unit 113 compares the feature amount of the speech unit selected for the synthesized speech with the feature amount of the speech unit specified by the comparison phoneme specifying unit 140. Then, the fitness of the speech unit selected for the synthesized speech is estimated. At this time, the speech unit specified by the comparison phoneme designating unit 140 as the comparison target is an arbitrary phoneme of the target speaker or the current speech unit selected from the speech units selected by the speech unit selection unit 110. Any one other than the speech unit.

そしてステップＳ７において、音声素片選択部１１０は、音素特徴量の適合度の低い音声素片が、選択されている音声素片の中に残っているか否かを判定する。この判定は、言い換えれば、比較すべきすべての他話者音声素片について、音素特徴量適合度推定部１１３による音素特徴量の適合度の推定が完了しており、且つその適合度が所定の閾値より低いものが存在するか否かによって行う。音素特徴量の適合度の低い音声素片が残っている場合（ステップＳ７：ＹＥＳ）にはステップＳ８へ進み、そのような音声素片が残っていない場合（ステップＳ７：ＮＯ）にはステップＳ９へ進む。 In step S <b> 7, the speech unit selection unit 110 determines whether speech units having a low degree of matching of phoneme feature values remain in the selected speech unit. In other words, in this determination, for all the other speaker speech units to be compared, the phoneme feature quantity matching level estimation unit 113 has completed the estimation of the phoneme feature quantity matching level, and the matching level is a predetermined level. This is performed depending on whether or not there is something lower than the threshold. If a speech segment having a low phoneme feature value fitness level remains (step S7: YES), the process proceeds to step S8. If no such speech segment remains (step S7: NO), step S9 is performed. Proceed to

ステップＳ８に進んだ場合、同ステップにおいては、音声素片選択部１１０は、音素特徴量適合度推定部１１３によって推定された音素特徴量適合度の低い他話者音声素片を棄却し、他の音声素片を再選する。なお、ここでの音声素片の再選の方法自体は、前述の通り、既存技術によるものである。そして、ステップＳ６の処理に戻る。
つまり、すべての音声素片の音素特徴量の適合度が前記の閾値より高くなるまで、音声素片の選択を繰り返す。
つまり、音声素片選択部１１０は、音素特徴量の適合度が高い音声素片を採用する。 When the process proceeds to step S8, in this step, the speech unit selection unit 110 rejects the other-speaker speech unit having a low phoneme feature value fitness estimated by the phoneme feature value fitness estimation unit 113, and the like. Re-select the speech segment. Note that the speech element reselection method here is based on the existing technology as described above. Then, the process returns to step S6.
That is, the selection of the speech unit is repeated until the matching degree of the phoneme feature amount of all the speech units becomes higher than the threshold value.
That is, the speech element selection unit 110 employs a speech element having a high degree of matching of phoneme feature quantities.

ステップＳ９に進んだ場合、音声素片選択部１１０は、選択（採用）された音声素片からなる合成音声を出力し、そしてこのフローチャート全体の処理を終了する。 When the process proceeds to step S9, the speech unit selection unit 110 outputs a synthesized speech composed of the selected (adopted) speech unit, and the process of the entire flowchart ends.

なお、上述した実施形態における音声合成装置１０の一部又は全部をコンピュータで実現するようにしても良い。その場合、この制御機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時刻の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時刻プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 In addition, you may make it implement | achieve part or all of the speech synthesizer 10 in embodiment mentioned above with a computer. In that case, the program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” dynamically holds a program for a short time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It is also possible to include those that hold a program for a certain time, such as a volatile memory inside a computer system serving as a server or client in that case. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、実施形態を説明したが、本発明はさらに次のような変形例でも実施することが可能である。
例えば、上で述べた適合度（種々の音素環境適合度、および種々の音素特徴量適合度）の全部を利用せず、それらのうちの一部だけを利用して、音声素片選択部が音声素片の採用あるいは棄却等を決定するようにしても良い。 Although the embodiment has been described above, the present invention can also be implemented in the following modified example.
For example, the speech unit selection unit does not use all of the suitability (various phoneme environment suitability and various phoneme feature suitability) described above, but only a part of them. You may make it determine adoption of a speech segment or rejection.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

本発明は、不自然さがなく高品質な合成音声を効率よく生成する目的で利用できる。例えば、本発明は、テレビやラジオ等の放送や、音声による情報提供等の目的で利用することができる。 The present invention can be used for the purpose of efficiently generating high-quality synthesized speech without unnaturalness. For example, the present invention can be used for the purpose of broadcasting information such as television and radio, and providing information by voice.

１０…音声合成装置
１１０…音声素片選択部
１１２…音素環境適合度推定部
１１３…音素特徴量適合度推定部
１２０…他話者音声素片箇所指定部
１３０…音声データベース記憶部
１３１…目的話者音声データベース
１３２…他話者音声データベース
１４０…比較音素指定部
１５０…合成音声記憶部
１５５…テキスト記憶部（表記データ記憶部）
１６０…他話者比率設定記憶部
１７０…デフォルト設定記憶部 DESCRIPTION OF SYMBOLS 10 ... Speech synthesizer 110 ... Speech unit selection part 112 ... Phoneme environment suitability estimation part 113 ... Phoneme feature quantity suitability estimation part 120 ... Other speaker speech element location designation part 130 ... Speech database storage part 131 ... Target talk Speaker voice database 132 ... other speaker voice database 140 ... comparative phoneme designation unit 150 ... synthesized voice storage unit 155 ... text storage unit (notation data storage unit)
160 ... Other speaker ratio setting storage unit 170 ... Default setting storage unit

Claims

A speech database storage unit for storing speech segments of the target speaker and other speakers;
A notation data storage unit for storing notation data corresponding to the target synthesized speech;
A phoneme feature value fitness estimation unit that calculates a phoneme feature value suitability between the plurality of speech units based on the feature values of the plurality of speech units;
Based on the notation data acquired from the notation data storage unit, a speech unit candidate constituting the target synthesized speech is selected from the speech database storage unit, and the selected speech unit candidate is selected. the speech unit sac Chi other speakers, to determine whether to adopt the candidates of the speech unit on the basis of the phonemic feature quantity matching degree of the phonemic feature quantity matching degree estimation unit is calculated, the result A speech unit selector for outputting the synthesized speech composed of the speech units employed;
Equipped with,
The phoneme feature value fitness estimation unit is configured to determine the phoneme feature value match between the speech unit of the other speaker and an arbitrary speech unit of the target speaker, or the speech unit of the other speaker and the speech. Calculating a phoneme feature value compatibility between speech units other than the speech unit of the other speaker among the speech unit candidates selected by the unit selection unit;
A speech synthesizer characterized by the above.

Further comprising a phoneme environment estimation unit that calculates a phoneme environment suitability based on at least a phoneme type of the speech unit and a phoneme type of a speech unit before and after the speech unit in the synthesized speech;
The speech unit selection unit causes the phoneme environment estimation unit to calculate a phoneme environment suitability for a speech unit of another speaker among the speech unit candidates, and based on the phoneme environment suitability Decide whether to adopt speech segment candidates,
The speech synthesizer according to claim 1.

The phoneme feature quantity fitness estimation unit uses the value of either the spectral slope of the speech unit or the first order coefficient of the FFT cepstrum coefficient or the feature quantity representing the characteristics of the vocal cord sound source as the feature quantity.
The speech synthesizer according to any one of claims 1 and 2.

The phoneme feature amount fitness estimation unit uses the frequency of the spectrum centroid in a predetermined frequency band in the speech spectrum of the speech unit as the feature amount.
The speech synthesizer according to any one of claims 1 to 3, wherein

The phoneme feature value fitness estimator uses the formant frequency and formant bandwidth of the speech element as the feature value.
The speech synthesizer according to any one of claims 1 to 4, characterized in that:

A speaker ratio setting storage unit that stores a setting value of the ratio of the number of speech units of other speakers;
The speech unit selection unit calculates so that a ratio of speech units of other speakers out of speech units constituting the synthesized speech is equal to or less than the set value read from the other speaker ratio setting storage unit. The speech unit of the other speaker having the highest phoneme environment suitability is adopted, and the speech unit of the other speaker is selected by the speech unit of the target speaker reselected from the speech database storage unit. replace,
The speech synthesizer according to claim 2.

A speech database storage unit for storing speech segments of the target speaker and other speakers;
A notation data storage unit for storing notation data corresponding to the target synthesized speech;
In a computer equipped with
A phoneme feature value fitness estimation process for calculating a phoneme feature value suitability between the plurality of speech segments based on the feature values of the plurality of speech segments;
Based on the notation data acquired from the notation data storage unit, a speech unit candidate constituting the target synthesized speech is selected from the speech database storage unit, and the selected speech unit candidate is selected. the speech unit sac Chi other speakers, to determine whether to adopt the candidates of the speech unit on the basis of the phonemic feature quantity matching degree calculated by the phonemic feature quantity matching degree estimation process, as a result A speech segment selection process for outputting the synthesized speech composed of the speech segments employed;
A program for executing the processing,
The phoneme feature value fitness estimation process includes the phoneme feature value match between the speech unit of the other speaker and an arbitrary speech unit of the target speaker, or the speech unit of the other speaker and the speech. Calculating a phoneme feature value fitness between speech units other than the speech unit of the other speaker among the speech unit candidates selected by the unit selection process;
A program characterized by that.