JP6047952B2

JP6047952B2 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP6047952B2
Application number: JP2012148193A
Authority: JP
Inventors: 慶二郎才野
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-07-29
Filing date: 2012-07-02
Publication date: 2016-12-21
Anticipated expiration: 2032-07-02
Also published as: JP2013050706A

Description

本発明は、音声素片を利用して発話音や歌唱音等の音声を合成する技術に関する。 The present invention relates to a technique for synthesizing sounds such as speech sounds and singing sounds using speech segments.

複数の音声素片を相互に連結することで所望の音声を合成する素片接続型の音声合成装置が従来から提案されている。特許文献１には、音声素片を部分的に利用することで合成音の明瞭度（仮想的な発声者の口の開き具合）を制御する技術が開示されている。 2. Description of the Related Art A unit connection type speech synthesizer that synthesizes a desired speech by connecting a plurality of speech units to each other has been proposed. Patent Document 1 discloses a technique for controlling the intelligibility of a synthesized sound (a degree of opening of a virtual speaker's mouth) by partially using speech segments.

例えば図１６に示すように、子音（半母音）の音素/ｗ/の音素区間Ｓ1と母音の音素/ａ/の音素区間Ｓ2とを含む音声素片Ｖ［ｗ-ａ］を想定する。音素区間Ｓ2は、音素/ａ/の波形が定常に維持される定常区間ＥBと、直前の音素区間Ｓ1の音素/ｗ/が音素/ａ/の波形に遷移する遷移区間ＥAとに区分される。特許文献１の技術では、遷移区間ＥA内に境界時点（音素セグメンテーション境界）ＴBが可変に設定され、音声素片Ｖのうち境界時点ＴB以前の区間にその区間の最後のフレームを反復的に連結することで所望の時間長の音声信号が生成される。境界時点ＴBの位置は、利用者から指示された変数（以下「明瞭度変数」という）に応じて可変に設定される。以上の構成によれば、遷移区間ＥAの全部を適用した場合と比較して明瞭度が低い音声（すなわち発声者が口を充分に開かずに発声した音声）を合成することが可能である。 For example, as shown in FIG. 16, a speech unit V [w-a] including a phoneme section S1 of consonant (semi-vowel) phoneme / w / and a phoneme section S2 of vowel phoneme / a / is assumed. The phoneme section S2 is divided into a steady section EB in which the phoneme / a / waveform is maintained constantly, and a transition section EA in which the phoneme / w / in the immediately preceding phoneme section S1 transitions to the phoneme / a / waveform. . In the technique of Patent Document 1, the boundary time point (phoneme segmentation boundary) TB is variably set in the transition section EA, and the last frame of the section is repetitively connected to the section of the speech unit V before the boundary time TB. Thus, an audio signal having a desired length of time is generated. The position of the boundary time TB is variably set according to a variable instructed by the user (hereinafter referred to as “intelligibility variable”). According to the above configuration, it is possible to synthesize a voice having a low clarity (that is, a voice uttered without sufficiently opening the mouth) as compared with the case where the entire transition section EA is applied.

特許第４２６５５０１号公報Japanese Patent No. 4265501

特許文献１の技術では、遷移区間ＥA内の境界時点ＴBを、利用者が指定する明瞭度変数に応じて時間軸上で線形に移動させる構成が想定される。すなわち、遷移区間ＥAの始点から境界時点ＴBまでの時間長は、利用者が指定した明瞭度変数に比例する。 In the technique of Patent Document 1, a configuration is assumed in which the boundary time point TB in the transition section EA is linearly moved on the time axis according to the clarity variable specified by the user. In other words, the time length from the start point of the transition section EA to the boundary time TB is proportional to the clarity variable specified by the user.

他方、図１６のグラフｇは、合成音の受聴者に知覚される音色（発声者の口の開き具合）の時間的な遷移を模式的に図示したグラフである。図１６に示すように、音声素片Ｖの音色は、遷移区間ＥAの始点から終点にかけて、音素区間Ｓ1の音素/ｗ/の音色から音素区間Ｓ2の音素/ａ/の音色に遷移する。グラフｇから理解される通り、受聴者が知覚する音色は、経過時間に対して非線形に変化する。すなわち、遷移区間ＥAの始点の近傍では時間が少し経過しただけで音色が顕著に変化するが、遷移区間ＥAの終点の近傍では時間が経過しても音色は殆ど変化しない。 On the other hand, a graph g in FIG. 16 is a graph schematically showing a temporal transition of a timbre (a degree of opening of a speaker's mouth) perceived by a listener of the synthesized sound. As shown in FIG. 16, the timbre of the speech segment V transitions from the timbre of the phoneme segment S1 to the timbre of the phoneme segment S2 from the timbre of the phoneme segment S1 from the start point to the end point of the transition segment EA. As understood from the graph g, the timbre perceived by the listener changes nonlinearly with respect to the elapsed time. That is, the timbre changes remarkably just after a short time in the vicinity of the start point of the transition section EA, but the timbre hardly changes even in the vicinity of the end point of the transition section EA.

したがって、前述のように利用者が指定する明瞭度変数に応じて境界時点ＴBを線形に移動させる場合、境界時点ＴBが遷移区間ＥAの始点の近傍に位置する状態では明瞭度変数を少し変化させただけで合成音の音色が顕著に変化するが、境界時点ＴBが遷移区間ＥAの終点の近傍に位置する状態では明瞭度変数を同様に変化させても合成音の音色は殆ど変化しない。以上のように明瞭度変数の変化と合成音の音色変化とが感覚的に整合しないから、合成音が所望の音色となるように利用者が明瞭度変数を設定することが困難であるという問題がある。以上の事情を考慮して、本発明は、利用者からの指示に応じた明瞭度変数の変化と合成音の音色変化とを整合させることを目的とする。 Therefore, when the boundary time TB is linearly moved according to the clarity variable specified by the user as described above, the clarity variable is slightly changed in a state where the boundary time TB is located in the vicinity of the start point of the transition section EA. The timbre of the synthesized sound changes noticeably, but the timbre of the synthesized sound hardly changes even if the intelligibility variable is similarly changed in a state where the boundary time TB is located near the end point of the transition section EA. As described above, since the change of the clarity variable and the timbre change of the synthesized sound do not match sensuously, it is difficult for the user to set the clarity variable so that the synthesized sound becomes a desired timbre. There is. In view of the above circumstances, an object of the present invention is to match the change in the clarity variable according to the instruction from the user with the timbre change in the synthesized sound.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の音声合成装置は、音声波形が経時的に変化する遷移区間（例えば遷移区間ＥA）を含む音声素片（例えば音声素片Ｖ）を順次に選択する素片選択手段（例えば素片選択部３４）と、利用者からの指示に応じて明瞭度変数（例えば明瞭度変数α）を可変に設定する変数設定手段（例えば変数設定部３３）と、遷移区間内に境界時点（例えば境界時点ＴB）を設定する手段であって、明瞭度変数が第１値（例えば数値α1）から所定量だけ変化した場合の境界時点の移動量（例えば変化量Δβ1）と、明瞭度変数が第１値とは相違する第２値（例えば数値α2）から所定量だけ変化した場合の境界時点の移動量（例えば変化量Δβ2）とが相違するように、変数設定手段が設定した明瞭度変数に応じて境界時点の位置を可変に設定する境界設定手段と、素片選択手段が選択した音声素片のうち境界時点の前方の区間（例えば適用区間Ｗ）を利用して音声信号を生成する合成処理手段（例えば合成処理部４６）とを具備する。以上の構成では、利用者からの指示に応じた明瞭度変数に対して境界時点の位置が非線形に変化する。したがって、受聴者が音声素片から知覚する音色が経過時間に対して非線形に変化する場合でも、利用者からの指示に応じた明瞭度変数の変化と合成音の音色変化とを整合させることが可能である。 The speech synthesizer of the present invention has a unit selection unit (for example, unit selection) for sequentially selecting speech units (for example, speech unit V) including a transition section (for example, transition section EA) in which the speech waveform changes over time. Section 34), variable setting means (for example, variable setting section 33) for variably setting an intelligibility variable (for example, intelligibility variable α) in accordance with an instruction from a user, and a boundary time point (for example, a boundary time point) in the transition section TB) is a means for setting, and the amount of movement at the boundary time (for example, the amount of change Δβ1) when the intelligibility variable changes by a predetermined amount from the first value (for example, numerical value α1), and the intelligibility variable has the first value According to the articulation variable set by the variable setting means so that the movement amount (for example, the change amount Δβ2) at the boundary time when it changes by a predetermined amount from a different second value (for example, the numerical value α2). Boundary setting means for variably setting the position of the boundary point; ; And a synthesis processing means for generating an audio signal (e.g., the synthesis processing section 46) segment selection means by utilizing the forward section of the boundary point of the speech unit selection (e.g. applying section W). In the above configuration, the position of the boundary time changes non-linearly with respect to the articulation variable according to the instruction from the user. Therefore, even when the timbre perceived by the listener from the speech segment changes nonlinearly with respect to the elapsed time, it is possible to match the change in the clarity variable according to the instruction from the user with the timbre change in the synthesized sound. Is possible.

本発明の好適な態様において、境界設定手段は、明瞭度変数と境界時点の位置との関係が複数の音声素片について共通するように明瞭度変数に応じて境界時点の位置を設定する。以上の態様では、明瞭度変数と境界時点の位置との関係が複数の音声素片について共通するから、境界設定手段による境界時点の設定が簡素化されるという利点がある。なお、以上の態様の具体例は、例えば第１実施形態として後述される。 In a preferred aspect of the present invention, the boundary setting means sets the position of the boundary time according to the clarity variable so that the relationship between the clarity variable and the position of the boundary time is common to a plurality of speech segments. In the above aspect, since the relationship between the clarity variable and the position of the boundary time is common to a plurality of speech segments, there is an advantage that the setting of the boundary time by the boundary setting means is simplified. In addition, the specific example of the above aspect is later mentioned, for example as 1st Embodiment.

本発明の好適な態様において、複数の音声素片の各々は、相異なる音素に対応する第１音素区間（例えば音素区間Ｓ1）と第２音素区間（例えば音素区間Ｓ2）とを含み、第２音素区間は遷移区間を含み、境界設定手段は、明瞭度変数と境界時点の位置との関係が、音声素片のうち第１音素区間の音素の種別（例えば種別Ｃ）に応じて相違するように、明瞭度変数に応じて境界時点の位置を設定する。以上の態様では、第１音素区間の音素の種別に応じて明瞭度変数と境界時点の位置との関係が個別に設定されるから、音声素片の音色の時間的な遷移が第１音素区間の音素の種別に応じて相違する場合でも、明瞭度変数の変化と合成音の音色変化とを整合させ得るという効果が実現される。なお、以上の態様の具体例は例えば第２実施形態として後述される。 In a preferred aspect of the present invention, each of the plurality of speech segments includes a first phoneme section (for example, phoneme section S1) and a second phoneme section (for example, phoneme section S2) corresponding to different phonemes, The phoneme section includes a transition section, and the boundary setting means causes the relationship between the clarity variable and the position at the boundary time to differ depending on the type of phoneme in the first phoneme section (for example, type C) in the speech segment. Next, the position of the boundary time is set according to the clarity variable. In the above aspect, since the relationship between the clarity variable and the position of the boundary time is individually set according to the phoneme type of the first phoneme section, the temporal transition of the timbre of the speech segment is the first phoneme section. Even when different depending on the type of phoneme, an effect that the change of the clarity variable and the timbre change of the synthesized sound can be matched is realized. In addition, the specific example of the above aspect is later mentioned as 2nd Embodiment, for example.

本発明の好適な態様の音声合成装置は、基準フレームに対する各フレームの音色の相違を示す音色指標値（例えば音色指標値Ｙ[m]）を遷移区間内の複数のフレームの各々について算定する指標算定手段（例えば指標算定部６０）を具備し、境界設定手段は、音色指標値の時間的な遷移において音色指標値が明瞭度変数に応じた数値となる時点を境界時点に設定する。以上の態様では、音色指標値が明瞭度変数に応じた数値となる時点が境界時点に設定されるから、各音声素片の特性に関わらず、明瞭度変数の変化と合成音の音色変化とを整合させることが可能である。以上の態様において、音色指標値の時間的な遷移において音色指標値が複数の時点にて明瞭度変数に応じた数値となる場合に、複数の時点のうち最も後方の時点を境界設定手段が境界時点に設定すれば、音声素片の長い区間が音声信号の合成に利用されるから、聴感的に自然な合成音を生成できるという利点がある。なお、以上の態様の具体例は、例えば第６実施形態として後述される。 The speech synthesizer according to a preferred aspect of the present invention is an index for calculating a timbre index value (for example, timbre index value Y [m]) indicating a difference in timbre of each frame with respect to a reference frame for each of a plurality of frames in the transition section. Computation means (for example, index computation unit 60) is provided, and the boundary setting means sets a time point when the timbre index value becomes a numerical value corresponding to the articulation variable in the temporal transition of the timbre index value. In the above aspect, since the time point at which the timbre index value becomes a numerical value corresponding to the intelligibility variable is set as the boundary time point, regardless of the characteristics of each speech unit, the change in the intelligibility variable and the timbre change in the synthesized sound Can be matched. In the above aspect, when the timbre index value becomes a numerical value corresponding to the articulation variable at a plurality of time points in the temporal transition of the timbre index value, the boundary setting means sets the boundary at the rearmost time point among the plurality of time points. If it is set at the time point, since a long section of the speech segment is used for synthesis of the speech signal, there is an advantage that a natural synthesized sound can be generated audibly. In addition, the specific example of the above aspect is later mentioned, for example as 6th Embodiment.

本発明の好適な態様の音声合成装置は、素片選択手段が選択した音声素片の遷移区間内に限界時点を設定する限界設定手段（例えば限界設定部４２）を具備し、境界設定手段は、限界設定手段が設定した限界時点から遷移区間の終点までの変動区間（例えば変動区間ＥC）内に境界時点を設定する。以上の態様では、遷移区間内に設定された限界時点の後方に境界時点が設定される。すなわち、遷移区間のうち第１音素区間の音素の影響が過度に残存する時点は境界時点として設定されない。したがって、境界時点に対応するフレームを反復して音声信号を生成する構成にも関わらず、聴感的に自然な印象の合成音を生成できるという利点がある。 The speech synthesizer according to a preferred aspect of the present invention comprises limit setting means (for example, limit setting unit 42) for setting a limit time point in the transition segment of the speech unit selected by the unit selection means, and the boundary setting means includes: Then, the boundary time is set in the fluctuation section (for example, the fluctuation section EC) from the limit time set by the limit setting means to the end point of the transition section. In the above aspect, the boundary time is set behind the limit time set in the transition section. That is, the time point at which the influence of the phonemes in the first phoneme segment in the transition segment remains excessively is not set as the boundary time point. Accordingly, there is an advantage that a synthesized sound with an audibly natural impression can be generated regardless of the configuration in which the audio signal is generated by repeating the frame corresponding to the boundary time point.

限界設定手段を具備する構成の好適例において、複数の音声素片の各々は、相異なる音素に対応する第１音素区間と第２音素区間とを含み、第２音素区間は遷移区間を含み、限界設定手段は、第２音素区間の遷移区間のうち第１音素区間の音素の種別に応じた時点を限界時点として設定する。遷移区間のうち第１音素区間の音素の影響が過度に残存する区間は第１音素区間の音素の種別に応じて相違するという傾向がある。以上の態様では、遷移区間内の限界時点の位置が第１音素区間の音素の種別に応じて可変に設定されるから、第１音素区間の音素の種別に応じた適切な位置を限界時点として設定できるという利点がある。 In a preferred example of the configuration including the limit setting means, each of the plurality of speech segments includes a first phoneme section and a second phoneme section corresponding to different phonemes, and the second phoneme section includes a transition section. The limit setting means sets a time point according to the phoneme type of the first phoneme section among the transition sections of the second phoneme section as the limit time point. Of the transition sections, sections in which the influence of phonemes in the first phoneme section remains excessively tend to differ depending on the type of phonemes in the first phoneme section. In the above aspect, since the position of the limit time point in the transition section is variably set according to the phoneme type of the first phoneme section, an appropriate position according to the phoneme type of the first phoneme section is set as the limit time point. There is an advantage that it can be set.

限界設定手段を具備する構成の好適例に係る音声合成装置は、遷移区間内の複数のフレームの各々について当該フレームを反復した場合の音声の自然性の指標値を算定する第２指標算定手段（例えば指標算定部４８）を具備し、限界設定手段は、各フレームの指標値に応じて限界時点を設定する。以上の態様では、遷移区間内の各フレームの指標値に応じて限界時点が設定されるから、音声素片の特性に応じた適切な限界時点を設定できるという利点がある。また、遷移区間内の各フレームの音量に応じた第１指標値と、遷移区間内の各フレームの非調和成分の強度に応じた第２指標値とを第２指標算定手段（例えば指標算定部４８）が指標値として算定し、遷移区間のうち、第１指標値が示す音量が所定値を上回り、かつ、第２指標値が示す非調和成分の強度が所定値を下回る時点を限界設定手段が限界時点として設定する構成によれば、例えば第１音素区間の音素が無声子音（例えば破裂音や破擦音や摩擦音）である場合に、遷移区間内の適切な位置に限界時点を設定できるという利点がある。 A speech synthesizer according to a preferred example of a configuration including a limit setting unit includes a second index calculation unit that calculates a speech naturalness index value when each frame is repeated for each of a plurality of frames in a transition section. For example, an index calculation unit 48) is provided, and the limit setting means sets the limit time according to the index value of each frame. In the above aspect, since the limit time is set according to the index value of each frame in the transition section, there is an advantage that an appropriate limit time can be set according to the characteristics of the speech unit. Further, the first index value corresponding to the volume of each frame in the transition section and the second index value corresponding to the intensity of the anharmonic component of each frame in the transition section are used as second index calculation means (for example, an index calculation unit). 48) is calculated as an index value, and, of the transition sections, the time when the volume indicated by the first index value exceeds a predetermined value and the intensity of the anharmonic component indicated by the second index value falls below the predetermined value is set as a limit setting means. Is set as the limit time, for example, when the phoneme in the first phoneme section is an unvoiced consonant (for example, a plosive sound, a smashing sound, or a friction sound), the limit time point can be set at an appropriate position in the transition section. There is an advantage.

以上の各態様に係る音声合成装置は、音声合成に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。本発明のプログラム（例えばプログラムＰGM）は、音声波形が経時的に変化する遷移区間を含む音声素片を順次に選択する素片選択処理と、利用者からの指示に応じて明瞭度変数を可変に設定する変数設定処理と、遷移区間内に境界時点を設定する処理であって、明瞭度変数が第１値から所定量だけ変化した場合の境界時点の移動量と、明瞭度変数が第１値とは相違する第２値から所定量だけ変化した場合の境界時点の移動量とが相違するように、変数設定処理で設定した明瞭度変数に応じて境界時点の位置を可変に設定する境界設定処理と、素片選択処理で選択した音声素片のうち境界時点の前方の区間を利用して音声信号を生成する合成処理とをコンピュータに実行させる。以上のプログラムによれば、本発明の音声合成装置と同様の作用および効果が実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech synthesizer according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech synthesis, and a general-purpose arithmetic processing device such as a CPU (Central Processing Unit). And collaboration with the program. The program of the present invention (for example, the program PGM) has a segment selection process for sequentially selecting speech segments including a transition section in which a speech waveform changes over time, and a variable intelligibility according to an instruction from a user. Variable setting processing for setting the boundary time in the transition section, the movement amount at the boundary time when the clarity variable changes by a predetermined amount from the first value, and the clarity variable is the first Boundary where the position of the boundary time point is variably set according to the articulation variable set in the variable setting process so that the amount of movement at the boundary time point when the predetermined value changes from the second value different from the value The computer is caused to execute a setting process and a synthesizing process for generating an audio signal using a section in front of the boundary time point among the speech elements selected in the element selection process. According to the above program, the same operation and effect as the speech synthesizer of the present invention are realized. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 記憶装置に格納された素片群の模式図である。It is a schematic diagram of the segment group stored in the storage device. 音声素片の波形と単位データとの関係の説明図である。It is explanatory drawing of the relationship between the waveform of a speech unit, and unit data. 編集画面の模式図である。It is a schematic diagram of an edit screen. 音声合成部のブロック図である。It is a block diagram of a speech synthesizer. 明瞭度変数と境界変数との関係を示すグラフである。It is a graph which shows the relationship between a clarity variable and a boundary variable. 第２実施形態における素片データの模式図である。It is a schematic diagram of the segment data in 2nd Embodiment. 第３実施形態における音声合成部のブロック図である。It is a block diagram of the speech synthesizer in a 3rd embodiment. 第３実施形態における音声素片の波形と単位データとの関係の説明図である。It is explanatory drawing of the relationship between the waveform of the speech unit and unit data in 3rd Embodiment. 第３実施形態における境界変数の算定方法の説明図である。It is explanatory drawing of the calculation method of the boundary variable in 3rd Embodiment. 第５実施形態における音声合成部のブロック図である。It is a block diagram of the speech synthesizer in the fifth embodiment. 指標値の説明図である。It is explanatory drawing of an index value. 第６実施形態における音声合成部のブロック図である。It is a block diagram of the speech synthesizer in a 6th embodiment. 第６実施形態の動作の説明図である。It is explanatory drawing of operation | movement of 6th Embodiment. 第６実施形態の他の動作の説明図である。It is explanatory drawing of other operation | movement of 6th Embodiment. 背景技術の説明図である。It is explanatory drawing of background art.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、発話音や歌唱音等の音声を素片接続型の音声合成処理で生成する信号処理装置であり、図１に示すように、演算処理装置１２と記憶装置１４と表示装置２２と入力装置２４と放音装置２６とを具備するコンピュータシステムで実現される。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a signal processing device that generates speech such as speech sounds and singing sounds through segment-connected speech synthesis processing. As shown in FIG. 1, the arithmetic processing device 12, the storage device 14, and the display device are used. 22, an input device 24, and a sound emitting device 26.

演算処理装置１２（ＣＰＵ）は、記憶装置１４に格納されたプログラムＰGMを実行することで、合成音の波形を表す音声信号ＶOUTを生成するための複数の機能（表示制御部３２，変数設定部３３，素片選択部３４，音声合成部３６）を実現する。なお、演算処理装置１２の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が一部の機能を実現する構成も採用され得る。 The arithmetic processing unit 12 (CPU) executes a program PGM stored in the storage device 14 to thereby generate a plurality of functions (display control unit 32, variable setting unit) for generating an audio signal VOUT representing the waveform of the synthesized sound. 33, a segment selection unit 34, and a speech synthesis unit 36) are realized. A configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes a part of the functions may be employed.

表示装置２２（例えば液晶表示装置）は、演算処理装置１２から指示された画像を表示する。入力装置２４は、利用者からの指示を受付ける機器（例えばマウスやキーボード）である。放音装置２６（例えばヘッドホンやスピーカ）は、演算処理装置１２が生成した音声信号ＶOUTに応じた音波を放射する。 The display device 22 (for example, a liquid crystal display device) displays an image instructed from the arithmetic processing device 12. The input device 24 is a device (for example, a mouse or a keyboard) that receives an instruction from a user. The sound emitting device 26 (for example, a headphone or a speaker) emits a sound wave corresponding to the sound signal VOUT generated by the arithmetic processing device 12.

記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種のデータ（素片群ＱA，合成情報ＱB）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として採用される。 The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data (segment group QA, synthesis information QB) used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is employed as the storage device 14.

記憶装置１４に格納される素片群ＱAは、図２に示すように、相異なる音声素片Ｖに対応する複数の素片データＤの集合（音声合成ライブラリ）である。第１実施形態では、相異なる音素に対応する２個の音素区間Ｓ（Ｓ1，Ｓ2）を連結したダイフォン（音素連鎖）を音声素片Ｖとして想定する。音素区間Ｓ2は音素区間Ｓ1の後方に位置する。なお、以下では便宜的に無音を子音の音素として説明する。 The unit group QA stored in the storage device 14 is a set (speech synthesis library) of a plurality of unit data D corresponding to different speech units V as shown in FIG. In the first embodiment, a diphone (phoneme chain) in which two phoneme sections S (S1, S2) corresponding to different phonemes are connected is assumed as a speech unit V. The phoneme segment S2 is located behind the phoneme segment S1. In the following, silence is described as a consonant phoneme for convenience.

図３は、１個の音声素片Ｖの波形図である。図３では、子音（半母音）の音素/ｗ/の音素区間Ｓ1に母音の音素/ａ/の音素区間Ｓ2が後続する音声素片Ｖ［ｗ-ａ］の波形が例示されている。図３の音素境界ＧAは、音素区間Ｓ1と音素区間Ｓ2との境界を意味する。母音の音素区間Ｓ2は、状態境界ＧBを挟んで遷移区間ＥAと定常区間ＥBとに区分される。 FIG. 3 is a waveform diagram of one speech unit V. FIG. 3 illustrates a waveform of a speech unit V [w−a] in which a phoneme segment S1 of a vowel / a / is followed by a phoneme segment S2 of a vowel / a / phoneme / w /. The phoneme boundary GA in FIG. 3 means the boundary between the phoneme segment S1 and the phoneme segment S2. The vowel phoneme section S2 is divided into a transition section EA and a steady section EB across the state boundary GB.

図３の定常区間ＥBは、音素区間Ｓ2の音素/ａ/の波形が定常に維持される区間である。他方、遷移区間ＥAは、音声素片Ｖの波形（音色）が、直前の音素区間Ｓ1の音素/ｗ/の波形から音素区間Ｓ2の音素/ａ/の波形に遷移する区間である。すなわち、音声素片Ｖの発声者の口の形状は、音素区間Ｓ1の音素/ｗ/に対応する形状から音素境界ＧAにて変化し始め、音素境界ＧAから状態境界ＧBにかけて音素区間Ｓ2の音素/ａ/に対応する形状に経時的に変化し、状態境界ＧBにて音素/ａ/に対応する形状に到達して以後は定常に維持される。 The steady section EB in FIG. 3 is a section in which the waveform of the phoneme / a / in the phoneme section S2 is maintained in a steady state. On the other hand, the transition section EA is a section in which the waveform (tone color) of the speech segment V transitions from the phoneme / w / waveform of the previous phoneme section S1 to the phoneme / a / waveform of the phoneme section S2. That is, the shape of the speaker's mouth of the speech segment V starts to change at the phoneme boundary GA from the shape corresponding to the phoneme / w / in the phoneme segment S1, and the phoneme in the phoneme segment S2 from the phoneme boundary GA to the state boundary GB. It changes over time to a shape corresponding to / a /, and after reaching the shape corresponding to phoneme / a / at the state boundary GB, it is kept steady thereafter.

図３のグラフｇは、合成音の受聴者に知覚される音色が遷移区間ＥAの始点（音素境界ＧA）から終点（状態境界ＧB）にかけて音素区間Ｓ1の音素/ｗ/から音素区間Ｓ2の音素/ａ/に時間的に変化する様子を模式的に図示したグラフである。グラフｇから理解されるように、受聴者が知覚する音色（発声者の口の開き具合）は、経過時間に対して非線形に変化する。すなわち、所定の単位時間に対する音色の変化量が遷移区間ＥA内の各時点で相違する。具体的には、遷移区間ＥAの始点（音素境界ＧA）の近傍では時間が少し経過しただけで音色は顕著に変化するが、遷移区間ＥAの終点（状態境界ＧB）の近傍では時間が経過しても音色は殆ど変化しない。 The graph g in FIG. 3 shows that the timbre perceived by the listener of the synthesized sound ranges from the phoneme / w / of the phoneme section S1 to the phoneme of the phoneme section S2 from the start point (phoneme boundary GA) to the end point (state boundary GB) of the transition section EA. It is the graph which illustrated a mode that it changed temporally to / a /. As can be understood from the graph g, the timbre perceived by the listener (the opening degree of the speaker's mouth) changes nonlinearly with respect to the elapsed time. That is, the timbre change amount with respect to a predetermined unit time is different at each time point in the transition section EA. Specifically, the timbre changes remarkably just after a short time in the vicinity of the start point (phoneme boundary GA) of the transition section EA, but the time passes in the vicinity of the end point (state boundary GB) of the transition section EA. However, the timbre hardly changes.

図２に示すように、各音声素片Ｖの素片データＤは、区間情報ＤBと複数の単位データＵとを含んで構成される。区間情報ＤBは、音声素片Ｖ内の音素境界ＧAと状態境界ＧBとを指定する。複数の単位データＵの各々は、音声素片Ｖ（音素区間Ｓ1および音素区間Ｓ2）を時間軸上で区分した各フレームの音声の周波数スペクトルを指定する。 As shown in FIG. 2, the unit data D of each speech unit V is configured to include section information DB and a plurality of unit data U. The section information DB specifies the phoneme boundary GA and the state boundary GB in the speech unit V. Each of the plurality of unit data U designates the frequency spectrum of the speech of each frame obtained by dividing the speech segment V (phoneme segment S1 and phoneme segment S2) on the time axis.

記憶装置１４に記憶される合成情報（スコアデータ）ＱBは、図１に示すように、合成音の発音文字Ｘ1と発音期間Ｘ2と音高（ピッチ）Ｘ3と明瞭度変数αとを合成音の音符毎に時系列に指定する。発音文字Ｘ1は、例えば歌唱音を合成する場合の歌詞の文字列である。明瞭度変数αは、合成音が聴感的に明確と知覚される程度を示す変数である。発声者が発声時に口を大きく開くほど発声音は聴感的に明瞭と知覚される。したがって、明瞭度変数αは、合成音の仮想的な発声者の口の開き具合を示す変数とも表現され得る。 As shown in FIG. 1, the synthesized information (score data) QB stored in the storage device 14 is composed of a synthesized sound pronunciation character X1, a pronunciation period X2, a pitch (pitch) X3, and an articulation variable α. Specify in chronological order for each note. The pronunciation character X1 is a character string of lyrics when, for example, a singing sound is synthesized. The articulation variable α is a variable indicating the degree to which the synthesized sound is perceived as perceptually clear. The voice is perceptually perceived more clearly as the speaker opens his / her mouth when speaking. Therefore, the articulation variable α can also be expressed as a variable indicating the degree of mouth opening of the virtual speaker of the synthesized sound.

図１の演算処理装置１２の表示制御部３２は、合成情報ＱBの生成および編集のために利用者が視認する図４の編集画面５０を表示装置２２に表示させる。編集画面５０は、第１領域５１と第２領域５２とに区分される。第１領域５１には、時間軸（横軸）と音高軸（縦軸）とが設定され、合成音の各音符を表現する音指示子５４が入力装置２４に対する利用者からの指示に応じて配置される。各音指示子５４の音高軸上の位置に応じて合成情報ＱBの音高Ｘ3が設定され、時間軸上の位置およびサイズに応じて発音期間Ｘ2が設定される。また、利用者が各音指示子５４に指定した文字が合成情報ＱBの発音文字Ｘ1として設定される。 The display control unit 32 of the arithmetic processing device 12 in FIG. 1 causes the display device 22 to display the editing screen 50 in FIG. 4 that is visually recognized by the user for generating and editing the composite information QB. The edit screen 50 is divided into a first area 51 and a second area 52. In the first area 51, a time axis (horizontal axis) and a pitch axis (vertical axis) are set, and a sound indicator 54 that represents each note of the synthesized sound corresponds to an instruction from the user to the input device 24. Arranged. The pitch X3 of the synthesis information QB is set according to the position on the pitch axis of each sound indicator 54, and the sound generation period X2 is set according to the position and size on the time axis. Further, the character designated by the user for each sound indicator 54 is set as the pronunciation character X1 of the composite information QB.

第２領域５２には、各音指示子５４に対応する変数指示子５６が第１領域５１と共通の時間軸のもとで配置される。各変数指示子５６は、縦方向の長さｄで明瞭度変数αの大小を表現する画像（棒グラフ）である。利用者は、入力装置２４を適宜に操作することで各変数指示子５６の長さｄを０以上かつ１２７以下の範囲内で変更することが可能である。 In the second area 52, variable indicators 56 corresponding to the sound indicators 54 are arranged on the same time axis as the first area 51. Each variable indicator 56 is an image (bar graph) expressing the magnitude of the articulation variable α with a length d in the vertical direction. The user can change the length d of each variable indicator 56 within the range of 0 or more and 127 or less by appropriately operating the input device 24.

図１の変数設定部３３は、入力装置２４に対する利用者からの指示に応じて合成情報ＱB内の明瞭度変数αを可変に設定する。具体的には、変数設定部３３は、利用者が指定した変数指示子５６の長さｄに比例するように０以上かつ１以下の範囲内で明瞭度変数αを設定する。すなわち、変数指示子５６に対する操作で利用者が設定した数値を、最大値が１となるように正規化（例えば１２７で除算）することで明瞭度変数αを算定する。 The variable setting unit 33 in FIG. 1 variably sets the articulation variable α in the composite information QB in accordance with an instruction from the user to the input device 24. Specifically, the variable setting unit 33 sets the clarity variable α within a range of 0 or more and 1 or less so as to be proportional to the length d of the variable indicator 56 designated by the user. In other words, the articulation variable α is calculated by normalizing (for example, dividing by 127) the numerical value set by the user by the operation on the variable indicator 56 so that the maximum value becomes 1.

図１の素片選択部３４は、合成情報ＱBが時系列に指定する各発音文字Ｘ1に対応した音声素片Ｖを素片群ＱAから順次に選択する。音声合成部３６は、素片選択部３４が順次に選択する音声素片Ｖの素片データＤを利用して音声信号ＶOUTを生成する。概略的には、音声合成部３６は、合成情報ＱBが指定する発音期間Ｘ2に応じて素片データＤを時間軸上で伸縮し、伸縮後の各単位データＵが示す周波数スペクトルを時間波形に変換したうえで合成情報ＱBの音高Ｘ3に調整して相互に連結することで音声信号ＶOUTを生成する。 The unit selection unit 34 in FIG. 1 sequentially selects the speech unit V corresponding to each phonetic character X1 specified in time series by the synthesis information QB from the unit group QA. The speech synthesizer 36 generates a speech signal VOUT using the segment data D of the speech segments V that are sequentially selected by the segment selector 34. Schematically, the speech synthesizer 36 expands / contracts the segment data D on the time axis according to the sound generation period X2 specified by the synthesis information QB, and converts the frequency spectrum indicated by each unit data U after expansion / contraction into a time waveform. After conversion, the audio signal VOUT is generated by adjusting the pitch X3 of the synthesis information QB and connecting them to each other.

図５は、音声合成部３６のブロック図である。図５に示すように、第１実施形態の音声合成部３６は、境界設定部４４と合成処理部４６とを含んで構成される。境界設定部４４は、図３に示すように、素片選択部３４が選択した音声素片Ｖの音素区間Ｓ2が母音や摩擦音や鼻音等の時間的に持続可能な音素に対応する場合に、音素区間Ｓ2の遷移区間ＥA内に境界時点ＴBを設定する。具体的には、境界設定部４４は、遷移区間ＥAのうち境界時点ＴBの時間軸上の位置を指定する変数（以下「境界変数」という）βを設定する。境界変数βは、遷移区間ＥAの始点（音素境界ＧA）を境界時点ＴBとして指定する最小値０から、遷移区間ＥAの終点（状態境界ＧB）を境界時点ＴBとして指定する最大値１までの範囲内（０≦β≦１）で、合成情報ＱBが指定する明瞭度変数αに応じて可変に設定される。 FIG. 5 is a block diagram of the speech synthesizer 36. As shown in FIG. 5, the speech synthesis unit 36 according to the first embodiment includes a boundary setting unit 44 and a synthesis processing unit 46. As shown in FIG. 3, the boundary setting unit 44, when the phoneme segment S2 of the speech unit V selected by the unit selection unit 34 corresponds to temporally sustainable phonemes such as vowels, friction sounds, and nasal sounds, A boundary time TB is set in the transition section EA of the phoneme section S2. Specifically, the boundary setting unit 44 sets a variable (hereinafter referred to as “boundary variable”) β that specifies the position on the time axis of the boundary time TB in the transition section EA. The boundary variable β ranges from the minimum value 0 that specifies the start point (phoneme boundary GA) of the transition section EA as the boundary time TB to the maximum value 1 that specifies the end point (state boundary GB) of the transition section EA as the boundary time TB. Within the range (0 ≦ β ≦ 1), it is variably set according to the articulation variable α designated by the synthesis information QB.

図６は、明瞭度変数αと境界変数βとの関係を示すグラフである。第１実施形態の境界設定部４４は、明瞭度変数αの自乗を境界変数βとして算定する（β＝α²）。したがって、図６に示すように、明瞭度変数αが数値α1から所定量δだけ正側に変化した場合の境界変数βの変化量（すなわち境界時点ＴBの移動量）Δβ1と、明瞭度変数αが数値α2（α2＞α1）から同じ所定量δだけ正側に変化した場合の境界変数βの変化量Δβ2とは相違する。具体的には変化量Δβ2は変化量Δβ1を上回る。すなわち、明瞭度変数αが最大値１に近付くほど明瞭度変数αの変化に対する境界時点ＴBの移動量は増加する。以上の説明から理解されるように、利用者が指定した明瞭度変数αに応じて境界変数β（境界時点ＴBの位置）は非線形に変化する。なお、第１実施形態では、明瞭度変数αと境界変数βとの関係は全種類の音声素片Ｖについて共通する。図３に示すように、音声素片Ｖのうち境界設定部４４が設定した境界時点ＴBの前方の区間（音素区間Ｓ1の始点から境界時点ＴBまでの区間）Ｗを以下では「適用区間」と表記する。 FIG. 6 is a graph showing the relationship between the articulation variable α and the boundary variable β. The boundary setting unit 44 of the first embodiment calculates the square of the clarity variable α as the boundary variable β (β = α ² ). Accordingly, as shown in FIG. 6, the change amount of the boundary variable β (that is, the movement amount at the boundary time TB) Δβ1 when the clarity variable α changes from the numerical value α1 to the positive side by a predetermined amount δ, and the clarity variable α Is different from the change amount Δβ2 of the boundary variable β when the value changes from the numerical value α2 (α2> α1) to the positive side by the same predetermined amount δ. Specifically, the change amount Δβ2 exceeds the change amount Δβ1. That is, as the intelligibility variable α approaches the maximum value 1, the movement amount of the boundary time TB with respect to the change in the intelligibility variable α increases. As understood from the above description, the boundary variable β (the position of the boundary time TB) changes nonlinearly in accordance with the clarity variable α designated by the user. In the first embodiment, the relationship between the clarity variable α and the boundary variable β is common to all types of speech segments V. As shown in FIG. 3, a section (a section from the start point of the phoneme section S1 to the boundary time TB) W before the boundary time TB set by the boundary setting unit 44 in the speech unit V is referred to as an “applied section” below. write.

図５の合成処理部４６は、素片選択部３４が選択した音声素片Ｖの適用区間Ｗを利用して音声信号ＶOUTを生成する。具体的には、合成処理部４６は、図３に示すように、素片データＤのうち適用区間Ｗ内の単位データＵで構成される単位データ群Ｚ1に、適用区間Ｗ内の最後に位置する１個の単位データＵ（図３の斜線部）を反復して配置した単位データ群Ｚ2を連結する。単位データ群Ｚ2を構成する単位データＵの個数は、単位データ群Ｚ1と単位データ群Ｚ2との合計長が発音期間Ｘ2に応じた目標長となるように可変に設定される。 The synthesis processing unit 46 in FIG. 5 generates the audio signal VOUT using the application section W of the audio unit V selected by the unit selection unit 34. Specifically, as shown in FIG. 3, the composition processing unit 46 positions the unit data group Z1 composed of the unit data U in the application section W in the segment data D at the end in the application section W. A unit data group Z2 in which one unit data U (shaded portion in FIG. 3) is repeatedly arranged is connected. The number of unit data U constituting the unit data group Z2 is variably set so that the total length of the unit data group Z1 and the unit data group Z2 becomes a target length corresponding to the sound generation period X2.

合成処理部４６は、単位データ群Ｚ1および単位データ群Ｚ2の各単位データＵが示す周波数スペクトルを時間波形に変換するとともに合成情報ＱBが指定する音高Ｘ3に調整し、相前後するフレームで相互に連結することで音声信号ＶOUTを生成する。なお、明瞭度変数αが最大値１に設定され、かつ、発音期間Ｘ2に応じた目標長が所定値（例えば音声素片Ｖの時間長）を下回る場合、合成処理部４６は、定常区間ＥBを含む音声素片Ｖの複数の単位データＵを後方から除去して目標長に調整することで音声信号ＶOUTを生成する（すなわち、単位データ群Ｚ2の付加は実行しない）。 The synthesis processing unit 46 converts the frequency spectrum indicated by the unit data U of the unit data group Z1 and the unit data group Z2 into a time waveform and adjusts it to the pitch X3 designated by the synthesis information QB, and mutually exchanges frames in succession. To generate an audio signal VOUT. When the articulation variable α is set to the maximum value 1 and the target length corresponding to the pronunciation period X2 is less than a predetermined value (for example, the time length of the speech segment V), the synthesis processing unit 46 determines the steady interval EB. A plurality of unit data U of the speech unit V including is removed from the rear and adjusted to the target length to generate the speech signal VOUT (that is, the addition of the unit data group Z2 is not executed).

以上のように、音素区間Ｓ2のうち音声波形が定常状態に到達する状態境界ＧBの到来前（すなわち発声者の口が完全に開く以前）に境界時点ＴBが設定され、音声素片Ｖのうち境界時点ＴBの前方の適用区間Ｗが音声信号ＶOUTの生成に利用される。したがって、発声者が口を充分に開かずに発声したような合成音を生成することが可能である。 As described above, the boundary time point TB is set before the arrival of the state boundary GB in which the speech waveform reaches the steady state in the phoneme section S2 (that is, before the speaker's mouth is fully opened). The applicable section W in front of the boundary time TB is used for generating the audio signal VOUT. Therefore, it is possible to generate a synthesized sound that the speaker speaks without fully opening his / her mouth.

また、第１実施形態では、利用者からの指示に応じた明瞭度変数αに対して境界変数β（時間軸上の境界時点ＴBの位置）が非線形に変化する。具体的には、明瞭度変数αが最大値１に近付くほど境界変数βの変化量は増加する。他方、遷移区間ＥA内の音声について受聴者が知覚する音色は遷移区間ＥA内の位置に応じて非線形に変化する。具体的には、遷移区間ＥAの終点に近い位置ほど音色の変化量は減少する。したがって、第１実施形態によれば、利用者が指示した明瞭度変数αに応じて境界時点ＴBを線形に移動させる構成と比較して、利用者が指定する明瞭度変数αの変化と合成音の音色変化との不整合感を低減することが可能である。例えば、利用者が明瞭度変数αを最大値１から半分に変化させた場合には、口の開き具合を半分に変化させたように合成音の音色が変化する。 In the first embodiment, the boundary variable β (the position of the boundary time TB on the time axis) changes nonlinearly with respect to the articulation variable α in accordance with an instruction from the user. Specifically, the amount of change in the boundary variable β increases as the clarity variable α approaches the maximum value 1. On the other hand, the timbre perceived by the listener for the sound in the transition section EA changes nonlinearly according to the position in the transition section EA. Specifically, the timbre change amount decreases as the position is closer to the end point of the transition section EA. Therefore, according to the first embodiment, the change in the clarity variable α designated by the user and the synthesized sound are compared with the configuration in which the boundary time TB is linearly moved according to the clarity variable α designated by the user. It is possible to reduce inconsistency with the timbre change. For example, when the user changes the articulation variable α from the maximum value 1 to half, the timbre of the synthesized sound changes as if the opening degree of the mouth is changed to half.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。第１実施形態では、明瞭度変数αと境界変数βとの関係を全種類の音声素片Ｖについて共通させた。しかし、遷移区間ＥA内の音色（口の開き具合）の時間変化は、音声素片Ｖの種類（特に音素区間Ｓ1内の音素の種類）に応じて変化する。以上の傾向を考慮して、第２実施形態では、音声素片Ｖのうち音素区間Ｓ1内の音素の種類に応じて明瞭度変数αと境界変数βとの関係を変化させる。なお、以下に例示する各形態において作用や機能が第１実施形態と同等である要素については、以上の説明で参照した符号を流用した各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In the first embodiment, the relationship between the clarity variable α and the boundary variable β is made common to all types of speech segments V. However, the time change of the timbre (mouth openness) in the transition section EA changes according to the type of the speech segment V (particularly the type of phoneme in the phoneme section S1). In consideration of the above tendency, in the second embodiment, the relationship between the articulation variable α and the boundary variable β is changed according to the type of phoneme in the phoneme section S1 of the speech unit V. In addition, about each element which an effect | action and a function are equivalent to 1st Embodiment in each form illustrated below, each detailed description which diverted the code | symbol referred by the above description is abbreviate | omitted suitably.

図７は、第２実施形態における素片データＤの模式図である。第２実施形態における各音声素片Ｖの素片データＤは、第１実施形態と同様の区間情報ＤBと複数の単位データＵとに加えて分類情報ＤAを含んで構成される。分類情報ＤAは、音声素片Ｖを構成する各音素の分類を指定する。例えば図７に示すように、母音（/ａ/，/ｉ/，/ｕ/），破裂音（/ｔ/，/ｋ/，/ｐ/），破擦音（/ｔｓ/），鼻音（/ｍ/，/ｎ/），流音（/ｒ/），摩擦音（/ｓ/，/ｆ/），半母音（/ｗ/，/ｙ/），無音（/Ｓｉｌ/）等の分類が、音声素片Ｖの音素区間Ｓ1および音素区間Ｓ2の各々について分類情報ＤAで指定される。 FIG. 7 is a schematic diagram of the segment data D in the second embodiment. The segment data D of each speech segment V in the second embodiment is configured to include classification information DA in addition to the section information DB and the plurality of unit data U similar to those in the first embodiment. The classification information DA specifies the classification of each phoneme constituting the speech segment V. For example, as shown in FIG. 7, vowels (/ a /, / i /, / u /), plosives (/ t /, / k /, / p /), rubbing sounds (/ ts /), nasal sounds ( / m /, / n /), stream sound (/ r /), friction sound (/ s /, / f /), semi-vowel (/ w /, / y /), silence (/ Sil /), etc. Each of the phoneme segment S1 and the phoneme segment S2 of the speech segment V is specified by the classification information DA.

図７に示すように、各音素は、複数の種別Ｃ（Ｃ1〜Ｃ3）に区分される。具体的には、子音の音素は、有声性の高低に応じて各種別Ｃに区分される。例えば日本語の音素の場合、半母音（/ｗ/，/ｙ/）や鼻音（/ｍ/，/ｎ/）や流音（/ｒ/）等の調和成分が豊富な音素、または、有声摩擦音（/ｚ/）や有声破裂音（/ｄ/）等の非調和成分が豊富な音素のように、有声性が高い子音（例えば有声子音）は種別Ｃ1に分類され、破裂音（/ｔ/，/ｋ/，/ｐ/）や破擦音（/ｔｓ/）や摩擦音（/ｓ/，/ｆ/）等の無声子音のように有声性が低い子音は種別Ｃ2に分類される。無音（/Ｓｉｌ/）は種別Ｃ3に分類される。また、母音（/ａ/，/ｉ/，/ｕ/）は種別Ｃ2に分類される。 As shown in FIG. 7, each phoneme is divided into a plurality of types C (C1 to C3). Specifically, consonant phonemes are classified into various types C according to the level of voicedness. For example, in the case of Japanese phonemes, phonemes rich in harmonic components such as semi-vowels (/ w /, / y /), nasal sounds (/ m /, / n /) and stream sounds (/ r /), or voiced friction sounds Consonants with high voicedness (for example, voiced consonants), such as phonemes rich in anharmonic components such as (/ z /) and voiced plosives (/ d /), are classified into type C1, and plosives (/ t / , / K /, / p /), deaf sounds (/ ts /), and unvoiced consonants such as friction sounds (/ s /, / f /) are classified into type C2. Silence (/ Sil /) is classified as type C3. Vowels (/ a /, / i /, / u /) are classified into type C2.

第２実施形態の境界設定部４４は、利用者から指示された明瞭度変数αと境界変数β（境界時点ＴBの位置）との関係が、音声素片Ｖのうち音素区間Ｓ1の音素の種別Ｃに応じて相違するように、明瞭度変数αに応じた境界変数βを算定する。具体的には、分類情報ＤAで指定される音素区間Ｓ1の音素の分類が種別Ｃ1に属する場合と種別Ｃ2に属する場合とで、明瞭度変数αから境界変数βを算定するための関数は相違する。 In the boundary setting unit 44 of the second embodiment, the relationship between the clarity variable α instructed by the user and the boundary variable β (the position of the boundary time TB) is the phoneme type of the phoneme segment S1 in the phoneme segment V. The boundary variable β corresponding to the articulation variable α is calculated so as to be different according to C. Specifically, the function for calculating the boundary variable β from the clarity variable α is different between the case where the phoneme classification of the phoneme section S1 specified by the classification information DA belongs to the type C1 and the type C2. To do.

第１実施形態でも第２実施形態と同様の効果が実現される。また、第２実施形態では、音素区間Ｓ1の音素の種別Ｃ（分類）に応じて明瞭度変数αと境界変数βとの関係が変化するから、遷移区間ＥA内の音色（口の開き具合）の時間的な遷移が音素区間Ｓ1内の音素の種別Ｃに応じて相違する場合でも、利用者が指定する明瞭度変数αの変化と合成音の音色変化との不整合感を各音声素片Ｖについて低減することが可能である。 In the first embodiment, the same effect as in the second embodiment is realized. In the second embodiment, the relationship between the articulation variable α and the boundary variable β changes according to the phoneme type C (classification) of the phoneme segment S1, so that the timbre in the transition segment EA (how the mouth opens) Even if the temporal transition of the voice segment differs depending on the phoneme type C in the phoneme segment S1, the inconsistency between the change in the clarity variable α designated by the user and the timbre change in the synthesized sound is represented by each speech unit. It is possible to reduce V.

＜第３実施形態＞
音声素片Ｖの遷移区間ＥAでは、音素区間Ｓ1の音素の波形から音素区間Ｓ2の音素の波形に遷移する。すなわち、遷移区間ＥAのうち音素境界ＧAの近傍には直前の音素区間Ｓ1の音素の影響が残存する。したがって、遷移区間ＥAのうち音素境界ＧAに近い位置に境界時点ＴBが設定されると、本来的ならば音素区間Ｓ2の音素のみが反映されるべき単位データ群Ｚ2内で、音素区間Ｓ1の音素の影響を含む単位データＵが反復され、合成音が不自然な音声となる可能性がある。以上の事情を背景として、第３実施形態では、単位データ群Ｚ2にて反復される単位データＵ（すなわち適用区間Ｗのうち境界時点ＴBに対応する最後の単位データＵ）にて直前の音素区間Ｓ1の音素の影響が充分に低減されるように、遷移区間ＥA内に設定される境界時点ＴBの位置を制限する。 <Third Embodiment>
In the transition segment EA of the phoneme segment V, the phoneme waveform of the phoneme segment S1 transitions to the phoneme waveform of the phoneme segment S2. That is, the influence of the phoneme of the immediately preceding phoneme section S1 remains in the transition section EA near the phoneme boundary GA. Therefore, when the boundary time TB is set at a position close to the phoneme boundary GA in the transition section EA, the phoneme of the phoneme section S1 is originally included in the unit data group Z2 in which only the phoneme of the phoneme section S2 should be reflected. There is a possibility that the unit data U including the influence of the above will be repeated and the synthesized sound becomes an unnatural sound. Against the background described above, in the third embodiment, the immediately preceding phoneme section in the unit data U that is repeated in the unit data group Z2 (that is, the last unit data U corresponding to the boundary time TB in the application section W). The position of the boundary time TB set in the transition section EA is limited so that the influence of the phoneme of S1 is sufficiently reduced.

図８は、第３実施形態における音声合成部３６のブロック図である。図８に示すように、第３実施形態の音声合成部３６は、第１実施形態の音声合成部３６に限界設定部４２を追加した構成である。限界設定部４２は、図９に示すように、素片選択部３４が選択した音声素片Ｖのうち母音の音素に対応する音素区間Ｓ2の遷移区間ＥA内に限界時点ＴAを設定する。限界時点ＴAは、遷移区間ＥAの割合Ｒに相当する時間（Ｒ×ＥA）だけ遷移区間ＥAの終点（状態境界ＧB）から前方の時点である。割合Ｒを指定する限界情報ＱCが記憶装置１４に事前に記憶され、限界設定部４２は、記憶装置１４から取得した限界情報ＱCに応じて限界時点ＴAを設定する。遷移区間ＥAの直前の音素区間Ｓ1の音素の影響が限界時点ＴAにて充分に低下するように、割合Ｒは０以上かつ１以下の範囲内で選定される。 FIG. 8 is a block diagram of the speech synthesizer 36 in the third embodiment. As shown in FIG. 8, the speech synthesizer 36 of the third embodiment has a configuration in which a limit setting unit 42 is added to the speech synthesizer 36 of the first embodiment. As shown in FIG. 9, the limit setting unit 42 sets a limit time TA within the transition segment EA of the phoneme segment S2 corresponding to the phoneme of the vowel in the speech segment V selected by the segment selection unit 34. The limit time point TA is a time point ahead of the end point (state boundary GB) of the transition section EA by a time (R × EA) corresponding to the ratio R of the transition section EA. Limit information QC specifying the ratio R is stored in advance in the storage device 14, and the limit setting unit 42 sets the limit time TA according to the limit information QC acquired from the storage device 14. The ratio R is selected within a range of 0 or more and 1 or less so that the influence of the phoneme in the phoneme section S1 immediately before the transition section EA is sufficiently reduced at the limit time TA.

図８の境界設定部４４は、遷移区間ＥA内の境界時点ＴBの位置を指定する境界変数γを、変数設定部３３が設定した明瞭度変数αに応じて可変に設定する。境界変数γは、第１実施形態の境界変数βと同様に、遷移区間ＥAの始点（音素境界ＧA）を最小値０として境界時点ＴBの時間軸上の位置を指定する変数である。境界設定部４４は、図９に示すように、音素区間Ｓ2内の遷移区間ＥAのうち限界設定部４２が設定した限界時点ＴAから状態境界ＧBまでの区間（以下「変動範囲」という）ＥC内に境界時点ＴBが位置するように境界変数γを算定する。 The boundary setting unit 44 in FIG. 8 variably sets the boundary variable γ that specifies the position of the boundary time TB in the transition section EA according to the articulation variable α set by the variable setting unit 33. Similarly to the boundary variable β of the first embodiment, the boundary variable γ is a variable that designates the position on the time axis of the boundary time TB with the start point (phoneme boundary GA) of the transition section EA as the minimum value 0. As shown in FIG. 9, the boundary setting unit 44 is within a section (hereinafter referred to as “variation range”) EC from the limit time TA set by the limit setting unit 42 to the state boundary GB in the transition section EA in the phoneme section S2. The boundary variable γ is calculated so that the boundary time point TB is located in

図１０は、境界変数γの説明図である。図１０の部分(A)には、明瞭度変数αに対して非線形に変化する第１実施形態の境界変数βで遷移区間ＥA内に指定される境界時点ＴBが図示されている。他方、図１０の部分(B)には、第３実施形態の限界時点ＴAで画定される変動区間ＥCが図示されている。遷移区間ＥAの全体の時間長を１と仮定すると、図１０の部分(B)に示すように、遷移区間ＥAは、限界時点ＴAの前方の時間（１−Ｒ）にわたる区間と、限界時点ＴAの後方の時間Ｒにわたる変動区間ＥCとに区分される。 FIG. 10 is an explanatory diagram of the boundary variable γ. Part (A) of FIG. 10 illustrates a boundary time point TB specified in the transition section EA by the boundary variable β of the first embodiment that changes nonlinearly with the articulation variable α. On the other hand, part (B) of FIG. 10 shows a fluctuation section EC defined by the limit time TA of the third embodiment. Assuming that the total time length of the transition section EA is 1, as shown in part (B) of FIG. 10, the transition section EA includes the section over the time (1-R) ahead of the limit time TA and the limit time TA. Is divided into a fluctuation section EC over time R.

限界時点ＴAから時間ｔxだけ後方の時点を、第３実施形態の境界変数γが示す境界時点ＴBとする。いま、遷移区間ＥAの全体に対して境界変数βが示す境界時点ＴBの位置（部分(A)）と、遷移区間ＥAのうちの変動区間ＥCに対して境界変数γが示す境界時点ＴBの位置（部分(B)）とが同等である（１：β＝Ｒ：ｔx）と仮定すると、時間ｔxは、境界変数βと割合Ｒとの乗算値（βＲ）として表現される。また、遷移区間ＥAのうち限界時点ＴAの前方の区間が前述のように時間（１−Ｒ）であることを考慮すると、遷移区間ＥAの始点ＧAを最小値０として境界時点ＴBを指定する境界変数γは、以下の数式(1)で表現される。
γ＝（１−Ｒ）＋βＲ
＝１−（１−β）Ｒ ……(1) A time point behind the limit time point TA by a time tx is defined as a boundary time point TB indicated by the boundary variable γ of the third embodiment. Now, the position (part (A)) of the boundary time TB indicated by the boundary variable β with respect to the entire transition section EA and the position of the boundary time TB indicated by the boundary variable γ with respect to the fluctuation section EC of the transition section EA. Assuming that (part (B)) is equivalent (1: β = R: tx), the time tx is expressed as a multiplication value (βR) of the boundary variable β and the ratio R. Further, considering that the section ahead of the limit time TA in the transition section EA is the time (1-R) as described above, the boundary for designating the boundary time TB with the start point GA of the transition section EA as the minimum value 0. The variable γ is expressed by the following formula (1).
γ = (1−R) + βR
= 1- (1-β) R (1)

第３実施形態の境界設定部４４は、第１実施形態と同様の方法で明瞭度変数αに応じた境界変数βを算定し（β＝α²）、境界変数βと限界情報ＱCが示す割合Ｒとについて数式(1)の演算を実行することで境界変数γを算定する。したがって、境界変数γは、第１実施形態の境界変数βと同様に明瞭度変数αに対して非線形に変化する。また、境界変数βは０以上かつ１以下の数値であるから、境界変数γは（１−Ｒ）以上かつ１以下の範囲内の数値となる。すなわち、境界変数γは、遷移区間ＥAの始点ＧAに対して時間（１−Ｒ）だけ後方の限界時点ＴAから遷移区間ＥAの終点ＧBまでの変動区間ＥC内に、明瞭度変数αに応じて非線形に変化する境界時点ＴBを指定する変数である。 The boundary setting unit 44 of the third embodiment calculates a boundary variable β corresponding to the articulation variable α in the same manner as in the first embodiment (β = α ² ), and the ratio indicated by the boundary variable β and the limit information QC The boundary variable γ is calculated by executing the calculation of the mathematical formula (1) for R and R. Therefore, the boundary variable γ changes nonlinearly with the articulation variable α, similarly to the boundary variable β of the first embodiment. Further, since the boundary variable β is a numerical value of 0 or more and 1 or less, the boundary variable γ is a numerical value within the range of (1−R) or more and 1 or less. That is, the boundary variable γ is in accordance with the articulation variable α in the fluctuation section EC from the limit time TA after the time (1-R) to the end point GB of the transition section EA by the time (1-R) with respect to the start point GA of the transition section EA. This is a variable for designating a non-linearly changing boundary time TB.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、境界時点ＴBの位置が、音素境界ＧAの後方に位置する限界時点ＴA以降に制限されるから、音声信号ＶOUTの生成に使用される単位データ群Ｚ2にて反復される単位データＵ（すなわち境界時点ＴBに対応する単位データＵ）では、直前の音素区間Ｓ1の音素の影響は充分に低減される。したがって、聴感的に自然な音声を合成できるという利点がある。 In the third embodiment, the same effect as in the first embodiment is realized. In the third embodiment, the position of the boundary time point TB is limited to the limit time point TA after the phoneme boundary GA, and is thus repeated in the unit data group Z2 used for generating the audio signal VOUT. Unit data U (ie, unit data U corresponding to the boundary time TB), the influence of the phoneme in the immediately preceding phoneme section S1 is sufficiently reduced. Therefore, there is an advantage that an acoustically natural voice can be synthesized.

＜第４実施形態＞
音素区間Ｓ2の遷移区間ＥA内で音素区間Ｓ1の音素の影響が充分に低減される最初の単位データＵの位置（音素区間Ｓ2内の複数の単位データＵのうち反復させた場合でも合成音が聴感的に不自然な音声とならない最先の単位データＵの位置）は、直前の音素区間Ｓ1の音素の種別Ｃに応じて相違するという傾向がある。例えば、音素区間Ｓ1の音素が、有声性が高い半母音等の音素の種別Ｃ1に属する場合、音素区間Ｓ2のうち音素境界ＧAの近傍の単位データＵを反復させたとしても合成音はそれほど不自然な音声にはならない。他方、音素区間Ｓ1の音素が、非調和成分（雑音成分）が豊富で振幅が小さい破裂音等の音素の種別Ｃ2に属する場合、音素区間Ｓ2のうち音素境界ＧAの近傍の単位データＵを反復させると、音素区間Ｓ1の音素に由来する合成音の不自然さが顕著に知覚される。以上の傾向を考慮して、第４実施形態では、音素区間Ｓ2内の遷移区間ＥAに対する限界時点ＴAの位置を直前の音素区間Ｓ1の音素の種別Ｃに応じて変化させる。なお、第４実施形態の構成は第３実施形態と同様である。 <Fourth embodiment>
The position of the first unit data U in which the influence of the phoneme in the phoneme section S1 is sufficiently reduced within the transition section EA of the phoneme section S2 (the synthesized sound is generated even when it is repeated among the plurality of unit data U in the phoneme section S2). There is a tendency that the position of the earliest unit data U that does not result in an unnatural sound is different depending on the type C of the phoneme in the immediately preceding phoneme section S1. For example, if the phoneme in the phoneme segment S1 belongs to a phoneme type C1 such as a semi-vowel with high voicedness, even if the unit data U near the phoneme boundary GA in the phoneme segment S2 is repeated, the synthesized sound is not so unnatural. It does not become a sound. On the other hand, when the phoneme in the phoneme segment S1 belongs to the phoneme type C2 such as a plosive having a rich anharmonic component (noise component) and a small amplitude, the unit data U near the phoneme boundary GA in the phoneme segment S2 is repeated. Then, the unnaturalness of the synthesized sound derived from the phoneme in the phoneme section S1 is noticeably perceived. Considering the above tendency, in the fourth embodiment, the position of the limit time TA with respect to the transition section EA in the phoneme section S2 is changed according to the phoneme type C of the immediately preceding phoneme section S1. The configuration of the fourth embodiment is the same as that of the third embodiment.

第４実施形態の記憶装置１４には、限界時点ＴAの位置を規定する限界情報ＱCが記憶される。限界情報ＱCは、遷移区間ＥAに対する変動区間ＥCの割合Ｒ（Ｒ1〜Ｒ3）を音素の種別Ｃ（Ｃ1〜Ｃ3）毎に指定する。限界設定部４２は、素片選択部３４が選択した音声素片Ｖのうち音素区間Ｓ1の音素の種別Ｃについて限界情報ＱCが示す割合Ｒに応じてその音声素片Ｖの遷移区間ＥA内に限界時点ＴAを設定する。なお、記憶装置１４内の各素片データＤは第２実施形態と同様に分類情報ＤAを含んで構成され、音素区間Ｓ1の種別Ｃは、素片データＤの分類情報ＤAから特定される。 The storage device 14 of the fourth embodiment stores limit information QC that defines the position of the limit time TA. The limit information QC specifies the ratio R (R1 to R3) of the fluctuation section EC to the transition section EA for each phoneme type C (C1 to C3). The limit setting unit 42 is included in the transition segment EA of the speech unit V according to the ratio R indicated by the limit information QC for the type C of the phoneme segment S1 in the speech unit V selected by the segment selection unit 34. Set the limit time TA. Each segment data D in the storage device 14 is configured to include classification information DA as in the second embodiment, and the type C of the phoneme section S1 is specified from the classification information DA of the segment data D.

具体的には、音素区間Ｓ1の音素が種別Ｃ1（有声性が高い子音）に属する場合には、遷移区間ＥAの終点ＧBから割合Ｒ1に応じた時間（Ｒ1×ＥA）だけ前方の時点が限界時点ＴAとして設定される。同様に、音素区間Ｓ1の音素が種別Ｃ2（無声子音または母音）に該当する場合には、遷移区間ＥAの終点ＧBから割合Ｒ2に応じた時間（Ｒ2×ＥA）だけ前方の時点が限界時点ＴAとして設定され、音素区間Ｓ1の音素が種別Ｃ3（無音）に該当する場合には、遷移区間ＥAの終点ＧBから時間（Ｒ3×ＥA）だけ前方の時点が限界時点ＴAとして設定される。遷移区間ＥAのうち限界時点ＴAから終点ＧBまでの変動区間ＥC内に明瞭度変数αに応じた境界時点ＴBが設定される点は第３実施形態と同様である。 Specifically, when the phoneme in the phoneme section S1 belongs to the type C1 (consonant with high voicedness), the time point ahead by the time (R1 × EA) corresponding to the ratio R1 from the end point GB of the transition section EA is limited. Set as time TA. Similarly, when the phoneme in the phoneme segment S1 corresponds to the type C2 (unvoiced consonant or vowel), the time point ahead by the time (R2 × EA) corresponding to the ratio R2 from the end point GB of the transition zone EA is the limit time point TA. When the phoneme in the phoneme section S1 corresponds to the type C3 (silence), a time point ahead by the time (R3 × EA) from the end point GB of the transition section EA is set as the limit time point TA. As in the third embodiment, the boundary time TB corresponding to the articulation variable α is set in the transition section EC from the limit time TA to the end point GB in the transition section EA.

音素区間Ｓ2内の複数の単位データＵのうち、その反復により生成された合成音が音素区間Ｓ1の音素に影響された不自然な音声とならない最先の単位データＵの位置を指定するように、音素の種別Ｃ毎の割合Ｒは実験的または統計的に選定される。例えば、音素区間Ｓ1の音素が種別Ｃ1に属する場合には音素区間Ｓ2内の音素境界ＧAの近傍の単位データＵを反復させても合成音はそれほど不自然にならないが、音素区間Ｓ1の音素が種別Ｃ2に属する場合に音素区間Ｓ2の音素境界ＧAの近傍の単位データＵを反復させると音素区間Ｓ1の音素に由来する合成音の不自然さが顕在化するという傾向を考慮すると、割合Ｒ1は割合Ｒ2や割合Ｒ3を上回る数値に設定される。したがって、遷移区間ＥAの時間長が共通すると仮定すると、音素区間Ｓ1の音素が種別Ｃ2や種別Ｃ3に属する場合の限界時点ＴAは、音素区間Ｓ1の音素が種別Ｃ1に属する場合の限界時点ＴAよりも時間的に遅い時点となる。具体的には、割合Ｒ1は０.８（８０％）程度に設定され、割合Ｒ2は０.６１（６１％）程度に設定され、割合Ｒ3は０.５（５０％）程度に設定される。 Among the plurality of unit data U in the phoneme section S2, the position of the earliest unit data U that does not result in an unnatural speech affected by the phoneme in the phoneme section S1 is specified for the synthesized sound generated by the repetition. The ratio R for each phoneme type C is selected experimentally or statistically. For example, when the phoneme in the phoneme segment S1 belongs to the type C1, the synthesized sound does not become so unnatural even if the unit data U near the phoneme boundary GA in the phoneme segment S2 is repeated, but the phoneme in the phoneme segment S1 Considering the tendency that if the unit data U in the vicinity of the phoneme boundary GA of the phoneme section S2 is repeated when belonging to the type C2, the unnaturalness of the synthesized sound derived from the phoneme of the phoneme section S1 becomes obvious, and the ratio R1 is It is set to a numerical value that exceeds the ratio R2 and ratio R3. Therefore, assuming that the time lengths of the transition sections EA are common, the limit time TA when the phoneme of the phoneme section S1 belongs to the type C2 or the type C3 is greater than the limit time TA when the phoneme of the phoneme section S1 belongs to the type C1. Will be later in time. Specifically, the ratio R1 is set to about 0.8 (80%), the ratio R2 is set to about 0.61 (61%), and the ratio R3 is set to about 0.5 (50%). .

第４実施形態においても第３実施形態と同様の効果が実現される。なお、音素区間Ｓ1の音素に関わらず共通の割合Ｒに応じて限界時点ＴAが設定される第３実施形態において、音素区間Ｓ1の音素に由来する合成音の不自然さを全部の音素について抑制するためには、限界時点ＴAを遷移区間ＥAの後方側に設定する必要がある。したがって、変動区間ＥCが短い時間に制限され、合成音の明瞭度（口の開き具合）を充分に低下させることができない可能性がある。他方、第３実施形態において合成音の明瞭度を充分に低下させるために、音素境界ＧAの近傍の時点を限界時点ＴAとして選定した場合、音素区間Ｓ1の音素に起因して合成音が不自然な音声となる。第４実施形態では、遷移区間ＥAに対する限界時点ＴAの位置（割合Ｒ）が直前の音素区間Ｓ1の音素の種別Ｃに応じて設定されるから、合成音の明瞭度の変化幅を充分に確保すること（明瞭度を充分に低下させること）と音素区間Ｓ1の音素に起因した合成音の不自然さを低減することとを両立できるという利点がある。 In the fourth embodiment, the same effect as in the third embodiment is realized. In the third embodiment in which the limit time TA is set according to the common ratio R regardless of the phonemes in the phoneme section S1, the unnaturalness of the synthesized sound derived from the phonemes in the phoneme section S1 is suppressed for all phonemes. In order to do this, it is necessary to set the limit time TA to the rear side of the transition section EA. Therefore, the fluctuation section EC is limited to a short time, and there is a possibility that the intelligibility (openness) of the synthesized sound cannot be sufficiently reduced. On the other hand, when the time near the phoneme boundary GA is selected as the limit time TA in order to sufficiently reduce the clarity of the synthesized sound in the third embodiment, the synthesized sound is unnatural due to the phoneme in the phoneme section S1. Sound. In the fourth embodiment, since the position (ratio R) of the limit time TA with respect to the transition section EA is set according to the phoneme type C of the immediately preceding phoneme section S1, a sufficient range of change in the clarity of the synthesized sound is ensured. There is an advantage that it is possible to achieve both (to sufficiently reduce the intelligibility) and to reduce the unnaturalness of the synthesized sound caused by the phonemes in the phoneme section S1.

なお、第４実施形態では、遷移区間ＥAに対する変動区間ＥCの割合Ｒを種別Ｃ毎に個別に設定したが、限界時点ＴAを音素区間Ｓ1の音素の種別Ｃ毎に相違させる方法は適宜に変更される。例えば、記憶装置１４に記憶された限界情報ＱCが、遷移区間ＥAの始点ＧAから限界時点ＴAまでの期間または限界時点ＴAから終点ＧBまでの期間の時間長（フレーム数）を音素の種別Ｃ毎に指定する構成も採用され得る。 In the fourth embodiment, the ratio R of the fluctuation section EC to the transition section EA is individually set for each type C. However, the method for changing the limit time TA for each phoneme type C in the phoneme section S1 is changed as appropriate. Is done. For example, the limit information QC stored in the storage device 14 indicates the time length (number of frames) of the period from the start point GA to the limit point TA or the period from the limit point TA to the end point GB for each phoneme type C. A configuration designated in the above can also be adopted.

＜第５実施形態＞
第３実施形態および第４実施形態では、記憶装置１４に事前に記憶された限界情報ＱCを利用して限界設定部４２が遷移区間ＥA内に限界時点ＴAを設定した。第５実施形態では、音声素片Ｖを解析した結果を利用して限界設定部４２が限界時点ＴAを設定する。 <Fifth Embodiment>
In the third and fourth embodiments, the limit setting unit 42 sets the limit time TA within the transition section EA using the limit information QC stored in advance in the storage device 14. In the fifth embodiment, the limit setting unit 42 sets the limit time TA using the analysis result of the speech segment V.

図１１は、第５実施形態における音声合成部３６のブロック図である。図１１に示すように、第５実施形態の音声合成部３６は、第３実施形態の音声合成部３６（図８）に指標算定部４８を追加した構成である。指標算定部４８は、素片選択部３４が選択した音声素片Ｖの音素区間Ｓ2のうち遷移区間ＥA内の複数のフレームの各々について、そのフレームの１個の単位データＵを反復することで生成される合成音の聴感的な自然性の尺度となる指標値Ｋを算定する。 FIG. 11 is a block diagram of the speech synthesizer 36 in the fifth embodiment. As shown in FIG. 11, the speech synthesizer 36 of the fifth embodiment has a configuration in which an index calculator 48 is added to the speech synthesizer 36 (FIG. 8) of the third embodiment. The index calculation unit 48 repeats one unit data U of each frame for each of a plurality of frames in the transition segment EA in the phoneme segment S2 of the speech segment V selected by the segment selection unit 34. An index value K that is a measure of auditory naturalness of the generated synthesized sound is calculated.

１個の単位データＵを反復した場合に合成音が聴感的に不自然な音声となる典型的なフレームは、有声音と比較して音量が小さいフレームや、調和成分（基音成分および各倍音成分）に対する非調和成分の強度が高いフレームである。具体的には、破裂音や破擦音等の音素の音素区間Ｓ1の直後に位置する遷移区間ＥA内の前方のフレームの単位データＵを反復した場合に合成音は聴感的に不自然な音声となる。以上の傾向を考慮して、指標算定部４８は、各フレームの音量に関する指標値Ｋ1と、各フレームの非調和成分の強度に関する指標値Ｋ2とを、素片選択部３４が選択した音声素片Ｖの遷移区間ＥA内のフレーム毎に指標値Ｋとして算定する。 A typical frame in which the synthesized sound becomes audibly unnatural sound when one unit data U is repeated is a frame whose volume is lower than that of a voiced sound, or a harmonic component (a fundamental component and each harmonic component). ) Is a frame in which the intensity of the anharmonic component is high. Specifically, when the unit data U of the front frame in the transition section EA located immediately after the phoneme section S1 of a phoneme such as a plosive or a smashing sound is repeated, the synthesized sound is audibly unnatural speech. It becomes. In consideration of the above tendency, the index calculation unit 48 selects the speech unit selected by the unit selection unit 34 from the index value K1 related to the volume of each frame and the index value K2 related to the intensity of the anharmonic component of each frame. The index value K is calculated for each frame in the V transition section EA.

各フレームの指標値Ｋ1は、例えば、所定の音量Ａ0に対するそのフレームの音量Ａの比（Ｋ1＝Ａ／Ａ0）として算定される。所定の音量Ａ0は、例えば遷移区間ＥA内の最後のフレームの音量（遷移区間ＥA内の最大値である可能性が高い）である。したがって、遷移区間ＥA内で音量Ａが大きいフレーム（すなわち、単位データＵを反復した合成音が聴感的に自然な音声となる可能性が高いフレーム）ほど、指標値Ｋ1は大きい数値となる。 The index value K1 of each frame is calculated, for example, as the ratio of the volume A of the frame to the predetermined volume A0 (K1 = A / A0). The predetermined volume A0 is, for example, the volume of the last frame in the transition section EA (highly likely to be the maximum value in the transition section EA). Therefore, the index value K1 becomes a larger numerical value as the frame has a louder volume A within the transition section EA (that is, the frame having a higher possibility that the synthesized sound obtained by repeating the unit data U becomes a perceptually natural voice).

各フレームの指標値Ｋ2は、そのフレームの音声成分から非調和成分を低減または除去した場合の平均パワーＰSに対するそのフレームの平均パワーＰの比（Ｋ2＝Ｐ／ＰS）として算定される。図１２には、遷移区間ＥA内の１個のフレームの単位データＵで指定された周波数スペクトルＳP1が図示されている。周波数スペクトルＳP1は、各調波周波数Ｆn（基本周波数および各倍音周波数）にて強度がピークとなる調和成分に加えて各調波周波数の間に存在する非調和成分を含んで構成される。 The index value K2 of each frame is calculated as the ratio (K2 = P / PS) of the average power P of the frame to the average power PS when the anharmonic component is reduced or removed from the audio component of the frame. FIG. 12 shows the frequency spectrum SP1 designated by the unit data U of one frame in the transition section EA. The frequency spectrum SP1 includes an inharmonic component that exists between the harmonic frequencies in addition to a harmonic component having a peak intensity at each harmonic frequency Fn (basic frequency and each harmonic frequency).

図１２には、周波数スペクトルＳP1から非調和成分を除去した周波数スペクトルＳP2（斜線部）が併記されている。周波数スペクトルＳP2は、周波数スペクトルＳP1の各調波周波数Ｆnに所定の調波成分Ｈを配置し、各調波成分Ｈの強度を周波数スペクトルＳP1の包絡線ＥNVに合致するように調整したスペクトルである。指標算定部４８は、周波数スペクトルＳP2の平均パワーＰSに対する周波数スペクトルＳP1の平均パワーＰの比を指標値Ｋ2としてフレーム毎に算定する。したがって、調和成分に対する非調和成分の強度が低いフレーム（すなわち、単位データＵを反復した合成音が聴感的に自然な音声となる可能性が高いフレーム）ほど、指標値Ｋ2は小さい数値となる。 FIG. 12 also shows a frequency spectrum SP2 (shaded portion) obtained by removing the anharmonic component from the frequency spectrum SP1. The frequency spectrum SP2 is a spectrum in which a predetermined harmonic component H is arranged at each harmonic frequency Fn of the frequency spectrum SP1, and the intensity of each harmonic component H is adjusted so as to match the envelope ENV of the frequency spectrum SP1. . The index calculation unit 48 calculates, for each frame, the ratio of the average power P of the frequency spectrum SP1 to the average power PS of the frequency spectrum SP2 as an index value K2. Therefore, the index value K2 is a smaller numerical value for a frame in which the intensity of the inharmonic component relative to the harmonic component is low (that is, a frame in which the synthesized sound obtained by repeating the unit data U is more likely to be perceptually natural speech).

第５実施形態の限界設定部４２は、遷移区間ＥA内の各フレームの指標値Ｋ（Ｋ1，Ｋ2）に応じて限界時点ＴAを設定する。すなわち、限界設定部４２は、遷移区間ＥA内の複数のフレームのうち指標値Ｋが示す合成音の自然性が目標値を上回る最先のフレームの時点を限界時点ＴAとして設定する。 The limit setting unit 42 of the fifth embodiment sets the limit time TA according to the index value K (K1, K2) of each frame in the transition section EA. That is, the limit setting unit 42 sets the time point of the earliest frame in which the naturalness of the synthesized sound indicated by the index value K among the plurality of frames in the transition section EA exceeds the target value as the limit time point TA.

具体的には、指標算定部４８は、遷移区間ＥAの先頭から順次にフレームを選択してそのフレームの指標値Ｋ1と指標値Ｋ2とを算定し、限界設定部４２は、指標値Ｋ1が所定の閾値Ｋth1を上回るか否か（すなわち音量が目標値を上回るか否か）および指標値Ｋ2が所定の閾値Ｋth2を下回るか否か（すなわち調和成分に対する非調和成分の強度が目標値を下回るか否か）を判定する。限界設定部４２は、指標値Ｋ1の判定と指標値Ｋ2の判定との双方の結果が肯定となる最先のフレームの時点を限界時点ＴAとして設定する。すなわち、調和成分に対する非調和成分の強度が充分に低くて音量が大きい時点（単位データＵの反復で生成される合成音が聴感的に自然な音声となる時点）が限界時点ＴAとして設定される。したがって、第５実施形態においても結果的には、第４実施形態と同様に、音素区間Ｓ1の音素の種別Ｃに応じた時点が限界時点ＴAとして設定される。例えば、音素区間Ｓ1の音素が種別Ｃ2に属する場合の限界時点ＴAは、音素区間Ｓ1の音素が種別Ｃ1に属する場合の限界時点ＴAよりも時間的に遅い時点となる。境界設定部４４や合成処理部４６の動作は第３実施形態や４実施形態と同様である。 Specifically, the index calculation unit 48 sequentially selects frames from the beginning of the transition section EA and calculates the index value K1 and the index value K2 of the frame, and the limit setting unit 42 sets the index value K1 to a predetermined value. Whether or not the index value K2 is below a predetermined threshold Kth2 (that is, whether the strength of the anharmonic component relative to the harmonic component is below the target value) Or not). The limit setting unit 42 sets, as the limit time TA, the time of the earliest frame in which both the determination of the index value K1 and the determination of the index value K2 are affirmative. That is, a time point when the intensity of the non-harmonic component with respect to the harmonic component is sufficiently low and the sound volume is large (a time point when the synthesized sound generated by repetition of the unit data U becomes audibly natural sound) is set as the limit time point TA. . Accordingly, in the fifth embodiment, as a result, as in the fourth embodiment, the time point corresponding to the phoneme type C in the phoneme section S1 is set as the limit time point TA. For example, the limit time TA when the phoneme in the phoneme section S1 belongs to the type C2 is a time point later in time than the limit time TA when the phoneme in the phoneme section S1 belongs to the type C1. The operations of the boundary setting unit 44 and the composition processing unit 46 are the same as those in the third embodiment and the fourth embodiment.

第５実施形態においても第３実施形態と同様の効果が実現される。また、第５実施形態では、遷移区間ＥA内のフレーム毎に算定された指標値Ｋ（Ｋ1，Ｋ2）に応じて限界時点ＴAが設定されるから、限界時点ＴAを規定する限界情報ＱCが事前に用意された第１実施形態や第２実施形態と比較して、各音声素片Ｖの特性に応じた適切な限界時点ＴAを設定できるという利点がある。 In the fifth embodiment, the same effect as in the third embodiment is realized. In the fifth embodiment, since the limit time TA is set according to the index value K (K1, K2) calculated for each frame in the transition section EA, the limit information QC that defines the limit time TA is obtained in advance. Compared with the first embodiment and the second embodiment prepared in the above, there is an advantage that an appropriate limit time TA can be set according to the characteristics of each speech unit V.

＜第６実施形態＞
図１３は、第６実施形態における音声合成部３６のブロック図である。図１３に示すように、第６実施形態の音声合成部３６は、第１実施形態の音声合成部３６に指標算定部６０を追加した構成である。指標算定部６０は、図１４に示すように、素片選択部３４が選択した音声素片Ｖの遷移区間ＥA内の複数（Ｍ個）のフレームの各々について音色指標値Ｙ[m]（ｍ＝１〜Ｍ）を算定する。音色指標値Ｙ[m]は、遷移区間ＥA内の第ｍ番目のフレームと遷移区間ＥA内の特定のフレーム（以下「基準フレーム」という）との音色の相違を示す尺度である。以下の説明では、遷移区間ＥA内の最後（第Ｍ番目）に位置する１個のフレームを基準フレームとして選定した場合を例示する。 <Sixth Embodiment>
FIG. 13 is a block diagram of the speech synthesizer 36 in the sixth embodiment. As shown in FIG. 13, the speech synthesis unit 36 of the sixth embodiment has a configuration in which an index calculation unit 60 is added to the speech synthesis unit 36 of the first embodiment. As shown in FIG. 14, the index calculation unit 60 uses the timbre index value Y [m] (m) for each of a plurality of (M) frames in the transition section EA of the speech unit V selected by the unit selection unit 34. = 1 to M). The timbre index value Y [m] is a scale indicating a timbre difference between the m-th frame in the transition section EA and a specific frame (hereinafter referred to as “reference frame”) in the transition section EA. In the following description, a case where one frame located at the end (Mth) in the transition section EA is selected as the reference frame will be exemplified.

第６実施形態において有声音の音素に対応する１個の単位データＵは、図１４に示すように、各フレームの周波数スペクトルを示すデータに加えて包絡形状データＥを含んで構成される。包絡形状データＥは、音声の周波数スペクトルの包絡線の形状的な特徴を示す複数の変数で構成される。第１実施形態の包絡形状データＥは、例えば励起波形エンベロープｅ1と胸部レゾナンスｅ2と声道レゾナンスｅ3と差分スペクトルｅ4とを含むＥｐＲ（Excitation plus Resonance）パラメータを要素とするベクトルであり、公知のＳＭＳ（Spectral Modeling Synthesis）分析で生成される。なお、ＥｐＲパラメータやＳＭＳ分析については、例えば特許第３７１１８８０号公報や特開２００７−２２６１７４号公報にも開示されている。 In the sixth embodiment, one unit data U corresponding to a phoneme of a voiced sound includes envelope shape data E in addition to data indicating the frequency spectrum of each frame, as shown in FIG. The envelope shape data E is composed of a plurality of variables indicating the shape characteristics of the envelope of the frequency spectrum of speech. The envelope shape data E of the first embodiment is a vector having, for example, a known SMS plus an ExR (Excitation plus Resonance) parameter including an excitation waveform envelope e1, a chest resonance e2, a vocal tract resonance e3, and a difference spectrum e4. (Spectral Modeling Synthesis) Generated by analysis. EpR parameters and SMS analysis are also disclosed in, for example, Japanese Patent No. 3711880 and Japanese Patent Application Laid-Open No. 2007-226174.

励起波形エンベロープ（Excitation Curve）ｅ1は、声帯振動のスペクトルの包絡線を近似する変数である。胸部レゾナンス（Chest Resonance）ｅ2は、胸部共鳴特性を近似する所定個のレゾナンス（帯域通過フィルタ）の帯域幅と中心周波数と振幅値とを指定する。声道レゾナンス（Vocal Tract Resonance）ｅ3は、声道共鳴特性を近似する複数のレゾナンスの各々について帯域幅と中心周波数と振幅値とを指定する。差分スペクトルｅ4は、励起波形エンベロープｅ1と胸部レゾナンスｅ2と声道レゾナンスｅ3とで近似されるスペクトルと音声のスペクトルとの差分（誤差）を意味する。 The excitation waveform envelope (Excitation Curve) e1 is a variable that approximates the envelope of the spectrum of vocal cord vibration. Chest resonance e2 designates the bandwidth, center frequency, and amplitude value of a predetermined number of resonances (bandpass filters) that approximate the chest resonance characteristics. Vocal Tract Resonance (e3) designates a bandwidth, a center frequency, and an amplitude value for each of a plurality of resonances that approximate the vocal tract resonance characteristics. The difference spectrum e4 means the difference (error) between the spectrum approximated by the excitation waveform envelope e1, the chest resonance e2 and the vocal tract resonance e3 and the spectrum of the voice.

包絡形状データＥは、各フレームの音色を示す変数として利用可能である。そこで、指標算定部６０は、各フレームの包絡形状データＥと基準フレームの包絡形状データＥとに応じて両者間の相違を示す音色指標値Ｙ[m]を算定する。具体的には、指標算定部６０は、以下の数式(2)の演算で音色指標値Ｙ[m]を算定する。
Ｙ[m]＝Ｄ{Ｅ(M)，Ｅ(m)}／Ｄ{Ｅ(M)，Ｅ(1)} ……(2)
数式(2)の演算子Ｄ{Ｅ(m1)，Ｅ(m2)}は、遷移区間ＥA内の第ｍ1番目のフレームの包絡形状データＥ(m1)と第ｍ2番目のフレームの包絡形状データＥ(m2)との距離（例えば各包絡形状データＥが示すベクトル間のユークリッド距離）を意味する。すなわち、数式(2)の分子の距離Ｄ{Ｅ(M)，Ｅ(m)}は、基準フレームの包絡形状データＥ(M)と遷移区間ＥAの第ｍ番目のフレームの包絡形状データＥ(m)との距離であり、数式(2)の分母の距離Ｄ{Ｅ(M)，Ｅ(1)}は、基準フレームの包絡形状データＥ(M)と遷移区間ＥAの最初のフレームの包絡形状データＥ(1)との距離である。 The envelope shape data E can be used as a variable indicating the tone color of each frame. Therefore, the index calculation unit 60 calculates a timbre index value Y [m] indicating a difference between the two according to the envelope shape data E of each frame and the envelope shape data E of the reference frame. Specifically, the index calculation unit 60 calculates the timbre index value Y [m] by the calculation of the following formula (2).
Y [m] = D {E (M), E (m)} / D {E (M), E (1)} (2)
The operator D {E (m1), E (m2)} in Equation (2) is obtained by calculating the envelope shape data E (m1) of the m1st frame and the envelope shape data E of the m2th frame in the transition section EA. This means a distance from (m2) (for example, a Euclidean distance between vectors indicated by each envelope shape data E). That is, the distance D {E (M), E (m)} of the numerator in Expression (2) is determined by the envelope shape data E (M) of the reference frame and the envelope shape data E (m) of the mth frame in the transition section EA. m), and the denominator distance D {E (M), E (1)} in equation (2) is the envelope shape data E (M) of the reference frame and the envelope of the first frame in the transition section EA This is the distance from the shape data E (1).

音声素片Ｖの音色は、遷移区間ＥAの始点ＧAから終点ＧBにかけて音素区間Ｓ1の音素の音色から音素区間Ｓ2の音素の音色に経時的に変化するから、距離Ｄ{Ｅ(M)，Ｅ(m)}は、遷移区間ＥAの最初のフレームで最大値Ｄ{Ｅ(M)，Ｅ(1)}となり、概略的には時間の経過とともに減少して最後のフレーム（基準フレーム）で最小値０となる。数式(2)の距離Ｄ{Ｅ(M)，Ｅ(1)}による除算は、音色指標値Ｙ[m]を０以上かつ１以下の範囲内の数値に正規化する演算を意味する。すなわち、音色指標値Ｙ[1]〜Ｙ[M]は、図１４に示すように、遷移区間ＥAの最初のフレームにて最大値１となり、概略的な傾向として時間の経過とともに減少して最後のフレーム（基準フレーム）にて最小値０となる。 Since the timbre of the speech segment V changes over time from the timbre of the phoneme segment S1 to the timbre of the phoneme segment S2 from the start point GA to the end point GB of the transition segment EA, the distance D {E (M), E (m)} is the maximum value D {E (M), E (1)} in the first frame of the transition section EA, and generally decreases with the passage of time and is the minimum in the last frame (reference frame). The value is 0. The division by the distance D {E (M), E (1)} in Equation (2) means an operation for normalizing the timbre index value Y [m] to a numerical value in the range of 0 to 1 inclusive. That is, as shown in FIG. 14, the timbre index values Y [1] to Y [M] have a maximum value of 1 in the first frame of the transition section EA, and decrease as time passes as a general trend. The minimum value becomes 0 in the frame (reference frame).

以上の説明から理解されるように、音色指標値Ｙ[m]は、受聴者が知覚する音色の変化（発声者の口の開き具合の変化）に対して線形に変化する変数として機能する。例えば、遷移区間ＥAのうち直前の音素区間Ｓ1の音素と音素区間Ｓ2の音素との中間の音色となるフレーム（例えば発声者が口を半分だけ開いた状態のフレーム）では音色指標値Ｙ[m]は０.５となり、直前の音素区間Ｓ1の音素の音色と音素区間Ｓ2の音素の音色とを２：８の割合で混合した音色のフレーム（例えば発声者が口を２割だけ開いた状態のフレーム）では音色指標値Ｙ[m]は０.２となる。 As understood from the above description, the timbre index value Y [m] functions as a variable that changes linearly with respect to a change in the timbre perceived by the listener (change in the degree of opening of the speaker's mouth). For example, in the transition zone EA, the tone color index value Y [m in a frame having a tone color intermediate between the phoneme in the immediately preceding phoneme segment S1 and the phoneme in the phoneme segment S2 (for example, a frame in which the speaker has half his mouth open). ] Is 0.5, and a timbre frame in which the timbre of the phoneme segment S1 and the timbre of the phoneme segment S2 are mixed at a ratio of 2: 8 (for example, the speaker has opened his / her mouth by 20%) ), The timbre index value Y [m] is 0.2.

第６実施形態の境界設定部４４は、遷移区間ＥA内の音色指標値Ｙ[m]の時間的な遷移において、音色指標値Ｙ[m]が、明瞭度変数αに線形に対応した数値となる時点を、境界時点ＴBとして選定する。具体的には、境界設定部４４は、明瞭度変数αに応じて０から１までの範囲内で線形に変化する数値（１−α）を算定し、図１４に示すように、遷移区間ＥA内で音色指標値Ｙ[m]が数値（１−α）に合致する時点ｔyを境界時点ＴBとして設定する。例えば明瞭度変数αが０.５である場合には、遷移区間ＥAのうち音色指標値Ｙ[m]が０.５となる時点（フレーム）が境界時点ＴBとして設定され、明瞭度変数αが０.２である場合には、遷移区間ＥAのうち音色指標値Ｙ[m]が０.８となる時点が境界時点ＴBとして設定される。 In the boundary setting unit 44 of the sixth embodiment, in the temporal transition of the timbre index value Y [m] in the transition section EA, the timbre index value Y [m] is a numerical value corresponding linearly to the articulation variable α. Is selected as the boundary time TB. Specifically, the boundary setting unit 44 calculates a numerical value (1-α) that changes linearly within a range from 0 to 1 according to the clarity variable α, and as shown in FIG. The time point ty at which the timbre index value Y [m] matches the numerical value (1-α) is set as the boundary time point TB. For example, when the intelligibility variable α is 0.5, the time point (frame) in which the timbre index value Y [m] becomes 0.5 in the transition section EA is set as the boundary time point TB, and the intelligibility variable α is In the case of 0.2, the time point at which the timbre index value Y [m] becomes 0.8 in the transition section EA is set as the boundary time point TB.

ところで、図１４では音色指標値Ｙ[m]が時間経過とともに単調減少する場合を例示したが、図１５に示すように、音声素片Ｖによっては音色指標値Ｙ[m]が単調減少しない可能性がある。音声素片Ｖが遷移区間ＥA内で増加および減少する場合、遷移区間ＥA内の複数の時点（フレーム）ｔyにて音色指標値Ｙ[m]が明瞭度変数αに応じた数値（１−α）に合致し得る。以上のように音色指標値Ｙ[m]が複数の時点ｔyにて明瞭度変数αに応じた数値（１−α）となる場合、境界設定部４４は、複数の時点ｔyのうち最も後方の時点を境界時点ＴBとして選択する。 FIG. 14 illustrates the case where the timbre index value Y [m] monotonously decreases with time. However, as shown in FIG. 15, the timbre index value Y [m] may not monotonously decrease depending on the speech segment V. There is sex. When the speech segment V increases and decreases in the transition section EA, the timbre index value Y [m] is a numerical value (1-α corresponding to the articulation variable α at a plurality of time points (frames) ty in the transition section EA. ). As described above, when the timbre index value Y [m] is a numerical value (1-α) corresponding to the articulation variable α at a plurality of time points ty, the boundary setting unit 44 is the rearmost among the plurality of time points ty. The time point is selected as the boundary time point TB.

第６実施形態においても、音声素片Ｖのうち境界時点ＴBの前方の適用区間Ｗが音声信号ＶOUTの生成に利用されるから、第１実施形態と同様に、発声者が口を充分に開かずに発声したような合成音を生成することが可能である。また、第６実施形態では、遷移区間ＥA内の音色の変化（発声者の口の開き具合の変化）に対して線形に変化する音色指標値Ｙ[m]が算定され、利用者から指示された明瞭度変数αに応じた数値（１−α）に音色指標値Ｙ[m]が合致する時点が境界時点ＴBとして設定される。したがって、第１実施形態と同様に、利用者が指定する明瞭度変数αの変化と合成音の音色変化との不整合感を低減することが可能である。具体的には、利用者が明瞭度変数αを最大値１から半分に変化させた場合には、口の開き具合を半分に変化させたように合成音の音色が変化する。 Also in the sixth embodiment, since the applicable section W in front of the boundary time TB in the speech unit V is used for generating the speech signal VOUT, the speaker sufficiently opens his mouth as in the first embodiment. It is possible to generate a synthesized sound as if it were uttered. In the sixth embodiment, the timbre index value Y [m] that changes linearly with respect to the timbre change in the transition section EA (change in the mouth openness of the speaker) is calculated and instructed by the user. The time point at which the timbre index value Y [m] matches the numerical value (1-α) corresponding to the articulation variable α is set as the boundary time point TB. Therefore, as in the first embodiment, it is possible to reduce the inconsistency between the change in the clarity variable α designated by the user and the timbre change in the synthesized sound. Specifically, when the user changes the clarity variable α from the maximum value 1 to half, the timbre of the synthesized sound changes as if the opening degree of the mouth is changed to half.

しかも、第６実施形態では、音声素片Ｖ毎に算定された音色指標値Ｙ[m]の時間遷移に応じて境界時点ＴBが設定されるから、明瞭度変数αと境界時点ＴBの位置との関係が事前に確定された第１実施形態から第５実施形態と比較して、利用者が指定する明瞭度変数αの変化と合成音の音色変化との不整合感が低減されるように各音声素片Ｖの特性に応じた適切な境界時点ＴBを設定できるという利点がある。また、第６実施形態では、時間軸上の複数の時点ｔyにて音色指標値Ｙ[m]が明瞭度変数αに応じた数値（１−α）となる場合に、複数の時点ｔyのうち最も後方の時点が境界時点ＴBとして選択される。以上の構成では、例えば複数の時点ｔyのうち最も前方の時点を境界時点ＴBとして選択する構成と比較して、音声素片Ｖ内の多くの単位データＵが単位データ群Ｚ1として音声信号ＶOUTの合成に使用され、１個の単位データＵを反復する単位データ群Ｚ2の時間長は相対的に短縮される。したがって、聴感的に自然な合成音を生成できるという利点がある。 In addition, in the sixth embodiment, the boundary time point TB is set according to the time transition of the timbre index value Y [m] calculated for each speech unit V, so the clarity variable α and the position of the boundary time point TB are Compared with the first to fifth embodiments in which the relationship is determined in advance, the inconsistency between the change in the clarity variable α designated by the user and the timbre change in the synthesized sound is reduced. There is an advantage that an appropriate boundary time TB can be set according to the characteristics of each speech unit V. In the sixth embodiment, when the timbre index value Y [m] becomes a numerical value (1-α) corresponding to the articulation variable α at a plurality of time points ty on the time axis, The most backward time is selected as the boundary time TB. In the above configuration, for example, a larger number of unit data U in the speech segment V is used as the unit data group Z1 of the audio signal VOUT as compared with the configuration in which the most forward time among the plurality of times ty is selected as the boundary time TB. The time length of the unit data group Z2 that is used for synthesis and repeats one unit data U is relatively shortened. Therefore, there is an advantage that a synthetic sound that is audibly natural can be generated.

なお、第３実施形態から第５実施形態の限界設定部４２を第６実施形態に追加することも可能である。すなわち、限界設定部４２が設定した限界時点ＴAから遷移区間ＥAの終点ＧBまでの変動区間ＥC内の各フレームについて音色指標値Ｙ[m]が算定され、音色指標値Ｙ[m]が明瞭度変数αに応じた数値（１−α）となる変動区間ＥC内の時点ｔyが境界時点ＴBとして選定される。 In addition, it is also possible to add the limit setting part 42 of 3rd Embodiment to 5th Embodiment to 6th Embodiment. That is, the timbre index value Y [m] is calculated for each frame in the variation section EC from the limit time TA set by the limit setting section 42 to the end point GB of the transition section EA, and the timbre index value Y [m] is the clarity. A time point ty in the fluctuation section EC that is a numerical value (1-α) corresponding to the variable α is selected as the boundary time point TB.

＜変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）明瞭度変数αと境界時点ＴB（境界変数β，γ）との関係を各素片データＤにて指定することで音声素片Ｖ毎に個別に設定することも可能である。また、限界時点ＴAを指定する限界情報ＱCを各素片データＤに含ませることで限界時点ＴAを音声素片Ｖ毎に個別に指定することも可能である。 (1) By specifying the relationship between the articulation variable α and the boundary time TB (boundary variables β, γ) in each segment data D, it is also possible to set each speech segment V individually. It is also possible to individually specify the limit time TA for each speech unit V by including the limit information QC specifying the limit time TA in each unit data D.

（２）第３実施形態および第４実施形態では、遷移区間ＥAのうち限界時点ＴAから状態境界ＧBまでの時間の割合Ｒを指定したが、遷移区間ＥAのうち音素境界ＧAから限界時点ＴAまでの時間の割合Ｒを指定することも可能である。 (2) In the third embodiment and the fourth embodiment, the ratio R of the time from the limit time TA to the state boundary GB in the transition section EA is specified, but from the phoneme boundary GA to the limit time TA in the transition section EA. It is also possible to specify the time ratio R.

（３）前述の各形態では単位データＵの反復で伸長音（母音の定常的な伸ばし音）が生成されるから、各音声素片Ｖの音素区間Ｓ2のうち定常区間ＥBを省略することも可能である。
定常区間ＥBを省略した構成によれば、素片群ＱAのデータ量を削減できるという利点がある。ただし、単位データＵの反復で生成される伸長音は実際に収録された伸長音と比較して不自然な音声となる場合があるから、定常区間ＥBを含むように音声素片Ｖの素片データＤを生成し、発音期間Ｘ2が短い場合には定常区間ＥBを含む音声素片Ｖをそのまま合成音の生成に使用する前述の各形態の構成が好適である。以上の例示から理解されるように、遷移区間ＥAは、音声素片Ｖのうち音素区間Ｓ2の一部（定常区間ＥB以外）または全部の区間を意味する。 (3) In each of the above-described forms, an extended sound (steady vowel sound) is generated by repeating the unit data U. Therefore, the stationary section EB in the phoneme section S2 of each speech unit V may be omitted. Is possible.
According to the configuration in which the stationary section EB is omitted, there is an advantage that the data amount of the element group QA can be reduced. However, since the decompressed sound generated by repeating the unit data U may be unnatural speech compared to the actually recorded decompressed sound, the segment of the speech unit V so as to include the stationary section EB. In the case where the data D is generated and the sound generation period X2 is short, the configurations of the above-described embodiments in which the speech segment V including the stationary section EB is used as it is for the generation of the synthesized sound are suitable. As understood from the above examples, the transition section EA means a part of the phoneme section S2 (except for the steady section EB) or the entire section of the speech unit V.

（４）各音素の種別Ｃは適宜に変更される。例えば、相前後する母音の音素の間に無音区間が介在するように音声素片Ｖが生成（収録）された場合に母音を種別Ｃ1に分類すると、第４実施形態では、無音区間が過度に伸長されて不自然な合成音となり得るため、前述の例示のように母音を種別Ｃ2に分類した構成が好適である。しかし、無音区間の伸長が発生しない場合（例えば相前後する母音の音素の間に無音区間が存在しない場合）や特段の問題とならない場合には、母音の音素を種別Ｃ1に分類することも可能である。 (4) The type C of each phoneme is changed as appropriate. For example, when a speech segment V is generated (recorded) so that a silent segment is interposed between phonemes of successive vowels, if the vowel is classified into type C1, the silent segment is excessive in the fourth embodiment. A configuration in which vowels are classified into type C2 as shown in the above example is preferable because they can be expanded and become unnatural synthesized sounds. However, if there is no expansion of the silent section (for example, if there is no silent section between adjacent vowel phonemes) or if there is no particular problem, the vowel phonemes can be classified as type C1. It is.

（５）前述の各形態では、利用者からの指示に応じて明瞭度変数αを制御したが、明瞭度変数αの設定の方法は任意である。例えば、合成音に指定されたテンポに応じて明瞭度変数αを制御することも可能である。例えば、合成音に指定されたテンポが所定の基準値（例えば１２０ＢＰＭ（Beat Per Minute）等の一般的な数値）に設定された場合に明瞭度変数αを最大値１２７に設定し、基準値からテンポが離れる（例えば基準値を超えて上昇する）ほど明瞭度変数αを低下させる構成が好適である。また、例えばテンポと発音期間Ｘ2とに応じて各合成音の実際の継続長を算定し、継続長が所定の閾値（例えば音声素片Ｖで収録された音素の継続長の所定倍の時間）を下回る場合に明瞭度変数αを低下させることも可能である。以上のようにテンポや継続長に応じて明瞭度変数αを制御する構成によれば、早口で発音するほど明瞭度が低下するという実際の発音の傾向を合成音にて再現することが可能である。 (5) In each of the above embodiments, the clarity variable α is controlled in accordance with an instruction from the user, but the method of setting the clarity variable α is arbitrary. For example, it is possible to control the articulation variable α in accordance with the tempo specified for the synthesized sound. For example, when the tempo specified for the synthesized sound is set to a predetermined reference value (for example, a general numerical value such as 120 BPM (Beat Per Minute)), the articulation variable α is set to the maximum value 127, and from the reference value A configuration in which the articulation variable α decreases as the tempo increases (for example, increases beyond the reference value) is preferable. Further, for example, the actual duration of each synthesized sound is calculated according to the tempo and the sound generation period X2, and the duration is a predetermined threshold (for example, a time that is a predetermined multiple of the duration of the phoneme recorded in the speech segment V). It is also possible to reduce the articulation variable α when below. As described above, according to the configuration in which the intelligibility variable α is controlled according to the tempo and the duration, it is possible to reproduce the actual pronunciation tendency that the intelligibility decreases as the sound is pronounced quickly with the synthesized sound. is there.

（６）前述の各形態のように単位データＵの反復のみで生成された合成音は人工的で不自然な音声と知覚される可能性がある。そこで、実際の発声音から抽出された変動成分（伸長音のうち時間的に微細に変動する揺れ成分）を、単位データＵの時系列から生成された音声に付加する構成も好適である。 (6) There is a possibility that the synthesized sound generated only by repeating the unit data U as in each of the above-described forms is perceived as an artificial and unnatural sound. Therefore, it is also preferable to add a fluctuation component extracted from an actual utterance sound (a fluctuation component that fluctuates in time among the extended sounds) to the voice generated from the time series of the unit data U.

（７）第５実施形態における指標値Ｋは適宜に変更される。例えば、音量の指標値Ｋ1および非調和成分の指標値Ｋ2の一方のみを利用して限界時点ＴAを設定する構成や、指標値Ｋ1および指標値Ｋ2以外の指標値Ｋを利用して限界時点ＴAを設定する構成も採用され得る。また、指標値Ｋ1や指標値Ｋ2の算定方法も適宜に変更される。例えば、前述の例示では、音量が大きいほど指標値Ｋ1が大きい数値となり、非調和成分の強度が低いほど指標値Ｋ2が小さい数値となる場合を例示したが、音量が大きいほど指標値Ｋ1が小さい数値となり、非調和成分の強度が低いほど指標値Ｋ2が大きい数値となるように指標値Ｋ1および指標値Ｋ2を算定することも可能である。 (7) The index value K in the fifth embodiment is changed as appropriate. For example, the limit time TA is set using only one of the volume index value K1 and the anharmonic component index value K2, or the limit time TA is set using an index value K other than the index value K1 and the index value K2. A configuration for setting the value can also be adopted. In addition, the calculation method of the index value K1 and the index value K2 is appropriately changed. For example, in the above-described example, the index value K1 has a larger numerical value as the volume increases, and the index value K2 decreases as the inharmonic component intensity decreases. However, the index value K1 decreases as the volume increases. It is also possible to calculate the index value K1 and the index value K2 so that the index value K2 becomes larger as the intensity of the anharmonic component is lower.

（８）第６実施形態では、単位データＵが指定するＥｐＲパラメータを音色の情報として利用して音色指標値Ｙ[m]を算定したが、音声素片Ｖの音色を示す情報はＥｐＲパラメータに限定されない。例えば、単位データＵが示す周波数スペクトルから算定されるケプストラムをＥｐＲの代わりに音色の情報として利用して音色指標値Ｙ[m]を算定することも可能である。 (8) In the sixth embodiment, the timbre index value Y [m] is calculated using the EpR parameter specified by the unit data U as timbre information. However, the information indicating the timbre of the speech segment V is used as the EpR parameter. It is not limited. For example, the timbre index value Y [m] can be calculated using a cepstrum calculated from the frequency spectrum indicated by the unit data U as timbre information instead of EpR.

（９）前述の各形態では、素片群ＱAを記憶する記憶装置１４が音声合成装置１００に搭載された構成を例示したが、音声合成装置１００とは独立した外部装置（例えばサーバ装置）が素片群ＱAを保持する構成も採用される。音声合成装置１００（素片選択部３４）は、例えば通信網を介して外部装置から音声素片Ｖ（素片データＤ）を取得して音声信号ＶOUTを生成する。同様に、音声合成装置１００から独立した外部装置に合成情報ＱBを保持することも可能である。以上の説明から理解されるように、素片データＤや合成情報ＱBを記憶する要素（前述の各形態における記憶装置１４）は音声合成装置１００の必須の要素ではない。 (9) In each of the above-described embodiments, the configuration in which the storage device 14 that stores the unit group QA is mounted on the speech synthesizer 100 is exemplified. However, an external device (for example, a server device) independent of the speech synthesizer 100 is provided. A configuration for holding the element group QA is also employed. The speech synthesizer 100 (unit selection unit 34) obtains a speech unit V (unit data D) from an external device via, for example, a communication network, and generates a speech signal VOUT. Similarly, the synthesis information QB can be held in an external device independent of the speech synthesizer 100. As can be understood from the above description, the element for storing the segment data D and the synthesis information QB (the storage device 14 in each of the above embodiments) is not an essential element of the speech synthesizer 100.

１００……音声合成装置、１２……演算処理装置、１４……記憶装置、２２……表示装置、２４……入力装置、２６……放音装置、３２……表示制御部、３３……変数設定部、３４……素片選択部、３６……音声合成部、４２……限界設定部、４４……境界設定部、４６……合成処理部、４８，６０……指標算定部。 DESCRIPTION OF SYMBOLS 100 ... Speech synthesizer, 12 ... Arithmetic processing unit, 14 ... Memory | storage device, 22 ... Display device, 24 ... Input device, 26 ... Sound emission device, 32 ... Display control part, 33 ... Variable Setting unit 34... Segment selection unit 36... Speech synthesis unit 42. Limit setting unit 44. Boundary setting unit 46. Synthesis processing unit 48 and 60.

Claims

Unit selection means for sequentially selecting speech units including a transition section in which a speech waveform changes over time;
Variable setting means for variably setting a clarity variable in accordance with an instruction from a user;
A means for setting a boundary time point in the transition section, wherein the larger the intelligibility variable is , the closer the boundary time point is to the end point of the transition section, and the larger the intelligibility variable is, Boundary setting means for variably setting the position of the boundary time according to the clarity variable set by the variable setting means so that the movement amount at the boundary time increases .
A speech synthesizer comprising: a synthesis processing unit that generates a speech signal using a section in front of the boundary time point among speech units selected by the unit selection unit.

The voice according to claim 1, wherein the boundary setting means sets the position of the boundary time according to the clarity variable so that the relationship between the clarity variable and the position of the boundary time is common to a plurality of speech segments. Synthesizer.

Each of the plurality of speech segments includes a first phoneme section and a second phoneme section corresponding to different phonemes, and the second phoneme section includes the transition section,
The boundary setting means responds to the clarity variable so that a relationship between the clarity variable and the position at the boundary time differs according to the phoneme type of the first phoneme section of the speech segment. The speech synthesizer according to claim 1, wherein the position at the boundary time is set.

Comprising index calculation means for calculating a timbre index value indicating a difference in timbre of each frame with respect to a reference frame for each of the plurality of frames in the transition section;
The speech synthesizer according to claim 1, wherein the boundary setting means sets a time point at which the timbre index value becomes a numerical value corresponding to the clarity variable in the temporal transition of the timbre index value.

When the timbre index value becomes a numerical value corresponding to the clarity variable at a plurality of time points in the temporal transition of the timbre index value, the boundary setting means determines the most backward time point among the plurality of time points. The speech synthesizer according to claim 4, wherein the speech synthesizer is set at a boundary time.

  Computer system
  Sequentially select speech segments that contain transition intervals where the speech waveform changes over time,
  Set the articulation variable variably according to instructions from the user,
  Set a boundary time point in the transition section,
  A speech signal is generated using a section in front of the boundary time among the selected speech segments.
  A speech synthesis method,
  In the setting of the boundary time point, the larger the articulation variable, the closer the boundary time point is to the end point of the transition section, and the larger the articulation variable, the more the movement amount of the boundary time point with respect to the change of the articulation variable. The position of the boundary time point is variably set according to the set clarity variable so as to increase
  Speech synthesis method.