JP5915264B2

JP5915264B2 - Speech synthesizer

Info

Publication number: JP5915264B2
Application number: JP2012046505A
Authority: JP
Inventors: 嘉山　啓; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-03-11
Filing date: 2012-03-02
Publication date: 2016-05-11
Anticipated expiration: 2032-03-02
Also published as: JP2012208479A

Description

本発明は、音声素片を利用して音声（発話音や歌唱音）を合成する技術に関する。 The present invention relates to a technique for synthesizing speech (speech sound or singing sound) using speech segments.

音声合成の対象として指定された音響（以下「合成対象音」という）を、事前に採取された複数の音声波形の接続で生成する素片接続型の音声合成が従来から提案されている。例えば特許文献１の技術では、音声素片毎に事前に採取された音声波形（素片データ）が記憶装置に格納され、合成対象音の発音文字（例えば歌詞）に対応する各音声波形を記憶装置から順次に選択して相互に接続することで合成対象音の音声信号が生成される。 Conventionally, segment connection type speech synthesis has been proposed in which sound designated as a target for speech synthesis (hereinafter referred to as “synthesis target sound”) is generated by connecting a plurality of speech waveforms collected in advance. For example, in the technique of Patent Document 1, a speech waveform (segment data) collected in advance for each speech unit is stored in a storage device, and each speech waveform corresponding to a pronunciation character (for example, lyrics) of a synthesis target sound is stored. A sound signal of the synthesis target sound is generated by selecting the devices sequentially and connecting them to each other.

特開２００７−２４０５６４号公報JP 2007-240564 A

特許文献１の技術では、記憶装置に格納された音声波形よりも長い時間長が合成対象音の継続長として指定された場合、その音声波形を反復（ループ）させることで音声信号が生成される。したがって、音声波形の時間長を１周期とする規則的な特性変化（例えば振幅や周期の変化）が発生し、受聴者が知覚する音質が低下するという問題がある。音声波形の反復が不要となる程度に各音声波形の時間長を充分に確保すれば以上の問題は解決されるが、長時間にわたる音声波形を格納するために膨大な記憶容量が必要になるという問題がある。以上の事情を考慮して、本発明は、音声合成に必要な記憶容量を削減しながら音声波形の反復に起因した音質の低下を防止することを目的とする。 In the technique of Patent Document 1, when a time length longer than the speech waveform stored in the storage device is designated as the duration of the synthesis target sound, a speech signal is generated by repeating (looping) the speech waveform. . Therefore, there is a problem that a regular characteristic change (for example, a change in amplitude or period) in which the time length of the voice waveform is one period occurs, and the sound quality perceived by the listener is deteriorated. If the time length of each voice waveform is sufficiently secured so that it is not necessary to repeat the voice waveform, the above problems can be solved, but a huge storage capacity is required to store the voice waveform over a long period of time. There's a problem. In view of the above circumstances, an object of the present invention is to prevent deterioration in sound quality due to repeated speech waveforms while reducing the storage capacity necessary for speech synthesis.

以上の課題を解決するために本発明が採用する手段を説明する。なお、本発明の理解を容易にするために、以下の説明では、本発明の要素と後述の実施形態の要素との対応を括弧書で付記するが、本発明の範囲を実施形態の例示に限定する趣旨ではない。 Means employed by the present invention to solve the above problems will be described. In order to facilitate the understanding of the present invention, in the following description, the correspondence between the elements of the present invention and the elements of the embodiments described later will be indicated in parentheses, but the scope of the present invention will be exemplified in the embodiments. It is not intended to be limited.

本発明の音声合成装置は、有声音の音声波形（例えば音声波形Ｖb）のうち時間軸上の相異なる位置から抽出された複数の単位波形（例えば単位波形ｕ[m]）を記憶する波形記憶手段（例えば記憶装置１２）と、複数の単位波形の各々を時間軸上に配列して合成波形（例えば合成波形Ｃ[n]）を生成する波形生成手段（例えば音声合成部２８）とを具備する。以上の構成では、音声波形のうち時間軸上の相異なる位置から抽出された複数の単位波形の各々を時間軸上に配列して合成波形が生成されるから、音声波形を反復させる特許文献１の構成と比較して、波形の反復に起因した音質の低下を防止することが可能である。また、音声波形から抽出された各単位波形が波形記憶手段に記憶されるから、音声波形の全区間を記憶する構成と比較して必要な記憶容量が削減されるという利点もある。 The speech synthesizer of the present invention stores a plurality of unit waveforms (for example, unit waveform u [m]) extracted from different positions on the time axis among voiced speech waveforms (for example, speech waveform Vb). Means (for example, storage device 12), and waveform generation means (for example, speech synthesizer 28) that generates a synthesized waveform (for example, synthesized waveform C [n]) by arranging each of a plurality of unit waveforms on the time axis. To do. In the above configuration, a plurality of unit waveforms extracted from different positions on the time axis in the speech waveform are arranged on the time axis to generate a composite waveform, so that the speech waveform is repeated. Compared with the above configuration, it is possible to prevent deterioration in sound quality due to waveform repetition. Further, since each unit waveform extracted from the speech waveform is stored in the waveform storage means, there is also an advantage that a necessary storage capacity is reduced as compared with a configuration in which all sections of the speech waveform are stored.

本発明の好適な態様において、波形生成手段は、複数の処理期間（例えば処理期間Ｒ[n]）の各々について、複数の単位波形から選択された第１単位波形（例えば第１単位波形Ｕa[n]）の複数個を当該処理期間内で経時的に強度が増加するように配列した第１波形系列（例えば第１波形系列Ｓa[n]）と、複数の単位波形のうち第１単位波形とは相違する第２単位波形（例えば第２単位波形Ｕb[n]）の複数個を当該処理期間内で経時的に強度が減少するように配列した第２波形系列（例えば第２波形系列Ｓb[n]）とを加算した合成波形例えば合成波形Ｃ[n]）を生成する。以上の態様では、第１単位波形を配列した第１波形系列と第２単位波形を配列した第２波形系列との加算（クロスフェード）で合成波形が生成されるから、各処理期間の合成波形を配列した素片波形（例えば素片波形Ｑ）において特性変化の周期性が知覚され難いという効果は格別に顕著となる。なお、時間軸序上の全部の処理期間について第１単位波形と第２単位波形とが相違する必要は必ずしもなく、第１単位波形と第２単位波形とが共通する処理期間が存在する構成も本発明の範囲に包含される。すなわち、以上の態様における「複数の処理期間」は、時間軸上の全部の処理期間のうち第１単位波形と第２単位波形とが相違する各処理時間を意味する。 In a preferred aspect of the present invention, the waveform generation means includes a first unit waveform (for example, a first unit waveform Ua [] selected from a plurality of unit waveforms for each of a plurality of processing periods (for example, the processing period R [n]). n]), a first waveform series (for example, the first waveform series Sa [n]) in which the intensity increases with time within the processing period, and the first unit waveform among the plurality of unit waveforms. A second waveform series (for example, the second waveform series Sb, for example) in which a plurality of second unit waveforms (for example, the second unit waveform Ub [n]) different from the above are arranged so that the intensity decreases with time within the processing period. [n]) is added to generate a composite waveform, for example, a composite waveform C [n]). In the above aspect, a composite waveform is generated by adding (cross-fade) the first waveform series in which the first unit waveforms are arranged and the second waveform series in which the second unit waveforms are arranged. The effect that it is difficult to perceive the periodicity of the characteristic change in the segment waveform (for example, the segment waveform Q) in which is arranged is particularly remarkable. The first unit waveform and the second unit waveform do not necessarily have to be different for all the processing periods on the time axis, and there is a configuration in which a processing period in which the first unit waveform and the second unit waveform are common exists. It is included in the scope of the present invention. That is, the “plurality of processing periods” in the above aspect means each processing time in which the first unit waveform and the second unit waveform are different among all the processing periods on the time axis.

第１波形系列と第２波形系列との加算で合成波形を生成する態様の具体例において、複数の処理期間のうちの一の処理期間の第１単位波形と、複数の処理期間のうち一の処理期間の直後の処理期間の第２単位波形とは共通の単位波形である。以上の態様によれば、相前後する処理期間では共通の単位波形が第２単位波形として選択されるから、処理期間毎に第１単位波形および第２単位波形の双方が変更される構成と比較して、素片波形における処理期間毎の規則的な特性変化を抑制することが可能である。 In the specific example of the aspect in which the composite waveform is generated by adding the first waveform series and the second waveform series, the first unit waveform of one processing period of the plurality of processing periods and one of the plurality of processing periods The second unit waveform in the processing period immediately after the processing period is a common unit waveform. According to the above aspect, since a common unit waveform is selected as the second unit waveform in successive processing periods, a comparison is made with a configuration in which both the first unit waveform and the second unit waveform are changed for each processing period. Thus, it is possible to suppress a regular characteristic change for each processing period in the segment waveform.

第１波形系列と第２波形系列との加算で合成波形を生成する態様の具体例において、波形生成手段は、複数の単位波形から処理期間毎にランダムに第１単位波形を選択する。以上の態様では、処理期間毎にランダムに第１単位波形が選択されるから、素片波形における処理期間毎の周期的な特性変化を抑制することが可能である。 In a specific example of the aspect in which the composite waveform is generated by adding the first waveform series and the second waveform series, the waveform generation unit randomly selects the first unit waveform from the plurality of unit waveforms for each processing period. In the above aspect, since the first unit waveform is randomly selected for each processing period, it is possible to suppress a periodic characteristic change for each processing period in the segment waveform.

第１波形系列と第２波形系列との加算で合成波形を生成する態様の具体例において、波形生成手段は、複数の処理期間のうちの一の処理期間の時間長と他の処理期間の時間長とを相違させる。以上の態様では、各処理期間の時間長が相違し得るから、全部の処理期間の時間長が共通する構成と比較して、素片波形における周期的な特性変化を抑制することが可能である。以上の効果は、複数の処理期間の各々の時間長をランダムに設定することで格別に顕著となる。 In the specific example of the aspect in which the composite waveform is generated by adding the first waveform series and the second waveform series, the waveform generation means includes the time length of one processing period and the time of another processing period. Different from the length. In the above aspect, since the time lengths of the respective processing periods can be different, it is possible to suppress a periodic characteristic change in the fragment waveform as compared with the configuration in which the time lengths of all the processing periods are common. . The above effects become particularly prominent by setting each time length of the plurality of processing periods at random.

本発明の好適な態様において、複数の単位波形の各々は、音声波形の１周期分に対応する。以上の態様では、音声波形の１周期分に対応する単位波形が合成波形の生成に利用されるから、記憶容量の削減と特性変化の周期性の抑制とを両立するという効果は格別に顕著となる。 In a preferred aspect of the present invention, each of the plurality of unit waveforms corresponds to one cycle of the speech waveform. In the above aspect, since the unit waveform corresponding to one cycle of the speech waveform is used for generating the synthesized waveform, the effect of achieving both the reduction of the storage capacity and the suppression of the periodicity of the characteristic change is particularly remarkable. Become.

本発明の好適な態様において、単位波形の強度（振幅）のピークトゥピーク値は複数の単位波形で共通する。以上の態様では、各単位波形のピークトゥピーク値が共通するから、複数の単位波形から生成される合成波形の振幅の変動が抑制される。したがって、振幅が定常に維持された自然な音声を生成できるという利点がある。 In a preferred aspect of the present invention, the peak-to-peak value of the intensity (amplitude) of the unit waveform is common to the plurality of unit waveforms. In the above aspect, since the peak-to-peak values of the unit waveforms are common, fluctuations in the amplitude of the composite waveform generated from the plurality of unit waveforms are suppressed. Therefore, there is an advantage that a natural voice whose amplitude is kept constant can be generated.

本発明の好適な態様において、単位波形の時間長は複数の単位波形で共通する。以上の態様では、各単位波形の時間長が共通するから、複数の単位波形から生成される合成波形の周期の変動が抑制される。したがって、周期が定常に維持された自然な音声を生成できるという利点がある。 In a preferred aspect of the present invention, the unit waveform has a common time length for a plurality of unit waveforms. In the above aspect, since the time length of each unit waveform is common, the fluctuation | variation of the period of the synthetic | combination waveform produced | generated from a several unit waveform is suppressed. Therefore, there is an advantage that a natural voice whose period is kept constant can be generated.

本発明の好適な態様において、複数の単位波形は、各単位波形間の相互相関関数が最大となるように各々の位相が調整されている。以上の態様では、各単位波形間の相互相関関数が最大となるように各々の位相が調整されるから、第１単位波形と第２単位波形との相殺が抑制されて聴感的に自然な素片波形を生成できるという利点がある。 In a preferred aspect of the present invention, the phases of the plurality of unit waveforms are adjusted such that the cross-correlation function between the unit waveforms is maximized. In the above aspect, each phase is adjusted so that the cross-correlation function between the unit waveforms is maximized. Therefore, the cancellation of the first unit waveform and the second unit waveform is suppressed, and an audibly natural element is obtained. There is an advantage that a half waveform can be generated.

以上の各態様に係る音声合成装置は、音声の合成に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。本発明のプログラム（例えばプログラムＰGM1）は、有声音の音声波形のうち時間軸上の相異なる位置から抽出された複数の単位波形を記憶する波形記憶手段を具備するコンピュータに、複数の単位波形の各々を時間軸上に配列して合成波形を生成する波形生成処理を実行させる。波形生成処理は、例えば、複数の処理期間の各々について、複数の単位波形から選択された第１単位波形の複数個を当該処理期間内で経時的に強度が増加するように配列した第１波形系列と、複数の単位波形のうち第１単位波形とは相違する第２単位波形の複数個を当該処理期間内で経時的に強度が減少するように配列した第２波形系列とを加算した合成波形を生成する処理である。以上のプログラムによれば、本発明の音声合成装置と同様の作用および効果が実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The speech synthesizer according to each aspect described above is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to speech synthesis, and general-purpose arithmetic processing such as CPU (Central Processing Unit). It is also realized by cooperation between the device and the program. The program of the present invention (for example, the program PGM1) stores a plurality of unit waveforms in a computer including waveform storage means for storing a plurality of unit waveforms extracted from different positions on the time axis in a voiced sound waveform. A waveform generation process for generating a composite waveform by arranging each of them on the time axis is executed. In the waveform generation processing, for example, for each of a plurality of processing periods, a first waveform in which a plurality of first unit waveforms selected from a plurality of unit waveforms are arranged so that the intensity increases with time within the processing period. A combination of a series and a second waveform series in which a plurality of second unit waveforms different from the first unit waveform among the plurality of unit waveforms are arranged so that the intensity decreases with time within the processing period. This is a process for generating a waveform. According to the above program, the same operation and effect as the speech synthesizer of the present invention are realized. The program of the present invention is provided to a user in a form stored in a computer-readable recording medium and installed in the computer, or provided from a server device in a form of distribution via a communication network and installed in the computer. Is done.

また、本発明は、前述の各態様に係る音声合成装置に使用される複数の単位波形を生成する音声処理装置としても実施され得る。本発明の音声処理装置は、有声音の音声波形のうち時間軸上の相異なる位置から複数の単位波形を抽出する波形抽出手段（例えば波形抽出部６２）と、波形抽出手段が抽出した複数の単位波形を各単位波形の音響特性が近付くように補正する波形補正手段（例えば波形補正部６４）とを具備する。 The present invention can also be implemented as a speech processing device that generates a plurality of unit waveforms used in the speech synthesizer according to each aspect described above. The speech processing apparatus according to the present invention includes a waveform extraction unit (for example, a waveform extraction unit 62) that extracts a plurality of unit waveforms from different positions on the time axis in a voiced sound waveform, and a plurality of waveform extraction units. Waveform correction means (for example, a waveform correction unit 64) for correcting the unit waveform so that the acoustic characteristics of each unit waveform approach each other.

本発明の好適な態様において、波形補正手段は、複数の単位波形の各々の時間長を共通の所定長に調整する周期補正手段（例えば周期補正部７４）を含む。以上の態様では、各単位波形の周期が共通の所定長に調整されるから、合成波形の周期の変動が抑制される。したがって、周期が定常に維持された自然な音声を生成できるという利点がある。 In a preferred aspect of the present invention, the waveform correction means includes period correction means (for example, a period correction unit 74) that adjusts the time length of each of the plurality of unit waveforms to a common predetermined length. In the above aspect, since the cycle of each unit waveform is adjusted to a common predetermined length, fluctuations in the cycle of the combined waveform are suppressed. Therefore, there is an advantage that a natural voice whose period is kept constant can be generated.

本発明の好適な態様において、周期補正手段は、相異なる複数の候補長の各々について、各単位波形を時間軸上で候補長に伸縮した場合の各単位波形の歪みの度合を示す歪指標値を算定する指標算定手段（例えば指標算定部７４２）と、複数の候補長のうち歪指標値が示す歪みの度合が最小となる候補長を所定長として選択し、複数の単位波形の各々の時間長を所定長に調整する補正処理手段（補正処理部７４４）とを含む。以上の態様では、各単位波形の歪みが抑制されるように補正後の所定長が選定されるから、音声波形の音響特性を忠実に反映した単位波形を生成できるという利点がある。 In a preferred aspect of the present invention, the period correction means, for each of a plurality of different candidate lengths, a distortion index value indicating the degree of distortion of each unit waveform when each unit waveform is expanded or contracted to the candidate length on the time axis And an index calculation means (for example, an index calculation unit 742) for calculating the time, and a candidate length that minimizes the degree of distortion indicated by the distortion index value among a plurality of candidate lengths is selected as a predetermined length, and each time of each of the plurality of unit waveforms is selected. Correction processing means (correction processing unit 744) for adjusting the length to a predetermined length. In the above aspect, since the predetermined length after correction is selected so that distortion of each unit waveform is suppressed, there is an advantage that a unit waveform that accurately reflects the acoustic characteristics of the speech waveform can be generated.

本発明の好適な態様に係る音声処理装置は、波形抽出手段が抽出した単位波形の時間長に対して所定長が長いほど当該単位波形の振幅が増加するように各単位波形の振幅を補正する歪補正手段（例えば歪補正部７８）を具備する。以上の態様では、周期補正手段による補正に起因した単位波形の振幅の変動が補正されるから、音声波形の音響特性を忠実に反映した単位波形を生成できるという効果は格別に顕著である。 The speech processing apparatus according to a preferred aspect of the present invention corrects the amplitude of each unit waveform so that the amplitude of the unit waveform increases as the predetermined length increases with respect to the time length of the unit waveform extracted by the waveform extraction unit. Distortion correcting means (for example, a distortion correcting unit 78) is provided. In the above aspect, since the fluctuation of the amplitude of the unit waveform caused by the correction by the period correcting unit is corrected, the effect that the unit waveform that faithfully reflects the acoustic characteristics of the speech waveform can be generated is particularly remarkable.

以上に説明した音声処理装置は、音声処理に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）で実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働でも実現される。本発明のプログラム（例えばプログラムＰGM2）は、音声合成に使用される複数の単位波形を生成するためのプログラムであって、有声音の音声波形のうち時間軸上の相異なる位置から複数の単位波形を抽出する波形抽出処理と、波形抽出処理で抽出した複数の単位波形を各単位波形の音響特性が近付くように補正する波形補正処理とをコンピュータに実行させる。以上のプログラムによれば、本発明の音声処理装置と同様の作用および効果が実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされるほか、通信網を介した配信の形態で提供されてコンピュータにインストールされる。 The audio processing apparatus described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to audio processing, and a general-purpose arithmetic processing apparatus such as a CPU (Central Processing Unit) and a program. It is also realized in collaboration with. The program of the present invention (for example, program PGM2) is a program for generating a plurality of unit waveforms used for speech synthesis, and a plurality of unit waveforms from different positions on the time axis among voiced sound waveforms. And a waveform correction process for correcting the plurality of unit waveforms extracted by the waveform extraction process so that the acoustic characteristics of each unit waveform approach each other. According to the above program, the same operation and effect as the speech processing apparatus of the present invention are realized. The program of the present invention is provided in a form stored in a computer-readable recording medium and installed in the computer, or is provided in a form distributed via a communication network and installed in the computer.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 変動素片および定常素片の素片データの説明図である。It is explanatory drawing of the segment data of a variation unit and a stationary unit. 編集画像および音声素片の時系列の模式図である。It is a time-series schematic diagram of an edit image and a speech unit. 音声合成部の動作のフローチャートである。It is a flowchart of operation | movement of a speech synthesizer. 定常素片の素片波形を生成する波形生成処理のフローチャートである。It is a flowchart of the waveform generation process which produces | generates the segment waveform of a stationary unit. 波形生成処理の説明図である。It is explanatory drawing of a waveform production | generation process. 第２実施形態に係る音声処理装置のブロック図である。It is a block diagram of the speech processing unit concerning a 2nd embodiment. 振幅補正部の動作の説明図である。It is explanatory drawing of operation | movement of an amplitude correction part. 周期補正部の動作の説明図である。It is explanatory drawing of operation | movement of a period correction | amendment part. 位相補正部の動作の説明図である。It is explanatory drawing of operation | movement of a phase correction part. 第３実施形態における周期補正部のブロック図である。It is a block diagram of the period correction | amendment part in 3rd Embodiment. 第３実施形態における周期補正部の動作の説明図である。It is explanatory drawing of operation | movement of the period correction | amendment part in 3rd Embodiment. 第４実施形態における波形補正部のブロック図である。It is a block diagram of the waveform correction | amendment part in 4th Embodiment. 第５実施形態における音声合成部のブロック図である。It is a block diagram of the speech synthesizer in the fifth embodiment. 変形例において定常素片の素片波形を生成する波形生成処理の説明図である。It is explanatory drawing of the waveform production | generation process which produces | generates the segment waveform of a stationary segment in a modification.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、歌唱音や発話音等の合成対象音を素片接続型の音声合成で生成する音声処理装置であり、図１に示すように、演算処理装置１０と記憶装置１２と入力装置１４と表示装置１６と放音装置１８とを具備するコンピュータシステムで実現される。 <A: First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a speech processing device that generates a synthesis target sound such as a singing sound or a speech sound by unit connection type speech synthesis. As shown in FIG. This is realized by a computer system including the device 14, the display device 16, and the sound emitting device 18.

演算処理装置１０（ＣＰＵ）は、記憶装置１２に格納されたプログラムＰGM1の実行で、合成対象音の音声信号ＳOUTを生成するための複数の機能（表示制御部２２，情報生成部２４，素片選択部２６，音声合成部２８）を実現する。音声信号ＳOUTは、合成対象音の波形を表す音響信号である。なお、演算処理装置１０の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が各機能を実現する構成も採用され得る。 The arithmetic processing unit 10 (CPU) has a plurality of functions (a display control unit 22, an information generation unit 24, an element unit) for generating the audio signal SOUT of the synthesis target sound by executing the program PGM1 stored in the storage device 12. The selection unit 26 and the speech synthesis unit 28) are realized. The audio signal SOUT is an acoustic signal representing the waveform of the synthesis target sound. A configuration in which each function of the arithmetic processing device 10 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes each function may be employed.

記憶装置１２は、演算処理装置１０が実行するプログラムＰGM1や演算処理装置１０が使用する各種の情報（素片群Ｇ，合成情報Ｚ）を記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として採用される。 The storage device 12 stores a program PGM1 executed by the arithmetic processing device 10 and various pieces of information (element group G, composite information Z) used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is employed as the storage device 12.

素片群Ｇは、複数の素片データＷの集合（音声合成ライブラリ）である。各素片データＷは、音声素片の時間軸上の波形を示すサンプル系列であり、音声合成の素材として利用される。音声素片は、言語的な意味の最小単位に相当する１個の音素（phoneme）または複数の音素を連結した音素連鎖（例えばダイフォンやトライフォン）である。なお、以下では便宜的に、無音を１個の音素（記号＃）として説明する。 The segment group G is a set (speech synthesis library) of a plurality of segment data W. Each unit data W is a sample series showing a waveform on the time axis of a speech unit, and is used as a material for speech synthesis. The phoneme unit is a phoneme (corresponding to a minimum unit of linguistic meaning) or a phoneme chain (for example, a diphone or a triphone) in which a plurality of phonemes are connected. In the following, for convenience, silence is described as one phoneme (symbol #).

音声素片は、音響特性が定常的である定常素片と音響特性が時間的に変動する変動素片とに区分される。定常素片の典型例は、１個の音素で構成される有声音（有声母音または有声子音）の音声素片であり、変動素片の典型例は、１個の音素で構成される無声音（無声子音）の音声素片または複数の音素（有声音または無声音）で構成されて音素間の遷移を含む音声素片（音素連鎖）である。 Speech segments are classified into stationary segments whose acoustic characteristics are stationary and variable segments whose acoustic characteristics vary with time. A typical example of a stationary element is a voice element of a voiced sound (voiced vowel or voiced consonant) composed of one phoneme, and a typical example of a variable element is an unvoiced sound composed of one phoneme ( An unvoiced consonant) or a plurality of phonemes (voiced sound or unvoiced sound) and includes a transition between phonemes (phoneme chain).

図２の部分(A)には、変動素片の音声波形（包絡線）Ｖaが図示され、図２の部分(B)には、定常素片の音声波形（包絡線）Ｖbが図示されている。図２の部分(A)に示すように、変動素片に分類される音声素片については、特定の発声者がその音声素片を発声したときの音声波形Ｖaの全区間にわたるサンプル系列が素片データＷとして記憶装置１２に格納される。他方、定常素片に分類される音声素片については、図２の部分(B)に示すように、特定の発声者がその音声素片を発声したときの音声波形Ｖb（音響特性が定常的に維持される区間）のうち時間軸上の相異なる位置から抽出されたＭ個（以下の例示では３個）の単位波形ｕ[1]〜ｕ[M]の各々のサンプル系列の集合が素片データＷとして記憶装置１２に格納される。第１実施形態において１個の定常素片に対応する各単位波形ｕ[m]（ｍ＝１〜Ｍ）は、時間的に連続する有声音の音声波形Ｖbの１周期分（例えば数ミリ秒程度）に相当する時間長Ｔ0の区間である。Ｍ個の単位波形ｕ[1]〜ｕ[M]の各々は、受聴者に共通の音声素片と知覚される程度に音響特性が相互に類似する。ただし、音声波形Ｖbの相異なる時点から抽出されるから、Ｍ個の単位波形ｕ[1]〜ｕ[M]の各々の音響特性（波形）は、１個の音声素片を継続的に発声した場合の音響特性の変動（揺らぎ）の範囲内で相互に相違する。 The part (A) in FIG. 2 shows the speech waveform (envelope) Va of the variable unit, and the part (B) in FIG. 2 shows the speech waveform (envelope) Vb of the stationary unit. Yes. As shown in part (A) of FIG. 2, for a speech unit classified as a variable unit, a sample sequence over the entire section of the speech waveform Va when a specific speaker speaks the speech unit is a prime unit. It is stored in the storage device 12 as a piece of data W. On the other hand, as shown in the part (B) of FIG. 2, with respect to a speech unit classified as a stationary unit, a speech waveform Vb (acoustic characteristics are steady when a specific speaker speaks the speech unit). A set of sample sequences of M (three in the following example) unit waveforms u [1] to u [M] extracted from different positions on the time axis. It is stored in the storage device 12 as a piece of data W. In the first embodiment, each unit waveform u [m] (m = 1 to M) corresponding to one stationary element is equivalent to one period (for example, several milliseconds) of a voice waveform Vb of a voiced sound that is continuous in time. This is a section of the time length T0 corresponding to the degree. Each of the M unit waveforms u [1] to u [M] has similar acoustic characteristics to the extent that they are perceived as common speech segments by the listener. However, since the voice waveforms Vb are extracted from different time points, the acoustic characteristics (waveforms) of the M unit waveforms u [1] to u [M] continuously utter a single speech segment. In this case, they are different from each other within the range of fluctuation (fluctuation) of the acoustic characteristics.

図１の記憶装置１２に格納された合成情報Ｚは、合成対象音を時系列に指定する情報（スコアデータ）である。図１に示すように、合成情報Ｚは、合成対象音を構成する複数の音声素片Ｚaの各々について音高Ｚbと発音時刻Ｚcと継続長Ｚdと音量Ｚeとを指定する。なお、以上に例示した情報に加えて（または以上の情報に代えて）、ボリューム（Volume）やベロシティ（Velocity）等の情報を合成情報Ｚで指定することも可能である。 The synthesis information Z stored in the storage device 12 of FIG. 1 is information (score data) that specifies the synthesis target sound in time series. As shown in FIG. 1, the synthesis information Z designates a pitch Zb, a pronunciation time Zc, a duration Zd, and a volume Ze for each of a plurality of speech segments Za constituting the synthesis target sound. In addition to the information exemplified above (or in place of the above information), information such as volume and velocity can be designated by the composite information Z.

入力装置１４は、利用者からの指示を受付ける機器（例えばマウスやキーボード）である。表示装置１６（例えば液晶表示装置）は、演算処理装置１０から指示された画像を表示する。放音装置１８（例えばスピーカやヘッドホン）は、演算処理装置１０が生成する音声信号ＳOUTに応じた音波を放射する。 The input device 14 is a device (for example, a mouse or a keyboard) that receives an instruction from a user. The display device 16 (for example, a liquid crystal display device) displays an image instructed from the arithmetic processing device 10. The sound emitting device 18 (for example, a speaker or headphones) emits a sound wave corresponding to the sound signal SOUT generated by the arithmetic processing device 10.

図１の表示制御部２２は、合成情報Ｚの生成および編集のために利用者が視認する図３の部分(A)の編集画面４０を表示装置１６に表示させる。図３の部分(A)に示すように、編集画面４０は、相互に交差する時間軸（横軸）と音高軸（縦軸）とが設定された画像（五線紙型またはピアノロール型の画像）である。利用者は、編集画面４０を参照しながら入力装置１４を適宜に操作することで、合成対象音を図形化した音符画像４２の配置や各音符画像４２の位置およびサイズの変更，各合成対象音に対する発音文字（例えば歌詞の音節）の指定を音声合成装置１００に指示することが可能である。なお、編集画面４０の形式は任意である。例えば、合成情報Ｚの各情報（音声素片Ｚa，音高Ｚb，発音時刻Ｚc，継続長Ｚd，音量Ｚe）の各数値のリストを編集画面４０として表示することも可能である。 The display control unit 22 in FIG. 1 causes the display device 16 to display the editing screen 40 of the part (A) in FIG. 3 that is visually recognized by the user for generating and editing the composite information Z. As shown in part (A) of FIG. 3, the editing screen 40 is an image (staff paper type or piano roll type) in which a time axis (horizontal axis) and a pitch axis (vertical axis) intersect each other are set. Image). By appropriately operating the input device 14 while referring to the editing screen 40, the user can change the arrangement of the note images 42 obtained by graphicizing the synthesis target sound, change the position and size of each note image 42, and each synthesis target sound. It is possible to instruct the speech synthesizer 100 to specify a pronunciation character (for example, a syllable of lyrics). The format of the edit screen 40 is arbitrary. For example, it is also possible to display a list of numerical values of each piece of information (speech segment Za, pitch Zb, pronunciation time Zc, duration Zd, volume Ze) of the synthesis information Z as the editing screen 40.

図１の情報生成部２４は、編集画面４０に対する利用者からの指示に応じて合成情報Ｚを生成または更新する。具体的には、情報生成部２４は、音符画像４２に指定された発音文字に応じて合成情報Ｚの各音声素片Ｚaを設定する。例えば、図３の部分(A)に例示した発音文字「ま［ｍａ］」については、図３の部分(B)に示すように［＃-ｍ］，［ｍ-ａ］，［ａ］，［ａ-＃］（＃：無音）という４個の音声素片Ｚaに変換される。なお、以上の例示ではダイフォンを例示したが、発音文字「ま［ｍａ］」は、例えばモノフォンを利用する場合には［ｍ］，［ａ］という２個の音声素片Ｚaに変換され、トライフォンを利用する場合には［＃-ｍ-ａ］，［ａ-＃］という２個の音声素片Ｚaに変換される。また、情報生成部２４は、音符画像４２の音高軸上の位置に応じて各音高Ｚbを設定し、各音声素片Ｚaの発音時刻Ｚcを音符画像４２の時間軸上の位置に応じて設定し、継続長Ｚdを音符画像４２の時間軸上の長さに応じて設定する。音量Ｚeも同様に利用者からの指示に応じて設定される。 The information generation unit 24 in FIG. 1 generates or updates the composite information Z in accordance with an instruction from the user with respect to the editing screen 40. Specifically, the information generation unit 24 sets each speech segment Za of the synthesis information Z according to the phonetic character specified in the note image 42. For example, for the phonetic character “ma [ma]” illustrated in the part (A) of FIG. 3, as shown in the part (B) of FIG. 3, [# -m], [m-a], [a], It is converted into four speech segments Za called [a- #] (#: silence). In the above example, a diphone is exemplified, but the pronunciation character “ma [ma]” is converted into two speech segments Za, such as [m] and [a], when using a monophone, for example. When using a phone, it is converted into two speech segments Za called [# -m-a] and [a- #]. Further, the information generating unit 24 sets each pitch Zb according to the position on the pitch axis of the note image 42, and sets the pronunciation time Zc of each speech element Za according to the position on the time axis of the note image 42. The continuation length Zd is set according to the length of the note image 42 on the time axis. Similarly, the volume Ze is set according to an instruction from the user.

素片選択部２６は、合成情報Ｚが指定する各音声素片Ｚaに対応する素片データＷを、各音声素片Ｚaの発音時刻Ｚcに対応する時点で、記憶装置１２の素片群Ｇから順次に選択する。音声合成部２８は、素片選択部２６が選択した素片データＷを利用して音声信号ＳOUTを生成する。具体的には、音声合成部２８は、素片選択部２６が選択した素片データＷの音声素片（以下「選択素片」という）毎に、その選択素片について合成情報Ｚが指定する音高Ｚbと継続長Ｚdと音量Ｚeとに調整された素片波形Ｑを素片データＷから生成し、相前後する素片波形Ｑを相互に連結することで音声信号ＳOUTを生成する。図４は、音声合成部２８が素片波形Ｑを生成する処理のフローチャートである。素片選択部２６が素片データＷを選択するたびに図４の処理が実行される。 The segment selection unit 26 selects the segment data W corresponding to each speech segment Za specified by the synthesis information Z at the time corresponding to the sound generation time Zc of each speech segment Za. Select sequentially. The speech synthesizer 28 generates a speech signal SOUT using the segment data W selected by the segment selector 26. Specifically, the speech synthesis unit 28 designates the synthesis information Z for each selected speech unit for each speech unit of the segment data W selected by the segment selection unit 26 (hereinafter referred to as “selected segment”). The segment waveform Q adjusted to the pitch Zb, the duration Zd, and the volume Ze is generated from the segment data W, and the adjacent segment waveforms Q are connected to each other to generate the audio signal SOUT. FIG. 4 is a flowchart of a process in which the speech synthesizer 28 generates the segment waveform Q. Each time the element selection unit 26 selects the element data W, the process of FIG. 4 is executed.

素片選択部２６が素片データＷを選択すると、音声合成部２８は、選択素片が定常素片であるか否かを判定する（ＳA1）。定常素片と変動素片とを区別する方法は任意であるが、例えば音声素片の種類（定常素片／変動素片）を示す情報を素片データＷに事前に付加し、その情報を参照して音声合成部２８が定常素片と変動素片とを区別する構成が採用され得る。選択素片が変動素片である場合（ＳA1：NO）、音声合成部２８は、素片選択部２６が選択した素片データＷ（図２の部分(A)の音声波形Ｖa）を、合成情報Ｚが選択素片について指定する音高Ｚbと継続長Ｚdと音量Ｚeとに調整することで選択素片の素片波形Ｑを生成する（ＳA2）。 When the segment selection unit 26 selects the segment data W, the speech synthesis unit 28 determines whether the selected segment is a stationary segment (SA1). The method for distinguishing between the stationary unit and the variable unit is arbitrary. For example, information indicating the type of the speech unit (stationary unit / variable unit) is added to the unit data W in advance, and the information is stored. A configuration in which the speech synthesizer 28 distinguishes between a stationary element and a variable element by referring to the reference may be adopted. When the selected segment is a variable segment (SA1: NO), the speech synthesizer 28 synthesizes the segment data W selected by the segment selector 26 (the speech waveform Va of part (A) in FIG. 2). The segment waveform Q of the selected segment is generated by adjusting the pitch Zb, duration Zd, and volume Ze specified by the information Z for the selected segment (SA2).

他方、選択素片が定常素片である場合（ＳA1：YES）、音声合成部２８は、選択素片の素片データＷに含まれるＭ個の単位波形ｕ[1]〜ｕ[M]の各々を時間軸上に選択的に配列することで素片波形Ｑを生成する処理（以下「波形生成処理」という）を実行する（ＳA3）。 On the other hand, when the selected segment is a stationary segment (SA1: YES), the speech synthesizer 28 selects the M unit waveforms u [1] to u [M] included in the segment data W of the selected segment. A process of generating the segment waveform Q (hereinafter referred to as “waveform generation process”) by selectively arranging each of them on the time axis is executed (SA3).

図５は、波形生成処理（図４の処理ＳA3）のフローチャートであり、図６は、波形生成処理の説明図である。図５の処理を開始すると、音声合成部２８は、図６に示すように、合成情報Ｚが選択素片について指定する継続長ＺdをＮ個の処理期間Ｒ[1]〜Ｒ[N]に区分する（ＳB1）。各処理期間Ｒ[n]（ｎ＝１〜Ｎ）の時間長Ｌr[n]はランダムに設定される。ただし、各時間長Ｌr[n]は単位波形ｕ[m]の時間長Ｔ0の整数倍に相当し、かつ、Ｎ個の時間長Ｌr[1]〜Ｌr[N]の合計は継続長Ｚdに合致する（Ｌr[1]＋Ｌr[2]＋……＋Ｌr[N]＝Ｚd）。 FIG. 5 is a flowchart of the waveform generation process (process SA3 of FIG. 4), and FIG. 6 is an explanatory diagram of the waveform generation process. When the processing of FIG. 5 is started, the speech synthesizer 28 sets the continuation length Zd designated by the synthesis information Z for the selected segment to N processing periods R [1] to R [N] as shown in FIG. Classify (SB1). The time length Lr [n] of each processing period R [n] (n = 1 to N) is set at random. However, each time length Lr [n] corresponds to an integer multiple of the time length T0 of the unit waveform u [m], and the total of the N time lengths Lr [1] to Lr [N] is the continuation length Zd. (Lr [1] + Lr [2] +... + Lr [N] = Zd).

第１実施形態の時間長Ｌr[n]は、基準長Ｌ0と変動長ｄ[n]との加算値として定義される（Ｌr[n]＝Ｌ0＋ｄ[n]）。音声合成部２８は、Ｎ個の変動長ｄ[n]の各々を所定の範囲内でランダムに設定し、各変動長ｄ[n]を所定の基準長Ｌ0に加算することで処理期間Ｒ[n]の時間長Ｌr[n]を設定する。したがって、各処理期間Ｒ[n]の時間長Ｌr[n]は相違し得る。また、処理期間Ｒ[n]の個数Ｎは継続長Ｚdに応じて変化する。 The time length Lr [n] of the first embodiment is defined as an added value of the reference length L0 and the fluctuation length d [n] (Lr [n] = L0 + d [n]). The speech synthesizer 28 randomly sets each of the N fluctuation lengths d [n] within a predetermined range, and adds each fluctuation length d [n] to a predetermined reference length L0, thereby processing period R [ The time length Lr [n] of n] is set. Therefore, the time length Lr [n] of each processing period R [n] can be different. Further, the number N of the processing periods R [n] changes according to the continuation length Zd.

音声合成部２８は、図６に示すように、選択素片の素片データＷに含まれるＭ個の単位波形ｕ[1]〜ｕ[M]を選択的に時間軸上に配列することで、時間長Ｌr[n]の合成波形Ｃ[n]を処理期間Ｒ[n]毎に生成する（ＳB2〜ＳB6）。Ｎ個の合成波形Ｃ[n]を連結させた波形が素片波形Ｑとして音声信号ＳOUTの生成に適用される。図６では、各単位波形ｕ[m]の強度（振幅またはパワー）の時間変化が模式的に図示されている。 As shown in FIG. 6, the speech synthesizer 28 selectively arranges M unit waveforms u [1] to u [M] included in the segment data W of the selected segment on the time axis. The combined waveform C [n] having the time length Lr [n] is generated for each processing period R [n] (SB2 to SB6). A waveform obtained by concatenating N synthesized waveforms C [n] is applied to the generation of the audio signal SOUT as a segment waveform Q. In FIG. 6, the time change of the intensity (amplitude or power) of each unit waveform u [m] is schematically illustrated.

音声合成部２８は、１個の処理期間Ｒ[n]を指定する変数ｎを１に初期化する（ＳB2）。そして、音声合成部２８は、選択素片の素片データＷに含まれるＭ個の単位波形ｕ[1]〜ｕ[M]のうち相異なる２個の単位波形ｕ[m]を第１単位波形Ｕa[n]および第２単位波形Ｕb[n]として選択する（ＳB3）。 The speech synthesizer 28 initializes a variable n designating one processing period R [n] to 1 (SB2). Then, the speech synthesizer 28 sets two different unit waveforms u [m] among the M unit waveforms u [1] to u [M] included in the segment data W of the selected segment as the first unit. The waveform Ua [n] and the second unit waveform Ub [n] are selected (SB3).

具体的には、音声合成部２８は、直前の処理期間Ｒ[n-1]での第１単位波形Ｕa[n-1]を現在の処理期間Ｒ[n]の第２単位波形Ｕb[n]として選択し、Ｍ個の単位波形ｕ[1]〜ｕ[M]のうち第２単位波形Ｕb[n]を除外した(Ｍ−1)個からランダムに処理期間Ｒ[n]の第１単位波形Ｕa[n]を選択する。なお、最初の処理期間Ｒ[1]については、Ｍ個の単位波形ｕ[1]〜ｕ[M]のうち任意の１個（例えばＭ個からランダムまたは固定的に選択された１個）の単位波形ｕ[m]が第２単位波形Ｕb[n]として選択される。 Specifically, the speech synthesizer 28 uses the first unit waveform Ua [n-1] in the immediately previous processing period R [n-1] as the second unit waveform Ub [n] in the current processing period R [n]. ], And the first unit of the processing period R [n] is randomly selected from (M−1) of the M unit waveforms u [1] to u [M] except for the second unit waveform Ub [n]. Select unit waveform Ua [n]. For the first processing period R [1], any one of M unit waveforms u [1] to u [M] (for example, one randomly or fixedly selected from M) is used. The unit waveform u [m] is selected as the second unit waveform Ub [n].

例えば図６に示すように、継続長Ｚd内の最初の処理期間Ｒ[1]では、単位波形ｕ[3]が第１単位波形Ｕa[1]として選択されるとともに単位波形ｕ[2]が第２単位波形Ｕb[1]として選択される。直後の処理期間Ｒ[2]では、単位波形ｕ[1]が新たな第１単位波形Ｕa[2]として選択され、単位波形ｕ[3]が第２単位波形Ｕb[2]として処理期間Ｒ[1]から引続き選択される。また、処理期間Ｒ[3]では、単位波形ｕ[2]が新たな第１単位波形Ｕa[3]として選択され、単位波形ｕ[1]が第２単位波形Ｕb[3]として処理期間Ｒ[2]から引続き選択される。 For example, as shown in FIG. 6, in the first processing period R [1] within the duration Zd, the unit waveform u [3] is selected as the first unit waveform Ua [1] and the unit waveform u [2] is selected. Selected as the second unit waveform Ub [1]. In the processing period R [2] immediately after, the unit waveform u [1] is selected as a new first unit waveform Ua [2], and the unit waveform u [3] is selected as the second unit waveform Ub [2]. Selected from [1]. In the processing period R [3], the unit waveform u [2] is selected as the new first unit waveform Ua [3], and the unit waveform u [1] is selected as the second unit waveform Ub [3]. Selected from [2].

以上のように処理期間Ｒ[n]の第１単位波形Ｕa[n]および第２単位波形Ｕb[n]を選択すると、音声合成部２８は、図６に示すように、複数の第１単位波形Ｕa[n]を配列した第１波形系列Ｓa[n]と、複数の第２単位波形Ｕb[n]を配列した第２波形系列Ｓb[n]とのクロスフェードで処理期間Ｒ[n]の合成波形Ｃ[n]を生成する（ＳB4）。具体的には、第１波形系列Ｓa[n]は、処理期間Ｒ[n]の時間長Ｌr[n]にわたる個数（Ｌr[n]/Ｔ0個）の第１単位波形Ｕa[n]を、各第１単位波形Ｕa[n]の強度（振幅）が経時的に増加するように調整して配列した時系列である。他方、第２波形系列Ｓb[n]は、処理期間Ｒ[n]の時間長Ｌr[n]にわたる個数（Ｌr[n]/Ｔ0個）の第２単位波形Ｕb[n]を、各第２単位波形Ｕb[n]の強度（振幅）が経時的に減少するように調整して配列した時系列である。音声合成部２８は、第１波形系列Ｓa[n]と第２波形系列Ｓb[n]との加算で合成波形Ｃ[n]を生成する。 When the first unit waveform Ua [n] and the second unit waveform Ub [n] in the processing period R [n] are selected as described above, the speech synthesis unit 28, as shown in FIG. The processing period R [n] is a crossfade of the first waveform series Sa [n] in which the waveforms Ua [n] are arranged and the second waveform series Sb [n] in which the plurality of second unit waveforms Ub [n] are arranged. The synthesized waveform C [n] is generated (SB4). Specifically, the first waveform series Sa [n] includes the number of first unit waveforms Ua [n] over the time length Lr [n] of the processing period R [n] (Lr [n] / T0), It is a time series arranged so that the intensity (amplitude) of each first unit waveform Ua [n] increases with time. On the other hand, the second waveform series Sb [n] includes the second unit waveforms Ub [n] of the number (Lr [n] / T0) of the second unit waveforms Ub [n] over the time length Lr [n] of the processing period R [n]. This is a time series arranged so that the intensity (amplitude) of the unit waveform Ub [n] decreases with time. The speech synthesis unit 28 generates a synthesized waveform C [n] by adding the first waveform series Sa [n] and the second waveform series Sb [n].

音声合成部２８は、Ｎ個の処理期間Ｒ[1]〜Ｒ[N]の全部について合成波形Ｃ[n]（Ｃ[1]〜Ｃ[N]）を生成したか否かを判定する（ＳB5）。処理ＳB5の結果が否定である場合、変数ｎに１を加算し（ＳB6）、更新後の変数ｎに対応する処理期間Ｒ[n]（すなわち直前に合成波形Ｃ[n-1]を生成した処理期間Ｒ[n-1]の直後の処理期間Ｒ[n]）を対象として処理ＳB3から処理ＳB5を実行することで合成波形Ｃ[n]を生成する。 The speech synthesizer 28 determines whether or not the synthesized waveform C [n] (C [1] to C [N]) has been generated for all of the N processing periods R [1] to R [N] ( SB5). When the result of the process SB5 is negative, 1 is added to the variable n (SB6), and the process period R [n] corresponding to the updated variable n (that is, the synthesized waveform C [n-1] is generated immediately before). The composite waveform C [n] is generated by executing the process SB3 to the process SB5 for the process period R [n] immediately after the process period R [n-1].

以上の処理の反復でＮ個の合成波形Ｃ[1]〜Ｃ[N]の生成が完了すると（ＳB5：YES）、音声合成部２８は、Ｎ個の合成波形Ｃ[1]〜Ｃ[N]を時間軸上に配列することで素片波形Ｑ0を生成する（ＳB7）。そして、音声合成部２８は、処理ＳB7で生成した素片波形Ｑ0を、合成情報Ｚが選択素片に指定する音高Ｚbおよび音量Ｚeに調整することで素片波形Ｑを生成する（ＳB8）。以上の説明から理解されるように、合成情報Ｚが選択素片に指定する継続長Ｚdにわたる音高Ｚbおよび音量Ｚeの素片波形Ｑが選択素片について生成される。前述の通り、変動素片について処理ＳA2で生成された素片波形Ｑと定常素片について波形生成処理ＳA3（処理ＳB8）で生成された素片波形Ｑとの連結で音声信号ＳOUTが生成される。 When the generation of N synthesized waveforms C [1] to C [N] is completed by repeating the above processing (SB5: YES), the speech synthesizer 28 sets the N synthesized waveforms C [1] to C [N]. ] Are arranged on the time axis to generate a segment waveform Q0 (SB7). Then, the speech synthesizer 28 generates the segment waveform Q by adjusting the segment waveform Q0 generated in the process SB7 to the pitch Zb and volume Ze specified by the synthesis information Z as the selected segment (SB8). . As understood from the above description, the segment waveform Q of the pitch Zb and the volume Ze over the continuation length Zd specified by the synthesis information Z for the selected segment is generated for the selected segment. As described above, the speech signal SOUT is generated by concatenating the segment waveform Q generated in the process SA2 for the variable segment and the segment waveform Q generated in the waveform generation process SA3 (process SB8) for the stationary segment. .

以上の説明から理解されるように、第１実施形態では、音声波形Ｖbのうち時間軸上の相異なる位置から抽出されたＭ個の単位波形ｕ[1]〜ｕ[M]を適宜に選択して配列することで合成波形Ｃ[n]が生成される。したがって、例えば定常音素の生成時に１個の音声波形Ｖbを反復させる構成（例えば特許文献１の構成）と比較すると、音声波形Ｖbの反復に起因して音声信号ＳOUTに発生する特性変化の周期性が受聴者に知覚され難くなる（すなわち高音質な音声信号ＳOUTを生成できる）という利点がある。 As understood from the above description, in the first embodiment, M unit waveforms u [1] to u [M] extracted from different positions on the time axis in the speech waveform Vb are appropriately selected. As a result, the combined waveform C [n] is generated. Therefore, for example, when compared with a configuration in which one speech waveform Vb is repeated at the time of generating a stationary phoneme (for example, the configuration of Patent Document 1), the periodicity of the characteristic change that occurs in the speech signal SOUT due to the repetition of the speech waveform Vb. Is less perceptible to the listener (that is, a high-quality sound signal SOUT can be generated).

第１実施形態では特に、第１波形系列Ｓa[n]と第２波形系列Ｓb[n]とのクロスフェードで合成波形Ｃ[n]が生成されるから、例えば複数の単位波形ｕ[m]を選択的に配列して合成波形Ｃ[n]を生成する構成と比較して、素片波形Ｑにおける特性変化の周期性が知覚され難いという効果は格別に顕著である。また、第１実施形態では、各処理期間Ｒ[n]が相異なる時間長Ｌr[n]に設定され得るから、Ｎ個の処理期間Ｒ[1]〜Ｒ[N]が相等しい時間長に設定される構成と比較して、素片波形Ｑにおける特性変化の周期性が知覚され難いという効果は格別に顕著である。また、第１実施形態では、処理期間Ｒ[n-1]で第１単位波形Ｕa[n-1]として選択された単位波形ｕ[m]が直後の処理期間Ｒ[n]で引続き第２単位波形Ｕb[n]として選択される。したがって、第１単位波形Ｕa[n]および第２単位波形Ｕb[n]の双方を直前の処理期間Ｒ[n]での選択対象とは無関係に選択する構成と比較して、素片波形Ｑにおける特性変化の周期性が低減されるという利点がある。 In the first embodiment, in particular, since the composite waveform C [n] is generated by cross-fading the first waveform series Sa [n] and the second waveform series Sb [n], for example, a plurality of unit waveforms u [m] As compared with the configuration in which the synthesized waveform C [n] is generated by selectively arranging the waveforms, the effect that it is difficult to perceive the periodicity of the characteristic change in the segment waveform Q is particularly remarkable. In the first embodiment, since each processing period R [n] can be set to a different time length Lr [n], N processing periods R [1] to R [N] have the same time length. Compared to the set configuration, the effect that the periodicity of the characteristic change in the segment waveform Q is difficult to be perceived is particularly remarkable. In the first embodiment, the unit waveform u [m] selected as the first unit waveform Ua [n-1] in the processing period R [n-1] continues in the processing period R [n] immediately after the second. The unit waveform Ub [n] is selected. Therefore, the segment waveform Q is compared with the configuration in which both the first unit waveform Ua [n] and the second unit waveform Ub [n] are selected regardless of the selection target in the immediately preceding processing period R [n]. There is an advantage that the periodicity of the characteristic change is reduced.

また、第１実施形態では、音声波形Ｖbから抽出された複数の部分（単位波形ｕ[m]）が記憶装置１２に格納されるから、音声波形Ｖbの全区間を記憶装置１２に格納する構成と比較して、記憶装置１２に必要な記憶容量が削減されるという利点もある。特に第１実施形態では、音声波形Ｖbの１周期分が各単位波形ｕ[m]として記憶装置１２に格納されるから、記憶容量の低減の効果は格別に顕著である。なお、携帯電話機や携帯情報端末等の携帯機器では、例えば据置型の情報処理装置と比較して記憶容量の制約が大きいから、記憶容量の削減が可能な第１実施形態は、音声合成装置１００を携帯機器等に搭載した場合に格別に有効である。 In the first embodiment, since a plurality of portions (unit waveform u [m]) extracted from the speech waveform Vb are stored in the storage device 12, the entire section of the speech waveform Vb is stored in the storage device 12. As compared with the above, there is an advantage that the storage capacity required for the storage device 12 is reduced. In particular, in the first embodiment, since one period of the voice waveform Vb is stored in the storage device 12 as each unit waveform u [m], the effect of reducing the storage capacity is particularly remarkable. Note that in a portable device such as a mobile phone or a portable information terminal, the storage capacity is more limited than, for example, a stationary information processing apparatus. Is particularly effective when installed in a portable device.

＜Ｂ：第２実施形態＞
図７は、本発明の第２実施形態に係る音声処理装置２００のブロック図である。音声処理装置２００は、第１実施形態の音声合成装置１００において定常音素の素片波形Ｑの生成に使用されるＭ個の単位波形ｕ[1]〜ｕ[M]を生成する。 <B: Second Embodiment>
FIG. 7 is a block diagram of a speech processing apparatus 200 according to the second embodiment of the present invention. The speech processing apparatus 200 generates M unit waveforms u [1] to u [M] that are used to generate the unit waveform Q of stationary phonemes in the speech synthesizer 100 of the first embodiment.

図７に示すように、音声処理装置２００は、演算処理装置５０と記憶装置５２とを具備するコンピュータシステムで実現される。記憶装置５２は、演算処理装置５０が実行するプログラムＰGM2や演算処理装置５０が記憶する各種の情報を記憶する。例えば、Ｍ個の単位波形ｕ[1]〜ｕ[M]の素材となる音声波形Ｖbが記憶装置５２に格納される。音声波形Ｖbは、有声音の音声素片を時間的に連続して発声した音声を示すサンプル系列である。例えば音声処理装置２００に接続された収音機器（図示略）が収音した音声波形Ｖbや、光ディスク等の各種の記録媒体またはインターネット等の通信網から供給された音声波形Ｖbが記憶装置５２に格納される。なお、以下の説明では便宜的に１個の音声波形Ｖbのみに言及するが、実際には相異なる音声素片に対応する複数の音声波形Ｖbが記憶装置５２に格納され、以下に例示する複数の単位波形ｕ[m]の生成が音声波形Ｖb毎に順次に実行される。 As shown in FIG. 7, the audio processing device 200 is realized by a computer system including an arithmetic processing device 50 and a storage device 52. The storage device 52 stores a program PGM2 executed by the arithmetic processing device 50 and various types of information stored by the arithmetic processing device 50. For example, a speech waveform Vb that is a material of M unit waveforms u [1] to u [M] is stored in the storage device 52. The voice waveform Vb is a sample series indicating a voice produced by uttering a voiced voice segment continuously in time. For example, a voice waveform Vb picked up by a sound collecting device (not shown) connected to the voice processing device 200 or a voice waveform Vb supplied from various recording media such as an optical disk or a communication network such as the Internet is stored in the storage device 52. Stored. In the following description, only one speech waveform Vb is referred to for convenience, but actually, a plurality of speech waveforms Vb corresponding to different speech segments are stored in the storage device 52, and a plurality of examples illustrated below. The unit waveform u [m] is sequentially generated for each voice waveform Vb.

演算処理装置５０は、記憶装置５２に格納されたプログラムＰGM2の実行で、音声波形ＶbからＭ個の単位波形ｕ[1]〜ｕ[M]を生成するための複数の機能（波形抽出部６２，波形補正部６４）を実現する。なお、演算処理装置５０の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が各機能を実現する構成も採用され得る。 The arithmetic processing unit 50 executes a plurality of functions (waveform extraction unit 62) for generating M unit waveforms u [1] to u [M] from the speech waveform Vb by executing the program PGM2 stored in the storage device 52. , A waveform correction unit 64) is realized. A configuration in which each function of the arithmetic processing unit 50 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes each function may be employed.

図８には、記憶装置５２に記憶された音声波形Ｖb（包絡線）が図示されている。図８に示すように、波形抽出部６２は、記憶装置５２に格納された音声波形Ｖbのうち時間軸上の相異なる位置からＭ個（以下の例示では３個）の単位波形ｘ[1]〜ｘ[M]を抽出する。各単位波形ｘ[m]は、音声波形Ｖbの１周期分に相当する区間である。単位波形ｘ[m]の抽出には公知の技術が任意に採用される。 FIG. 8 shows a speech waveform Vb (envelope) stored in the storage device 52. As shown in FIG. 8, the waveform extraction unit 62 includes M (three in the following example) unit waveforms x [1] from different positions on the time axis in the speech waveform Vb stored in the storage device 52. Extract ~ x [M]. Each unit waveform x [m] is a section corresponding to one cycle of the speech waveform Vb. A known technique is arbitrarily employed for extracting the unit waveform x [m].

発声者が１個の音声素片を継続的に発声した場合でも、実際の音声波形Ｖbの音響特性（振幅や周期）は経時的に変動するから、音声波形Ｖbから抽出された各単位波形ｘ[m]の音響特性は相違し得る。図７の波形補正部６４は、各単位波形ｘ[m]の音響特性を相互に類似するように補正（正規化）することでＭ個の単位波形ｕ[1]〜ｕ[M]を生成する。図７に示すように、波形補正部６４は、振幅補正部７２と周期補正部７４と位相補正部７６とを含んで構成される。 Even when the speaker continuously utters one speech unit, the acoustic characteristics (amplitude and period) of the actual speech waveform Vb vary with time, so each unit waveform x extracted from the speech waveform Vb. The acoustic properties of [m] can be different. The waveform correction unit 64 in FIG. 7 generates M unit waveforms u [1] to u [M] by correcting (normalizing) the acoustic characteristics of the unit waveforms x [m] to be similar to each other. To do. As shown in FIG. 7, the waveform correction unit 64 includes an amplitude correction unit 72, a period correction unit 74, and a phase correction unit 76.

図８に示すように、音声波形Ｖbにおける振幅の時間的な変動に起因して、各単位波形ｘ[m]の強度（振幅）のピークトゥピーク値Ａ[m]は相違し得る。ピークトゥピーク値Ａ[m]は、単位波形ｘ[m]の強度の最大値と最小値との差分（全振幅）を意味する。振幅補正部７２は、単位波形ｘ[m]のピークトゥピーク値Ａ[m]が所定値Ａ0に調整されるように各単位波形ｘ[m]を補正（例えば単位波形ｘ[m]を振幅方向に伸縮する）することで単位波形ｙA[m]（ｙA[1]〜ｙA[M]）を生成する。振幅補正部７２による補正の方法は任意であるが、例えば、ピークトゥピーク値Ａ[m]に対する所定値Ａ0の比（Ａ0/Ａ[m]）を補正値として単位波形ｘ[m]に乗算する方法が好適である。 As shown in FIG. 8, the peak-to-peak value A [m] of the intensity (amplitude) of each unit waveform x [m] can be different due to the temporal variation of the amplitude in the speech waveform Vb. The peak-to-peak value A [m] means a difference (total amplitude) between the maximum value and the minimum value of the intensity of the unit waveform x [m]. The amplitude correction unit 72 corrects each unit waveform x [m] so that the peak-to-peak value A [m] of the unit waveform x [m] is adjusted to a predetermined value A0 (for example, the unit waveform x [m] is amplitude-adjusted). The unit waveform yA [m] (yA [1] to yA [M]) is generated by expanding and contracting in the direction. The method of correction by the amplitude correction unit 72 is arbitrary. For example, the unit waveform x [m] is multiplied as a correction value by the ratio (A0 / A [m]) of the predetermined value A0 to the peak-to-peak value A [m]. Is preferred.

また、音声波形Ｖbにおける周期の時間的な変動に起因して、各単位波形ｘ[m]の時間長（音声波形Ｖbの１周期）Ｔ[m]は相違し得る。図７の周期補正部７４は、振幅補正部７２による補正後の単位波形ｙA[m]の周期Ｔ[m]が所定値Ｔ0に調整されるように各単位波形ｙA[m]を補正することで単位波形ｙB[m]（ｙB[1]〜ｙB[M]）を生成する。周期補正部７４による補正の方法は任意であるが、例えば以下に例示する方法が好適である。 Further, due to the temporal variation of the period in the voice waveform Vb, the time length of each unit waveform x [m] (one period of the voice waveform Vb) T [m] may be different. 7 corrects each unit waveform yA [m] so that the period T [m] of the unit waveform yA [m] corrected by the amplitude correction unit 72 is adjusted to a predetermined value T0. A unit waveform yB [m] (yB [1] to yB [M]) is generated. Although the correction method by the period correction unit 74 is arbitrary, for example, the following method is preferable.

図９の部分(A)は、振幅補正部７２による補正後の単位波形ｙA[m]の波形図である。第１に、周期補正部７４は、図９の部分(B)の例示の通り、各単位波形ｙA[m]を時間軸上で伸縮することで時間長Ｔ'[m]の単位波形ｙA'[m]（ｙA'[1]〜ｙA'[M]）を生成する。時間長Ｔ'[m]は、音声波形Ｖbのサンプリング周期の整数倍であり、かつ、単位波形ｙA[m]の時間長Ｔ[m]に最も近い時間長（例えば時間長Ｔ[m]の整数部）に設定される。各単位波形ｙA'[m]は、始点ｔsおよび終点ｔeにて強度（信号値）がゼロとなるように生成される。第２に、周期補正部７４は、図９の部分(C)の例示の通り、単位波形ｙA'[m]を時間軸上で伸縮することで時間長Ｔ0の単位波形ｙB[m]（ｙB[1]〜ｙB[M]）を生成する。時間長Ｔ0は、例えば、各単位波形ｙA'[m]の時間長Ｔ'[m]の最頻値（したがってサンプリング周期の整数倍）に設定される。 Part (A) of FIG. 9 is a waveform diagram of the unit waveform yA [m] after correction by the amplitude correction unit 72. First, as illustrated in the part (B) of FIG. 9, the period correcting unit 74 expands and contracts each unit waveform yA [m] on the time axis to thereby unit time y ′ ′ of time length T ′ [m]. [m] (yA ′ [1] to yA ′ [M]) is generated. The time length T ′ [m] is an integral multiple of the sampling period of the voice waveform Vb and is the closest to the time length T [m] of the unit waveform yA [m] (for example, the time length T [m] Integer part). Each unit waveform yA ′ [m] is generated so that the intensity (signal value) becomes zero at the start point ts and the end point te. Secondly, as illustrated in the part (C) of FIG. 9, the period correcting unit 74 expands and contracts the unit waveform yA ′ [m] on the time axis to thereby unit time yB [m] (yB [1] to yB [M]). The time length T0 is set to, for example, the mode value of the time length T ′ [m] of each unit waveform yA ′ [m] (thus, an integer multiple of the sampling period).

Ｍ個の単位波形ｙB[1]〜ｙB[M]のピークトゥピーク値Ａ0および時間長Ｔ0は以上の処理で正規化（共通化）されるが、音声波形Ｖbのうち波形抽出部６２が１周期として抽出された各単位波形ｘ[m]の時間軸上の位置によっては、各単位波形ｙB[m]の波形の相関が低い可能性がある。例えば、図１０の部分(A)の単位波形ｙB[1]は始点の直後に極大点（ピーク）が到来するのに対し、図１０の部分(B)の単位波形ｙB[2]は始点の直後に極小点（ディップ）が到来するという相違がある。図７の位相補正部７６は、周期補正部７４による補正後のＭ個の単位波形ｙB[1]〜ｙB[M]の間で波形の相関が増加するように各単位波形ｙB[m]の位相を補正して単位波形ｕ[m]（ｕ[1]〜ｕ[M]）を生成する。 The peak-to-peak value A0 and the time length T0 of the M unit waveforms yB [1] to yB [M] are normalized (shared) by the above processing, but the waveform extraction unit 62 of the speech waveform Vb is set to 1. Depending on the position on the time axis of each unit waveform x [m] extracted as a period, the correlation of the waveform of each unit waveform yB [m] may be low. For example, the unit waveform yB [1] in part (A) of FIG. 10 has a maximum point (peak) immediately after the start point, whereas the unit waveform yB [2] in part (B) of FIG. There is a difference that a dip comes soon after. 7 corrects each unit waveform yB [m] so that the correlation between the waveforms increases among the M unit waveforms yB [1] to yB [M] corrected by the period correction unit 74. The unit waveform u [m] (u [1] to u [M]) is generated by correcting the phase.

位相補正部７６は、周期補正部７４による補正後のＭ個の単位波形ｙB[1]〜ｙB[M]のうちの１個の単位波形ｙB[m]を基準波形ｙREFとして選択する。図１０では、部分(A)に図示された単位波形ｙB[1]を基準波形ｙREFとした場合が例示されている。位相補正部７６は、基準波形ｙREF以外の(Ｍ−１)個の単位波形ｙB[m]の各々について基準波形ｙREFとの相互相関関数Ｆm(τ)を算定する。変数τは、基準波形ｙREFに対する単位波形ｙB[m]の時間差（シフト量）である。位相補正部７６は、図１０の部分(C)に例示されるように、相互相関関数Ｆm(τ)が最大となる変数τの時間だけ単位波形ｙB[m]の始点ｔsを時間軸上で移動する（単位波形ｙB[m]を移相する）ことで単位波形ｕ[m]を生成する。図１０の部分(C)に示すように、単位波形ｙB[m]のうち移動後の始点ｔs以前の区間は単位波形ｙB[m]の末尾に付加される。なお、音声波形Ｖbの２周期分を波形抽出部６２が単位波形ｘ[m]として抽出し、相互相関関数Ｆm(τ)が最大となる変数τの時間だけ単位波形ｙB[m]の始点ｔsから経過した時点を起点とする１周期分を位相補正部７６が単位波形ｕ[m]として抽出することも可能である。 The phase correction unit 76 selects one unit waveform yB [m] among the M unit waveforms yB [1] to yB [M] corrected by the period correction unit 74 as the reference waveform yREF. FIG. 10 illustrates the case where the unit waveform yB [1] illustrated in the part (A) is the reference waveform yREF. The phase correction unit 76 calculates a cross-correlation function Fm (τ) with the reference waveform yREF for each of (M−1) unit waveforms yB [m] other than the reference waveform yREF. The variable τ is a time difference (shift amount) of the unit waveform yB [m] with respect to the reference waveform yREF. As illustrated in part (C) of FIG. 10, the phase correction unit 76 sets the start point ts of the unit waveform yB [m] on the time axis for the time of the variable τ at which the cross-correlation function Fm (τ) is maximum. The unit waveform u [m] is generated by moving (shifting the unit waveform yB [m]). As shown in part (C) of FIG. 10, the section of the unit waveform yB [m] before the start point ts after movement is added to the end of the unit waveform yB [m]. Note that two periods of the voice waveform Vb are extracted as the unit waveform x [m] by the waveform extraction unit 62, and the start point ts of the unit waveform yB [m] is the time of the variable τ that maximizes the cross-correlation function Fm (τ). It is also possible for the phase correction unit 76 to extract a unit waveform u [m] for one period starting from the time point elapsed since

以上の説明から理解されるように、第１実施形態におけるＭ個の単位波形ｕ[1]〜ｕ[M]は、ピークトゥピーク値Ａ0および時間長Ｔ0が共通し、かつ、相互相関関数Ｆm(τ)が最大となるように位相が調整された関係にある。波形補正部６４が生成したＭ個の単位波形ｕ[1]〜ｕ[M]は、図７に示すように記憶装置５２に格納され、例えば通信網や可搬型の記録媒体を介して第１実施形態の音声合成装置１００の記憶装置１２に転送される。 As understood from the above description, the M unit waveforms u [1] to u [M] in the first embodiment share a peak-to-peak value A0 and a time length T0, and have a cross-correlation function Fm. The phase is adjusted so that (τ) is maximized. The M unit waveforms u [1] to u [M] generated by the waveform correction unit 64 are stored in the storage device 52 as shown in FIG. 7, and the first unit waveforms u, for example, via a communication network or a portable recording medium. It is transferred to the storage device 12 of the speech synthesizer 100 of the embodiment.

第２実施形態では、Ｍ個の単位波形ｕ[1]〜ｕ[M]のピークトゥピーク値が所定値Ａ0に調整されるから、単位波形ｕ[m]毎にピークトゥピーク値が相違する構成と比較すると、単位波形ｕ[m]を利用して生成される合成波形Ｃ[n]（素片波形Ｑ）における振幅の変動が抑制される。また、Ｍ個の単位波形ｕ[1]〜ｕ[M]の時間長が所定値Ｔ0に調整されるから、各単位波形ｕ[m]の時間長が相違する構成と比較すると、単位波形ｕ[m]を利用して生成される合成波形Ｃ[n]における周期（音高）の変動が抑制される。したがって、合成対象音のうち振幅や周期の変動が少ない定常素片の区間（定常部）について聴感的に自然な印象の音声を生成することが可能である。 In the second embodiment, since the peak-to-peak values of the M unit waveforms u [1] to u [M] are adjusted to the predetermined value A0, the peak-to-peak values are different for each unit waveform u [m]. Compared with the configuration, amplitude fluctuations in the combined waveform C [n] (segment waveform Q) generated using the unit waveform u [m] are suppressed. Further, since the time lengths of the M unit waveforms u [1] to u [M] are adjusted to the predetermined value T0, the unit waveform u is compared with the configuration in which the time lengths of the unit waveforms u [m] are different. Variations in the period (pitch) in the synthesized waveform C [n] generated using [m] are suppressed. Therefore, it is possible to generate a sound with an acoustically natural impression in a section (stationary part) of a stationary segment with small fluctuations in amplitude and period among synthesis target sounds.

また、各単位波形ｕ[m]の相関が低い場合、第１波形系列Ｓa[n]と第２波形系列Ｓb[n]とを加算（クロスフェード）する段階で第１単位波形Ｕa[n]と第２単位波形Ｕb[n]とが相殺され、合成波形Ｃ[n]の再生音が聴感的に不自然な音声となる可能性がある。第２実施形態では、相互相関関数Ｆm(τ)が最大となるように各単位波形ｕ[m]の位相が調整されるから、聴感的に自然な印象の音声を生成することが可能である。 When the correlation between the unit waveforms u [m] is low, the first unit waveform Ua [n] is added (cross-fade) to the first waveform series Sa [n] and the second waveform series Sb [n]. And the second unit waveform Ub [n] cancel each other, and the reproduced sound of the synthesized waveform C [n] may be audibly unnatural. In the second embodiment, since the phase of each unit waveform u [m] is adjusted so that the cross-correlation function Fm (τ) is maximized, it is possible to generate a sound with an acoustically natural impression. .

なお、波形補正部６４の各要素による処理の順番は適宜に変更される。例えば、周期補正部７４による周期の補正後に振幅補正部７２が振幅を補正する構成も採用され得る。また、波形補正部６４の各要素は適宜に省略される。すなわち、波形補正部６４は、振幅補正部７２と周期補正部７４と位相補正部７６との少なくともひとつを含む要素として包括される。 Note that the order of processing by each element of the waveform correction unit 64 is appropriately changed. For example, a configuration in which the amplitude correction unit 72 corrects the amplitude after the cycle correction by the cycle correction unit 74 may be employed. Further, each element of the waveform correction unit 64 is omitted as appropriate. That is, the waveform correction unit 64 is included as an element including at least one of the amplitude correction unit 72, the period correction unit 74, and the phase correction unit 76.

＜Ｃ：第３実施形態＞
第２実施形態で説明した通り、周期補正部７４は、各単位波形ｙA[m]の周期Ｔ[m]を所定値Ｔ0に調整する。第３実施形態は、各単位波形ｙB[m]の時間長（所定長Ｔ0）の選定方法に着目した周期補正部７４の具体例である。図１１は、第３実施形態の周期補正部７４のブロック図であり、図１２は、第３実施形態の周期補正部７４の動作の説明図である。図１１に示すように、第３実施形態の周期補正部７４は、指標算定部７４２と補正処理部７４４とを含んで構成される。 <C: Third Embodiment>
As described in the second embodiment, the period correction unit 74 adjusts the period T [m] of each unit waveform yA [m] to a predetermined value T0. The third embodiment is a specific example of the period correction unit 74 focusing on a method for selecting a time length (predetermined length T0) of each unit waveform yB [m]. FIG. 11 is a block diagram of the period correction unit 74 of the third embodiment, and FIG. 12 is an explanatory diagram of the operation of the period correction unit 74 of the third embodiment. As shown in FIG. 11, the period correction unit 74 of the third embodiment includes an index calculation unit 742 and a correction processing unit 744.

指標算定部７４２は、図１２に示すように、相異なる複数（Ｋ個）の候補長Ｘ[1]〜Ｘ[K]の各々について歪指標値Ｄ[k]（ｋ＝１〜Ｋ）を算定する。候補長Ｘ[k]は、所定長Ｔ0の候補となる時間長であり、音声波形Ｖbのサンプリング周期の整数倍の時間長に設定される。例えば、候補長Ｘ[1]は、第２実施形態で説明した単位波形ｙA'[1]の時間長Ｔ'[1]に設定され、候補長Ｘ[2]は単位波形ｙA'[2]の時間長Ｔ'[2]に設定され、候補長Ｘ[3]は単位波形ｙA'[3]の時間長Ｔ'[3]に設定される（Ｋ＝Ｍ＝３）。歪指標値Ｄ[k]は、Ｍ個の単位波形ｙA[1]〜ｙA[M]の各々を初期的な周期Ｔ[m]から共通の候補長Ｘ[k]に伸縮した場合の各単位波形ｙA[m]の時間軸上の歪みの度合（伸縮の前後にわたる単位波形ｙA[m]の変形の度合）を示す指標である。図１２のように単位波形ｙA[m]が３個である場合（Ｍ＝３）を想定すると、歪指標値Ｄ[k]は、例えば以下の数式(1)で算定される。
Ｄ[k]＝|Ｔ[1]−Ｘ[k]|/Ｘ[k]＋|Ｔ[2]−Ｘ[k]|/Ｘ[k]＋|Ｔ[3]−Ｘ[k]|/Ｘ[k] …(1)
数式(1)から理解されるように、各単位波形ｙA[m]の周期Ｔ[m]と候補長Ｘ[k]との相違が大きい（候補長Ｘ[k]に伸縮した場合の波形の変形が大きい）ほど歪指標値Ｄ[k]は大きい数値となる。 As shown in FIG. 12, the index calculation unit 742 calculates the distortion index value D [k] (k = 1 to K) for each of a plurality (K) of different candidate lengths X [1] to X [K]. Calculate. The candidate length X [k] is a time length that is a candidate for the predetermined length T0, and is set to a time length that is an integral multiple of the sampling period of the speech waveform Vb. For example, the candidate length X [1] is set to the time length T ′ [1] of the unit waveform yA ′ [1] described in the second embodiment, and the candidate length X [2] is the unit waveform yA ′ [2]. The candidate length X [3] is set to the time length T ′ [3] of the unit waveform yA ′ [3] (K = M = 3). The distortion index value D [k] is each unit when each of the M unit waveforms yA [1] to yA [M] is expanded or contracted from the initial period T [m] to a common candidate length X [k]. This is an index indicating the degree of distortion of the waveform yA [m] on the time axis (the degree of deformation of the unit waveform yA [m] before and after expansion / contraction). Assuming that there are three unit waveforms yA [m] as shown in FIG. 12 (M = 3), the distortion index value D [k] is calculated by the following equation (1), for example.
D [k] = | T [1] −X [k] | / X [k] + | T [2] −X [k] | / X [k] + | T [3] −X [k] | / X [k] (1)
As can be understood from Equation (1), the difference between the period T [m] of each unit waveform yA [m] and the candidate length X [k] is large (the waveform of the waveform when expanded or contracted to the candidate length X [k]). The greater the deformation), the larger the distortion index value D [k].

図１１の補正処理部７４４は、図１２に示すように、Ｋ個の候補長Ｘ[1]〜Ｘ[K]のうち歪指標値Ｄ[k]で表現される歪みの度合が最小となる候補長Ｘ[k]（すなわち、最小の歪指標値Ｄ[k]に対応する候補長Ｘ[k]）を所定長Ｔ0として選択し、振幅補正部７２による補正後の各単位波形ｙA[m]の時間長（周期）Ｔ[m]を共通の所定長Ｔ0に調整することで単位波形ｙB[m]を生成する。各単位波形ｙA[m]の伸縮方法は第２実施形態と同様である。 As shown in FIG. 12, the correction processing unit 744 in FIG. 11 minimizes the degree of distortion expressed by the distortion index value D [k] among the K candidate lengths X [1] to X [K]. The candidate length X [k] (that is, the candidate length X [k] corresponding to the minimum distortion index value D [k]) is selected as the predetermined length T0, and each unit waveform yA [m after correction by the amplitude correction unit 72 is selected. ] Is adjusted to a common predetermined length T0 to generate a unit waveform yB [m]. The expansion / contraction method of each unit waveform yA [m] is the same as in the second embodiment.

以上に説明した通り、第３実施形態では、Ｍ個の単位波形ｙA[1]〜ｙA[M]の伸縮の度合（歪指標値Ｄ[k]）が最小となるように調整後の各単位波形ｙB[m]の所定長Ｔ0が可変に設定されるから、周期補正部７４による補正前の単位波形ｙA[m]と補正後の単位波形ｙB[m]との相違（音声波形Ｖbの音響特性からの乖離）が低減されるという利点がある。 As described above, in the third embodiment, each unit after adjustment so that the degree of expansion / contraction (distortion index value D [k]) of M unit waveforms yA [1] to yA [M] is minimized. Since the predetermined length T0 of the waveform yB [m] is variably set, the difference between the unit waveform yA [m] before correction by the period correction unit 74 and the unit waveform yB [m] after correction (the sound of the audio waveform Vb) There is an advantage that the deviation from the characteristic is reduced.

なお、第２実施形態では、各単位波形ｙA[m]の周期Ｔ[m]の小数部の切捨で各単位波形ｙA'[m]の時間長Ｔ'[m]を算定したが、各単位波形ｙA[m]の周期Ｔ[m]の小数部の切上で各単位波形ｙA'[m]の時間長Ｔ'[m]を算定することも可能である。したがって、第３実施形態では、以下に例示する通り、各単位波形ｙA[m]の周期Ｔ[m]の小数部を切捨てた時間長Ｔa'[m]と各単位波形ｙA[m]の周期Ｔ[m]の小数部を切上げた時間長Ｔb'[m]とを各候補長Ｘ[k]とすることも可能である。 In the second embodiment, the time length T ′ [m] of each unit waveform yA ′ [m] is calculated by rounding off the decimal part of the period T [m] of each unit waveform yA [m]. It is also possible to calculate the time length T ′ [m] of each unit waveform yA ′ [m] by rounding up the decimal part of the period T [m] of the unit waveform yA [m]. Therefore, in the third embodiment, as illustrated below, the time length Ta ′ [m] obtained by discarding the fractional part of the period T [m] of each unit waveform yA [m] and the period of each unit waveform yA [m] The time length Tb ′ [m] obtained by rounding up the decimal part of T [m] can be set as each candidate length X [k].

例えば、候補長Ｘ[1]は、単位波形ｙA[1]の周期Ｔ[1]の少数部を切捨てた時間長Ｔa'[1]に設定され、候補長Ｘ[2]は、単位波形ｙA[1]の周期Ｔ[1]の小数部を切上げた時間長Ｔb'[1]に設定される。候補長Ｘ[3]は、単位波形ｙA[2]の周期Ｔ[2]の少数部を切捨てた時間長Ｔa'[2]に設定され、候補長Ｘ[4]は、単位波形ｙA[2]の周期Ｔ[2]の小数部を切上げた時間長Ｔb'[2]に設定される。同様に、候補長Ｘ[5]は、単位波形ｙA[3]の周期Ｔ[3]の少数部を切捨てた時間長Ｔa'[3]に設定され、候補長Ｘ[6]は、単位波形ｙA[3]の周期Ｔ[3]の小数部を切上げた時間長Ｔb'[3]に設定される。すなわち、各単位波形ｙA[m]と周期Ｔ[m]の切捨／切上との組合せに対応した６種類の候補長Ｘ[1]〜Ｘ[6]が設定される。 For example, the candidate length X [1] is set to the time length Ta ′ [1] obtained by truncating the fractional part of the period T [1] of the unit waveform yA [1], and the candidate length X [2] is set to the unit waveform yA. A time length Tb ′ [1] obtained by rounding up the decimal part of the period T [1] of [1] is set. The candidate length X [3] is set to the time length Ta ′ [2] obtained by discarding the decimal part of the period T [2] of the unit waveform yA [2], and the candidate length X [4] is set to the unit waveform yA [2 ] Is set to a time length Tb ′ [2] obtained by rounding up the decimal part of the period T [2]. Similarly, the candidate length X [5] is set to the time length Ta ′ [3] obtained by truncating the fractional part of the period T [3] of the unit waveform yA [3], and the candidate length X [6] is the unit waveform. The time length Tb ′ [3] is set by rounding up the fractional part of the cycle T [3] of yA [3]. That is, six candidate lengths X [1] to X [6] corresponding to the combination of each unit waveform yA [m] and the period T [m] are set.

指標算定部７４２は、各候補長Ｘ[k]について前掲の数式(1)の演算で歪指標値Ｄ[k]（Ｄ[1]〜Ｄ[6]）を算定し、補正処理部７４４は、６個の候補長Ｘ[1]〜Ｘ[6]のうち歪指標値Ｄ[k]が最小となる候補長Ｘ[k]を調整後の所定長Ｔ0として確定する。以上の構成においても第３実施形態と同様の効果が実現される。 The index calculation unit 742 calculates the distortion index value D [k] (D [1] to D [6]) for each candidate length X [k] by the calculation of the mathematical formula (1) described above, and the correction processing unit 744 The candidate length X [k] that minimizes the distortion index value D [k] among the six candidate lengths X [1] to X [6] is determined as the adjusted predetermined length T0. In the above configuration, the same effect as that of the third embodiment is realized.

なお、各歪指標値Ｄ[k]の算定方法は適宜に変更される。例えば、前掲の数式(1)では、各項を正数とするために周期Ｔ[m]と候補長Ｘ[k]との差分の絶対値|Ｔ[m]−Ｘ[k]|を算定したが、以下の数式(2)のように周期Ｔ[m]および候補長Ｘ[k]の差分と候補長Ｘ[k]との比を自乗することで各項を正数とすることも可能である。
Ｄ[k]＝｛（Ｔ[1]−Ｘ[k]）/Ｘ[k]｝²
＋｛（Ｔ[2]−Ｘ[k]）/Ｘ[k]｝²＋｛（Ｔ[3]−Ｘ[k]）/Ｘ[k]｝² …(2) In addition, the calculation method of each distortion index value D [k] is changed as appropriate. For example, in Equation (1), the absolute value | T [m] −X [k] | of the difference between the period T [m] and the candidate length X [k] is calculated in order to make each term a positive number. However, each term can be made a positive number by squaring the ratio of the difference between the period T [m] and the candidate length X [k] and the candidate length X [k] as shown in the following equation (2). Is possible.
D [k] = {(T [1] −X [k]) / X [k]} ²
+ {(T [2] −X [k]) / X [k]} ² + {(T [3] −X [k]) / X [k]} ² (2)

＜Ｄ：第４実施形態＞
図１３は、第４実施形態における波形補正部６４のブロック図である。図１３に示すように、第４実施形態の波形補正部６４は、前述の各形態で例示した要素（振幅補正部７２，周期補正部７４，位相補正部７６）に歪補正部７８を追加した構成である。 <D: Fourth Embodiment>
FIG. 13 is a block diagram of the waveform correction unit 64 in the fourth embodiment. As shown in FIG. 13, the waveform correction unit 64 of the fourth embodiment has a distortion correction unit 78 added to the elements (amplitude correction unit 72, period correction unit 74, phase correction unit 76) exemplified in the above-described embodiments. It is a configuration.

周期補正部７４が各単位波形ｙA[m]の周期Ｔ[m]を時間長Ｔ0に伸縮すると、各単位波形ｙB[m]のピークトゥピーク値Ａ[m]は、時間軸上での伸縮の度合に応じて、振幅補正部７２による補正の直後（周期補正部７４による補正前）のピークトゥピーク値Ａ0から変動し得る。すなわち、周期補正部７４の補正後の各単位波形ｙB[m]には歪みが発生する。具体的には、周期補正部７４による補正後の単位波形ｙB[m]の時間長Ｔ0が補正前の単位波形ｙA[m]の周期Ｔ[m]と比較して長い（伸長の度合が高い）ほど、単位波形ｙB[m]のピークトゥピーク値Ａ[m]は振幅補正部７２による補正の直後ピークトゥピーク値Ａ0と比較して小さい数値となり、周期補正部７４による補正後の単位波形ｙB[m]の時間長Ｔ0が補正前の単位波形ｙA[m]の周期Ｔ[m]と比較して短い（収縮の度合が高い）ほど、単位波形ｙB[m]のピークトゥピーク値Ａ[m]はピークトゥピーク値Ａ0と比較して大きい数値となる。以上の傾向を考慮して、第４実施形態の歪補正部７８は、周期補正部７４による補正後の各単位波形ｙB[m]のピークトゥピーク値Ａ[m]を調整することで以上に説明した波形の歪みを補正する。 When the period correction unit 74 expands and contracts the period T [m] of each unit waveform yA [m] to the time length T0, the peak-to-peak value A [m] of each unit waveform yB [m] expands and contracts on the time axis. The peak-to-peak value A0 immediately after the correction by the amplitude correction unit 72 (before the correction by the period correction unit 74) can be varied. That is, distortion occurs in each unit waveform yB [m] after correction by the period correction unit 74. Specifically, the time length T0 of the unit waveform yB [m] after correction by the period correction unit 74 is longer than the period T [m] of the unit waveform yA [m] before correction (the degree of expansion is high). ), The peak-to-peak value A [m] of the unit waveform yB [m] is smaller than the peak-to-peak value A0 immediately after correction by the amplitude correction unit 72, and the unit waveform after correction by the period correction unit 74 The shorter the time length T0 of yB [m] is compared to the period T [m] of the unit waveform yA [m] before correction (the higher the degree of contraction), the peak-to-peak value A of the unit waveform yB [m]. [m] is a larger value than the peak-to-peak value A0. In consideration of the above tendency, the distortion correction unit 78 of the fourth embodiment adjusts the peak-to-peak value A [m] of each unit waveform yB [m] corrected by the period correction unit 74 as described above. Correct the waveform distortion described.

具体的には、歪補正部７８は、単位波形ｙA[m]の初期的な周期Ｔ[m]に対する時間長Ｔ0の比（Ｔ0／Ｔ[m]）を、周期補正部７４による補正後の単位波形ｙB[m]のピークトゥピーク値Ａ[m]に補正値として作用させる（典型的には乗算する）。以上の説明から理解されるように、周期補正部７４による補正後の単位波形ｙB[m]の時間長Ｔ0が補正前の単位波形ｙA[m]の周期Ｔ[m]に対して長い（周期補正部７４による伸長の度合が大きい）ほど、単位波形ｙB[m]のピークトゥピーク値Ａ[m]は歪補正部７８による処理で大きい数値に補正される。したがって、周期補正部７４による補正に起因した波形の歪みを抑制できるという利点がある。歪補正部７８による補正後の各単位波形ｙB[m]を位相補正部７６が補正して各単位波形ｕ[m]を生成する処理は第２実施形態と同様である。 Specifically, the distortion correction unit 78 calculates the ratio (T0 / T [m]) of the time length T0 to the initial period T [m] of the unit waveform yA [m] after correction by the period correction unit 74. The peak-to-peak value A [m] of the unit waveform yB [m] is acted as a correction value (typically multiplied). As understood from the above description, the time length T0 of the unit waveform yB [m] after correction by the period correction unit 74 is longer than the period T [m] of the unit waveform yA [m] before correction (period As the degree of expansion by the correction unit 74 increases, the peak-to-peak value A [m] of the unit waveform yB [m] is corrected to a larger value by the processing by the distortion correction unit 78. Therefore, there is an advantage that the waveform distortion caused by the correction by the period correction unit 74 can be suppressed. The process by which the phase correction unit 76 corrects each unit waveform yB [m] after correction by the distortion correction unit 78 to generate each unit waveform u [m] is the same as in the second embodiment.

以上に説明した第４実施形態では、周期補正部７４による単位波形ｙA[m]の伸縮の度合に応じて各単位波形ｙB[m]のピークトゥピーク値Ａ[m]が補正されるから、音声波形Ｖbの音響特性を忠実に反映した単位波形ｕ[m]を生成できるという利点がある。なお、第４実施形態における所定長Ｔ0の選定方法は任意であり、例えば、歪指標値Ｄ[k]に応じて時間長Ｔ0を設定する前述の第３実施形態が好適に採用される。 In the fourth embodiment described above, the peak-to-peak value A [m] of each unit waveform yB [m] is corrected according to the degree of expansion / contraction of the unit waveform yA [m] by the period correction unit 74. There is an advantage that the unit waveform u [m] that accurately reflects the acoustic characteristics of the voice waveform Vb can be generated. Note that the method for selecting the predetermined length T0 in the fourth embodiment is arbitrary, and for example, the third embodiment described above in which the time length T0 is set according to the distortion index value D [k] is suitably employed.

＜Ｃ：第５実施形態＞
第５実施形態は、第１実施形態の音声合成部２８を図１４の音声合成部２８Aに置換した構成である。図１４に示すように、音声合成部２８Aは、合成処理部８２と非調和成分生成部８４とフィルタ部８６と合成部８８とを含んで構成される。合成処理部８２は、第１実施形態の音声合成部２８と同様に動作して音声信号ＨAを生成する。音声信号ＨAは、第１実施形態の音声信号ＳOUTに相当し、合成情報Ｚで指定される音高Ｚbおよび音量Ｚeに対応する調和成分（基音成分および倍音成分）を豊富に含む。以上のように調和成分が豊富な音声信号ＨAの再生音は、人工的な印象の音声となる可能性がある。そこで、第５実施形態では、音声信号ＨAに非調和成分ＨBを付加することで音声信号ＳOUTを生成する。 <C: Fifth Embodiment>
In the fifth embodiment, the speech synthesizer 28 of the first embodiment is replaced with the speech synthesizer 28A of FIG. As shown in FIG. 14, the speech synthesis unit 28A includes a synthesis processing unit 82, an anharmonic component generation unit 84, a filter unit 86, and a synthesis unit 88. The synthesis processing unit 82 operates in the same manner as the speech synthesis unit 28 of the first embodiment and generates the audio signal HA. The audio signal HA corresponds to the audio signal SOUT of the first embodiment, and includes abundant harmonic components (basic tone component and harmonic component) corresponding to the pitch Zb and volume Ze specified by the synthesis information Z. As described above, the reproduction sound of the audio signal HA rich in harmonic components may be an artificial impression. Therefore, in the fifth embodiment, the audio signal SOUT is generated by adding the anharmonic component HB to the audio signal HA.

非調和成分生成部８４は、非調和成分Ｈ0を生成する。非調和成分Ｈ0は、例えばホワイトノイズやピンクノイズ等の雑音成分である。フィルタ部８６は、非調和成分Ｈ0から非調和成分ＨBを生成する。例えば、非調和成分Ｈ0のうち音高Ｚbに対応する各調波周波数（基本周波数および各倍音周波数）以外の帯域成分を選択的に通過させるコムフィルタがフィルタ部８６として好適である。合成部８８は、合成処理部８２が生成した音声信号ＨAとフィルタ部８６が生成した非調和成分ＨBとを加算することで音声信号ＳOUTを生成する。 The anharmonic component generator 84 generates an anharmonic component H0. The anharmonic component H0 is a noise component such as white noise or pink noise. The filter unit 86 generates an anharmonic component HB from the anharmonic component H0. For example, a comb filter that selectively passes band components other than the harmonic frequencies (fundamental frequencies and harmonic frequencies) corresponding to the pitch Zb in the anharmonic component H0 is suitable as the filter unit 86. The synthesis unit 88 adds the audio signal HA generated by the synthesis processing unit 82 and the anharmonic component HB generated by the filter unit 86 to generate the audio signal SOUT.

以上に説明した第５実施形態では、合成処理部８２が生成した音声信号ＨAに非調和成分ＨBが付加されるから、音声信号ＨAを単独で音声信号ＳOUTとして出力する構成と比較して、聴感的に自然な印象の音声を生成できるという利点がある。なお、図１４のフィルタ部８６を省略した構成（非調和成分Ｈ0が音声信号ＨAに加算される構成）も採用され得る。 In the fifth embodiment described above, since the anharmonic component HB is added to the audio signal HA generated by the synthesis processing unit 82, compared with the configuration in which the audio signal HA is output alone as the audio signal SOUT. There is an advantage that sound with a natural impression can be generated. A configuration in which the filter unit 86 in FIG. 14 is omitted (a configuration in which the anharmonic component H0 is added to the audio signal HA) may be employed.

＜Ｄ：変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <D: Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）変形例１
Ｍ個の単位波形ｕ[1]〜ｕ[M]を利用して合成波形Ｃ[n]を生成する方法は適宜に変更される。例えば、Ｍ個の単位波形ｕ[1]〜ｕ[M]から順次に選択される単位波形ｕ[m]を時間軸上に配列して合成波形Ｃ[n]を生成する構成も採用され得る。以上の説明から理解されるように、第１実施形態の音声合成部２８は、Ｍ個の単位波形ｕ[1]〜ｕ[M]を時間軸上に配列して音声信号ＳOUTを生成する要素（波形生成手段）の一例である。 (1) Modification 1
The method of generating the composite waveform C [n] using the M unit waveforms u [1] to u [M] is appropriately changed. For example, a configuration in which unit waveforms u [m] sequentially selected from M unit waveforms u [1] to u [M] are arranged on the time axis to generate a composite waveform C [n] may be employed. . As understood from the above description, the speech synthesizer 28 of the first embodiment generates an audio signal SOUT by arranging M unit waveforms u [1] to u [M] on the time axis. It is an example of (waveform generation means).

また、以上の各形態では、各処理期間Ｒ[n]が時間軸上で連続する構成を例示したが、図１５に示すように、複数個の単位波形ｕ[m]が配列される保持期間Ｅ[n]を処理期間Ｒ[n]と直後の処理期間Ｒ[n+1]との間に介挿することも可能である。保持期間Ｅ[n]には、直前の処理期間Ｒ[n]で選択された第１単位波形Ｕa[n]の複数個が、強度を変化させることなく配列される。各保持期間Ｅ[n]の時間長Ｌe[n]は、処理期間Ｒ[n]の時間長Ｌr[n]と同様に例えばランダムに設定され得るが、共通の固定値に設定することも可能である。図１５の例示から理解されるように、相前後する処理期間Ｒ[n]が時間軸上で連続する構成は本発明において必須ではない。 Further, in each of the above embodiments, the configuration in which each processing period R [n] is continued on the time axis is illustrated, but as shown in FIG. 15, a holding period in which a plurality of unit waveforms u [m] are arranged. It is also possible to insert E [n] between the processing period R [n] and the immediately subsequent processing period R [n + 1]. In the holding period E [n], a plurality of first unit waveforms Ua [n] selected in the immediately preceding processing period R [n] are arranged without changing the intensity. The time length Le [n] of each holding period E [n] can be set, for example, at random like the time length Lr [n] of the processing period R [n], but can also be set to a common fixed value. It is. As can be understood from the illustration of FIG. 15, a configuration in which consecutive processing periods R [n] are continued on the time axis is not essential in the present invention.

（２）変形例２
各処理期間Ｒ[n]を相異なる時間長Ｌr[n]に設定する方法は適宜に変更される。例えば、時間長Ｌr[n]に対して所定値を加算または減算して時間長Ｌr[n+1]を算定することで各処理期間Ｒ[n]の時間長Ｌr[1]〜Ｌr[N]を相違させることも可能である。また、第１実施形態では時間長Ｌr[n]の変動長ｄ[n]を乱数に設定したが、時間長Ｌr[n]自体を乱数とした構成も採用され得る。もっとも、時間長Ｌr[1]〜Ｌr[N]を相等しい時間に設定することも可能である。 (2) Modification 2
The method of setting each processing period R [n] to a different time length Lr [n] is appropriately changed. For example, the time length Lr [n] is calculated by adding or subtracting a predetermined value to or from the time length Lr [n] to calculate the time length Lr [1] to Lr [N] of each processing period R [n]. ] Can be made different. In the first embodiment, the variation length d [n] of the time length Lr [n] is set to a random number. However, a configuration in which the time length Lr [n] itself is a random number may be employed. However, the time lengths Lr [1] to Lr [N] can be set to equal times.

（３）変形例３
処理期間Ｒ[n]毎に第１単位波形Ｕa[n]および第２単位波形Ｕb[n]を選択する方法は任意である。例えば、Ｍ個の単位波形ｕ[1]〜ｕ[M]を処理期間Ｒ[n]毎に順番に第１単位波形Ｕa[n]として選択する構成も採用され得る。また、第１実施形態では、処理期間Ｒ[n-1]で第１単位波形Ｕa[n-1]として選択された単位波形ｕ[m]を直後の処理期間Ｒ[n]で引続き第２単位波形Ｕb[n]として選択したが、第１単位波形Ｕa[n]および第２単位波形Ｕb[n]の双方を処理期間Ｒ[n]毎に独立に選択することも可能である。 (3) Modification 3
A method for selecting the first unit waveform Ua [n] and the second unit waveform Ub [n] for each processing period R [n] is arbitrary. For example, a configuration in which M unit waveforms u [1] to u [M] are sequentially selected as the first unit waveform Ua [n] for each processing period R [n] may be employed. In the first embodiment, the unit waveform u [m] selected as the first unit waveform Ua [n-1] in the processing period R [n-1] is continued in the immediately subsequent processing period R [n]. Although the unit waveform Ub [n] is selected, both the first unit waveform Ua [n] and the second unit waveform Ub [n] can be independently selected for each processing period R [n].

（４）変形例４
第２実施形態では、音声合成装置１００とは別体の音声処理装置２００を例示したが、音声波形ＶbからＭ個の単位波形ｕ[1]〜ｕ[M]を生成する音声処理装置２００の機能（波形抽出部６２，波形補正部６４）を音声合成装置１００に搭載することも可能である。 (4) Modification 4
In the second embodiment, the speech processing apparatus 200 separate from the speech synthesis apparatus 100 is illustrated, but the speech processing apparatus 200 that generates M unit waveforms u [1] to u [M] from the speech waveform Vb. The functions (the waveform extraction unit 62 and the waveform correction unit 64) can be mounted on the speech synthesizer 100.

１００……音声合成装置、２００……音声処理装置、１０，５０……演算処理装置、１２，５２……記憶装置、１４……入力装置、１６……表示装置、１８……放音装置、２２……表示制御部、２４……情報生成部、２６……素片選択部、２８……音声合成部、４０……編集画面、４２……音符画像、６２……波形抽出部、６４……波形補正部、７２……振幅補正部、７４……周期補正部、７６……位相補正部、８２……合成処理部、８４……非調和成分生成部、８６……フィルタ部、８８……合成部、ｕ[m]（ｕ[1]〜ｕ[M]）……単位波形、Ｕa[n]……第１単位波形，Ｕb[n]……第２単位波形、Ｓa[n]……第１波形系列、Ｓb[n]……第２波形系列、Ｃ[n]……合成波形、Ｑ……素片波形、Ｖa，Ｖb……音声波形、ＳOUT……音声信号。 DESCRIPTION OF SYMBOLS 100 ... Speech synthesizer, 200 ... Speech processing device, 10, 50 ... Arithmetic processing device, 12, 52 ... Storage device, 14 ... Input device, 16 ... Display device, 18 ... Sound emitting device, 22... Display control section 24. Information generation section 26. Segment selection section 28. Speech synthesis section 40. Editing screen 42. Note image 62 62 Waveform extraction section 64. ... Waveform correction section, 72 ... Amplitude correction section, 74 ... Period correction section, 76 ... Phase correction section, 82 ... Synthesis processing section, 84 ... Anharmonic component generation section, 86 ... Filter section, 88 ... ... Synthesizer, u [m] (u [1] to u [M]) ... Unit waveform, Ua [n] ... First unit waveform, Ub [n] ... Second unit waveform, Sa [n] ... 1st waveform series, Sb [n] ... 2nd waveform series, C [n] ... Composite waveform, Q ... Fragment waveform, Va, Vb ... Voice waveform, SOUT ... Voice signal.

Claims

Waveform storage means for storing a plurality of unit waveforms extracted from different positions on the time axis among voiced sound waveforms;
For each of a plurality of processing periods, a first waveform series in which a plurality of first unit waveforms selected from the plurality of unit waveforms are arranged so that the intensity increases with time within the processing period; A composite waveform is generated by adding a second waveform series in which a plurality of second unit waveforms different from the first unit waveform among the unit waveforms are arranged so that the intensity decreases with time within the processing period. A speech synthesizer comprising: a waveform generating means;

The speech synthesizer according to claim 1, wherein each of the plurality of unit waveforms corresponds to one cycle of the speech waveform.

The peak-to-peak value and time length of the unit waveform are common to the plurality of unit waveforms,
The speech synthesizer according to claim 1, wherein the phase of each of the plurality of unit waveforms is adjusted so that a cross-correlation function between the unit waveforms is maximized.

The first unit waveform of one processing period of the plurality of processing periods and the second unit waveform of the processing period immediately after the one processing period of the plurality of processing periods are common unit waveforms. The speech synthesizer according to any one of claims 1 to 3.

The waveform generating means includes
Selecting the first unit waveform randomly from the plurality of unit waveforms for each processing period;
The speech synthesizer according to any one of claims 1 to 4, wherein a time length of each of the plurality of processing periods is set at random.

A speech processing device for generating a plurality of unit waveforms used in the speech synthesizer according to claim 1,
Waveform extraction means for extracting a plurality of unit waveforms from different positions on the time axis among voiced sound waveforms;
A speech processing apparatus comprising: a waveform correcting unit that corrects the plurality of unit waveforms extracted by the waveform extracting unit so that the acoustic characteristics of each unit waveform approach each other.