JP2005523478A

JP2005523478A - How to synthesize speech

Info

Publication number: JP2005523478A
Application number: JP2003586870A
Authority: JP
Inventors: エルカンエフギギ
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2002-04-19
Filing date: 2003-04-01
Publication date: 2005-08-04
Anticipated expiration: 2023-04-01
Also published as: US20050131679A1; EP1500080A1; JP4451665B2; DE60316678D1; CN100508025C; AU2003215851A1; DE60316678T2; ATE374990T1; EP1500080B1; CN1647152A; WO2003090205A1; US7822599B2

Abstract

The present invention relates to a method for analyzing speech, the method comprising the steps of: a) inputting a speech signal, b) obtaining the first harmonic of the speech signal, c) determining the phase-difference Df between the speech signal and the first harmonic.

Description

本発明は、音声の分析及び合成の分野、特に限定はしないが、テキスト音声合成の分野に関する。 The present invention relates to the field of speech analysis and synthesis, and without limitation, the field of text-to-speech synthesis.

テキスト音声（ＴＴＳ）合成システムの機能は、所与の言語の一般的なテキストから音声を合成することである。今日では、ＴＴＳシステムは、多くの用途（例えば、電話網を通じたデータベースへのアクセス又はハンディキャップを負う人々への援助）に実現化されている。音声を合成する１つの方法は、半音節又は多音字のような録音された音声の副単位の集合の要素を連結することによる。成功した商用システムの大部分は多音字の連結を使用している。多音字は、２つ（ダイフォン）、３つ（トライフォン）又はそれ以上の音のグループを有し、所望のひとまとまりの音を安定したスペクトル領域においてセグメントに分けることによって、無意味な単語から決定することができる。連結に基づいた合成では、２つの隣り合う音の間の移行の会話は、合成された音声の品質を保証するために重要である。基本的な副単位としての多音字の選択では、２つの隣り合う音の間の移行は記録された副単位に保存され、連結は同じような音の間で実行される。 The function of a text-to-speech (TTS) synthesis system is to synthesize speech from common text in a given language. Today, TTS systems are implemented for many applications (eg, access to databases over the telephone network or assistance to people with handicaps). One method of synthesizing speech is by concatenating elements of a set of recorded speech subunits, such as semi-syllables or polyphonic characters. Most successful commercial systems use polyphonic concatenation. A polyphonic character has a group of two (diphones), three (triphones) or more sounds from meaningless words by segmenting the desired chunk of sound into a stable spectral region. Can be determined. In synthesis based on concatenation, the transition conversation between two adjacent sounds is important to guarantee the quality of the synthesized speech. In the selection of a polyphonic character as the basic subunit, the transition between two adjacent sounds is preserved in the recorded subunit, and concatenation is performed between similar sounds.

しかしながら、合成の前に、音は、その音を含む新たな語の韻律的制約を満たす目的で、その継続時間及びピッチが修正されなければならない。この処理は、単調な響きの合成音声の生成を回避するために必要である。ＴＴＳシステムでは、この機能は韻律的モジュールにより実行される。記録された副単位における継続時間及びピッチの修正を可能とするために、ＴＴＳシステムを基礎とした多くの連結は時間領域ピッチ同期波形重畳（ＴＤ−ＰＳＯＬＡ）（Speech Commun., vol. 9, pp. 453-467, 1990 E. Moulines及びF. Charpentierによる「ダイフォンを用いたテキスト音声合成のためのピッチ同期波形処理技術」）合成のモデルを使用する。 However, before synthesis, the sound must be modified in duration and pitch in order to satisfy the prosodic constraints of the new word that contains the sound. This processing is necessary to avoid the generation of a monotonous sounding synthesized speech. In the TTS system, this function is performed by the prosodic module. To allow modification of duration and pitch in recorded subunits, many connections based on TTS systems are time domain pitch-synchronized waveform superposition (TD-PSOLA) (Speech Commun., Vol. 9, pp 453-467, 1990 “Pitch-synchronized waveform processing technology for text-to-speech synthesis using diphones” by E. Moulines and F. Charpentier)).

ＴＤ−ＰＳＯＬＡモデルでは、音声信号は先ずピッチマーキングアルゴリズムに従う。このアルゴリズムは、有声セグメントの信号のピークにおいてマークを割り当て、無声セグメントでは１０ｍｓ離れたマークを割り当てている。合成は、ピッチマークにおいて中心合わせされ且つ前のピッチマークから次のピッチマークまで広がるハニング窓掛けされたセグメントの重ね合わせによって行われる。継続時間の修正は、窓掛けされたセグメントの幾つかを削除又は繰り返すことによって与えられる。一方、ピッチ周期の修正は、窓掛けされたセグメント間の重ね合わせを増加又は減少することによって与えられる。 In the TD-PSOLA model, the audio signal first follows a pitch marking algorithm. This algorithm assigns marks at the peak of the signal of the voiced segment, and assigns marks 10 ms apart in the unvoiced segment. Compositing is done by superposition of Hanning windowed segments centered at the pitch mark and extending from the previous pitch mark to the next pitch mark. The duration correction is given by deleting or repeating some of the windowed segments. On the other hand, pitch period correction is provided by increasing or decreasing the overlap between windowed segments.

多くの商用ＴＴＳシステムにおいて成功しているにもかかわらず、合成のＴＤ−ＰＳＯＬＡモデルを使用することによって作り出される合成音声は、主に韻律的な変化が大きい条件下で、以下に概説されるような幾つかの欠点を表す。
１．ピッチの修正は、適切に補償される必要がある継続時間の修正を持ち込む。
２．継続時間の修正は、１つのピッチ周期解像度（α= ... ,1/2,2/3,3/4,... ,4/3,3/2,2/1,...）を用いて、量子化方法でのみ実行することができる。
３．無声部分の継続時間を長くすると、セグメントの繰返しは「金属的」アーティファクト（合成された音声が金属的に聞こえる）を持ち込む場合がある。 Despite success in many commercial TTS systems, synthesized speech produced by using a synthetic TD-PSOLA model, as outlined below, mainly under conditions of large prosodic changes Represents several shortcomings.
1. Pitch correction introduces duration corrections that need to be properly compensated.
2. Correction of duration is one pitch period resolution (α = ..., 1 / 2,2 / 3,3 / 4, ..., 4 / 3,3 / 2,2 / 1, ...) Can be performed only with the quantization method.
3. Increasing the duration of the unvoiced portion may cause segment repetition to introduce “metallic” artifacts (synthesized speech sounds metallic).

スピーチ及びオーディオの処理に関するＩＥＥＥ会報、第６巻、Ｎｏ．５、１９９８年９月のFabio Violaro及びOlivier Boeffardによる「テキスト音声合成のためのハイブリッドモデル」に、連結に基づいたテキスト音声合成のハイブリッドモデルが記載されている。 IEEE Bulletin on Speech and Audio Processing, Vol. 5, “Hybrid model for text-to-speech synthesis” by Fabio Violaro and Olivier Boeffard in September 1998 describes a hybrid model of text-to-speech synthesis based on concatenation.

音声信号はピッチ同期分析に従い、ノイズ成分に加えて、可変最大周波数を伴なう高調波成分に分解される。高調波成分は、ピッチの倍数の周波数を伴なうシヌソイドの和としてモデル化される。ノイズ成分は、ＬＰＣフィルタに印加されるランダム刺激としてモデル化される。無声セグメントでは、高調波成分はゼロに等しくなる。ピッチの修正が存在する場合、新たな高調波パラメータの集合は、新たな高調波周波数においてスペクトル包絡を再度サンプリングすることによって評価される。継続時間及び／又はピッチの修正が存在する高調波成分の合成に対して、高調波パラメータに位相補正が導入される。 The audio signal is decomposed into a harmonic component with a variable maximum frequency in addition to the noise component according to the pitch synchronization analysis. The harmonic components are modeled as a sum of sinusoids with a frequency that is a multiple of the pitch. The noise component is modeled as a random stimulus applied to the LPC filter. In the unvoiced segment, the harmonic component is equal to zero. If pitch correction is present, a new set of harmonic parameters is evaluated by re-sampling the spectral envelope at the new harmonic frequency. For harmonic component synthesis where there is a duration and / or pitch correction, phase correction is introduced into the harmonic parameters.

他の種々のいわゆる「重畳及び加算」方法は、例えばＰＩＯＬＡ（Pitch Inflected OverLap and Add）［P. Meyer, H. W. Ruhl, R. Kruger, M. Kugler L.L.M.Vogten, A. Dirksen,及びK. Belhoula．によるPHRITTS：ドイツ語のためのテキスト音声合成器，１９９３年ベルリンでのEurospeech'９３の８７７−８９０ページ］、又はＰＩＣＯＬＡ（Pointer Interval Controlled OverLap and Add）［森田：音声の時間軸での圧縮・伸長に関する研究，日本の名古屋大学修士学位論文（１９８７）］から既知である。
これらの方法は、ピッチ周期位置をマークする方法が互いに異なる。 Various other so-called “superposition and addition” methods are described, for example, by PIOLA (Pitch Inflected OverLap and Add) [P. Meyer, HW Ruhl, R. Kruger, M. Kugler LLMVogten, A. Dirksen, and K. Belhoula. By PHRITTS: Text-to-speech synthesizer for German, pages 877-890 of Eurospeech '93 in Berlin, 1993], or PICOLA (Pointer Interval Controlled OverLap and Add) [Morita: compression / decompression of speech over time Research in Japan, Nagoya University Master's Thesis (1987)].
These methods differ from each other in the method of marking the pitch period position.

これらの方法は、２つの異なる波形のためのミキサとして利用されるとき、どれも満足な結果を与えない。問題は位相の不整合である。高調波の位相は、記録装置、室内音響、マイクロホンまでの距離、母音色、同時調音効果などによる影響を受ける。それらの要因のいくつかはレコーディング環境のように不変に維持できるが、同時調音効果のような他の要因は、制御することは（不可能ではないにしても）非常に難しい。その結果、ピッチ周期位置が位相情報を考慮せずにマークされたとき、合成品質は位相の不整合で損なわれる。 None of these methods give satisfactory results when utilized as a mixer for two different waveforms. The problem is phase mismatch. The phase of the harmonic is affected by the recording device, room acoustics, distance to the microphone, vowel color, simultaneous articulation effect, and the like. Some of those factors can be kept unchanged as in the recording environment, while other factors such as simultaneous articulation effects are very difficult (if not impossible) to control. As a result, when the pitch period position is marked without considering the phase information, the synthesis quality is compromised by phase mismatch.

ＭＢＲ−ＰＳＯＬＡのような他の方法（マルチバンド再合成ピッチ同期波形重畳合成）［T.Dutoit及びH.Leich. ＭＢＲ−ＰＳＯＬＡ：セグメントデータベースのＭＢＥ再合成に基づいたテキスト音声合成。１９９３年のSpeech Communication］は位相の不整合を避けるために位相情報を再発生する。しかし、これは、発生した音声の自然さを低減する特別な分析−合成作動を含む。この合成はしばしば機械的な音に聞こえる。 Other methods such as MBR-PSOLA (multiband resynthesis pitch-synchronized waveform superposition synthesis) [T.Dutoit and H.Leich. MBR-PSOLA: Text-to-speech synthesis based on MBE resynthesis of segment database. 1993 Speech Communication] regenerates phase information to avoid phase mismatch. However, this involves special analysis-synthesis operations that reduce the naturalness of the generated speech. This composition often sounds like a mechanical sound.

米国特許第５，７８７，３９８号は、ピッチを変えることによって音声を合成するための装置を示す。この方法の不利な点の１つは、ピッチマークが励起ピーク（excitation peak）上に中心合わせされ、測定された励起ピークが必ずしも同期位相を有する必要がないので、位相歪みが生じることである。 U.S. Pat. No. 5,787,398 shows an apparatus for synthesizing speech by changing the pitch. One disadvantage of this method is that phase distortion occurs because the pitch mark is centered on the excitation peak and the measured excitation peak does not necessarily have to have a synchronous phase.

合成音声信号のピッチは、音声信号をスペクトル成分及び励起成分に分けることによって変わる。後者は、有声音の場合、少なくともほぼ声の刺激の瞬間に対応するピッチタイミングマーク情報と同期する一連のオーバーラップする窓関数で乗算され、それを制御可能な時間シフトの適用後に加算される窓掛けされた音声セグメントに分離する。次に、スペクトル及び励起の成分が再結合される。乗算はピッチ周期につき少なくとも２つの窓を使用し、各々は１ピッチ周期よりも短い継続時間を有する。 The pitch of the synthesized speech signal is changed by dividing the speech signal into a spectral component and an excitation component. The latter is a voiced sound that is multiplied by a series of overlapping window functions that are synchronized with pitch timing mark information corresponding at least approximately to the moment of voice stimulation, and is added after application of a controllable time shift. Separate into multiplied audio segments. The spectral and excitation components are then recombined. The multiplication uses at least two windows per pitch period, each having a duration shorter than one pitch period.

米国特許第５，０８１，６８１号は、有声音の基本周波数から各高調波の位相を求めるいくつかの方法及び関連する技術を示す。アプリケーションは、音声符号化、音声エンハンスメント、及び音声の時間スケール修正を含む。基本的方法は、基本周波数及び有声／無声情報から位相信号を再現し、及び合成音声の質を向上するために再現された位相信号にランダム成分を加算することを含む。 US Pat. No. 5,081,681 shows several methods and related techniques for determining the phase of each harmonic from the fundamental frequency of voiced sound. Applications include speech coding, speech enhancement, and speech time scale modification. The basic method involves reproducing the phase signal from the fundamental frequency and voiced / unvoiced information and adding a random component to the reproduced phase signal to improve the quality of the synthesized speech.

米国特許第５，０８１，６８１号は、音声処理のための位相合成の方法を記載している。位相を合成するので、合成の結果は人間の声の多くの面で自然に聞こえず、サラウンドの音響が合成によって無視される。 US Pat. No. 5,081,681 describes a method of phase synthesis for speech processing. As the phase is synthesized, the result of the synthesis does not sound natural in many aspects of the human voice, and the surround sound is ignored by the synthesis.

本発明は、音声、特に自然音声の分析のための方法を提供する。本発明による音声の合成のための方法は、音声信号（特にダイフォン音声信号）と音声信号の第１倍音との間の位相差が、異なるダイフォンに対して基本的に一定の話者依存パラメータであるという発見に基づく。 The present invention provides a method for analysis of speech, particularly natural speech. The method for speech synthesis according to the invention is such that the phase difference between the speech signal (especially the diphone speech signal) and the first harmonic of the speech signal is essentially a constant speaker dependent parameter for different diphones. Based on the discovery that there is.

本発明の好適実施例では、この位相差は、音声信号の最大値を求め、位相ゼロ、即ち第１倍音のポジティブゼロ交差を求めることによって得られる。その最大値の位相と位相ゼロとの間の差は、話者依存位相差パラメータである。 In the preferred embodiment of the invention, this phase difference is obtained by determining the maximum value of the audio signal and determining the phase zero, ie the positive zero crossing of the first harmonic. The difference between the maximum phase and phase zero is a speaker dependent phase difference parameter.

１つのアプリケーションでは、このパラメータは、窓関数（例えば、レイズドコサイン又は三角窓）を求めるための基礎としての役割をなす。好ましくは、窓関数は、第１倍音のゼロ位相に位相差を加えたものによって与えられる位相角に中心合わせされる。好ましくは、窓関数は、その位相角において最大値を有する。例えば、窓関数は、その位相角に対して対称に選択される。 In one application, this parameter serves as the basis for determining the window function (eg, raised cosine or triangular window). Preferably, the window function is centered on the phase angle given by the zero phase of the first overtone plus the phase difference. Preferably, the window function has a maximum at that phase angle. For example, the window function is selected symmetrically with respect to its phase angle.

音声合成に対しては、ダイフォンサンプルが窓関数によって窓掛けされ、ここで、窓関数及び窓掛けされるダイフォンサンプルは、位相差だけオフセットされる。 For speech synthesis, diphone samples are windowed by a window function, where the window function and the windowed diphone sample are offset by a phase difference.

このように窓掛けされるダイフォンサンプルは、連結される。このようにして、音声合成の結果が擬似的に自然に聞こえるように、自然位相情報が保存される。 The diphone samples that are windowed in this way are concatenated. In this way, the natural phase information is stored so that the result of speech synthesis can be heard in a pseudo-natural manner.

本発明の好適実施例によれば、ダイフォン及びピッチ輪郭（Ｐｉｔｃｈｃｏｎｔｏｕｒ）を示す制御情報が提供される。例えば、斯かる制御情報は、テキスト音声システムの言語処理モジュールによって提供することができる。 According to a preferred embodiment of the present invention, control information indicating a diphone and a pitch contour is provided. For example, such control information can be provided by a language processing module of a text speech system.

他の時間領域重畳法と比較して本発明の特に有利な点は、ピッチ周期（又はピッチパルス）位置が第１倍音の位相によって同期がとられることである。 A particular advantage of the present invention compared to other time domain superposition methods is that the pitch period (or pitch pulse) position is synchronized by the phase of the first overtone.

ピッチ情報は、オリジナルの音声信号の第１倍音をローパスフィルタリングし、ゼロ位相の指標としてポジティブゼロ交差を用いることによって取得することができる。このようにして、オリジナルの位相情報を変化させずに、位相不連続アーチファクトが回避される。 The pitch information can be obtained by low pass filtering the first harmonic of the original audio signal and using a positive zero crossing as a zero phase indicator. In this way, phase discontinuity artifacts are avoided without changing the original phase information.

本発明の音声合成方法及び音声合成装置の応用例として、電気通信サービス、言語教育、身体障害者への補助、トーキングブック及びトイ、音声モニタリング、マルチメディア、マンマシンコミュニケーションがある。 Application examples of the speech synthesis method and speech synthesizer of the present invention include telecommunications services, language education, assistance for disabled persons, talking books and toys, voice monitoring, multimedia, and man-machine communication.

本発明の以下の好適実施例は、図面を参照しながらより詳細に記載される。 The following preferred embodiments of the invention will be described in more detail with reference to the drawings.

図１のフローチャートは、本発明による音声分析のための方法の例示である。ステップ１０１において、自然音声が入力される。自然音声の入力のために、無意味な単語の既知のトレーニングシーケンスを利用することができる。ステップ１０２では、自然音声からダイフォンが抽出される。ダイフォンは自然音声から切り出され、１つの音素から他の音素への遷移からなる。 The flowchart of FIG. 1 is an illustration of a method for speech analysis according to the present invention. In step 101, natural speech is input. A known training sequence of meaningless words can be used for natural speech input. In step 102, a diphone is extracted from natural speech. A diphone is cut out from natural speech and consists of a transition from one phoneme to another.

次のステップ１０３では、ダイフォンのうちの少なくとも１つは、ダイフォンの第１倍音を得るためにローパスフィルタにかけられる。この第１倍音は、録音の間一定に保つことができる話者依存特性である。 In the next step 103, at least one of the diphones is subjected to a low pass filter to obtain a first overtone of the diphone. This first overtone is a speaker dependent characteristic that can be kept constant during recording.

ステップ１０４では、第１倍音とダイフォンとの間の位相差が求められる。この位相差は、話者特有の音声パラメータである。このパラメータは、図３乃至図１０を基準にしてより詳細に説明されるように、音声合成のために有用である。 In step 104, the phase difference between the first overtone and the diphone is determined. This phase difference is a speaker-specific speech parameter. This parameter is useful for speech synthesis, as will be described in more detail with reference to FIGS.

図２は、第１倍音とダイフォンとの間の位相差を求める（図１のステップ４参照）１つの方法の例示である。自然音声から得られる音波２０１は、分析のための基本を形成する。音波２０１は、音波２０１の第１倍音２０２を得る目的で、約１５０Ｈｚのカットオフ周波数のローパスフィルタにかけられる。第１倍音２０２のポジティブゼロ交差は、位相角ゼロを規定する。図２に示されているように、第１倍音２０２は、１９の数の連続する完全な周期に及んでいる。ここで考察された例では、期間の継続時間は、期間１から期間１９に向かってわずかに増加する。期間の１つに対して、当該期間内の音の波形２０１の局所的最大値が決定される。 FIG. 2 is an illustration of one method for determining the phase difference between the first overtone and the diphone (see step 4 in FIG. 1). The sound wave 201 obtained from natural speech forms the basis for analysis. The sound wave 201 is applied to a low-pass filter having a cutoff frequency of about 150 Hz for the purpose of obtaining a first overtone 202 of the sound wave 201. The positive zero crossing of the first overtone 202 defines a phase angle of zero. As shown in FIG. 2, the first overtone 202 spans 19 consecutive complete cycles. In the example considered here, the duration of the period increases slightly from period 1 to period 19. For one of the periods, a local maximum value of the sound waveform 201 within the period is determined.

例えば、期間１内の音波２０１の局所的最大値は、最大値２０３である。図２に、期間１内の最大値２０３の位相がj_ｍａｘで示されている。期間１のj_ｍａｘとゼロ位相j_０との間の差Δjは、話者依存音声パラメータである。ここで考察された例では、この位相差は約0.3πである。この位相差は、この位相差を求めるためにどの最大値が利用されるかに関わらず、ほぼ一定であることに注意されたい。しかし、この測定に対しては、特徴的な最大エネルギー位置によって期間を選択することが好ましい。例えば、期間９内の最大値２０４がこの分析を実行するために利用される場合、結果として生じる位相差は期間１とほぼ同じである。 For example, the local maximum value of the sound wave 201 within the period 1 is the maximum value 203. In FIG. 2, the phase of the maximum value 203 within the period 1 is indicated by _jmax . The difference Δj between j _{max in} period 1 and the zero phase j ₀ is a speaker dependent speech parameter. In the example considered here, this phase difference is about 0.3π. Note that this phase difference is approximately constant regardless of which maximum value is used to determine this phase difference. However, for this measurement, it is preferable to select the period according to the characteristic maximum energy position. For example, if the maximum value 204 in period 9 is used to perform this analysis, the resulting phase difference is approximately the same as period 1.

図３は、本発明の音声合成方法のアプリケーションの例示である。ステップ３０１では、自然音声から得られたダイフォンがj_０＋Δjにおいてその最大値を有する窓関数によって窓掛けがなされ、例えば位相j_０＋Δjに対して中心合わせされるレイズドコサインを選択することができる。 FIG. 3 is an example of an application of the speech synthesis method of the present invention. In step 301, been made windowed by a window function having a maximum value in the diphones is j _{0 +} .DELTA.j obtained from natural speech, it is possible to select the raised cosine which is centered relative to for example the phase j _{0 +} .DELTA.j.

このように、ステップ３０２において、ダイフォンのピッチベルが与えられる。ステップ３０３では、音声情報が入力される。これは、自然音声から又はテキスト音声システム（例えば、斯かるテキスト音声システムの言語処理モジュール）から得られた情報とすることができる。 Thus, at step 302, a diphone pitch bell is provided. In step 303, voice information is input. This can be information obtained from natural speech or from a text speech system (eg, a language processing module of such a text speech system).

音声情報に従い、ピッチベルが選択される。例えば、音声情報は、ダイフォンの情報及び合成されるべきピッチ輪郭の情報を含む。この場合、ステップ３０５におけるピッチベルの連結がステップ３０６において所望の音声出力となるように、ステップ３０４においてピッチベルがそれに応じて選択される。 A pitch bell is selected according to the audio information. For example, the audio information includes diphone information and pitch contour information to be synthesized. In this case, the pitch bell is selected accordingly at step 304 so that the connection of the pitch bell at step 305 results in the desired audio output at step 306.

図３の方法の応用例が、図４に例として示されている。図４は、幾つかのダイフォンから成る音波４０１を示す。ピッチ間隔の各々に対するゼロ位相j_０を得るために、上の図１及び図２を基準として説明したような分析が音波４０１に適用される。図２の例のように、ゼロ位相j_０は、ピッチ間隔内の最大の位相j_ｍａｘから、ほぼ一定のΔjの位相角だけずれている。 An application of the method of FIG. 3 is shown as an example in FIG. FIG. 4 shows a sound wave 401 consisting of several diphones. In order to obtain a zero phase j ₀ for each of the pitch intervals, an analysis as described with reference to FIGS. 1 and 2 above is applied to the sound wave 401. As in the example of FIG. 2, the zero phase j ₀ is shifted from the maximum phase j _max within the pitch interval by a substantially constant phase angle Δj.

レイズドコサイン４０２は音波４０１を窓掛けするために用いられる。レイズドコサイン４０２は、位相j_０＋Δjに対して中心合わせされる。レイズドコサイン４０２による音波４０１の窓掛けは、連続したピッチベル４０３を与える。このように、音波４０１のダイフォン波形は、斯かる連続したピッチベル４０３に分割される。ピッチベル４０３は、位相j_０＋Δjに中心合わせされるレイズドコサインによって、隣接する２つの期間から得られる。矩形関数よりもレイズドコサインを利用する利点は、エッジがこのように滑らかなことである。この動作は、同じ順序でピッチベル４０３の全てをオーバーラップさせて加えることにより可逆的なものであることに注意されたい。これはオリジナルの音波４０１を作り出す。 Raised cosine 402 is used to window sound wave 401. Raised cosine 402 is centered with respect to phase j ₀ + Δj. The windowing of the sound wave 401 by the raised cosine 402 gives a continuous pitch bell 403. Thus, the diphone waveform of the sound wave 401 is divided into such continuous pitch bells 403. The pitch bell 403 is obtained from two adjacent periods by a raised cosine centered on the phase j ₀ + Δj. The advantage of using a raised cosine over a rectangular function is that the edges are so smooth. Note that this operation is reversible by adding all of the pitch bells 403 in the same order and overlapping. This creates the original sound wave 401.

ピッチベル４０３を繰り返す又はスキップすることによって、及び／又はピッチを変えるためにピッチベル４０３を互いに近づけたり遠ざけたりすることによって、音波４０１の継続時間を変えることができる。音波４０１のオリジナルのピッチを大きくするために同じピッチベル４０３をオリジナルのピッチよりも大きいピッチで繰り返すことによって、このように音波４０４が合成される。特性位相差Δjを考慮して実行された先の窓掛け動作によって、この重ね合わせ動作の結果として位相はそのまま残ることに注意すべきである。このように、ピッチベル４０３は、準自然音声を合成するためにビルディングブロックとして利用することができる。 The duration of the sound wave 401 can be changed by repeating or skipping the pitch bell 403 and / or by moving the pitch bells 403 closer to or away from each other to change the pitch. The sound wave 404 is synthesized in this way by repeating the same pitch bell 403 at a pitch larger than the original pitch in order to increase the original pitch of the sound wave 401. It should be noted that the phase remains as a result of this superposition operation due to the previous windowing operation performed taking into account the characteristic phase difference Δj. Thus, the pitch bell 403 can be used as a building block to synthesize quasi-natural speech.

図５は、自然音声の処理のための１つのアプリケーションを示す。ステップ５０１では、既知の話者の自然音声が入力される。これは、図４に示されるように、音波４０１の入力に対応する。この自然音声は、レイズドコサイン４０２（図４参照）によって、又はゼロ位相j_０＋Δjを基準にして中心合わせされた別の適切な窓関数によって窓掛けされる。 FIG. 5 shows one application for processing natural speech. In step 501, natural speech of a known speaker is input. This corresponds to the input of the sound wave 401 as shown in FIG. This natural speech is windowed by the raised cosine 402 (see FIG. 4) or by another suitable window function centered on the zero phase j ₀ + Δj.

このように、自然音声は、ステップ５０３において提供されるピッチベル（図４のピッチベル４０３参照）に分解される。 In this way, the natural sound is broken down into pitch bells provided in step 503 (see pitch bell 403 in FIG. 4).

ステップ５０４では、ステップ５０３において提供されるピッチベルが、音声合成のための「ビルディングブロック」として利用される。処理の１つの方法は、ピッチベル自体を変えないが特定のピッチベルを省く又は特定のピッチベルを繰り返すことである。例えば、ピッチベルを４番目毎に省くと、これは、音声の音を異なるように変えること無く音声の速度を２５％速くする。同様に、音声速度は、特定のピッチベルを繰り返すことによって減少することができる。 In step 504, the pitch bell provided in step 503 is used as a “building block” for speech synthesis. One method of processing is to omit the specific pitch bell or repeat the specific pitch bell without changing the pitch bell itself. For example, if the pitch bell is omitted every fourth, this increases the speed of the voice by 25% without changing the voice sound differently. Similarly, the voice speed can be reduced by repeating a specific pitch bell.

あるいは又は加えて、ピッチベルの距離は、ピッチを増減するために修正される。 Alternatively or additionally, the pitch bell distance is modified to increase or decrease the pitch.

ステップ５０５において、処理されたピッチベルは、擬似的に自然に聞こえる合成音声波形を生成するために重ねられる。 In step 505, the processed pitch bells are overlaid to produce a synthesized speech waveform that sounds quasi-naturally.

図６は、本発明の別のアプリケーションの例である。ステップ６０１において音声情報が提供される。音声情報は、音素、音素の継続時間及びピッチ情報を有する。斯かる音声情報は、最新のテキスト音声処理システムによって、テキストから生成することができる。 FIG. 6 is an example of another application of the present invention. In step 601, audio information is provided. The voice information includes phonemes, phoneme durations and pitch information. Such speech information can be generated from text by modern text speech processing systems.

ステップ６０２では、ステップ６０１において提供されるこの音声情報から、ダイフォンが抽出される。ステップ６０３では、ステップ６０１において提供された情報に基づいて、必要なダイフォンの時間軸上の位置及びピッチ輪郭が求められる。 In step 602, a diphone is extracted from this audio information provided in step 601. In step 603, the position and pitch contour of the required diphone on the time axis are determined based on the information provided in step 601.

ステップ６０４では、ステップ６０３で求められたようなタイミング及びピッチ条件に従って、ピッチベルが選択される。ステップ６０５では、擬似的に自然な音声出力を提供するために、選択されたピッチベルが連結される。 In step 604, a pitch bell is selected according to the timing and pitch conditions as determined in step 603. In step 605, the selected pitch bells are concatenated to provide a quasi-natural audio output.

この手続きは、図７乃至図９に示されるような例によって更に示されている。 This procedure is further illustrated by an example as shown in FIGS.

図７は、文「ＨＥＬＬＯＷＯＲＬＤ！」の音声表記を示す。表記の第１の列７０１は、ＳＡＭＰＡ標準表記における音素を含む。第２の列７０２は、個々の音素の継続時間をｍ秒で示す。第３の列はピッチ情報を有する。ピッチの動きは２つの数字、つまり、音素の継続時間のパーセンテージとしての位置、及びピッチ周波数（Ｈｚ）によって示される。 FIG. 7 shows a phonetic notation for the sentence “HELLO WORD!”. The first column 701 of the notation includes phonemes in the SAMPA standard notation. The second column 702 shows the duration of individual phonemes in milliseconds. The third column has pitch information. Pitch movement is indicated by two numbers: the position as a percentage of the phoneme duration, and the pitch frequency (Hz).

合成は、先に生成したダイフォンのデーターベースの中での検索から始まる。ダイフォンは、実際の音声から切り出され、或る音素から別の音素への遷移からなる。特定の言語のための全ての可能な音素の組合せが、音素境界のようないくつかの別途の情報とともに、このデータベースに記憶されなければならない。異なる話者の複数のデータベースがある場合、特定の話者の選択を合成装置への別途の入力とすることができる。 Compositing begins with a search in the previously generated diphone database. A diphone is cut out from actual speech and consists of a transition from one phoneme to another. All possible phoneme combinations for a particular language must be stored in this database, along with some extra information such as phoneme boundaries. If there are multiple databases of different speakers, the selection of a specific speaker can be a separate input to the synthesizer.

図８は、文「ＨＥＬＬＯＷＯＲＬＤ！」のためのダイフォン、即ち図７の列７０１の全ての音素の遷移を示す。 FIG. 8 shows the transition of all phonemes in the diphone for the sentence “HELLO WORLD!”, Ie column 701 in FIG.

図９は、音素境界の位置、ダイフォン境界及び合成されるべきピッチ周期位置の計算結果を示す。音素境界は、音素の継続時間を加えることによって計算される。例えば、音素「ｈ」は、１００ｍｓの沈黙の後に始まる。音素「シュワー（schwa）」は、１５５ｍｓ＝１００ｍｓ＋５５ｍｓ後に始まる、等である。 FIG. 9 shows the calculation results of the position of the phone boundary, the diphone boundary, and the pitch period position to be synthesized. Phoneme boundaries are calculated by adding the phoneme duration. For example, the phoneme “h” begins after 100 ms of silence. The phoneme “schwa” starts after 155 ms = 100 ms + 55 ms, and so on.

ダイフォン境界は、データーベースから、音素の継続時間のパーセンテージとして取り出される。個々の音素位置とダイフォン境界との両方が図９の上の図面９０１に示されており、ダイフォンの開始点が示されている。この開始点は、列７０２によって与えられる音素の継続時間及び列７０３に与えられる音素の継続時間のパーセンテージに基づいて計算される。 The diphone boundary is retrieved from the database as a percentage of the phoneme duration. Both the individual phoneme locations and the diphone boundaries are shown in the upper drawing 901 of FIG. 9, showing the starting point of the diphone. This starting point is calculated based on the phoneme duration given by column 702 and the percentage of phoneme duration given by column 703.

図９の図面９０２は、「ＨＥＬＬＯＷＯＲＬＤ！」のピッチ輪郭を示す。ピッチ輪郭は、列７０３（図７参照）に含まれるピッチ情報に基づいて決定される。例えば、現在のピッチ位置が０，２５秒の場合、ピッチ周期は最初の’｜’の音素の５０％だろう。対応するピッチは１３３Ｈｚと１３９Ｈｚとの間に存在する。それは、以下の一次線形の式で計算することができる。

Drawing 902 of FIG. 9 shows the pitch profile of “HELLO WORD!”. The pitch contour is determined based on the pitch information included in the column 703 (see FIG. 7). For example, if the current pitch position is 0.25 seconds, the pitch period will be 50% of the first '|' phoneme. A corresponding pitch exists between 133 Hz and 139 Hz. It can be calculated with the following linear equation:

次のピッチ位置は、0.2500+1/135.5=0.2574秒である。この計算のために（ＥＲＢ−レートスケールのような）非線形関数を使用することも可能である。ＥＲＢ（等価矩形帯域幅）は、心理音響測定値（Glasberg及びMooore（１９９０））から得られる尺度であり、人間の耳のマスク特性を考慮することによってより良好な表現を与える。周波数からＥＲＢへの変換のための公式は、以下の通りである

ここで、ｆは周波数（ｋＨｚ）である。この考えは、ＥＲＢ−レートスケールにおけるピッチ変化は、線形的な変化として人間の耳で知覚されるということである。 The next pitch position is 0.2500 + 1 / 135.5 = 0.2574 seconds. It is also possible to use a non-linear function (such as ERB-rate scale) for this calculation. ERB (equivalent rectangular bandwidth) is a measure derived from psychoacoustic measurements (Glasberg and Mooore (1990)) and gives a better representation by considering the mask characteristics of the human ear. The formula for frequency to ERB conversion is:

Here, f is a frequency (kHz). The idea is that pitch changes in the ERB-rate scale are perceived by the human ear as linear changes.

たとえ無声部分がピッチを有しないとしても、無声領域もピッチ周期位置でマークされることに注意されたい。 Note that the unvoiced area is also marked with the pitch period position, even if the unvoiced part has no pitch.

変化するピッチが図面９０２のピッチ輪郭によって与えられ、図面９０１内にも、変化する間隔を有する縦のライン９０３によって示されている。２つのライン９０３の間の間隔が大きくなればなるほど、ピッチは小さくなる。図面９０１及び９０２に与えられる音素、ダイフォン、及びピッチ情報は、合成されるべき音声のための基準となるものである。ダイフォンサンプル、即ちピッチベル（図４のピッチベル４０３参照）がダイフォンデータベースから取り出される。ダイフォンの各々に対して、そのダイフォンのための斯かる多数のピッチベルが連結され、多数のピッチベルはダイフォンの継続時間に対応し、ピッチベル間の間隔は、９０２の図面のピッチ輪郭によって与えられるような必要なピッチ周波数に対応する。 The changing pitch is given by the pitch profile in drawing 902 and is also shown in drawing 901 by vertical lines 903 having changing spacing. The greater the distance between the two lines 903, the smaller the pitch. The phonemes, diphones, and pitch information given in drawings 901 and 902 are the reference for the speech to be synthesized. Diphone samples, i.e., pitch bells (see pitch bell 403 in FIG. 4) are retrieved from the diphone database. For each diphone, such a number of pitch bells for that diphone are connected, the number of pitch bells corresponding to the duration of the diphone, and the spacing between pitch bells as given by the pitch contours of 902 drawings Corresponds to the required pitch frequency.

全てのピッチベルの連結の結果は、準自然的な合成音声である。これは、ダイフォン境界において位相に関連した不連続性が本発明によって防止されるからである。これは、ピッチ周期の位相不整合により斯かる不連続性が避けられない従来技術と対照的である。 The result of all pitch bell connections is a quasi-natural synthesized speech. This is because the present invention prevents phase related discontinuities at the diphone boundary. This is in contrast to the prior art where such discontinuities are unavoidable due to phase mismatch of the pitch period.

また、各ダイフォンの両側の継続時間が適切に調整されたので、韻律（ピッチ／継続時間）は適正である。ピッチも所望のピッチ輪郭関数と整合している。 In addition, since the duration time of both sides of each diphone is appropriately adjusted, the prosody (pitch / duration time) is appropriate. The pitch is also consistent with the desired pitch contour function.

図１０は、本発明を実現するために、プログラムされた装置９５０（例えばパソコン）を示す。装置９５０は、特性位相差Δjを求める役割をなす音声分析モジュール９５１を有する。この目的のため、１つのダイフォン音声波を記憶するために音声分析モジュール９５１は記憶部９５２を有する。一定の位相差Δjを得るには、１つのダイフォンで十分である。 FIG. 10 shows a device 950 (eg, a personal computer) programmed to implement the present invention. The device 950 has a speech analysis module 951 that serves to determine the characteristic phase difference Δj. For this purpose, the voice analysis module 951 has a storage unit 952 for storing one diphone voice wave. One diphone is sufficient to obtain a constant phase difference Δj.

更に、音声分析モジュール９５１はローパスフィルタモジュール９５３を有する。ローパスフィルタモジュール９５３は、記憶部９５２に記憶されたダイフォンの第１倍音を取り出す目的で、およそ１５０Ｈｚのカットオフ周波数、又は別の適切なカットオフ周波数を有する。 Further, the voice analysis module 951 has a low-pass filter module 953. The low-pass filter module 953 has a cutoff frequency of approximately 150 Hz or another appropriate cutoff frequency for the purpose of extracting the first harmonic of the diphone stored in the storage unit 952.

装置９５０のモジュール９５４は、ダイフォンの特定の期間内の最大エネルギー位置とその第１倍音のゼロ位相位置との間の距離を求める役割をなす（この距離は、位相差Δjに変換される）。これは、図２の例に示されているように、第１倍音のポジティブゼロ交差によって与えられるゼロ位相と倍音の期間内のダイフォンの最大値との間の位相差を求めることによって行うことができる。 Module 954 of device 950 serves to determine the distance between the maximum energy position of the diphone in the specified period and its zero overtone zero position (this distance is converted to a phase difference Δj). This can be done by determining the phase difference between the zero phase given by the positive zero crossing of the first harmonic and the maximum value of the diphone within the harmonic period, as shown in the example of FIG. it can.

音声分析の結果、音声分析モジュール９５１は、特性位相差Δj、従ってデータベースの全てのダイフォンに対して期間位置（そこにおいて、例えばレイズドコサイン窓がピッチベルを得るために中心合わせされている）を提供する。位相差Δjは記憶部９５５に記憶される。 As a result of the speech analysis, speech analysis module 951 provides a characteristic phase difference Δj and thus a period position for all diphones in the database, where, for example, a raised cosine window is centered to obtain the pitch bell. . The phase difference Δj is stored in the storage unit 955.

装置９５０は、更に音声合成モジュール９５６を有する。音声合成モジュール９５６は、図２にも示すように、ピッチベル、即ち、窓関数によって窓掛けされたダイフォンサンプルの記憶のための記憶部９５７を有する。記憶部９５７は必ずしもピッチベルでなければならない必要はないことに注意されたい。全部のダイフォンを期間位置情報とともに記憶することができ、又はダイフォンは一定のピッチに単調化することができる。このようにして、合成モジュールの窓関数を使用することによってデータベースからピッチベルを取り出すことが可能である。 The device 950 further includes a speech synthesis module 956. As shown in FIG. 2, the speech synthesis module 956 includes a storage unit 957 for storing a pitch bell, that is, a diphone sample windowed by a window function. Note that the storage unit 957 does not necessarily have to be a pitch bell. All diphones can be stored with period position information, or the diphones can be monotonized to a constant pitch. In this way, it is possible to retrieve the pitch bell from the database by using the window function of the synthesis module.

モジュール９５８はピッチベルを選択し、ピッチベルを必要なピッチに適合させる役割をなす。これは、モジュール９５８に供給される制御情報に基づいて行われる。 Module 958 is responsible for selecting the pitch bell and adapting the pitch bell to the required pitch. This is done based on control information supplied to module 958.

モジュール９５９は、モジュール９６０による音声出力を提供するために、モジュール９５８で選択されるピッチベルを連結する役割をなす。 Module 959 serves to concatenate the pitch bells selected in module 958 to provide audio output by module 960.

ダイフォンとその第１倍音との間の位相差を求める方法のフローチャートを示す。6 shows a flowchart of a method for obtaining a phase difference between a diphone and its first overtone. 図１の方法のアプリケーションの例を示す信号図を示す。Fig. 2 shows a signal diagram illustrating an example application of the method of Fig. 1; 音声を合成する本発明の方法の実施例を示す。2 shows an embodiment of the method of the invention for synthesizing speech. 図３の方法のアプリケーション例を示す。Fig. 4 shows an example application of the method of Fig. 3; 自然音声の処理のための本発明のアプリケーションを示す。Figure 2 shows an application of the present invention for processing natural speech. テキスト音声のための本発明のアプリケーションを示す。Fig. 2 shows an application of the present invention for text speech. 音声情報を含むファイルの例である。It is an example of the file containing audio | voice information. 図７のファイルから抽出されたダイフォン情報を含むファイルの例である。It is an example of the file containing the diphone information extracted from the file of FIG. 図７及び図８のファイルの処理の結果を示す。The result of the process of the file of FIG.7 and FIG.8 is shown. 本発明による音声分析及び合成装置のブロック図を示す。1 shows a block diagram of a speech analysis and synthesis apparatus according to the present invention.

Explanation of symbols

音波２０１
第１倍音２０２
最大値２０３
最大値２０４
音波４０１
レイズドコサイン４０２
ピッチベル４０３
音波４０４
列７０１
列７０２
列７０３
図面９０１
図面９０２
装置９５０
音声分析モジュール９５１
記憶部９５２
ローパスフィルタモジュール９５３
モジュール９５４
記憶部９５５
音声合成モジュール９５６
記憶部９５７
モジュール９５８
モジュール９５９
モジュール９６０ Sonic 201
First overtone 202
Maximum value 203
Maximum value 204
Sound wave 401
Raised Cosine 402
Pitch bell 403
Sonic 404
Row 701
Row 702
Row 703
Drawing 901
Drawing 902
Device 950
Speech analysis module 951
Storage unit 952
Low pass filter module 953
Module 954
Storage unit 955
Speech synthesis module 956
Storage unit 957
Module 958
Module 959
Module 960

Claims

A method for speech analysis, the method comprising:
-Steps for the input of audio signals;
-Obtaining a first overtone of said audio signal;
-Determining a phase difference between the audio signal and the first overtone;
Having a method.

Obtaining the phase difference comprises:
-A step for determining the position of the maximum value of the audio signal;
The method of claim 1, comprising determining the phase difference between the maximum value and a phase zero of the first overtone of the audio signal.

The method of claim 1 or 2, wherein the audio signal is a diphone signal.

A method of synthesizing speech, the method comprising:
-Selecting a diphone sample windowed by a window function centered on a phase angle determined by the phase difference between the speech signal and the first harmonic of the speech signal;
-Concatenating the windowed and selected diphone samples;
Having a method.

The method of claim 4, wherein the audio signal is a diphone signal.

6. A method according to claim 4 or 5, wherein the window function is a raised cosine or a triangular window.

7. A method according to claim 4, 5 or 6, further comprising the step of inputting information representing a diphone and pitch profile and forming a basis for selection of the windowed diphone sample.

8. A method according to any one of claims 4 to 7, wherein the information is provided from a language processing module of a text speech system.

-Voice input step,
-Windowing the sound with the window function to obtain a windowed diphone sample;
The method of any one of claims 4 to 8, further comprising:

Computer program for carrying out the method according to any one of claims 1 to 9.

-Means for input of audio signals;
-Means for obtaining the first harmonic of the audio signal;
-Means for determining a phase difference between the audio signal and the first overtone;
A voice analysis apparatus having

12. The means for obtaining the phase difference obtains a maximum value of the audio signal and obtains a phase zero of the audio signal in order to obtain a phase difference between the maximum value of the audio signal and the phase zero. Voice analysis device.

The voice analysis apparatus according to claim 11 or 12, wherein the voice signal is a diphone signal.

Means for selecting a diphone sample windowed by a window function centered on a phase angle determined by a phase difference between an audio signal and the first harmonic of the audio signal;
-Means for concatenating the windowed and selected diphone signals;
A speech synthesizer.

The speech synthesis apparatus according to claim 14, wherein the speech signal is a diphone signal.

The speech synthesizer according to claim 14 or 15, wherein the window function is a raised cosine or a triangular window.

Means for inputting information representing the diphone and pitch contour;
17. The speech synthesizer according to claim 14, 15 or 16, wherein the means for selecting the windowed diphone selects based on the information.

-Language processing means for providing information representing the diphone and pitch contour;
Means for selecting, based on the information, diphone samples windowed by a window function centered on a phase angle determined by a phase difference between the speech signal and the first harmonic of the speech signal Speech synthesis means comprising: and means for connecting the windowed and selected diphone samples;
A text voice system.

The text-to-speech system of claim 18 wherein the window function is a raised cosine or a triangular window.

-Means for input of a signal having a natural speech signal;
-To provide a windowed diphone sample, the natural speech signal is filtered by a window function centered on the phase angle determined by the phase difference between the speech signal and the first harmonic of the speech signal. Means for windowing,
-Means for processing said windowed diphone samples;
-Means for concatenating the windowed and selected diphone signals;
A voice processing system.