JP2002515608A

JP2002515608A - Method and apparatus for determining spectral speech characteristics of uttered expressions

Info

Publication number: JP2002515608A
Application number: JP2000548866A
Authority: JP
Inventors: ホルツアプフェルマーティン
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1998-05-11
Filing date: 1999-05-03
Publication date: 2002-05-28
Also published as: ATE214831T1; ES2175988T3; DE59901018D1; EP1078354A1; EP1078354B1; WO1999059134A1

Abstract

According to the invention, spectral voice characteristics are determined in a natural language expression, whereby the expression is digitized and subjected to a wavelet transformation. The speaker-specific characteristics arise from the different transformation steps of the wavelet transformation. Within the scope of a voice synthesis, these characteristics can be compared with characteristics of other expressions in order to generate a continuously sounding synthetic voice signal for the human ear. Alternatively, the characteristics can also be modified in a targeted manner in order to counteract a perceptive dissonance.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】本発明は、発声された表出におけるスペクトル的な音声特徴を求める方法およ
び装置に関する。The present invention relates to a method and an apparatus for determining a spectral speech feature in an uttered expression.

【０００２】連結形の音声合成では、個々の音が音声データバンクから合成される。この際
に人間の耳にとって自然に聞こえる音声経過を得るために、音が組み立てられる
点（連結点）における不連続性を回避しなければならない。ここで音とは、例え
ば言語の音素または複数の音素をまとめたものである。In concatenated speech synthesis, individual sounds are synthesized from a speech data bank. In this case, in order to obtain a sound course that sounds natural to the human ear, discontinuities at the points where the sounds are assembled (connection points) must be avoided. Here, the sound is, for example, a collection of phonemes of a language or a plurality of phonemes.

【０００３】ウェーブレット変換は[１]から公知である。ウェーブレット変換ではウェーブ
レットフィルタによって、連続する変換ステップの１つずつのハイパス成分とロ
ーパス成分が目下の変換ステップの信号を完全に復元することが保証される。こ
こでは１つの変換ステップからつぎの変換ステップでハイパス成分ないしはロー
パス成分の分解能の低減が行われる（英語の専門用語では「サブサンプリング」
である）。殊にこのサブサンプリングによって変換ステップの数は有限である。The wavelet transform is known from [1]. In the wavelet transform, the wavelet filter ensures that the high-pass and low-pass components of each successive transform step completely restore the signal of the current transform step. Here, the resolution of the high-pass component or the low-pass component is reduced in one conversion step to the next conversion step (in English terminology, “subsampling”).
Is). In particular, due to this subsampling, the number of conversion steps is finite.

【０００４】本発明の課題は、スペクトル的な音声特徴を求める方法および装置を提供して
、例えば自然に聞こえる合成音声出力が得られるようにすることである。[0004] It is an object of the present invention to provide a method and a device for determining spectral audio features, for example to provide a natural sounding synthetic audio output.

【０００５】この課題は、請求項１の特徴部分に記載された構成によって解決される。[0005] This problem is solved by a configuration described in the characterizing part of claim 1.

【０００６】本発明の枠内では、発声された表出のスペクトル的な音声特徴を求める方法が
示される。このために発声された表出をデジタル化し、これにウェーブレット変
換を行う。ウェーブレット変換の相異なる変換ステップに基づいて発声者固有の
特徴を求める。[0006] Within the framework of the invention, a method is described for determining the spectral speech characteristics of an uttered expression. For this purpose, the uttered expression is digitized and subjected to a wavelet transform. The unique features of the speaker are determined based on different transform steps of the wavelet transform.

【０００７】ここで殊に有利であるのは、ウェーブレット変換ではハイパスフィルタとロー
パスフィルタとによって表出が分割され、相異なる変換ステップの相異なるハイ
パス成分ないしはローパス成分が発声者固有の特徴を含むことである。It is particularly advantageous here that in the wavelet transform, the expression is divided by a high-pass filter and a low-pass filter, and that different high-pass or low-pass components of different transformation steps have speaker-specific features. It is.

【０００８】相異なる変換ステップの個別のハイパス成分ないしはローパス成分は発声者固
有の所定の特徴を表し、各変換ステップのハイパス成分もローパス成分も共に変
更できる、すなわち各特徴を別の特徴とは別個に変更することができる。逆ウェ
ーブレット変換時に個々の変換ステップの各ハイパスおよびローパス成分から元
の信号を再度組み立てれば、所望する特徴だけが変更されることが保証される。
したがってあらかじめ決められた、表出の所定の特性を変更することができ、こ
れによって表出の残りの部分が影響を受けることはない。[0008] The individual high-pass or low-pass components of the different transform steps represent certain features specific to the speaker, and both the high-pass and low-pass components of each transform step can be changed, ie each feature is distinct from another feature. Can be changed to Reassembling the original signal from the high-pass and low-pass components of the individual transform steps during the inverse wavelet transform ensures that only the desired features are changed.
Thus, a predetermined, predetermined characteristic of the expression can be changed, so that the rest of the expression is not affected.

【０００９】１実施形態の特徴は、ウェーブレット変換の前に表出を窓化して、すなわちサ
ンプリング値の所定量を切り出して、周波数領域に変換することである。このた
めに例えば高速フーリエ変換（ＦＦＴ）が適用される。A feature of one embodiment is that the expression is windowed before the wavelet transform, ie, a predetermined amount of the sampling value is cut out and transformed into the frequency domain. For this purpose, for example, a fast Fourier transform (FFT) is applied.

【００１０】別の実施形態の特徴は、変換ステップのハイパス成分を実部と虚部に分けるこ
とである。ウェーブレット変換のハイパス成分は、目下のローパス成分と、先行
する変換ステップのローパス成分との間の差分信号に相応する。[0010] A feature of another embodiment is to separate the high-pass component of the transform step into a real part and an imaginary part. The high-pass component of the wavelet transform corresponds to the difference signal between the current low-pass component and the low-pass component of the preceding transform step.

【００１１】１発展形態の特徴は殊に、ウェーブレット変換の実行すべき変換ステップの数
が、連続して接続されるローパスフィルタからなる最後の変換ステップに表出の
直流成分が含まれることによって決定されることにある。この場合に信号をその
ウェーブレット係数によって完全なものとして表すことができる。このことは信
号部分の情報がウェーブレット空間に完全に変換されることに相応する。A feature of a development is in particular that the number of transform steps to be performed of the wavelet transform is determined by the fact that the last transform step consisting of a series of connected low-pass filters contains an expressed DC component. Is to be done. In this case, the signal can be represented as complete by its wavelet coefficients. This corresponds to the fact that the information of the signal part is completely transformed into the wavelet space.

【００１２】殊に例えば各ローパス成分だけが（ハイパスおよびローパスフィルタによって
）さらに変換される場合、上の説明したように変換ステップのハイパス成分とし
て差分信号が残る。差分信号（ハイパス成分）を変換ステップにわたって累積す
ると、最後の変換ステップにおいて累積されたハイパス成分として直流成分のな
い、発声された表出の情報が得られる。In particular, if, for example, only each low-pass component is further transformed (by high-pass and low-pass filters), the difference signal remains as the high-pass component of the transformation step, as explained above. When the difference signal (high-pass component) is accumulated over the conversion step, information of an uttered expression without a DC component is obtained as the high-pass component accumulated in the last conversion step.

【００１３】付加的な発展形態では発声者固有の特徴が以下のものとして識別可能である。In an additional development, the speaker-specific features can be identified as:

【００１４】ａ）基本周波数：ウェーブレット変換の第１または２の変換ステップにおけるハイパス成分の振
動によって表出の基本周波数が識別される。この基本周波数によって発声者が男
性であるか、女性であるかが示される。A) Fundamental frequency: The expressed fundamental frequency is identified by the oscillation of the high-pass component in the first or second transform step of the wavelet transform. The fundamental frequency indicates whether the speaker is male or female.

【００１５】ｂ）スペクトル包絡の形状：スペクトル包絡は調音時の声道の伝達関数についての情報を含む。有声の領域
においてはスペクトル包絡にフォルマントが優勢である。ウェーブレット変換の
高次の変換ステップでのハイパス成分はこのスペクトル包絡を含む。B) Shape of the spectral envelope: The spectral envelope contains information about the transfer function of the vocal tract during articulation. In voiced regions, the formants predominate in the spectral envelope. The high-pass component in the higher-order transform step of the wavelet transform includes this spectral envelope.

【００１６】ｃ）スペクトル傾斜（しわがれ度（Rauchigkeit））声のしわがれ度は、前の前のローパス成分の経過における負の傾きとして識別
される。C) Spectral tilt (Rauchigkeit) The degree of voice wrinkling is identified as a negative slope in the course of the previous previous low-pass component.

【００１７】発声者固有のａ）〜ｃ）の特徴は、音声合成では極めて重要である。冒頭に述
べたように連結式の音声合成において大量の実際に発声される表出を使用する場
合、これらの表出から模範の音が切り出され、のちに新しい語に組み立てられる
（合成された音声）。この場合、組み立てられた音の間の不連続性は不利である
。なぜならばこれが人間の耳には不自然に知覚されるからである。これらの不連
続性に対向するために、直接、知覚に関する複数のパラメタを検出し、場合によ
っては比較し、および／または相互に適合させると有利である。The features a) to c) unique to the speaker are extremely important in speech synthesis. As mentioned at the beginning, when using a large number of actual spoken expressions in concatenated speech synthesis, exemplary sounds are cut out of these expressions and then assembled into new words (synthesized speech). ). In this case, the discontinuity between the assembled sounds is disadvantageous. This is because it is unnaturally perceived by the human ear. To counter these discontinuities, it is advantageous to directly detect, possibly compare, and / or match multiple perceptual parameters.

【００１８】これは直接の操作によって行うことができる。この直接の操作は音声音をその
発声者固有の特徴の少なくとも１つにおいて適合させて、この音声音が連結によ
って結合された音の音響的なコンテキストにおいて、障害として知覚されないよ
うにすることによって行われる。また、適合する音の選択を調整して、発声者固
有の特徴と、結合すべき音とをできるだけ良好に相互に適合させて、例えばこれ
らの音が同じまたは類似のしわがれ度の特徴を有するようにすることも可能であ
る。This can be done by direct manipulation. This direct manipulation is performed by adapting the audio sound in at least one of the speaker's unique characteristics so that the audio sound is not perceived as an obstacle in the acoustic context of the sound combined by the concatenation. Will be Also, the selection of matching sounds can be adjusted so that the speaker-specific characteristics and the sounds to be combined are matched as well as possible to each other, for example, so that these sounds have the same or similar wrinkling characteristics. It is also possible to

【００１９】本発明の利点は、スペクトル包絡が、発声者の調音路（Artikulationstrakt）
を反映しており、かつ例えば極位置モデルのようにフォルマントに依拠していな
いことである。さらに非パラメトリックな表現としてのウェーブレット変換では
データは失われず、表出をつねに完全に復元することができる。ウェーブレット
変換の個々の変換ステップから生じるデータは相互に線形に独立であり、したが
って別個に変化させることができ、後で再び、変化がなされた表出に（損失なし
に）組み立てることができる。An advantage of the present invention is that the spectral envelope can be controlled by the speaker's Artikulationstrakt.
And does not rely on formants, such as the polar position model. Furthermore, in the wavelet transform as a non-parametric representation, no data is lost, and the expression can always be completely restored. The data resulting from the individual transform steps of the wavelet transform are linearly independent of one another and can therefore be varied separately and later reassembled (without loss) into the altered representation.

【００２０】本発明ではさらにプロセッサユニットを有する、スペクトル的な特徴を求める
装置が提供される。このプロセッサユニットは表出をデジタル化できるように構
成されている。それに基づいて表出にウェーブレット変換が行われ、相異なる変
換ステップにより発声者固有の特徴が求められる。The invention further provides an apparatus for determining a spectral feature, comprising a processor unit. The processor unit is configured to digitize the representation. Based on this, the expression is subjected to wavelet transformation, and the unique characteristics of the speaker are determined by different transformation steps.

【００２１】この装置は、本発明の方法または上に説明した発展形態を実施するのに殊に有
利である。This device is particularly advantageous for implementing the method of the invention or the development described above.

【００２２】本発明の発展形態は従属請求項に記載されている。[0022] Developments of the invention are set out in the dependent claims.

【００２３】本発明の実施例を以下、図面に基づき詳しく説明する。Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【００２４】ここで図１は、ウェーブレット関数を示しており、図２は、ウェーブレット関数を実数部と虚数部とに分けて示しており、図３は、ウェーブレット変換の変換ステップを表すカスケード接続されたフィ
ルタ構造を示しており、図４は、相異なる変換ステップのローパス成分とハイパス成分とを示しており
、図５は、連結型の音声合成のステップを示している。Here, FIG. 1 shows a wavelet function, FIG. 2 shows the wavelet function divided into a real part and an imaginary part, and FIG. 3 shows a cascade connection showing a transform step of the wavelet transform. FIG. 4 shows a low-pass component and a high-pass component of different conversion steps, and FIG. 5 shows a concatenated speech synthesis step.

【００２５】図１は、つぎの式によって決まるウェーブレット関数を示している。FIG. 1 shows a wavelet function determined by the following equation.

【００２６】[0026]

【数１】 (Equation 1)

【００２７】ここでｆは周波数を、 σは標準偏差を、ｃは所定の規格化定数を表している。Here, f represents a frequency, σ represents a standard deviation, and c represents a predetermined normalization constant.

【００２８】標準偏差σは、例えばあらかじめ設定することの可能な、図１の側波帯最小値
１０１の位置によって決定される。The standard deviation σ is determined, for example, by the position of the sideband minimum value 101 in FIG. 1, which can be set in advance.

【００２９】図２は、数式（１）の実数部と、この実数部をヒルベルト変換Ｈしたもの虚数
部として有するウェーブレット関数を示している。したがってこの複素ウェーブ
レット関数はつぎの式で得られる。FIG. 2 shows a real part of Expression (1) and a wavelet function having the real part as an imaginary part obtained by performing the Hilbert transform H. Therefore, this complex wavelet function is obtained by the following equation.

【００３０】[0030]

【数２】 (Equation 2)

【００３１】[0031]

【外１】 [Outside 1]

【００３２】図３は、ウェーブレット変換をカスケード接続して適用することを示している
。信号３０１はハイパスフィルタＨＰ１３０２とローパスフィルタＴＰ１３
０５とによってフィルタリングされる。ここで例えばサブサンプリングが行われ
、すなわち記憶すべき値の数がフィルタ毎に低減される。逆ウェーブレット変換
により、ローパス成分ＴＰ１３０５およびハイパス成分ＨＰ１３０４から再
度、原信号３０１が復元されることが保証される。FIG. 3 shows that the wavelet transform is cascaded and applied. The signal 301 includes a high-pass filter HP1 302 and a low-pass filter TP13.
05. Here, for example, subsampling is performed, that is, the number of values to be stored is reduced for each filter. The inverse wavelet transform ensures that the original signal 301 is restored again from the low-pass component TP1 305 and the high-pass component HP1 304.

【００３３】ハイパスＨＰ１３０２では、実部Ｒｅ１３０３と虚部Ｉm1 ３０４にした
がって別個にフィルタリングが行われる。In the high-pass HP 1 302, filtering is performed separately according to the real part Re 1 303 and the imaginary part Im 1 304.

【００３４】ローパスフィルタＴＰ１３０５の後の信号３１０は、新たにハイパスフィル
タＨＰ２３０６とローパスフィルタＴＰ２３０９とによってフィルタリング
される。ハイパスフィルタＨＰ２３０６も実部Ｒｅ２３０７と虚部Ｉｍ２
３０８とを含む。第２変換ステップ１１の後、再度フィルタリングが行われ、こ
れが繰り返される。The signal 310 after the low-pass filter TP 1 305 is newly filtered by the high-pass filter HP 2 306 and the low-pass filter TP 2 309. The high-pass filter HP2 306 also has a real part Re2 307 and an imaginary part Im2.
308. After the second conversion step 11, filtering is performed again, and this is repeated.

【００３５】２５６個の値を有する（ＦＦＴ変換された）短時間のスペクトルから出発する
場合、８つの変換ステップ（サブサンプリングレート：１／２）が、最後のロー
パスフィルタＴＰ８からの信号が直流成分に等しくなるまで実行される。If one starts with a short-time (FFT-transformed) spectrum having 256 values, the eight transform steps (subsampling rate: 1/2) are based on the fact that the signal from the last low-pass filter TP8 is a DC component. Is executed until it is equal to

【００３６】図４にはウェーブレット変換の相異なる変換ステップが、ローパス成分（図４
A，４Ｃおよび４Ｅ）とハイパス成分（図４Ｂ，４Ｄおよび４Ｆ）とに分けられ
て示されている。FIG. 4 shows different transform steps of the wavelet transform, which are represented by low-pass components (FIG. 4).
A, 4C and 4E) and high-pass components (FIGS. 4B, 4D and 4F).

【００３７】図４Ｂのハイパス成分から、発声された表出の基本周波数が見て取れる。振幅
の変動のほかに、ウェーブレットフィルタリングされたスペクトルにはっきりと
優位のある周期性を識別することができる。これが発声者の基本周波数である。
この基本周波数によって、あらかじめ設定した表出を音声合成時に相互に適合さ
せたり、またはあらかじめ設定した表出を備えるデータバンクから適合する表出
を求めることができる。The fundamental frequency of the uttered expression can be seen from the high-pass component of FIG. 4B. In addition to amplitude variations, periodicities that clearly dominate the wavelet filtered spectrum can be identified. This is the speaker's fundamental frequency.
With this fundamental frequency, a preset expression can be adapted to each other at the time of speech synthesis, or a suitable expression can be obtained from a data bank provided with the preset expression.

【００３８】図４Ｃのローパス成分では、際立った最小値および最大値として音声信号部分
（この音声信号部分の長さは基本周波数の約２倍に相当する)のフォルマントが
示されている。これらのフォルマントは、発声者の声道の共振周波数を表す。フ
ォルマントを明瞭に示すことができることにより、連結形の音声合成において適
合する音声ユニットを適合および／または選択することが可能である。In the low-pass component of FIG. 4C, the formant of the audio signal portion (the length of the audio signal portion is approximately twice the fundamental frequency) is shown as the remarkable minimum and maximum values. These formants represent the resonant frequency of the speaker's vocal tract. The ability to clearly indicate the formants makes it possible to adapt and / or select speech units that are suitable in concatenated speech synthesis.

【００３９】前の前の変換ステップのローパス成分（原信号において周波数値が２５６個の
場合は：ＴＰ７）では、声のしわがれ度を求めることができる。最大値Ｍｘと最
小値Ｍｉとの間の曲線経過の下降はしわがれの度合いを特徴付ける。In the low-pass component of the previous conversion step (when the original signal has 256 frequency values: TP 7), the degree of wrinkling of the voice can be obtained. The drop in the curve course between the maximum value Mx and the minimum value Mi characterizes the degree of wrinkling.

【００４０】これによって上記の３つの発声者固有の特徴は識別されて、これを音声合成に
対して適切に変更することができる。ここで殊に重要であるのは、逆ウェーブレ
ット変換時に、発声者固有の個々の特徴の操作によって、その特徴だけが変更さ
れ、他の知覚に関連するパラメタはそのままであることである。したがって基本
周波数を所期のように調整することができ、これによって声のしわがれ度が変更
されることはない。Thus, the above three unique features of the speaker can be identified and modified appropriately for speech synthesis. Of particular importance here is that, during the inverse wavelet transform, the manipulation of individual features specific to the speaker changes only those features and leaves other perception-related parameters intact. Therefore, the fundamental frequency can be adjusted as desired, without changing the degree of wrinkling of the voice.

【００４１】別の使用例の特徴は、別の音部分に連結して結合する有利な音部分を選択でき
ることであり、ここでこれらの両者の音部分は元々、別の発声者から別々のコン
テキストで記録されたものである。スペクトル的な音声特徴を求めれば、有利な
結合すべき音部分を見つけることができる。それはこれらの特徴によって評価基
準が周知であり、この評価基準によって音部分相互の比較と、ひいては適合する
音部分の選択とが決まった基準にしたがって自動的に可能になるからである。A feature of another use case is that it is possible to select an advantageous sound part which is linked to and combined with another sound part, where both of these sound parts are originally separated from different speakers by different contexts. It was recorded in. Once the spectral audio features have been determined, it is possible to find advantageous sound parts to be combined. This is because these features make the evaluation criterion well known, and this evaluation criterion enables the comparison of the sound parts with one another and thus the selection of the matching sound parts automatically according to certain criteria.

【００４２】図５は連結形の音声合成のステップを示している。データバンクは、種々の発
声者が自然に発声した音声のあらかじめ設定されたセットによって構成されてお
り、ここで自然に発話した音声の音部分は識別されて記憶される。１つの音声の
種々の音部分に対して多数の標本が生じ、データバンクはこれらにアクセス可能
である。音部分は例えば１つの音声または音素またはこのような音素の列である
。音部分が小さければ小さいほど、新しい語を組み立てる際の可能性は大きくな
る。ドイツ語は、約４０の所定量の音素を含んでおり、これらの音素はこの言語
のほぼすべての語の合成に十分である。この際に種々の音響的コンテキストを、
どの語に各音素が出現するかに応じて考慮しなければならない。ここで重要なの
は、個々の音素を音響的コンテキストに適切に挿入して、人間の聴覚に不自然で
いかにも「合成」であることが分かる不連続性が回避されるようにすることであ
る。上記のように音部分は種々の発声者から得たものであり、したがって相異な
る発声者固有の特徴を有する。できる限り自然に作用する表出を合成するために
、これらの不連続性を最小化することが重要である。これは識別可能であり、か
つ変更可能な発声者固有の特徴を適合させることによって、または適合する音部
分をデータバンクから選択することによって行うことができる。ここでも発声者
固有の特徴は選択の際の重要な補助手段である。FIG. 5 shows the steps of concatenated speech synthesis. The databank is composed of a preset set of sounds naturally uttered by various speakers, wherein the sound parts of the naturally uttered sounds are identified and stored. A large number of samples occur for different sound parts of one voice, and the data bank can access them. A sound part is, for example, a voice or a phoneme or a sequence of such phonemes. The smaller the sound part, the greater the potential for constructing new words. German contains about 40 predetermined amounts of phonemes, which are sufficient for the synthesis of almost all words of this language. At this time, various acoustic contexts,
Consideration must be given to which words each phoneme occurs. What is important here is that the individual phonemes be properly inserted into the acoustic context so that discontinuities which are unnatural and very "synthetic" to the human hearing are avoided. As mentioned above, the sound parts are obtained from different speakers and therefore have different speaker-specific characteristics. It is important to minimize these discontinuities in order to synthesize expressions that act as naturally as possible. This can be done by adapting identifiable and modifiable speaker-specific features or by selecting a matching sound segment from a data bank. Again, the speaker-specific features are an important aid in the selection.

【００４３】図５には２つの音Ａ５０７とＢ５０８の例が示されており、これらはそれ
ぞれ、個別の音部分５０５ないしは５０６を有する。音Ａ５０７およびＢ５
０８はそれぞれ、発声された表出から得られたものであり、音Ａ５０７と音Ｂ
５０８とは明らかに異なっている。区切り線５０９は、音Ａ５０７と音Ｂ
５０８とを結合しなければならない箇所を示す。この場合に音Ａ５０７の最初
の３つの音部分と、音Ｂ５０８の最後の３つの音部分を連結して結合しなけれ
ばならない。FIG. 5 shows an example of two sounds A 507 and B 508, each of which has a separate sound part 505 or 506. Sounds A 507 and B 5
08 are obtained from the uttered expressions, and the sound A 507 and the sound B
508 is clearly different. Separation line 509 consists of sound A 507 and sound B
508 indicates the location where it must be joined. In this case, the first three sound parts of the sound A 507 and the last three sound parts of the sound B 508 must be connected and connected.

【００４４】区切り線５０９に沿って連続する音部分を時間的に延長または圧縮（矢印５０
３を参照されたい）して、経過部５０９における不連続な感じを回避しなければ
ならない。A continuous sound portion along the dividing line 509 is temporally extended or compressed (arrow 50).
3 (see FIG. 3) to avoid the discontinuity in the transition 509.

【００４５】変形例の特徴は、区切り線５０９に沿って分割される音が急峻に経過すること
である。この場合、人間の聴覚に障害と知覚される上記の不連続性が生じてしま
う。これに対して、経過領域５０１または５０２内の音部分を考慮して、音Ｃを
組み立て、ここで各経過領域５０１または５０２、相互に対応付け可能な音部分
間のスペクトル的な間隔尺度が適合される（音部分間で漸次的に経過する）。こ
の間隔尺度として使用されるのは、例えばウェーブレット空間における、この領
域に関連する係数間のユークリッド距離である。A feature of the modification is that the sound divided along the dividing line 509 elapses sharply. In this case, the discontinuity described above, which is perceived as a disturbance in human hearing, occurs. On the other hand, the sound C in the transition region 501 or 502 is taken into account, and the transition region 501 or 502 and the spectral interval measure between the sound portions that can be associated with each other are matched. (Elapses gradually between sound parts). The Euclidean distance between the coefficients associated with this region, for example in wavelet space, is used as this interval measure.

【００４６】参考文献 [１] I. Daubechies: "Ten Lectures on Wavelets", Siam Verlag 1992, ISBN 0
-89871-274-2, 第５．１章、第１２９〜１３７頁References [1] I. Daubechies: "Ten Lectures on Wavelets", Siam Verlag 1992, ISBN 0
-89871-274-2, Chapter 5.1, pp. 129-137

[Brief description of the drawings]

【図１】ウェーブレット関数を示す図である。FIG. 1 is a diagram showing a wavelet function.

【図２】ウェーブレット関数を実数部と虚数部に分けて示す図である。FIG. 2 is a diagram showing a wavelet function divided into a real part and an imaginary part.

【図３】ウェーブレット変換の変換ステップを表すカスケード接続されたフィルタ構造
を示す図である。FIG. 3 is a diagram showing a cascade-connected filter structure representing a transform step of a wavelet transform.

【図４】相異なる変換ステップのローパス成分とハイパス成分とを示す図である。FIG. 4 is a diagram showing a low-pass component and a high-pass component of different conversion steps.

【図５】連結型の音声合成のステップを示す図である。FIG. 5 is a diagram showing steps of concatenated speech synthesis.

【手続補正書】特許協力条約第３４条補正の翻訳文提出書[Procedural Amendment] Submission of translation of Article 34 Amendment of the Patent Cooperation Treaty

【提出日】平成１２年７月７日（２０００．７．７）[Submission date] July 7, 2000 (200.7.7)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】０００３[Correction target item name] 0003

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【０００３】ウェーブレット変換は[１]から公知である。ウェーブレット変換ではウェーブ
レットフィルタによって、連続する変換ステップの１つずつのハイパス成分とロ
ーパス成分が目下の変換ステップの信号を完全に復元することが保証される。こ
こでは１つの変換ステップからつぎの変換ステップでハイパス成分ないしはロー
パス成分の分解能の低減が行われる（英語の専門用語では「サブサンプリング」
である）。殊にこのサブサンプリングによって変換ステップの数は有限である。ＵＳ−Ａ−５５２８７２５には、ウェーブレット変換を用いた音声認識方法が
記載されている。ＥＰ−Ａ−０５１９８０２には、自然に聞こえる音声音の並びを考慮して、発
声者固有の特徴を適合させる音声合成方法が記載されている。The wavelet transform is known from [1]. In the wavelet transform, the wavelet filter ensures that the high-pass and low-pass components of each successive transform step completely restore the signal of the current transform step. Here, the resolution of the high-pass component or the low-pass component is reduced in one conversion step to the next conversion step (in English terminology, “subsampling”).
Is). In particular, due to this subsampling, the number of conversion steps is finite. US-A-5528725 describes a speech recognition method using wavelet transform. EP-A-0519802 describes a speech synthesis method that adapts the unique characteristics of the speaker in consideration of the arrangement of naturally sounding speech sounds.

Claims

[Claims]

1. A method for determining spectral speech characteristics of an uttered expression, comprising: a) digitizing the expression; b) performing a wavelet transform on the digitized expression; c) a phase of the wavelet transform. A method for determining spectral features of a voice, wherein the unique features of the speaker are determined using different transformation steps.

2. The method according to claim 1, wherein a windowing transformation of the digitized representation to the frequency domain is performed before the wavelet transformation.

3. The method according to claim 2, wherein the conversion to the frequency domain is performed by a fast Fourier transform.

4. The method according to claim 1, wherein in each step of the wavelet transform, a low-pass component and a high-pass component of a signal to be transformed are obtained.

5. The method according to claim 1, wherein the high-pass component is divided into a real part and an imaginary part.

6. The method according to claim 1, wherein the wavelet transform comprises a plurality of transform steps, wherein the last transform step supplies the expressed DC component by repeated low-pass filtering corresponding to the number of transform steps. The method according to any one of the preceding claims.

7. The speaker-specific features are determined by: a) the fundamental frequency of the uttered expression, b) the spectral envelope, and c) the degree of wrinkling of the uttered expression. A method according to any one of the preceding claims.

8. Use of the method according to claim 1, wherein individual features specific to the speaker are adapted taking into account the naturally audible sequence of the audio sound. .

9. A speech sound that guarantees a natural arrangement of speech sounds that are heard from a preset data set is selected based on individual spectral speech characteristics. How to use the described speech synthesis method.

10. An apparatus for determining spectral audio characteristics of an uttered expression comprising a processor unit, the apparatus comprising: a) digitizing said expression; and b) wavelet transforming said digitized expression. C) performing a step of determining a speaker-specific feature using different transform steps of a wavelet transform, wherein the apparatus determines spectral utterance features of the uttered expression.