JP2612869B2

JP2612869B2 - Voice conversion method

Info

Publication number: JP2612869B2
Application number: JP62250708A
Authority: JP
Inventors: 徹都木; 尚夫桑原; 哲夫梅田
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1987-10-06
Filing date: 1987-10-06
Publication date: 1997-05-21
Anticipated expiration: 2012-05-21
Also published as: JPH0193796A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、放送、映画、音楽等において、人間の音声
を処理する音声情報処理技術のうち、音声の個人性を変
化させたり、音声の明瞭性を高めたり、響きを変えて特
殊効果を持たせるなどの声質変換方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial Application Field] The present invention relates to an audio information processing technique for processing human voice in broadcasting, movie, music, and the like. The present invention relates to a voice quality conversion method such as enhancing clarity or changing a sound to have a special effect.

［発明の概要］本発明は人の音声を一時記録し、その音声の質を変化
させて、再び音声として出力する技術に関するもので、入力音声をA/D変換した後、有声音区間について、先
ず線形予測係数を算出しておき、この線形予測係数に基
づいて所望のホルマント周波数および帯域幅の変更を行
ない、さらに変更後のホルマント周波数や帯域幅に応じ
た線形予測係数を求め、これによりスペクトル包絡を変
更する。次にフーリエ変換により原音声を周波数領域に
変換し、ホルマント変更前後のスペクトル包絡によって
原音声を所望の形に変更し、逆フーリエ変換によって時
間領域に戻した後、これをD/A変換することにより、音
声としての自然性を良好に保ったまま、原音声の持つ、
個人性を変換したり、言葉としての明瞭性を改善できる
ようにする方法である。[Summary of the Invention] The present invention relates to a technique for temporarily recording a human voice, changing the quality of the voice, and outputting the voice again as a voice. First, a linear prediction coefficient is calculated, a desired formant frequency and bandwidth are changed based on the linear prediction coefficient, and a linear prediction coefficient corresponding to the changed formant frequency and bandwidth is obtained. Change the envelope. Next, transform the original voice to the frequency domain by Fourier transform, change the original voice to the desired shape by the spectral envelope before and after the formant change, return it to the time domain by inverse Fourier transform, and then D / A convert it With this, the original sound has
It is a way to transform personality and improve verbal clarity.

［従来の技術］従来、放送現場などではアナログフィルターを用い
て、音声の特定の周波数帯域を取り除くことにより個人
性の消去を行ったり、熟練した技術者がグラフィックイ
コライザーにより特定の周波数帯域のエネルギーを増強
・減衰させて声の質の補正を行なっていた。[Prior art] Conventionally, at a broadcast site or the like, an analog filter is used to remove a specific frequency band of audio to eliminate personality, or a skilled engineer uses a graphic equalizer to reduce energy in a specific frequency band. The quality of the voice was corrected by augmentation and attenuation.

近年、デジタル技術による方法も開発されており、サ
ンプリング周波数Ｆで書込んだ音声波形を、FXRなるサ
ンプリング周波数で読出せば、ピッチ周波数を含む全ス
ペクトル情報が周波数的にＲ倍となる。この時適当な時
間窓と周期を用いて波形を間引いたり、繰返したりすれ
ば、発生速度を原音声と同じに保ちながら音声の質の変
換ができる。このような装置は、「ハーモナイザー」な
どと呼ばれ、音響効果装置として一般に使用され始めて
いる。In recent years, a method using digital technology has been developed. If a voice waveform written at a sampling frequency F is read at a sampling frequency of FXR, all spectrum information including a pitch frequency becomes R times in frequency. At this time, if the waveform is thinned out or repeated using an appropriate time window and cycle, the quality of the voice can be converted while maintaining the same generation speed as the original voice. Such a device is called a “harmonizer” or the like, and has begun to be generally used as a sound effect device.

また、線形予測分析に基づくデジタルフィルターを用
いて、音声の生成モデルに適応したフィルタリングを行
い、音声の質や明瞭性を制御する方式が、例えば特願昭
61−206777において本願人によって提案されている。In addition, a method of controlling the quality and clarity of speech by performing filtering adapted to a speech generation model using a digital filter based on linear predictive analysis has been proposed.
61-206777.

［発明が解決しようとする問題点］しかしながら、アナログフィルターやグラフィックイ
コライザーを用いる方法は、原理的に音声の特定の周波
数帯域のエネルギーを増減させるだけであり、根本的に
音声の質を変化させることはできない。すなわち、原音
声に含まれる一部の特徴がレベル的に強調・抑圧される
のみで、決定的に個人性を変えたり、明瞭性を改善する
ことはできない。[Problems to be Solved by the Invention] However, the method using an analog filter or a graphic equalizer basically only increases or decreases the energy of a specific frequency band of voice, and fundamentally changes the voice quality. Can not. In other words, only some of the features included in the original sound are emphasized and suppressed in terms of level, and it is not possible to decisively change individuality or improve clarity.

また、「ハーモナイザー」などの装置では、ピッチを
含む全スペクトル情報が周波数軸上で移動するので、音
声の高低や個人性が大幅に変化する。しかし、このよう
な変化は機械的であり実際の個人の音声間に存在するス
ペクトル情報の差異とは異なる。それ故、出力音声は非
人間的なものになり易い。また、原音声のピッチ周波数
を保存することはできない。Further, in a device such as a "harmonizer", the whole spectrum information including the pitch moves on the frequency axis, so that the level of the voice and the personality greatly change. However, such changes are mechanical and different from the differences in the spectral information that exists between actual personal sounds. Therefore, the output sound tends to be inhuman. Also, the pitch frequency of the original voice cannot be stored.

さらに、線形予測分析に基づくデジタルフィルターを
用いる方法は、音声の生成モデルに適応しているので、
声の質や明瞭性を根本的に制御することが可能である
が、フィルターの安定性などの原因から音質に問題があ
った。Furthermore, since the method using a digital filter based on linear predictive analysis is adapted to a speech generation model,
Although it is possible to fundamentally control the quality and clarity of voice, there was a problem in sound quality due to factors such as filter stability.

そこで、本発明の目的は、上述した従来の問題点を解
消し、音声の生成モデルに適応した分析と高速フーリエ
変換を併用することによって、音声特有の特徴を利用し
た高音質な声質変換を行なうことが可能な声質変換方法
を提供することにある。Therefore, an object of the present invention is to solve the above-mentioned conventional problems and perform high-quality voice conversion using characteristics unique to voice by using analysis adapted to a voice generation model and fast Fourier transform together. It is an object of the present invention to provide a voice conversion method capable of performing the above.

［問題点を解決するための手段］そのために本発明では入力音声から有声音区間を抽出
し、有声音区間において、分析窓幅と該窓幅のスライド
周期とによって設定された各短時間区間内のホルマント
周波数および帯域幅を算出すると共に、各短時間区間内
で線形予測係数を求めることによって、スペクトル包絡
を算出し、有声音区間にフーリエ変換を施し周波数領域
に変換し、ホルマント周波数の時間軌跡を求め、時間軌
跡の各時点におけるホルマント周波数または帯域幅に変
更を加え、変更を加えられたホルマント周波数および帯
域幅に基づいてスペクトル包絡を算出し、ホルマント周
波数変更後に算出されたスペクトル包絡をホルマント周
波数変更前に算出されたスペクトル包絡で除した商を変
更成分とし、フーリエ変換で変換された周波数成分に変
更成分を乗ずると共に、ホルマント周波数変更以外のス
ペクトル変更を加えた後、逆フーリエ変換によって波形
を時間領域に戻し、無声音区間、無音区間または前後の
有声音区間と接続し、新たな音声波形とすることを特徴
とする。[Means for Solving the Problems] For this purpose, in the present invention, a voiced sound section is extracted from the input voice, and in the voiced sound section, each voiced sound section is set within each short time section set by the analysis window width and the sliding period of the window width. By calculating the formant frequency and bandwidth of each of the short-term sections, and calculating the linear prediction coefficients within each short-term section, the spectrum envelope is calculated, and the voiced section is subjected to Fourier transform to be converted into the frequency domain, and the time trajectory of the formant frequency is calculated. Is calculated, a change is made to the formant frequency or bandwidth at each time point of the time trajectory, a spectrum envelope is calculated based on the changed formant frequency and bandwidth, and the spectrum envelope calculated after the formant frequency change is converted to the formant frequency. The quotient divided by the spectrum envelope calculated before the change is used as the change component, and the frequency transformed by the Fourier transform is used. After multiplying the wave number component by the change component and making a spectrum change other than the formant frequency change, the waveform is returned to the time domain by the inverse Fourier transform, and connected to the unvoiced sound section, the unvoiced section or the preceding and following voiced sound sections, and the new voice It is characterized by having a waveform.

［作用］以上の構成によれば高速フーリエ変換を用いて周波数
スペクトル包絡の変更を行ない、音声の声質を変換する
ことが可能となる。[Operation] According to the above configuration, it is possible to change the frequency spectrum envelope using the fast Fourier transform, and to convert the voice quality of speech.

［実施例］以下、図面に示す実施例に基づき本発明を詳細に説明
する。EXAMPLES Hereinafter, the present invention will be described in detail based on examples shown in the drawings.

第１図は、本発明の一実施例に係る声質変換システム
のブロック図を示す。図において、２は分析部、４はホ
ルマント周波数制御部、６はスペクトル制御部をそれぞ
れ示し、各部は電子計算機内に構成され、ROM,RAMおよ
びメモリディスク等のメモリを併用しながら声質変換の
処理が実行される。FIG. 1 is a block diagram showing a voice conversion system according to an embodiment of the present invention. In the figure, reference numeral 2 denotes an analysis unit, 4 denotes a formant frequency control unit, and 6 denotes a spectrum control unit. Each unit is configured in a computer and performs voice quality conversion processing while using memories such as a ROM, a RAM, and a memory disk. Is executed.

A/D変換されて標本化された音声波形は、分析部２へ
入力し、有音と無音および有声音と無声音の判別、さら
に有声音については、共振周波数を求める。The A / D-converted and sampled speech waveform is input to the analysis unit 2 to determine voiced and unvoiced and voiced and unvoiced sounds, and for voiced sounds, a resonance frequency is obtained.

次にホルマント周波数制御部４においては、分析部２
で得られた共振周波数に基づきホルマント周波数を求
め、所望の変更を行なう。Next, in the formant frequency control unit 4, the analysis unit 2
A formant frequency is obtained based on the resonance frequency obtained in step (1), and a desired change is made.

スペクトル制御部６ではホルマント周波数制御部４で
変更されたホルマント周波数に応じてスペクトル包絡を
変更する。The spectrum controller 6 changes the spectrum envelope according to the formant frequency changed by the formant frequency controller 4.

上述した一連の有声音に対する声質変換の処理を終了
すると、無声音区間および無声区間を接続し、次の有声
音区間の処理に移る。最終的に合成された音声波形をD/
A変換して出力音声とする。When the voice quality conversion processing for a series of voiced sounds is completed, the unvoiced sound section and the unvoiced sound section are connected, and the process proceeds to the next voiced sound section. D /
A converted to output audio.

上記各部における処理の詳細を第２図に示すフローチ
ャートを参照しながら説明する。The details of the processing in each section will be described with reference to the flowchart shown in FIG.

変換ビット数12bit,標本化周波数15kHzでA/D変換され
た音声は、まず、分析部２において、ステップS1で音声
パワーの有無に基づいて有音区間と無音区間の判別が行
われる。次にステップS2では有音区間の標本値に対して
PARCOR分析と零交さ分析とを行い、無声子音区間と有声
音区間との判別を行う。これは、１次のPARCOR係数を参
照して入力周波数の高域成分の割合を調べたり、零交さ
数を調べることによって行なう。すなわち、無声子音の
エネルギーは高周波領域にまで分布しており、高域成分
の割合および高周波になると多くなる零交さ数を調べる
ことによって無声子音と有声音とを判別する。なお、PA
RCOR分析と零交さ分析の両方を用いて判別を行なうの
は、判別を確実なものとするためである。First, in the analysis unit 2, a speech section and a non-speech section are determined in the analysis unit 2 based on the presence or absence of the speech power in the analysis unit 2 at a conversion bit number of 12 bits and a sampling frequency of 15 kHz. Next, in step S2, the sample value of the sound interval is
PARCOR analysis and zero-crossing analysis are performed to discriminate between unvoiced consonant sections and voiced sound sections. This is performed by checking the ratio of the high frequency component of the input frequency with reference to the first-order PARCOR coefficient or checking the number of zero crossings. That is, the energy of unvoiced consonants is distributed up to the high frequency region, and unvoiced consonants and voiced sounds are determined by examining the proportion of high frequency components and the number of zero crossings that increase at higher frequencies. In addition, PA
The reason for performing the discrimination using both the RCOR analysis and the zero-crossing analysis is to ensure the discrimination.

上記ステップS1およびS2で判別された無音区間の時間
および無声子音区間の波形は、それぞれステップS14お
よびS15においてそのままRAMまたはメモリディスク等に
記憶される。The time of the silent section and the waveform of the unvoiced consonant section determined in steps S1 and S2 are stored as they are in the RAM or the memory disk in steps S14 and S15, respectively.

次に、ステップS3では有声音区間における音声波形の
標本値を、音声の生成モデルに基づくいわゆる声道逆フ
ィルタに通すことによって線形予測分析を行なう。すな
わち、まず窓幅20msec程度の窓掛けを行ない、標本値に
窓掛けを行なったデータを基に相関関数を求めることに
よりステップS4で線形予測係数α_１〜α_ｐを算出する。
ここで、ｐは線形予測の次数であり男性の声に対しては
ｐ＝14、女性の声に対してｐ＝10程度を用いる。さら
に、ステップS5では上記で求めたα_１〜α_ｐを係数と
し、以下に示す（１）式を満足する複素数ｚの根z₁〜z_p
を求める。Next, in step S3, a linear prediction analysis is performed by passing a sample value of the voice waveform in the voiced sound section through a so-called vocal tract inverse filter based on a voice generation model. That is, first, windowing is performed with a window width of about 20 msec, and a correlation function is obtained based on data obtained by windowing the sample values, thereby calculating linear prediction coefficients α _{1 to} α _p in step S4.
Here, p is an order of linear prediction, and p = 14 is used for a male voice and p = about 10 is used for a female voice. Further, in step S5, α _{1 to} α _p obtained above are used as coefficients, and roots z _{1 to} z _p of the complex number z satisfying the following equation (1).
Ask for.

１＋α₁z^-1＋α₂z^-2＋…＋α_pz^-p＝０（１） z₁〜z_pには共役複素根が含まれ、また１つの共振点は
１対の共役複素根で表わされるから虚部が正であるz_iに
対してのみ、以下に示す（２），（３）式により共振周
波数F_iとその帯域幅B_iを求め、線形予測係数と共にRAM
あるいはメモリディスク等に記録する。1 + α ₁ z ^-1 + α ₂ z ^-2 +... + Α _p z ^-p = 0 (1) z _{1 to} z _p include conjugate complex roots, and one resonance point is represented by a pair of conjugate complex roots. Therefore, only for z _i whose imaginary part is positive, the resonance frequency F _i and its bandwidth B _i are obtained by the following equations (2) and (3), and the RAM and the linear prediction coefficient are obtained.
Alternatively, it is recorded on a memory disk or the like.

F_i＝Fs/（２π）・arg（z_i）［Hz］（２） B_i＝Fs/π・|log（|z_i|）｜［Hz］（３）なおFsは音声の標本化周波数である。F _i = Fs / (2π) · arg (z _i ) [Hz] (2) B _i = Fs / π · log (| z _i |) | [Hz] (3) where Fs is the sampling frequency of the voice It is.

これら一連の操作を分析の開始位置を10msec程度ずつ
後へスライドしながら音声区間が終るまで繰返す。These series of operations are repeated until the voice section ends, while sliding the analysis start position backward by about 10 msec.

ホルマント周波数制御部４では、ステップS6で、分析
部２において得られた一連の共振周波数F_iから、その帯
域幅と連続性を考慮してホルマント周波数の時間軌跡を
求める。一般にホルマントでない周波数を与える根はそ
の帯域幅がホルマントを与えるものに比較して広い。ホ
ルマント周波数は低い方から順に第１ホルマント、第２
ホルマント、…と呼び、母音や有声子音の音韻性には、
第１〜第３ホルマントが重要であり、これらの軌跡は特
に正確に求める。The formant frequency control unit 4, in step S6, a series of resonant frequency F _i obtained in the analysis unit 2 calculates the time trajectory of the formant frequency in consideration of the continuity and its bandwidth. In general, the roots that provide non-formant frequencies have wider bandwidths than those that provide formants. The formant frequencies are the first formant, the second
Called formants,… the phonology of vowels and voiced consonants
The first to third formants are important, and their trajectories are determined particularly accurately.

次に、ステップS7において、ステップS6で求めたホル
マント周波数の軌跡に対して所望の変更を行い、新たな
ホルマント周波数と帯域幅を定める。Next, in step S7, a desired change is made to the locus of the formant frequency obtained in step S6, and a new formant frequency and bandwidth are determined.

例えば、明瞭性を高めるには第３図に示すように第１
〜第３ホルマントの時間軸にわたるホルマント周波数の
動きの強調を行なうと効果がある。For example, as shown in FIG.
It is effective to enhance the movement of the formant frequency over the time axis of the third to third formants.

また、個人性を変えるには、第４図のように全てのホ
ルマント周波数を一様にシフトさせるのが効果的であ
る。帯域幅については、狭めるとはっきりした感じの声
になり、広げると滑らかな印象の声になる。In order to change the personality, it is effective to shift all the formant frequencies uniformly as shown in FIG. As for the bandwidth, narrowing it gives a clear voice, and widening it gives a smooth voice.

新たなホルマント周波数の軌跡と帯域幅が決定したな
らば、ステップS8で各時点における新たな線形予測係数
を以下のようにして計算する。When the new formant frequency trajectory and bandwidth are determined, in step S8, a new linear prediction coefficient at each time point is calculated as follows.

変更されたホルマントと変更されなかったホルマント
およびホルマントと認められなかった共振周波数を含め
て、新しい共振周波数をF_i′、その帯域幅をB_i′とす
る。各F_i′,B_i′の組について以下に示す（４）式を用
い一般に|Z_i′｜＜１であることを考慮してｚの新たな
根z_i′を求める。The new resonance frequency including the changed formant, the unchanged formant, and the resonance frequency not recognized as a formant is denoted by F _i ′, and its bandwidth is denoted by B _i ′. For each pair of F _i ′ and B _i ′, a new root z _i ′ of z is obtained by using the following equation (4) and generally considering that | Z _i ′ | <1.

z_i′＝exp（−πB_i′/Fs＋j2πF_i′/Fs）（４）これらのZ_i′に加えて、各々の共役複素根と、（１）
式を解いて得られたｚの根のうち虚部が零のものがあれ
ばそれを含め、全部でｐ個のz_i′を用いて以下に示す
（５）式のように新たな多項式を作る。z _i ′ = exp (−πB _i ′ / Fs + j2πF _i ′ / Fs) (4) In addition to these Z _i ′, each conjugate complex root and (1)
If there is a root of z obtained by solving the equation and the imaginary part is zero, a new polynomial is expressed as shown in the following equation (5) by using a total of p z _i ′ s including the imaginary part. create.

（１−z₁′z^-1）（１−z₂′z^-1）…… …（１−z_p′z^-1）＝１＋α_１′z^-1＋α_２′z^-2＋…＋α_ｐ′Z^-p （５）（５）式の右辺を満足するα_１′〜α_ｐ′が新たな線
形予測係数を与える。 _{^{(1-z 1 'z -1}} ) (1-z 2' z -1) ...... ... (1-z p 'z -1) = 1 + α 1' z -1 + α 2 'z -2 + ... + α p 'Z- ^p (5) α ₁ ' to α _p 'satisfying the right side of the equation (5) give a new linear prediction coefficient.

スペクトル制御部６では、ホルマント制御部４で得ら
れた結果に求づき、各時点における原音声にスペクトル
包絡を所望のスペクトル包絡に変更する。The spectrum control unit 6 obtains the result obtained by the formant control unit 4, and changes the spectrum envelope of the original voice at each time point to a desired spectrum envelope.

ここで分析部２における線形予測分析の窓幅に相当す
る標本数をＭ、分析窓のスライドの周期に相当する標本
数をＬとする。本例では、Ｍ＝300、Ｌ＝150とする。Here, the number of samples corresponding to the window width of the linear prediction analysis in the analysis unit 2 is M, and the number of samples corresponding to the period of the slide of the analysis window is L. In this example, M = 300 and L = 150.

先ず、第５図に示すごとく原音声のｑ点からｑ＋2L−
１点までの2L個の標本のデータに対してその自乗和P_Sを
求める。さらにステップS16,S17で分析部２において上
記に示した分析窓の部分より求めた線形予測係数α_１〜
α_ｐと、これらをホルマント制御部で変更したα_１′〜
α_ｐ′を用いて、（６）および（７）式により原音声の
スペクトル包絡Ｈ（ｋ）およびホルマントを変更したス
ペクトル包絡（ｋ）を求める。First, as shown in FIG. 5, q + 2L−
Obtain the square sum P _S for the data of 2L sets of samples up to a point. Further, in steps S16 and S17, the linear prediction coefficients α ₁ to α ₁ obtained from the analysis window shown above in the analysis unit 2 in the analysis unit 2
α _p and α ₁ ′ to α ₁ ′
Using α _p ′, the spectrum envelope H (k) of the original speech and the spectrum envelope (k) of which the formant is changed are obtained by the equations (6) and (7).

ここでＮはＭよりも大きい２のべき乗で512とする。
Ｈ（ｋ）は、原音声の音韻性や個人性を多く含んだ物理
量であるが、（ｋ）は音韻性や個人性が強調・抑圧、
あるいは変更されたものである。 Here, N is a power of 2 larger than M and is set to 512.
H (k) is a physical quantity including a lot of phonological and personality of the original voice, while (k) emphasizes and suppresses phonological and personality,
Or it has been changed.

なお、（ｋ）は、以下に示す（８）式を用いてホル
マント周波数制御部で求めたｐ個のz_i′から直接的に計
算できる。Note that (k) can be directly calculated from p z _i 'obtained by the formant frequency control unit using the following equation (8).

しかし、（６），（７）式の形はFFT（高速フーリエ
変換）アルゴリズムを適用できるので、（８）式を用い
るのは時間的に不利である。 However, since the FFT (fast Fourier transform) algorithm can be applied to the forms of equations (6) and (7), the use of equation (8) is disadvantageous in time.

次に、ステップS9およびS10で、ｑ＋Ｌ−N/2点からｑ
＋Ｌ＋N/2−１点までのＮ個のサンプルを新たに、ｘ
（１）〜ｘ（Ｎ）とおき、以下の（９）式に示すように
時間窓係数をかけてｙ（１）〜ｙ（Ｎ）とする。Next, in steps S9 and S10, q + L−N / 2
N samples up to + L + N / 2−1 points are newly added to x
(1) to x (N), and multiplied by a time window coefficient as shown in the following equation (9) to obtain y (1) to y (N).

ｙ（ｍ）＝ｗ（ｍ）・ｘ（ｍ）ｍ＝１〜Ｎ（９）但し、Ｔ＝N/2−Ｌ＋１、Ｔ′＝N/2＋Ｌとしてｗ（ｍ）＝0.5・｛１−cos（πm/T）｝１≦ｍ≦Ｔｗ（ｍ）＝１Ｔ＜ｍ＜Ｔ′ ｗ（ｍ）＝0.5・［１−cos｛π（ｍ−Ｔ′）/T｝］
Ｔ′≦ｍ≦Ｎこれらｙ（ｍ）に対して、Ｎ点の高速フーリエ変換を
行い周波数領域に変換しＹ（ｋ）とし、ステップS11に
てこのＹ（ｋ）の絶対値を以下に示す（10）式を用いて
変更する。なお位相成分はそのままとする。y (m) = w (m) · x (m) m = 1 to N (9) where T = N / 2−L + 1 and T ′ = N / 2 + L, w (m) = 0.5 · ｛1−cos (Πm / T)｝ 1 ≦ m ≦ Tw (m) = 1 T <m <T'w (m) = 0.5 · [1-cos {π (m-T ') / T}]
T ′ ≦ m ≦ N These y (m) are subjected to a fast Fourier transform at N points and converted into a frequency domain to be Y (k). In step S11, the absolute value of this Y (k) is shown below. Change using equation (10). The phase component is left as it is.

（ｋ）＝Ａ（ｋ）・（ｋ）/H（ｋ）・Ｙ（ｋ）ｋ＝１〜Ｎ（10）ここでＡ（ｋ）は、線形予測係数に基づいたホルマン
トの変更以外の要因としてスペクトル包絡に変化を与え
るもので、０から１の間の実数配列とし、Ａ（ｋ）＝Ａ
（Ｎ−ｋ＋２）の関係をもつ。例えば、以下に示す（1
1）式のようにすれば高い周波数を強調する特性とな
る。(K) = A (k) · (k) / H (k) · Y (k) k = 1 to N (10) where A (k) is a factor other than the change of the formant based on the linear prediction coefficient. Which gives a change to the spectrum envelope, and is a real array between 0 and 1, and A (k) = A
(N−k + 2). For example, the following (1
According to the expression (1), the characteristic emphasizes a high frequency.

Ａ（ｋ）＝1.4・（ｋ−１）/N＋0.3 ｋ＝１〜N/2＋１
（11）ステップS12において、（10）式の（ｋ）を逆高速
フーリエ変換により時間変換の波形（１）〜（Ｎ）
とし、得られたＮ点のデータのうちN/2−Ｌ＋１〜N/2＋
Ｌの2Lサンプルのデータに対し、その自乗和P_S′を求
め、これが先に求めたP_Sに等しくなるよう、すなわち音
声の大きさが等しくなるよう（12）式に示すごとくゲイ
ンの調整をし、（N/2−Ｌ＋１）−（N/2＋Ｌ）とす
る。すなわち、（ｍ）＝（P_S/P′_Ｓ）^1/2・（ｍ）ｍ＝N/2−Ｌ＋１〜N/2＋Ｌ（12）この（N/2−Ｌ＋１）〜（N/2＋Ｌ）に対してさら
に両端で０、中心で１となるようなハニング窓または三
角窓をかけ、この波形をRAMあるいはメモリディスク等
に一時保存する。この窓掛けにより、波形接続の際の端
効果を軽減することができる。A (k) = 1.4 · (k−1) /N+0.3 k = 1 to N / 2 + 1
(11) In step S12, waveforms (1) to (N) of time conversion are performed by inverse fast Fourier transform on (k) in equation (10).
And N / 2−L + 1 to N / 2 + of the obtained N points of data
To L of 2L samples of the data, we obtain the square sum P _S ', which is to be equal to P _S previously obtained, i.e., the adjustment of the gain as shown as (12) in which the magnitude of voice equal And (N / 2−L + 1) − (N / 2 + L). That is, (m) = (P _S / P ′ _S ) ^1/2 · (m) m = N / 2−L + 1 to N / 2 + L (12) On the other hand, a Hanning window or a triangular window is set so that 0 is at both ends and 1 is at the center, and this waveform is temporarily stored in a RAM or a memory disk. This windowing can reduce the end effect at the time of waveform connection.

次にｑ点をＬポイント後へシフトし、同じ一連の処理
を行った後、第６図に示すように2Lサンプルのデータの
前半のＬ点と、直前の処理フレームの後半のＬ点とを重
ね合わせて順次加える。Next, the q point is shifted to the point after the L point, and after performing the same series of processing, the L point in the first half of the data of the 2L sample and the L point in the second half of the immediately preceding processing frame are determined as shown in FIG. Overlap and add sequentially.

以下、有性音区間が終るまで同じ操作を繰返せばスペ
クトル包絡が変更された連続音声波形が得られる。Thereafter, if the same operation is repeated until the end of the sexual sound section, a continuous sound waveform with a changed spectral envelope can be obtained.

ひとつの有声音区間の処理が終了したならば、ステッ
プS13において、前後の無声音区間または、無音区間と
接続し、次の有声区間の処理に移る。最終的に合成され
た音声をD/A変換して、出力音声とする。When the processing for one voiced sound section is completed, in step S13, the process is connected to the preceding or next unvoiced sound section or the silent section, and the process proceeds to the next voiced section. Finally, the synthesized voice is D / A converted to output voice.

なお、本実施例では、有声音区間の検出およびホルマ
ントを抽出する方法として線形予測分析を用いたが、そ
の他の方法で求めても本発明の効果は変らない。In this embodiment, linear prediction analysis is used as a method for detecting voiced sound sections and extracting formants, but the effect of the present invention does not change even if it is obtained by other methods.

［発明の効果］以上説明したように、本発明によれば音声の生成モデ
ルに基づいて、各時点の周波数スペクトル包絡を変化さ
せて、声の質を変換することができる。[Effects of the Invention] As described above, according to the present invention, the quality of voice can be converted by changing the frequency spectrum envelope at each time point based on the speech generation model.

スペクトル包落の変更には高速フーリエ変換を用いて
おり、従来のデジタルフィルターを用いる方法に比較し
て、音質が良い。また、周波数領域において自由にスペ
クトルを制御できるので、ホルマントといった概念以外
のスペクトル制御も可能であり、原音声のピッチ周波数
を保ちながら、個人性の制御や明瞭性の改善のみなら
ず、様々な声の印象の制御が可能となる。Fast Fourier transform is used to change the spectral coverage, and the sound quality is better than that of a conventional method using a digital filter. Also, since the spectrum can be freely controlled in the frequency domain, spectrum control other than the concept of formants is also possible, and while maintaining the pitch frequency of the original sound, not only control of personality and improvement of clarity, but also various voices Can be controlled.

[Brief description of the drawings]

第１図は本発明の一実施例に係るシステムのブロック
図、第２図は本発明の一実施例を示すフローチャート、第３図は実施例におけるホルマント周波数の時間軸にわ
たる変化を説明するための線図、第４図は実施例におけるホルマント周波数の時間軸にわ
たる一様な変化を説明するための線図、第５図は実施例における処理区間を説明するための波形
図、第６図は実施例における波形の重ね合わせを説明するた
めの波形図である。２……分析部、４……ホルマント周波数制御部、６……スペクトル制御部。FIG. 1 is a block diagram of a system according to one embodiment of the present invention, FIG. 2 is a flowchart showing one embodiment of the present invention, and FIG. 3 is a diagram for explaining a change over time axis of a formant frequency in the embodiment. FIG. 4 is a diagram for explaining a uniform change of the formant frequency over the time axis in the embodiment. FIG. 5 is a waveform diagram for explaining a processing section in the embodiment. FIG. FIG. 7 is a waveform chart for explaining superposition of waveforms in an example. 2 ... analyzing unit 4 ... formant frequency control unit 6 ... spectrum control unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献桑原、都木、「分析合成による声質変換と嗄声改善への応用」電子情報通信学会技術研究報告ＳＰ86−57、ＰＰ．45− 52 ──────────────────────────────────────────────────続き Continuation of the front page (56) References Kuwahara and Tsuki, "Voice quality conversion by analysis and synthesis and application to hoarseness improvement," IEICE Technical Report SP86-57, PP. 45− 52

Claims

(57) [Claims]

1. A voiced sound section is extracted from an input voice, and a formant frequency and a bandwidth in each short time section set by an analysis window width and a sliding period of the window width are calculated in the voiced sound section. In addition, by calculating a linear prediction coefficient in each of the short time intervals, a spectrum envelope is calculated, a Fourier transform is performed on the voiced sound interval to convert it into a frequency domain, and a time trajectory of the formant frequency is obtained. Change the formant frequency or the bandwidth at each time point, calculate a spectrum envelope based on the changed formant frequency and bandwidth, and change the spectrum envelope calculated after the formant frequency change to the formant frequency change The quotient divided by the spectrum envelope calculated previously is used as a change component, and transformed by the Fourier transform. After multiplying the changed frequency component by the changed component and adding a spectrum change other than the formant frequency change, the waveform is returned to the time domain by the inverse Fourier transform, and connected to the unvoiced sound section, the unvoiced section or the preceding and following voiced sound sections. And a new voice waveform.