JP2007249009A

JP2007249009A - Sound signal analysis method and sound signal synthesis method

Info

Publication number: JP2007249009A
Application number: JP2006074939A
Authority: JP
Inventors: Hitoshi Ito; 仁伊藤; Masafumi Yano; 雅文矢野
Original assignee: Tohoku University NUC
Current assignee: Tohoku University NUC
Priority date: 2006-03-17
Filing date: 2006-03-17
Publication date: 2007-09-27
Anticipated expiration: 2026-03-17
Also published as: JP4469986B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide novel sound analysis/synthesis techniques for improving parameter estimation accuracy of a sinusoid signal having time variation thereby greatly improving accuracy of sound analysis and further sound re-synthesis. <P>SOLUTION: A sound signal analysis method is characterized in that a phase function of a first harmonic component is estimated through local variation rate encoding for an input sound signal having time variation characteristics, then the time axis of the signal is converted into its phase axis and the converted signal is analyzed again through the local variation rate encoding, and thus momentary amplitudes, momentary frequencies, and momentary phases of all components of the input signal are output as parameter functions capable of re-synthesis. A sound signal synthesis method is characterized in that the sound signal is re-synthesized by using the respective parameter functions of the sound signal analyzing method. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声や楽器音などの音響信号を高精度で分析可能な方法および分析された音響信号を所望の目的のために目標とする形態に再合成可能な方法に関する。 The present invention relates to a method capable of analyzing a sound signal such as a voice or a musical instrument sound with high accuracy and a method capable of re-synthesizing the analyzed sound signal into a target form for a desired purpose.

音声の有声部や多くの楽器音は、その周波数応答が線スペクトル構造になることが知られている。例えば、有声音声の発話においては、声帯の振動周期や声道の形状は時間と共に変化する。これらの変化は、生成される音声の基本周波数(F0)や振幅包絡の変化として現れる。したがって、有声音声や楽器音を精度良く音響分析するためには、スペクトル全体形状の時間変化パタンを正確に分析する技術が必要である。 It is known that the voice response of voices and many instrument sounds have a line spectrum structure. For example, in utterance of voiced speech, the vibration period of the vocal cords and the shape of the vocal tract change with time. These changes appear as changes in the fundamental frequency (F0) and amplitude envelope of the generated speech. Therefore, in order to accurately analyze voiced voices and musical instrument sounds, a technique for accurately analyzing time-varying patterns of the entire spectrum shape is required.

この分野における技術開発の進展を考えてみるに、Fant(1960 年) は、声帯振動により生成された音源信号が、声道を通過する際にその形状に応じて変調され、多様な有声音声が生成されるという音声生成のソース・フィルタ理論を提案した。声帯振動による音源信号は、基本周波数（F0）の整数倍の周波数にエネルギーを持つ調波構造を持ち、F0は声帯振動の速さに応じて時々刻々変化する。また、各調波成分が受ける変調の強さは、声道形状の特性を反映して周波数ごとに異なっており、かつ、声道の時間変化に伴って変化する。 Considering the progress of technological development in this field, Fant (1960) found that a voice signal generated by vocal fold vibration is modulated according to its shape when passing through the vocal tract, and a variety of voiced speech is generated. A source filter theory of voice generation that is generated is proposed. The sound source signal due to vocal cord vibration has a harmonic structure with energy at an integer multiple of the fundamental frequency (F0), and F0 changes from moment to moment according to the speed of vocal cord vibration. In addition, the intensity of modulation received by each harmonic component is different for each frequency, reflecting the characteristics of the vocal tract shape, and varies with time variation of the vocal tract.

このような特性を持つ有声音信号は、瞬時振幅や周波数が滑らかに変化する正弦波（sinusoid）の和として表すことができる。この考え方はSinusoidal Modelingと呼ばれ、現在まで活発な研究が続いている（例えば、非特許文献１）。このSinusoidal Modelingにおいては、有声音信号は次式（数１）のように表される。ここでx(t)は有声音信号、Kはsinusoidの数を表し、ξk(t),ψk(t), ηkはｋ番目のsinusoidの瞬時振幅、瞬時周波数、初期位相にそれぞれ対応する。 A voiced sound signal having such characteristics can be expressed as a sum of sinusoids whose instantaneous amplitude and frequency change smoothly. This idea is called Sinusoidal Modeling, and active research has continued until now (for example, Non-Patent Document 1). In this Sinusoidal Modeling, the voiced sound signal is expressed by the following equation (Equation 1). Here, x (t) represents a voiced sound signal, K represents the number of sinusoids, and ξk (t), ψk (t), and ηk correspond to the instantaneous amplitude, instantaneous frequency, and initial phase of the kth sinusoid, respectively.

上記数１から明らかなように、もし全てのsinusoidのパラメータ（ξk(t),ψk(t), ηk）が与えられれば、そこから有声音信号x(t)を得ることは容易である。しかし、この逆変換、即ち有声音信号x(t)自体から、各sinusoid成分のパラメータを推定する過程は一種の不良設定問題になっているため、何らかの拘束条件を設けなければ解くことができない。 As is clear from the above equation 1, if all the sinusoid parameters (ξk (t), ψk (t), ηk) are given, it is easy to obtain a voiced sound signal x (t) therefrom. However, since this inverse transformation, that is, the process of estimating the parameters of each sinusoid component from the voiced sound signal x (t) itself is a kind of defect setting problem, it cannot be solved unless some constraint is provided.

上述の非特許文献１(1986)で、McAulayとQuatieriは、有声音声を一定の時間間隔で周波数分析し、それぞれの分析時刻近傍ではsinusoid成分の瞬時振幅と周波数が定常とみなせるという拘束条件を設定して、パラメータ推定を行った。まず各時刻のスペクトルから調波成分に対応するピーク振幅と周波数を検出し、次に時間的連続性に基づいてそれらを結びつけることで、sinusoid成分の振幅と周波数の軌跡を計算する。彼らは、ひとつの自然発話音声を例に取り、このアイディアが上手く働くことを示しているが、そのシステムには多数のヒューリスティクスが含まれているため、有声音声全般、例えば話者の性が異なる場合にも同等の結果が得られる保証はない。 In Non-Patent Document 1 (1986), McAulay and Quatieri set a constraint that voiced speech is frequency analyzed at regular time intervals, and the instantaneous amplitude and frequency of the sinusoid component can be regarded as steady near each analysis time. Then, parameter estimation was performed. First, the peak amplitude and frequency corresponding to the harmonic component are detected from the spectrum at each time, and then the amplitude and frequency trajectories of the sinusoid component are calculated by connecting them based on temporal continuity. They take one spontaneous speech as an example and show that this idea works well, but because the system contains a lot of heuristics, the voiced voice in general, for example, There is no guarantee that equivalent results will be obtained in different cases.

また、分析時間近傍で振幅と周波数が定常的であるという拘束条件は、これらのパラメータの時間変化が大きい時刻での推定精度の劣化を招く。この問題を解決するために、Hermus(2005)は、ガウス型の窓関数を用いてスペクトルを計算するExponential Sinusoidal Modeling (ESM)を提案した（非特許文献２）。彼はESMのパラメータ推定性能を、聴覚マスキングパターンに基づく精度の指標であるNMR(Noise to Masking Ratio)を用いて評価している。ESMを用いることにより確かにNMRは向上するが、推定されたパラメータから再合成した信号は、元の音声とは大きく異なる。実際、彼が行った聴取実験では、再合成信号と原音は容易に弁別可能（つまり、再合成した信号は、元の音声とは大きく異なっている）という結果が示されている。 Further, the constraint condition that the amplitude and frequency are steady in the vicinity of the analysis time causes a deterioration in estimation accuracy at a time when the time change of these parameters is large. In order to solve this problem, Hermus (2005) proposed Exponential Sinusoidal Modeling (ESM) that calculates a spectrum using a Gaussian window function (Non-patent Document 2). He evaluates ESM's parameter estimation performance using NMR (Noise to Masking Ratio), an accuracy index based on auditory masking patterns. Although NMR is certainly improved by using ESM, the signal re-synthesized from the estimated parameters is significantly different from the original speech. In fact, his listening experiments show that the re-synthesized signal and the original sound can be easily distinguished (that is, the re-synthesized signal is very different from the original speech).

Sinusoidal Modelingの長所のひとつとして、信号に含まれるsinusoid成分のパラメータが推定されれば、そこから波形自体を再合成するのが容易であるという点が挙げられる。従って、入力信号と再合成信号の差を計算すれば、パラメータ推定性能を定量的に評価することが可能である。しかしながら、現在までのところ、この様な厳密な指標に基づいてsinusoidal modelingの性能評価を行った研究は報告されていない。
McAulay, R.J. and Quatieri, T.F., "Speech Analysis/Synthesis Based on a Sinusoidal Representation", IEEE trans. on Speech and Signal Processing, ASSP-34(4), p744-754, 1986 年 Hermus, K., "Perceptual audio modeling with exponentially damped sinusoids", Signal Processing 85, p163-176, 2005 年 One advantage of Sinusoidal Modeling is that if the parameters of the sinusoid component included in the signal are estimated, it is easy to re-synthesize the waveform itself. Therefore, if the difference between the input signal and the recombined signal is calculated, it is possible to quantitatively evaluate the parameter estimation performance. However, to date, no studies have been reported on performance evaluation of sinusoidal modeling based on such strict indicators.
McAulay, RJ and Quatieri, TF, "Speech Analysis / Synthesis Based on a Sinusoidal Representation", IEEE trans. On Speech and Signal Processing, ASSP-34 (4), p744-754, 1986 Hermus, K., "Perceptual audio modeling with exponentially damped sinusoids", Signal Processing 85, p163-176, 2005

そこで本発明の課題は、時間変化を伴うsinusoid信号のパラメータ推定精度を大幅に向上でき、それによって、音響分析、さらには、音響再合成の精度を大幅に向上できる、新規な音響分析／合成技術を提供することにある。より具体的には、本明細書にて局所変化率符号化（Local Vector Coding：LVC）と呼ぶ新しい手法を提案し、高精度の音響分析技術を提供し、その分析に基づいて自由に音響を再合成可能な技術を提供することを課題とする。 Therefore, the object of the present invention is to provide a novel acoustic analysis / synthesis technique that can greatly improve the parameter estimation accuracy of a sinusoid signal with time change, thereby greatly improving the accuracy of acoustic analysis and further acoustic resynthesis. Is to provide. More specifically, we propose a new method called Local Vector Coding (LVC) in this specification, provide high-accuracy acoustic analysis technology, and freely generate sound based on the analysis. It is an object to provide a re-synthesizable technique.

上記課題を解決するために、本発明に係る音響信号分析方法は、時間変化特性を持つ入力音響信号に対して、局所変化率符号化により第一調波成分の位相関数を推定した後、信号の時間軸をこの位相軸に変換し、変換信号を再び局所変化率符号化により分析することで、入力信号の全成分の瞬時振幅、瞬時周波数および瞬時位相を、再合成可能なパラメータ関数として出力することを特徴とする方法からなる。分析対象となる音響信号としては、音声信号は勿論のこと、楽器音信号も含まれる。
音響信号分析方法。 In order to solve the above problem, an acoustic signal analysis method according to the present invention estimates a phase function of a first harmonic component by local change rate coding for an input acoustic signal having a time-varying characteristic, The time axis is converted to this phase axis, and the converted signal is analyzed again by local rate-of-change coding, so that the instantaneous amplitude, instantaneous frequency, and instantaneous phase of all components of the input signal are output as recombinable parameter functions. It consists of the method characterized by doing. The acoustic signals to be analyzed include not only audio signals but also instrument sound signals.
Acoustic signal analysis method.

また、本発明に係る音響信号合成方法は、上記のような音響信号分析方法における各パラメータ関数を用いて音響信号を再合成することを特徴とする方法からなる。再合成は、忠実に元の音響信号を再現する場合は勿論のこと、再合成された音響信号から、または、音響信号を再合成するに際し、音質を他の音質に変換することも可能である（例えば、男性音声から女性音声へのモーフィング等も可能である。 An acoustic signal synthesis method according to the present invention comprises a method characterized by re-synthesizes an acoustic signal using each parameter function in the acoustic signal analysis method as described above. Re-synthesis is possible not only to faithfully reproduce the original sound signal but also to convert the sound quality from the re-synthesized sound signal or when re-synthesizing the sound signal to another sound quality. (For example, morphing from male voice to female voice is also possible.

本発明に係る音響信号分析方法および音響信号合成方法によれば、従来技術では得られなかった精度で、音響信号を正確に分析でき、かつ、分析された音響信号を、所望の目的のために目標とする形態に自由に再合成することができる。 According to the acoustic signal analyzing method and the acoustic signal synthesizing method according to the present invention, the acoustic signal can be accurately analyzed with an accuracy that cannot be obtained by the prior art, and the analyzed acoustic signal can be used for a desired purpose. It can be freely re-synthesized into the target form.

以下に、本発明の実施の形態について、本発明の完成に至るまでの開発経過とともに、図面を参照しながら詳細に説明する。
まず、本発明を完成するための研究においては、時間変化を伴うsinusoid信号のパラメータ推定精度を、再合成信号と入力信号の差である残差信号と、入力信号のエネルギー比（S/R: Singnal per Residual ratio）を評価指標とし、sinusoidal modelingの性能を評価することを検討した。以下に、有声音声の分析を主体に説明するが、本発明は楽器音の分析にも適用可能である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings along with development progress up to completion of the present invention.
First, in the research for completing the present invention, the parameter estimation accuracy of a sinusoid signal accompanied by a change in time is determined using the residual signal, which is the difference between the recombined signal and the input signal, and the input signal energy ratio (S / R: We evaluated the performance of sinusoidal modeling using Singnal per Residual ratio as an evaluation index. In the following, the analysis of voiced speech will be mainly described, but the present invention can also be applied to the analysis of musical instrument sounds.

（１）基本周波数(F0)検出（Pitch Determination）
有声音声の基本周波数(F0)は、第一調波成分の周波数に対応し、知覚される声の高さ（ピッチ）を決定付ける主要因である。また、F0は、話者の情動表出や個人性の識別等の手がかりであることが知られており、音声の音響分析において応用性の高い特徴量であると言える。通常sinusoidal modelingでは、有声音信号に含まれる全成分のパラメータを推定する。従って、第一調波成分のパラメータであるF0も不可避的に推定される。本発明では、sinusoidal ModelingのF0検出手法としての側面にも注目する。 (1) Fundamental frequency (F0) detection (Pitch Determination)
The fundamental frequency (F0) of voiced speech corresponds to the frequency of the first harmonic component and is the main factor that determines the perceived voice pitch (pitch). Further, F0 is known to be a clue for expressing a speaker's emotion and identifying personality, and can be said to be a highly applicable feature quantity in acoustic analysis of speech. Usually, sinusoidal modeling estimates the parameters of all components contained in a voiced sound signal. Therefore, the parameter F0 of the first harmonic component is inevitably estimated. In the present invention, attention is also paid to aspects of sinusoidal modeling as an F0 detection technique.

音声信号からF0を検出する手法は、pitch determination algorithm(PDA)と呼ばれ、これまで広範囲の研究がなされている。これらを大まかに分類すると、次の３つのグループになる（Hess, 1983）。第一は周波数領域の手法であり、これは音声スペクトルに現れる調波構造を利用してF0を検出する。この手法の代表的なものはケプストラム法である（Noll, 1966）。第二は、時間領域の手法である。これは有声音声の準周期性を利用してF0を検出するもので、自己相関関数法などが代表的である（Sondhi, 1968）。第三は時間−周波数領域の手法で、上の二つの組み合わせに対応する。この手法では、帯域通過フィルタを用いて音声を複数の周波数チャネルに分割し、チャネルごとの信号の周期性を評価することでF0を検出する。 A technique for detecting F0 from an audio signal is called a pitch determination algorithm (PDA), and extensive research has been conducted so far. These can be roughly classified into the following three groups (Hess, 1983). The first is a frequency domain technique, which detects F0 using the harmonic structure that appears in the speech spectrum. A typical example of this method is the cepstrum method (Noll, 1966). The second is a time domain technique. This is to detect F0 using the quasi-periodicity of voiced speech, and the autocorrelation function method is typical (Sondhi, 1968). The third is a time-frequency domain method, which corresponds to the above two combinations. In this method, a voice is divided into a plurality of frequency channels using a band-pass filter, and F0 is detected by evaluating the periodicity of the signal for each channel.

初期のPDAの研究では、入力として単一でクリーンな音声が用いられた。近年は、複数の音声や雑音下でのPDAに関する研究が報告されている（Shimamura and Kobayashi, 2001; Wu, et. al, 2003）。自然発話音声のPDAの性能を定量的に評価するためには、検出誤差を計算するためのリファレンスとなるF0が必要となる。これら最近の研究でも、雑音や他の音声と混合させる前に、クリーンな単一音声に対して、ケプストラムや自己相関を用いてリファレンスF0を計算している。しかし、この計算においては、真のF0の整数倍の周波数が検出されるdouble pitch問題や、F0の変化速度が速い時刻での値のジャンプ等がしばしば観測されるため、実験者が手動でそれらの値を修正する作業が必要になっている（Wu, etl al., 2003）。従って、Hess(1983)がその当時に指摘した事実−”例えクリーンで単一の音声であっても、そこから自動的にF0を検出する手法は未だ完成されていない”−は、それから二十年以上を経た現在においても真実であると言わざるを得ない。 Early PDA studies used a single, clean voice as input. In recent years, research on PDAs under multiple voices and noise has been reported (Shimamura and Kobayashi, 2001; Wu, et. Al, 2003). In order to quantitatively evaluate the performance of a PDA for spontaneously uttered speech, F0, which serves as a reference for calculating a detection error, is required. These recent studies also calculate a reference F0 for clean single speech using cepstrum and autocorrelation before mixing with noise and other speech. However, in this calculation, the double pitch problem in which a frequency that is an integer multiple of true F0 is detected and the jump of the value at a time when the change speed of F0 is fast are often observed. It is necessary to correct the value of (Wu, etl al., 2003). Therefore, the fact that Hess (1983) pointed out at that time-“even if it was clean and single speech, the method of automatically detecting F0 from it has not been completed yet”- It must be said that it is true even now after more than a year.

自然発話音声のF0検出を困難にする要因のひとつは、sinusoidal modelingの場合と同様、信号パラメータの時間変化である。F0の時間変化が大きい音声では、スペクトルの調波構造が歪むため、分析時間近傍で定常性を仮定するケプストラムや自己相関関数法では、正確なF0を検出することができない。これに対して、本発明で提案する局所変化率符号化（LVC）は、時間変化を伴う信号を正確に分析できる点が特徴であり、新たなF0検出法としての有効性が期待できる手法である。以下に、既存手法と比較実験を行い、LVCのF0検出性能、その手法の有効性について定量的に評価する。 One of the factors that make it difficult to detect F0 of naturally uttered speech is the time variation of signal parameters, as in sinusoidal modeling. Since the harmonic structure of the spectrum is distorted in speech with a large F0 time change, it is not possible to accurately detect F0 with the cepstrum or autocorrelation function method that assumes stationarity near the analysis time. On the other hand, the local rate of change coding (LVC) proposed in the present invention is characterized by the ability to accurately analyze signals with temporal changes, and is a method that can be expected to be effective as a new F0 detection method. is there. In the following, a comparison experiment with existing methods will be conducted to quantitatively evaluate LVC F0 detection performance and the effectiveness of the method.

（２）時間変化（以下、単に「時変」と言うこともある。）sinsuoid信号の性質
有声音声のsinusoid modelingにおいては、振幅や位相などのパラメータの時間変化の処理が重要な問題になってくる。ここでは、まず瞬時振幅と位相が単純な関数で時間変化するsinusoid信号を例に取り、その特性を理論的に説明する。 (2) Time change (hereinafter, sometimes simply referred to as “time change”) The nature of sinsuoid signal In sinusoid modeling of voiced speech, processing of time change of parameters such as amplitude and phase is an important issue come. Here, a sinusoid signal whose instantaneous amplitude and phase change with a simple function is taken as an example, and its characteristics are theoretically explained.

（２−１） Quadratic-Parameter Sinusoid信号
有声音声に含まれるsinusoid成分のパラメータは、複雑な時間変化パタンを示す。このような成分のひとつに注目し、その瞬時対数振幅A(t)とP(t)を時刻tc近傍でテイラー展開し、二次までの項で打ち切ると次式数２、数３が得られる。 (2-1) Quadratic-Parameter Sinusoid signal The parameter of the sinusoid component included in voiced speech indicates a complex time-varying pattern. Paying attention to one of these components, Taylor expansion of the instantaneous logarithmic amplitudes A (t) and P (t) near the time tc and truncation to terms up to the second order yields the following equations 2 and 3. .

ここでa₀,a₁,a₂,p₀,p₁,p₂は定数であり、a₀が瞬時対数振幅、a₁が振幅の変化速度、a₂が振幅の変化加速度、p₀が瞬時位相、p₁が角周波数、p₂が角周波数の変化速度にそれぞれ対応する。例えば、a₁=a₂=p₂=0の場合は角周波数p₁の純音を意味する。以下では簡単のため、時刻tc =0の場合を考える。これらのパラメータから生成される複素信号s(t)は次式数４で表される。 Where a ₀ , a ₁ , a ₂ , p ₀ , p ₁ , p ₂ are constants, a ₀ is the instantaneous logarithmic amplitude, a ₁ is the amplitude change rate, a ₂ is the amplitude change acceleration, and p ₀ is The instantaneous phase, p _{1 corresponds} to the angular frequency, and p ₂ corresponds to the change rate of the angular frequency. For example, a ₁ = a ₂ = p ₂ = 0 means a pure tone with an angular frequency p ₁ . In the following, for the sake of simplicity, consider the case of time tc = 0. A complex signal s (t) generated from these parameters is expressed by the following equation (4).

尚、α₀,α₁,α₂は複素定数である。この複素信号の実部が、成分信号の時刻c近傍の近似に対応する。自然発話された音声の瞬時振幅や位相は、二次関数よりも複雑なパタンを有するが、最も単純な時間変化信号という意味で、この信号の特性を調べることは有効である。以下ではs(t)をQuadratic-Parameter Sinusoid(QPS)と呼ぶ。 Α ₀ , α ₁ and α ₂ are complex constants. The real part of this complex signal corresponds to the approximation of the component signal near time c. The instantaneous amplitude and phase of naturally uttered speech have a pattern that is more complex than a quadratic function, but it is effective to examine the characteristics of this signal in the sense of the simplest time-varying signal. In the following, s (t) is referred to as Quadratic-Parameter Sinusoid (QPS).

（２−２） QPSの周波数応答
QPS信号s(t)の周波数応答S₀(ω)を次式数５、数６で計算する。ここでTは分析時間を表す定数、w(t)はガウス型の窓関数、γは窓関数の時間長を決める正の定数である。 (2-2) QPS frequency response
The frequency response S ₀ (ω) of the QPS signal s (t) is calculated by the following equations (5) and (6). Here, T is a constant representing the analysis time, w (t) is a Gaussian window function, and γ is a positive constant that determines the time length of the window function.

QPSのパラメータのあらゆる範囲に対して、数５を解析的に求めることは不可能である。だが、少なくとも分析時刻の端点ｔ＝±Tにおいて、s(t)*w(t)の振幅包絡がゼロとみなせる場合には、数５を解析的に解くことができる。この条件は、具体的には振幅包絡が、t=-T〜Tの間でピーク値Apeakを持つことと、その値が分析時間の端点の値Aedge=A(±T)に対してMergin(dB)だけ大きいことに対応する。これを満たすためには、パラメータa₁とa₂が次式数７、数８の範囲に存在する必要がある。 It is impossible to analytically find Equation 5 for any range of QPS parameters. However, when the amplitude envelope of s (t) * w (t) can be regarded as zero at least at the end point t = ± T of the analysis time, Equation 5 can be solved analytically. Specifically, this condition is that the amplitude envelope has a peak value Apeak between t = -T and T, and that value is Mergin () with respect to the end value Aedge = A (± T) of the analysis time. Corresponds to being larger by dB). In order to satisfy this, the parameters a ₁ and a ₂ need to be in the range of the following equations 7 and 8.

上記条件が満たされれば、QPSの周波数応答S₀(ω)は次式数９のように書ける。即ち、S₀(ω)の対数振幅応答と位相応答は、角周波数ωの二次関数で表される。 If the above condition is satisfied, the frequency response S ₀ (ω) of the QPS can be written as the following equation (9). That is, the logarithmic amplitude response and phase response of S ₀ (ω) are expressed by a quadratic function of the angular frequency ω.

図１にS₀(ω)の例を示す。図１(a) は、QPSの周波数振幅応答を示している。実線は瞬時対数振幅と瞬時周波数が時間変化するQPSの応答を表す（a₀=a₂=p₀=0, a₁=0.4dB/ms, p₁=100Hz, p₂=1.0Hz/ms）。薄線は純音に対応するQPSの応答を表す（a₀= a₁=a₂=p0= p₂=0, p₁=100Hz）。矢印は、振幅応答のピーク周波数と、実際の瞬時周波数との差（F-shift）に対応する。 (b)は、QPSの周波数位相応答。実線と薄線は(a)と同じ。応答のピーク周波数における位相と、実際の瞬時周波数との差（P-shift）を矢印で表す。 (c)は、パラメータp₂とa₁を変化させた場合のF-shiftの変化を示す。それぞれの線がa₁=0.2, 0.4, 0.8 dB/msのQPSに対応する。 (d)は、(c)と同様にパラメータを変化させた場合のP-shiftを表す。図１(a),(c) において薄線が純音の周波数応答（a₀=a₁=a₂=p₀=p₂=0, p₁=100 Hz）に、また実線は振幅と周波数が線形に変化するQPSの応答(a0 =a2=p0 =0, a1= 0.4 dB/ms, p₁=100 Hz p₂=1.0 Hz/ms)にそれぞれ対応する。これら二つの信号はともに時刻t=0で瞬時周波数と位相がゼロ、瞬時周波数が100Hzになるが、変化成分を含むQPSの振幅応答のピーク周波数は100Hzにはならず、ピークにおける位相もゼロにはならない。この応答ピーク周波数と位相のずれを図１(b),(d) に示す。ここで示したずれの大きさは、数９から一意に導くことが可能である。 FIG. 1 shows an example of S ₀ (ω). FIG. 1 (a) shows the frequency amplitude response of the QPS. The solid line shows the response of QPS with instantaneous logarithmic amplitude and instantaneous frequency changing over time (a ₀ = a ₂ = p ₀ = 0, a ₁ = 0.4 dB / ms, p ₁ = 100 Hz, p ₂ = 1.0 Hz / ms) . The thin line represents the response of the QPS corresponding to the pure tone (a ₀ = a ₁ = a ₂ = p0 = p ₂ = 0, p ₁ = 100 Hz). The arrow corresponds to the difference (F-shift) between the peak frequency of the amplitude response and the actual instantaneous frequency. (b) is the frequency phase response of QPS. Solid and thin lines are the same as (a). The difference (P-shift) between the phase at the peak frequency of the response and the actual instantaneous frequency is represented by an arrow. (c) shows the change in F-shift when the parameters p ₂ and a ₁ are changed. Each line corresponds to a QPS of a ₁ = 0.2, 0.4, 0.8 dB / ms. (d) represents the P-shift when the parameter is changed as in (c). In Fig. 1 (a) and (c), the thin line shows the frequency response of pure tone (a ₀ = a ₁ = a ₂ = p ₀ = p ₂ = 0, p ₁ = 100 Hz), and the solid line shows the amplitude and frequency. corresponding to QPS response (a0 = a2 = p0 = 0 , a1 = 0.4 dB / ms, p 1 = 100 Hz p 2 = 1.0 Hz / ms) that varies linearly. Both of these two signals have an instantaneous frequency and phase of zero at time t = 0 and an instantaneous frequency of 100 Hz, but the peak frequency of the amplitude response of the QPS including the change component is not 100 Hz, and the phase at the peak is also zero. Must not. The response peak frequency and phase shift are shown in FIGS. 1 (b) and 1 (d). The magnitude of the deviation shown here can be uniquely derived from Equation 9.

この結果は、例えば周波数応答のピーク検出のような手法では、時間変化を含むQPSの瞬時周波数や振幅を正確に分析できないことを示唆する。また、調波構造を持つ有声音声では、各成分の瞬時周波数がF0の整数倍であっても、F0や振幅が時間変化する場合には、スペクトル上の振幅ピークが必ずしもF0の整数倍の周波数に存在しないことを説明できる。 This result suggests that, for example, a technique such as frequency response peak detection cannot accurately analyze the instantaneous frequency and amplitude of a QPS including time changes. In voiced speech with a harmonic structure, even if the instantaneous frequency of each component is an integer multiple of F0, if F0 or the amplitude changes over time, the amplitude peak on the spectrum is not necessarily an integer multiple of F0. Explain that it does not exist.

（２−３）局所変化率変換（Local Vector Transform: LVT）
QPSの周波数応答の一般式数９から、その角周波数に対する一次微分S₁(ω)と二次微分S₂(ω)は、次式数１０、数１１のように計算できる。 (2-3) Local vector transformation (LVT)
From the general expression 9 of the frequency response of the QPS, the first derivative S ₁ (ω) and the second derivative S ₂ (ω) with respect to the angular frequency can be calculated as the following expressions 10 and 11.

もし入力信号s(t)から、これら3種類の周波数応答S₀(ω), S₁(ω), S₂(ω)を得ることができれば、任意の角周波数ωにおいて信号パラメータα₀,α₁,α₂を一意に決定可能である。これは次式数１２、数１３、数１４のように表される。 If these three types of frequency responses S ₀ (ω), S ₁ (ω), S ₂ (ω) can be obtained from the input signal s (t), the signal parameters α ₀ , α at an arbitrary angular frequency ω ₁ and α ₂ can be determined uniquely. This is expressed by the following equations (12), (13), and (14).

図２に、有声音声入力に対して、推定されたパラメータを示す。入力は男性話者の/iyoiyo/という発話の最初の母音/i/の部分に対応する。音声データはATRディジタル音声データベースの音素バランス216単語のM107B-0002を用いている。図２(a)においては、実線は入力信号の振幅スペクトルを表す。入力は男性話者の発話/iyoiyo/の最初の母音/i/の時間に対応する。薄線はLocal Vector Transform(LVT)により推定された信号パラメータから求めた振幅応答を表す。推定パラメータは、140Hzの整数倍の周波数応答から計算し、第一成分から第五成分までを図示している。 (b)では、実線は入力音声の位相応答を示す。薄線は、LVTにより推定されたパラメータから算出された位相応答を表す。 (c)は、LVTにより推定された瞬時対数振幅（パラメータa₀）を表す。入力音のエネルギーが十分な領域では、ほぼ同一の値が推定されている。 (d)は、LVTにより推定された瞬時位相（p₀）を表す。 (e)は、推定された瞬時対数振幅の変化速度（a₁）を表す。(f) は、推定された瞬時周波数（p₁）を表す。(g) は、推定された瞬時対数振幅の変化加速度（a₂）を表す。(h) は、推定された瞬時周波数の変化速度（p₂）を表す。図２(a),(b) が入力の振幅応答と位相応答を表し、図２(c) 〜(h) が数１２、数１３、数１４を用いて推定されたパラメータを表す。また、周波数が140Hzの整数倍の推定パラメータを用いて、数９の周波数応答を再合成し、図２(a),(b) の薄線に示している。周波数140Hzの整数倍の近傍では、入力の周波数応答と再合成された周波数応答が、振幅、位相とも良く一致していることが確認できる。これは、自然発話音声のような実際の音声に対しても、QPSに基づくパラメータ推定が有効であることを意味する。本明細書では、このパラメータ推定の演算をLocal Vector Transform(LVT)と呼ぶ。 FIG. 2 shows the estimated parameters for voiced speech input. The input corresponds to the first vowel / i / part of the utterance / iyoiyo / of a male speaker. The speech data uses M107B-0002 with 216 phoneme balances from the ATR digital speech database. In FIG. 2A, the solid line represents the amplitude spectrum of the input signal. The input corresponds to the time of the first vowel / i / of male speaker's utterance / iyoiyo /. The thin line represents the amplitude response obtained from the signal parameters estimated by Local Vector Transform (LVT). The estimation parameter is calculated from the frequency response of an integral multiple of 140 Hz, and the first to fifth components are illustrated. In (b), the solid line indicates the phase response of the input voice. The thin line represents the phase response calculated from the parameters estimated by LVT. (c) represents the instantaneous logarithmic amplitude (parameter a ₀ ) estimated by LVT. In a region where the energy of the input sound is sufficient, almost the same value is estimated. (d) represents the instantaneous phase (p ₀ ) estimated by LVT. (e) represents the estimated instantaneous logarithmic amplitude change rate (a ₁ ). (f) represents the estimated instantaneous frequency (p ₁ ). (g) represents the estimated change in instantaneous logarithmic amplitude (a ₂ ). (h) represents the estimated change rate (p ₂ ) of the instantaneous frequency. 2A and 2B show the amplitude response and the phase response of the input, and FIGS. 2C to 2H show the parameters estimated using the equations 12, 13, and 14. FIG. Further, the frequency response of Equation 9 is re-synthesized using an estimation parameter whose frequency is an integer multiple of 140 Hz, and is shown by thin lines in FIGS. 2 (a) and 2 (b). In the vicinity of an integer multiple of 140 Hz, it can be confirmed that the input frequency response and the re-synthesized frequency response are in good agreement with both amplitude and phase. This means that parameter estimation based on QPS is effective even for actual speech such as naturally uttered speech. In this specification, this parameter estimation operation is called Local Vector Transform (LVT).

（３）局所変化率に基づく音声符号化システム(LVC)
上述した局所変化率変換(LVT)に基づいて、有声音声のsinusouidal modelingシステムを構築する。入力となる音声信号から、各成分の瞬時対数振幅と位相を推定する処理を、局所変化率符号化(LVC)と呼ぶ。以下ではLVCシステムの構成について述べる。 (3) Speech coding system (LVC) based on local rate of change
A sinusouidal modeling system for voiced speech is constructed based on the local change rate transformation (LVT) described above. The process of estimating the instantaneous logarithmic amplitude and phase of each component from the input speech signal is called local rate of change coding (LVC). The configuration of the LVC system is described below.

（３−１）概要
まず、sinusoidal modelingの一般式数１を次の数１５のように変形する。ここでAk(t)はｋ番目の成分の瞬時対数振幅であり、Pk(t)は瞬時位相に対応する。ちなみに成分kの瞬時角周波数はPk(t)の時間微分により計算できる。 (3-1) Overview First, the general formula number 1 of sinusoidal modeling is transformed into the following formula 15. Here, Ak (t) is the instantaneous logarithmic amplitude of the kth component, and Pk (t) corresponds to the instantaneous phase. Incidentally, the instantaneous angular frequency of the component k can be calculated by time differentiation of Pk (t).

図３に、LVCシステムの計算ブロックを示す。ここで、x(t)：入力信号, P_F0(t) ：第一調波成分の瞬時位相, Ai(t)：第i調波成分の瞬時振幅, Pi(t)：第i調波成分の瞬時位相, y(t)：再合成波形、をそれぞれ表している。システムの入力は有声音声信号x(t)であり、システムの出力は各成分の瞬時対数振幅関数Ak(t)と瞬時位相関数Pk(t)である。一度これらのパラメータ関数が推定できれば、信号波形y(t)は容易に再合成できる。 FIG. 3 shows a calculation block of the LVC system. Where x (t): input signal, P _F0 (t): instantaneous phase of the first harmonic component, Ai (t): instantaneous amplitude of the i-th harmonic component, Pi (t): i-th harmonic component Represents the instantaneous phase, y (t): recomposition waveform. The input of the system is a voiced speech signal x (t), and the output of the system is an instantaneous logarithmic amplitude function Ak (t) and an instantaneous phase function Pk (t) of each component. Once these parameter functions can be estimated, the signal waveform y (t) can be easily re-synthesized.

このパラメータ関数の推定は、二段階の計算により実行する。第一段階目の計算では、入力信号の第一成分の瞬時位相 P_F0(t) を推定する。まず入力信号の各時間フレームにおいて、前述のLVTにより成分パラメータを計算する。図２に示した通り、LVTは任意の時刻の周波数応答からパラメータを推定するが、システム出力である振幅関数や位相関数を得るためには、各時間フレームで計算されたパラメータ値を時間方向に接続する必要がある。この処理は、次のTrajectory Estimationのモジュールで計算する。この時点でx(t)の第一成分の瞬時位相関数PF0(t)が得られる。F0検出タスクにおいては、この P_F0(t) の時間微分を２πで除した値を出力とする。 The estimation of the parameter function is performed by a two-stage calculation. In the first stage calculation, the instantaneous phase P _F0 (t) of the first component of the input signal is estimated. First, in each time frame of the input signal, the component parameter is calculated by the aforementioned LVT. As shown in FIG. 2, LVT estimates parameters from the frequency response at an arbitrary time, but in order to obtain the amplitude and phase functions that are system outputs, the parameter values calculated in each time frame are set in the time direction. Need to connect. This process is calculated by the following Trajectory Estimation module. At this time, the instantaneous phase function PF0 (t) of the first component of x (t) is obtained. In the F0 detection task, the value obtained by dividing the time derivative of P _F0 (t) by 2π is output.

第二段階目の計算では、まず，入力信号x(t)の時間軸を、第一段階で推定された P_F0(t) ＝φによって位相に変換する（Time to Phase Conversion, 後述する。）。高次の調波成分においては、周波数変化速度が大きいため、ひとつの成分のエネルギーピークが他の成分の周波数領域まで広がってしまう。時間−位相変換は、この成分間干渉を低減するために導入する。変換後の波形ｘ（φ）は、F0の時間変化が正規化され、成分間干渉が小さくなる。 In the calculation of the second stage, first, the time axis of the input signal x (t) is converted into a phase by P _F0 (t) = φ estimated in the first stage (Time to Phase Conversion, which will be described later). . In the higher-order harmonic component, since the frequency change speed is large, the energy peak of one component spreads to the frequency region of the other component. Time-phase conversion is introduced to reduce this inter-component interference. In the converted waveform x (φ), the temporal change of F0 is normalized, and the inter-component interference is reduced.

この変換波形ｘ（φ）に、もう一度LVTを実行し高次の調波成分を含めた全ての成分のパラメータを推定する。LVTの出力は、第一段階の場合と同様、Trajectory Estimationモジュールで時間方向に接続する。この時点で得られるのは、位相φに対する瞬時対数振幅Ａｋ（φ）と瞬時位相Ｐｋ（φ）である。最後に、これらの関数の位相軸φを時間tに逆変換し、各成分の瞬時対数振幅関数Ak(t)と位相関数Pk(t)を得る。以下に、各計算ブロックの詳細について述べる。 This converted waveform x (φ) is subjected to LVT once again to estimate the parameters of all components including higher-order harmonic components. The LVT output is connected in the time direction using the Trajectory Estimation module, as in the first stage. At this time, the instantaneous logarithmic amplitude Ak (φ) and the instantaneous phase Pk (φ) with respect to the phase φ are obtained. Finally, the phase axis φ of these functions is inversely converted to time t to obtain the instantaneous logarithmic amplitude function Ak (t) and phase function Pk (t) of each component. Details of each calculation block will be described below.

（３−２） Local Vector Transform
LVCでは入力信号x(t)に対して、一定の時間ステップΔごとに周波数応答を計算する。各分析時刻の近傍では、入力信号の各成分はQPSで近似できると仮定し、その時刻におけるQPSのパラメータを計算する。 (3-2) Local Vector Transform
In LVC, a frequency response is calculated for each input signal x (t) at a certain time step Δ. In the vicinity of each analysis time, it is assumed that each component of the input signal can be approximated by QPS, and the parameter of QPS at that time is calculated.

まずｎ番目の時間フレームにおいて、入力信号に対して時刻ｔ＝ｎΔ中心とする窓関数w(t)を乗じる。次にx(t)の複素周波数応答X₀(ω)、およびその角周波数に対する一次、二次微分X₁(ω),X₂(ω)を次式数１６、数１７、数１８で計算する。 First, in the nth time frame, the input signal is multiplied by a window function w (t) centered at time t = nΔ. Next, the complex frequency response X ₀ (ω) of x (t) and the first and second derivatives X ₁ (ω) and X ₂ (ω) with respect to the angular frequency are calculated by the following equations (16), (17), and (18). To do.

上式から明らかなように、X₁(ω), X₂(ω)の計算は、実際には信号x(t)に(-jωt),(-jωt)²を乗じた信号を入力とした短時間フーリエ変換（FFT）で実装可能である。よって、各時間フレームに対して３回のFFTを実行することになる。これらの周波数応答が得られれば、あとは数１２〜数１４のLVTで全ての角周波数に対するパラメータが一意に推定できる。 As is clear from the above equation, the calculation of X ₁ (ω) and X ₂ (ω) is actually performed by using the signal x (t) multiplied by (-jωt) and (-jωt) ² as input. It can be implemented with a short-time Fourier transform (FFT). Therefore, three FFTs are executed for each time frame. If these frequency responses are obtained, the parameters for all the angular frequencies can be uniquely estimated with the LVTs of Equations 12 to 14.

（３−３） Trajectory Estimation：最適パスの決定
LVTによって、各時間フレームの全ての周波数におけるQPSパラメータが計算できる（例えば、図２）。この推定されたパラメータは、その周波数近傍に単一のエネルギーしか存在しない場合は有効であるが、複数の成分が干渉している周波数領域では大きな誤差を含むものとなる。例えば図２の210, 350, 490, 630Hz周辺では、推定されたパラメータ値が大きく変動していることが確認できる。各成分の瞬時対数振幅関数Ak(t)と位相関数Pk(t)を得るためには、LVTで推定されたパラメータの中から、意味のあるものを抽出する必要がある。 (3-3) Trajectory Estimation: Determination of optimal path
With LVT, QPS parameters at all frequencies in each time frame can be calculated (eg, FIG. 2). This estimated parameter is effective when only a single energy exists in the vicinity of the frequency, but includes a large error in a frequency region where a plurality of components interfere. For example, in the vicinity of 210, 350, 490, and 630 Hz in FIG. 2, it can be confirmed that the estimated parameter value varies greatly. In order to obtain the instantaneous logarithmic amplitude function Ak (t) and phase function Pk (t) of each component, it is necessary to extract meaningful ones from the parameters estimated by LVT.

この処理は、時間フレームｎにおいて、成分kとして妥当な推定パラメータを出力する角周波数ωBEST(k,n)を定める問題と考えることができる。この計算処理を最適パスの決定と呼ぶ。最適パスを決定するための拘束条件は、(1)LVT出力値の安定性、(2)出力されたパラメータの時間連続性の二つである。 This process can be considered as a problem of determining an angular frequency ωBEST (k, n) for outputting a reasonable estimation parameter as the component k in the time frame n. This calculation process is called optimum path determination. There are two constraint conditions for determining the optimum path: (1) stability of the LVT output value and (2) time continuity of the output parameter.

図２に示したように、単一成分のエネルギーが支配的な周波数領域では、LVTの出力パラメータのばらつきは小さい。６つのパラメータa₀〜p₂は、周波数が140, 280, 420, 560, 700 Hz近傍で、ほぼ一定の値となっている。即ち、パラメータ推定値のばらつきが小さい周波数ほど、最適パスとして妥当ということになる。これを評価するために、角周波数ω近傍で推定されたパラメータの標準偏差σ(n,ω)を計算する。 As shown in FIG. 2, in the frequency region where the energy of a single component is dominant, the variation in the output parameters of the LVT is small. The six parameters a _{0 to} p ₂ have substantially constant values in the vicinity of frequencies 140, 280, 420, 560, and 700 Hz. In other words, the frequency with a smaller variation in the parameter estimation value is more appropriate as the optimum path. In order to evaluate this, the standard deviation σ (n, ω) of the parameter estimated in the vicinity of the angular frequency ω is calculated.

また、時間フレームn、角周波数ωにおいて推定されたパラメータには、時間変化率が含まれる。有声音声の瞬時振幅や位相は連続的に変化するため、異なる時間フレームにおける推定パラメータの時間連続性を比較すれば、それがひとつの成分に起因するものか否かを評価することが可能である。ある点G1= (n1,ω1)において推定されたパラメータの値と、点G₂=(n₂,ω₂)のパラメータの間の時間連続性C₁₂は、次式数１９、数２０を用いて評価する。 The parameter estimated in the time frame n and the angular frequency ω includes a time change rate. Since the instantaneous amplitude and phase of voiced speech change continuously, it is possible to evaluate whether or not it is caused by a single component by comparing the time continuity of the estimated parameters in different time frames. . The time continuity C ₁₂ between the value of the parameter estimated at a certain point G1 = (n1, ω1) and the parameter of the point G ₂ = (n ₂ , ω ₂ ) uses the following equations 19 and 20. To evaluate.

ここでg(t)はtの一次関数であり、χ₁₂はG₁とG₂のパラメータ分布から計算される正規分布をg(t)が満たす結合確率に対応する。μdはパラメータ推定値である。この時間連続性の計算は、QPSの６つのパラメータのうち、a₀とp₀を除く４つから計算する。これは位相回転の問題を回避するためである。通常、sinusoid信号の瞬時位相は、時間tに対する単調増加関数になるが、各時間フレームで得られる瞬時位相は±πの範囲に限定されているからである。正しい瞬時位相を得るためには、この値に２πの整数倍を加えた値を用いる必要があるが、この整数値の推定と時間連続性を同時に評価するのは困難である。 Here, g (t) is a linear function of t, and χ ₁₂ corresponds to the joint probability that g (t) satisfies the normal distribution calculated from the parameter distributions of G ₁ and G ₂ . μd is a parameter estimate. This time continuity is calculated from four parameters excluding a ₀ and p ₀ among the six parameters of QPS. This is to avoid the problem of phase rotation. Usually, the instantaneous phase of the sinusoid signal is a monotonically increasing function with respect to time t, but the instantaneous phase obtained in each time frame is limited to a range of ± π. In order to obtain a correct instantaneous phase, it is necessary to use a value obtained by adding an integer multiple of 2π to this value, but it is difficult to simultaneously estimate the integer value and evaluate time continuity.

最適パスは時間連続性C₁₂をコスト関数として動的計画法(dynamic programming)により計算する。時間−周波数平面における計算範囲を限定することで、複数の成分の最適パスを計算することも可能である。LVCの第一段階目の計算においては、50〜500 Hzの周波数範囲から第一調波成分の最適パスを算出し、第二段階では成分kに対してk+0.5〜k+1.5（単位は調波番号）の範囲からパスを計算する。 The optimal path is calculated by dynamic programming with time continuity C ₁₂ as a cost function. By limiting the calculation range in the time-frequency plane, it is possible to calculate the optimum paths of a plurality of components. In the calculation of the first stage of LVC, the optimal path of the first harmonic component is calculated from the frequency range of 50 to 500 Hz, and in the second stage, k + 0.5 to k + 1.5 (in units of component k) The path is calculated from the range of the harmonic number).

図４に最適パスの計算例を示す。図４は、男性話者の発話/iyoiyo/の、最初の/i/から/y/への音韻変化部において決定された最適パス（音声は図２と同一）を示している。決定されたパス上の(a₁,p₁)を瞬時値とし、その変化率(a₂,p₂)を用いて、各点のパラメータ値を三次元ベクトルで表示している（実線）。各ベクトルの方向が、近傍の時間のベクトルと滑らかに連続していることが確認できる。 FIG. 4 shows an example of calculating the optimum path. FIG. 4 shows the optimum path (speech is the same as in FIG. 2) determined in the phoneme change part from the first / i / to / y / of the male speaker's utterance / iyoiyo /. The (a ₁ , p ₁ ) on the determined path is set as an instantaneous value, and the parameter value at each point is displayed as a three-dimensional vector using the change rate (a ₂ , p ₂ ) (solid line). It can be confirmed that the direction of each vector is smoothly continuous with neighboring time vectors.

（３−４） Trajectory Estimation：軌跡関数の生成
最適パスの計算により、N点（t₀〜tN_N-1）におけるQPSパラメータの値が得られている。これらの値に基づいて成分kの瞬時対数振幅Ak(t)と瞬時位相Pk(t)を決定する。まず前節で述べた位相回転の問題を解決する。推定すべき瞬時位相関数Pk(t)は時間に対する単調増加関数であるが、各時間フレームから得られるp0の値は±πに限定されているため、必ずしも単調増加にはならない。これは、時間フレーム間で２π×u（uは正の整数）の位相回転の情報が欠落したために生じる。正しい位相関数を得るためには、各時間フレーム間の回転係数ｕ_n定める必要がある。 (3-4) Trajectory Estimation: Generation of locus function The value of the QPS parameter at N points (t _{0 to} tN _N-1 ) is obtained by calculating the optimum path. Based on these values, the instantaneous logarithmic amplitude Ak (t) and instantaneous phase Pk (t) of the component k are determined. First, the problem of phase rotation described in the previous section is solved. The instantaneous phase function Pk (t) to be estimated is a monotonically increasing function with respect to time. However, since the value of p0 obtained from each time frame is limited to ± π, it does not necessarily monotonously increase. This occurs because information on phase rotation of 2π × u (u is a positive integer) is lost between time frames. To obtain the correct phase function, it is necessary to determine the rotation factor u _n between each time frame.

時間フレームnとn+1の間の位相回転係数ｕ_nは、次式数２１で求めることができる。ここでv(t)は瞬時角周波数を表す三次関数で、その係数はp₁(n),p₂(n),p₁(n+1),p₂(n+1)から一意に計算できる。Δは時間フレーム間隔である。 Phase rotation factor u _n between time frames n and n + 1 can be calculated by the following equation number 21. Where v (t) is a cubic function representing the instantaneous angular frequency, and its coefficient is uniquely calculated from p ₁ (n), p ₂ (n), p ₁ (n + 1), p ₂ (n + 1) it can. Δ is a time frame interval.

上式で位相回転の問題を解決した後、全時刻のパラメータ推定値から、振幅関数と位相関数を決定する。これらの関数は、ひとつの時間フレーム間では時間tに対する５次関数で表現される。推定されたパラメータ値からこの５次関数の係数は一意に決定可能だが、この手法で推定された関数はしばしば振動的になることがある。これはパラメータ推定時の微妙な誤差が、増幅されるために生じる。 After solving the problem of phase rotation with the above equation, the amplitude function and phase function are determined from the parameter estimates at all times. These functions are expressed as a quintic function with respect to time t between one time frame. Although the coefficients of this quintic function can be uniquely determined from the estimated parameter values, the functions estimated by this technique can often be oscillatory. This occurs because a subtle error in parameter estimation is amplified.

この振動を抑制するために、各５次関数の係数には以下の拘束条件を用いる。(1)計算された各5次関数の接続点において、時間に対する４次微分までが連続であること。(2)計算された５次関数の接続点において、瞬時値、一次、二次微分が、推定パラメータを平均値、最適パス決定で用いた分散値で表される正規分布の結合確率を最大にすること。このような拘束条件を用いて各５次関数の係数を決定すると、上で述べた振動を抑制することが可能である。 In order to suppress this vibration, the following constraint condition is used for the coefficient of each quintic function. (1) Up to the fourth derivative with respect to time is continuous at the calculated connection point of each quintic function. (2) At the connection point of the calculated quintic function, the instantaneous value, first-order and second-order derivatives maximize the probability of combining a normal distribution represented by the average value of the estimated parameter and the variance value used in determining the optimal path. To do. When the coefficient of each quintic function is determined using such a constraint condition, the vibration described above can be suppressed.

図５に最適パスのパラメータ値と分散から計算した瞬時対数振幅関数、瞬時位相関数の例を示す。図５は、軌跡関数の生成例で、瞬時対数振幅関数、位相関数の計算例（男性話者/iyoiyo/）を示している。図５ (a)は、太線が計算された瞬時対数振幅関数を表す。白丸は最適パス上のパラメータ値、エラーバーはその分散に対応する。 (b)は、瞬時位相関数（位相回転の補正後）、 (c)は、瞬時対数振幅の変化速度関数、 (d)は、瞬時周波数関数、 (e)は、瞬時対数振幅の変化加速度関数、 (f)は、瞬時周波数の変化速度関数を示している。軌跡関数の計算においては、上述した時間的な滑らかさも拘束条件として用いているため、各軌跡は推定パラメータとは完全には一致しない。しかし、振動が少なくパラメータ値に近い軌跡が生成できていることが確認できる。 FIG. 5 shows an example of the instantaneous logarithmic amplitude function and instantaneous phase function calculated from the parameter value and variance of the optimum path. FIG. 5 shows an example of generation of a trajectory function, and shows an example of calculation of instantaneous logarithmic amplitude function and phase function (male speaker / iyoiyo /). FIG. 5 (a) shows the instantaneous logarithmic amplitude function for which the thick line is calculated. White circles correspond to parameter values on the optimum path, and error bars correspond to the variances. (b) is the instantaneous phase function (after phase rotation correction), (c) is the instantaneous logarithmic amplitude change rate function, (d) is the instantaneous frequency function, (e) is the instantaneous logarithmic amplitude change acceleration function (F) shows the instantaneous frequency change rate function. In the calculation of the trajectory function, the temporal smoothness described above is also used as the constraint condition, so that each trajectory does not completely match the estimated parameter. However, it can be confirmed that a locus with little vibration and close to the parameter value can be generated.

（３−５） Time to Phase Conversion
入力信号の成分間の干渉が小さい場合は、上記の手法により各成分の軌跡関数が簡単に求められる。しかし通常の有声音声では、特に高次の調波成分において成分間の干渉が無視できなくなるため、この様な手法が適用できない。図６にこのような例を示す。図６は、時間／位相変換の例を示しており、 (a)は、男性話者/iyoiyo/の/i/から/y/への遷移部分の波形とスペクトログラムで、エネルギーの強い部分を白で表示してある。F0は140 Hzから170 Hzへ上昇している（図５参照）。 (b)は、(a)の時刻450 msにおける周波数振幅応答を示している。F0の変化速度が大きいため、高次の調波成分（2.0 kHz以上）間で干渉が強くなり、振幅ピークを二次関数で近似できなくなっている。 (c)は、(a)の信号を時間／位相変換した後の波形とスペクトログラムで、スペクトログラムの横軸は、第一調波成分の位相、縦軸は調波番号を表す。F0がほぼ一定であることが確認できる。 (d)は、(c)の時刻に対応する位相における周波数振幅スペクトルで、成分間の干渉が軽減し、高次の調波成分の振幅ピークも明確に現れている。この図６(a) は、男性話者の発話/iyoiyo/の波形とスペクトログラムであり、この図の時間範囲は図５と同一である。時間400 msから500 msにかけてF0が急速に上昇している。図５(f) からF0の変化速度は最大0.8 Hz/msほどであることが分かる。第ｎ調波成分の周波数変化速度はこのｎ倍になるため、成分間の干渉が無視できなくなる。実際、図６(c) のt=450 msにおけるスペクトルでは、2 kHz以上の成分は明確なピークを確認できない。 (3-5) Time to Phase Conversion
When the interference between the components of the input signal is small, the trajectory function of each component can be easily obtained by the above method. However, with ordinary voiced speech, interference between components cannot be ignored particularly in higher harmonic components, and thus such a method cannot be applied. FIG. 6 shows such an example. Fig. 6 shows an example of time / phase conversion. (A) is a waveform and spectrogram of the transition part from / i / to / y / of male speaker / iyoiyo /. Is displayed. F0 increases from 140 Hz to 170 Hz (see Fig. 5). (b) shows the frequency amplitude response at time 450 ms of (a). Since the rate of change of F0 is large, interference between high-order harmonic components (2.0 kHz or higher) becomes strong, and the amplitude peak cannot be approximated by a quadratic function. (c) is a waveform and spectrogram after time / phase conversion of the signal of (a), the horizontal axis of the spectrogram represents the phase of the first harmonic component, and the vertical axis represents the harmonic number. It can be confirmed that F0 is almost constant. (d) is a frequency amplitude spectrum in a phase corresponding to the time of (c), interference between components is reduced, and amplitude peaks of higher harmonic components clearly appear. FIG. 6A shows the waveform and spectrogram of the utterance / iyoiyo / of a male speaker, and the time range of this figure is the same as FIG. F0 rises rapidly from 400 ms to 500 ms. It can be seen from FIG. 5 (f) that the change rate of F0 is about 0.8 Hz / ms at the maximum. Since the frequency change speed of the nth harmonic component is n times, interference between components cannot be ignored. In fact, in the spectrum at t = 450 ms in FIG. 6 (c), a clear peak cannot be confirmed for components of 2 kHz or higher.

この問題に対応するため、時間／位相軸変換を導入する。この変換は、干渉の影響の少ない第一調波成分の位相PF_F0(t) をLVTにより推定し、入力信号の時間軸tをこの位相軸に置き換えることで実現される。実際にはx(t)は離散時間でサンプリングされているため、任意の時刻tにおけるx(t)の値はsinc関数で補完する。変換後の信号x(φ) においては、時間軸の代わりに位相(rad)が用いられ、これを周波数分析すれば、角周波数の代わりに調波成分の番号が用いられることになる。これを次式数２２、数２３に示す。 In order to deal with this problem, time / phase axis conversion is introduced. This conversion is realized by estimating the phase PF _F0 (t) of the first harmonic component with little influence of interference by LVT and replacing the time axis t of the input signal with this phase axis. Since x (t) is actually sampled at discrete time, the value of x (t) at an arbitrary time t is complemented by a sinc function. In the converted signal x (φ), the phase (rad) is used instead of the time axis, and if this is subjected to frequency analysis, the harmonic component number is used instead of the angular frequency. This is shown in the following equations (22) and (23).

この様な変換により、F0の時間変化に起因する成分間の干渉を低減することが可能である。図６(b) に時間／位相変換を施した波形とスペクトログラムを示す。このスペクトログラムでは、成分周波数（単位は調波番号になる）の時間変化がほぼ無くなっていることが確認できる。また図６(d) は、図６(c) と同じ時刻のスペクトルだが、高次の成分まで明確なピークが確認できる。 By such conversion, it is possible to reduce interference between components due to the temporal change of F0. Fig. 6 (b) shows the waveform and spectrogram after time / phase conversion. In the spectrogram, it can be confirmed that the time change of the component frequency (the unit is the harmonic number) is almost eliminated. FIG. 6 (d) shows a spectrum at the same time as FIG. 6 (c), but clear peaks can be confirmed up to higher order components.

このようにして得られた瞬時振幅関数Ak( φ) や瞬時位相関数Pk( φ) の位相軸は、次式数２４、数２５のようにP _F0(t) を用いて容易に時間軸に逆変換することが可能である。 The phase axes of the instantaneous amplitude function Ak (φ) and the instantaneous phase function Pk (φ) obtained in this way can be easily set to the time axis using P _F0 (t) as shown in the following equations 24 and 25. Inverse transformation is possible.

LVCの出力は、N点×K成分のQPSパラメータ（6×N×K）と、N点の基本位相パラメータ（3N）の和である、3N(2K+1)個の値で表現できる。 The output of the LVC can be expressed by 3N (2K + 1) values, which is the sum of the QPS parameter (6 × N × K) of N points × K components and the basic phase parameter (3N) of N points.

（４）合成音声を用いた性能評価実験
次に、LVCシステムの性能評価について説明する。まず定量的な評価を行うため、入力信号はパラメータが既知である合成音声を用いる。評価項目はF0検出誤差と、パラメータ推定性能とする。 (4) Performance evaluation experiment using synthesized speech Next, the performance evaluation of the LVC system will be described. First, in order to perform quantitative evaluation, a synthesized speech whose parameters are known is used as an input signal. Evaluation items are F0 detection error and parameter estimation performance.

（４−１）入力信号
評価用の入力信号として、合計400個の合成二連母音を作成する。刺激のパラメータは、母音の組み合わせ（25通り）、F0の変化パタン（８通り）、話者（男性、女性）の３種類である。入力信号のF0は次式数２６で表される。ここで、FCとFDはF0の平均的な大きさと変化幅に対応する定数であり、男性話者を模擬する信号の場合は FC = 100 Hz, FD = 10 Hzとし、女性話者ではFC = 200 Hz, FD = 20 Hzとする。定数Lは信号の持続時間で全ての信号で300msである。 (4-1) Input Signal A total of 400 synthesized double vowels are created as input signals for evaluation. There are three types of stimulation parameters: vowel combinations (25 patterns), F0 change patterns (8 patterns), and speakers (male and female). F0 of the input signal is expressed by the following equation (26). Here, FC and FD are constants corresponding to the average size and change width of F0. In the case of a signal simulating a male speaker, FC = 100 Hz and FD = 10 Hz. 200 Hz, FD = 20 Hz. The constant L is the signal duration and 300 ms for all signals.

持続時間中のF0の変化パタンは、パラメータFPにより決まる。FPは0から1.75まで0.25刻みで８通り用意する。例えばFp=0の場合は、F0は持続時間中に滑らかに減少し、FP=0.5では時刻150 msで頂点を持つ山型の変化パタンを示す。図７(a) に代表的なF0の変化パタンを示す。図７(b) は、フォルマント周波数で、男性話者の二連母音/ia/のF1-F4 を示しており、 (c)は、各調波成分の瞬時振幅で、２次〜15次の成分を表示している。 The change pattern of F0 during the duration is determined by the parameter FP. Eight FPs are prepared in increments of 0.25 from 0 to 1.75. For example, when Fp = 0, F0 decreases smoothly during the duration, and when FP = 0.5, a peak-shaped change pattern having a vertex at time 150 ms is shown. FIG. 7 (a) shows a typical change pattern of F0. Fig. 7 (b) shows the F1-F4 of male speaker's dual vowel / ia / at the formant frequency, and (c) shows the instantaneous amplitude of each harmonic component and the second to 15th order. Ingredients are displayed.

二連母音は、日本語５母音二つの組合せである。各母音に対応する周波数応答は５フォルマントのcascade-Klatt音声合成器で計算する（Klatt, 1980）。母音のフォルマント周波数は、男性話者を模擬するものと、女性話者を模擬するものの二通り用意する。表１に入力信号のフォルマント周波数を示す。尚、これらの値は、成人男女各一名の自然発話音声から抽出した値に基づいている。 A double vowel is a combination of two Japanese five vowels. The frequency response corresponding to each vowel is calculated with a five-formant cascade-Klatt speech synthesizer (Klatt, 1980). There are two types of vowel formant frequencies: one that simulates a male speaker and one that simulates a female speaker. Table 1 shows the formant frequency of the input signal. These values are based on values extracted from the spontaneous speech of each adult male and female.

通常Cascade-Klatt音声合成器はサンプリング周波数10kHzで実装されるが、これは声道長が約17cmの男性話者を近似している。一般に女性の声道長は男性よりも短いので、女性話者を模擬するためにはKlatt合成器のサンプリング周波数を上げる必要がある。ここでは女性話者のサンプリング周波数は12 kHzに設定する。これは声道長にして約14cmに対応する。 Cascade-Klatt speech synthesizers are usually implemented with a sampling frequency of 10 kHz, which approximates a male speaker with a vocal tract length of about 17 cm. Since female vocal tract length is generally shorter than male, it is necessary to increase the sampling frequency of the Klatt synthesizer in order to simulate a female speaker. Here, the sampling frequency for female speakers is set to 12 kHz. This corresponds to approximately 14 cm in vocal tract length.

またKlatt合成器のナイキスト周波数以上の周波数応答H(f)は、次式数２７、数２８を用いて決定する。ここでK(f)はKlatt合成器の周波数応答、F5は第５フォルマント周波数を表す。H(f)は第５フォルマント周波数（男性：4.5 kHz、女性：5.4 kHz）以上の周波数で、振幅応答が-48 dB/octで減衰し、位相応答はゼロになることを意味する。この処理により、Klatt合成器のナイキスト周波数以上の領域で、連続的な応答を計算することが可能になる。図８に計算されたH(f)の例を示す。図８は、合成母音の振幅包絡の例を示しており、(a) は、母音/i/の振幅包絡で、線の太さが話者に対応し、太線が男性、細線が女性を表す。(b) は、母音/o/の振幅包絡を示している。 Further, the frequency response H (f) above the Nyquist frequency of the Klatt synthesizer is determined using the following equations (27) and (28). Here, K (f) represents the frequency response of the Klatt synthesizer, and F5 represents the fifth formant frequency. H (f) means that the amplitude response is attenuated by -48 dB / oct and the phase response becomes zero at a frequency equal to or higher than the fifth formant frequency (male: 4.5 kHz, female: 5.4 kHz). This process makes it possible to calculate a continuous response in the region above the Nyquist frequency of the Klatt synthesizer. FIG. 8 shows an example of the calculated H (f). FIG. 8 shows an example of the amplitude envelope of the synthesized vowel, where (a) is the amplitude envelope of the vowel / i /, the line thickness corresponds to the speaker, the thick line represents male, and the thin line represents female. . (b) shows the amplitude envelope of the vowel / o /.

フォルマント周波数は、先行母音から後続母音へ滑らかに変化する。図７(b) に二連母音/ia/のフォルマント周波数を示す。 The formant frequency changes smoothly from the preceding vowel to the subsequent vowel. Fig. 7 (b) shows the formant frequency of the double vowel / ia /.

これらの情報に基づいて、入力信号は次式数２９、数３０により作成する。成分数Kは男性・女性どちらの話者を模擬する場合も80である。信号のサンプリング周波数は48kHz、持続時間は300 msとする。図７(c) に各調波成分の振幅パタンの例を示す。 Based on these pieces of information, the input signal is created by the following equations 29 and 30. The number of components K is 80 when both male and female speakers are simulated. The sampling frequency of the signal is 48 kHz and the duration is 300 ms. FIG. 7 (c) shows an example of the amplitude pattern of each harmonic component.

（４−２） F0検出性能
音声のF0検出（pitch determination）に関しては、既に多くの研究がなされており、様々なアルゴリズムが提案されている。F0検出アルゴリズムの性能を評価する手法として、Rabiner(1977)は次の誤差率e(%)を提案している（数３１）。本明細書でも、この誤差率を用いて性能を評価する。 (4-2) F0 detection performance Much research has already been conducted on F0 detection (pitch determination) of speech, and various algorithms have been proposed. As a technique for evaluating the performance of the F0 detection algorithm, Rabiner (1977) proposes the following error rate e (%) (Equation 31). Also in this specification, performance is evaluated using this error rate.

また、既存手法との性能差を評価するために、代表的なF0検出手法であるケプストラム法（Noll, 1966）と比較する。ケプストラムによるF0検出は、次のように実現する。まずLVCと同一の時間フレームに対して40 msのハミング窓をかけ、8192点のFFTを用いてパワースペクトルを計算する。次に各周波数成分のパワーを対数に変換したスペクトルを入力として、もう一度FFTを実行することでケプストラムを得る。このケプストラム上でエネルギーがピークを持つquefrencyが、信号の基本周期（1/F0）に対応する。 In addition, in order to evaluate the performance difference from the existing method, we compare it with the cepstrum method (Noll, 1966) which is a typical F0 detection method. F0 detection by cepstrum is realized as follows. First, a 40 ms Hamming window is applied to the same time frame as LVC, and the power spectrum is calculated using an 8192-point FFT. Next, the spectrum obtained by converting the power of each frequency component into a logarithm is used as an input, and a cepstrum is obtained by performing FFT again. The quefrency with the energy peak on this cepstrum corresponds to the fundamental period (1 / F0) of the signal.

ここでは、ケプストラムのF0検出精度を向上させるために、予めF0の存在する範囲を限定している。Fc=100 Hzの入力に対しては、quefrencyのピークを求めるF0の探索範囲は80〜120 Hzとし、Fc=200 Hzの入力に対しては、160 Hzから240 Hzを探索範囲とした。尚、この条件はケプストラム法の検出性能を向上させるために用いており、LVCには適用していない。LVCは全ての入力信号に対して同一のパラメータでF0検出を行う。 Here, in order to improve the F0 detection accuracy of the cepstrum, the range in which F0 exists is limited in advance. For the input of Fc = 100 Hz, the search range of F0 for obtaining the peak of quefrency is 80 to 120 Hz, and for the input of Fc = 200 Hz, the search range is 160 to 240 Hz. This condition is used to improve the detection performance of the cepstrum method and is not applied to LVC. LVC performs F0 detection with the same parameters for all input signals.

図９に、LVCとケプストラム法のF0検出誤差を表す。この図９から入力信号の種類に関わらず、LVCのF0検出誤差は常にケプストラム法より小さいことが分かる。全入力に対する平均誤差は、LVCで0.000650%、ケプストラム法で0.163%である。この結果は、パラメータの時間変化を伴う有声音声のF0検出に対して、LVCシステムが有効であることを支持するものである。 FIG. 9 shows the F0 detection error of the LVC and the cepstrum method. It can be seen from FIG. 9 that the LVC F0 detection error is always smaller than the cepstrum method regardless of the type of input signal. The average error for all inputs is 0.000650% for LVC and 0.163% for the cepstrum method. This result supports the effectiveness of the LVC system for F0 detection of voiced speech with parameter changes over time.

図９はF0検出性能の評価を示しており、横軸は合成音声の母音の種類、縦軸はF0検出誤差（％)を示す。LVCのF0推定誤差を○と●で、cepstrum法（ケプストラム法）による検出誤差を◇と◆で表す。エラーバーはF0変化パタンによる標準偏差に対応する。 FIG. 9 shows the evaluation of the F0 detection performance, where the horizontal axis indicates the type of vowel of the synthesized speech and the vertical axis indicates the F0 detection error (%). The LVC F0 estimation error is represented by ○ and ●, and the detection error by the cepstrum method (cepstrum method) is represented by ◇ and ◆. Error bars correspond to the standard deviation due to the F0 change pattern.

（４−３）パラメータ推定性能
ここではLVCのシステム全体としての性能を評価する。入力音のパラメータが既知であるので、各成分ごとにパラメータの推定精度を算出することも可能だが、入力パラメータが未知である場合と対応付ける為、推定されたパラメータから再合成された信号と入力信号の比較によってシステム全体の性能を評価する。具体的には入力信号x(t)と、推定されたパラメータから再合成した波形y(t)を用いて、次式数３２で表す入力/残差信号比（Signal to residual power ratio: S /R）を計算する。S/Rは入力信号と再合成信号の残差が小さいほど値が大きくなる。再合成信号y(t)は全ての成分の和から計算されるため、一部の成分のパラメータ推定における誤差が、そのままS/Rに反映される。高いS/Rを得るためには、全ての成分のパラメータ推定で誤差が小さくなっている必要がある。 (4-3) Parameter estimation performance Here, the performance of the entire LVC system is evaluated. Since the parameters of the input sound are already known, it is possible to calculate the parameter estimation accuracy for each component, but in order to match the case where the input parameter is unknown, the signal recombined from the estimated parameter and the input signal The overall system performance is evaluated by comparing Specifically, using the input signal x (t) and the waveform y (t) re-synthesized from the estimated parameters, the input / residual signal ratio (S / R) is calculated. The value of S / R increases as the residual between the input signal and the recombined signal decreases. Since the recombined signal y (t) is calculated from the sum of all components, an error in parameter estimation of some components is directly reflected in the S / R. In order to obtain a high S / R, it is necessary that the error is small in the parameter estimation of all components.

図１０に各母音ごとのS/Rを示す。S/Rは母音の種類によって変化し、特に入力合成音声のフォルマント周波数が変化しない場合（５種類）と、変化する場合（20種類）の差が大きい。前者の平均S/RはFc=100 Hzで65.9 dB、Fc=200 Hzで69.1 dBと非常に高いが、後者の平均S/RはFc=100 Hzで38.4 dB、Fc=200 Hzで43.5dBである。また全サンプルの平均S/Rは46.3dBで、S/Rが30 dBを下回るサンプルは存在しない。 FIG. 10 shows the S / R for each vowel. S / R varies depending on the type of vowel, and there is a large difference between when the formant frequency of the input synthesized speech does not change (five types) and when it changes (20 types). The average S / R of the former is 65.9 dB at Fc = 100 Hz and 69.1 dB at Fc = 200 Hz, but the average S / R of the latter is 38.4 dB at Fc = 100 Hz and 43.5 dB at Fc = 200 Hz. It is. The average S / R of all samples is 46.3 dB, and no sample has an S / R below 30 dB.

（５）自然音声を用いた性能評価実験
合成音声を用いた性能表実験により、LVCはF0や振幅包絡が時間変化する有声音声を高い精度で分析できることが示された。ここでは、自然発話音声を入力とした場合の性能を評価する。自然発話音声のF0は未知のため、評価項目はS/Rのみとする。 (5) Performance evaluation experiment using natural speech The performance table experiment using synthetic speech showed that LVC can analyze voiced speech whose F0 and amplitude envelope change over time with high accuracy. Here, the performance when natural speech is input is evaluated. Since F0 of naturally uttered speech is unknown, the evaluation item is only S / R.

（５−１）入力信号
入力音声はATRディジタル音声データベースの音素バランス216単語より抽出する。データベースに含まれる単語の中から、全ての音韻が有声音であるサンプルを探し、表２で示す12単語を選定する。各単語音声の開始、終了時刻は、データベースに付属するラベル情報のうち、音響イベント層ラベルにより決定する。男性17名と女性17名（M101-M117, F101-F117）が静かな部屋で発話した、これら12単語を性能評価の入力信号として用いる。合計サンプル数は384である。 (5-1) Input signal Input speech is extracted from 216 words of phoneme balance in the ATR digital speech database. From the words included in the database, search for samples in which all phonemes are voiced sounds, and select the 12 words shown in Table 2. The start time and end time of each word sound are determined by the acoustic event layer label in the label information attached to the database. These 12 words spoken by 17 men and 17 women (M101-M117, F101-F117) in a quiet room are used as input signals for performance evaluation. The total number of samples is 384.

（５−２）パラメータ推定性能
合成音声の場合と同様に、自然発話音声の各成分の瞬時振幅関数と瞬時位相関数をLVCにより推定し、得られたパラメータから信号を再合成する。図１１(a)に入力信号と再合成、残差信号の例を示す。入力は男性話者が発話した/yumoa/であり、S/Rは27.6dBである。この図から再合成信号は、ほぼ入力信号と一致していることが確認できる。また図１１(c)に、これらの信号のスペクトログラムを示す。入力音声のF0と振幅包絡は、激しい時間変化を有する。また残差信号のスペクトル（図１１(c)下段）は、入力の振幅包絡のエネルギーが強い時間＝周波数領域でエネルギーが残っていることを示している。図１１は、自然音声のLVC分析例を示しており、(a)は、男性話者の発話/yumoa/の例で、上段が入力信号、中段がLVCの再合成信号、下段が残差信号にそれぞれ対応する。入力と残差信号の比(S/R)は27.6dBである。 (b)は、女性話者の発話/yumoa/の例で、図の形式は(a)と同じである。S/R=30.4 dB である。(c)は、(a)の信号のスペクトログラムで、上段が入力信号、中段が再合成信号、下段が残差信号を表す。 (d)は、(b)のスペクトログラムで、図の形式は(c)と同一である。 (5-2) Parameter estimation performance Similar to the case of synthesized speech, the instantaneous amplitude function and instantaneous phase function of each component of naturally uttered speech are estimated by LVC, and the signal is re-synthesized from the obtained parameters. FIG. 11 (a) shows an example of an input signal, re-synthesis, and residual signal. The input is / yumoa / spoken by a male speaker, and S / R is 27.6 dB. From this figure, it can be confirmed that the recombined signal substantially coincides with the input signal. FIG. 11 (c) shows spectrograms of these signals. The F0 and amplitude envelope of the input speech have severe time changes. Further, the spectrum of the residual signal (the lower part of FIG. 11 (c)) shows that energy remains in the time = frequency region where the energy of the input amplitude envelope is strong. FIG. 11 shows an example of LVC analysis of natural speech. (A) is an example of a male speaker's utterance / yumoa /, where the upper part is an input signal, the middle part is an LVC recombination signal, and the lower part is a residual signal. Correspond to each. The ratio of input to residual signal (S / R) is 27.6dB. (b) is an example of a female speaker's utterance / yumoa /, and the format of the figure is the same as (a). S / R = 30.4 dB. (c) is a spectrogram of the signal of (a), with the upper stage representing the input signal, the middle stage representing the recombined signal, and the lower stage representing the residual signal. (d) is the spectrogram of (b), and the format of the figure is the same as (c).

図１１(b)と(d)には、女性話者の/yumoa/の入力、再合成、残差信号を同じ形式で示す。これも男性話者の場合とほぼ同様の結果が得られていることが確認できる。特に、残差信号のスペクトルの時刻0-200ms、周波数500Hz周辺のエネルギーは重要である。この残差エネルギーは、入力信号の第一調波成分と第二調波成分の間に存在している。即ち、有声音信号が振幅と周波数が連続的に変化するsinusoid信号の和として表現できるするLVCの仮定の枠外にあるエネルギーということになる。 FIGS. 11B and 11D show the input / resynthesis / residual signal of / yumoa / of a female speaker in the same format. It can be confirmed that the result is almost the same as that of the male speaker. Particularly, the energy around the time 0-200 ms of the spectrum of the residual signal and the frequency of 500 Hz is important. This residual energy exists between the first harmonic component and the second harmonic component of the input signal. In other words, the voiced sound signal is energy outside the LVC assumption that can be expressed as the sum of sinusoid signals whose amplitude and frequency continuously change.

Fantの音源フィルタ理論によれば、有声音声の発話時には声帯振動が主な音源ではあるが、同時に乱流雑音もわずかながら発生していると考えられる。この雑音音源も、声帯振動音源と同じ共振フィルタと同じ声道フィルタで変調されるため、入力信号の振幅包絡の強い領域で残差エネルギーが強くなるのである。声帯振動に起因する音源信号は周波数とともに減衰することが知られているが、乱流雑音音源にはそのような特性はない。よって、残差信号の大きさは高周波数になるほど大きくなることが予測できる。図１２に、その他の発話の入力信号と残差信号の例を示す。特に図１２(c)の結果は、上で述べた仮説を支持するものである。図１２は、入力信号スペクトル（上段）と、残差信号スペクトル（下段）の比較を示したものであり、 (a)は、男性話者の/nyuIN/の例で、S/R=31.2 dB、 (b)は、女性話者の/nyuiN/の例で、S/R=33.2 dB、 (c)は、男性話者の/reNai/の例で、S/R=27.6dB、 (d)は、女性話者の/reNai/の例で、S/R=30.5 dB の場合を、それぞれ示している。 According to Fant's sound source filter theory, vocal cord vibration is the main sound source when voiced speech is uttered, but at the same time, a small amount of turbulent noise is also generated. Since this noise sound source is also modulated by the same resonance filter and the same vocal tract filter as the vocal cord vibration sound source, the residual energy becomes strong in a region where the amplitude envelope of the input signal is strong. It is known that sound source signals resulting from vocal cord vibrations attenuate with frequency, but turbulent noise sound sources do not have such characteristics. Therefore, it can be predicted that the magnitude of the residual signal increases as the frequency increases. FIG. 12 shows examples of other speech input signals and residual signals. In particular, the results of FIG. 12 (c) support the hypothesis described above. Fig. 12 shows a comparison between the input signal spectrum (upper) and the residual signal spectrum (lower). (A) is an example of male speaker / nyuIN /, where S / R = 31.2 dB. (B) is an example of / nyuiN / for a female speaker, S / R = 33.2 dB, (c) is an example of / reNai / for a male speaker, S / R = 27.6 dB, (d) Shows an example of / reNai / for a female speaker, and shows the case of S / R = 30.5 dB.

図１３に各単語の平均S/Rを示す。男性話者の発話単語に対する平均S/Rは21.2ｄB、女性話者の平均S/Rは24.0dB、全サンプルの平均S/Rは22.6dBである。 FIG. 13 shows the average S / R of each word. The average S / R for male spoken words is 21.2 dB, the average S / R for female speakers is 24.0 dB, and the average S / R for all samples is 22.6 dB.

（６） LVCに基づく音声変調
以上の実験により、LVCが自然発話された有声音声の成分パラメータを高い精度で推定できることを示した。このようなパラメータで有声音声を表現することにより、音声処理の観点から様々な応用が期待できる。ここではその一例として、音源フィルタ理論を応用した音声の変調手法について説明する。 (6) Speech modulation based on LVC From the above experiment, it was shown that the component parameter of voiced speech in which LVC was spoken naturally can be estimated with high accuracy. By expressing voiced speech with such parameters, various applications can be expected from the viewpoint of speech processing. Here, as an example, a voice modulation method using sound source filter theory will be described.

（６−１）発話速度の変調
LVCの出力は、音声の各成分kに対する瞬時対数振幅関数Ak(t)と位相関数Pk(t)である。Fantの音源フィルタ理論を適用するために、声道フィルタに起因する振幅包絡E(t,ω)を次式数３３で計算する。ここでInterpolateは、ある時刻の全ての成分の瞬時角周波数と瞬時振幅値を用いて、任意の角周波数における瞬時振幅値を得る関数とする。これは、例えばωに最も近い二つの成分の瞬時振幅の線形補完などで実現できる。 (6-1) Speech rate modulation
The output of the LVC is an instantaneous logarithmic amplitude function Ak (t) and a phase function Pk (t) for each component k of speech. In order to apply Fant's sound source filter theory, the amplitude envelope E (t, ω) resulting from the vocal tract filter is calculated by the following equation (33). Here, Interpolate is a function for obtaining an instantaneous amplitude value at an arbitrary angular frequency using the instantaneous angular frequency and instantaneous amplitude value of all components at a certain time. This can be realized, for example, by linear interpolation of the instantaneous amplitudes of the two components closest to ω.

この表現に基づいて、音声の発話速度を任意の速さに変調する手法を考える。単純なサンプリング周波数を変えて発話速度を変更する手法では、元の音声とは異なる話者のような変調音声が生成される。これを回避するためには、包絡Eが角周波数軸に対して変化しないことと、F0の範囲が元の音声と同一であることが必要となる。具体的には、発話速度をＭ_T倍した変調音声の瞬時振幅A’(t)と位相P’(t)は次式数３４、数３５で与えられる。 Based on this expression, a method of modulating the speech rate to an arbitrary speed will be considered. In the method of changing the speech rate by changing the simple sampling frequency, modulated speech like a speaker different from the original speech is generated. In order to avoid this, it is necessary that the envelope E does not change with respect to the angular frequency axis and that the range of F0 is the same as the original voice. Specifically, the instantaneous amplitude A ′ (t) and phase P ′ (t) of the modulated speech obtained by multiplying the speech rate by M _T are given by the following equations 34 and 35.

上記の変換により、任意の発話速度で同一話者と感じられる滑らかな音声される。 By the above conversion, smooth speech that can be felt as the same speaker at any utterance speed is obtained.

（６−２）振幅包絡とF0の変調
音声に含まれる話者の個人性は声道フィルタに関係する振幅包落と、声帯振動と関係するF0に強く影響される。例えば声道長の短い女性の音声を変調して、声道長の長い男性の音声を生成するためには、包絡関数の周波数軸を変化させれば良い。また声の高さを変化させるなら、瞬時位相関数を変更すれば良い。具体的には、話者の声道長をMV倍、基本周波数をMP倍に変調した音声の瞬時振幅A’(t)と位相P’(t)は次式数３６、数３７のように書ける。 (6-2) Amplitude envelope and modulation of F0 The personality of a speaker included in speech is strongly influenced by the amplitude envelope related to the vocal tract filter and F0 related to vocal cord vibration. For example, in order to modulate a voice of a woman with a short vocal tract length to generate a male voice with a long vocal tract length, the frequency axis of the envelope function may be changed. If the pitch of the voice is changed, the instantaneous phase function may be changed. Specifically, the instantaneous amplitude A ′ (t) and phase P ′ (t) of the voice modulated by MV times the speaker's vocal tract length and MP times the fundamental frequency are expressed by the following equations 36 and 37: I can write.

上記変換により、男性話者の発話から女性話者と感じられる変調音声を生成することができる。無論、その逆も可能である。 By the above conversion, modulated speech that can be felt as a female speaker can be generated from the speech of a male speaker. Of course, the reverse is also possible.

（６−３）音声モーフィング
上述した振幅包絡とF0変調手法を応用すると、ある音声から別の音声への連続的なモーフィングが可能となる。この手法の特徴は、モーフィングの途中の音声でも自然な発話のように聞こえる点である。具体的には、二つの音声の包絡関数Ｅ_A，Ｅ_Bと位相関数Ｐ_A，Ｐ_B、及び話者の声道長の比R₁₂と二つの音声の混合比Ｍ_Xが与えられている場合に、モーフィング音声の振幅A’(t)と位相P’(t)は次式数３８、数３９で計算できる。 (6-3) Voice Morphing Applying the amplitude envelope and the F0 modulation method described above enables continuous morphing from one voice to another voice. The feature of this method is that even a voice during morphing sounds like a natural utterance. More specifically, the two audio envelope function E _A, E _B and the phase function P _A, P _B, and the mixing ratio M _X speaker vocal tract length ratio R ₁₂ and two voice is given In this case, the amplitude A ′ (t) and the phase P ′ (t) of the morphing speech can be calculated by the following equations 38 and 39.

図１４にモーフィングの例を示す。図１４は、発話/warauoNna/ に対するモーフィングの例を示しており、 (a)は、女性話者のスペクトログラム (b)は、(a)と(f)を混合比20%でモーフィングした変調音声のスペクトログラム、 (c)は、混合比40% 、(d)は、混合比60% 、(e)は、混合比80% 、(f)は、男性話者のスペクトログラム、をそれぞれ示している。図１４(a)が女性の発話、図１４(f)が男性の発話であり、混合比を変えていくとF0と振幅包絡が滑らかに変化することが確認できる。 FIG. 14 shows an example of morphing. FIG. 14 shows an example of morphing for utterance / warauoNna /. (A) is a spectrogram of a female speaker (b) is a modulated speech obtained by morphing (a) and (f) at a mixing ratio of 20%. (C) shows a mixing ratio of 40%, (d) shows a mixing ratio of 60%, (e) shows a mixing ratio of 80%, and (f) shows a spectrogram of a male speaker. FIG. 14 (a) shows a female utterance and FIG. 14 (f) shows a male utterance. It can be confirmed that the F0 and the amplitude envelope change smoothly as the mixing ratio is changed.

このように、本発明においては、既存の技術では困難であった時間変化を伴う有声音声の正確な分析が可能となった。これにより、本発明に係る技術を用いて、多様な応用が期待できる。本願では応用のひとつとして音声変調を例に挙げた。また、自然音声の時間変化を詳細に分析する手法は、今後の音声生成研究（音声合成研究）の発展に大いに寄与できると考えられる。 As described above, according to the present invention, it is possible to accurately analyze voiced speech accompanied by a time change, which is difficult with the existing technology. Thereby, various applications can be expected using the technology according to the present invention. In this application, speech modulation is taken as an example of one application. In addition, it is thought that the method of analyzing the temporal change of natural speech in detail can greatly contribute to the development of future speech generation research (speech synthesis research).

例えば、本発明による音声信号合成までの概略ステップを表したシステムの構成例をず１５に示す。このように、本発明により、位相スペクトルの時間軸変換まで実行することにより、極めて高精度の音声信号分析を行うことが可能になり、かつ、分析された信号を用いて、実質的に自由に精度良く音声信号を再現、合成できるようになる。 For example, a configuration example of a system showing the schematic steps up to speech signal synthesis according to the present invention is shown in FIG. As described above, according to the present invention, it is possible to perform an extremely accurate speech signal analysis by performing up to the time axis conversion of the phase spectrum, and substantially freely using the analyzed signal. Sound signals can be reproduced and synthesized with high accuracy.

なお、本願では、音声を主体に説明してきたが、楽器音についても、同様の適用、展開が可能であることは言うまでもない。 In the present application, the description has been made mainly on sound, but it goes without saying that the same application and development can be applied to musical instrument sounds.

QPSの周波数応答を示す特性図である。It is a characteristic view which shows the frequency response of QPS. LVTの計算例を示す特性図である。It is a characteristic view which shows the example of calculation of LVT. LVCシステム構成例を示すブロック図である。It is a block diagram which shows the example of a LVC system structure. 最適パス決定の例を示す特性図である。It is a characteristic view which shows the example of optimal path determination. 軌跡関数の生成例を示す特性図である。It is a characteristic view which shows the example of a production | generation of a locus function. 時間／位相変換の例を示す特性図である。It is a characteristic view which shows the example of time / phase conversion. 性能評価の入力信号の例を示す特性図である。It is a characteristic view which shows the example of the input signal of performance evaluation. 合成母音の振幅包絡の例を示す特性図である。It is a characteristic view which shows the example of the amplitude envelope of a synthetic | combination vowel. F0検出性能の例を示す特性図である。It is a characteristic view which shows the example of F0 detection performance. パラメータ推定精度の例を示す特性図である。It is a characteristic view showing an example of parameter estimation accuracy. 自然音声のLVC分析例１の例を示す特性図である。It is a characteristic view which shows the example of the LVC analysis example 1 of natural speech. 自然音声のLVC分析例２の例を示す特性図である。It is a characteristic view which shows the example of the LVC analysis example 2 of a natural voice. 自然音声のパラメータ推定性能の例を示す特性図である。It is a characteristic view which shows the example of the parameter estimation performance of natural speech. 音声モーフィングの例を示す特性図である。It is a characteristic view which shows the example of audio | voice morphing. 本発明に係るシステムの構成例を示す説明図である。It is explanatory drawing which shows the structural example of the system which concerns on this invention.

Claims

For an input acoustic signal with time-varying characteristics, the phase function of the first harmonic component is estimated by local rate-of-change coding, then the time axis of the signal is converted to this phase axis, and the converted signal is again converted to the local rate of change. An acoustic signal analysis method characterized by outputting the instantaneous amplitude, instantaneous frequency, and instantaneous phase of all components of an input signal as a recombinable parameter function by performing analysis by encoding.

The acoustic signal synthesis method according to claim 1, wherein the acoustic signal is re-synthesized using each parameter function in the acoustic signal analysis method according to claim 1.

The sound signal synthesis method according to claim 2, wherein the sound quality is converted into another sound quality from the re-synthesized sound signal or when the sound signal is re-synthesized.

The method of claims 1-3, wherein the acoustic signal comprises an audio signal.

The method of claims 1-3, wherein the acoustic signal comprises a musical instrument sound signal.