JP5325130B2

JP5325130B2 - LPC analysis device, LPC analysis method, speech analysis / synthesis device, speech analysis / synthesis method, and program

Info

Publication number: JP5325130B2
Application number: JP2010012963A
Authority: JP
Inventors: 定男廣谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-01-25
Filing date: 2010-01-25
Publication date: 2013-10-23
Anticipated expiration: 2030-01-25
Also published as: JP2011150232A

Description

この発明は、音声信号から音源信号と声道スペクトルを抽出し、声道フィルタを音源信号で駆動して音声信号を合成する音声分析合成装置、音声分析合成方法に関し、特に基本周波数に影響されない声道スペクトルを抽出するＬＰＣ分析装置、ＬＰＣ分析方法に関する。 The present invention relates to a speech analysis / synthesis device and a speech analysis / synthesis method for extracting a sound source signal and a vocal tract spectrum from a speech signal and synthesizing a speech signal by driving a vocal tract filter with the sound source signal. The present invention relates to an LPC analysis apparatus and an LPC analysis method for extracting a road spectrum.

これまで音声合成・符号化・認識技術の性能向上には、人間の音声生成メカニズムに基づき、音声信号を効率的かつ精度良く、音源信号と声道スペクトルに分解することが重要な役割を果たすとされてきた。この分解には、線形予測（ＬＰＣ）分析が広く用いられているが、音源信号として白色雑音を仮定しているため、得られる声道スペクトルが少なからず基本周波数（Ｆ０）の影響を受けるという問題があった。特に、女声においては基本周波数が高く、上記仮定が満たされないことから、ＬＰＣ分析により推定される声道スペクトルに音源信号の基本周波数とその倍音が含まれてしまい、正確な声道スペクトルを得ることが難しいという問題があった。 Up to now, in order to improve the performance of speech synthesis / coding / recognition technology, it is important to decompose speech signals into sound source signals and vocal tract spectrum efficiently and accurately based on human speech generation mechanism. It has been. For this decomposition, linear prediction (LPC) analysis is widely used. However, since white noise is assumed as a sound source signal, the obtained vocal tract spectrum is not a little affected by the fundamental frequency (F0). was there. In particular, since the fundamental frequency is high in female voice and the above assumption is not satisfied, the fundamental frequency of the sound source signal and its harmonics are included in the vocal tract spectrum estimated by LPC analysis, and an accurate vocal tract spectrum is obtained. There was a problem that was difficult.

ところで、ＬＰＣ分析における音源信号の基本周波数の問題を回避するために、標本選択線形予測法が提案されている（例えば、非特許文献１参照）。これは、音声信号をＬＰＣ逆フィルタに通すことにより得られるＬＰＣ残差信号が大きくなる時点での音声信号は、上記白色雑音を音源信号とする仮定を満たさないことから、当該音声信号を除いてＬＰＣ分析を行うという手法である。これにより、基本周波数の影響の少ない声道スペクトルを得ることが期待されるが、音声信号に含まれる位相特性を考慮していないため、取り除くべき音声信号の選択に誤りが生じ、その結果得られる声道スペクトルの精度が不十分であるという問題がある。また、音声信号を取り除くため、分析に用いるデータ数が不足するおそれがある。 By the way, in order to avoid the problem of the fundamental frequency of the sound source signal in the LPC analysis, a sample selection linear prediction method has been proposed (for example, see Non-Patent Document 1). This is because the audio signal at the time when the LPC residual signal obtained by passing the audio signal through the LPC inverse filter becomes large does not satisfy the assumption that the white noise is a sound source signal. This is a method of performing LPC analysis. As a result, it is expected to obtain a vocal tract spectrum with little influence of the fundamental frequency, but since the phase characteristics included in the audio signal are not taken into consideration, an error occurs in the selection of the audio signal to be removed, and the result is obtained. There is a problem that the accuracy of the vocal tract spectrum is insufficient. Moreover, since the audio signal is removed, there is a possibility that the number of data used for analysis is insufficient.

一方、音声信号に含まれる位相特性の問題を解決するために、ＡＲ−ＨＭＭという方法が提案されている（例えば、非特許文献２参照）。この方法では、ＨＭＭによる音源モデリングを仮定することで、基本周波数に頑健なＬＰＣ分析を可能としているが、音源のモデルが位相特性を含めた複雑なものであるため、計算に時間がかかり、さらには安定な解を容易に求めることが難しいという問題がある。 On the other hand, a method called AR-HMM has been proposed in order to solve the problem of the phase characteristics included in the audio signal (see, for example, Non-Patent Document 2). In this method, it is possible to perform robust LPC analysis at the fundamental frequency by assuming sound source modeling by HMM. However, since the sound source model is complicated including phase characteristics, it takes time to calculate, Has a problem that it is difficult to find a stable solution easily.

三好義昭、大和一晴、柳田益造、角所収，“２段標本選択線形予測法による高ピッチ音声の分析”，電子情報通信学会論文誌，Vol.J70-A, No.8, pp.1146-1156, 1987.Yoshiaki Miyoshi, Kazuharu Yamato, Masuzou Yanagida, Kakusho, “Analysis of high-pitch speech using two-stage sample selection linear prediction method”, IEICE Transactions, Vol.J70-A, No.8, pp.1146- 1156, 1987. 佐宗晃、田中和世，“ＨＭＭによる音源のモデリングと高基本周波数に頑健な声道特性抽出”，電子情報通信学会論文誌，Vol.J84-D-II, No.9, pp.1960-1969,2001.Minoru Sasou and Seino Tanaka, “Modeling of sound source by HMM and extraction of vocal tract characteristics robust to high fundamental frequencies”, IEICE Transactions, Vol.J84-D-II, No.9, pp.1960-1969 , 2001.

この発明の目的は上述したような状況に鑑み、音声信号から基本周波数の影響を受けない正確かつ安定な声道スペクトルを効率的に得ることにある。 In view of the above situation, an object of the present invention is to efficiently obtain an accurate and stable vocal tract spectrum that is not affected by a fundamental frequency from an audio signal.

この発明によれば、ＬＰＣ分析装置は位相等化音声信号とピッチマーク時刻群とを入力とし、音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものとし、ＬＰＣ係数と音源信号とによって得られる音声信号と、位相等化音声信号との誤差が最小となるようにＬＰＣ係数及び振幅Ｇを求める構成とされる。 According to this invention, the LPC analyzer receives the phase equalized audio signal and the pitch mark time group as input, and the sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and the pitch mark time. The time other than is made up of white noise, and the LPC coefficient and the amplitude G are obtained so that the error between the audio signal obtained from the LPC coefficient and the sound source signal and the phase equalized audio signal is minimized.

この発明による音声分析合成装置は、入力された音声信号の音声区間を検出する音声区間検出部と、前記音声区間に対して前記音声信号から基本周波数を推定する基本周波数分析部と、前記基本周波数に基づき決定した窓長で前記音声信号を切り出してＬＰＣ分析を行い、ＬＰＣ逆フィルタに前記音声信号を通すことによりＬＰＣ残差信号を求める第１ＬＰＣ分析部と、前記基本周波数から得られる基本周期に応じたピッチ波形を生成し、そのピッチ波形と前記ＬＰＣ残差信号とを用いてピッチマーク時刻群を抽出するピッチマーク分析部と、前記ピッチマーク時刻群と前記ＬＰＣ残差信号とを用いて求めた位相等化フィルタを前記音声信号に施すことにより位相等化音声信号を生成する位相等化音声生成部と、音源信号を前記ピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、前記ピッチマーク時刻以外の時刻は白色雑音よりなるものとし、ＬＰＣ係数と前記音源信号とによって得られる音声信号と、前記位相等化音声信号との誤差が最小となるように前記ＬＰＣ係数及び前記振幅Ｇを求める第２ＬＰＣ分析部と、前記ピッチマーク時刻群と前記位相等化音声信号と前記ＬＰＣ係数とを用い、パルスゲインとマルチパルス音源モデルを求めるマルチパルス音源モデル生成部と、前記第２ＬＰＣ分析部におけるＬＰＣ分析の際に得られるＰＡＲＣＯＲ係数ｋと自己相関関数Ｒとを用いて白色雑音ゲインを計算する白色雑音ゲイン生成部と、前記音声区間以外では白色雑音に前記白色雑音ゲインを乗じたものを用い、前記音声区間では前記基本周波数、前記パルスゲイン及び前記マルチパルス音源モデルから計算されるマルチパルスあるいは前記基本周波数と前記パルスゲインから計算される単一パルス列を用いてなる音源信号と前記ＬＰＣ係数とを畳み込み演算することにより音声信号を合成する音声合成部とよりなる。 A speech analysis / synthesis apparatus according to the present invention includes a speech section detection unit that detects a speech section of an input speech signal, a fundamental frequency analysis unit that estimates a fundamental frequency from the speech signal for the speech section, and the fundamental frequency The speech signal is cut out with the window length determined based on the LPC analysis, the LPC analysis is performed, and the speech signal is passed through the LPC inverse filter to obtain the LPC residual signal, and the fundamental period obtained from the fundamental frequency A pitch waveform analysis unit that generates a corresponding pitch waveform and extracts a pitch mark time group using the pitch waveform and the LPC residual signal, and uses the pitch mark time group and the LPC residual signal. Applying a phase equalization filter to the audio signal to generate a phase equalized audio signal; Each pitch mark time of the time group has a single pulse of amplitude G, and the time other than the pitch mark time is composed of white noise, the audio signal obtained from the LPC coefficient and the sound source signal, and the phase equalization Using the second LPC analysis unit for obtaining the LPC coefficient and the amplitude G so as to minimize the error from the audio signal, the pitch mark time group, the phase equalized audio signal, and the LPC coefficient, a pulse gain and a multi-value are obtained. A multi-pulse sound source model generation unit that obtains a pulse sound source model, a white noise gain generation unit that calculates a white noise gain using a PARCOR coefficient k and an autocorrelation function R obtained in the LPC analysis in the second LPC analysis unit; The white frequency multiplied by the white noise gain is used outside the voice interval, and the fundamental frequency and the pulse are used in the voice interval. A voice signal is synthesized by convolution calculation of a sound source signal using a multi-pulse calculated from the gain and the multi-pulse sound source model or a single pulse train calculated from the fundamental frequency and the pulse gain and the LPC coefficient. It consists of a speech synthesizer.

この発明によれば、基本周波数に影響されない正確かつ安定な声道スペクトルを効率的に得ることができる。 According to the present invention, an accurate and stable vocal tract spectrum that is not affected by the fundamental frequency can be obtained efficiently.

この発明による音声分析合成装置の一実施例の機能構成を示すブロック図。The block diagram which shows the function structure of one Example of the speech analysis synthesis apparatus by this invention. 図１に示した音声分析合成装置における処理の流れを示すフローチャート（その１）。The flowchart (the 1) which shows the flow of a process in the speech analysis and synthesis apparatus shown in FIG. 図１に示した音声分析合成装置における処理の流れを示すフローチャート（その２）。The flowchart (the 2) which shows the flow of a process in the speech analysis and synthesis apparatus shown in FIG. 図１に示した音声分析合成装置における処理の流れを示すフローチャート（その３）。The flowchart (the 3) which shows the flow of a process in the speech analysis and synthesis apparatus shown in FIG. 声道スペクトルの分析結果を示すグラフ。The graph which shows the analysis result of a vocal tract spectrum. 声道スペクトル系列の分析結果を示す図。The figure which shows the analysis result of a vocal tract spectrum series. 音声分析合成処理における各処理過程の波形例を示す図。The figure which shows the example of a waveform of each process in a speech analysis synthesis process.

この発明の実施形態を図面を参照して実施例により説明する。 Embodiments of the present invention will be described with reference to the drawings.

図１はこの発明による音声分析合成装置の一実施例の機能構成を示したものであり、この例では音声分析合成装置１０は音声区間検出部１１と基本周波数分析部１２と第１ＬＰＣ分析部１３とピッチマーク分析部１４と位相等化音声生成部１５と第２ＬＰＣ分析部１６とマルチパルス音源モデル生成部１７と白色雑音ゲイン生成部１８と音声合成部１９とによって構成されている。 FIG. 1 shows a functional configuration of an embodiment of a speech analysis / synthesis apparatus according to the present invention. In this example, a speech analysis / synthesis apparatus 10 includes a speech section detection unit 11, a fundamental frequency analysis unit 12, and a first LPC analysis unit 13. And a pitch mark analysis unit 14, a phase equalized speech generation unit 15, a second LPC analysis unit 16, a multi-pulse sound source model generation unit 17, a white noise gain generation unit 18, and a speech synthesis unit 19.

図２〜４は図１に示した音声分析合成装置１０における処理の流れを示したものであり、以下、図１〜４を参照して各部の機能、処理の流れについて説明する。
＜音声区間検出部＞
まず、音声区間検出部１１にて、音声信号（原音声）のパワーの閾値処理に基づき、音声区間の検出を行う（ステップＳ１）。 2 to 4 show the flow of processing in the speech analysis / synthesis apparatus 10 shown in FIG. 1, and the function of each unit and the flow of processing will be described below with reference to FIGS.
<Audio section detection unit>
First, the voice section detection unit 11 detects a voice section based on the power threshold processing of the voice signal (original voice) (step S1).

＜基本周波数分析部＞
次に、基本周波数分析部１２にて、得られた音声区間に対して音声信号からピッチ抽出アルゴリズムを用いて基本周波数を推定する。例えば、本実施例では、３０ｍｓの分析窓長（分析区間）と、４ｍｓの分析シフト長により、瞬時周波数振幅スペクトルに基づき、基本周波数を求める（ステップＳ２）。なお、基本周波数の分析には例えば下記文献Ａに記載されている瞬時周波数振幅スペクトルに基づく手法を用いることができる。
文献Ａ：Arifianto,D.,Tanaka,T.,Masuko,T.,and Kobayashi,T.,“Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency”,IEICE Trans. Information and Systems,Vol.E87-D,No.12,pp.2812-2820,2004. <Basic frequency analysis section>
Next, the fundamental frequency analysis unit 12 estimates a fundamental frequency from the speech signal using a pitch extraction algorithm for the obtained speech section. For example, in the present embodiment, the fundamental frequency is obtained based on the instantaneous frequency amplitude spectrum with an analysis window length (analysis interval) of 30 ms and an analysis shift length of 4 ms (step S2). For analysis of the fundamental frequency, for example, a technique based on the instantaneous frequency amplitude spectrum described in the following document A can be used.
Reference A: Arifianto, D., Tanaka, T., Masuko, T., and Kobayashi, T., “Robust F0 estimation of speech signal using harmonicity measure based on instantaneous frequency”, IEICE Trans. Information and Systems, Vol. E87 -D, No.12, pp.2812-2820,2004.

＜第１ＬＰＣ分析部＞
第１ＬＰＣ分析部１３は、位相等化処理に用いるＬＰＣ残差信号を得るために、４ｍｓの分析シフト長で、音声信号を基本周期（基本周期＝１÷基本周波数）の2.5倍を窓長としたブラックマン窓で切り出し、自己相関法によるＬＰＣ分析を行う（ステップＳ３）。そして、音声信号をＬＰＣ逆フィルタに通すことによりＬＰＣ残差信号を得る（ステップＳ４）。 <First LPC analysis unit>
In order to obtain an LPC residual signal used for phase equalization processing, the first LPC analysis unit 13 uses an analysis shift length of 4 ms and sets the voice signal to 2.5 times the fundamental period (basic period = 1 ÷ basic frequency) as the window length. The extracted Blackman window is used to perform LPC analysis by the autocorrelation method (step S3). Then, the LPC residual signal is obtained by passing the audio signal through the LPC inverse filter (step S4).

位相等化処理では、ＬＰＣ残差信号のスペクトルが平坦であることが音声信号のスペクトルを変化させないための条件であるため、本実施例では、ＬＰＣ残差信号のスペクトルの平坦化を目的として、ＬＰＣの分析次数を高めに設定する（例えば、男声で５０次、女声で４０次）。また、基本周波数の影響を避けるために、ラグ窓（１００Ｈｚ）を用いる。 In the phase equalization process, the flatness of the spectrum of the LPC residual signal is a condition for preventing the spectrum of the speech signal from being changed. Therefore, in this embodiment, for the purpose of flattening the spectrum of the LPC residual signal, The analysis order of LPC is set higher (for example, 50th for male voice and 40th for female voice). In order to avoid the influence of the fundamental frequency, a lag window (100 Hz) is used.

さらに、窓関数を用いたパワースペクトルの分析は分析時刻に依存するという問題があるため、声道スペクトルの時間方向平滑化を目的として、下記文献Ｂに記載されているようなＴＡＮＤＥＭ窓を用いる。
文献Ｂ：森勢将雅、高橋徹、河原英紀、入野俊夫，“分析時刻に依存しない周期信号のパワースペクトル推定法を用いた音声分析”，電子情報通信学会論文誌，Vol.J92-A,No.3,pp.163-171,2009. Furthermore, since there is a problem that the analysis of the power spectrum using the window function depends on the analysis time, a TANDEM window as described in the following document B is used for the purpose of smoothing the vocal tract spectrum in the time direction.
Reference B: Masamasa Mori, Toru Takahashi, Hidenori Kawahara, Toshio Irino, “Speech Analysis Using Power Spectrum Estimation Method for Periodic Signals Independent of Analysis Time”, IEICE Transactions, Vol. J92-A, No.3, pp.163-171, 2009.

これは、当該分析フレームと基本周期の半分シフトした分析フレームのパワースペクトルを足して２で割ることで分析時刻に依存しないパワースペクトルを推定する手法である。ウィーナー・ヒンチンの定理より、パワースペクトルの逆フーリエ変換は自己相関関数であるため、自己相関法によるＬＰＣ分析で用いる場合は、当該分析フレームと基本周期の半分シフトした分析フレームの自己相関関数を足して２で割り、得られる自己相関行列をＤｕｒｂｉｎアルゴリズムで解けばＬＰＣ係数を得ることができる。 This is a method of estimating a power spectrum that does not depend on the analysis time by adding the power spectrum of the analysis frame and the analysis frame shifted by half of the basic period and dividing by two. According to Wiener Hinting's theorem, the inverse Fourier transform of the power spectrum is an autocorrelation function, so when using it in LPC analysis by the autocorrelation method, add the autocorrelation function of the analysis frame and the analysis frame shifted by half the fundamental period. LPC coefficients can be obtained by dividing the obtained autocorrelation matrix by the Durbin algorithm.

ここで、当該分析フレームに含まれる基本周期がＴ_０で一定の場合、ＴＡＮＤＥＭ窓の計算に必要なシフト量はＴ_０／２で良いことが文献Ｂで示されている。しかし、実際の音声信号では、分析フレーム内での基本周期がＴ_０、Ｔ_１など一定ではないため、本実施例では、分析フレーム内での基本周期のゆらぎを考慮したシフト量として、周波数空間での重み付き平均 Here, the basic period contained in the analysis frame if the constant T _0, the shift amount required for the calculation of the TANDEM window that may be a T _0/2 are shown in the literature B. However, in an actual audio signal, the fundamental period in the analysis frame is not constant, such as T ₀ , T ₁ , and in this embodiment, the frequency space is used as a shift amount considering the fluctuation of the fundamental period in the analysis frame. Weighted average at

を用いる。ここで、ｗはガウス重みである。 Is used. Here, w is a Gaussian weight.

＜ピッチマーク分析部＞
ピッチマーク分析部１４は、位相等化処理に用いるピッチマーク（ピッチマーク時刻群）を得るために、音声区間内で、基本周波数から得られる基本周期に応じたパルス系列信号（ピッチ波形）を生成する（ステップＳ５）。フレーム番号ｔ、時刻ｋにおいて、音声区間内で、ピッチ波形ｅｘ（ｔ，ｋ）の絶対値と、ＬＰＣ残差信号ｅ（ｔ，ｋ）の絶対値の間で、フレームｔ毎に、相互相関関数
ｒ（ｔ，ｊ）＝Σ_ｋ｜ｅ（ｔ，ｋ）｜×｜ｅｘ（ｔ，ｋ＋ｊ）｜
を計算し、Σ_ｔｒ（ｔ，ｊ）が最大となるようなｊの系列を、動的計画法を用いて求め、ピッチマーク時刻群の候補を得る。そして、得られたピッチマーク時刻の近傍で、ＬＰＣ残差信号の絶対値が最大となる時刻を探索する。 <Pitch mark analysis unit>
The pitch mark analysis unit 14 generates a pulse series signal (pitch waveform) corresponding to the fundamental period obtained from the fundamental frequency within the speech section in order to obtain a pitch mark (pitch mark time group) used for phase equalization processing. (Step S5). At frame number t and time k, the cross-correlation is performed for each frame t between the absolute value of the pitch waveform ex (t, k) and the absolute value of the LPC residual signal e (t, k) within the speech section. Function r (t, j) = Σ _k | e (t, k) | × | ex (t, k + j) |
Was calculated, the Σ _{t r (t,} j) is series of j such that maximum, determined using a dynamic programming method to obtain a candidate of the pitch mark time group. Then, a time at which the absolute value of the LPC residual signal is maximized is searched in the vicinity of the obtained pitch mark time.

さらに、得られたピッチマーク時刻群の中で、残差の絶対値が最大となるピッチマーク時刻を起点として選択し、隣り合うピッチマーク時刻間でのＬＰＣ残差信号の自己相関関数（変形自己相関関数）が最大となる時刻を順次探索し、最終的なピッチマーク時刻群として抽出する（ステップＳ６）。ここで、得られた隣り合うピッチマーク時刻間で差分を取り、新たな基本周期として用いても良い。 Furthermore, the pitch mark time at which the absolute value of the residual is the maximum is selected from the obtained pitch mark time group, and the autocorrelation function (modified self) of the LPC residual signal between adjacent pitch mark times is selected. The time when the correlation function is maximized is sequentially searched and extracted as a final pitch mark time group (step S6). Here, a difference may be taken between the obtained adjacent pitch mark times and used as a new basic period.

＜位相等化音声生成部＞
位相等化音声生成部１５は、位相等化音声信号を得るために、ピッチマーク（ピッチマーク時刻群）とＬＰＣ残差信号を用いて、ＬＰＣ残差信号の値をピッチマーク時刻を中心として反転させ、正規化した値を係数として持つ位相等化フィルタを求め、これを音声信号（ここでは、以下、原音声と言う）に施すことにより位相等化音声信号を得る（ステップＳ７）。 <Phase equalized speech generator>
In order to obtain a phase-equalized audio signal, the phase-equalized audio generation unit 15 uses the pitch mark (pitch mark time group) and the LPC residual signal, and inverts the value of the LPC residual signal around the pitch mark time. Then, a phase equalization filter having a normalized value as a coefficient is obtained, and this is applied to an audio signal (hereinafter referred to as the original audio) to obtain a phase equalized audio signal (step S7).

ここで、ピッチマーク時刻毎に得られる位相等化フィルタの係数を、例えば１次のローパスフィルタを用いて時間的に平滑化した後、位相等化フィルタを原音声に施す。位相等化フィルタのタップ数は基本周期の長さと同じとする。位相等化処理による原音声のスペクトルの変形はわずかである。また、人間の聴覚は音声信号の短時間位相特性に対して比較的鈍感であるため、原音声と位相等化音声の聴感上の違いはわずかである。この処理は、例えば、特許第２０６１８１６号公報（以下、特許文献１と言う）に記載されている方法によるものであり、ＬＰＣ残差信号のエネルギを時間的に集中化させることで、原音声に含まれる位相を零に近似させることを可能にする。 Here, after the phase equalization filter coefficient obtained at each pitch mark time is smoothed temporally using, for example, a first-order low-pass filter, the phase equalization filter is applied to the original sound. The number of taps of the phase equalization filter is the same as the length of the fundamental period. The deformation of the original speech spectrum due to the phase equalization process is slight. In addition, since human hearing is relatively insensitive to the short-time phase characteristics of audio signals, the difference in audibility between the original audio and the phase-equalized audio is slight. This processing is based on, for example, the method described in Japanese Patent No. 2061816 (hereinafter referred to as Patent Document 1), and by concentrating the energy of the LPC residual signal over time, Makes it possible to approximate the included phase to zero.

＜第２ＬＰＣ分析部＞
第２ＬＰＣ分析部１６は、位相等化音声生成部１５にて取得した位相等化音声信号に対して、ＬＰＣ分析を行う。
ここで、位相等化音声信号は単一パルス列と白色雑音による音源信号が声道フィルタを通ることで生成されたと仮定できる。つまり、分析フレーム内に含まれるＩ＋１個のピッチマーク時刻群（ｔ_０，ｔ_１，…，ｔ_Ｉはピッチマーク時刻）におけるパルスの振幅をＧ、それ以外の時刻ｔでは従来のＬＰＣ分析と同様に音源として白色雑音を仮定すると、 <Second LPC analysis unit>
The second LPC analysis unit 16 performs LPC analysis on the phase equalized speech signal acquired by the phase equalized speech generation unit 15.
Here, it can be assumed that the phase-equalized audio signal is generated by passing a sound source signal of a single pulse train and white noise through a vocal tract filter. That is, the amplitude of the pulse in the group of I + 1 pitch marks included in the analysis frame (t ₀ , t ₁ ,..., T _I is the pitch mark time) is G, and other times t are the same as in the conventional LPC analysis. Assuming white noise as the sound source,

を最小化するＬＰＣ係数ａとパルスの振幅Ｇを位相等化音声信号から求める問題に帰着される。ここで、ｒは重みであり、ｅはＬＰＣ残差信号、ｓは位相等化音声信号、ｐはＬＰＣ分析次数である。 This results in the problem of obtaining the LPC coefficient a and the amplitude G of the pulse from the phase-equalized audio signal. Here, r is a weight, e is an LPC residual signal, s is a phase equalized speech signal, and p is an LPC analysis order.

位相等化音声信号は、４ｍｓの分析シフト長で、音声信号を基本周期の2.5倍を窓長としたブラックマン窓で切り出す。ｒ＝０であれば、位相等化音声信号のｓ（ｔ_０）…ｓ（ｔ_Ｉ）を除去した標本選択線形予測法と等価になり、Ｇ＝０かつｒ＝１であれば、従来の白色雑音を仮定したＬＰＣ分析と等価になる。 The phase-equalized audio signal is cut out by a Blackman window with an analysis shift length of 4 ms and a window length of 2.5 times the basic period. If r = 0, this is equivalent to the sample selection linear prediction method in which s (t ₀ )... s (t _I ) of the phase-equalized audio signal is removed. If G = 0 and r = 1, This is equivalent to LPC analysis assuming white noise.

本実施例では、Ｇ≠０かつｒ＝１とすることで、標本選択線形予測法の欠点であった位相等化音声の除去に伴うデータ数の不足の問題を回避し、さらにＬＰＣ分析と音源信号のパラメータＧを同一の枠組みで最適化できるという特徴を持つ。上記評価式を最小にするＬＰＣ係数は、以下の連立方程式を解くことで得られる（但し、ｒ＝１とした）。 In this embodiment, by setting G ≠ 0 and r = 1, the problem of lack of data due to the removal of phase-equalized speech, which was a drawback of the sample selection linear prediction method, is avoided, and further, LPC analysis and sound source The characteristic is that the signal parameter G can be optimized in the same framework. The LPC coefficient that minimizes the evaluation formula can be obtained by solving the following simultaneous equations (provided that r = 1).

この連立方程式はＬｅｖｉｎｓｏｎアルゴリズムを用いて効率的に解くことができる。ここで、位相等化音声信号の自己相関関数Ｒは、 This simultaneous equation can be efficiently solved using the Levinson algorithm. Here, the autocorrelation function R of the phase equalized speech signal is

パルスの振幅Ｇは、上記Ｌｅｖｉｎｓｏｎアルゴリズムを解く際に得られるＰＡＲＣＯＲ係数ｋと自己相関関数Ｒより、 The amplitude G of the pulse is obtained from the PARCOR coefficient k and the autocorrelation function R obtained when solving the Levinson algorithm.

となる。上式（３）は自己相関法の拡張であることから、声道スペクトルの時間方向平滑化を目的としたＴＡＮＤＥＭ窓の適用が可能である。この場合、右辺のＧを含む相互相関関数の値として、当該分析フレームと式（１）分シフトした分析フレームの相互相関関数を足して２で割った値を用いればよい。 It becomes. Since the above equation (3) is an extension of the autocorrelation method, it is possible to apply a TANDEM window for the purpose of temporal smoothing of the vocal tract spectrum. In this case, as the value of the cross-correlation function including G on the right side, a value obtained by adding the cross-correlation function of the analysis frame and the analysis frame shifted by the expression (1) and dividing by 2 may be used.

Ｇの初期値を、位相等化音声信号の計算でＤｕｒｂｉｎアルゴリズムを解く際に得られるＰＡＲＣＯＲ係数ｋと自己相関関数Ｒを用いて上式（５）を計算したものとして、ＬＰＣ係数の決定（ステップＳ９）、Ｇの決定（ステップＳ８）を繰り返し行う（例えば、５回）。この繰り返しにより、得られるパラメータの推定精度の向上が期待される。本実施例では、位相等化音声信号は、比較的簡単な音源から生成されると仮定できることから、解くべき問題が簡単になる。このことは、繰り返し計算に時間がかからず、さらには安定な解を容易に求めることができるという結果につながる。ここでのＬＰＣ分析の分析次数は、位相等化処理のための次数と異なるものを用いてもよい。またこの分析により基本周波数の影響は低減されるため、ラグ窓は用いなくてよい。 Assuming that the initial value of G is calculated from the above equation (5) using the PARCOR coefficient k and autocorrelation function R obtained when solving the Durbin algorithm by calculating the phase equalized speech signal, the LPC coefficient is determined (step S9) and G determination (step S8) is repeated (for example, 5 times). This repetition is expected to improve the estimation accuracy of the obtained parameters. In this embodiment, it can be assumed that the phase-equalized audio signal is generated from a relatively simple sound source, so that the problem to be solved is simplified. This leads to the result that iterative calculation does not take time and a stable solution can be easily obtained. The analysis order of the LPC analysis here may be different from the order for the phase equalization process. In addition, since the influence of the fundamental frequency is reduced by this analysis, the lag window need not be used.

なお、音声区間以外でのＬＰＣ係数と白色雑音ゲインは、１５ｍｓの固定窓長、４ｍｓの固定フレームシフト長を用いて求める。これは本実施例のＬＰＣ分析法でＩ＝０とすることと等価である。 Note that the LPC coefficient and white noise gain outside the speech section are obtained using a fixed window length of 15 ms and a fixed frame shift length of 4 ms. This is equivalent to setting I = 0 in the LPC analysis method of this embodiment.

＜マルチパルス音源モデル生成部＞
マルチパルス音源モデル生成部１７は、音声区間内で、上記位相等化音声信号との聴覚重み付き誤差が最小となるような、パルスの振幅（パルスゲイン）とマルチパルス音源モデルのパラメータ（ＦＩＲフィルタ係数ｖ_ｋ）を求める（ステップＳ１０）。ここで、ＦＩＲフィルタ（６タップ）の伝達特性は、特許文献１と同様に、次のように表わされる。 <Multipulse source model generator>
The multi-pulse sound source model generation unit 17 sets the amplitude of the pulse (pulse gain) and the parameters (FIR filter) of the multi-pulse sound source model so that the auditory weighted error with the phase-equalized speech signal is minimized within the speech section. The coefficient v _k ) is _obtained (step S10). Here, the transfer characteristic of the FIR filter (6 taps) is expressed as follows, as in Patent Document 1.

位相等化パルス音源の計算には、各ピッチマーク時刻位置を分析開始時点として、分析窓長は１基本周期として求める。本実施例では、分析にはピッチ同期分析を用いるが、合成には４ｍｓフレームシフトを用いるため、ピッチマーク時刻位置と固定長フレームの開始時点が異なることが問題となる。従って、本実施例では、各フレームにおけるパラメータは線形補間により求める。本実施例では、時間・周波数ともに滑らかなＬＰＣスペクトルが得られるため、上記計算により得られる音源モデルのパラメータも時間的に滑らかに変化すると期待される。 For the calculation of the phase equalization pulse sound source, each pitch mark time position is set as the analysis start time, and the analysis window length is determined as one basic period. In this embodiment, pitch synchronization analysis is used for the analysis, but since a 4 ms frame shift is used for the synthesis, there is a problem that the pitch mark time position is different from the start time of the fixed-length frame. Therefore, in this embodiment, the parameters in each frame are obtained by linear interpolation. In this embodiment, since a smooth LPC spectrum is obtained in both time and frequency, the parameters of the sound source model obtained by the above calculation are expected to change smoothly in time.

＜白色雑音ゲイン生成部＞
白色雑音ゲイン生成部１８は、第２ＬＰＣ分析部１６においてＬｅｖｉｎｓｏｎアルゴリズムを解く際に得られるＰＡＲＣＯＲ係数ｋと自己相関関数Ｒを用いて白色雑音ゲインを計算する（ステップＳ１１）。 <White noise gain generator>
The white noise gain generation unit 18 calculates the white noise gain using the PARCOR coefficient k and the autocorrelation function R obtained when the second LPC analysis unit 16 solves the Levinson algorithm (step S11).

＜音声合成部＞
音声合成部１９は、ＬＰＣ係数と音源信号を畳み込み演算することにより音声合成を行う（ステップＳ１３）。音源信号は、音声区間以外では白色雑音に白色雑音ゲインを乗じたものを用いる。音声区間では、基本周波数、パルスゲインおよびマルチパルス音源モデルから計算されるマルチパルス、あるいは基本周波数とパルスゲインから計算される単一パルス列を用いる（ステップＳ１２）。 <Speech synthesis unit>
The speech synthesizer 19 performs speech synthesis by convolving the LPC coefficient and the sound source signal (step S13). As a sound source signal, a signal obtained by multiplying white noise by a white noise gain is used outside the speech section. In the speech section, a multipulse calculated from the fundamental frequency, pulse gain, and multipulse sound source model, or a single pulse train calculated from the fundamental frequency and pulse gain is used (step S12).

［変形例］
原音声あるいは位相等化音声信号を0-500、500-1000、1000-2000、2000-3000、3000-4000、4000-5000、5000-6000、6000-7000、7000-8000 Ｈｚの帯域通過フィルタに通したそれぞれの信号の自己相関関数を計算し、これを有声強度として求めても良い。この計算には、上記と同様に、各ピッチマーク時刻位置を分析開始時点として、ピッチ同期分析を行う。音声合成の際には、有声強度に基づき、ある閾値より大きい帯域を有声帯域、小さい帯域を無声帯域として、有声帯域ではマルチパルスあるいは単一パルス列、無声帯域では白色雑音を混合した駆動音源を作成し、ＬＰＣ係数と畳み込めば良い。 [Modification]
Bandpass filter of 0-500, 500-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000, 5000-6000, 6000-7000, 7000-8000 Hz The autocorrelation function of each passed signal may be calculated and obtained as the voiced intensity. In this calculation, similarly to the above, pitch synchronization analysis is performed with each pitch mark time position as the analysis start time. When synthesizing speech, based on the voiced intensity, a voice source is created with a band greater than a certain threshold as a voiced band, a smaller band as a voiceless band, a multipulse or single pulse train in the voiced band, and white noise mixed in the voiceless band. Then, it may be convolved with the LPC coefficient.

また、音声分析により得られたパラメータを変換した後、音声を合成しても良い。例えば、基本周波数の値を半分あるいは２倍にする、ＬＰＣ分析により得られるフォルマント周波数を任意にシフトさせる、あるいは、時間軸を２倍あるいは半分にすることも可能である。 Further, after converting the parameters obtained by the voice analysis, the voice may be synthesized. For example, the fundamental frequency value can be halved or doubled, the formant frequency obtained by the LPC analysis can be arbitrarily shifted, or the time axis can be doubled or halved.

［実験例］
女声英語母国語話者が発声した「rise」の/i/のＬＰＣスペクトルを図５に示す。本実験では、位相等化処理のためのＬＰＣ分析次数は４０次、ＬＰＣスペクトルを得るための分析次数は２０次とした。音声のサンプリングレートは１６ｋＨｚである。図５の細線はラグ窓を用いた従来のＬＰＣ分析法、太線は本発明によるＬＰＣ分析法（提案法）により得られたものである。 [Experimental example]
FIG. 5 shows the LPC spectrum of “rise” produced by a female English native speaker. In this experiment, the LPC analysis order for phase equalization processing was 40th, and the analysis order for obtaining an LPC spectrum was 20th. The audio sampling rate is 16 kHz. The thin line in FIG. 5 is obtained by the conventional LPC analysis method using a lug window, and the thick line is obtained by the LPC analysis method (proposed method) according to the present invention.

従来法では声道スペクトルとして基本周波数を拾ってしまっているが、提案法は基本周波数を拾うことなく、第１フォルマント周波数（Ｆ１）を正確に抽出できていることが分かる。また、Ｆ１の改善に伴い、Ｆ２の振幅が回復していることが分かる。 In the conventional method, the fundamental frequency is picked up as a vocal tract spectrum. However, it can be seen that the proposed method can accurately extract the first formant frequency (F1) without picking up the fundamental frequency. It can also be seen that the amplitude of F2 has recovered with the improvement of F1.

図６は男声話者が発声した「腕前」のＬＰＣスペクトルで、（Ａ）はＴＡＮＤＥＭ窓なし（従来法）、（Ｂ）はＴＡＮＤＥＭ窓あり（提案法）である。位相等化処理およびＬＰＣスペクトルのためのＬＰＣ分析次数は５０次とした。（Ａ）は時間的に不連続なスペクトルを示しているが（つまり、縦じまが見られる）、（Ｂ）は時間的に連続的なスペクトルが得られていることが分かる。 FIGS. 6A and 6B are LPC spectra of “skills” uttered by a male speaker, where FIG. 6A shows no TANDEM window (conventional method) and FIG. 6B shows a TANDEM window (proposed method). The LPC analysis order for the phase equalization process and the LPC spectrum was 50th. (A) shows a temporally discontinuous spectrum (that is, vertical stripes are seen), but (B) shows that a temporally continuous spectrum is obtained.

図７は発声資料「腕前」の一部の波形例を示している。図７（Ａ）は、１６ｋＨｚでサンプリングした音声信号、図７（Ｂ）はＬＰＣ分析を行い、音声信号をＬＰＣ逆フィルタに通すことにより得られたＬＰＣ残差信号、図７（Ｃ）はＬＰＣ残差信号に基づく位相等化フィルタをＬＰＣ残差信号に施して得られた位相等化残差信号、図７（Ｄ）は位相等化フィルタを音声信号に施して得られた位相等化音声信号、図７（Ｅ）は合成された音声信号である。 FIG. 7 shows a partial waveform example of the utterance material “skill”. 7A is an audio signal sampled at 16 kHz, FIG. 7B is an LPC residual signal obtained by performing LPC analysis and passing the audio signal through an LPC inverse filter, and FIG. 7C is an LPC signal. The phase equalization residual signal obtained by applying the phase equalization filter based on the residual signal to the LPC residual signal. FIG. 7D shows the phase equalization sound obtained by applying the phase equalization filter to the audio signal. FIG. 7E shows a synthesized audio signal.

図７（Ｂ）はＬＰＣ残差信号にも位相特性の影響が見られるが、位相等化処理を行った図７（Ｃ）は単一パルス列と白色雑音が混合したものと見なすことができる。さらに、図７（Ｄ）は図７（Ａ）と波形が異なるが、聴感上の差異はほとんど感じられないことを確認している。図７（Ｅ）は合成音声であるが、図７（Ｄ）とほぼ同じ波形をしており、提案法の有効性が示されている。また、提案法による合成音声と原音声の聴感上の違いはほとんどないことが確認された。 In FIG. 7B, the effect of the phase characteristic is also seen in the LPC residual signal, but FIG. 7C in which the phase equalization processing is performed can be regarded as a mixture of a single pulse train and white noise. Furthermore, although FIG. 7D has a waveform different from that of FIG. 7A, it has been confirmed that almost no difference in audibility is felt. FIG. 7 (E) shows synthesized speech, but has almost the same waveform as FIG. 7 (D), indicating the effectiveness of the proposed method. In addition, it was confirmed that there was almost no difference in audibility between synthesized speech and original speech by the proposed method.

以上説明した音声分析合成装置及び音声分析合成方法はコンピュータと、コンピュータにインストールされたプログラムによって実現することができる。コンピュータはインストールされたプログラムを実行することにより音声分析合成装置として機能する。 The speech analysis / synthesis apparatus and speech analysis / synthesis method described above can be realized by a computer and a program installed in the computer. The computer functions as a speech analysis / synthesis device by executing the installed program.

なお、図１に示した音声分析合成装置１０における第２ＬＰＣ分析部はＬＰＣ分析装置単体として取り扱ってもよい。 Note that the second LPC analysis unit in the speech analysis / synthesis apparatus 10 shown in FIG. 1 may be handled as a single LPC analysis apparatus.

Claims

With the phase equalized audio signal and pitch mark time group as inputs,
The sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and the time other than the pitch mark time is composed of white noise,
The LPC coefficient and the amplitude G are obtained using the PARCOR coefficient k and the autocorrelation function R so that the error between the audio signal obtained from the LPC coefficient and the sound source signal and the phase equalized audio signal is minimized. An LPC analyzer characterized by being configured.

With the phase equalized audio signal and pitch mark time group as inputs,
The sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and the time other than the pitch mark time is composed of white noise,
The LPC coefficient and the amplitude G are obtained using the PARCOR coefficient k and the autocorrelation function R so that the error between the audio signal obtained from the LPC coefficient and the sound source signal and the phase equalized audio signal is minimized. An LPC analysis method characterized by the above.

A voice section detector for detecting a voice section of the input voice signal;
A fundamental frequency analyzer for estimating a fundamental frequency from the speech signal for the speech section;
A first LPC analysis unit that cuts out the speech signal with a window length determined based on the fundamental frequency, performs LPC analysis, and obtains an LPC residual signal by passing the speech signal through an LPC inverse filter;
A pitch mark analysis unit that generates a pitch waveform according to a fundamental period obtained from the fundamental frequency and extracts a pitch mark time group using the pitch waveform and the LPC residual signal;
A phase-equalized sound generation unit that generates a phase-equalized sound signal by applying a phase equalization filter obtained using the pitch mark time group and the LPC residual signal to the sound signal;
The sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and a time other than the pitch mark time is composed of white noise, and a sound signal obtained from the LPC coefficient and the sound source signal And a second LPC analysis unit for obtaining the LPC coefficient and the amplitude G using the PARCOR coefficient k and the autocorrelation function R so that an error from the phase equalized speech signal is minimized,
Using the pitch mark time group, the phase-equalized audio signal, and the LPC coefficient, a multi-pulse sound source model generating unit for obtaining a pulse gain and a multi-pulse sound source model;
A white noise gain generation unit that calculates a white noise gain using the PARCOR coefficient k and the autocorrelation function R obtained in the LPC analysis in the second LPC analysis unit;
Other than the speech section, white noise multiplied by the white noise gain is used. In the speech section, multipulses calculated from the fundamental frequency, the pulse gain, and the multipulse sound source model, or the fundamental frequency and the pulse gain. A speech synthesizer that synthesizes a speech signal by performing a convolution operation on a sound source signal using a single pulse train calculated from the LPC coefficient,
A speech analysis / synthesis apparatus characterized by comprising:

The speech analysis / synthesis device according to claim 3,
The second LPC analysis unit calculates an initial value of the amplitude G using a PARCOR coefficient k and an autocorrelation function R obtained at the time of LPC analysis in the first LPC analysis unit. .

A voice segment detection process for detecting a voice segment of the input voice signal;
A fundamental frequency analysis process for estimating a fundamental frequency from the speech signal for the speech interval;
A first LPC analysis process in which the speech signal is cut out with a window length determined based on the fundamental frequency to perform LPC analysis, and an LPC residual signal is obtained by passing the speech signal through an LPC inverse filter;
A pitch mark analysis process for generating a pitch waveform corresponding to a fundamental period obtained from the fundamental frequency and extracting a pitch mark time group using the pitch waveform and the LPC residual signal;
A phase-equalized speech generation process for generating a phase-equalized speech signal by applying a phase-equalization filter obtained using the pitch mark time group and the LPC residual signal to the speech signal;
The sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and a time other than the pitch mark time is composed of white noise, and a sound signal obtained from the LPC coefficient and the sound source signal And a second LPC analysis process for obtaining the LPC coefficient and the amplitude G using the PARCOR coefficient k and the autocorrelation function R so that an error from the phase equalized speech signal is minimized,
Using the pitch mark time group, the phase-equalized audio signal, and the LPC coefficient, a multi-pulse sound source model generation process for obtaining a pulse gain and a multi-pulse sound source model;
A white noise gain generation step of calculating a white noise gain using a PARCOR coefficient k and an autocorrelation function R obtained in the LPC analysis in the second LPC analysis step;
Other than the speech section, white noise multiplied by the white noise gain is used. In the speech section, multipulses calculated from the fundamental frequency, the pulse gain, and the multipulse sound source model, or the fundamental frequency and the pulse gain. A speech synthesis process for synthesizing a speech signal by performing a convolution operation on a sound source signal using a single pulse train calculated from the LPC coefficient;
A speech analysis and synthesis method characterized by comprising:

The speech analysis and synthesis method according to claim 5,
In the second LPC analysis process, an initial value of the amplitude G is calculated using a PARCOR coefficient k and an autocorrelation function R obtained in the LPC analysis in the first LPC analysis process. .

A program for causing a computer to function as the speech synthesis analysis apparatus according to claim 3.