JP6285823B2

JP6285823B2 - LPC analysis apparatus, speech analysis conversion synthesis apparatus, method and program thereof

Info

Publication number: JP6285823B2
Application number: JP2014164234A
Authority: JP
Inventors: 定男廣谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-08-12
Filing date: 2014-08-12
Publication date: 2018-02-28
Anticipated expiration: 2034-08-12
Also published as: JP2016040571A

Description

本発明は、音声信号を音源信号と声道スペクトルとに分離し、声道スペクトルを変換し、変換した声道スペクトルに基づき音声を合成する技術に関する。また、その際に用いるLPC係数を求める技術に関する。 The present invention relates to a technique for separating a speech signal into a sound source signal and a vocal tract spectrum, converting the vocal tract spectrum, and synthesizing speech based on the converted vocal tract spectrum. The present invention also relates to a technique for obtaining an LPC coefficient used at that time.

これまで音声合成技術、音声符号化技術、音声認識技術の性能向上には、人間の音声生成メカニズムに基づき、音声信号を効率的かつ精度良く、音源信号と声道スペクトルとに分解することが重要な役割を果たすとされてきた。この分解には、線形予測（linear predictive coding、以下「LPC」ともいう）分析が広く用いられているが、音源信号として白色雑音を仮定しているため、得られる声道スペクトルが少なからず基本周波数（fundamental frequency、以下「F0」ともいう）の影響を受けるという問題があった。特に、F0の高い音声（例えば、女性や子どもなどの音声）は、上述の仮定が満たされないことから、LPC分析により推定される声道スペクトルに音源信号の基本周波数とその倍音が含まれてしまい、正確な声道スペクトルを得ることが難しいという問題があった。このことは、声道スペクトルを変換するなどの音声変換処理において問題が生じる原因となることが知られている。 To improve the performance of speech synthesis technology, speech coding technology, and speech recognition technology, it has been important to decompose speech signals into sound source signals and vocal tract spectra efficiently and accurately based on human speech generation mechanisms It has been supposed to play a role. For this decomposition, linear predictive coding (hereinafter also referred to as “LPC”) analysis is widely used. However, since white noise is assumed as a sound source signal, the obtained vocal tract spectrum is not limited to a fundamental frequency. There was a problem of being affected by (fundamental frequency, hereinafter also referred to as “F0”). In particular, voices with high F0 (for example, voices of women, children, etc.) do not satisfy the above assumptions, so the vocal tract spectrum estimated by LPC analysis includes the fundamental frequency of the sound source signal and its harmonics. There was a problem that it was difficult to obtain an accurate vocal tract spectrum. This is known to cause a problem in voice conversion processing such as converting the vocal tract spectrum.

これに対し、LPC分析における音源信号の基本周波数の問題を回避する方法として、有声音の音源信号を仮定したLPC分析であるDAP法（Discrete all-pole modeling）が提案された（例えば、非特許文献１参照）。しかしDAP法は、解を収束させるために１０回程度の繰り返し演算が必要なため、通常のLPC分析の5倍以上の計算が必要である。 On the other hand, the DAP method (Discrete all-pole modeling), which is an LPC analysis assuming a voiced sound source signal, was proposed as a method to avoid the problem of the fundamental frequency of the sound source signal in LPC analysis (for example, non-patented) Reference 1). However, since the DAP method requires about 10 iterations to converge the solution, it requires more than five times the calculation of the normal LPC analysis.

DAP法の計算量の問題を解決するため、位相等化処理に基づく線形予測法が提案されている（例えば、特許文献１参照）。これは、音声信号に位相等化処理を行い、位相等化音声信号にパルス列を仮定したLPC分析を行うことで、F0に頑健な声道スペクトルの抽出を可能としている。この方法は、従来のF0に頑健な声道スペクトル分析法（例えば非特許文献１）よりも計算量が少ない。 In order to solve the problem of the calculation amount of the DAP method, a linear prediction method based on phase equalization processing has been proposed (see, for example, Patent Document 1). This makes it possible to extract a vocal tract spectrum that is robust to F0 by performing phase equalization processing on the audio signal and performing LPC analysis assuming a pulse train in the phase equalized audio signal. This method has a smaller amount of calculation than a conventional vocal tract spectrum analysis method robust to F0 (for example, Non-Patent Document 1).

特開２０１１−１５０２３２号公報JP 2011-150232 A

A. El-Jaroudi, J. Makhoul, “Discrete all-pole modeling,” IEEE Trans. Signal Processing, 1991, pp. 411-423.A. El-Jaroudi, J. Makhoul, “Discrete all-pole modeling,” IEEE Trans. Signal Processing, 1991, pp. 411-423.

しかしながら、特許文献１の方法により声道スペクトル分析を行う場合、位相等化音声信号の生成および位相等化音声信号の自己相関関数が必要となる。そのため、位相等化音声信号の生成に伴う計算量の増加が生じる。声道スペクトルを実時間で分析する場合などには、さらなる計算量の削減が求められる。また、特許文献１の方法により声道スペクトル分析を行う場合、位相等化音声信号の自己相関関数を用いることによる分析誤りが生じる可能性があり、LPC係数及び声道スペクトルの推定精度の低下を招く虞がある。 However, when performing vocal tract spectrum analysis by the method of Patent Document 1, generation of a phase-equalized speech signal and an autocorrelation function of the phase-equalized speech signal are required. As a result, the amount of calculation increases with the generation of the phase-equalized audio signal. When the vocal tract spectrum is analyzed in real time, a further reduction in the amount of calculation is required. In addition, when performing vocal tract spectrum analysis by the method of Patent Document 1, there is a possibility that an analysis error may occur due to the use of the autocorrelation function of the phase-equalized speech signal, and the LPC coefficient and the estimation accuracy of the vocal tract spectrum are reduced. There is a risk of inviting.

本発明は、従来よりも、計算量を抑え、より高速に、より精度良くLPC係数を求めるLPC分析装置、及びLPC係数を用いた音声分析変換合成装置、及びそれらの方法を提供することを目的とする。 An object of the present invention is to provide an LPC analysis apparatus that obtains an LPC coefficient at a higher speed and with higher accuracy, a speech analysis conversion / synthesis apparatus using the LPC coefficient, and a method thereof, with a smaller amount of calculation than before. And

上記の課題を解決するために、本発明の一態様によれば、LPC分析装置は、音声信号と、ピッチマーク時刻群と、LPC残差信号とを入力とし、音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものと仮定し、第二LPC係数と音源信号とによって得られる音声信号と、音声信号に対応する位相等化音声信号との誤差が最小となるように、音声信号と、音声信号の自己相関関数と、LPC残差信号と、ピッチマーク時刻群とを用いて、第二LPC係数を求める構成とされている。 In order to solve the above-described problem, according to one aspect of the present invention, an LPC analysis apparatus has an audio signal, a pitch mark time group, and an LPC residual signal as inputs, and a sound source signal of the pitch mark time group. Assuming that each pitch mark time has a single pulse of amplitude G, and that the time other than the pitch mark time consists of white noise, the audio signal obtained from the second LPC coefficient and the sound source signal corresponds to the audio signal. A configuration for obtaining a second LPC coefficient using an audio signal, an autocorrelation function of the audio signal, an LPC residual signal, and a pitch mark time group so that an error from the phase equalized audio signal is minimized. Has been.

上記の課題を解決するために、本発明の他の態様によれば、音声分析変換合成装置は、入力された音声信号の音声区間を検出する音声区間検出部と、LPC分析により音声信号から得られるLPC係数と、音声信号とを用いて、LPC残差信号を求める第一LPC分析部と、ピッチマーク時刻群を抽出するピッチマーク分析部と、音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものと仮定し、LPC残差信号は無相関であると仮定し、第二LPC係数と音源信号とによって得られる音声信号と、音声信号に対応する位相等化音声信号との誤差が最小となるように、音声信号と、音声信号の自己相関関数と、LPC残差信号と、ピッチマーク時刻群とを用いて、第二LPC係数を求める構成とされている第二LPC分析部、第二LPC係数によって得られる予測多項式から根を求め、その根を用いてフォルマントを選択し、その選択したフォルマントに対応する声道スペクトルと、選択したフォルマントを所定の方法で変換したフォルマントに対応する声道スペクトルとを用いて、変換フィルタを生成し、音声信号を変換フィルタで変換する音声変換部とを含む。 In order to solve the above-described problem, according to another aspect of the present invention, a speech analysis conversion / synthesis apparatus includes a speech segment detection unit that detects a speech segment of an input speech signal, and an audio signal obtained by LPC analysis. A first LPC analysis unit that obtains an LPC residual signal using an LPC coefficient and an audio signal, a pitch mark analysis unit that extracts a pitch mark time group, and a sound source signal for each pitch mark time of the pitch mark time group It is assumed that time other than the pitch mark time is composed of white noise, and that the LPC residual signal is uncorrelated and is obtained by the second LPC coefficient and the sound source signal. Using the audio signal, the autocorrelation function of the audio signal, the LPC residual signal, and the pitch mark time group so that the error between the audio signal and the phase equalized audio signal corresponding to the audio signal is minimized. The second LPC coefficient The second LPC analysis unit, which obtains a root from the prediction polynomial obtained by the second LPC coefficient, selects a formant using the root, and selects the vocal tract spectrum corresponding to the selected formant and the selected formant. A voice conversion unit that generates a conversion filter using the vocal tract spectrum corresponding to the formant converted by the method, and converts the voice signal by the conversion filter.

上記の課題を解決するために、本発明の他の態様によれば、LPC分析方法は、音声信号と、ピッチマーク時刻群と、LPC残差信号とを用い、音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものと仮定し、第二LPC係数と音源信号とによって得られる音声信号と、音声信号に対応する位相等化音声信号との誤差が最小となるように、音声信号と、音声信号の自己相関関数と、LPC残差信号と、ピッチマーク時刻群とを用いて、第二LPC係数を求める。 In order to solve the above-described problem, according to another aspect of the present invention, an LPC analysis method uses a voice signal, a pitch mark time group, and an LPC residual signal, and a sound source signal of the pitch mark time group. Assuming that each pitch mark time has a single pulse of amplitude G, and that the time other than the pitch mark time consists of white noise, the audio signal obtained from the second LPC coefficient and the sound source signal corresponds to the audio signal. The second LPC coefficient is obtained using the audio signal, the autocorrelation function of the audio signal, the LPC residual signal, and the pitch mark time group so that the error from the phase equalized audio signal is minimized.

上記の課題を解決するために、本発明の他の態様によれば、音声分析変換合成方法は、入力された音声信号の音声区間を検出する音声区間検出ステップと、LPC分析により音声信号から得られるLPC係数と、音声信号とを用いて、LPC残差信号を求める第一LPC分析ステップと、ピッチマーク時刻群を抽出するピッチマーク分析ステップと、音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものと仮定し、LPC残差信号は無相関であると仮定し、第二LPC係数と音源信号とによって得られる音声信号と、音声信号に対応する位相等化音声信号との誤差が最小となるように、音声信号と、音声信号の自己相関関数と、LPC残差信号と、ピッチマーク時刻群とを用いて、第二LPC係数を求める構成とされている第二LPC分析ステップ、第二LPC係数によって得られる予測多項式から根を求め、その根を用いてフォルマントを選択し、その選択したフォルマントに対応する声道スペクトルと、選択したフォルマントを所定の方法で変換したフォルマントに対応する声道スペクトルとを用いて、変換フィルタを生成し、音声信号を変換フィルタで変換する音声変換ステップとを含む。 In order to solve the above-described problem, according to another aspect of the present invention, a speech analysis conversion synthesis method includes a speech segment detection step for detecting a speech segment of an input speech signal, and an audio signal obtained by LPC analysis. First LPC analysis step for obtaining an LPC residual signal using the LPC coefficient and the audio signal, a pitch mark analysis step for extracting a pitch mark time group, and a sound source signal for each pitch mark time of the pitch mark time group It is assumed that time other than the pitch mark time is composed of white noise, and that the LPC residual signal is uncorrelated and is obtained by the second LPC coefficient and the sound source signal. Using the audio signal, the autocorrelation function of the audio signal, the LPC residual signal, and the pitch mark time group so that the error between the audio signal and the phase equalized audio signal corresponding to the audio signal is minimized. Second LPC A second LPC analysis step configured to obtain a number, obtain a root from a prediction polynomial obtained by a second LPC coefficient, select a formant using the root, and a vocal tract spectrum corresponding to the selected formant; A voice conversion step of generating a conversion filter using a vocal tract spectrum corresponding to the formant obtained by converting the selected formant by a predetermined method, and converting the voice signal by the conversion filter.

本発明によれば、従来よりも、計算量を抑え、より高速に、より精度良くLPC係数を求めることができる。 According to the present invention, it is possible to obtain the LPC coefficient more accurately and at a higher speed by reducing the amount of calculation than before.

第一実施形態に係る音声分析変換合成装置の機能ブロック図。The functional block diagram of the speech analysis conversion synthesizer concerning a first embodiment. 第一実施形態に係る音声分析変換合成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech analysis conversion synthesizer concerning 1st embodiment. フォルマント周波数の変換例を示す図。The figure which shows the conversion example of a formant frequency.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following description, the symbol “^” or the like used in the text should be described immediately above the character immediately before, but it is described immediately after the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態＞
本実施形態では、LPC分析または声道スペクトル変換の対象となる音声信号（以下「原音声信号」ともいう）とLPC残差信号と原音声信号の自己相関関数とピッチマーク時刻とが与えられた場合に、連立方程式を解くことによりLPC係数を求める。さらに、LPC係数を用いてフォルマントを選択し、選択したフォルマントに対応する声道スペクトルを変換し、変換前後の声道スペクトルを用いて、原音声信号を変換し、変換後の音声信号を出力する。 <First embodiment>
In this embodiment, a speech signal (hereinafter also referred to as “original speech signal”), an LPC residual signal, an autocorrelation function of the original speech signal, and a pitch mark time that are subject to LPC analysis or vocal tract spectrum conversion are given. In some cases, LPC coefficients are obtained by solving simultaneous equations. Furthermore, the formant is selected using the LPC coefficient, the vocal tract spectrum corresponding to the selected formant is converted, the original voice signal is converted using the vocal tract spectrum before and after conversion, and the converted voice signal is output. .

まず初めに、本実施形態の理論を説明する。
〔位相等化処理に基づくLPC分析〕
時刻を表すインデックスをt、入力された原音声信号をs(t)、音源スペクトルの傾斜特性を取り除くためにプリエンファシスを行った原音声信号をsp(t)とすると、プリエンファシスを行った原音声信号sp(t)は次式により表される。
sp(t)=s(t)-αs(t-1) (1)
αは例えば0.98を用いる。なお、プリエンファシス(pre-emphasis)とは、伝送路固有の高周波数における減衰特性に応じて伝送信号の高域側を増幅して送信側から送出し、受信側で受ける信号の周波数特性を改善する変調技術である。 First, the theory of this embodiment will be described.
[LPC analysis based on phase equalization]
If the index representing the time is t, the input original audio signal is s (t), and the original audio signal that has been pre-emphasized to remove the slope characteristics of the sound source spectrum is sp (t), the original The audio signal sp (t) is expressed by the following equation.
sp (t) = s (t) -αs (t-1) (1)
For example, α is 0.98. Note that pre-emphasis improves the frequency characteristics of the signal received on the receiving side by amplifying the high frequency side of the transmission signal and sending it from the transmitting side according to the attenuation characteristics at the high frequency inherent to the transmission line. Modulation technology.

LPC分析次数の総数をP、LPC分析次数を表すインデックスをp(p=1,2,…,P)、通常のLPC分析により原音声信号s(t)から得られるLPC係数をa={a(1),a(2),…，a(p),…,a(P)}、LPC残差信号をe(t)とすると、LPC残差信号e(t)は次式により表される（参考文献１参照）。

（参考文献１）古井、「ディジタル音声処理」、東海大学出版会、1985年、pp.60-65. The total number of LPC analysis orders is P, the index representing the LPC analysis order is p (p = 1, 2,..., P), and the LPC coefficients obtained from the original speech signal s (t) by normal LPC analysis are a = {a (1), a (2), ..., a (p), ..., a (P)}, and LPC residual signal e (t), LPC residual signal e (t) is expressed by the following equation (See Reference 1).

(Reference 1) Furui, “Digital Audio Processing”, Tokai University Press, 1985, pp. 60-65.

次に位相等化音声信号x(t)を得ることを考える。位相等化残差信号をe^'(t)、Mを正の偶数、位相等化フィルタのタップ数をM+1、位相等化フィルタをh_{t_0}={h_{t_0}(-M/2),h_{t_0}(-M/2+1),…,h_{t_0}(-1),h_{t_0}(0),h_{t_0}(1),…,h_{t_0}(M/2)}（ただし、下付き添え字t_0はt₀を表す）、パルス発生時刻をt₀、デルタ関数をδ(t)としたとき、位相等化残差信号e^'(t)は次式により表される。

Next, consider obtaining a phase-equalized audio signal x (t). The phase equalization residual signal is e ^ '(t), M is a positive even number, the number of taps of the phase equalization filter is M + 1, the phase equalization filter is h _{t_0} = {h _{t_0} (-M / 2), h _{t_0} (-M / 2 + 1), ..., h _{t_0} (-1), h _{t_0} (0), h _{t_0} (1), ..., h _{t_0} (M / 2)} (however, subscript t_0 represents t _0), the pulse generation time t _0, when the delta function with the [delta] (t), phase equalization residual signal e ^ '(t) is expressed by the following equation.

ただし、

となる位相等化フィルタh_{t_0}を決定する。LPC残差信号e(t)が無相関であると仮定すると、

となる。 However,

A phase equalization filter h _{t — 0} is determined. Assuming that the LPC residual signal e (t) is uncorrelated,

It becomes.

得られた位相等化フィルタh_{t_0}を用いて、次式により位相等化音声信号x(t)を生成する。

Using the obtained phase equalization filter h t — ₀ , a phase equalized audio signal x (t) is generated by the following equation.

次に、音源としてパルス列を仮定したモデルとの自乗誤差を最小化するようなLPC係数a^を求めることを考える。窓関数w(t)をかけた位相等化音声信号をxw(t)=x(t)w(t)、G(t)をパルス振幅とし、Gw(t)=G(t)w(t)とすると、LPC係数a^は次式により表される。

Next, consider obtaining an LPC coefficient a ^ that minimizes the square error with a model assuming a pulse train as the sound source. The phase equalized audio signal multiplied by the window function w (t) is xw (t) = x (t) w (t), G (t) is the pulse amplitude, and Gw (t) = G (t) w (t ), The LPC coefficient a ^ is expressed by the following equation.

前式より、LPC係数a^は次の連立方程式を解くことで求められる。

From the previous equation, the LPC coefficient a ^ can be obtained by solving the following simultaneous equations.

ここで、t_iは声門閉鎖時点であり、以降、ピッチマーク時刻と呼ぶ。R_xxは位相等化音声信号x(t)の自己相関関数であり、次式により表される。

ただし、Lは、自己相関関数を求める際に必要となるフレーム数である。 Here, t _i is the time when the glottis are closed, and is hereinafter referred to as pitch mark time. R _xx is an autocorrelation function of the phase equalized audio signal x (t), and is expressed by the following equation.

However, L is the number of frames required when obtaining the autocorrelation function.

〔位相等化音声信号を経由しないLPC分析〕
次に、位相等化音声信号を用いずに、LPC係数a^を求める方法を説明する。
位相等化線形予測法(phase equalization-based autoregressive、以下「PEAR」ともいう)でのLPC係数a^の導出には、位相等化音声信号x(t)とその自己相関関数R_xxとを用いていた。その結果、位相等化音声信号x(t)の生成に伴う計算量の増加や自己相関関数R_xxを用いることによる分析誤りが生じる可能性がある。
そこで、位相等化音声信号x(t)及び自己相関関数R_xxを用いない式の導出を行う。 [LPC analysis without going through phase equalized audio signal]
Next, a method for obtaining the LPC coefficient a ^ without using the phase equalized audio signal will be described.
The phase equalization-based autoregressive (hereinafter also referred to as “PEAR”) LPC coefficient a ^ is derived using the phase equalized speech signal x (t) and its autocorrelation function R _xx. It was. As a result, there is a possibility that an increase in the amount of calculation accompanying the generation of the phase-equalized speech signal x (t) and an analysis error due to the use of the autocorrelation function _Rxx .
Therefore, a formula that does not use the phase equalized speech signal x (t) and the autocorrelation function R _xx is derived.

まず、spw(t)=sp(t)w(t)とすると、位相等化音声信号x(t)の自己相関関数R_xxは、

となる。このとき、LPC残差信号e(t)が無相関であると仮定すると、

となり、位相等化音声信号x(t)の自己相関関数R_xxは、（窓関数w(t)をかけ）プリエンファシスを行った原音声spw(t)の自己相関関数R_SSと一致する。 First, when spw (t) = sp (t) w (t), the autocorrelation function R _xx of the phase equalized speech signal x (t) is

It becomes. At this time, assuming that the LPC residual signal e (t) is uncorrelated,

Thus, the autocorrelation function R _xx of the phase equalized speech signal x (t) matches the autocorrelation function R _{SS of} the original speech spw (t) subjected to pre-emphasis (by applying the window function w (t)).

次に、式（8）の連立方程式の右辺に含まれる

の変形を行う。

Next, it is included on the right side of the simultaneous equations of equation (8)

The deformation of.

ここで、パルス振幅G(t_i)を最小自乗解により得られる

とする。パルス振幅G(t_i)を最小自乗解により得られる値とすることは、LPC係数a^と音源信号とによって得られる音声信号と、音声信号に対応する位相等化音声信号x(t_i)との誤差が最小となることを意味する。ここで、w(t_i)=w(t_i-p)を仮定すると、

となる。つまり、

となる。 Here, the pulse amplitude G (t _i ) can be obtained by the least squares solution

And Setting the pulse amplitude G (t _i ) to a value obtained by the least squares solution means that the audio signal obtained by the LPC coefficient a ^ and the sound source signal and the phase equalized audio signal x (t _i ) corresponding to the audio signal This means that the error is minimized. Here, assuming w (t _i ) = w (t _i -p),

It becomes. That means

It becomes.

よって、LPC係数a^={a^(1),a^(2),…,a^(p),…,a^(P)}は、

を解くことで求めることができる。ここで、式(17)の行列の中に、位相等化音声信号x(t)およびその自己相関関数R_xxが含まれないことが分かる。また、G(t)が含まれていないため、参考文献２のような反復計算の必要がない。 Therefore, LPC coefficient a ^ = {a ^ (1), a ^ (2), ..., a ^ (p), ..., a ^ (P)}

Can be obtained by solving Here, it can be seen that the phase equalized speech signal x (t) and its autocorrelation function R _xx are not included in the matrix of Equation (17). Moreover, since G (t) is not included, it is not necessary to perform iterative calculation as in Reference 2.

なお、w(t_i)=w(t_i-p)を仮定し、w(t_i-1)からw(t_i-P)の値をw(t_i)とすることで式(17)を得たが、w(t_i)からw(t_i-P)の値を、より仮定が満たされるように

としてもよい。この場合、式(17)のw(t_i)をWと置き換えればよい。
（参考文献２）廣谷定男、持田岳美、「位相等化処理に基づく線形予測法を用いた頑健な声道スペクトルの推定」、電子情報通信学会技術研究報告、2010年11月、vol.110、no.297、SP2010-76、pp.41-46. _{Incidentally, w (t i) = w} (t i -p) assumes, w (t _i -1) from the w (t _i -P) of the value w (t _i) to be the formula (17) But w (t _i ) to w (t _i -P) so that the assumption is more satisfied

It is good. In this case, w (t _i ) in equation (17) may be replaced with W.
(Reference 2) Sadao Shibuya, Takemi Mochida, “Estimating robust vocal tract spectrum using linear prediction based on phase equalization”, IEICE Technical Report, November 2010, vol.110, no.297, SP2010-76, pp.41-46.

なお、式(17)の右辺に含まれる以下の値

を計算する処理は、プリエンファシスを行った原音声信号sp(t)に対して位相等化処理（式(5),(6)）を行うことと等価である。 The following values included on the right side of Equation (17)

Is equivalent to performing phase equalization (formulas (5) and (6)) on the original speech signal sp (t) subjected to pre-emphasis.

基本周波数F0に頑健なLPC係数を求めるために、非特許文献１では通常のLPC分析の5倍以上、特許文献１では原音声信号の全てに対して位相等化処理を行うため通常のLPC分析の1.8倍の計算が必要であった。一方、本実施形態では、特許文献１のように原音声信号の全てに対して位相等化処理を行う必要はなく、式(17)に示すように、ピッチマーク時刻t_i以前のP個分の原音声信号sp(t)に対してのみ位相等化処理（と等価な処理）を行えばよいため、例えば、1つの処理単位の中にピッチマークが４個程度含まれ(I=4)、LPC分析次数の総数Pが１８程度、フレーム数が４００程度、タップ数が１１程度の場合には、通常のLPC分析の1.2倍程度の計算量で済む。 In order to obtain a robust LPC coefficient at the fundamental frequency F0, non-patent document 1 is more than 5 times the normal LPC analysis, and patent document 1 is a normal LPC analysis because it performs phase equalization processing on all of the original speech signals. The calculation of 1.8 times was required. On the other hand, in the present embodiment, it is not necessary to perform phase equalization processing on all of the original audio signals as in Patent Document 1, and as shown in Expression (17), P pieces before the pitch mark time t _i are obtained. Therefore, for example, about four pitch marks are included in one processing unit (I = 4). When the total number P of the LPC analysis orders is about 18, the number of frames is about 400, and the number of taps is about 11, the calculation amount is about 1.2 times that of the normal LPC analysis.

＜第一実施形態に係る音声分析変換合成装置１００＞
図１は第一実施形態に係る音声分析変換合成装置１００の機能ブロック図を、図２はその処理フローの例を示す。
音声分析変換合成装置１００は、音声区間検出部１１０、第一LPC分析部１３０、ピッチマーク分析部１４０、第二LPC分析部１６０及び音声変換部１７０を含む。
音声分析変換合成装置１００は、音声信号（原信号）を受け取り、所望の音声に変換した音声信号（合成音声信号）を出力する。 <Speech analysis conversion synthesis apparatus 100 according to the first embodiment>
FIG. 1 is a functional block diagram of the speech analysis conversion synthesis apparatus 100 according to the first embodiment, and FIG. 2 shows an example of the processing flow.
The speech analysis conversion synthesis device 100 includes a speech section detection unit 110, a first LPC analysis unit 130, a pitch mark analysis unit 140, a second LPC analysis unit 160, and a speech conversion unit 170.
The voice analysis conversion synthesis apparatus 100 receives a voice signal (original signal) and outputs a voice signal (synthesized voice signal) converted into a desired voice.

＜音声区間検出部１１０＞
まず、音声区間検出部１１０は、音声信号（原音声）s(t)を受け取り、入力された音声信号の音声区間を検出し、出力する（Ｓ１１０）。例えば、音声信号s(t)のパワーを求め、そのパワーが所定の閾値よりも大きい場合に、音声区間として検出し、音声区間を表す情報（以下「音声区間情報」ともいう）を出力する（Ｓ１１０）。例えば、音声区間情報をu(t)とし、音声信号s(t)が音声区間であればu(t)=1とし、音声区間でなければu(t)=0とする。また、音声区間の開始時刻及び終了時刻（および／または、音声区間でない区間の開始時刻及び終了時刻）を音声区間情報として出力してもよい。音声区間として検出された音声信号s(t)を、そのまま音声区間情報として出力してもよい。要は、以下の処理において、音声区間が分かればよいので、どのような方法により、音声区間を検出し、音声区間情報を出力してもよい。以下の処理において、音声区間に対してのみ処理を行うことで、処理量を軽減することができる。なお、本実施形態では音声区間情報をu(t)とする。 <Audio section detection unit 110>
First, the voice section detector 110 receives a voice signal (original voice) s (t), detects a voice section of the input voice signal, and outputs it (S110). For example, the power of the voice signal s (t) is obtained, and when the power is larger than a predetermined threshold, it is detected as a voice section, and information representing the voice section (hereinafter also referred to as “voice section information”) is output ( S110). For example, the voice section information is u (t), u (t) = 1 if the voice signal s (t) is a voice section, and u (t) = 0 if the voice signal is not a voice section. In addition, the start time and end time of a voice section (and / or the start time and end time of a section that is not a voice section) may be output as voice section information. The voice signal s (t) detected as the voice section may be output as voice section information as it is. In short, in the following processing, it is only necessary to know the voice section. Therefore, the voice section may be detected and the voice section information may be output by any method. In the following processing, the amount of processing can be reduced by performing processing only on the speech section. In the present embodiment, the voice section information is u (t).

＜第一LPC分析部１３０＞
第一LPC分析部１３０は、音声信号s(t)とその音声区間情報u(t)とを受け取り、LPC分析により音声区間の音声信号s(t)から得られるLPC係数aと、音声区間の音声信号s(t)とを用いて、LPC残差信号e(t)を求め（Ｓ１３０）、LPC分析の過程で得られる自己相関関数R_SSと、LPC残差信号e(t)とを出力する。 <First LPC analysis unit 130>
The first LPC analysis unit 130 receives the speech signal s (t) and the speech segment information u (t), and performs LPC analysis to obtain the LPC coefficient a obtained from the speech signal s (t) in the speech segment and the speech segment information. Using the speech signal s (t), the LPC residual signal e (t) is obtained (S130), and the autocorrelation function R _SS obtained during the LPC analysis process and the LPC residual signal e (t) are output. To do.

例えば、本実施形態では、第一LPC分析部１３０は、原音声信号s(t)に対して通常のLPC分析(例えば参考文献１参照)を行い、LPC係数aと、自己相関関数R_SSとを求める。 For example, in the present embodiment, the first LPC analysis unit 130 performs normal LPC analysis (for example, see Reference 1) on the original speech signal s (t), and calculates the LPC coefficient a, the autocorrelation function R _SS , Ask for.

最後に、原音声信号s(t),s(t-1),…,s(t-P)とLPC係数a={a(1),a(2),…,a(P)}とを用いて、式(2)によりLPC残差信号e(t)を求める。

なお、式(2)のプリエンファシスを行った原音声信号sp(t)は、式(1)により原音声信号s(t)に対してプリエンファシスを行って取得すればよい。 Finally, using the original speech signals s (t), s (t-1), ..., s (tP) and LPC coefficients a = {a (1), a (2), ..., a (P)} Thus, the LPC residual signal e (t) is obtained by Equation (2).

Note that the original audio signal sp (t) subjected to the pre-emphasis in Expression (2) may be obtained by performing pre-emphasis on the original audio signal s (t) according to Expression (1).

＜ピッチマーク分析部１４０＞
ピッチマーク分析部１４０は、ピッチマーク時刻群{t₀,t₁,t₂,…,t_i,…t_I}を抽出し（Ｓ１４０）、出力する。
ピッチマーク時刻の抽出方法としてはどのような方法を用いてもよい。ただし、原音声信号s(t)のピッチマーク時刻t_iを正確に検出できるかどうかが、LPC係数a^の推定結果の安定性に大きく関わってくるため、より推定精度の高いものが望ましい。 <Pitch mark analysis unit 140>
Pitch mark analyzer 140, a pitch mark time group _{_{{t 0, t 1, t}} 2, ..., t i, ... t I} extracts (S140), and outputs.
Any method may be used as the pitch mark time extraction method. However, whether accurately detect the pitch mark time t _i of the original speech signal s (t) is to come largely responsible stability of the LPC coefficients a ^ of the estimated results, make higher the estimation accuracy is desirable.

（抽出方法の例１）
例えば、参考文献３の方法が考えられる。
（参考文献３）Honda, M., "Speech coding using waveform matching based on LPC residual phase equalization", Proc. ICASSP, 1990, pp.213-216.
参考文献３の方法では、
(1)まず、Jフレーム分のLPC残差信号e(t),e(t-1),…,e(t-J+1)内での最大値となるLPC残差信号e(t-j)を見つける。ただし、jは0,1,…,J-1の何れかである。
(2)次に、その最大値を持つ時点(t-j)（基準ピッチマーク）を中心とした位相等化フィルタを作成する。
(3)さらに、Jフレーム分のLPC残差信号e(t),e(t-1),…,e(t-J+1)を位相等化する。
(4)位相等化残差信号から、予め定めた閾値を超えるものをピッチマークとする。
ただし、この方法によりピッチマーク時刻の抽出を行った場合、ノイズによるピークを最大値としてしまうと正確にピッチマーク時刻を検出できないことがある。そこで、以下の方法によりピッチマーク時刻を検出してもよい。 (Extraction method example 1)
For example, the method of Reference 3 can be considered.
(Reference 3) Honda, M., "Speech coding using waveform matching based on LPC residual phase equalization", Proc. ICASSP, 1990, pp.213-216.
In the method of Reference 3,
(1) First, the LPC residual signal e (tj) that is the maximum value in the LPC residual signals e (t), e (t-1), ..., e (t-J + 1) for J frames Find out. However, j is 0, 1, ..., J-1.
(2) Next, a phase equalization filter centered on the time point (tj) (reference pitch mark) having the maximum value is created.
(3) Further, phase equalization of the LPC residual signals e (t), e (t−1),..., E (t−J + 1) for J frames is performed.
(4) From the phase equalization residual signal, a signal exceeding a predetermined threshold is set as a pitch mark.
However, when the pitch mark time is extracted by this method, the pitch mark time may not be accurately detected if the peak due to noise is maximized. Therefore, the pitch mark time may be detected by the following method.

（抽出方法の例２）
フレーム内のLPC残差信号の値を大きい順に複数個（例えば３個）選び、対応する時刻群を抽出する。時刻群のうち、それぞれの時点を中心とした位相等化フィルタから求められた位相等化残差信号の自己相関関数を求め、基本周期T₀（ピッチラグ。基本周波数F0の逆数）の自己相関関数の値と、T₀+1の自己相関関数の値との差分が閾値を超える（自己相関関数の値が急激に変化する）時点を、抽出方法の例１の基準ピッチマークとし、抽出方法の例１の(2)から(4)を行い、ピッチマークを抽出する。 (Extraction method example 2)
A plurality (for example, three) of LPC residual signal values in the frame are selected in descending order, and a corresponding time group is extracted. In the time group, the autocorrelation function of the phase equalization residual signal obtained from the phase equalization filter centered on each time point is obtained, and the autocorrelation function of the fundamental period T ₀ (pitch lag, reciprocal of the fundamental frequency F0) is obtained. And the difference between the value of T ₀ +1 and the value of the autocorrelation function of T ₀ +1 exceeds the threshold (the value of the autocorrelation function changes abruptly) as the reference pitch mark of the extraction method example 1, Perform steps (2) to (4) in Example 1 to extract pitch marks.

（抽出方法の例３）
また、エレクトログロットグラフィ電気喉頭図（Electro-Glotto-Graph、以下「EGG」ともいう）を用いてピッチマーク時刻を計測してもよい。例えばEGG信号の微分値を利用してピッチマーク時刻を検出する。
さらに、抽出方法の例１〜３を組合せてもよいし、他の抽出方法（例えば特許文献１の抽出方法）を用いてもよい。 (Example 3 of extraction method)
Alternatively, the pitch mark time may be measured using an electro-glotto-electric laryngeal diagram (Electro-Glotto-Graph, hereinafter also referred to as “EGG”). For example, the pitch mark time is detected using the differential value of the EGG signal.
Furthermore, examples 1 to 3 of the extraction method may be combined, or another extraction method (for example, the extraction method of Patent Document 1) may be used.

＜第二LPC分析部１６０＞
音源信号をピッチマーク時刻群の各ピッチマーク時刻に振幅Ｇの単一パルスをもち、ピッチマーク時刻以外の時刻は白色雑音よりなるものと仮定する。また、LPC残差信号は無相関であると仮定する。 <Second LPC analysis unit 160>
It is assumed that the sound source signal has a single pulse with an amplitude G at each pitch mark time of the pitch mark time group, and the time other than the pitch mark time is composed of white noise. Also assume that the LPC residual signal is uncorrelated.

第二LPC分析部１６０は、原音声信号s(t)と、音声信号の自己相関関数R_SSと、LPC残差信号e(t)と、ピッチマーク時刻群{t₀,t₁,t₂,…,t_i,…t_I}とを受け取り、これらの値を用いて、第二LPC係数a^と音源信号G(t)（ピッチマーク時刻t₀,t₁,t₂,…,t_i,…t_Iで振幅Ｇの単一パルス、それ以外の時刻で白色雑音）とによって得られる音声信号と、原音声信号s(t)に対応する位相等化音声信号x(t)との誤差が最小となるように（式(14)参照）、第二LPC係数a^を求め（Ｓ１６０）、出力する。例えば、式(17)の連立方程式を解くことで、第二LPC係数a^を求めることができる。

なお、式(17)のプリエンファシスを行った原音声信号sp(t)は、式(1)により原音声信号s(t)に対してプリエンファシスを行って取得してもよいし、原音声信号s(t)に代えて第一LPC分析部１３０で求めたsp(t)を受け取り、用いてもよい。 The second LPC analysis unit 160 includes the original speech signal s (t), the autocorrelation function R _SS of the speech signal, the LPC residual signal e (t), and the pitch mark time group {t ₀ , t ₁ , t _2. , ..., t _i , ... t _I } and using these values, the second LPC coefficient a ^ and the sound source signal G (t) (pitch mark times t ₀ , t ₁ , t ₂ , ..., t _i ,..., t _I with a single pulse of amplitude G and white noise at other times) and a phase equalized audio signal x (t) corresponding to the original audio signal s (t) The second LPC coefficient a ^ is obtained so as to minimize the error (see equation (14)) (S160) and output. For example, the second LPC coefficient a ^ can be obtained by solving the simultaneous equations of Expression (17).

Note that the original audio signal sp (t) subjected to pre-emphasis in Expression (17) may be acquired by performing pre-emphasis on the original audio signal s (t) according to Expression (1), or Instead of the signal s (t), sp (t) obtained by the first LPC analysis unit 130 may be received and used.

＜音声変換部１７０＞
音声変換部１７０は、原音声信号s(t)と音声区間情報u(t)と第二LPC係数a^とを受け取る。音声変換部１７０は、第二LPC係数a^によって得られる予測多項式から根zを求め、その根zを用いてフォルマントを選択し、その選択したフォルマントに対応する声道スペクトルと、選択したフォルマントを所定の方法で変換したフォルマントに対応する声道スペクトルとを用いて、変換フィルタを生成し、原音声信号s(t)を変換フィルタで変換し（Ｓ１７０）、変換後の音声信号y(t)を出力する。声道スペクトルの変換は例えば以下のように行う。 <Audio conversion unit 170>
The voice conversion unit 170 receives the original voice signal s (t), the voice section information u (t), and the second LPC coefficient a ^. The speech conversion unit 170 obtains a root z from the prediction polynomial obtained by the second LPC coefficient a ^, selects a formant using the root z, and selects the vocal tract spectrum corresponding to the selected formant and the selected formant. A conversion filter is generated using the vocal tract spectrum corresponding to the formant converted by a predetermined method, the original speech signal s (t) is converted by the conversion filter (S170), and the converted speech signal y (t) Is output. The conversion of the vocal tract spectrum is performed as follows, for example.

〔声道スペクトル変換〕
フォルマント周波数を変換するための方法を説明する。フォルマントは、第二LPC係数a^によって得られる予測多項式の根zから、

により求める。ここで、Fsはサンプリング周波数であり、Re(z)及びIm(z)はそれぞれ根zの実部及び虚部であり、F及びBはそれぞれフォルマント周波数及び帯域幅の候補である。帯域幅は声道スペクトルのピークの鋭さのことである。例えば12次のLPC分析の場合、最大6個得られる複素共役対の根がフォルマントの候補として得られる。さらに、根の候補の中からフォルマントを適切に選択する必要がある。通常、帯域幅が狭い根をフォルマントとして選択する。 [Vocal tract spectrum conversion]
A method for converting the formant frequency will be described. From the root z of the prediction polynomial obtained by the second LPC coefficient a ^

Ask for. Here, Fs is the sampling frequency, Re (z) and Im (z) are the real part and imaginary part of the root z, respectively, and F and B are formant frequency and bandwidth candidates, respectively. Bandwidth is the sharpness of the peak of the vocal tract spectrum. For example, in the case of 12th-order LPC analysis, the roots of up to six complex conjugate pairs are obtained as formant candidates. Furthermore, it is necessary to appropriately select a formant from among the root candidates. Normally, roots with a narrow bandwidth are selected as formants.

選択したフォルマント周波数とバンド幅に対応する声道スペクトルをA(z)、変換後の声道スペクトルをA’(z)とすると、変換フィルタF(z)は以下のように表される。

変換後の音声信号Y(z)は、次式のように、変換フィルタF(z)に原音声信号S(z)を通すことにより、求めることができる。
Y(z)=F(z)S(z) (20)
さらに、変換後の音声信号Y(z)を時間領域に変換して、変換後の時間領域の音声信号y(t)を得（参考文献４）、音声分析変換合成装置１００の出力として出力する。
（参考文献４）Villacorta, V.M., Perkell, J.S., and Guenther, F.H., "Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception", J. Acoust. Soc. Am., 2007, pp.2306--2319, 2007. When the vocal tract spectrum corresponding to the selected formant frequency and bandwidth is A (z) and the converted vocal tract spectrum is A ′ (z), the conversion filter F (z) is expressed as follows.

The converted audio signal Y (z) can be obtained by passing the original audio signal S (z) through the conversion filter F (z) as in the following equation.
Y (z) = F (z) S (z) (20)
Further, the converted speech signal Y (z) is converted into the time domain, and the converted speech signal y (t) in the time domain is obtained (reference document 4), which is output as the output of the speech analysis conversion synthesis apparatus 100. .
(Reference 4) Villacorta, VM, Perkell, JS, and Guenther, FH, "Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception", J. Acoust. Soc. Am., 2007, pp.2306-- 2319, 2007.

例えば、声道スペクトルをA(z)及びA'(z)は以下のように表される。

For example, A (z) and A ′ (z) in the vocal tract spectrum are expressed as follows.

例えばβ=0.9である。ｒ_p及びθ_pはそれぞれ虚根の大きさ及び角を表す。このとき、時間領域の音声信号y(t)は次式により求められる。
y(t)=s(t)-(2rcosθ)s(t-1)+r²s(t-2)+(2rcosθ')y(t-1)-r²y(t-2)
ただし、ｒ及びθはそれぞれ虚根の大きさ及び角を表す。 For example, β = 0.9. r _p and θ _p represent the size and angle of the imaginary root, respectively. At this time, the audio signal y (t) in the time domain is obtained by the following equation.
y (t) = s (t)-(2rcosθ) s (t-1) + r ² s (t-2) + (2rcosθ ') y (t-1) -r ² y (t-2)
However, r and (theta) represent the magnitude | size and angle | corner of an imaginary root, respectively.

＜効果＞
このような構成により、従来よりも、計算量を抑え、より高速に、より精度良くLPC係数を求めることができる。さらに、声道スペクトルを変換した場合にも、より自然な音声を出力することができる。F0に頑健なLPC係数を求めるために、非特許文献１では通常のLPC分析の5倍以上、特許文献１では1.8倍の計算が必要であったが、本実施形態では特許文献１のように原音声信号のすべてに位相等化処理を行う必要はなく、ピッチマーク時刻以前の数サンプルに対してのみ位相等化処理を行うだけで良いため1.2倍の計算で済む。 <Effect>
With such a configuration, it is possible to obtain the LPC coefficient at a higher speed and with a higher accuracy by reducing the amount of calculation than in the past. Further, even when the vocal tract spectrum is converted, a more natural voice can be output. In order to obtain an LPC coefficient that is robust to F0, non-patent document 1 requires calculation more than 5 times that of normal LPC analysis, and patent document 1 requires 1.8 times, but in this embodiment, as in patent document 1 It is not necessary to perform the phase equalization process on all of the original audio signals, and it is sufficient to perform the phase equalization process only on a few samples before the pitch mark time.

上述の第二LPC分析部を用いることで、例えば、第二LPC係数a^を求め、リアルタイムで精度の高い声道スペクトルを表示する発声診断装置を実現することができる。さらに、上述の音声分析変換合成装置を用いることで、リアルタイムに日本人母語話者が発する多言語の発音を補正する装置を実現することができる。 By using the above-described second LPC analysis unit, for example, it is possible to realize an utterance diagnosis apparatus that obtains the second LPC coefficient a ^ and displays a highly accurate vocal tract spectrum in real time. Furthermore, by using the above-described speech analysis conversion / synthesis device, it is possible to realize a device that corrects multilingual pronunciation produced by a native Japanese speaker in real time.

＜シミュレーション結果＞
図３は第一実施形態におけるフォルマント周波数の変換例を示す。図３の破線は入力である原音声信号s(t)の音声スペクトルを、一点鎖線は原音声信号s(t)から第一フォルマントを除去（式(20)の原音声信号S(z)を１/A(z)で除算する処理に相当(式(19)参照)）した後の音声スペクトルを、実線は変換後の第一フォルマントを加えた（式(20)の原音声信号S(z)に１/A'(z)を乗じる処理に相当(式(19)参照)）後の音声スペクトルを表す。サンプリング周波数、分析窓長、シフト長、α、分析窓、LPC次数はそれぞれ、16kHz、25ms、12.5ms、0.97、Blackman窓、18次とした。本シミュレーションでは合成音声を入力とした。合成音声はKlattフォルマント音声合成器を用い、日本語５母音を合成した。基本周波数280Hzのときの第１フォルマント周波数の正解に対する誤差（Hzおよび%）を示す。 <Simulation results>
FIG. 3 shows an example of formant frequency conversion in the first embodiment. The broken line in FIG. 3 indicates the voice spectrum of the input original voice signal s (t), and the alternate long and short dash line removes the first formant from the original voice signal s (t) (the original voice signal S (z) in equation (20) is removed). Corresponding to the process of dividing by 1 / A (z) (see equation (19))), the solid line added the first formant after conversion (original audio signal S (z in equation (20)) ) Is equivalent to the process of multiplying 1 / A ′ (z) (see equation (19))). Sampling frequency, analysis window length, shift length, α, analysis window, and LPC order were 16 kHz, 25 ms, 12.5 ms, 0.97, Blackman window, and 18th order, respectively. In this simulation, synthesized speech was input. The synthesized speech was a Japanese vowel using a Klatt formant speech synthesizer. The error (Hz and%) for the correct answer of the first formant frequency when the fundamental frequency is 280 Hz is shown.

何れの場合にも、第一実施形態の音声分析変換合成装置のほうが、誤差が小さいことが分かる。

In any case, it can be seen that the speech analysis conversion synthesis apparatus of the first embodiment has a smaller error.

＜変形例＞
本実施形態では、第一LPC分析部や第二LPC分析部においてプリエンファシスを行った原音声信号を用いているが、必ずしもプリエンファシスを行わなくともよく、原音声信号をそのまま利用してもよい。ただし、プリエンファシスを行った原音声信号を用いることでLPC分析の精度が向上する。 <Modification>
In the present embodiment, the original speech signal that has been pre-emphasized in the first LPC analysis unit and the second LPC analysis unit is used, but the pre-emphasis may not necessarily be performed, and the original speech signal may be used as it is. . However, the accuracy of LPC analysis is improved by using the original speech signal that has been pre-emphasized.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

An original audio signal, a pitch mark time group extracted from the original audio signal, and an LPC residual signal obtained by performing LPC analysis on the original audio signal, and a sound source signal as the pitch mark time group It is assumed that each pitch mark time has a single pulse with an amplitude G, and the time other than the pitch mark time is composed of white noise,
The original audio signal and the original audio signal self so that an error between the audio signal obtained by the second LPC coefficient and the sound source signal and the phase equalized audio signal corresponding to the original audio signal is minimized. Using the correlation function, the LPC residual signal, and the pitch mark time group, the second LPC coefficient is obtained.
LPC analyzer.

The LPC analyzer according to claim 1,
The original speech signal subjected to pre-emphasis is sp, the autocorrelation function of the original speech signal sp is R _SS , the LPC residual signal is e, the pitch mark time is t _i , the window function is w, the second LPC The coefficient is a ^

To obtain the second LPC coefficient a ^,
LPC analyzer.

A voice section detector for detecting a voice section of the input original voice signal;
Using the LPC coefficient obtained from the original speech signal by LPC analysis and the original speech signal, a first LPC analysis unit for obtaining an LPC residual signal,
A pitch mark analyzer for extracting a pitch mark time group from the original audio signal ;
It is assumed that the sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and that the time other than the pitch mark time is composed of white noise, and the LPC residual signal is uncorrelated. suppose, a voice signal obtained by said sound source signal with a second LPC coefficient, the so error between phase equalization audio signal corresponding to the original speech signal is minimized, the the original audio signal, the original and the autocorrelation function of the speech signal, said the LPC residual signal, using said pitch mark time group, the second LPC analyzer which is the second obtaining LPC coefficients configuration,
The root is obtained from the prediction polynomial obtained by the second LPC coefficient, a formant is selected using the root, the vocal tract spectrum corresponding to the selected formant, and the formant obtained by converting the selected formant by a predetermined method. Including a voice conversion unit that generates a conversion filter using the vocal tract spectrum and converts the original voice signal with the conversion filter,
Speech analysis conversion synthesizer.

Using an original audio signal, a pitch mark time group extracted from the original audio signal, and an LPC residual signal obtained by performing LPC analysis on the original audio signal , a sound source signal of the pitch mark time group It is assumed that each pitch mark time has a single pulse with an amplitude G, and that the time other than the pitch mark time consists of white noise,
The original audio signal and the original audio signal self so that an error between the audio signal obtained by the second LPC coefficient and the sound source signal and the phase equalized audio signal corresponding to the original audio signal is minimized. Using the correlation function, the LPC residual signal, and the pitch mark time group, the second LPC coefficient is obtained.
LPC analysis method.

The LPC analysis method according to claim 4,
The original speech signal subjected to pre-emphasis is sp, the autocorrelation function of the original speech signal sp is R _SS , the LPC residual signal is e, the pitch mark time is t _i , the window function is w, the second LPC The coefficient is a ^

To obtain the second LPC coefficient a ^,
LPC analysis method.

A voice section detection step for detecting a voice section of the input original voice signal;
A first LPC analysis step for obtaining an LPC residual signal using an LPC coefficient obtained from the original voice signal by LPC analysis and the original voice signal;
A pitch mark analysis step of extracting a pitch mark time group from the original audio signal ;
It is assumed that the sound source signal has a single pulse of amplitude G at each pitch mark time of the pitch mark time group, and that the time other than the pitch mark time is composed of white noise, and the LPC residual signal is uncorrelated. suppose, a voice signal obtained by said sound source signal with a second LPC coefficient, the so error between phase equalization audio signal corresponding to the original speech signal is minimized, the the original audio signal, the original and the autocorrelation function of the speech signal, and the LPC residual signal by using said pitch mark time group, and the second LPC analysis step which is configured to determine the second LPC coefficient,
The root is obtained from the prediction polynomial obtained by the second LPC coefficient, a formant is selected using the root, the vocal tract spectrum corresponding to the selected formant, and the formant obtained by converting the selected formant by a predetermined method. A voice conversion step of generating a conversion filter using the vocal tract spectrum to convert the original voice signal with the conversion filter,
Speech analysis conversion synthesis method.

A program for causing a computer to function as the speech analysis conversion synthesis apparatus according to claim 3.