JP4999757B2

JP4999757B2 - Speech analysis / synthesis apparatus, speech analysis / synthesis method, computer program, and recording medium

Info

Publication number: JP4999757B2
Application number: JP2008092985A
Authority: JP
Inventors: 定男廣谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-03-31
Filing date: 2008-03-31
Publication date: 2012-08-15
Anticipated expiration: 2028-03-31
Also published as: JP2009244723A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech analysis and synthesis device that outputs speech which is easy to hear, by converting a local speech speed for each part of input speech to a desired speed, and by reducing local variation of the speech speed of the input speech. <P>SOLUTION: In a speech analysis section 100, a speech spectrum and a parameter of a sound source or the like are extracted from a speech signal. In a speech conversion section 200, predetermined conversion is performed on the speech spectrum and the parameter of the sound source or the like, based on speed information of an articulation parameter, and the speech signal is generated by a vocoder type speech synthesis section 300. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力された音声信号の発話速度を所望の速度に変換した音声信号を出力することができる、音声分析合成装置、音声分析合成方法、コンピュータプログラム、および記録媒体に関するものである。 The present invention relates to a speech analysis / synthesis device, a speech analysis / synthesis method, a computer program, and a recording medium that can output a speech signal obtained by converting a speech rate of an input speech signal into a desired speed.

これまでに、与えられた音声信号の発話スタイルを変換して音声を合成する方法として、音声スペクトルに着目した方法が提案されている（例えば、非特許文献１を参照）。しかしながら、音声スペクトルのみからでは、発話スタイルの特徴をうまく捉えることが困難であり、現時点では十分な品質が得られないという問題がある。 So far, as a method of synthesizing speech by converting the speech style of a given speech signal, a method focusing on the speech spectrum has been proposed (for example, see Non-Patent Document 1). However, it is difficult to capture the features of the utterance style from the speech spectrum alone, and there is a problem that sufficient quality cannot be obtained at the present time.

音声を調音運動の観点から捉え、調音運動に基づき発話スタイルを変換させる方法は、上記の音声スペクトルを用いる場合と比較して、より直感的であり、精度が良くなることが期待される。しかしながら、調音運動と音声スペクトルとの間の非線形な写像関係により、調音運動に基づき変換した発話スタイルの特徴が、写像後の音声スペクトルにうまく反映されないという問題がある（例えば、特許文献１を参照）。 The method of capturing speech from the viewpoint of articulatory motion and converting the utterance style based on articulatory motion is expected to be more intuitive and more accurate than the case of using the speech spectrum described above. However, due to the non-linear mapping relationship between the articulatory motion and the speech spectrum, there is a problem that the features of the utterance style converted based on the articulatory motion are not well reflected in the speech spectrum after mapping (see, for example, Patent Document 1). ).

最も基本的な発話スタイルの制御として、話速の変換があるが、音声信号そのものに時間軸伸縮を施すと、基本周波数も変化してしまうという問題があるため、ピッチ同期分析を用いることが提案されている（例えば、非特許文献２を参照）。しかしながら、ピッチ同期分析を行うために必要なピッチマークの抽出において、従来のＬＰＣ（線形予測係数）予測残差信号の絶対値の閾値処理に基づく方法では、特に基本周波数の高い女声においてピッチマークをうまく抽出できないことが知られている（例えば、非特許文献３を参照）。また、ピッチ同期分析は、固定窓長および固定フレームシフト長を用いた音声信号の分析方法と比較して、基本周波数の影響を受けない、安定な音声スペクトルや音源情報の抽出が可能であることが知られている（例えば、非特許文献３を参照）。 The most basic utterance style control is conversion of speech speed, but there is a problem that if the time base expansion / contraction is applied to the audio signal itself, there is a problem that the fundamental frequency also changes. (For example, see Non-Patent Document 2). However, in the extraction of pitch marks necessary for performing the pitch synchronization analysis, the conventional method based on the threshold processing of the absolute value of the LPC (linear prediction coefficient) prediction residual signal is used to detect pitch marks particularly in a female voice having a high fundamental frequency. It is known that extraction cannot be performed well (see, for example, Non-Patent Document 3). Pitch synchronization analysis is capable of extracting a stable audio spectrum and sound source information that is not affected by the fundamental frequency compared to audio signal analysis using fixed window length and fixed frame shift length. Is known (see, for example, Non-Patent Document 3).

音声の合成において、単一パルス系列と白色雑音を切り換える駆動音源を用いた場合では、合成される音声信号の品質が良くないことが問題となる。そこで、単一パルス系列の代わりに、位相等化音声信号との誤差が最小になるように決定されたマルチパルス系列を用いる手法があるが、有声と無声が切り替わる部分においてバズ的な音声を生じる問題がある（例えば、特許文献２を参照）。一方、バズ的な音声を改善する方法として、周波数帯域毎の有声／無声判定に基づき、単一パルス系列と白色雑音、を混合する駆動音源が提案されているが、さらなる品質の向上が必要とされている（例えば、非特許文献４参照）。
Tachibana,M.,Yamagishi,J.,Masuko,T.,and Kobayashi,T.,“Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,”IEICE Trans．Information and Systems，E88-D, 11, pp．2484-2491(2005). Moulines，E., and Charpentier，F.,“Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,”Speech Communication,9,pp.453-467(1990). Miyoshi，Y.,Yamato,K.,Mizoguchi.R.,Yanagida,M.,and Kakusho,O.,“Analysis of speech signals of short pitch period by a sample-selective linear prediction,”IEEE Trans. Signal Processing, 35, 9, pp. 1233-1240(1987).. McCree, A.V.,and Barnwell,T.P.,“A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech and Audio Processing,3,4, pp. 242-249(1995). 特許第３４１２７９８号公報特公平７−８２３６０号公報 In speech synthesis, when a driving sound source that switches between a single pulse sequence and white noise is used, the quality of the synthesized speech signal is not good. Therefore, there is a method of using a multi-pulse sequence determined so as to minimize an error with a phase-equalized audio signal instead of a single pulse sequence, but a buzzy sound is produced at a portion where voiced and unvoiced are switched. There is a problem (see, for example, Patent Document 2). On the other hand, as a method for improving buzzy sound, a driving sound source that mixes a single pulse sequence and white noise based on voiced / unvoiced judgment for each frequency band has been proposed, but further quality improvement is required. (For example, refer nonpatent literature 4).
Tachibana, M., Yamagishi, J., Masuko, T., and Kobayashi, T., “Speech synthesis with various emotional expressions and speaking styles by style interpolation and morphing,” IEICE Trans. Information and Systems, E88-D, 11, pp. 2484-2491 (2005). Moulines, E., and Charpentier, F., “Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones,” Speech Communication, 9, pp. 453-467 (1990). Miyoshi, Y., Yamato, K., Mizoguchi.R., Yanagida, M., and Kakusho, O., “Analysis of speech signals of short pitch period by a sample-selective linear prediction,” IEEE Trans. Signal Processing, 35, 9, pp. 1233-1240 (1987). McCree, AV, and Barnwell, TP, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech and Audio Processing, 3, 4, pp. 242-249 (1995). Japanese Patent No. 3421798 Japanese Patent Publication No. 7-82360

調音運動に基づく発話スタイルの変換において、上記の特許文献１で提案されている調音パラメータから音声スペクトルへの写像を用いた場合、十分な品質の音声信号を合成できないという問題がある。従って、調音パラメータから音声スペクトルへの写像処理を介さずに、発話スタイルの変換に関わる調音パラメータの特徴を、直接音声スペクトルの変換に生かす技術が必要となる。 In the conversion of speech style based on articulatory motion, there is a problem that a speech signal of sufficient quality cannot be synthesized when the mapping from the articulation parameter proposed in Patent Document 1 to the speech spectrum is used. Therefore, there is a need for a technique that directly utilizes the characteristics of the articulation parameters related to the speech style conversion without converting the articulation parameters to the speech spectrum.

本発明は、斯かる実情に鑑みなされたものであり、本発明の目的は、入力音声の各部分ごとの局所的な発話速度を、所望の速度に変換した音声を出力することができるようにし、入力音声の発話速度の局所的なばらつきを少なくし、聞き取りやすい音声を出力することができる、音声分析合成装置、音声分析合成方法、コンピュータプログラム、および記録媒体を提供することにある。 The present invention has been made in view of such circumstances, and an object of the present invention is to be able to output a voice obtained by converting a local speech speed of each part of an input voice into a desired speed. Another object of the present invention is to provide a speech analysis / synthesis device, a speech analysis / synthesis method, a computer program, and a recording medium capable of reducing local variations in the utterance speed of input speech and outputting easily audible speech.

本発明は上記課題を解決するためになされたものであり、同時測定された音声を音声信号として計測すると共に調音運動の計測データを収集するデータ入力部と、前記計測データを分析する音声分析部と、前記音声分析部の分析結果に所定の変換を施す音声変換部と、前記音声変換部の変換結果を基に音声を合成するボコーダ型の音声合成部とを備える音声分析合成装置であって、前記音声分析部は、前記音声信号から音声区間を検出すると共に、前記音声区間における基本周波数を算出する基本周波数計算部と、前記基本周波数を用いて、ピッチ周期に応じたパルス系列を持つピッチ波形を生成するパルス系列生成部と、前記音声信号を基に線型予測分析を行いＬＰＣ（線型予測分析）係数を算出するＬＰＣ係数計算部と、前記音声信号と、前記ＬＰＣ係数をフィルタ係数に持つ逆フィルタとによりＬＰＣ予測残差波形を算出するＬＰＣ残差計算部と、前記ＬＰＣ係数からＬＳＰ（線スペクトル対）係数を算出するＬＳＰ係数計算部と、前記検出した音声区間内で、前記ＬＰＣ予測残差波形の絶対値と、前記ピッチ波形の絶対値との相互相関を最大にするピッチマークを抽出するピッチマーク計算部と、前記検出した音声区間内で、前記音声信号と、前記ピッチマークと、前記ＬＰＣ予測残差波形とを基に、音声信号の位相成分を一定の位相に等化した位相等化音声を生成する位相等化音声計算部と、前記検出した音声区間内で、前記位相等化音声を基に位相等化パルス音源モデルのフィルタ係数を算出する位相等化パルス音源フィルタ計算部と、前記検出した音声区間内で、前記音声信号に基いて予め定められた算出方法で周波数帯域毎の音声強度を算出する有声強度計算部と、前記検出した音声区間内で、前記音声信号に基いて予め定められた算出方法で白色雑音ゲインを算出する白色雑音ゲイン計算部と、を備え、前記音声変換部は、前記調音運動の計測データを基に、調音パラメータの速度を調音速度として算出する調音速度計算部と、前記調音速度に応じて前記ＬＳＰ係数に所定の変換を施すＬＳＰ係数変換部と、前記調音速度に応じて前記基本周波数に所定の変換を施す基本周波数変換部と、前記調音速度に応じて前記位相等化パルス音源モデルのフィルタ係数に所定の変換を施す位相等化パルス音源フィルタ変換部と、前記調音速度に応じて前記白色雑音ゲインに所定の変換を施す白色雑音ゲイン変換部と、前記調音速度に応じて前記周波数帯域毎の音声強度に所定の変換を施す有声強度変換部と、を備え、前記音声合成部は、前記基本周波数変換部において変換された基本周波数と、前記位相等化パルス音源フィルタ変換部により変換されたフィルタ係数と、前記位相等化パルス音源モデルとに基づいて、位相等化パルス音源を生成すると共に、前記有声強度変換部において変換された周波数帯域毎の音声強度に基いて、有声帯域においては前記生成した位相等化パルス音源を混合し、無声帯域では白色雑音を混合した駆動音源を生成する駆動音源生成部と、前記ＬＳＰ係数変換部により変換されたＬＳＰ係数と前記駆動音源の出力信号とから音声信号を合成する畳み込み演算部と、を備え、前記有声強度計算部が前記音声強度を算出する場合、前記白色雑音ゲイン計算部が前記白色雑音ゲインを算出する場合、および、前記駆動音源生成部が前記位相等化パルス音源を生成する場合、分析窓長を２ピッチ周期分として算出または生成する、ことを特徴とする音声分析合成装置である。 The present invention has been made to solve the above-described problem, and measures a voice that is measured simultaneously as a voice signal and collects measurement data of articulatory movement, and a voice analyzer that analyzes the measurement data A speech analysis and synthesis device comprising: a speech conversion unit that performs predetermined conversion on the analysis result of the speech analysis unit; and a vocoder-type speech synthesis unit that synthesizes speech based on the conversion result of the speech conversion unit. The voice analysis unit detects a voice section from the voice signal and calculates a fundamental frequency in the voice section; and a pitch having a pulse sequence corresponding to a pitch period using the fundamental frequency. A pulse sequence generation unit that generates a waveform, an LPC coefficient calculation unit that performs linear prediction analysis based on the speech signal and calculates an LPC (Linear Prediction Analysis) coefficient, and the speech signal An LPC residual calculation unit that calculates an LPC prediction residual waveform by an inverse filter having the LPC coefficient as a filter coefficient, an LSP coefficient calculation unit that calculates an LSP (Line Spectrum Pair) coefficient from the LPC coefficient, and the detected A pitch mark calculation unit that extracts a pitch mark that maximizes the cross-correlation between the absolute value of the LPC prediction residual waveform and the absolute value of the pitch waveform within the speech interval, and within the detected speech interval, A phase-equalized speech calculation unit that generates phase-equalized speech in which a phase component of the speech signal is equalized to a constant phase based on the speech signal, the pitch mark, and the LPC prediction residual waveform; and the detection A phase equalization pulse sound source filter calculation unit for calculating a filter coefficient of a phase equalization pulse sound source model based on the phase equalized sound within the detected voice interval; A voiced intensity calculation unit that calculates a voice intensity for each frequency band by a predetermined calculation method based on an audio signal, and white noise by a predetermined calculation method based on the audio signal in the detected voice section A white noise gain calculation unit for calculating a gain, and the audio conversion unit calculates an articulation parameter speed as an articulation speed based on the measurement data of the articulation motion, and the articulation speed In response, an LSP coefficient converter that performs a predetermined conversion on the LSP coefficient, a basic frequency converter that performs a predetermined conversion on the fundamental frequency according to the articulation speed, and the phase equalized pulse sound source according to the articulation speed A phase equalization pulse sound source filter converter that performs a predetermined conversion on the filter coefficient of the model, and a white noise gain converter that performs a predetermined conversion on the white noise gain according to the articulation speed; A voiced intensity conversion unit that performs a predetermined conversion on the sound intensity for each frequency band according to the articulation speed, and the speech synthesis unit includes the fundamental frequency converted by the fundamental frequency conversion unit, and the phase A phase equalization pulse sound source is generated based on the filter coefficient converted by the equalization pulse sound source filter conversion unit and the phase equalization pulse sound source model, and for each frequency band converted by the voiced intensity conversion unit. Based on the voice intensity, the generated phase equalization pulse sound source is mixed in the voiced band, and the driving sound source generating unit for generating the driving sound source in which the white noise is mixed in the unvoiced band, and converted by the LSP coefficient converting unit. comprises a convolution unit for synthesizing an audio signal from the LSP coefficients and the output signal of the excitation, the, if the voiced strength calculation unit calculates the voice level If the white noise gain calculator calculates the white noise gain, and, when said excitation generating unit generates said phase equalization pulse excitation, to calculate or generate analysis window length as two pitches cycles, it Is a speech analysis and synthesis device characterized by

また、本発明は、前記音声合成部が、前記ＬＳＰ係数変換部により変換されたＬＳＰ係数をＬＰＣ係数に変換するＬＰＣ係数計算部を有し、前記畳み込み演算部が、前記ＬＰＣ係数計算部において変換されたＬＰＣ係数と前記駆動音源の出力信号とを畳み込むことにより音声信号を合成する、ことを特徴とする記載の音声分析合成装置である。 The speech synthesis unit may further include an LPC coefficient calculation unit that converts the LSP coefficient converted by the LSP coefficient conversion unit into an LPC coefficient, and the convolution operation unit converts the LPC coefficient in the LPC coefficient calculation unit. The speech analysis and synthesis apparatus according to claim 1, wherein the speech signal is synthesized by convolving the LPC coefficient thus generated and the output signal of the driving sound source.

また、本発明は、前記音声変換部のＬＳＰ係数変換部と基本周波数変換部と位相等化パルス音源フィルタ変換部と白色雑音ゲイン変換部と有声強度変換部とは、それぞれ、時刻ｔにおける調音速度として、調音パラメータをｘ_ｔ，ｉ（ｉ＝１，・・・，ｎ：唇や舌などの水平および垂直位置）とした場合のＲＭＳ距離ｄｘ_ｔを使用し、「ｄｘ_ｔ＝ｓｑｒｔ（Σ_ｉ（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）×（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）／ｎ）、ここで、ｓｑｒｔは根号、調音速度の単位はｍｍ」、また、音声区間全体の調音速度の和を、音声区間全体の長さ（フレーム数）で割った、平均調音速度ａｖｅｄｘを算出し、さらに、すべての時刻ｔにおいて、「ｄｘ_ｋ＜＝ｔ×ａｖｅｄｘ、かつｄｘ_ｋ＋１＞ｔ×ａｖｅｄｘ」となるｋを求め、時刻ｔにおけるパラメータを、次の式により線型補間すること、「（（ｄｘ_ｋ＋１−ｔ×avedx）×ｐ_ｋ＋（ｔ×avedx−ｄｘ_ｋ）×ｐ_ｋ＋１）／（ｄｘ_ｋ＋１−ｄｘ_ｋ）、ここで、Ｐ_ｋは、時刻ｋにおける、前記ＬＳＰ係数、基本周波数、位相等化パルス音源のフィルタ係数、白色雑音ゲイン、または、周波数帯域毎の有声強度」、を特徴とする音声分析合成装置である。 Further, according to the present invention, the LSP coefficient conversion unit, the fundamental frequency conversion unit, the phase equalization pulse sound source filter conversion unit, the white noise gain conversion unit, and the voiced intensity conversion unit of the sound conversion unit are each an articulation speed at time t. Assuming that the articulation parameter is x _{t, i} (i = 1,..., N: horizontal and vertical positions such as lips and tongue), the RMS distance dx _t is used, and “dx _t = sqrt (Σ _i (X _{t, i} −x _{t−1, i} ) × (x _{t, i} −x _{t−1, i} ) / n), where sqrt is the root number and the unit of articulation speed is mm ” An average articulation speed avedx is calculated by dividing the sum of articulation speeds of the entire section by the length (number of frames) of the entire speech section. Further, at all times t, “dx _k ≦ t × avedx and dx _{k +} 1> t × _avedx "to become asked for k, time The parameters in, to linear interpolation by the following equation, _{"((dx k + 1 -t ×} avedx) × p k + (t × avedx-dx k) × p k + 1) / (dx k + 1 -dx k), where , P _k is a speech analysis / synthesis device characterized by the LSP coefficient, the fundamental frequency, the filter coefficient of the phase equalization pulse sound source, the white noise gain, or the voiced intensity for each frequency band at time k ”.

また、本発明は、同時測定された音声を音声信号として計測すると共に調音運動の計測データを収集するデータ入力部と、前記計測データを分析する音声分析部と、前記音声分析部の分析結果に所定の変換を施す音声変換部と、前記音声変換部の変換結果を基に音声を合成するボコーダ型の音声合成部とを備える音声分析合成装置における音声分析合成方法であって、前記音声分析部により、前記音声信号から音声区間を検出すると共に、前記音声区間における基本周波数を算出する基本周波数計算手順と、前記基本周波数を用いて、ピッチ周期に応じたパルス系列を持つピッチ波形を生成するパルス系列生成手順と、前記音声信号を基に線型予測分析を行いＬＰＣ（線型予測分析）係数を算出するＬＰＣ係数計算手順と、前記音声信号と、前記ＬＰＣ係数をフィルタ係数に持つ逆フィルタとによりＬＰＣ予測残差波形を算出するＬＰＣ残差計算手順と、前記ＬＰＣ係数からＬＳＰ（線スペクトル対）係数を算出するＬＳＰ係数計算手順と、前記検出した音声区間内で、前記ＬＰＣ予測残差波形の絶対値と、前記ピッチ波形の絶対値との相互相関を最大にするピッチマークを抽出するピッチマーク計算手順と、前記検出した音声区間内で、前記音声信号と、前記ピッチマークと、前記ＬＰＣ予測残差波形とを基に、音声信号の位相成分を一定の位相に等化した位相等化音声を生成する位相等化音声計算手順と、前記検出した音声区間内で、前記位相等化音声を基に位相等化パルス音源モデルのフィルタ係数を算出する位相等化パルス音源フィルタ計算手順と、前記検出した音声区間内で、前記音声信号に基いて予め定められた算出方法で周波数帯域毎の音声強度を算出する有声強度計算手順と、前記検出した音声区間内で、前記音声信号に基いて予め定められた算出方法で白色雑音ゲインを算出する白色雑音ゲイン計算手順と、が行われ、前記音声変換部により、前記調音運動の計測データを基に、調音パラメータの速度を調音速度として算出する調音速度計算手順と、前記調音速度に応じて前記ＬＳＰ係数に所定の変換を施すＬＳＰ係数変換手順と、前記調音速度に応じて前記基本周波数に所定の変換を施す基本周波数変換手順と、前記調音速度に応じて前記位相等化パルス音源モデルのフィルタ係数に所定の変換を施す位相等化パルス音源フィルタ変換手順と、前記調音速度に応じて前記白色雑音ゲインに所定の変換を施す白色雑音ゲイン変換手順と、前記調音速度に応じて前記周波数帯域毎の音声強度に所定の変換を施す有声強度変換手順と、を行われ、前記音声合成部により、前記基本周波数変換手順において変換された基本周波数と、前記位相等化パルス音源フィルタ変換手順により変換されたフィルタ係数と、前記位相等化パルス音源モデルとに基づいて、位相等化パルス音源を生成すると共に、前記有声強度変換手順において変換された周波数帯域毎の音声強度に基いて、有声帯域においては前記生成した位相等化パルス音源を混合し、無声帯域では白色雑音を混合した駆動音源を生成する駆動音源生成手順と、前記ＬＳＰ係数変換手順により変換されたＬＳＰ係数と前記駆動音源の出力信号とから音声信号を合成する畳み込み演算手順と、が行われ、前記有声強度計算手順において前記音声強度が算出される場合、前記白色雑音ゲイン計算手順において前記白色雑音ゲインが算出される場合、および、前記駆動音源生成手順において前記位相等化パルス音源が生成される場合、分析窓長を２ピッチ周期分として算出または生成される、ることを特徴とする音声分析合成方法である。 The present invention also provides a data input unit that measures simultaneously measured speech as a speech signal and collects measurement data of articulatory movement, a speech analysis unit that analyzes the measurement data, and an analysis result of the speech analysis unit. A speech analysis / synthesis method in a speech analysis / synthesis apparatus comprising: a speech conversion unit that performs predetermined conversion; and a vocoder-type speech synthesis unit that synthesizes speech based on a conversion result of the speech conversion unit, wherein the speech analysis unit To detect a speech section from the speech signal, calculate a fundamental frequency in the speech section, and generate a pitch waveform having a pulse sequence according to a pitch period using the fundamental frequency A sequence generation procedure, an LPC coefficient calculation procedure for performing an LPC (Linear Prediction Analysis) coefficient by performing linear prediction analysis based on the speech signal, the speech signal, An LPC residual calculation procedure for calculating an LPC prediction residual waveform using an inverse filter having the LPC coefficient as a filter coefficient, an LSP coefficient calculation procedure for calculating an LSP (Line Spectrum Pair) coefficient from the LPC coefficient, and the detected A pitch mark calculation procedure for extracting a pitch mark that maximizes the cross-correlation between the absolute value of the LPC prediction residual waveform and the absolute value of the pitch waveform within a speech interval, and within the detected speech interval, A phase-equalized speech calculation procedure for generating phase-equalized speech in which the phase component of the speech signal is equalized to a constant phase based on the speech signal, the pitch mark, and the LPC prediction residual waveform; and the detection A phase equalization pulse sound source filter calculation procedure for calculating a filter coefficient of a phase equalization pulse sound source model based on the phase equalization sound, and the detected voice interval And a voiced intensity calculation procedure for calculating the voice intensity for each frequency band by a predetermined calculation method based on the voice signal, and a calculation method predetermined based on the voice signal within the detected voice section. And a white noise gain calculation procedure for calculating a white noise gain at the sound conversion unit, and based on the measurement data of the articulation motion, the articulation speed calculation procedure for calculating the speed of the articulation parameter as the articulation speed, An LSP coefficient conversion procedure for performing a predetermined conversion on the LSP coefficient according to the articulation speed, a basic frequency conversion procedure for performing a predetermined conversion on the fundamental frequency according to the articulation speed, and the phase according to the articulation speed A phase equalization pulse sound source filter conversion procedure for performing predetermined conversion on the filter coefficient of the equalized pulse sound source model, and a predetermined conversion on the white noise gain according to the articulation speed. A white noise gain conversion procedure to be performed, and a voiced strength conversion procedure to perform a predetermined conversion on the voice strength for each frequency band according to the articulation speed, and the voice synthesis unit performs the conversion in the basic frequency conversion procedure. A phase equalized pulse sound source is generated based on the fundamental frequency, the filter coefficient converted by the phase equalized pulse sound source filter conversion procedure, and the phase equalized pulse sound source model, and the voiced intensity conversion procedure A driving sound source generation procedure for generating a driving sound source in which the generated phase equalization pulse sound source is mixed in the voiced band and white noise is mixed in the unvoiced band, based on the sound intensity for each frequency band converted in A convolution operation procedure for synthesizing an audio signal from the LSP coefficient converted by the LSP coefficient conversion procedure and the output signal of the driving sound source is performed. The case where the voice level in the voicing strength calculation procedure is calculated, the case where the in white noise gain calculation procedure white noise gain is calculated, and the phase equalization pulse excitation in the excitation generation procedure is produced In this case, the speech analysis and synthesis method is characterized in that the analysis window length is calculated or generated as two pitch periods .

また、本発明は、前記音声合成部により、前記ＬＳＰ係数変換手順により変換されたＬＳＰ係数をＬＰＣ係数に変換するＬＰＣ係数計算手順が行われ、前記畳み込み演算手順で、前記ＬＰＣ係数計算手順において変換されたＬＰＣ係数と前記駆動音源の出力信号とを畳み込むことにより音声信号を合成する、ことを特徴とする音声分析合成方法である。 In the present invention, an LPC coefficient calculation procedure for converting an LSP coefficient converted by the LSP coefficient conversion procedure into an LPC coefficient is performed by the speech synthesizer, and the conversion is performed in the LPC coefficient calculation procedure by the convolution calculation procedure. A speech analysis and synthesis method characterized in that a speech signal is synthesized by convolving the LPC coefficient thus generated and the output signal of the driving sound source.

また、本発明は、前記音声変換部によるＬＳＰ係数変換手順と基本周波数変換手順と位相等化パルス音源フィルタ変換手順と白色雑音ゲイン変換手順と有声強度変換手順とで、それぞれ、前記音声変換部により、時刻ｔにおける調音速度として、調音パラメータをｘ_ｔ，ｉ（ｉ＝１，・・・，ｎ：唇や舌などの水平および垂直位置）とした場合のＲＭＳ距離ｄｘ_ｔを使用する手順と、「ｄｘ_ｔ＝ｓｑｒｔ（Σ_ｉ（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）×（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）／ｎ）、ここで、ｓｑｒｔは根号、調音速度の単位はｍｍ」、また、音声区間全体の調音速度の和を、音声区間全体の長さ（フレーム数）で割った、平均調音速度ａｖｅｄｘを算出する手順と、さらに、すべての時刻ｔにおいて、「ｄｘ_ｋ＜＝ｔ×ａｖｅｄｘ、かつｄｘ_ｋ＋１＞ｔ×ａｖｅｄｘ」となるｋを求め、時刻ｔにおけるパラメータを、次の式により線型補間する手順と、「（（ｄｘ_ｋ＋１−ｔ×avedx）×ｐ_ｋ＋（ｔ×avedx−ｄｘ_ｋ）×ｐ_ｋ＋１）／（ｄｘ_ｋ＋１−ｄｘ_ｋ）、ここで、Ｐ_ｋは、時刻ｋにおける、前記ＬＳＰ係数、基本周波数、位相等化パルス音源フィルタ係数、白色雑音ゲイン、または、周波数帯域毎の有声強度」、が行われることを特徴とする音声分析合成方法である。 Further, the present invention provides an LSP coefficient conversion procedure, a basic frequency conversion procedure, a phase equalization pulse sound source filter conversion procedure, a white noise gain conversion procedure, and a voiced intensity conversion procedure, respectively, performed by the sound conversion unit. Using the RMS distance dx _t when the articulation parameter is x _{t, i} (i = 1,..., N: horizontal and vertical positions such as lips and tongue) as the articulation speed at time t, “Dx _t = sqrt (Σ _i (x _{t, i} −x _{t−1, i} ) × (x _{t, i} −x _{t−1, i} ) / n), where sqrt is the root sign, the articulation speed The unit is mm ”, and the procedure of calculating the average articulation speed avedx by dividing the sum of the articulation speeds of the entire speech section by the length (number of frames) of the entire speech section, and at all times t, dx _k <= t × avedx, One _{dx k +} 1> t × _avedx "become sought k, the parameters at time t, a step of linear interpolation by the following equation," _{((dx k + 1 -t ×} avedx) × p k + (t × avedx-dx _{_{_{k) × p k + 1)}}} / (dx k + 1 -dx k), where, _{P k} is at time k, the LSP coefficients, the fundamental frequency, phase equalization pulse excitation filter coefficients, white noise gain, or each frequency band The voice analysis and synthesis method is characterized in that “voiced intensity” is performed.

また、本発明は、音声と調音運動の計測データを収集するデータ入力部と、前記計測データを分析する音声分析部と、前記音声分析部の分析結果に所定の変換を施す音声変換部と、前記音声変換部の変換結果を基に音声を合成するボコーダ型の音声合成部とを備える音声分析合成装置内のコンピュータに、上述の手順を実行させるためのコンピュータプログラムである。 The present invention also includes a data input unit that collects measurement data of voice and articulation movement, a voice analysis unit that analyzes the measurement data, a voice conversion unit that performs predetermined conversion on the analysis result of the voice analysis unit, A computer program for causing a computer in a speech analysis / synthesis apparatus including a vocoder-type speech synthesis unit that synthesizes speech based on a conversion result of the speech conversion unit to execute the above-described procedure.

また、本発明は、上述のコンピュータプログラムを格納すること特徴とするコンピュータ読み取り可能な記録媒体である。 The present invention also provides a computer-readable recording medium that stores the above-described computer program.

本発明の音声分析合成装置および音声分析合成方法においては、音声分析部おいて、音声信号から、音声スペクトルや音源のパラメータ等を抽出し、また音声変換部により、調音パラメータの速度情報に基づき、音声スペクトルや音源のパラメータ等に所定の変換を施し、ボコーダ型の音声合成器で音声信号を生成するようにしたので、これにより、様々な発話スタイルを与える、高品質な音声を合成することができる。このため、例えば、入力音声の声の高さ（ピッチ）を変えずに、入力音声の各部分ごとの局所的な発話速度を、所望の速度に変換した音声を出力することができる。すなわち、発話速度が速い部分を遅くし、遅い部分は速くするという変換により、入力音声の発話速度の局所的なばらつきを少なくし、聞き取りやすい音声を得ることができる。 In the speech analysis and synthesis apparatus and speech analysis and synthesis method of the present invention, the speech analysis unit extracts the speech spectrum, sound source parameters, and the like from the speech signal, and the speech conversion unit based on the speed information of the articulation parameters, Since the voice signal is generated by the vocoder-type voice synthesizer, the voice spectrum and the sound source parameters are subjected to predetermined conversion, so that it is possible to synthesize high-quality voice that gives various utterance styles. it can. For this reason, for example, it is possible to output a voice in which the local speech rate for each part of the input voice is converted to a desired speed without changing the voice pitch (pitch) of the input voice. That is, by converting the part where the speech rate is fast and slowing the slow part, it is possible to reduce the local variation in the speech rate of the input speech and obtain a speech that is easy to hear.

また、本発明の音声分析合成装置および音声分析合成方法においては、音声分析部により得られた、ＬＳＰ係数、基本周波数、位相等化パルス音源のフィルタ係数、白色雑音ゲイン、または、周波数帯域毎の有声強度に対して、調音パラメータの速度情報に基づき、平均調音速度ａｖｅｄｘを算出し、すべての時刻ｔにおいて、「ｄｘ_ｋ＜＝ｔ×ａｖｅｄｘ、かつｄｘ_ｋ＋１＞ｔ×ａｖｅｄｘ」となるｋを求め、時刻ｔにおけるパラメータを、線型補間した、「（（ｄｘ_ｋ＋１−ｔ×avedx）×ｐ_ｋ＋（ｔ×avedx−ｄｘ_ｋ）×ｐ_ｋ＋１）／（ｄｘ_ｋ＋１−ｄｘ_ｋ）、ここで、Ｐ_ｋは、時刻ｋにおける、ＬＳＰ係数、基本周波数、位相等化パルス音源フィルタ係数、白色雑音ゲイン、または、周波数帯域毎の有声強度」、を算出するようにしたので、これにより、人間が一定の調音速度で発話した場合の発話スタイルを有する音声を生成することが可能となる。 In the speech analysis / synthesis apparatus and speech analysis / synthesis method of the present invention, the LSP coefficient, fundamental frequency, phase equalization pulse sound source filter coefficient, white noise gain, or frequency band obtained by the speech analysis unit is obtained. For the voiced intensity, the average articulation speed averagex is calculated based on the speed information of the articulation parameter, and _{k that} satisfies “dx _k ≦ t × avedx and dx _{k + 1} > t × avedx” is obtained at all times t. , “((Dx _{k + 1} −t × avedx) × p _k + (t × avedx−dx _k ) × p _{k + 1} ) / (dx _{k + 1} −dx _k ), where P = _k is an LSP coefficient at time k, a fundamental frequency, a phase equalization pulse sound source filter coefficient, a white noise gain, or a voiced intensity for each frequency band ”. Therefore, it is possible to generate a voice having an utterance style when a human utters at a constant articulation speed.

図１は、本発明の実施の形態に係わる音声分析合成装置の構成を示す図である。
図１に示す本発明の音声分析合成装置は、音声分析合成装置１に、マイク（マイクロフォン）２および、２次元磁気センサシステム３を接続して構成される。 FIG. 1 is a diagram showing a configuration of a speech analysis / synthesis apparatus according to an embodiment of the present invention.
The speech analysis / synthesis apparatus of the present invention shown in FIG. 1 is configured by connecting a microphone (microphone) 2 and a two-dimensional magnetic sensor system 3 to a speech analysis / synthesis apparatus 1.

音声分析合成装置１内には、ＣＰＵ（Central Processing Unit）、ＲＯＭ(Read Only Memory)、ＲＡＭ（Random Access Memory）等を有する主制御部１１が設けられ、この主制御部１１は、音声分析合成装置１内の各処理部の処理動作を統括して制御するための制御部である。また、主制御１１内のＣＰＵによりコンピュータシステムが構成されている。 The speech analysis / synthesis apparatus 1 includes a main control unit 11 having a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. It is a control unit for controlling the processing operation of each processing unit in the apparatus 1 in an integrated manner. A computer system is constituted by the CPU in the main control 11.

データ入力部１２は、インタフェース部１３を介して、マイク２および２次元磁気センサシステム３と接続されている。データ入力部１２は、マイク２により計測される音声信号と、２次元磁気センサシステム３により計測される調音運動（唇や、舌の運動）の同時計測データを計測する。図１に示す音声分析合成装置１では、例えば、音声信号は１６ｋＨｚサンプリングで計測し、調音パラメータは、下歯茎付近の１点、上・下唇それぞれの１点、舌上の３点の計６点の水平および垂直位置を毎秒２５０回のレートで測定する（１２次のベクトル）。 The data input unit 12 is connected to the microphone 2 and the two-dimensional magnetic sensor system 3 via the interface unit 13. The data input unit 12 measures simultaneous measurement data of an audio signal measured by the microphone 2 and articulatory movement (lip or tongue movement) measured by the two-dimensional magnetic sensor system 3. In the speech analysis / synthesis apparatus 1 shown in FIG. 1, for example, the speech signal is measured at 16 kHz sampling, and the articulation parameters are 1 point near the lower gum, 1 point on each of the upper and lower lips, and 3 points on the tongue. The horizontal and vertical positions of the points are measured at a rate of 250 times per second (12th order vector).

また、音声分析合成装置１には、音声分析部１００、音声変換部２００、およびボコーダ型の音声合成部３００を有している。 The speech analysis / synthesis apparatus 1 includes a speech analysis unit 100, a speech conversion unit 200, and a vocoder-type speech synthesis unit 300.

音声分析部１００は、データ入力部１２により収集された音声信号を基に、ＬＰＣ（線形予測係数）、ＬＳＰ（線スペクトル対）係数、位相等化音声、位相等化パルス音源モデルのフィルタ係数、周波数帯域ごとの有声強度、および白色雑音ゲイン等の算出処理を行なう。 The speech analysis unit 100 is based on the speech signal collected by the data input unit 12, LPC (Linear Prediction Coefficient), LSP (Line Spectrum Pair) Coefficient, Phase Equalized Speech, Phase Equalized Pulse Sound Source Model Filter Coefficient, Calculation processing such as voiced intensity for each frequency band and white noise gain is performed.

音声変換部２００は、調音運動の計測データを基に、ＬＳＰ係数、基本周波数、位相等化パルス音源モデルのフィルタ係数、白色雑音ゲイン、および周波数帯域ごとの有声強度等の変換処理（例えば、線形補間処理）を行なう。音声合成部３００は、音声変換部２００おいて変換されたパラメータを用いて、駆動音源を生成すると共に、この駆動音源からの信号を基に音声信号を合成する。音声出力部１４は、音声合成部３００により合成された音声信号により、音声を出力する。 Based on the measurement data of the articulatory motion, the voice conversion unit 200 converts LSP coefficients, fundamental frequencies, filter coefficients of the phase equalization pulse sound source model, white noise gain, and voiced intensity for each frequency band (for example, linear Interpolation process). The voice synthesizer 300 generates a driving sound source using the parameters converted by the voice converter 200 and synthesizes a voice signal based on a signal from the driving sound source. The voice output unit 14 outputs a voice based on the voice signal synthesized by the voice synthesis unit 300.

図２は、音声分析部１００の構成を示す図である。
図２に示す音声分析部１００内の基本周波数計算部１０１は、音声信号のパワーにより、音声区間の検出を行ない、基本周波数を抽出する処理を行なう。 FIG. 2 is a diagram illustrating a configuration of the voice analysis unit 100.
The fundamental frequency calculation unit 101 in the speech analysis unit 100 shown in FIG. 2 performs a process of detecting a speech section and extracting a fundamental frequency based on the power of the speech signal.

パルス系列生成部１０２は、音声区間内で、基本周波数計算部１０１により得られた基本周波数を用いて、ピッチ周期に応じたパルス系列を持つパルス系列信号ｅｘを生成する処理を行なう。なお、このピッチ周期に応じたパルス系列を持つパルス系列信号ｅｘをピッチ波形と呼ぶ（例えば、図８（Ｂ）に示すピッチ波形を参照）。 The pulse sequence generation unit 102 performs a process of generating a pulse sequence signal ex having a pulse sequence corresponding to the pitch period using the fundamental frequency obtained by the fundamental frequency calculation unit 101 within the speech section. The pulse sequence signal ex having a pulse sequence corresponding to this pitch period is called a pitch waveform (see, for example, the pitch waveform shown in FIG. 8B).

ＬＰＣ係数計算部１０３は、音声信号を用いて、通常の線型予測分析を行い、ＬＰＣ（線型予測分析）係数を算出する。ＬＰＣ残差計算部１０４は、ＬＰＣ係数をフィルタ係数とする逆フィルタによりＬＰＣ予測残差波形ｒｅｓを求める（例えば、図８（Ｃ）に示すＬＰＣ予測残差波形を参照）。 The LPC coefficient calculation unit 103 performs normal linear prediction analysis using the audio signal, and calculates an LPC (linear prediction analysis) coefficient. The LPC residual calculation unit 104 obtains an LPC prediction residual waveform res by an inverse filter using the LPC coefficient as a filter coefficient (see, for example, the LPC prediction residual waveform shown in FIG. 8C).

ＬＳＰ係数計算部１０５は、ＬＰＣ係数計算部１０３により算出されたＬＰＣ係数を基に、ＬＳＰ（線スペクトル対）係数を算出すると共に、これを保持する処理を行なう。 The LSP coefficient calculation unit 105 calculates an LSP (line spectrum pair) coefficient based on the LPC coefficient calculated by the LPC coefficient calculation unit 103 and performs processing for holding the LSP coefficient.

ピッチマーク計算部１０６は、ＬＰＣ予測残差信号ｒｅｓと、パルス系列信号ｅｘとを基に、ピッチマークを抽出する処理を行なう。（例えば、図８（D）に示すピッチマークを参照）。 The pitch mark calculation unit 106 performs a process of extracting a pitch mark based on the LPC prediction residual signal res and the pulse sequence signal ex. (For example, see the pitch mark shown in FIG. 8D).

位相等化音声計算部１０７は、音声区間内で、音声信号と、ピッチマーク計算部１０６で求められたピッチマークと、ＬＰＣ予測残差信号の波形を用いて、音声信号の位相成分を一定の位相に等化した位相等化音声を生成する（例えば、特許文献２を参照）。 The phase-equalized speech calculation unit 107 uses a speech signal, a pitch mark obtained by the pitch mark calculation unit 106, and a waveform of the LPC prediction residual signal within a speech interval to obtain a constant phase component of the speech signal. A phase equalized sound equalized to the phase is generated (see, for example, Patent Document 2).

位相等化パルス音源フィルタ計算部１０８は、音声区間内で、位相等化音声と合成音声信号との聴覚重み付き誤差が最小となるような、位相等化パルス音源モデルを生成すると共に、この位相等化パルス音源モデルのパラメータ（ＦＩＲフィルタ係数ｖ_ｋ）を求める処理を行なう。（例えば、特許文献２を参照）。 The phase equalization pulse sound source filter calculation unit 108 generates a phase equalization pulse sound source model that minimizes the perceptually weighted error between the phase equalized speech and the synthesized speech signal within the speech interval. Processing for obtaining parameters (FIR filter coefficient v _k ) of the equalized pulse source model is performed. (For example, see Patent Document 2).

有声強度計算部１０９は、音声信号を、帯域通過フィルタに通し、例えば、４ｍｓのフレーム周期毎に自己相関関数、あるいは調波構造指数を計算し、有声強度を算出する処理を行なう（例えば、非特許文献4参照）。白色雑音ゲイン計算部１１０は、音声区間以外での白色雑音のゲインを算出する処理を行なう。 The voiced intensity calculation unit 109 performs processing for calculating the voiced intensity by passing the audio signal through a bandpass filter and calculating, for example, an autocorrelation function or a harmonic structure index for each frame period of 4 ms (for example, non-voiced intensity). (See Patent Document 4). The white noise gain calculation unit 110 performs a process of calculating the white noise gain in a portion other than the speech section.

この音声分析部１００における処理の流れを図５に示す。以下、図５を参照して、音声分析部１００における処理の流れについて説明する。 The flow of processing in the voice analysis unit 100 is shown in FIG. Hereinafter, with reference to FIG. 5, a flow of processing in the voice analysis unit 100 will be described.

最初に、マイクによる音声信号をデータ入力部１２より計測する（ステップＳ１０１）。例えば、図８（Ａ）に音声信号の波形の例を示す。 First, an audio signal from the microphone is measured from the data input unit 12 (step S101). For example, FIG. 8A shows an example of the waveform of an audio signal.

続いて、基本周波数計算部１０１は、得られた音声信号から、音声信号のパワーを基に、音声区間の検出を行う。例えば、本実施の形態では、人間の声道の特性に合わせて３０ｍｓ程度の分析窓長（分析区間）と、４ｍｓ程度の分析シフト長により、瞬時周波数振幅スペクトルに基づき、基本周波数（ＦＯあるいはピッチ周期）を求める（ステップＳ１０２）。 Subsequently, the fundamental frequency calculation unit 101 detects a voice section from the obtained voice signal based on the power of the voice signal. For example, in the present embodiment, an analysis window length (analysis interval) of about 30 ms and an analysis shift length of about 4 ms are matched to the characteristics of the human vocal tract, and the fundamental frequency (FO or pitch) is based on the instantaneous frequency amplitude spectrum. (Cycle) is obtained (step S102).

この基本周波数の算出には、例えば、ＩＥＩＣＥの文献（５）「Arifiant、D., Tanaka,T., Masuko, T., and Kobayashi, T.,“Robust FO estimation of speech signal using harmonicity measure based on instantaneous frequency,”IEICE Trans. Information and Systems, E87-D,12,pp. 2812-2820(2004).」に示される手法を使用することができる。 For calculating the fundamental frequency, for example, IEICE document (5) “Arifiant, D., Tanaka, T., Masuko, T., and Kobayashi, T.,“ Robust FO estimation of speech signal using harmonicity measure based on An instantaneous frequency, “IEICE Trans. Information and Systems, E87-D, 12, pp. 2812-2820 (2004)” can be used.

なお、基本周波数の抽出には、変形自己相関法などの別の手法を用いることも可能であるが、本発明において基本周波数の抽出誤りは、音声の分析や合成の精度に大きな影響を及ぼすため、できるだけ抽出誤りの少ない手法を用いることが重要である。 Note that another method such as a modified autocorrelation method can be used for the extraction of the fundamental frequency. However, in the present invention, the extraction error of the fundamental frequency greatly affects the accuracy of speech analysis and synthesis. It is important to use a technique with as few extraction errors as possible.

次に、パルス系列生成部１０２により、音声区間内で、基本周波数計算部１０１により求めた基本周波数を用いて、ピッチ周期に応じたパルス系列信号（ピッチ波形）ｅｘを生成する（ステップＳ１０３）。このピッチ周期に応じたパルス系列信号ｅｘの例を、図８（Ｂ）に示す。 Next, the pulse sequence generation unit 102 generates a pulse sequence signal (pitch waveform) ex corresponding to the pitch period using the fundamental frequency obtained by the fundamental frequency calculation unit 101 within the speech section (step S103). An example of the pulse series signal ex corresponding to this pitch period is shown in FIG.

続いて、ＬＰＣ係数計算部１０３により、上記音声信号を用いて、線型予測分析を行い、また、ＬＰＣ残差計算部１０４により、ＬＰＣ逆フィルタによりＬＰＣ予測残差波形ｒｅｓを求める。このＬＰＣ予測残差波形ｒｅｓの例を、図８（Ｃ）に示す。 Subsequently, the LPC coefficient calculation unit 103 performs linear prediction analysis using the speech signal, and the LPC residual calculation unit 104 obtains an LPC prediction residual waveform res using an LPC inverse filter. An example of this LPC prediction residual waveform res is shown in FIG.

本実施の形態では、前述のように、ＬＰＣ分析窓長３０ｍｓ、分析シフト長４ｍｓとし、２８次の自己相関法により求め、さらに、基本周波数の影響を避けるためにラグ窓（Lag Window）を用いる。ここで、ＬＰＣ係数は、ＬＳＰ係数計算部１０５により、線スペクトル対（ＬＳＰ）係数に変換して保持しておく（ステップＳ１０４）。 In this embodiment, as described above, an LPC analysis window length of 30 ms and an analysis shift length of 4 ms are obtained by the 28th-order autocorrelation method, and a lag window is used to avoid the influence of the fundamental frequency. . Here, the LPC coefficient is converted into a line spectrum pair (LSP) coefficient by the LSP coefficient calculation unit 105 and stored (step S104).

続いて、ピッチマーク計算部１０６により、音声区間内で、フレーム番号ｔ（フレーム周期４ｍｓ）、時刻ｋ（窓長３０ｍｓ）において、ステップＳ１０３で生成したピッチ波形ｅｘ（ｔ，ｋ）の絶対値と、ステップＳ１０４において求めたＬＰＣ予測残差波形ｒｅｓ（ｔ，ｋ）の絶対値の間で、フレームｔ毎に、相互相関関数、
ｒ（ｔ、ｊ）＝Σ_ｋ｜ｒｅｓ（ｔ，ｋ）｜×｜ｅｘ（ｔ，ｋ＋ｊ）｜、 Subsequently, the pitch mark calculation unit 106 calculates the absolute value of the pitch waveform ex (t, k) generated in step S103 at the frame number t (frame period 4 ms) and time k (window length 30 ms) within the voice section. A cross-correlation function between the absolute values of the LPC prediction residual waveform res (t, k) obtained in step S104 for each frame t,
r (t, j) = Σ _k | res (t, k) | × | ex (t, k + j) |

を計算し、Σ_ｔｒ（ｔ、ｊ）が最大となるようなｊの系列を、動的計画法を用いて求める。ここで、｜＊｜は絶対値である。得られるｊの系列は、ＬＰＣ予測残差信号の絶対値が大きな時刻を示しているため、ピッチマークの候補となる。最終的には、得られたピッチマーク候補の近傍で、再度｜ｒｅｓ｜が最大となる時刻を探索し、ピッチマークとして抽出する（ステップＳ１０５）。このピッチマークの例を、図８（Ｄ）に示す。 Was calculated, the Σ _{t r (t,} j) is maximized such j sequences, determined using dynamic programming. Here, | * | is an absolute value. The obtained sequence j is a pitch mark candidate because the absolute value of the LPC prediction residual signal indicates a large time. Finally, a time at which | res | is maximized is searched again in the vicinity of the obtained pitch mark candidate and extracted as a pitch mark (step S105). An example of this pitch mark is shown in FIG.

次に、位相等化音声計算部１０７により、音声区間内で、ステップＳ１０１において得られた音声信号と、ステップＳ１０５で求められたピッチマークと、ステップＳ１０４で求められたＬＰＣ予測残差信号を用いて、音声信号の位相成分を一定の位相に等化した位相等化音声を生成する（例えば、特許文献２を参照）（ステップＳ１０６）。 Next, the speech signal obtained in step S101, the pitch mark obtained in step S105, and the LPC prediction residual signal obtained in step S104 are used by the phase equalized speech calculation unit 107 within the speech interval. Thus, phase equalized sound is generated by equalizing the phase component of the sound signal to a constant phase (see, for example, Patent Document 2) (step S106).

そして、位相等化パルス音源フィルタ計算部１０８により、音声区間内で、上記位相等化音声との聴覚重み付き誤差が最小となるような、位相等化パルス音源モデルおよび位相等化パルス音源モデルのパラメータ（ＦＩＲフィルタ係数ｖｋ）を求める（ステップＳ１０７）。ここで、ＦＩＲフィルタ（６タップ）の伝達特性は、特許文献２と同様に、次のように表される。 Then, the phase equalization pulse sound source filter calculation unit 108 uses the phase equalization pulse sound source model and the phase equalization pulse sound source model so that the auditory weighted error with the phase equalization sound is minimized within the speech interval. A parameter (FIR filter coefficient vk) is obtained (step S107). Here, the transfer characteristic of the FIR filter (6 taps) is expressed as follows, as in Patent Document 2.

ここで、Ｔｉはピッチマークｉにおけるピッチ周期である。 Here, Ti is a pitch period in the pitch mark i.

また、白色雑音ゲイン計算部１１０で算出される白色雑音のゲインは、音声信号ｓの自己相関関数を、 Further, the white noise gain calculated by the white noise gain calculation unit 110 is obtained by calculating the autocorrelation function of the audio signal s.

としたとき、 When

で与えられる。ここで、ＰはＬＰＣ分析の次数、αｋはＬＰＣ係数、ｎはフレーム番号、Ｎは窓長である。 Given in. Here, P is the order of LPC analysis, αk is the LPC coefficient, n is the frame number, and N is the window length.

次に、有声強度計算部１０９により、音声信号を、0−500，500―1000，1000―2000，2000−3000，3000−4000，4000−5000，5000−6000，6000−7000，7000−8000Hzの帯域通過フィルタにそれぞれ通し、４ｍｓのフレーム周期毎に自己相関関数、あるいは調波構造指数（例えば、前述のＩＥＩＣＥの文献（５）を参照）を計算し、これを有声強度とする（例えば、非特許文献4参照）（ステップＳ１０８）。 Next, the voiced intensity calculation unit 109 converts the audio signal to 0-500, 500-1000, 1000-2000, 2000-3000, 3000-4000, 4000-5000, 5000-6000, 6000-7000, 7000-8000Hz. An autocorrelation function or a harmonic structure index (for example, see the above-mentioned IEICE document (5)) is calculated for each 4 ms frame period through each bandpass filter, and this is used as a voiced intensity (for example, non (See Patent Document 4) (Step S108).

これらの位相等化パルス音源、白色雑音ゲイン、および有声強度の計算には、各ピッチマーク位置を分析開始時点とし、分析窓長は２ピッチ周期分として求める。本実施の形態では、分析にはピッチ同期分析を用いるが、合成には４ｍｓフレームシフトを用いるため、ピッチマーク位置と固定長フレームの開始時点が異なることが問題となる。したがって、本実施の形態では、各フレームにおけるパラメータは線型補間により求める。なお、音声区間以外での白色雑音のゲインは、１５ｍｓの固定窓長、４ｍｓの固定フレームシフト長を用いて求める（ステップＳ１０９）。 In calculating the phase equalization pulse sound source, the white noise gain, and the voiced intensity, each pitch mark position is set as the analysis start time, and the analysis window length is calculated as two pitch periods. In this embodiment, pitch synchronization analysis is used for analysis, but since 4 ms frame shift is used for synthesis, there is a problem that the pitch mark position and the start time of the fixed-length frame are different. Therefore, in the present embodiment, the parameters in each frame are obtained by linear interpolation. It should be noted that the gain of white noise outside the speech section is obtained using a fixed window length of 15 ms and a fixed frame shift length of 4 ms (step S109).

また、図３は、音声変換部２００の構成例を示す図である。
図３に示すように、音声変換部２００は、調音パラメータの速度（調音速度）を計算する調音速度計算部２０１を有している。また、調音速度を基に、音声分析部１００で求めたそれぞれのパラータを変換（線形補間）するＬＳＰ係数変換部２０２、基本周波数変換部２０３、位相等化パルス音源フィルタ変換部２０４、白色雑音ゲイン変換部２０５、および周波数帯域毎の有声強度変換部２０６とで構成されている。なお、線形補間については、後述される。 FIG. 3 is a diagram illustrating a configuration example of the audio conversion unit 200.
As shown in FIG. 3, the voice conversion unit 200 includes an articulation speed calculation unit 201 that calculates the speed of the articulation parameter (articulation speed). Further, based on the articulation speed, an LSP coefficient conversion unit 202, a basic frequency conversion unit 203, a phase equalization pulse sound source filter conversion unit 204, a white noise gain, which convert (linear interpolation) each parameter obtained by the speech analysis unit 100. A conversion unit 205 and a voiced intensity conversion unit 206 for each frequency band are included. The linear interpolation will be described later.

また、図６は、音声変換部における処理の流れを示す図である。以下、図６を参照して、その処理の流れについて説明する。 FIG. 6 is a diagram showing the flow of processing in the voice conversion unit. Hereinafter, the flow of the processing will be described with reference to FIG.

調音速度計算部２０１により、音声区間において、２次元磁気センサシステムを用いて計測した調音パラメータの速度（調音速度）を計算する（ステップＳ２０１、Ｓ２０２）。
この調音速度の波形例を図１１の最下段の波形（細線の波形）に示す。 The articulation speed calculation unit 201 calculates the speed (articulation speed) of the articulation parameter measured using the two-dimensional magnetic sensor system in the voice section (steps S201 and S202).
A waveform example of this articulation speed is shown in the lowermost waveform (thin line waveform) in FIG.

この調音速度を算出する場合に、時刻ｔにおける調音速度は、調音パラメータをｘ_ｔ，ｉ（ｉ＝１，・・・，１２：唇や舌などの水平および垂直位置）として、ＲＭＳ距離ｄｘ_ｔ、
ｄｘ_ｔ＝ｓｑｒｔ（Σ_ｉ（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）×（ｘ_ｔ，ｉ−ｘ_{ｔ−１，ｉ}）／１２）、
が用いられる。ここで、ｓｑｒｔは根号、調音速度の単位はｍｍである。 When calculating the articulatory speed, the articulatory speed at time t is the RMS distance dx _{t with the} articulation parameters _{xt, i} (i = 1,..., 12: horizontal and vertical positions such as lips and tongue). ,
dx _t = sqrt (Σ _i (x _{t, i} −x _{t−1, i} ) × (x _{t, i} −x _{t−1, i} ) / 12),
Is used. Here, sqrt is a root number, and the unit of the articulation speed is mm.

そして、調音速度計算部２０１は、音声区間全体の調音速度の和を、音声区間全体の長さ（フレーム数）で割った、平均調音速度ａｖｅｄｘを計算する（ステップＳ２０２）。 Then, the articulation speed calculation unit 201 calculates an average articulation speed avedx obtained by dividing the sum of the articulation speeds of the entire speech section by the length (number of frames) of the entire speech section (step S202).

そして、すべての時刻ｔにおいて、
「ｄｘ_ｋ＜＝ｔ×ａｖｅｄｘ、かつｄｘ_ｋ＋１＞ｔ×ａｖｅｄｘ」となるｋを求め、最終的に時刻ｔにおけるパラメータを、線型補間した、
（（ｄｘ_ｋ＋１−ｔ×avedx）×ｐ_ｋ＋（ｔ×avedx−ｄｘ_ｋ）×ｐ_ｋ＋１）／（ｄｘ_ｋ＋１−ｄｘ_ｋ）、
を算出する。 And at all times t
_{K is obtained as} “dx _k ≦ t × avedx and dx _{k + 1} > t × avedx”, and finally the parameter at time t is linearly interpolated.
((Dx _{k + 1} −t × avedx) × p _k + (t × avedx−dx _k ) × p _{k + 1} ) / (dx _{k + 1} −dx _k ),
Is calculated.

ここでｐ_ｋは、時刻ｋにおける、ＬＳＰ係数、基本周波数、位相等化パルス音源フィルタ係数、白色雑音ゲイン、あるいは周波数帯域毎の有声強度であり、ＬＳＰ係数は、ＬＳＰ係数変換部２０２により算出され（ステップＳ２０３）、基本周波数は、基本周波数変換部２０３により算出され（ステップＳ２０４）、位相等化パルス音源フィルタ係数は位相等化パルス音源フィルタ変換部２０４により算出される（ステップＳ２０５）。また、白色雑音ゲインは、白色雑音ゲイン変換部２０５により算出され（ステップＳ２０６）、周波数帯域毎の有声強度は、周波数帯域毎の有声強度変換部２０６により算出される。ＬＳＰ係数変換部２０２により算出される（ステップＳ２０７）、 Here p _k is at time k, LSP coefficients, the fundamental frequency, phase equalization pulse excitation filter coefficients, a voiced strength for each white noise gain or frequency bands,, LSP coefficients are calculated by the LSP coefficient converter 202 (Step S203), the fundamental frequency is calculated by the fundamental frequency converter 203 (Step S204), and the phase equalization pulse excitation filter coefficient is calculated by the phase equalization pulse excitation filter converter 204 (Step S205). The white noise gain is calculated by the white noise gain conversion unit 205 (step S206), and the voiced intensity for each frequency band is calculated by the voiced intensity conversion unit 206 for each frequency band. Calculated by the LSP coefficient converter 202 (step S207),

これにより、人間が一定の調音速度で発話した場合の発話スタイルを有する音声を生成することが可能となる（これを、調音速度等化音声と呼ぶ）。 As a result, it is possible to generate a voice having a speech style when a human speaks at a constant articulation speed (this is referred to as articulation speed equalized voice).

この調音速度等化音声の例を、図１１において調音速度等化音声（上から２番目の波形）として示している。 An example of this articulation speed equalized voice is shown as the articulation speed equalized voice (second waveform from the top) in FIG.

また、これとは逆に、調音速度の逆数の平均毎にパラメータを並べることにより、人間にとっては発話困難な、調音速度の初速が速く、だんだんと遅くなる発話スタイルを有する音声を生成できる。これらの調音速度の変換法は一例であり、様々な手法が考えられる。 On the other hand, by arranging parameters for each average of the reciprocal of the articulation speed, it is possible to generate speech having an utterance style in which the initial speed of the articulation speed is fast and gradually slows, which is difficult for humans. These articulation speed conversion methods are examples, and various methods are conceivable.

また、図４は、音声合成部３００の構成例を示す図である。
図４に示すように、音声合成部３００は、位相等化パルス音源と白色雑音の音源を生成する駆動音源生成部３０１と、ＬＳＰ係数からＬＰＣ係数を算出するＬＰＣ係数計算部３０２とを有している。また、位相等化パルス音源と白色雑音とＬＰＣ係数とを基に、最終的な音声信号３０４を合成する畳み込み演算部３０３を有している。 FIG. 4 is a diagram illustrating a configuration example of the speech synthesis unit 300.
As shown in FIG. 4, the speech synthesis unit 300 includes a drive sound source generation unit 301 that generates a phase equalization pulse sound source and a white noise sound source, and an LPC coefficient calculation unit 302 that calculates an LPC coefficient from the LSP coefficient. ing. In addition, a convolution operation unit 303 that synthesizes the final audio signal 304 based on the phase equalization pulse sound source, the white noise, and the LPC coefficient is provided.

この音声合成部３００における処理の流れを図７に示す。以下、図７を参照して、音声合成部その処理の流れについて説明する。 A flow of processing in the speech synthesizer 300 is shown in FIG. Hereinafter, with reference to FIG. 7, the flow of the speech synthesis unit will be described.

音声合成部３００では、音声変換部２００において調音速度を基に線形補間されたパラメータを取得する（ステップＳ３０１）。 In the speech synthesizer 300, the parameters that are linearly interpolated based on the articulation speed in the speech converter 200 are acquired (step S301).

また、このステップＳ３０１において、駆動音源生成部３０１により、音声変換部２００において調音速度を基に線形補間された基本周波数に、式（１）を適用することで、位相等化パルス音源フィルタ係数を求め、位相等化パルス音源を作成する。 Also, in this step S301, the phase equalization pulse sound source filter coefficient is obtained by applying equation (1) to the fundamental frequency linearly interpolated based on the articulation speed by the sound conversion unit 200 by the driving sound source generation unit 301. Obtain a phase equalization pulse sound source.

また、白色雑音に、音声変換部２００において調音速度を基に補間された白色雑音のゲインを乗じる。そして、音声変換部２００において調音速度を基に補間された有声強度に基づき、ある閾値より大きい帯域を有声帯域、小さい帯域を無声帯域として、有声帯域では位相等化パルス音源、無声帯域では白色雑音を混合した駆動音源を作成する。 Further, the white noise is multiplied by the gain of the white noise interpolated based on the articulation speed in the voice conversion unit 200. Then, based on the voiced intensity interpolated based on the articulation speed in the voice conversion unit 200, a band larger than a certain threshold is set as a voiced band, and a small band is set as a voiceless band. A drive sound source mixed with is created.

そして、最終的には、ＬＰＣ係数計算部３０２により、ＬＳＰ係数変換部２０２により算出されたＬＳＰ係数をＬＰＣ係数に変換し、畳み込み演算部３０３により、ＬＰＣ係数計算部３０２により変換されたＬＰＣ係数と駆動音源の出力信号とを畳み込むことで音声を合成する（ステップＳ３０２）。 Finally, the LPC coefficient calculated by the LSP coefficient converter 202 is converted into an LPC coefficient by the LPC coefficient calculator 302, and the LPC coefficient converted by the LPC coefficient calculator 302 is converted by the convolution calculator 303. The voice is synthesized by convolving the output signal of the driving sound source (step S302).

以上、本発明の音声分析合成装置の構成と処理の流れについて説明したが、具体的な例として、音声分析部１００、音声変換部２００、および音声合成部３００において、処理される信号の波形の例を、図８〜図１１に示す。 The configuration and processing flow of the speech analysis / synthesis apparatus of the present invention have been described above. As specific examples, the waveforms of signals processed in the speech analysis unit 100, speech conversion unit 200, and speech synthesis unit 300 are described. Examples are shown in FIGS.

図８は、発声資料「腕前」の一部の波形例を示している。図８（Ａ）は、マイクから入力される音声信号を１６ｋＨｚでサンプリングした音声信号、図８（Ｂ）は、基本周波数を用いて、ピッチ周期に応じたパルス系列を持つ信号であるピッチ波形を示している。 FIG. 8 shows a partial waveform example of the utterance material “skill”. FIG. 8A shows an audio signal obtained by sampling an audio signal input from a microphone at 16 kHz, and FIG. 8B shows a pitch waveform that is a signal having a pulse sequence corresponding to the pitch period using the fundamental frequency. Show.

また、図８（Ｃ）は、線型予測分析（ＬＰＣ）を行い、ＬＰＣ逆フィルタにより求めたＬＰＣ予測残差信号ｒｅｓの波形を示し、図８（Ｄ）は、音声区間内で、フレーム番号ｔ（フレーム周期４ｍｓ）、時刻ｋ（窓長３０ｍｓ）において、ピッチ波形とＬＰＣ予測残差信号を基に生成されたピッチマークを示している。また、図８（Ｅ）は、調音パラメータの速度（調音速度）から算出した平均調音速度ａｖｅｄｘを基に合成した再合成波形を示している。 FIG. 8C shows the waveform of the LPC prediction residual signal res obtained by performing linear prediction analysis (LPC) and using an LPC inverse filter. FIG. 8D shows the frame number t in the speech interval. A pitch mark generated based on the pitch waveform and the LPC prediction residual signal is shown at (frame period 4 ms) and time k (window length 30 ms). FIG. 8E shows a re-synthesis waveform synthesized based on the average articulation speed averagex calculated from the speed of the articulation parameter (articulation speed).

図８に示すように、従来手法であるL PＣ予測残差信号の絶対値の閾値処理を用いた場合、図８（Ｃ）の波形の点線の丸で囲まれた部分をピッチマークと誤認してしまう可能性があるが、本手法を用いた場合、このような誤りは少ない。 As shown in FIG. 8, when the threshold value processing of the absolute value of the LPC prediction residual signal, which is a conventional method, is used, the portion surrounded by the dotted circle in the waveform of FIG. 8C is mistaken as a pitch mark. However, there are few such errors when this method is used.

また、図９は、発声資料「腕前」の白色雑音のゲインの例を示す図である。
図９（Ａ）は、ピッチ同期分析適用なし、図９（Ｂ）は、ピッチ同期分析適用あり、の場合を示す。図９に示すように、ピッチ同期分析を行なうことにより、基本波の影響を受けない、なめらかに変化するゲインを得ることができる。 FIG. 9 is a diagram illustrating an example of white noise gain of the utterance material “skill”.
FIG. 9A shows the case where the pitch synchronization analysis is not applied, and FIG. 9B shows the case where the pitch synchronization analysis is applied. As shown in FIG. 9, by performing pitch synchronization analysis, a smoothly changing gain that is not affected by the fundamental wave can be obtained.

また、図１０は、駆動音源の例を示す図であり、図１０（Ａ）は、位相等化パルス音源、図１０（Ｂ）は、ゲインを加えた白色雑音．図１０（Ｃ）は、周波数帯域毎の有声／無声判定に基づいて、図１０（Ａ）と図１０（Ｂ）とを混合した信号を示している。 FIG. 10 is a diagram showing an example of a driving sound source. FIG. 10A shows a phase equalization pulse sound source, and FIG. 10B shows white noise with gain. FIG. 10C shows a signal obtained by mixing FIG. 10A and FIG. 10B based on voiced / unvoiced determination for each frequency band.

また、図１１は、本発明における調音速度等化音声の例を示す図である。発声は「青空に入道雲が浮かんでいます」の例である。 Moreover, FIG. 11 is a figure which shows the example of the articulation speed equalization audio | voice in this invention. The utterance is an example of “a thunderhead in the blue sky”.

図１１に示す信号波形は、上から順番に、原音声、調音速度等化音声、基本周波数、調音位置（下歯茎、水平方向）、調音位置（上唇、水平方向）、調音位置（下唇、水平方向）、調音位置（舌１、水平方向）、調音位置（舌２、水平方向）、調音位置（舌３、水平方向）、
調音位置（下歯茎、垂直方向）、調音位置（上唇、垂直方向）、調音位置（下唇、垂直方向）、調音位置（舌１、垂直方向）、調音位置（舌２、垂直方向）、調音位置（舌３、垂直方向）、調音速度を、それぞれ示している。なお、基本周波数、調音位置および調音速度における細線は原音声、太線は調音速度等化音声である。 The signal waveforms shown in FIG. 11 are, in order from the top, original voice, articulation speed equalized voice, fundamental frequency, articulation position (lower gum, horizontal direction), articulation position (upper lip, horizontal direction), articulation position (lower lip, Horizontal direction), articulation position (tongue 1, horizontal direction), articulation position (tongue 2, horizontal direction), articulation position (tongue 3, horizontal direction),
Articulation position (lower gum, vertical direction), articulation position (upper lip, vertical direction), articulation position (lower lip, vertical direction), articulation position (tongue 1, vertical direction), articulation position (tongue 2, vertical direction), articulation The position (tongue 3, vertical direction) and articulation speed are shown. The fine lines in the fundamental frequency, the articulation position, and the articulation speed are the original sound, and the thick line is the articulation speed equalized sound.

図１１に示すように、調音速度等化音声における調音速度は、音声区間で一定に保たれており、本発明の手法の有効性を確認できる。 As shown in FIG. 11, the articulation speed in the articulation speed equalized voice is kept constant in the voice section, and the effectiveness of the method of the present invention can be confirmed.

また受聴試験の結果、再合成された音声信号と、原音声信号の間の知覚的な歪みはほとんど感じられないことを確認している。 As a result of the listening test, it was confirmed that there is almost no perceptual distortion between the re-synthesized audio signal and the original audio signal.

なお、上記の説明においては、図７のステップＳ３０２において、ＬＰＣ係数計算部３０２により、ＬＳＰ係数変換部２０２により算出されたＬＳＰ係数をＬＰＣ係数に変換し、畳み込み演算部３０３により、ＬＰＣ係数計算部３０２により変換されたＬＰＣ係数と駆動音源の出力信号とを畳み込むことで音声を合成した。しかし、これに限られるものではなく、畳み込み演算部３０３は、ＬＳＰ係数変換部２０２により算出されたＬＳＰ係数からＬＳＰ合成フィルタを生成し、生成したＬＳＰ合成フィルタと駆動音源の出力信号とを畳み込むことで音声を合成してもよい。 In the above description, in step S302 of FIG. 7, the LPC coefficient calculation unit 302 converts the LSP coefficient calculated by the LSP coefficient conversion unit 202 into an LPC coefficient, and the convolution calculation unit 303 converts the LPC coefficient calculation unit. Speech was synthesized by convolving the LPC coefficient converted by 302 and the output signal of the driving sound source. However, the present invention is not limited to this, and the convolution operation unit 303 generates an LSP synthesis filter from the LSP coefficient calculated by the LSP coefficient conversion unit 202, and convolves the generated LSP synthesis filter and the output signal of the driving sound source. You may synthesize speech.

以上、本発明の音声分析合成装置について説明したが、図１に示した音声分析合成装置１は、内部にコンピュータシステムを有している。そして、データ入力部１２、音声分析部１００、音声変換部２００、音声合成部３００等における処理は、ＣＰＵがプログラムを読み出して実行することにより、その機能が実現されるものである（もちろん、専用のハードウェアにより実現されるものであってもよい）。 The speech analysis / synthesis apparatus of the present invention has been described above. The speech analysis / synthesis apparatus 1 shown in FIG. 1 has a computer system therein. The processing in the data input unit 12, the voice analysis unit 100, the voice conversion unit 200, the voice synthesis unit 300, and the like is realized by the CPU reading and executing the program (of course, the dedicated function). It may be realized by hardware).

そして、上記プログラムは、例えばハードディスクやＲＯＭ等の、コンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。 The program is stored in a computer-readable recording medium such as a hard disk or ROM. The computer reads out and executes the program, and the above process is performed.

すなわち、データ入力部１２、音声分析部１００、音声変換部２００、音声合成部３００等における、各処理は、ＣＰＵ等の中央演算処理装置が上記プログラムを読み出して、情報の加工、演算処理を実行することにより、実現されるものである。 That is, in each process in the data input unit 12, the voice analysis unit 100, the voice conversion unit 200, the voice synthesis unit 300, etc., a central processing unit such as a CPU reads the program and executes information processing and calculation processing. By doing so, it is realized.

ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、図１に示す音声分析合成装置１には、周辺機器として入力装置、表示装置等（いずれも表示せず）が接続されているものとする。ここで、入力装置としては、キーボード、マウス等の入力デバイスのことをいう。表示装置とは、ＣＲＴ（Cathode Ray Tube）や液晶表示装置等のことをいう。 Further, it is assumed that an input device, a display device, and the like (none of them are displayed) are connected to the speech analysis / synthesis device 1 shown in FIG. Here, the input device refers to an input device such as a keyboard and a mouse. The display device refers to a CRT (Cathode Ray Tube), a liquid crystal display device, or the like.

以上、本発明の実施の形態について説明したが、本発明の音声分析合成装置は、上述の図示例にのみ限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。 Although the embodiments of the present invention have been described above, the speech analysis / synthesis apparatus of the present invention is not limited to the above illustrated examples, and various modifications can be made without departing from the scope of the present invention. Of course.

本発明の実施の形態に係わる音声分析合成装置の構成を示す図である。It is a figure which shows the structure of the speech analysis synthesis apparatus concerning embodiment of this invention. 音声分析部の構成例を示す図である。It is a figure which shows the structural example of a speech analysis part. 音声変換部の構成例を示す図である。It is a figure which shows the structural example of an audio | voice conversion part. 音声合成部の構成例を示す図である。It is a figure which shows the structural example of a speech synthesizer. 音声分析部における処理の流れを示す図である。It is a figure which shows the flow of a process in an audio | voice analysis part. 音声変換部における処理の流れを示す図である。It is a figure which shows the flow of the process in an audio | voice conversion part. 音声合成部における処理の流れを示す図である。It is a figure which shows the flow of a process in a speech synthesizer. 発声資料「腕前」の一部の波形例を示す図である。It is a figure which shows the example of a part of waveform of utterance material "the skill". 発声資料「腕前」の白色雑音のゲインの例を示す図である。It is a figure which shows the example of the gain of the white noise of utterance material "the skill". 駆動音源の例を示す図である。It is a figure which shows the example of a drive sound source. 本発明における調音速度等化音声の例を示す図である。It is a figure which shows the example of the articulation speed equalization audio | voice in this invention.

Explanation of symbols

１・・・音声分析合成装置、２・・・マイク、３・・・２次元磁気センサシステム、１１・・・主制御部、１２・・・データ入力部、１３・・・インタフェース部、１４・・・音声出力部、１００・・・音声分析部、１０１・・・基本周波数計算部、１０２・・・パルス系列生成部、１０３・・・ＬＰＣ係数計算部、１０４・・・ＬＰＣ残差計算部、１０５・・・ＬＳＰ係数計算部、１０６・・・ピッチマーク計算部、１０７・・・位相等化音声計算部、１０８・・・位相等化パルス音源フィルタ計算部、１０９・・・有声強度計算部、１１０・・・白色雑音ゲイン計算部、２００・・・音声変換部、２０１・・・調音速度計算部、２０２・・・ＬＳＰ係数変換部、２０３・・・基本周波数変換部、２０４・・・位相等化パルス音源フィルタ変換部、２０５・・・白色雑音ゲイン変換部、２０６・・・有声強度変換部、３００・・・音声合成部、３０１・・・駆動音源生成部、３０２・・・ＬＰＣ係数計算部、３０３・・・畳み込み演算部 DESCRIPTION OF SYMBOLS 1 ... Speech analysis synthesis apparatus, 2 ... Microphone, 3 ... Two-dimensional magnetic sensor system, 11 ... Main control part, 12 ... Data input part, 13 ... Interface part, 14 * ..Voice output unit 100 ... Voice analysis unit 101 ... Frequency frequency calculation unit 102 ... Pulse sequence generation unit 103 ... LPC coefficient calculation unit 104 ... LPC residual calculation unit , 105 ... LSP coefficient calculation unit, 106 ... Pitch mark calculation unit, 107 ... Phase equalization sound calculation unit, 108 ... Phase equalization pulse sound source filter calculation unit, 109 ... Voiced intensity calculation 110, white noise gain calculation unit, 200 ... audio conversion unit, 201 ... articulation speed calculation unit, 202 ... LSP coefficient conversion unit, 203 ... fundamental frequency conversion unit, 204 ...・ Phase equalization pulse source filter Conversion unit 205 ... white noise gain conversion unit 206 ... voiced intensity conversion unit 300 ... voice synthesis unit 301 ... drive sound source generation unit 302 ... LPC coefficient calculation unit 303 ..Convolution operation unit

Claims

A data input unit that measures simultaneously measured voice as a voice signal and collects measurement data of articulatory movement, a voice analysis unit that analyzes the measurement data, and a voice that performs predetermined conversion on the analysis result of the voice analysis unit A speech analysis and synthesis apparatus comprising a conversion unit and a vocoder type speech synthesis unit that synthesizes speech based on a conversion result of the speech conversion unit,
The voice analysis unit
Detecting a voice section from the voice signal and calculating a fundamental frequency in the voice section;
Using the fundamental frequency, a pulse sequence generation unit that generates a pitch waveform having a pulse sequence according to a pitch period;
An LPC coefficient calculation unit that performs linear prediction analysis based on the speech signal and calculates an LPC (Linear Prediction Analysis) coefficient;
An LPC residual calculation unit that calculates an LPC prediction residual waveform using the speech signal and an inverse filter having the LPC coefficient as a filter coefficient;
An LSP coefficient calculation unit for calculating an LSP (Line Spectrum Pair) coefficient from the LPC coefficient;
A pitch mark calculator for extracting a pitch mark that maximizes the cross-correlation between the absolute value of the LPC prediction residual waveform and the absolute value of the pitch waveform within the detected speech section;
A phase that generates phase-equalized speech in which the phase component of the speech signal is equalized to a constant phase based on the speech signal, the pitch mark, and the LPC prediction residual waveform within the detected speech section. An equalized speech calculator;
A phase equalization pulse sound source filter calculation unit for calculating a filter coefficient of a phase equalization pulse sound source model based on the phase equalization sound within the detected voice interval;
Within the detected speech section, a voiced strength calculation unit that calculates speech strength for each frequency band by a predetermined calculation method based on the speech signal;
A white noise gain calculation unit that calculates a white noise gain by a predetermined calculation method based on the audio signal within the detected audio section;
With
The voice conversion unit
Based on the measurement data of the articulatory movement, an articulation speed calculation unit that calculates the speed of the articulation parameter as the articulation speed,
An LSP coefficient converter that performs a predetermined conversion on the LSP coefficient according to the articulation speed;
A fundamental frequency converter that performs a predetermined conversion on the fundamental frequency according to the articulation speed;
A phase equalization pulse sound source filter conversion unit that performs a predetermined conversion on the filter coefficient of the phase equalization pulse sound source model according to the articulation speed;
A white noise gain conversion unit that performs a predetermined conversion on the white noise gain according to the articulation speed;
A voiced intensity conversion unit that performs a predetermined conversion on the sound intensity for each frequency band according to the articulation speed;
With
The speech synthesizer
A phase equalization pulse sound source is generated based on the fundamental frequency converted by the basic frequency conversion unit, the filter coefficient converted by the phase equalization pulse sound source filter conversion unit, and the phase equalization pulse sound source model. At the same time, based on the voice intensity for each frequency band converted by the voiced intensity converter, the generated phase equalization pulse sound source is mixed in the voiced band and the driving sound source is mixed in the unvoiced band with white noise. A driving sound generator,
A convolution operation unit that synthesizes an audio signal from the LSP coefficient converted by the LSP coefficient conversion unit and the output signal of the driving sound source;
Equipped with a,
Analysis when the voiced intensity calculation unit calculates the voice intensity, when the white noise gain calculation unit calculates the white noise gain, and when the drive sound source generation unit generates the phase equalized pulse sound source Calculate or generate the window length as 2 pitch periods,
A speech analysis and synthesis apparatus characterized by the above.

The speech synthesizer
An LPC coefficient calculator that converts the LSP coefficient converted by the LSP coefficient converter into an LPC coefficient;
The convolution unit is
A voice signal is synthesized by convolving the LPC coefficient converted in the LPC coefficient calculation unit and the output signal of the driving sound source;
The speech analysis / synthesis apparatus according to claim 1.

The LSP coefficient conversion unit, the fundamental frequency conversion unit, the phase equalization pulse sound source filter conversion unit, the white noise gain conversion unit, and the voiced intensity conversion unit of the sound conversion unit, respectively,
As the articulation speed at time t, the RMS distance dxt when the articulation parameter is xt, i (i = 1,..., N: horizontal and vertical positions such as lips and tongue) is used.
“Dxt = sqrt (Σi (xt, i−xt−1, i) × (xt, i−xt−1, i) / n),
Where sqrt is the root number and the unit of articulation speed is mm "
Further, an average articulation speed avedx is calculated by dividing the sum of articulation speeds of the entire speech section by the length (number of frames) of the entire speech section.
Furthermore, at all times t
Find k such that “dxk <= t × avedx and dxk + 1> t × avedx”,
Linearly interpolating the parameter at time t by the following equation:
“((Dxk + 1−t × avedx) × pk + (t × avedx−dxk) × pk + 1) / (dxk + 1−dxk), where Pk is the LSP coefficient, fundamental frequency, phase equalization pulse sound source at time k Filter coefficient, white noise gain, or voiced intensity per frequency band ",
The speech analysis / synthesis apparatus according to claim 1 or 2, characterized in that:

A data input unit that measures simultaneously measured voice as a voice signal and collects measurement data of articulatory movement, a voice analysis unit that analyzes the measurement data, and a voice that performs predetermined conversion on the analysis result of the voice analysis unit A speech analysis / synthesis method in a speech analysis / synthesis apparatus comprising: a conversion unit; and a vocoder-type speech synthesis unit that synthesizes speech based on a conversion result of the speech conversion unit,
By the voice analysis unit,
Detecting a voice section from the voice signal, and calculating a fundamental frequency in the voice section;
Using the fundamental frequency, a pulse sequence generation procedure for generating a pitch waveform having a pulse sequence according to the pitch period;
An LPC coefficient calculation procedure for performing linear prediction analysis based on the speech signal and calculating an LPC (Linear Prediction Analysis) coefficient;
An LPC residual calculation procedure for calculating an LPC prediction residual waveform by the speech signal and an inverse filter having the LPC coefficient as a filter coefficient;
An LSP coefficient calculation procedure for calculating an LSP (Line Spectrum Pair) coefficient from the LPC coefficient;
A pitch mark calculation procedure for extracting a pitch mark that maximizes a cross-correlation between an absolute value of the LPC prediction residual waveform and an absolute value of the pitch waveform within the detected speech section;
A phase that generates phase-equalized speech in which the phase component of the speech signal is equalized to a constant phase based on the speech signal, the pitch mark, and the LPC prediction residual waveform within the detected speech section. Equalized speech calculation procedure;
A phase equalization pulse sound source filter calculation procedure for calculating a filter coefficient of a phase equalization pulse sound source model based on the phase equalization sound within the detected voice interval;
Within the detected speech section, a voiced strength calculation procedure for calculating speech strength for each frequency band by a predetermined calculation method based on the speech signal;
Within the detected speech section, a white noise gain calculation procedure for calculating a white noise gain by a predetermined calculation method based on the speech signal;
Is done,
By the voice conversion unit,
Based on the measurement data of the articulatory movement, the articulation speed calculation procedure for calculating the speed of the articulation parameter as the articulation speed,
An LSP coefficient conversion procedure for performing a predetermined conversion on the LSP coefficient according to the articulation speed;
A fundamental frequency conversion procedure for performing a predetermined conversion on the fundamental frequency according to the articulation speed;
A phase equalization pulse sound source filter conversion procedure for performing a predetermined conversion on the filter coefficient of the phase equalization pulse sound source model according to the articulation speed;
A white noise gain conversion procedure for performing a predetermined conversion on the white noise gain according to the articulation speed;
A voiced intensity conversion procedure for performing a predetermined conversion on the sound intensity for each frequency band according to the articulation speed;
Done
By the speech synthesizer,
A phase equalized pulse sound source is generated based on the fundamental frequency converted in the basic frequency conversion procedure, the filter coefficient converted by the phase equalized pulse sound source filter conversion procedure, and the phase equalized pulse sound source model. At the same time, based on the voice intensity for each frequency band converted in the voiced intensity conversion procedure, the generated phase equalization pulse sound source is mixed in the voiced band and the driving sound source is mixed in the unvoiced band with white noise. Driving sound source generation procedure,
A convolution calculation procedure for synthesizing an audio signal from the LSP coefficient converted by the LSP coefficient conversion procedure and the output signal of the driving sound source;
Is done ,
When the voice intensity is calculated in the voiced intensity calculation procedure, the white noise gain is calculated in the white noise gain calculation procedure, and the phase equalization pulse sound source is generated in the driving sound source generation procedure In this case, the analysis window length is calculated or generated as two pitch periods.
A speech analysis and synthesis method characterized by the above.

By the speech synthesizer,
An LPC coefficient calculation procedure for converting the LSP coefficient converted by the LSP coefficient conversion procedure into an LPC coefficient is performed.
In the convolution calculation procedure,
A voice signal is synthesized by convolving the LPC coefficient converted in the LPC coefficient calculation procedure with the output signal of the driving sound source;
The speech analysis and synthesis method according to claim 4.

In the LSP coefficient conversion procedure, the fundamental frequency conversion procedure, the phase equalization pulse sound source filter conversion procedure, the white noise gain conversion procedure, and the voiced intensity conversion procedure by the sound conversion unit,
A procedure using the RMS distance dxt when the articulation parameter is xt, i (i = 1,..., N: horizontal and vertical positions such as lips and tongue) as the articulation speed at time t;
“Dxt = sqrt (Σi (xt, i−xt−1, i) × (xt, i−xt−1, i) / n),
Where sqrt is the root number and the unit of articulation speed is mm "
Further, a procedure for calculating an average articulation speed avedx obtained by dividing the sum of the articulation speeds of the entire speech section by the length (number of frames) of the entire speech section;
Furthermore, at all times t
Find k such that “dxk <= t × avedx and dxk + 1> t × avedx”,
A procedure for linearly interpolating the parameter at time t by the following equation;
“((Dxk + 1−t × avedx) × pk + (t × avedx−dxk) × pk + 1) / (dxk + 1−dxk), where Pk is the LSP coefficient, fundamental frequency, phase equalization pulse sound source at time k Filter coefficient, white noise gain, or voiced intensity per frequency band ",
6. The speech analysis / synthesis method according to claim 4 or 5, wherein:

A data input unit that collects measurement data of voice and articulation movement, a voice analysis unit that analyzes the measurement data, a voice conversion unit that performs predetermined conversion on the analysis result of the voice analysis unit, and a conversion of the voice conversion unit To a computer in a speech analysis and synthesis device comprising a vocoder type speech synthesis unit that synthesizes speech based on the results,
The computer program for performing the procedure in any one of Claims 4-6.

A computer-readable recording medium storing the computer program according to claim 7.