JP3398968B2

JP3398968B2 - Speech analysis and synthesis method

Info

Publication number: JP3398968B2
Application number: JP09226292A
Authority: JP
Inventors: 淳松本; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-03-18
Filing date: 1992-03-18
Publication date: 2003-04-21
Anticipated expiration: 2018-04-21
Also published as: JPH05265486A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声信号の分析合成符
号化装置に適用される音声合成方式に関する。The present invention relates to a sound Koego formed method applied to the analysis synthesizing the speech signal coding apparatus.

【０００２】[0002]

【従来の技術】人間の聴覚は一種のスペクトル分析器で
あって、パワースペクトルが等しければ同じ音として聞
こえるという性質がある。この性質を利用して合成音を
得る方法が音声の分析合成方法である。2. Description of the Related Art Human hearing is a kind of spectrum analyzer, and has the property of being heard as the same sound if the power spectra are equal. A method of obtaining a synthetic voice by utilizing this property is a voice analysis / synthesis method.

【０００３】上記合成音を得るには、分析側で入力音声
信号を分析し、ピッチ情報、有声音／無声音の判別情
報、振幅情報等を抽出あるいは検出し、合成側に伝送
し、合成側でそれらの情報を基に人工的に音声を作り出
す。特に、合成側は、その合成の方式により、録音編集
方式、パラメータ編集方式、規則合成方式等に分類でき
る。In order to obtain the above-mentioned synthesized sound, the analysis side analyzes the input voice signal, extracts or detects pitch information, voiced / unvoiced sound discrimination information, amplitude information, etc., transmits them to the synthesis side, and transmits them to the synthesis side. A voice is artificially created based on such information. In particular, the synthesizing side can be classified into a recording editing method, a parameter editing method, a rule synthesizing method, etc. depending on the synthesizing method.

【０００４】上記録音編集方式は、予め、人が発生した
音声を単語や文節等を単位にとって蓄積（録音）してお
き、必要に応じてそれらを読みだして接続（編集）し、
音声を合成するものである。In the above recording / editing system, the voice generated by a person is previously accumulated (recorded) in units of words, phrases, etc., read out as needed, and connected (edited),
It synthesizes voice.

【０００５】上記パラメータ編集方式は、上記録音編集
方式の場合と同様に単語、文節等を単位とするが、予め
人が発声した音声を音声生成モデルに基づいて分析し
て、パラメータ時系列の形で蓄え、必要に応じて接続し
たパラメータ時系列を用いて音声合成装置を駆動し、音
声を合成する方式である。The above-mentioned parameter editing system uses words, phrases, etc. as a unit as in the case of the above-mentioned recording editing system. However, the voice uttered by a person is analyzed in advance based on the voice generation model to form the parameter time series. In this method, a voice synthesizing device is driven by using a parameter time series which is stored as described above and is connected as needed, to synthesize a voice.

【０００６】上記規則合成方式は、文字や音声記号など
の離散的記号で表現された系列を、連続的に変換する技
術である。変換の過程で、音声生成の普遍的諸性質や人
為的諸性質が合成規則として適用される。The rule synthesizing method is a technique for continuously converting a sequence represented by discrete symbols such as characters and phonetic symbols. In the process of conversion, universal and artificial properties of speech production are applied as synthesis rules.

【０００７】上記各合成方式は、いずれも何らかの形で
声道特性を模擬し、それに音源波とほぼ同じスペクトル
を持つ信号を使って合成音を得ている。In each of the above synthesis methods, the vocal tract characteristic is simulated in some form, and a synthesized sound is obtained by using a signal having substantially the same spectrum as the sound source wave.

【０００８】[0008]

【発明が解決しようとする課題】ところで、上記の音声
分析合成方法では、分析側の位相に合成側の位相を合わ
せる必要がある。この場合、合成側にて位相情報を得る
際、角周波数による線形予測及び白色雑音による修正を
用いる場合がある。しかし、位相の真値と予測による雑
音（エラー）の制御は、上記白色雑音では不可能であ
る。By the way, in the above speech analysis / synthesis method, it is necessary to match the phase on the synthesis side with the phase on the analysis side. In this case, when obtaining phase information on the combining side, linear prediction by angular frequency and correction by white noise may be used. However, control of noise (error) due to the true value of the phase and prediction is impossible with the white noise.

【０００９】また、全帯域中の無声音の占める割合で白
色雑音のレベルを変化させて修正項に用いているため、
有声音を多く含むブロックが連続した場合、予測のみで
修正が施されないため、結果として強い母音が長時間続
くようなときはエラーが累積し、音質の劣化が生じる。Further, since the level of white noise is changed and used as a correction term at a ratio of unvoiced sound in the entire band,
When blocks containing a lot of voiced sound are consecutive, no correction is made only by prediction. As a result, when a strong vowel lasts for a long time, errors accumulate and sound quality deteriorates.

【００１０】そこで、本発明は、大きさと分散を制御す
ることができる雑音を予測の修正に用いることで音質の
向上を実現する音声合成方法の提供を目的とする。[0010] Accordingly, the present onset Ming, an object of the present invention to provide a sound Koego forming method for realizing the sound quality improvements in the use of the noise that it is possible to control the large can and dispersion to correct the prediction.

【００１１】[0011]

【課題を解決するための手段】本発明に係る音声合成方
法は、入力された音声信号をフレーム単位で区分し、区
分されたフレーム毎にピッチを求め、求められたピッチ
の基本波及びその高調波を表す信号群を用いて音声を合
成する音声合成方法において、上記基本波及びその高調
波のフレーム先端部における位相を所定の値に設定する
工程と、上記基本波及びその高調波のフレーム終端部に
おける位相を、上記所定の値に設定されたフレーム先端
部における位相に基づいて予測する工程と、上記予測さ
れたフレーム終端部の位相を上記基本波及びその高調波
の周波数に応じた修正項を加えることにより修正する工
程とを有して上記課題を解決する。Means for Solving the Problems] sound Koego forming method according to the present invention divides an input audio signal in units of frames, determine the pitch for each frame that is classified, the fundamental wave of the determined pitch and in sound Koego forming method of synthesizing speech using a signal group representing the harmonics, and setting the phase of the frame front end of the fundamental wave and its harmonics to a predetermined value, the fundamental wave and its harmonics frame tip a phase in frame termination section of the wave, which is set to the predetermined value
The step of predicting based on the phase in the part, and the step of correcting the phase of the predicted frame end part by adding a correction term according to the frequency of the fundamental wave and its harmonics Solve.

【００１２】また、本発明に係る音声合成方法におい
て、上記修正する工程は、上記基本波及び高調波の周波
数に応じて異なる分散を持つ乱数を発生させ、上記乱数
で発生された値を上記予測されたフレーム終端部の位相
に加えることにより修正する。Further, in the sound Koego forming method according to the present invention, said modifications may generate a random number having different dispersions in accordance with the frequency of the fundamental wave and harmonics, generated values above random number upper Symbol predicted frame termination of the phase
Correct by adding to.

【００１３】[0013]

【作用】本発明に係る音声合成方法は、入力された音声
信号をフレーム単位で区分して求めたピッチの基本波及
びその高調波のフレーム先端部における位相を所定の値
に設定し、上記基本波及びその高調波のフレーム終端部
における位相を、上記所定の値に設定されたフレーム先
端部における位相に基づいて予測し、上記予測されたフ
レーム終端部の位相を上記基本波及びその高調波の周波
数に応じた修正項に加えることにより修正する。Sound Koego forming method according to the present invention, sets the phase of the fundamental wave and the frame tip portion of its harmonics of the pitch obtained by dividing the input speech signal in frame units to a predetermined value, The phase at the end of the frame of the fundamental wave and its harmonics, the frame ahead set to the predetermined value
Prediction is performed based on the phase at the end, and the predicted phase at the frame end is added to the correction term corresponding to the frequencies of the fundamental wave and its harmonics to make a correction .

【００１４】[0014]

【実施例】以下、本発明に係る音声合成方法を、音声信
号の分析合成符号化装置（いわゆるボコーダ）に適用し
た具体例について、図面を参照しながら説明する。この
分析合成符号化装置は、同時刻（同じブロックあるいは
フレーム内）の周波数軸領域に有声音（Voiced) 区間と
無声音(Unvoiced)区間とが存在するというモデル化を行
っている。EXAMPLES Hereinafter, the sound Koego forming method according to the present invention, a specific example of applying the analysis-synthesis speech signal coding apparatus (so-called vocoder) will be described with reference to the drawings. This analysis-synthesis coding apparatus performs modeling that a voiced sound (Voiced) section and an unvoiced sound (Unvoiced) section exist in the frequency domain at the same time (in the same block or frame).

【００１５】図１は、上記音声信号の分析合成符号化装
置に本発明を適用した実施例の全体の概略構成を示す図
である。この図１において、本発明に係る音声合成方法
の実施例は、入力音声信号からピッチ情報等を分析する
分析部１０と、該分析部１０から伝送部２によって伝送
されてきた各種情報（ピッチ情報等）を基に有声音と無
声音を得、さらに該有声音と無声音とを合成する合成部
２０とからなる。FIG. 1 is a diagram showing the overall schematic construction of an embodiment in which the present invention is applied to the above-mentioned speech signal analysis / synthesis coding apparatus. In this Figure 1, an embodiment of the sound Koego forming method according to the present invention, the analyzing unit 10 for analyzing the pitch information and the like from the input speech signal, various kinds of information transmitted by the transmission unit 2 from the analyzer 10 ( And a voiceless sound based on the pitch information and the like, and a synthesis unit 20 that synthesizes the voiced sound and the unvoiced sound.

【００１６】上記分析部１０は、入力端子１から入力さ
れた音声信号を所定サンプル数（Ｎサンプル）のブロッ
ク単位で取り出すブロック取り出し部１１と、このブロ
ック取り出し部１１からのブロック毎の入力音声信号か
ら、ピッチ情報を抽出するピッチ情報抽出部１２と、上
記ブロック取り出し部１１からのブロック毎の入力音声
信号から周波数軸上に変換されたデータを求めるデータ
変換部１３と、このデータ変換部１３からの周波数軸上
データを上記ピッチ情報抽出部１２のピッチ情報に基づ
いて複数の帯域に分割する帯域分割部１４と、この帯域
分割部１４の各帯域毎のパワー（振幅）情報及び有声音
Ｖか無声音ＵＮかの判別情報を求める振幅情報＆Ｖ／Ｕ
Ｖ判別情報検出部１５とを有する。The analyzing unit 10 extracts a voice signal input from the input terminal 1 in block units of a predetermined number of samples (N samples), and an input voice signal for each block from the block extracting unit 11. From this, a pitch information extraction unit 12 for extracting pitch information, a data conversion unit 13 for obtaining data converted on the frequency axis from the input voice signal for each block from the block extraction unit 11, and this data conversion unit 13 Band division section 14 for dividing the data on the frequency axis into a plurality of bands based on the pitch information of the pitch information extraction section 12, and power (amplitude) information and voiced sound V for each band of the band division section 14. Amplitude information & V / U for determining whether unvoiced sound UN
The V discrimination information detection unit 15 is included.

【００１７】上記合成部２０は、上記伝送部２により上
記分析部１０から伝送されてきたピッチ情報、Ｖ／ＵＶ
判別情報及び振幅情報を受け取り、有声音合成部２１で
有声音を無声音合成部２７で無声音を合成し、該合成さ
れた有声音と無声音とを加算部２８で加算合成し、該合
成音信号を出力端子３から取り出すようにしている。The synthesizing unit 20 has the pitch information, V / UV, transmitted from the analyzing unit 10 by the transmitting unit 2.
Upon receiving the discrimination information and the amplitude information, the voiced sound synthesis unit 21 synthesizes the voiced sound with the unvoiced sound synthesis unit 27, the synthesized voiced sound and unvoiced sound are added and synthesized by the addition unit 28, and the synthesized sound signal is obtained. It is taken out from the output terminal 3.

【００１８】なお、上記各情報は、上記Ｎサンプル（例
えば２５６サンプル）のブロック内のデータに対して処
理を施すことにより得られるものであるが、ブロックは
時間軸上を上記Ｌサンプルのフレームを単位として前進
することから、伝送するデータは上記フレーム単位で得
られる。すなわち、上記フレーム周期でピッチ情報、Ｖ
／ＵＶ判別情報及び振幅情報が更新されることになる。The above-mentioned respective information is obtained by processing the data in the block of N samples (for example, 256 samples), but the block represents the frame of L samples on the time axis. Since the data advances in units, the data to be transmitted is obtained in the above frame units. That is, pitch information, V
/ UV discrimination information and amplitude information will be updated.

【００１９】上記有声音合成部２１は、上記ピッチ情報
と入力端子４から供給されるフレーム初期位相とに基づ
いてフレーム終端位相（次の合成フレームの先端の位
相）を予測する位相予測部２２と、この位相予測部２２
からの予測を上記ピッチ情報ととＶ／ＵＶ判別情報とが
供給される雑音付加部２３からの修正項を用いて修正す
る位相修正部２４と、この位相修正部２４からの修正位
相情報に基づいて図示しない正弦波ＲＯＭから正弦波を
読みだし出力する正弦波発生部２５と、上記振幅情報が
供給され上記正弦波発生部２５からの正弦波の振幅を増
幅する振幅増幅部２６とを有する。The voiced sound synthesizing section 21 predicts the frame end phase (the phase of the leading edge of the next synthesized frame) based on the pitch information and the frame initial phase supplied from the input terminal 4, and , The phase predictor 22
Based on the corrected phase information from the phase correction unit 24 and the phase correction unit 24 that corrects the prediction from the above using the correction term from the noise adding unit 23 to which the pitch information and the V / UV discrimination information are supplied. The sine wave generator 25 reads out and outputs a sine wave from a sine wave ROM (not shown), and an amplitude amplifier 26 that is supplied with the amplitude information and amplifies the amplitude of the sine wave from the sine wave generator 25.

【００２０】上記無声音合成部２７には、上記ピッチ情
報、Ｖ／ＵＶ判別情報及び振幅情報が供給され、例えば
ホワイトノイズを図示しないバンドパスフィルタでフィ
ルタリングして時間軸上の無声音波形を合成している。The unvoiced sound synthesizer 27 is supplied with the pitch information, V / UV discrimination information and amplitude information. For example, white noise is filtered by a band pass filter (not shown) to synthesize an unvoiced sound waveform on the time axis. There is.

【００２１】上記加算部２８では、上記有声音合成部２
１、無声音合成部２７において合成された有声音及び無
声音の各信号を適当な固定の混合比で加算する。そし
て、この加算された音声信号は、出力端子３から音声信
号として出力される。In the adding section 28, the voiced sound synthesizing section 2 is used.
1. The voiced sound and unvoiced sound signals synthesized by the unvoiced sound synthesizer 27 are added at an appropriate fixed mixing ratio. Then, the added audio signal is output from the output terminal 3 as an audio signal.

【００２２】ここで、上記合成部２０の有声音合成部２
１内の位相予測部２２では、時刻０（フレームの先頭）
における第ｍ高調波の位相（フレーム初期位相）をψ_0m
とすると、フレームの最後での位相ψ_Lmを、 ψ_Lm＝ψ_0m＋ｍ（ω_O1＋ω_L1）Ｌ／２・・・（１）と予測する。また、各バンドの位相φ_mは、 φ_m＝ψ_Lm＋ε_m ・・・（２）となる。上記（１）、（２）式中でＬはフレームインタ
ーバル、ωO1は、合成フレームの先端（ｎ＝０）での基
本角周波数、ω_L1は該合成フレームの終端（ｎ＝Ｌ：次
の合成フレーム先端）での基本角周波数、ε_mは各バン
ドでの予測修正項を示している。Here, the voiced sound synthesizer 2 of the synthesizer 20 is used.
In the phase prediction unit 22 within 1, the time 0 (the beginning of the frame)
The phase of m-th harmonic (initial frame phase) at ψ _0m
Then, the phase ψ _Lm at the end of the frame is predicted as ψ _Lm = ψ _0m + m (ω _O1 + ω _L1 ) L / 2 (1). The phase φ _{m of} each band is φ _m = φ _Lm + ε _m (2) (1), (2) wherein L is the frame interval, Omegao1 the fundamental angular frequency at the tip of the composite frame (n = 0), omega _L1 is the composite frame termination (n = L: The following synthetic The fundamental angular frequency at the frame tip), ε _m , indicates the prediction correction term for each band.

【００２３】上記（１）式より、上記位相予測部２２
は、第ｍ高調波の平均角周波数に時刻を乗じ、それに第
ｍ高調波の初期位相を加えた位相を時刻Ｌでの予測位相
として求めている。また、上記（２）式より、各バンド
の位相φ_mは、上記予測位相に予測修正項ε_mを加えた
値である。From the equation (1), the phase predictor 22
Calculates the phase obtained by multiplying the average angular frequency of the mth harmonic by the time and adding the initial phase of the mth harmonic as the predicted phase at the time L. Further, from the above formula (2), the phase φ _{m of} each band is a value obtained by adding the prediction correction term ε _m to the above-mentioned predicted phase.

【００２４】上記予測修正項ε_mは、各バンド間で分布
が乱れており（ランダム）、乱数を用いることができる
が本実施例では、ガウス雑音を用いている。このガウス
雑音は、図２に示すように帯域別にみて高域になるにつ
れ（例えば、ε₁からε₁₀）分散が大きくなる雑音であ
る。このガウス雑音は、位相の真の値と予測による値と
の誤差を適切に近似する。The predictive correction term ε _m has a random distribution among the bands (random), and random numbers can be used, but Gaussian noise is used in this embodiment. As shown in FIG. 2, this Gaussian noise is a noise in which the dispersion increases as the frequency becomes higher (eg, ε ₁ to ε ₁₀ ) in each band. This Gaussian noise properly approximates the error between the true value of the phase and the predicted value.

【００２５】ここで、今、図２に示すような分散が単純
にバンドにｍに比例するものとすれば、上記予測修正項
ε_mは、 εm ＝ｈ1 Ｎ（０，ｋi ）・・・（３）と示される。ここで、ｈ₁は定数、ｋ_iは分数、０は平
均を表す。[0025] Here, now, if that dependency as shown in FIG. 2 is proportional to m for simplicity band, the predicted correction terms epsilon _m is, εm = h1 N (0, ki) ··· ( 3) is indicated. Here, h ₁ is a constant, k _i is a fraction, and 0 is an average.

【００２６】また、全帯域を有声音と無声音の二つの帯
域に分割したときに、無声音の部分が多ければ音声を構
成する各周波数成分の位相はよりランダムになるので、
上記予測修正項ε_mは、 ε_m＝ｈ₂ｎ_ujＮ（０，ｋ_i）・・・（４）と示すことができる。ここで、ｈ₂は定数、ｋ_iは分
数、０は平均、ｎ_ujはブロックｊでの無声音バンドの数
を表す。Further, when the entire band is divided into two bands of voiced sound and unvoiced sound, if there are many unvoiced sound parts, the phase of each frequency component constituting the voice becomes more random,
The prediction correction term ε _m can be expressed as ε _m = h ₂ n _uj N (0, k _i ) ... (4). Where h ₂ is a constant, k _i is a fraction, 0 is an average, and n _uj is the number of unvoiced bands in block j.

【００２７】また、特に入力音声の母音が長く続くとき
のように上述したような各バンド間での分布の乱れがな
い時、もしくは母音から子音及び無音に遷移する時に
は、上記（３）、（４）式で示された予測修正項がかえ
って合成音声の音質を劣化させるので、遅延が許される
のであれば１フレーム先の振幅情報（パワー）Ｓレベ
ル、もしくは有声音部分の減少を調べて上記修正項ε_m
を、 ε_m＝ｈ₃max(ａ，Ｓ_j−Ｓ_j+1）Ｎ（０，ｋ_i）・・・（５） ε_m＝ｈ₄max(ｂ，ｎ_vj−ｎ_v(j+1)）Ｎ（０，ｋ_i）・・・（６）とする。ここで、ａ，ｂ，ｈ₃，ｈ₄は定数である。In particular, when there is no disturbance in the distribution between the bands as described above, such as when the vowel of the input voice continues for a long time, or when the vowel changes to consonants and silence, the above (3), ( Since the predictive correction term expressed by the equation (4) rather deteriorates the sound quality of the synthesized speech, if delay is allowed, the amplitude information (power) S level one frame ahead or the decrease of the voiced sound portion is checked to find the above. Correction term ε _m
_{_{The, ε m = h 3 max (}} a, S j -S j + 1) N (0, k i) ··· (5) ε m = h 4 max (b, n vj -n v (j + 1 ₎ ) N (0, k _i ) ... (6) Here, a, b, h ₃ and h ₄ are constants.

【００２８】さらに、上記ピッチ情報抽出部１２でのピ
ッチ情報が低い場合は、周波数バンドが増え、位相が揃
うことによる悪影響の増大を考慮して、上記上記修正項
ε_mを、 ε_m＝ｆ( Ｓ_j，ｈ_j）Ｎ（０，ｋ_i）・・・（７）とする。ここで、ｆは周波数である。 Further, when the pitch information in the pitch information extraction unit 12 is low, the above correction term ε _m is set to ε _m = f in consideration of an increase in adverse effects due to increase of frequency bands and alignment of phases. (S _j , h _j ) N (0, k _i ) ... (7). Here, f is a frequency .

【００２９】以上より、上記音声信号の分析合成符号化
装置に本発明を適用した実施例は、位相予測の修正に用
いる雑音をガウス性にすることで、その大きさと分散を
制御することができる。As described above, in the embodiment in which the present invention is applied to the speech signal analysis / synthesis coding apparatus, the noise used for the correction of the phase prediction is Gaussian to control the magnitude and variance thereof. .

【００３０】以下、本発明に係る音声合成方法を、音声
信号の合成分析符号化装置（いわゆるボコーダ）の一種
であるＭＢＥ（Multiband Excitation: マルチバンド励
起）ボコーダに適用した具体例について、図面を参照し
ながら説明する。このＭＢＥボコーダは、D. W. Griffi
n and J. S. Lim,"Multiband Excitation Vocoder,"IEE
E Trans.Acoustics,Speech,and Signal Processing, vo
l.36, No.8, pp.1223-1235, Aug.1988 に開示されてい
るものであり、従来のＰＡＲＣＯＲ（PARtialauto-CORr
elation: 偏自己相関）ボコーダ等では、音声のモデル
化の際に有声音区間と無声音区間とをブロックあるいは
フレーム毎に切り換えていたのに対し、ＭＢＥボコーダ
では、同時刻（同じブロックあるいはフレーム内）の周
波数軸領域に有声音（Voiced）区間と無声音（Unvoice
d）区間とが存在するという仮定でモデル化している。[0030] Hereinafter, the sound Koego forming method according to the present invention, which is one type MBE synthesis analysis encoding unit of the audio signal (the so-called vocoder): A specific example of applying the (Multiband Excitation multi-band excitation) vocoders, the drawings Will be described with reference to. This MBE vocoder is based on DW Griffi
n and JS Lim, "Multiband Excitation Vocoder," IEE
E Trans. Acoustics, Speech, and Signal Processing, vo
L.36, No.8, pp.1223-1235, Aug.1988, the conventional PARCOR (PARtialauto-CORr
elation: Partial autocorrelation) In a vocoder or the like, a voiced sound section and an unvoiced sound section were switched for each block or frame when modeling speech, whereas in the MBE vocoder, the same time (in the same block or frame) Voiced section and unvoiced sound (Unvoice
d) It is modeled on the assumption that an interval and exists.

【００３１】図３は、上記ＭＢＥボコーダに本発明を適
用した実施例の全体の概略構成を示すブロック図であ
る。この図３において、入力端子１０１には音声信号が
供給されるようになっており、この入力音声信号は、Ｈ
ＰＦ（ハイパスフィルタ）等のフィルタ１０２に送られ
て、いわゆるＤＣ（直流）オフセット分の除去や帯域制
限（例えば２００〜３４００Hzに制限）のための少なく
とも低域成分（２００Hz以下）の除去が行われる。この
フィルタ１０２を介して得られた信号は、ピッチ抽出部
１０３及び窓かけ処理部１０４にそれぞれ送られる。ピ
ッチ抽出部１０３では、入力音声信号データが所定サン
プル数Ｎ（例えばＮ＝２５６）単位でブロック分割され
（あるいは方形窓による切り出しが行われ）、このブロ
ック内の音声信号についてのピッチ抽出が行われる。こ
のような切り出しブロック（２５６サンプル）を、例え
ば図４のＡに示すようにＬサンプル（例えばＬ＝１６
０）のフレーム間隔で時間軸方向に移動させており、各
ブロック間のオーバラップはＮ−Ｌサンプル（例えば９
６サンプル）となっている。また、窓かけ処理部１０４
では、１ブロックＮサンプルに対して所定の窓関数、例
えばハミング窓をかけ、この窓かけブロックを１フレー
ムＬサンプルの間隔で時間軸方向に順次移動させてい
る。FIG. 3 is a block diagram showing an overall schematic configuration of an embodiment in which the present invention is applied to the MBE vocoder. In FIG. 3, an audio signal is supplied to the input terminal 101, and the input audio signal is H
It is sent to a filter 102 such as a PF (high-pass filter) to remove a so-called DC (direct current) offset component and remove at least a low frequency component (200 Hz or less) for band limitation (for example, 200 to 3400 Hz). . The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. In the pitch extraction unit 103, the input voice signal data is divided into blocks in units of a predetermined number N (for example, N = 256) (or cut out by a rectangular window), and pitches of voice signals in this block are extracted. . Such cut-out block (256 samples) is used for L samples (for example, L = 16) as shown in A of FIG.
0) frame intervals are moved in the time axis direction, and the overlap between blocks is NL samples (for example, 9 samples).
6 samples). Also, the windowing processing unit 104
In this case, a predetermined window function, for example, a Hamming window is applied to 1 block N samples, and the windowed blocks are sequentially moved in the time axis direction at intervals of 1 frame L samples.

【００３２】このような窓かけ処理を数式で表すと、ｘ_w(k,q) ＝ｘ(q) ｗ(kL-q) ・・・（８）となる。この（８）式において、ｋはブロック番号を、
ｑはデータの時間インデックス（サンプル番号）を表
し、処理前の入力信号のｑ番目のデータｘ(q) に対して
第ｋブロックの窓（ウィンドウ）関数ｗ(kL-q)により窓
かけ処理されることによりデータｘ_w(k,q) が得られる
ことを示している。ピッチ抽出部１０３内での図４のＡ
に示すような方形窓の場合の窓関数ｗ_r(r) は、ｗ_r(r) ＝１０≦ｒ＜Ｎ・・・（９）＝０ｒ＜０，Ｎ≦ｒまた、窓かけ処理部１０４での図４のＢに示すようなハ
ミング窓の場合の窓関数ｗ_h(r) は、ｗ_h(r) ＝ 0.54
− 0.46 cos(２πr/(N-1)) ０≦ｒ＜Ｎ・・・（10）＝０ｒ＜０，Ｎ≦ｒである。このような窓関数ｗ_r(r) あるいはｗ_h(r) を
用いるときの上記（８）式の窓関数ｗ(r) （＝ｗ(kL-
q)）の否零区間は、０≦ｋＬ−ｑ＜Ｎこれを変形して、ｋＬ−Ｎ＜ｑ≦ｋＬ従って、例えば上記方形窓の場合に窓関数ｗ_r(kL-q)＝
１となるのは、図５に示すように、ｋＬ−Ｎ＜ｑ≦ｋＬ
のときとなる。また、上記（８）〜（10）式は、長さＮ
（＝２５６）サンプルの窓が、Ｌ（＝１６０）サンプル
ずつ前進してゆくことを示している。以下、上記（９）
式、（10）式の各窓関数で切り出された各Ｎ点（０≦ｒ
＜Ｎ）の否零サンプル列を、それぞれｘ_wr(k,r) 、ｘ_wh
(k,r) と表すことにする。When this windowing process is expressed by a mathematical expression, x _w (k, q) = x (q) w (kL-q) (8) In this equation (8), k is a block number,
q represents the time index (sample number) of the data, and the q-th data x (q) of the input signal before processing is windowed by the window function (w (kL-q)) of the kth block. It is shown that the data x _w (k, q) can be obtained by doing so. FIG. 4A in the pitch extraction unit 103
The window function w _r (r) in the case of the rectangular window is as follows: w _r (r) = 1 0 ≦ r <N (9) = 0 r <0, N ≦ r window function w _h (r) in the case of Hamming window as shown in B of FIG. 4 is a section _{104, w h (r) =} 0.54
−0.46 cos (2πr / (N−1)) 0 ≦ r <N (10) = 0 r <0, N ≦ r. Such a window function w _r (r) or w (8) when using the _h (r) formula of the window function w (r) (= w (KL-
q)), the zero-zero section is: 0 ≦ kL−q <N, which is transformed into kL−N <q ≦ kL Therefore, for example, in the case of the above rectangular window, the window function w _r (kL-q) =
As shown in FIG. 5, 1 becomes kL-N <q ≦ kL.
It will be when. Further, the above equations (8) to (10) are expressed by the length N
It shows that the window of (= 256) samples advances by L (= 160) samples. Below, above (9)
N points (0 ≦ r
The non-zero sample sequences of <N) are respectively x _wr (k, r) and x _wh
We will denote it as (k, r).

【００３３】窓かけ処理部１０４では、図６に示すよう
に、上記（10）式のハミング窓がかけられた１ブロック
２５６サンプルのサンプル列ｘ_wh(k,r) に対して１７９
２サンプル分の０データが付加されて（いわゆる０詰め
されて）２０４８サンプルとされ、この２０４８サンプ
ルの時間軸データ列に対して、直交変換部１０５により
例えばＦＦＴ（高速フーリエ変換）等の直交変換処理が
施される。In the windowing processing unit 104, as shown in FIG. 6, 179 is applied to the sample sequence x _wh (k, r) of 256 samples of one block on which the Hamming window of the above equation (10) is applied.
Two samples of 0 data are added (so-called zero padding) to form 2048 samples, and the orthogonal transformation unit 105 performs orthogonal transformation such as FFT (Fast Fourier Transform) on the time-axis data sequence of 2048 samples. Processing is performed.

【００３４】ピッチ抽出部１０３では、上記ｘ_wr(k,r)
のサンプル列（１ブロックＮサンプル）に基づいてピッ
チ抽出が行われる。このピッチ抽出法には、時間波形の
周期性や、スペクトルの周期的周波数構造や、自己相関
関数を用いるもの等が知られているが、本実施例では、
センタクリップ波形の自己相関法を採用している。この
ときのブロック内でのセンタクリップレベルについて
は、１ブロックにつき１つのクリップレベルを設定して
もよいが、ブロックを細分割した各部（各サブブロッ
ク）の信号のピークレベル等を検出し、これらの各サブ
ブロックのピークレベル等の差が大きいときに、ブロッ
ク内でクリップレベルを段階的にあるいは連続的に変化
させるようにしている。このセンタクリップ波形の自己
相関データのピーク位置に基づいてピッチ周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。In the pitch extraction unit 103, the above x _wr (k, r)
Pitch extraction is performed based on the sample sequence (1 block N samples). The pitch extraction method is known to include periodicity of a time waveform, periodic frequency structure of spectrum, and one using an autocorrelation function.
The center correlation waveform autocorrelation method is used. Regarding the center clip level in the block at this time, one clip level may be set for each block, but the peak level of the signal of each part (each sub-block) obtained by subdividing the block is detected and When there is a large difference in peak level between the sub-blocks, the clip level is changed stepwise or continuously within the block. The pitch period is determined based on the peak position of the autocorrelation data of this center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak of the plurality of peaks is equal to or larger than a predetermined threshold value. In the case of, the maximum peak position is set as the pitch cycle, and in other cases, the pitch is within the pitch range that satisfies a predetermined relationship with the pitch other than the current frame, for example, the pitch of the previous frame and the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
6, the pitch search (pitch fine search) with high accuracy is performed by the closed loop.

【００３５】高精度（ファイン）ピッチサーチ部１０６
には、ピッチ抽出部１０３で抽出された整数（インテジ
ャー）値の粗（ラフ）ピッチデータと、直交変換部１０
５により例えばＦＦＴされた周波数軸上のデータとが供
給されている。この高精度ピッチサーチ部１０６では、
上記粗ピッチデータ値を中心に、0.２〜0.５きざみで±
数サンプルずつ振って、最適な小数点付き（フローティ
ング）のファインピッチデータの値へ追い込む。このと
きのファインサーチの手法として、いわゆる合成による
分析 (Analysis by Synthesis)法を用い、合成されたパ
ワースペクトルが原音のパワースペクトルに最も近くな
るようにピッチを選んでいる。High precision (fine) pitch search unit 106
Includes rough pitch data of integer (integer) values extracted by the pitch extraction unit 103, and the orthogonal transformation unit 10.
5, for example, FFT-processed data on the frequency axis is supplied. In this high precision pitch search unit 106,
Centering on the above coarse pitch data value, ± in increments of 0.2 to 0.5
Shake several samples at a time to reach the optimum fine pitch data value with a decimal point (floating). As a fine search method at this time, a so-called analysis by synthesis method is used, and the pitch is selected so that the synthesized power spectrum is closest to the power spectrum of the original sound.

【００３６】このピッチのファインサーチについて説明
する。先ず、上記ＭＢＥボコーダにおいては、上記ＦＦ
Ｔ等により直交変換された周波数軸上のスペクトルデー
タとしてのＳ(j) をＳ(j) ＝Ｈ(j) ｜Ｅ(j) ｜０＜ｊ＜Ｊ・・・（11）と表現するようなモデルを想定している。ここで、Ｊは
ω _s ／４πに対応し、サンプリング周波数ｆ_s＝ω _s ／
２πが例えば８ｋHzのときには４ｋHzに対応する。上記
（11）式中において、周波数軸上のスペクトルデータＳ
(j) が図７のＡに示すような波形のとき、Ｈ(j) は、図
７のＢに示すような元のスペクトルデータＳ(j) のスペ
クトル包絡線（エンベロープ）を示し、Ｅ(j) は、図７
のＣに示すような等レベルで周期的な励起信号（エキサ
イテイション）のスペクトルを示している。すなわち、
ＦＦＴスペクトルＳ(j) は、スペクトルエンベロープＨ
(j)と励起信号のパワースペクトル｜Ｅ(j) ｜との積と
してモデル化される。The fine search of the pitch will be described. First, in the MBE vocoder, the FF
Let S (j) as spectrum data on the frequency axis orthogonally transformed by T etc. be expressed as S (j) = H (j) | E (j) | 0 <j <J (11) It is assumed that the model. Where J is
Corresponding to ω _s / 4π , sampling frequency f _s = ω _s /
When 2π is, for example, 8 kHz, it corresponds to 4 kHz. In the above equation (11), spectrum data S on the frequency axis
When (j) has a waveform as shown in A of FIG. 7, H (j) shows the spectrum envelope (envelope) of the original spectrum data S (j) as shown in B of FIG. 7, and E ( j) is shown in FIG.
3 shows a spectrum of an excitation signal (excitation) which is cyclic at an equal level as shown in C of FIG. That is,
The FFT spectrum S (j) has a spectral envelope H
It is modeled as the product of (j) and the power spectrum of the excitation signal | E (j) |.

【００３７】上記励起信号のパワースペクトル｜Ｅ(j)
｜は、上記ピッチに応じて決定される周波数軸上の波形
の周期性（ピッチ構造）を考慮して、１つの帯域（バン
ド）の波形に相当するスペクトル波形を周波数軸上の各
バンド毎に繰り返すように配列することにより形成され
る。この１バンド分の波形は、例えば上記図６に示すよ
うな２５６サンプルのハミング窓関数に１７９２サンプ
ル分の０データを付加（０詰め）した波形を時間軸信号
と見なしてＦＦＴし、得られた周波数軸上のある帯域幅
を持つインパルス波形を上記ピッチに応じて切り出すこ
とにより形成することができる。Power spectrum of the above excitation signal | E (j)
| Is a spectral waveform corresponding to the waveform of one band (band) for each band on the frequency axis in consideration of the periodicity (pitch structure) of the waveform on the frequency axis determined according to the pitch. It is formed by arranging it repeatedly. The waveform for one band is obtained by performing FFT by regarding a waveform obtained by adding (zero-filling) 0 data for 1792 samples to a Hamming window function of 256 samples as shown in FIG. 6 as a time axis signal. It can be formed by cutting out an impulse waveform having a certain bandwidth on the frequency axis according to the pitch.

【００３８】次に、上記ピッチに応じて分割された各バ
ンド毎に、上記Ｈ(j) を代表させるような（各バンド毎
のエラーを最小化するような）値（一種の振幅）｜Ａ_m
｜を求める。ここで、例えば第ｍバンド（第ｍ高調波の
帯域）の下限、上限の点をそれぞれａ_m、ｂ_mとすると
き、この第ｍバンドのエラーε_mは、Next, for each band divided according to the above pitch, a value (a kind of amplitude) | A that represents the above H (j) (minimizes the error for each band) | A _m
Ask for |. Here, for example, when the lower limit point and the upper limit point of the m-th band (band of the m-th harmonic) are a _m and b _m , respectively, the error ε _m of the m-th band is

【００３９】[0039]

【数１】 [Equation 1]

【００４０】で表せる。このエラーε_mを最小化するよ
うな｜Ａ_m｜は、Can be expressed as | A _m | that minimizes this error ε _m is

【００４１】[0041]

【数２】 [Equation 2]

【００４２】となり、この（13）式の｜Ａ_m｜のとき、
エラーε_mを最小化する。このような振幅｜Ａ_m｜を各
バンド毎に求め、得られた各振幅｜Ａ_m｜を用いて上記
（12）式で定義された各バンド毎のエラーεm を求め
る。次に、このような各バンド毎のエラーε_mの全バン
ドの総和値Σε_mを求める。さらに、このような全バン
ドのエラー総和値Σε_mを、いくつかの微小に異なるピ
ッチについて求め、エラー総和値Σε_mが最小となるよ
うなピッチを求める。Therefore, when | A _m | in this equation (13),
Minimize the error ε _m . Such an amplitude | A _m | is obtained for each band, and the obtained amplitude | A _m | is used to obtain the error ε m for each band defined by the above equation (12). Next, the sum total value Σε _m of all the bands of the error ε _m for each band is obtained. Further, such an error sum value Σε _m of all bands is obtained for some slightly different pitches, and a pitch that minimizes the error sum value Σε _m is obtained.

【００４３】すなわち、上記ピッチ抽出部１０３で求め
られたラフピッチを中心として、例えば 0.25 きざみで
上下に数種類ずつ用意する。これらの複数種類の微小に
異なるピッチの各ピッチに対してそれぞれ上記エラー総
和値Σε_mを求める。この場合、ピッチが定まるとバン
ド幅が決まり、上記（13）式より、周波数軸上データの
パワースペクトル｜Ｓ(j) ｜と励起信号スペクトル｜Ｅ
(j) ｜とを用いて上記（12）式のエラーε_mを求め、そ
の全バンドの総和値Σε_mを求めることができる。この
エラー総和値Σε_mを各ピッチ毎に求め、最小となるエ
ラー総和値に対応するピッチを最適のピッチとして決定
するわけである。以上のようにして高精度ピッチサーチ
部１０６で最適のファイン（例えば 0.25 きざみ）ピッ
チが求められ、この最適ピッチに対応する振幅｜Ａ_m｜
が決定される。That is, with the rough pitch obtained by the pitch extraction unit 103 as the center, several types are prepared up and down in steps of, for example, 0.25. The error sum value Σε _m is obtained for each of these plural kinds of slightly different pitches. In this case, if the pitch is determined, the bandwidth is determined, and from the above equation (13), the power spectrum | S (j) | of the data on the frequency axis and the excitation signal spectrum | E
(j) | can be used to obtain the error ε _m in the above equation (12), and the total sum value Σε _m of all the bands can be obtained. This error sum total value Σε _m is obtained for each pitch, and the pitch corresponding to the minimum error sum value is determined as the optimum pitch. As described above, the high-precision pitch search unit 106 finds the optimum fine (eg, 0.25 step) pitch, and the amplitude | A _m | corresponding to this optimum pitch.
Is determined.

【００４４】以上ピッチのファインサーチの説明におい
ては、説明を簡略化するために、全バンドが有声音（Vo
iced）の場合を想定しているが、上述したようにＭＢＥ
ボコーダにおいては、同時刻の周波数軸上に無声音（Un
voiced）領域が存在するというモデルを採用しているこ
とから、上記各バンド毎に有声音／無声音の判別を行う
ことが必要とされる。In the above description of the pitch fine search, in order to simplify the explanation, all bands are voiced (Vo
Assuming the case of iced), MBE as described above
In the vocoder, unvoiced sound (Un
Since a model in which a voiced) region exists is used, it is necessary to distinguish voiced sound / unvoiced sound for each band.

【００４５】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、第ｍバンド
のＮＳＲは、The optimum pitch and amplitude | A _m | data from the high precision pitch search unit 106 is sent to the voiced sound / unvoiced sound determination unit 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, the NSR of the m-th band is

【００４６】[0046]

【数３】 [Equation 3]

【００４７】と表せ、このＮＳＲ値が所定の閾値（例え
ば0.３）より大のとき（エラーが大きい）ときには、そ
のバンドでの｜Ａ_m｜｜Ｅ(j) ｜による｜Ｓ(j) ｜の近
似が良くない（上記励起信号｜Ｅ(j) ｜が基底として不
適当である）と判断でき、当該バンドをＵＶ（Unvoice
d、無声音）と判別する。これ以外のときは、近似があ
る程度良好に行われていると判断でき、そのバンドをＶ
（Voiced、有声音）と判別する。If this NSR value is larger than a predetermined threshold value (for example, 0.3) (error is large), | A _m || E (j) | due to | S (j) in that band It can be judged that the approximation of | is not good (the above excitation signal | E (j) | is unsuitable as a basis), and the band is UV (Unvoice
d, unvoiced sound). In other cases, it can be judged that the approximation has been performed to some extent, and the band is set to V
(Voiced, voiced sound).

【００４８】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅を求めている。このＵＶのバン
ドについての振幅｜Ａ_m｜_UVは、Next, the amplitude re-evaluation unit 108 has the amplitude | A _m evaluated as the data on the frequency axis from the orthogonal transformation unit 105 and the fine pitch evaluated from the high precision pitch search unit 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from is supplied. The amplitude re-evaluation unit 108 re-calculates the amplitude of the band determined as unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107. The amplitude | A _m | _UV for this UV band is

【００４９】[0049]

【数４】 [Equation 4]

【００５０】にて求められる。It is calculated by

【００５１】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００Hzまでとすると、この有効帯域が上記
ピッチに応じて、８バンド〜６３バンドに分割されるこ
とになり、これらの各バンド毎に得られる上記振幅｜Ａ
_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの個
数ｍ _MX ＋１も８〜６３と変化することになる。このため
データ数変換部１０９では、この可変個数ｍ _MX ＋１の振
幅データを一定個数Ｎ_C（例えば４４個）のデータに変
換している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. The data number conversion unit 109 is for making the number constant in consideration of the fact that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, when the effective band is up to 3400 Hz , the effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | A obtained for each of these bands | A
_{The number of m m} (including the UV band amplitude | A _m | _UV ) data m _MX +1 also changes from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number m _MX +1 of amplitude data into a fixed number N _C (for example, 44) of data.

【００５２】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（( ｍ _MX ＋
１) ×Ｋ_OS個）の振幅データを直線補間してさらに多く
のＮM 個（例えば２０４８個）に拡張し、このＮ_M個の
データを間引いて上記一定個数Ｎ_C（例えば４４個）の
データに変換する。Here, in this embodiment, dummy data for interpolating values from the last data in the block to the first data in the block with respect to the amplitude data of one block of the effective band on the frequency axis. Is added to increase the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
By multiplying the number of K _OS times the amplitude data, and multiplying the number of K _OS times (( m _MX +
1) amplitude data of × K _OS pieces) extended to linear interpolation to more NM number (e.g. 2048), the data of the N thinned out _M data the predetermined number N _C (e.g. 44) Convert to.

【００５３】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (the above-mentioned fixed number N _C of amplitude data) is sent to the vector quantization unit 110, and a predetermined number of data are collected into a vector, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. Further, the high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００５４】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is a frame of the L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame period.

【００５５】次に、伝送されて得られた上記各データに
基づき音声信号を合成するための合成側（デコード側）
の概略構成について、図８を参照しながら説明する。こ
の図８において、入力端子１２１には上記ベクトル量子
化された振幅データが、入力端子１２２には上記符号化
されたピッチデータが、また入力端子１２３には上記Ｖ
／ＵＶ判別データがそれぞれ供給される。入力端子１２
１からの量子化振幅データは、逆ベクトル量子化部１２
４に送られて逆量子化され、データ数逆変換部１２５に
送られて逆変換され、得られた振幅データが有声音合成
部１２６及び無声音合成部１２７に送られる。入力端子
１２２からの符号化ピッチデータは、ピッチ復号化部１
２８で復号化され、データ数逆変換部１２５、有声音合
成部１２６及び無声音合成部１２７に送られる。また入
力端子１２３からのＶ／ＵＶ判別データは、有声音合成
部１２６及び無声音合成部１２７に送られる。Next, a synthesizing side (decoding side) for synthesizing a voice signal based on the above-mentioned respective data obtained by transmission.
The general configuration of will be described with reference to FIG. In FIG. 8, the vector-quantized amplitude data is input to the input terminal 121, the encoded pitch data is input to the input terminal 122, and the V-value is input to the input terminal 123.
/ UV discrimination data is supplied respectively. Input terminal 12
The quantized amplitude data from 1 is the inverse vector quantizer 12
4 is inversely quantized, is then transmitted to the data number inverse conversion unit 125 and is inversely converted, and the obtained amplitude data is transmitted to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127. The encoded pitch data from the input terminal 122 is the pitch decoding unit 1
It is decoded at 28 and sent to the data number inverse conversion unit 125, the voiced sound synthesis unit 126, and the unvoiced sound synthesis unit 127. The V / UV discrimination data from the input terminal 123 is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127.

【００５６】有声音合成部１２６では例えば余弦(cosin
e)波合成により時間軸上の有声音波形を合成し、無声音
合成部１２７では例えばホワイトノイズをバンドパスフ
ィルタでフィルタリングして時間軸上の無声音波形を合
成し、これらの各有声音合成波形と無声音合成波形とを
加算部１２９で加算合成して、出力端子１３０より取り
出すようにしている。この場合、上記振幅データ、ピッ
チデータ及びＶ／ＵＶ判別データは、上記分析時の１フ
レーム（Ｌサンプル、例えば１６０サンプル）毎に更新
されて与えられるが、フレーム間の連続性を高める（円
滑化する）ために、上記振幅データやピッチデータの各
値を１フレーム中の例えば中心位置における各データ値
とし、次のフレームの中心位置までの間（合成時の１フ
レーム）の各データ値を補間により求める。すなわち、
合成時の１フレーム（例えば上記分析フレームの中心か
ら次の分析フレームの中心まで）において、先端サンプ
ル点での各データ値と終端（次の合成フレームの先端）
サンプル点での各データ値とが与えられ、これらのサン
プル点間の各データ値を補間により求めるようにしてい
る。In the voiced sound synthesis unit 126, for example, cosine (cosin)
e) The voiced sound waveform on the time axis is synthesized by wave synthesis, and the unvoiced sound synthesis unit 127 synthesizes the unvoiced sound waveform on the time axis by filtering white noise, for example, with a bandpass filter, and synthesizes these voiced sound synthesized waveforms. The unvoiced sound synthesis waveform is added and synthesized by the addition unit 129 and is taken out from the output terminal 130. In this case, the amplitude data, the pitch data, and the V / UV discrimination data are updated and given for each frame (L sample, for example, 160 samples) at the time of the analysis, but the continuity between the frames is improved (smoothed). Therefore, each value of the amplitude data and the pitch data is set to each data value at, for example, the center position in one frame, and each data value up to the center position of the next frame (one frame at the time of composition) is interpolated. Ask by. That is,
In one frame at the time of synthesis (for example, from the center of the above analysis frame to the center of the next analysis frame), each data value at the tip sample point and the end (the tip of the next synthesis frame)
Each data value at the sample point is given, and each data value between these sample points is obtained by interpolation.

【００５７】以下、有声音合成部１２６における合成処
理を詳細に説明する。上記Ｖ（有声音）と判別された第
ｍバンド（第ｍ高調波の帯域）における時間軸上の上記
１合成フレーム（Ｌサンプル、例えば１６０サンプル）
分の有声音をＶ_m(n) とするとき、この合成フレーム内
の時間インデックス（サンプル番号）ｎを用いて、Ｖ_m(n) ＝Ａ_m(n) cos(θ_m(n)) ０≦ｎ＜Ｌ・・・（16）と表すことができる。全バンドの内のＶ（有声音）と判
別された全てのバンドの有声音を加算（ΣＶ_m(n) ）し
て最終的な有声音Ｖ(n) を合成する。The synthesis process in the voiced sound synthesis unit 126 will be described in detail below. The one synthesized frame (L samples, eg, 160 samples) on the time axis in the m-th band (band of the m-th harmonic) determined to be V (voiced sound)
When the amount of voiced and V _m (n), using a time index (sample number) n in the synthetic _{frame, V m (n) = A} m (n) cos (θ m (n)) 0 ≦ n <L can be expressed as (16). The final voiced sound V (n) is synthesized by adding (ΣV _m (n)) the voiced sounds of all bands that are determined to be V (voiced sound) of all bands.

【００５８】この（16）式中のＡ_m(n) は、上記合成フ
レームの先端から終端までの間で補間された第ｍ高調波
の振幅である。最も簡単には、フレーム単位で更新され
る振幅データの第ｍ高調波の値を直線補間すればよい。
すなわち、上記合成フレームの先端（ｎ＝０）での第ｍ
高調波の振幅値をＡ_0m、該合成フレームの終端（ｎ＝
Ｌ：次の合成フレームの先端）での第ｍ高調波の振幅値
をＡ_Lmとするとき、Ａ_m(n) ＝ (L-n)Ａ_0m／Ｌ＋ｎＡ_Lm／Ｌ・・・（17）の式によりＡ_m(n) を計算すればよい。A _m (n) in the equation (16) is the amplitude of the m-th harmonic wave that is interpolated from the beginning to the end of the composite frame. The simplest way is to linearly interpolate the value of the m-th harmonic of the amplitude data updated in frame units.
That is, the m-th frame at the tip (n = 0) of the composite frame
The amplitude value of the harmonic is A _0m , the end of the composite frame (n =
L: the amplitude value of the m-th harmonic at the next composite frame) is A _Lm , then A _m (n) = (Ln) A _0m / L + nA _Lm / L ... (17) It suffices to calculate A _m (n).

【００５９】次に、上記（16）式中の位相θ_m(n) は、 θ _m (n) ＝ｍω_O1ｎ＋ｎ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋Δωｎ・・・（18）により求めることができる。この（18）式中で、φ_0mは
上記合成フレームの先端（ｎ＝０）での第ｍ高調波の位
相（フレーム初期位相）を示し、ω₀₁は合成フレーム先
端（ｎ＝０）での基本角周波数、ω_L1は該合成フレーム
の終端（ｎ＝Ｌ：次の合成フレーム先端）での基本角周
波数をそれぞれ示している。上記（18）式中のΔωは、
ｎ＝Ｌにおける位相φ_Lmがθ_m(L) に等しくなるような
最小のΔωを設定する。Next, the phase θ _m (n) in the above equation (16) is given by θ _m (n) = mω _O1 n + n ² m (ω _L1 −ω ₀₁ ) / 2L + φ _{0 m} + Δω n (18) You can ask. In this equation (18), φ _0m represents the phase (frame initial phase) of the m-th harmonic at the top (n = 0) of the composite frame, and ω ₀₁ represents the top of the composite frame (n = 0). The fundamental angular frequency, ω _L1, indicates the fundamental angular frequency at the end of the composite frame (n = L: leading end of the next composite frame). Δω in the above equation (18) is
Set a minimum Δω such that the phase φ _{Lm at} n = L is equal to θ _m (L).

【００６０】これに、対して本発明の実施例では、上記
（18) 式のφ0m＋Δωｎを合成側に送らずに、合成側で
位相を予測算出している。すなわち、上記位相予測部２
２は、上記（１）式に示されるように時刻０（フレーム
の先頭）における第ｍ高調波の位相（フレーム初期位
相）ψ_0mにｍ（ω_O1＋ω_L1）Ｌ／２を加えフレームの
最後での位相ψ_Lmを、予測算出している。また、各バン
ドの位相φ_mは、上記予測算出された位相ψ_Lmにε_mを
加えて示される。このε_mは各バンドでの予測修正項を
示している。本発明では、この予測修正項ε_mにガウス
性の雑音を用いている。On the other hand, in the embodiment of the present invention, the phase is predicted and calculated on the combining side without sending φ0m + Δωn in the equation (18) to the combining side. That is, the phase prediction unit 2
2 is obtained by adding m (ω _O1 + ω _L1 ) L / 2 to the m-th harmonic phase (frame initial phase) ψ _0m at time 0 (the beginning of the frame) as shown in the above equation (1) The phase ψ _{Lm at} is predicted and calculated. Further, the phase φ _{m of} each band is shown by adding ε _m to the above-predicted and calculated phase ψ _Lm . This ε _m indicates the prediction correction term in each band. In the present invention, Gaussian noise is used for this prediction correction term ε _m .

【００６１】ここで、図９のＡは、音声信号のスペクト
ルの一例を示しており、バンド番号（ハーモニクスナン
バ）ｍが８、９、１０の各バンドがＵＶ（無声音）とさ
れ、他のバンドはＶ（有声音）とされている。このＶ
（有声音）のバンドの時間軸信号が上記有声音合成部１
２６により合成され、ＵＶ（無声音）のバンドの時間軸
信号が無声音合成部１２７で合成されるわけである。Here, A of FIG. 9 shows an example of the spectrum of the voice signal, and the bands with band numbers (harmonics number) m of 8, 9, and 10 are UV (unvoiced sound), and the other bands. Is V (voiced sound). This V
The time axis signal of the (voiced sound) band is the voiced sound synthesis unit 1 described above.
26, and the time axis signal of the UV (unvoiced sound) band is synthesized by the unvoiced sound synthesis unit 127.

【００６２】以下、無声音合成部１２７における無声音
合成処理を説明する。ホワイトノイズ発生部１３１から
の時間軸上のホワイトノイズ信号波形を、所定の長さ
（例えば２５６サンプル）で適当な窓関数（例えばハミ
ング窓）により窓かけをし、ＳＴＦＴ処理部１３２によ
りＳＴＦＴ（ショートタームフーリエ変換）処理を施す
ことにより、図９のＢに示すようなホワイトノイズの周
波数軸上のパワースペクトルを得る。このＳＴＦＴ処理
部１３２からのパワースペクトルをバンド振幅処理部１
３３に送り、図９のＣに示すように、上記ＵＶ（無声
音）とされたバンド（例えばｍ＝８、９、１０）につい
て上記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とさ
れたバンドの振幅を０にする。このバンド振幅処理部１
３３には上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別
データが供給されている。バンド振幅処理部１３３から
の出力は、ＩＳＴＦＴ処理部１３４に送られ、位相は元
のホワイトノイズの位相を用いて逆ＳＴＦＴ処理を施す
ことにより時間軸上の信号に変換する。ＩＳＴＦＴ処理
部１３４からの出力は、オーバーラップ加算部１３５に
送られ、時間軸上で適当な（元の連続的なノイズ波形を
復元できるように）重み付けをしながらオーバーラップ
及び加算を繰り返し、連続的な時間軸波形を合成する。
オーバーラップ加算部１３５からの出力信号が上記加算
部１２９に送られる。The unvoiced sound synthesizing process in the unvoiced sound synthesizing section 127 will be described below. The white noise signal waveform on the time axis from the white noise generation unit 131 is windowed by a suitable window function (for example, Hamming window) with a predetermined length (for example, 256 samples), and the STFT processing unit 132 performs STFT (short circuit). By performing the term Fourier transform) processing, a power spectrum of the white noise on the frequency axis as shown in B of FIG. 9 is obtained. The power spectrum from the STFT processing unit 132 is converted to the band amplitude processing unit 1
33, and as shown in FIG. 9C, the above-mentioned amplitude | A _m | _UV is multiplied with respect to the above-mentioned UV (unvoiced) band (for example, m = 8, 9, 10), and another V (voiced sound) is generated. ) Is set to 0. This band amplitude processing unit 1
The amplitude data, pitch data, and V / UV discrimination data are supplied to 33. The output from the band amplitude processing unit 133 is sent to the ISTFT processing unit 134, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 134 is sent to the overlap adding unit 135, which repeats overlap and addition while appropriately weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuously. Time-domain waveforms are synthesized.
The output signal from the overlap adder 135 is sent to the adder 129.

【００６３】このように、各合成部１２６、１２７にお
いて合成されて時間軸上に戻された有声音部及び無声音
部の各信号は、加算部１２９により適当な固定の混合比
で加算して、出力端子１３０より再生された音声信号を
取り出す。As described above, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized in the respective synthesis units 126 and 127 and returned on the time axis are added by the addition unit 129 at an appropriate fixed mixing ratio, The reproduced audio signal is taken out from the output terminal 130.

【００６４】したがって、本発明に係る音声合成方法
を、ＭＢＥに適用した具体例では、位相の予測に用いる
雑音をガウス性にすることでその大きさと分散を制御す
ることができる。[0064] Thus, the sound Koego forming method according to the present invention, in the specific example applied to MBE, it is possible to control the size and dispersion by a noise used to predict the phase of Gaussian.

【００６５】なお、上記図３の音声分析側（エンコード
側）の構成や図７の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration on the speech analysis side (encoding side) in FIG. 3 and the configuration on the speech synthesis side (decoding side) in FIG. 7, although each unit is described in hardware, a so-called DSP (digital It is also possible to realize it by a software program using a signal processor or the like.

【００６６】[0066]

【発明の効果】本発明に係る音声合成方法は、入力され
た音声信号をフレーム単位で区分して求めたピッチの基
本波及びその高調波のフレーム先端部における位相を所
定の値に設定し、上記基本波及びその高調波のフレーム
終端部における位相を、上記所定の値に設定されたフレ
ーム先端部における位相に基づいて予測し、上記予測さ
れたフレーム終端部の位相を上記基本波及びその高調波
の周波数に応じた修正項に加えることにより修正するこ
とによって、雑音の大きさと分散を制御でき、音質の向
上を可能とする。また、音声の信号レベル及びその時間
的変化を利用することで、エラーの累積を防ぎ母音もし
くは母音部から子音部の遷移点での音質劣化を防ぐこと
ができる。Sound Koego forming method according to the present invention is, sets the phase of the fundamental wave and the frame tip portion of the harmonics of the input audio signal obtained by dividing a frame unit pitch to a predetermined value However, the phase of the fundamental wave and its harmonics at the frame end is set to the above-specified value.
The magnitude and variance of noise are predicted by correcting the phase of the frame end by adding the predicted phase of the frame end to a correction term corresponding to the frequencies of the fundamental wave and its harmonics. Can be controlled and the sound quality can be improved. Further, by using the signal level of the voice and its temporal change, it is possible to prevent the accumulation of errors and prevent the deterioration of the sound quality at the vowel or the transition point from the vowel part to the consonant part.

[Brief description of drawings]

【図１】本発明に係る音声合成方法をいわゆるボコーダ
に適用した具体例の機能ブロック図である。1 is a functional block diagram of a specific example of applying the sound Koego forming method according to the present invention the so-called vocoder.

【図２】本発明に係る音声合成方法に用いられるガウス
性雑音を説明するための特性図である。2 is a characteristic diagram for explaining a Gaussian noise for use in sound Koego forming method according to the present invention.

【図３】本発明に係る音声合成方法が適用される装置の
具体例としての音声信号の合成分析符号化装置の分析側
（エンコード側）の概略構成を示す機能ブロック図であ
る。3 is a functional block diagram showing the schematic configuration of the analysis side of the analysis-by-synthesis speech signal coding apparatus as a specific example of device Sound Koego forming method according to the present invention is applied (encoding end).

【図４】窓かけ処理を説明するための図である。FIG. 4 is a diagram for explaining a windowing process.

【図５】窓かけ処理と窓関数との関係を説明するための
図である。FIG. 5 is a diagram for explaining a relationship between windowing processing and a window function.

【図６】直交変換（ＦＦＴ）処理対象としての時間軸デ
ータを示す図である。FIG. 6 is a diagram showing time axis data as an object of orthogonal transform (FFT) processing.

【図７】周波数軸上のスペクトルデータ、スペクトル包
絡線（エンベロープ）及び励起信号のパワースペクトル
を示す図である。FIG. 7 is a diagram showing spectrum data on a frequency axis, a spectrum envelope (envelope), and a power spectrum of an excitation signal.

【図８】本発明に係る音声合成方法が適用される装置の
具体例としての音声信号の合成分析符号化装置の合成側
（デコード側）の概略構成を示す機能ブロック図であ
る。8 is a functional block diagram showing the schematic configuration of the synthesis side of synthesis analysis the speech signal coding apparatus as a specific example of device Sound Koego forming method according to the present invention is applied (decoding side).

【図９】音声信号を合成する際の無声音合成を説明する
ための図である。FIG. 9 is a diagram for explaining unvoiced sound synthesis when synthesizing a voice signal.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭58−23100（ＪＰ，Ａ) 特開昭64−28700（ＪＰ，Ａ) 特開平３−139922（ＪＰ，Ａ) 特開平２−137900（ＪＰ，Ａ) 特開昭61−32094（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 19/02 G10H 7/08 ─────────────────────────────────────────────────── ─── Continuation of the front page (56) Reference JP-A-58-23100 (JP, A) JP-A-64-28700 (JP, A) JP-A-3-139922 (JP, A) JP-A-2- 137900 (JP, A) JP 61-32094 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 19/02 G10H 7/08

Claims

(57) [Claims]

1. A sound for synthesizing a voice by dividing an input voice signal on a frame-by-frame basis, obtaining a pitch for each divided frame, and using a signal group representing a fundamental wave of the obtained pitch and its harmonics. in Koego forming method, and setting the position <br/> phase in the frame tip portion of the fundamental wave and its harmonics to a predetermined value, the phase at the frame end of the fundamental wave and its harmonics, the At the tip of the frame set to the specified value
And a step of correcting the predicted phase of the frame end portion by adding a correction term according to the frequencies of the fundamental wave and its harmonics. sound Koego forming method characterized.

Wherein the step of modifying the generates a random number having different dispersions in accordance with the frequency of the fundamental wave and harmonics, the phase of the frame end portion that is above Symbol predict the occurrence values in the random number sound Koego forming method according to claim 1, wherein the modifying by addition.