JPH05265486A

JPH05265486A - Speech analyzing and synthesizing method

Info

Publication number: JPH05265486A
Application number: JP4092262A
Authority: JP
Inventors: Atsushi Matsumoto; 淳松本; Masayuki Nishiguchi; 正之西口
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-03-18
Filing date: 1992-03-18
Publication date: 1993-10-15
Anticipated expiration: 2018-04-21
Also published as: JP3398968B2

Abstract

PURPOSE:To improve the quantity of a speech by predicting the terminal phase of a block according to pitch information in each block of a speech signal transmitted from analysis side and the block initial phase, and correcting the phase by using Gaussian noises. CONSTITUTION:An analysis part 10 takes the input speech out in blocks of a specific number of samples and extracts pitch information from the speech signal in the blocks. The speech signal in the blocks is converted into a frequency axis and power information and voiced-sound/voiceless-sound decision information are found in each of bands divided according to the pitch information. A synthesis part 20 synthesizes speeches by a voiced-sound synthesis part 21 and a voiceless-sound synthesis part 27 according to the pitch information sent from the analysis part 10, etc.; and they are added by an addition part 28 and a synthesized speech signal is outputted. The phase prediction part 22 of the voiced-sound synthesis part 21 predicts the block terminal phase according to the pitch information and block initial phase and a phase correction part 24 corrects the terminal phase by using the Gaussian noises having variance corresponding to the respective bands.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声信号の分析合成符
号化装置に適用される音声分析合成方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech analysis / synthesis system applied to a speech signal analysis / synthesis coding apparatus.

【０００２】[0002]

【従来の技術】人間の聴覚は一種のスペクトル分析器で
あって、パワースペクトルが等しければ同じ音として聞
こえるという性質がある。この性質を利用して合成音を
得る方法が音声の分析合成方法である。2. Description of the Related Art Human hearing is a kind of spectrum analyzer, and has the property that it sounds as the same sound if the power spectra are equal. A method of obtaining a synthetic sound by utilizing this property is a voice analysis / synthesis method.

【０００３】上記合成音を得るには、分析側で入力音声
信号を分析し、ピッチ情報、有声音／無声音の判別情
報、振幅情報等を抽出あるいは検出し、合成側に伝送
し、合成側でそれらの情報を基に人工的に音声を作り出
す。特に、合成側は、その合成の方式により、録音編集
方式、パラメータ編集方式、規則合成方式等に分類でき
る。In order to obtain the above-mentioned synthesized sound, the analyzing side analyzes the input voice signal, extracts or detects pitch information, voiced / unvoiced sound discrimination information, amplitude information, etc., and transmits them to the synthesizing side. A voice is artificially created based on such information. In particular, the synthesizing side can be classified into a recording editing method, a parameter editing method, a rule synthesizing method, etc., depending on the synthesizing method.

【０００４】上記録音編集方式は、予め、人が発生した
音声を単語や文節等を単位にとって蓄積（録音）してお
き、必要に応じてそれらを読みだして接続（編集）し、
音声を合成するものである。In the above-mentioned recording / editing method, the voice generated by a person is accumulated (recorded) in units of words, phrases, etc. in advance, read out as needed, and connected (edited).
It synthesizes voice.

【０００５】上記パラメータ編集方式は、上記録音編集
方式の場合と同様に単語、文節等を単位とするが、予め
人が発声した音声を音声生成モデルに基づいて分析し
て、パラメータ時系列の形で蓄え、必要に応じて接続し
たパラメータ時系列を用いて音声合成装置を駆動し、音
声を合成する方式である。The parameter editing method uses words, phrases, and the like as a unit as in the case of the recording editing method. However, the voice uttered by a person is analyzed in advance based on the voice generation model to form the parameter time series. In this method, a voice synthesizing device is driven by using a parameter time series that is stored as described above and is connected as needed to synthesize a voice.

【０００６】上記規則合成方式は、文字や音声記号など
の離散的記号で表現された系列を、連続的に変換する技
術である。変換の過程で、音声生成の普遍的諸性質や人
為的諸性質が合成規則として適用される。The rule synthesizing method is a technique for continuously converting a sequence represented by discrete symbols such as characters and phonetic symbols. In the process of conversion, universal and artificial properties of speech production are applied as synthesis rules.

【０００７】上記各合成方式は、いずれも何らかの形で
声道特性を模擬し、それに音源波とほぼ同じスペクトル
を持つ信号を使って合成音を得ている。In each of the above synthesis methods, the vocal tract characteristic is simulated in some form, and a synthesized sound is obtained by using a signal having substantially the same spectrum as the sound source wave.

【０００８】[0008]

【発明が解決しようとする課題】ところで、上記の音声
分析合成方法では、分析側の位相に合成側の位相を合わ
せる必要がある。この場合、合成側にて位相情報を得る
際、角周波数による線形予測及び白色雑音による修正を
用いる場合がある。しかし、位相の真値と予測による雑
音（エラー）の制御は、上記白色雑音では不可能であ
る。By the way, in the above speech analysis / synthesis method, it is necessary to match the phase on the synthesis side with the phase on the analysis side. In this case, when the phase information is obtained on the combining side, linear prediction by angular frequency and correction by white noise may be used. However, control of noise (error) based on the true value of the phase and prediction is impossible with the white noise.

【０００９】また、全帯域中の無声音の占める割合で白
色雑音のレベルを変化させて修正項に用いているため、
有声音を多く含むブロックが連続した場合、予測のみで
修正が施されないため、結果として強い母音が長時間続
くようなときはエラーが累積し、音質の劣化が生じる。Further, since the level of white noise is changed and used as a correction term at a ratio of unvoiced sound in the entire band,
When a block containing a lot of voiced sound is continuous, no correction is made only by prediction. As a result, when a strong vowel lasts for a long time, errors are accumulated and sound quality deteriorates.

【００１０】そこで、本発明に係る音声分析合成方法
は、その大きさと分散を制御することができる雑音を予
測の修正に用いることで音質の向上を実現する音声分析
合成方法の提供と目的とする。Therefore, it is an object of the speech analysis / synthesis method according to the present invention to provide a speech analysis / synthesis method for improving the sound quality by using noise whose magnitude and variance can be controlled for correction of prediction. ..

【００１１】[0011]

【課題を解決するための手段】本発明に係る音声分析合
成方法は、入力された音声信号をブロック単位で区分し
てブロック内でのピッチ情報を求める工程と、上記ブロ
ック毎の音声信号を周波数軸に変換して周波数軸上デー
タを求める工程と、この周波数軸上データを上記ピッチ
情報に基づいて複数の帯域に分割する工程と、分割され
た各帯域毎のパワー情報及び有声音か無声音かの判別情
報を求める工程と、これらの工程により求められた上記
ピッチ情報、各帯域毎のパワー情報及び有声音か無声音
かの判別情報を伝送する工程と、伝送されて得られた各
ブロック毎の上記ピッチ情報とブロック初期位相とに基
づいてブロック終端位相を予測する工程と、上記各帯域
に応じた分散を持つ雑音を用いて上記予測されたブロッ
ク終端位相を修正する工程とを有して上記課題を解決す
る。A speech analysis / synthesis method according to the present invention comprises a step of dividing an input speech signal into blocks to obtain pitch information within the block, and a frequency of the speech signal for each block. Axis to obtain frequency-axis data, dividing the frequency-axis data into a plurality of bands based on the pitch information, power information for each divided band, and whether voiced or unvoiced And the step of transmitting the above-mentioned pitch information obtained by these steps, the power information for each band, and the discrimination information of voiced sound or unvoiced sound, and for each block obtained by the transmission. Predicting the block end phase based on the pitch information and the block initial phase, and modifying the predicted block end phase using noise having a variance according to each band To solve the above problems and a that step.

【００１２】また、本発明に係る音声分析合成方法は、
上記雑音をガウス性雑音であることを特徴として上記課
題を解決する。The speech analysis and synthesis method according to the present invention is
The above problem is solved by the characteristic that the noise is Gaussian noise.

【００１３】[0013]

【作用】本発明に係る音声分析合成方法は、ブロック毎
の音声信号を周波数軸に変換して得られる周波数軸上デ
ータをブロック毎の音声信号から求められたピッチ情報
に基づいて分割した複数帯域毎にパワー情報及び有声音
か無声音かの判別情報を分析側で求めて伝送し、合成側
では伝送されて得られた各ブロック毎の上記ピッチ情報
とブロック初期位相とに基づいてブロック終端位相を予
測し、該予測された終端位相を上記各帯域に応じた分散
を持つガウス性雑音を用いて修正することによって、予
測位相値と真値との誤差を制御できる。According to the speech analysis and synthesis method of the present invention, the frequency-axis data obtained by converting the speech signal of each block into the frequency axis is divided into a plurality of bands based on the pitch information obtained from the speech signal of each block. The analysis side obtains the power information and discrimination information of voiced sound or unvoiced sound for each and transmits it, and the synthesis side determines the block end phase based on the pitch information and the block initial phase for each block obtained by the transmission. It is possible to control the error between the predicted phase value and the true value by predicting and correcting the predicted terminal phase using Gaussian noise having a variance according to each band.

【００１４】[0014]

【実施例】以下、本発明に係る音声分析合成方法を、音
声信号の分析合成符号化装置（いわゆるボコーダ）に適
用した具体例について、図面を参照しながら説明する。
この分析合成符号化装置は、同時刻（同じブロックある
いはフレーム内）の周波数軸領域に有声音（Voiced) 区
間と無声音(Unvoiced)区間とが存在するというモデル化
を行っている。DESCRIPTION OF THE PREFERRED EMBODIMENTS A specific example in which the speech analysis / synthesis method according to the present invention is applied to a speech signal analysis / synthesis coding apparatus (so-called vocoder) will be described below with reference to the drawings.
This analysis-synthesis coding apparatus performs modeling such that a voiced sound section (Voiced) section and an unvoiced sound section (Unvoiced) section exist in the frequency domain at the same time (in the same block or frame).

【００１５】図１は、上記音声信号の分析合成符号化装
置に本発明を適用した実施例の全体の概略構成を示す図
である。この図１において、本発明に係る音声分析合成
方法の実施例は、入力音声信号からピッチ情報等を分析
する分析部１０と、該分析部１０から伝送部２によって
伝送されてきた各種情報（ピッチ情報等）を受け取り、
有声音と無声音をそれぞれ合成し、さらに該有声音と無
声音とを合成する合成部２０とからなる。FIG. 1 is a diagram showing an overall schematic configuration of an embodiment in which the present invention is applied to the above-described speech signal analysis / synthesis coding apparatus. In FIG. 1, an embodiment of the voice analysis / synthesis method according to the present invention comprises an analysis unit 10 for analyzing pitch information and the like from an input voice signal, and various information (pitch information transmitted by the transmission unit 2 from the analysis unit 10). Information etc.),
The synthesis unit 20 synthesizes voiced sound and unvoiced sound, and further synthesizes the voiced sound and unvoiced sound.

【００１６】上記分析部１０は、入力端子１から入力さ
れた音声信号を所定サンプル数（Ｎサンプル）のブロッ
ク単位で取り出すブロック取り出し部１１と、このブロ
ック取り出し部１１からのブロック毎の入力音声信号か
ら、ピッチ情報を抽出するピッチ情報抽出部１２と、上
記ブロック取り出し部１１からのブロック毎の入力音声
信号から周波数軸上に変換されたデータを求めるデータ
変換部１３と、このデータ変換部１３からの周波数軸上
データを上記ピッチ情報抽出部１２のピッチ情報に基づ
いて複数の帯域に分割する帯域分割部１４と、この帯域
分割部１４の各帯域毎のパワー（振幅）情報及び有声音
Ｖか無声音ＵＮかの判別情報を求める振幅情報＆Ｖ／Ｕ
Ｖ判別情報検出部１５とを有する。The analyzing unit 10 extracts a voice signal input from the input terminal 1 in block units of a predetermined number of samples (N samples), and an input voice signal for each block from the block extracting unit 11. From this, a pitch information extraction unit 12 for extracting pitch information, a data conversion unit 13 for obtaining data converted on the frequency axis from the input voice signal for each block from the block extraction unit 11, and this data conversion unit 13 Band division unit 14 for dividing the data on the frequency axis into a plurality of bands based on the pitch information of the pitch information extraction unit 12, and power (amplitude) information and voiced sound V for each band of the band division unit 14. Amplitude information & V / U for determining unvoiced sound UN information
The V discrimination information detection unit 15 is included.

【００１７】上記合成部２０は、上記伝送部２により上
記分析部１０から伝送されてきたピッチ情報、Ｖ／ＵＶ
判別情報及び振幅情報を受け取り、有声音合成部２１で
有声音を無声音合成部２７で無声音を合成し、該合成さ
れた有声音と無声音とを加算部２８で加算合成し、該合
成音信号を出力端子３から取り出すようにしている。The synthesizing unit 20 has the pitch information, V / UV, transmitted from the analyzing unit 10 by the transmitting unit 2.
Upon receiving the discrimination information and the amplitude information, the voiced sound synthesizing unit 21 synthesizes the voiced sound with the unvoiced sound synthesizing unit 27, and the synthesized voiced sound and unvoiced sound are added and synthesized by the adding unit 28 to obtain the synthesized sound signal. It is taken out from the output terminal 3.

【００１８】なお、上記各情報は、上記Ｎサンプル（例
えば２５６サンプル）のブロック内のデータに対して処
理を施すことにより得られるものであるが、ブロックは
時間軸上を上記Ｌサンプルのフレームを単位として前進
することから、伝送するデータは上記フレーム単位で得
られる。すなわち、上記フレーム周期でピッチ情報、Ｖ
／ＵＶ判別情報及び振幅情報が更新されることになる。The above information is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is the frame of L samples on the time axis. Since the data advances in units, the data to be transmitted is obtained in frame units. That is, pitch information, V
/ UV discrimination information and amplitude information will be updated.

【００１９】上記有声音合成部２１は、上記ピッチ情報
と入力端子４から供給されるフレーム初期位相とに基づ
いてフレーム終端位相（次の合成フレームの先端の位
相）を予測する位相予測部２２と、この位相予測部２２
からの予測を上記ピッチ情報ととＶ／ＵＶ判別情報とが
供給される雑音付加部２３からの修正項を用いて修正す
る位相修正部２４と、この位相修正部２４からの修正位
相情報に基づいて図示しない正弦波ＲＯＭから正弦波を
読みだし出力する正弦波発生部２５と、上記振幅情報が
供給され上記正弦波発生部２５からの正弦波の振幅を増
幅する振幅増幅部２６とを有する。The voiced sound synthesizing section 21 predicts the frame end phase (the phase of the leading edge of the next synthesized frame) based on the pitch information and the frame initial phase supplied from the input terminal 4, and , The phase predictor 22
Based on the corrected phase information from the phase correction unit 24 and the phase correction unit 24 that corrects the prediction from the above using the correction term from the noise adding unit 23 to which the pitch information and the V / UV discrimination information are supplied. The sine wave generator 25 reads out and outputs a sine wave from a sine wave ROM (not shown), and the amplitude amplifier 26 is supplied with the amplitude information and amplifies the amplitude of the sine wave from the sine wave generator 25.

【００２０】上記無声音合成部２７には、上記ピッチ情
報、Ｖ／ＵＶ判別情報及び振幅情報が供給され、例えば
ホワイトノイズを図示しないバンドパスフィルタでフィ
ルタリングして時間軸上の無声音波形を合成している。The unvoiced sound synthesizer 27 is supplied with the pitch information, V / UV discrimination information and amplitude information. For example, white noise is filtered by a bandpass filter (not shown) to synthesize an unvoiced sound waveform on the time axis. There is.

【００２１】上記加算部２８では、上記有声音合成部２
１、無声音合成部２７において合成された有声音及び無
声音の各信号を適当な固定の混合比で加算する。そし
て、この加算された音声信号は、出力端子３から音声信
号として出力される。In the adding section 28, the voiced sound synthesizing section 2 is used.
1. The voiced sound and unvoiced sound signals synthesized by the unvoiced sound synthesizer 27 are added at an appropriate fixed mixing ratio. Then, the added audio signal is output from the output terminal 3 as an audio signal.

【００２２】ここで、上記合成部２０の有声音合成部２
１内の位相予測部２２では、時刻０（フレームの先頭）
における第ｍ高調波の位相（フレーム初期位相）をψ_0m
とすると、フレームの最後での位相ψ_Lmを、 ψ_Lm＝ψ_0m＋ｍ（ω_O1＋ω_L1）Ｌ／２・・・（１）と予測する。また、各バンドの位相φ_mは、 φ_m＝ψ_Lm＋ε_m ・・・（２）となる。上記（１）、（２）式中でＬはフレームインタ
ーバル、ω_O1は、合成フレームの先端（ｎ＝０）での基
本角周波数、ω_L1は該合成フレームの終端（ｎ＝Ｌ：次
の合成フレーム先端）での基本角周波数、ε_mは各バン
ドでの予測修正項を示している。Here, the voiced sound synthesizer 2 of the synthesizer 20 is used.
In the phase prediction unit 22 within 1, the time 0 (the beginning of the frame)
The phase of m-th harmonic (initial phase of frame) at ψ _0m
Then, the phase ψ _Lm at the end of the frame is predicted as ψ _Lm = ψ _0m + m (ω _O1 + ω _L1 ) L / 2 (1). The phase φ _{m of} each band is φ _m = φ _Lm + ε _m (2) In the above equations (1) and (2), L is the frame interval, ω _O1 is the fundamental angular frequency at the leading edge (n = 0) of the composite frame, and ω _L1 is the end (n = L: The fundamental angular frequency at the composite frame tip), ε _m , indicates the prediction correction term in each band.

【００２３】上記（１）式より、上記位相予測部２２
は、第ｍ高調波の平均角周波数に時刻を乗じ、それに第
ｍ高調波の初期位相を加えた位相を時刻Ｌでの予測位相
として求めている。また、上記（２）式より、各バンド
の位相φ_mは、上記予測位相に予測修正項ε_mを加えた
値である。From the equation (1), the phase predictor 22
Calculates the phase obtained by multiplying the average angular frequency of the m-th harmonic by the time and adding the initial phase of the m-th harmonic as the predicted phase at the time L. Further, from the above formula (2), the phase φ _{m of} each band is a value obtained by adding the prediction correction term ε _m to the above-mentioned predicted phase.

【００２４】上記予測修正項ε_mは、各バンド間で分布
が乱れており（ランダム）、乱数を用いることができる
が本実施例では、ガウス雑音を用いている。このガウス
雑音は、図２に示すように帯域別にみて高域になるにつ
れ（例えば、ε₁からε₁₀）分散が大きくなる雑音であ
る。このガウス雑音は、位相の真の値と予測による値と
の誤差を適切に近似する。The predictive correction term ε _m has a random (random) distribution among the bands, and random numbers can be used, but Gaussian noise is used in this embodiment. As shown in FIG. 2, this Gaussian noise is a noise in which the dispersion increases as the frequency becomes higher (for example, ε ₁ to ε ₁₀ ) in each band. This Gaussian noise properly approximates the error between the true value of the phase and the predicted value.

【００２５】ここで、今、図２に示すような分散が単純
にバンドにｍに比例するものとすれば、上記予測修正項
ε_mは、 ε_m＝ｈ₁Ｎ（０，ｋ_i）・・・（３）と示される。ここで、ｈ₁は定数、ｋ_iは分数、０は平
均を表す。Now, assuming that the variance shown in FIG. 2 is simply proportional to m in the band, the prediction correction term ε _m is ε _m = h ₁ N (0, k _i ). .. (3) is indicated. Here, h ₁ is a constant, k _i is a fraction, and 0 is an average.

【００２６】また、全帯域を有声音と無声音の二つの帯
域に分割したときに、無声音の部分が多ければ音声を構
成する各周波数成分の位相はよりランダムになるので、
上記予測修正項ε_mは、 ε_m＝ｈ₂ｎ_ujＮ（０，ｋ_i）・・・（４）と示すことができる。ここで、ｈ₂は定数、ｋ_iは分
数、０は平均、ｎ_ujはブロックｊでの無声音バンドの数
を表す。Further, when the entire band is divided into two bands of voiced sound and unvoiced sound, if there are many unvoiced sounds, the phase of each frequency component forming the voice becomes more random.
The prediction correction term ε _m can be expressed as ε _m = h ₂ n _uj N (0, k _i ) ... (4). Here, h ₂ is a constant, k _i is a fraction, 0 is an average, and n _uj is the number of unvoiced bands in block j.

【００２７】また、特に入力音声の母音が長く続くとき
のように上述したような各バンド間での分布の乱れがな
い時、もしくは母音から子音及び無音に遷移する時に
は、上記（３）、（４）式で示された予測修正項がかえ
って合成音声の音質を劣化させるので、遅延が許される
のであれば１フレーム先の振幅情報（パワー）Ｓレベ
ル、もしくは有声音部分の減少を調べて上記修正項ε_m
を、 ε_m＝ｈ₃max(ａ，Ｓ_j−Ｓ_j+1）Ｎ（０，ｋ_i）・・・（５） ε_m＝ｈ₄max(ｂ，ｎ_vj−ｎ_v(j+1)）Ｎ（０，ｋ_i）・・・（６）とする。ここで、ａ，ｂ，ｈ₃，ｈ₄は定数である。In particular, when there is no disturbance in the distribution between the bands as described above when the vowel of the input voice continues for a long time, or when the vowel changes to consonant and silence, the above (3), ( Since the predictive correction term expressed by the equation (4) rather deteriorates the sound quality of the synthesized voice, if delay is allowed, the amplitude information (power) S level one frame ahead or the decrease of the voiced sound portion is checked to determine the above. Correction term ε _m
_{_{The, ε m = h 3 max (}} a, S j -S j + 1) N (0, k i) ··· (5) ε m = h 4 max (b, n vj -n v (j + 1 ₎ ) N (0, k _i ) ... (6) Here, a, b, h ₃ and h ₄ are constants.

【００２８】さらに、上記ピッチ情報抽出部１２でのピ
ッチ情報が低い場合は、周波数バンドが増え、位相が揃
うことによる悪影響の増大を考慮して、上記上記修正項
ε_mを、 ε_m＝ｆ( Ｓ_j，ｈ_j）Ｎ（０，ｋ_i）・・・（７）とする。ここで、ｆは周波数Further, when the pitch information in the pitch information extraction unit 12 is low, the above correction term ε _m is set to ε _m = f in consideration of an increase in adverse effects due to increase in frequency bands and alignment of phases. (S _j , h _j ) N (0, k _i ) ... (7). Where f is the frequency

【００２９】以上より、上記音声信号の分析合成符号化
装置に本発明を適用した実施例は、位相予測の修正に用
いる雑音をガウス性にすることで、その大きさと分散を
制御することができる。As described above, in the embodiment in which the present invention is applied to the speech signal analysis / synthesis coding apparatus, the noise used for the correction of the phase prediction is Gaussian to control the magnitude and variance thereof. ..

【００３０】以下、本発明に係る音声分析合成方法を、
音声信号の合成分析符号化装置（いわゆるボコーダ）の
一種であるＭＢＥ（Multiband Excitation: マルチバン
ド励起）ボコーダに適用した具体例について、図面を参
照しながら説明する。このＭＢＥボコーダは、D. W. Gr
iffin and J. S. Lim,"Multiband Excitation Vocode
r," IEEE Trans.Acoustics,Speech,and Signal Process
ing, vol.36, No.8, pp.1223-1235, Aug.1988 に開示さ
れているものであり、従来のＰＡＲＣＯＲ（PARtial au
to-CORrelation: 偏自己相関）ボコーダ等では、音声の
モデル化の際に有声音区間と無声音区間とをブロックあ
るいはフレーム毎に切り換えていたのに対し、ＭＢＥボ
コーダでは、同時刻（同じブロックあるいはフレーム
内）の周波数軸領域に有声音（Voiced）区間と無声音
（Unvoiced）区間とが存在するという仮定でモデル化し
ている。The speech analysis and synthesis method according to the present invention will be described below.
A specific example applied to an MBE (Multiband Excitation) vocoder, which is a kind of speech signal synthesis analysis coding apparatus (so-called vocoder), will be described with reference to the drawings. This MBE vocoder is DW Gr
iffin and JS Lim, "Multiband Excitation Vocode
r, "IEEE Trans. Acoustics, Speech, and Signal Process
ing, vol.36, No.8, pp.1223-1235, Aug.1988, the conventional PARCOR (PARtial au
In a to-CORrelation (partial autocorrelation) vocoder or the like, a voiced sound section and an unvoiced sound section were switched for each block or frame when modeling a voice, whereas in the MBE vocoder, at the same time (same block or frame). The model is based on the assumption that there are voiced and unvoiced intervals in the frequency domain of (in).

【００３１】図３は、上記ＭＢＥボコーダに本発明を適
用した実施例の全体の概略構成を示すブロック図であ
る。この図３において、入力端子１０１には音声信号が
供給されるようになっており、この入力音声信号は、Ｈ
ＰＦ（ハイパスフィルタ）等のフィルタ１０２に送られ
て、いわゆるＤＣ（直流）オフセット分の除去や帯域制
限（例えば２００〜３４００Hzに制限）のための少なく
とも低域成分（２００Hz以下）の除去が行われる。この
フィルタ１０２を介して得られた信号は、ピッチ抽出部
１０３及び窓かけ処理部１０４にそれぞれ送られる。ピ
ッチ抽出部１０３では、入力音声信号データが所定サン
プル数Ｎ（例えばＮ＝２５６）単位でブロック分割され
（あるいは方形窓による切り出しが行われ）、このブロ
ック内の音声信号についてのピッチ抽出が行われる。こ
のような切り出しブロック（２５６サンプル）を、例え
ば図４のＡに示すようにＬサンプル（例えばＬ＝１６
０）のフレーム間隔で時間軸方向に移動させており、各
ブロック間のオーバラップはＮ−Ｌサンプル（例えば９
６サンプル）となっている。また、窓かけ処理部１０４
では、１ブロックＮサンプルに対して所定の窓関数、例
えばハミング窓をかけ、この窓かけブロックを１フレー
ムＬサンプルの間隔で時間軸方向に順次移動させてい
る。FIG. 3 is a block diagram showing an overall schematic configuration of an embodiment in which the present invention is applied to the MBE vocoder. In FIG. 3, an audio signal is supplied to the input terminal 101, and the input audio signal is H
It is sent to a filter 102 such as a PF (high-pass filter) to remove a so-called DC (direct current) offset and at least a low frequency component (200 Hz or less) for band limitation (for example, 200 to 3400 Hz). .. The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. The pitch extraction unit 103 divides the input voice signal data into blocks in units of a predetermined number N (for example, N = 256) (or cuts out with a rectangular window), and pitches the voice signals in this block. .. Such cut-out block (256 samples) is divided into L samples (eg L = 16) as shown in A of FIG.
0) frame intervals are moved in the time axis direction, and the overlap between blocks is NL samples (for example, 9 samples).
6 samples). Also, the windowing processing unit 104
In this case, a predetermined window function, for example, a Hamming window is applied to one block of N samples, and this windowed block is sequentially moved in the time axis direction at intervals of one frame of L samples.

【００３２】このような窓かけ処理を数式で表すと、ｘ_w(k,q) ＝ｘ(q) ｗ(kL-q) ・・・（８）となる。この（８）式において、ｋはブロック番号を、
ｑはデータの時間インデックス（サンプル番号）を表
し、処理前の入力信号のｑ番目のデータｘ(q) に対して
第ｋブロックの窓（ウィンドウ）関数ｗ(kL-q)により窓
かけ処理されることによりデータｘ_w(k,q) が得られる
ことを示している。ピッチ抽出部１０３内での図４のＡ
に示すような方形窓の場合の窓関数ｗ_r(r) は、ｗ_r(r) ＝１０≦ｒ＜Ｎ・・・（９）＝０ｒ＜０，Ｎ≦ｒまた、窓かけ処理部１０４での図４のＢに示すようなハ
ミング窓の場合の窓関数ｗ_h(r) は、ｗ_h(r) ＝ 0.54 − 0.46 cos(２πr/(N-1)) ０≦ｒ＜Ｎ・・・（10）＝０ｒ＜０，Ｎ≦ｒである。このような窓関数ｗ_r(r) あるいはｗ_h(r) を
用いるときの上記（８）式の窓関数ｗ(r) （＝ｗ(kL-
q)）の否零区間は、０≦ｋＬ−ｑ＜Ｎこれを変形して、ｋＬ−Ｎ＜ｑ≦ｋＬ従って、例えば上記方形窓の場合に窓関数ｗ_r(kL-q)＝
１となるのは、図５に示すように、ｋＬ−Ｎ＜ｑ≦ｋＬ
のときとなる。また、上記（８）〜（10）式は、長さＮ
（＝２５６）サンプルの窓が、Ｌ（＝１６０）サンプル
ずつ前進してゆくことを示している。以下、上記（９）
式、（10）式の各窓関数で切り出された各Ｎ点（０≦ｒ
＜Ｎ）の否零サンプル列を、それぞれｘ_wr(k,r) 、ｘ_wh
(k,r) と表すことにする。When this windowing process is expressed by a mathematical expression, x _w (k, q) = x (q) w (kL-q) (8) In this equation (8), k is a block number,
q represents the time index (sample number) of the data, and the q-th data x (q) of the unprocessed input signal is windowed by the window function (w (kL-q)) of the kth block. It is shown that the data x _w (k, q) can be obtained by doing so. FIG. 4A in the pitch extraction unit 103
The window function w _r (r) in the case of a rectangular window is as follows: w _r (r) = 1 0 ≦ r <N (9) = 0 r <0, N ≦ r window function w _h (r) in the case of Hamming window as shown in B of FIG. 4 is a section _{104, w h (r) =} 0.54 - 0.46 cos (2πr / (N-1)) 0 ≦ r <N (10) = 0 r <0, N ≦ r. Such a window function w _r (r) or w (8) when using the _h (r) formula of the window function w (r) (= w (KL-
q)), the zero-zero interval is 0 ≦ kL−q <N, which is transformed into kL−N <q ≦ kL. Therefore, for example, in the case of the above rectangular window, the window function w _r (kL−q) =
As shown in FIG. 5, 1 becomes kL-N <q ≦ kL.
It will be when. Further, the above equations (8) to (10) are expressed by the length N
The window of (= 256) samples is shown to advance by L (= 160) samples. Below, above (9)
N points (0 ≦ r
The non-zero sample sequence of <N) is represented by x _wr (k, r) and x _wh
Let us denote it as (k, r).

【００３３】窓かけ処理部１０４では、図６に示すよう
に、上記（10）式のハミング窓がかけられた１ブロック
２５６サンプルのサンプル列ｘ_wh(k,r) に対して１７９
２サンプル分の０データが付加されて（いわゆる０詰め
されて）２０４８サンプルとされ、この２０４８サンプ
ルの時間軸データ列に対して、直交変換部１０５により
例えばＦＦＴ（高速フーリエ変換）等の直交変換処理が
施される。In the windowing processing unit 104, as shown in FIG. 6, 179 is applied to the sample sequence x _wh (k, r) of 256 samples of one block on which the Hamming window of the above equation (10) is applied.
Two samples of 0 data are added (so-called zero padding) to make 2048 samples, and the orthogonal transformation unit 105 performs orthogonal transformation such as FFT (Fast Fourier Transform) on the time-axis data sequence of 2048 samples. Processing is performed.

【００３４】ピッチ抽出部１０３では、上記ｘ_wr(k,r)
のサンプル列（１ブロックＮサンプル）に基づいてピッ
チ抽出が行われる。このピッチ抽出法には、時間波形の
周期性や、スペクトルの周期的周波数構造や、自己相関
関数を用いるもの等が知られているが、本実施例では、
センタクリップ波形の自己相関法を採用している。この
ときのブロック内でのセンタクリップレベルについて
は、１ブロックにつき１つのクリップレベルを設定して
もよいが、ブロックを細分割した各部（各サブブロッ
ク）の信号のピークレベル等を検出し、これらの各サブ
ブロックのピークレベル等の差が大きいときに、ブロッ
ク内でクリップレベルを段階的にあるいは連続的に変化
させるようにしている。このセンタクリップ波形の自己
相関データのピーク位置に基づいてピーク周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。In the pitch extraction unit 103, the above x _wr (k, r)
Pitch extraction is performed based on the sample sequence (1 block N samples). The pitch extraction method is known to include periodicity of time waveform, periodic frequency structure of spectrum, and autocorrelation function.
The center correlation waveform autocorrelation method is used. Regarding the center clip level in the block at this time, one clip level may be set for one block, but the peak level of the signal of each part (each sub-block) obtained by subdividing the block is detected and When there is a large difference in the peak level of each sub-block, the clip level is changed stepwise or continuously within the block. The peak period is determined based on the peak position of the autocorrelation data of the center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak among the plurality of peaks is equal to or larger than a predetermined threshold. In the case of, the maximum peak position is set as the pitch period, and in other cases, the pitch is within a pitch range that satisfies a predetermined relationship with the pitch other than the current frame, for example, the pitch of the previous frame and the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
Then, the high precision pitch search (pitch fine search) is performed by the closed loop.

【００３５】高精度（ファイン）ピッチサーチ部１０６
には、ピッチ抽出部１０３で抽出された整数（インテジ
ャー）値の粗（ラフ）ピッチデータと、直交変換部１０
５により例えばＦＦＴされた周波数軸上のデータとが供
給されている。この高精度ピッチサーチ部１０６では、
上記粗ピッチデータ値を中心に、0.２〜0.５きざみで±
数サンプルずつ振って、最適な小数点付き（フローティ
ング）のファインピッチデータの値へ追い込む。このと
きのファインサーチの手法として、いわゆる合成による
分析 (Analysis by Synthesis)法を用い、合成されたパ
ワースペクトルが原音のパワースペクトルに最も近くな
るようにピッチを選んでいる。High precision (fine) pitch search unit 106
Includes rough pitch data of integer (integer) values extracted by the pitch extraction unit 103 and the orthogonal transformation unit 10.
5, the data on the frequency axis subjected to FFT, for example, is supplied. In this high precision pitch search unit 106,
Centering on the above coarse pitch data value, it is ± 0.2 in increments of ±
Shake several samples at a time to drive to the optimum fine pitch data value with a decimal point (floating). As a fine search method at this time, a so-called analysis by synthesis method is used, and the pitch is selected so that the synthesized power spectrum is closest to the power spectrum of the original sound.

【００３６】このピッチのファインサーチについて説明
する。先ず、上記ＭＢＥボコーダにおいては、上記ＦＦ
Ｔ等により直交変換された周波数軸上のスペクトルデー
タとしてのＳ(j) をＳ(j) ＝Ｈ(j) ｜Ｅ(j) ｜０＜ｊ＜Ｊ・・・（11）と表現するようなモデルを想定している。ここで、Ｊは
πω_s＝ｆ_s／２に対応し、サンプリング周波数ｆ_s＝
２πω_sが例えば８ｋHzのときには４ｋHzに対応する。
上記（11）式中において、周波数軸上のスペクトルデー
タＳ(j) が図７のＡに示すような波形のとき、Ｈ(j)
は、図７のＢに示すような元のスペクトルデータＳ(j)
のスペクトル包絡線（エンベロープ）を示し、Ｅ(j)
は、図７のＣに示すような等レベルで周期的な励起信号
（エキサイテイション）のスペクトルを示している。す
なわち、ＦＦＴスペクトルＳ(j) は、スペクトルエンベ
ロープＨ(j) と励起信号のパワースペクトル｜Ｅ(j) ｜
との積としてモデル化される。The fine search of the pitch will be described. First, in the MBE vocoder, the FF
Let S (j) as spectrum data on the frequency axis orthogonally transformed by T etc. be expressed as S (j) = H (j) | E (j) | 0 <j <J (11) It is assumed that the model. Here, J corresponds to πω _s = f _s / 2, and the sampling frequency f _s =
When 2πω _s is, for example, 8 kHz, it corresponds to 4 kHz.
In the equation (11), when the spectrum data S (j) on the frequency axis has a waveform as shown in A of FIG. 7, H (j)
Is the original spectrum data S (j) as shown in B of FIG.
Shows the spectral envelope of E (j)
Shows a spectrum of an excitation signal (excitation) that is periodic at an equal level as shown in FIG. 7C. That is, the FFT spectrum S (j) is the spectrum envelope H (j) and the power spectrum of the excitation signal | E (j) |
It is modeled as the product of and.

【００３７】上記励起信号のパワースペクトル｜Ｅ(j)
｜は、上記ピッチに応じて決定される周波数軸上の波形
の周期性（ピッチ構造）を考慮して、１つの帯域（バン
ド）の波形に相当するスペクトル波形を周波数軸上の各
バンド毎に繰り返すように配列することにより形成され
る。この１バンド分の波形は、例えば上記図６に示すよ
うな２５６サンプルのハミング窓関数に１７９２サンプ
ル分の０データを付加（０詰め）した波形を時間軸信号
と見なしてＦＦＴし、得られた周波数軸上のある帯域幅
を持つインパルス波形を上記ピッチに応じて切り出すこ
とにより形成することができる。Power spectrum of the excitation signal | E (j)
Is a spectral waveform corresponding to the waveform of one band (band) for each band on the frequency axis in consideration of the periodicity (pitch structure) of the waveform on the frequency axis determined according to the pitch. It is formed by arranging it repeatedly. The waveform for one band is obtained by performing FFT by regarding the waveform obtained by adding (filling with 0) 1792 samples of 0 data to the Hamming window function of 256 samples as shown in FIG. 6 as a time axis signal. It can be formed by cutting out an impulse waveform having a certain bandwidth on the frequency axis according to the pitch.

【００３８】次に、上記ピッチに応じて分割された各バ
ンド毎に、上記Ｈ(j) を代表させるような（各バンド毎
のエラーを最小化するような）値（一種の振幅）｜Ａ_m
｜を求める。ここで、例えば第ｍバンド（第ｍ高調波の
帯域）の下限、上限の点をそれぞれａ_m、ｂ_mとすると
き、この第ｍバンドのエラーε_mは、Next, for each band divided according to the above pitch, a value (a kind of amplitude) | A that represents the above H (j) (minimizes the error for each band) | A _m
Ask for |. Here, for example, when the lower and upper points of the m-th band (band of the m-th harmonic) are a _m and b _m , respectively, the error ε _m of the m-th band is

【００３９】[0039]

【数１】で表せる。このエラーε_mを最小化するような｜Ａ_m｜
は、[Equation 1] Can be expressed as | A _m | that minimizes this error ε _m
Is

【００４０】[0040]

【数２】となり、この（13）式の｜Ａ_m｜のとき、エラーε_mを
最小化する。このような振幅｜Ａ_m｜を各バンド毎に求
め、得られた各振幅｜Ａ_m｜を用いて上記（12）式で定
義された各バンド毎のエラーε_mを求める。次に、この
ような各バンド毎のエラーε_mの全バンドの総和値Σε
_mを求める。さらに、このような全バンドのエラー総和
値Σε_mを、いくつかの微小に異なるピッチについて求
め、エラー総和値Σε_mが最小となるようなピッチを求
める。[Equation 2] Therefore, when | A _m | in this equation (13), the error ε _m is minimized. Such an amplitude | A _m | is obtained for each band, and the obtained amplitude | A _m | is used to obtain the error ε _m for each band defined by the above equation (12). Next, the sum Σε of all the bands of such error ε _m for each band
_{Find m} . Further, such an error sum value Σε _m of all bands is obtained for some slightly different pitches, and a pitch that minimizes the error sum value Σε _m is obtained.

【００４１】すなわち、上記ピッチ抽出部１０３で求め
られたラフピッチを中心として、例えば 0.25 きざみで
上下に数種類ずつ用意する。これらの複数種類の微小に
異なるピッチの各ピッチに対してそれぞれ上記エラー総
和値Σε_mを求める。この場合、ピッチが定まるとバン
ド幅が決まり、上記（13）式より、周波数軸上データの
パワースペクトル｜Ｓ(j) ｜と励起信号スペクトル｜Ｅ
(j) ｜とを用いて上記（12）式のエラーε_mを求め、そ
の全バンドの総和値Σε_mを求めることができる。この
エラー総和値Σε_mを各ピッチ毎に求め、最小となるエ
ラー総和値に対応するピッチを最適のピッチとして決定
するわけである。以上のようにして高精度ピッチサーチ
部１０６で最適のファイン（例えば 0.25 きざみ）ピッ
チが求められ、この最適ピッチに対応する振幅｜Ａ_m｜
が決定される。That is, several kinds of vertical pitches are prepared with the rough pitch obtained by the pitch extraction unit 103 as the center, for example, in steps of 0.25. The error sum value Σε _m is obtained for each of these plural kinds of slightly different pitches. In this case, when the pitch is determined, the bandwidth is determined, and from the above equation (13), the power spectrum | S (j) |
(j) | can be used to find the error ε _m in the above equation (12), and the total sum Σε _m of all the bands can be obtained. This error sum value Σε _m is obtained for each pitch, and the pitch corresponding to the minimum error sum value is determined as the optimum pitch. As described above, the high-precision pitch search unit 106 obtains the optimum fine (for example, 0.25 step) pitch, and the amplitude | A _m | corresponding to this optimum pitch.
Is determined.

【００４２】以上ピッチのファインサーチの説明におい
ては、説明を簡略化するために、全バンドが有声音（Vo
iced）の場合を想定しているが、上述したようにＭＢＥ
ボコーダにおいては、同時刻の周波数軸上に無声音（Un
voiced）領域が存在するというモデルを採用しているこ
とから、上記各バンド毎に有声音／無声音の判別を行う
ことが必要とされる。In the above description of the pitch fine search, in order to simplify the explanation, all bands are voiced (Vo
Assuming the case of iced), as described above, MBE
In the vocoder, unvoiced sound (Un
Since a model in which a voiced region exists is used, it is necessary to distinguish voiced sound / unvoiced sound for each band.

【００４３】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、第ｍバンド
のＮＳＲは、The optimum pitch and amplitude | A _m | data from the high precision pitch search section 106 is sent to the voiced sound / unvoiced sound determination section 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, the NSR of the m-th band is

【００４４】[0044]

【数３】と表せ、このＮＳＲ値が所定の閾値（例えば0.３）より
大のとき（エラーが大きい）ときには、そのバンドでの
｜Ａ_m｜｜Ｅ(j) ｜による｜Ｓ(j) ｜の近似が良くない
（上記励起信号｜Ｅ(j) ｜が基底として不適当である）
と判断でき、当該バンドをＵＶ（Unvoiced、無声音）と
判別する。これ以外のときは、近似がある程度良好に行
われていると判断でき、そのバンドをＶ（Voiced、有声
音）と判別する。[Equation 3] When this NSR value is larger than a predetermined threshold value (for example, 0.3) (error is large), | S (j) | is approximated by | A _m || E (j) | in that band. Is not good (the above excitation signal | E (j) | is unsuitable as a basis)
Therefore, the band is determined to be UV (Unvoiced, unvoiced sound). In other cases, it can be determined that the approximation is performed to some extent, and the band is determined to be V (Voiced, voiced sound).

【００４５】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅を求めている。このＵＶのバン
ドについての振幅｜Ａ_m｜_UVは、Next, the amplitude re-evaluation unit 108 supplies the frequency-axis data from the orthogonal transform unit 105 and the amplitude | A _m evaluated as the fine pitch from the high precision pitch search unit 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from The amplitude re-evaluation unit 108 re-calculates the amplitude of the band determined to be unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107. The amplitude | A _m | _UV for this UV band is

【００４６】[0046]

【数４】にて求められる。[Equation 4] Required at.

【００４７】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００ｋHzまでとすると、この有効帯域が上
記ピッチに応じて、８バンド〜６３バンドに分割される
ことになり、これらの各バンド毎に得られる上記振幅｜
Ａ_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの
個数ｍ_MX+1も８〜６３と変化することになる。このため
データ数変換部１０９では、この可変個数ｍ_MX+1の振幅
データを一定個数Ｎ_C（例えば４４個）のデータに変換
している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. This data number conversion unit 109 is for making the number constant, considering that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, if the effective band is up to 3400 kHz, this effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | obtained for each of these bands |
The number of data A _m | (including the amplitude of the UV band | A _m | _UV ) m _{MX + 1} also changes from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number m _{MX + 1} of amplitude data into a fixed number N _C (for example, 44) of data.

【００４８】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（(
ｍ_MX+1) ×Ｋ_OS個）の振幅データを直線補間してさらに
多くのＮ_M個（例えば２０４８個）に拡張し、このＮ_M
個のデータを間引いて上記一定個数Ｎ_C（例えば４４
個）のデータに変換する。Here, in this embodiment, dummy data for interpolating values from the last data in the block to the first data in the block with respect to the amplitude data for one block of the effective band on the frequency axis. Is added to expand the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
By multiplying the number of times, the amplitude data of K _OS times the number is obtained, and the number of K _OS times the number of (((
(m _{MX + 1} ) × K _OS pieces of amplitude data are linearly interpolated and expanded to a larger number of N _M pieces (eg, 2048 pieces), and this N _M pieces
The data is thinned out to obtain the above-mentioned fixed number N _C (for example, 44
Data).

【００４９】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (a fixed number N _C of the amplitude data) is sent to the vector quantization unit 110, and a predetermined number of data is put together into a vector, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. The high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００５０】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is a frame of L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame cycle.

【００５１】次に、伝送されて得られた上記各データに
基づき音声信号を合成するための合成側（デコード側）
の概略構成について、図８を参照しながら説明する。こ
の図８において、入力端子１２１には上記ベクトル量子
化された振幅データが、入力端子１２２には上記符号化
されたピッチデータが、また入力端子１２３には上記Ｖ
／ＵＶ判別データがそれぞれ供給される。入力端子１２
１からの量子化振幅データは、逆ベクトル量子化部１２
４に送られて逆量子化され、データ数逆変換部１２５に
送られて逆変換され、得られた振幅データが有声音合成
部１２６及び無声音合成部１２７に送られる。入力端子
１２２からの符号化ピッチデータは、ピッチ復号化部１
２８で復号化され、データ数逆変換部１２５、有声音合
成部１２６及び無声音合成部１２７に送られる。また入
力端子１２３からのＶ／ＵＶ判別データは、有声音合成
部１２６及び無声音合成部１２７に送られる。Next, a synthesizing side (decoding side) for synthesizing a voice signal based on the above-mentioned respective data transmitted and obtained.
The general configuration of will be described with reference to FIG. In FIG. 8, the vector-quantized amplitude data is input to the input terminal 121, the encoded pitch data is input to the input terminal 122, and the V-value is input to the input terminal 123.
/ UV discrimination data is supplied respectively. Input terminal 12
The quantized amplitude data from 1 is the inverse vector quantization unit 12
4 and is inversely quantized, is then sent to the data number inverse transform unit 125 and is inversely transformed, and the obtained amplitude data is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127. The encoded pitch data from the input terminal 122 is the pitch decoding unit 1
It is decoded at 28 and sent to the data number inverse conversion unit 125, the voiced sound synthesis unit 126, and the unvoiced sound synthesis unit 127. The V / UV discrimination data from the input terminal 123 is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127.

【００５２】有声音合成部１２６では例えば余弦(cosin
e)波合成により時間軸上の有声音波形を合成し、無声音
合成部１２７では例えばホワイトノイズをバンドパスフ
ィルタでフィルタリングして時間軸上の無声音波形を合
成し、これらの各有声音合成波形と無声音合成波形とを
加算部１２９で加算合成して、出力端子１３０より取り
出すようにしている。この場合、上記振幅データ、ピッ
チデータ及びＶ／ＵＶ判別データは、上記分析時の１フ
レーム（Ｌサンプル、例えば１６０サンプル）毎に更新
されて与えられるが、フレーム間の連続性を高める（円
滑化する）ために、上記振幅データやピッチデータの各
値を１フレーム中の例えば中心位置における各データ値
とし、次のフレームの中心位置までの間（合成時の１フ
レーム）の各データ値を補間により求める。すなわち、
合成時の１フレーム（例えば上記分析フレームの中心か
ら次の分析フレームの中心まで）において、先端サンプ
ル点での各データ値と終端（次の合成フレームの先端）
サンプル点での各データ値とが与えられ、これらのサン
プル点間の各データ値を補間により求めるようにしてい
る。In the voiced sound synthesis unit 126, for example, cosine (cosin
e) A voiced sound waveform on the time axis is synthesized by wave synthesis, and in the unvoiced sound synthesis unit 127, for example, white noise is filtered by a bandpass filter to synthesize the unvoiced sound waveform on the time axis, and these voiced sound synthesized waveforms are combined. The unvoiced sound synthesized waveform is added and synthesized by the addition unit 129 and is taken out from the output terminal 130. In this case, the amplitude data, the pitch data, and the V / UV discrimination data are updated and given for each frame (L sample, for example, 160 samples) at the time of the analysis, but the continuity between the frames is improved (smoothed). Therefore, each value of the amplitude data and the pitch data is set as each data value at, for example, the center position in one frame, and each data value up to the center position of the next frame (one frame at the time of composition) is interpolated. Ask by. That is,
In one frame (for example, from the center of the above analysis frame to the center of the next analysis frame) at the time of synthesis, each data value at the tip sample point and the end (the tip of the next synthesis frame)
Each data value at the sample point is given, and each data value between these sample points is obtained by interpolation.

【００５３】以下、有声音合成部１２６における合成処
理を詳細に説明する。上記Ｖ（有声音）と判別された第
ｍバンド（第ｍ高調波の帯域）における時間軸上の上記
１合成フレーム（Ｌサンプル、例えば１６０サンプル）
分の有声音をＶ_m(n) とするとき、この合成フレーム内
の時間インデックス（サンプル番号）ｎを用いて、Ｖ_m(n) ＝Ａ_m(n) cos(θ_m(n)) ０≦ｎ＜Ｌ・・・（16）と表すことができる。全バンドの内のＶ（有声音）と判
別された全てのバンドの有声音を加算（ΣＶ_m(n) ）し
て最終的な有声音Ｖ(n) を合成する。The synthesis processing in the voiced sound synthesis unit 126 will be described in detail below. The one combined frame (L sample, for example, 160 samples) on the time axis in the m-th band (band of the m-th harmonic) determined to be V (voiced sound)
When the voiced sound for a minute is V _m (n), V _m (n) = A _m (n) cos (θ _m (n)) 0 using the time index (sample number) n in this composite frame. ≦ n <L (16) The final voiced sound V (n) is synthesized by adding (ΣV _m (n)) the voiced sounds of all the bands which are determined to be V (voiced sound) of all the bands.

【００５４】この（16）式中のＡ_m(n) は、上記合成フ
レームの先端から終端までの間で補間された第ｍ高調波
の振幅である。最も簡単には、フレーム単位で更新され
る振幅データの第ｍ高調波の値を直線補間すればよい。
すなわち、上記合成フレームの先端（ｎ＝０）での第ｍ
高調波の振幅値をＡ_0m、該合成フレームの終端（ｎ＝
Ｌ：次の合成フレームの先端）での第ｍ高調波の振幅値
をＡ_Lmとするとき、Ａ_m(n) ＝ (L-n)Ａ_0m／Ｌ＋ｎＡ_Lm／Ｌ・・・（17）の式によりＡ_m(n) を計算すればよい。A _m (n) in the equation (16) is the amplitude of the m-th harmonic wave that is interpolated from the beginning to the end of the composite frame. The simplest way is to linearly interpolate the value of the m-th harmonic of the amplitude data updated in frame units.
That is, the m-th frame at the tip (n = 0) of the composite frame
The amplitude value of the harmonic is A _0m , the end of the composite frame (n =
L: the amplitude value of the m-th harmonic at the end of the next composite frame) is A _Lm , A _m (n) = (Ln) A _0m / L + nA _Lm / L ... (17) It suffices to calculate A _m (n).

【００５５】次に、上記（16）式中の位相θ_m(n) は、 θ_m(0) ＝ｍω_O1ｎ＋ｎ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋Δωｎ・・・（18）により求めることができる。この（18）式中で、φ_0mは
上記合成フレームの先端（ｎ＝０）での第ｍ高調波の位
相（フレーム初期位相）を示し、ω₀₁は合成フレーム先
端（ｎ＝０）での基本角周波数、ω_L1は該合成フレーム
の終端（ｎ＝Ｌ：次の合成フレーム先端）での基本角周
波数をそれぞれ示している。上記（18）式中のΔωは、
ｎ＝Ｌにおける位相φ_Lmがθ_m(L) に等しくなるような
最小のΔωを設定する。Next, the phase θ _m (n) in the above equation (16) is calculated by θ _m (0) = mω _O1 n + n ² m (ω _L1 −ω ₀₁ ) / 2L + φ _{0 m} + Δω n (18) You can ask. In the equation (18), φ _0m represents the phase of the m-th harmonic (frame initial phase) at the tip (n = 0) of the composite frame, and ω ₀₁ is at the tip of the composite frame (n = 0). The fundamental angular frequency, ω _L1, represents the fundamental angular frequency at the end of the combined frame (n = L: the leading end of the next combined frame). Δω in the equation (18) is
Set a minimum Δω such that the phase φ _{Lm at} n = L is equal to θ _m (L).

【００５６】これに、対して本発明の実施例では、上記
（18) 式のφ_0m＋Δωｎを合成側に送らずに、合成側で
位相を予測算出している。すなわち、上記位相予測部２
２は、上記（１）式に示されるように時刻０（フレーム
の先頭）における第ｍ高調波の位相（フレーム初期位
相）ψ_0mにｍ（ω_O1＋ω_L1）Ｌ／２を加えフレームの
最後での位相ψ_Lmを、予測算出している。また、各バン
ドの位相φ_mは、上記予測算出された位相ψ_Lmにε_mを
加えて示される。このε_mは各バンドでの予測修正項を
示している。本発明では、この予測修正項ε_mにガウス
性の雑音を用いている。On the other hand, in the embodiment of the present invention, the phase is predicted and calculated on the combining side without sending φ _0m + Δωn in the equation (18) to the combining side. That is, the phase prediction unit 2
2 is obtained by adding m (ω _O1 + ω _L1 ) L / 2 to the phase (frame initial phase) ψ _0m of the m-th harmonic at time 0 (the beginning of the frame) as shown in the above equation (1). The phase ψ _{Lm at} is predicted and calculated. Further, the phase φ _{m of} each band is shown by adding ε _m to the predicted and calculated phase ψ _Lm . This ε _m indicates the prediction correction term in each band. In the present invention, Gaussian noise is used for this prediction correction term ε _m .

【００５７】ここで、図９のＡは、音声信号のスペクト
ルの一例を示しており、バンド番号（ハーモニクスナン
バ）ｍが８、９、１０の各バンドがＵＶ（無声音）とさ
れ、他のバンドはＶ（有声音）とされている。このＶ
（有声音）のバンドの時間軸信号が上記有声音合成部１
２６により合成され、ＵＶ（無声音）のバンドの時間軸
信号が無声音合成部１２７で合成されるわけである。Here, A of FIG. 9 shows an example of a spectrum of a voice signal. Each band of band numbers (harmonics number) m is 8, 9 and 10 is UV (unvoiced sound), and other bands. Is V (voiced sound). This V
The time axis signal of the (voiced sound) band is the voiced sound synthesis unit 1 described above.
26, and the time axis signal of the UV (unvoiced sound) band is synthesized by the unvoiced sound synthesis unit 127.

【００５８】以下、無声音合成部１２７における無声音
合成処理を説明する。ホワイトノイズ発生部１３１から
の時間軸上のホワイトノイズ信号波形を、所定の長さ
（例えば２５６サンプル）で適当な窓関数（例えばハミ
ング窓）により窓かけをし、ＳＴＦＴ処理部１３２によ
りＳＴＦＴ（ショートタームフーリエ変換）処理を施す
ことにより、図９のＢに示すようなホワイトノイズの周
波数軸上のパワースペクトルを得る。このＳＴＦＴ処理
部１３２からのパワースペクトルをバンド振幅処理部１
３３に送り、図９のＣに示すように、上記ＵＶ（無声
音）とされたバンド（例えばｍ＝８、９、１０）につい
て上記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とさ
れたバンドの振幅を０にする。このバンド振幅処理部１
３３には上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別
データが供給されている。バンド振幅処理部１３３から
の出力は、ＩＳＴＦＴ処理部１３４に送られ、位相は元
のホワイトノイズの位相を用いて逆ＳＴＦＴ処理を施す
ことにより時間軸上の信号に変換する。ＩＳＴＦＴ処理
部１３４からの出力は、オーバーラップ加算部１３５に
送られ、時間軸上で適当な（元の連続的なノイズ波形を
復元できるように）重み付けをしながらオーバーラップ
及び加算を繰り返し、連続的な時間軸波形を合成する。
オーバーラップ加算部１３５からの出力信号が上記加算
部１２９に送られる。The unvoiced sound synthesizing process in the unvoiced sound synthesizing section 127 will be described below. The white noise signal waveform on the time axis from the white noise generation unit 131 is windowed by a suitable window function (for example, Hamming window) with a predetermined length (for example, 256 samples), and the STFT processing unit 132 performs STFT (short circuit). By performing the (Term Fourier Transform) process, a power spectrum of white noise on the frequency axis as shown in B of FIG. 9 is obtained. The power spectrum from the STFT processing unit 132 is converted to the band amplitude processing unit 1
33, and as shown in FIG. 9C, the above-mentioned amplitude | A _m | _UV is multiplied with respect to the band (for example, m = 8, 9, 10) set as the UV (unvoiced sound), and another V (voiced sound) is generated. ) Is set to 0. This band amplitude processing unit 1
The above amplitude data, pitch data, and V / UV discrimination data are supplied to 33. The output from the band amplitude processing unit 133 is sent to the ISTFT processing unit 134, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 134 is sent to the overlap addition unit 135, and the overlap and addition are repeated while performing appropriate weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuous. Time-domain waveforms are synthesized.
The output signal from the overlap adder 135 is sent to the adder 129.

【００５９】このように、各合成部１２６、１２７にお
いて合成されて時間軸上に戻された有声音部及び無声音
部の各信号は、加算部１２９により適当な固定の混合比
で加算して、出力端子１３０より再生された音声信号を
取り出す。As described above, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized in the respective synthesis units 126 and 127 and are returned to the time axis are added by the addition unit 129 at an appropriate fixed mixing ratio, The reproduced audio signal is taken out from the output terminal 130.

【００６０】したがって、本発明に係る音声分析合成方
法を、ＭＢＥに適用した具体例では、位相の予測に用い
る雑音をガウス性にすることでその大きさと分散を制御
することができる。Therefore, in the specific example in which the speech analysis / synthesis method according to the present invention is applied to MBE, the size and variance can be controlled by making the noise used for phase prediction into Gaussian.

【００６１】なお、上記図３の音声分析側（エンコード
側）の構成や図７の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration on the voice analysis side (encoding side) in FIG. 3 and the configuration on the voice synthesis side (decoding side) in FIG. 7, although each unit is described in hardware, a so-called DSP (digital It is also possible to realize it by a software program using a signal processor or the like.

【００６２】[0062]

【発明の効果】本発明に係る音声分析合成方法は、ブロ
ック毎の音声信号を周波数軸に変換して得られる周波数
軸上データをブロック毎の音声信号から求められたピッ
チ情報に基づいて分割した複数帯域毎にパワー情報及び
有声音か無声音かの判別情報を分析側で求めて伝送し、
合成側では伝送されて得られた各ブロック毎の上記ピッ
チ情報とブロック初期位相とに基づいてブロック終端位
相を予測し、該予測された終端位相を上記各帯域に応じ
た分散を持つガウス性雑音を用いて修正することによっ
て、雑音の大きさと分散を制御でき、音質の向上が期待
できる。また、音声の信号レベル及びその時間的変化を
利用することで、エラーの累積を防ぎ母音部もしくは母
音部から子音部の遷移点での音質劣化を防ぐことができ
る。In the voice analysis / synthesis method according to the present invention, the data on the frequency axis obtained by converting the voice signal of each block into the frequency axis is divided based on the pitch information obtained from the voice signal of each block. The analysis side obtains and transmits power information and discrimination information of voiced or unvoiced sound for each of a plurality of bands,
On the combining side, the block end phase is predicted based on the pitch information and the block initial phase for each block obtained by transmission, and the predicted end phase is a Gaussian noise having a variance according to each band. By modifying with, it is possible to control the magnitude and variance of noise, and it is expected to improve the sound quality. Further, by utilizing the signal level of the voice and its temporal change, it is possible to prevent the accumulation of errors and prevent the deterioration of the sound quality at the vowel part or at the transition point from the vowel part to the consonant part.

[Brief description of drawings]

【図１】本発明に係る音声分析合成方法をいわゆるボコ
ーダに適用した具体例の機能ブロック図である。FIG. 1 is a functional block diagram of a specific example in which a voice analysis / synthesis method according to the present invention is applied to a so-called vocoder.

【図２】本発明に係る音声分析合成方法に用いられるガ
ウス性雑音を説明するための特性図である。FIG. 2 is a characteristic diagram for explaining Gaussian noise used in the speech analysis / synthesis method according to the present invention.

【図３】本発明に係る音声分析合成方法が適用される装
置の具体例としての音声信号の合成分析符号化装置の分
析側（エンコード側）の概略構成を示す機能ブロック図
である。FIG. 3 is a functional block diagram showing a schematic configuration of an analysis side (encoding side) of a speech signal synthesis analysis coding apparatus as a specific example of an apparatus to which a speech analysis synthesis method according to the present invention is applied.

【図４】窓かけ処理を説明するための図である。FIG. 4 is a diagram for explaining a windowing process.

【図５】窓かけ処理と窓関数との関係を説明するための
図である。FIG. 5 is a diagram for explaining a relationship between windowing processing and a window function.

【図６】直交変換（ＦＦＴ）処理対象としての時間軸デ
ータを示す図である。FIG. 6 is a diagram showing time axis data as an object of orthogonal transform (FFT) processing.

【図７】周波数軸上のスペクトルデータ、スペクトル包
絡線（エンベロープ）及び励起信号のパワースペクトル
を示す図である。FIG. 7 is a diagram showing spectrum data on a frequency axis, a spectrum envelope (envelope), and a power spectrum of an excitation signal.

【図８】本発明に係る音声分析合成方法が適用される装
置の具体例としての音声信号の合成分析符号化装置の合
成側（デコード側）の概略構成を示す機能ブロック図で
ある。FIG. 8 is a functional block diagram showing a schematic configuration on the synthesis side (decoding side) of a speech signal synthesis analysis coding apparatus as a specific example of an apparatus to which the speech analysis synthesis method according to the present invention is applied.

【図９】音声信号を合成する際の無声音合成を説明する
ための図である。FIG. 9 is a diagram for explaining unvoiced sound synthesis when synthesizing voice signals.

[Explanation of symbols]

１０・・・・・分析部２０・・・・・合成部２１・・・・・有声音合成部２２・・・・・位相予測部２３・・・・・雑音付加部２４・・・・・位相修正部２５・・・・・正弦波発生部２６・・・・・振幅増幅部 10-Analysis unit 20-Synthesis unit 21-Voice sound synthesis unit 22-Phase prediction unit 23-Noise addition unit 24- Phase correction unit 25: Sine wave generation unit 26: Amplitude amplification unit

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成５年４月５日[Submission date] April 5, 1993

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】全文[Name of item to be corrected] Full text

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【書類名】明細書[Document name] Statement

【発明の名称】音声分析合成方法Title of Speech Analysis and Synthesis Method

【特許請求の範囲】[Claims]

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【０００２】[0002]

【０００８】[0008]

【００１０】そこで、本発明に係る音声分析合成方法
は、その大きさと分散を制御することができる雑音を予
測の修正に用いることで音質の向上を実現する音声分析
合成方法の提供を目的とする。[0010] Thus, a method vocoding according to the present invention has an object to provide a speech analysis-synthesis method for realizing the sound quality by using the noise can control the size and dispersion corrected prediction ..

【００１１】[0011]

【００１３】[0013]

【００１４】[0014]

【００１５】図１は、上記音声信号の分析合成符号化装
置に本発明を適用した実施例の全体の概略構成を示す図
である。この図１において、本発明に係る音声分析合成
方法の実施例は、入力音声信号からピッチ情報等を分析
する分析部１０と、該分析部１０から伝送部２によって
伝送されてきた各種情報（ピッチ情報等）を基に有声音
と無声音を得、さらに該有声音と無声音とを合成する合
成部２０とからなる。FIG. 1 is a diagram showing an overall schematic configuration of an embodiment in which the present invention is applied to the above-described speech signal analysis / synthesis coding apparatus. In FIG. 1, an embodiment of the voice analysis / synthesis method according to the present invention comprises an analysis unit 10 for analyzing pitch information and the like from an input voice signal, and various information (pitch information transmitted by the transmission unit 2 from the analysis unit 10). Voiced sound based on information etc.)
And a unvoiced sound, and a synthesizer 20 for synthesizing the voiced sound and the unvoiced sound.

【００２２】ここで、上記合成部２０の有声音合成部２
１内の位相予測部２２では、時刻０（フレームの先頭）
における第ｍ高調波の位相（フレーム初期位相）をψ_0m
とすると、フレームの最後での位相ψ_Lmを、 ψ_Lm＝ψ_0m＋ｍ（ω_O1＋ω_L1）Ｌ／２・・・（１）と予測する。また、各バンドの位相φ_mは、 φ_m＝ψ_Lm＋ε_m ・・・（２）となる。上記（１）、（２）式中でＬはフレームインタ
ーバル、ωO1は、合成フレームの先端（ｎ＝０）での基
本角周波数、ω_L1は該合成フレームの終端（ｎ＝Ｌ：次
の合成フレーム先端）での基本角周波数、ε_mは各バン
ドでの予測修正項を示している。Here, the voiced sound synthesizer 2 of the synthesizer 20 is used.
In the phase prediction unit 22 within 1, the time 0 (the beginning of the frame)
The phase of m-th harmonic (initial phase of frame) at ψ _0m
Then, the phase ψ _Lm at the end of the frame is predicted as ψ _Lm = ψ _0m + m (ω _O1 + ω _L1 ) L / 2 (1). The phase φ _{m of} each band is φ _m = φ _Lm + ε _m (2) (1), (2) wherein L is the frame interval, Omegao1 the fundamental angular frequency at the tip of the composite frame (n = 0), omega _L1 is the composite frame termination (n = L: The following synthetic The fundamental angular frequency at the frame tip), ε _m , indicates the prediction correction term for each band.

【００２５】ここで、今、図２に示すような分散が単純
にバンドにｍに比例するものとすれば、上記予測修正項
ε_mは、 εm ＝ｈ1 Ｎ（０，ｋi ）・・・（３）と示される。ここで、ｈ₁は定数、ｋ_iは分数、０は平
均を表す。[0025] Here, now, if that dependency as shown in FIG. 2 is proportional to m for simplicity band, the predicted correction terms epsilon _m is, εm = h1 N (0, ki) ··· ( 3) is indicated. Here, h ₁ is a constant, k _i is a fraction, and 0 is an average.

【００２８】さらに、上記ピッチ情報抽出部１２でのピ
ッチ情報が低い場合は、周波数バンドが増え、位相が揃
うことによる悪影響の増大を考慮して、上記上記修正項
ε_mを、 ε_m＝ｆ( Ｓ_j，ｈ_j）Ｎ（０，ｋ_i）・・・（７）とする。ここで、ｆは周波数である。 Further, when the pitch information in the pitch information extraction unit 12 is low, the above correction term ε _m is set to ε _m = f in consideration of an increase in adverse effects due to increase in frequency bands and alignment of phases. (S _j , h _j ) N (0, k _i ) ... (7). Here, f is a frequency .

【００３１】図３は、上記ＭＢＥボコーダに本発明を適
用した実施例の全体の概略構成を示すブロック図であ
る。この図３において、入力端子１０１には音声信号が
供給されるようになっており、この入力音声信号は、Ｈ
ＰＦ（ハイパスフィルタ）等のフィルタ１０２に送られ
て、いわゆるＤＣ（直流）オフセット分の除去や帯域制
限（例えば２００〜３４００Hzに制限）のための少なく
とも低域成分（２００Hz以下）の除去が行われる。この
フィルタ１０２を介して得られた信号は、ピッチ抽出部
１０３及び窓かけ処理部１０４にそれぞれ送られる。ピ
ッチ抽出部１０３では、入力音声信号データが所定サン
プル数Ｎ（例えばＮ＝２５６）単位でブロック分割され
（あるいは方形窓による切り出しが行われ）、このブロ
ック内の音声信号についてのピッチ抽出が行われる。こ
のような切り出しブロック（２５６サンプル）を、例え
ば図４のＡに示すようにＬサンプル（例えばＬ＝１６
０）のフレーム間隔で時間軸方向に移動させており、各
ブロック間のオーバラップはＮ−Ｌサンプル（例えば９
６サンプル）となっている。また、窓かけ処理部１０４
では、１ブロックＮサンプルに対して所定の窓関数、例
えばハミング窓をかけ、この窓かけブロックを１フレー
ムＬサンプルの間隔で時間軸方向に順次移動させてい
る。FIG. 3 is a block diagram showing an overall schematic configuration of an embodiment in which the present invention is applied to the MBE vocoder. In FIG. 3, an audio signal is supplied to the input terminal 101, and the input audio signal is H
It is sent to a filter 102 such as a PF (high-pass filter) to remove a so-called DC (direct current) offset and at least a low frequency component (200 Hz or less) for band limitation (for example, 200 to 3400 Hz). .. The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. The pitch extraction unit 103 divides the input voice signal data into blocks in units of a predetermined number N (for example, N = 256) (or cuts out with a rectangular window), and pitches the voice signals in this block. .. Such cut-out block (256 samples) is divided into L samples (eg L = 16) as shown in A of FIG.
0) frame intervals are moved in the time axis direction, and the overlap between blocks is NL samples (for example, 9 samples).
6 samples). Also, the windowing processing unit 104
In this case, a predetermined window function, for example, a Hamming window is applied to one block N samples, and this windowed block is sequentially moved in the time axis direction at intervals of one frame L samples.

【００３２】このような窓かけ処理を数式で表すと、ｘ_w(k,q) ＝ｘ(q) ｗ(kL-q) ・・・（８）となる。この（８）式において、ｋはブロック番号を、
ｑはデータの時間インデックス（サンプル番号）を表
し、処理前の入力信号のｑ番目のデータｘ(q) に対して
第ｋブロックの窓（ウィンドウ）関数ｗ(kL-q)により窓
かけ処理されることによりデータｘ_w(k,q) が得られる
ことを示している。ピッチ抽出部１０３内での図４のＡ
に示すような方形窓の場合の窓関数ｗ_r(r) は、ｗ_r(r) ＝１０≦ｒ＜Ｎ・・・（９）＝０ｒ＜０，Ｎ≦ｒまた、窓かけ処理部１０４での図４のＢに示すようなハ
ミング窓の場合の窓関数ｗ_h(r) は、ｗ_h(r) ＝ 0.54 − 0.46 cos(２πr/(N-1)) ０≦ｒ＜Ｎ・・・（10）＝０ｒ＜０，Ｎ≦ｒである。このような窓関数ｗ_r(r) あるいはｗ_h(r) を
用いるときの上記（８）式の窓関数ｗ(r) （＝ｗ(kL-
q)）の否零区間は、０≦ｋＬ−ｑ＜Ｎこれを変形して、ｋＬ−Ｎ＜ｑ≦ｋＬ従って、例えば上記方形窓の場合に窓関数ｗ_r(kL-q)＝
１となるのは、図５に示すように、ｋＬ−Ｎ＜ｑ≦ｋＬ
のときとなる。また、上記（８）〜（10）式は、長さＮ
（＝２５６）サンプルの窓が、Ｌ（＝１６０）サンプル
ずつ前進してゆくことを示している。以下、上記（９）
式、（10）式の各窓関数で切り出された各Ｎ点（０≦ｒ
＜Ｎ）の否零サンプル列を、それぞれｘ_wr(k,r) 、ｘ_wh
(k,r) と表すことにする。When this windowing process is expressed by a mathematical expression, x _w (k, q) = x (q) w (kL-q) (8) In this equation (8), k is a block number,
q represents the time index (sample number) of the data, and the q-th data x (q) of the unprocessed input signal is windowed by the window function (w (kL-q)) of the kth block. It is shown that the data x _w (k, q) can be obtained by doing so. FIG. 4A in the pitch extraction unit 103
The window function w _r (r) in the case of a rectangular window is as follows: w _r (r) = 1 0 ≦ r <N (9) = 0 r <0, N ≦ r window function w _h (r) in the case of Hamming window as shown in B of FIG. 4 is a section _{104, w h (r) =} 0.54 - 0.46 cos (2πr / (N-1)) 0 ≦ r <N (10) = 0 r <0, N ≦ r. Such a window function w _r (r) or w (8) when using the _h (r) formula of the window function w (r) (= w (KL-
q)), the zero-zero interval is 0 ≦ kL−q <N, which is transformed into kL−N <q ≦ kL. Therefore, for example, in the case of the above rectangular window, the window function w _r (kL−q) =
As shown in FIG. 5, 1 becomes kL-N <q ≦ kL.
It will be when. Further, the above equations (8) to (10) are expressed by the length N
The window of (= 256) samples is shown to advance by L (= 160) samples. Below, above (9)
N points (0 ≦ r
The non-zero sample sequences of <N) are respectively represented by x _wr (k, r) and x _wh
We will denote it as (k, r).

【００３４】ピッチ抽出部１０３では、上記ｘ_wr(k,r)
のサンプル列（１ブロックＮサンプル）に基づいてピッ
チ抽出が行われる。このピッチ抽出法には、時間波形の
周期性や、スペクトルの周期的周波数構造や、自己相関
関数を用いるもの等が知られているが、本実施例では、
センタクリップ波形の自己相関法を採用している。この
ときのブロック内でのセンタクリップレベルについて
は、１ブロックにつき１つのクリップレベルを設定して
もよいが、ブロックを細分割した各部（各サブブロッ
ク）の信号のピークレベル等を検出し、これらの各サブ
ブロックのピークレベル等の差が大きいときに、ブロッ
ク内でクリップレベルを段階的にあるいは連続的に変化
させるようにしている。このセンタクリップ波形の自己
相関データのピーク位置に基づいてピッチ周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。In the pitch extraction unit 103, the above x _wr (k, r)
Pitch extraction is performed based on the sample sequence (1 block N samples). The pitch extraction method is known to include periodicity of time waveform, periodic frequency structure of spectrum, and autocorrelation function.
The center correlation waveform autocorrelation method is used. Regarding the center clip level in the block at this time, one clip level may be set for one block, but the peak level of the signal of each part (each sub-block) obtained by subdividing the block is detected and When there is a large difference in the peak level of each sub-block, the clip level is changed stepwise or continuously within the block. The pitch period is determined based on the peak position of the autocorrelation data of this center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak among the plurality of peaks is equal to or larger than a predetermined threshold value. In the case of, the maximum peak position is set as the pitch cycle, and in other cases, the center of the pitch is a pitch range satisfying a predetermined relationship with a pitch other than the current frame, for example, the pitch before and after the frame, for example, the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
Then, the high precision pitch search (pitch fine search) is performed by the closed loop.

【００３６】このピッチのファインサーチについて説明
する。先ず、上記ＭＢＥボコーダにおいては、上記ＦＦ
Ｔ等により直交変換された周波数軸上のスペクトルデー
タとしてのＳ(j) をＳ(j) ＝Ｈ(j) ｜Ｅ(j) ｜０＜ｊ＜Ｊ・・・（11）と表現するようなモデルを想定している。ここで、Ｊは
ω_s／４πに対応し、サンプリング周波数ｆ_s＝ω_s／
２πが例えば８ｋHzのときには４ｋHzに対応する。上記
（11）式中において、周波数軸上のスペクトルデータＳ
(j) が図７のＡに示すような波形のとき、Ｈ(j) は、図
７のＢに示すような元のスペクトルデータＳ(j) のスペ
クトル包絡線（エンベロープ）を示し、Ｅ(j) は、図７
のＣに示すような等レベルで周期的な励起信号（エキサ
イテイション）のスペクトルを示している。すなわち、
ＦＦＴスペクトルＳ(j) は、スペクトルエンベロープＨ
(j)と励起信号のパワースペクトル｜Ｅ(j) ｜との積と
してモデル化される。The fine search of the pitch will be described. First, in the MBE vocoder, the FF
Let S (j) as spectrum data on the frequency axis orthogonally transformed by T etc. be expressed as S (j) = H (j) | E (j) | 0 <j <J (11) It is assumed that the model. Where J is
Corresponding to ω _s / 4π , sampling frequency f _s = ω _s /
When 2π is, for example, 8 kHz, it corresponds to 4 kHz. In the above equation (11), spectrum data S on the frequency axis
When (j) has a waveform as shown in A of FIG. 7, H (j) shows a spectrum envelope (envelope) of the original spectrum data S (j) as shown in B of FIG. 7, and E ( j) is shown in FIG.
3 shows a spectrum of an excitation signal (excitation) which is cyclic at an equal level as shown in C of FIG. That is,
The FFT spectrum S (j) has a spectral envelope H
It is modeled as the product of (j) and the power spectrum of the excitation signal | E (j) |.

【００３９】[0039]

【数１】 [Equation 1]

【００４０】で表せる。このエラーε_mを最小化するよ
うな｜Ａ_m｜は、Can be expressed as | A _m | that minimizes this error ε _m is

【００４１】[0041]

【数２】 [Equation 2]

【００４２】となり、この（13）式の｜Ａ_m｜のとき、
エラーε_mを最小化する。このような振幅｜Ａ_m｜を各
バンド毎に求め、得られた各振幅｜Ａ_m｜を用いて上記
（12）式で定義された各バンド毎のエラーεm を求め
る。次に、このような各バンド毎のエラーε_mの全バン
ドの総和値Σε_mを求める。さらに、このような全バン
ドのエラー総和値Σε_mを、いくつかの微小に異なるピ
ッチについて求め、エラー総和値Σε_mが最小となるよ
うなピッチを求める。Thus, when | A _m | in this equation (13),
Minimize the error ε _m . Such an amplitude | A _m | is obtained for each band, and the obtained amplitude | A _m | is used to obtain the error ε m for each band defined by the above equation (12). Next, the sum total value Σε _m of all the bands of such error ε _m for each band is obtained. Further, such an error sum value Σε _m of all bands is obtained for some slightly different pitches, and a pitch that minimizes the error sum value Σε _m is obtained.

【００４３】すなわち、上記ピッチ抽出部１０３で求め
られたラフピッチを中心として、例えば 0.25 きざみで
上下に数種類ずつ用意する。これらの複数種類の微小に
異なるピッチの各ピッチに対してそれぞれ上記エラー総
和値Σε_mを求める。この場合、ピッチが定まるとバン
ド幅が決まり、上記（13）式より、周波数軸上データの
パワースペクトル｜Ｓ(j) ｜と励起信号スペクトル｜Ｅ
(j) ｜とを用いて上記（12）式のエラーε_mを求め、そ
の全バンドの総和値Σε_mを求めることができる。この
エラー総和値Σε_mを各ピッチ毎に求め、最小となるエ
ラー総和値に対応するピッチを最適のピッチとして決定
するわけである。以上のようにして高精度ピッチサーチ
部１０６で最適のファイン（例えば 0.25 きざみ）ピッ
チが求められ、この最適ピッチに対応する振幅｜Ａ_m｜
が決定される。That is, with the rough pitch obtained by the pitch extraction unit 103 as the center, several types are prepared up and down in steps of, for example, 0.25. The error sum value Σε _m is obtained for each of these plural kinds of slightly different pitches. In this case, when the pitch is determined, the bandwidth is determined, and from the above equation (13), the power spectrum | S (j) |
(j) | can be used to find the error ε _m in the above equation (12), and the total sum Σε _m of all the bands can be obtained. This error sum value Σε _m is obtained for each pitch, and the pitch corresponding to the minimum error sum value is determined as the optimum pitch. As described above, the high-precision pitch search unit 106 obtains the optimum fine (eg, 0.25 step) pitch, and the amplitude | A _m | corresponding to this optimum pitch.
Is determined.

【００４４】以上ピッチのファインサーチの説明におい
ては、説明を簡略化するために、全バンドが有声音（Vo
iced）の場合を想定しているが、上述したようにＭＢＥ
ボコーダにおいては、同時刻の周波数軸上に無声音（Un
voiced）領域が存在するというモデルを採用しているこ
とから、上記各バンド毎に有声音／無声音の判別を行う
ことが必要とされる。In the above description of the pitch fine search, in order to simplify the explanation, all bands are voiced (Vo
Assuming the case of iced), as described above, MBE
In the vocoder, unvoiced sound (Un
Since a model in which a voiced) area exists is used, it is necessary to distinguish voiced sound / unvoiced sound for each band.

【００４５】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、第ｍバンド
のＮＳＲは、The data of the optimum pitch and the amplitude | A _m | from the high precision pitch search unit 106 is sent to the voiced sound / unvoiced sound discrimination unit 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, the NSR of the m-th band is

【００４６】[0046]

【数３】 [Equation 3]

【００４７】と表せ、このＮＳＲ値が所定の閾値（例え
ば0.３）より大のとき（エラーが大きい）ときには、そ
のバンドでの｜Ａ_m｜｜Ｅ(j) ｜による｜Ｓ(j) ｜の近
似が良くない（上記励起信号｜Ｅ(j) ｜が基底として不
適当である）と判断でき、当該バンドをＵＶ（Unvoice
d、無声音）と判別する。これ以外のときは、近似があ
る程度良好に行われていると判断でき、そのバンドをＶ
（Voiced、有声音）と判別する。When this NSR value is larger than a predetermined threshold value (for example, 0.3) (error is large), | A _m || E (j) | It can be judged that the approximation of | is not good (the above excitation signal | E (j) | is unsuitable as a basis), and the band is UV (Unvoice).
d, unvoiced sound). In other cases, it can be judged that the approximation has been performed to some extent, and the band is set to V
(Voiced, voiced sound).

【００４８】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅を求めている。このＵＶのバン
ドについての振幅｜Ａ_m｜_UVは、Next, the amplitude re-evaluation unit 108 has the amplitude on the frequency axis data from the orthogonal transformation unit 105 and the amplitude | A _m evaluated as the fine pitch from the high precision pitch search unit 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from The amplitude re-evaluation unit 108 re-calculates the amplitude of the band determined to be unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107. The amplitude | A _m | _UV for this UV band is

【００４９】[0049]

【数４】 [Equation 4]

【００５０】にて求められる。It is calculated by

【００５１】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００Hzまでとすると、この有効帯域が上記
ピッチに応じて、８バンド〜６３バンドに分割されるこ
とになり、これらの各バンド毎に得られる上記振幅｜Ａ
_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの個
数ｍ_MX＋１も８〜６３と変化することになる。このため
データ数変換部１０９では、この可変個数ｍ_MX＋１の振
幅データを一定個数Ｎ_C（例えば４４個）のデータに変
換している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. This data number conversion unit 109 is for making the number constant, considering that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, when the effective band is up to 3400 Hz , the effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | A obtained for each of these bands | A
_{The number of m m} (including UV band amplitude | A _m | _UV ) data m _MX +1 also changes from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number m _MX +1 of amplitude data into a fixed number N _C (for example, 44) of data.

【００５２】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（( ｍ_MX＋
１) ×Ｋ_OS個）の振幅データを直線補間してさらに多く
のＮM 個（例えば２０４８個）に拡張し、このＮ_M個の
データを間引いて上記一定個数Ｎ_C（例えば４４個）の
データに変換する。Here, in this embodiment, dummy data for interpolating values from the last data in the block to the first data in the block for the amplitude data of one block of the effective band on the frequency axis. Is added to expand the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
Obtain an amplitude data of K _OS times the number by performing oversampling multiplied), the K _OS times the number ((m _MX +
1) amplitude data of × K _OS pieces) extended to linear interpolation to more NM number (e.g. 2048), the data of the N thinned out _M data the predetermined number N _C (e.g. 44) Convert to.

【００５３】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (the above-mentioned fixed number N _C of amplitude data) is sent to the vector quantization unit 110, and a predetermined number of data is collected into a vector, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. The high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００５４】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Note that each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is a frame of L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame cycle.

【００５５】次に、伝送されて得られた上記各データに
基づき音声信号を合成するための合成側（デコード側）
の概略構成について、図８を参照しながら説明する。こ
の図８において、入力端子１２１には上記ベクトル量子
化された振幅データが、入力端子１２２には上記符号化
されたピッチデータが、また入力端子１２３には上記Ｖ
／ＵＶ判別データがそれぞれ供給される。入力端子１２
１からの量子化振幅データは、逆ベクトル量子化部１２
４に送られて逆量子化され、データ数逆変換部１２５に
送られて逆変換され、得られた振幅データが有声音合成
部１２６及び無声音合成部１２７に送られる。入力端子
１２２からの符号化ピッチデータは、ピッチ復号化部１
２８で復号化され、データ数逆変換部１２５、有声音合
成部１２６及び無声音合成部１２７に送られる。また入
力端子１２３からのＶ／ＵＶ判別データは、有声音合成
部１２６及び無声音合成部１２７に送られる。Next, a synthesizing side (decoding side) for synthesizing a voice signal on the basis of the above-mentioned respective data transmitted and obtained.
The general configuration of will be described with reference to FIG. In FIG. 8, the vector-quantized amplitude data is input to the input terminal 121, the encoded pitch data is input to the input terminal 122, and the V-value is input to the input terminal 123.
/ UV discrimination data is supplied respectively. Input terminal 12
The quantized amplitude data from 1 is the inverse vector quantization unit 12
4 and is inversely quantized, is then sent to the data number inverse transform unit 125 and is inversely transformed, and the obtained amplitude data is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127. The encoded pitch data from the input terminal 122 is the pitch decoding unit 1
It is decoded at 28 and sent to the data number inverse conversion unit 125, the voiced sound synthesis unit 126, and the unvoiced sound synthesis unit 127. The V / UV discrimination data from the input terminal 123 is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127.

【００５６】有声音合成部１２６では例えば余弦(cosin
e)波合成により時間軸上の有声音波形を合成し、無声音
合成部１２７では例えばホワイトノイズをバンドパスフ
ィルタでフィルタリングして時間軸上の無声音波形を合
成し、これらの各有声音合成波形と無声音合成波形とを
加算部１２９で加算合成して、出力端子１３０より取り
出すようにしている。この場合、上記振幅データ、ピッ
チデータ及びＶ／ＵＶ判別データは、上記分析時の１フ
レーム（Ｌサンプル、例えば１６０サンプル）毎に更新
されて与えられるが、フレーム間の連続性を高める（円
滑化する）ために、上記振幅データやピッチデータの各
値を１フレーム中の例えば中心位置における各データ値
とし、次のフレームの中心位置までの間（合成時の１フ
レーム）の各データ値を補間により求める。すなわち、
合成時の１フレーム（例えば上記分析フレームの中心か
ら次の分析フレームの中心まで）において、先端サンプ
ル点での各データ値と終端（次の合成フレームの先端）
サンプル点での各データ値とが与えられ、これらのサン
プル点間の各データ値を補間により求めるようにしてい
る。In the voiced sound synthesis unit 126, for example, cosine (cosin)
e) A voiced sound waveform on the time axis is synthesized by wave synthesis, and in the unvoiced sound synthesis unit 127, for example, white noise is filtered by a bandpass filter to synthesize the unvoiced sound waveform on the time axis, and these voiced sound synthesized waveforms are combined. The unvoiced sound synthesized waveform is added and synthesized by the addition unit 129 and is taken out from the output terminal 130. In this case, the amplitude data, the pitch data, and the V / UV discrimination data are updated and given for each frame (L sample, for example, 160 samples) at the time of the analysis, but the continuity between the frames is improved (smoothed). Therefore, each value of the amplitude data and the pitch data is set as each data value at, for example, the center position in one frame, and each data value up to the center position of the next frame (one frame at the time of composition) is interpolated. Ask by. That is,
In one frame (for example, from the center of the above analysis frame to the center of the next analysis frame) at the time of synthesis, each data value at the tip sample point and the end (the tip of the next synthesis frame)
Each data value at the sample point is given, and each data value between these sample points is obtained by interpolation.

【００５７】以下、有声音合成部１２６における合成処
理を詳細に説明する。上記Ｖ（有声音）と判別された第
ｍバンド（第ｍ高調波の帯域）における時間軸上の上記
１合成フレーム（Ｌサンプル、例えば１６０サンプル）
分の有声音をＶ_m(n) とするとき、この合成フレーム内
の時間インデックス（サンプル番号）ｎを用いて、Ｖ_m(n) ＝Ａ_m(n) cos(θ_m(n)) ０≦ｎ＜Ｌ・・・（16）と表すことができる。全バンドの内のＶ（有声音）と判
別された全てのバンドの有声音を加算（ΣＶ_m(n) ）し
て最終的な有声音Ｖ(n) を合成する。The synthesis processing in the voiced sound synthesis unit 126 will be described in detail below. The one combined frame (L sample, for example, 160 samples) on the time axis in the m-th band (band of the m-th harmonic) determined to be V (voiced sound)
When the voiced sound for a minute is V _m (n), V _m (n) = A _m (n) cos (θ _m (n)) 0 using the time index (sample number) n in this composite frame. ≦ n <L (16) The final voiced sound V (n) is synthesized by adding (ΣV _m (n)) the voiced sounds of all the bands which are determined to be V (voiced sound) of all the bands.

【００５８】この（16）式中のＡ_m(n) は、上記合成フ
レームの先端から終端までの間で補間された第ｍ高調波
の振幅である。最も簡単には、フレーム単位で更新され
る振幅データの第ｍ高調波の値を直線補間すればよい。
すなわち、上記合成フレームの先端（ｎ＝０）での第ｍ
高調波の振幅値をＡ_0m、該合成フレームの終端（ｎ＝
Ｌ：次の合成フレームの先端）での第ｍ高調波の振幅値
をＡ_Lmとするとき、Ａ_m(n) ＝ (L-n)Ａ_0m／Ｌ＋ｎＡ_Lm／Ｌ・・・（17）の式によりＡ_m(n) を計算すればよい。A _m (n) in the equation (16) is the amplitude of the m-th harmonic wave that is interpolated from the beginning to the end of the composite frame. The simplest way is to linearly interpolate the value of the m-th harmonic of the amplitude data updated in frame units.
That is, the m-th frame at the tip (n = 0) of the composite frame
The amplitude value of the harmonic is A _0m , the end of the composite frame (n =
L: the amplitude value of the m-th harmonic at the end of the next composite frame) is A _Lm , A _m (n) = (Ln) A _0m / L + nA _Lm / L ... (17) It suffices to calculate A _m (n).

【００５９】次に、上記（16）式中の位相θ_m(n) は、 θ_m(n) ＝ｍω_O1ｎ＋ｎ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋Δωｎ・・・（18）により求めることができる。この（18）式中で、φ_0mは
上記合成フレームの先端（ｎ＝０）での第ｍ高調波の位
相（フレーム初期位相）を示し、ω₀₁は合成フレーム先
端（ｎ＝０）での基本角周波数、ω_L1は該合成フレーム
の終端（ｎ＝Ｌ：次の合成フレーム先端）での基本角周
波数をそれぞれ示している。上記（18）式中のΔωは、
ｎ＝Ｌにおける位相φ_Lmがθ_m(L) に等しくなるような
最小のΔωを設定する。Next, the phase θ _m (n) in the above equation (16) is given by θ _m (n) = mω _O1 n + n ² m (ω _L1 −ω ₀₁ ) / 2L + φ _{0 m} + Δω n (18) You can ask. In the equation (18), φ _0m represents the phase of the m-th harmonic (frame initial phase) at the tip (n = 0) of the composite frame, and ω ₀₁ is at the tip of the composite frame (n = 0). The fundamental angular frequency, ω _L1, represents the fundamental angular frequency at the end of the combined frame (n = L: the leading end of the next combined frame). Δω in the equation (18) is
Set a minimum Δω such that the phase φ _{Lm at} n = L is equal to θ _m (L).

【００６０】これに、対して本発明の実施例では、上記
（18) 式のφ0m＋Δωｎを合成側に送らずに、合成側で
位相を予測算出している。すなわち、上記位相予測部２
２は、上記（１）式に示されるように時刻０（フレーム
の先頭）における第ｍ高調波の位相（フレーム初期位
相）ψ_0mにｍ（ω_O1＋ω_L1）Ｌ／２を加えフレームの
最後での位相ψ_Lmを、予測算出している。また、各バン
ドの位相φ_mは、上記予測算出された位相ψ_Lmにε_mを
加えて示される。このε_mは各バンドでの予測修正項を
示している。本発明では、この予測修正項ε_mにガウス
性の雑音を用いている。On the other hand, in the embodiment of the present invention, the phase is predicted and calculated on the combining side without sending φ0m + Δωn in the equation (18) to the combining side. That is, the phase prediction unit 2
2 is obtained by adding m (ω _O1 + ω _L1 ) L / 2 to the phase (frame initial phase) ψ _0m of the m-th harmonic at time 0 (the beginning of the frame) as shown in the above equation (1). The phase ψ _{Lm at} is predicted and calculated. Further, the phase φ _{m of} each band is shown by adding ε _m to the predicted and calculated phase ψ _Lm . This ε _m indicates the prediction correction term in each band. In the present invention, Gaussian noise is used for this prediction correction term ε _m .

【００６１】ここで、図９のＡは、音声信号のスペクト
ルの一例を示しており、バンド番号（ハーモニクスナン
バ）ｍが８、９、１０の各バンドがＵＶ（無声音）とさ
れ、他のバンドはＶ（有声音）とされている。このＶ
（有声音）のバンドの時間軸信号が上記有声音合成部１
２６により合成され、ＵＶ（無声音）のバンドの時間軸
信号が無声音合成部１２７で合成されるわけである。Here, A of FIG. 9 shows an example of a spectrum of a voice signal, and each band of band numbers (harmonics number) m is 8, 9 and 10 is UV (unvoiced sound), and other bands. Is V (voiced sound). This V
The time axis signal of the (voiced sound) band is the voiced sound synthesis unit 1 described above.
26, and the time axis signal of the UV (unvoiced sound) band is synthesized by the unvoiced sound synthesis unit 127.

【００６２】以下、無声音合成部１２７における無声音
合成処理を説明する。ホワイトノイズ発生部１３１から
の時間軸上のホワイトノイズ信号波形を、所定の長さ
（例えば２５６サンプル）で適当な窓関数（例えばハミ
ング窓）により窓かけをし、ＳＴＦＴ処理部１３２によ
りＳＴＦＴ（ショートタームフーリエ変換）処理を施す
ことにより、図９のＢに示すようなホワイトノイズの周
波数軸上のパワースペクトルを得る。このＳＴＦＴ処理
部１３２からのパワースペクトルをバンド振幅処理部１
３３に送り、図９のＣに示すように、上記ＵＶ（無声
音）とされたバンド（例えばｍ＝８、９、１０）につい
て上記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とさ
れたバンドの振幅を０にする。このバンド振幅処理部１
３３には上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別
データが供給されている。バンド振幅処理部１３３から
の出力は、ＩＳＴＦＴ処理部１３４に送られ、位相は元
のホワイトノイズの位相を用いて逆ＳＴＦＴ処理を施す
ことにより時間軸上の信号に変換する。ＩＳＴＦＴ処理
部１３４からの出力は、オーバーラップ加算部１３５に
送られ、時間軸上で適当な（元の連続的なノイズ波形を
復元できるように）重み付けをしながらオーバーラップ
及び加算を繰り返し、連続的な時間軸波形を合成する。
オーバーラップ加算部１３５からの出力信号が上記加算
部１２９に送られる。The unvoiced sound synthesizing process in the unvoiced sound synthesizing section 127 will be described below. The white noise signal waveform on the time axis from the white noise generation unit 131 is windowed by a suitable window function (for example, Hamming window) with a predetermined length (for example, 256 samples), and the STFT processing unit 132 performs STFT (short circuit). By performing the (Term Fourier Transform) process, a power spectrum of white noise on the frequency axis as shown in B of FIG. 9 is obtained. The power spectrum from the STFT processing unit 132 is converted to the band amplitude processing unit 1
33, and as shown in FIG. 9C, the above-mentioned amplitude | A _m | _UV is multiplied with respect to the band (for example, m = 8, 9, 10) set as the UV (unvoiced sound), and another V (voiced sound) is generated. ) Is set to 0. This band amplitude processing unit 1
The above amplitude data, pitch data, and V / UV discrimination data are supplied to 33. The output from the band amplitude processing unit 133 is sent to the ISTFT processing unit 134, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 134 is sent to the overlap addition unit 135, and the overlap and addition are repeated while performing appropriate weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuous. Time-domain waveforms are synthesized.
The output signal from the overlap adder 135 is sent to the adder 129.

【００６３】このように、各合成部１２６、１２７にお
いて合成されて時間軸上に戻された有声音部及び無声音
部の各信号は、加算部１２９により適当な固定の混合比
で加算して、出力端子１３０より再生された音声信号を
取り出す。As described above, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized by the synthesis units 126 and 127 and are returned to the time axis are added by the addition unit 129 at an appropriate fixed mixing ratio, The reproduced audio signal is taken out from the output terminal 130.

【００６４】したがって、本発明に係る音声分析合成方
法を、ＭＢＥに適用した具体例では、位相の予測に用い
る雑音をガウス性にすることでその大きさと分散を制御
することができる。Therefore, in the specific example in which the speech analysis / synthesis method according to the present invention is applied to MBE, the magnitude and variance can be controlled by making the noise used for the phase prediction Gaussian.

【００６５】なお、上記図３の音声分析側（エンコード
側）の構成や図７の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration on the speech analysis side (encoding side) in FIG. 3 and the configuration on the speech synthesis side (decoding side) in FIG. 7, each part is described in hardware, but a so-called DSP (digital It is also possible to realize it by a software program using a signal processor or the like.

【００６６】[0066]

【図面の簡単な説明】[Brief description of drawings]

【符号の説明】１０・・・・・分析部２０・・・・・合成部２１・・・・・有声音合成部２２・・・・・位相予測部２３・・・・・雑音付加部２４・・・・・位相修正部２５・・・・・正弦波発生部２６・・・・・振幅増幅部[Explanation of Codes] 10 ... Analysis unit 20 ... Synthesis unit 21 ... Voiced sound synthesis unit 22 ... Phase prediction unit 23 ... Noise addition unit 24・・・・・ Phase correction unit 25 ・・・ Sine wave generation unit 26 ・・・・・ Amplitude amplification unit

Claims

[Claims]

1. A step of dividing an input voice signal in block units to obtain pitch information within a block, and a step of converting the voice signal of each block into a frequency axis to obtain data on a frequency axis, The step of dividing the data on the frequency axis into a plurality of bands based on the pitch information, the step of obtaining the power information and the discrimination information of voiced sound or unvoiced sound for each of the divided bands, and the steps And a step of transmitting the pitch information, power information for each band, and discrimination information of voiced sound or unvoiced sound, and a block end phase based on the transmitted pitch information and block initial phase of each block. And a step of correcting the predicted block termination phase by using noise having a variance according to each band. How to do it.

2. The voice analysis / synthesis method according to claim 1, wherein the noise is Gaussian noise. [0000]