JPH05297897A

JPH05297897A - Voiced sound deciding method

Info

Publication number: JPH05297897A
Application number: JP4121461A
Authority: JP
Inventors: Masayuki Nishiguchi; 正之西口; Atsushi Matsumoto; 淳松本
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1992-04-15
Filing date: 1992-04-15
Publication date: 1993-11-12
Anticipated expiration: 2016-10-22
Also published as: JP3221050B2

Abstract

PURPOSE:To eliminate an offensive sound in MBE (multiband excited) encoding by generating a voiceless sound in the whole band in a block when the signal level of a block wherein a band decided as a voiced sound block is present is less than a 1st specific threshold value. CONSTITUTION:Data of one block on a time base are converted by an orthogonal conversion part 13 into data of one block on a frequency axis and divided by a band division part 14 into plural bands, and a V/UV decision part 15 decides whether the sound is voiced or not, band by band. When it is judged that at least one band is a voiced sound band, a level detection part 18 detects the signal level of one block including the voiced sound band and supplies it to a comparison part 19. This comparison part 19 compares it with a threshold value supplied to an input terminal 20 and when the level is less than the threshold value, a decision part 21 judges the signal of the one block judged by the decision part 15 as the voiced sound band is a granular noise level, i.e., a mean amplitude less than 1LSB (least order digit).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、入力音声信号をブロッ
ク単位で区分して周波数軸に変換して得られた周波数軸
上データから有声音を雑音又は無声音と区別して判別す
るような有声音判別方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voiced sound for discriminating a voiced sound from noise or unvoiced sound from data on a frequency axis obtained by dividing an input voice signal into blocks and converting it into a frequency axis. Regarding the determination method.

【０００２】[0002]

【従来の技術】音声は音の性質として有声音と無声音に
区別される。有声音は声帯振動を伴う音声で周期的な振
動として観測される。無声音は声帯振動を伴わない音声
で非周期的な音として観測される。通常の音声では大部
分が有声音であり、無声音は無声子音と呼ばれる特殊な
子音のみである。有声音の周期は声帯振動の周期で決ま
り、これをピッチ周期、その逆数をピッチ周波数とい
う。これらピッチ周期及びピッチ周波数（以下、ピッチ
とした場合はピッチ周期を指す）は声の高低やイントネ
ーションを決める重要な要因である。したがって、上記
ピッチをどれだけ正確に捉えるかが音声の音質を左右す
る。しかし、上記ピッチを捉える場合には、上記音声の
周囲にある雑音、いわゆる背景雑音や量子化の際の量子
化雑音を考慮しなければならない。これらの雑音又は無
声音と有声音を区別することが音声信号を符号化する場
合に重要となる。2. Description of the Related Art Speech is classified into voiced sound and unvoiced sound as a property of sound. Voiced sound is a voice accompanied by vocal cord vibration and is observed as periodic vibration. Unvoiced sound is observed as a non-periodic sound with no vocal cord vibration. Most of the normal voices are voiced sounds, and unvoiced sounds are only special consonants called unvoiced consonants. The period of voiced sound is determined by the period of vocal cord vibration, which is called the pitch period, and its reciprocal is called the pitch frequency. The pitch period and the pitch frequency (hereinafter, referred to as a pitch period when referred to as a pitch) are important factors that determine the pitch of the voice and intonation. Therefore, how accurately the pitch is captured affects the sound quality of the voice. However, when capturing the pitch, it is necessary to consider noise around the voice, so-called background noise, and quantization noise during quantization. Distinguishing these noises or unvoiced sounds from voiced sounds is important when coding speech signals.

【０００３】上記音声信号の符号化の具体的な例として
は、ＭＢＥ（Multiband Excitation: マルチバンド励
起）符号化、ＳＢＥ（Singleband Excitation:シングル
バンド励起) 符号化、ハーモニック（Harmonic) 符号
化、ＳＢＣ（Sub-band Coding:帯域分割符号化) 、ＬＰ
Ｃ（Linear Predictive Coding: 線形予測符号化) 、あ
るいはＤＣＴ（離散コサイン変換）、ＭＤＣＴ（モデフ
ァイドＤＣＴ）、ＦＦＴ（高速フーリエ変換）等があ
る。Specific examples of the above-mentioned encoding of the voice signal include MBE (Multiband Excitation) encoding, SBE (Singleband Excitation) encoding, Harmonic encoding, and SBC ( Sub-band Coding), LP
There are C (Linear Predictive Coding), DCT (Discrete Cosine Transform), MDCT (Modified DCT), FFT (Fast Fourier Transform), and the like.

【０００４】例えば、上記ＭＢＥ符号化においては、入
力音声信号波形からピッチを抽出する場合、明確なピッ
チが表れない場合でもピッチの軌跡を捉えやすくしてい
た。そして、復号化側（合成側）では、上記ピッチを基
に余弦波（cosin)波合成により時間軸上の有声音波形を
合成し、別途合成される時間軸上の無声音波形と加算合
成し出力する。For example, in the above MBE encoding, when extracting a pitch from an input speech signal waveform, it is easy to capture the trajectory of the pitch even when a clear pitch does not appear. Then, on the decoding side (synthesis side), a cosine wave (cosin) wave synthesis is used to synthesize a voiced sound waveform on the time axis based on the above-mentioned pitch, and it is added and synthesized with a separately synthesized unvoiced sound waveform on the time axis and output. To do.

【０００５】[0005]

【発明が解決しようとする課題】ところで、上記ＭＢＥ
符号化において、入力音声信号の背景に低ピッチ周波数
（50Hz) の騒音であるハム雑音が乗ってしまうと合成側
ではそのピッチのところでcosin 波のピークが重なるよ
うに合成が行われてしまう。すなわち、有声音の合成で
行っているような固定位相の加算で各cosin 波を合成し
てしまう。すると、cosin 波を固定位相で重ねるという
周期性を持つことにより、背景雑音等の一種であったハ
ム雑音が周期性を持ったインパルス波形になってしま
う。つまり、本来、時間軸上で散らばっているべき振幅
の強度があるフレームの１部分に集中してしまいそれが
周期性を持つために、非常に耳障りな異音となって再生
されることになる。By the way, the above MBE
In encoding, if hum noise, which is a low pitch frequency (50Hz) noise, is added to the background of the input speech signal, the synthesis side performs synthesis so that the peaks of the cosin wave overlap at that pitch. That is, each cosin wave is synthesized by addition of fixed phases as is done in the synthesis of voiced sounds. Then, by having the periodicity that the cosin waves are superposed with a fixed phase, the hum noise, which is a kind of background noise, becomes an impulse waveform with periodicity. In other words, since the intensity of the amplitude that should originally be scattered on the time axis is concentrated in one portion of the frame and has periodicity, it is reproduced as a very offensive noise. ..

【０００６】本発明は、上記実情に鑑みてなされたもの
であり、有声音を雑音又は無声音と区別し確実に判別で
き、合成側に対しては異音の発生を抑えさせることがで
きる有声音判別方法の提供を目的とする。The present invention has been made in view of the above circumstances, and is capable of distinguishing a voiced sound from noise or unvoiced sound with certainty, and capable of suppressing the generation of an abnormal sound on the synthesis side. The purpose is to provide a discrimination method.

【０００７】[0007]

【課題を解決するための手段】本発明に係る有声音判別
方法は、入力された音声信号をブロック単位で区分して
周波数軸に変換して周波数軸上データを求める工程と、
この周波数軸上データを複数の帯域に分割する工程と、
分割された各帯域毎に有声音か否かを判別する工程と、
上記判別により有声音であるとされた帯域が存在するブ
ロック単位の信号のレベルを検出する工程と、上記検出
された信号のレベルが第１の所定の閾値以下になるとき
に全帯域を無声音とする工程とを有することを特徴とし
て上記課題を解決する。A voiced sound discrimination method according to the present invention comprises a step of dividing an input voice signal into blocks and converting them into a frequency axis to obtain frequency axis data.
Dividing the data on the frequency axis into a plurality of bands,
A step of determining whether or not there is voiced sound for each of the divided bands,
The step of detecting the level of the signal in block units in which the band determined to be the voiced sound exists by the above-mentioned discrimination, and the entire band is made unvoiced when the level of the detected signal becomes equal to or lower than a first predetermined threshold value. The above-mentioned problems are solved by the steps of:

【０００８】また、他の発明に係る有声音判別方法は、
入力された音声信号をブロック単位で区分して周波数軸
に変換して周波数軸上データを求める工程と、この周波
数軸上データを複数の帯域に分割する工程と、分割され
た各帯域毎に有声音か否かを判別する工程と、上記判別
により有声音であるとされた帯域が存在するブロック単
位の信号のレベルを検出する工程と、上記ブロック内で
の信号のピーク値を求める工程と、上記ピーク値をブロ
ック内の信号レベルに基づいて正規化した値を求める工
程と、上記正規化した値が第２の所定の閾値以下でかつ
上記有声音であるとされた帯域が存在するブロック単位
の信号のレベルが第１の所定の閾値以下になるときに全
帯域を無声音とする工程とを有することを特徴として上
記課題を解決する。A voiced sound discrimination method according to another invention is
There is a step of dividing the input audio signal into blocks and converting it into a frequency axis to obtain frequency axis data, a step of dividing the frequency axis data into a plurality of bands, and a step of dividing each of the divided bands. A step of determining whether or not it is a voice sound, a step of detecting the level of a signal of a block unit in which the band which is said to be a voiced sound by the above-mentioned determination exists, a step of obtaining a peak value of the signal in the block, A step of obtaining a normalized value of the peak value based on a signal level in a block, and a block unit in which the normalized value is equal to or less than a second predetermined threshold and a band in which the voiced sound is present exists To make the entire band unvoiced when the level of the signal of 1 becomes less than or equal to the first predetermined threshold value.

【０００９】ここで、上記信号のレベルは、有声音であ
るとされた帯域が存在するブロック単位の周波数軸上デ
ータから検出してもよく、さらに、周波数軸上データを
帯域分割したデータから検出してもよい。Here, the level of the signal may be detected from the data on the frequency axis of the block unit in which the band that is said to be voiced exists, and further detected from the data obtained by band-dividing the data on the frequency axis. You may.

【００１０】また、上記有声音か否かの判別とは、有声
音か雑音又は無声音かを判別することであり、有声音を
確実に判別すると共に雑音又は無声音も確実に判別でき
る。つまり、入力音声信号から雑音（微小レベルのハム
雑音も含む）又は無声音を判別することもできる。この
ようなときには、例えば、強制的に入力音声信号の全帯
域を無声音とすると、合成側での異音の発生を抑えるこ
とができる。Further, the above-mentioned discrimination of voiced sound is to discriminate between voiced sound, noise and unvoiced sound. It is possible to surely discriminate voiced sound and noise or unvoiced sound. That is, noise (including hum noise of a minute level) or unvoiced sound can be discriminated from the input voice signal. In such a case, for example, if the entire band of the input audio signal is forcibly made unvoiced, it is possible to suppress the generation of abnormal noise on the synthesis side.

【００１１】[0011]

【作用】入力されたブロック単位の音声信号を周波数軸
に変換し、該周波数軸上のデータを複数の帯域に分割
し、該帯域毎に有声音か否かを判別する。そして、有声
音と判別された帯域が存在するブロックの信号のレベル
が第１の所定の閾値以下になるときに上記ブロック内の
全帯域を無声音とする。そのため、異音の発生を抑える
ことができる。The input voice signal in block units is converted to the frequency axis, the data on the frequency axis is divided into a plurality of bands, and it is determined for each band whether or not it is a voiced sound. Then, when the level of the signal of the block in which the band discriminated as the voiced sound exists becomes equal to or lower than the first predetermined threshold value, the entire band in the block is made unvoiced. Therefore, generation of abnormal noise can be suppressed.

【００１２】[0012]

【実施例】以下、本発明に係る有声音判別方法の実施例
について、図面を参照しながら説明する。図１は、本発
明の第１の実施例となる有声音判別方法を説明するため
の有声音判別装置の概略構成を示している。この実施例
は、入力された音声信号をブロック単位で区分し、周波
数軸に変換して得られた周波数軸上データの１ブロック
内の信号のピーク値を該ブロック内の信号レベルに基づ
いて正規化した値と、上記周波数軸上データの１ブロッ
クを複数の帯域に分割し、該複数の帯域毎に有声音
（Ｖ）／無声音（ＵＶ）を判別し、該複数の帯域のうち
少なくとも１帯域が有声音とされたときに該有声音とさ
れた帯域を含むブロックの信号レベルとに応じて該ブロ
ック内の全帯域を無声音とする。具体的には、上記正規
化した値が所定の閾値以下でかつ上記信号レベルも所定
の閾値以下のときに全帯域を無声音とする。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of a voiced sound discrimination method according to the present invention will be described below with reference to the drawings. FIG. 1 shows a schematic configuration of a voiced sound discrimination apparatus for explaining a voiced sound discrimination method according to a first embodiment of the present invention. In this embodiment, the input audio signal is divided into blocks, and the peak value of the signal in one block of the data on the frequency axis obtained by converting the data into the frequency axis is normalized based on the signal level in the block. And the one block of the data on the frequency axis is divided into a plurality of bands, voiced sound (V) / unvoiced sound (UV) is discriminated for each of the plurality of bands, and at least one band of the plurality of bands is determined. Is voiced, the entire band in the block is unvoiced according to the signal level of the block including the voiced band. Specifically, when the normalized value is less than or equal to a predetermined threshold value and the signal level is less than or equal to a predetermined threshold value, the entire band is unvoiced.

【００１３】図１において、入力端子１１には、図示し
ないＨＰＦ（ハイパスフィルタ）等のフィルタによりい
わゆるＤＣ（直流）オフセット分の除去や帯域制限（例
えば２００〜３４００Hzに制限）のための少なくとも低
域成分（２００Hz以下）の除去が行われた音声の信号が
供給される。この信号は、窓かけ処理部１２に送られ
る。この窓かけ処理部１２では１ブロックＮサンプル
（例えばＮ＝２５６）に対して例えばハミング窓をか
け、この１ブロックを１フレームＬサンプル（例えばＬ
＝１６０）の間隔で時間軸方向に順次移動させており、
各ブロック間のオーバーラップはＮ−Ｌサンプル（例え
ば９６サンプル）となっている。この窓かけ処理部１２
でＮサンプルのブロックとされた信号は、直交変換部１
３に供給される。この直交変換部１３は、例えば１ブロ
ック２５６サンプルのサンプル列に対して１７９２サン
プル分の０データを付加して（いわゆる０詰めして）２
０４８サンプルとし、この２０４８サンプルの時間軸デ
ータ列に対して、ＦＦＴ（高速フーリエ変換）等の直交
変換処理を施し、周波数軸データ列に変換する。この直
交変換部１３からの周波数軸上データは、帯域分割部１
４に供給されると共に、場合によっては後述する信号レ
ベル検出部１８にも供給される。上記帯域分割部１４
は、供給された周波数軸上データを図示しないピッチ抽
出部で抽出されたピッチに応じて複数の帯域に分割す
る。各帯域に分割された周波数軸上データは、Ｖ（有声
音）／ＵＶ（無声音）判別部１５に供給される。このＶ
／ＵＶ判別部１５は、分割された各帯域についてＶ／Ｕ
Ｖの判別を行っている。このＶ／ＵＶ判別情報は、修正
部１６及び判断部２１に供給される。In FIG. 1, an input terminal 11 has at least a low frequency band for removing a so-called DC (direct current) offset component and band limitation (for example, 200-3400 Hz) by a filter such as an HPF (high-pass filter) not shown. An audio signal from which components (200 Hz or less) have been removed is supplied. This signal is sent to the windowing processing unit 12. In this windowing processing unit 12, for example, a Hamming window is applied to 1 block N samples (for example, N = 256), and this 1 block is processed for 1 frame L samples (for example, L = L).
= 160), it is moved sequentially in the time axis direction,
The overlap between blocks is NL samples (96 samples, for example). This windowing processing unit 12
The signal that is a block of N samples in
3 is supplied. The orthogonal transform unit 13 adds 0 data for 1792 samples to a sample sequence of 256 samples per block (so-called zero padding) 2
With 048 samples, the time-axis data sequence of 2048 samples is subjected to orthogonal transformation processing such as FFT (Fast Fourier Transform) and converted into a frequency-axis data sequence. The data on the frequency axis from the orthogonal transform unit 13 is the band division unit 1
4 and at the same time, it is also supplied to the signal level detection unit 18 described later. The band division unit 14
Divides the supplied data on the frequency axis into a plurality of bands according to the pitch extracted by a pitch extraction unit (not shown). The data on the frequency axis divided into each band is supplied to the V (voiced sound) / UV (unvoiced sound) determination unit 15. This V
/ UV discrimination unit 15 determines V / U for each of the divided bands.
V is being discriminated. This V / UV discrimination information is supplied to the correction unit 16 and the determination unit 21.

【００１４】一方、該信号レベル検出部１８は、上記窓
かけ処理部１２からの時間軸上のデータの１ブロック毎
の信号のレベルを検出し、該有声音である１帯域が含ま
れるブロックから検出された信号レベルを比較部１９に
供給する。この比較部１９は、入力端子２０から供給さ
れる第１の所定の閾値と上記信号レベルとを比較し、そ
の比較結果を上記判断部２１に供給する。On the other hand, the signal level detecting section 18 detects the signal level of each block of the data on the time axis from the windowing processing section 12, and detects from the block containing one band of the voiced sound. The detected signal level is supplied to the comparison unit 19. The comparison unit 19 compares the first predetermined threshold value supplied from the input terminal 20 with the signal level, and supplies the comparison result to the determination unit 21.

【００１５】ここで、上記信号レベル検出部１８は、上
記直交変換部１３からの周波数軸上データから１ブロッ
ク毎の信号のレベルを検出してもよく、また、上記帯域
分割部１４からの帯域分割した周波数軸上データから検
出してもよい。Here, the signal level detection unit 18 may detect the level of the signal for each block from the data on the frequency axis from the orthogonal transformation unit 13, and the band from the band division unit 14 may be detected. It may be detected from the divided data on the frequency axis.

【００１６】上記窓かけ処理部１２からの時間軸上のデ
ータは、ピーク検出部２２及びブロック内分散検出部２
３にも供給される。このピーク検出部２２は、時間軸上
の１ブロック毎のデータのピーク値を検出し、正規化部
２４に供給する。ブロック内分散検出部２３は、時間軸
上の１ブロック毎の信号レベルの分散を検出し、正規化
部２４に供給する。上記正規化部２４は、上記ピーク値
と信号レベルの分散から正規化した値を算出する。そし
て、比較部２５に供給する。この比較部２５は、入力端
子２６から供給される第２の所定の閾値と上記正規化さ
れた値とを比較し、その比較結果を上記判断部２１に供
給する。The data on the time axis from the windowing processing unit 12 is the peak detection unit 22 and the intra-block variance detection unit 2.
3 is also supplied. The peak detection unit 22 detects a peak value of data for each block on the time axis and supplies the peak value to the normalization unit 24. The intra-block variance detector 23 detects the variance of the signal level for each block on the time axis and supplies the variance to the normalizer 24. The normalization unit 24 calculates a normalized value from the peak value and the variance of the signal level. Then, it is supplied to the comparison unit 25. The comparison unit 25 compares the second predetermined threshold value supplied from the input terminal 26 with the normalized value, and supplies the comparison result to the determination unit 21.

【００１７】上記判断部２１は、上記修正部１６を制御
する。例えば、上記信号レベルが上記第１の所定の閾値
以下で、かつ上記ピーク値を正規化した値が上記第２の
所定の閾値以下のとき、上記修正部１６が１ブロックの
周波数軸上データの全帯域を無声音にするように制御す
る。そして、出力端子１７からは、無声音情報が出力さ
れる。The judgment unit 21 controls the correction unit 16. For example, when the signal level is equal to or lower than the first predetermined threshold value and the value obtained by normalizing the peak value is equal to or lower than the second predetermined threshold value, the correction unit 16 outputs one block of data on the frequency axis. The whole band is controlled to be unvoiced. Then, the unvoiced sound information is output from the output terminal 17.

【００１８】以下に、この実施例の動作を説明する。上
記窓かけ処理部１２でハミング窓をかけることにより切
り出される１ブロックのサンプル数Ｎを２５６サンプル
とし、入力サンプル列をｘ（ｎ）とする。この１ブロッ
ク（２５６サンプル）の時間軸上のデータは、上記直交
変換部１３により１ブロックの周波数軸上データに変換
される。この１ブロックの周波数軸上データは、上記帯
域分割部１４で複数の帯域に分割される。複数の帯域に
分割された１ブロックの周波数軸上データは、上記Ｖ／
ＵＶ判別部１５により、帯域毎にＶ／ＵＶが判別され
る。ここで、有声音である帯域が全くない、すなわち全
帯域無声音である場合には、上記修正部１６では特に修
正を加えない。The operation of this embodiment will be described below. The number of samples N in one block cut out by applying the Hamming window in the windowing processing unit 12 is 256 samples, and the input sample sequence is x (n). This one block (256 samples) of data on the time axis is converted into one block of data on the frequency axis by the orthogonal transform unit 13. The one-block data on the frequency axis is divided into a plurality of bands by the band dividing unit 14. The data on the frequency axis of one block divided into a plurality of bands is V /
The UV discrimination unit 15 discriminates V / UV for each band. Here, when there is no voiced band at all, that is, when the band is unvoiced, the correction section 16 does not make any particular correction.

【００１９】一方、上記Ｖ／ＵＶ判別部１５で、少なく
とも１帯域が有声音であるとされた場合には、該有声音
である帯域を含む１ブロックの信号レベルが上記信号レ
ベル検出部１８で検出される。検出された信号レベル
は、比較部１８に供給され、入力端子２０からの閾値と
比較される。そして、該比較結果が上記判断部２１に供
給される。On the other hand, when the V / UV discriminating section 15 determines that at least one band is voiced sound, the signal level of one block including the voiced sound band is detected by the signal level detecting section 18. To be detected. The detected signal level is supplied to the comparison unit 18 and compared with the threshold value from the input terminal 20. Then, the comparison result is supplied to the determination unit 21.

【００２０】上記Ｖ／ＵＶ判別部１５で少なくとも１帯
域が有声音とされた場合というのは、入力される音声信
号のレベルが量子化ステップの１ＬＳＢ（最下位桁）分
である場合を示す。つまり、上記Ｖ／ＵＶ判別部１５が
１ＬＳＢ分の雑音（グラニュラ雑音）を判別し、このと
きの音声の１ブロックの信号レベルが微小レベルのとき
に全帯域を無声音にする。The case where at least one band is voiced by the V / UV discriminator 15 means that the level of the input voice signal is 1 LSB (least significant digit) of the quantization step. That is, the V / UV discrimination unit 15 discriminates noise of 1 LSB (granular noise), and when the signal level of one block of voice at this time is a minute level, the entire band is made unvoiced.

【００２１】具体的には、信号レベル検出部１８でエネ
ルギーの分散σを、Specifically, the signal level detector 18 calculates the energy variance σ by

【００２２】[0022]

【数１】 [Equation 1]

【００２３】で示される（１）式により算出する。ここ
で、ｘは、It is calculated by the equation (1) shown by. Where x is

【００２４】[0024]

【数２】 [Equation 2]

【００２５】で示されるｘ（ｎ）の平均値である。It is the average value of x (n) represented by

【００２６】上記σは、上記Ｖ／ＵＶ判別部１５で少な
くとも１帯域が有声音とされた場合に比較器１９に供給
される。この比較部１９は、入力端子２０に供給される
閾値ｇ_t（例えばｇ_t＝８）と上記分散σとを比較して
いる。ここで、上記σが閾値ｇ_tより小さければ、上記
判断部２１は、上記Ｖ／ＵＶ判別部１５で少なくとも１
帯域が有声音であるとされた１ブロックの信号がグラニ
ュラ雑音レベル、すなわち、１ＬＳＢ以下の平均振幅で
あると判断する。The above σ is supplied to the comparator 19 when the V / UV discriminating unit 15 determines that at least one band is voiced sound. The comparison unit 19 compares the threshold value g _t (for example, g _t = 8) supplied to the input terminal 20 with the variance σ. Here, if σ is smaller than the threshold value g _t , the determination unit 21 causes the V / UV determination unit 15 to perform at least 1
It is determined that the signal of one block whose band is voiced has a granular noise level, that is, an average amplitude of 1 LSB or less.

【００２７】また、上記ピーク検出部２２は、１ブロッ
クの周波数軸上データのピーク値Ｐ_eを、Further, the peak detecting section 22 calculates the peak value P _{e of the} data on the frequency axis of one block by

【００２８】[0028]

【数３】で示される（３）式により検出する。ここで、ｘは上述
の通りｘ（ｎ）の平均値である。そして、正規化部２４
は、上記ピーク値Ｐ_eを、上記（１）式で、示される分
散σで除算し、正規化値Ｐ_nを得る。すなわち、正規化
値Ｐ_nは、Ｐ_n＝Ｐ_e／σ ・・・（４）で示される。そして、比較部２５は上記正規化値Ｐ_nと
入力端子２６から供給される閾値Ｐ_thnとを比較し、該
比較情報を上記判断部２１に供給する。この判断部２１
には、上記比較部１９が上記分散σと入力端子２０から
供給された閾値ｇ_t（例えばｇ_t＝８）とを比較するこ
とによって得られた比較情報も供給されている。そし
て、上記判断部２１は、上記比較部２５と上記比較部１
９からの比較情報に応じて上記修正部１６に制御信号を
供給する。具体的には、上記正規化した値Ｐ_nが所定の
閾値Ｐ_t以下でかつ上記信号レベルσも所定の閾値ｇ_t
以下のときに上記修正部１６を機能させて、１ブロック
の周波数軸上データの全帯域を無声音にさせる。[Equation 3] It is detected by the equation (3). Here, x is the average value of x (n) as described above. Then, the normalization unit 24
Is obtained by dividing the peak value P _e by the variance σ shown in the equation (1) to obtain a normalized value P _n . That is, the normalized value P _n is represented by P _n = P _e / σ (4) Then, the comparison unit 25 compares the normalized value P _n with the threshold P _thn supplied from the input terminal 26, and supplies the comparison information to the determination unit 21. This determination unit 21
Also, the comparison information obtained by the comparison unit 19 comparing the variance σ and the threshold value g _t (for example, g _t = 8) supplied from the input terminal 20 is also supplied. Then, the judgment unit 21 is configured to compare the comparison unit 25 and the comparison unit 1 with each other.
A control signal is supplied to the correction unit 16 in accordance with the comparison information from 9. Specifically, the normalized value P _n is less than or _equal to a predetermined threshold P _t , and the signal level σ is also a predetermined threshold g _t.
In the following cases, the correction unit 16 is caused to function to make the entire band of the data on the frequency axis of one block unvoiced.

【００２９】したがって、この実施例は、全帯域無声音
とされた以外の信号、すなわち少なくとも１帯域が有声
音と判断された信号の１ブロックの信号レベルが微小な
とき、かつピーク値が微小なときに、強制的に全バンド
を無声音とすることで、合成側で発生する固定位相加算
による異音の発生を防ぐことができる。Therefore, in this embodiment, when the signal level of one block of the signals other than the unvoiced sound in the entire band, that is, the signal in which at least one band is determined to be the voiced sound is small and the peak value is small, In addition, by forcibly making all bands unvoiced, it is possible to prevent the generation of abnormal noise due to fixed phase addition that occurs on the combining side.

【００３０】なお、本発明に係る有声音判別方法は、Ｍ
ＢＥ等のボコーダに適用可能である。すなわち、入力音
声信号の１ブロックの周波数軸上データが雑音又は無声
音とされたときに全ての帯域を強制的に無声音とするこ
とでＭＢＥ等のボコーダの合成側での異音の発生を防ぐ
ことができる。The voiced sound discrimination method according to the present invention uses M
It is applicable to vocoders such as BE. That is, when the data on the frequency axis of one block of the input audio signal is made noise or unvoiced sound, all the bands are forced to be unvoiced sound to prevent the generation of abnormal sound on the synthesis side of the vocoder such as MBE. You can

【００３１】以下、本発明に係る有声音判別方法が適用
可能な音声信号の合成分析符号化装置（いわゆるボコー
ダ）の一種であるＭＢＥ（Multiband Excitation: マル
チバンド励起）ボコーダの具体例について、図面を参照
しながら説明する。このＭＢＥボコーダは、D. W. Grif
fin and J. S. Lim,"Multiband Excitation Vocoder,"
IEEE Trans.Acoustics,Speech,and Signal Processing,
vol.36, No.8, pp.1223-1235, Aug.1988 に開示されて
いるものであり、従来のＰＡＲＣＯＲ（PARtial auto-C
ORrelation: 偏自己相関）ボコーダ等では、音声のモデ
ル化の際に有声音区間と無声音区間とをブロックあるい
はフレーム毎に切り換えていたのに対し、ＭＢＥボコー
ダでは、同時刻（同じブロックあるいはフレーム内）の
周波数軸領域に有声音（Voiced）区間と無声音（Unvoic
ed）区間とが存在するという仮定でモデル化している。A specific example of an MBE (Multiband Excitation) vocoder, which is a kind of speech signal synthesis analysis coding apparatus (so-called vocoder) to which the voiced sound discrimination method according to the present invention is applicable, will be described below with reference to the drawings. It will be explained with reference to FIG. This MBE vocoder is DW Grif
fin and JS Lim, "Multiband Excitation Vocoder,"
IEEE Trans. Acoustics, Speech, and Signal Processing,
Vol.36, No.8, pp.1223-1235, Aug.1988, the conventional PARCOR (PARtial auto-C
In ORrelation: partial autocorrelation) vocoders, voiced sections and unvoiced sections were switched for each block or frame when modeling speech, whereas in MBE vocoder, the same time (in the same block or frame) Voiced section (Voiced) and unvoiced section (Unvoic)
ed) section is modeled on the assumption that and exist.

【００３２】図２は、上記ＭＢＥボコーダの実施例の全
体の概略構成を示すブロック図である。この図２におい
て、入力端子１０１には音声信号が供給されるようにな
っており、この入力音声信号は、ＨＰＦ（ハイパスフィ
ルタ）等のフィルタ１０２に送られて、いわゆるＤＣ
（直流）オフセット分の除去や帯域制限（例えば２００
〜３４００Hzに制限）のための少なくとも低域成分（２
００Hz以下）の除去が行われる。このフィルタ１０２を
介して得られた信号は、ピッチ抽出部１０３及び窓かけ
処理部１０４にそれぞれ送られる。ピッチ抽出部１０３
では、入力音声信号データが所定サンプル数Ｎ（例えば
Ｎ＝２５６）単位でブロック分割され（あるいは方形窓
による切り出しが行われ）、このブロック内の音声信号
についてのピッチ抽出が行われる。このような切り出し
ブロック（２５６サンプル）を、例えば図３のＡに示す
ようにＬサンプル（例えばＬ＝１６０）のフレーム間隔
で時間軸方向に移動させており、各ブロック間のオーバ
ラップはＮ−Ｌサンプル（例えば９６サンプル）となっ
ている。また、窓かけ処理部１０４では、１ブロックＮ
サンプルに対して所定の窓関数、例えばハミング窓をか
け、この窓かけブロックを１フレームＬサンプルの間隔
で時間軸方向に順次移動させている。FIG. 2 is a block diagram showing the overall schematic configuration of the embodiment of the MBE vocoder. In FIG. 2, an audio signal is supplied to an input terminal 101, and this input audio signal is sent to a filter 102 such as an HPF (high-pass filter) to be a so-called DC signal.
Removal of (DC) offset and band limitation (for example, 200
At least low frequency component (2)
(Less than 00 Hz) is removed. The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. Pitch extraction unit 103
In, the input voice signal data is divided into blocks in units of a predetermined number N (for example, N = 256) (or cut out by a rectangular window), and pitch extraction is performed on the voice signals in this block. Such a cut block (256 samples) is moved in the time axis direction at a frame interval of L samples (for example, L = 160) as shown in A of FIG. 3, and the overlap between the blocks is N−. There are L samples (for example, 96 samples). In addition, in the windowing processing unit 104, 1 block N
A predetermined window function, for example, a Hamming window is applied to the sample, and this windowed block is sequentially moved in the time axis direction at intervals of 1 frame L sample.

【００３３】このような窓かけ処理を数式で表すと、ｘ_w(k,q) ＝ｘ(q) ｗ(kL-q) ・・・（５）となる。この（５）式において、ｋはブロック番号を、
ｑはデータの時間インデックス（サンプル番号）を表
し、処理前の入力信号のｑ番目のデータｘ(q) に対して
第ｋブロックの窓（ウィンドウ）関数ｗ(kL-q)により窓
かけ処理されることによりデータｘ_w(k,q) が得られる
ことを示している。ピッチ抽出部１０３内での図３のＡ
に示すような方形窓の場合の窓関数ｗ_r(r) は、ｗ_r(r) ＝１０≦ｒ＜Ｎ・・・（６）＝０ｒ＜０，Ｎ≦ｒまた、窓かけ処理部１０４での図３のＢに示すようなハ
ミング窓の場合の窓関数ｗ_h(r) は、ｗ_h(r) ＝ 0.54 − 0.46 cos(２πr/(N-1)) ０≦ｒ＜Ｎ・・・（７）＝０ｒ＜０，Ｎ≦ｒである。このような窓関数ｗ_r(r) あるいはｗ_h(r) を
用いるときの上記（５）式の窓関数ｗ(r) （＝ｗ(kL-
q)）の否零区間は、０≦ｋＬ−ｑ＜Ｎこれを変形して、ｋＬ−Ｎ＜ｑ≦ｋＬ従って、例えば上記方形窓の場合に窓関数ｗ_r(kL-q)＝
１となるのは、図４に示すように、ｋＬ−Ｎ＜ｑ≦ｋＬ
のときとなる。また、上記（５）〜（７）式は、長さＮ
（＝２５６）サンプルの窓が、Ｌ（＝１６０）サンプル
ずつ前進してゆくことを示している。以下、上記（６）
式、（７）式の各窓関数で切り出された各Ｎ点（０≦ｒ
＜Ｎ）の否零サンプル列を、それぞれｘ_wr(k,r) 、ｘ_wh
(k,r) と表すことにする。When this windowing process is expressed by a mathematical expression, x _w (k, q) = x (q) w (kL-q) (5) In this equation (5), k is a block number,
q represents the time index (sample number) of the data, and the q-th data x (q) of the unprocessed input signal is windowed by the window function (w (kL-q)) of the kth block. It is shown that the data x _w (k, q) can be obtained by doing so. 3A in the pitch extraction unit 103
The window function w _r (r) in the case of the rectangular window is as follows: w _r (r) = 1 0 ≦ r <N (6) = 0 r <0, N ≦ r window function w _h (r) in the case of Hamming window as shown in B of FIG. 3 is a section _{104, w h (r) =} 0.54 - 0.46 cos (2πr / (N-1)) 0 ≦ r <N (7) = 0 r <0, N ≦ r. Such a window function w _r (r) or w (5) when using the _h (r) formula of the window function w (r) (= w (KL-
q)), the zero-zero interval is 0 ≦ kL−q <N, which is transformed into kL−N <q ≦ kL. Therefore, for example, in the case of the above rectangular window, the window function w _r (kL−q) =
As shown in FIG. 4, 1 becomes kL-N <q ≦ kL.
It will be when. Further, the above equations (5) to (7) are expressed by the length N
The window of (= 256) samples is shown to advance by L (= 160) samples. Below, above (6)
Each of the N points (0 ≦ r
The non-zero sample sequence of <N) is represented by x _wr (k, r) and x _wh
Let us denote it as (k, r).

【００３４】窓かけ処理部１０４では、図５に示すよう
に、上記（７）式のハミング窓がかけられた１ブロック
２５６サンプルのサンプル列ｘ_wh(k,r) に対して１７９
２サンプル分の０データが付加されて（いわゆる０詰め
されて）２０４８サンプルとされ、この２０４８サンプ
ルの時間軸データ列に対して、直交変換部１０５により
例えばＦＦＴ（高速フーリエ変換）等の直交変換処理が
施される。In the windowing processing unit 104, as shown in FIG. 5, 179 is applied to the sample sequence x _wh (k, r) of one block of 256 samples to which the Hamming window of the equation (7) is applied.
Two samples of 0 data are added (so-called zero padding) to make 2048 samples, and the orthogonal transformation unit 105 performs orthogonal transformation such as FFT (Fast Fourier Transform) on the time-axis data sequence of 2048 samples. Processing is performed.

【００３５】ピッチ抽出部１０３では、上記ｘ_wr(k,r)
のサンプル列（１ブロックＮサンプル）に基づいてピッ
チ抽出が行われる。このピッチ抽出法には、時間波形の
周期性や、スペクトルの周期的周波数構造や、自己相関
関数を用いるもの等が知られているが、本実施例では、
センタクリップ波形の自己相関法を採用している。この
ときのブロック内でのセンタクリップレベルについて
は、１ブロックにつき１つのクリップレベルを設定して
もよいが、ブロックを細分割した各部（各サブブロッ
ク）の信号のピークレベル等を検出し、これらの各サブ
ブロックのピークレベル等の差が大きいときに、ブロッ
ク内でクリップレベルを段階的にあるいは連続的に変化
させるようにしている。このセンタクリップ波形の自己
相関データのピーク位置に基づいてピーク周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。In the pitch extraction unit 103, the above x _wr (k, r)
Pitch extraction is performed based on the sample sequence (1 block N samples). The pitch extraction method is known to include periodicity of time waveform, periodic frequency structure of spectrum, and autocorrelation function.
The center correlation waveform autocorrelation method is used. Regarding the center clip level in the block at this time, one clip level may be set for one block, but the peak level of the signal of each part (each sub-block) obtained by subdividing the block is detected and When there is a large difference in the peak level of each sub-block, the clip level is changed stepwise or continuously within the block. The peak period is determined based on the peak position of the autocorrelation data of the center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak among the plurality of peaks is equal to or larger than a predetermined threshold. In the case of, the maximum peak position is set as the pitch period, and in other cases, the pitch is within a pitch range that satisfies a predetermined relationship with the pitch other than the current frame, for example, the pitch of the previous frame and the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
Then, the high precision pitch search (pitch fine search) is performed by the closed loop.

【００３６】高精度（ファイン）ピッチサーチ部１０６
には、ピッチ抽出部１０３で抽出された整数（インテジ
ャー）値の粗（ラフ）ピッチデータと、直交変換部１０
５により例えばＦＦＴされた周波数軸上のデータとが供
給されている。この高精度ピッチサーチ部１０６では、
上記粗ピッチデータ値を中心に、0.２〜0.５きざみで±
数サンプルずつ振って、最適な小数点付き（フローティ
ング）のファインピッチデータの値へ追い込む。このと
きのファインサーチの手法として、いわゆる合成による
分析 (Analysis by Synthesis)法を用い、合成されたパ
ワースペクトルが原音のパワースペクトルに最も近くな
るようにピッチを選んでいる。High precision (fine) pitch search unit 106
Includes rough pitch data of integer (integer) values extracted by the pitch extraction unit 103 and the orthogonal transformation unit 10.
5, the data on the frequency axis subjected to FFT, for example, is supplied. In this high precision pitch search unit 106,
Centering on the above coarse pitch data value, it is ± 0.2 in increments of ±
Shake several samples at a time to drive to the optimum fine pitch data value with a decimal point (floating). As a fine search method at this time, a so-called analysis by synthesis method is used, and the pitch is selected so that the synthesized power spectrum is closest to the power spectrum of the original sound.

【００３７】このピッチのファインサーチについて説明
する。先ず、上記ＭＢＥボコーダにおいては、上記ＦＦ
Ｔ等により直交変換された周波数軸上のスペクトルデー
タとしてのＳ(j) をＳ(j) ＝Ｈ(j) ｜Ｅ(j) ｜０＜ｊ＜Ｊ・・・（８）と表現するようなモデルを想定している。ここで、Ｊは
πω_s＝ｆ_s／２に対応し、サンプリング周波数ｆ_s＝
２πω_sが例えば８ｋHzのときには４ｋHzに対応する。
上記（８）式中において、周波数軸上のスペクトルデー
タＳ(j) が図６のＡに示すような波形のとき、Ｈ(j)
は、図６のＢに示すような元のスペクトルデータＳ(j)
のスペクトル包絡線（エンベロープ）を示し、Ｅ(j)
は、図６のＣに示すような等レベルで周期的な励起信号
（エキサイテイション）のスペクトルを示している。す
なわち、ＦＦＴスペクトルＳ(j) は、スペクトルエンベ
ロープＨ(j) と励起信号のパワースペクトル｜Ｅ(j) ｜
との積としてモデル化される。The fine search of the pitch will be described. First, in the MBE vocoder, the FF
S (j) as spectrum data on the frequency axis orthogonally transformed by T etc. is expressed as S (j) = H (j) | E (j) | 0 <j <J (8) It is assumed that the model. Here, J corresponds to πω _s = f _s / 2, and the sampling frequency f _s =
When 2πω _s is, for example, 8 kHz, it corresponds to 4 kHz.
In the above formula (8), when the spectrum data S (j) on the frequency axis has a waveform as shown in A of FIG. 6, H (j)
Is the original spectrum data S (j) as shown in B of FIG.
Shows the spectral envelope of E (j)
6 shows a spectrum of an excitation signal (excitation) which is periodic at an equal level as shown in C of FIG. That is, the FFT spectrum S (j) is the spectrum envelope H (j) and the power spectrum of the excitation signal | E (j) |
It is modeled as the product of and.

【００３８】上記励起信号のパワースペクトル｜Ｅ(j)
｜は、上記ピッチに応じて決定される周波数軸上の波形
の周期性（ピッチ構造）を考慮して、１つの帯域（バン
ド）の波形に相当するスペクトル波形を周波数軸上の各
バンド毎に繰り返すように配列することにより形成され
る。この１バンド分の波形は、例えば上記図５に示すよ
うな２５６サンプルのハミング窓関数に１７９２サンプ
ル分の０データを付加（０詰め）した波形を時間軸信号
と見なしてＦＦＴし、得られた周波数軸上のある帯域幅
を持つインパルス波形を上記ピッチに応じて切り出すこ
とにより形成することができる。Power spectrum of the excitation signal | E (j)
Is a spectral waveform corresponding to the waveform of one band (band) for each band on the frequency axis in consideration of the periodicity (pitch structure) of the waveform on the frequency axis determined according to the pitch. It is formed by arranging it repeatedly. The waveform for one band is obtained by performing FFT by regarding a waveform obtained by adding (zero-filling) 0 data for 1792 samples to a Hamming window function of 256 samples as shown in FIG. 5 as a time axis signal. It can be formed by cutting out an impulse waveform having a certain bandwidth on the frequency axis according to the pitch.

【００３９】次に、上記ピッチに応じて分割された各バ
ンド毎に、上記Ｈ(j) を代表させるような（各バンド毎
のエラーを最小化するような）値（一種の振幅）｜Ａ_m
｜を求める。ここで、例えば第ｍバンド（第ｍ高調波の
帯域）の下限、上限の点をそれぞれａ_m、ｂ_mとすると
き、この第ｍバンドのエラーε_mは、Next, for each band divided according to the above pitch, a value (a kind of amplitude) | A that represents the above H (j) (minimizes the error for each band) | A _m
Ask for |. Here, for example, when the lower and upper points of the m-th band (band of the m-th harmonic) are a _m and b _m , respectively, the error ε _m of the m-th band is

【００４０】[0040]

【数４】で表せる。このエラーε_mを最小化するような｜Ａ_m｜
は、[Equation 4] Can be expressed as | A _m | that minimizes this error ε _m
Is

【００４１】[0041]

【数５】となり、この（10）式の｜Ａ_m｜のとき、エラーε_mを
最小化する。このような振幅｜Ａ_m｜を各バンド毎に求
め、得られた各振幅｜Ａ_m｜を用いて上記（９）式で定
義された各バンド毎のエラーε_mを求める。次に、この
ような各バンド毎のエラーε_mの全バンドの総和値Σε
_mを求める。さらに、このような全バンドのエラー総和
値Σε_mを、いくつかの微小に異なるピッチについて求
め、エラー総和値Σε_mが最小となるようなピッチを求
める。[Equation 5] Therefore, when | A _m | in the equation (10), the error ε _m is minimized. Such an amplitude | A _m | is obtained for each band, and the obtained amplitude | A _m | is used to obtain an error ε _m for each band defined by the above equation (9). Next, the sum Σε of all the bands of such error ε _m for each band
_{Find m} . Further, such an error sum value Σε _m of all bands is obtained for some slightly different pitches, and a pitch that minimizes the error sum value Σε _m is obtained.

【００４２】すなわち、上記ピッチ抽出部１０３で求め
られたラフピッチを中心として、例えば 0.25 きざみで
上下に数種類ずつ用意する。これらの複数種類の微小に
異なるピッチの各ピッチに対してそれぞれ上記エラー総
和値Σε_mを求める。この場合、ピッチが定まるとバン
ド幅が決まり、上記（10）式より、周波数軸上データの
パワースペクトル｜Ｓ(j) ｜と励起信号スペクトル｜Ｅ
(j) ｜とを用いて上記（９）式のエラーε_mを求め、そ
の全バンドの総和値Σε_mを求めることができる。この
エラー総和値Σε_mを各ピッチ毎に求め、最小となるエ
ラー総和値に対応するピッチを最適のピッチとして決定
するわけである。以上のようにして高精度ピッチサーチ
部１０６で最適のファイン（例えば 0.25 きざみ）ピッ
チが求められ、この最適ピッチに対応する振幅｜Ａ_m｜
が決定される。That is, with the rough pitch obtained by the pitch extraction unit 103 as the center, several types are prepared up and down in steps of, for example, 0.25. The error sum value Σε _m is obtained for each of these plural kinds of slightly different pitches. In this case, when the pitch is determined, the bandwidth is determined. From the above equation (10), the power spectrum | S (j) |
(j) | and the error ε _m in the above equation (9) can be obtained, and the sum total value Σε _m of all the bands can be obtained. This error sum value Σε _m is obtained for each pitch, and the pitch corresponding to the minimum error sum value is determined as the optimum pitch. As described above, the high-precision pitch search unit 106 obtains the optimum fine (eg, 0.25 step) pitch, and the amplitude | A _m | corresponding to this optimum pitch.
Is determined.

【００４３】以上ピッチのファインサーチの説明におい
ては、説明を簡略化するために、全バンドが有声音（Vo
iced）の場合を想定しているが、上述したようにＭＢＥ
ボコーダにおいては、同時刻の周波数軸上に無声音（Un
voiced）領域が存在するというモデルを採用しているこ
とから、上記各バンド毎に有声音／無声音の判別を行う
ことが必要とされる。In the above description of the pitch fine search, in order to simplify the description, all bands are voiced (Vo
Assuming the case of iced), as described above, MBE
In the vocoder, unvoiced sound (Un
Since a model in which a voiced) area exists is used, it is necessary to distinguish voiced sound / unvoiced sound for each band.

【００４４】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、第ｍバンド
のＮＳＲは、The optimum pitch and amplitude | A _m | data from the high precision pitch search unit 106 is sent to the voiced sound / unvoiced sound determination unit 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, the NSR of the m-th band is

【００４５】[0045]

【数６】と表せ、このＮＳＲ値が所定の閾値（例えば0.３）より
大のとき（エラーが大きい）ときには、そのバンドでの
｜Ａ_m｜｜Ｅ(j) ｜による｜Ｓ(j) ｜の近似が良くない
（上記励起信号｜Ｅ(j) ｜が基底として不適当である）
と判断でき、当該バンドをＵＶ（Unvoiced、無声音）と
判別する。これ以外のときは、近似がある程度良好に行
われていると判断でき、そのバンドをＶ（Voiced、有声
音）と判別する。[Equation 6] When this NSR value is larger than a predetermined threshold value (for example, 0.3) (error is large), | S (j) | is approximated by | A _m || E (j) | in that band. Is not good (the above excitation signal | E (j) | is unsuitable as a basis)
Therefore, the band is determined to be UV (Unvoiced, unvoiced sound). In other cases, it can be determined that the approximation is performed to some extent, and the band is determined to be V (Voiced, voiced sound).

【００４６】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅を求めている。このＵＶのバン
ドについての振幅｜Ａ_m｜_UVは、Next, the amplitude re-evaluation unit 108 has the amplitude on the frequency axis data from the orthogonal transformation unit 105 and the amplitude | A _m evaluated as the fine pitch from the high precision pitch search unit 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from The amplitude re-evaluation unit 108 re-calculates the amplitude of the band determined to be unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107. The amplitude | A _m | _UV for this UV band is

【００４７】[0047]

【数７】にて求められる。[Equation 7] Required at.

【００４８】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００Hzまでとすると、この有効帯域が上記
ピッチに応じて、８バンド〜６３バンドに分割されるこ
とになり、これらの各バンド毎に得られる上記振幅｜Ａ
_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの個
数ｍ_MX＋１も８〜６３と変化することになる。このため
データ数変換部１０９では、この可変個数ｍ_MX＋１の振
幅データを一定個数Ｎ_C（例えば４４個）のデータに変
換している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. This data number conversion unit 109 is for making the number constant, considering that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, if the effective band is up to 3400 Hz, the effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | A obtained for each of these bands | A
_{The number of m m} (including UV band amplitude | A _m | _UV ) data m _MX +1 also changes from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number m _MX +1 of amplitude data into a fixed number N _C (for example, 44) of data.

【００４９】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（( ｍ_MX＋
１) ×Ｋ_OS個）の振幅データを直線補間してさらに多く
のＮ_M個（例えば２０４８個）に拡張し、このＮ_M個の
データを間引いて上記一定個数Ｎ_C（例えば４４個）の
データに変換する。In this embodiment, dummy data for interpolating the values from the last data in the block to the first data in the block with respect to the amplitude data of one block of the effective band on the frequency axis. Is added to expand the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
Obtain an amplitude data of K _OS times the number by performing oversampling multiplied), the K _OS times the number ((m _MX +
1) x K _OS pieces of amplitude data are linearly interpolated and expanded to a larger number of N _M pieces (for example, 2048 pieces), and the N _M pieces of data are thinned out to obtain the above-mentioned fixed number N _C (for example, 44 pieces). Convert to data.

【００５０】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (a fixed number N _C of the amplitude data) is sent to the vector quantization unit 110, and a predetermined number of data is collected into a vector, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. The high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００５１】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is a frame of L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame cycle.

【００５２】次に、伝送されて得られた上記各データに
基づき音声信号を合成するための合成側（デコード側）
の概略構成について、図７を参照しながら説明する。こ
の図７において、入力端子１２１には上記ベクトル量子
化された振幅データが、入力端子１２２には上記符号化
されたピッチデータが、また入力端子１２３には上記Ｖ
／ＵＶ判別データがそれぞれ供給される。入力端子１２
１からの量子化振幅データは、逆ベクトル量子化部１２
４に送られて逆量子化され、データ数逆変換部１２５に
送られて逆変換され、得られた振幅データが有声音合成
部１２６及び無声音合成部１２７に送られる。入力端子
１２２からの符号化ピッチデータは、ピッチ復号化部１
２８で復号化され、データ数逆変換部１２５、有声音合
成部１２６及び無声音合成部１２７に送られる。また入
力端子１２３からのＶ／ＵＶ判別データは、有声音合成
部１２６及び無声音合成部１２７に送られる。Next, a synthesizing side (decoding side) for synthesizing an audio signal on the basis of the above-mentioned respective data transmitted and obtained.
The general configuration of will be described with reference to FIG. In FIG. 7, the vector-quantized amplitude data is input to the input terminal 121, the encoded pitch data is input to the input terminal 122, and the V-value is input to the input terminal 123.
/ UV discrimination data is supplied respectively. Input terminal 12
The quantized amplitude data from 1 is the inverse vector quantization unit 12
4 and is inversely quantized, is then sent to the data number inverse transform unit 125 and is inversely transformed, and the obtained amplitude data is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127. The encoded pitch data from the input terminal 122 is the pitch decoding unit 1
It is decoded at 28 and sent to the data number inverse conversion unit 125, the voiced sound synthesis unit 126, and the unvoiced sound synthesis unit 127. The V / UV discrimination data from the input terminal 123 is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127.

【００５３】有声音合成部１２６では例えば余弦(cosin
e)波合成により時間軸上の有声音波形を合成し、無声音
合成部１２７では例えばホワイトノイズをバンドパスフ
ィルタでフィルタリングして時間軸上の無声音波形を合
成し、これらの各有声音合成波形と無声音合成波形とを
加算部１２９で加算合成して、出力端子１３０より取り
出すようにしている。この場合、上記振幅データ、ピッ
チデータ及びＶ／ＵＶ判別データは、上記分析時の１フ
レーム（Ｌサンプル、例えば１６０サンプル）毎に更新
されて与えられるが、フレーム間の連続性を高める（円
滑化する）ために、上記振幅データやピッチデータの各
値を１フレーム中の例えば中心位置における各データ値
とし、次のフレームの中心位置までの間（合成時の１フ
レーム）の各データ値を補間により求める。すなわち、
合成時の１フレーム（例えば上記分析フレームの中心か
ら次の分析フレームの中心まで）において、先端サンプ
ル点での各データ値と終端（次の合成フレームの先端）
サンプル点での各データ値とが与えられ、これらのサン
プル点間の各データ値を補間により求めるようにしてい
る。In the voiced sound synthesis unit 126, for example, cosine (cosin
e) A voiced sound waveform on the time axis is synthesized by wave synthesis, and in the unvoiced sound synthesis unit 127, for example, white noise is filtered by a bandpass filter to synthesize the unvoiced sound waveform on the time axis, and these voiced sound synthesized waveforms are combined. The unvoiced sound synthesized waveform is added and synthesized by the addition unit 129 and is taken out from the output terminal 130. In this case, the amplitude data, the pitch data, and the V / UV discrimination data are updated and given for each frame (L sample, for example, 160 samples) at the time of the analysis, but the continuity between the frames is improved (smoothed). Therefore, each value of the amplitude data and the pitch data is set as each data value at, for example, the center position in one frame, and each data value up to the center position of the next frame (one frame at the time of composition) is interpolated. Ask by. That is,
In one frame (for example, from the center of the above analysis frame to the center of the next analysis frame) at the time of synthesis, each data value at the tip sample point and the end (the tip of the next synthesis frame)
Each data value at the sample point is given, and each data value between these sample points is obtained by interpolation.

【００５４】以下、有声音合成部１２６における合成処
理を詳細に説明する。上記Ｖ（有声音）と判別された第
ｍバンド（第ｍ高調波の帯域）における時間軸上の上記
１合成フレーム（Ｌサンプル、例えば１６０サンプル）
分の有声音をＶ_m(n) とするとき、この合成フレーム内
の時間インデックス（サンプル番号）ｎを用いて、Ｖ_m(n) ＝Ａ_m(n) cos(θ_m(n)) ０≦ｎ＜Ｌ・・・（13）と表すことができる。全バンドの内のＶ（有声音）と判
別された全てのバンドの有声音を加算（ΣＶ_m(n) ）し
て最終的な有声音Ｖ(n) を合成する。The synthesis processing in the voiced sound synthesis unit 126 will be described in detail below. The one combined frame (L sample, for example, 160 samples) on the time axis in the m-th band (band of the m-th harmonic) determined to be V (voiced sound)
When the voiced sound for a minute is V _m (n), V _m (n) = A _m (n) cos (θ _m (n)) 0 using the time index (sample number) n in this composite frame. ≦ n <L can be expressed as (13). The final voiced sound V (n) is synthesized by adding (ΣV _m (n)) the voiced sounds of all the bands which are determined to be V (voiced sound) of all the bands.

【００５５】この（13）式中のＡ_m(n) は、上記合成フ
レームの先端から終端までの間で補間された第ｍ高調波
の振幅である。最も簡単には、フレーム単位で更新され
る振幅データの第ｍ高調波の値を直線補間すればよい。
すなわち、上記合成フレームの先端（ｎ＝０）での第ｍ
高調波の振幅値をＡ_0m、該合成フレームの終端（ｎ＝
Ｌ：次の合成フレームの先端）での第ｍ高調波の振幅値
をＡ_Lmとするとき、Ａ_m(n) ＝ (L-n)Ａ_0m／Ｌ＋ｎＡ_Lm／Ｌ・・・（14）の式によりＡ_m(n) を計算すればよい。A _m (n) in the equation (13) is the amplitude of the m-th harmonic wave which is interpolated from the beginning to the end of the composite frame. The simplest way is to linearly interpolate the value of the m-th harmonic of the amplitude data updated in frame units.
That is, the m-th frame at the tip (n = 0) of the composite frame
The amplitude value of the harmonic is A _0m , the end of the composite frame (n =
L: the amplitude value of the m-th harmonic at the next synthetic frame) is A _Lm , then A _m (n) = (Ln) A _0m / L + nA _Lm / L ... (14) It suffices to calculate A _m (n).

【００５６】次に、上記（13）式中の位相θ_m(n) は、 θ_m(0) ＝ｍω_O1ｎ＋ｎ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋Δωｎ・・・（15）により求めることができる。この（15）式中で、φ_0mは
上記合成フレームの先端（ｎ＝０）での第ｍ高調波の位
相（フレーム初期位相）を示し、ω₀₁は合成フレーム先
端（ｎ＝０）での基本角周波数、ω_L1は該合成フレーム
の終端（ｎ＝Ｌ：次の合成フレーム先端）での基本角周
波数をそれぞれ示している。上記（15）式中のΔωは、
ｎ＝Ｌにおける位相φ_Lmがθ_m(L) に等しくなるような
最小のΔωを設定する。Next, the phase θ _m (n) in the above equation (13) is calculated by θ _m (0) = mω _O1 n + n ² m (ω _L1 −ω ₀₁ ) / 2L + φ _{0 m} + Δω n (15) You can ask. In this equation (15), φ _0m represents the phase (frame initial phase) of the m-th harmonic at the top (n = 0) of the composite frame, and ω ₀₁ represents the top of the composite frame (n = 0). The fundamental angular frequency, ω _L1, represents the fundamental angular frequency at the end of the combined frame (n = L: the leading end of the next combined frame). Δω in the equation (15) is
Set a minimum Δω such that the phase φ _{Lm at} n = L is equal to θ _m (L).

【００５７】以下、任意の第ｍバンドにおいて、それぞ
れｎ＝０、ｎ＝ＬのときのＶ／ＵＶ判別結果に応じた上
記振幅Ａ_m(n) 、位相θ_m(n) の求め方を説明する。第
ｍバンドが、ｎ＝０、ｎ＝ＬのいずれもＶ（有声音）と
される場合に、振幅Ａ_m(n) は、上述した（10）式によ
り、伝送された振幅値Ａ_0m、Ａ_Lmを直線補間して振幅Ａ
_m(n) を算出すればよい。位相θ_m(n) は、ｎ＝０でθ
_m(0) ＝φ_0mからｎ＝Ｌでθ_m(L) がφ_Lmとなるように
Δωを設定する。Hereinafter, how to obtain the amplitude A _m (n) and the phase θ _m (n) according to the V / UV discrimination result when n = 0 and n = L in an arbitrary m-th band will be described. To do. When the m-th band is V (voiced sound) for both n = 0 and n = L, the amplitude A _m (n) is the transmitted amplitude value A _0m by the above-mentioned equation (10), Linear interpolation of A _Lm and amplitude A
_It suffices to calculate _m (n). The phase θ _m (n) is θ when n = 0
Δω is set so that θ _m (L) becomes φ _Lm when _m (0) = φ _{0 m} and n = L.

【００５８】次に、ｎ＝０のときＶ（有声音）で、ｎ＝
ＬのときＵＶ（無声音）とされる場合に、振幅Ａ_m(n)
は、Ａ_m(0) の伝送振幅値Ａ_0mからＡ_m(L) で０となる
ように直線補間する。ｎ＝Ｌでの伝送振幅値Ａ_Lmは無声
音の振幅値であり、後述する無声音合成の際に用いられ
る。位相θ_m(n) は、θ_m(0) ＝φ_0mとし、かつΔω＝
０とする。Next, when n = 0, V (voiced sound) and n =
Amplitude A _m (n) when UV (unvoiced sound) when L
Is linearly interpolated so that 0 A _m (L) from the transmission amplitude value A _{0 m} of A _m (0). The transmission amplitude value A _{Lm when} n = L is the amplitude value of unvoiced sound and is used in unvoiced sound synthesis described later. The phase θ _m (n) is θ _m (0) = φ _{0 m} , and Δω =
Set to 0.

【００５９】さらに、ｎ＝０のときＵＶ（無声音）で、
ｎ＝ＬのときＶ（有声音）とされる場合には、振幅Ａ_m
(n) は、ｎ＝０での振幅Ａ_m(0) を０とし、ｎ＝Ｌで伝
送された振幅値Ａ_Lmとなるように直線補間する。位相θ
_m(n) については、ｎ＝０での位相θ_m(0) として、フ
レーム終端での位相値φ_Lmを用いて、 θ_m(0) ＝φ_Lm−ｍ（ω_O1＋ω_L1）Ｌ／２・・・（16）とし、かつΔω＝０とする。Further, when n = 0, UV (unvoiced sound)
When V = voiced sound when n = L, amplitude A _m
(n) is linearly interpolated so that the amplitude A _m (0) at n = 0 is 0 and the transmitted amplitude value A _Lm is n = L. Phase θ
_{For m} (n), using the phase value φ _Lm at the end of the frame as the phase θ _m (0) at n = 0, θ _m (0) = φ _Lm −m (ω _O1 + ω _L1 ) L / 2 ... (16) and Δω = 0.

【００６０】上記ｎ＝０、ｎ＝ＬのいずれもＶ（有声
音）とされる場合に、θ_m(L) がφ_LmとなるようにΔω
を設定する手法について説明する。上記（15）式で、ｎ
＝Ｌと置くことにより、 θ_m(L) ＝ｍω_O1Ｌ＋Ｌ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋ΔωＬ＝ｍ（ω_O1＋ω_L1）Ｌ／２＋φ_0m＋ΔωＬ＝φ_Lm となり、これを整理すると、Δωは、 Δω＝（mod2π((φ_Lm−φ_0m) − mL(ω_O1＋ω_L1)/2)／Ｌ・・・（17）となる。この（17）式でmod2π(x) とは、ｘの主値を−
π〜＋πの間の値で返す関数である。例えば、ｘ＝１.3
πのときmod2π(x) ＝−０.7π、ｘ＝２.3πのときmod2
π(x) ＝０.3π、ｘ＝−１.3πのときmod2π(x) ＝０.7
π、等である。When both n = 0 and n = L are V (voiced sound), Δω is set so that θ _m (L) becomes φ _Lm.
A method of setting will be described. In the above formula (15), n
= L, then θ _m (L) = mω _O1 L + L ² m (ω _L1 − ω ₀₁ ) / 2L + φ _0m + ΔωL = m (ω _O1 + ω _L1 ) L / 2 + φ _0m + ΔωL = φ _Lm . Then, Δω becomes Δω = (mod2π ((φ _Lm −φ _0m ) −mL (ω _O1 + ω _L1 ) / 2) / L (17). In this equation (17), mod 2π (x) Is the principal value of x
It is a function that returns a value between π and + π. For example, x = 1.3
mod2 π (x) = -0.7π when π, mod2 when x = 2.3π
When π (x) = 0.3π and x = -1.3π, mod2π (x) = 0.7
π, and so on.

【００６１】ここで、図８のＡは、音声信号のスペクト
ルの一例を示しており、バンド番号（ハーモニクスナン
バ）ｍが８、９、１０の各バンドがＵＶ（無声音）とさ
れ、他のバンドはＶ（有声音）とされている。このＶ
（有声音）のバンドの時間軸信号が上記有声音合成部１
２６により合成され、ＵＶ（無声音）のバンドの時間軸
信号が無声音合成部１２７で合成されるわけである。Here, A of FIG. 8 shows an example of the spectrum of the voice signal, and the bands with band numbers (harmonics number) m of 8, 9, and 10 are UV (unvoiced sound), and other bands. Is V (voiced sound). This V
The time axis signal of the (voiced sound) band is the voiced sound synthesis unit 1 described above.
26, and the time axis signal of the UV (unvoiced sound) band is synthesized by the unvoiced sound synthesis unit 127.

【００６２】以下、無声音合成部１２７における無声音
合成処理を説明する。ホワイトノイズ発生部１３１から
の時間軸上のホワイトノイズ信号波形を、所定の長さ
（例えば２５６サンプル）で適当な窓関数（例えばハミ
ング窓）により窓かけをし、ＳＴＦＴ処理部１３２によ
りＳＴＦＴ（ショートタームフーリエ変換）処理を施す
ことにより、図８のＢに示すようなホワイトノイズの周
波数軸上のパワースペクトルを得る。このＳＴＦＴ処理
部１３２からのパワースペクトルをバンド振幅処理部１
３３に送り、図８のＣに示すように、上記ＵＶ（無声
音）とされたバンド（例えばｍ＝８、９、１０）につい
て上記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とさ
れたバンドの振幅を０にする。このバンド振幅処理部１
３３には上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別
データが供給されている。バンド振幅処理部１３３から
の出力は、ＩＳＴＦＴ処理部１３４に送られ、位相は元
のホワイトノイズの位相を用いて逆ＳＴＦＴ処理を施す
ことにより時間軸上の信号に変換する。ＩＳＴＦＴ処理
部１３４からの出力は、オーバーラップ加算部１３５に
送られ、時間軸上で適当な（元の連続的なノイズ波形を
復元できるように）重み付けをしながらオーバーラップ
及び加算を繰り返し、連続的な時間軸波形を合成する。
オーバーラップ加算部１３５からの出力信号が上記加算
部１２９に送られる。The unvoiced sound synthesizing process in the unvoiced sound synthesizing section 127 will be described below. The white noise signal waveform on the time axis from the white noise generation unit 131 is windowed by a suitable window function (for example, Hamming window) with a predetermined length (for example, 256 samples), and the STFT processing unit 132 performs STFT (short circuit). By performing the (Term Fourier Transform) process, a power spectrum of white noise on the frequency axis as shown in B of FIG. 8 is obtained. The power spectrum from the STFT processing unit 132 is converted to the band amplitude processing unit 1
33, and as shown in FIG. 8C, the above-mentioned amplitude | A _m | _UV is multiplied with respect to the UV (unvoiced) band (for example, m = 8, 9, 10), and another V (voiced sound) is generated. ) Is set to 0. This band amplitude processing unit 1
The above amplitude data, pitch data, and V / UV discrimination data are supplied to 33. The output from the band amplitude processing unit 133 is sent to the ISTFT processing unit 134, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 134 is sent to the overlap addition unit 135, and the overlap and addition are repeated while performing appropriate weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuous. Time-domain waveforms are synthesized.
The output signal from the overlap adder 135 is sent to the adder 129.

【００６３】このように、各合成部１２６、１２７にお
いて合成されて時間軸上に戻された有声音部及び無声音
部の各信号は、加算部１２９により適当な固定の混合比
で加算して、出力端子１３０より再生された音声信号を
取り出す。As described above, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized by the synthesis units 126 and 127 and are returned to the time axis are added by the addition unit 129 at an appropriate fixed mixing ratio, The reproduced audio signal is taken out from the output terminal 130.

【００６４】なお、上記図２の音声分析側（エンコード
側）の構成や図８の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration on the speech analysis side (encoding side) in FIG. 2 and the configuration on the speech synthesis side (decoding side) in FIG. 8, although each unit is described as hardware, a so-called DSP (digital It is also possible to realize it by a software program using a signal processor or the like.

【００６５】また、本発明に係る有声音判別方法は、例
えば、自動車電話の送信側で環境雑音（背景雑音等）を
落としたいというようなとき、背景雑音を検出すう手段
としても用いられる。すなわち、雑音に乱された低品質
の音声を処理し、雑音の影響を取り除き、聴きやすい音
にするようないわゆるスピーチエンハンスメントでの雑
音検出にも適用される。The voiced sound discrimination method according to the present invention is also used as a means for detecting background noise when, for example, it is desired to reduce environmental noise (background noise etc.) on the transmission side of a car telephone. That is, it is also applied to noise detection in so-called speech enhancement in which low-quality speech disturbed by noise is processed, the influence of noise is removed, and a sound that is easy to hear is obtained.

【００６６】[0066]

【発明の効果】本発明に係る有声音判別方法は、入力さ
れたブロック単位の音声信号を周波数軸に変換し、該周
波数軸を複数の帯域に分割し、該帯域毎に有声音か否か
を判別する。そして、有声音と判別された帯域が存在す
るブロックの信号のレベルが第１の所定の閾値以下にな
るときに上記ブロック内の全帯域を無声音とすることで
例えば上記ＭＢＥ符号化において、入力音声信号の背景
に低ピッチ周波数（50Hz) の騒音であるハム雑音が乗っ
てしまったときの合成側での固定位相の加算による異音
等の非常に耳障りな音を防ぐことができる。The voiced sound discrimination method according to the present invention converts an input voice signal in block units into a frequency axis, divides the frequency axis into a plurality of bands, and determines whether or not each band is a voiced sound. To determine. Then, when the signal level of the block in which the band discriminated as the voiced sound is equal to or lower than the first predetermined threshold value, the entire band in the block is made unvoiced sound, for example, in the MBE encoding, the input speech It is possible to prevent very annoying sound such as abnormal noise due to addition of fixed phase on the synthesis side when hum noise that is low pitch frequency (50Hz) is added to the background of the signal.

[Brief description of drawings]

【図１】本発明に係る有声音判別方法の実施例を説明す
るための有声音判別装置の概略構成を示す機能ブロック
図である。FIG. 1 is a functional block diagram showing a schematic configuration of a voiced sound discrimination apparatus for explaining an embodiment of a voiced sound discrimination method according to the present invention.

【図２】本発明に係る有声音判別方法が適用可能な装置
の具体例としての音声信号の合成分析符号化方法の分析
側（エンコード側）の概略構成を示す機能ブロック図で
ある。FIG. 2 is a functional block diagram showing a schematic configuration of an analysis side (encoding side) of a voice signal synthesis analysis coding method as a specific example of an apparatus to which a voiced sound discrimination method according to the present invention is applicable.

【図３】窓かけ処理を説明するための図である。FIG. 3 is a diagram for explaining a windowing process.

【図４】窓かけ処理と窓関数との関係を説明するための
図である。FIG. 4 is a diagram for explaining a relationship between windowing processing and a window function.

【図５】直交変換（ＦＦＴ）処理対象としての時間軸デ
ータを示す図である。FIG. 5 is a diagram showing time axis data as an object of orthogonal transform (FFT) processing.

【図６】周波数軸上のスペクトルデータ、スペクトル包
絡線（エンベロープ）及び励起信号のパワースペクトル
を示す図である。FIG. 6 is a diagram showing spectrum data on a frequency axis, a spectrum envelope (envelope), and a power spectrum of an excitation signal.

【図７】本発明に係る有声音判別方法が適用可能な装置
の具体例としての音声信号の合成分析符号化装置の合成
側（デコード側）ん概略構成を示す機能ブロック図であ
る。FIG. 7 is a functional block diagram showing a schematic configuration on a synthesis side (decoding side) of a speech signal synthesis analysis coding apparatus as a specific example of an apparatus to which a voiced sound discrimination method according to the present invention is applicable.

【図８】音声信号を合成する際の無声音合成を説明する
ための図である。FIG. 8 is a diagram for explaining unvoiced sound synthesis when synthesizing a voice signal.

[Explanation of symbols]

１２・・・・・窓かけ処理部１３・・・・・直交変換部１４・・・・・帯域分割部１５・・・・・Ｖ／ＵＶ判別部１６・・・・・修正部１８・・・・・信号レベル検出部１９・・・・・比較部（信号レベル）２１・・・・・判断部２２・・・・・ピーク検出部２３・・・・・ブロック内分散検出部２４・・・・・正規化部２５・・・・・比較部（正規化ピーク） 12 ... Window processing unit 13 ... Orthogonal transformation unit 14 ... Band division unit 15 ... V / UV discrimination unit 16 ... Correction unit 18 ...・・・ Signal level detection unit 19 ・・・・・ Comparison unit (signal level) 21 ・・・ Determination unit 22 ・・・ Peak detection unit 23 ・・・ In-block variance detection unit 24 ・・・・・ Normalization part 25 ・・・ Comparison part (normalized peak)

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成５年５月１２日[Submission date] May 12, 1993

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】全文[Name of item to be corrected] Full text

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【書類名】明細書[Document name] Statement

【発明の名称】有声音判別方法Title of invention Voiced sound discrimination method

【特許請求の範囲】[Claims]

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【０００２】[0002]

【０００５】[0005]

【０００７】[0007]

【００１１】[0011]

【００１２】[0012]

【００１９】一方、上記Ｖ／ＵＶ判別部１５で、少なく
とも１帯域が有声音であるとされた場合には、該有声音
である帯域を含む１ブロックの信号レベルが上記信号レ
ベル検出部１８で検出される。検出された信号レベル
は、比較部１９に供給され、入力端子２０からの閾値と
比較される。そして、該比較結果が上記判断部２１に供
給される。On the other hand, when the V / UV discriminating section 15 determines that at least one band is voiced sound, the signal level of one block including the voiced sound band is detected by the signal level detecting section 18. To be detected. The detected signal level is supplied to the comparison unit 19 and compared with the threshold value from the input terminal 20. Then, the comparison result is supplied to the determination unit 21.

【００２０】上記Ｖ／ＵＶ判別部１５で少なくとも１帯
域が有声音とされた場合には、入力される音声信号のレ
ベルが量子化ステップの１ＬＳＢ（最下位桁）分である
か否かを信号レベル検出部１８により判別し、１ＬＳＢ
以下のときは全帯域を無声音にする。 [0020] When at least one-zone above V / UV discrimination unit 15 is a voiced sound, the level of an audio signal input is a 1LSB (least significant digit) fraction quantization step
Whether or not the signal level detection unit 18 determines
In the following cases, the entire band is unvoiced.

【００２１】具体的には、信号レベル検出部１８でエネ
ルギーの標準偏差σを、Specifically, the signal level detector 18 calculates the standard deviation σ of energy as

【００２２】[0022]

【数１】 [Equation 1]

【００２４】[0024]

【数２】 [Equation 2]

【００２６】上記σは、上記Ｖ／ＵＶ判別部１５で少な
くとも１帯域が有声音とされた場合に比較部１９に供給
される。この比較部１９は、入力端子２０に供給される
閾値ｇ_t（例えばｇ_t＝８）と上記σとを比較してい
る。ここで、上記σが閾値ｇ_tより小さければ、上記判
断部２１は、上記Ｖ／ＵＶ判別部１５で少なくとも１帯
域が有声音であるとされた１ブロックの信号がグラニュ
ラ雑音レベル、すなわち、１ＬＳＢ以下の平均振幅であ
ると判断する。The above σ is supplied to the comparison unit 19 when the V / UV discrimination unit 15 determines that at least one band is voiced sound. The comparison unit 19 compares the threshold value g _t (for example, g _t = 8) supplied to the input terminal 20 with the above σ . Here, if σ is smaller than the threshold value g _t , the determination unit 21 determines that the signal of one block in which at least one band is voiced by the V / UV determination unit 15 is a granular noise level, that is, 1 LSB. The following average amplitudes are determined.

【００２７】また、上記ピーク検出部２２は、１ブロッ
クのデータのピーク値Ｐ_eを、Further, the peak detecting section 22 calculates the peak value P _e of the data of one block as

【００２８】[0028]

【数３】 [Equation 3]

【００２９】で示される（３）式により検出する。ここ
で、ｘは上述の通りｘ（ｎ）の平均値である。そして、
正規化部２４は、上記ピーク値Ｐ_eを、上記（１）式
で、示される標準偏差σで除算し、正規化値Ｐ_nを得
る。すなわち、正規化値Ｐ_nは、Ｐ_n＝Ｐ_e／σ ・・・（４）で示される。そして、比較部２５は上記正規化値Ｐ_nと
入力端子２６から供給される閾値Ｐ_thnとを比較し、該
比較情報を上記判断部２１に供給する。この判断部２１
には、上記比較部１９が上記標準偏差σと入力端子２０
から供給された閾値ｇ_t（例えばｇ_t＝８）とを比較す
ることによって得られた比較情報も供給されている。そ
して、上記判断部２１は、上記比較部２５と上記比較部
１９からの比較情報に応じて上記修正部１６に制御信号
を供給する。具体的には、上記正規化した値Ｐ_nが所定
の閾値Ｐ_t以下でかつ上記信号レベルσも所定の閾値ｇ
_t以下のときに上記修正部１６を機能させて、１ブロッ
クの周波数軸上データの全帯域を無声音にさせる。It is detected by the equation (3) shown by. Here, x is the average value of x (n) as described above. And
The normalization unit 24 divides the peak value P _e by the standard deviation σ shown in the equation (1) to obtain a normalized value P _n . That is, the normalized value P _n is represented by P _n = P _e / σ (4) Then, the comparison unit 25 compares the normalized value P _n with the threshold P _thn supplied from the input terminal 26, and supplies the comparison information to the determination unit 21. This determination unit 21
Is compared with the standard deviation σ and the input terminal 20.
The comparison information obtained by comparing with the threshold value g _t (for example, g _t = 8) supplied from the above is also supplied. Then, the judgment unit 21 supplies a control signal to the correction unit 16 according to the comparison information from the comparison unit 25 and the comparison unit 19. Specifically, the normalized value P _n is less than or _equal to a predetermined threshold P _t , and the signal level σ is also a predetermined threshold g.
_{When t is} less than or equal to _{t, the} correction unit 16 is caused to function to make the entire band of the data on the frequency axis of one block unvoiced.

【００３０】したがって、この実施例は、全帯域無声音
とされた以外の信号、すなわち少なくとも１帯域が有声
音と判断された信号の１ブロックの信号レベルが微小な
とき、かつピーク値が微小なときに、強制的に全バンド
を無声音とすることで、合成側で発生する固定位相加算
による異音の発生を防ぐことができる。Therefore, in this embodiment, when the signal level of one block of the signals other than the unvoiced sound in the entire band, that is, the signal in which at least one band is determined to be the voiced sound is small and the peak value is small, In addition, by forcibly making all bands unvoiced, it is possible to prevent the generation of abnormal noise due to fixed phase addition that occurs on the combining side.

【００３１】なお、本発明に係る有声音判別方法は、Ｍ
ＢＥ等のボコーダに適用可能である。すなわち、入力音
声信号の１ブロックの周波数軸上データが雑音又は無声
音とされたときに全ての帯域を強制的に無声音とするこ
とでＭＢＥ等のボコーダの合成側での異音の発生を防ぐ
ことができる。The voiced sound discrimination method according to the present invention uses M
It is applicable to vocoders such as BE. That is, when the data on the frequency axis of one block of the input audio signal is made noise or unvoiced sound, all the bands are forced to be unvoiced sound to prevent the generation of abnormal sound on the synthesis side of the vocoder such as MBE. You can

【００３２】以下、本発明に係る有声音判別方法が適用
可能な音声信号の合成分析符号化装置（いわゆるボコー
ダ）の一種であるＭＢＥ（Multiband Excitation: マル
チバンド励起）ボコーダの具体例について、図面を参照
しながら説明する。このＭＢＥボコーダは、D. W. Grif
fin and J. S. Lim,"Multiband Excitation Vocoder,"
IEEE Trans.Acoustics,Speech,and Signal Processing,
vol.36, No.8, pp.1223-1235, Aug.1988 に開示されて
いるものであり、従来のＰＡＲＣＯＲ（PARtial auto-C
ORrelation: 偏自己相関）ボコーダ等では、音声のモデ
ル化の際に有声音区間と無声音区間とをブロックあるい
はフレーム毎に切り換えていたのに対し、ＭＢＥボコー
ダでは、同時刻（同じブロックあるいはフレーム内）の
周波数軸領域に有声音（Voiced）区間と無声音（Unvoic
ed）区間とが存在するという仮定でモデル化している。A specific example of an MBE (Multiband Excitation) vocoder, which is a kind of speech signal synthesis analysis coding apparatus (so-called vocoder) to which the voiced sound discrimination method according to the present invention is applicable, will be described below with reference to the drawings. It will be explained with reference to FIG. This MBE vocoder is DW Grif
fin and JS Lim, "Multiband Excitation Vocoder,"
IEEE Trans. Acoustics, Speech, and Signal Processing,
Vol.36, No.8, pp.1223-1235, Aug.1988, the conventional PARCOR (PARtial auto-C
In ORrelation: partial autocorrelation) vocoders, voiced sections and unvoiced sections were switched for each block or frame when modeling speech, whereas in MBE vocoder, the same time (in the same block or frame) Voiced section (Voiced) and unvoiced section (Unvoic)
ed) section is modeled on the assumption that and exist.

【００３３】図２は、上記ＭＢＥボコーダの実施例の全
体の概略構成を示すブロック図である。この図２におい
て、入力端子１０１には音声信号が供給されるようにな
っており、この入力音声信号は、ＨＰＦ（ハイパスフィ
ルタ）等のフィルタ１０２に送られて、いわゆるＤＣ
（直流）オフセット分の除去や帯域制限（例えば２００
〜３４００Hzに制限）のための少なくとも低域成分（２
００Hz以下）の除去が行われる。このフィルタ１０２を
介して得られた信号は、ピッチ抽出部１０３及び窓かけ
処理部１０４にそれぞれ送られる。ピッチ抽出部１０３
では、入力音声信号データが所定サンプル数Ｎ（例えば
Ｎ＝２５６）単位でブロック分割され（あるいは方形窓
による切り出しが行われ）、このブロック内の音声信号
についてのピッチ抽出が行われる。このような切り出し
ブロック（２５６サンプル）を、例えば図３のＡに示す
ようにＬサンプル（例えばＬ＝１６０）のフレーム間隔
で時間軸方向に移動させており、各ブロック間のオーバ
ラップはＮ−Ｌサンプル（例えば９６サンプル）となっ
ている。また、窓かけ処理部１０４では、１ブロックＮ
サンプルに対して所定の窓関数、例えばハミング窓をか
け、この窓かけブロックを１フレームＬサンプルの間隔
で時間軸方向に順次移動させている。FIG. 2 is a block diagram showing the overall schematic configuration of the embodiment of the MBE vocoder. In FIG. 2, an audio signal is supplied to an input terminal 101, and this input audio signal is sent to a filter 102 such as an HPF (high-pass filter) to be a so-called DC signal.
Removal of (DC) offset and band limitation (for example, 200
At least low frequency component (2)
(Less than 00 Hz) is removed. The signal obtained through the filter 102 is sent to the pitch extraction unit 103 and the windowing processing unit 104, respectively. Pitch extraction unit 103
In, the input voice signal data is divided into blocks in units of a predetermined number N (for example, N = 256) (or cut out by a rectangular window), and pitch extraction is performed on the voice signals in this block. Such a cut block (256 samples) is moved in the time axis direction at a frame interval of L samples (for example, L = 160) as shown in A of FIG. 3, and the overlap between the blocks is N−. There are L samples (for example, 96 samples). In addition, in the windowing processing unit 104, 1 block N
A predetermined window function, for example, a Hamming window is applied to the sample, and this windowed block is sequentially moved in the time axis direction at intervals of 1 frame L sample.

【００３４】このような窓かけ処理を数式で表すと、ｘ_w(k,q) ＝ｘ(q) ｗ(kL-q) ・・・（５）となる。この（５）式において、ｋはブロック番号を、
ｑはデータの時間インデックス（サンプル番号）を表
し、処理前の入力信号のｑ番目のデータｘ(q) に対して
第ｋブロックの窓（ウィンドウ）関数ｗ(kL-q)により窓
かけ処理されることによりデータｘ_w(k,q) が得られる
ことを示している。ピッチ抽出部１０３内での図３のＡ
に示すような方形窓の場合の窓関数ｗ_r(r) は、ｗ_r(r) ＝１０≦ｒ＜Ｎ・・・（６）＝０ｒ＜０，Ｎ≦ｒまた、窓かけ処理部１０４での図３のＢに示すようなハ
ミング窓の場合の窓関数ｗ_h(r) は、ｗ_h(r) ＝ 0.54 − 0.46 cos(２πr/(N-1)) ０≦ｒ＜Ｎ・・・（７）＝０ｒ＜０，Ｎ≦ｒである。このような窓関数ｗ_r(r) あるいはｗ_h(r) を
用いるときの上記（５）式の窓関数ｗ(r) （＝ｗ(kL-
q)）の否零区間は、０≦ｋＬ−ｑ＜Ｎこれを変形して、ｋＬ−Ｎ＜ｑ≦ｋＬ従って、例えば上記方形窓の場合に窓関数ｗ_r(kL-q)＝
１となるのは、図４に示すように、ｋＬ−Ｎ＜ｑ≦ｋＬ
のときとなる。また、上記（５）〜（７）式は、長さＮ
（＝２５６）サンプルの窓が、Ｌ（＝１６０）サンプル
ずつ前進してゆくことを示している。以下、上記（６）
式、（７）式の各窓関数で切り出された各Ｎ点（０≦ｒ
＜Ｎ）の否零サンプル列を、それぞれｘ_wr(k,r) 、ｘ_wh
(k,r) と表すことにする。When this windowing process is expressed by a mathematical expression, x _w (k, q) = x (q) w (kL-q) (5) In this equation (5), k is a block number,
q represents the time index (sample number) of the data, and the q-th data x (q) of the unprocessed input signal is windowed by the window function (w (kL-q)) of the kth block. It is shown that the data x _w (k, q) can be obtained by doing so. 3A in the pitch extraction unit 103
The window function w _r (r) in the case of the rectangular window is as follows: w _r (r) = 1 0 ≦ r <N (6) = 0 r <0, N ≦ r window function w _h (r) in the case of Hamming window as shown in B of FIG. 3 is a section _{104, w h (r) =} 0.54 - 0.46 cos (2πr / (N-1)) 0 ≦ r <N (7) = 0 r <0, N ≦ r. Such a window function w _r (r) or w (5) when using the _h (r) formula of the window function w (r) (= w (KL-
q)), the zero-zero interval is 0 ≦ kL−q <N, which is transformed into kL−N <q ≦ kL. Therefore, for example, in the case of the above rectangular window, the window function w _r (kL−q) =
As shown in FIG. 4, 1 becomes kL-N <q ≦ kL.
It will be when. Further, the above equations (5) to (7) are expressed by the length N
The window of (= 256) samples is shown to advance by L (= 160) samples. Below, above (6)
Each of the N points (0 ≦ r
The non-zero sample sequence of <N) is represented by x _wr (k, r) and x _wh
Let us denote it as (k, r).

【００３５】窓かけ処理部１０４では、図５に示すよう
に、上記（７）式のハミング窓がかけられた１ブロック
２５６サンプルのサンプル列ｘ_wh(k,r) に対して１７９
２サンプル分の０データが付加されて（いわゆる０詰め
されて）２０４８サンプルとされ、この２０４８サンプ
ルの時間軸データ列に対して、直交変換部１０５により
例えばＦＦＴ（高速フーリエ変換）等の直交変換処理が
施される。In the windowing processing unit 104, as shown in FIG. 5, 179 is applied to the sample sequence x _wh (k, r) of one block of 256 samples to which the Hamming window of the equation (7) is applied.
Two samples of 0 data are added (so-called zero padding) to make 2048 samples, and the orthogonal transformation unit 105 performs orthogonal transformation such as FFT (Fast Fourier Transform) on the time-axis data sequence of 2048 samples. Processing is performed.

【００３６】ピッチ抽出部１０３では、上記ｘ_wr(k,r)
のサンプル列（１ブロックＮサンプル）に基づいてピッ
チ抽出が行われる。このピッチ抽出法には、時間波形の
周期性や、スペクトルの周期的周波数構造や、自己相関
関数を用いるもの等が知られているが、本実施例では、
センタクリップ波形の自己相関法を採用している。この
ときのブロック内でのセンタクリップレベルについて
は、１ブロックにつき１つのクリップレベルを設定して
もよいが、ブロックを細分割した各部（各サブブロッ
ク）の信号のピークレベル等を検出し、これらの各サブ
ブロックのピークレベル等の差が大きいときに、ブロッ
ク内でクリップレベルを段階的にあるいは連続的に変化
させるようにしている。このセンタクリップ波形の自己
相関データのピーク位置に基づいてピッチ周期を決めて
いる。このとき、現在フレームに属する自己相関データ
（自己相関は１ブロックＮサンプルのデータを対象とし
て求められる）から複数のピークを求めておき、これら
の複数のピークの内の最大ピークが所定の閾値以上のと
きには該最大ピーク位置をピッチ周期とし、それ以外の
ときには、現在フレーム以外のフレーム、例えば前後の
フレームで求められたピッチに対して所定の関係を満た
すピッチ範囲内、例えば前フレームのピッチを中心とし
て±２０％の範囲内にあるピークを求め、このピーク位
置に基づいて現在フレームのピッチを決定するようにし
ている。このピッチ抽出部１０３ではオープンループに
よる比較的ラフなピッチのサーチが行われ、抽出された
ピッチデータは高精度（ファイン）ピッチサーチ部１０
６に送られて、クローズドループによる高精度のピッチ
サーチ（ピッチのファインサーチ）が行われる。In the pitch extraction unit 103, the above x _wr (k, r)
Pitch extraction is performed based on the sample sequence (1 block N samples). The pitch extraction method is known to include periodicity of time waveform, periodic frequency structure of spectrum, and autocorrelation function.
The center correlation waveform autocorrelation method is used. Regarding the center clip level in the block at this time, one clip level may be set for one block, but the peak level of the signal of each part (each sub-block) obtained by subdividing the block is detected and When there is a large difference in the peak level of each sub-block, the clip level is changed stepwise or continuously within the block. The pitch period is determined based on the peak position of the autocorrelation data of this center clip waveform. At this time, a plurality of peaks are obtained from the autocorrelation data belonging to the current frame (the autocorrelation is obtained for the data of N samples of one block), and the maximum peak among the plurality of peaks is equal to or larger than a predetermined threshold. In the case of, the maximum peak position is set as the pitch period, and in other cases, the pitch is within a pitch range that satisfies a predetermined relationship with the pitch other than the current frame, for example, the pitch of the previous frame and the pitch of the previous frame. As a result, a peak in the range of ± 20% is obtained, and the pitch of the current frame is determined based on this peak position. In this pitch extraction unit 103, a relatively rough pitch search is performed by an open loop, and the extracted pitch data has a high precision (fine) pitch search unit 10.
Then, the high precision pitch search (pitch fine search) is performed by the closed loop.

【００３７】高精度（ファイン）ピッチサーチ部１０６
には、ピッチ抽出部１０３で抽出された整数（インテジ
ャー）値の粗（ラフ）ピッチデータと、直交変換部１０
５により例えばＦＦＴされた周波数軸上のデータとが供
給されている。この高精度ピッチサーチ部１０６では、
上記粗ピッチデータ値を中心に、0.２〜0.５きざみで±
数サンプルずつ振って、最適な小数点付き（フローティ
ング）のファインピッチデータの値へ追い込む。このと
きのファインサーチの手法として、いわゆる合成による
分析 (Analysis by Synthesis)法を用い、合成されたパ
ワースペクトルが原音のパワースペクトルに最も近くな
るようにピッチを選んでいる。High precision (fine) pitch search section 106
Includes rough pitch data of integer (integer) values extracted by the pitch extraction unit 103 and the orthogonal transformation unit 10.
5, the data on the frequency axis subjected to FFT, for example, is supplied. In this high precision pitch search unit 106,
Centering on the above coarse pitch data value, it is ± 0.2 in increments of ±
Shake several samples at a time to drive to the optimum fine pitch data value with a decimal point (floating). As a fine search method at this time, a so-called analysis by synthesis method is used, and the pitch is selected so that the synthesized power spectrum is closest to the power spectrum of the original sound.

【００３８】このピッチのファインサーチについて説明
する。先ず、上記ＭＢＥボコーダにおいては、上記ＦＦ
Ｔ等により直交変換された周波数軸上のスペクトルデー
タとしてのＳ(j) をＳ(j) ＝Ｈ(j) ｜Ｅ(j) ｜０＜ｊ＜Ｊ・・・（８）と表現するようなモデルを想定している。ここで、Ｊは
ω_s／４π＝ｆ_s／２に対応し、サンプリング周波数ｆ
_s＝ω_s／２πが例えば８ｋHzのときには４ｋHzに対応
する。上記（８）式中において、周波数軸上のスペクト
ルデータＳ(j) が図６のＡに示すような波形のとき、Ｈ
(j) は、図６のＢに示すような元のスペクトルデータＳ
(j) のスペクトル包絡線（エンベロープ）を示し、Ｅ
(j) は、図６のＣに示すような等レベルで周期的な励起
信号（エキサイテイション）のスペクトルを示してい
る。すなわち、ＦＦＴスペクトルＳ(j) は、スペクトル
エンベロープＨ(j) と励起信号のパワースペクトル｜Ｅ
(j) ｜との積としてモデル化される。The fine search of the pitch will be described. First, in the MBE vocoder, the FF
S (j) as spectrum data on the frequency axis orthogonally transformed by T etc. is expressed as S (j) = H (j) | E (j) | 0 <j <J (8) It is assumed that the model. Where J is
corresponding to ω _s / 4π = f _s / 2, and the sampling frequency f
_{When s} = ω _s / 2π is 8 kHz, for example, it corresponds to 4 kHz. In the equation (8), when the spectrum data S (j) on the frequency axis has a waveform as shown in A of FIG.
(j) is the original spectrum data S as shown in B of FIG.
The spectral envelope of (j) is shown as E
(j) shows the spectrum of the excitation signal (excitation) which is cyclic at the same level as shown in C of FIG. That is, the FFT spectrum S (j) is the spectrum envelope H (j) and the power spectrum | E of the excitation signal.
(j) | is modeled as the product.

【００３９】上記励起信号のパワースペクトル｜Ｅ(j)
｜は、上記ピッチに応じて決定される周波数軸上の波形
の周期性（ピッチ構造）を考慮して、１つの帯域（バン
ド）の波形に相当するスペクトル波形を周波数軸上の各
バンド毎に繰り返すように配列することにより形成され
る。この１バンド分の波形は、例えば上記図５に示すよ
うな２５６サンプルのハミング窓関数に１７９２サンプ
ル分の０データを付加（０詰め）した波形を時間軸信号
と見なしてＦＦＴし、得られた周波数軸上のある帯域幅
を持つインパルス波形を上記ピッチに応じて切り出すこ
とにより形成することができる。Power spectrum of the excitation signal | E (j)
Is a spectral waveform corresponding to the waveform of one band (band) for each band on the frequency axis in consideration of the periodicity (pitch structure) of the waveform on the frequency axis determined according to the pitch. It is formed by arranging it repeatedly. The waveform for one band is obtained by performing FFT by regarding a waveform obtained by adding (zero-filling) 0 data for 1792 samples to a Hamming window function of 256 samples as shown in FIG. 5 as a time axis signal. It can be formed by cutting out an impulse waveform having a certain bandwidth on the frequency axis according to the pitch.

【００４０】次に、上記ピッチに応じて分割された各バ
ンド毎に、上記Ｈ(j) を代表させるような（各バンド毎
のエラーを最小化するような）値（一種の振幅）｜Ａ_m
｜を求める。ここで、例えば第ｍバンド（第ｍ高調波の
帯域）の下限、上限の点をそれぞれａ_m、ｂ_mとすると
き、この第ｍバンドのエラーε_mは、Next, for each band divided according to the above pitch, a value (a kind of amplitude) | A that represents the above H (j) (minimizes the error for each band) | A _m
Ask for |. Here, for example, when the lower and upper points of the m-th band (band of the m-th harmonic) are a _m and b _m , respectively, the error ε _m of the m-th band is

【００４１】[0041]

【００４２】[0042]

【数５】 [Equation 5]

【００４３】となり、この（10）式の｜Ａ_m｜のとき、
エラーε_mを最小化する。このような振幅｜Ａ_m｜を各
バンド毎に求め、得られた各振幅｜Ａ_m｜を用いて上記
（９）式で定義された各バンド毎のエラーεm を求め
る。次に、このような各バンド毎のエラーε_mの全バン
ドの総和値Σε_mを求める。さらに、このような全バン
ドのエラー総和値Σε_mを、いくつかの微小に異なるピ
ッチについて求め、エラー総和値Σε_mが最小となるよ
うなピッチを求める。Therefore, when | A _m | in this equation (10),
Minimize the error ε _m . Such an amplitude | A _m | is obtained for each band, and the obtained amplitude | A _m | is used to obtain the error ε m for each band defined by the above equation (9). Next, the sum total value Σε _m of all the bands of such error ε _m for each band is obtained. Further, such an error sum value Σε _m of all bands is obtained for some slightly different pitches, and a pitch that minimizes the error sum value Σε _m is obtained.

【００４４】すなわち、上記ピッチ抽出部１０３で求め
られたラフピッチを中心として、例えば 0.25 きざみで
上下に数種類ずつ用意する。これらの複数種類の微小に
異なるピッチの各ピッチに対してそれぞれ上記エラー総
和値Σε_mを求める。この場合、ピッチが定まるとバン
ド幅が決まり、上記（10）式より、周波数軸上データの
パワースペクトル｜Ｓ(j) ｜と励起信号スペクトル｜Ｅ
(j) ｜とを用いて上記（９）式のエラーε_mを求め、そ
の全バンドの総和値Σε_mを求めることができる。この
エラー総和値Σε_mを各ピッチ毎に求め、最小となるエ
ラー総和値に対応するピッチを最適のピッチとして決定
するわけである。以上のようにして高精度ピッチサーチ
部１０６で最適のファイン（例えば 0.25 きざみ）ピッ
チが求められ、この最適ピッチに対応する振幅｜Ａ_m｜
が決定される。That is, several kinds of vertical pitches are prepared with the rough pitch obtained by the pitch extraction unit 103 as the center, for example, in steps of 0.25. The error sum value Σε _m is obtained for each of these plural kinds of slightly different pitches. In this case, when the pitch is determined, the bandwidth is determined. From the above equation (10), the power spectrum | S (j) |
(j) | and the error ε _m in the above equation (9) can be obtained, and the sum total value Σε _m of all the bands can be obtained. This error sum value Σε _m is obtained for each pitch, and the pitch corresponding to the minimum error sum value is determined as the optimum pitch. As described above, the high-precision pitch search unit 106 obtains the optimum fine (eg, 0.25 step) pitch, and the amplitude | A _m | corresponding to this optimum pitch.
Is determined.

【００４５】以上ピッチのファインサーチの説明におい
ては、説明を簡略化するために、全バンドが有声音（Vo
iced）の場合を想定しているが、上述したようにＭＢＥ
ボコーダにおいては、同時刻の周波数軸上に無声音（Un
voiced）領域が存在するというモデルを採用しているこ
とから、上記各バンド毎に有声音／無声音の判別を行う
ことが必要とされる。In the above description of the pitch fine search, in order to simplify the explanation, all bands are voiced (Vo
Assuming the case of iced), as described above, MBE
In the vocoder, unvoiced sound (Un
Since a model in which a voiced) area exists is used, it is necessary to distinguish voiced sound / unvoiced sound for each band.

【００４６】上記高精度ピッチサーチ部１０６からの最
適ピッチ及び振幅｜Ａ_m｜のデータは、有声音／無声音
判別部１０７に送られ、上記各バンド毎に有声音／無声
音の判別が行われる。この判別のために、ＮＳＲ（ノイ
ズｔｏシグナル比）を利用する。すなわち、第ｍバンド
のＮＳＲは、The optimum pitch and amplitude | A _m | data from the high precision pitch search unit 106 is sent to the voiced sound / unvoiced sound determination unit 107, and the voiced sound / unvoiced sound is discriminated for each band. NSR (noise to signal ratio) is used for this determination. That is, the NSR of the m-th band is

【００４７】[0047]

【数６】 [Equation 6]

【００４８】と表せ、このＮＳＲ値が所定の閾値（例え
ば0.３）より大のとき（エラーが大きい）ときには、そ
のバンドでの｜Ａ_m｜｜Ｅ(j) ｜による｜Ｓ(j) ｜の近
似が良くない（上記励起信号｜Ｅ(j) ｜が基底として不
適当である）と判断でき、当該バンドをＵＶ（Unvoice
d、無声音）と判別する。これ以外のときは、近似があ
る程度良好に行われていると判断でき、そのバンドをＶ
（Voiced、有声音）と判別する。If this NSR value is larger than a predetermined threshold value (eg, 0.3) (error is large), | A _m || E (j) | due to | S (j) in that band It can be judged that the approximation of | is not good (the above excitation signal | E (j) | is unsuitable as a basis), and the band is UV (Unvoice).
d, unvoiced sound). In other cases, it can be judged that the approximation has been performed to some extent, and the band is set to V
(Voiced, voiced sound).

【００４９】次に、振幅再評価部１０８には、直交変換
部１０５からの周波数軸上データ、高精度ピッチサーチ
部１０６からのファインピッチと評価された振幅｜Ａ_m
｜との各データ、及び上記有声音／無声音判別部１０７
からのＶ／ＵＶ（有声音／無声音）判別データが供給さ
れている。この振幅再評価部１０８では、有声音／無声
音判別部１０７において無声音（ＵＶ）と判別されたバ
ンドに関して、再度振幅を求めている。このＵＶのバン
ドについての振幅｜Ａ_m｜_UVは、Next, in the amplitude re-evaluation section 108, the amplitude on the frequency axis data from the orthogonal transformation section 105 and the amplitude | A _m evaluated as the fine pitch from the high precision pitch search section 106.
| And each voiced sound / unvoiced sound discrimination unit 107
V / UV (voiced sound / unvoiced sound) discrimination data from The amplitude re-evaluation unit 108 re-calculates the amplitude of the band determined to be unvoiced sound (UV) by the voiced sound / unvoiced sound determination unit 107. The amplitude | A _m | _UV for this UV band is

【００５０】[0050]

【数７】にて求められる。[Equation 7] Required at.

【００５１】この振幅再評価部１０８からのデータは、
データ数変換（一種のサンプリングレート変換）部１０
９に送られる。このデータ数変換部１０９は、上記ピッ
チに応じて周波数軸上での分割帯域数が異なり、データ
数（特に振幅データの数）が異なることを考慮して、一
定の個数にするためのものである。すなわち、例えば有
効帯域を３４００Hzまでとすると、この有効帯域が上記
ピッチに応じて、８バンド〜６３バンドに分割されるこ
とになり、これらの各バンド毎に得られる上記振幅｜Ａ
_m｜（ＵＶバンドの振幅｜Ａ_m｜_UVも含む）データの個
数ｍ_MX＋１も８〜６３と変化することになる。このため
データ数変換部１０９では、この可変個数ｍ_MX＋１の振
幅データを一定個数Ｎ_C（例えば４４個）のデータに変
換している。The data from the amplitude re-evaluation unit 108 is
Data number conversion (a kind of sampling rate conversion) unit 10
Sent to 9. This data number conversion unit 109 is for making the number constant, considering that the number of divided bands on the frequency axis differs according to the pitch and the number of data (especially the number of amplitude data) differs. is there. That is, for example, if the effective band is up to 3400 Hz, the effective band is divided into 8 bands to 63 bands according to the pitch, and the amplitude | A obtained for each of these bands | A
_{The number of m m} (including UV band amplitude | A _m | _UV ) data m _MX +1 also changes from 8 to 63. Therefore, the data number conversion unit 109 converts the variable number m _MX +1 of amplitude data into a fixed number N _C (for example, 44) of data.

【００５２】ここで本実施例においては、周波数軸上の
有効帯域１ブロック分の振幅データに対して、ブロック
内の最後のデータからブロック内の最初のデータまでの
値を補間するようなダミーデータを付加してデータ個数
をＮ_F個に拡大した後、帯域制限型のＫ_OS倍（例えば８
倍）のオーバーサンプリングを施すことによりＫ_OS倍の
個数の振幅データを求め、このＫ_OS倍の個数（( ｍ_MX＋
１) ×Ｋ_OS個）の振幅データを直線補間してさらに多く
のＮM 個（例えば２０４８個）に拡張し、このＮ_M個の
データを間引いて上記一定個数Ｎ_C（例えば４４個）の
データに変換する。Here, in this embodiment, dummy data for interpolating values from the last data in the block to the first data in the block for the amplitude data of one block of the effective band on the frequency axis. Is added to expand the number of data to N _F , and then the bandwidth-limited K _OS times (for example, 8
Obtain an amplitude data of K _OS times the number by performing oversampling multiplied), the K _OS times the number ((m _MX +
1) amplitude data of × K _OS pieces) extended to linear interpolation to more NM number (e.g. 2048), the data of the N thinned out _M data the predetermined number N _C (e.g. 44) Convert to.

【００５３】このデータ数変換部１０９からのデータ
（上記一定個数Ｎ_Cの振幅データ）がベクトル量子化部
１１０に送られて、所定個数のデータ毎にまとめられて
ベクトルとされ、ベクトル量子化が施される。ベクトル
量子化部１１０からの量子化出力データは、出力端子１
１１を介して取り出される。また、上記高精度のピッチ
サーチ部１０６からの高精度（ファイン）ピッチデータ
は、ピッチ符号化部１１５で符号化され、出力端子１１
２を介して取り出される。さらに、上記有声音／無声音
判別部１０７からの有声音／無声音（Ｖ／ＵＶ）判別デ
ータは、出力端子１１３を介して取り出される。これら
の各出力端子１１１〜１１３からのデータは、所定の伝
送フォーマットの信号とされて伝送される。The data from the data number conversion unit 109 (the above-mentioned fixed number N _C of amplitude data) is sent to the vector quantization unit 110, and a predetermined number of data is collected into a vector, and vector quantization is performed. Is given. The quantized output data from the vector quantizer 110 is output to the output terminal 1
It is taken out via 11. The high-precision (fine) pitch data from the high-precision pitch search unit 106 is coded by the pitch coding unit 115, and the output terminal 11
It is taken out via 2. Further, the voiced sound / unvoiced sound (V / UV) discrimination data from the voiced sound / unvoiced sound discrimination unit 107 is taken out through the output terminal 113. The data from these output terminals 111 to 113 are transmitted as signals in a predetermined transmission format.

【００５４】なお、これらの各データは、上記Ｎサンプ
ル（例えば２５６サンプル）のブロック内のデータに対
して処理を施すことにより得られるものであるが、ブロ
ックは時間軸上を上記Ｌサンプルのフレームを単位とし
て前進することから、伝送するデータは上記フレーム単
位で得られる。すなわち、上記フレーム周期でピッチデ
ータ、Ｖ／ＵＶ判別データ、振幅データが更新されるこ
とになる。Note that each of these data is obtained by processing the data in the block of N samples (for example, 256 samples), but the block is a frame of L samples on the time axis. , The data to be transmitted is obtained in the frame unit. That is, the pitch data, the V / UV discrimination data, and the amplitude data are updated at the above frame cycle.

【００５５】次に、伝送されて得られた上記各データに
基づき音声信号を合成するための合成側（デコード側）
の概略構成について、図７を参照しながら説明する。こ
の図７において、入力端子１２１には上記ベクトル量子
化された振幅データが、入力端子１２２には上記符号化
されたピッチデータが、また入力端子１２３には上記Ｖ
／ＵＶ判別データがそれぞれ供給される。入力端子１２
１からの量子化振幅データは、逆ベクトル量子化部１２
４に送られて逆量子化され、データ数逆変換部１２５に
送られて逆変換され、得られた振幅データが有声音合成
部１２６及び無声音合成部１２７に送られる。入力端子
１２２からの符号化ピッチデータは、ピッチ復号化部１
２８で復号化され、データ数逆変換部１２５、有声音合
成部１２６及び無声音合成部１２７に送られる。また入
力端子１２３からのＶ／ＵＶ判別データは、有声音合成
部１２６及び無声音合成部１２７に送られる。Next, a synthesizing side (decoding side) for synthesizing a voice signal on the basis of the above-mentioned respective data transmitted and obtained.
The general configuration of will be described with reference to FIG. In FIG. 7, the vector-quantized amplitude data is input to the input terminal 121, the encoded pitch data is input to the input terminal 122, and the V-value is input to the input terminal 123.
/ UV discrimination data is supplied respectively. Input terminal 12
The quantized amplitude data from 1 is the inverse vector quantization unit 12
4 and is inversely quantized, is then sent to the data number inverse transform unit 125 and is inversely transformed, and the obtained amplitude data is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127. The encoded pitch data from the input terminal 122 is the pitch decoding unit 1
It is decoded at 28 and sent to the data number inverse conversion unit 125, the voiced sound synthesis unit 126, and the unvoiced sound synthesis unit 127. The V / UV discrimination data from the input terminal 123 is sent to the voiced sound synthesis unit 126 and the unvoiced sound synthesis unit 127.

【００５６】有声音合成部１２６では例えば余弦(cosin
e)波合成により時間軸上の有声音波形を合成し、無声音
合成部１２７では例えばホワイトノイズをバンドパスフ
ィルタでフィルタリングして時間軸上の無声音波形を合
成し、これらの各有声音合成波形と無声音合成波形とを
加算部１２９で加算合成して、出力端子１３０より取り
出すようにしている。この場合、上記振幅データ、ピッ
チデータ及びＶ／ＵＶ判別データは、上記分析時の１フ
レーム（Ｌサンプル、例えば１６０サンプル）毎に更新
されて与えられるが、フレーム間の連続性を高める（円
滑化する）ために、上記振幅データやピッチデータの各
値を１フレーム中の例えば中心位置における各データ値
とし、次のフレームの中心位置までの間（合成時の１フ
レーム）の各データ値を補間により求める。すなわち、
合成時の１フレーム（例えば上記分析フレームの中心か
ら次の分析フレームの中心まで）において、先端サンプ
ル点での各データ値と終端（次の合成フレームの先端）
サンプル点での各データ値とが与えられ、これらのサン
プル点間の各データ値を補間により求めるようにしてい
る。In the voiced sound synthesis unit 126, for example, cosine (cosin)
e) A voiced sound waveform on the time axis is synthesized by wave synthesis, and in the unvoiced sound synthesis unit 127, for example, white noise is filtered by a bandpass filter to synthesize the unvoiced sound waveform on the time axis, and these voiced sound synthesized waveforms are combined. The unvoiced sound synthesized waveform is added and synthesized by the addition unit 129 and is taken out from the output terminal 130. In this case, the amplitude data, the pitch data, and the V / UV discrimination data are updated and given for each frame (L sample, for example, 160 samples) at the time of the analysis, but the continuity between the frames is improved (smoothed). Therefore, each value of the amplitude data and the pitch data is set as each data value at, for example, the center position in one frame, and each data value up to the center position of the next frame (one frame at the time of composition) is interpolated. Ask by. That is,
In one frame (for example, from the center of the above analysis frame to the center of the next analysis frame) at the time of synthesis, each data value at the tip sample point and the end (the tip of the next synthesis frame)
Each data value at the sample point is given, and each data value between these sample points is obtained by interpolation.

【００５７】以下、有声音合成部１２６における合成処
理を詳細に説明する。上記Ｖ（有声音）と判別された第
ｍバンド（第ｍ高調波の帯域）における時間軸上の上記
１合成フレーム（Ｌサンプル、例えば１６０サンプル）
分の有声音をＶ_m(n) とするとき、この合成フレーム内
の時間インデックス（サンプル番号）ｎを用いて、Ｖ_m(n) ＝Ａ_m(n) cos(θ_m(n)) ０≦ｎ＜Ｌ・・・（13）と表すことができる。全バンドの内のＶ（有声音）と判
別された全てのバンドの有声音を加算（ΣＶ_m(n) ）し
て最終的な有声音Ｖ(n) を合成する。The synthesis processing in the voiced sound synthesis unit 126 will be described in detail below. The one combined frame (L sample, for example, 160 samples) on the time axis in the m-th band (band of the m-th harmonic) determined to be V (voiced sound)
When the voiced sound for a minute is V _m (n), V _m (n) = A _m (n) cos (θ _m (n)) 0 using the time index (sample number) n in this composite frame. ≦ n <L can be expressed as (13). The final voiced sound V (n) is synthesized by adding (ΣV _m (n)) the voiced sounds of all the bands which are determined to be V (voiced sound) of all the bands.

【００５８】この（13）式中のＡ_m(n) は、上記合成フ
レームの先端から終端までの間で補間された第ｍ高調波
の振幅である。最も簡単には、フレーム単位で更新され
る振幅データの第ｍ高調波の値を直線補間すればよい。
すなわち、上記合成フレームの先端（ｎ＝０）での第ｍ
高調波の振幅値をＡ_0m、該合成フレームの終端（ｎ＝
Ｌ：次の合成フレームの先端）での第ｍ高調波の振幅値
をＡ_Lmとするとき、Ａ_m(n) ＝ (L-n)Ａ_0m／Ｌ＋ｎＡ_Lm／Ｌ・・・（14）の式によりＡ_m(n) を計算すればよい。A _m (n) in the equation (13) is the amplitude of the m-th harmonic wave interpolated from the beginning to the end of the composite frame. The simplest way is to linearly interpolate the value of the m-th harmonic of the amplitude data updated in frame units.
That is, the m-th frame at the tip (n = 0) of the composite frame
The amplitude value of the harmonic is A _0m , the end of the composite frame (n =
L: the amplitude value of the m-th harmonic at the next synthetic frame) is A _Lm , then A _m (n) = (Ln) A _0m / L + nA _Lm / L ... (14) It suffices to calculate A _m (n).

【００５９】次に、上記（13）式中の位相θ_m(n) は、 θ_m(n) ＝ｍω_O1ｎ＋ｎ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋Δωｎ・・・（15）により求めることができる。この（15）式中で、φ_0mは
上記合成フレームの先端（ｎ＝０）での第ｍ高調波の位
相（フレーム初期位相）を示し、ω₀₁は合成フレーム先
端（ｎ＝０）での基本角周波数、ω_L1は該合成フレーム
の終端（ｎ＝Ｌ：次の合成フレーム先端）での基本角周
波数をそれぞれ示している。上記（15）式中のΔωは、
ｎ＝Ｌにおける位相φ_Lmがθ_m(L) に等しくなるような
最小のΔωを設定する。Next, the phase θ _m (n) in the above equation (13) is calculated by θ _m (n) = mω _O1 n + n ² m (ω _L1 −ω ₀₁ ) / 2L + φ _{0 m} + Δω n (15) You can ask. In this equation (15), φ _0m represents the phase (frame initial phase) of the m-th harmonic at the top (n = 0) of the composite frame, and ω ₀₁ represents the top of the composite frame (n = 0). The fundamental angular frequency, ω _L1, represents the fundamental angular frequency at the end of the combined frame (n = L: the leading end of the next combined frame). Δω in the equation (15) is
Set a minimum Δω such that the phase φ _{Lm at} n = L is equal to θ _m (L).

【００６０】以下、任意の第ｍバンドにおいて、それぞ
れｎ＝０、ｎ＝ＬのときのＶ／ＵＶ判別結果に応じた上
記振幅Ａ_m(n) 、位相θ_m(n) の求め方を説明する。第
ｍバンドが、ｎ＝０、ｎ＝ＬのいずれもＶ（有声音）と
される場合に、振幅Ａ_m(n) は、上述した（10）式によ
り、伝送された振幅値Ａ_0m、Ａ_Lmを直線補間して振幅Ａ
_m(n) を算出すればよい。位相θ_m(n) は、ｎ＝０でθ
_m(0) ＝φ_0mからｎ＝Ｌでθ_m(L) がφ_Lmとなるように
Δωを設定する。Hereinafter, how to obtain the amplitude A _m (n) and the phase θ _m (n) according to the V / UV discrimination result when n = 0 and n = L in an arbitrary m-th band will be described. To do. When the m-th band is V (voiced sound) for both n = 0 and n = L, the amplitude A _m (n) is the transmitted amplitude value A _0m by the above-mentioned equation (10), Linear interpolation of A _Lm and amplitude A
_It suffices to calculate _m (n). The phase θ _m (n) is θ when n = 0
Δω is set so that θ _m (L) becomes φ _Lm when _m (0) = φ _{0 m} and n = L.

【００６１】次に、ｎ＝０のときＶ（有声音）で、ｎ＝
ＬのときＵＶ（無声音）とされる場合に、振幅Ａ_m(n)
は、Ａ_m(0) の伝送振幅値Ａ_0mからＡ_m(L) で０となる
ように直線補間する。ｎ＝Ｌでの伝送振幅値Ａ_Lmは無声
音の振幅値であり、後述する無声音合成の際に用いられ
る。位相θ_m(n) は、θ_m(0) ＝φ_0mとし、かつΔω＝
０とする。Next, when n = 0, V (voiced sound) and n =
Amplitude A _m (n) when UV (unvoiced sound) when L
Is linearly interpolated so that 0 A _m (L) from the transmission amplitude value A _{0 m} of A _m (0). The transmission amplitude value A _{Lm when} n = L is the amplitude value of unvoiced sound and is used in unvoiced sound synthesis described later. The phase θ _m (n) is θ _m (0) = φ _{0 m} , and Δω =
Set to 0.

【００６２】さらに、ｎ＝０のときＵＶ（無声音）で、
ｎ＝ＬのときＶ（有声音）とされる場合には、振幅Ａ_m
(n) は、ｎ＝０での振幅Ａ_m(0) を０とし、ｎ＝Ｌで伝
送された振幅値Ａ_Lmとなるように直線補間する。位相θ
_m(n) については、ｎ＝０での位相θ_m(0) として、フ
レーム終端での位相値φ_Lmを用いて、 θ_m(0) ＝φ_Lm−ｍ（ω_O1＋ω_L1）Ｌ／２・・・（16）とし、かつΔω＝０とする。Further, when n = 0, UV (unvoiced sound)
When V = voiced sound when n = L, amplitude A _m
(n) is linearly interpolated so that the amplitude A _m (0) at n = 0 is 0 and the transmitted amplitude value A _Lm is n = L. Phase θ
_{For m} (n), using the phase value φ _Lm at the end of the frame as the phase θ _m (0) at n = 0, θ _m (0) = φ _Lm −m (ω _O1 + ω _L1 ) L / 2 ... (16) and Δω = 0.

【００６３】上記ｎ＝０、ｎ＝ＬのいずれもＶ（有声
音）とされる場合に、θ_m(L) がφ_LmとなるようにΔω
を設定する手法について説明する。上記（15）式で、ｎ
＝Ｌと置くことにより、 θ_m(L) ＝ｍω_O1Ｌ＋Ｌ²ｍ（ω_L1−ω₀₁）／２Ｌ＋φ_0m＋ΔωＬ＝ｍ（ω_O1＋ω_L1）Ｌ／２＋φ_0m＋ΔωＬ＝φLm となり、これを整理すると、Δωは、 Δω＝（mod2π((φ_Lm−φ_0m) − mL(ω_O1＋ω_L1)/2)／Ｌ・・・（17）となる。この（17）式でmod2π(x) とは、ｘの主値を−
π〜＋πの間の値で返す関数である。例えば、ｘ＝１.3
πのときmod2π(x) ＝−０.7π、ｘ＝２.3πのときmod2
π(x) ＝０.3π、ｘ＝−１.3πのときmod2π(x) ＝０.7
π、等である。When both n = 0 and n = L are V (voiced sound), Δω is set so that θ _m (L) becomes φ _Lm.
A method of setting will be described. In the above formula (15), n
= L, then θ _m (L) = mω _O1 L + L ² m (ω _L1 −ω ₀₁ ) / 2L + φ _0m + ΔωL = m (ω _O1 + ω _L1 ) L / 2 + φ _0m + ΔωL = φLm. , Δω is Δω = (mod 2π ((φ _Lm −φ _0m ) −mL (ω _O1 + ω _L1 ) / 2) / L (17). In this equation (17), mod 2π (x) is , The main value of x is −
It is a function that returns a value between π and + π. For example, x = 1.3
mod2 π (x) = -0.7π when π, mod2 when x = 2.3π
When π (x) = 0.3π and x = -1.3π, mod2π (x) = 0.7
π, and so on.

【００６４】ここで、図８のＡは、音声信号のスペクト
ルの一例を示しており、バンド番号（ハーモニクスナン
バ）ｍが８、９、１０の各バンドがＵＶ（無声音）とさ
れ、他のバンドはＶ（有声音）とされている。このＶ
（有声音）のバンドの時間軸信号が上記有声音合成部１
２６により合成され、ＵＶ（無声音）のバンドの時間軸
信号が無声音合成部１２７で合成されるわけである。Here, A of FIG. 8 shows an example of a spectrum of a voice signal. Each band of band numbers (harmonics number) m of 8, 9 and 10 is UV (unvoiced sound), and other bands. Is V (voiced sound). This V
The time axis signal of the (voiced sound) band is the voiced sound synthesis unit 1 described above.
26, and the time axis signal of the UV (unvoiced sound) band is synthesized by the unvoiced sound synthesis unit 127.

【００６５】以下、無声音合成部１２７における無声音
合成処理を説明する。ホワイトノイズ発生部１３１から
の時間軸上のホワイトノイズ信号波形を、所定の長さ
（例えば２５６サンプル）で適当な窓関数（例えばハミ
ング窓）により窓かけをし、ＳＴＦＴ処理部１３２によ
りＳＴＦＴ（ショートタームフーリエ変換）処理を施す
ことにより、図８のＢに示すようなホワイトノイズの周
波数軸上のパワースペクトルを得る。このＳＴＦＴ処理
部１３２からのパワースペクトルをバンド振幅処理部１
３３に送り、図８のＣに示すように、上記ＵＶ（無声
音）とされたバンド（例えばｍ＝８、９、１０）につい
て上記振幅｜Ａ_m｜_UVを乗算し、他のＶ（有声音）とさ
れたバンドの振幅を０にする。このバンド振幅処理部１
３３には上記振幅データ、ピッチデータ、Ｖ／ＵＶ判別
データが供給されている。バンド振幅処理部１３３から
の出力は、ＩＳＴＦＴ処理部１３４に送られ、位相は元
のホワイトノイズの位相を用いて逆ＳＴＦＴ処理を施す
ことにより時間軸上の信号に変換する。ＩＳＴＦＴ処理
部１３４からの出力は、オーバーラップ加算部１３５に
送られ、時間軸上で適当な（元の連続的なノイズ波形を
復元できるように）重み付けをしながらオーバーラップ
及び加算を繰り返し、連続的な時間軸波形を合成する。
オーバーラップ加算部１３５からの出力信号が上記加算
部１２９に送られる。The unvoiced sound synthesizing process in the unvoiced sound synthesizing section 127 will be described below. The white noise signal waveform on the time axis from the white noise generation unit 131 is windowed by a suitable window function (for example, Hamming window) with a predetermined length (for example, 256 samples), and the STFT processing unit 132 performs STFT (short circuit). By performing the (Term Fourier Transform) process, a power spectrum of white noise on the frequency axis as shown in B of FIG. 8 is obtained. The power spectrum from the STFT processing unit 132 is converted to the band amplitude processing unit 1
33, and as shown in FIG. 8C, the above-mentioned amplitude | A _m | _UV is multiplied with respect to the UV (unvoiced) band (for example, m = 8, 9, 10), and another V (voiced sound) is generated. ) Is set to 0. This band amplitude processing unit 1
The above amplitude data, pitch data, and V / UV discrimination data are supplied to 33. The output from the band amplitude processing unit 133 is sent to the ISTFT processing unit 134, and the phase is converted into a signal on the time axis by performing inverse STFT processing using the phase of the original white noise. The output from the ISTFT processing unit 134 is sent to the overlap addition unit 135, and the overlap and addition are repeated while performing appropriate weighting (so that the original continuous noise waveform can be restored) on the time axis, and continuous. Time-domain waveforms are synthesized.
The output signal from the overlap adder 135 is sent to the adder 129.

【００６６】このように、各合成部１２６、１２７にお
いて合成されて時間軸上に戻された有声音部及び無声音
部の各信号は、加算部１２９により適当な固定の混合比
で加算して、出力端子１３０より再生された音声信号を
取り出す。In this way, the signals of the voiced sound portion and the unvoiced sound portion which are synthesized in the respective synthesis units 126 and 127 and are returned to the time axis are added by the addition unit 129 at an appropriate fixed mixing ratio, The reproduced audio signal is taken out from the output terminal 130.

【００６７】なお、上記図２の音声分析側（エンコード
側）の構成や図８の音声合成側（デコード側）の構成に
ついては、各部をハードウェア的に記載しているが、い
わゆるＤＳＰ（ディジタル信号プロセッサ）等を用いて
ソフトウェアプログラムにより実現することも可能であ
る。Regarding the configuration on the voice analysis side (encoding side) in FIG. 2 and the configuration on the voice synthesis side (decoding side) in FIG. 8, each part is described as hardware, but a so-called DSP (digital It is also possible to realize it by a software program using a signal processor or the like.

【００６８】また、本発明に係る有声音判別方法は、例
えば、自動車電話の送信側で環境雑音（背景雑音等）を
落としたいというようなとき、背景雑音を検出すう手段
としても用いられる。すなわち、雑音に乱された低品質
の音声を処理し、雑音の影響を取り除き、聴きやすい音
にするようないわゆるスピーチエンハンスメントでの雑
音検出にも適用される。The voiced sound discrimination method according to the present invention is also used as means for detecting background noise when it is desired to reduce environmental noise (background noise or the like) on the transmitting side of a car telephone. That is, it is also applied to noise detection in so-called speech enhancement in which low-quality speech disturbed by noise is processed, the influence of noise is removed, and a sound that is easy to hear is obtained.

【００６９】[0069]

【図面の簡単な説明】[Brief description of drawings]

【符号の説明】１２・・・・・窓かけ処理部１３・・・・・直交変換部１４・・・・・帯域分割部１５・・・・・Ｖ／ＵＶ判別部１６・・・・・修正部１８・・・・・信号レベル検出部１９・・・・・比較部（信号レベル）２１・・・・・判断部２２・・・・・ピーク検出部２３・・・・・ブロック内分散検出部２４・・・・・正規化部２５・・・・・比較部（正規化ピーク）[Explanation of Codes] 12 ... Windowing processing unit 13 ... Orthogonal transformation unit 14 ... Band division unit 15 ... V / UV discrimination unit 16 ... Modifying unit 18-Signal level detecting unit 19-Comparison unit (signal level) 21-Judging unit 22-Peak detecting unit 23-In-block variance Detection unit 24 ... Normalization unit 25 ... Comparison unit (normalized peak)

Claims

[Claims]

1. A step of dividing an input audio signal into blocks and converting it into a frequency axis to obtain frequency axis data; a step of dividing the frequency axis data into a plurality of bands; A step of determining whether each band is a voiced sound, a step of detecting a signal level of a block unit in which a band determined to be a voiced sound by the above determination exists, and a level of the detected signal is A voiced sound determination method, wherein the whole band is made unvoiced when it becomes equal to or less than a predetermined threshold value of 1.

2. A step of dividing an input audio signal in block units and converting it into a frequency axis to obtain frequency axis data, and a step of dividing the frequency axis data into a plurality of bands. A step of determining whether or not each band is voiced sound, a step of detecting a signal level of a block unit in which the band determined to be voiced by the above determination exists, and a peak value of the signal in the block And a step in which the peak value is normalized based on the signal level in the block, and a band in which the normalized value is equal to or less than a second predetermined threshold and is the voiced sound. A voiced sound discrimination method, wherein the whole band is made unvoiced when the level of the block-based signal in which is present is equal to or lower than a first predetermined threshold value.