JPH10222194A

JPH10222194A - Discriminating method for voice sound and voiceless sound in voice coding

Info

Publication number: JPH10222194A
Application number: JP3262397A
Authority: JP
Inventors: Shinto Rin; 進燈林; Shinan Rin; 信安林
Original assignee: GOTAI HANDOTAI KOFUN YUGENKOSHI
Current assignee: GOTAI HANDOTAI KOFUN YUGENKOSHI
Priority date: 1997-02-03
Filing date: 1997-02-03
Publication date: 1998-08-21

Abstract

PROBLEM TO BE SOLVED: To provide a method for discriminating surely a voice sound and a voiceless sound in voice coding. SOLUTION: Voice frame data of an input voice is divided into four sub- frames, further it is discriminated whether an input voice in each sub-frame is a voice sound or a voiceless sound as the following. NC values (normalized correlation value) of each sub-frame are compared with a high critical value and a low critical value respectively, magnitude of an energy quantity value and a LSP (line spectrum pair of element) coefficient value of the sub-frame is discriminated respectively by a discrimination step discriminating whether it is stable or unstable, when both are larger than a set critical value, a discrimination step for an energy ratio of low frequency band/high frequency band (LOH) is performed, it is discriminated whether each sub-frame is in a critical value or more or not in a LOH value discriminating step, when it is 'yes', a sub-frame is discriminated as a voice sound signal and when it is 'no', a sub-frame is discriminated as a voiceless signal.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は一種の音声符号化の
技術に関し、特に、音声符号化技術において有声音か無
声音かを識別するのに用いられる方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a kind of speech coding technique, and more particularly to a method used in speech coding technique to discriminate between voiced sound and unvoiced sound.

【０００２】[0002]

【従来の技術】音声合成の技術にあっては、線形予測コ
ーディング（ＬｉｎｅｒＰｒｅｄｉｃｔｉｖｅＣｏ
ｄｉｎｇ；ＬＰＣ）の技術が一般に用いられている。こ
のＬＰＣの方法では、ＬＰＣ−１０音声エンコーダが、
低ビット率の音声圧縮に広く用いられている。一つのＬ
ＰＣ音声エンコーダについては、いかに正確に入力音声
信号が有声音か無声音かを識別するかが重要な課題であ
った。というのは、この有声／無声音識別過程が、音声
合成の出力品質に大きな影響を与えうるためであった。2. Description of the Related Art In the technology of speech synthesis, linear predictive coding (Linear Predictive Coding) is used.
ding (LPC) is generally used. In this LPC method, the LPC-10 audio encoder uses
Widely used for low bit rate audio compression. One L
For PC audio encoders, it has been an important issue how to accurately identify whether an input audio signal is voiced or unvoiced. This is because the voiced / unvoiced sound discrimination process can greatly affect the output quality of speech synthesis.

【０００３】図１に示されるのは、伝統的な音声符号化
技術のブロック図である。図中のブロック中には、イン
パルス列ジェネレータ１１（ＩｍｐｕｌｓｅＴｒａｉ
ｎＧｅｎｅｒａｔｏｒ）、ランダムノイズジェネレータ
１２（ＲａｎｄｏｍＮｏｉｓｅＧｅｎｅｒａｔｏ
ｒ）、有声／無声音切り換えスイッチ１３（ｖｏｉｃｅ
ｄ／ｕｎｖｏｉｃｅｄＳｗｉｔｃｈ）、利得ユニット
１４（ＧａｉｎＵｎｉｔ）、ＬＰＣフィルタ１５（Ｌ
ｉｎｅｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇＦｉ
ｌｔｅｒ）、ＬＰＣフィルタ制御変数設定ユニット１６
が含まれる。[0003] Fig. 1 is a block diagram of a traditional speech coding technique. In the block in the figure, an impulse train generator 11 (Impulse Train)
nGenerator), random noise generator 12 (Random Noise Generator)
r), voiced / unvoiced sound selector switch 13 (voice
d / unvoiced Switch), gain unit 14 (Gain Unit), LPC filter 15 (L
inner Predictive Coding Fi
lter), LPC filter control variable setting unit 16
Is included.

【０００４】インパルス列ジェネレータ１１の発生する
周期性インパルス列（ＰｅｒｉｏｄｉｃＩｍｐｕｌｓ
ｅＴｒａｉｎ）或いはランダムノイズジェネレータ１
２の発生するノイズ信号（ＷｈｉｔｅＮｏｉｓｅ）
は、声音／無声音切り換えスイッチ１３による、その入
力信号の類型属性によった適当な選択切り換えを経て、
利得ユニット１４を経て信号の利得が行われ、以てその
信号のレベルが調整される。そしてさらにＬＰＣフィル
タ１５がＬＰＣフィルタ制御変数設定ユニット１６に設
定されたＬＰＣ変数（ＬＰＣＰａｒａｍｅｔｅｒｓ）
に基づき、ろ波を執行し、最後に、ＬＰＣフィルタ１５
の出力端より音声出力Ｓ（ｎ）を行う。[0004] Periodic impulse trains (Periodic Impulses) generated by the impulse train generator 11
e Train) or random noise generator 1
2 generated noise signal (White Noise)
Is appropriately switched by the voice / unvoiced switch 13 according to the type attribute of the input signal.
The gain of the signal is performed via the gain unit 14, so that the level of the signal is adjusted. Further, the LPC filter 15 sets the LPC variable (LPC Parameters) set in the LPC filter control variable setting unit 16.
Filter, and finally, the LPC filter 15
, An audio output S (n) is performed from the output terminal.

【０００５】前述の音声識別のステップを執行すると
き、識別装置は各一つの入力音声の音声フレーム（Ｓｐ
ｅｅｃｈＦｒａｍｅ）に対してその有声／無声音判別
方法、ピッチ周期（ＰｉｔｃｈＰｅｒｉｏｄ）、ＬＰ
Ｃ変数、及び利得値（ＧａｉｎＶａｌｕｅ）を更新す
る。その目的は、入力音声の変化状況に追従できるよう
にすることにある。現在ある典型的な技術では、各一つ
の音声フレームは１６０回のサンプルを包括し、即ち、
一つの所定の音声フレームの大きさの中に、０．０２秒
ごとにサンプルが取られている。[0005] When performing the above-described speech identification step, the identification device will use the speech frame (Sp) of each one of the input speeches.
ech Frame), its voiced / unvoiced sound discrimination method, pitch period (Pitch Period), LP
Update the C variable and the gain value (GainValue). The purpose is to be able to follow the changing state of the input voice. In the current typical technique, each one audio frame covers 160 samples, ie
Samples are taken every 0.02 seconds within the size of one predetermined audio frame.

【０００６】前述の音声識別において、その中、有声音
か無声音かの判別方法については、伝統的には相関ピッ
チの強度に基づいて判別する方法がとられている。例え
ば、もし正規化相互相関値（ＮｏｒｍａｌｉｚｅｄＣ
ｒｏｓｓｃｏｒｒｅｌａｔｉｏｎＶａｌｕｅ；ＮＣ
値）が予め設定された臨界値、例えば０．４以上であれ
ば、その音声フレームは正常な音声信号と判定され、こ
のとき、音声合成器が周期パルス列によりＬＰＣフィル
タを励起する。その反対に、もしＮＣ値が臨界値０．４
より小さい場合、その音声フレームは無声音信号に属す
ると判別され、音声合成器はランダムノイズジェネレー
タで該ＬＰＣフィルタを励起する。前述のＮＣ値の定義
は以下の数式２による。[0006] In the above-mentioned speech recognition, a method of discriminating a voiced sound or an unvoiced sound among them is traditionally a method based on the strength of a correlation pitch. For example, if the normalized cross-correlation value (Normalized C
Ross Correlation Value; NC
Is larger than a preset critical value, for example, 0.4 or more, the speech frame is determined to be a normal speech signal. At this time, the speech synthesizer excites the LPC filter with a periodic pulse train. Conversely, if the NC value is the critical value 0.4
If smaller, the speech frame is determined to belong to the unvoiced sound signal, and the speech synthesizer excites the LPC filter with a random noise generator. The above-described definition of the NC value is based on Equation 2 below.

【数２】 (Equation 2)

【０００７】しかし、不安定な音声信号、即ち臨界値の
上下の不確定レベル区域を変動するものについては、そ
のＮＣ値の臨界値０．４より小さい程度は非常に小さく
なり得て、この時、前述の簡易な判別方法では、正確に
それが有声音信号であるか或いは無声音信号であるかを
判別できない。ゆえに、実際の応用時には、誤断が発生
する恐れがあった。However, for an unstable voice signal, that is, one that fluctuates in an uncertain level area above and below the critical value, the NC value below the critical value of 0.4 can be very small. However, the simple determination method described above cannot accurately determine whether the signal is a voiced sound signal or an unvoiced sound signal. Therefore, at the time of actual application, there is a possibility that a mistake is generated.

【０００８】上述の問題を克服し、その判別の正確度を
増進するために、周知の技術では前述のＮＣ値の判別の
外に、さらに音声信号エネルギー量の判別を執行する必
要があり、それによって比較的正確な判別をなす目的を
達していた。In order to overcome the above problem and increase the accuracy of the discrimination, in the known technique, in addition to the above-described discrimination of the NC value, it is necessary to further execute the discrimination of the amount of energy of the voice signal. Has achieved the purpose of making relatively accurate discrimination.

【０００９】周知の技術にあっては、このほか、もう一
種の有声／無声音識別方法がある。このもう一種の周知
の技術に基づき、音声信号エネルギー量を判別する時に
は、以下の二種の状況を包括する。ａ．音声エネルギー量一般的には、無声音の音声エネルギー量は有声音のもの
より低く、そのエネルギーの二乗平均平方根値（ＲＭ
Ｓ）は、以下の数式３で求められる。[0009] In the known technology, there is another type of voiced / unvoiced sound discrimination method. When determining the energy of the audio signal based on this other known technique, the following two situations are included. a. Generally speaking, the voice energy of unvoiced sounds is lower than that of voiced sounds, and the root mean square value (RM)
S) is obtained by the following Expression 3.

【数３】その中、Ｎは入力音声信号の音声フレーム全体を代表す
る。ｂ．ゼロ交差率（ＺＣ）その定義は音声フレーム全体の零交差の回数とされ、以
下の数式４で求められる。(Equation 3) Among them, N represents the entire audio frame of the input audio signal. b. Zero-crossing rate (ZC) The definition is defined as the number of zero-crossings in the entire voice frame, and is calculated by the following equation (4).

【数４】前述の音声符号化技術中、各一つの音声フレーム中に
は、１６０回のサンプルが含まれ、そのビット数の面で
は、各一つの音声フレームは３４ビットのＬＰＣ変数
と、６ビットのピッチと、１ビットの有声／無声音と、
７ビットの利得値の総計４８ビットを含む。(Equation 4) In the above-described speech coding technique, each one speech frame contains 160 samples, and in terms of the number of bits, each speech frame is composed of a 34-bit LPC variable, a 6-bit pitch, 1-bit voiced / unvoiced sound,
Includes a total of 48 bits of 7-bit gain values.

【００１０】前述のように、音声を符号化するには、入
力音声信号が有声音であるか無声音であるかをいかに正
確に判別するかが重要な課題となる。そしてその判別過
程が音声合成の出力品質に大きく影響する。もし、有声
／無声音判別の過程で、無声音が有声音と誤断されたな
らば、出力された合成音声は唸り声のような音声とな
り、もし有声音が無声音と誤断されたならば、出力され
た合成音声は敲撃音のように聞こえる。この問題に対し
て、前述の伝統的な技術は有効に解決することができな
かった。As described above, in encoding speech, it is important to accurately determine whether the input speech signal is voiced or unvoiced. The discrimination process greatly affects the output quality of speech synthesis. If the unvoiced sound is mistaken for a voiced sound in the voiced / unvoiced sound discrimination process, the output synthesized voice will be a sound like a groan, and if the voiced sound is mistaken for an unvoiced sound, the output will be made. The synthesized speech sounds like an elaborate sound. The traditional techniques described above have not been able to effectively solve this problem.

【００１１】さらに、前述の第２種の伝統の技術では、
１ビットを以て音声フレーム中の有声或いは無声音の状
態を決定しており、以て、有声と無声音間の臨界状態を
含蓄せんとしている。このため、音声フレーム全体が臨
界区域にあり、有声音か無声音かが判定されていないの
で、往々にして出力された合成音声が雑音を有するよう
に聞こえた。Further, in the above-mentioned second type of traditional technology,
One bit determines the state of voiced or unvoiced sound in a voice frame, and thus implies the critical state between voiced and unvoiced sounds. For this reason, since the entire speech frame was in the critical area and it was not determined whether the speech sound was voiced or unvoiced, the output synthesized speech often seemed to have noise.

【００１２】[0012]

【発明が解決しようとする課題】上述の周知の技術の欠
点から、伝統的な音声符号化技術には改善の必要がある
ことが分かる。このため、本発明の主な目的は、一種の
音声符号化の改良技術を提供して、それにより音声符号
化の過程で優れた音声合成出力品質を提供することにあ
る。The shortcomings of the known techniques described above indicate that traditional speech coding techniques need improvement. Accordingly, it is a primary object of the present invention to provide a kind of improved technique for speech coding, thereby providing excellent speech synthesis output quality during speech coding.

【００１３】本発明のもう一つの目的は、音声符号化中
に用いられる、正確に有声音か無声音かの識別方法を提
供し、該識別方法により、正確に入力音声信号中の音声
フレームが有声音か無声音かを判別できるようにするこ
とにある。Another object of the present invention is to provide a method for accurately discriminating between voiced sound and unvoiced sound, which is used during speech coding. An object of the present invention is to make it possible to distinguish between a vocal sound and an unvoiced sound.

【００１４】本発明のさらにもう一つの目的は、一種の
四分割式の有声／無声音判別の方法（Ｑｉｕａｒｔｅｒ
Ｖｏｉｃｅｄ／ＵｎｖｏｉｃｅｄＤｅｃｉｓｉｏｎ
Ｓｃｈｅｍｅ）を提供することにあり、それは、入力
音声信号中の各一つの音声フレームを四つのサブフレー
ムに分割し、その後、各一つのサブフレームに対して、
その相関の変数に基づき、総合的に該サブフレームが有
声音か無声音かを判定し、その判別の結果により、音声
合成出力端にて正確で自然な音声信号出力を行わせる方
法とする。Still another object of the present invention is to provide a method for discriminating voiced / unvoiced sounds of a kind of quadrant.
Voiced / Unvoiced Decision
(Scheme), which divides each one audio frame in the input audio signal into four subframes, and then, for each one subframe,
On the basis of the correlation variables, it is determined whether the sub-frame is voiced or unvoiced sound, and based on the result of the determination, an accurate and natural sound signal is output at the speech synthesis output terminal.

【００１５】本発明のさらにもう一つの目的は、一種
の、入力音声信号の音声フレーム中の有声／無声音の正
確な判別方法を提供することにある。本発明のステップ
中、まず、入力音声の音声フレームを四つのサブフレー
ムに分割した後、順に該四つのサブフレームのＮＣ値
（正規化相互相関値）が高臨界値（例えば０．７）以上
であるか否かを判別する。その後、さらに、該ＮＣ値が
低臨界値（例えば０．４）より小さいかを判別する。前
述の二つの判別ステップの後、明らかに有声と無声音に
属する信号を判別し、続いて、前述の明らかに有声と無
声音に属する信号の間に介在する信号を判別する。即
ち、もし前述のステップ中、ＮＣ値が低臨界値より小さ
くないと判別されたならば、安定／不安定の判別ステッ
プを執行し、該サブフレームのエネルギー量値と線スペ
クトル対偶（ＬｉｎｅＳｐｅｃｔｒｕｍＰａｉｒ；
ＬＳＰ）係数値、即ちＬＳＰ係数値の大きさをそれぞれ
判別し、もしエネルギー量値とＬＳＰ係数値が予め設定
された臨界値より大きくなければ、音声信号が安定状態
を呈すると判定し、四つのサブフレームの属性全てを前
の一つの音声フレーム中の最後の一つのサブフレームの
有声／無声音状態と同じと設定し、もし前述のステップ
中、エネルギー量値とＬＳＰ係数値が設定された臨界値
より大きいと判別されたなら、該サブフレームの低周波
数帯域対高周波数帯域エネルギー比率（Ｌｏｗｔｏ
ＨｉｇｈＢａｎｄＥｎｅｒｇｙＲａｔｉｏＶａ
ｌｕｅ；ＬＯＨ）の判別ステップを執行し、各一つのサ
ブフレームのＬＯＨ値が一つの臨界値より大きいか否か
を判定し、もし臨界値より大きければ、該サブフレーム
を有声音声信号と判定し、もしそうでなければ該サブフ
レームを無声音声の信号と判定する。同様に次の一つの
サブフレームに対して判別を進行し、こうして四つのサ
ブフレーム全てに対する判別を行う。Still another object of the present invention is to provide a method for accurately discriminating voiced / unvoiced sounds in a voice frame of an input voice signal. In the steps of the present invention, first, the voice frame of the input voice is divided into four sub-frames, and the NC values (normalized cross-correlation values) of the four sub-frames are sequentially higher than a high critical value (for example, 0.7). Is determined. Thereafter, it is further determined whether the NC value is smaller than a low critical value (for example, 0.4). After the above two discriminating steps, signals that clearly belong to voiced and unvoiced sounds are discriminated, and subsequently, signals that intervene between the aforementioned signals that clearly belong to voiced and unvoiced sounds are discriminated. That is, if it is determined in the above steps that the NC value is not smaller than the low critical value, a stable / unstable determining step is performed, and the energy amount value of the subframe and the line spectrum pair (Line Spectrum Pair) are executed. ;
LSP) coefficient value, that is, the magnitude of the LSP coefficient value is determined, and if the energy amount value and the LSP coefficient value are not greater than a predetermined critical value, it is determined that the audio signal exhibits a stable state, and All the attributes of the subframe are set to be the same as the voiced / unvoiced state of the last one subframe in the previous one voice frame, and if the energy amount value and the LSP coefficient value are set to the critical value during the above-mentioned steps, If it is determined to be larger, the energy ratio of the low frequency band to the high frequency band (Low to
High Band Energy Ratio Va
l; LOH), and determines whether the LOH value of each subframe is greater than a threshold value. If the LOH value is greater than the threshold value, the subframe is determined to be a voiced speech signal. If not, the subframe is determined to be an unvoiced speech signal. Similarly, the determination proceeds for the next one subframe, and thus the determination is performed for all four subframes.

【００１６】[0016]

【課題を解決するための手段】請求項１の発明は、一種
の音声符号化における有声音と無声音の識別方法であっ
て、入力音声の音声フレームデータの属性を識別するの
に用いられる方法であり、該方法は以下のａからｆのス
テップを包括する、ａ．現在の入力音声の音声フレームデータを四つのサブ
フレームに分割する、ｂ．四つのサブフレームの正規化相互相関値（Ｎｏｒｍ
ａｌｉｚｅｄＣｒｏｓｓｃｏｒｒｅｌａｔｉｏｎ
Ｖａｌｕｅ）、即ちＮＣ値が一つの高臨界値以上である
か否かを判別し、もし判別結果がイエスであれば、入力
された現在の音声フレーム中の四つのサブフレームがい
ずれも有声音信号であると判定する、ｃ．もし上記ｂのステップでサブフレームのＮＣ値が高
臨界値以上でなければ、該ＮＣ値が一つの低臨界値より
小さいか否かを判別し、もしイエスであれば、該音声フ
レーム中の四つのサブフレームがいずれも無声音信号に
属すると判定する、ｄ．もし上記ｃのステップでＮＣ値が低臨界値より小さ
くないと判別されたなら、安定か不安定かの判別ステッ
プを執行し、該サブフレームのエネルギー量値と線スペ
クトル対偶（ｌｉｎｅＳｐｅｃｔｒｕｍＰａｉｒ）
係数値、即ちＬＳＰ係数値の大きさをそれぞれ判別す
る、ｅ．もしエネルギー量値とＬＳＰ係数値が設定された臨
界値より大きくなければ、音声信号が安定状態を呈する
と判定し、四つのサブフレームの属性全部を前の一つの
音声フレーム中の最後の一つのサブフレームの有声音か
無声音の状態と同じと設定し、ｆ．上述のｅのステップ中で、もしエネルギー量値とＬ
ＳＰ係数値が設定された臨界値より大きければ、該サブ
フレームの低周波数帯域対高周波数帯域エネルギー比率
（ＬｏｗｔｏＨｉｇｈＢａｎｄＥｎｅｒｇｙ
ＲａｔｉｏＶａｌｕｅ）、即ちＬＯＨ値の判別ステッ
プを執行し、各一つのＬＯＨ値に対してある臨界値より
大きいか否かを判定し、もし臨界値より大きければ、該
サブフレームを有声音信号と判定し、もし大きくなけれ
ば該サブフレームを無声音の信号と判定し、次の一つの
サブフレームに判別を進行して、四つのサブフレームの
全てに対する判別を終えて終了する、以上を特徴とする、音声符号化における有声音と無声音
の識別方法としている。The invention of claim 1 is a method for discriminating voiced and unvoiced sounds in a kind of speech coding, which is used for discriminating attributes of speech frame data of input speech. The method comprises the following steps a to f: a. Split the audio frame data of the current input audio into four subframes, b. Normalized cross-correlation values of the four subframes (Norm
aligned Cross Correlation
Value), that is, whether the NC value is greater than or equal to one high critical value. If the result of the determination is yes, all four subframes in the input current voice frame are voiced sound signals. Is determined to be c. If the NC value of the subframe is not equal to or higher than the high critical value in step b, it is determined whether the NC value is smaller than one low critical value. Determining that each of the two subframes belongs to an unvoiced sound signal; d. If it is determined in step c that the NC value is not smaller than the low critical value, a step of determining whether the NC value is stable or unstable is performed, and the energy value of the subframe and the line spectrum pair are used.
Determining the magnitude of the coefficient value, ie, the magnitude of the LSP coefficient value, e. If the energy value and the LSP coefficient value are not greater than the set threshold values, it is determined that the audio signal is in a stable state, and all attributes of the four subframes are changed to the last one in the previous audio frame. Set the same as the voiced or unvoiced state of the subframe; f. In step e above, if the energy value and L
If the SP coefficient value is greater than the set threshold value, the low-to-high band energy ratio of the subframe is low to high band energy.
Ratio Value), i.e., a step of determining the LOH value is performed, and it is determined whether each LOH value is greater than a certain threshold value. If the LOH value is greater than the threshold value, the subframe is determined to be a voiced sound signal. If it is not large, the subframe is determined as an unvoiced sound signal, the determination proceeds to the next one subframe, and the determination is completed after finishing the determination for all four subframes. This is a method for discriminating voiced and unvoiced sounds in speech coding.

【００１７】請求項２の発明は、ｂのステップでサブフ
レームのＮＣ値を判別する時に用いられる高臨界値は
０．７に設定することを特徴とする、請求項１に記載の
音声符号化における有声音と無声音の識別方法としてい
る。According to a second aspect of the present invention, the high critical value used to determine the NC value of the subframe in step b is set to 0.7. For discriminating voiced and unvoiced sounds.

【００１８】請求項３の発明は、ｃのステップでサブフ
レームのＮＣ値を判別する時に用いられる低臨界値は
０．４に設定することを特徴とする、請求項１に記載の
音声符号化における有声音と無声音の識別方法としてい
る。According to a third aspect of the present invention, the low critical value used when determining the NC value of the subframe in step c is set to 0.4. For discriminating voiced and unvoiced sounds.

【００１９】請求項４の発明は、ｄのステップの安定か
不安定かの判別ステップ中、サブフレームのエネルギー
量値の判別では、前の一つのエネルギー量と現在のエネ
ルギー量の差値が設定されたある臨界値以上であるか否
かの判断を行う、請求項１に記載の音声符号化における
有声音と無声音の識別方法としている。According to a fourth aspect of the present invention, in the step of determining whether the step d is stable or unstable, in the determination of the energy amount of the subframe, the difference value between the previous one energy amount and the current energy amount is set. 2. A method for discriminating voiced and unvoiced sounds in speech coding according to claim 1, wherein it is determined whether or not the threshold value is equal to or greater than a certain critical value.

【００２０】請求項５の発明は、エネルギー量値の判別
ステップで、設定された臨界値は０．４５とする、請求
項４に記載の音声符号化における有声音と無声音の識別
方法としている。According to a fifth aspect of the present invention, there is provided the method for discriminating voiced and unvoiced sounds in speech coding according to the fourth aspect, wherein the set critical value is set to 0.45 in the energy amount determining step.

【００２１】請求項６の発明は、ｄのステップの安定か
不安定かの判別ステップ中、サブフレームのＬＳＰ係数
値の判別では、前の一つの平均ＬＳＰ係数値と現在のＬ
ＳＰ係数値との差値を判断することを特徴とする、請求
項１に記載の音声符号化における有声音と無声音の識別
方法としている。According to a sixth aspect of the present invention, in the step of determining whether the step d is stable or unstable, in determining the LSP coefficient value of the subframe, the previous average LSP coefficient value and the current LSP coefficient value are determined.
The method according to claim 1, wherein a difference value between the SP coefficient value and the SP coefficient value is determined.

【００２２】請求項７の発明は、サブフレームのＬＳＰ
係数値の判別ステップで用いる臨界値は０．４と設定す
ることを特徴とする、請求項６に記載の音声符号化にお
ける有声音と無声音の識別方法としている。According to the seventh aspect of the present invention, the LSP
The method according to claim 6, wherein the critical value used in the coefficient value discriminating step is set to 0.4.

【００２３】請求項８の発明は、ｆのステップで、サブ
フレームのＬＯＨ値の判別ステップ中、ＬＯＨの定義は
以下の数式１とされ、According to an eighth aspect of the present invention, in the step f, during the step of determining the LOH value of the subframe, the definition of the LOH is expressed by the following equation 1.

【数１】その中ｉは第ｉ個のサブフレームを代表し、Ｓ
₂1p1k はもとの信号が1k低域フィルタを通過した後に得
られる信号を代表し、定義中、音声信号中の１ＫＨｚよ
り低いものと１ＫＨｚより高いもののエネルギー量比率
は、一つのウインドウ長度Ｗで割られ、そのいわゆるウ
インドウ長度Ｗの定義は、ピッチがＮｓｕｂｆｒａｍｅ
より大きい場合は、Ｗ＝ピッチ（ｐｉｔｃｈ）ピッチがＮｓｕｂｆｒａｍｅ／２以上でＮｓｕｂｆｒａ
ｍｅより小さい場合は、Ｗ＝２＊ピッチであり、その
中、Ｎｓｕｂｆｒａｍｅはサンプルのサブフレーム長度
を示し、ＬＯＨの定義中、静音臨界値Ｔｓｉｌは現在の
音声フレームの最大音声値であり、該Ｔｓｉｌ値は１Ｋ
Ｈｚの高域フィルタを通過した音声信号のエネルギー量
中に加えられ得て、それにより低エネルギー量の有声信
号に無声音として選択される傾向を与え、ｄｏｆｆｓｅ
ｔ（ｊ）は各一つのサブフレームの中心位置で、その定
義は、ｄｏｆｆｓｅｔ（ｊ）＝Ｎｓｕｂｆｒａｍｅ＊（ｊ−１
／２），ｊ＝１〜４その中、ｊはサブフレームの番号を表示する、以上を特
徴とする、請求項１に記載の音声符号化における有声音
と無声音の識別方法としている。## EQU1 ## where i represents the ith subframe, and S
₂ 1p1k represents the signal obtained after the original signal has passed through the 1k low-pass filter. In the definition, the energy amount ratio of the signal lower than 1 KHz and the signal higher than 1 KHz in the audio signal is one window length W. The definition of the so-called window length W is that the pitch is Nsubframe
If greater than, W = pitch (pitch) Nsubframe / 2 or more and Nsubfra
If less than me, then W = 2 * pitch, where Nsubframe indicates the subframe length of the sample, and in the definition of LOH, the silence threshold Tsil is the maximum speech value of the current speech frame; Value is 1K
Hz high pass filter may be added into the energy content of the audio signal which has passed through the high pass filter, thereby giving low energy content voiced signals a tendency to be selected as unvoiced.
t (j) is the center position of each one subframe, and its definition is: doffset (j) = Nsubframe * (j−1)
/ 2), j = 1 to 4, wherein j represents the number of a subframe, and wherein j is the number of a subframe, and wherein the method of discriminating voiced and unvoiced sounds in speech coding according to claim 1 is provided.

【００２４】[0024]

【発明の実施の形態】本発明の判別方法では、入力音声
信号の音声フレームを４個のサブフレーム（Ｓｕｂｆｒ
ａｍｅ）に分割し、その後、各一つのサブフレームに対
して相関する変数に基づき、相当的に各一つのサブフレ
ームが有声音か無声音かの判別を行う。前述の変数は、
ＮＣ、エネルギー量、線スペクトル対偶係数（ｌｉｎｅ
ＳｐｅｃｔｒｕｍＰａｉｒ；ＬＳＰ）、及び低周波
数帯域対高周波数帯域エネルギー比率（Ｌｏｗｔｏ
ＨｉｇｈＢａｎｄＥｎｅｒｇｙＲａｔｉｏＶａ
ｌｕｅ；ＬＯＨ）を含む。DESCRIPTION OF THE PREFERRED EMBODIMENTS In a discrimination method according to the present invention, an audio frame of an input audio signal is divided into four subframes (Subfr).
ame), and then, based on variables correlated to each one of the sub-frames, it is substantially determined whether each one of the sub-frames is voiced or unvoiced. The above variables are
NC, energy amount, line spectrum vs. even coefficient (line
Spectrum Pair (LSP), and a low frequency band to high frequency band energy ratio (Low to
High Band Energy Ratio Va
lue; LOH).

【００２５】以下は本発明の判別ステップである。図２
に示されるのは、本発明の判別フローチャートである。
そのステップは以下を包括する。フローチャートの開始
ステップ１０１の後、ステップ１０２を執行する。ステ
ップ１０２では現在の音声フレームデータを取得する。
続いて、ＮＣ値が一つの高臨界値０．７以上であるか否
かの判別のステップ１０３を執行する。該ＮＣ値の定義
については前述の説明を参照されたい。もし判別結果が
イエスであるならば、ステップ１０４を執行する。ステ
ップ１０４ではこの入力された現在の音声フレームデー
タ中の四つのサブフレームがいずれも有声音信号である
か否かを判定し、その後、判別プロセスを終了する。The following is the determination step of the present invention. FIG.
Is a determination flowchart according to the present invention.
The steps include: After the start step 101 of the flowchart, step 102 is executed. In step 102, the current audio frame data is obtained.
Subsequently, step 103 for determining whether the NC value is equal to or higher than one high critical value 0.7 is executed. See the above description for the definition of the NC value. If the determination is yes, step 104 is executed. In step 104, it is determined whether all of the four sub-frames in the input current audio frame data are voiced sound signals, and then the determination process is terminated.

【００２６】もし前述のステップ１０２中で、ＮＣ値が
高臨界値０．７以上でないと判別したならば、続いてス
テップ１０５で該ＮＣ値が低臨界値０．４より低いか否
かをを判別し、もしイエスであれば、該音声フレーム中
の四つのサブフレームがいずれも無声音信号に属すると
判定し、その後、判別プロセスを終了する。If it is determined in step 102 that the NC value is not higher than the high critical value 0.7, it is determined in step 105 whether the NC value is lower than the low critical value 0.4. It is determined, and if yes, it is determined that all four sub-frames in the audio frame belong to the unvoiced sound signal, and thereafter, the determination process ends.

【００２７】前述のステップ１０２、１０３の判別の
後、明らかに有声ないし無声音に属する信号が判別され
る。続いて、前述の明らかに有声ないし無声音に属する
信号の間に介在する信号を判別し、この一つの不安定
な、遷移領域の中、単独のステップ１０２、ステップ１
０３中のＮＣ値判断ステップにより有声／無声音の正確
な判別を行うことは不可能であり、このため以下の判別
方法により本発明の課題を解決することができる。ゆえ
に以下の判別ステップが本発明の特徴を極めて示すキー
ステップであるといえる。After the above-described steps 102 and 103, signals that clearly belong to voiced or unvoiced sounds are determined. Subsequently, signals intervening between the signals belonging to the above-mentioned apparently voiced or unvoiced sounds are discriminated, and a single step 102, a step 1
It is impossible to make an accurate determination of voiced / unvoiced sound by the NC value determination step 03, and therefore the problem of the present invention can be solved by the following determination method. Therefore, it can be said that the following determination steps are key steps that show the features of the present invention.

【００２８】もし前述のステップ１０５中でＮＣ値が
０．４より小さくないと判定されたなら、安定か不安定
かの判別ステップ（Ｓｔａｔｉｏｎａｒｙ／ｎｏｎｓｔ
ａｔｉｏｎａｒｙＤｅｃｉｓｉｏｎ；Ｓ／ＮＳＤ
ｅｃｉｓｉｏｎ）を執行する。この一つのステップ中に
は、二つの判別項目が含まれ、その中の一つは、エネル
ギー量の判別であり、それは、一つのエネルギー量（Ｐ
ｒｅｖｉｏｕｓＥｎｅｒｇｙ）と現在のエネルギー量
（ＣｕｒｒｅｎｔＥｎｅｒｇｙ）の差値、即ちｄｉｓ
（ＰｒＥｎｇ，ＣｕＥｎｇ）の判別である。さらにもう
一歩Ｓ／ＮＳの判別の正確度を増すために、この一つの
ステップ中にはさらにＬＳＰ係数の判別が包括される。
このＬＳＰ係数はＬＰＣ等化器より取得する。このＬＳ
Ｐ係数の判別では、前の一つの平均ＬＳＰ（Ｐａｓｔ
ａｖｅｒａｇｅＬＳＰ）と現在ＬＳＰ（Ｃｕｒｒｅｎ
ｔＬＳＰ）の差値、即ちｄｉｓ（ＰａＬＳＰ，ＣｕＬ
ＳＰ）を取得する。ステップ１０７のＳ／ＮＳ判別ステ
ップ中、ａ．ｄｉｓ（ＰｒＥｎｇ，ＣｕＥｎｇ）が０．４５以上
であり、且つ、ｂ．ｄｉｓ（ＰａＬＳＰ，ＣｕＬＳＰ）が０．４以上で
あるか否かを判定し、もし結果がノーであれば、音声信号は安定状態にあるこ
とを示し、ステップ１０８を執行し、四つのサブフレー
ムの属性が全て前の一つの音声フレーム中の最後の一つ
のサブフレームの有声音又は無声音状態と同じと設定す
る。反対に、もしステップ１０７の差値の判別ステップ
中で、結果がイエス（即ちエネルギー量或いはＬＳＰ係
数の変化が極めて速いことを示す）の場合、ＬＯＨの判
別ステップ（ステップ１０９から１１３）を執行し、各
一つのサブフレームに対して有声か無声音かの分類判別
を行い、以て正確な判別結果を得る。いわゆるＬＯＨの
判別の定義は以下の数式１のとおりである。If it is determined in step 105 that the NC value is not smaller than 0.4, it is determined whether the NC value is stable or unstable (Stationary / nonst).
ationary Decision; S / NS D
execution). In this one step, two discrimination items are included, one of which is the discrimination of the energy amount, which is one energy amount (P
difference between the current energy (current energy) and the current energy amount (current energy), that is, dis
(PrEng, CuEng). In order to further increase the accuracy of the S / NS determination, LSP coefficient determination is further included in this one step.
This LSP coefficient is obtained from the LPC equalizer. This LS
In the determination of the P coefficient, the previous one average LSP (Past
average LSP) and current LSP (Curren)
t LSP), ie, dis (PaLSP, CuL)
SP). During the S / NS determination step of step 107, a. dis (PrEng, CuEng) is 0.45 or more, and b. determine if dis (PaLSP, CuLSP) is greater than or equal to 0.4; if the result is no, indicate that the audio signal is in a stable state, execute step 108, and All attributes are set to be the same as the voiced or unvoiced state of the last one subframe in the previous one voice frame. Conversely, if the result of the determination of the difference value in step 107 is yes (ie, indicating that the change in energy amount or LSP coefficient is extremely fast), the LOH determination step (steps 109 to 113) is executed. , For each subframe, to determine whether it is voiced or unvoiced, thereby obtaining an accurate determination result. The definition of the so-called LOH determination is as shown in Equation 1 below.

【数１】その中ｉは第ｉ個のサブフレームを代表し、Ｓ
₂1p1k はもとの信号が1k低域フィルタを通過した後に得
られる信号を代表する。定義中、音声信号中の１ＫＨｚ
より低いものと１ＫＨｚより高いもののエネルギー量比
率は、一つのウインドウ長度Ｗで相互に相除され、その
いわゆるウインドウ長度Ｗの定義は以下のとおりであ
る。Ｗ＝ピッチ（ｐｉｔｃｈ）ピッチがＮｓｕｂｆｒａ
ｍｅより大きい場合Ｗ＝２＊ピッチピッチがＮｓｕｂｆｒａｍｅ／２以
上でＮｓｕｂｆｒａｍｅより小さい場合その中、Ｎｓｕｂｆｒａｍｅはサンプルのサブフレーム
長度を示す。このほか、ＬＯＨの定義中、一つの静音臨
界値Ｔｓｉｌを現在の音声フレームの最大音声値として
選択し、該Ｔｓｉｌ値は１ＫＨｚの高域フィルタを通過
した音声信号のエネルギー量中に加えられ得て、それに
より低エネルギー量の有声信号が無声音として選択され
る傾向が得られる。ｄｏｆｆｓｅｔ（ｊ）は各一つのサ
ブフレームの中心位置で、その定義は、ｄｏｆｆｓｅｔ（ｊ）＝Ｎｓｕｂｆｒａｍｅ＊（ｊ−１
／２），ｊ＝１〜４その中、ｊはサブフレームの番号を表示する。本発明の
ＬＯＨ判別フローチャート中、ステップ１１０では先に
第１個のサブフレームのＬＯＨ（前述の定義を参照）が
１より大きいか否かを判別し、もしイエス（１より大き
い）ならば、ステップ１１２を執行し、該サブフレーム
が有声音信号であると判定する。もしノーであれば、ス
テップ１１１を執行し、即ち該サブフレームが無声音の
信号であると判定する。その後、さらにステップ１１３
及び１１９に戻り、次の一つのサブフレームに対して判
別を進行し、四つのサブフレーム全てに対する判別を終
えて終了する。即ち、上述のＬＯＨ判別の後、各一つの
サブフレームのＬＯＨ値が、もし一つの臨界値より大き
ければ、該サブフレームは有声と判定され、大きくなけ
れば無声音と判定される。一つの音声フレームの四つの
サブフレーム全部の判定が終了した後、結果に基づき符
号化する過程に進む。本発明中、四つのサブフレームが
ただ３ビットを以て符号化され、それは図３に示される
とおりである。その中、１は有声音を示し、０は無声音
を示す。## EQU1 ## where i represents the ith subframe, and S
₂ 1p1k represents the signal obtained after the original signal has passed through the 1k low-pass filter. 1KHz in audio signal during definition
The energy ratio between the lower one and the one higher than 1 KHz is mutually offset by one window length W, and the definition of the so-called window length W is as follows. W = pitch (pitch) Pitch is Nsubfra
When the pitch is larger than me W = 2 * pitch When the pitch is equal to or larger than Nsubframe / 2 and smaller than Nsubframe, Nsubframe indicates the subframe length of the sample. In addition, during the definition of LOH, one silence critical value Tsil is selected as the maximum audio value of the current audio frame, and the Tsil value can be added to the energy amount of the audio signal passed through the 1 KHz high-pass filter. , Thereby tending to select voiced signals of low energy content as unvoiced sounds. doffset (j) is the center position of each one subframe, and its definition is: doffset (j) = Nsubframe * (j−1)
/ 2), j = 1 to 4, j indicates the number of the subframe. In the LOH determination flowchart of the present invention, it is first determined in step 110 whether or not the LOH of the first subframe (see the above definition) is greater than 1; Step 112 is executed to determine that the subframe is a voiced sound signal. If no, step 111 is executed, that is, it is determined that the subframe is an unvoiced signal. Then, step 113
And 119, the determination proceeds for the next one subframe, and the determination ends for all four subframes. That is, after the above-described LOH determination, if the LOH value of each one sub-frame is larger than one critical value, the sub-frame is determined to be voiced; if not, the sub-frame is determined to be unvoiced. After the determination of all four sub-frames of one voice frame is completed, the process proceeds to encoding based on the result. In the present invention, four sub-frames are coded with only 3 bits, as shown in FIG. Among them, 1 indicates a voiced sound and 0 indicates an unvoiced sound.

【００２９】図３に示される索引値を得た後、対応する
値を保存し、符号化の過程を完成し、その後、実際の応
用では、周知の音声合成技術を以て復号し、必要な合成
音声を発生する。After obtaining the index values shown in FIG. 3, the corresponding values are stored and the encoding process is completed. Then, in an actual application, decoding is performed using a well-known speech synthesis technique, and the necessary synthesized speech is decoded. Occurs.

【００３０】[0030]

【発明の効果】本発明は、一種の音声符号化の改良技術
を提供しており、それにより音声符号化の過程で優れた
音声合成出力品質が得られる。本発明はさらに、音声符
号化中に用いられる、正確に有声音か無声音かの識別方
法を提供しており、該識別方法により、正確に入力音声
信号中の音声フレームが有声音か無声音かを判別できる
ようになる。本発明はさらにまた、一種の四分割式の有
声／無声音判別の方法（ＱｉｕａｒｔｅｒＶｏｉｃｅ
ｄ／ＵｎｖｏｉｃｅｄＤｅｃｉｓｉｏｎＳｃｈｅｍ
ｅ）を提供している。The present invention provides a kind of improved speech coding technology, whereby an excellent speech synthesis output quality can be obtained in the speech coding process. The present invention further provides a method for accurately distinguishing voiced or unvoiced sounds used during speech coding, whereby the method for accurately determining whether a speech frame in an input speech signal is voiced or unvoiced. It becomes possible to determine. The present invention further relates to a method of discriminating voiced / unvoiced sounds of a quadrant (Quarter Voice).
d / Unvoiced Decision Schem
e).

[Brief description of the drawings]

【図１】伝統的な音声符号化技術の基本ブロック図であ
る。FIG. 1 is a basic block diagram of a traditional speech coding technique.

【図２】本発明の判別フローチャートである。FIG. 2 is a determination flowchart of the present invention.

【図３】本発明中、四つのサブフレームを３ビットを以
て符号化したコード表である。FIG. 3 is a code table in which four subframes are encoded with 3 bits in the present invention.

[Explanation of symbols]

１１インパルス列ジェネレータ１２ランダムノイズジェネレータ１３有声／無声音切り換えスイッチ１４利得ユニット１５ＬＰＣフィルタ１６ＬＰＣフィルタ制御変数設定ユニット Reference Signs List 11 impulse train generator 12 random noise generator 13 voiced / unvoiced sound changeover switch 14 gain unit 15 LPC filter 16 LPC filter control variable setting unit

Claims

[Claims]

1. A method of discriminating voiced and unvoiced sounds in a type of speech encoding, which is used to identify attributes of speech frame data of an input speech. Encompassing the steps: a. Split the audio frame data of the current input audio into four subframes, b. Normalized cross-correlation values of the four subframes (Norm
aligned Cross Correlation
Value), that is, whether the NC value is greater than or equal to one high critical value. If the result of the determination is yes, all four subframes in the input current voice frame are voiced sound signals. Is determined to be c. If the NC value of the subframe is not equal to or higher than the high critical value in step b, it is determined whether the NC value is smaller than one low critical value. Determining that each of the two subframes belongs to an unvoiced sound signal; d. If it is determined in step c that the NC value is not smaller than the low critical value, a step of determining whether the NC value is stable or unstable is performed, and the energy value of the subframe and the line spectrum pair are used.
Determining the magnitude of the coefficient value, ie, the magnitude of the LSP coefficient value, e. If the energy value and the LSP coefficient value are not greater than the set threshold values, it is determined that the audio signal is in a stable state, and all attributes of the four subframes are changed to the last one in the previous audio frame. Set the same as the voiced or unvoiced state of the subframe; f. In step e above, if the energy value and L
If the SP coefficient value is greater than the set threshold value, the low-to-high band energy ratio of the subframe is low to high band energy.
Ratio Value), i.e., a step of determining the LOH value is performed, and it is determined whether each LOH value is greater than a certain threshold value. If the LOH value is greater than the threshold value, the subframe is determined to be a voiced sound signal. If it is not large, the subframe is determined to be a signal of unvoiced sound, the determination proceeds to the next one subframe, and the determination is completed after finishing the determination for all four subframes. A method for distinguishing voiced and unvoiced sounds in speech coding.

2. The voiced speech and unvoiced speech in speech coding according to claim 1, wherein the high critical value used when discriminating the NC value of the subframe in step b is set to 0.7. Identification method.

3. The voiced and unvoiced speech in speech coding according to claim 1, wherein the low critical value used in determining the NC value of the subframe in step c is set to 0.4. Identification method.

4. In the step of determining whether the step d is stable or unstable, in the determination of the energy amount value of the subframe,
Determine whether the difference between the previous one energy amount and the current energy amount is equal to or more than a set critical value,
A method for discriminating voiced and unvoiced sounds in speech coding according to claim 1.

5. The method for discriminating voiced and unvoiced sounds in speech coding according to claim 4, wherein in the step of determining the energy amount value, the set critical value is 0.45.

6. In the step of determining whether the step d is stable or unstable, in determining the LSP coefficient value of the subframe, a difference value between a previous average LSP coefficient value and a current LSP coefficient value is determined. The method for distinguishing voiced and unvoiced sounds in speech coding according to claim 1, characterized in that:

7. The method according to claim 6, wherein the threshold value used in the step of determining the LSP coefficient value of the subframe is set to 0.4.

8. In the step f, the LOH of the subframe is
During the value determination step, LOH is defined by the following equation (1). Where i represents the i-th subframe, S ₂ 1p1k represents the signal obtained after the original signal has passed through the 1k low-pass filter, and, during definition, is lower than 1 KHz in the audio signal. The energy amount ratio of higher than 1 KHz is divided by one window length W. The definition of the window length W is as follows: When the pitch is larger than Nsubframe, W = pitch (pitch) When the pitch is Nsubframe / 2 or more, Nsubframe
If less than me, then W = 2 * pitch, where Nsubframe indicates the subframe length of the sample, and in the definition of LOH, the silence threshold Tsil is the maximum speech value of the current speech frame; Value is 1K
Hz high pass filter may be added into the energy content of the audio signal which has passed through the high pass filter, thereby giving low energy content voiced signals a tendency to be selected as unvoiced.
t (j) is the center position of each one subframe, and its definition is: doffset (j) = Nsubframe * (j−1)
/ 2), j = 1 to 4, wherein j represents the number of a subframe, wherein the voiced sound and the unvoiced sound are distinguished from each other in voice coding according to claim 1.