JPH10232699A

JPH10232699A - Lpc vocoder

Info

Publication number: JPH10232699A
Application number: JP9052589A
Authority: JP
Inventors: Akihiro Nakahara; 聡宏中原
Original assignee: Japan Radio Co Ltd
Current assignee: Japan Radio Co Ltd
Priority date: 1997-02-21
Filing date: 1997-02-21
Publication date: 1998-09-02

Abstract

PROBLEM TO BE SOLVED: To reproduce an articulate and natural synthesized voice by mixing noise components with a voiced sound excitation signal and adding a secondary pulse and fluctuations. SOLUTION: A pitch extraction and voice decision part 5 extracts the pitch and voice decision value of frame data and outputs the extracted voice decision value to an encoding part 11. Further, when the voice decision value is a voiced sound, the extracted pitch is outputted as a pitch coefficient which approximates the period component of a vocal-chords wave of a frame to the encoding part 11. A mixing coefficient calculation part 6 compares the obtained pitch coefficient with respective band pitches of 3rd to 5th band data when the said voice decision value is the voiced sound. The sum of band voice decision values of band data having differences within a certain value is calculated and outputted as a mixing coefficient to the encoding part 11. A fluctuation decision part 7 extracts a fluctuation decision value for the pitch from the voice decision value, 1st and 2nd band pitches, and the frame data that a shift register 2 holds.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声データの高能
率伝送、記録に用いられるＬＰＣボコーダ（線形予測音
声分析合成装置）の品質向上に関する。The present invention relates to improving the quality of an LPC vocoder (linear predictive speech analysis / synthesis apparatus) used for highly efficient transmission and recording of speech data.

【０００２】[0002]

【従来の技術】ボコーダは、入力音声を分析して該音声
を特徴づける特徴パラメータを抽出し、例えばこれを符
号化して低ビットレートで伝送し復号した後、この特徴
パラメータに基づいて音声を再合成することにより、原
音声の高能率伝送・記録を可能とする音声分析合成装置
であり、線形予測分析に基づいたＬＰＣ（LinearPredic
tive Coding)ボコーダなどがＬＳＩ化され広く使用され
ている。2. Description of the Related Art A vocoder analyzes an input speech to extract a feature parameter characterizing the speech. For example, the vocoder encodes the speech, transmits the decoded speech at a low bit rate, decodes the speech, and reproduces the speech based on the feature parameters. This is a speech analysis / synthesis device that enables highly efficient transmission and recording of original speech by synthesizing, and is based on LPC (Linear Predic
tive coder (vocoder) and the like are integrated into LSIs and widely used.

【０００３】然しながら、従来のボコーダでは音声デー
タを例えば２２．５ｍｓの一定時間長のフレームに区切
り、フレーム単位で特徴パラメータの抽出や音声の再合
成を行っている等の理由から、合成音声がボコーダ特有
の人工的なものとなってしまうという問題点がある。However, in the conventional vocoder, the synthesized voice is divided into frames of a fixed time length of, for example, 22.5 ms, and characteristic voices are extracted and the voice is re-synthesized in units of frames. There is a problem that it becomes a peculiar artificial thing.

【０００４】このため、より自然な音声の合成を目的と
して、各種の試みがなされており、例えばその一つに、
特願平７−３０２９４５号「音声信号合成方法」（先行
技術と言う）があるが、この発明はフレーム単位で伝送
された特徴パラメータについて前フレームの特徴パラメ
ータを参照して、フレーム間で特徴パラメータが連続的
に変化するようフレーム間補間を行うことにより、より
自然な音声合成を行うこととしている。[0004] For this reason, various attempts have been made for the purpose of synthesizing a more natural sound.
Japanese Patent Application No. 7-302945 discloses a "voice signal synthesizing method" (referred to as "prior art"). However, the present invention refers to a feature parameter transmitted in a frame unit with reference to a feature parameter of a previous frame, and a feature parameter between frames. By performing inter-frame interpolation so that changes continuously, more natural speech synthesis is performed.

【０００５】図３は、この先行技術のボコーダの構成を
示すブロック図であり、入力音声から第１のボコーダの
分析部５０で抽出された特徴パラメータが伝送路５４を
介して伝達され、第２のボコーダの合成部６０で伝達さ
れた特徴パラメータに基づき出力音声が合成される。な
お一般には双方向通信のため第１，第２のボコーダとも
に分析部と合成部とが設けられているが、重複するの
で、図３では第１のボコーダの合成部と第２のボコーダ
の分析部とは省略している。FIG. 3 is a block diagram showing the configuration of the vocoder according to the prior art. The characteristic parameters extracted from the input voice by the analysis unit 50 of the first vocoder are transmitted via a transmission line 54, and the second The output voice is synthesized based on the characteristic parameters transmitted by the synthesis unit 60 of the vocoder. In general, both the first and second vocoders are provided with an analyzing unit and a synthesizing unit for two-way communication. However, since they are duplicated, FIG. 3 shows the synthesizing unit of the first vocoder and the analyzing unit of the second vocoder. The part is abbreviated.

【０００６】分析部５０は、入力される音声信号を分析
する分析処理部５１と、分析処理部５１の出力を符号化
する符号化器５３−１〜５３−４と、これら各符号化器
の出力を多重し伝送路５４に出力するマルチプレクサ５
２とを備えている。合成部６０は伝送路５４より入力さ
れる多重信号を分離するデマルチプレクサ６１、分離さ
れた各信号を復号化する復号器６２−１〜６２−４、こ
れら各復号器の出力に基づき補間処理をおこなう補間処
理部６３および補間処理部６３の出力に基づき音声信号
を合成する合成処理部６４を備えている。[0006] The analysis unit 50 includes an analysis processing unit 51 for analyzing an input audio signal, encoders 53-1 to 53-4 for encoding the output of the analysis processing unit 51, and an encoder for each of these encoders. Multiplexer 5 for multiplexing the output and outputting to transmission line 54
2 is provided. The synthesizing section 60 performs a demultiplexer 61 for separating the multiplexed signal input from the transmission line 54, decoders 62-1 to 62-4 for decoding the separated signals, and performs an interpolation process based on the outputs of the respective decoders. An interpolation processing unit 63 to be performed and a synthesis processing unit 64 for synthesizing audio signals based on the output of the interpolation processing unit 63 are provided.

【０００７】分析処理部５１では、入力音声信号を８Ｋ
Ｈｚのサンプリング周波数でサンプリングしディジタル
信号に変換した後、２２．５ｍｓの時間間隔でフレーム
（分析区間）に区切り、各フレーム毎に該フレームの信
号波形に関する特徴パラメータ、例えば図３の例では有
声音／無声音の別，ピッチ周期，ＬＳＰ（線スペクトル
対）係数およびフレームパワーを抽出する。これらの特
徴パラメータは符号化器５３−１〜５３−４で符号化さ
れ、マルチプレクサ５２で多重化され、伝送路５４を経
てデマルチプレクサ６１で多重分離され、復号器６２−
１〜６２−４による復号化の後、補間処理部６３に供給
される。The analysis processing section 51 converts the input audio signal into 8K
After sampling at a sampling frequency of Hz and converting it into a digital signal, it is divided into frames (analysis sections) at a time interval of 22.5 ms, and for each frame, characteristic parameters relating to the signal waveform of the frame, for example, voiced sound in the example of FIG. / Extract unvoiced sound, pitch period, LSP (line spectrum pair) coefficient, and frame power. These characteristic parameters are encoded by encoders 53-1 to 53-4, multiplexed by a multiplexer 52, demultiplexed by a demultiplexer 61 via a transmission path 54, and demultiplexed by a decoder 62-53.
After decoding by 1 to 62-4, it is supplied to the interpolation processing unit 63.

【０００８】前述のように供給される特徴パラメータ
は、分析処理部５１においてフレーム単位に抽出される
ので、フレーム単位の離散値となっており、これを直接
用いて合成した出力音声はどうしてもボコーダ特有の人
工的な音声になってしまう。このため図３に示す先行技
術では、補間処理部６３を設け、前フレームの特徴パラ
メータ値を参照して当該フレームの特徴パラメータ値を
線形内挿することによりフレーム間で連続に変化するよ
うに補間処理を行い、より自然な音声合成を得ることと
している。例えば、当該および前フレームが有声音であ
る場合には、各ピッチ毎の間隔変化が連続的に等差とな
るようにピッチ周期を補間し、伝送された当該フレーム
の線スペクトル対，フレームパワーを、前フレームのそ
れぞれの値を参照して、補間された周期の各ピッチ毎に
連続的に変化するよう補正し、各ピッチ毎に線スペクト
ル対係数，フレームパワー／１ピッチを割り当ててい
る。Since the characteristic parameters supplied as described above are extracted in frame units in the analysis processing unit 51, they are discrete values in frame units, and the output voice synthesized by directly using this is inevitably unique to the vocoder. It becomes an artificial voice. For this reason, in the prior art shown in FIG. 3, an interpolation processing unit 63 is provided, and by interpolating the characteristic parameter value of the frame with reference to the characteristic parameter value of the previous frame so as to continuously change between frames. Processing is performed to obtain a more natural speech synthesis. For example, when the current frame and the previous frame are voiced sounds, the pitch period is interpolated so that the interval change at each pitch is continuously equal, and the line spectrum pair and the frame power of the transmitted frame are calculated. , With reference to the respective values of the previous frame, correction is made so as to change continuously at each pitch of the interpolated cycle, and a line spectrum coefficient and a frame power / 1 pitch are assigned to each pitch.

【０００９】合成処理部６４では、このようにして補間
されたピッチ周期毎に、補正されたフレームパワーに対
応するパワーの、有声音では１つのインパルスを、また
無声音ではランダムノイズを発生し、これを励起信号と
して補間された線スペクトル対の値に対応する伝達特性
のフィルタ回路を通すことにより、当該ピッチの音声波
形データを合成する。このようにして図３の先行技術で
は、フレームの境界で特徴パラメータの値が不連続に変
化することによる不自然さの改善を行っている。The synthesizing unit 64 generates one impulse for voiced sound and random noise for unvoiced sound at a power corresponding to the corrected frame power for each pitch period interpolated in this manner. Is passed through a filter circuit having a transfer characteristic corresponding to the value of the interpolated line spectrum pair as an excitation signal, thereby synthesizing voice waveform data at the pitch. As described above, in the prior art shown in FIG. 3, the unnaturalness due to the discontinuous change of the value of the feature parameter at the frame boundary is improved.

【００１０】[0010]

【発明が解決しようとする課題】然しながら上記の先行
技術でも、特徴パラメータの不連続性に起因する不自然
さは解消できるが、なお有声音励起信号を単独のインパ
ルスで近似しているため、高周波数帯に生じるスペクト
ラムピークに起因する合成音声の音質劣化や、音声合成
時に、単純に有声音励起信号を一定のパルス列で近似し
無声音励起信号をランダムノイズで近似することによ
る、合成音声の不自然性は改善されないと共に、一定の
フレーム長で抽出したフレームパワー等を、補間された
ピッチに対応して、伸縮する合成フレーム長に割り当て
るため、合成音声の再現性が劣る等の問題点があった。However, in the above prior art, the unnaturalness due to the discontinuity of the characteristic parameter can be eliminated. However, since the voiced excitation signal is approximated by a single impulse, a high level is required. Degradation of synthesized speech due to spectrum peaks occurring in the frequency band, and unnaturalness of synthesized speech by simply approximating the voiced excitation signal with a fixed pulse train and approximating the unvoiced excitation signal with random noise during speech synthesis Performance is not improved, and frame power extracted at a fixed frame length is assigned to a synthesized frame length that expands and contracts in accordance with the interpolated pitch. .

【００１１】本発明はかかる問題点を解決するためにな
されたものであり、有声音励起信号に雑音成分を混合
し、また２次パルス及びゆらぎを加味することにより、
明瞭で自然性に優れた合成音声を再生でき、また特徴パ
ラメータの抽出にあたっては、合成部におけるピッチの
補間を前提として分析対象フレーム長を加減することに
より、再現性により優れたＬＰＣボコーダを提供するこ
とを目的としている。SUMMARY OF THE INVENTION The present invention has been made to solve such a problem, and a noise component is mixed into a voiced sound excitation signal, and a secondary pulse and fluctuation are added to the signal.
An LPC vocoder with superior reproducibility can be provided by reproducing synthesized speech with excellent naturalness and by adjusting the length of a frame to be analyzed on the premise of pitch interpolation in a synthesis unit when extracting feature parameters. It is intended to be.

【００１２】[0012]

【課題を解決するための手段】本発明は、音声信号を一
定時間長のフレームデータ毎に分割し、各フレームデー
タの、少なくとも有声音，無声音の別、パワー、フレー
ムデータ合成時に使用する声道等価フィルタの特性係
数、および有声音の場合にはそのピッチを含む特徴パラ
メータを抽出し、この特徴パラメータに対応する各フレ
ームデータを合成することにより音声信号を復元するＬ
ＰＣボコーダにおいて、有声音のフレームデータの合成
時に生成する励起信号に２次パルスを付加する手段を備
えたことを特徴とする。According to the present invention, an audio signal is divided into frame data having a fixed time length, and each frame data is classified into at least a voiced voice or an unvoiced voice, a power, and a vocal tract used in synthesizing frame data. The characteristic coefficient including the characteristic coefficient of the equivalent filter and the pitch in the case of voiced sound is extracted, and the frame signal corresponding to the characteristic parameter is synthesized to recover the audio signal by synthesizing L.
The PC vocoder is characterized by comprising means for adding a secondary pulse to an excitation signal generated at the time of synthesizing voiced sound frame data.

【００１３】また、有声音のフレームデータの前記特徴
パラメータの抽出時に雑音成分の混合度合を示す混合係
数を併せて抽出する手段と、この混合係数に対応して合
成時に生成する励起信号に雑音成分を付加する手段とを
備えたことを特徴とする。Means for extracting together a mixing coefficient indicating the degree of mixing of noise components when extracting the characteristic parameters of voiced sound frame data; and adding a noise component to an excitation signal generated at the time of synthesis corresponding to the mixing coefficient. And means for adding

【００１４】また、有声音のフレームデータの前記特徴
パラメータの抽出時にピッチのゆらぎの度合を示すゆら
ぎ判定値を併せて抽出する手段と、このゆらぎ判定値に
対応して合成時に生成する励起信号のピッチ間隔にゆら
ぎ成分を付加する手段を備えたことを特徴とする。Further, means for extracting a fluctuation judgment value indicating the degree of pitch fluctuation at the time of extracting the feature parameter of voiced sound frame data, and an excitation signal generated at the time of synthesis corresponding to the fluctuation judgment value corresponding to the fluctuation judgment value. It is characterized by comprising means for adding a fluctuation component to the pitch interval.

【００１５】また、前記特徴パラメータに対応する各フ
レームデータを合成する際、前記ピッチを示す特徴パラ
メータをフレーム間で滑らかに変化するよう補間し、こ
れに併せて合成する各フレームデータのフレーム長を調
整して該音声信号を復元する場合に、前記特徴パラメー
タの抽出時に前記パワーの抽出にあたって、合成時に調
整されるフレーム長に対応して抽出対象フレームデータ
長を調節する手段を備えたことを特徴とする。Further, when synthesizing each frame data corresponding to the characteristic parameter, the characteristic parameter indicating the pitch is interpolated so as to smoothly change between frames, and the frame length of each frame data to be synthesized is adjusted accordingly. In the case where the audio signal is adjusted and restored, when extracting the power at the time of extracting the feature parameter, a means for adjusting a data length of a frame to be extracted corresponding to a frame length adjusted at the time of synthesis is provided. And

【００１６】更に、前記特徴パラメータに対応する各フ
レームデータを合成する際、前記ピッチを示す特徴パラ
メータをフレーム間で滑らかに変化するよう補間し、こ
れに併せて合成する各フレームデータのフレーム長を調
整して該音声信号を復元するＬＰＣボコーダに、前記特
徴パラメータの抽出時に前記フレームデータ合成時に使
用する声道等価フィルタの特性係数の抽出にあたって、
合成時に調整されるフレーム長に対応して抽出対象フレ
ームデータ長を調節する手段を備えたことを特徴とす
る。Further, when synthesizing each frame data corresponding to the characteristic parameter, the characteristic parameter indicating the pitch is interpolated so as to smoothly change between frames, and the frame length of each frame data to be synthesized is also adjusted accordingly. In the LPC vocoder that adjusts and restores the audio signal, in extracting the characteristic coefficients of the vocal tract equivalent filter used in synthesizing the frame data when extracting the feature parameters,
There is provided a means for adjusting the length of the extraction target frame data in accordance with the frame length adjusted at the time of synthesis.

【００１７】[0017]

【発明の実施の形態】以下、本発明の実施形態を図面を
用いて説明する。図１，図２は本発明のＬＰＣボコーダ
の一実施形態を示すブロック図であり、図１は分析側の
構成を、図２は合成側の構成を示す。始めに図１を参照
して分析側における特徴パラメータの抽出について説明
する。Embodiments of the present invention will be described below with reference to the drawings. 1 and 2 are block diagrams showing an embodiment of the LPC vocoder of the present invention. FIG. 1 shows a configuration on the analysis side, and FIG. 2 shows a configuration on the synthesis side. First, extraction of characteristic parameters on the analysis side will be described with reference to FIG.

【００１８】図１に示す音声分析側では、例えば１００
Ｈｚ〜３ＫＨｚの帯域のアナログ信号である音声入力が
Ａ／Ｄコンバータ１により８ＫＨｚのサンプリング周波
数でサンプリングされてディジタル信号に変換され、シ
フトレジスタ２にバッファリングされる。本実施形態で
は音声データを２２．５ｍｓ（１８０サンプル周期）を
基準フレーム長とするフレーム単位に分割し、特徴パラ
メータを抽出することとしており、シフトレジスタ２は
分析対象となる当該フレーム及びその直前のフレームの
音声データ（以下、それぞれ当該フレームデータ、前フ
レームデータと称する）を保持している。On the voice analysis side shown in FIG.
An audio input, which is an analog signal in a frequency band of 3 Hz to 3 kHz, is sampled at a sampling frequency of 8 kHz by the A / D converter 1, converted into a digital signal, and buffered in the shift register 2. In the present embodiment, the audio data is divided into frame units each having a reference frame length of 22.5 ms (180 sample periods), and the feature parameters are extracted. The shift register 2 stores the relevant frame to be analyzed and the immediately preceding frame. It holds audio data of a frame (hereinafter, referred to as the frame data and the previous frame data, respectively).

【００１９】当該フレームデータはディジタルフィルタ
により構成されるフィルタバンク３で複数の周波数帯域
の帯域データに分割され、そのそれぞれが音声分析部４
に入力され分析される。本実施形態のフィルタバンク３
は、シフトレジスタ２に保持される当該フレームデータ
を１００〜８００Ｈｚ、５００〜１０００Ｈｚ、１００
０Ｈｚ〜１５００Ｈｚ、１５００Ｈｚ〜２０００Ｈｚ及
び２０００Ｈｚ以上の、それぞれ第１〜第５の、５つの
帯域データに分割し音声分析部４に出力する。音声分析
部４ではこの第１〜第５の帯域データのそれぞれについ
て、ＡＭＤＦ（ AverageMagnitude Difference Functio
n ）法によりピッチを算出し、また自己相関係数および
平均低域エネルギー割合を分析し、音声判定値すなわち
有声音「１」、無声音「０」の別の算定を行っている。The frame data is divided into band data of a plurality of frequency bands by a filter bank 3 composed of digital filters, each of which is
And analyzed. Filter bank 3 of the present embodiment
Indicates that the frame data held in the shift register 2 is 100 to 800 Hz, 500 to 1000 Hz, 100
The data is divided into first to fifth band data of 0 Hz to 1500 Hz, 1500 Hz to 2000 Hz, and 2000 Hz or more, and output to the voice analysis unit 4. The voice analysis unit 4 performs an AMDM (AverageMagnitude Difference Functio) on each of the first to fifth band data.
The pitch is calculated by the method n), the autocorrelation coefficient and the average low-frequency energy ratio are analyzed, and another calculation of the voice judgment value, that is, the voiced sound “1” and the unvoiced sound “0” is performed.

【００２０】そして、これら５つの帯域データのそれぞ
れの帯域ピッチ及び帯域音声判定値から、当該フレーム
データの特徴パラメータのうち、有声音「１」、無声音
「０」の別を示す音声判定値、音声判定値が有声音であ
る場合にそのピッチを示すピッチ係数、有声音に無声音
を混合すべき場合の混合率を決定する混合係数、ピッチ
間隔にゆらぎを加味すべき場合を示すゆらぎ判定値の４
つの値を抽出している。Then, based on the band pitch and band sound judgment value of each of the five band data, a sound judgment value indicating whether voiced sound is “1” or unvoiced sound “0”, and a sound judgment value, A pitch coefficient indicating the pitch when the judgment value is a voiced sound, a mixing coefficient for determining a mixing ratio when a voiced sound is to be mixed with an unvoiced sound, and a fluctuation judgment value indicating a case where fluctuation is to be added to a pitch interval.
Are extracting two values.

【００２１】以下、本実施形態における上記４つの特徴
パラメータの抽出について述べる。先ずピッチ抽出・音
声判定部５では、前記第１および第２の帯域データのピ
ッチ及びピッチ抽出部５で決定された前フレームデータ
のピッチを比較し、第１及び第２の帯域ピッチのうち、
前フレームデータのピッチにより近いピッチを当該フレ
ームデータのピッチと決定し、この帯域ピッチを示す帯
域データの帯域音声判定値を当該フレームの音声判定値
とする。なお第１，第２の帯域データが同一ピッチを示
し、且つ帯域音声判定値が相違する場合には、第１の帯
域音声判定値を当該フレームデータの音声判定値とす
る。このようにしてピッチ抽出・音声判定部５は、当該
フレームデータのピッチ及び音声判定値を抽出し、抽出
された音声判定値を符号化部１１に出力し、更にこの音
声判定値が「１」（有声音）の場合には抽出されたピッ
チを、当該フレームの声帯波の周期成分を近似するピッ
チ係数として符号化部１１に出力する。Hereinafter, extraction of the above four characteristic parameters in the present embodiment will be described. First, the pitch extraction / speech determination unit 5 compares the pitches of the first and second band data and the pitch of the previous frame data determined by the pitch extraction unit 5, and among the first and second band pitches,
A pitch closer to the pitch of the previous frame data is determined as the pitch of the frame data, and the band voice determination value of the band data indicating the band pitch is set as the voice determination value of the frame. If the first and second band data have the same pitch and the band audio determination values are different, the first band audio determination value is used as the audio determination value of the frame data. In this way, the pitch extraction / speech determination unit 5 extracts the pitch and the speech determination value of the frame data, outputs the extracted speech determination value to the encoding unit 11, and furthermore, the speech determination value is “1”. In the case of (voiced sound), the extracted pitch is output to the encoding unit 11 as a pitch coefficient approximating the periodic component of the vocal cord wave of the frame.

【００２２】混合係数算出部６では、ピッチ抽出・音声
判定部５で出力する音声判定値が、「１」の場合には、
前記第３，第４および第５の帯域データのそれぞれの帯
域ピッチと上記ピッチ係数とを比較し、その差が一定値
以内である帯域データの帯域音声判定値の総和（０〜
３）を算出し、混合係数として符号化部１１に出力す
る。例えば、当該フレームデータの音声判定値が
「１」、ピッチ係数が４０（サンプル数、サンプリング
周波数８ＫＨｚでは５ｍｓ、ピッチ周波数２００Ｈｚ）
の時、第３，第４および第５の帯域ピッチが、それぞれ
４２、４５及び８０であり、帯域音声判定値がそれぞ
れ、「１」，「１」，「０」の場合、上記一定値を３と
すれば混合係数は１となり、上記一定値を５とすれば混
合係数２が得られる。In the mixing coefficient calculation section 6, when the voice judgment value output from the pitch extraction / voice judgment section 5 is "1",
The respective band pitches of the third, fourth, and fifth band data are compared with the pitch coefficients, and the sum of the band voice determination values (0 to 0) of the band data whose difference is within a certain value.
3) is calculated and output to the encoding unit 11 as a mixing coefficient. For example, the voice determination value of the frame data is “1”, and the pitch coefficient is 40 (5 ms at the number of samples, 8 kHz sampling frequency, 200 Hz pitch frequency).
, The third, fourth, and fifth band pitches are 42, 45, and 80, respectively, and when the band voice determination values are “1”, “1”, and “0”, respectively, If the value is 3, the mixing coefficient is 1, and if the fixed value is 5, the mixing coefficient 2 is obtained.

【００２３】ゆらぎ判定部７では、ピッチ抽出・音声判
定部５で出力する音声判定値と、前記第１および第２の
帯域ピッチ、さらにシフトレジスタ２の保持する当該フ
レームデータからピッチのゆらぎの有「１」無「０」を
示すゆらぎ判定値を抽出する。すなわち音声判定値が
「１」で第１と第２の帯域ピッチの差が一定値、例えば
７以上の場合、及び音声判定値が「０」であって当該フ
レームデータのフレームパワーが一定値以上の場合に
は、ゆらぎ有りの有声音としてゆらぎ判定値を「１」と
する。後者は音声の立ち上がり時によくみられる分析値
でありこの場合にはゆらぎ判定値と共に音声判定値も
「１」に修正し、改めて混合係数の算出を行う。The fluctuation determining section 7 has a pitch fluctuation based on the voice determination value output from the pitch extracting / voice determining section 5, the first and second band pitches, and the frame data stored in the shift register 2. A fluctuation determination value indicating “1” or “0” is extracted. That is, when the audio determination value is “1” and the difference between the first and second band pitches is a fixed value, for example, 7 or more, and when the audio determination value is “0” and the frame power of the frame data is a certain value or more. In the case of, the fluctuation determination value is set to “1” as a voiced sound with fluctuation. The latter is an analysis value often seen at the time of rising of the voice. In this case, the voice determination value is corrected to “1” together with the fluctuation determination value, and the mixing coefficient is calculated again.

【００２４】以上は、特徴パラメータ、特に混合係数，
ゆらぎ判定値の抽出の実施例であって、本発明はこの実
施例に限定されるものではない。例えば、Ａ／Ｄコンバ
ータ１におけるサンプリング周波数は入力音声の帯域に
応じて、また基準フレーム長は符号化効率と合成音の品
質を勘案して、適当な値とすることができる。また本実
施例では、当該フレームデータを５つの帯域に分割して
分析しているが、例えば２つの帯域に分割し、低い方の
帯域音声判定値，帯域ピッチを、当該フレームデータの
音声判定値，ピッチ係数とし、高い帯域の帯域ピッチが
ピッチ係数の近傍にある場合の帯域音声判定値の例えば
３倍の値を混合係数とし、音声判定値が「０」であり、
且つフレームパワーが一定値以上の場合のみ、ゆらぎ判
定値を「１」とすることとしても良い。さらに帯域ピッ
チの算出方法、有声音／無声音の判別方法についても、
上記実施例に限定されるものではなく、既知の他の適当
な手法を適用することができる。The above is a description of the characteristic parameters, especially the mixing coefficients,
This is an embodiment of extracting a fluctuation determination value, and the present invention is not limited to this embodiment. For example, the sampling frequency in the A / D converter 1 can be set to an appropriate value in accordance with the band of the input sound, and the reference frame length can be set to an appropriate value in consideration of the coding efficiency and the quality of the synthesized sound. In this embodiment, the frame data is divided into five bands for analysis. For example, the frame data is divided into two bands, and the lower band voice determination value and the lower band pitch are determined by the voice determination value of the frame data. , A pitch coefficient, and a mixing coefficient is, for example, a value three times as large as the band voice determination value when the band pitch of the high band is near the pitch coefficient, and the voice determination value is “0”;
In addition, the fluctuation determination value may be set to “1” only when the frame power is equal to or more than a certain value. Furthermore, regarding the method of calculating the band pitch and the method of determining voiced / unvoiced sound,
The present invention is not limited to the above-described embodiment, and other known appropriate methods can be applied.

【００２５】次に、当該フレームデータの他の特徴パラ
メータである、線形スペクトル対係数とフレームパワー
の抽出について述べる。本実施形態では、上記特徴パラ
メータの抽出に先立って分析窓調整部８において合成側
におけるピッチのフレーム間補間を前提として、分析の
対象となるフレーム長（分析窓長）を調整することによ
り、より忠実な出力音声の合成を行うこととしている。
以下、分析窓調整部８におけるフレーム長の調整につい
て説明する。Next, extraction of a linear spectrum pair coefficient and frame power, which are other characteristic parameters of the frame data, will be described. In the present embodiment, the analysis window adjusting unit 8 adjusts the frame length (analysis window length) to be analyzed on the premise of the inter-frame interpolation of the pitch on the synthesis side prior to the extraction of the feature parameter. It is designed to perform faithful output speech synthesis.
Hereinafter, adjustment of the frame length in the analysis window adjustment unit 8 will be described.

【００２６】最初に、当該フレームデータが有声音、す
なわちピッチ抽出・音声判定部５の出力する音声判定値
が「１」の場合、若しくはゆらぎ判定部７において音声
判定値が「１」に修正された場合について述べる。先
ず、合成側におけるピッチのフレーム間補間と同一の方
法で、当該フレームの各ピッチ間隔が前フレームのピッ
チ間隔から連続的に変化するよう各ピッチ間隔の補間を
行う。具体的には、基準フレーム長に前分析窓（補正さ
れたフレーム）決定時の残余サンプル数（後述）を加算
したものを当該拡張フレーム長とし、前分析窓の最終ピ
ッチ間隔をＰｐｓｔ（サンプル数）、ピッチ抽出・音声
判定部５で決定されたピッチ係数をＰｒｅｆとすると
き、拡張フレーム長／（ＰｐｓｔとＰｒｅｆの平均）の
整数値を当該フレームデータの仮ピッチ数Ｎｔｐとし、
例えば各ピッチ間隔が等差δ＝（Ｐｒｅｆ−Ｐｐｓｔ）
／Ｎｔｐで増減するよう、当該フレームのピッチ間隔を
順にＰｐｓｔ＋δ、Ｐｐｓｔ＋２δ、・・・に線形補間
する。First, when the frame data is a voiced sound, that is, when the voice judgment value output from the pitch extraction / voice judgment unit 5 is "1", or the fluctuation judgment unit 7 corrects the voice judgment value to "1". Is described. First, in the same manner as the inter-frame interpolation of the pitch on the synthesizing side, interpolation of each pitch interval is performed so that each pitch interval of the frame continuously changes from the pitch interval of the previous frame. Specifically, a value obtained by adding the number of remaining samples (described later) at the time of determining the pre-analysis window (corrected frame) to the reference frame length is defined as the extended frame length, and the final pitch interval of the pre-analysis window is defined as Ppst (number of samples). ), When the pitch coefficient determined by the pitch extraction / voice determination unit 5 is Pref, an integer value of the extended frame length / (average of Ppst and Pref) is set as the provisional pitch number Ntp of the frame data,
For example, each pitch interval is equal difference δ = (Pref−Ppst)
/ Ntp, linearly interpolates the pitch interval of the frame into Ppst + δ, Ppst + 2δ,... In order.

【００２７】さらに、ゆらぎ判定部７のゆらぎ判定値が
「１」（ゆらぎ有り）の場合はこのように線形補間され
た各ピッチ間隔のそれぞれに乱数Ｒから一定の算式、例
えば０．７５＋Ｒ（Ｒは０〜０．５の値とする）で求め
られる値を乗じ、ゆらぎを加味する。このようにして得
られた各ピッチ間隔を順に積算し、積算値が前記当該拡
張フレーム長を越えない最大値を与えるピッチ数を、当
該フレームの分析ピッチ数とし、この積算値の最大値を
当該分析窓長とする。この当該分析窓長を用いて、シフ
トレジスタ２に保持されている前フレームデータ及び当
該フレームデータより、前分析窓データの次のサンプル
値に続く当該分析窓長分のサンプルデータを当該分析窓
データとして分析の対象とする。また、当該拡張フレー
ム長と当該分析窓長の差、即ち当該フレームデータのう
ち分析対象とならなかったサンプル数は前記残余サンプ
ル数として、基準フレーム長に加算し、次の拡張フレー
ム長を算出する。また、当該フレームデータの音声判定
値が無声音「０」の場合は、上記残余サンプル数と当該
フレームデータ長を合わせたものを当該分析窓データと
する。即ち、連続する無声音の場合は、残余サンプル数
は０となり、当該分析窓データは当該フレームデータに
一致する。Further, when the fluctuation judgment value of the fluctuation judgment unit 7 is "1" (there is fluctuation), a constant formula, for example, 0.75 + R (R Is a value of 0 to 0.5) and the fluctuation is taken into account. The pitch intervals thus obtained are integrated in order, the number of pitches giving the maximum value whose integrated value does not exceed the extended frame length is set as the analysis pitch number of the frame, and the maximum value of this integrated value is used as the Let it be the analysis window length. Using the analysis window length, the sample data for the analysis window length following the next sample value of the previous analysis window data is extracted from the previous frame data and the frame data held in the shift register 2. To be analyzed. Further, the difference between the extended frame length and the analysis window length, that is, the number of samples not analyzed in the frame data is added to the reference frame length as the remaining sample number, and the next extended frame length is calculated. . When the voice judgment value of the frame data is unvoiced sound "0", the analysis window data is obtained by adding the number of remaining samples to the frame data length. That is, in the case of a continuous unvoiced sound, the number of remaining samples is 0, and the analysis window data matches the frame data.

【００２８】以上、分析窓調整部８における分析窓長の
補正および分析の対象とする当該分析窓データの特定に
ついて説明したが、本発明はこれに限定されるものでは
なく、例えばシフトレジスタ２に当該フレームデータ及
び前後のフレームデータを保管し、連続する３フレーム
間でピッチが滑らかに変化するように各ピッチ間隔を補
間することとして同様に分析窓長を補正し、当該分析窓
データを特定し、合成側でも同様の方法で各ピッチ間隔
の補間を行うこととしても良い。The correction of the analysis window length in the analysis window adjustment unit 8 and the specification of the analysis window data to be analyzed have been described above. However, the present invention is not limited to this. The analysis window length is similarly corrected by interpolating each pitch interval so that the pitch changes smoothly between three consecutive frames by storing the frame data and the frame data before and after, and specifying the analysis window data. Alternatively, the synthesis side may perform interpolation of each pitch interval in the same manner.

【００２９】線スペクトル対係数分析部９では、このよ
うにして特定される当該分析窓データの音声波形につい
て、線スペクトル対係数を算出し、音声合成時の声道等
価フィルタを特定する特徴パラメータとして符号化部１
１に出力する。なお本実施形態ではこの際あわせてＰＡ
ＲＣＯＲ（偏自己相関）係数を算出し前記声道等価フィ
ルタが発散しないよう補正を行っている。また、ゲイン
計算部１０では同当該分析窓データを前半と後半とに２
等分し、そのそれぞれについてＲＭＳ（Root Mean Squa
re）値を算出し、フレームパワーを特定する特徴パラメ
ータとして符号化部１１に出力する。なお、声道等価フ
ィルタを特定する特徴パラメータの抽出についても線ス
ペクトル対分析に限定されるものではなく、他の方法た
とえばケプストラム分析等によることとしても良いが、
線スペクトル対係数は後述する合成部におけるフレーム
間補間が容易である特徴を有する。The line spectrum versus coefficient analysis unit 9 calculates a line spectrum versus coefficient for the speech waveform of the analysis window data specified in this way, and as a characteristic parameter for specifying a vocal tract equivalent filter at the time of speech synthesis. Encoding unit 1
Output to 1. In this embodiment, the PA
An RCOR (partial autocorrelation) coefficient is calculated and corrected so that the vocal tract equivalent filter does not diverge. In addition, the gain calculation unit 10 stores the analysis window data in the first half and the second half.
RMS (Root Mean Squa
re) Calculate the value and output it to the encoding unit 11 as a feature parameter for specifying the frame power. Note that the extraction of the feature parameter for specifying the vocal tract equivalent filter is not limited to the line spectrum pair analysis, but may be performed by another method such as cepstrum analysis.
The line spectrum pair coefficient has a feature that the inter-frame interpolation in the combining unit described later is easy.

【００３０】符号化部１１では、以上のようにして分析
抽出された、ピッチ係数，フレームパワー，線スペクト
ル対係数，ゆらぎ判定値，音声判定値および混合係数
を、各フレームデータの特徴パラメータとして符号化
し、合成側に伝達する。The encoding unit 11 encodes the pitch coefficient, the frame power, the line spectrum pair coefficient, the fluctuation judgment value, the speech judgment value, and the mixing coefficient, which are analyzed and extracted as described above, as characteristic parameters of each frame data. And transmit it to the compositing side.

【００３１】次に合成側の一実施形態を図２のブロック
図を参照して説明する。分析側から伝達された符号化デ
ータは、復号化部２１において各フレームデータ毎の特
徴パラメータ、即ちピッチ係数，フレームパワー，線ス
ペクトル対係数，ゆらぎ判定値，音声判定値および混合
係数に復号化される。Next, one embodiment of the combining side will be described with reference to the block diagram of FIG. The encoded data transmitted from the analysis side is decoded by the decoding unit 21 into characteristic parameters for each frame data, that is, pitch coefficients, frame powers, line spectrum pair coefficients, fluctuation judgment values, speech judgment values, and mixed coefficients. You.

【００３２】最初に音声判定値が「１」（有声音）の場
合の音声合成について説明する。先ずピッチ係数がピッ
チ補間部２２に入力され、図１の分析窓調整部８と同様
の方法でフレーム間で連続に変化するようピッチ間隔の
補間を行う。すなわち本実施例では拡張フレーム計算部
２５で前合成窓計算時の残余サンプル数を基準フレーム
長（本実施例では２２．５ｍｓ＝１８０サンプル数）に
加算し、当該拡張フレーム長を算出する。ピッチ補間部
２２ではこの拡張フレーム長，前合成窓の最終ピッチ間
隔Ｐｐｓｔ，および当該フレームデータのピッチ係数Ｐ
ｒｅｆから、仮ピッチ数Ｎｔｐ＝２×拡張フレーム長／
（Ｐｐｓｔ＋Ｐｒｅｆ）、ピッチ間隔増減値δ＝（Ｐｒ
ｅｆ−Ｐｐｓｔ）／Ｎｔｐを算出し、当該フレームのピ
ッチ間隔を順にＰｐｓｔ＋δ、Ｐｐｓｔ＋２δ、・・・
に線形補間する。First, the speech synthesis when the speech judgment value is "1" (voiced sound) will be described. First, the pitch coefficient is input to the pitch interpolation unit 22, and the pitch interval is interpolated so as to change continuously between frames by the same method as the analysis window adjustment unit 8 in FIG. That is, in the present embodiment, the extension frame calculation unit 25 adds the number of remaining samples at the time of the previous synthesis window calculation to the reference frame length (22.5 ms = 180 samples in this embodiment) to calculate the extension frame length. The pitch interpolation unit 22 calculates the extended frame length, the final pitch interval Ppst of the previous synthesis window, and the pitch coefficient P of the frame data.
From ref, the number of provisional pitches Ntp = 2 × extended frame length /
(Ppst + Pref), pitch interval increase / decrease value δ = (Pr
ef−Ppst) / Ntp, and the pitch interval of the frame is sequentially set to Ppst + δ, Ppst + 2δ,.
Linear interpolation.

【００３３】次にゆらぎ調整部２３で、ゆらぎ判定値が
「１」の場合は、各ピッチ間隔に０．７５＋Ｒ（Ｒは０
〜０．５の乱数）を乗じ、ゆらぎを加味する。合成窓計
算部２４では、このようにして算出された各ピッチ間隔
を順次積算し、積算値が前記拡張フレーム長を越えない
最大値を与えるピッチ数を当該フレームの合成ピッチ数
とし、この最大値を当該合成窓長とする。また、この拡
張フレーム長から合成窓長を差し引いた値を、新たな残
余サンプル数として拡張フレーム計算部２５に出力す
る。Next, when the fluctuation judgment value is "1" in the fluctuation adjusting section 23, 0.75 + R (R is 0
Multiplied by a random number of .about.0.5) to take into account fluctuations. The synthetic window calculating section 24 sequentially accumulates the pitch intervals calculated in this way, and sets the number of pitches that gives the maximum value whose integrated value does not exceed the extended frame length as the synthetic pitch number of the frame. Is the synthetic window length. Further, a value obtained by subtracting the synthesis window length from the extended frame length is output to the extended frame calculation unit 25 as a new number of remaining samples.

【００３４】一方、第１パワー補間部２６では、復号化
部２１で出力する当該フレームパワーの前半値、後半値
を上記当該合成窓の各ピッチ間隔に補間配分し、各ピッ
チ間隔毎に付与すべきパワー値を、フレーム間およびフ
レーム内前半，後半で連続となるように算出する。即
ち、例えば当該合成窓が１つのピッチよりなる場合はフ
レームパワーの前半値と後半値の平均値を当該ピッチの
パワーとする。当該合成窓のピッチ数が偶数の場合は、
前合成窓の最終ピッチのパワーを参照し、前半値を前半
の各ピッチに、また後半値を後半の各ピッチに、ピッチ
間隔と同様に線形補間のうえ配分する。当該合成窓が奇
数ピッチよりなる場合は、前半のピッチ数が後半のピッ
チ数より１つ少なくなるよう前半，後半のピッチ数を定
め、この各ピッチに偶数の場合と同様にしてフレームパ
ワーの前半値，後半値を配分する。前半のパワーが大き
いほうが不自然性が少ないためである。On the other hand, the first power interpolation unit 26 interpolates and distributes the first half value and the second half value of the frame power output from the decoding unit 21 to each pitch interval of the synthesis window, and assigns the former value to each pitch interval. The power value to be calculated is calculated so as to be continuous between frames and in the first and second half of the frame. That is, for example, when the synthesis window has one pitch, the average value of the former half value and the latter half value of the frame power is set as the power of the pitch. If the number of pitches of the synthesis window is even,
With reference to the power of the final pitch of the pre-synthesis window, the first half value is allocated to each first half pitch, and the second half value is allocated to each second half pitch after linear interpolation in the same manner as the pitch interval. When the synthesis window is composed of odd pitches, the first half and second half pitch numbers are determined so that the first half pitch number is one less than the second half pitch number, and the first half of the frame power is set to each pitch in the same manner as in the case of an even number. Distribute the value and the latter half value. This is because the larger the power in the first half, the less the unnaturalness.

【００３５】本発明では、有声音の励起信号に分析部で
分析した混合係数に対応して当該合成窓に一定の雑音を
加味することにより、より自然な音声合成を行うが、こ
のため第１雑音発生部２７では、当該合成窓の各ピッチ
間隔の各サンプル値に、乱数より求めた雑音値を設定
し、これを１−ｂｚ^-1の特性を持つハイパスフィルタ２
８でフィルタリングすることにより、当該合成窓の各ピ
ッチ間隔の雑音信号を設定する。According to the present invention, more natural speech synthesis is performed by adding a certain noise to the synthesis window corresponding to the mixing coefficient analyzed by the analysis unit to the excitation signal of the voiced sound. The noise generation unit 27 sets a noise value obtained from a random number to each sample value at each pitch interval of the synthesis window, and applies the noise value to a high-pass filter 2 having a characteristic of 1-bz ^-1.
By filtering with 8, the noise signal at each pitch interval of the synthesis window is set.

【００３６】パルス発生部２９では、合成音声の励起パ
ルスとして、合成窓計算部２４の算出する各ピッチ間隔
毎に、第１パワー補間部２６で算出した各ピッチ毎に補
間されたパワー値に、復号化部２１で復号された混合係
数を乗じた値に対応する振幅のパルスを、１つずつ発生
させる。例えば本実施例では混合係数をＭ（０〜３）と
して、上記パワー値に０．６＋０．１Ｍを乗じた値を励
起パルスの振幅値とし、各ピッチ間隔の最後のサンプル
値にこの振幅値を設定し、これを１＋ａｚ^-1の特性を持
つローパスフィルタ３０でフィルタリングし、ハイパス
フィルタ２８で出力する雑音信号と重畳し、各ピッチ毎
の励起信号を得ている。In the pulse generator 29, as the excitation pulse of the synthesized voice, the power value interpolated for each pitch calculated by the first power interpolator 26 at each pitch interval calculated by the synthesis window calculator 24, A pulse having an amplitude corresponding to the value obtained by multiplying the mixed coefficient decoded by the decoding unit 21 is generated one by one. For example, in this embodiment, the mixing coefficient is set to M (0 to 3), a value obtained by multiplying the power value by 0.6 + 0.1 M is set as the amplitude value of the excitation pulse, and this amplitude value is set as the last sample value of each pitch interval. This is set, filtered by a low-pass filter 30 having a characteristic of 1 + az ⁻¹ , and superimposed on a noise signal output by a high-pass filter 28 to obtain an excitation signal for each pitch.

【００３７】この励起信号に２次パルスを付加し、合成
音の明瞭度，自然性を高める。このため、この各ピッチ
毎の励起信号に、２次パルス付加部３１で２次パルスを
付加する。本実施例では例えば２０次のＦＩＲフィルタ
を通すことにより２次パルスを付加することとしている
が、他の適宜な手段を用いても良い。さらに２次パルス
を付加した励起信号はパワー正規化部３２で、各ピッチ
間隔毎に第１パワー補間部２６で配分されたパワーとな
るよう正規化される。以上のようにして、有声音の合成
窓について各ピッチ毎の励起信号を得る。A secondary pulse is added to the excitation signal to enhance the clarity and naturalness of the synthesized sound. Therefore, a secondary pulse is added to the excitation signal for each pitch by the secondary pulse adding unit 31. In the present embodiment, for example, the secondary pulse is added by passing through a 20th-order FIR filter, but other appropriate means may be used. Further, the excitation signal to which the secondary pulse has been added is normalized by the power normalizing section 32 so that the power is distributed by the first power interpolating section 26 at each pitch interval. As described above, the excitation signal for each pitch is obtained for the voiced sound synthesis window.

【００３８】次に復号化部２１で出力する音声判定値が
「０」（無声音）の場合のフレームの励起信号の生成に
ついて説明する。音声判定値が「０」の場合は、ピッチ
係数に係わらず、サブフレーム分割部３３で当該合成窓
を４つのサブフレームとして設定する。各サブフレーム
長は基準フレーム長、すなわち本実施例では１８０サン
プル数を４等分した４５サンプル数を基本とするが、前
合成窓が有声音であり拡張フレーム計算部２５に前記残
余サンプル数が残っている場合には、最後のサブフレー
ム長にこの残余サンプル数を付加し拡張フレーム計算部
２５の残余サンプル数をクリアにする。前述した奇数ピ
ッチの有声音合成窓の後半のピッチ数を多くするのと同
様の理由による。Next, a description will be given of the generation of the excitation signal of the frame when the speech judgment value output from the decoding unit 21 is "0" (unvoiced sound). When the voice determination value is “0”, the subframe division unit 33 sets the synthesis window as four subframes regardless of the pitch coefficient. Each subframe length is basically a reference frame length, that is, 45 samples obtained by dividing 180 samples into four in this embodiment. However, the pre-synthesis window is a voiced sound, and If there is, the number of remaining samples is added to the last subframe length to clear the number of remaining samples of the extended frame calculation unit 25. This is for the same reason as increasing the number of pitches in the latter half of the odd-numbered voiced sound synthesis window.

【００３９】第２パワー補間部３４では、復号化部２１
で出力する当該合成窓のフレームパワーの前半値を前半
の２サブフレームに、後半値を後半の２サブフレーム
に、サブフレーム間で連続的にパワー値が変化するよう
に補間して配分する。第２雑音発生部３５では、このパ
ワー値を基に各サブフレーム毎に乱数を用いて白色雑音
を設定する。本実施例ではこの白色雑音は整形フィルタ
３６で、２次パルス付加部３１と同様の２０次のＦＩＲ
フィルタで整形され、無声音の合成窓の各サブフレーム
毎の励起信号として出力される。このようにして得られ
た励起信号は、有声音合成窓では２次パルス付加部３１
より各ピッチ毎に出力され、無声音合成窓では整形フィ
ルタ３６より各サブフレーム毎に出力され、声道と等価
な伝達特性を持つ合成フィルタ３７に順次入力される。The second power interpolation unit 34 includes a decoding unit 21
, The first half value of the frame power of the combined window output to the first two subframes and the second half value to the second half subframe are interpolated and distributed so that the power value changes continuously between the subframes. The second noise generator 35 sets white noise using random numbers for each subframe based on the power value. In the present embodiment, the white noise is converted by the shaping filter 36 into a 20th-order FIR similar to that of the secondary pulse adding unit 31.
It is shaped by a filter and output as an excitation signal for each subframe of the unvoiced sound synthesis window. The excitation signal thus obtained is supplied to the secondary pulse adding unit 31 in the voiced sound synthesis window.
In the unvoiced sound synthesis window, the data is output from the shaping filter 36 for each subframe, and is sequentially input to the synthesis filter 37 having a transfer characteristic equivalent to the vocal tract.

【００４０】線スペクトル対補間部３８では、復号化部
２１で出力する各フレームの線スペクトル対係数を各合
成窓間で連続的に変化するようにピッチ間隔毎、もしく
はサブフレーム毎に補間し、αパラメータ変換部３９で
合成フィルタ３７に対応するパラメータに変換し、各ピ
ッチ毎またはサブフレーム毎に、その伝達特性を規定す
る。本実施例では当該フレームデータの線スペクトル対
係数を前記パワー補間部２６と同様に、有声音合成窓で
はピッチ毎に、無声音合成窓ではサブフレーム毎に、連
続的に変化するように線形補間した後、ピッチ毎または
サブフレーム毎に線形予測パラメータ（αパラメータ）
に変換し、線形予測法（ＬＰＣ）による音声合成フィル
タで構成する合成フィルタ３７の特性値を、順次入力さ
れる励起信号に同期してピッチ毎，サブフレーム毎に設
定することにより、音声データを合成している。合成フ
ィルタ３７の出力はＤ／Ａコンバータ４０でアナログ信
号に変換され合成音声として出力される。The line spectrum pair interpolating unit 38 interpolates the line spectrum pair coefficient of each frame output from the decoding unit 21 at every pitch interval or every subframe so as to change continuously between the synthesis windows. The parameter is converted into a parameter corresponding to the synthesis filter 37 by the α parameter conversion unit 39, and the transfer characteristic is defined for each pitch or each subframe. In the present embodiment, the linear spectrum pair coefficient of the frame data is linearly interpolated so as to change continuously for each pitch in the voiced sound synthesis window and for each subframe in the unvoiced sound synthesis window, similarly to the power interpolation unit 26. Later, linear prediction parameter (α parameter) for each pitch or subframe
By setting the characteristic value of the synthesis filter 37 composed of a voice synthesis filter by the linear prediction method (LPC) for each pitch and each subframe in synchronization with the sequentially input excitation signal, the voice data is converted. Combined. The output of the synthesis filter 37 is converted into an analog signal by the D / A converter 40 and output as a synthesized voice.

【００４１】本実施形態のＬＰＣボコーダは以上のよう
に構成され、有声音励起信号に雑音成分を混合し、また
２次パルス及びゆらぎを加味することにより、明瞭で自
然性により優れた高品質の合成音声を再生することがで
きる。また特徴パラメータの抽出にあたっては合成部に
おけるピッチの補間を前提として分析対象フレーム長を
加減することにより、再現忠実性により優れた合成音声
を再生することができるようになる。The LPC vocoder of the present embodiment is constructed as described above. By mixing a voiced excitation signal with a noise component and adding a secondary pulse and fluctuation, it is possible to obtain a clear and natural high quality signal. Synthesized speech can be played. In addition, when extracting the characteristic parameters, the synthesis target speech can be reproduced with higher fidelity by adjusting the length of the frame to be analyzed on the premise of pitch interpolation in the synthesizer.

【００４２】なお本実施形態では、上記有声音励起信号
への雑音成分の混合、２次パルスの付加、ゆらぎの加
味、また合成時のピッチ補間を前提とした声道等価フィ
ルタ係数の抽出、フレームパワーの抽出を行っている
が、本発明はこれに限定されるものではなく、必要に応
じてこれら各手段を単独にまたは複数組み合わせて構成
しても良い。In the present embodiment, mixing of noise components in the voiced sound excitation signal, addition of secondary pulses, addition of fluctuations, extraction of vocal tract equivalent filter coefficients on the premise of pitch interpolation during synthesis, and frame extraction Although power is extracted, the present invention is not limited to this, and these units may be configured alone or in combination as needed.

【００４３】[0043]

【発明の効果】以上説明したように本発明のＬＰＣボコ
ーダによれば、有声音励起信号に雑音成分を混合し、ま
た２次パルス及びゆらぎを加味することにより、明瞭で
自然性により優れた高品質の合成音声を再生することが
できる。また特徴パラメータの抽出にあたっては、合成
部におけるピッチの補間を前提として分析対象フレーム
長を加減することにより、再現性により優れた合成音声
を再生することができる等の効果がある。As described above, according to the LPC vocoder of the present invention, a noise component is mixed with a voiced sound excitation signal, and a secondary pulse and fluctuation are added, so that a clear and natural sound is obtained. High quality synthesized speech can be reproduced. In addition, in extracting the characteristic parameters, by adjusting the length of the frame to be analyzed on the assumption that the pitch is interpolated in the synthesizing unit, there is an effect that a synthesized voice with excellent reproducibility can be reproduced.

[Brief description of the drawings]

【図１】本発明のＬＰＣボコーダの分析側の構成の一実
施形態を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of a configuration on an analysis side of an LPC vocoder of the present invention.

【図２】本発明のＬＰＣボコーダの合成側の構成の一実
施形態を示すブロック図である。FIG. 2 is a block diagram showing an embodiment of a configuration on the synthesis side of the LPC vocoder of the present invention.

【図３】従来のＬＰＣボコーダの構成の一例を示すブロ
ック図である。FIG. 3 is a block diagram illustrating an example of a configuration of a conventional LPC vocoder.

[Explanation of symbols]

１Ａ／Ｄコンバータ２シフトレジスタ３フィルタバンク４音声分析部５ピッチ抽出・音声判定部６混合係数算出部７ゆらぎ判定部８分析窓調整部９線スペクトル対係数分析部１０ゲイン計算部１１符号化部２１復号化部２２ピッチ補間部２３ゆらぎ調整部２４合成窓計算部２５拡張フレーム計算部２６第１パワー補間部２７第１雑音発生部２８ハイパスフィルタ２９パルス発生部３０ローパスフィルタ３１２次パルス付加部３２パワー正規化部３３サブフレーム分割部３４第２パワー補間部３５第２雑音発生部３６整形フィルタ３７合成フィルタ３８線スペクトル対補間部３９ αパラメータ変換部４０Ｄ／Ａコンバータ５０分析部５１分析処理部５２マルチプレクサ５３−１〜５３−４符号化器５４伝送路６０合成部６１デマルチプレクサ６２−１〜６２−４復号器６３補間処理部６４合成処理部 REFERENCE SIGNS LIST 1 A / D converter 2 shift register 3 filter bank 4 voice analysis unit 5 pitch extraction / voice determination unit 6 mixing coefficient calculation unit 7 fluctuation determination unit 8 analysis window adjustment unit 9 line spectrum vs. coefficient analysis unit 10 gain calculation unit 11 encoding Unit 21 decoding unit 22 pitch interpolation unit 23 fluctuation adjustment unit 24 synthesis window calculation unit 25 extended frame calculation unit 26 first power interpolation unit 27 first noise generation unit 28 high-pass filter 29 pulse generation unit 30 low-pass filter 31 secondary pulse addition Unit 32 power normalization unit 33 subframe division unit 34 second power interpolation unit 35 second noise generation unit 36 shaping filter 37 synthesis filter 38 line spectrum pair interpolation unit 39 α parameter conversion unit 40 D / A converter 50 analysis unit 51 analysis Processing unit 52 Multiplexer 53-1 to 53-4 Code Vessel 54 transmission line 60 combining unit 61 demultiplexer 62-1 to 62-4 decoder 63 interpolation processing section 64 synthesis processing unit

Claims

[Claims]

An audio signal is divided into frame data having a predetermined time length, and at least a voiced sound,
By extracting the unvoiced sound, the power, the characteristic coefficient of the vocal tract equivalent filter used in synthesizing the frame data, and the characteristic parameter including the pitch in the case of voiced sound, and synthesizing each frame data corresponding to this characteristic parameter An LPC vocoder for restoring the audio signal, comprising: means for adding a secondary pulse to an excitation signal generated at the time of synthesizing voiced frame data.
PC vocoder.

2. An audio signal is divided into frame data having a predetermined time length, and at least a voiced sound,
By extracting the unvoiced sound, the power, the characteristic coefficient of the vocal tract equivalent filter used in synthesizing the frame data, and the characteristic parameter including the pitch in the case of voiced sound, and synthesizing each frame data corresponding to this characteristic parameter An LPC vocoder for restoring the audio signal, wherein at the time of extracting the characteristic parameters of the voiced sound frame data, a means for additionally extracting a mixing coefficient indicating a degree of mixing of noise components; Means for adding a noise component to the excitation signal to be generated.
C vocoder.

3. An audio signal is divided into frame data having a predetermined time length, and at least a voiced sound,
By extracting the unvoiced sound, the power, the characteristic coefficient of the vocal tract equivalent filter used in synthesizing the frame data, and the characteristic parameter including the pitch in the case of voiced sound, and synthesizing each frame data corresponding to this characteristic parameter An LPC vocoder for restoring the audio signal, wherein at the time of extracting the characteristic parameter of the voiced frame data, means for extracting together a fluctuation determination value indicating the degree of pitch fluctuation; and synthesizing in response to the fluctuation determination value. Means for adding a fluctuation component to the pitch interval of the excitation signal generated at the time.

4. An audio signal is divided into frame data having a predetermined time length, and at least a voiced sound,
Unvoiced sound, power, characteristic parameters including the pitch of the characteristic coefficient of the vocal tract equivalent filter used when synthesizing frame data and voiced sound, when synthesizing each frame data corresponding to this characteristic parameter, LP for interpolating the characteristic parameter indicating the pitch so as to smoothly change between frames, adjusting the frame length of each frame data to be synthesized in accordance with the interpolation, and restoring the audio signal
An LPC vocoder, comprising: a C vocoder, comprising means for adjusting the data length of a frame to be extracted in accordance with a frame length adjusted at the time of synthesis when extracting the power when extracting the feature parameter.

5. An audio signal is divided into frame data having a fixed time length, and at least a voiced sound,
Unvoiced sound, power, characteristic parameters including the pitch of the characteristic coefficient of the vocal tract equivalent filter used when synthesizing frame data and voiced sound, when synthesizing each frame data corresponding to this characteristic parameter, LP for interpolating the characteristic parameter indicating the pitch so as to smoothly change between frames, adjusting the frame length of each frame data to be synthesized in accordance with the interpolation, and restoring the audio signal
In the C vocoder, there is provided a means for adjusting the extraction target frame data length corresponding to the frame length adjusted at the time of synthesizing the characteristic coefficient of the vocal tract equivalent filter used at the time of synthesizing the frame data at the time of extracting the feature parameter. An LPC vocoder characterized in that:

6. The mixing coefficient is obtained by dividing the frame data to be extracted into a plurality of band data, and comparing and extracting respective pitches, voiced sounds, and unvoiced sounds obtained by analyzing the respective band data. The LPC vocoder according to claim 2, characterized in that:

7. The plurality of band data are five band data, and the extraction target frame is extracted from each of pitches, voiced sounds, and unvoiced sounds obtained by analyzing band data of two lower bands. It discriminates between voiced sound and unvoiced sound of the data, and when it is determined that the extraction target frame data is a voiced sound, each of the five band data obtained by analyzing the band data of the three higher bands is obtained. Pitch,
7. The LPC vocoder according to claim 6, wherein the mixing coefficient is extracted from voiced sound and unvoiced sound.

8. The fluctuation determination value is calculated by dividing the frame data to be extracted into a plurality of band data, and comparing and judging respective pitches obtained by analyzing each band data. The LPC vocoder according to claim 3.

9. The fluctuation determination value is calculated by determining that the extraction target frame data has a characteristic of unvoiced sound and that the frame power is a voiced sound with fluctuation when the frame power is equal to or more than a predetermined value. The LPC vocoder according to claim 3, characterized in that: