JP3342310B2

JP3342310B2 - Audio decoding device

Info

Publication number: JP3342310B2
Application number: JP23238896A
Authority: JP
Inventors: 延佳海木
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1996-09-02
Filing date: 1996-09-02
Publication date: 2002-11-05
Anticipated expiration: 2016-09-02
Also published as: JPH1074095A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声を分析、
圧縮して符号化する音声符号化装置、及び符号化された
データから音声を復号化し、合成する音声復号化装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention analyzes an input voice,
The present invention relates to an audio encoding device that compresses and encodes, and an audio decoding device that decodes and synthesizes audio from encoded data.

【０００２】[0002]

【従来の技術】一般に、音声の符号化による音声符号化
復号化装置では、その符号化の圧縮率が大きな問題とな
る。すなわち符号化音声を伝送する場合、圧縮率が高い
ほど同じ伝送容量で多くの音声データを伝送可能であ
る。また、符号化したデータを記憶装置に蓄積してお
き、その蓄積された符号化データを基に音声を再生する
場合でも、符号化の圧縮率が高いほど同じ記憶容量の記
憶装置で多くの音声データを蓄積可能となる。2. Description of the Related Art In general, in a speech coding / decoding apparatus using speech coding, the compression rate of the coding is a major problem. That is, when transmitting coded audio, the higher the compression ratio, the more audio data can be transmitted with the same transmission capacity. Also, even when the encoded data is stored in the storage device and the audio is reproduced based on the stored encoded data, the higher the compression ratio of the encoding, the more the audio data is stored in the storage device having the same storage capacity. Data can be stored.

【０００３】一般的な高能率な音声符号化では、音声の
生成過程を考慮して、音声をスペクトル特徴パラメー
タ、音源パラメータに分離して符号化する。その音源パ
ラメータのうち、ピッチは音の高さを特徴付ける重要な
パラメータで、音声の符号化の際のパラメータとして一
般的に広く用いられている。[0003] In general high-efficiency speech coding, speech is separated into spectral feature parameters and sound source parameters and encoded in consideration of a speech generation process. Among the sound source parameters, the pitch is an important parameter that characterizes the pitch of a sound, and is generally and widely used as a parameter in speech coding.

【０００４】音声のピッチ周波数のダイナミックレンジ
は、一般男性の場合、ゼロを含めて約５０Ｈｚ〜２５０
Ｈｚ、一般女性の場合、ゼロを含めて約１００Ｈｚ〜４
００Ｈｚ程度である。このため、ピッチを符号化する場
合、零を含めて約５０〜４００のダイナミックレンジを
持つ数値を圧縮符号化する必要がある。[0004] The dynamic range of the pitch frequency of voice is about 50 Hz to 250 including zero for general males.
Hz, for general women, about 100Hz-4 including zero
It is about 00 Hz. Therefore, when encoding the pitch, it is necessary to compress and encode a numerical value having a dynamic range of about 50 to 400 including zero.

【０００５】従来、ピッチの圧縮符号化のためには、
（１）人間の聴覚特性を利用して、ピッチ周波数を対数
処理したものを量子化して圧縮したり、（２）ピッチの
差分を符号化したりしていた。人間の音源の音声生成機
構を考えた場合、ピッチは時間的に不連続にならない。
このため、時系列にある時間間隔に音声の特徴パラメー
タを圧縮符号化する場合、分析した圧縮する時間間隔が
充分に短い場合、その差分値のダイナミックレンジはピ
ッチ周波数自体のダイナミックレンジに比べ小さくな
る。前記（２）の方法は、このことを利用するものであ
る。Conventionally, for pitch compression encoding,
(1) Quantizing and compressing the logarithm of the pitch frequency using the human auditory characteristics, or (2) encoding the pitch difference. When considering the sound generation mechanism of a human sound source, the pitch does not become discontinuous in time.
For this reason, when the feature parameter of the voice is compression-coded at a time interval in a time series, if the analyzed time interval for compression is sufficiently short, the dynamic range of the difference value is smaller than the dynamic range of the pitch frequency itself. . The method (2) utilizes this fact.

【０００６】また、記憶装置に符号化した音声データを
蓄積して音声を再生する場合、記憶装置の蓄積容量を減
少させる目的のため、発生する内容の一部分を別々に収
録し、編集つなぎ合わせて合成音声を作る編集合成方式
による音声合成方法が知られている。[0006] In addition, when audio is reproduced by storing encoded audio data in a storage device, a part of generated contents is separately recorded and edited and connected for the purpose of reducing the storage capacity of the storage device. 2. Description of the Related Art A speech synthesis method using an edit synthesis method for creating a synthesized speech is known.

【０００７】[0007]

【発明が解決しようとする課題】現状のピッチの圧縮符
号化は、聴覚特性に基づいて対数軸で量子化したり、差
分の符号化によるダイナミックレンジの縮小化によった
りしている。音声波形を分析し圧縮する時間間隔が充分
に短い場合、そのピッチの差分値のダイナミックレンジ
はピッチ自体のダイナミックレンジに比べ小さくなる。
このため、差分処理によってピッチの圧縮率を高めるこ
とが可能になる。しかしながら、分析する時間間隔を大
きくした場合、その差分値のダイナミックレンジはあま
り小さくならず、圧縮率を高くすることができないとい
う問題がある。At present, the compression coding of the pitch is based on the quantization on the logarithmic axis based on the auditory characteristics, or the reduction of the dynamic range by coding the difference. When the time interval for analyzing and compressing the audio waveform is sufficiently short, the dynamic range of the pitch difference value becomes smaller than the dynamic range of the pitch itself.
Therefore, it is possible to increase the compression ratio of the pitch by the difference processing. However, when the time interval for analysis is increased, the dynamic range of the difference value is not so small, and there is a problem that the compression ratio cannot be increased.

【０００８】そこで、本発明の目的は、このような従来
技術の問題点を解決し、圧縮符号化の圧縮率をさらに高
めることにある。また、記憶装置に符号化した音声デー
タを蓄積編集して音声を編集合成する場合、収録した音
素片間のピッチが連続的に接続されず、音の高さが急激
に変わり、異音が発生したり、不自然に聞こえるという
問題点があった。本発明の他の目的は、圧縮符号化され
た音素片を編集して合成音声を出力する編集合成方式に
おいて、ピッチの不連続性を解消することにある。Accordingly, an object of the present invention is to solve such a problem of the prior art and to further increase the compression rate of compression encoding. In addition, when editing and synthesizing audio by storing and editing encoded audio data in a storage device, the pitch between recorded speech segments is not continuously connected, and the pitch of the sound changes suddenly, causing abnormal noise. And sounded unnatural. It is another object of the present invention to eliminate pitch discontinuity in an editing / synthesizing method for editing a compression-encoded phoneme unit and outputting a synthesized speech.

【０００９】[0009]

【課題を解決するための手段】前記目的は、ピッチパタ
ーンをモデル化し、そのモデル化したパラメータで音声
を圧縮符号化する、あるいは実際のピッチパターンとピ
ッチパターンモデルの差の情報をさらに付加して圧縮符
号化することにより達成される。モデル化は、音声生成
機構に立脚したモデルを用い、アクセント句単位、フレ
ーズ単位、音素単位などの区間のピッチをまとめてモデ
ル化を行うことでピッチ情報を圧縮する。SUMMARY OF THE INVENTION The object of the present invention is to model a pitch pattern and compression-encode speech using the modeled parameters, or further add information on a difference between an actual pitch pattern and a pitch pattern model. This is achieved by compression encoding. Modeling compresses pitch information by using a model based on a speech generation mechanism and modeling the pitch of sections such as accent units, phrase units, and phoneme units collectively.

【００１０】入力された音声のピッチを分析間隔単位に
分析・抽出し、抽出されたピッチをアクセント、フレー
ズ、音素の各パターンの加算によってモデル化し、実音
声のピッチとパターン化されたピッチの誤差と各パター
ンを伝送することにより圧縮符号化した音声符号データ
を生成する。そして、音声符号化データからアクセント
・フレーズ・音素の各パターン、及びパターン化された
ピッチと実音声のピッチの誤差を加算することによって
ピッチを生成・復号化する。[0010] The pitch of the input voice is analyzed and extracted for each analysis interval, and the extracted pitch is modeled by adding each pattern of accent, phrase, and phoneme, and the error between the pitch of the actual voice and the patterned pitch is calculated. And the respective patterns are transmitted, thereby generating compression-encoded audio code data. Then, the pitch is generated and decoded by adding each of the accent, phrase, and phoneme patterns and the error between the patterned pitch and the pitch of the actual voice from the voice encoded data.

【００１１】また、前記した音素間のピッチが不連続に
なる問題は、ピッチパターンをモデル化していないこと
に起因していると考えられる。したがって、ピッチ制御
に音声生成機構に立脚したモデルを用いることにより、
ピッチの不連続性をなくすことができる。このピッチの
不連続性をなくすことによって、上述の音の高さが急激
に変わり、異音が発生したり、不自然に聞こえるという
問題が解消され、より自然性の高い音声合成を可能にす
ることができる。The above-mentioned problem that the pitch between phonemes becomes discontinuous is considered to be caused by not modeling the pitch pattern. Therefore, by using a model based on the voice generation mechanism for pitch control,
Pitch discontinuities can be eliminated. By eliminating the discontinuity of the pitch, the above-mentioned sudden change in the pitch of the sound, the problem of abnormal noise and unnatural sound can be solved, and a more natural sound synthesis can be realized. be able to.

【００１２】さらに、モデル化圧縮されたピッチ情報の
モデルパラメータをピッチモデルに基づいて制御変更す
るモデルパラメータの制御手段を備え、ピッチモデルに
立脚しても、例えば大きなアクセント成分が連続した場
合などピッチが大きくなりすぎないように、モデル化し
たパラメータを制御変更することにより、ピッチが高く
なりすぎたり、低くなりすぎたりしないようにすること
ができる。Further, the apparatus is provided with model parameter control means for controlling and changing the model parameter of the modeled and compressed pitch information based on the pitch model. By controlling the modeled parameters so that the pitch does not become too large, the pitch can be prevented from becoming too high or too low.

【００１３】すなわち、本発明の音声符号化装置は、自
然音声を分析して所定区間のピッチパターンを抽出する
音源抽出手段と、所定区間のピッチパターンモデルを複
数記憶した記憶手段と、音源抽出手段によって抽出され
たピッチパターンに最も良くマッチングするピッチパタ
ーンモデルを選択する選択手段と、選択されたピッチパ
ターンモデルを用いて自然音声の所定区間のピッチパタ
ーンを符号化する手段とを備えることを特徴とする。符
号化はピッチパターンモデルに対応するコードによって
行うことができる。That is, the speech coding apparatus of the present invention comprises: a sound source extracting means for analyzing a natural speech to extract a pitch pattern in a predetermined section; a storage means storing a plurality of pitch pattern models in a predetermined section; Selecting means for selecting a pitch pattern model that best matches the pitch pattern extracted by the method, and means for encoding a pitch pattern of a predetermined section of natural speech using the selected pitch pattern model. I do. Encoding can be performed by a code corresponding to the pitch pattern model.

【００１４】前記所定区間は、アクセント成分単位の区
間、アクセント成分単位の区間及びフレーズ成分単位の
区間、あるいはアクセント成分単位の区間、フレーズ成
分単位の区間及び音素成分単位の区間とすることができ
る。また、実際のピッチパターンとピッチパターンモデ
ルとの間のピッチ誤差を抽出するピッチ誤差抽出手段、
あるいはピッチ誤差抽出手段によって抽出されたピッチ
誤差を符号化するピッチ誤差時系列符号化手段を備える
こともできる。The predetermined section may be a section of an accent component, a section of an accent component and a section of a phrase component, or a section of an accent component, a section of a phrase component and a section of a phoneme component. A pitch error extracting means for extracting a pitch error between an actual pitch pattern and a pitch pattern model;
Alternatively, there may be provided a pitch error time series encoding means for encoding the pitch error extracted by the pitch error extracting means.

【００１５】また、本発明による音声復号化装置は、自
然音声の所定区間のピッチパターンモデルを複数記憶し
た記憶手段と、ピッチパターンモデルに関連づけて符号
化されたデータをピッチパターンモデルを用いて復号化
するピッチ生成手段とを備えることを特徴とする。前記
所定区間がアクセント成分単位の区間であるとき、ピッ
チ生成手段はアクセントパターン生成手段を備える。Further, a speech decoding apparatus according to the present invention stores a plurality of pitch pattern models of a predetermined section of natural speech, and decodes data encoded in association with the pitch pattern model using the pitch pattern model. And a pitch generating means for converting the pitch into a pitch. When the predetermined section is a section of an accent component unit, the pitch generating means includes an accent pattern generating means.

【００１６】所定区間がアクセント成分単位の区間及び
フレーズ成分単位の区間であるとき、ピッチ生成手段は
アクセントパターン生成手段と、フレーズパターン生成
手段と、アクセントパターン生成手段で生成されたアク
セントパターンとフレーズパターン生成手段で生成され
たフレーズパターンを加算する加算手段とを備える。When the predetermined section is a section of an accent component unit and a section of a phrase component unit, the pitch generating means includes an accent pattern generating means, a phrase pattern generating means, and an accent pattern and a phrase pattern generated by the accent pattern generating means. Adding means for adding the phrase pattern generated by the generating means.

【００１７】所定区間がアクセント成分単位の区間、フ
レーズ成分単位の区間及び音素成分単位の区間であると
き、ピッチ生成手段はアクセントパターン生成手段と、
フレーズパターン生成手段と、音素パターン生成手段
と、アクセントパターン生成手段で生成されたアクセン
トパターンとフレーズパターン生成手段で生成されたフ
レーズパターンと音素パターン生成手段で生成された音
素パターンとを加算する加算手段とを備える。When the predetermined section is a section of an accent component unit, a section of a phrase component unit and a section of a phoneme component unit, the pitch generating means includes an accent pattern generating means,
Phrase pattern generation means, phoneme pattern generation means, and addition means for adding the accent pattern generated by the accent pattern generation means, the phrase pattern generated by the phrase pattern generation means, and the phoneme pattern generated by the phoneme pattern generation means And

【００１８】データが自然音声のピッチパターンとピッ
チパターンモデルとのピッチ誤差を符号化した符号化デ
ータを含むとき、ピッチ生成手段はピッチ誤差を符号化
したデータを復元するピッチ誤差復元手段を備え、ピッ
チ誤差復元手段で復元されたピッチ誤差を加算手段によ
って加算する。When the data includes encoded data obtained by encoding a pitch error between a pitch pattern of a natural voice and a pitch pattern model, the pitch generating means includes pitch error restoring means for restoring the encoded data of the pitch error; The pitch error restored by the pitch error restoring means is added by the adding means.

【００１９】データが自然音声のピッチパターンとピッ
チパターンモデルとのピッチ誤差データを時系列的に符
号化した符号化データを含んでいるとき、ピッチ生成手
段はピッチ誤差時系列符号化データを復元するピッチ誤
差時系列復元手段を備え、ピッチ誤差時系列復元手段に
よって復元されたピッチ誤差データをピッチ誤差復元手
段に入力する。When the data includes coded data obtained by time-sequentially coding pitch error data between a natural voice pitch pattern and a pitch pattern model, the pitch generating means restores the pitch error time-series coded data. Pitch error time series restoration means is provided, and pitch error data restored by the pitch error time series restoration means is input to the pitch error restoration means.

【００２０】また、前記音声複号化装置において、編集
合成方式で音声を合成するとき、編集される音素素片の
ピッチが上限値及び下限値を越えないように、前記デー
タに変更を加える制御手段を備えることを特徴とする。In the speech decoding apparatus, when synthesizing a speech by the edit synthesis method, a control for changing the data so that the pitch of the phoneme segment to be edited does not exceed the upper limit value and the lower limit value. It is characterized by comprising means.

【００２１】[0021]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を詳細に説明する。図１は、本発明による音声
符号化復号化装置の構成を概略的に示したブロック図で
ある。この音声符号化復号化装置は、音声を圧縮符号化
する音声圧縮符号化装置１と、圧縮符号化されたデータ
から音声を復号化して合成音声を出力する音声復号化装
置２からなる。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram schematically showing a configuration of a speech encoding / decoding apparatus according to the present invention. This speech coding / decoding device includes a speech compression / coding device 1 for compressing and coding speech, and a speech decoding device 2 for decoding a speech from compression-coded data and outputting a synthesized speech.

【００２２】音声圧縮符号化装置１は、音声入力器２
０、スペクトル解析器２１、音源抽出器２２、ピッチモ
デル化器２３、符号化器２４から構成されている。ま
た、音声復号化装置２は、符号化データ蓄積器（入力
器）１０、ピッチ生成器１１、音源生成器１２、制御器
１３、スペクトル生成器１４、音声合成出力器１５から
構成されている。The voice compression encoding device 1 includes a voice input device 2
0, a spectrum analyzer 21, a sound source extractor 22, a pitch modeler 23, and an encoder 24. The speech decoding device 2 includes an encoded data storage (input unit) 10, a pitch generator 11, a sound source generator 12, a controller 13, a spectrum generator 14, and a speech synthesis output unit 15.

【００２３】図２は音声圧縮符号化装置１中のピッチモ
デル化器２３の詳細な構成例を示し、図３は音声復号化
装置２中のピッチ生成器１１の詳細な構成例を示す。ピ
ッチモデル化器２３は、図１に示すように、音源抽出器
２２及び符号化器２４に接続され、図２に示すように、
ピッチ周波数入力器３０、パターンマッチング器３１、
ピッチパターン出力器３２、ピッチパターンＲＯＭ３
３、ピッチ誤差抽出器３４等により構成されている。FIG. 2 shows a detailed example of the configuration of the pitch modeler 23 in the speech compression encoding apparatus 1, and FIG. 3 shows a detailed example of the pitch generator 11 in the speech decoding apparatus 2. The pitch modeler 23 is connected to the sound source extractor 22 and the encoder 24 as shown in FIG. 1, and as shown in FIG.
Pitch frequency input device 30, pattern matching device 31,
Pitch pattern output device 32, pitch pattern ROM 3
3. It is composed of a pitch error extractor 34 and the like.

【００２４】またピッチ生成器１１は、図１に示すよう
に、符号化データ蓄積器（入力器）１０及び音源生成器
１２に接続され、図３に示すように、ピッチパターン入
力器４０、アクセントパターン生成器４１、フレーズパ
ターン生成器４２、音素パターン生成器４３、加算器４
４、ピッチ誤差復号器４５、ピッチ誤差時系列復元器４
６等から構成されている。The pitch generator 11 is connected to an encoded data accumulator (input unit) 10 and a sound source generator 12 as shown in FIG. 1, and as shown in FIG. Pattern generator 41, phrase pattern generator 42, phoneme pattern generator 43, adder 4
4, pitch error decoder 45, pitch error time series restoration unit 4
6 and so on.

【００２５】以下、上述の構成要素の動作を説明する。
最初に、音声圧縮符号化装置１の動作について説明す
る。まず、音声入力器２０は音声を入力して、分析フレ
ーム単位に音声を切り出し、スペクトル解析器２１に出
力する。スペクトル解析器２１は、音声入力器２０から
出力された音声のスペクトル分析を行い、声道情報と音
源情報に分離し、声道情報を表すスペクトル情報を取り
出し、符号化器２４に出力すると共に、分離された音源
情報を音源抽出器２２に出力する。The operation of the above components will be described below.
First, the operation of the audio compression encoding device 1 will be described. First, the speech input device 20 inputs speech, cuts out speech in units of analysis frames, and outputs the speech to the spectrum analyzer 21. The spectrum analyzer 21 analyzes the spectrum of the speech output from the speech input device 20, separates the speech into vocal tract information and sound source information, extracts spectrum information representing the vocal tract information, and outputs it to the encoder 24. The separated sound source information is output to the sound source extractor 22.

【００２６】音源抽出器２２は、スペクトル解析器２１
で分離された音源情報を入力として、音声の音の高さを
一般的に表すピッチ周波数（間隔）を抽出してピッチモ
デル化器２３に出力する。また、ピッチ周波数以外の音
源情報、パワー情報を符号化器２４に出力する。ピッチ
モデル化器２３は、音源抽出器２２から抽出されたピッ
チ周波数をモデル化して情報量を削減・圧縮し、符号化
器２４にモデル化したパラメータを出力する。The sound source extractor 22 includes a spectrum analyzer 21
With the sound source information separated in step (1) as an input, a pitch frequency (interval) generally representing the pitch of a voice is extracted and output to the pitch modeler 23. Also, it outputs excitation information and power information other than the pitch frequency to the encoder 24. The pitch modeler 23 models the pitch frequency extracted from the sound source extractor 22, reduces and compresses the information amount, and outputs the modeled parameter to the encoder 24.

【００２７】ピッチモデル化器２３で行うピッチのモデ
ル化の１つとしては、日本語規則合成において規則化さ
れよく用いられ、アクセント句の成分とフレーズの成分
の重畳モデルによる藤崎モデル〔日本音響学会誌２７，
ｐｐ．４４５−４５３，（１９７１）〕がある。この藤
崎モデルに音素のパターンを追加したものとして、武田
氏によるモデル〔「音素による変化を考慮した基本周波
数パターン生成モデルと音声合成規則」、信学論（Ａ）
Ｊ７３−Ａ，ｐｐ．３７９−３８６（１９９０）〕があ
る。また数量化Ｉ類を用いたものとして阿部氏らによる
モデル〔「基本周波数パターンの２階層制御方式」音講
論集１−２−１１，ｐｐ．２２７−２２８（１９９２．
３）〕がある。One of the pitch models performed by the pitch modeler 23 is a regularized and frequently used Japanese rule synthesis, a Fujisaki model based on a superposition model of an accent phrase component and a phrase component [Acoustic Society of Japan] Magazine 27,
pp. 445-453, (1971)]. This Fujisaki model is obtained by adding a phoneme pattern to the model by Takeda [“Basic frequency pattern generation model and speech synthesis rule considering changes due to phonemes”, IEICE (A)
J73-A, pp. 379-386 (1990)]. Also, a model by Abe et al. [“Two-layer control method of fundamental frequency pattern”, Sound Lecture Collection 1-2-11, pp. 227-228 (1992.
3)].

【００２８】このようなモデル化を行い音声認識に利用
した例として、下平氏らによる韻律認識手法〔「ピッチ
パタンのクラスタリングによる連続音声の句境界検
出」、音講論集２−５−１４，ｐｐ．８１−８２（１９
９１）〕がある。これらの手法を応用することによりピ
ッチモデル化器２３で各アクセント句、フレーズ、音素
のピッチパターンを抽出することができる。この際、ピ
ッチのモデル化はアクセント句、フレーズなど音声現象
や言語との対応を必ずしも正確に取る必要はなく、符号
化効率の点から圧縮率が高くなるようにモデル化しても
よい。As an example of such modeling and use for speech recognition, a prosody recognition method by Shimohira et al. ["Phrase boundary detection of continuous speech by clustering of pitch patterns", Psychological Lectures 2-5-14, pp. . 81-82 (19
91)]. By applying these techniques, the pitch modeler 23 can extract a pitch pattern of each accent phrase, phrase, and phoneme. At this time, pitch modeling does not necessarily have to accurately correspond to speech phenomena and languages such as accent phrases and phrases, and may be modeled so that the compression ratio is high in terms of coding efficiency.

【００２９】また、入力音声の内容が既知の場合、ある
いは分析合成を同時に行う必要がなく、ＲＯＭなどに圧
縮符号化データを蓄積しておけば良い場合等では、ピッ
チのモデル化にあたり自動抽出された各パターンの種類
を入手をかけ、圧縮符号化データを修正し、高品位で自
然で、かつ圧縮率の高いピッチ圧縮符号化データを作成
することが可能である。When the contents of the input speech are known, or when it is not necessary to perform the analysis and synthesis at the same time and it is sufficient to store the compression-encoded data in a ROM or the like, the pitch is automatically extracted when modeling the pitch. It is possible to obtain the type of each pattern obtained, modify the compression-encoded data, and create high-quality, natural and high-pitch compression data.

【００３０】図２に示したピッチモデル化器２３の構成
例では、ピッチ周波数入力器３０において、音源抽出器
２２で分析フレーム単位に抽出されたピッチ周波数を入
力として、アクセント句、フレーズ、文単位などのまと
まったピッチ周波数データ列をパターンマッチング器３
１へ出力する。ピッチパターンＲＯＭ３３は、予め音声
のアクセント、フレーズ、音素のピッチパターンを調
べ、クラスタリングなどを用いてパターンを登録してお
く。ここで、ピッチパターンＲＯＭ３３は、アクセント
パターンのみを持つことも、アクセントパターンとフレ
ーズパターンを持つことも、全（アクセント、フレー
ズ、音素）パターンを持つこともできる。In the configuration example of the pitch modeler 23 shown in FIG. 2, the pitch frequency input unit 30 receives the pitch frequency extracted for each analysis frame by the sound source extractor 22 as an input, and outputs the accent phrase, phrase, sentence unit. Pitch frequency data sequence such as
Output to 1. The pitch pattern ROM 33 examines pitch patterns of voice accents, phrases, and phonemes in advance, and registers the patterns using clustering or the like. Here, the pitch pattern ROM 33 can have only an accent pattern, have an accent pattern and a phrase pattern, or have all (accent, phrase, phoneme) patterns.

【００３１】パターンマッチング器３１では、ピッチ周
波数入力器２２から入力されたピッチ周波数データ列と
ピッチパターンＲＯＭ３３に登録してあるピッチパター
ンとの比較を行い、最も良くマッチングしたパターンを
入力されたピッチ周波数のパターンであると判定し、各
アクセント、フレーズ、音素毎のパターンの種類をピッ
チパターン出力器３２に出力する。The pattern matching unit 31 compares the pitch frequency data sequence input from the pitch frequency input unit 22 with the pitch pattern registered in the pitch pattern ROM 33, and determines the best matching pattern as the input pitch frequency. And outputs the type of pattern for each accent, phrase, and phoneme to the pitch pattern output unit 32.

【００３２】また、ピッチモデル化器２３はピッチ誤差
抽出器３４を備えることもでき、その場合、ピッチ誤差
抽出器３４は、ピッチ周波数入力器３０から出力された
ピッチ周波数列とパターンマッチング器３１でマッチン
グされたパターンの値列を入力としてピッチパターン出
力器へその誤差（差分）値列を出力する。Further, the pitch modeler 23 may include a pitch error extractor 34. In this case, the pitch error extractor 34 uses the pitch frequency train output from the pitch frequency input device 30 and the pattern matching device 31. The value sequence of the matched pattern is input and the error (difference) value sequence is output to the pitch pattern output device.

【００３３】さらに、ピッチ誤差抽出器３５で抽出され
たピッチ誤差を圧縮符号化するピッチ誤差時系列符号化
器３５を更に備えることもでき、その場合、ピッチ誤差
時系列符号化器３５は、ピッチ誤差抽出器３４から出力
された誤差値列を入力としてＡＤＰＣＭ（Adaptive Dif
ferential Pulse Code Modulation）等で時系列符号化
を行うことにより圧縮し、ピッチパターン出力器３２へ
圧縮された誤差データを出力する。Further, a pitch error time series encoder 35 for compressing and encoding the pitch error extracted by the pitch error extractor 35 can be further provided. In this case, the pitch error time series encoder 35 The error value sequence output from the error extractor 34 is used as an input and ADPCM (Adaptive Dif
The compressed error data is output to the pitch pattern output device 32 by performing time-series coding using ferential pulse code modulation or the like.

【００３４】ピッチパターン出力器３２は、パターンマ
ッチング器３１が出力した各アクセント、フレーズ、音
素毎のパターンを出力する。また、ピッチ誤差抽出器３
４を備えている場合には、ピッチ誤差抽出器３４が出力
する各パターンと実際のピッチ周波数列の誤差（差分）
列をさらに出力する。ピッチ誤差時系列符号化器３５を
更に備えている場合には、ピッチパターン出力器３２
は、各パターンと実際のピッチ周波数列の誤差（差分）
列の代わりに、ピッチ誤差時系列符号化器３５が出力す
るＡＤＰＣＭ等で時系列符号化を行ったパラメータを出
力する。これら各アクセント、フレーズ、音素毎のパタ
ーン、各パターンと実際のピッチ周波数列の誤差（差
分）列、ＡＤＰＣＭ等で時系列符号化あるいはＶＱなど
で符号化を行ったパラメータをモデル化ピッチ情報と呼
ぶ。The pitch pattern output unit 32 outputs a pattern for each accent, phrase, and phoneme output by the pattern matching unit 31. Also, pitch error extractor 3
4, the error (difference) between each pattern output by the pitch error extractor 34 and the actual pitch frequency sequence
Print more columns. If the apparatus further includes the pitch error time series encoder 35, the pitch pattern output unit 32
Is the error (difference) between each pattern and the actual pitch frequency sequence
Instead of the columns, the parameters output by the ADPCM or the like output from the pitch error time-series encoder 35 and subjected to time-series coding are output. These accents, phrases, patterns for each phoneme, an error (difference) sequence between each pattern and the actual pitch frequency sequence, and parameters that have been time-series coded by ADPCM or VQ coded are called modeled pitch information. .

【００３５】最後に符号化器２４は、スペクトル解析器
２１、音源抽出器２２、ピッチモデル化器２３で、各々
分析・抽出されたスペクトル情報、音源情報・パワー情
報、モデル化ピッチ情報を入力として、圧縮・符号化
し、音声復号化装置２へ符号化データを出力する。Finally, the encoder 24 receives the spectrum information, the sound source information / power information, and the modeled pitch information which are analyzed and extracted by the spectrum analyzer 21, the sound source extractor 22, and the pitch modeler 23, respectively. , And outputs the encoded data to the audio decoding device 2.

【００３６】次に、音声復号化装置２について説明す
る。音声復号化装置２に設けられた符号化データ蓄積器
（入力器）１０は、音声圧縮符号化装置１と組み合わせ
て音声を伝送する場合は、伝送路からの入力器となり、
音声符号化データとしてＲＯＭ等に蓄積されたものを用
いる場合には、符号化データの蓄積器となる。Next, the speech decoding apparatus 2 will be described. The coded data storage (input unit) 10 provided in the audio decoding device 2 becomes an input device from a transmission path when transmitting audio in combination with the audio compression encoding device 1,
When the data stored in the ROM or the like is used as the voice encoded data, the voice data becomes a storage for the encoded data.

【００３７】符号化データ蓄積器（入力器）１０は、フ
レーム単位に入力される圧縮符号化された符号化データ
を音源情報、スペクトル情報、パワー情報、モデル化ピ
ッチ情報に分離し、モデル化ピッチ情報をピッチ生成器
１１に、音源情報及びパワー情報を音源生成器１２に、
スペクトル情報をスペクトル生成器１４にそれぞれ出力
する。ピッチ生成器１１は、符号化データ蓄積器（入力
器）１０から出力されたモデル化ピッチ情報を入力と
し、ピッチ周波数（間隔）を復号化し音源生成器１２に
出力する。The coded data storage (input unit) 10 separates the coded coded data input in units of frames into excitation information, spectrum information, power information, and modeling pitch information, Information to the pitch generator 11, sound source information and power information to the sound source generator 12,
The spectrum information is output to the spectrum generator 14, respectively. The pitch generator 11 receives the modeled pitch information output from the encoded data storage (input unit) 10 as input, decodes the pitch frequency (interval), and outputs the decoded frequency to the excitation generator 12.

【００３８】音声復号化装置２は、符号化した蓄積符号
化音声データを編集して音声合成を行う（編集合成の）
場合、制御部１３を備え、ピッチ周波数が高くなりすぎ
る場合、低くなりすぎる場合、ピッチ周波数が上限、下
限を越えないようにアクセントパターン、フレーズパタ
ーンの大きさを制限するように、モデル化ピッチ情報の
制御信号をピッチ生成器１１に出力する。ピッチ生成器
１１は、制御器１３からの制御信号を基にピッチ周波数
が下限、上限を越えないようにモデル化ピッチ情報を変
更し、ピッチ周波数（間隔）を生成して、音源生成器１
２に出力する。このことにより、ピッチが高すぎたり、
低すぎて不自然に聞こえるという問題点を解消し、より
自然性の高い音声合成が可能になる。The speech decoding apparatus 2 edits the encoded stored and encoded speech data to perform speech synthesis (for editing and synthesis).
If the pitch frequency is too high or too low, the control unit 13 is provided, and if the pitch frequency does not exceed the upper limit or the lower limit, the modeled pitch information is controlled so as to limit the size of the accent pattern and the phrase pattern. Is output to the pitch generator 11. The pitch generator 11 changes the modeled pitch information based on the control signal from the controller 13 so that the pitch frequency does not exceed the lower limit and the upper limit, generates a pitch frequency (interval), and generates the pitch frequency (interval).
Output to 2. This can cause the pitch to be too high,
This solves the problem that the sound is too low and sounds unnatural, and allows speech synthesis with higher naturalness.

【００３９】また編集合成により音声合成を行う場合以
外でも、ピッチをモデル化することにより、音素片（音
声素片）を接続して合成音声を再生する録音編集合成の
場合、ピッチが不連続に接続されることがなくなること
により、ピッチが急激に変わり、異音が発生したり、不
自然に聞こえるという問題点を解消し、より自然性の高
い音声合成が可能になる。In addition to the case where speech synthesis is performed by editing and synthesizing, the pitch is modeled, and in the case of recording / editing / synthesis in which speech units (speech units) are connected to reproduce synthesized speech, the pitch is discontinuous. Eliminating the connection eliminates the problem of sudden changes in pitch, generation of abnormal noise, and unnatural sound, and enables speech synthesis with higher naturalness.

【００４０】音源生成器１２は、符号化データ蓄積器
（入力器）１０から入力される音源情報、パワー情報
と、ピッチ生成器１１から出力されるピッチ周波数（間
隔）を入力として音源を生成し、スペクトル生成器１４
に出力する。スペクトル生成器１４は、符号化データ蓄
積器（入力器）１０から出力されたスペクトル情報に基
づいて合成フィルタを構成し、音源生成器１２で生成さ
れた音源にフィルタリングすることによって合成音声信
号を生成する。合成音声出力器１５は、スペクトル生成
器１４で生成された合成音声信号をＤ／Ａ変換すること
によって合成音声を生成する。The sound source generator 12 receives the sound source information and power information input from the encoded data storage (input device) 10 and the pitch frequency (interval) output from the pitch generator 11 to generate a sound source. , Spectrum generator 14
Output to The spectrum generator 14 forms a synthesis filter based on the spectrum information output from the encoded data storage (input device) 10, and generates a synthesized speech signal by filtering the sound source generated by the sound source generator 12. I do. The synthesized voice output unit 15 generates a synthesized voice by performing D / A conversion on the synthesized voice signal generated by the spectrum generator 14.

【００４１】図３はピッチ生成器１１の構成の一例を示
すものであり、図３を参照してピッチ生成器１１につい
て更に詳細に説明する。ピッチ生成器１１のピッチパタ
ーン入力器４０では、符号化データ蓄積器（入力器）１
０で分析フレーム単位に圧縮符号化されたモデル化ピッ
チ情報を入力とし、アクセントパターンをアクセントパ
ターン生成器４１へ、フレーズパターンをフレーズパタ
ーン生成器４２へ、音素パターンを音素パターン生成器
４３へそれぞれ出力する。各パターンと実際のピッチ周
波数列の誤差（差分）列のデータがある場合には、それ
をピッチ誤差復元器４５へ出力し、ＡＤＰＣＭ等で時系
列符号化を行ったパラメータがある場合には、それをピ
ッチ誤差時系列復元器４６へそれぞれ出力する。FIG. 3 shows an example of the configuration of the pitch generator 11, and the pitch generator 11 will be described in more detail with reference to FIG. In the pitch pattern input device 40 of the pitch generator 11, an encoded data storage (input device) 1
The modeled pitch information compressed and encoded in units of analysis frames at 0 is input, and the accent pattern is output to the accent pattern generator 41, the phrase pattern is output to the phrase pattern generator 42, and the phoneme pattern is output to the phoneme pattern generator 43, respectively. I do. If there is data of an error (difference) sequence between each pattern and the actual pitch frequency sequence, the data is output to the pitch error reconstructor 45, and if there is a parameter that has been subjected to time-series coding by ADPCM or the like, It is output to the pitch error time series restoration unit 46, respectively.

【００４２】ここで、モデル化ピッチ情報がアクセント
パターンのみを対象にする場合には、ピッチ生成器１１
は、ピッチパターン入力器４０とアクセントパターン生
成器４１のみから構成される。アクセントパターン生成
器４１は、入力されたアクセントパターンからアクセン
ト単位のピッチ周波数（間隔）列を生成、出力する。Here, if the modeled pitch information targets only the accent pattern, the pitch generator 11
Is composed of only a pitch pattern input device 40 and an accent pattern generator 41. The accent pattern generator 41 generates and outputs a pitch frequency (interval) sequence of accent units from the input accent pattern.

【００４３】モデル化ピッチ情報がアクセントパターン
トフレーズパターンを対象にする場合には、ピッチ生成
器１１は、ピッチパターン入力器４０とアクセントパタ
ーン生成器４１、フレーズパターン生成器４２、加算器
４４から構成される。フレーズパターン生成器４２は、
入力されたフレーズパターンからフレーズ単位のフレー
ズのピッチ周波数（間隔）列を生成、出力する。加算器
４４は、入力されるアクセント単位、フレーズ単位など
のピッチ周波数（間隔）を加算して、元の音声のピッチ
周波数（間隔）を出力する。When the modeled pitch information targets an accented phrase pattern, the pitch generator 11 includes a pitch pattern input unit 40, an accent pattern generator 41, a phrase pattern generator 42, and an adder 44. Is done. The phrase pattern generator 42
Generates and outputs a pitch frequency (interval) sequence of phrases in phrase units from the input phrase pattern. The adder 44 adds the input pitch frequencies (intervals) such as accent units and phrase units, and outputs the pitch frequency (interval) of the original voice.

【００４４】モデル化ピッチ情報が全（アクセント、フ
レーズ、音素）パターンを対象にする場合には、ピッチ
生成器１１は、ピッチパターン入力器４０とアクセント
パターン生成器４１、フレーズパターン生成器４２、音
素パターン生成器４３、加算器４４から構成される。When the modeled pitch information covers all (accent, phrase, phoneme) patterns, the pitch generator 11 includes a pitch pattern input device 40, an accent pattern generator 41, a phrase pattern generator 42, and a phoneme. It comprises a pattern generator 43 and an adder 44.

【００４５】モデル化ピッチ情報がモデル化したピッチ
と実際のピッチとの誤差（差分）列を含んでいる場合に
は、ピッチ生成器１１は、ピッチ誤差復元器４５も備え
る。ピッチ誤差復元器４５は、ピッチパターン入力器４
０から出力された各パターンと実際のピッチ周波数列の
誤差（差分）列を入力として加算器４４へ転送する。When the modeled pitch information includes an error (difference) sequence between the modeled pitch and the actual pitch, the pitch generator 11 also includes a pitch error reconstructor 45. The pitch error restoring unit 45 includes the pitch pattern input unit 4
The error (difference) sequence between each pattern output from 0 and the actual pitch frequency sequence is transferred to the adder 44 as an input.

【００４６】さらにモデル化ピッチ情報がＡＤＰＣＭ等
で時系列符号化を行ったパラメータを含む場合には、ピ
ッチ誤差復元器４５とともにピッチ誤差時系列復元器４
６をも備える。ピッチ誤差時系列復元器４６は、ピッチ
パターン入力器４０から出力されたＡＤＰＣＭ等で各パ
ターンの加算によるピッチ周波数と実際のピッチ周波数
列の誤差（差分）列の時系列符号化を行ったパラメータ
を入力として、ＡＤＰＣＭ等で各パターンの加算による
ピッチ周波数と実際のピッチ周波数列の誤差（差分）列
を推定、生成する。生成された誤差列はピッチ復元器４
５に出力される。Further, when the modeled pitch information includes parameters subjected to time series coding by ADPCM or the like, the pitch error time series
6 is also provided. The pitch error time series reconstructor 46 uses the ADPCM or the like output from the pitch pattern input device 40 to perform a time series encoding of the error (difference) sequence of the pitch frequency and the actual pitch frequency sequence by adding the respective patterns. As an input, an error (difference) sequence between the pitch frequency obtained by adding the patterns and the actual pitch frequency sequence is estimated and generated by ADPCM or the like. The generated error sequence is stored in the pitch restorer 4
5 is output.

【００４７】このように、ピッチを符号化・復号化する
際に、ピッチをモデル化することにより、従来のピッチ
の符号化データに比較して少ないデータ量でピッチを符
号化可能になり、少ない電送容量、少ないメモリ量で高
温質の復号化音声を生成することができる。以下に、具
体例をもとに本発明による処理の流れを説明する。４
は、音声圧縮符号化装置１に音声入力として「次の電車
は天理行きです。」を入力した場合の処理の一例を示し
ている。As described above, when the pitch is encoded and decoded, by modeling the pitch, the pitch can be encoded with a smaller amount of data as compared with the encoded data of the conventional pitch. High-quality decoded speech can be generated with a transmission capacity and a small amount of memory. Hereinafter, the flow of processing according to the present invention will be described based on a specific example. 4
Shows an example of a process when “the next train is bound for Tenri” is input to the voice compression encoding device 1 as a voice input.

【００４８】音源抽出器２２では、入力された自然音声
のピッチパターンを算出し、フレーズ、アクセント、音
素の各成分に分解する。この成分への分解は、ＨＨＭ
（Hidden Markov Model）等の周知の音声認識による音
素の自動切り出し、前述の下平氏らによる韻律認識手法
〔「ピッチパタンのクラスタリングによる連続音声の句
境界検出」、音講論集２−５−１４，ｐｐ．８１−８２
（１９９１）〕等の手法によって行うことができる。The sound source extractor 22 calculates the pitch pattern of the input natural speech and decomposes the pattern into phrases, accents, and phonemes. The decomposition into this component is HHM
(Hidden Markov Model) etc., automatic segmentation of phonemes by well-known speech recognition, the above-mentioned prosody recognition method by Shimohira et al. ["Phrase boundary detection of continuous speech by pitch pattern clustering", Jpn. pp. 81-82
(1991)].

【００４９】ピッチモデル化器２３では、各成分に対し
てピッチパターンＲＯＭ３３に予め登録してあるパター
ンのうち、最も一致するものをパターンマッチング器３
１により選択する。処理例としては、まずアクセントを
切り出し、最もマッチングするアクセントパターン、フ
レーズパターンをＲＯＭに登録されたパターンから検
索、抽出する。次に、個々の音素（子音）についての音
素パターンをＲＯＭに登録された音素パターンから検
索、抽出する。符号化器２４では、検索された各パター
ンのコード番号を圧縮パターンとする。In the pitch modeler 23, the pattern that matches the most among the patterns registered in advance in the pitch pattern ROM 33 for each component is determined by the pattern matching unit 3.
Select by 1. As a processing example, first, an accent is cut out, and the most matching accent pattern and phrase pattern are searched and extracted from the patterns registered in the ROM. Next, a phoneme pattern for each phoneme (consonant) is searched and extracted from the phoneme patterns registered in the ROM. In the encoder 24, the code number of each searched pattern is set as a compression pattern.

【００５０】前記パターンマッチングの際、時間軸の圧
縮伸長率、ピッチパターンの開始値、終了値、最大値、
最小値、圧縮伸張率の全部あるいは一部等のパラメータ
を用い、ピッチパターンＲＯＭ３３の登録パターンを変
形する機能を持たせ、ＲＯＭパターンの削減を行い、各
ピッチパターンのコード番号及び変形パラメータをその
パターンの圧縮パラメータとしてもよい。At the time of the pattern matching, the compression / expansion rate on the time axis, the start value, the end value, the maximum value of the pitch pattern,
Using a parameter such as the minimum value or all or a part of the compression / expansion rate, a function of modifying the registered pattern of the pitch pattern ROM 33 is provided, the ROM pattern is reduced, and the code number and the modification parameter of each pitch pattern are changed to the pattern. May be used as the compression parameter.

【００５１】ピッチパターンＲＯＭ３３に登録するパタ
ーンの一般的な作成方法としては、大量の自然音声の各
ピッチパターンを抽出した後、クラスタリングを行い、
ＲＯＭに入れるべきピッチパターンを決定する。この
際、多くのピッチパターンをＲＯＭ３３に入れた場合、
モデル化精度が高くなる反面、圧縮率の低下、処理量の
増大を招く。As a general method of creating a pattern to be registered in the pitch pattern ROM 33, clustering is performed after extracting a large number of natural voice pitch patterns.
Determine the pitch pattern to be stored in the ROM. At this time, if many pitch patterns are stored in the ROM 33,
On the other hand, the modeling accuracy is increased, but the compression ratio is reduced and the processing amount is increased.

【００５２】音声復号化装置２側のピッチ生成器１１
は、音声圧縮符号化装置１側で保持しているピッチパタ
ーンＲＯＭと同じピッチパターンを記憶したＲＯＭを持
ち、符号化器２４によって符号化されて送られてきたコ
ード番号からピッチパターンを生成する。The pitch generator 11 of the speech decoding device 2
Has a ROM that stores the same pitch pattern as the pitch pattern ROM held by the audio compression encoding device 1, and generates a pitch pattern from the code numbers encoded and transmitted by the encoder 24.

【００５３】[0053]

【発明の効果】以上より明らかなように、本発明では、
ピッチをモデル化し、そのモデル化したパラメータで圧
縮符号する、あるいは実際のピッチとピッチモデルの差
をさらに付加し圧縮符号化する、あるいはさらに時系列
的にそれらの差を推定することにより、音声符号化の際
の圧縮率をさらに高めることが出来る。As is clear from the above, according to the present invention,
Speech coding can be performed by modeling the pitch and compressing and coding with the modeled parameters, or by adding and compressing the difference between the actual pitch and the pitch model, or by estimating the difference in a time series. In this case, the compression ratio can be further increased.

【００５４】さらに、本発明では、ピッチをモデル化す
ることにより、音声を編集して合成音声を出力する編集
合成方式において、音声生成機構に立脚したモデルを用
いることにより、接続する音声素片間にピッチが連続的
に接続されるため、より自然性の高い音声を合成するこ
とが出来る。Further, according to the present invention, in the editing / synthesizing system for editing a voice and outputting a synthesized voice by modeling a pitch, a model based on a voice generating mechanism is used to enable a speech unit to be connected. Since the pitch is connected continuously, a voice with higher naturalness can be synthesized.

[Brief description of the drawings]

【図１】本発明による音声合成装置の一例の回路構成を
示すブロック図。FIG. 1 is a block diagram showing a circuit configuration of an example of a speech synthesis device according to the present invention.

【図２】ピッチモデル化器の構成例を示すブロック図。FIG. 2 is a block diagram showing a configuration example of a pitch modeler.

【図３】ピッチ生成器の構成例を示すブロック図。FIG. 3 is a block diagram showing a configuration example of a pitch generator.

【図４】自然音声からのパターン抽出の例を説明する
図。FIG. 4 is a view for explaining an example of pattern extraction from natural speech.

[Explanation of symbols]

１…音声圧縮符号化装置、２…音声復号化装置、１０…
符号化データ蓄積器（入力器）、１１…ピッチ生成器、
１２…音源生成器、１３…制御機、１４…スペクトル生
成器、１５…合成音声出力器、２０…音声入力器、２１
…スペクトル解析器、２２…音源抽出器、２３…ピッチ
モデル化器、２４…符号化器、３０…ピッチ周波数入力
器、３１…パターンマッチング器、３２…ピッチパター
ン出力器、３３…ピッチパターンＲＯＭ、３４…ピッチ
誤差抽出器、３５…ピッチ誤差時系列符号化器、４０…
ピッチパターン入力器、４１…アクセントパターン生成
器、４２…フレーズパターン成績器、４３…音素パター
ン生成器、４４…加算器、４５…ピッチ誤差復元器、４
６…ピッチ誤差時系列復元器DESCRIPTION OF SYMBOLS 1 ... Audio compression coding apparatus, 2 ... Audio decoding apparatus, 10 ...
Encoded data storage (input device), 11 pitch generator,
12 ... sound source generator, 13 ... controller, 14 ... spectrum generator, 15 ... synthesized voice output device, 20 ... voice input device, 21
... spectrum analyzer, 22 ... sound source extractor, 23 ... pitch modeler, 24 ... encoder, 30 ... pitch frequency input device, 31 ... pattern matching device, 32 ... pitch pattern output device, 33 ... pitch pattern ROM, 34 pitch error extractor 35 pitch error time series encoder 40
Pitch pattern input device, 41: Accent pattern generator, 42: Phrase pattern scorer, 43: Phoneme pattern generator, 44: Adder, 45: Pitch error restorer, 4
6 Pitch error time series restorer

Claims

(57) [Claims]

Storage means for storing a plurality of pitch pattern models of a predetermined section of natural speech; control means for outputting a control signal for controlling a pitch frequency of a phoneme segment to be edited when synthesizing speech; Pitch generation means for decoding data encoded in association with the pitch pattern model using the pitch pattern model, so that the pitch frequency does not exceed an upper limit and a lower limit based on the control signal. And a pitch generating means having a pitch generator for generating a changed pitch frequency.

2. The speech decoding apparatus according to claim 1, wherein the predetermined section is a section of an accent component unit.
A speech decoding apparatus according to claim 1, wherein said pitch generating means includes an accent pattern generating means.

3. The speech decoding device according to claim 1, wherein the predetermined section is a section of an accent component unit and a section of a phrase component unit, and the pitch generation unit includes an accent pattern generation unit, a phrase pattern generation unit, A speech decoding apparatus comprising: an addition unit that adds the accent pattern generated by the accent pattern generation unit and the phrase pattern generated by the phrase pattern generation unit.

4. The speech decoding apparatus according to claim 1, wherein the predetermined section is a section of an accent component unit, a section of a phrase component unit, and a section of a phoneme component unit, and the pitch generation unit includes an accent pattern generation unit and an accent pattern generation unit. A phrase pattern generation unit, a phoneme pattern generation unit, an accent pattern generated by the accent pattern generation unit, a phrase pattern generated by the phrase pattern generation unit, and a phoneme pattern generated by the phoneme pattern generation unit. An audio decoding device comprising: an adding unit that performs addition.

5. The speech decoding apparatus according to claim 1, wherein the data includes encoded data obtained by encoding a pitch error between a natural speech pitch pattern and a pitch pattern model. Wherein the pitch generating means includes a pitch error restoring means for restoring data obtained by encoding the pitch error, and the pitch error restored by the pitch error restoring means is input to the adding means. Decryption device.

6. The speech decoding apparatus according to claim 5, wherein the data includes encoded data obtained by encoding pitch error data between a natural speech pitch pattern and a pitch pattern model in a time-series manner. The pitch generation means includes pitch error time series restoration means for restoring the pitch error time series encoded data, and inputs the pitch error data restored by the pitch error time series restoration means to the pitch error restoration means. Audio decoding device.