JP2001034299A

JP2001034299A - Sound synthesis device

Info

Publication number: JP2001034299A
Application number: JP11205847A
Authority: JP
Inventors: Akitoshi Saito; 彰利斉藤
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 1999-07-21
Filing date: 1999-07-21
Publication date: 2001-02-09

Abstract

PROBLEM TO BE SOLVED: To reproduce vocal data in high quality. SOLUTION: This device allots musical sound data, vocal data which is compression-encoded by an analysis-synthesis encoding system provided with a code book, and the code book used at the time of compression-encoding through a base station 2, and stores them in a storage means 13. When the compression-encoded vocal data is decoded in a sound synthesis part 15, the code book peculiar to the vocal is stored in the sound synthesis part 15 beforehand and decoded. Therefore, real and high quality vocal sound can be reproduced. Moreover, since musical sound is synthesized from the musical sound data by a musical sound synthesis part 16, and the vocal data is reproduced in synchronization with this musical sound and outputted with the musical sound, the musical sound with the high quality vocal sound can be reproduced.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明が属する技術分野】本発明は、自動車電話機や携
帯電話機等に適用して好適な音声合成装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing apparatus suitable for application to a mobile telephone, a portable telephone, or the like.

【０００２】[0002]

【従来の技術】ディジタルセルラーシステムとして知ら
れているＰＤＣ（Personal Digital Cellular telecomm
unication system）等の携帯電話システムや、簡易型携
帯電話システム（ＰＨＳ：Personal Handyphone Syste
m）においては、占有できる周波数帯域が限られた帯域
とされているため、音声信号は高能率圧縮符号化されて
伝送されている。この高能率音声圧縮符号化方式の一方
式として、音源モデルと声道モデルからなる音声合成モ
デルを用いる分析合成符号化方式が知られている。さら
に、この分析合成符号化方式には、ＭＰＣ（Multi-Puls
e Excited LPC）方式や、コードブックを用いてベクト
ル量子化するＣＥＬＰ（Code Excited LPC）方式があ
り、ＣＥＬＰ方式はある種のディジタルセルラー方式で
実用化されている。さらに、ＣＥＬＰ方式で例えば５１
２あったコードブック中のコードベクトル数を９とする
ことのできるＶＳＥＬＰ（Vector Sum Exiceted Linear
Prediction）方式も実用化されている。ＶＳＥＬＰ方
式では、例えば９のコードベクトルに±の符号を与えて
２⁹通りの和を求めることにより、５１２通りのコード
ベクトルを作り出している。そのため、ＣＥＬＰ方式に
比べてＶＳＥＬＰ方式はコードブック用に必要なメモリ
量を少なくすることができ、さらに符号誤りの影響が少
なく、合成フィルタの演算量を少なくできる長所を有し
ている。2. Description of the Related Art PDC (Personal Digital Cellular telecomm) known as a digital cellular system
unified system) and a simple mobile phone system (PHS: Personal Handyphone Syste)
In m), since the frequency band that can be occupied is limited, the audio signal is transmitted after being compressed and encoded with high efficiency. As one method of the high-efficiency speech compression encoding method, an analysis-synthesis encoding method using a speech synthesis model including a sound source model and a vocal tract model is known. Furthermore, this analysis / synthesis coding method includes MPC (Multi-Puls
There are a CELP (Code Excited LPC) method for performing vector quantization using a codebook and a CELP (Code Excited LPC) method, and the CELP method has been put to practical use in a kind of digital cellular method. Furthermore, for example, 51
VSELP (Vector Sum Exiceted Linear) that can set the number of code vectors in the two codebooks to 9
Prediction) method is also in practical use. The VSELP method, by determining the sum of the ^nine gives the sign of the ± a codevector example 9, are producing codevector 512 kinds. Therefore, the VSELP method has the advantages that the memory amount required for the codebook can be reduced as compared with the CELP method, the effects of code errors are small, and the calculation amount of the synthesis filter can be reduced.

【０００３】音声情報を高能率に圧縮符号化できる従来
のＣＥＬＰ系エンコーダの代表的な構成を図６に示す。
音声の特徴は、声帯から発生される源音声のピッチＬや
ノイズ成分（以下、「源音声特徴パラメータ」と呼
ぶ。）と、のど、口を音声が通過する時の声道伝達特性
や唇での放射特性（以下、これらをまとめて「声道特徴
パラメータ」という。）によって表現することができ
る。すなわち、源音声を発生する声帯モデルと、声帯モ
デルに縦続された声道モデルとで音声合成モデルを表す
ことができる。FIG. 6 shows a typical configuration of a conventional CELP encoder which can compress and encode voice information with high efficiency.
The characteristics of the voice include the pitch L and noise component of the source voice generated from the vocal cords (hereinafter, referred to as “source voice feature parameters”), the vocal tract transmission characteristics when the voice passes through the throat, and the lips. (Hereinafter collectively referred to as “vocal tract feature parameters”). That is, a voice synthesis model can be represented by a vocal cord model that generates a source voice and a vocal tract model cascaded to the vocal cord model.

【０００４】図６に示すＣＥＬＰ系エンコーダはこの音
声合成モデルに基づき入力音声を符号化しており、長期
予測器１０１は、入力された入力音声を、例えば２０ｍ
ｓ程度の長さのフレームとして切り出し、のどフィルタ
の逆特性のフィルタで処理して源音声とする。この源音
声と、前フレームの合成源音声との相互相関をフレーム
毎に演算し、相互相関の最大値を与えるサンプルずれを
ピッチとして検出する。このピッチを、入力音声の基本
周波数にほぼ相当するピッチＬのパラメータとしてフレ
ーム毎に出力する。また、反射係数分析部１０４は、入
力された入力音声の自己相関を用いてＦＬＡＴ（固定小
数点共分散格子型アルゴリズム）を実行することにより
反射係数γを求め、反射係数γを声道特徴パラメータと
してフレーム毎に出力する。さらに、合成源音声が入力
されているのど近似フィルタ部１０５に反射係数γを声
道特徴パラメータとして供給することにより、のど近似
フィルタ部１０５から合成音声が出力され、この合成音
声から入力音声を減算器１０６で差し引くことによりフ
レーム毎の残差信号が出力される。[0006] The CELP encoder shown in FIG. 6 encodes input speech based on this speech synthesis model, and the long-term predictor 101 converts the input speech into, for example, 20 m.
A frame having a length of about s is cut out and processed by a filter having a characteristic reverse to that of the throat filter to obtain a source voice. The cross-correlation between this source voice and the synthesized source voice of the previous frame is calculated for each frame, and a sample shift giving the maximum value of the cross-correlation is detected as a pitch. This pitch is output for each frame as a parameter of the pitch L substantially corresponding to the fundamental frequency of the input voice. Further, the reflection coefficient analysis unit 104 obtains a reflection coefficient γ by executing FLAT (fixed point covariance lattice type algorithm) using the autocorrelation of the input voice input, and uses the reflection coefficient γ as a vocal tract feature parameter. Output every frame. Further, by supplying the reflection coefficient γ as a vocal tract feature parameter to the throat approximation filter unit 105 to which the synthesized source speech is input, a synthesized speech is output from the throat approximation filter unit 105, and the input speech is subtracted from this synthesized speech. The subtractor 106 outputs a residual signal for each frame.

【０００５】そして、合成音声が入力音声に最も近くな
るように（残差信号が最も小さくなるように）源音用コ
ードブック１０２から１つのコードベクトルを選択する
処理がフレーム毎に行われ、その際に選択されたコード
ベクトルを特定するインデックスＩが出力される。この
コードベクトルの次元はフレーム毎のサンプル数と等し
くされ、ベクトル量子化による出力符号であるインデッ
クスＩと、ピッチＬとが源音声特徴パラメータとなる。
なお、コードベクトルをＫ次元とし、インデックスＩを
Ｂビットとした際には、サンプルあたりのビット数ｂ
は、ｂ＝Ｂ／Ｋとなり、次元数Ｋを多くすればするほどベクトル量子化
の圧縮率を高くすることができる。また、コードブック
は２^B種類の代表的なパターンのコードベクトルから構
成される。ところで、長期予測器１０１からのピッチＬ
のパラメータに基づく復号信号とインデックスＩで示さ
れるコードベクトルとが合成部１０３で合成されて合成
源音声が生成されており、この合成源音声がのど近似フ
ィルタ部１０５において反射係数γに基づいてフィルタ
処理されて、前記したように合成音声とされる。Then, a process of selecting one code vector from the source sound codebook 102 so that the synthesized speech is closest to the input speech (so that the residual signal is minimized) is performed for each frame. At this time, an index I for specifying the selected code vector is output. The dimension of this code vector is made equal to the number of samples for each frame, and an index I, which is an output code by vector quantization, and a pitch L are source speech feature parameters.
When the code vector is K-dimensional and the index I is B bits, the number of bits per sample b
Is b = B / K, and the compression ratio of vector quantization can be increased as the number of dimensions K is increased. Further, the code book is composed of code vectors of 2 ^B types of typical patterns. By the way, the pitch L from the long-term predictor 101
And a code vector indicated by the index I are synthesized by the synthesizing unit 103 to generate a synthesized source speech. The synthesized source speech is filtered by the throat approximation filter unit 105 based on the reflection coefficient γ. After being processed, the synthesized speech is obtained as described above.

【０００６】このようにしてエンコーダから出力された
インデックスＩ、ピッチＬおよび反射係数γからなる音
声パラメータをＣＥＬＰ系デコーダに与えることによ
り、元の音声信号を再生することができる。このＣＥＬ
Ｐ系デコーダの代表的な構成を図７に示す。図７におい
て、デコーダに入力されたフレーム毎の音声パラメータ
はデータ処理部１３０において、インデックスＩ、ピッ
チＬおよび反射係数γの各パラメータに分離されて、ピ
ッチＬのパラメータはピッチＬ発生部１３２に、インデ
ックスＩのパラメータはコードブック１３１に、反射係
数γのパラメータはのど近似フィルタ１３４に振り分け
られる。なお、コードブック１３１はエンコーダにおけ
る源音用コードブック１０２と共通の内容とされ、ＲＯ
Ｍ（Read Only Memory）にその内容が記録されている。[0006] The original audio signal can be reproduced by providing the CELP decoder with the audio parameter comprising the index I, pitch L and reflection coefficient γ output from the encoder in this way. This CEL
FIG. 7 shows a typical configuration of the P-system decoder. In FIG. 7, the audio parameters for each frame input to the decoder are separated into respective parameters of an index I, a pitch L, and a reflection coefficient γ in a data processing unit 130, and the parameters of the pitch L are output to a pitch L generating unit 132. The parameter of the index I is distributed to the codebook 131, and the parameter of the reflection coefficient γ is distributed to the throat approximation filter 134. The code book 131 has the same contents as the source sound code book 102 in the encoder.
The contents are recorded in M (Read Only Memory).

【０００７】ピッチＬのパラメータに基づいてピッチＬ
発生部１３２からは、音声の周波数の復号信号が発生さ
れて源波形再生部１３３に供給される。源波形再生部１
３３には、コードブック１３１から読み出されたインデ
ックスＩで示されるコードベクトルのデータが供給され
ており、このデータをピッチＬの復号信号に合成するこ
とにより、源波形再生部１３３において合成源波形が再
生される。この源波形再生部１３３からは、人間の声帯
の振動により発生された波形と同様の波形が合成源波形
として出力されるようになる。この合成源波形は、反射
係数γのパラメータによりフィルタ係数が制御されるの
ど近似フィルタ１３４においてフィルタ処理されて、合
成音声とされる。のど近似フィルタ１３４は人間ののど
や口の伝達関数を再現しており、予めデータ処理部１３
０から供給された反射係数γを蓄積しておいて、必要と
する際に各フィルタに供給している。のど近似フィルタ
１３４から出力される合成音声は、スペクトラルフィル
タ１３５に供給されて、音声としての不自然さが取り除
かれる。The pitch L is determined based on the parameter of the pitch L.
The generator 132 generates a decoded signal of the audio frequency and supplies it to the source waveform reproducer 133. Source waveform reproducing unit 1
33 is supplied with the code vector data indicated by the index I read from the code book 131. By combining this data with the decoded signal having the pitch L, the source waveform reproducing unit 133 Is played. From the source waveform reproducing unit 133, a waveform similar to the waveform generated by the vibration of the human vocal cords is output as a composite source waveform. This synthesized source waveform is filtered by a throat approximation filter 134 in which the filter coefficient is controlled by the parameter of the reflection coefficient γ, to obtain a synthesized voice. The throat approximation filter 134 reproduces the transfer function of the human throat and mouth,
The reflection coefficient γ supplied from 0 is stored and supplied to each filter when necessary. The synthesized speech output from the throat approximation filter 134 is supplied to a spectral filter 135 to remove unnaturalness as speech.

【０００８】なお、反射係数γはそのデータ量が大きい
ことから、ベクトル量子化などの圧縮符号化により低ビ
ット化して伝送することも考えられる。例えば、反射係
数γをベクトル量子化して伝送する際には、エンコーダ
に複数の代表ベクトルからなる反射係数用コードブック
を備え、反射係数分析部１０４から出力された反射係数
γを代表ベクトルの一つで置き換え、その代表ベクトル
を示す反射係数インデックスを伝送する。また、デコー
ダにもエンコーダに備えた反射係数用コードブックと同
一内容の反射係数用コードブックを備えさせ、伝送され
た反射係数インデックスで指示される反射係数γを読み
出すようにすればよい。Since the data amount of the reflection coefficient γ is large, it is conceivable to transmit the reflection coefficient γ by reducing the number of bits by compression coding such as vector quantization. For example, when the reflection coefficient γ is vector-quantized and transmitted, the encoder is provided with a reflection coefficient codebook including a plurality of representative vectors, and the reflection coefficient γ output from the reflection coefficient analysis unit 104 is used as one of the representative vectors. And the reflection coefficient index indicating the representative vector is transmitted. Further, the decoder may be provided with a reflection coefficient codebook having the same contents as the reflection coefficient codebook provided in the encoder, and may read out the reflection coefficient γ indicated by the transmitted reflection coefficient index.

【０００９】[0009]

【発明が解決しようとする課題】ディジタルセルラーシ
ステムにおける音声符号化方式としては、一般にＣＥＬ
Ｐ系の高能率音声圧縮方式が採用されているが、伝送さ
れる音声パラメータとしては、伝送容量が限られている
ことから本人認識および会話内容を特定することのでき
る最低限のパラメータ量にとどめられている。特に、Ｖ
ＳＥＬＰ方式では、コードブックに蓄えられたコードベ
クトルの数が限られたものとされている。このため、伝
送された音声データをデコードした際の音質は良好なも
のではないという問題点があった。As a speech coding method in a digital cellular system, CEL is generally used.
Although the P-type high-efficiency voice compression system is adopted, the voice parameters to be transmitted are limited to the minimum parameter amount that can identify the identity and identify the conversation content due to the limited transmission capacity. Have been. In particular, V
In the SELP system, the number of code vectors stored in a code book is limited. For this reason, there is a problem that the sound quality when the transmitted sound data is decoded is not good.

【００１０】また、声帯や声道の特徴は人によって異な
っていることから、話者に応じて源音声特徴パラメータ
や声道特徴パラメータのパラメータ量を変える必要があ
るが、ディジタルセルラーシステムでは標準的なパラメ
ータ量とされており、話者に応じてパラメータ量を変え
ることはできない。このため、話者によっては伝送され
た音声データをデコードした際に、かろうじて相手を認
識することができる程度の音質しか得ることができなか
った。このような音質しか得られないことから、楽音信
号または鑑賞用のボーカル音声をＣＥＬＰ系のエンコー
ダで圧縮符号化して伝送し、この伝送されたデータをデ
コードした際の楽音またはボーカル音声は、鑑賞に堪え
るものではないという問題点があった。Since the characteristics of the vocal cords and vocal tracts vary from person to person, it is necessary to change the parameters of the source voice characteristic parameters and the vocal tract characteristic parameters according to the speaker. Parameter amount, and the parameter amount cannot be changed according to the speaker. For this reason, depending on the speaker, when the transmitted voice data is decoded, only a sound quality that can barely recognize the other party can be obtained. Since only such sound quality can be obtained, the musical sound signal or the vocal sound for appreciation is compressed and encoded by a CELP encoder and transmitted, and the musical sound or vocal sound when decoding the transmitted data is used for appreciation. There was a problem that it was unbearable.

【００１１】本発明は、このような問題点に鑑みて、楽
音データや高能率音声圧縮符号化されたボーカルデータ
を配信した際に、ボーカル音声を高音質で再生できた
り、呼品質のボーカル音声を楽音データに同期して再生
できるようにした音声合成装置を提供することを目的と
している。SUMMARY OF THE INVENTION In view of the above problems, the present invention can reproduce vocal voices with high sound quality or distribute vocal voices with call quality when music data or vocal data that has been subjected to high-efficiency voice compression encoding is distributed. It is an object of the present invention to provide a voice synthesizing apparatus which can reproduce a sound in synchronism with musical sound data.

【００１２】[0012]

【課題を解決するための手段】前記目的を達成するため
に、本発明の音声合成装置は、配信された、コードブッ
クを用いてベクトル量子化する分析合成符号化方式で圧
縮符号化されたボーカルデータと、圧縮符号化される際
に用いられたコードブックとを少なくとも記憶する記憶
手段と、前記記憶手段に記憶されている前記コードブッ
クを用いて、前記記憶手段から読み出された前記ボーカ
ルデータをデコードすることにより、ボーカル音声を合
成する音声合成手段とを備えている。In order to achieve the above object, a speech synthesizer according to the present invention comprises a distributed vocal compression-encoded by an analysis-synthesis encoding system for performing vector quantization using a codebook. Storage means for storing at least data and a codebook used at the time of compression encoding, and the vocal data read from the storage means using the codebook stored in the storage means And a voice synthesizing means for synthesizing vocal voice by decoding the vocal voice.

【００１３】また、上記本発明の音声合成装置におい
て、前記記憶手段に、配信された楽音データも記憶さ
れ、前記記憶手段から読み出された楽音データから楽音
を生成する楽音合成手段をさらに備え、該楽音合成手段
で生成された楽音と前記ボーカル音声とが同期されてい
てもよい。さらに、上記本発明の音声合成装置におい
て、前記音声合成手段に第２のコードブックが予め格納
されており、該第２のコードブックを用いて、圧縮符号
化された通話信号をデコードするようにしてもよい。さ
らにまた、上記本発明の音声合成装置において、それぞ
れのボーカル特有のコードブックを用いてベクトル量子
化する分析合成符号化方式で圧縮符号化された複数系統
のボーカルデータと、前記それぞれのボーカル特有のコ
ードブックとが配信されて、前記記憶手段に記憶され、
前記音声合成手段において、前記記憶手段に記憶されて
いる前記それぞれのボーカル特有のコードブックの各々
を用いて、前記複数系統のボーカルデータを並列して各
々デコードすることにより、前記楽音合成手段で合成さ
れた楽音に同期された複数系統のボーカル音声を合成す
るようにしてもよい。[0013] In the above-mentioned voice synthesizing apparatus of the present invention, the storage means further stores a distributed tone data, and further comprises a tone synthesis means for generating a tone from the tone data read from the storage means. The musical sound generated by the musical sound synthesizing means may be synchronized with the vocal sound. Further, in the above-mentioned speech synthesizing apparatus of the present invention, a second codebook is stored in the speech synthesizing means in advance, and the speech signal compressed and encoded is decoded using the second codebook. You may. Still further, in the above-described voice synthesizing apparatus of the present invention, a plurality of systems of vocal data compression-encoded by an analysis-synthesis coding method for performing vector quantization using a codebook unique to each vocal, A code book is distributed and stored in the storage means,
The voice synthesizing unit decodes the vocal data of the plurality of systems in parallel using each of the vocal-specific codebooks stored in the storage unit, whereby the musical sound synthesizing unit synthesizes the vocal data. A plurality of vocal sounds synchronized with the reproduced musical sound may be synthesized.

【００１４】このような本発明によれば、コードブック
を用いてベクトル量子化する分析合成符号化方式で圧縮
符号化されたボーカルデータと、圧縮符号化される際に
用いられたコードブックとを配信している。これによ
り、圧縮符号化されたボーカルデータをデコードする際
に、そのボーカルに特有のコードブックを参照してデコ
ードすることができるため、高音質のリアルなボーカル
音声を再生することができる。また、楽音データも配信
される場合には、伝送された楽音データから楽音を合成
して、この楽音に同期してボーカルデータを再生して楽
音と共に出力することができ、高品質のボーカル音声を
伴う楽音を再生することができる。また、通話信号は予
め格納されているコードブックを用いてデコードされる
ため、ボーカルデータ用のコードブックが配信されても
通話信号をデコードした際に音質が劣化することはな
い。さらに、配信されたボーカルデータが複数系統のボ
ーカルデータからなる際には、複数系統のコードブック
をあわせて配信することにより、複数系統のボーカルデ
ータをそれぞれ高品質でデコードすることができるよう
になる。According to the present invention, the vocal data compression-encoded by the analysis-synthesis encoding method in which vector quantization is performed using the codebook, and the codebook used at the time of compression-encoding are used. Has been delivered. Thus, when decoding the compression-encoded vocal data, the vocal data can be decoded with reference to a codebook specific to the vocal, so that high-quality real vocal sound can be reproduced. When music data is also distributed, it is possible to synthesize a music sound from the transmitted music data, reproduce the vocal data in synchronization with the music sound, and output the vocal data together with the music sound. The accompanying musical sound can be reproduced. Further, since the call signal is decoded by using a codebook stored in advance, even if the codebook for vocal data is distributed, the sound quality does not deteriorate when the call signal is decoded. Furthermore, when the distributed vocal data is composed of multiple systems of vocal data, the multiple systems of vocal data can be decoded with high quality by distributing together multiple systems of codebooks. .

【００１５】[0015]

【発明の実施の形態】本発明の音声合成装置を携帯電話
機に適用した際の実施の形態を説明するための図を図１
に示す。図１において、１は携帯電話機であり、２は各
無線ゾーンを管理する基地局である。携帯電話機におけ
るセルラーシステムは、一般に小ゾーン方式が採用され
てサービスエリア内に多数の無線ゾーンが配置されてい
る。この各々の無線ゾーンを管理するのが基地局２であ
り、移動局である携帯電話機１が一般電話機と通話する
際には、携帯電話機１が基地局２を介して交換機に接続
され、交換機から一般電話網に接続されるようになる。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 is a diagram for explaining an embodiment in which a speech synthesizer of the present invention is applied to a portable telephone.
Shown in In FIG. 1, 1 is a mobile phone, and 2 is a base station that manages each wireless zone. In a cellular system of a mobile phone, a small zone system is generally adopted, and a large number of wireless zones are arranged in a service area. The base station 2 manages each of these wireless zones. When the mobile phone 1 as a mobile station talks with a general telephone, the mobile phone 1 is connected to the exchange via the base station 2 and communicates therewith. Being connected to the general telephone network.

【００１６】携帯電話機１は、一般にリトラクタブルと
されたアンテナ１０を備え、アンテナ１０は送受信機部
１１に接続されている。送受信機部１１は、アンテナ１
０で受信された信号の復調を行うと共に、送信する信号
を変調してアンテナ１０に供給している。電話機能部１
２は他の電話機と通話する際に、携帯電話機１を電話機
として機能させるための制御を行うと共に、音声の高能
率圧縮に対応するためのＣＥＬＰ系のエンコーダおよび
デコーダ（音声合成部）１５を有している。通話中はマ
イク１９から入力された音声信号が電話機能部１２のエ
ンコーダにより圧縮符号化され、送受信機部１１で変調
されてアンテナ１０から送信される。また、アンテナ１
０で受信された圧縮符号化された音声データは、送受信
機部１１で復調されて、電話機能部１２の音声合成部１
５で元の音声信号に復号され、スピーカ等からなる出力
部２０から出力される。このように、通話中は送受信機
部１１と電話機能部１２との間で信号の授受が行われ
る。The mobile phone 1 has a generally retractable antenna 10, which is connected to a transceiver unit 11. The transceiver unit 11 includes the antenna 1
The signal received at 0 is demodulated and the signal to be transmitted is modulated and supplied to the antenna 10. Phone Function 1
Reference numeral 2 denotes a CELP-based encoder and decoder (speech synthesis unit) 15 for controlling the mobile phone 1 to function as a phone when talking with another phone and for coping with high-efficiency compression of speech. are doing. During a call, an audio signal input from the microphone 19 is compression-encoded by the encoder of the telephone function unit 12, modulated by the transceiver unit 11, and transmitted from the antenna 10. Also, antenna 1
0 is demodulated by the transmitter / receiver unit 11 and is transmitted to the voice synthesizer 1 of the telephone function unit 12.
The signal is decoded into an original audio signal at 5 and output from an output unit 20 including a speaker or the like. As described above, signals are exchanged between the transceiver unit 11 and the telephone function unit 12 during a call.

【００１７】記憶手段１３は、後述するように配信され
たＭＩＤＩフォーマットされた楽音データと、コードブ
ックを備えるＣＥＬＰ系のエンコーダで高能率音声圧縮
符号化されたボーカルデータと、このボーカルデータを
エンコードした際に用いたコードブックとを格納する記
憶手段である。ＭＩＤＩデコーダ１４は、記憶手段１３
に格納されたＭＩＤＩフォーマット形式の楽音データを
デコードして演奏データとして楽音合成部１６へ供給す
るＭＩＤＩデコーダである。また、電話機能部１２内の
音声合成部１５には、記憶手段１３に格納されたボーカ
ル特有のコードブックがコードブック部のメモリ（ＲＡ
Ｍ）内にロードされ、記憶手段１３から読み出されたボ
ーカルデータを、音声合成部１５においてボーカル特有
のコードブックを用いてデコードしてボーカル音声を合
成している。この際に、楽音合成部１６で合成される楽
音が伴奏とされて、この伴奏に同期してボーカル音声が
音声合成部１５で合成される。楽音合成部１６から出力
される伴奏音と、音声合成部１５から出力されるボーカ
ル音声とは、混合器１７でミキシングされ、増幅部１８
で増幅された後、出力部２０から出力されて放音され
る。The storage means 13 stores MIDI-formatted tone data distributed as described later, vocal data which has been subjected to high-efficiency voice compression encoding by a CELP encoder having a codebook, and which encodes the vocal data. This is storage means for storing the code book used at that time. The MIDI decoder 14 stores the storage unit 13
This is a MIDI decoder that decodes the musical tone data in the MIDI format stored in the MIDI format and supplies it to the musical tone synthesizer 16 as performance data. Further, the vocal-specific code book stored in the storage unit 13 is stored in the voice synthesizing unit 15 in the telephone function unit 12 in the memory (RA) of the code book unit.
M), the vocal data read from the storage unit 13 is decoded by the voice synthesis unit 15 using a vocal-specific codebook, and vocal voice is synthesized. At this time, the musical sound synthesized by the musical sound synthesizing section 16 is regarded as an accompaniment, and the vocal sound is synthesized by the audio synthesizing section 15 in synchronization with the accompaniment. The accompaniment sound output from the musical sound synthesizer 16 and the vocal sound output from the voice synthesizer 15 are mixed by the mixer 17 and
After being amplified by, the signal is output from the output unit 20 and emitted.

【００１８】次に、図１に示す構成の動作を説明する
と、交換機に電話回線等を介して接続されたサーバに携
帯電話機１を接続した際には、サーバに蓄積されている
データベースから所望の楽音データとボーカルデータと
を、交換機および基地局２を介して携帯電話機１にダウ
ンロードすることができる。このボーカルデータには、
エンコードした際に用いたボーカル特有のコードブック
が付属している。ダウンロードする際には、基地局２と
無線回線を介して携帯電話機１が接続され、基地局２の
アンテナ２ａから送信された、例えば上記した楽音デー
タと、ボーカルデータおよびコードブックのデータと
を、アンテナ１０で受信し送受信機部１１で復調する。
この時、携帯電話機１はサーバに接続される通信モード
に設定されていることから、送受信機部１１で復調され
た楽音データと、ボーカルデータおよびコードブックの
データとは記憶手段１３に格納されることによりダウン
ロードされる。Next, the operation of the configuration shown in FIG. 1 will be described. When the portable telephone 1 is connected to a server connected to the exchange via a telephone line or the like, a desired database is stored in the server. Music data and vocal data can be downloaded to the mobile phone 1 via the exchange and the base station 2. This vocal data includes
A codebook specific to the vocals used when encoding was attached. At the time of downloading, the mobile phone 1 is connected to the base station 2 via a wireless line, and for example, the above-mentioned musical sound data, vocal data and codebook data transmitted from the antenna 2a of the base station 2 are The signal is received by the antenna 10 and demodulated by the transceiver unit 11.
At this time, since the mobile phone 1 is set to the communication mode connected to the server, the tone data demodulated by the transceiver unit 11 and the vocal data and codebook data are stored in the storage unit 13. Downloaded by

【００１９】また、携帯電話機１が他の電話機と通話す
る通信モードとされている際には、基地局２のアンテナ
２ａから送信された他の電話機からの通話信号が、アン
テナ１０で受信されて送受信機部１１で復調される。こ
の時、携帯電話機１は他の電話機と通話する通信モード
に設定されていることから、送受信機部１１で復調され
た通話信号は、電話機能部１２に供給されて音声合成部
１５において通話用のコードブックを用いてデコードさ
れ、スピーカ等からなる出力部２０から音声として放音
される。さらに、マイク１９を介して入力された音声
は、電話機能部１２内のエンコーダにおいてエンコード
され、さらに送受信機部１１において変調されて、アン
テナ１０から送信される。この送信信号は、アンテナ２
ａを介して基地局２が受信し、基地局２に接続された交
換機を介して他の電話機に伝送されることにより双方向
の通話が行われるようになる。When the mobile phone 1 is in a communication mode for talking with another phone, a communication signal from the other phone transmitted from the antenna 2a of the base station 2 is received by the antenna 10 and The signal is demodulated by the transceiver unit 11. At this time, since the mobile phone 1 is set to the communication mode for talking with another telephone, the call signal demodulated by the transceiver unit 11 is supplied to the telephone function unit 12 and is used by the voice synthesis unit 15 for communication. , And is output as sound from the output unit 20 including a speaker or the like. Further, the voice input via the microphone 19 is encoded by an encoder in the telephone function unit 12, further modulated in the transceiver unit 11, and transmitted from the antenna 10. This transmission signal is transmitted to antenna 2
The data is received by the base station 2 via a, and is transmitted to another telephone via the exchange connected to the base station 2, so that a two-way communication is performed.

【００２０】ところで、他の電話機から携帯電話機１に
電話すべく、携帯電話機１の電話番号を発信すると、交
換機は携帯電話機１が登録されている基地局２を検出し
て、検出された基地局２を介して携帯電話機１へ着呼さ
れたことを報知する。携帯電話機１は、これを受信して
電話機能部１２においてユーザに着信されたことを報知
する着信音を放音する。この際に、予め着信音として設
定されているボーカルデータとＭＩＤＩフォーマットさ
れた楽音データを記憶手段１３から読み出して、電話機
能部１２の音声合成部１５およびＭＩＤＩデコーダ１４
に供給する。ＭＩＤＩデコーダ１４はその楽音データを
デコードした後、楽音合成部１６へ供給し、楽音合成部
１６は供給された楽音データに応じた楽音信号を生成し
て混合器１７へ供給する。When the telephone number of the mobile phone 1 is transmitted in order to call the mobile phone 1 from another telephone, the exchange detects the base station 2 in which the mobile phone 1 is registered, and detects the detected base station. 2 that the call has been received by the mobile phone 1. The mobile phone 1 receives this, and emits a ringtone notifying that the user has received a call in the telephone function unit 12. At this time, the vocal data and the MIDI-formatted musical sound data set as ringtones in advance are read out from the storage means 13, and the voice synthesizing section 15 and the MIDI decoder 14 of the telephone function section 12 are read out.
To supply. The MIDI decoder 14 decodes the tone data and supplies the decoded tone data to the tone synthesizer 16. The tone synthesizer 16 generates a tone signal corresponding to the supplied tone data and supplies the tone signal to the mixer 17.

【００２１】また、音声合成部１５には、記憶手段に格
納されているボーカル特有のコードブックがデコードに
先立ってセットされている。そして、音声合成部１５に
おいてボーカル特有のコードブックを用いてＣＥＬＰ系
の圧縮符号化されたボーカルデータがデコードされる。
この音声合成部１５におけるボーカルデータのデコード
は、楽音合成部１６で生成された楽音信号に同期して実
行される。次いで、音声合成部１５でデコードされたボ
ーカル信号と楽音合成部１６で生成された楽音信号とが
混合器１７においてミキシングされ、増幅部１８で増幅
されて出力部２０から放音される。すなわち、楽音を伴
奏としてボーカル音声が放音されるようになる。これに
より、携帯電話機１に着信があった際に、出力部２０か
ら着信音として伴奏を伴うボーカル音声（以下、「着信
メロディ」という）が放音されるようになる。この着信
メロディにおけるボーカル音声は、配信されたボーカル
音声に特有のコードブックを用いてデコードされている
ことから、音声合成部１５からは高音質のリアルなボー
カル音声を得ることができる。The vocal-specific code book stored in the storage means is set in the speech synthesizer 15 prior to decoding. Then, the voice synthesis unit 15 decodes the CELP-based compression-coded vocal data using a vocal-specific codebook.
The decoding of the vocal data in the voice synthesizer 15 is executed in synchronization with the tone signal generated by the tone synthesizer 16. Next, the vocal signal decoded by the voice synthesizer 15 and the tone signal generated by the tone synthesizer 16 are mixed in the mixer 17, amplified by the amplifier 18, and emitted from the output unit 20. That is, a vocal sound is emitted with a musical sound as an accompaniment. As a result, when an incoming call arrives at the mobile phone 1, a vocal sound accompanied by accompaniment (hereinafter referred to as “ringing melody”) is emitted from the output unit 20 as a ringing sound. Since the vocal sound in the ringtone melody is decoded using a codebook specific to the distributed vocal sound, a high-quality real vocal sound can be obtained from the sound synthesis unit 15.

【００２２】また、楽音合成部１６で楽音信号が生成さ
れると共に、音声合成部１５でボーカル信号がデコード
されるのは、着信時に限らず、保留時であってもよい。
この際には、予め保留音として設定されているボーカル
データとＭＩＤＩフォーマットされた楽音データを記憶
手段１３から読み出して、電話機能部１２の音声合成部
１５およびＭＩＤＩデコーダ１４に供給する。ＭＩＤＩ
デコーダ１４はその楽音データをデコードした後、楽音
合成部１６へ供給し、楽音合成部１６は供給された楽音
データに応じた楽音信号を生成する。音声合成部１５
は、ボーカル特有のコードブックを用いて供給されたボ
ーカルデータをデコードし、デコードされたボーカル音
声は混合器１７において前記楽音信号とミキシングさ
れ、増幅部１８で増幅されて出力部２０から放音され
る。この保留音におけるボーカル音声は、配信されたボ
ーカル音声に特有のコードブックを用いてデコードされ
ていることから、音声合成部１５からは高音質のリアル
なボーカル音声を得ることができる。The tone signal is generated by the tone synthesizer 16 and the vocal signal is decoded by the voice synthesizer 15 not only at the time of incoming call but also at the time of hold.
At this time, the vocal data and the MIDI-formatted musical sound data set in advance as the holding sounds are read out from the storage means 13 and supplied to the voice synthesizing section 15 and the MIDI decoder 14 of the telephone function section 12. MIDI
After decoding the tone data, the decoder 14 supplies the tone data to the tone synthesizer 16, and the tone synthesizer 16 generates a tone signal corresponding to the supplied tone data. Voice synthesis unit 15
Decodes the supplied vocal data using a vocal-specific codebook, and the decoded vocal sound is mixed with the tone signal in the mixer 17, amplified by the amplifier 18, and emitted from the output unit 20. You. Since the vocal sound in the hold sound is decoded by using a codebook specific to the distributed vocal sound, a high-quality real vocal sound can be obtained from the voice synthesizer 15.

【００２３】また、配信されて記憶手段１３に格納され
たボーカルデータと楽音データを鑑賞用として再生する
ことも可能とされている。この場合には、携帯電話機１
を楽音再生モードとして再生したいボーカルデータを選
択するようにする。この際に、図示しない表示部には記
憶手段１３に格納されているボーカルデータの曲名等を
表示して選択するようにする。その後に、再生を指示す
ると選択されたボーカルデータと楽音データが記憶手段
１３から読み出されて、電話機能部１２の音声合成部１
５およびＭＩＤＩデコーダ１４に供給される。ＭＩＤＩ
デコーダ１４はその楽音データをデコードした後、楽音
合成部１６へ供給し、楽音合成部１６は供給された楽音
データに応じた楽音信号を生成する。音声合成部１５
は、ボーカル特有のコードブックを用いて供給されたボ
ーカルデータをデコードし、デコードされたボーカル音
声は混合器１７において前記楽音信号とミキシングさ
れ、増幅部１８で増幅されて出力部２０から放音され
る。これにより鑑賞用として再生されたボーカル音声
は、配信されたボーカル音声に特有のコードブックを用
いてデコードされていることから、音声合成部１５から
は高音質のリアルなボーカル音声を得ることができる。The vocal data and musical sound data distributed and stored in the storage means 13 can be reproduced for appreciation. In this case, the mobile phone 1
Is selected as the vocal data to be reproduced in the tone reproduction mode. At this time, the title of the vocal data stored in the storage means 13 is displayed and selected on a display unit (not shown). Thereafter, when the reproduction is instructed, the selected vocal data and musical sound data are read out from the storage means 13 and the voice synthesizing section 1 of the telephone function section 12 is read.
5 and the MIDI decoder 14. MIDI
After decoding the tone data, the decoder 14 supplies the tone data to the tone synthesizer 16, and the tone synthesizer 16 generates a tone signal corresponding to the supplied tone data. Voice synthesis unit 15
Decodes the supplied vocal data using a vocal-specific codebook, and the decoded vocal sound is mixed with the tone signal in the mixer 17, amplified by the amplifier 18, and emitted from the output unit 20. You. As a result, the vocal sound reproduced for appreciation is decoded using a codebook specific to the distributed vocal sound, so that a high-quality real vocal sound can be obtained from the sound synthesizer 15. .

【００２４】ここで、ボーカル特有のコードブックにつ
いて説明する。ボーカル音声が図６に示す構成のエンコ
ーダで圧縮符号化されるものとすると、源音用コードブ
ック１０２からは残差信号が最も小さくなるコードベク
トルが選択される。しかしながら、源音用コードブック
１０２は限られた数のコードベクトルからなることか
ら、残差信号が微少となるようなコードベクトルが存在
しない場合が生じる。そこで、ボーカル音声をトレーニ
ング信号として入力してオフラインで学習させるように
する。この学習において、源音用コードブック１０２の
コードベクトルを残差信号が常に微少となるように書き
換えていくようにする。これにより、入力されたボーカ
ル音声に適応した源音用コードブック１０２が生成され
るようになる。このボーカル音声に適応した源音用コー
ドブック１０２を、本明細書においてはボーカル特有の
コードブックとしている。Here, a vocal-specific code book will be described. Assuming that the vocal sound is to be compression-encoded by the encoder having the configuration shown in FIG. 6, a code vector having the smallest residual signal is selected from the source sound codebook 102. However, since the source codebook 102 includes a limited number of codevectors, there may be cases where there is no codevector with which the residual signal is very small. Therefore, a vocal sound is input as a training signal so as to be learned offline. In this learning, the code vector of the source codebook 102 is rewritten so that the residual signal is always small. As a result, the source sound codebook 102 adapted to the input vocal sound is generated. The source sound codebook 102 adapted to the vocal sound is defined as a vocal-specific codebook in this specification.

【００２５】ところで、図１に示す構成において、楽音
合成部１６と音声合成部１５とを図２に示すように縦続
して構成するようにしてもよい。図２に示す構成では、
ＭＩＤＩデコーダ１４の出力は楽音合成部１６に供給さ
れて、楽音データに基づいて楽音信号が生成され、この
楽音信号は楽音合成部１６をパスしたボーカルデータと
共に音声合成部１５に供給され、ボーカルデータはボー
カル特有の配信されたコードブックを用いてデコードさ
れ、楽音信号とミキシングされて出力される。ミキシン
グ信号は図示されていないが、前述と同様に増幅部１８
で増幅されて出力部２０から放音される。By the way, in the configuration shown in FIG. 1, the tone synthesis unit 16 and the voice synthesis unit 15 may be arranged in cascade as shown in FIG. In the configuration shown in FIG.
The output of the MIDI decoder 14 is supplied to a tone synthesizer 16 to generate a tone signal based on the tone data. The tone signal is supplied to the speech synthesizer 15 together with the vocal data passed through the tone synthesizer 16, and the vocal data is output. Is decoded using a distributed codebook specific to vocals, mixed with a musical sound signal, and output. Although the mixing signal is not shown, the amplification unit 18
And is output from the output unit 20.

【００２６】なお、ボーカルデータに特定のＭＩＤＩチ
ャンネルを割り当てて楽音データと共にＭＩＤＩ形態に
より配信しても、ボーカルデータをＣＥＬＰ形態で楽音
データをＭＩＤＩ形態で配信してもよい。一例として、
ボーカルデータをＭＩＤＩ形態とされた楽音データ中に
挿入して伝送する際のデータフォーマットを図３に示
す。図３に示すように、ボーカルデータである音声パラ
メータの前後にはフラグが挿入されて、ＭＩＤＩデータ
中の音声パラメータの開始と終了とが示されている。音
声パラメータは、ピッチＬ、インデックスＩ、反射係数
γの各パラメータからなり、図示するように各パラメー
タはそのパラメータを示すフラグがヘッダとして付加さ
れている。この際に、反射係数γはそのデータ量が大き
いことから、ベクトル量子化することにより低ビット化
するようにしてもよい。反射係数γをベクトル量子化す
る際には、エンコーダに複数の代表ベクトルからなる反
射係数用コードブックを備えさせ、反射係数分析部から
出力された反射係数γを代表ベクトルの一つで置き換
え、その代表ベクトルを示す反射係数インデックスをパ
ラメータとして伝送するようにする。また、デコーダ側
の音声合成部には、エンコーダに備えた反射係数用コー
ドブックをボーカルデータと共に配信して同一内容とし
た反射係数用コードブックを備えさせ、伝送された反射
係数インデックスで指示される反射係数γを読み出すよ
うにすれば、反射係数γを再生することができる。これ
により、音声パラメータのデータ量を低減化することが
できる。It should be noted that a specific MIDI channel may be allocated to the vocal data and distributed together with the musical sound data in the MIDI format, or the vocal data may be distributed in the CELP format and the musical sound data in the MIDI format. As an example,
FIG. 3 shows a data format when the vocal data is inserted into MIDI sound data and transmitted. As shown in FIG. 3, flags are inserted before and after the audio parameter which is the vocal data to indicate the start and end of the audio parameter in the MIDI data. The voice parameter includes a pitch L, an index I, and a reflection coefficient γ, and a flag indicating the parameter is added to each parameter as a header as shown in the figure. At this time, since the data amount of the reflection coefficient γ is large, the bit may be reduced by vector quantization. When performing vector quantization of the reflection coefficient γ, the encoder is provided with a reflection coefficient codebook including a plurality of representative vectors, and the reflection coefficient γ output from the reflection coefficient analysis unit is replaced with one of the representative vectors. The reflection coefficient index indicating the representative vector is transmitted as a parameter. Also, the decoder-side speech synthesizer is provided with a reflection coefficient codebook having the same contents by distributing the reflection coefficient codebook provided in the encoder together with the vocal data, and is designated by the transmitted reflection coefficient index. If the reflection coefficient γ is read, the reflection coefficient γ can be reproduced. As a result, the data amount of the voice parameter can be reduced.

【００２７】また、ボーカルデータおよびそのボーカル
に特有のコードブックのデータとは、初期のダウンロー
ド時にまとめて供給されてもよいが、音声合成部１５に
おいてデコードしながら前記データを随時供給するよう
にしてもよい。さらに、必要とされるボーカル音声の音
質が、電話音質程度でよい場合、電話音質よりよい音質
としたい場合、ハイファイレベルの音質としたい場合に
分けて、配信されるボーカルデータの音声パラメータの
量を適宜選択して配信するようにしてもよい。The vocal data and the codebook data unique to the vocal may be supplied collectively at the time of initial download. Is also good. Furthermore, when the required vocal voice sound quality is about the same as the telephone sound quality, when it is desired to have a better sound quality than the telephone sound quality, and when it is desired to have a high-fidelity sound quality, the amount of the voice parameter of the vocal data to be distributed is determined. You may make it select suitably and distribute.

【００２８】次に、本発明の実施の形態にかかる音声合
成装置の構成の一例を図４に示す。図４に示す構成は、
図１に示す携帯電話機１における音声合成部１５として
用いることができ、ボーカルデータのデコードを行える
ことはもちろんであるが、通話時の音声データのデコー
ドも行える構成とされている。図４において、記憶手段
１３から読み出された音声パラメータはデータ処理部３
０において、インデックスＩ、ピッチＬおよび反射係数
γの各パラメータに分離されて、ピッチＬのパラメータ
はピッチＬ発生部３２に、インデックスＩのパラメータ
はコードブック３１に、反射係数γのパラメータはのど
近似フィルタ３４に振り分けられる。なお、コードブッ
ク３１のＲＯＭには音声データをデコードするに用いる
携帯電話機１で共通とされた通話用のコードブックが格
納され、コードブック３１のＲＡＭにはボーカルデータ
をデコードするに用いるボーカル特有のコードブックが
配信された際に格納される。Next, FIG. 4 shows an example of the configuration of the speech synthesizer according to the embodiment of the present invention. The configuration shown in FIG.
It can be used as the voice synthesizing unit 15 in the mobile phone 1 shown in FIG. 1 and can decode vocal data, but can also decode voice data during a call. In FIG. 4, the voice parameters read from the storage unit 13 are stored in the data processing unit 3.
At 0, the parameters of the index I, the pitch L, and the reflection coefficient γ are separated, and the parameter of the pitch L is approximated to the pitch L generator 32, the parameter of the index I is approximated to the codebook 31, and the parameter of the reflection coefficient γ is approximated by the throat. Filtered to filter 34. Note that the ROM of the codebook 31 stores a codebook for a call that is common to the mobile phone 1 used for decoding voice data, and the RAM of the codebook 31 stores a vocal-specific code used for decoding vocal data. Stored when the codebook is distributed.

【００２９】デコードする際には、供給されたピッチＬ
のパラメータに基づいてピッチＬ発生部３２から、音声
の周波数の復号信号が発生されて源波形再生部３３に供
給される。源波形再生部３３には、コードブック３１か
ら読み出されたインデックスＩで示されるコードベクト
ルのデータが供給されており、このデータをピッチＬの
復号信号に合成することにより、源波形再生部３３にお
いて合成源波形が再生される。この源波形再生部３３か
らは、人間の声帯の振動により発生された波形と同様の
波形が合成源波形として出力されるようになる。この合
成源波形は、反射係数γのパラメータによりフィルタ係
数が制御されるのど近似フィルタ３４においてフィルタ
処理されて、合成音声とされる。のど近似フィルタ３４
は人間ののどや口の伝達関数を再現しており、予めデー
タ処理部３０から供給された反射係数γを蓄積しておい
て、必要とする際に各フィルタに供給している。のど近
似フィルタ３４から出力される合成音声は、スペクトラ
ルフィルタ３５に供給されて、音声としての不自然さが
取り除かれる。When decoding, the supplied pitch L
The decoded signal of the frequency of the sound is generated from the pitch L generating unit 32 based on the above parameters, and is supplied to the source waveform reproducing unit 33. The source waveform reproducing unit 33 is supplied with the data of the code vector indicated by the index I read from the codebook 31, and synthesizes this data with the decoded signal of the pitch L to obtain the source waveform reproducing unit 33. At, the combined source waveform is reproduced. From the source waveform reproducing unit 33, a waveform similar to the waveform generated by the vibration of the human vocal cords is output as a composite source waveform. This synthesized source waveform is filtered by a throat approximation filter 34 in which the filter coefficient is controlled by the parameter of the reflection coefficient γ to be a synthesized voice. Throat approximation filter 34
Reproduces the transfer function of the human throat and mouth, stores the reflection coefficient γ supplied from the data processing unit 30 in advance, and supplies it to each filter when necessary. The synthesized speech output from the throat approximation filter 34 is supplied to a spectral filter 35 to remove unnaturalness as speech.

【００３０】この場合に、ボーカルデータをデコードす
る際には、ＲＡＭに格納されたボーカル特有のコードブ
ックからインデックスＩによりコードベクトルが読み出
されるので、リアルな高品質のボーカル音声がスペクト
ラルフィルタ３５から出力されるようになる。また、通
話の音声データをデコードする際には、インデックスＩ
によりＲＯＭに格納された共通のコードブックからコー
ドベクトルが読み出されるので、通常の品質の音声がス
ペクトラルフィルタ３５から出力されるようになる。な
お、通話の音声データは記憶手段１３を介することな
く、送受信機部１１から直接データ処理部３０に音声デ
ータが供給されて、リアルタイムにデコードされる。In this case, when decoding the vocal data, since the code vector is read out from the vocal-specific code book stored in the RAM by the index I, a real high-quality vocal sound is output from the spectral filter 35. Will be done. Also, when decoding the voice data of a call, the index I
As a result, the code vector is read from the common code book stored in the ROM, so that normal-quality sound is output from the spectral filter 35. The voice data of the call is supplied from the transceiver unit 11 directly to the data processing unit 30 without passing through the storage unit 13, and is decoded in real time.

【００３１】なお、ボーカルデータにおける反射係数γ
がベクトル量子化されて反射係数インデックスがパラメ
ータとされている場合は、反射係数γをベクトル量子化
するに用いた反射係数用コードブックもボーカルデータ
に付随して配信されるようになる。そして、音声合成装
置に配信された反射係数用コードブックを格納し、デー
タ処理部３０で分離された反射係数インデックスで指示
される反射係数γを、反射係数用コードブックから読み
出して反射係数γを再生する。再生された反射係数γ
は、のど近似フィルタ３４に供給されて、上記したデコ
ード動作が行われるようになる。The reflection coefficient γ in the vocal data
Is vector-quantized and the reflection coefficient index is used as a parameter, the code book for the reflection coefficient used for vector-quantizing the reflection coefficient γ is also distributed along with the vocal data. Then, the reflection coefficient codebook delivered to the speech synthesizer is stored, and the reflection coefficient γ indicated by the reflection coefficient index separated by the data processing unit 30 is read out from the reflection coefficient codebook to obtain the reflection coefficient γ. Reproduce. Reconstructed reflection coefficient γ
Is supplied to the throat approximation filter 34 so that the above-described decoding operation is performed.

【００３２】なお、音質を決定する要素として、ベクト
ル量子化する際のコードベクトルの次元数やコードベク
トルの種類数、および、のど近似フィルタの次数すなわ
ち反射係数γのパラメータ数があり、ボーカル特有のコ
ードブックにおいてコードベクトルの種類数を増加（イ
ンデックスＩのビット数が増加）させると共に、反射係
数γのパラメータ数を増加させることにより、よりリア
ルな高品質のボーカル音声を再生することができる。ま
た、上記したパラメータは声の性質によって異なること
から、コードベクトル数を増加した際に高品質の再生が
できる場合は、インデックスＩを増加するようにし、の
ど近似フィルタの次数を増加した際に高品質の再生がで
きる場合は、反射係数γのパラメータ数を増加するよう
に、インデックスＩと反射係数γを蓄積するメモリを共
有にして効率的にメモリを使用するようにしてもよい。Elements that determine the sound quality include the number of dimensions of the code vector and the number of types of the code vector at the time of vector quantization, and the order of the throat approximation filter, that is, the number of parameters of the reflection coefficient γ. By increasing the number of types of code vectors in the codebook (the number of bits of the index I is increased) and increasing the number of parameters of the reflection coefficient γ, more realistic high-quality vocal sound can be reproduced. In addition, since the above-mentioned parameters differ depending on the nature of voice, if high-quality reproduction can be performed when the number of code vectors is increased, the index I is increased, and when the order of the throat approximation filter is increased, the index is increased. If the quality can be reproduced, the memory for storing the index I and the reflection coefficient γ may be shared and the memory may be used efficiently so as to increase the number of parameters of the reflection coefficient γ.

【００３３】ところで、他の電話機と通話する際に、通
話に先立って自分の声に特有のコードブックを送信し
て、相手の電話機にダウンロードさせると共に、自分の
電話機には相手から送信された相手の声に特有のコード
ブックをダウンロードする。すなわち、それぞれの電話
機には自分の声をベクトル量子化するための第１のコー
ドブックと、相手のベクトル量子化された音声データを
デコードするための第２のコードブックとを持つように
する。このようにすると、それぞれの音声データはその
音声に特有のコードブックを用いてデコードすることが
できるため、リアルな高品質の通話を行うことができる
ようになる。また、自分の声に特有のコードブックを送
信するのではなく、配信されてＲＡＭに蓄積された有名
歌手特有のコードブックを選択して送信し、相手の電話
機にダウンロードさせてもよい。これによれば、自分の
声が相手の電話機においては有名歌手の声として聞こえ
るため、遊び心のある通話を楽しむことができる。By the way, when a call is made with another telephone, a code book specific to one's voice is transmitted prior to the call to be downloaded to the other telephone, and the other telephone transmitted from the other telephone is transmitted to the own telephone. Download a codebook specific to your voice. That is, each telephone has a first codebook for vector quantizing its own voice and a second codebook for decoding the other party's vector quantized audio data. In this way, each audio data can be decoded using a codebook specific to the audio, so that a real high-quality call can be made. Instead of transmitting a codebook specific to one's own voice, a codebook specific to a famous singer distributed and stored in a RAM may be selected and transmitted, and downloaded to a telephone of the other party. According to this, the user's voice can be heard as a famous singer's voice on the other party's telephone, so that it is possible to enjoy a playful call.

【００３４】次に、図１に示す携帯電話機１における本
発明の実施の形態にかかる音声合成装置の他の構成例を
図５に示す。図５に示す構成は、２ボーカル用とされた
音声合成部の構成であり、それぞれのボーカルデータを
並列して同時にデコードできるように、並列した構成を
備えている。図５において、記憶手段１３から読み出さ
れた２ボーカル用の音声パラメータはデータ処理部３０
において、インデックスＩａ，Ｉｂ、ピッチＬａ，Ｌｂ
および反射係数γａ，γｂの各パラメータに分離され
て、ピッチＬａのパラメータはピッチＬ発生部３２ａ
に、ピッチＬｂのパラメータはピッチＬ発生部３２ｂ
に、インデックスＩａ，Ｉｂのパラメータはコードブッ
ク３１に、反射係数γａのパラメータはのど近似フィル
タ３４ａに、反射係数γｂのパラメータはのど近似フィ
ルタ３４ｂに振り分けられる。なお、コードブック３１
のＲＯＭには音声データをデコードするに用いる携帯電
話機１で共通とされた通話用のコードブックが格納さ
れ、コードブック３１のＲＡＭには２ボーカル用のボー
カルデータをデコードするに用いる２ボーカル用のコー
ドブックが、配信された際に格納される。Next, FIG. 5 shows another configuration example of the voice synthesizing apparatus according to the embodiment of the present invention in the mobile phone 1 shown in FIG. The configuration shown in FIG. 5 is a configuration of a speech synthesis unit for two vocals, and has a parallel configuration so that respective vocal data can be decoded in parallel and simultaneously. In FIG. 5, the voice parameters for two vocals read from the storage unit 13 are stored in the data processing unit 30.
, Index Ia, Ib, pitch La, Lb
And the parameters of the pitch La are separated into parameters of the reflection coefficient γa and γb.
The parameter of the pitch Lb is the pitch L generating unit 32b
The parameters of the indexes Ia and Ib are distributed to the codebook 31, the parameters of the reflection coefficient γa are distributed to the approximate throat filter 34a, and the parameters of the reflection coefficient γb are distributed to the approximate throat filter 34b. The code book 31
The ROM stores a codebook for a call that is common to the mobile phone 1 used for decoding voice data, and the RAM of the codebook 31 stores a code for the two vocals used for decoding the vocal data for two vocals. Codebook is stored when it is distributed.

【００３５】２ボーカルデータをデコードする際には、
供給されたピッチＬａのパラメータに基づいてピッチＬ
発生部３２ａから、第１のボーカルの周波数の復号信号
が発生されて源波形再生部３３ａに供給される。同時
に、供給されたピッチＬｂのパラメータに基づいてピッ
チＬ発生部３２ｂから、第２のボーカルの周波数の復号
信号が発生されて源波形再生部３３ｂに供給される。源
波形再生部３３ａには、コードブック３１のＲＡＭに格
納された第１のボーカルに特有のコードブックからイン
デックスＩａで示されるコードベクトルのデータが読み
出されて供給されており、このデータをピッチＬａの復
号信号に合成することにより、源波形再生部３３ａにお
いて第１のボーカルの合成源波形が再生される。同時
に、源波形再生部３３ｂには、コードブック３１のＲＡ
Ｍに格納された第２のボーカルに特有のコードブックか
らインデックスＩｂで示されるコードベクトルのデータ
が読み出されて供給されており、このデータをピッチＬ
ｂの復号信号に合成することにより、源波形再生部３３
ｂにおいて第２のボーカルの合成源波形が再生される。When decoding 2 vocal data,
The pitch L is determined based on the supplied parameters of the pitch La.
The generator 32a generates a decoded signal of the first vocal frequency and supplies it to the source waveform reproducer 33a. At the same time, a decoded signal of the second vocal frequency is generated from the pitch L generating unit 32b based on the supplied parameter of the pitch Lb, and is supplied to the source waveform reproducing unit 33b. The source waveform reproducing unit 33a reads and supplies the code vector data indicated by the index Ia from the codebook unique to the first vocal stored in the RAM of the codebook 31, and supplies this data to the pitch. By combining with the La decoded signal, the combined source waveform of the first vocal is reproduced in the source waveform reproducing unit 33a. At the same time, the source waveform reproducing unit 33b
The data of the code vector indicated by the index Ib is read out from the codebook specific to the second vocal stored in M and supplied, and this data is sent to the pitch L
b by combining with the decoded signal of the
At b, the composite source waveform of the second vocal is reproduced.

【００３６】この第１のボーカルの合成源波形は、反射
係数γａのパラメータによりフィルタ係数が制御される
のど近似フィルタ３４ａにおいてフィルタ処理されて、
第１のボーカルの合成音声とされる。同時に、第２のボ
ーカルの合成源波形は、反射係数γｂのパラメータによ
りフィルタ係数が制御されるのど近似フィルタ３４ｂに
おいてフィルタ処理されて、第２のボーカルの合成音声
とされる。のど近似フィルタ３４ａ（３４ｂ）は人間の
のどや口の伝達関数を再現しており、予めデータ処理部
３０から供給された反射係数γａ（γｂ）を蓄積してお
いて、必要とする際に各フィルタに供給している。のど
近似フィルタ３４ａから出力される第１のボーカルの合
成音声は、スペクトラルフィルタ３５ａに供給されて、
第１のボーカルの音声としての不自然さが取り除かれ
る。同時に、のど近似フィルタ３４ｂから出力される第
２のボーカルの合成音声は、スペクトラルフィルタ３５
ｂに供給されて、第２のボーカルの音声としての不自然
さが取り除かれる。The first vocal synthesized source waveform is filtered by a throat approximation filter 34a whose filter coefficient is controlled by the parameter of the reflection coefficient γa.
The synthesized voice of the first vocal is used. At the same time, the synthesized source waveform of the second vocal is filtered by a throat approximation filter 34b whose filter coefficient is controlled by the parameter of the reflection coefficient γb, to obtain a synthesized voice of the second vocal. The throat approximation filter 34a (34b) reproduces the transfer function of the human throat and mouth, stores the reflection coefficient γa (γb) supplied from the data processing unit 30 in advance, and stores Supplying to filter. The synthesized voice of the first vocal output from the throat approximation filter 34a is supplied to the spectral filter 35a,
The unnaturalness of the voice of the first vocal is removed. At the same time, the synthesized voice of the second vocal output from the throat approximation filter 34b is
b to remove the unnaturalness of the second vocal sound.

【００３７】このように、第１，第２のボーカルデータ
をそれぞれデコードする際には、インデックスＩａ，Ｉ
ｂによりＲＡＭに格納された２ボーカル用のそれぞれの
ボーカル特有のコードブックからコードベクトルが読み
出されるので、リアルな高品質の２ボーカル音声がスペ
クトラルフィルタ３５ａ，３５ｂからそれぞれ出力され
るようになる。また、通話の音声データをデコードする
際には、並列配置された一方の系列の構成のみを使用し
て、インデックスＩによりＲＯＭに格納された通話用の
コードブックからコードベクトルを読み出してデコード
される。この際の音声の品質は、通常の品質の音声がス
ペクトラルフィルタ３５ａ（３５ｂ）から出力されるよ
うになる。なお、通話の音声データは記憶手段１３を介
することなく、送受信機部１１から直接データ処理部３
０に音声データが供給され、リアルタイムにデコードさ
れる。As described above, when decoding the first and second vocal data, respectively, the indexes Ia and I
Since the code vector is read from the vocal-specific code book for each of the two vocals stored in the RAM by b, real high-quality two-vocal sounds are output from the spectral filters 35a and 35b, respectively. When decoding the voice data of the call, the code vector is read out from the codebook for the call stored in the ROM by the index I and decoded using only the configuration of one of the series arranged in parallel. . The quality of the sound at this time is such that the sound of normal quality is output from the spectral filter 35a (35b). The voice data of the call is transmitted directly from the transceiver unit 11 to the data processing unit 3 without passing through the storage unit 13.
0 is supplied with audio data and decoded in real time.

【００３８】なお、２ボーカルデータにおける反射係数
γａ，γｂがベクトル量子化されてそれぞれの反射係数
インデックスがパラメータとされている場合は、反射係
数γａ，γｂをベクトル量子化するに用いたそれぞれの
反射係数用コードブックもボーカルデータに付随して配
信されるようになる。そして、音声合成装置に配信され
たそれぞれの反射係数用コードブックを格納し、データ
処理部３０で分離されたそれぞれの反射係数インデック
スで指示される反射係数γａ，γｂを、それぞれ対応す
る反射係数用コードブックから読み出して反射係数γ
ａ，γｂを再生する。再生された反射係数γａ，γｂ
は、のど近似フィルタ３４ａ，３４ｂに供給されて、上
記したデコード動作が行われるようになる。この際の、
２ボーカル用のそれぞれの反射係数用コードブックは、
それぞれのボーカルに適用した反射係数用コードブック
とされている。When the reflection coefficients γa and γb in the two-vocal data are vector-quantized and the respective reflection coefficient indexes are used as parameters, the respective reflection coefficients used for vector-quantizing the reflection coefficients γa and γb are used. The coefficient code book is also distributed along with the vocal data. Then, the respective codebooks for reflection coefficients distributed to the speech synthesizer are stored, and the reflection coefficients γa and γb indicated by the respective reflection coefficient indexes separated by the data processing unit 30 are respectively used for the corresponding reflection coefficients. Read from the code book and reflectivity γ
a and γb are reproduced. Reconstructed reflection coefficients γa, γb
Is supplied to the throat approximation filters 34a and 34b so that the above-described decoding operation is performed. At this time,
The codebook for each reflection coefficient for 2 vocals is
It is a codebook for reflection coefficient applied to each vocal.

【００３９】なお、２ボーカルとした際には、第１のボ
ーカルをリードボーカルとし、第２のボーカルをバック
コーラスに設定することができる。また、２ボーカルに
限られるものではなく、３ボーカル以上としてもよい。
この場合には、ボーカル数に応じた系列の構成が並列に
配置される。また、それぞれのボーカルに特有のコード
ブックを配信する代わりに、コーラス用の共通のコード
ブックを配信して、このコーラス用のコードブックを用
いて複数ボーカルをデコードするようにしてもよい。When two vocals are set, the first vocal can be set as the lead vocal and the second vocal can be set as the back chorus. The number of vocals is not limited to two, but may be three or more.
In this case, a series configuration corresponding to the number of vocals is arranged in parallel. Instead of distributing a codebook unique to each vocal, a common codebook for chorus may be distributed and a plurality of vocals may be decoded using the codebook for chorus.

【００４０】なお、携帯電話機１に内蔵される楽音合成
部１６の音源方式としては、ＦＭ音源方式、波形メモリ
音源（ＰＣＭ音源）方式、物理モデル音源方式等とする
ことができ、楽音合成部１６の構成としてはＤＳＰ等を
用いたハードウェア音源でも、音源プログラムを実行さ
せるソフトウェア音源でもよい。また、楽音をボーカル
音声の伴奏として出力するようにしたが、楽音を再生す
ることなくボーカル音声のみを再生するようにしてもよ
い。さらに、楽音データはＭＩＤＩデータとして配信す
ると説明したが、本発明はこれに限ることはなく、他の
データ形式で楽音データを配信するようにしてもよい。The tone generator of the tone synthesizer 16 incorporated in the mobile phone 1 can be an FM tone generator, a waveform memory tone generator (PCM tone generator), a physical model tone generator, or the like. May be a hardware tone generator using a DSP or the like, or a software tone generator that executes a tone generator program. Further, although the musical sound is output as the accompaniment of the vocal sound, it is also possible to reproduce only the vocal sound without reproducing the musical sound. Furthermore, although it has been described that the music data is distributed as MIDI data, the present invention is not limited to this, and the music data may be distributed in another data format.

【００４１】[0041]

【発明の効果】本発明は以上説明したように、コードブ
ックを用いてベクトル量子化する分析合成符号化方式で
圧縮符号化されたボーカルデータと、圧縮符号化される
際に用いられたコードブックとを配信している。これに
より、圧縮符号化されたボーカルデータをデコードする
際に、そのボーカルに特有のコードブックを参照してデ
コードすることができるため、高音質のリアルなボーカ
ル音声を再生することができる。また、楽音データも配
信される場合には、伝送された楽音データから楽音を合
成して、この楽音に同期してボーカルデータを再生して
楽音と共に出力することができ、高品質のボーカル音声
を伴う楽音を再生することができる。また、通話信号は
予め格納されているコードブックを用いてデコードされ
るため、ボーカルデータ用のコードブックが配信されて
も通話信号をデコードした際に音質が劣化することはな
い。さらに、配信されたボーカルデータが複数系統のボ
ーカルデータからなる際には、複数系統のコードブック
をあわせて配信することにより、複数系統のボーカルデ
ータをそれぞれ高品質でデコードすることができるよう
になる。As described above, according to the present invention, the vocal data compression-encoded by the analysis-synthesis encoding method in which vector quantization is performed using the codebook, and the codebook used in the compression encoding And has been delivered. Thus, when decoding the compression-encoded vocal data, the vocal data can be decoded with reference to a codebook specific to the vocal, so that high-quality real vocal sound can be reproduced. When music data is also distributed, it is possible to synthesize a music sound from the transmitted music data, reproduce the vocal data in synchronization with the music sound, and output the vocal data together with the music sound. The accompanying musical sound can be reproduced. Further, since the call signal is decoded by using a codebook stored in advance, even if the codebook for vocal data is distributed, the sound quality does not deteriorate when the call signal is decoded. Furthermore, when the distributed vocal data is composed of multiple systems of vocal data, the multiple systems of vocal data can be decoded with high quality by distributing together multiple systems of codebooks. .

[Brief description of the drawings]

【図１】本発明の音声合成装置を携帯電話機に適用し
た際の実施の形態を説明するための図である。FIG. 1 is a diagram for describing an embodiment when a speech synthesizer of the present invention is applied to a mobile phone.

【図２】本発明の音声合成装置を携帯電話機に適用し
た際の構成の変形例を示す図である。FIG. 2 is a diagram showing a modification of the configuration when the speech synthesis device of the present invention is applied to a mobile phone.

【図３】本発明の実施の形態にかかる音声合成装置に
おいて、ボーカルデータをＭＩＤＩ形態とされた楽音デ
ータ中に挿入して伝送する際のデータフォーマットの一
例を示す図である。FIG. 3 is a diagram showing an example of a data format when vocal data is inserted into MIDI-formatted musical sound data and transmitted in the voice synthesizer according to the embodiment of the present invention;

【図４】本発明の実施の形態にかかる音声合成装置の
構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a configuration of a speech synthesizer according to the embodiment of the present invention;

【図５】本発明の実施の形態にかかる音声合成装置の
他の構成を示す図である。FIG. 5 is a diagram showing another configuration of the speech synthesizer according to the embodiment of the present invention;

【図６】音声情報を高能率に圧縮符号化できる従来の
ＣＥＬＰ系エンコーダの代表的な構成を示す図である。FIG. 6 is a diagram illustrating a typical configuration of a conventional CELP encoder that can compress and encode audio information with high efficiency.

【図７】高能率に圧縮符号化した音声情報をデコード
する従来のＣＥＬＰ系デコーダの代表的な構成を示す図
である。FIG. 7 is a diagram illustrating a typical configuration of a conventional CELP decoder that decodes audio information that has been compressed and encoded with high efficiency.

[Explanation of symbols]

１携帯電話機、２基地局、２ａアンテナ、１０
アンテナ、１１送受信機部、１２電話機能部、１３
記憶手段、１４ＭＩＤＩデコーダ、１５音声合成
部、１６楽音合成部、１７混合器、１８増幅部、
１９マイク、２０出力部、３０データ処理部、３
１コードブック、３２，３２ａ，３２ｂピッチＬ発生
部、３３，３３ａ，３３ｂ源波形再生部、３４，３４
ａ，３４ｂのど近似フィルタ、３５，３５ａ，３５ｂ
スペクトラルフィルタ、１０１長期予測器、１０２
源音コードブック、１０３合成部、１０４反射係数
分析部、１０５のど近似フィルタ部、１０６減算
器、１３０データ処理部、１３１コードブック、１
３２ピッチＬ発生部、１３３源波形再生部、１３４
のど近似フィルタ、１３５スペクトラルフィルタ1 mobile phone, 2 base station, 2a antenna, 10
Antenna, 11 transceiver unit, 12 telephone function unit, 13
Storage means, 14 MIDI decoder, 15 voice synthesizer, 16 tone synthesizer, 17 mixer, 18 amplifier,
19 microphones, 20 output units, 30 data processing units, 3
1 codebook, 32, 32a, 32b pitch L generating section, 33, 33a, 33b source waveform reproducing section, 34, 34
a, 34b throat approximation filter, 35, 35a, 35b
Spectral filter, 101 long-term predictor, 102
Source codebook, 103 synthesis unit, 104 reflection coefficient analysis unit, 105 throat approximation filter unit, 106 subtractor, 130 data processing unit, 131 codebook, 1
32 pitch L generator, 133 source waveform reproducer, 134
Throat approximation filter, 135 spectral filter

Claims

[Claims]

At least, distributed vocal data compressed and encoded by an analysis-synthesis encoding method for performing vector quantization using a codebook and a codebook used at the time of compression-encoding are stored. Storage means, and speech synthesis means for synthesizing vocal sound by decoding the vocal data read from the storage means using the codebook stored in the storage means. Characteristic speech synthesizer.

2. The storage means also stores distributed tone data, further comprising a tone synthesis means for generating a tone from the tone data read from the storage means, wherein the tone generated by the tone synthesis means is provided. The voice synthesizing apparatus according to claim 1, wherein the vocal voice and the vocal voice are synchronized.

3. The speech synthesizer according to claim 2, wherein a second codebook is stored in advance, and the second codebook is used to decode a compression-coded speech signal. The speech synthesizer according to claim 1.

4. A vocal data of a plurality of systems compressed and encoded by an analysis-synthesis coding method for performing vector quantization using a codebook unique to each vocal, and the codebook unique to each vocal are distributed. Using the respective vocal-specific codebooks stored in the storage means to decode the vocal data of the plurality of systems in parallel in the speech synthesis means. 2. The voice synthesizing apparatus according to claim 1, wherein a plurality of vocal voices are synthesized.