JPH08263090A

JPH08263090A - Synthesis unit accumulating method and synthesis unit dictionary device

Info

Publication number: JPH08263090A
Application number: JP7060963A
Authority: JP
Inventors: Takao Koyama; 貴夫小山; Noriya Murakami; 憲也村上; Ayanori Yoshitani; 文徳吉谷
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1995-03-20
Filing date: 1995-03-20
Publication date: 1996-10-11

Abstract

PURPOSE: To provide a synthesis unit dictionary device in which the retrieval efficiency is increased during the selection of a synthesis unit that is to be used in a voice synthesis and local peak position information is efficiently accumulated. CONSTITUTION: A synthesis unit dictionary device 2 consists of an accent information file 25 in which accent relative position information, that expresses the separation in terms of relative number of moras between the starting sound syllable of an individual synthesis unit and the accent position of a sentence paragraph that becomes the basis of the synthesis unit, is accumulated corresponding to the synthesis unit, synthesis unit selecting means 21 and 22 which select the candidates for the synthesis units employing the accent relative position information during the retrieving of the synthesis units and a waveform change information file 26 in which voiced sound interval information is accumulated in a relatively small accumulating unit among the voiced sound interval information defined by the interval of a local peak and voiceless sound-silence interval information, voiceless sound-silence interval information is accumulated in a relatively large accumulating unit and an identification code of the accumulating unit is added to the prescribed part of each accumulating unit.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、規則音声合成に用いる
音声合成単位辞書装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis unit dictionary device used for regular speech synthesis.

【０００２】[0002]

【従来の技術】従来の規則音声合成に用いる合成単位辞
書装置は、音声波形から切り出した複数の合成単位と共
に、各合成単位のピッチ周波数、経過時間、区間パワ等
の韻律情報と、各合成単位の前後の音韻環境情報とを蓄
積している。音声を合成する際には、上記韻律情報及び
音韻環境をキー情報として所望の合成単位候補を索出
し、索出した合成単位候補から目標とする韻律パタンと
のパラメタ差を所定の評価式で評価して特定の合成単位
を決定している。2. Description of the Related Art A conventional synthesis unit dictionary device used for regular speech synthesis includes a plurality of synthesis units cut out from a speech waveform, prosody information such as pitch frequency, elapsed time, section power of each synthesis unit, and each synthesis unit. The phonological environment information before and after is accumulated. When synthesizing a voice, a desired synthesis unit candidate is searched by using the above-mentioned prosodic information and phonological environment as key information, and the parameter difference between the searched synthesis unit candidate and a target prosody pattern is evaluated by a predetermined evaluation formula. And a specific synthesis unit is determined.

【０００３】また、より自然な合成音声を生成する場合
は、合成単位の接続位置において韻律パタンの不整合が
起こらないように、接続対象となる各合成単位の韻律を
変形している。この韻律を変形する方法としては、例え
ば、広川、箱田、「波形編集型規則合成法におけるピッ
チ制御法の検討」平成２年３月音響学会講演論文集１一
４−７に記載された「音声波形のローカルピークを用い
たピッチ同期波形重畳法」等が知られている。Further, in the case of producing a more natural synthesized speech, the prosody of each synthesis unit to be connected is modified so that the prosodic pattern mismatch does not occur at the connection position of the synthesis unit. As a method of transforming this prosody, for example, Hirokawa, Hakoda, "Study of pitch control method in waveform editing type rule synthesis method", March 2002, Acoustical Society of Japan, Proc. A pitch synchronization waveform superimposing method using a local peak of the waveform is known.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来の
合成単位辞書装置では、合成単位の蓄積量の増大に伴っ
て合成単位の検索対象範囲が広くなる。また、韻律パタ
ンを評価基準とした場合には、数値的に似かよった合成
単位が多くなり、優劣の判断が困難になる。その対策と
して前述の各パラメタに加えて前後数音韻の音韻環境情
報を評価基準とすることも考えられるが、そうすると、
音声合成単位辞書装置の検索効率が悪くなるという問題
が生じる。However, in the conventional synthesis unit dictionary device, the search range of the synthesis unit becomes wider as the storage amount of the synthesis unit increases. Further, when the prosodic pattern is used as the evaluation criterion, there are many synthetic units that are numerically similar, and it is difficult to judge superiority or inferiority. As a countermeasure, in addition to the above-mentioned parameters, it is possible to use phonological environment information of several phonemes before and after as an evaluation standard.
There is a problem that the search efficiency of the voice synthesis unit dictionary device becomes poor.

【０００５】また、合成音声の自然性を確保するために
前述の「ローカルピークを用いたピッチ同期波形重畳
法」を実時間で実現する場合には、予め音声波形中から
ローカルピーク位置情報を抽出し、これを蓄えておく必
要がある。しかし、合成単位の蓄積量が増加すると、そ
れに比例して蓄積しておくべきローカルピーク位置情報
も多くなる問題があった。Further, in the case of realizing the above-mentioned "pitch synchronization waveform superimposing method using a local peak" in order to secure the naturalness of synthesized speech, local peak position information is extracted from the speech waveform in advance. However, it is necessary to store this. However, there is a problem that when the storage amount of the synthesis unit increases, the local peak position information to be stored increases in proportion to the increase.

【０００６】本発明の課題は、かかる問題点に鑑み、自
然音声の韻律パタンに最も近い合成単位を選定するとき
の検索効率を高め、かつローカルピーク位置情報を少な
い容量で効率的に蓄積する方法、及びこの方法により作
成される合成単位辞書装置を提供することにある。In view of the above problems, an object of the present invention is to improve search efficiency when selecting a synthesis unit closest to a prosodic pattern of natural speech and efficiently store local peak position information with a small capacity. , And a synthesis unit dictionary device created by this method.

【０００７】[0007]

【課題を解決するための手段】本発明は、まず、合成単
位蓄積方法を提供する。この方法は、音声合成対象とな
る複数の合成単位を辞書装置に蓄積する際に、個々の合
成単位の開始音節が当該合成単位の基礎となる文節のア
クセント位置から相対的に何モーラ（音節の時間的まと
まりの単位、以下同じ）離れているかを表すアクセント
相対位置情報を当該合成単位に対応付けて蓄積すること
を特徴とする。The present invention firstly provides a method for accumulating synthetic units. In this method, when a plurality of synthesis units to be speech-synthesized are stored in a dictionary device, the starting syllable of each synthesis unit is relatively mora (syllabic Accent relative position information indicating whether or not they are apart is stored in association with the synthesis unit.

【０００８】また、前記合成単位の韻律制御に用いるロ
ーカルピーク位置情報を併せて蓄積する際に、ローカル
ピーク（ピッチ構造における振幅の極大点、以下同じ）
の間隔により定まる有声音区間情報と無声音・無音区間
情報とにそれぞれ異なる蓄積単位を割り当て、かつ割り
当てられた各蓄積単位の所定部位に当該蓄積単位の識別
符号を付することを特徴とする。なお、有声音区間情報
及び無声音・無音区間情報は、例えば音声波形から切り
出した合成単位をＰＣＭ化して得たＰＣＭデータのロー
カルピーク間のサンプル数の形で蓄積する。Further, when the local peak position information used for the prosodic control of the synthesis unit is also stored, the local peak (the maximum point of the amplitude in the pitch structure, the same applies hereinafter).
It is characterized in that different accumulation units are assigned to the voiced sound section information and the unvoiced sound / unvoiced section information, which are determined by the interval, and the identification code of the storage unit is attached to a predetermined part of each allocated storage unit. The voiced sound section information and the unvoiced sound / unvoiced section information are accumulated in the form of the number of samples between local peaks of PCM data obtained by, for example, PCM converting a synthesis unit cut out from a voice waveform.

【０００９】本発明は、また、上記方法の実施により得
られる合成単位辞書装置をも提供する。この装置は、合
成対象となる複数の合成単位を蓄積して成る合成単位辞
書装置において、個々の合成単位の開始音節が当該合成
単位の基礎となる文節のアクセント位置から相対的に何
モーラ離れているかを表すアクセント相対位置情報を当
該合成単位に対応付けて蓄積したアクセント情報ファイ
ルと、合成単位の検索時に前記アクセント情報ファイル
に蓄積されたアクセント相対位置情報を用いて合成単位
の候補を選定する合成単位選定手段と、を有するもので
ある。The present invention also provides a synthesis unit dictionary device obtained by implementing the above method. This device is a synthesis unit dictionary device in which a plurality of synthesis units to be synthesized are accumulated, and the starting syllable of each synthesis unit is relatively separated from the accent position of the syllable serving as the basis of the synthesis unit by a number of moras. A synthesis for selecting a candidate for a synthesis unit using the accent information file in which accent relative position information indicating whether or not is stored in association with the synthesis unit and the accent relative position information stored in the accent information file when searching for the synthesis unit And unit selection means.

【００１０】本発明の他の構成に係る合成単位辞書装置
は、合成対象となる複数の合成単位を蓄積して成る合成
単位辞書装置において、ローカルピークの間隔により定
まる有声音区間情報と無声音・無音区間情報のうち相対
的に小さい蓄積単位に有声音区間情報を蓄積するととも
に相対的に大きい蓄積単位に無声音・無音区間情報を蓄
積し、且つ各蓄積単位の所定部位に当該蓄積単位の識別
符号を付して成るローカルピーク位置情報ファイルを備
えたことを特徴としている。A synthesis unit dictionary device according to another configuration of the present invention is a synthesis unit dictionary device in which a plurality of synthesis units to be synthesized are accumulated, and voiced sound section information and unvoiced sounds / silences determined by the intervals of local peaks. The voiced sound section information is stored in a relatively small storage unit of the section information, unvoiced sound / unvoiced section information is stored in a relatively large storage unit, and the identification code of the storage unit is stored in a predetermined portion of each storage unit. It is characterized by having a local peak position information file attached.

【００１１】[0011]

【作用】一般に、文章はいくつかの発話単位に分割さ
れ、各発話単位は、一定の傾きで周波数の降下する特性
（話調成分）の上に、文節固有のピッチ周波数、すなわ
ちアクセント成分が重畳するモデルで表現される。個々
のモデルは、各モーラにおける音の高低の配置（ピッチ
パタン）によって記述されるが、日本語モデルのピッチ
パタンは、アクセント位置を頂点として略ヘ字状になる
傾向がある。したがって、アクセント型の種類、つまり
合成単位に対するアクセントの位置がわかれば、その合
成単位の基礎となった文節のピッチパタンの概形を知る
ことが可能になる。また、文節のモーラ数と音韻の継続
時間長との間には相関があり、文節内のモーラ数が多く
なれば音韻の継続時間長が短くなる傾向にある。すなわ
ち、モーラ数とアクセント相対位置は、合成単位の選択
の際の基準となるピッチ及び継続時間長をおおまかに反
映しているため、これらの情報を有するアクセント相対
位置情報を用いることにより、合成単位の経続時間長及
びピッチ形状を大まかに加味した検索が可能になる。こ
の場合、必要以上に長い前後の音韻環境を監視する必要
が無いため、合成単位を選択する際の探索区間を狭める
ことが可能となり、検索効率が高まる。In general, a sentence is divided into several utterance units, and each utterance unit is superposed with a pitch frequency peculiar to a phrase, that is, an accent component, on a characteristic (speech tone component) in which the frequency drops with a constant slope. It is represented by a model. The individual models are described by the pitch arrangement of the sound in each mora, but the pitch pattern of the Japanese model tends to have a generally H-shape with the accent position as the apex. Therefore, if the type of accent type, that is, the position of the accent with respect to the composition unit is known, it is possible to know the outline of the pitch pattern of the bunsetsu which is the basis of the composition unit. In addition, there is a correlation between the number of morae in the bunsetsu and the duration of the phoneme, and the more the number of morae in the bunsetsu, the shorter the duration of the phoneme. That is, since the number of mora and the relative accent position roughly reflect the pitch and the duration length which are the reference when selecting the composition unit, by using the accent relative position information having such information, the composition unit It is possible to perform a search that roughly takes into account the duration of time and pitch shape. In this case, since it is not necessary to monitor the phoneme environment before and after an unnecessarily long time, it is possible to narrow the search section when selecting the synthesis unit and improve the search efficiency.

【００１２】また、ローカルピーク位置情報を蓄積する
際に、有声音区間情報を相対的に小さい蓄積単位で蓄積
し、無声音・無音区間情報を大きな蓄積単位で蓄積する
ことにより、一律の蓄積単位で蓄積する場合に比べて必
要とする辞書容量が低減する。また、ローカルピークの
絶対的な位置は、蓄積単位の識別符号から導かれる各区
間情報の大きさ（データ長）を加算することで得られ
る。したがってローカルピークの位置や間隔を短い時間
で容易に導出することができる。Further, when accumulating the local peak position information, the voiced sound section information is accumulated in a relatively small accumulation unit, and the unvoiced sound / unvoiced section information is accumulated in a large accumulation unit, so that a uniform accumulation unit is obtained. The required dictionary capacity is reduced compared to the case of storing. Further, the absolute position of the local peak is obtained by adding the size (data length) of each section information derived from the identification code of the storage unit. Therefore, the position and interval of the local peak can be easily derived in a short time.

【００１３】[0013]

【実施例】以下、図面を参照して本発明の合成音声単位
辞書装置の好適な実施例を詳細に説明する。図１は、本
発明の一実施例に係る規則音声合成装置の構成図であ
る。まず、この規則音声合成装置１の全体的な処理概要
を図１を参照して説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A preferred embodiment of the synthetic speech unit dictionary device of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a configuration diagram of a regular speech synthesizer according to an embodiment of the present invention. First, an overall processing outline of the rule speech synthesizing device 1 will be described with reference to FIG.

【００１４】規則音声合成装置１において、音声を合成
する際には、合成しようとする文字列、アクセント型、
及び韻律パタンを入力端子ＩＮに与える。入力端子ＩＮ
から入力された文字列等は、合成単位検索部３に送出さ
れる。合成単位検索部３では、入力された文字列等を、
音声合成単位を検索する際に用いる所定のキー情報に変
換し、該キー情報に基づいて合成単位辞書装置２から合
成対象となる音声合成単位の侯補素片と各侯補素片の韻
律情報及びアクセント相対位置情報とを索出する。合成
単位検索部３では、また、索出した候補素片の韻律情報
及びアクセント相対位置情報を適当な評価式を用いるこ
とにより１つに紋り込む。そして絞り込んだ素片の音響
パラメタ及び波形変形情報を取得する。このようにし
て、合成に必要な合成単位を全て索出する。合成単位接
続部４では、索出した合成単位を滑らかに接続してノイ
ズの混入等を防止し、更に、韻律が目標の形状に近くな
るように変形してこれを合成音声出力部５へ送出する。
合成音声出力部５では、合成単位接続部４から送出され
た音響パラメタを音声信号に変換し、スピーカ等を介し
て出力する。In the regular voice synthesizer 1, when synthesizing a voice, a character string to be synthesized, an accent type,
And a prosody pattern to the input terminal IN. Input terminal IN
The character string or the like input from is sent to the synthesis unit search unit 3. In the composition unit search unit 3, the input character string,
The speech synthesis unit is converted into predetermined key information to be used when searching, and based on the key information, the unit of speech synthesis unit to be synthesized from the synthesis unit dictionary device 2, the prosodic information of each segment and the relative position of the accent. Find information and information. The synthesis unit search unit 3 also embeds the prosodic information and the accent relative position information of the retrieved candidate segment into one by using an appropriate evaluation formula. Then, the acoustic parameter and the waveform deformation information of the narrowed down element are acquired. In this way, all the synthesis units necessary for synthesis are found. The synthesis unit connection unit 4 smoothly connects the retrieved synthesis units to prevent noise from mixing, and further transforms the prosody so that it is close to the target shape and sends it to the synthesis speech output unit 5. To do.
The synthetic voice output unit 5 converts the acoustic parameters sent from the synthesis unit connection unit 4 into a voice signal and outputs the voice signal via a speaker or the like.

【００１５】次に、本実施例による合成単位辞書装置２
について説明する。この合成単位辞書装置２は、図２に
示すように、上記合成単位検索部３へ双方向アクセス可
能に接続された合成単位管理部２１と、この合成単位管
理部２１へ双方向アクセス可能に接続された合成単位検
索テーブル２２、音響パラメタファイル２３、韻律情報
ファイル２４、アクセント情報ファイル２５、及び、波
形変形情報ファイル２６から構成される。Next, the synthesis unit dictionary device 2 according to the present embodiment.
Will be described. As shown in FIG. 2, the synthesis unit dictionary device 2 is connected to the synthesis unit search unit 3 so that the synthesis unit management unit 21 is bidirectionally accessible, and the synthesis unit management unit 21 is bidirectionally accessible. The composite unit search table 22, the acoustic parameter file 23, the prosody information file 24, the accent information file 25, and the waveform modification information file 26 are included.

【００１６】合成単位管理部２１は、登録すべき情報を
効率良く蓄積するとともに、他処理部に対し、要求に応
じて情報を提供するためのインタフェースとして機能す
るものである。また、合成単位検索テーブル２２は、所
望の合成単位を高速に検索するためのテーブルである。
音響パラメタファイル２３は、合成音声を生成する際に
用いる音響パラメタを格納する領域である。韻律情報フ
ァイル２４は、各合成単位の平均ピッチ周波数、継続時
間長、区間パワ及び隣接音韻の種類等の情報を格納する
領域である。なお、本実施例では、ＰＣＭ（Pulse code
modulation）方式により音声波形をディジタル化した
ものを格納しているので、上記各種パラメタはこれらデ
ィジタルデータ（ＰＣＭデータ）に対応したものとす
る。The composition unit management unit 21 efficiently stores information to be registered and also functions as an interface for providing information to other processing units in response to a request. Further, the composition unit search table 22 is a table for searching a desired composition unit at high speed.
The acoustic parameter file 23 is an area for storing acoustic parameters used when generating synthetic speech. The prosody information file 24 is an area for storing information such as the average pitch frequency of each synthesis unit, duration time, section power, and type of adjacent phoneme. In this embodiment, the PCM (Pulse code
Since the digitized audio waveform is stored by the modulation method, the above various parameters are assumed to correspond to these digital data (PCM data).

【００１７】アクセント情報ファイル２５は、個々の合
成単位が何モーラの文節から切り出され、且つその合成
単位が当該文節のアクセント位置から相対的に何モーラ
離れているかを表すアクセント相対位置情報を合成単位
別に格納するファイルである。The accent information file 25 includes accent relative position information indicating how many moras each synthesizing unit is cut out from, and how many moras each synthesizing unit is relatively separated from the accent position of the bunsetsu. This file is stored separately.

【００１８】このアクセント情報ファイル２５の詳細を
図３及び図４を参照して説明する。前述のように、合成
単位に対するアクセントの相対位置がわかれば、その合
成単位の基礎となった文節のピッチパタンの概形を知る
ことが可能になる。例えばモーラ数が３の合成単位の場
合は、図３（ａ）〜（ｄ）に示すように、平板型の３モ
ーラ０型，先頭にアクセントがある３モーラ１型，中央
部分にアクセントがある３モーラ２型，及び末尾にアク
セントがある３モーラ３型では、各合成単位の先頭位置
に対するアクセント相対位置がわかれば、当該合成単位
の基礎となった文節のピッチパタンがわかる。そこで、
本実施例では、先頭位置がアクセント位置である場合は
アクセント相対位置を’０’、同様に、一音節前であれ
ば’−１’、一音節後ろであれば’＋１’として定め
る。また、アクセントのない平板型では便宜的に２モー
ラ目をアクセント位置として定める。そしてこの情報を
合成単位毎に付加する。このようにして作成したアクセ
ント情報ファイル２５のテーブルイメージ例４０１を図
４に示す。Details of the accent information file 25 will be described with reference to FIGS. 3 and 4. As described above, if the relative position of the accent with respect to the composition unit is known, it is possible to know the outline of the pitch pattern of the bunsetsu which is the basis of the composition unit. For example, in the case where the number of mora is 3, the flat type 3 mora 0 type, the 3 mora 1 type with an accent at the beginning, and the accent in the central part, as shown in FIGS. In the 3-mora type 2 and the 3-mora type 3 having an accent at the end, if the accent relative position with respect to the head position of each composition unit is known, the pitch pattern of the bunsetsu which is the basis of the composition unit can be known. Therefore,
In the present embodiment, the relative accent position is determined to be "0" when the head position is the accent position, "-1" when the position is one syllable before, and "+1" when the position is one syllable behind. In addition, for the flat type without accent, the second mora is set as the accent position for convenience. Then, this information is added for each synthesis unit. FIG. 4 shows a table image example 401 of the accent information file 25 created in this way.

【００１９】図４において、合成単位番号は、本実施例
の規則音声合成装置１が各合成単位に対して単一に与え
る値であり、モーラ数及びアクセント相対位置を一意に
対応付けている。したがって、合成単位番号を検索する
際にモーラ数やアクセント相対位置を用いることがで
き、合成単位の選択の際の基準となるピッチ及び継続時
間長をもおおまかに知ることが可能となる。In FIG. 4, the synthesizing unit number is a value given to each synthesizing unit by the rule speech synthesizing apparatus 1 of the present embodiment, and uniquely associates the number of mora and the relative position of accent. Therefore, it is possible to use the number of mora and the relative position of the accent when searching the composition unit number, and it is also possible to roughly know the pitch and the duration length which are the reference when selecting the composition unit.

【００２０】次に、波形変形情報ファイル２６の詳細内
容を図５及び図６を参照して説明する。有声音区間にお
いて、音声波形は周期的な形状をなしている。これは一
般にピッチ構造と呼ばれる。各ピッチにおいては、図５
に示すように、周期的に振幅の大きなピーク（５０２，
５０３，５０４）が現われる。これらピークをローカル
ピークと称する。音声のピッチ周期は、ほぼ２．５ｍｓ
〜１０ｍｓの間に分布しており、周期が短くなるほど声
が高くなり、逆に周期が長くなるほど声が低く聞こえ
る。よって、もとの音声のピッチ周期を変更すること
で、声の高さを制御することが可能となる。実際の音声
波形におけるピッチ間隔は図５下段に示すとおりであ
り、図示の例では、３６００サンプル５０５、９７サン
プル５０６、及び９２サンプル５０７の場合を示してい
る。各ピッチ間隔は、サンプリング周波数１０ｋＨｚの
音声を用いた場合の例である。また、ピッチ単位で間引
きや操り返し処理を行うことで、合成単位の経続時間長
の制御が可能となる。これらの韻律制御を行う手法が前
述のピッチ同期波形重畳法であり、この手法では、ロー
カルピーク位置をピッチ毎の目印として用いている。Next, the detailed contents of the waveform modification information file 26 will be described with reference to FIGS. In the voiced section, the voice waveform has a periodic shape. This is generally called a pitch structure. Figure 5 for each pitch
As shown in, a peak (502,
503, 504) appears. These peaks are called local peaks. The pitch period of voice is approximately 2.5 ms
It is distributed over 10 ms, and the shorter the cycle, the higher the voice, and conversely, the longer the cycle, the lower the voice sounds. Therefore, the pitch of the voice can be controlled by changing the pitch period of the original voice. The pitch intervals in the actual speech waveform are as shown in the lower part of FIG. 5, and in the illustrated example, the cases of 3600 samples 505, 97 samples 506, and 92 samples 507 are shown. Each pitch interval is an example when voice with a sampling frequency of 10 kHz is used. Further, by performing the thinning-out and the turning-back process in pitch units, it becomes possible to control the duration time of the composition unit. The above-mentioned pitch synchronization waveform superposition method is a method for performing these prosody control, and in this method, the local peak position is used as a mark for each pitch.

【００２１】図６は、波形変形情報ファイル２６におけ
るローカルピーク位置情報の格納要領、すなわちローカ
ルピーク位置情報ファイルの説明図である。本実施例で
は、ローカルピーク位置情報は、ＰＣＭデータにおける
ピーク間のサンプル数で表す。但し、先頭のローカルピ
ーク位置は、音声データの始端とピーク位置との間隔で
ある。このようにすれば、ローカルピークの絶対的な位
置は、先頭からのピーク間隔を加算することで得られ
る。ピーク間隔の傾向は、大きく分けると有声音区間で
の短い区間のものと無声音または無音区間における長め
の区間のものとがある。有声音区間では、ピッチ周期が
２．５ｍｓ〜ｌ０ｍｓ程度であり、サンプリング周波数
をｌ０ｋＨｚとしたとき、２５〜１００ｓａｍｐｌｅの
間隔でローカルピークが出現する。一方、無声音区間
は、一般的にｌ００ｍｓ〜１ｓ程度であり、ローカルピ
ークの出現間隔は、ｌ０００〜１００００ｓａｍｐｌｅ
程度である。FIG. 6 is an explanatory diagram of the storage procedure of the local peak position information in the waveform deformation information file 26, that is, the local peak position information file. In this embodiment, the local peak position information is represented by the number of samples between peaks in the PCM data. However, the local peak position at the beginning is the interval between the start end of the audio data and the peak position. In this way, the absolute position of the local peak can be obtained by adding the peak intervals from the beginning. The peak intervals tend to be roughly divided into short intervals in voiced sound intervals and unvoiced sounds or long intervals in unvoiced intervals. In the voiced sound section, the pitch period is about 2.5 ms to 10 ms, and when the sampling frequency is 10 kHz, local peaks appear at intervals of 25 to 100 sample. On the other hand, the unvoiced sound section is generally about 100 ms to 1 s, and the appearance interval of the local peak is 1000 to 10000 sample.
It is a degree.

【００２２】そこで、本実施例では、ローカルピーク位
置情報ファイル（波形変形情報ファイル２６）内に短い
蓄積単位と長い蓄積単位とを用意し、各ピーク間隔に合
わせて蓄積単位の長さを選択する。図６に示す例では、
（ａ）に示す短い蓄積単位６０１に８ビットを割り付け
る。但し、先頭の１ビットは蓄積単位長を表すフラグ６
０３であり、論理０の時は短い蓄積単位を表し、残り７
ビット６０２を割り付ける。ピーク間隔の表現能力とし
ては、０〜１２ｓａｍｐｌｅを表現することが可能であ
る。また、ピーク間隔として余りにも短いもの（例えば
０〜２ｍｓ程度）はあり得ないとして、表現できる値に
定数値を割り当てることで更に効率化が可能である。一
方、（ｂ）に示す長い蓄積単位６０４には１６ビットを
割り当てる。長い蓄積単位でも、先頭の１ビットを蓄積
単位長を表すフラグ６０６として用いる。この場合は、
論理１を長い蓄積単位の識別符号として用いる。そし
て、残り１５ビット６０５を割り当てる。ピーク間隔の
表現能カとしては、０〜３２７８６ｓａｍｐｌｅを表現
することが可能である。Therefore, in this embodiment, a short storage unit and a long storage unit are prepared in the local peak position information file (waveform modification information file 26), and the length of the storage unit is selected according to each peak interval. . In the example shown in FIG.
8 bits are allocated to the short storage unit 601 shown in (a). However, the first 1 bit is a flag 6 indicating the storage unit length.
03, a logical 0 indicates a short storage unit, and the remaining 7
Allocate bit 602. As the expression ability of the peak interval, 0 to 12 sample can be expressed. Further, assuming that the peak interval cannot be too short (for example, about 0 to 2 ms), it is possible to further improve efficiency by assigning a constant value to a value that can be expressed. On the other hand, 16 bits are allocated to the long storage unit 604 shown in (b). Even in a long storage unit, the leading 1 bit is used as a flag 606 indicating the storage unit length. in this case,
Logic 1 is used as the identification code of the long storage unit. Then, the remaining 15 bits 605 are allocated. The peak interval can be expressed as 0 to 32786 sample.

【００２３】実際に使用する合成単位６０７において
は、（ｃ）に示すように、先ず先頭ビットのフラグ６０
８を確認して取得するデータの長さを識別する。ここ
で、ピーク間隔を取得し、最初に読み込んだデータ長６
１０から、次のデータの読み出し位置（次のデータの先
頭フラグ）６０９を計算する。この処理を随時行うこと
で、一連のローカルピーク位置情報を読み出すことが可
能となる。また、ローカルピーク位置は、合成単位毎に
与えられており、各合成単位での終端位置は、一定値を
検出したときの終了値とする。また、始端の読み出し位
置（アドレス）は、所定の管理情報を用いて指定する。In the synthesis unit 607 actually used, as shown in (c), first, the flag 60 of the head bit is used.
Check 8 to identify the length of the data to be acquired. Here, the peak interval is acquired and the data length of the first read 6
From 10, the read position of the next data (head flag of the next data) 609 is calculated. By performing this processing as needed, a series of local peak position information can be read. The local peak position is given for each synthesis unit, and the end position in each synthesis unit is the end value when a constant value is detected. Further, the reading position (address) of the starting end is designated by using predetermined management information.

【００２４】このように、アクセント情報ファイル２５
を用いることで、ピッチ及び継続時間長を考慮した検索
が可能となり、合成単位の候補の絞り込む際の探索区間
を大輻に狭めることが可能となる。アクセント情報ファ
イル２５は、継続時間長に大きな影響を与えるモーラ数
をも保持しているので、これを探索時の評価基準として
用いることにより選択される合成単位の継続時間長が自
然音声に近いものとなる。また、ローカルピーク位置情
報を迅速に索出できるため、韻律の制御を実時間で行う
ことが可能であり、さらに、ローカルピークの間隔に応
じて適当な蓄積単位を割り当てているので、単純に同一
の蓄積単位を与える場合の５０％、２５％程度の容量で
同一情報の蓄積が可能となる。これにより波形辞書のサ
イズを従来より格段に小さくすることが可能となる。In this way, the accent information file 25
By using, it becomes possible to perform a search in consideration of the pitch and the duration, and it is possible to narrow the search section when narrowing down the candidates of the composition unit. Since the accent information file 25 also holds the number of mora that has a great influence on the duration, the duration of the synthesis unit selected by using this as the evaluation criterion at the time of search is close to natural speech. Becomes In addition, since the local peak position information can be retrieved quickly, it is possible to control the prosody in real time, and since appropriate storage units are assigned according to the intervals of the local peaks, simply the same It is possible to store the same information with a capacity of about 50% and 25% of that of the storage unit. This makes it possible to make the size of the waveform dictionary much smaller than before.

【００２５】なお、本実施例で用いたモーラ数やアクセ
ント相対位置、およびピッチ間隔のサンプル数は例示で
あり、必ずしも本実施例の数値に限定されるものではな
い。The number of moras, relative positions of accents, and the number of samples of pitch intervals used in this embodiment are mere examples, and are not necessarily limited to the values in this embodiment.

【００２６】[0026]

【発明の効果】以上の説明から明らかなように、本発明
の合成単位蓄積方法及びこれにより得られた合成単位辞
書装置によれば、波形規則音声合成装置で用いる大量の
合成単位中から、自然音声の韻律パタンに最も近い合成
単位を選定する際に、おおまかな探索範囲を示唆するこ
とが可能になり、検索効率が格段に高まる効果がある。As is apparent from the above description, according to the synthesis unit accumulating method of the present invention and the synthesis unit dictionary device obtained by the method, a natural number is selected from a large number of synthesis units used in the waveform rule speech synthesizer. When selecting a synthesis unit that is closest to the prosodic pattern of speech, it is possible to suggest a rough search range, which has the effect of dramatically increasing search efficiency.

【００２７】また、音声波形のローカルピーク位置情報
を効率良く蓄積し得る効果があり、しかも必要なローカ
ルピーク位置情報を迅速に抽出することができるので、
例えば合成音声の発声速度や声の高さを実時間で変更す
る用途において顕著な効果を奏することができる。Further, there is an effect that the local peak position information of the voice waveform can be efficiently accumulated, and the necessary local peak position information can be quickly extracted.
For example, a remarkable effect can be exerted in the application of changing the utterance speed or the pitch of the synthesized voice in real time.

[Brief description of drawings]

【図１】本発明の一実施例に係る合成単位辞書装置を用
いた音声合成装置の一例を示すブロック構成図。FIG. 1 is a block diagram showing an example of a voice synthesizing device using a synthesis unit dictionary device according to an embodiment of the present invention.

【図２】本実施例の合成単位辞書装置のブロック構成例
を示す図。FIG. 2 is a diagram showing an example of a block configuration of a composition unit dictionary device of the present embodiment.

【図３】本実施例によるモーラとアクセント相対位置と
の関係説明図。FIG. 3 is an explanatory diagram of a relationship between a mora and an accent relative position according to the present embodiment.

【図４】本実施例によるアクセント情報ファイルのテー
ブルイメージ説明図。FIG. 4 is an explanatory diagram of a table image of an accent information file according to the present embodiment.

【図５】ローカルピーク及びピーク間情報の説明図。FIG. 5 is an explanatory diagram of local peak and peak-to-peak information.

【図６】本実施例による波形変形情報ファイルにおける
ローカルピーク位置情報の蓄積状態説明図。FIG. 6 is an explanatory diagram of an accumulated state of local peak position information in a waveform modification information file according to the present embodiment.

[Explanation of symbols]

１規則音声合成装置２合成単位辞書装置３合成単位検索部４合成単位接続部５合成音声出力部２ｌ合成単位管理部２２合成単位検索テーブル２３音響パラメタファイル２４韻律情報ファィル２５アクセント情報ファイル２６波形変形情報ファイル 1 Ruled Speech Synthesizer 2 Synthesis Unit Dictionary Device 3 Synthesis Unit Search Unit 4 Synthesis Unit Connection Unit 5 Synthesis Speech Output Unit 2l Synthesis Unit Management Unit 22 Synthesis Unit Search Table 23 Acoustic Parameter File 24 Prosodic Information File 25 Accent Information File 26 Waveform Modification Information file

Claims

[Claims]

1. When accumulating a plurality of synthesis units to be speech-synthesized in a dictionary device, the start syllable of each synthesis unit is relatively separated from the accent position of the syllable serving as the basis of the synthesis unit by a number of moras. A composition unit accumulating method characterized by accumulating accent relative position information indicating whether or not it is associated with the composition unit.

2. A method of accumulating local peak position information used for prosodic control of a synthesis unit when accumulating a synthesis unit to be voice-synthesized in a dictionary device, in a voiced sound section determined by an interval between local peaks. A synthetic unit accumulating method characterized by allocating different accumulating units to information and unvoiced sound / silent section information, and assigning an identification code of the accumulating unit to a predetermined part of each allocated accumulating unit.

3. The voiced sound section information and the unvoiced sound / unvoiced section information are the number of samples between local peaks of PCM data obtained by PCMizing a synthesis unit cut out from a voice waveform. The method for accumulating synthetic units described.

4. A synthesis unit dictionary device comprising a plurality of synthesis units to be synthesized, wherein the starting syllable of each synthesis unit is relatively far away from the accent position of the syllable serving as the basis of the synthesis unit. A candidate for a synthesis unit is selected by using the accent information file in which the accent relative position information indicating whether or not it is stored is associated with the synthesis unit and the accent relative position information stored in the accent information file when searching for the synthesis unit. A synthesizing unit selecting device, comprising: synthesizing unit selecting means.

5. A synthesizing unit dictionary device in which a plurality of synthesizing units to be synthesized are accumulated, which is stored in a relatively small accumulating unit of voiced sound section information and unvoiced sound / silent section information which is determined by a local peak interval. It is equipped with a local peak position information file that stores voice sound section information and also stores unvoiced sound / silent section information in a relatively large storage unit, and attaches an identification code of the storage unit to a predetermined part of each storage unit. A synthesis unit dictionary device characterized by the above.