JPH113095A

JPH113095A - Speech synthesis device

Info

Publication number: JPH113095A
Application number: JP9156489A
Authority: JP
Inventors: Kazuhiko Miyata; 和彦宮田
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1997-06-13
Filing date: 1997-06-13
Publication date: 1999-01-06

Abstract

PROBLEM TO BE SOLVED: To decrease a calculation amount at a time of speech synthesis and also to obtain a highly naturalized speech. SOLUTION: A speech unit information addition part 3 puts a speech unit vector on each frame of a natural speech pattern. An interframe distance calculation part 4 calculates a distance between each frame. An inter-pattern connecting frame authorization part 5 assumes a frame pair having a distance within a threshold to be the inter-pattern connecting frame pair. A transition weight calculation part 7 connects the inter-pattern connecting frame pair, adding them the transition weight according to the distance, forming a speech pattern connecting network, and storing it in a speech pattern connecting network storage part 8. At a speech synthesis, an optimal path on the above-mentioned speech pattern connecting network is searched to obtain a synthetic parameter. Thus, speech data are connected with each other in a frame unit smaller than a phoneme and a highly naturalized speech is obtained. Moreover, a calculation amount is decreased at the speech synthesis by setting a frame transition path only between acoustically connectable frames.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、自然音声パター
ンを分析して得られた特徴スペクトルを適切に編集して
音声合成処理を行う音声合成装置に関し、特に、より自
然性の高い音声出力を実現すると共に、音声合成処理動
作時における計算量および記憶データ量の削減を図り、
生成語彙への柔軟な適応能力を付与する音声合成装置に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesizer which performs a speech synthesis process by appropriately editing a characteristic spectrum obtained by analyzing a natural speech pattern, and in particular, realizes a more natural sound output. To reduce the amount of calculation and the amount of stored data during the speech synthesis processing operation.
The present invention relates to a speech synthesizer that gives a flexible adaptive ability to a generated vocabulary.

【０００２】[0002]

【従来の技術】従来、テキスト音声合成方式の一つとし
て、話者から得られた自然音声データを音声素片毎にデ
ータベースに格納しておき、目的とする合成文章に適合
する音声素片をつなぎ合わせて音声を生成する編集合成
方式がある。従来の編集合成方式においては、音声合成
に際して必要な音声素片の音声データをデータベース中
から選択して使用するため、テキスト情報に基づいて音
声素片データを辞書引き可能なようにテキスト情報と音
声素片データとが関連付けられたデータベースを作成す
る必要がある。2. Description of the Related Art Conventionally, as one of the text-to-speech synthesis methods, natural speech data obtained from a speaker is stored in a database for each speech unit, and a speech unit suitable for a target synthetic sentence is stored. There is an editing / synthesis method in which audio is generated by joining. In the conventional editing / synthesizing method, the speech data of speech units required for speech synthesis is selected from a database and used. It is necessary to create a database that is associated with segment data.

【０００３】すなわち、話者から収録された自然音声デ
ータに対して、先ず、音素,音素の短い連鎖(Ｃ(子音)Ｖ
(母音),ＶＣ,ＶＣＶ等),音節,単語あるいは文節という
音素情報に基づいた先験的な音声素片分割情報を音声記
号として用いた音声素片のセグメンテーションとラベル
付けとが行われる。そして、得られた音声素片区間情報
とラベル情報とを音声素片の音声データに付加して登録
したデータベースを予め作成するのである。[0003] That is, for natural speech data recorded from a speaker, first, a phoneme, a short chain of phonemes (C (consonant) V
(Vowels), VC, VCV, etc.), speech unit segmentation and labeling using a priori speech unit segmentation information based on phoneme information such as syllables, words or phrases as speech symbols are performed. Then, a database in which the obtained speech unit segment information and label information are added to the speech data of the speech unit and registered is created in advance.

【０００４】この場合、上記音声素片のラベル情報とし
て用いる音声記号によって定まる音声素片区間内におけ
る更に微少な区間の相互補完が可能な音声データの連結
は考慮されない。また、音声素片同士の接続点はセグメ
ンテーションされた音声素片の端点に固定であるか、あ
るいは、上記接続点の探索範囲と探索方向(順序)とは上
記音声記号の種類に依存して設定される。尚、上記音声
素片の選定や接続点の探索は、音声合成時に音素環境と
スペクトルの連続性を考慮して行われる。[0004] In this case, the connection of voice data that can complement each other in smaller voice segments determined by voice symbols used as label information of the voice segments is not considered. In addition, the connection point between the speech units is fixed to the end point of the segmented speech unit, or the search range and search direction (order) of the connection point are set depending on the type of the speech symbol. Is done. The selection of the speech unit and the search for the connection point are performed at the time of speech synthesis in consideration of the continuity of the phoneme environment and the spectrum.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記従
来の編集合成方式によるテキスト音声合成においては、
以下のような問題がある。すなわち、上述のように、合
成音声素片データは、何らかの言語情報あるいは音素情
報の下に付与されラベル情報に基づいてデータベースを
検索して求められるので、音声素片の単位と上記言語情
報あるいは音素情報との関係を明確に表現できる必要が
ある。そのために、音素等の音声単位よりも微小な単位
を元にした音声素片の単位は用いられないのである。However, in the conventional text-to-speech synthesis based on the editing and synthesizing method,
There are the following problems. That is, as described above, the synthesized speech unit data is obtained by searching a database based on label information provided under some linguistic information or phoneme information, and thus the unit of the speech unit and the linguistic information or phoneme We need to be able to clearly express our relationship with information. Therefore, a speech unit based on a unit smaller than a speech unit such as a phoneme is not used.

【０００６】つまり、従来の編集合成方式においては、
上記音声記号で設定された音声単位によって決まる合成
音声素片の単位より小さな単位での音声データの相互補
完や接続は検定されないのである。したがって、最も微
小なフレーム単位でより有効な特徴ベクトルの連鎖が存
在したとしても音声合成時にその有効な特徴ベクトルを
評価することは不可能であるという問題がある。このこ
とは、上記編集合成方式は、比較的大きな記憶容量を必
要とすることを意味し、音声合成システムの規模縮小の
障害となっている。In other words, in the conventional editing / synthesizing method,
Mutual complementation and connection of speech data in units smaller than the unit of the synthesized speech unit determined by the speech unit set by the speech symbol are not tested. Therefore, there is a problem that even if there is a chain of more effective feature vectors in the smallest frame unit, it is impossible to evaluate the effective feature vector during speech synthesis. This means that the editing and synthesizing method requires a relatively large storage capacity, which is an obstacle to reducing the scale of the speech synthesizing system.

【０００７】例えば、“中蔦,浜田等「合成単位を自動
生成する規則合成法」音声研究会資料ＳＰ８７−１５
Ｐ５７〜６４”、あるいは、“伊東,中蔦,広川等「波形
ＣＯＣユニットを用いた音声合成方式」日本音響学会講
論集平成５年１０月２−８−１７Ｐ２５３〜２５４”
に示されるような、音声素片内の特徴スペクトルによる
距離尺度を用いた素片クラスタリングを利用する合成音
声素片作成法においては、クラスタ分割は、音素あるい
は複数の音素の連結を単位とするラベル情報に基づいて
行われる。したがって、やはり、上記ラベル情報によっ
て限定された音声素片接続点以外でのフレーム接続は評
価されることはないのである。[0007] For example, "Rule synthesis method for automatically generating synthesis units," Nakatsuta, Hamada, etc.
P57-64 "or" Ito, Nakatsuta, Hirokawa, etc., "Speech synthesis method using waveform COC unit" Proceedings of the Acoustical Society of Japan, October 1993, 2-8-17 P253-254 "
In the synthetic speech segment creation method using segment clustering using a distance scale based on a feature spectrum in a speech segment as shown in (1), cluster division is performed by labeling a unit of a phoneme or a concatenation of a plurality of phonemes. This is done based on the information. Therefore, the frame connection other than the speech unit connection point limited by the label information is not evaluated.

【０００８】また、上記編集合成方式では、上記音声記
号のラベルに基づいてデータベース中から必要な音声素
片をその都度適合性を計って選択しなければならない。
そのために、データベースの検索と接続の検定に要する
計算量は合成音声素片の記憶量に従って爆発的に増大す
ることになる。これは、音声素片選択の際に同じ音声記
号のラベルを持つ総ての音声素片候補がそのラベル情報
に基づいて同列に検定されるために、音響的に接続が適
当ではない音声素片の組み合わせの検定までもが行われ
るために、余分な計算処理が発生するからである。Further, in the above-mentioned editing and synthesizing method, it is necessary to select a necessary speech unit from a database on the basis of the label of the above-mentioned speech symbol by measuring compatibility each time.
Therefore, the amount of calculation required for searching the database and verifying the connection explosively increases according to the storage amount of the synthesized speech unit. This is because all speech unit candidates having the same speech symbol label are tested in the same column based on the label information at the time of speech unit selection. This is because even the test of the combination of is performed, an extra calculation process occurs.

【０００９】“安部,武田,匂坂等「音韻環境に応じた音
声合成素片の接続方法の検討」音声研究会資料ＳＰ８９
−６６Ｐ１７〜２２”においては、音声素片接続点の
探索範囲における自由度の向上は認められる。ところ
が、同じ音素環境を持つ音声素片の選択においては上述
と同様の問題が起こり、合成音声素片列(部分列)を仮説
する際には、やはりこの合成音声素片区間に与えられた
音声記号と仮説を立てる為に用いる分割点候補群とによ
って分割点の候補が決まるので、分割点選択の自由度は
無く、限られた分割点の組み合わせによって、接続点の
探索範囲は限定されるのである。"Abe, Takeda, Sakazaka, etc." Study on Connection Method of Speech Synthesis Units According to Phonological Environment "Speech Research Group Material SP89
In −66 P17 to 22 ″, the degree of freedom in the search range of the speech unit connection point is improved. However, the same problem as described above occurs in the selection of speech units having the same phoneme environment, and the synthesized speech When hypothesizing the unit sequence (subsequence), the candidate of the dividing point is determined by the speech symbol given to this synthesized speech unit segment and the dividing point candidate group used for forming the hypothesis. There is no degree of freedom in selection, and the search range of the connection point is limited by a limited combination of division points.

【００１０】さらに、“岩橋,海木,匂坂等「Speech Seg
ment Selection for ConcatinativeSynthesis Based on
Spectral Distortion Minimization」,IEICE TRANS. VO
LE.76-A NO.11，NOV. 1993，PP1942-1948”に開示され
ているような分割点の候補の仮説方法を用いても、音声
記号での統計的な音素連鎖性を尺度に音声素片分割が行
われるので、選択されなかった分割点においても分割し
た音声素片候補対よりも自然で滑らかなスペクトル接続
が行われる音声素片接続が存在する場合があるという問
題を解決できない。[0010] Furthermore, "Iwahashi, Kaigi, Sakasaka, etc."
ment Selection for ConcatinativeSynthesis Based on
Spectral Distortion Minimization ”, IEICE TRANS.VO
LE.76-A NO.11, NOV. 1993, PP1942-1948 ”, even if the hypothesis method of the candidate of the dividing point is used, the speech is measured based on the statistical phoneme continuity in the phonetic symbol. Since segment segmentation is performed, it is not possible to solve the problem that there is a speech segment connection in which a natural and smoother spectrum connection is performed than a segmented speech segment candidate pair even at an unselected division point.

【００１１】このことは、自然音声から音声素片を作成
して音声合成を行う編集合成方式にとっては重要な問題
なのである。[0011] This is an important problem for an editing and synthesizing method in which a speech unit is created from natural speech and speech synthesis is performed.

【００１２】上述したように、音声の特徴スペクトル空
間上におけるスペクトル遷移の通りうる経路や、経路単
位同士の接続点の探索範囲は、音声素片を記述するため
に用いた音声記号によって限定される。したがって、あ
らゆるスペクトル遷移を十分に網羅するためには、特徴
スペクトルのサンプル群が発声空間を十分に埋め尽くし
ている必要があるだけでなく、音声素片によって定義さ
れる区間(つまり、上述した、音声の特徴スペクトル空
間上におけるスペクトル遷移の通りうる経路)内で取り
得る他方向へのスペクトル遷移も十分である必要があ
る。[0012] As described above, the search path of the route that can follow the spectrum transition in the feature spectrum space of the voice and the connection point between the path units is limited by the voice symbol used to describe the voice unit. . Therefore, in order to fully cover all spectral transitions, it is not only necessary that the samples of the feature spectrum fully fill the utterance space, but also that the interval defined by the speech unit (that is, Spectral transitions in other directions that can be taken within a path that can follow spectral transitions on the feature spectrum space of the voice need to be sufficient.

【００１３】上記音声素片によって定義される区間内で
取り得るスペクトル遷移が不足している場合は、さらに
自然音声データを増加して取り得るスペクトル遷移の経
路を増やさなければならない。これは逆に、冗長な特徴
スペクトル情報が増加するために、無用な記憶量の増大
を招くと共に、接続点探索のための計算量の増加にも大
きく影響を及ぼすという問題が生ずる。When there is a shortage of possible spectrum transitions in the section defined by the speech unit, it is necessary to further increase natural speech data to increase the number of possible spectrum transition paths. On the contrary, since redundant feature spectrum information is increased, the amount of unnecessary storage is increased, and the amount of calculation for searching for a connection point is greatly affected.

【００１４】また、音声素片と音声記号との対応関係が
固定的であるために、音響的特徴が酷似する(つまり類
似する)調音状態から得られた音声素片に他の音声記号
が付与されている場合には、その音声素片の特徴スペク
トルが合成パラメータとしての使用を評価検討されるこ
とはなく、調音遷移情報の有効利用が図られないという
問題もある。Further, since the correspondence between the speech unit and the speech symbol is fixed, another speech symbol is added to the speech unit obtained from the articulation state in which the acoustic features are very similar (that is, similar). In this case, the use of the feature spectrum of the speech unit as a synthesis parameter is not evaluated and examined, and there is a problem that the articulation transition information cannot be effectively used.

【００１５】また、使用条件によって文章や音素連鎖の
発生(つまり、スペクトル遷移の経路分布)が限定される
ような音声合成装置の用途を考えた場合に、語彙数によ
っては分析合成方式による音声合成装置を用いるよりも
テキスト音声合成方式による音声合成装置を用いた方が
システム規模の上で優位な場合がある。しかしながら、
従来の編集合成によるテキスト音声合成方式では、限定
した文章空間内においても特に好ましくない音声素片同
士の接続が行われる危険性を低減することが難しく、用
途を限定して使用する場合の品質向上を図ることは困難
であるという問題もある。Further, when considering the use of a speech synthesizer in which the occurrence of sentences and phoneme chains (that is, the path distribution of spectrum transition) is limited depending on the use conditions, depending on the number of vocabularies, the speech synthesis by the analysis and synthesis method is considered. In some cases, using a text-to-speech synthesis system using a text-to-speech synthesis method is more advantageous on a system scale than using a device. However,
With the conventional text-to-speech synthesis method using edit synthesis, it is difficult to reduce the risk of connection between speech units that are not particularly preferable even in a limited sentence space, and the quality is improved when the use is limited. There is also a problem that it is difficult to achieve this.

【００１６】そこで、この発明の目的は、自然音声から
得られた特徴スペクトルを効果的に合成パラメータの生
成に利用し、音声合成時の計算量を低減してシステムの
肥大を抑え、編集合成の特徴を生かして高い自然性を有
する合成音声を得ることができる音声合成装置を提供す
ることにある。Accordingly, an object of the present invention is to effectively utilize a feature spectrum obtained from natural speech for the generation of synthesis parameters, reduce the amount of calculation at the time of speech synthesis, suppress the enlargement of the system, and perform editing and synthesis. An object of the present invention is to provide a speech synthesizer capable of obtaining a synthesized speech having a high naturalness by utilizing features.

【００１７】[0017]

【課題を解決するための手段】上記目的を達成するた
め、請求項１に係る発明の音声合成装置は、入力部と、
話者から得られた自然音声をフレーム分析して各フレー
ムに音声単位ラベルが付けられた複数の自然音声パター
ンに関して,異なる自然音声パターンに属して音響的特
徴が類似した特徴スペクトルを有するフレーム対を互い
に接続し,各フレーム間の接続に遷移重みを付加して形
成された音声パターン接続ネットワークが格納された音
声パターン接続ネットワーク格納部と、上記音声パター
ン接続ネットワークを構成する各フレームの音声単位ラ
ベルを参照して,上記入力部から入力された言語情報に
基づく音声単位ラベル列の順に上記音声パターン接続ネ
ットワーク上を辿る場合に,上記遷移重みの累計が最大
値を呈する最適経路を探索し,探索された最適経路上に
あるフレームの特徴スペクトルの列を合成パラメータ列
として出力するパラメータ列探索部と、上記入力された
言語情報に基づいて韻律制御パラメータを生成する韻律
生成部と、上記パラメータ列探索部からの合成パラメー
タ列および上記韻律生成部からの韻律制御パラメータに
基づいて合成音声を生成する音声合成部を備えたことを
特徴としている。According to a first aspect of the present invention, there is provided a speech synthesizer comprising: an input unit;
Frame analysis of natural speech obtained from a speaker is performed, and for a plurality of natural speech patterns in which each frame is labeled with a speech unit, a pair of frames belonging to different natural speech patterns and having similar acoustic features is identified. A voice pattern connection network storage unit that connects to each other and stores a voice pattern connection network formed by adding a transition weight to the connection between each frame, and a voice unit label of each frame constituting the voice pattern connection network. By reference, when tracing on the voice pattern connection network in the order of the voice unit label sequence based on the linguistic information input from the input unit, a search is made for an optimum route in which the cumulative total of the transition weights has the maximum value, and the search is performed. That outputs a sequence of feature spectra of frames on the optimal route as a composite parameter sequence A sequence search unit, a prosody generation unit for generating a prosody control parameter based on the input linguistic information, and a synthesized speech based on the synthesized parameter sequence from the parameter sequence search unit and the prosody control parameter from the prosody generation unit Is provided.

【００１８】上記構成によれば、音声合成時に用いる合
成パラメータを、自然音声をフレーム分析して成る自然
音声パターン間において音響的に接続可能なフレームを
接続して各フレーム間の接続に遷移重みを付加した音声
パターン接続ネットワークとして蓄えている。そして、
音声合成時には、上記音声パターン接続ネットワーク上
の最適経路を探索して合成パラメータを得るのである。
こうして、音素等よりも微小なフレーム単位で合成パラ
メータが滑らかに接続されて、高い自然性を有する合成
音声が生成される。According to the above configuration, the synthesis parameters used for speech synthesis are connected to frames that can be acoustically connected between natural speech patterns formed by analyzing natural speech frames, and transition weights are applied to connections between frames. It is stored as an added voice pattern connection network. And
At the time of voice synthesis, an optimum route on the voice pattern connection network is searched to obtain synthesis parameters.
In this way, the synthesis parameters are smoothly connected in units of frames smaller than phonemes or the like, and a synthesized speech having high naturalness is generated.

【００１９】また、請求項２に係る発明は、請求項１に
係る発明の音声合成装置において、話者から得られた自
然音声をフレーム分析して得られた自然音声パターンを
格納する自然音声パターン格納部と、上記自然音声パタ
ーンの各フレームに上記音声単位ラベルを付ける音声単
位情報付加部と、異なる自然音声パターンに属するフレ
ームの特徴スペクトル間の距離を求めるフレーム間距離
計算部と、上記フレーム間距離が所定値以下であるか否
かを検定することによって,当該フレーム間距離を有す
るフレーム対が上記異なる自然音声パターン間を接続す
るパターン間接続フレーム対と成り得るか否かを検定す
るパターン間接続フレーム検定部と、上記パターン間接
続フレーム対と成り得るフレーム対間及び既に接続され
ているフレーム対間を遷移する際の遷移重みを求め,複
数の自然音声パターンにおける上記パターン間接続フレ
ーム対を互いに接続して形成された音声パターン接続ネ
ットワークの各フレーム間の接続に上記遷移重みを付加
して上記音声パターン接続ネットワーク格納部に格納す
る遷移重み計算部を備えたことを特徴としている。According to a second aspect of the present invention, in the voice synthesizing apparatus according to the first aspect of the present invention, a natural voice pattern for storing a natural voice pattern obtained by frame analysis of a natural voice obtained from a speaker is stored. A storage unit, a speech unit information adding unit that attaches the speech unit label to each frame of the natural speech pattern, an inter-frame distance calculation unit that determines a distance between characteristic spectra of frames belonging to different natural speech patterns, By examining whether or not the distance is equal to or less than a predetermined value, it is possible to determine whether a pair of frames having the interframe distance can be a pair of inter-pattern connection frames connecting the different natural voice patterns. A connection frame verification unit, between a pair of frames that can be the above-mentioned pattern connection frame pair and between already connected frame pairs The transition weights at the time of transition are calculated, and the transition weights are added to the connections between the respective frames of the voice pattern connection network formed by connecting the inter-pattern connection frame pairs in a plurality of natural voice patterns to each other. It is characterized by including a transition weight calculator for storing in the pattern connection network storage.

【００２０】上記構成によれば、自然音声をフレーム分
析して成る自然音声パターン間において音響的に接続可
能なフレームが接続されて各フレーム間の接続に遷移重
みが付加されている音声パターン接続ネットワークが、
自然音声パターン格納部に格納された自然音声パターン
に基づいて自動的に作成される。According to the above configuration, a speech pattern connection network in which acoustically connectable frames are connected between natural speech patterns formed by frame analysis of natural speech and transition weights are added to connections between the frames. But,
It is automatically created based on the natural voice pattern stored in the natural voice pattern storage.

【００２１】また、請求項３に係る発明は、請求項２に
係る発明の音声合成装置において、上記フレーム間距離
計算部は、距離算出の対象となる２つのフレームの特徴
スペクトル間の距離に対して、両特徴スペクトルのパワ
ー差,両特徴スペクトルの無音声箇所,両特徴スペクトル
の変化方向あるいは両特徴スペクトルの分散による補正
を行った補正距離を求めることを特徴としている。According to a third aspect of the present invention, in the voice synthesizing apparatus according to the second aspect of the present invention, the inter-frame distance calculating unit calculates a distance between characteristic spectra of two frames to be calculated. Then, a correction distance obtained by performing correction based on a power difference between the two feature spectra, a non-voice portion of the two feature spectra, a change direction of the two feature spectra, or a variance of the two feature spectra is obtained.

【００２２】上記構成によれば、上記話者から得られた
自然音声に対してフレーム分析によって得られた特徴ス
ペクトルに基づく距離に、距離算出の対象となる２つの
特徴スペクトルの特徴差を強調するような補正を行って
フレーム間の距離が算出される。したがって、音響的に
類似するフレーム対を的確に判定できるようなフレーム
間距離が算出される。According to the above arrangement, the characteristic difference between the two feature spectra to be calculated is emphasized in the distance based on the feature spectrum obtained by the frame analysis for the natural speech obtained from the speaker. The distance between frames is calculated by performing such correction. Therefore, an inter-frame distance is calculated such that an acoustically similar frame pair can be accurately determined.

【００２３】また、請求項４に係る発明は、請求項２に
係る発明の音声合成装置において、自然音声パターンの
全フレームをクラスタリングした場合に,個々のクラス
タに落ち込んだフレーム数の上記音声単位ラベル毎の割
合を要素とする音声単位ベクトルを,上記クラスタに対
応付けて格納するフレームクラスタ格納部を備えて、上
記音声単位情報付加部は、上記フレームクラスタ格納部
を参照して、上記自然音声パターンの各フレームが属す
るクラスタに対応付けられている音声単位ベクトルを当
該フレームに付けることによって、各フレームに音声単
位ラベルを付けるようになっていることを特徴としてい
る。According to a fourth aspect of the present invention, in the voice synthesizing apparatus according to the second aspect of the present invention, when all the frames of the natural voice pattern are clustered, the voice unit label of the number of frames dropped into each cluster is used. A frame unit for storing a speech unit vector having a ratio of each element as an element in association with the cluster, wherein the speech unit information adding unit refers to the frame cluster storage unit and By attaching a speech unit vector associated with the cluster to which each frame belongs to the frame, a speech unit label is attached to each frame.

【００２４】上記構成によれば、上記音声パターン接続
ネットワークのノードであるフレームと音声単位ラベル
との対応付けが、１対１ではなく且つ離散的ではなく、
複数の音声単位ラベルに関する通過確率を要素とするベ
クトルという形でなされている。したがって、上記ノー
ドが表す複数の調音遷移情報が有効に利用されて、多く
の音声単位ラベルへの遷移状態が表現される。According to the above configuration, the correspondence between the frame as the node of the voice pattern connection network and the voice unit label is not one-to-one and not discrete,
It is made in the form of a vector having the passage probabilities for a plurality of voice unit labels as elements. Therefore, a plurality of articulation transition information represented by the nodes is effectively used, and a transition state to many speech unit labels is expressed.

【００２５】また、請求項５に係る発明は、請求項２に
係る発明の音声合成装置において、上記パターン間接続
フレーム検定部は、上記フレーム間距離が上記所定値よ
りも小さい他の所定値以下である場合には、当該フレー
ム間距離を有するフレーム対を１つのフレームに併合す
るフレーム併合手段を備えたことを特徴としている。According to a fifth aspect of the present invention, in the voice synthesizing apparatus according to the second aspect of the present invention, the inter-pattern connection frame testing unit determines that the inter-frame distance is equal to or less than another predetermined value smaller than the predetermined value. In the case of (1), there is provided a frame merging means for merging a pair of frames having the inter-frame distance into one frame.

【００２６】上記構成によれば、音響的に非常に類似す
るフレーム対が１つのフレームに併合されて、冗長な特
徴スペクトル情報の削除が行われて記憶容量の低減が図
られる。According to the above configuration, a pair of frames that are acoustically very similar are merged into one frame, redundant feature spectrum information is deleted, and the storage capacity is reduced.

【００２７】また、請求項６に係る発明は、請求項１乃
至請求項５の何れか一つに係る発明の音声合成装置にお
いて、上記音声パターン接続ネットワークを構成する各
フレームの音声単位ラベルを参照して、音声単位ラベル
が既知の特定音声パターンの音声単位ラベル列の順に上
記音声パターン接続ネットワーク上を辿る場合に、上記
遷移重みの累計が最大値を呈する最適経路を探索し、探
索された最適経路上にあるフレームに付けられた音声単
位ラベルの列を上記特定音声パターンの認識結果として
出力するネットワーク評価部を備えたことを特徴として
いる。According to a sixth aspect of the present invention, in the voice synthesizing apparatus according to any one of the first to fifth aspects of the present invention, the voice unit label of each frame constituting the voice pattern connection network is referred to. Then, when the voice unit label follows the voice pattern connection network in the order of the voice unit label sequence of the known specific voice pattern, the optimum route in which the total of the transition weights exhibits the maximum value is searched, A network evaluation unit is provided which outputs a sequence of audio unit labels attached to frames on the route as a recognition result of the specific audio pattern.

【００２８】上記構成によれば、音声単位ラベルが既知
の音声パターンの音声単位ラベル列と認識結果(音声単
位ラベル列)との適合率、および、上記最適経路が呈す
る累計遷移重みの値によって、上記音声パターン接続ネ
ットワークの音声合成に対する適性が評価される。According to the above configuration, the matching rate between the speech unit label string of the speech pattern whose speech unit label is known and the recognition result (speech unit label string), and the value of the cumulative transition weight presented by the above-mentioned optimal route are represented by: The suitability of the above voice pattern connection network for voice synthesis is evaluated.

【００２９】また、請求項７に係る発明は、請求項１乃
至請求項６の何れか一つに係る発明の音声合成装置にお
いて、上記音声パターン接続ネットワークにおけるフレ
ーム間の遷移重みを変更するネットワーク編集部を備え
たことを特徴としている。According to a seventh aspect of the present invention, there is provided the voice synthesizing apparatus according to any one of the first to sixth aspects, wherein the network editing for changing a transition weight between frames in the voice pattern connection network is performed. It is characterized by having a part.

【００３０】上記構成によれば、語彙を限定した音声合
成を行う場合に上記音声パターン接続ネットワーク中に
存在する不適当なフレーム間接続が容易に修正されて、
限定された言語情報による合成音声の品質が高められ
る。According to the above configuration, when performing speech synthesis with limited vocabulary, inappropriate interframe connections existing in the speech pattern connection network are easily corrected,
The quality of the synthesized speech based on the limited linguistic information is improved.

【００３１】ここで、上記言語情報とは、テキストや構
文構造や形態素を含む概念である。Here, the linguistic information is a concept including a text, a syntax structure, and a morpheme.

【００３２】[0032]

【発明の実施の形態】以下、この発明を図示の実施の形
態により詳細に説明する。図１は、本実施の形態の音声
合成装置におけるブロック図である。この音声合成装置
は、音声パターン接続ネットワーク作成装置を有したテ
キスト音声合成装置である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a block diagram of the speech synthesizer according to the present embodiment. This speech synthesizer is a text speech synthesizer having a speech pattern connection network creation device.

【００３３】先ず、上記音声パターン接続ネットワーク
作成装置１について説明する。この音声パターン接続ネ
ットワーク作成装置１は、自然音声パターン格納部２,
音声単位情報付加部３,フレーム間距離計算部４,パター
ン間接続フレーム検定部５,遷移重み計算部７,音声パタ
ーン接続ネットワーク格納部８およびフレームクラスタ
コードブック格納部９を有している。First, the voice pattern connection network creating device 1 will be described. The voice pattern connection network creation device 1 includes a natural voice pattern storage unit 2,
It has a voice unit information addition unit 3, an inter-frame distance calculation unit 4, an inter-pattern connection frame test unit 5, a transition weight calculation unit 7, a voice pattern connection network storage unit 8, and a frame cluster codebook storage unit 9.

【００３４】上記自然音声パターン格納部２には、予め
収録された自然音声をフレーム分析し、各フレームに音
声単位を示すラベリングを施して(以下、付けられたラ
ベルを音声単位ラベルと言う)作成した自然音声パター
ン群が格納されている。さらに、上記各自然音声パター
ンの全フレームに関してベクトル量子化を行ってコード
ブックを作成し、各コードが割り付けられたクラスタ内
に落ち込むフレームを音声単位ラベル毎に集計して各ク
ラスタ毎に正規化する。そして、各正規化値を要素とす
るベクトル(式(１))を、そのクラスタが持つ音声単位ベ
クトルとして、コードブックと共にフレームクラスタコ
ードブック格納部９に格納されている。The natural voice pattern storage unit 2 creates a frame by analyzing a natural voice recorded in advance and labeling each frame to indicate a voice unit (hereinafter, the attached label is called a voice unit label). Stored natural voice pattern groups. Further, a vector book is created by performing vector quantization on all the frames of each natural voice pattern, and the frames falling into the cluster to which each code is allocated are totalized for each voice unit label and normalized for each cluster. . Then, a vector (Equation (1)) having each normalized value as an element is stored in the frame cluster codebook storage unit 9 together with the codebook as a speech unit vector of the cluster.

【数１】 (Equation 1)

【００３５】ここで、上記音声単位ベクトルの作成には
ｋ-ミーンズクラスタリング等が用いられ、クラスタ内
の平均歪みＤm(１≦ｍ≦Ｋ)と閾値Ｄrとの関係がＤr ≧ Ｄm …（３）となるまでＫ←Ｋ＋１として分割数を増やし、調音状態
の差が最も良く区分されるように調節する。Here, k-means clustering or the like is used to create the speech unit vector, and the relationship between the average distortion Dm (1 ≦ m ≦ K) in the cluster and the threshold Dr is Dr ≧ Dm (3) The number of divisions is increased by setting K ← K + 1 until the following condition is satisfied, and adjustment is made so that the difference in the articulation state is best classified.

【００３６】上記音声単位情報付加部３は、フレームク
ラスタコードブック格納部９に格納されたコードブック
と各コードのクラスタが持つ音声単位ベクトルとを用い
て、自然音声パターン格納部２から読み出した自然音声
パターンの各フレームに音声単位ベクトルを付ける。こ
うして、各フレームと音声単位ラベルとを対応付けるの
である。The voice unit information adding unit 3 uses the codebook stored in the frame cluster codebook storage unit 9 and the voice unit vector of each code cluster to read out the natural voice pattern stored in the natural voice pattern storage unit 2. A voice unit vector is attached to each frame of the voice pattern. Thus, each frame is associated with the audio unit label.

【００３７】上記フレーム間距離計算部４は、以下に述
べるようにして、２つの自然音声パターンにおける時間
的に連続するフレーム間の距離を計算し、算出した距離
ｄをパターン間接続フレーム検定部５に送出する。この
場合におけるフレーム間の距離ｄの算出には、上記フレ
ーム分析によって得られた特徴スペクトル(波形,ケプス
トラム,ＬＰＣ(線形予測分析法)あるいはＬＳＰ(線スペ
クトル対)等)をそのまま用いるわけではなく、上記特徴
スペクトルの存在する空間に補正傾斜を施した距離を用
いるのである。The inter-frame distance calculator 4 calculates the distance between temporally consecutive frames in two natural voice patterns as described below, and uses the calculated distance d as the inter-pattern connection frame tester 5. To send to. In calculating the distance d between frames in this case, the feature spectrum (waveform, cepstrum, LPC (linear prediction analysis method) or LSP (line spectrum pair), etc.) obtained by the frame analysis is not used as it is, The distance in which the space where the above-mentioned characteristic spectrum exists is subjected to the correction inclination is used.

【００３８】その場合における補正距離計算式の一例を
式(４)に示す。An example of the correction distance calculation formula in that case is shown in formula (4).

【数２】 (Equation 2)

【００３９】式(４)の補正項における分子の第１項で
は、夫々の特徴スペクトルｕ,ｖのパワーＰu,Ｐvの対数
の差の絶対値を取ることによってパワーＰの異なるフレ
ーム間の距離を大きくする。また、第２項では、無声摩
擦音や無声破擦音等のホルマント構造を持たずに高音域
に広い発声スペクトル空間を持つフレームの距離を補正
する。更に、分母の第１項では、特徴スペクトルｕ,ｖ
の変化分の内積を取ってノルムで正規化することによっ
て、特徴スペクトルｕ,ｖの変化方向が同じ場合の距離
を補正する。この補正の特徴スペクトルとしてはデルタ
ケプストラム等を用いればよい。第４項では、夫々の特
長スペクトルｕ,ｖが属するクラスタｃ内での分散σ
_c(u),σ_c(v)を用いて、スペクトルの分布による全体的
な距離傾斜の補正を行う。In the first term of the numerator in the correction term of equation (4), the distance between frames having different powers P is obtained by taking the absolute value of the logarithmic difference between the powers Pu and Pv of the respective characteristic spectra u and v. Enlarge. In the second term, the distance of a frame having a wide utterance spectrum space in a high frequency range without having a formant structure such as unvoiced fricatives and unvoiced affricates is corrected. Further, in the first term of the denominator, the characteristic spectra u, v
Is normalized by taking the inner product of the change in the norm and by the norm, thereby correcting the distance when the change directions of the characteristic spectra u and v are the same. A delta cepstrum or the like may be used as a characteristic spectrum for this correction. In the fourth term, the variance σ in the cluster c to which each of the characteristic spectra u and v belongs
_{Using c (u)} and σ _{c (v)} , the overall distance inclination is corrected by the distribution of the spectrum.

【００４０】上記パターン間接続フレーム検定部５は、
上記フレーム間距離計算部４で計算されたフレーム間の
正規化距離ｄ_norm(以下、単に距離ｄと言う)の値が所定
値以内か否かを検定する。そして、所定値以内の距離値
ｄを呈するフレーム対を総て抽出してパターン間接続フ
レーム対とする。尚、上記フレーム間距離計算部４で行
われるフレーム間の距離計算からパターン間接続フレー
ム検定部５で行われるパターン間接続フレーム対の検定
までの処理には、“伊東,木山,小島,関,岡等「標準パタ
ーンの任意区間によるスポッティングのための Referen
ce Interval-free 連続ＤＰ」電子情報通信学会技報，
ＳＰ９５−３４,Ｐ７３〜８０”に報告されているよう
な、フレーム同期型の任意の区間同士のスポッティング
を応用することもできる。The inter-pattern connection frame verification unit 5 includes:
It is determined whether or not the value of the normalized distance d _norm (hereinafter simply referred to as distance d) between frames calculated by the inter-frame distance calculator 4 is within a predetermined value. Then, all the frame pairs exhibiting the distance value d within the predetermined value are extracted as a pattern connection frame pair. The processing from the calculation of the distance between frames performed by the inter-frame distance calculation unit 4 to the verification of the pair of inter-pattern connection frames performed by the inter-frame connection frame verification unit 5 includes “Ito, Kiyama, Kojima, Seki, Oka et al. “Refen for spotting in any section of the standard pattern
ce Interval-free continuous DP ”IEICE technical report,
SP95-34, P73-80 ", and spotting between arbitrary sections of a frame synchronization type can also be applied.

【００４１】さらに、上記パターン間接続フレーム検定
部５は、フレーム併合手段６を有する。そして、このフ
レーム併合手段６によって、パターン間接続フレーム対
のなかでより近接したフレーム対の特徴スペクトルを併
合して、記憶容量の削減と音声合成動作時のパス削減に
よる高速化を図る。上記遷移重み計算部７は、上述のよ
うにして抽出されたパターン間接続フレーム対を接続す
る。そして、この接続されたフレーム対に対応する特徴
スペクトル間の遷移に当該２フレーム間の距離に基づい
て決定した遷移重みｂを付加したものを、音声パターン
接続ネットワークの情報として音声パターン接続ネット
ワーク格納部８に格納する。ここまでの処理が、上記音
声パターン接続ネットワーク作成装置１によって行われ
る音声パターン接続ネットワーク生成処理である。Further, the inter-pattern connection frame test section 5 has a frame merging means 6. Then, the frame merging unit 6 merges the characteristic spectra of the closer frame pairs among the inter-pattern connection frame pairs, thereby reducing the storage capacity and increasing the speed by reducing the number of paths during the speech synthesis operation. The transition weight calculation unit 7 connects the inter-pattern connection frame pairs extracted as described above. Then, a transition obtained by adding a transition weight b determined based on the distance between the two frames to the transition between the feature spectra corresponding to the connected frame pair is stored as a voice pattern connection network storage unit as voice pattern connection network information. 8 is stored. The processing up to this is the voice pattern connection network generation processing performed by the voice pattern connection network creation device 1.

【００４２】図２は、上記音声パターン接続ネットワー
ク作成装置１によって行われる音声パターン接続ネット
ワーク生成処理動作のフローチャートである。以下、上
記音声パターン接続ネットワーク生成処理動作につい
て、今一度簡単に説明する。ここで、自然音声をフレー
ム分析して各フレームに音声単位ラベルが付けられた自
然音声パターンが自然音声パターン格納部２に格納さ
れ、上記自然音声パターンに基づいて作成された上記コ
ードブックと音声単位ベクトルとがフレームクラスタコ
ードブック格納部９に格納されているものとする。FIG. 2 is a flowchart of a voice pattern connection network generation processing operation performed by the voice pattern connection network generation device 1. Hereinafter, the voice pattern connection network generation processing operation will be briefly described once again. Here, a natural voice pattern in which a natural voice is subjected to frame analysis and a voice unit label is attached to each frame is stored in the natural voice pattern storage unit 2, and the code book and the voice unit generated based on the natural voice pattern are stored. The vector is stored in the frame cluster codebook storage unit 9.

【００４３】ステップＳ1で、上記音声単位情報付加部
３によって、上記フレームクラスタコードブック格納部
９に格納されているコードブックと音声単位ラベルとを
参照して、自然音声パターン格納部２に格納されている
総ての自然音声パターンの各フレームに上記音声単位ベ
クトルが付けられる。ステップＳ2で、上記フレーム間
距離計算部４によって、上記音声単位ベクトルが付けら
れた自然音声パターンの群から１つの自然音声パターン
が読み出される。In step S1, the speech unit information adding unit 3 refers to the code book and the speech unit label stored in the frame cluster codebook storage unit 9 and stores the code book and the speech unit label in the natural speech pattern storage unit 2. The speech unit vector is attached to each frame of all the natural speech patterns. In step S2, the one-to-one inter-frame distance calculation unit 4 reads one natural sound pattern from the group of natural sound patterns to which the sound unit vector is attached.

【００４４】ステップＳ3で、上記フレーム間距離計算
部４によって、上記自然音声パターン格納部２から上記
ステップＳ1において読み出された自然音声パターンと
は異なる自然音声パターンの１つが参照パターンとして
読み出される。ステップＳ4で、上記フレーム間距離計
算部４によって、上記ステップＳ3において読み出され
た参照パターンのフレームと上記ステップＳ1において
読み出された自然音声パターンのフレームとのうち時間
的に連続するフレーム間の距離ｄが、上述のように式
(４)によって算出される。ステップＳ5で、上記パター
ン間接続フレーム判定部５によって、上記ステップＳ4
において算出されたフレーム間距離ｄの値が所定値以内
か否かを検定することによって、上記両フレームが異な
る自然音声パターン間を接続するパターン間接続フレー
ム対に適合するか否かが判定される。その結果、適合す
ると判定された場合にはステップＳ6に進み、適合しな
いと判定された場合にはステップＳ8に進む。In step S3, one of the natural voice patterns different from the natural voice pattern read in step S1 from the natural voice pattern storage unit 2 is read by the inter-frame distance calculation unit 4 as a reference pattern. In step S4, the interframe distance calculation unit 4 sets a time interval between temporally consecutive frames of the reference pattern frame read in step S3 and the natural voice pattern frame read in step S1. The distance d is given by the equation
It is calculated by (4). In step S5, the inter-pattern connection frame determining unit 5 determines in step S4
By examining whether or not the value of the inter-frame distance d calculated in is within a predetermined value, it is determined whether or not the two frames match a pair of inter-pattern connection frames connecting different natural voice patterns. . As a result, when it is determined that they match, the process proceeds to step S6, and when it is determined that they do not match, the process proceeds to step S8.

【００４５】ステップＳ6で、上記遷移重み計算部７に
よって、上記パターン間接続フレーム対間を遷移する場
合の遷移重みｂが、上記パターン間接続フレーム対の距
離ｄに基づいて算出される。ステップＳ7で、上記遷移
重み計算部７によって、上記算出された遷移重みｂとパ
ターン間接続フレーム対とが対応付けられて、音声パタ
ーン接続ネットワークの情報として音声パターン接続ネ
ットワーク格納部８に格納される。ステップＳ8で、上
記フレーム間距離計算部４によって、上記自然音声パタ
ーン格納部２に格納されている総ての自然音声パターン
が参照パターンとして読み出されたか否かが判別され
る。その結果、読み出されていればステップＳ9に進む
一方、読み出されていなければ上記ステップＳ3に戻っ
て次の参照パターンとの処理に移行する。In step S6, the transition weight calculation section 7 calculates a transition weight b for transition between the inter-pattern connection frame pairs based on the distance d between the inter-pattern connection frame pairs. In step S7, the calculated transition weight b is associated with the inter-pattern connection frame pair by the transition weight calculation unit 7 and stored in the voice pattern connection network storage unit 8 as voice pattern connection network information. . In step S8, the inter-frame distance calculation unit 4 determines whether or not all the natural voice patterns stored in the natural voice pattern storage unit 2 have been read as reference patterns. As a result, if it has been read, the process proceeds to step S9. If it has not been read, the process returns to step S3 and shifts to the process with the next reference pattern.

【００４６】ステップＳ9で、上記フレーム間距離計算
部４によって、自然音声パターン格納部２に未処理の自
然音声パターンが存在するか否かが判別される。その結
果、存在すれば上記ステップＳ2に戻って次の自然音声
パターンの処理に移行し、そうでなければステップＳ10
に進む。ステップＳ10で、上記パターン間接続フレーム
判定部５のフレーム併合手段６によって、上記ステップ
Ｓ5におけるパターン間接続フレーム対の適合性判定に
用いられた所定値よりも更に小さい他の所定値以下のフ
レーム間距離ｄを呈するパターン間接続フレーム対(以
下、近接接続フレーム対と言う)が選出される。そし
て、この近接接続フレーム対に対応する特徴スペクトル
が併合される。ステップＳ11で、上記遷移重み計算部７
によって、上記併合後のフレームの遷移重みが算出さ
れ、当該併合フレームとその遷移重みとで、音声パター
ン間接続ネットワーク格納部８が更新される。そうした
後、音声パターン接続ネットワーク生成処理動作を終了
する。In step S9, the inter-frame distance calculation unit 4 determines whether or not an unprocessed natural voice pattern exists in the natural voice pattern storage unit 2. As a result, if it exists, the process returns to the step S2 and shifts to the processing of the next natural voice pattern.
Proceed to. In step S10, the frame merging means 6 of the inter-pattern connection frame judging section 5 sets a frame interval smaller than another predetermined value smaller than the predetermined value used for determining the suitability of the inter-pattern connection frame pair in step S5. An inter-pattern connection frame pair exhibiting a distance d (hereinafter, referred to as a close connection frame pair) is selected. Then, the characteristic spectra corresponding to the close connection frame pair are merged. In step S11, the transition weight calculator 7
Thereby, the transition weight of the frame after the above-mentioned combination is calculated, and the inter-voice-pattern connection network storage unit 8 is updated with the combined frame and the transition weight. After that, the voice pattern connection network generation processing operation ends.

【００４７】こうして作成された音声パターン接続ネッ
トワークを用いて、言語解析部１１,音声単位ラベル列
変換部１２,パラメータ列探索部１３,音声合成部１４,
韻律生成部１５および音声出力部１６によって、テキス
ト音声合成を行うのである。以下、上記テキスト音声合
成について説明する。Using the speech pattern connection network created in this way, a language analysis unit 11, a speech unit label string conversion unit 12, a parameter string search unit 13, a speech synthesis unit 14,
The text-to-speech synthesis is performed by the prosody generation unit 15 and the speech output unit 16. Hereinafter, the text-to-speech synthesis will be described.

【００４８】上記言語解析部１１は、テキスト入力部１
０から与えられたテキストを言語解析して、解析結果を
音声単位ラベル列変換部１２に送出する。音声単位ラベ
ル列変換部１２は、音声単位情報付加部３で用いた音声
単位ラベル群と同じ音声単位ラベル群を用いて入力テキ
ストを音声単位ラベル列に変換する。この場合、従来の
テキスト音声合成装置においては、入力テキストに対し
て言語情報あるいは音素情報を単位としたラベル列を一
意に決定しているが、本実施の形態においては、上記音
声パターン接続ネットワークの使用によって音素よりも
微小なフレーム単位での接続が可能になるので、無声化
母音の取り扱いや異音化結合の発声の有無等を含む複数
の音声単位ラベル列候補を入力テキストに与えることが
可能となる。その場合には、上述のようにして得られた
複数の音声単位ラベル列候補に基づいて複数の合成音声
候補を作成し、これらの合成音声候補のうち最も滑らか
に発声可能な合成音声候補を選択して音声合成を行うの
である。The language analysis unit 11 includes the text input unit 1
Language analysis is performed on the text given from 0, and the analysis result is sent to the voice unit label string conversion unit 12. The voice unit label sequence conversion unit 12 converts the input text into a voice unit label sequence using the same voice unit label group as the voice unit label group used in the voice unit information adding unit 3. In this case, in the conventional text-to-speech synthesis apparatus, a label string in units of linguistic information or phoneme information is uniquely determined with respect to the input text. By using it, it is possible to connect in units of frames smaller than phonemes, so it is possible to give multiple input speech unit label string candidates to the input text, including handling of unvoiced vowels and presence / absence of unusual connection Becomes In this case, a plurality of synthesized speech candidates are created based on the plurality of speech unit label string candidates obtained as described above, and a synthesized speech candidate that can produce the smoothest speech is selected from these synthesized speech candidates. Then, speech synthesis is performed.

【００４９】さらに、上記音声単位ラベル列変換部１２
は、従来のテキスト音声合成装置のように、同一ラベル
の継続を１つのラベルに圧縮して単一の音声素片長の設
定や複数の音声素片の継続長の設定は総て後段の音声合
成部に委ねるのではなく、同じ音声単位ラベルを有する
フレームの繰り返し回数およびその繰り返し回数の変動
幅を設定できるフレーム同期型の音声単位ラベル列に展
開しておくのである。こうして上記音声単位ラベル列変
換部１２において変換された音声単位ラベル列はパラメ
ータ列探索部１３に送出される。Further, the voice unit label string conversion unit 12
As in the conventional text-to-speech synthesis apparatus, the continuation of the same label is compressed into one label to set a single speech unit length or to set the continuation length of a plurality of speech units. Instead, it is expanded to a frame-synchronous audio unit label sequence in which the number of repetitions of a frame having the same audio unit label and the variation range of the number of repetitions can be set. The speech unit label string converted by the speech unit label string conversion unit 12 is sent to the parameter string search unit 13.

【００５０】上記パラメータ列探索部１３は、上記音声
パターン接続ネットワーク格納部８に格納されている音
声パターン接続ネットワークにおいて、最も尤度の高い
(つまり、自然で滑らかに接続される)特徴スペクトル列
(合成パラメータ列)を与えるパスを探索することによっ
て、音声合成部１４を駆動するための合成パラメータ列
を生成するのである。The parameter sequence search unit 13 is the one having the highest likelihood in the voice pattern connection network stored in the voice pattern connection network storage unit 8.
Feature spectrum sequence (that is, connected naturally and smoothly)
By searching for a path giving the (synthesis parameter sequence), a synthesis parameter sequence for driving the speech synthesis unit 14 is generated.

【００５１】図３は、上記音声パターン接続ネットワー
ク構造の一部を模式的に示し、併せて１つの合成パラメ
ータ列のパス探索過程を模式的に示す。また、図４は、
図３と同じ音声パターン接続ネットワーク構造において
他の合成パラメータ列のパスを模式的に示す。以下、図
３および図４に従って、パラメータ列探索部１３によっ
て行われる合成パラメータ列のパス探索動作の一例を説
明する。FIG. 3 schematically shows a part of the above-described voice pattern connection network structure, and also schematically shows a path search process for one synthesis parameter sequence. Also, FIG.
FIG. 4 schematically shows paths of other synthesis parameter strings in the same voice pattern connection network structure as FIG. Hereinafter, an example of the path search operation of the composite parameter sequence performed by the parameter sequence search unit 13 will be described with reference to FIGS.

【００５２】上述のごとく、図３および図４は上記音声
パターン接続ネットワークの一部を切り出したものであ
り、発声/ａｂｅ/から得られた音声パターンＡ(○印の
連鎖)と、発声/ｂｂｆ/から得られた音声パターンＢ(□
印の連鎖)と、発声/ｄｂｃ/から得られた音声パターン
Ｃ(△印の連鎖)との間で、パターン間接続が起こってい
る部分である。尚、図中、実線は同じ音声パターン内の
次フレームへ遷移する「不変パス(図中においては単に
「不変」と記載)」を示し、一点鎖線は他の音声パター
ンの次フレームへ遷移する「遷移パス(図中においては
単に「遷移」と記載)」を示す。そして、各「不変パ
ス」および「遷移パス」には遷移重みが付加されている
(図３および図４においては遷移重みの一部のみ記載し
ている)。また、上記音声パターン接続ネットワークの
各ノード(各フレームの特徴スペクトル)には、上記音声
単位ベクトルが付加されている。例えば、音声パターン
Ｂの第３フレームには音声単位ベクトル（０.１(/a/),
０.６(/b/),０.３(/d/))が付加されている。これは、音
声パターンＢの第３フレームが属するクラスタに属する
フレームのうち、１割が音声単位ラベル/ａ/のフレーム
であり、６割が音声単位ラベル/ｂ/のフレームであり、
３割りが音声単位ラベル/ｄ/のフレームであることを表
し、これは取りも直さず各音声単位ラベル/ａ/,/ｂ/,/
ｄ/のフレームが音声パターンＢの第３フレームに遷移
する確率を表している。As described above, FIGS. 3 and 4 are cut-out portions of the voice pattern connection network. The voice pattern A (chain of circles) obtained from the utterance / abe / and the utterance / bbf / Voice pattern B (□
This is the part where the connection between the patterns occurs between the chain of marks) and the voice pattern C obtained from the utterance / dbc / (the chain of marks). In the figure, a solid line indicates an “invariant path (in the figure, simply described as“ invariant ”)” transitioning to the next frame in the same voice pattern, and a dashed line indicates a transition to the next frame of another voice pattern. A transition path (in the figure, simply described as “transition”) ”is shown. A transition weight is added to each “invariant path” and “transition path”.
(In FIG. 3 and FIG. 4, only a part of the transition weight is shown.) The speech unit vector is added to each node (feature spectrum of each frame) of the speech pattern connection network. For example, in the third frame of the voice pattern B, the voice unit vector (0.1 (/ a /),
0.6 (/ b /) and 0.3 (/ d /)) are added. This means that among the frames belonging to the cluster to which the third frame of the voice pattern B belongs, 10% are frames of the voice unit label / a /, 60% are frames of the voice unit label / b /,
It indicates that 30% of the frames are of the audio unit label / d /, which is not changed and each audio unit label / a /, / b /, /
The probability that the frame of d / transits to the third frame of the voice pattern B is shown.

【００５３】以上のことから、音声パターンＡの第２フ
レームから音声パターンＢの第３フレームへの遷移パス
の遷移重み「０.６」と音声パターンＢの第３フレームの
音声単位ベクトルにおける音声単位ラベル/ｂ/の要素値
「０.６」との積が、第３フレームの音声単位ラベルが/
ｂ/である場合に、音声パターンＡの第２フレームから
音声パターンＢの第３フレームに遷移する確率となる。
以下、この「確率」を以て遷移パスの尤度とするのであ
る。したがって、与えられたテキストの音声単位ラベル
列が音声パターン接続ネットワーク上の各経路を通過す
る場合の累積尤度が最大となる経路を探索し、各ノード
(フレーム)に対応付けられている特徴スペクトル(合成
パラメータ)を接続することによって、音響的に違和感
がなく自然な合成音声を生成できるのである。From the above, the transition weight “0.6” of the transition path from the second frame of the voice pattern A to the third frame of the voice pattern B and the voice unit in the voice unit vector of the third frame of the voice pattern B The product of the label / b / with the element value "0.6" is the sound unit label of the third frame
In the case of b /, the probability of transition from the second frame of the audio pattern A to the third frame of the audio pattern B is obtained.
Hereinafter, the "probability" is used as the likelihood of the transition path. Therefore, the route in which the cumulative likelihood when the voice unit label string of the given text passes through each route on the voice pattern connection network is searched for, and each node is searched.
By connecting the characteristic spectrum (synthesis parameter) associated with the (frame), it is possible to generate a natural synthesized voice without any acoustic discomfort.

【００５４】以下、上記音声パターンＡ,Ｂ,Ｃのパスを
フレーム同期展開して得られた音声単位ラベル列/ａａ
ｂｂｂｂｆ/をテキストから得られた音声単位ラベル列
とした場合に、合成パラメータ列が通過する最適パスの
探索を考える。Hereinafter, a speech unit label string / aa obtained by developing the path of the speech patterns A, B, and C in frame synchronization.
When bbbbf / is a speech unit label string obtained from a text, a search for an optimal path through which the synthesis parameter string passes will be considered.

【００５５】上記音声パターン接続ネットワークを音声
単位ラベル列/ａａｂｂｂｂｆ/が通過するときの部分尤
度計算を次のように行う。図３に示すように、初期フレ
ームは音声パターンＡの第１フレームであり、音声単位
ベクトルは/ａ/の要素のみを持つ。そして、第２フレー
ムに遷移するパスは同じ音声パターンＡへの不変パスの
みであり、不変パスの遷移重みは１.０と定義されてい
るから、第２フーム/ａ/までの累積尤度は、初期値
「１.０」に、不変パスの遷移重み１.０と音声パターン
Ａの第２フレームの音声単位ベクトルにおける/ａ/の要
素値０.８とを積和して、ｐ(i│i=２）＝１.０＋１.０×０.８＝１.８ …（５）但し、ｉ：フレーム番号(最大値Ｎ)となる。The partial likelihood calculation when the voice unit label sequence / aabbbbb / passes through the voice pattern connection network is performed as follows. As shown in FIG. 3, the initial frame is the first frame of the audio pattern A, and the audio unit vector has only an element of / a /. Then, the path that transitions to the second frame is only the invariant path to the same voice pattern A, and the transition weight of the invariant path is defined as 1.0, so the cumulative likelihood up to the second frame / a / is , The initial value “1.0”, and the sum of the transition weight of the invariant path of 1.0 and the element value of / a / in the audio unit vector of the second frame of the audio pattern A of 0.8, p (i | I = 2) = 1.0 + 1.0 × 0.8 = 1.8 (5) where i: frame number (maximum value N).

【００５６】次に、第２フレーム/ａ/から第３フレーム
/ｂ/に遷移する際のパスは、音声パターンＡへの不変パ
スと音声パターンＢへの遷移パスとの２つのパスがあ
る。そこで、先ず、音声パターンＡへの不変パスを考え
る。この場合には、第２フレームまでの累積尤度ｐ(２)
＝１.８に、第３フレーム/ｂ/への不変パスの遷移重み
１.０と音声パターンＡの第３フレームの音声単位ベク
トルにおける/ｂ/の要素値０.６との積を加算して、第
３フレームまでの累積尤度ｐ(３)はｐ(３)＝２.４とな
る。次に、音声パターンＢへの遷移パスを考える。この
場合には、第２フレームまでの累積尤度ｐ(２)＝１.８
に、第３フレーム/ｂ/への遷移パスの遷移重み０.６と
音声パターンＢの第３フレームの音声単位ベクトルにお
ける/ｂ/の要素値０.６との積を加算して、第３フレー
ムまでの累積尤度ｐ(３)はｐ(３)＝２.１６となる。し
たがって、第２フレーム/ａ/から第３フレーム/ｂ/への
遷移としては、音声パターンＢへの遷移パスよりも音声
パターンＡへの不変パスを取った方が尤度は高いので不
変パスを取り、第３フレームまでの累積尤度Ｐ(３)はＰ
(３)＝２.４となる。Next, from the second frame / a / to the third frame
There are two paths at the time of transition to / b /: an invariant path to the audio pattern A and a transition path to the audio pattern B. Therefore, first, an invariant path to the audio pattern A is considered. In this case, the accumulated likelihood p (2) up to the second frame
= 1.8, the product of the transition weight 1.0 of the invariant path to the third frame / b / and the element value 0.6 of / b / in the audio unit vector of the third frame of the audio pattern A is added. Thus, the cumulative likelihood p (3) up to the third frame is p (3) = 2.4. Next, a transition path to the voice pattern B is considered. In this case, the cumulative likelihood up to the second frame p (2) = 1.8
And the product of the transition weight 0.6 of the transition path to the third frame / b / and the element value 0.6 of / b / in the speech unit vector of the third frame of the speech pattern B are added to the third The accumulated likelihood p (3) up to the frame is p (3) = 2.16. Therefore, the transition from the second frame / a / to the third frame / b / is more likely to take the invariant path to the audio pattern A than to the transition path to the audio pattern B. And the cumulative likelihood P (3) up to the third frame is P
(3) = 2.4.

【００５７】同様にして、第３フレーム/ｂ/から第４フ
レーム/ｂ/に遷移する場合の累積尤度ｐ(４)を求める。
但し、この場合には、音声パターンＡへの不変パスを取
った場合の累積尤度ｐ(４)はｐ(４)＝３.２となる。一
方、音声パターンＢへの遷移パスを取った場合の累積尤
度ｐ(４)はｐ(４)＝３.２１となる。したがって、第３
フレーム/ｂ/から第４フレーム/ｂ/への遷移としては遷
移パスを取り、第４フレームまでの累積尤度Ｐ(４)はＰ
(４)＝３.２１となる。Similarly, the cumulative likelihood p (4) when transitioning from the third frame / b / to the fourth frame / b / is obtained.
However, in this case, the cumulative likelihood p (4) when the invariant path to the voice pattern A is taken is p (4) = 3.2. On the other hand, the cumulative likelihood p (4) when a transition path to the voice pattern B is taken is p (4) = 3.21. Therefore, the third
A transition path is taken as a transition from frame / b / to the fourth frame / b /, and the accumulated likelihood P (4) up to the fourth frame is P
(4) = 3.21.

【００５８】こうして、式(６)によって示される漸化式
の最終項ｐ(Ｎ)、ｐ(ｉ＋１）←ｐ(ｉ)＋ｂ_n(i+1)ｎ(ｉ＋１) …（６）但し、ｐ(１）＝ｎ(１）ｐ(ｉ）：合成音声パターンのパスにおける第ｉフレー
ムまでの累積尤度ｎ(ｉ）：合成音声パターンのパスにおける第ｉフレー
ムの音声単位ベクトルの評価対象要素の値ｂ_n(i)：次フレームｉへの遷移重みすなわち、式(７)によって求められる最終フレームまで
の累積尤度値Thus, the final term p (N), p (i + 1) ← p (i) + b _{n (i + 1)} n (i + 1)... (6) in the recurrence equation represented by equation (6) (1) = n (1) p (i): cumulative likelihood up to the i-th frame in the path of the synthesized voice pattern n (i): the evaluation target element of the voice unit vector of the i-th frame in the path of the synthesized voice pattern Value b _{n (i)} : weight of transition to the next frame i, that is, the cumulative likelihood value up to the last frame obtained by equation (7)

【数３】を最大にするパスが最適経路として選択される。(Equation 3) Is selected as the optimal route.

【００５９】したがって、図３に示す例の場合には、最
終フレームである第７フレームまでの累積尤度ｐ(７)がｐ(７)＝１.０＋(１.０×０.８)＋(１.０×０.６) ＋(０.９×０.９)＋(０.８×０.９)＋(１.０×０.４) ＋(１.０×０.９)＝５.２３ …（８）となるパス(図３における太線：一点鎖線は遷移パス)
が、最大尤度を示す最適経路となる。Therefore, in the case of the example shown in FIG. 3, the cumulative likelihood p (7) up to the seventh frame which is the last frame is p (7) = 1.0 + (1.0 × 0.8) + (1.0 x 0.6) + (0.9 x 0.9) + (0.8 x 0.9) + (1.0 x 0.4) + (1.0 x 0.9) = 5.23… (8) Path (thick line in FIG. 3: chain line is a transition path)
Is the optimal path indicating the maximum likelihood.

【００６０】図４は、上述と同様にして得られたテキス
ト音声単位ラベル列/ｂｂｂｂｃｃｃ/が通過するパスを
示す。この場合には、最終フレームである第７フレーム
までの累積尤度ｐ(７)がｐ(７)＝０.５＋(１.０×０.５)＋(１.０×０.７) ＋(１.０×０.９)＋(１.０×０.２)＋(１.０×０.７) ＋(１.０×１.０)＝４.５ …（９）となるパス(図４における太線：一点鎖線は遷移パス)
が、最大尤度を示す最適経路となる。FIG. 4 shows a path through which the text / speech unit label string / bbbbccc / obtained in the same manner as described above passes. In this case, the accumulated likelihood p (7) up to the seventh frame which is the last frame is p (7) = 0.5 + (1.0 × 0.5) + (1.0 × 0.7) + (1.0 × 0.9) + (1.0 × 0.2) + (1.0 × 0.7) + (1.0 × 1.0) = 4.5 (9) (Bold line in FIG. 4: dashed line is transition path)
Is the optimal path indicating the maximum likelihood.

【００６１】上述のようにして選択された最適経路が通
過する音声パターン接続ネットワーク上のノードに該当
する特徴スペクトルの列を合成パラメータ列として得
る。A sequence of characteristic spectra corresponding to nodes on the voice pattern connection network through which the optimum route selected as described above passes is obtained as a synthesis parameter sequence.

【００６２】ここで、上述したように、上記音声単位ラ
ベル列変換部１２によって入力テキストに複数の音声単
位ラベル列候補を与えた場合には、各音声単位ラベル列
候補に対応した複数の合成パラメータ列候補が得られる
ことになる。その場合には、各合成パラメータ列候補の
最大尤度を比較して、最も大きな最大尤度を呈する合成
パラメータ列候補を最も滑らかに発声可能な目的の合成
パラメータ列であると決定すればよい。As described above, when a plurality of speech unit label string candidates are given to the input text by the speech unit label string conversion unit 12, a plurality of synthesis parameters corresponding to each speech unit label string candidate Column candidates will be obtained. In that case, the maximum likelihood of each of the synthesis parameter sequence candidates may be compared, and the synthesis parameter sequence candidate exhibiting the largest maximum likelihood may be determined to be the target synthesis parameter sequence that can be uttered most smoothly.

【００６３】こうして得られた合成パラメータ列は音声
合成部１４に送出される。そして、音声合成部１４によ
って、言語解析部１１による言語解析結果に基づいて韻
律生成部１５によって生成された韻律制御パラメータに
基づいて韻律が制御されると共に、上記合成パラメータ
列に基づいて合成音声が生成され、この生成された合成
音声信号が音声出力部１６に送出される。そうすると、
音声出力部１６は、上記合成音声信号に基づいてスピー
カ１７を駆動して合成音声を出力する。The synthesis parameter sequence thus obtained is sent to the speech synthesis unit 14. Then, the speech synthesis unit 14 controls the prosody based on the prosody control parameters generated by the prosody generation unit 15 based on the result of the language analysis by the language analysis unit 11, and synthesizes the synthesized speech based on the synthesis parameter sequence. The generated synthesized voice signal is sent to the voice output unit 16. Then,
The sound output unit 16 drives the speaker 17 based on the synthesized sound signal to output a synthesized sound.

【００６４】本実施の形態においては、上記音声合成部
１４で用いる音声合成方式については特に限定しない。
尚、上記音声合成方式は、上記音声パーン接続ネットワ
ーク作成装置１の自然音声パターン格納部２に格納する
自然音声パターンを生成する際に用いたフレーム分析法
に関係する。本実施の形態の場合には、上記音声合成方
式として零位相波形重畳方式,線形音声合成方式、線形
ＬＳＰを用いる方式、ケプストラム合成方式等の適用が
考えられる。その際に、言語解析部１１による言語解析
や韻律生成部１５による韻律生成方式には関係しない。In the present embodiment, the speech synthesis method used in the speech synthesis unit 14 is not particularly limited.
The voice synthesis method relates to a frame analysis method used when generating a natural voice pattern to be stored in the natural voice pattern storage unit 2 of the voice pan connection network creating apparatus 1. In the case of the present embodiment, application of a zero-phase waveform superposition method, a linear sound synthesis method, a method using a linear LSP, a cepstrum synthesis method, or the like can be considered as the sound synthesis method. At this time, it does not relate to the language analysis by the language analysis unit 11 or the prosody generation method by the prosody generation unit 15.

【００６５】最後に、ネットワーク評価部１８による音
声パターン接続ネットワーク適性評価処理、および、ネ
ットワーク編集部１９による音声パターン接続ネットワ
ーク編集について述べる。上記音声パターン接続ネット
ワーク格納部８に格納されている音声パターン接続ネッ
トワークを用いて自然音声の認識を行うことが可能であ
る。そこで、ネットワーク評価部１８は、発声既知の音
声を上記音声パターン接続ネットワークを用いて認識
し、認識結果に基づいて上記音声パターン接続ネットワ
ークの適性を評価するのである。Finally, the speech pattern connection network suitability evaluation processing by the network evaluation unit 18 and the speech pattern connection network editing by the network editing unit 19 will be described. Natural speech can be recognized using the speech pattern connection network stored in the speech pattern connection network storage unit 8. Therefore, the network evaluation unit 18 recognizes a voice with a known utterance using the voice pattern connection network, and evaluates the suitability of the voice pattern connection network based on the recognition result.

【００６６】上記ネットワーク評価部１８による音声パ
ターン接続ネットワークの適性評価は次のように行う。
すなわち、上記パラメータ列探索部１３によって最終フ
レームまでの累積尤度ｐ(Ｎ)を算出する際に用いる式
(７)中のｉ番目のフレームにおける音声単位ベクトルの
評価対象要素の値ｎ(ｉ)を、入力音声の(ｉ−１)番目の
フレームと音声パターン接続ネットワークにおけるｉ番
目のフレームとの特徴スペクトル間距離によって定義さ
れるスコアと見なす。そして、上記式(７)を用いて、音
声単位ラベルが既知の入力音声に関して上記音声パター
ン接続ネットワーク上の最大スコアを示す最適経路を求
め、その最適経路が通過する音声パターン接続ネットワ
ーク上のノードに割り当てられた選択(通過)音声単位ラ
ベルの列に基づいて入力音声を認識するのである。その
場合に、認識された音声単位ラベル列が入力音声に適合
し、且つ、最大スコアの値が大きい程、得られた最適経
路に基づいて得られる合成パラメータ列は自然音声の遷
移に近いものであることを表しており、当該音声パター
ン接続ネットワークはテキスト音声合成に対する適性が
十分高いと評価できるのである。尚、上記式(７)によっ
て最適パスのスコアを求める代わりに、フレーム間距離
計算部４で用いる式(４)によって算出された距離を用い
ても構わない。The suitability evaluation of the voice pattern connection network by the network evaluation section 18 is performed as follows.
That is, the equation used when calculating the cumulative likelihood p (N) up to the last frame by the parameter string search unit 13
The value n (i) of the evaluation target element of the speech unit vector in the i-th frame in (7) is calculated as the characteristic spectrum of the (i-1) -th frame of the input speech and the i-th frame in the speech pattern connection network. Consider the score defined by the distance between them. Then, using the above equation (7), the optimum route indicating the maximum score on the voice pattern connection network for the input voice whose voice unit label is known is obtained, and the node on the voice pattern connection network through which the optimum route passes is determined. The input speech is recognized based on the column of the assigned selected (passed) speech unit labels. In this case, as the recognized speech unit label string matches the input speech and the value of the maximum score is larger, the synthesized parameter string obtained based on the obtained optimal route is closer to the transition of natural speech. This indicates that the voice pattern connection network is sufficiently suitable for text-to-speech synthesis. It should be noted that the distance calculated by Expression (4) used by the interframe distance calculation unit 4 may be used instead of obtaining the score of the optimal path by Expression (7).

【００６７】上述の音声パターン接続ネットワークの適
性評価の結果、上記音声パターン接続ネットワーク中に
不適当な遷移が発見された場合には、ネットワーク編集
部１９によって、不必要な遷移パスを削除することによ
って、更に良い合成音声品質を得ることができるのであ
る。ここで、上記ネットワーク編集部１９による編集処
理は、単純に、音声パターン接続ネットワーク中におけ
る不適当な遷移重みを「０」に置き換えればよい。ま
た、音声パターン接続ネットワークの構造を変換しても
よい。As a result of the evaluation of the suitability of the voice pattern connection network, if an inappropriate transition is found in the voice pattern connection network, an unnecessary transition path is deleted by the network editing unit 19. Thus, a better synthesized voice quality can be obtained. Here, the editing process by the network editing unit 19 may simply replace the inappropriate transition weight in the voice pattern connection network with “0”. Further, the structure of the voice pattern connection network may be converted.

【００６８】このように、本実施の形態においては、上
記音声パターン接続ネットワーク作成装置１を有してい
る。そして、この音声パターン接続ネットワーク作成装
置１は、音声単位情報付加部３によって、フレームクラ
スタコードブック格納部９に格納されたコードブックと
音声単位ベクトルとを参照して、自然音声パターン格納
部２に格納された自然音声パターンのフレームに音声単
位ベクトルを付ける。そして、異なる自然音声パターン
に属するフレーム間の距離ｄをフレーム間距離計算部４
によって計算し、パターン間接続フレーム検定部５によ
って、所定値以内のフレーム間距離ｄを呈するフレーム
対を総て抽出してパターン間接続フレーム対とする。遷
移重み計算部７は、上記パターン間接続フレーム対を接
続し、その接続されたフレーム対に対応する特徴スペク
トル間の遷移に距離ｄに応じた遷移重みｂを付加して音
声パターン接続ネットワークを形成し、音声パターン接
続ネットワーク格納部８に格納する。As described above, in the present embodiment, the voice pattern connection network creating device 1 is provided. Then, the voice pattern connection network creating apparatus 1 refers to the code unit and the voice unit vector stored in the frame cluster codebook storage unit 9 by the voice unit information adding unit 3 and stores the code in the natural voice pattern storage unit 2. A speech unit vector is attached to the stored frame of the natural speech pattern. Then, a distance d between frames belonging to different natural voice patterns is calculated by an inter-frame distance calculation unit 4.
The inter-pattern connection frame testing unit 5 extracts all the frame pairs exhibiting the inter-frame distance d within a predetermined value, and sets them as inter-pattern connection frame pairs. The transition weight calculation unit 7 connects the above-mentioned inter-pattern connection frame pairs, and adds a transition weight b according to the distance d to transitions between feature spectra corresponding to the connected frame pairs to form a voice pattern connection network. Then, it is stored in the voice pattern connection network storage unit 8.

【００６９】そして、合成音声時には、与えられたテキ
ストを言語解析部１１によって言語解析し、音声単位ラ
ベル列変換部１２によって上記言語解析結果に基づいて
入力テキストを音声単位ラベル列に変換する。パラメー
タ列探索部１３は、上記変換された音声単位ラベル列に
基づいて、式(７)に従って、最終フレームまでの累積尤
度ｐ(Ｎ)を最大にする上記音声パターン接続ネットワー
ク上の最適経路を選択する。そうすると、音声合成部１
４は、上述のようにして選択された最適経路に合致した
合成パラメータ列に基づいて合成音声を生成するように
している。At the time of synthesized speech, the given text is subjected to language analysis by the language analysis unit 11, and the input text is converted to a speech unit label sequence by the speech unit label sequence conversion unit 12 based on the result of the language analysis. Based on the converted speech unit label string, the parameter string search unit 13 determines the optimal path on the speech pattern connection network that maximizes the cumulative likelihood p (N) up to the last frame according to Equation (7). select. Then, the voice synthesis unit 1
No. 4 generates a synthesized speech based on a synthesis parameter sequence that matches the optimum route selected as described above.

【００７０】すなわち、本実施の形態においては、合成
パラメータを音声素片のデータベースとして蓄えて置く
のではなく、予め自然音声をフレーム分析して成る自然
音声パターン間において音響的に接続可能なパターン間
接続フレーム対を調査して、自然音声パターン内でのフ
レーム間の「不変パス」に加えて他の自然音声パターン
のフレームとの「遷移パス」を設け、この両パスに遷移
重みを付加して作成した音声パターン接続ネットワーク
として蓄えておく。そして、音声合成時には、上記音声
パターン接続ネットワーク上の最適経路を探索して合成
パラメータを得るようにしている。したがって、上記編
集合成方式による音声合成時に、音響的特徴が類似する
調音状態から得られた他の自然音声パターンの特徴スペ
クトルに対して合成パラメータとしての使用の可能性を
評価検討することができ、上記自然音声パターン内にお
ける更に微小な区間での相互補完を行うことが可能とな
る。その結果、本実施の形態によれば、音素情報に基づ
く単位よりも微小なフレーム単位で音声データを滑らか
に自然に接続することができ、入力テキストに最もよく
適合した合成パラメータ列を得ることができるのであ
る。That is, in the present embodiment, the synthesis parameters are not stored as a database of speech units, but rather, natural sound patterns formed by frame analysis of natural speech in advance can be connected between acoustically connectable patterns. By examining the connection frame pair, in addition to the "invariant path" between frames in the natural voice pattern, a "transition path" with frames of other natural voice patterns is provided, and a transition weight is added to both paths. It is stored as the created voice pattern connection network. Then, at the time of voice synthesis, an optimum route on the voice pattern connection network is searched to obtain synthesis parameters. Therefore, at the time of speech synthesis by the above-mentioned edit synthesis method, it is possible to evaluate and evaluate the possibility of use as a synthesis parameter for a feature spectrum of another natural speech pattern obtained from an articulation state having a similar acoustic feature, Mutual complementation can be performed in smaller sections in the natural voice pattern. As a result, according to the present embodiment, speech data can be smoothly and naturally connected in units of frames smaller than the unit based on phoneme information, and a synthesis parameter sequence that is best suited to the input text can be obtained. You can.

【００７１】また、上記音声パターン接続ネットワーク
格納部８に格納された音声パターン接続ネットワークに
おいては、予め自然音声パターンに基づいて音響的に接
続可能なパターン間接続フレーム対を調査しておき、各
自然音声パターンにおけるパターン間接続フレーム対間
のみに遷移パスを設定している。したがって、音声合成
時に、音響的に接続不可能なフレーム対に関する接続評
価の実行を避けることができ、接続フレーム探索のため
の計算量を大幅に削減できる。In the voice pattern connection network stored in the voice pattern connection network storage section 8, a pattern connection frame pair that can be acoustically connected based on a natural voice pattern is checked in advance. The transition path is set only between the pattern connection frame pairs in the voice pattern. Therefore, at the time of speech synthesis, it is possible to avoid the execution of connection evaluation for a pair of frames that cannot be connected acoustically, and it is possible to greatly reduce the amount of calculation for searching for a connected frame.

【００７２】また、上記音声パターン接続ネットワーク
における各ノードと音声単位ラベルとの関連付けを、図
３または図４に示すように、音声パターン接続ネットワ
ーク上の各ノードを各音声単位ラベルが通過する確率と
して与えられている。したがって、ある１つのノードが
確率的に複数の音声単位ラベルが通過可能であれば、当
該ノードに該当する特徴スペクトルは異なる合成パラメ
ータ列上の異なる音声単位ラベルに対応付けて使用可能
である。したがって、少ない音声パターン接続ネットワ
ーク情報で多くのスペクトル遷移情報を表すことがで
き、自然音声から得られた特徴スペクトルを効果的に合
成パラメータ生成に利用できるのである。また、このこ
とは、冗長な特徴スペクトル情報の記憶を無くして無用
な記憶量の増大を防止する。The association between each node in the voice pattern connection network and the voice unit label is defined as a probability that each voice unit label passes through each node on the voice pattern connection network as shown in FIG. 3 or FIG. Has been given. Therefore, if a certain node can stably pass a plurality of speech unit labels, the feature spectrum corresponding to the node can be used in association with different speech unit labels on different synthesis parameter strings. Therefore, a large amount of spectrum transition information can be represented by a small amount of voice pattern connection network information, and a feature spectrum obtained from natural voice can be effectively used for generating synthesis parameters. This also eliminates redundant feature spectrum information storage and prevents an unnecessary increase in storage amount.

【００７３】また、本実施の形態においては、上記音声
パターン接続ネットワーク格納部８に格納されている音
声パターン接続ネットワークの評価を、ネットワーク評
価部１８によって、限定されたテキスト(発声既知の音
声パターン)で評価可能にしている。そして更に、ネッ
トワーク編集部１９によって、誤りを起こすような不必
要な遷移パスを削除可能にしている。したがって、限定
した文章空間内においても特に好ましくない音声素片同
士が接続される危険性を低減することが難しいというテ
キスト音声合成方式の欠点を克服でき、限定されたテキ
ストによるテキスト合成音声の品質を高めることができ
るのである。In the present embodiment, the evaluation of the voice pattern connection network stored in the voice pattern connection network storage section 8 is performed by the network evaluation section 18 on a limited text (known voice pattern). The evaluation is possible. Further, the network editing unit 19 can delete unnecessary transition paths that cause an error. Therefore, it is possible to overcome the drawback of the text-to-speech synthesis method in which it is difficult to reduce the risk that particularly undesired speech units are connected to each other even in the limited sentence space, and to improve the quality of the text-synthesis speech using the limited text. It can be enhanced.

【００７４】[0074]

【発明の効果】以上より明らかなように、請求項１に係
る発明の音声合成装置は、自然音声をフレーム分析して
成る自然音声パターン間において音響的に接続可能なフ
レームを接続して各フレーム間の接続に遷移重みを付加
した音声パターン接続ネットワークを音声パターン接続
ネットワーク格納部に格納しており、パラメータ列探索
部によって、入力された言語情報に基づく音声単位ラベ
ル列の順に上記音声パターン接続ネットワーク上を辿る
場合の遷移重みの累計が最大値を呈する最適経路を探索
して合成パラメータ列を得、得られた合成パラメータ列
および韻律生成部で生成された韻律パラメータに基づい
て、音声合成部によって合成音声を生成するので、上記
自然音声パターン内における更に微小な区間での相互補
完を行って合成パラメータ列を得ることが可能となる。
したがって、この発明によれば、音素等よりも微小なフ
レーム単位で特徴スペクトルを滑らかに接続することが
でき、高い自然性を有する合成音声を生成できる。As is apparent from the above description, the speech synthesizing apparatus according to the first aspect of the present invention connects frames that can be acoustically connected between natural speech patterns formed by analyzing natural speech frames. A voice pattern connection network in which transition weights are added to connections between the voice pattern connection networks is stored in a voice pattern connection network storage unit. A search is made for the optimal path in which the cumulative total of transition weights when tracing upward has a maximum value to obtain a synthesized parameter sequence, and the speech synthesis unit performs, based on the obtained synthesized parameter sequence and the prosody parameter generated by the prosody generation unit. Since synthetic speech is generated, mutual complementation is performed in even smaller sections in the natural speech pattern, and synthesized speech is generated. It is possible to obtain a meter columns.
Therefore, according to the present invention, feature spectra can be smoothly connected in units of frames smaller than phonemes or the like, and synthetic speech having high naturalness can be generated.

【００７５】また、上記音声パターン接続ネットワーク
は、自然音声パターン間において音響的に接続可能なフ
レーム間のみを接続して形成しているので、合成音声時
に最適経路を探索する場合に、音響的に接続不可能な無
駄な経路に対する探索処理を避けることができる。した
がって、合成音声時の計算量を低減できる。Further, since the above-mentioned speech pattern connection network is formed by connecting only acoustically connectable frames between natural speech patterns, when searching for an optimum route at the time of synthetic speech, it is acoustically necessary. It is possible to avoid search processing for useless routes that cannot be connected. Therefore, the amount of calculation at the time of synthesized speech can be reduced.

【００７６】また、請求項２に係る発明の音声合成装置
は、上記自然音声パターンの各フレームに音声単位情報
付加部によって音声単位ラベルを付け、異なる自然音声
パターンに属するフレームの特徴スペクトル間の距離を
フレーム間距離計算部によって求め、上記フレーム間距
離が所定値以下であってパターン間接続フレーム対と成
り得るフレーム対間の遷移重みを遷移重み計算部によっ
て求め、複数の自然音声パターンにおける上記パターン
間接続フレーム対を互いに接続して形成された音声パタ
ーン接続ネットワークの各フレーム間の接続に上記遷移
重みを付加して上記音声パターン接続ネットワーク格納
部に格納するので、自然音声をフレーム分析して成る自
然音声パターン間において音響的に接続可能なパターン
が接続されて遷移重みが付加された音声パターン接続ネ
ットワークを、自然音声パターン格納部に格納された自
然音声パターンに基づいて自動的に作成できる。In the speech synthesizer according to the second aspect of the present invention, a speech unit label is attached to each frame of the natural speech pattern by a speech unit information adding unit, and a distance between characteristic spectra of frames belonging to different natural speech patterns is provided. Is calculated by the inter-frame distance calculation unit, and the transition weight between the pair of frames in which the inter-frame distance is equal to or less than a predetermined value and which can be an inter-pattern connection frame pair is calculated by the transition weight calculation unit. The transition weights are added to the connections between the frames of the voice pattern connection network formed by connecting the inter-connection frame pairs to each other and stored in the voice pattern connection network storage unit. Acoustic connectable patterns are connected and transition between natural voice patterns A voice pattern connected network only is added, can be automatically created based on natural speech patterns stored in natural speech pattern storage unit.

【００７７】また、請求項３に係る発明の音声合成装置
におけるフレーム間距離計算部は、距離算出の対象とな
る２つのフレームの特徴スペクトル間の距離に対して、
両特徴スペクトルのパワー差,両特徴スペクトルの無音
声箇所,両特徴スペクトルの変化方向あるいは両特徴ス
ペクトルの分散による補正を行った補正距離を求めるの
で、距離算出の対象となる２つの特徴スペクトルの特徴
差を強調するような補正を行ってフレーム間の距離を算
出できる。したがって、この発明によれば、音響的に類
似するフレーム対を的確に判定できるようなフレーム間
距離を算出できる。The inter-frame distance calculator in the speech synthesizer according to the third aspect of the present invention calculates the distance between the characteristic spectra of two frames to be calculated.
The difference between the power of the two feature spectra, the non-speech location of the two feature spectra, the direction of change of the two feature spectra, or the corrected distance corrected by the variance of the two feature spectra is obtained. The distance between frames can be calculated by performing a correction that emphasizes the difference. Therefore, according to the present invention, it is possible to calculate an inter-frame distance by which a pair of acoustically similar frames can be accurately determined.

【００７８】また、請求項４に係る発明の音声合成装置
における音声単位情報付加部は、フレームクラスタ格納
部を参照して、上記自然音声パターンの各フレームが属
するクラスタに対応付けられている音声単位ベクトルを
当該フレームに付けることによって、各フレームに音声
単位ラベルを付けるので、複数の音声単位ラベルに関す
る通過確率を要素とするベクトルという形で、上記音声
パターン接続ネットワークのノードとなるフレームと音
声単位ラベルとを対応付けることができる。したがっ
て、上記音声パターン接続ネットワークのある１つのノ
ードに該当する特徴スペクトルを異なる音声単位ラベル
に対応付けることが可能となり、少ない音声パターン接
続ネットワーク情報で多くの調音遷移情報を表わすこと
ができる。また、冗長な特徴スペクトル情報の記憶を無
くして記憶容量の増大を防止できる。The voice unit information adding unit in the voice synthesizing apparatus according to the fourth aspect of the present invention refers to the frame cluster storage unit to store the voice unit associated with the cluster to which each frame of the natural voice pattern belongs. Since a speech unit label is attached to each frame by attaching a vector to the frame, a frame which is a node of the speech pattern connection network and a speech unit label in the form of a vector having a passage probability regarding a plurality of speech unit labels as an element. Can be associated with. Therefore, it is possible to associate a feature spectrum corresponding to one node of the voice pattern connection network with a different voice unit label, and a large number of articulation transition information can be represented by a small amount of voice pattern connection network information. Further, it is possible to eliminate the storage of redundant feature spectrum information and prevent an increase in storage capacity.

【００７９】また、請求項５に係る発明の音声合成装置
におけるパターン間接続フレーム検定部は、上記フレー
ム間距離が上記所定値よりも小さい他の所定値以下であ
る場合には、フレーム併合手段によって当該フレーム間
距離を有するフレーム対を１つのフレームに併合するの
で、音響的に非常に類似するフレーム対を１つのフレー
ムに併合して冗長な特徴スペクトル情報の削除を行い、
記憶容量の低減を図ることができる。In the voice synthesizing apparatus according to the fifth aspect of the present invention, the inter-pattern connection frame testing unit may determine whether the inter-frame distance is equal to or less than another predetermined value smaller than the predetermined value. Since the pair of frames having the inter-frame distance is merged into one frame, the pair of acoustically similar frames is merged into one frame to delete redundant feature spectrum information.
The storage capacity can be reduced.

【００８０】また、請求項６に係る発明の音声合成装置
は、ネットワーク評価部によって、音声単位ラベルが既
知の特定音声パターンの音声単位ラベル列の順に上記音
声パターン接続ネットワーク上を辿る場合に、上記遷移
重みの累計が最大値を呈する最適経路を探索し、探索さ
れた最適経路上にあるフレームに付けられた音声単位ラ
ベルの列を上記特定音声パターンの認識結果として出力
するので、音声単位ラベルが既知の音声パターンの音声
単位ラベル列と認識結果(音声単位ラベル列)との適合
率、および、上記最適経路が呈する累計遷移重みの値に
よって、上記音声パターン接続ネットワークの音声合成
に対する適性を評価できる。Further, in the speech synthesizer according to the present invention, when the network evaluation unit traces the voice pattern connection network in the order of the voice unit label sequence of the specific voice pattern having a known voice unit label, The optimal path in which the sum of the transition weights exhibits the maximum value is searched, and the sequence of the audio unit labels attached to the frames on the searched optimal path is output as the recognition result of the specific audio pattern. The suitability of the speech pattern connection network for speech synthesis can be evaluated by the matching rate between the speech unit label string of the known speech pattern and the recognition result (speech unit label string), and the value of the cumulative transition weight presented by the optimal path. .

【００８１】また、請求項７に係る発明の音声合成装置
は、ネットワーク編集部を有して、上記音声パターン接
続ネットワークにおける上記フレーム間の遷移重みを変
更するので、語彙を限定した音声合成を行う場合に、上
記音声パターン接続ネットワーク中に存在する不適当な
フレーム間接続を容易に修正して、限定された言語情報
による合成音声の品質を高めることができる。The speech synthesizer according to the invention of claim 7 has a network editing unit and changes the transition weight between the frames in the speech pattern connection network, thereby performing speech synthesis with a limited vocabulary. In this case, inappropriate inter-frame connections existing in the voice pattern connection network can be easily corrected, and the quality of synthesized speech based on limited linguistic information can be improved.

[Brief description of the drawings]

【図１】この発明の音声合成装置におけるブロック図で
ある。FIG. 1 is a block diagram of a speech synthesizer according to the present invention.

【図２】図１に示す音声パターン接続ネットワーク作成
装置によって行われる音声パターン接続ネットワーク生
成処理動作のフローチャートである。FIG. 2 is a flowchart of a voice pattern connection network generation processing operation performed by the voice pattern connection network creation device shown in FIG. 1;

【図３】図１における音声パターン接続ネットワーク格
納部に格納されている音声パターン接続ネットワーク構
造の一部を通過するある１つの合成パラメータ列の最適
経路を示す模式図である。FIG. 3 is a schematic diagram showing an optimal route of one synthesis parameter string passing through a part of a voice pattern connection network structure stored in a voice pattern connection network storage unit in FIG. 1;

【図４】図３とは異なる合成パラメータ列の最適経路を
示す模式図である。FIG. 4 is a schematic diagram showing an optimum route of a synthesis parameter sequence different from FIG.

[Explanation of symbols]

１…音声パターン接続ネットワーク作成装置２…自然音声パターン格納部、３…音声単位情報
付加部、４…フレーム間距離計算部、５…パタ
ーン間接続フレーム検定部、６…フレーム併合手段、
７…遷移重み計算部、８…音声パターン接続
ネットワーク格納部、９…フレームクラスタコードブッ
ク格納部、１０…テキスト入力部、１１…
言語解析部、１２…音声単位ラベル列変換部、１３
…パラメータ列探索部、１４…音声合成部、
１５…韻律生成部、１６…音声出力部、
１７…スピーカ、１８…ネットワーク評価部、
１９…ネットワーク編集部。DESCRIPTION OF SYMBOLS 1 ... Voice pattern connection network creation apparatus 2 ... Natural voice pattern storage part 3 ... Voice unit information addition part 4 ... Inter-frame distance calculation part 5 ... Inter-pattern connection frame verification part 6 ... Frame merging means
7: transition weight calculation unit, 8: voice pattern connection network storage unit, 9: frame cluster codebook storage unit, 10: text input unit, 11 ...
Language analysis unit, 12: voice unit label string conversion unit, 13
... parameter string search unit, 14 ... speech synthesis unit,
15: Prosody generation unit, 16: Voice output unit,
17 speaker, 18 network evaluation unit,
19 Network editing department.

Claims

[Claims]

An input unit configured to analyze a natural voice obtained from a speaker and perform a frame analysis on each of the plurality of natural voice patterns with a voice unit label attached to each frame; A pair of frames having similar characteristic spectra are connected to each other, and a voice pattern connection network storing section storing a voice pattern connection network formed by adding a transition weight to the connection between the frames; and When referring to the audio unit label of each frame to be configured and tracing on the audio pattern connection network in the order of the audio unit label sequence based on the linguistic information input from the input unit, the total of the transition weights has a maximum value. The optimal route to be presented is searched, and the sequence of the feature spectrum of the frame on the searched optimal route is synthesized. A parameter sequence search unit that outputs a data sequence, a prosody generation unit that generates a prosody control parameter based on the input linguistic information, a synthesized parameter sequence from the parameter sequence search unit, and a prosody from the prosody generation unit A speech synthesis device comprising a speech synthesis unit that generates a synthesized speech based on a control parameter.

2. The natural voice pattern storage unit according to claim 1, wherein a natural voice pattern storage unit stores a natural voice pattern obtained by frame analysis of a natural voice obtained from a speaker. A speech unit information adding unit for attaching the speech unit label to each frame; an inter-frame distance calculating unit for calculating a distance between characteristic spectra of frames belonging to different natural speech patterns; and whether the inter-frame distance is equal to or less than a predetermined value. An inter-pattern connection frame test unit that tests whether a pair of frames having the inter-frame distance can be a pair of inter-pattern connection frames that connects the different natural voice patterns. The transition weight when transitioning between frame pairs that can be connected frame pairs and between already connected frame pairs For this reason, the transition weight is added to the connection between each frame of the voice pattern connection network formed by connecting the above-mentioned inter-pattern connection frame pairs in a plurality of natural voice patterns to each other and stored in the voice pattern connection network storage unit. A speech synthesizer comprising a transition weight calculator.

3. The speech synthesizer according to claim 2, wherein said inter-frame distance calculation unit calculates a distance to be calculated.
For the distance between the feature spectra of two frames, it is necessary to find the correction distance obtained by correcting the power difference between the two feature spectra, the silent part of the two feature spectra, the change direction of the two feature spectra, or the variance of the two feature spectra. Characteristic speech synthesizer.

4. The voice synthesizing apparatus according to claim 2, wherein when all frames of the natural voice pattern are clustered, voice having a ratio of the number of frames dropped into each cluster for each voice unit label as an element. A frame cluster storage unit that stores a unit vector in association with the cluster; wherein the audio unit information adding unit refers to the frame cluster storage unit and associates the unit vector with a cluster to which each frame of the natural audio pattern belongs. A speech unit label attached to each frame by attaching a given speech unit vector to the frame.

5. The voice synthesizing device according to claim 2, wherein the inter-pattern connection frame test unit determines that the inter-frame connection frame is smaller than the other predetermined value if the inter-frame distance is smaller than the predetermined value. A speech synthesizer comprising a frame merging means for merging a pair of frames having a distance into one frame.

6. The voice synthesizing apparatus according to claim 1, wherein the voice unit label is known by referring to the voice unit label of each frame constituting the voice pattern connection network. When tracing on the voice pattern connection network in the order of the voice unit label sequence of the specific voice pattern, a search is made for an optimum route in which the total of the transition weights has the maximum value, and the route is added to the frame on the searched optimum route. A network evaluation unit that outputs a sequence of the voice unit labels as the recognition result of the specific voice pattern, a matching rate between the voice unit label sequence of the specific voice pattern and the recognition result, and a cumulative transition represented by the optimal path. A speech synthesizer characterized in that the suitability of the speech pattern connection network can be evaluated by the value of the weight.

7. The speech synthesizer according to claim 1, further comprising a network editing unit for changing a transition weight between the frames in the speech pattern connection network. Speech synthesizer.