JP2008026452A

JP2008026452A - Speech synthesizer, method and program

Info

Publication number: JP2008026452A
Application number: JP2006196662A
Authority: JP
Inventors: Nobuyuki Nishizawa; 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2006-07-19
Filing date: 2006-07-19
Publication date: 2008-02-07
Anticipated expiration: 2026-07-19
Also published as: JP4882569B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech synthesizer capable of performing high speed speech synthesizing processing by reducing a processing amount in the speech synthesizer, especially the processing amount in preliminary selection. <P>SOLUTION: Based on a phoneme sequence which is selected with a predetermined reference, under condition for using certain phoneme for certain synthesizing object information, a cost degradation value data base is provided for indicating a cost degradation value of the phoneme for the synthesizing object information, which indicates a degradation degree of a synthesized speech waveform in the speech phoneme sequence selected with the predetermined reference. One or more pieces of synthesizing object information included in the cost degradation value data base are selected for each input synthesizing object information. Phoneme candidates for the input synthesizing object information are selected from a plurality of phonemes corresponding to the selected synthesizing object information, based on the cost degradation value. A series of speech phonemes used for the output synthesized speech waveform is selected from a series of input synthesizing object information pieces and the speech phoneme candidates for each selected input synthesizing object information piece. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、データベース化した音声素片から最適な音声素片を選択し、選択した音声素片を接続することで音声を合成する音声合成装置、方法及びプログラムに関する。 The present invention relates to a speech synthesizer, a method, and a program for synthesizing speech by selecting an optimum speech unit from speech units stored in a database and connecting the selected speech unit.

音声合成技術の１つである素片接続型音声合成は、多数の音声素片をあらかじめデータベース化しておき、合成時には指定された合成目標情報の各パラメータに近く、かつ、前後の音声素片との接続関係の良好な音声素片を、素片データベースから選択して合成を行う方式である。各音声素片には、音素ラベル、音響パラメータ、音声コーパス内での出現環境等のパラメータが付されている。 In unit-connected speech synthesis, which is one of speech synthesis technologies, a large number of speech units are stored in a database in advance, and are close to each parameter of synthesis target information specified at the time of synthesis, The speech unit having a good connection relation is selected from the unit database and synthesized. Each speech segment is assigned parameters such as phoneme labels, acoustic parameters, and appearance environment in the speech corpus.

素片接続型音声合成においては、指定された合成目標情報に基づき、使用する音声素片の選択（以後、素片選択と呼ぶ。）を行うが、この素片選択は、コストと呼ぶ歪み尺度、つまり、選択した音声素片により合成される音声波形の、目標とする合成音声波形からの劣化度合いを示す指標に基づき行われる。コストは、通常、合成目標情報と音声素片との誤差を示すターゲットコストと、音声素片間の不連続の程度を示す接続コストに分けることができ、素片選択は全体のコストを最小とするように行われる。 In unit-connected speech synthesis, a speech unit to be used is selected based on designated synthesis target information (hereinafter referred to as unit selection). This unit selection is a distortion measure called cost. That is, it is performed based on an index indicating the degree of deterioration of the speech waveform synthesized by the selected speech segment from the target synthesized speech waveform. The cost can usually be divided into a target cost indicating an error between the synthesis target information and the speech unit, and a connection cost indicating the degree of discontinuity between the speech units, and segment selection minimizes the overall cost. To be done.

より自然性の高い合成音声を得るためには、大規模な素片データベースを使用する必要があるが、素片データベースの大規模化に伴い、考慮すべき音声素片の組合せ数が膨大なものとなり素片選択処理が困難になるという問題が発生する。この問題を解決するため、ターゲットコスト及び接続コストの両方を考慮した最終的な音声素片の選択を行う前に、ターゲットコストに基づき素片データベースに用意されている音声素片の絞込みを行う予備選択を行う構成が提案されている。 In order to obtain synthesized speech with higher naturalness, it is necessary to use a large unit database. However, as the unit database becomes larger, the number of combinations of speech units to be considered is huge. Then, the problem that the segment selection process becomes difficult occurs. In order to solve this problem, before selecting the final speech segment considering both the target cost and the connection cost, a preliminary segment that narrows down speech segments prepared in the segment database based on the target cost. A configuration for making a selection has been proposed.

しかしながら、近年、素片データベースはますます大規模なものとなり、この結果、素片選択の処理量を減らすために行われる予備選択の処理量が無視できないものとなっている。このため、特許文献１には、予備選択を多段で行うことにより予備選択処理量を削減する構成が記載されている。特許文献１によると、予備選択に使用するターゲットコストを、基本周波数、持続時間、ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）といったサブコストに分割し、順次サブコストによる絞込みを行うことで全体の計算回数を削減している。 However, in recent years, the segment database has become increasingly large, and as a result, the amount of preliminary selection performed to reduce the amount of segment selection processing cannot be ignored. For this reason, Patent Document 1 describes a configuration in which the amount of preliminary selection processing is reduced by performing preliminary selection in multiple stages. According to Patent Document 1, the target cost used for the preliminary selection is divided into sub-costs such as fundamental frequency, duration, and MFCC (Mel-Frequency Cepstrum Coefficient), and the number of calculations is reduced by sequentially narrowing down by sub-cost. ing.

また、特許文献２には、テスト文による音声合成により各音声素片の選択頻度を調べ、選択頻度が少ない音声素片を素片データベースから削除することで選択処理の処理量を削減する構成が記載されている。 Further, Patent Document 2 has a configuration in which the frequency of selection processing is reduced by checking the selection frequency of each speech unit by speech synthesis using a test sentence and deleting a speech unit with a low selection frequency from the unit database. Are listed.

特開２００５‐２６５８９５号公報JP 2005-265895 A 特開２００４‐３７６０５号公報JP 2004-37605 A

特許文献１に記載の構成では、音声合成時に、総ての音声素片候補について評価を行わなければならない点で従来技術と同じであり、この構成による処理量の削減には限界がある。更に、予備選択と素片選択においては異なる選択基準を用いているため、予備選択で選択される音声素片候補が、素片選択での選択基準において最も好ましいものとは限らないという問題がある。この問題に対処するためには、予備選択で選択する音声素片候補数をできるだけ多くする必要があるが、予備選択での選択数を多くすると、素片選択の処理量が増大するという問題が発生する。 The configuration described in Patent Document 1 is the same as the prior art in that all speech unit candidates must be evaluated at the time of speech synthesis, and there is a limit to the reduction in processing amount by this configuration. Further, since different selection criteria are used in the preliminary selection and the segment selection, there is a problem that the speech segment candidate selected in the preliminary selection is not necessarily the most preferable selection criterion in the segment selection. . In order to cope with this problem, it is necessary to increase the number of speech unit candidates to be selected in the preliminary selection as much as possible. However, if the number of selections in the preliminary selection is increased, there is a problem that the processing amount of the unit selection increases. appear.

また、特許文献２に記載の構成では、テスト文が少ない場合、たまたまテスト文に含まれていなかった、使用頻度の高い音声素片が素片データベースから削除されてしまうことや、たまたまテスト文に含まれていた、使用頻度の低い音声素片が素片データベースに残ってしまうという問題が発生し得る。この問題に対処するためには、素片データベースと比較して、大量のテスト文を用意する必要があるが、素片データベースの大規模化に伴いテスト文を用意することが困難となる。 Further, in the configuration described in Patent Document 2, when there are few test sentences, a frequently used speech element that was not included in the test sentence may be deleted from the element database, There may be a problem in that a speech unit that is included and remains infrequently used remains in the segment database. In order to deal with this problem, it is necessary to prepare a large amount of test sentences as compared with the segment database, but it becomes difficult to prepare test sentences as the scale of the segment database increases.

したがって、本発明は、音声合成装置における処理量、特に、予備選択における処理量を削減し、高速に音声合成処理を行う音声合成装置、方法及びプログラムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a speech synthesizer, a method, and a program that perform speech synthesis processing at high speed by reducing the amount of processing in the speech synthesizer, in particular, the amount of processing in preliminary selection.

本発明における音声合成装置によれば、
一連の入力合成目標情報に基づき、素片データベースに用意されている複数の音声素片から一連の音声素片を選択し、選択した一連の音声素片それぞれに対応する音声波形を接続して合成音声波形を出力する音声合成装置であって、一連の試験合成目標情報に基づき、素片データベースから所定の基準で選択した音声素片系列を基準とし、ある試験合成目標情報に対してある音声素片を使用する条件の下、該所定の基準で選択した音声素片系列での合成音声波形の劣化度合いを示す、該試験合成目標情報に対する該音声素片のコスト悪化値を求め、各試験合成目標情報について、対応する音声素片のコスト悪化値を示すコスト悪化値データベースを生成するコスト悪化値計算手段と、入力合成目標情報それぞれについて、コスト悪化値データベースから１つ以上の試験合成目標情報を選択し、選択した試験合成目標情報に対応する音声素片から、コスト悪化値に基づき、入力合成目標情報に対する音声素片候補を選択する予備選択手段と、一連の入力合成目標情報に基づき、予備選択手段が選択した各入力合成目標情報に対する音声素片候補の中から、出力する合成音声波形に使用する一連の音声素片を選択する素片選択手段とを備えていることを特徴とする。 According to the speech synthesizer of the present invention,
Based on a set of input synthesis target information, a series of speech units is selected from a plurality of speech units prepared in the unit database, and speech waveforms corresponding to each of the selected series of speech units are connected and synthesized. A speech synthesizer for outputting a speech waveform, which is based on a series of test synthesis target information and based on a speech segment sequence selected from a segment database according to a predetermined standard, with respect to a certain speech synthesis target information. A cost deterioration value of the speech unit with respect to the test synthesis target information indicating the degree of degradation of the synthesized speech waveform in the speech unit sequence selected by the predetermined criterion under the condition for using the segment, and each test synthesis Cost deterioration value data for each of the input deterioration target information and cost deterioration value calculation means for generating a cost deterioration value database indicating the cost deterioration value of the corresponding speech segment for the target information Pre-selection means for selecting one or more test synthesis target information from the source and selecting a speech unit candidate for the input synthesis target information from the speech unit corresponding to the selected test synthesis target information based on the cost deterioration value And, based on a series of input synthesis target information, a unit selection for selecting a series of speech units to be used for an output synthesized speech waveform from among speech unit candidates for each input synthesis target information selected by the preliminary selection means Means.

本発明の音声合成装置における他の実施形態によれば、
予備選択手段は、コスト悪化値データベースに含まれる試験合成目標情報をグループ化し、グループ化した試験合成目標情報それぞれに対応する音声素片から、コスト悪化値に基づき複数の音声素片を選択し、選択した音声素片をグループと対応させて記録する予備選択結果データベース生成手段と、各入力合成目標情報が属するグループを判定し、判定したグループに対応する音声素片を、該入力合成目標情報に対する音声素片候補として選択する選択手段とを備えていることも好ましい。 According to another embodiment of the speech synthesizer of the present invention,
The preliminary selection means groups the test synthesis target information included in the cost deterioration value database, selects a plurality of speech units based on the cost deterioration value from the speech units corresponding to each of the grouped test synthesis target information, Preliminary selection result database generating means for recording the selected speech unit in association with the group, and determining a group to which each input synthesis target information belongs, and determining the speech unit corresponding to the determined group to the input synthesis target information It is also preferable to include selection means for selecting as speech segment candidates.

また、本発明の音声合成装置における他の実施形態によれば、
予備選択結果データベース生成手段は、同一単位音声に属する総ての試験合成目標情報を根ノードとし、該根ノードを順次分割することにより生成した決定木の葉ノードに含まれる試験合成目標情報を１つのグループとしてグループ化し、選択手段は、決定木に従い入力合成目標情報が属するグループを決定することも好ましい。 According to another embodiment of the speech synthesizer of the present invention,
Preliminary selection result database generation means uses all test synthesis target information belonging to the same unit speech as a root node, and groups test test target information included in a leaf node of a decision tree generated by sequentially dividing the root node into one group It is also preferable that the selection unit determines a group to which the input synthesis target information belongs according to the decision tree.

更に、本発明の音声合成装置における他の実施形態によれば、
予備選択結果データベース生成手段は、評価値を最小にする集合分割を繰り返すことにより決定木の生成を行い、評価値は、分割により生成した各葉ノードの評価値を、各葉ノードに含まれる試験合成目標情報の数に基づき重み付けした値の和であり、葉ノードの評価値は、該葉ノードに含まれる各試験合成目標情報に対応する音声素片から、コスト悪化値に基づき選択した所定数の音声素片の、コスト悪化値の平均値であることも好ましい。 Furthermore, according to another embodiment of the speech synthesizer of the present invention,
Preliminary selection result database generation means generates a decision tree by repeating set partitioning that minimizes the evaluation value, and the evaluation value is a test included in each leaf node by using the evaluation value of each leaf node generated by the partitioning. The sum of values weighted based on the number of synthesis target information, and the evaluation value of the leaf node is a predetermined number selected based on the cost deterioration value from the speech segment corresponding to each test synthesis target information included in the leaf node It is also preferable that the average value of the cost deterioration values of the speech segments.

更に、本発明の音声合成装置における他の実施形態によれば、
素片選択手段は、音声素片の接続による合成音声波形の劣化度合いのみを考慮して一連の音声素片の選択を行うことも好ましい。 Furthermore, according to another embodiment of the speech synthesizer of the present invention,
It is also preferable that the unit selection unit selects a series of speech units in consideration of only the degree of deterioration of the synthesized speech waveform due to the connection of speech units.

更に、本発明の音声合成装置における他の実施形態によれば、
コスト悪化値計算手段は、一連の試験合成目標情報それぞれに対応する１つ以上の音声素片を素片データベースから選択し、選択した音声素片を、一連の試験合成目標情報と同一順序に並べ、各音声素片を探索グラフのノードに対応させて、先頭及び末尾それぞれから動的計画法により、合成音声波形の劣化度合いを示すコスト計算を行うことで、前記基準とする音声素片系列と、前記基準とする音声素片系列と、前記ある音声素片を使用する条件の下での音声素片系列のコストを取得し、前記取得した各音声素片系列のコストに基づき、各試験合成目標情報について、素片データベースから選択した音声素片のコスト悪化値を求めることも好ましい。 Furthermore, according to another embodiment of the speech synthesizer of the present invention,
The cost deterioration value calculating means selects one or more speech segments corresponding to each series of test synthesis target information from the segment database, and arranges the selected speech segments in the same order as the series of test synthesis target information. , Each speech unit is associated with a node of the search graph, and cost calculation indicating the degree of degradation of the synthesized speech waveform is performed by dynamic programming from the beginning and the end, respectively. , Acquiring the cost of the speech unit sequence as a reference and the speech unit sequence under the condition of using the certain speech unit, and combining each test synthesis based on the acquired cost of each speech unit sequence It is also preferable to obtain the cost deterioration value of the speech unit selected from the unit database for the target information.

本発明における音声合成方法によれば、
一連の試験合成目標情報に基づき、素片データベースから所定の基準で選択した音声素片系列を基準とし、ある試験合成目標情報に対してある音声素片を使用する条件の下、該所定の基準で選択した音声素片系列での合成音声波形の劣化度合いを示す、該試験合成目標情報に対する該音声素片のコスト悪化値を有するコスト悪化値データベースを用い、一連の入力合成目標情報から合成音声波形を出力する音声合成方法であって、入力合成目標情報それぞれについて、コスト悪化値データベースから１つ以上の試験合成目標情報を選択し、選択した試験合成目標情報に対応する音声素片から、コスト悪化値に基づき、入力合成目標情報に対する音声素片候補を選択するステップと、一連の入力合成目標情報に基づき、各入力合成目標情報に対する音声素片候補の中から、出力する合成音声波形に使用する一連の音声素片を選択するステップと、選択された一連の音声素片それぞれに対応する音声波形を接続して合成音声波形を生成するステップとを有することを特徴とする。 According to the speech synthesis method of the present invention,
Based on a series of test synthesis target information, based on a speech unit sequence selected by a predetermined standard from a segment database, the predetermined standard is used under the condition of using a speech unit for a certain test synthesis target information. Using the cost deterioration value database having the cost deterioration value of the speech unit with respect to the test synthesis target information, which indicates the degree of deterioration of the synthesized speech waveform in the speech unit sequence selected in step 1, and synthesized speech from a series of input synthesis target information A speech synthesis method for outputting a waveform, wherein, for each input synthesis target information, one or more test synthesis target information is selected from a cost deterioration value database, and a cost is calculated from a speech unit corresponding to the selected test synthesis target information. A step of selecting speech segment candidates for the input synthesis target information based on the deteriorated value and a set of input synthesis target information for each input synthesis target information. Selecting a series of speech units to be used for the synthesized speech waveform to be output from the speech unit candidates to be output, and connecting the speech waveforms corresponding to each of the selected series of speech units to generate a synthesized speech waveform And generating.

本発明におけるプログラムによれば、コンピュータを前記音声合成装置として機能させることを特徴とする。 According to the program of the present invention, a computer is caused to function as the speech synthesizer.

あらかじめ試験合成目標情報に対する各音声素片のコスト悪化値を示すコスト悪化値データベースを作成しておき、コスト悪化値に基づき予備選択を行う。ここで、試験合成目標情報に対する音声素片のコスト悪化値とは、所定基準で選択した音声素片系列での合成音声波形を基準とし、該試験合成目標情報に対して該音声素片を使用するとの条件の下、同じ所定基準で選択した音声素片系列での合成音声波形の劣化度合いを示す値であり、音声素片の接続も考慮されたものである。つまり、予備選択の選択基準は、素片選択の選択基準と同様であり、予備選択の精度が高くなる。よって、予備選択及び素片選択の処理量を増やすことなく、効率よく高速に音声合成処理を行うことが可能となる。 A cost deterioration value database indicating the cost deterioration value of each speech unit for the test synthesis target information is created in advance, and preliminary selection is performed based on the cost deterioration value. Here, the cost deterioration value of the speech unit with respect to the test synthesis target information is based on the synthesized speech waveform in the speech unit sequence selected by the predetermined standard, and the speech unit is used for the test synthesis target information. Under these conditions, this is a value indicating the degree of deterioration of the synthesized speech waveform in the speech segment sequence selected based on the same predetermined criterion, and the connection of speech segments is also taken into consideration. That is, the selection criterion for preliminary selection is the same as the selection criterion for segment selection, and the accuracy of preliminary selection is increased. Therefore, it is possible to perform speech synthesis processing efficiently and at high speed without increasing the amount of processing for preliminary selection and segment selection.

更に、予備選択の精度が高くなるため、予備選択での候補数を減らしたとしても、高品質の音声を合成することができる。また、コスト悪化値を用いることで、音声素片の利用可能性についての評価値が得られることから、素片選択結果の頻度情報を用いた従来技術と異なり、コスト悪化値データベースの作成のために大量のテスト文を用いることなく高品質な音声合成を行うことができる。 Furthermore, since the accuracy of preliminary selection is increased, high quality speech can be synthesized even if the number of candidates in preliminary selection is reduced. Moreover, since the evaluation value about the availability of the speech unit can be obtained by using the cost deterioration value, unlike the conventional technology using the frequency information of the unit selection result, the cost deterioration value database is created. High-quality speech synthesis can be performed without using a large amount of test sentences.

あらかじめ、コスト悪化値データベースに含まれる試験合成目標情報をグループ化し、各グループに対する音声素片を選択しておくことで、音声合成時における予備選択手段の処理は、入力合成目標情報が属するグループの判定のみとなり予備選択の処理量を削減することができる。グループの作成及び判定については決定木を用いることができ、試験合成目標情報とは異なる入力合成目標情報が与えられても、類似の試験合成目標情報に対する結果を反映させた適当な結果を素早く返すことができる。 By preliminarily grouping test synthesis target information included in the cost deterioration value database and selecting speech segments for each group, the processing of the preliminary selection means at the time of speech synthesis is performed for the group to which the input synthesis target information belongs. Only the determination is made, and the processing amount of the preliminary selection can be reduced. A decision tree can be used for group creation and judgment, and even if input synthesis target information different from the test synthesis target information is given, an appropriate result reflecting the result for similar test synthesis target information is quickly returned. be able to.

また、音声素片の接続により生じる合成音声波形の劣化の度合いのみを考慮して一連の音声素片の選択を行うことで音響特徴量パラメータを必要とするターゲットコスト計算を省略でき、音声合成の処理量を削減することが可能となる。更に、コスト悪化値の計算を、各音声素片を探索グラフのノードに対応させた上で、先頭からの動的計画法によるコスト計算と、末尾からの動的計画法によりコスト計算により求めることで、コスト悪化値データベースの作成処理負担を削減することができる。 In addition, target cost calculation that requires acoustic feature parameters can be omitted by selecting a series of speech segments considering only the degree of degradation of the synthesized speech waveform caused by the connection of speech segments. It is possible to reduce the amount of processing. Furthermore, the cost deterioration value is calculated by calculating the cost by the dynamic programming from the beginning and the dynamic programming from the end after associating each speech unit with the node of the search graph. Thus, it is possible to reduce the burden of creating the cost deterioration value database.

本発明を実施するための最良の実施形態について、以下では図面を用いて詳細に説明する。ここで、“単位音声”とは、本発明による音声合成装置の合成処理単位であり、１つの単位音声に対して複数の音声素片（以後、単に“素片”と呼ぶ。）が対応する。なお、単位音声の具体例としては、音素、音節、単語がある。また、各素片は、１つの単位音声とのみ対応することとしても、複数の単位音声と対応することとしてもよい。更に、合成目標情報とは、合成する音声波形のために使用すべき素片を選択するための指標であり、素片と同様に単位音声と対応関係がある。つまり、合成目標情報が与えられると、単位音声を介して、与えられた合成目標情報に対応する複数の素片が求まり、音声合成装置は、ターゲットコスト及び接続コストを考慮して、対応する複数の素片から最適な素片を選択して音声合成を行う。 BEST MODE FOR CARRYING OUT THE INVENTION The best mode for carrying out the present invention will be described below in detail with reference to the drawings. Here, “unit speech” is a synthesis processing unit of the speech synthesizer according to the present invention, and a plurality of speech units (hereinafter simply referred to as “units”) correspond to one unit speech. . Specific examples of unit speech include phonemes, syllables, and words. Each segment may correspond to only one unit sound or may correspond to a plurality of unit sounds. Furthermore, the synthesis target information is an index for selecting a segment to be used for a speech waveform to be synthesized, and has a correspondence relationship with a unit speech in the same manner as a segment. That is, when the synthesis target information is given, a plurality of segments corresponding to the given synthesis target information are obtained through unit speech, and the speech synthesizer considers the target cost and the connection cost, Speech synthesis is performed by selecting the optimum segment from the segments.

図１は、本発明による音声合成装置のブロック図である。図１によると、音声合成装置は、前処理部１と、コスト悪化値計算部２と、予備選択部３と、素片選択部４と、音声波形接続部５と、コスト悪化値データベース６と、素片情報データベース７と、音声波形データベース８とを備えている。 FIG. 1 is a block diagram of a speech synthesizer according to the present invention. According to FIG. 1, the speech synthesizer includes a preprocessing unit 1, a cost deterioration value calculation unit 2, a preliminary selection unit 3, a segment selection unit 4, a speech waveform connection unit 5, and a cost deterioration value database 6. The unit information database 7 and the speech waveform database 8 are provided.

素片情報データベース７は、素片を特定する素片ＩＤと、その素片の、例えば、基本周波数、ＭＦＣＣ、持続時間といった各パラメータについてのデータベースであり、音声波形データベース８は、素片ＩＤで特定される素片の実際の波形についてのデータベースであり、素片情報データベース７及び音声波形データベース８で素片データベースを構成している。 The segment information database 7 is a database for a segment ID for identifying a segment and each parameter of the segment, for example, fundamental frequency, MFCC, and duration, and the speech waveform database 8 is a segment ID. This is a database for the actual waveform of the specified segment, and the segment information database 7 and the speech waveform database 8 constitute a segment database.

前処理部１は、入力テキストを、形態素解析、構文解析、単語辞書の参照等により単位音声に分割し、解析によって得られた韻律情報等を付し、テキスト解析情報として出力する。その後、テキスト解析情報に、ＭＦＣＣや基本周波数等の音響特徴量パラメータを付して一連の合成目標情報として出力する。 The preprocessing unit 1 divides the input text into unit speeches by morphological analysis, syntax analysis, word dictionary reference, etc., attaches prosodic information obtained by the analysis, and outputs it as text analysis information. Thereafter, acoustic feature parameters such as MFCC and fundamental frequency are added to the text analysis information and output as a series of synthesis target information.

コスト悪化値計算部２は、前処理部１に入力された入力テキストから生成された一連の合成目標情報に基づき、素片情報データベース７を用いて、合成目標情報ごとに、対応する素片のコスト悪化値を示すコスト悪化値データベース６を生成する。なお、コスト悪化値データベース６は、入力テキストとして１つ以上のテスト文を使用して音声合成処理の開始前に作成しておく。 The cost deterioration value calculation unit 2 uses the segment information database 7 based on a series of synthesis target information generated from the input text input to the preprocessing unit 1, and uses the segment information database 7 for each synthesis target information. A cost deterioration value database 6 indicating the cost deterioration value is generated. The cost deterioration value database 6 is created before the start of speech synthesis processing using one or more test sentences as input text.

以下、入力したテスト文が４つの単位音声により構成される場合を例にして、コスト悪化値計算部２によるコスト悪化値データベース６の作成について説明する。前処理部１は、入力テキスト文を４つの単位音声に分割し、単位音声ごとに合成目標情報を生成する。コスト悪化値計算部２は、前処理部１が生成した４つの合成目標情報それぞれに対応する複数の素片の素片情報を素片情報データベース７から取得する。本例においては、各合成目標情報それぞれについて、素片情報データベース７には４つの対応する素片が用意されているものとする。 Hereinafter, the creation of the cost deterioration value database 6 by the cost deterioration value calculation unit 2 will be described by taking as an example a case where the input test sentence is composed of four unit sounds. The preprocessing unit 1 divides the input text sentence into four unit sounds, and generates synthesis target information for each unit sound. The cost deterioration value calculation unit 2 acquires segment information of a plurality of segments corresponding to the four synthesis target information generated by the preprocessing unit 1 from the segment information database 7. In this example, it is assumed that four corresponding segments are prepared in the segment information database 7 for each synthesis target information.

図４は、コスト悪化値計算のための探索グラフを示す図であり、ノードｎ１１〜ｎ１４は、第１の合成目標情報に対応する４つ素片を、ノードｎ２１〜ｎ２４は、第２の合成目標情報に対応する４つの素片を、ノードｎ３１〜ｎ３４は、第３の合成目標情報に対応する４つの素片を、ノードｎ４１〜ｎ４４は、第４の合成目標情報に対応する４つの素片を表す。ここでは、探索グラフを、これらの先頭及び末尾に無音を示すノードＸ及びノードＹを付加して作成する。 FIG. 4 is a diagram illustrating a search graph for calculating a cost deterioration value. Nodes n11 to n14 are four segments corresponding to the first synthesis target information, and nodes n21 to n24 are a second synthesis. The four elements corresponding to the target information, the nodes n31 to n34 are the four elements corresponding to the third synthesis target information, and the nodes n41 to n44 are the four elements corresponding to the fourth synthesis target information. Represents a piece. Here, the search graph is created by adding a node X and a node Y indicating silence to the beginning and the end of these.

ノード間を結ぶ線は接続コストに、各ノードはターゲットコストに対応し、あるノードから他のノードへコストは、使用した経路の各接続コストと通過したノードのターゲットコストを順次加算することにより求める。なお、ノードＸ及びノードＹに対応するターゲットコストは０である。ここで、ノードＸからノードＹに至る複数経路のうち、そのコストが最小となる最適経路を、ノードｎ１１、ｎ２２、ｎ３２、ｎ４３であるものとし、第２の合成目標情報に対し、ノードｎ２３を使用するという条件の下でのコストが最小となる経路が、ノードｎ１４、ｎ２３、ｎ３４、ｎ４３であるものとする。このとき、ノードｎ１１、ｎ２２、ｎ３２、ｎ４３の経路でのコストを基準としたときの、ノードｎ１４、ｎ２３、ｎ３４、ｎ４３の経路でのコストの悪化度合いが、第２の合成目標情報についてのノードｎ２３に対応する素片のコスト悪化値である。つまり、テスト文で与えられる一連の合成目標情報から、所定の基準により最適なものとして選択した素片の系列を基準とし、ある合成目標情報に対してある素片を使用するとの条件の下、同じ所定の基準で選択した素片の系列での合成音声波形の劣化度合いを示す指標が、その合成目標情報に対するその素片のコスト悪化値であり、以下の説明においては、最適経路のコストと、ある素片を使用するとの条件の下でのコストとの差をコスト悪化値として用いる。 The line connecting the nodes corresponds to the connection cost, each node corresponds to the target cost, and the cost from one node to the other is obtained by sequentially adding each connection cost of the used route and the target cost of the passed node. . Note that the target cost corresponding to the node X and the node Y is zero. Here, among the plurality of routes from the node X to the node Y, the optimum route with the lowest cost is assumed to be the nodes n11, n22, n32, n43, and the node n23 is set to the second synthesis target information. It is assumed that the route having the minimum cost under the condition of use is the nodes n14, n23, n34, and n43. At this time, when the cost of the paths of the nodes n11, n22, n32, and n43 is used as a reference, the degree of deterioration of the cost of the paths of the nodes n14, n23, n34, and n43 is the node for the second synthesis target information. This is the cost deterioration value of the segment corresponding to n23. In other words, from a series of synthesis target information given in a test sentence, based on a condition that a segment is used for a certain synthesis target information, based on a sequence of segments selected as being optimal according to a predetermined criterion, The index indicating the degree of degradation of the synthesized speech waveform in the segment sequence selected based on the same predetermined criterion is the cost deterioration value of the segment relative to the synthesis target information. In the following explanation, the cost of the optimum path is The difference from the cost under the condition that a certain piece is used is used as the cost deterioration value.

本実施形態においては、図４の探索グラフを用い動的計画法に基づき簡易な方法にて各素片のコスト悪化値を計算する。具体的には、ｔ＝１である各ノードに対しては、ノードＸとの接続コストと、そのノードのターゲットコストを加算したものを、そのノードのノードＸからの最小コストとして記録する。次に、ｔ＝２である各ノードに対し、それぞれ４つある１時点前のノードへの経路の接続コストと、その経路に接続されている１時点前のノードに記録されている最小コストを加算し、１番小さいものと、そのノードのターゲットコストを加算したものを、そのノードのノードＸからの最小コストとして記録する。以後、ノードＹに至るまで、時刻順に上記処理を繰り返す。この操作によりノードＸから探索グラフ内の任意のノードｎに至る最小コストｃ（Ｘ→ｎ）が記録される。 In this embodiment, the cost deterioration value of each segment is calculated by a simple method based on dynamic programming using the search graph of FIG. Specifically, for each node with t = 1, the sum of the connection cost with the node X and the target cost of the node is recorded as the minimum cost from the node X of the node. Next, for each node with t = 2, the connection cost of the route to the node immediately before four points, and the minimum cost recorded in the node immediately before point connected to the route are Add and record the smallest one plus the target cost for that node as the minimum cost from node X for that node. Thereafter, the above processing is repeated in order of time until the node Y is reached. By this operation, the minimum cost c (X → n) from the node X to any node n in the search graph is recorded.

続いて、ノードＹを先頭とし、時間軸逆向きに同様の処理を行う。ただし、ノードｎの値としてｃ´（Ｙ→ｎ）＝ｃ（Ｙ→ｎ）−ｃｔ（ｎ）を求める。ここで、ｃｔ（ｎ）は、ノードｎのターゲットコストである。グラフ上の任意のノードｎについて、複数あるノードＸ→ｎ→Ｙの経路のうち、その最小コストは、ｃ（Ｘ→ｎ）＋ｃ（ｎ→Ｙ）−ｃｔ（ｎ）で与えられるが、ｃ（ｎ→Ｙ）＝ｃ（Ｙ→ｎ）であるため、
ｃ（Ｘ→ｎ）＋ｃ（ｎ→Ｙ）−ｃｔ（ｎ）＝ｃ（Ｘ→ｎ）＋ｃ´（Ｙ→ｎ）
であり、コスト悪化値ｄ（ｎ）を、
ｄ（ｎ）＝ｃ（Ｘ→ｎ）＋ｃ´（Ｙ→ｎ）−ｃ（Ｘ→Ｙ）
として求めることができる。本実施形態においては、ノードＸからの処理を１回、ノードＹからの処理を１回行うことで、総てのノードについて、各素片のコスト悪化値を求めることができ、各ノードから個別にノードＹまでの最小コストを探索する必要はない。 Subsequently, the same processing is performed in the reverse direction of the time axis with the node Y as the head. However, c ′ (Y → n) = c (Y → n) −ct (n) is obtained as the value of the node n. Here, ct (n) is the target cost of the node n. For an arbitrary node n on the graph, the minimum cost of a plurality of nodes X → n → Y is given by c (X → n) + c (n → Y) −ct (n). Since (n → Y) = c (Y → n),
c (X → n) + c (n → Y) −ct (n) = c (X → n) + c ′ (Y → n)
And the cost deterioration value d (n) is
d (n) = c (X → n) + c ′ (Y → n) −c (X → Y)
Can be obtained as In this embodiment, by performing the process from the node X once and the process from the node Y once, it is possible to obtain the cost deterioration value of each unit for all the nodes, and individually from each node. There is no need to search for the minimum cost to node Y.

図５は、コスト悪化値データベース６を示す図であり、ある単位音声に属する合成目標情報Ａ１、Ａ２、Ａ３、Ａ４、Ａ５・・・それぞれについて、同じ単位音声に属する素片と、そのコスト悪化値を示している。例えば、ある単位音声を音素“あ”とすると、合成目標情報Ａ１、Ａ２、Ａ３、Ａ４、Ａ５は、テスト文から得られた単位音声“あ”に属する合成目標情報であり、素片ａ_１、ａ_２、ａ_３・・・は、用意されている素片のうち、単位音声“あ”に属する素片である。実際には、各単位音声について、例えば図示しないが、単位音声“い”に属する合成目標情報Ｂ１、Ｂ２・・・に対しても、同じく用意されている素片ｂ_１、ｂ_２・・・ごとのコスト悪化値がコスト悪化値データベース６には含まれている。 FIG. 5 is a diagram showing the cost deterioration value database 6, and for each of the synthesis target information A1, A2, A3, A4, A5... Belonging to a certain unit voice, the unit belonging to the same unit voice and its cost deterioration. The value is shown. For example, if a unit speech is a phoneme “A”, the synthesis target information A1, A2, A3, A4, A5 is synthesis target information belonging to the unit speech “A” obtained from the test sentence, and the segment a _1. , A ₂ , a ₃ ... Are the segments belonging to the unit voice “A” among the prepared segments. Actually, for each unit voice, for example, although not shown, the same prepared pieces b ₁ , b ₂ ... For the synthesis target information B ₁ , B _2. Each cost deterioration value is included in the cost deterioration value database 6.

予備選択部３は、実際の音声合成処理において、前処理部１によって入力テキストから生成された、一連の合成目標情報それぞれについて、コスト悪化値データベース６を参照して、各合成目標情報に対する素片候補を選択し、選択した素片候補の素片ＩＤを出力する。 The preliminary selection unit 3 refers to the cost deterioration value database 6 for each series of synthesis target information generated from the input text by the preprocessing unit 1 in the actual speech synthesis process, and provides a segment for each synthesis target information. A candidate is selected, and the segment ID of the selected segment candidate is output.

予備選択部３による素片候補の選択方法の例として、前処理部１から入力された合成目標情報と同一の合成目標情報がコスト悪化値データベース６に存在する場合は、その合成目標情報に対応する素片うち、コスト悪化値の小さいものから所定数の素片を、又は、コスト悪化値が所定値より小さな素片を選択し、同一の合成目標情報がコスト悪化値データベース６に存在しない場合は、コスト悪化値データベース６に存在する一番近い合成目標情報から同様に素片候補を選択する方法がある。 As an example of a method for selecting a segment candidate by the preliminary selection unit 3, when the same synthesis target information as the synthesis target information input from the pre-processing unit 1 exists in the cost deterioration value database 6, it corresponds to the synthesis target information. When a predetermined number of pieces are selected from those having a low cost deterioration value or a piece having a cost deterioration value smaller than the predetermined value and the same synthesis target information does not exist in the cost deterioration value database 6 There is a method of similarly selecting a segment candidate from the nearest synthesis target information existing in the cost deterioration value database 6.

予備選択部３による素片候補の選択方法の他の例として、前処理部１から入力された合成目標情報に近い所定数の合成目標情報をコスト悪化値データベース６から選択し、選択した各合成目標情報に対応する素片のうち、同一素片のコスト悪化値の平均値を求め、平均値の小さいものから所定数の素片を、又は、平均値が所定値より小さな素片を選択する方法がある。更に、前処理部１から入力された合成目標情報と同一の合成目標情報がコスト悪化値データベース６にある場合は、その合成目標情報から、同一の合成目標情報がコスト悪化値データベース６に存在しない場合は、所定数の合成目標情報をコスト悪化値データベース６から選択してコスト悪化値の平均を計算する方法により選択しても良い。 As another example of the method of selecting segment candidates by the preliminary selection unit 3, a predetermined number of synthesis target information close to the synthesis target information input from the preprocessing unit 1 is selected from the cost deterioration value database 6, and each selected synthesis is selected. From the segments corresponding to the target information, obtain an average value of cost deterioration values of the same segment, and select a predetermined number of segments or a segment whose average value is smaller than the predetermined value from the smaller average value. There is a way. Furthermore, when the synthesis target information identical to the synthesis target information input from the preprocessing unit 1 is in the cost deterioration value database 6, the same synthesis target information is not present in the cost deterioration value database 6 from the synthesis target information. In this case, a predetermined number of synthesis target information may be selected from the cost deterioration value database 6 and selected by a method of calculating an average of the cost deterioration values.

素片選択部４は、予備選択部３が選択した素片候補について、素片情報データベース７からパラメータを取得し、前処理部１からの一連の合成目標情報に基づき、素片候補の中から最適な組合せとなる素片を選択し、選択した素片の素片ＩＤを出力する。 The segment selection unit 4 acquires parameters from the segment information database 7 for the segment candidates selected by the preliminary selection unit 3, and from among the segment candidates based on a series of synthesis target information from the preprocessing unit 1. The segment that is the optimal combination is selected, and the segment ID of the selected segment is output.

音声波形接続部５は、素片選択部４が出力した素片ＩＤに基づき音声波形データベース８から対応する素片の波形情報を取得し、波形を接続して合成した音声波形を出力する。 The speech waveform connection unit 5 acquires the waveform information of the corresponding segment from the speech waveform database 8 based on the segment ID output by the segment selection unit 4, and outputs a synthesized speech waveform by connecting the waveforms.

本発明による音声合成装置においては、あらかじめテスト文に基づき、合成目標情報ごとに、その合成目標情報に属する素片のコスト悪化値を求めておき、コスト悪化値に基づき予備選択を行う。コスト悪化値は素片を強制的に選択したときの合成音声波形の劣化度合い、つまり、ターゲットコスト及び接続コスト両方が含まれる指標である。よって、本発明における予備選択は、素片データベースに用意されている素片を、ターゲットコスト及び接続コスト両方が含まれる指標により絞りこむものであり、予備選択を行わず素片選択部４が総ての素片から素片選択を行うとした場合に選択される素片が、素片候補に含まれる可能性が高くなる。つまり、予備選択の精度が高くなる。素片選択部４は、そのような素片候補から素片選択を行うため、予備選択部３及び素片選択部４での処理量を増やすことなく、効率よく高速に音声合成処理を行うことが可能となる。 In the speech synthesizer according to the present invention, the cost deterioration value of the segment belonging to the synthesis target information is obtained for each synthesis target information in advance based on the test sentence, and preliminary selection is performed based on the cost deterioration value. The cost deterioration value is an index including the degree of deterioration of the synthesized speech waveform when a segment is forcibly selected, that is, both the target cost and the connection cost. Therefore, the preliminary selection in the present invention is to narrow down the segments prepared in the segment database using an index that includes both the target cost and the connection cost, and the segment selection unit 4 does not perform the preliminary selection. When the segment selection is performed from all the segments, there is a high possibility that the segment selected is included in the segment candidate. That is, the accuracy of preliminary selection is increased. Since the segment selection unit 4 performs segment selection from such segment candidates, the speech selection process can be performed efficiently and at high speed without increasing the amount of processing in the preliminary selection unit 3 and the segment selection unit 4. Is possible.

更に、予備選択の精度が高くなるため、予備選択での候補数を減らしたとしても、高品質の音声を合成することができる。また、コスト悪化値を用いることで、テスト文に含まれる単位音声に対応する総ての素片候補に対して、その利用可能性についての評価値が得られることから、素片選択結果の頻度情報を用いた従来技術と異なり、大量のテスト文を用いることなく高品質な音声合成を行うことができる。 Furthermore, since the accuracy of preliminary selection is increased, high quality speech can be synthesized even if the number of candidates in preliminary selection is reduced. In addition, by using the cost deterioration value, evaluation values for availability can be obtained for all the unit candidates corresponding to the unit speech included in the test sentence. Unlike the prior art using information, high-quality speech synthesis can be performed without using a large amount of test sentences.

続いて、本発明による音声合成装置の他の実施形態について説明する。本実施形態の音声合成装置全体のブロック図は、図１と同じであるが、本実施形態において予備選択部３は、図２に示すように、予備選択結果データベース生成部３１と、選択部３２とを備えている。 Next, another embodiment of the speech synthesizer according to the present invention will be described. The block diagram of the entire speech synthesizer of this embodiment is the same as that of FIG. 1, but in this embodiment, the preliminary selection unit 3 includes a preliminary selection result database generation unit 31 and a selection unit 32 as shown in FIG. And.

予備選択結果データベース生成部３１は、コスト悪化値データベース６から、あらかじめ予備選択結果データベース９を作成しておく。図７は、予備選択結果データベース９を示す図である。図７に示すように予備選択結果データベース９とは、コスト悪化値データベース６のうち、同一単位音声に属する合成目標情報を後述するように所定の方法でグループ化し、グループ化した各合成目標情報それぞれに対応する複数の素片から、同じく後述するように所定数の素片をあらかじめ選択したものである。例えば、図７において、合成目標情報Ａ１、Ａ２、Ａ３がグループ化され、このグループには、素片ａ_１、ａ_５、ａ_９・・・ａ_３８のｋ個の素片が選択されている。 The preliminary selection result database generation unit 31 creates a preliminary selection result database 9 from the cost deterioration value database 6 in advance. FIG. 7 is a diagram showing the preliminary selection result database 9. As shown in FIG. 7, the preliminary selection result database 9 is a group of synthesis target information belonging to the same unit voice in the cost deterioration value database 6 and grouped by a predetermined method as will be described later. As described later, a predetermined number of segments are selected in advance from a plurality of segments corresponding to. For example, in FIG. 7, the synthesis target information A1, A2, A3 is grouped, and k pieces of pieces a ₁ , a ₅ , a ₉ ... A ₃₈ are selected in this group. .

選択部３２は、前処理部１からの各合成目標情報がどのグループに属するかを判定し、判定したグループに対応するｋ個の素片の素片ＩＤを、素片候補として出力する。以下に、予備選択結果データベース生成部３１による予備選択結果データベース９の作成について説明する。 The selection unit 32 determines to which group each synthesis target information from the preprocessing unit 1 belongs, and outputs the unit IDs of k units corresponding to the determined group as unit candidates. Hereinafter, creation of the preliminary selection result database 9 by the preliminary selection result database generation unit 31 will be described.

予備選択結果データベース生成部３１は、合成目標情報のグループ化のため、単位音声ごとに決定木を作成する。図６は、予備選択結果データベース９の作成のための決定木を示す図である。まず、同一の単位音声に属する総ての合成目標情報の集合をＴとし、これを決定木の根ノードとする。この集合Ｔを、質問ｑ０により質問ｑ０を満たす合成目標情報の集合Ｔ１と、質問ｑ０を満たさない合成目標情報の集合Ｔ２に分割する。更に、集合Ｔ１を質問ｑ１により、集合Ｔ２を質問ｑ２により分割することを再帰的に繰り返して木構造を構築する。ここで、合成目標情報ｔに対して素片ｕを用いた場合のコスト悪化値をｄ（ｕ、ｔ）、ｔ１及びｎ１を集合Ｔ１に属する合成目標情報及びその数、ｔ２及びｎ２を集合Ｔ２に属する合成目標情報及びその数とする。ここで、総てのｕに対して、総てのｔ１に対するｄ（ｕ、ｔ１）の平均値を計算し、その値の小さい方からｋ１個を選びその平均値をｍ１とする。同様に、総てのｕに対して、総てのｔ２に対するｄ（ｕ、ｔ２）の平均値を計算し、その値の小さい方からｋ２個を選びその平均値をｍ２とする。なお、ｋ１及びｋ２は、適当に定めたある整数とする。このとき、各質問ｑは、下記の式（１）
（ｎ１×ｍ１＋ｎ２×ｍ２）／（ｎ１＋ｎ２）（１）
を最小とし、かつ、Ｙｅｓ又はＮｏのいずれかで決まるものとする。つまり、集合分割を繰り返すことで決定木を構築する。また、Ｙｅｓ又はＮｏのいずれかで決まる質問とは、合成目標情報に含まれる離散量については、“中心音素は母音である”、合成目標情報に含まれる連続量については、“素片の時間長が５０ｍｓ未満である”といったような質問を言う。これら質問については、複数のものをあらかじめ合成目標情報に含まれる情報を参照して経験的に作成しておく。 The preliminary selection result database generation unit 31 creates a decision tree for each unit voice for grouping the synthesis target information. FIG. 6 is a diagram showing a decision tree for creating the preliminary selection result database 9. First, let T be the set of all synthesis target information belonging to the same unit speech, and this be the root node of the decision tree. This set T is divided into a set T1 of composite target information that satisfies the question q0 by the question q0 and a set T2 of composite target information that does not satisfy the question q0. Further, the tree structure is constructed by recursively repeating the division of the set T1 by the question q1 and the set T2 by the question q2. Here, the cost deterioration value when the segment u is used for the synthesis target information t is d (u, t), t1 and n1 are the synthesis target information belonging to the set T1 and the number thereof, and t2 and n2 are the set T2. And the number of synthesis target information belonging to. Here, for all u, the average value of d (u, t1) with respect to all t1 is calculated, k1 is selected from the smaller one, and the average value is defined as m1. Similarly, for all u, the average value of d (u, t2) with respect to all t2 is calculated, k2 is selected from the smaller value, and the average value is m2. Note that k1 and k2 are appropriately determined integers. At this time, each question q is expressed by the following formula (1).
(N1 * m1 + n2 * m2) / (n1 + n2) (1)
And is determined by either Yes or No. That is, a decision tree is constructed by repeating set partitioning. The question determined by either Yes or No refers to “the central phoneme is a vowel” for the discrete amount included in the synthesis target information, and “the time of the segment” for the continuous amount included in the synthesis target information. Ask a question such as “The length is less than 50 ms”. Regarding these questions, a plurality of questions are prepared in advance by referring to information included in the synthesis target information.

なお、木構造が大きくなってしまうことを避けるために、例えば、次の分割停止条件を設定しておく。
（Ａ）根ノードからの段数が所定値（例えば３０段）以上となるとき。
（Ｂ）分割後の評価値の差がある閾値未満となるとき。ただし、分割前の評価値は、総てのｔに対するｄ（ｕ、ｔ）の平均値を総てのｕに対して計算し、その値の小さい方からｋ個選んだときの、その平均値ｍとする。また、分割後の評価値は、上記式（１）で与えられる値である。
（Ｃ）ノードに含まれる合成目標情報の数、つまりグループ内の合成目標情報ｔの数ｎが一定数以下となるとき。 In order to avoid an increase in the tree structure, for example, the following division stop condition is set.
(A) When the number of stages from the root node is a predetermined value (for example, 30 stages) or more.
(B) When the difference between the evaluation values after division is less than a certain threshold. However, the evaluation value before division is the average value when the average value of d (u, t) for all t is calculated for all u and k is selected from the smaller ones. m. The evaluation value after the division is a value given by the above formula (1).
(C) When the number of synthesis target information included in the node, that is, the number n of synthesis target information t in the group is equal to or less than a certain number.

分割停止条件を満たさない葉ノードについては分割処理を行い、分割停止条件を満たす場合には分割を停止する。続いて、各葉ノードに含まれる合成目標情報の集合をＴｍ、Ｔｍに属する各合成目標情報をｔｍ、Ｔｍに属する合成目標情報がＮ個あるとき、ある素片ｕに対するコスト悪化値の平均値 For leaf nodes that do not satisfy the division stop condition, the division processing is performed. When the division stop condition is satisfied, the division is stopped. Subsequently, when a set of synthesis target information included in each leaf node is Tm, each synthesis target information belonging to Tm is tm, and there are N synthesis target information belonging to Tm, an average value of cost deterioration values for a certain unit u

を総ての素片について計算し、平均値の小さい方からｋ´個を、この葉ノードに対する予備選択結果として予備選択結果データベース９を生成する。

Are calculated for all the segments, and the preliminary selection result database 9 is generated as the preliminary selection result for this leaf node with k ′ pieces having the smallest average value.

選択部３２は、前処理部１から合成目標情報が与えられたとき、与えられた合成目標情報に対応する単位音声のための決定木に基づき、与えられた合成目標情報の素片候補を選択する。具体的には、与えられた合成目標情報が、決定木の各質問ｑを満たすか否かを判定して、根ノードから順に葉ノード方向に辿り、葉ノードに達した場合、その葉ノードに対応するｋ´個の素片の素片ＩＤを素片候補として出力する。 When the synthesis target information is given from the preprocessing unit 1, the selection unit 32 selects a unit candidate of the given synthesis target information based on a decision tree for unit speech corresponding to the given synthesis target information To do. Specifically, it is determined whether or not the given synthesis target information satisfies each question q of the decision tree, and the leaf nodes are sequentially traced from the root node. The corresponding k ′ element IDs are output as element candidates.

本実施形態によれば、予備選択による素片候補を予備選択結果データベース９から決定木を用いて高速に取得することが可能となる。また、テスト文には含まれなかった合成目標情報が与えられても、類似の合成目標情報に対する結果を反映させた適当な結果を素早く返すことができ、よって、大量のテスト文を用いることなく高速な選択処理を行うことができる。なお、準備した総ての質問について評価を行い、質問に対する回答の一致度の最も高い合成目標情報に対する予備選択結果を出力する等、決定木を用いる方法でなくとも良い。 According to the present embodiment, it is possible to quickly obtain segment candidates by preliminary selection from the preliminary selection result database 9 using the decision tree. Moreover, even if synthesis target information that was not included in the test sentence is given, an appropriate result reflecting the result for similar synthesis target information can be quickly returned, so without using a large amount of test sentences. High-speed selection processing can be performed. Note that it is not necessary to use a decision tree, such as evaluating all prepared questions and outputting a preliminary selection result for composite target information having the highest degree of matching of answers to the questions.

最後に、前処理部１について説明する。図３は前処理部１のブロック図である。図３によると、前処理部１は、テキスト処理部１１と、合成パラメータ生成部１２とを備えている。テキスト処理部１１は、入力テキストを、形態素解析、構文解析、単語辞書の参照等により単位音声に分割し、解析により得られた韻律情報等を付してテキスト情報として出力する。合成パラメータ生成部１２は、テキスト情報に音響特徴量パラメータを付して出力する。 Finally, the preprocessing unit 1 will be described. FIG. 3 is a block diagram of the preprocessing unit 1. According to FIG. 3, the preprocessing unit 1 includes a text processing unit 11 and a synthesis parameter generation unit 12. The text processing unit 11 divides the input text into unit speeches by morphological analysis, syntax analysis, word dictionary reference, and the like, attaches prosodic information obtained by the analysis, and outputs the text information. The synthesis parameter generation unit 12 attaches an acoustic feature parameter to the text information and outputs it.

ここで、前処理部１が出力する合成目標情報は、テキスト処理部１１の出力とすることも可能であり、合成パラメータ生成部１２の出力とすることも可能である。前者の場合には、音響特徴量パラメータが合成目標情報に含まれないこととなるため、素片選択部４は、ターゲットコストの算出ができず、よって、総ての素片候補についてターゲットコストを０と看做して処理を行う。つまり、素片の接続による合成音声波形の劣化を示す接続コストのみが考慮された素片選択が行われる。予備選択部３が出力する素片候補の各素片と、合成目標情報との誤差は、比較的小さいと考えられることから、その後の選択処理において、ターゲットコストを総て０と看做した素片選択を行っても、最終的なコストは比較的に小さくなると考えることができる。一方、接続コストの計算には合成目標情報に含まれる音響特徴量パラメータを必要としない。よって、前者の構成においては、合成パラメータ生成部１２を省略することができ、かつ、素片選択部４においては音響特徴量パラメータを必要とするターゲットコスト計算を省略でき、音声合成の処理量を削減することが可能となる。 Here, the synthesis target information output from the preprocessing unit 1 can be output from the text processing unit 11 or output from the synthesis parameter generation unit 12. In the former case, since the acoustic feature parameter is not included in the synthesis target information, the segment selection unit 4 cannot calculate the target cost. Therefore, the target cost is not calculated for all segment candidates. Treat as 0. That is, the segment selection is performed in consideration of only the connection cost indicating the degradation of the synthesized speech waveform due to the segment connection. Since the error between each segment candidate segment output by the preliminary selection unit 3 and the synthesis target information is considered to be relatively small, in the subsequent selection process, all the target costs are considered to be 0. Even if one-sided selection is performed, the final cost can be considered to be relatively small. On the other hand, the calculation of the connection cost does not require the acoustic feature parameter included in the synthesis target information. Therefore, in the former configuration, the synthesis parameter generation unit 12 can be omitted, and the segment selection unit 4 can omit the target cost calculation that requires the acoustic feature parameter, and the processing amount of speech synthesis can be reduced. It becomes possible to reduce.

なお、上記説明において、ある素片のコスト悪化値を、最適経路のコストと、その素片を強制的に選択したときの最適コストとの差として説明したが、強制的に選択したときのコストの、最適経路のコストに対する比とする等の方法であっても良い。また、コスト悪化値計算部２によるコスト悪化値データベース６の生成において計算対象とする素片を、素片情報データベース７に含まれる全素片としても、ターゲットコストにより絞込みを行い、ターゲットコスト値の小さい所定数の素片としても良い。ターゲットコストによる絞込みを行う場合は、コスト悪化値データベース６の生成処理負担の削減ができ、全素片を対象とする場合は、より正確なコスト悪化値データベース６を生成することができる。なお、ターゲットコストによる絞込みを行う場合、絞込みにより計算対象とされなかった素片のコスト悪化値については、当該合成目標情報について計算された素片のうち、最悪の値、又は、それ以上の値を使用する。 In the above description, the cost deterioration value of a certain segment has been described as the difference between the cost of the optimal route and the optimal cost when the segment is forcibly selected. However, a method such as a ratio to the cost of the optimum route may be used. In addition, the segment to be calculated in the generation of the cost deterioration value database 6 by the cost deterioration value calculation unit 2 is narrowed down by the target cost even as all the pieces included in the segment information database 7, and the target cost value A small predetermined number of pieces may be used. When narrowing down by target cost, the generation processing burden of the cost deterioration value database 6 can be reduced, and when all the pieces are targeted, the more accurate cost deterioration value database 6 can be generated. In addition, when narrowing down by target cost, for the cost deterioration value of the segment that was not subject to calculation by narrowing down, the worst value or higher value among the segments calculated for the synthesis target information Is used.

また、本発明による音声合成装置は、コンピュータに読み込まれることにより、そのコンピュータに図１、２、３に示す各機能ブロックの動作を行わせるプログラムにより実現することができる。 The speech synthesizer according to the present invention can be realized by a program that, when read into a computer, causes the computer to perform the operations of the functional blocks shown in FIGS.

本発明による音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to the present invention. 予備選択部のブロック図である。It is a block diagram of a preliminary selection part. 前処理部のブロック図である。It is a block diagram of a pre-processing part. コスト悪化値データベースの作成の説明図である。It is explanatory drawing of preparation of a cost deterioration value database. コスト悪化値データベースを示す図である。It is a figure which shows a cost deterioration value database. 予備選択結果データベース作成のための決定木を示す図である。It is a figure which shows the decision tree for preparation of a preliminary selection result database. 予備選択結果データベースを示す図である。It is a figure which shows a preliminary selection result database.

Explanation of symbols

１前処理部
１１テキスト処理部
１２合成パラメータ生成部
２コスト悪化値計算部
３予備選択部
３１予備選択結果データベース生成部
３２選択部
４素片選択部
５音声波形接続部
６コスト悪化値データベース
７素片情報データベース
８音声波形データベース
９予備選択結果データベース DESCRIPTION OF SYMBOLS 1 Pre-processing part 11 Text processing part 12 Synthesis parameter generation part 2 Cost deterioration value calculation part 3 Preliminary selection part 31 Preliminary selection result database generation part 32 Selection part 4 Segment selection part 5 Speech waveform connection part 6 Cost deterioration value database 7 Element Single information database 8 Speech waveform database 9 Preliminary selection result database

Claims

Based on a set of input synthesis target information, a series of speech units is selected from a plurality of speech units prepared in the unit database, and speech waveforms corresponding to each of the selected series of speech units are connected and synthesized. A speech synthesizer that outputs a speech waveform,
Based on a series of test synthesis target information, based on a speech unit sequence selected by a predetermined standard from a segment database, the predetermined standard is used under the condition of using a speech unit for a certain test synthesis target information. The cost degradation value of the speech unit for the test synthesis target information, which indicates the degree of degradation of the synthesized speech waveform in the speech unit sequence selected in Step 1, is obtained, and the cost degradation of the corresponding speech unit for each test synthesis target information A cost deterioration value calculating means for generating a cost deterioration value database indicating values;
For each of the input synthesis target information, one or more test synthesis target information is selected from the cost deterioration value database, and the voice corresponding to the input synthesis target information is selected from the speech unit corresponding to the selected test synthesis target information based on the cost deterioration value. A preliminary selection means for selecting a segment candidate;
A unit selection unit for selecting a series of speech units to be used for a synthesized speech waveform to be output from speech unit candidates for each input synthesis target information selected by the preliminary selection unit based on the series of input synthesis target information; ,
A speech synthesizer characterized by comprising:

The preliminary selection means
The test synthesis target information included in the cost deterioration value database is grouped, and a plurality of speech units are selected based on the cost deterioration values from the speech units corresponding to the grouped test synthesis target information. A pre-selection result database generating means for recording in correspondence with a group;
Selecting means for determining a group to which each input synthesis target information belongs, and selecting a speech unit corresponding to the determined group as a speech unit candidate for the input synthesis target information;
The speech synthesizer according to claim 1, comprising:

Preliminary selection result database generation means uses all test synthesis target information belonging to the same unit speech as a root node, and groups test test target information included in a leaf node of a decision tree generated by sequentially dividing the root node into one group Grouped as
The selection means determines a group to which the input synthesis target information belongs according to the decision tree,
The speech synthesizer according to claim 2.

The preliminary selection result database generation means generates a decision tree by repeating set division to minimize the evaluation value,
The evaluation value is a sum of values obtained by weighting the evaluation value of each leaf node generated by the division based on the number of pieces of test synthesis target information included in each leaf node,
The evaluation value of the leaf node is an average value of the cost deterioration values of a predetermined number of speech units selected based on the cost deterioration value from the speech elements corresponding to each test synthesis target information included in the leaf node. ,
The speech synthesizer according to claim 3.

The unit selection means selects a series of speech units considering only the degree of deterioration of the synthesized speech waveform caused by the connection of speech units,
The speech synthesizer according to any one of claims 1 to 4.

Cost deterioration value calculation means
Select one or more speech segments corresponding to each series of test synthesis target information from the segment database,
Arrange the selected speech segments in the same order as the series of test synthesis target information,
Corresponding each speech unit to a node of the search graph, by performing cost calculation indicating the degree of degradation of the synthesized speech waveform by dynamic programming from the beginning and the end, the speech unit series as the reference, Obtain the cost of a speech segment sequence under the condition of using the speech segment,
Based on the cost of each acquired speech unit sequence, for each test synthesis target information, to determine the cost deterioration value of the speech unit selected from the unit database,
The speech synthesizer according to any one of claims 1 to 5.

Based on a series of test synthesis target information, based on a speech unit sequence selected by a predetermined standard from a segment database, the predetermined standard Using the cost deterioration value database having the cost deterioration value of the speech unit with respect to the test synthesis target information, which indicates the degree of deterioration of the synthesized speech waveform in the speech unit sequence selected in step 1, and synthesized speech from a series of input synthesis target information A speech synthesis method for outputting a waveform,
For each of the input synthesis target information, one or more test synthesis target information is selected from the cost deterioration value database, and the voice corresponding to the input synthesis target information is selected based on the cost deterioration value from the speech unit corresponding to the selected test synthesis target information. Selecting a segment candidate;
Selecting a series of speech units to be used for a synthesized speech waveform to be output from speech unit candidates for each input synthesis target information based on the series of input synthesis target information;
Connecting a speech waveform corresponding to each of a selected series of speech units to generate a synthesized speech waveform;
A speech synthesis method characterized by comprising:

A program that causes a computer to function as the speech synthesizer according to any one of claims 1 to 6.