JP6398523B2

JP6398523B2 - Speech synthesizer, method, and program

Info

Publication number: JP6398523B2
Application number: JP2014193108A
Authority: JP
Inventors: 淳一郎副島
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2014-09-22
Filing date: 2014-09-22
Publication date: 2018-10-03
Anticipated expiration: 2034-09-22
Also published as: JP2016065899A

Description

本発明は、音声コーパスからの音声素片の選択によって音声合成を行う技術に関する。 The present invention relates to a technique for performing speech synthesis by selecting speech segments from a speech corpus.

入力テキストデータから生成される合成目標に対して、電子化された大規模な言語・音声データである音声コーパスを参照することにより音声素片を選択し、当該音声素片を接続することにより合成音声を出力する音声合成技術が知られている（例えば非特許文献１に記載の技術）。 For a synthesis target generated from input text data, a speech unit is selected by referring to a speech corpus, which is a large-scale digitized language / speech data, and synthesized by connecting the speech unit A voice synthesis technique for outputting voice is known (for example, the technique described in Non-Patent Document 1).

このような音声合成技術において、音声コーパスから合成目標に最も適合する音声素片列を選択するための手法として従来、次のような技術が知られている（例えば非特許文献１に記載の技術）。まず、入力テキストデータから抽出される音素セグメントごとに、その音素と同じ音素を有する音声素片のデータ（以下、「素片データ」と記載する）が、素片候補データとして音声コーパスから抽出される。次に、ＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ：動的計画法）アルゴリズムによって、入力テキストデータ全体に渡ってコストが最小となる最良の素片候補データの組（最良の素片データ列）が決定される。コストとしては、入力テキストデータと音声コーパス内の各素片データ間の音素列および韻律の差異、素片候補データである隣接する素片データ間のスペクトラム包絡などの音響パラメータ（特徴量ベクトルデータ）の不連続性などが用いられる。 In such a speech synthesis technology, the following technology is conventionally known as a method for selecting a speech segment sequence that best matches a synthesis target from a speech corpus (for example, the technology described in Non-Patent Document 1). ). First, for each phoneme segment extracted from the input text data, speech segment data having the same phoneme as that phoneme (hereinafter referred to as “segment data”) is extracted from the speech corpus as segment candidate data. The Next, the best segment candidate data set (best segment data string) that minimizes the cost over the entire input text data is determined by a DP (Dynamic Programming) algorithm. Costs include acoustic parameters (feature vector data) such as phoneme sequences and prosodic differences between input text data and each piece of data in the speech corpus, and spectral envelopes between adjacent piece data that are candidate pieces of data. The discontinuity of is used.

上述のような素片接続型の音声合成において、より自然な聴感を得られるためには、指定された韻律遷移に忠実であること、素片同士の接続部が滑らかに連続していることなどが必須である。 In order to obtain a more natural sensation in the speech synthesis of the unit connection type as described above, it is faithful to the specified prosodic transition, the connection part between the units is smoothly continuous, etc. Is essential.

これらを同時に実現するためには、なるべく連続する区間を採用できるような、元音声情報を大量に搭載する辞書、あるいは、音素の切れ目が正確に定義されている辞書が必要となる。一般に、辞書制作のための音素片分割には、自動分割の精度が低く、人間が実際に録音音声を聞いたうえで切り出しを行うなど、高いコストを要求される。 In order to realize these simultaneously, a dictionary in which a large amount of original speech information can be used or a dictionary in which phoneme breaks are accurately defined so that continuous sections can be adopted as much as possible is required. In general, segmentation for creating a dictionary is required to be expensive because, for example, the accuracy of automatic segmentation is low, and humans actually cut out after listening to the recorded speech.

音声の連続性を考慮した素片接続型の従来の音声合成技術として、次のような技術が知られている（例えば特許文献１に記載の技術）。音声素片同士を接続する際の最適接続点を探索する接続点探索において、先行する音声素片の末尾から所定の範囲内、又は接続する音声素片の先頭から所定の範囲内を、最適接続点の探索範囲とする技術である。 The following technology is known as a conventional speech synthesis technology of the unit connection type considering the continuity of speech (for example, the technology described in Patent Document 1). In connection point search for searching for the optimum connection point when connecting speech units, the optimal connection is made within the predetermined range from the end of the preceding speech unit or within the predetermined range from the beginning of the connected speech unit. This is a technique for setting a point search range.

特開２００８−１９１３３４号公報JP 2008-191334 A

河井恒、“知識ベース 3-4 コーパスベース音声合成”、［online］、ver.1/2011.1.7、電子情報通信学会、［平成２５年１２月２５日検索］、インターネット＜ＵＲＬ：http://27.34.144.197/files/02/02gun_07hen_03.pdf#page=6＞Tsuyoshi Kawai, “Knowledge Base 3-4 Corpus-Based Speech Synthesis”, [online], ver.1 / 2011.1.7, IEICE, [December 25, 2013 search], Internet <URL: http: / /27.34.144.197/files/02/02gun_07hen_03.pdf#page=6>

しかし、上述の従来技術では、音声素片同士の接続において接続部分の所定範囲を最適接続点の探索範囲としているだけであり、音声コーパスから音声素片を選択する際の韻律情報の評価においては音声素片の切り出し範囲は考慮されていない。また、素片選択時の音声素片の適合度と音声素片同士の接続点の適合度とを所定の重み付き関数で評価する際に、最適接続点の探索範囲が変更されたときに音声素片の適合度も変化するが、上述の従来技術ではその変化分は考慮されていない。これらの点から、上述の従来技術は、本来目標とする韻律情報に適合した最適な素片を選択することができないという課題を有していた。 However, in the above-described prior art, only the predetermined range of the connection part in the connection between the speech units is set as the search range of the optimum connection point. In the evaluation of the prosodic information when selecting the speech unit from the speech corpus, The segmentation range of the speech segment is not considered. In addition, when evaluating the adaptability of speech units at the time of selecting a segment and the adaptability of connection points between speech units using a predetermined weighted function, the speech is detected when the optimum connection point search range is changed. The degree of fit of the piece also changes, but the change is not taken into account in the above-described conventional technology. From these points, the above-described conventional technology has a problem that it is not possible to select an optimum segment that is suitable for the originally intended prosodic information.

さらに、指定韻律に合う音声にするため、合成部などで、波形の変形を行うことが考えられるが、この際の波形変形による音質劣化のリスクも課題である。 Furthermore, in order to make the voice suitable for the specified prosody, it is conceivable that the waveform is deformed by a synthesis unit or the like, but the risk of sound quality degradation due to the waveform deformation at this time is also a problem.

本発明は、音声コーパスから最適な音声素片を正しく選択可能とすることを目的とする。 An object of the present invention is to make it possible to correctly select an optimal speech segment from a speech corpus.

態様の一例では、入力テキストデータから音素及び目標韻律が対応付けられたセグメントデータの列を抽出する抽出部と、抽出部により抽出されたセグメントデータごとに、音声コーパスから取得した音声素片がそのセグメントデータの目標韻律よりも長い継続長を有する場合に、その取得した音声素片から目標韻律の継続長より長い箇所を削除した新たな音声素片を生成する生成部と、取得した音声素片及び新たな音声素片夫々とセグメントデータとの不一致度を示す素片コストを算出し、その算出された各素片コストに基づき取得した音声素片及び新たな音声素片夫々からセグメントデータに対する音声素片候補データとなる音声素片をリストアップする素片リストアップ部と、抽出されたセグメントデータの列を構成するセグメントデータごとに、そのセグメントデータに対応する音声素片候補データと、そのセグメントデータに隣接するセグメントデータに対応する音声素片候補データとの間の不連続性を示す接続コストを算出し、その算出された接続コスト及び隣接する音声素片候補データ夫々の素片コストに基づいて、音声素片候補データとしてリストアップされた音声素片のいずれかを選択して音声素片データ列を生成する音素列選択部と、生成された音声素片データ列に基づいて合成音声を生成する音声生成部と、を備える。 In an example of the aspect, an extraction unit that extracts a segment data string in which phonemes and target prosody are associated with each other from the input text data, and for each segment data extracted by the extraction unit, a speech unit acquired from the speech corpus When the segment data has a duration longer than the target prosody, a generation unit that generates a new speech unit by deleting a part longer than the duration of the target prosody from the acquired speech unit, and the acquired speech unit In addition, a unit cost indicating the degree of inconsistency between each new speech unit and the segment data is calculated, and the speech to the segment data from each speech unit and each new speech unit obtained based on the calculated unit cost A segment list-up unit that lists speech units that are segment candidate data, and a segment data that constitutes a segment of the extracted segment data For each data, the connection cost indicating the discontinuity between the speech segment candidate data corresponding to the segment data and the speech segment candidate data corresponding to the segment data adjacent to the segment data is calculated. Phonemes for generating a speech unit data string by selecting one of speech units listed as speech unit candidate data based on the connected cost and the unit cost of each adjacent speech unit candidate data A column selection unit; and a speech generation unit that generates synthesized speech based on the generated speech segment data sequence.

本発明によれば、音声コーパスから最適な音声素片を正しく選択可能とすることが可能となる。 According to the present invention, it is possible to correctly select an optimal speech segment from a speech corpus.

本発明による音声合成装置の実施形態のブロック図である。1 is a block diagram of an embodiment of a speech synthesizer according to the present invention. 波形選択部のブロック図である。It is a block diagram of a waveform selection part. 音素列コスト、韻律コスト、および接続コストの説明図である。It is explanatory drawing of phoneme sequence cost, prosodic cost, and connection cost. 本実施形態の説明図である。It is explanatory drawing of this embodiment. 音声合成装置をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of the computer which can implement | achieve a speech synthesizer as software processing. 制御変数のデータ構成例を示す図である。It is a figure which shows the data structural example of a control variable. セグメントデータのデータ構成例を示す図である。It is a figure which shows the data structural example of segment data. 韻律データのデータ構成例を示す図である。It is a figure which shows the data structural example of prosodic data. 素片候補データのデータ構成例を示す図である。It is a figure which shows the data structural example of segment candidate data. 音声辞書データのデータ構成例を示す図である。It is a figure which shows the example of a data structure of audio | voice dictionary data. 素片データのデータ構成例を示す図である。It is a figure which shows the data structural example of segment data. 音素データのデータ構成例を示す図である。It is a figure which shows the example of a data structure of phoneme data. 特徴量ベクトルデータのデータ構成例を示す図である。It is a figure which shows the data structural example of feature-value vector data. 素片選定処理の例を示すフローチャートである。It is a flowchart which shows the example of a segment selection process. 最良の素片候補データの選択動作の説明図である。It is explanatory drawing of selection operation | movement of the best segment candidate data. 素片リストアップ処理の例を示すフローチャート（その１）である。It is a flowchart (the 1) which shows the example of a segment list-up process. 素片リストアップ処理の例を示すフローチャート（その２）である。It is a flowchart (the 2) which shows the example of a segment list-up process. 素片コスト計算＆候補追加処理の例を示すフローチャートである。It is a flowchart which shows the example of a segment cost calculation & candidate addition process. 音素列選択処理の例を示すフローチャートである。It is a flowchart which shows the example of a phoneme sequence selection process.

以下、本発明を実施するための形態について図面を参照しながら詳細に説明する。 Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.

図１は、本発明による音声合成装置１００の実施形態のブロック図であり、テキスト入力部１０１、形態素解析部１０２、韻律予測部１０３、韻律辞書１０４、波形選択部１０５、音声辞書１０６、および波形合成部１０７を備える。 FIG. 1 is a block diagram of an embodiment of a speech synthesizer 100 according to the present invention, in which a text input unit 101, a morphological analysis unit 102, a prosody prediction unit 103, a prosody dictionary 104, a waveform selection unit 105, a speech dictionary 106, and a waveform A synthesis unit 107 is provided.

テキスト入力部１０１は、入力テキストデータを入力する。 The text input unit 101 inputs input text data.

形態素解析部１０２は、テキスト入力部１０１が入力した入力テキストデータに対して形態素解析処理を実行することにより、入力テキストデータに対応する音素列を抽出する。入力テキストデータは、音素列中の音素ごとにセグメント分けされ、各音素を示す音素データが、そのセグメント分けにより得られる合成目標を構成するセグメントデータに登録される。 The morpheme analyzer 102 extracts a phoneme string corresponding to the input text data by executing a morpheme analysis process on the input text data input by the text input unit 101. The input text data is segmented for each phoneme in the phoneme string, and the phoneme data indicating each phoneme is registered in the segment data constituting the synthesis target obtained by the segmentation.

韻律予測部１０３は、形態素解析部１０２で得られる言語情報をもとに、実際の音声データに基づく韻律に関する統計的なモデルを記憶した韻律辞書１０４を参照することにより、合成目標の音素列中の音素ごとに、声帯の基本周波数であるピッチの高さ、持続時間長、および強度（振幅）によって表される韻律を予測する。この結果、韻律予測部１０３は、音素セグメント毎に、韻律情報である目標韻律データを生成し、合成目標を構成する上記セグメントデータに登録する。 The prosody prediction unit 103 refers to the prosodic dictionary 104 that stores a statistical model related to prosody based on actual speech data based on the linguistic information obtained by the morphological analysis unit 102, so that For each phoneme, a prosody represented by the pitch height, duration length, and intensity (amplitude), which is the fundamental frequency of the vocal cords, is predicted. As a result, the prosody prediction unit 103 generates target prosody data that is prosody information for each phoneme segment, and registers the target prosody data in the segment data constituting the synthesis target.

すなわち、入力テキストデータから合成目標として生成されるセグメントデータ列において、各セグメントデータは、音素データと目標韻律データを有する。 That is, in a segment data string generated as a synthesis target from input text data, each segment data has phoneme data and target prosody data.

波形選択部１０５は、目標韻律データと音素データを含むセグメントデータごとに、素片コストを評価することにより、素片候補データを音声辞書１０６中の音声コーパスからリストアップする。そして、波形選択部１０５は、セグメントデータごとに、接続コストおよび素片コストを評価することにより、最良の素片候補データをリストアップした素片候補データから選択する。 The waveform selection unit 105 lists the segment candidate data from the speech corpus in the speech dictionary 106 by evaluating the segment cost for each segment data including the target prosody data and the phoneme data. Then, the waveform selection unit 105 selects the best segment candidate data from the listed segment candidate data by evaluating the connection cost and the segment cost for each segment data.

波形合成部１０７は、波形選択部１０５がセグメントデータごとに音声辞書１０６から選択した素片データ列に基づいて、合成音声を生成し出力する。 The waveform synthesis unit 107 generates and outputs synthesized speech based on the segment data string selected from the speech dictionary 106 by the waveform selection unit 105 for each segment data.

図２は、図１の波形選択部１０５の詳細な構成を示すブロック図であり、波形選択部１０５は、図１の韻律予測部１０３から出力された目標韻律データ２０１、韻律入力部２０２、素片選定部２０７、および評価部２０８を備える。素片選定部２０７は、素片リストアップ部２０７ａとそこから出力された素片候補データ２０９、および音素列選択部２０７ｂを備える。評価部２０８は、素片評価部２０８ａおよび接続評価部２０８ｂを備える。 2 is a block diagram showing a detailed configuration of the waveform selection unit 105 in FIG. 1. The waveform selection unit 105 includes the target prosody data 201, the prosody input unit 202, and the element output from the prosody prediction unit 103 in FIG. A piece selection unit 207 and an evaluation unit 208 are provided. The segment selection unit 207 includes a segment list-up unit 207a, segment candidate data 209 output therefrom, and a phoneme string selection unit 207b. The evaluation unit 208 includes an element evaluation unit 208a and a connection evaluation unit 208b.

韻律入力部２０２は、図１の韻律予測部１０３が出力した目標韻律データ２０１を入力する。 The prosody input unit 202 inputs the target prosody data 201 output by the prosody prediction unit 103 of FIG.

素片選定部２０７において、素片リストアップ部２０７ａは、図１の韻律予測部１０３から出力されるセグメントデータごとに（以下これを「処理対象セグメントデータ」と記載する）、その処理対象セグメントデータに含まれる音素と一致する音素を有する１つ以上の素片データ（音声素片）を、音声辞書１０６中の音声コーパスから選択する（以下、この素片データを「処理対象素片データ」と記載する）。 In the segment selection unit 207, the segment list-up unit 207a performs processing target segment data for each segment data output from the prosody prediction unit 103 in FIG. 1 (hereinafter referred to as “processing target segment data”). Is selected from the speech corpus in the speech dictionary 106 (hereinafter referred to as “processing target segment data”). To describe).

図２の評価部２０８内の素片評価部２０８ａは、処理対象セグメントデータの音素とその前後２セグメントずつの各セグメントデータの音素とから構成される音素列と、処理対象素片データの音素とその前後の２つずつの各素片データの音素とから構成される音素列とを比較することにより、音素列コストを算出する。この音素列コストは、音素列の不一致度を示す。隣接するセグメントデータ間の音素列と隣接する素片データ間の音素列の一致度が高いほど音素列コストが低くなるように、音素列コストが算出される。前後の音素列が一致する素片データを選択したほうが、自然な合成音声が得られるからである。 The unit evaluation unit 208a in the evaluation unit 208 of FIG. 2 includes a phoneme string composed of a phoneme of processing target segment data and a phoneme of each segment data of two segments before and after that, and a phoneme of processing target segment data. The phoneme sequence cost is calculated by comparing the phoneme sequence composed of the phoneme of each of the two segment data before and after that. This phoneme string cost indicates the degree of disagreement between phoneme strings. The phoneme string cost is calculated such that the higher the degree of coincidence between the phoneme string between adjacent segment data and the phoneme string between adjacent segment data, the lower the phoneme string cost. This is because natural synthesized speech can be obtained by selecting segment data in which the preceding and following phoneme sequences match.

図３に示されるctxt_distanceが、音素列コストを示している。図３において、segment_k-2, segment_k-1, segment_k, segment_k+1, segment_k+2は、入力テキストデータに対応する合成目標を構成するセグメントデータの離散時系列を示しており、segment_kが処理対象セグメントデータであるとする。unit_u-2, unit_u-1, unit_u, unit_u+1, unit_u+2は、音声辞書１０６の音声コーパス中のある位置から切り出された素片データの離散時系列を示しており、unit_uが処理対象素片データであるとする。処理対象セグメントデータsegment_kにおける処理対象素片データunit_uに対する音素列コストctxt_distanceは、処理対象セグメントデータsegment_kと処理対象素片データunit_uのそれぞれを中心とする、前後２つずつと自身を含む計５つの連続する音素列同士（図３の「●」で示される）の不一致度として、算出される。 Ctxt_distance shown in FIG. 3 indicates the phoneme string cost. In FIG. 3, segment _k-2 , segment _k-1 , segment _k , segment _{k + 1} , and segment _{k + 2} indicate discrete time series of segment data that constitutes a synthesis target corresponding to input text data, Assume that segment _k is processing target segment data. unit _u-2 , unit _u-1 , unit _u , unit _{u + 1} , and unit _{u + 2} represent discrete time series of segment data extracted from a certain position in the speech corpus of the speech dictionary 106, It is assumed that unit _u is processing target segment data. Phoneme sequence cost ctxt_distance for processed fragment data Unit _u in the processing target segment data segment _k is centered on the respective processing target segment data segment _k processed fragment data Unit _u, including itself and two by two front and rear It is calculated as the degree of inconsistency between a total of five consecutive phoneme strings (indicated by “●” in FIG. 3).

図２において、素片リストアップ部２０７ａは、素片評価部２０８ａに対して、処理対象素片データの韻律データ（以下「素片韻律データ」と記載する）と処理対象セグメントデータの目標韻律データ２０１との間の韻律コストを算出させ評価させる。具体的には、素片評価部２０８ａは、処理対象セグメントデータの目標韻律データ２０１と処理対象素片データの素片韻律データとの差に基づいて、韻律コストを算出する。韻律コストは、目標韻律データ２０１と素片韻律データの距離を示す。図３に示されるpros_distanceが、韻律コストを示しており、処理対象セグメントデータsegment_kの目標韻律データ２０１と、処理対象素片データunit _uの素片韻律データとの差に基づいて算出される。 In FIG. 2, the segment list-up unit 207a sends to the segment evaluation unit 208a prosodic data of processing target segment data (hereinafter referred to as “segment prosodic data”) and target prosody data of processing target segment data. The prosodic cost between 201 and 201 is calculated and evaluated. Specifically, the segment evaluation unit 208a calculates the prosody cost based on the difference between the target prosody data 201 of the processing target segment data and the segment prosodic data of the processing target segment data. The prosodic cost indicates the distance between the target prosodic data 201 and the segment prosodic data. The pros_distance shown in FIG. 3 indicates the prosodic cost, and is calculated based on the difference between the target prosody data 201 of the processing target segment data segment _{k and} the segment prosodic data of the processing target segment data unit _u .

ここで、素片リストアップ部２０７ａは、処理対象セグメントデータごとに、音声コーパスから取得した処理対象素片データが処理対象セグメントデータの目標韻律データ２０１よりも長い継続長を有する場合に、その取得した処理対象素片データから目標韻律データ２０１の継続長よりも長い箇所を削除した新たな素片データを生成し、取得した処理対象素片データおよび新たな素片データ夫々と処理対象セグメントデータとの不一致度を示す素片コストを算出し、その算出された各素片コストに基づいて、取得した処理対象素片データおよび新たな素片データ夫々を処理対象セグメントデータに対する処理対象素片候補データとしてリストアップする。具体的には、第１のパターンとして、図４（ａ）に示されるように、取得した処理対象素片データunit_uから目標韻律データsegment_kの継続長よりも長い先頭箇所４０１を削除して切り出した新たな音声素片が生成されて、リストアップされる。第２のパターンとして、図４（ｂ）に示されるように、取得した処理対象素片データunit_uから目標韻律データsegment_kの継続長よりも長い末尾箇所４０２を削除して切り出した新たな音声素片が生成されてリストアップされる。第３のパターンとして、図４（ｃ）に示されるように、取得した処理対象素片データunit_uから目標韻律データsegment_kの継続長よりも長い先頭箇所４０３および末尾箇所４０４を均等に削除して切り出した新たな音声素片が生成されてリストアップされる。上記３つのパターンに加えて、音声コーパスから取得された元の処理対象素片データもリストアップされる。 Here, the segment list-up unit 207a obtains, for each processing target segment data, when the processing target segment data acquired from the speech corpus has a longer duration than the target prosody data 201 of the processing target segment data. New segment data is generated by deleting a portion longer than the continuation length of the target prosody data 201 from the processed target segment data, and the acquired target segment data and new segment data, the target segment data, The element cost indicating the degree of inconsistency is calculated, and based on each calculated element cost, the obtained element data to be processed and the new element data are processed object candidate data for the object segment data. To list. Specifically, as shown in FIG. 4A, as the first pattern, a head portion 401 longer than the duration of the target prosodic data segment _k is deleted from the obtained processing target segment data unit _u. A new speech segment that is cut out is generated and listed. As a second pattern, as shown in FIG. 4 (b), a new voice cut out by deleting the end portion 402 longer than the duration of the target prosody data segment _k from the acquired processing target segment data unit _u. Segments are created and listed. As the third pattern, as shown in FIG. 4 (c), the beginning portion 403 and the end portion 404 longer than the duration of the target prosodic data segment _k are equally deleted from the acquired processing target segment data unit _u. A new speech segment cut out is generated and listed. In addition to the above three patterns, the original process target segment data acquired from the speech corpus is also listed.

素片評価部２０８ａは、音素列コストと韻律コストとの重み付け和のコスト値を、現在の切り出し区間および上述の３つのパターンの切り出し区間の各素片データに対応する素片コストとして算出する。このようにして得られる音素列コスト、韻律コスト、および素片コストの値は、各素片データの切り出し先頭位置２０２ｓ（後述する図９のtop_shift）と切り出し終了位置２０２ｅ（後述する図９のtail_shift）の情報とともに、素片候補データ２０９（後述する図９のcandidate[0]等）としてメモリに記録される。 The segment evaluation unit 208a calculates the cost value of the weighted sum of the phoneme sequence cost and the prosodic cost as the segment cost corresponding to each segment data in the current segmentation section and the segmentation section of the above three patterns. The phoneme sequence cost, prosodic cost, and segment cost values obtained in this way are the segment data start position 202s (top_shift in FIG. 9 described later) and segment end position 202e (tail_shift in FIG. 9 described later). ) Is recorded in the memory as segment candidate data 209 (candidate [0] in FIG. 9 to be described later).

このとき、音声コーパスから取得した処理対象素片データが処理対象セグメントデータの目標韻律データ２０１よりも長い継続長を有する場合には、素片リストアップ部２０７ａが音声コーパスから取得した１つの処理対象素片データあたり、その元の処理対象素片候補データ２０９と、図４（ａ）、（ｂ）、および（ｃ）で示される３つのパターンの新たに生成された素片候補データ２０９の、合計４つの素片候補データ２０９が出力されることになる。 At this time, if the processing target segment data acquired from the speech corpus has a longer duration than the target prosody data 201 of the processing target segment data, one processing target acquired by the segment list-up unit 207a from the speech corpus For each piece of data, the original processing target piece candidate data 209 and the newly generated piece candidate data 209 of the three patterns shown in FIGS. 4A, 4B, and 4C, A total of four segment candidate data 209 is output.

素片リストアップ部２０７ａは、素片評価部２０８ａが評価した素片コストが低い順に、素片候補データ２０９を並び替えて、処理対象セグメントデータとリンクさせて出力する。 The segment list-up unit 207a rearranges the segment candidate data 209 in ascending order of segment cost evaluated by the segment evaluation unit 208a, and links and outputs the segment candidate data 209.

音素列選択部２０７ｂは、セグメントデータごとに（処理対象セグメントデータに対して）リストアップされている各素片候補データ２０９と、そのセグメントデータの１つ前のセグメントデータ（以下「前方セグメントデータ」と記載する）に対してリストアップされている各素片候補データ２０９（以下「前方素片候補データ２０９」と記載する）の２つの素片候補データ間の音響パラメータの不連続性を示す接続コスト（図３のcont_distance）を算出し、２つの素片候補データの素片コストを再度算出して評価し、最良の素片候補データを選択して素片データ列を生成し、波形合成部１０７へ出力する。 The phoneme string selection unit 207b selects each segment candidate data 209 listed for each segment data (with respect to the processing target segment data), and the segment data immediately before the segment data (hereinafter “front segment data”). Connection indicating the discontinuity of the acoustic parameters between the two segment candidate data of each segment candidate data 209 (hereinafter referred to as “front segment candidate data 209”) listed for Calculate the cost (cont_distance in FIG. 3), recalculate and evaluate the segment cost of the two segment candidate data, select the best segment candidate data, generate a segment data string, and generate a waveform synthesizer It outputs to 107.

図５は、図１の音声合成装置１００をソフトウェア処理として実現できるコンピュータのハードウェア構成例を示す図である。図５に示されるコンピュータは、ＣＰＵ５０１、ＲＯＭ（リードオンリーメモリ：読出し専用メモリ）５０２、ＲＡＭ（ランダムアクセスメモリ）５０３、入力装置５０４、出力装置５０５、外部記憶装置５０６、可搬記録媒体５１０が挿入される可搬記録媒体駆動装置５０７、及び通信インタフェース５０８を有し、これらがバス５０９によって相互に接続された構成を有する。同図に示される構成は上記システムを実現できるコンピュータの一例であり、そのようなコンピュータはこの構成に限定されるものではない。 FIG. 5 is a diagram illustrating a hardware configuration example of a computer that can implement the speech synthesis apparatus 100 of FIG. 1 as software processing. In the computer shown in FIG. 5, a CPU 501, a ROM (Read Only Memory) 502, a RAM (Random Access Memory) 503, an input device 504, an output device 505, an external storage device 506, and a portable recording medium 510 are inserted. A portable recording medium driving device 507 and a communication interface 508, which are connected to each other by a bus 509. The configuration shown in the figure is an example of a computer that can implement the above system, and such a computer is not limited to this configuration.

ＲＯＭ５０２は、コンピュータを制御する音声合成プログラムを含む各プログラムを記憶するメモリである。ＲＡＭ５０３は、各プログラムの実行時に、ＲＯＭ５０２に記憶されているプログラム又はデータを一時的に格納するメモリである。 A ROM 502 is a memory that stores programs including a speech synthesis program for controlling the computer. The RAM 503 is a memory that temporarily stores a program or data stored in the ROM 502 when each program is executed.

外部記憶装置５０６は、例えばＳＳＤ（ソリッドステートドライブ）記憶装置またはハードディスク記憶装置であり、入力テキストデータや合成音声データの保存に用いられる。 The external storage device 506 is, for example, an SSD (solid state drive) storage device or a hard disk storage device, and is used for storing input text data and synthesized voice data.

ＣＰＵ５０１は、各プログラムを、ＲＯＭ５０２からＲＡＭ５０３に読み出して実行することにより、当該コンピュータ全体の制御を行う。 The CPU 501 controls the entire computer by reading each program from the ROM 502 to the RAM 503 and executing it.

入力装置５０４は、ユーザによるキーボードやマウス等による入力操作を検出し、その検出結果をＣＰＵ５０１に通知する。また、入力装置５０４は、図１のテキスト入力部１０１の機能を実行して入力テキストデータを外部から入力し、ＲＡＭ５０３または外部記憶装置５０６に記憶させる。 The input device 504 detects an input operation by a user using a keyboard, a mouse, or the like, and notifies the CPU 501 of the detection result. The input device 504 executes the function of the text input unit 101 in FIG. 1 to input input text data from the outside, and stores the input text data in the RAM 503 or the external storage device 506.

出力装置５０５は、ＣＰＵ５０１の制御によって送られてくるデータを表示装置や印刷装置に出力する。また、出力装置５０５は、図１の波形合成部１０７が外部記憶装置５０６またはＲＡＭ５０３に出力した合成音声データを、音声として放音する。 The output device 505 outputs data sent under the control of the CPU 501 to a display device or a printing device. The output device 505 emits the synthesized voice data output from the waveform synthesis unit 107 of FIG. 1 to the external storage device 506 or the RAM 503 as voice.

可搬記録媒体駆動装置５０７は、光ディスクやＳＤＲＡＭ、コンパクトフラッシュ等の可搬記録媒体５１０を収容するもので、外部記憶装置５０６の補助の役割を有する。 The portable recording medium driving device 507 accommodates a portable recording medium 510 such as an optical disk, SDRAM, or compact flash, and has an auxiliary role for the external storage device 506.

通信インターフェース５０８は、例えばＬＡＮ（ローカルエリアネットワーク）又はＷＡＮ（ワイドエリアネットワーク）の通信回線を接続するための装置である。 The communication interface 508 is a device for connecting, for example, a LAN (local area network) or WAN (wide area network) communication line.

本実施形態によるシステムは、図１および図２の各処理部の機能を搭載した音声合成プログラムを、ＲＯＭ５０２からＲＡＭ５０３に読み出してＣＰＵ５０１が実行することで実現される。そのプログラムは、例えば外部記憶装置５０６や可搬記録媒体５１０に記録して配布してもよく、或いは通信インタフェース５０８によりネットワークから取得できるようにしてもよい。 The system according to the present embodiment is realized by reading a voice synthesis program having the functions of the processing units shown in FIGS. 1 and 2 from the ROM 502 to the RAM 503 and executing it by the CPU 501. The program may be recorded and distributed in the external storage device 506 or the portable recording medium 510, for example, or may be acquired from the network by the communication interface 508 .

次に、図５のコンピュータが、図１および図２の機能を有する音声合成装置１００として動作するために、ＲＡＭ５０３または外部記憶装置５０６上に保持する各種データについて説明する。 Next, various data stored in the RAM 503 or the external storage device 506 in order for the computer of FIG. 5 to operate as the speech synthesizer 100 having the functions of FIGS. 1 and 2 will be described.

図６は、ＲＡＭ５０３に保持される制御変数WavSelのデータ構成例を示す図である。制御変数WavSelは、unitdb、seg_count、segmentの各変数データを保持する。unitdbは、外部記憶装置５０６上の音声辞書１０６に記憶される音声辞書データへのポインタを保持する。seg_countは、セグメントデータの総数を保持する。segmentは、最初のセグメントデータ（後述する図７のsegment[0]の先頭アドレス）へのポインタを保持する。 FIG. 6 is a diagram illustrating a data configuration example of the control variable WavSel held in the RAM 503. The control variable WavSel holds variable data of unitdb, seg_count, and segment. unitdb holds a pointer to speech dictionary data stored in the speech dictionary 106 on the external storage device 506. seg_count holds the total number of segment data. The segment holds a pointer to the first segment data (the start address of segment [0] in FIG. 7 described later).

図７は、図６の制御変数WavSel内のsegmentポインタから参照されＲＡＭ５０３または外部記憶装置５０６に保持されるセグメントデータsegment[0]〜segment[seg_count]のデータ構成例を示す図である。各セグメントデータは、入力テキストデータを図１の形態素解析部１０２で形態素解析して得られる合計seg_count個（制御変数WavSelのseg_countに保持される個数）の音素ごとに、図１の韻律予測部１０３によって、segment[0],segment[1],・・・,segment[seg_count-1]として得られる。セグメントデータの記憶アドレスは、制御変数WavSelのsegmentによって示される。各セグメントデータsegment[i]（i=0,・・・,seg_count-1）は、seg_id、phone_id、target_prosody、candidate、best_cand、prev、nextの各変数データを保持する。seg_idは、セグメントＩＤ（識別子）を保持する。phone_idは、音素ＩＤを保持する。target_prosodyは、ＲＡＭ５０３または外部記憶装置５０６に保持される目標韻律データ２０１の先頭へのポインタを保持する。candidateは、最初の素片候補データ２０９（後述する図９のcandidate[0]の先頭アドレス）へのポインタを保持する。best_candは、図２の音素列選択部２０７ｂに対応する処理によって現在のセグメントデータに対応してする選択される最良の素片候補データ２０９（後述する図９のcandidate[0]〜candidate[N]、・・・のいずれかの先頭アドレス）へのポインタを保持する。prevは１つ手前のセグメントデータへのポインタ、nextは１つ後ろのセグメントデータへのポインタを保持する。現在のセグメントデータが例えばsegmen[1]であれば、prevはsegment[0]の先頭アドレスを保持し、nextはsegment[2]の先頭アドレスを保持する。また、現在のセグメントデータが例えば先頭データsegment[0]であれば、prevは未定義値であるNULL値を保持する。現在のセグメントデータが例えば末端データsegment[seg_count]であれば、nextはNULL値を保持する。 FIG. 7 is a diagram illustrating a data configuration example of segment data segment [0] to segment [seg_count] that is referred to from the segment pointer in the control variable WavSel of FIG. 6 and held in the RAM 503 or the external storage device 506. For each segment data, the prosody prediction unit 103 in FIG. 1 is provided for each total number of seg_count (number stored in seg_count of the control variable WavSel) obtained by morphological analysis of the input text data in the morpheme analysis unit 102 in FIG. To obtain segment [0], segment [1],..., Segment [seg_count-1]. The storage address of the segment data is indicated by the segment of the control variable WavSel. Each segment data segment [i] (i = 0,..., Seg_count-1) holds variable data of seg_id, phone_id, target_prosody, candidate, best_cand, prev, and next. seg_id holds a segment ID (identifier). phone_id holds a phoneme ID. target_prosody holds a pointer to the head of the target prosody data 201 held in the RAM 503 or the external storage device 506. candidate holds a pointer to the first segment candidate data 209 (the leading address of candidate [0] in FIG. 9 described later). best_cand is the best segment candidate data 209 selected corresponding to the current segment data by the processing corresponding to the phoneme string selection unit 207b in FIG. 2 (candidate [0] to candidate [N] in FIG. 9 described later). ,... Is held. prev holds a pointer to the previous segment data, and next holds a pointer to the next segment data. If the current segment data is, for example, segment [1], prev holds the start address of segment [0], and next holds the start address of segment [2]. Also, if the current segment data is, for example, the top data segment [0], prev holds a null value that is an undefined value. If the current segment data is, for example, end data segment [seg_count], next holds a NULL value.

図８は、図７の各セグメントデータ内のtarget_prosodyポインタまたは後述する図１１の各素片データ内のprosodyポインタから参照されＲＡＭ５０３または外部記憶装置５０６に記憶される韻律データprosody[0],prosody[1],・・・,prosody[N],・・・のデータ構成例を示す図である。各韻律データprosody[i]（i=0,・・・,N,・・・）は、time,pitch,power,prev,nextの各変数データを保持する。timeは、韻律が発生する時刻を保持する。pitchは、韻律の音高（ピッチ周波数）を保持する。powerは、韻律の強度を保持する。prevは１つ手前の韻律データへのポインタ、nextは１つ後ろの韻律データへのポインタを保持する。現在の韻律データが、先頭データであればprevはNULL値を保持し、末端データであればnextはNULL値を保持する。 FIG. 8 shows prosodic data prosody [0], prosody [0] that are referenced from the target_prosody pointer in each segment data of FIG. 7 or the prosody pointer in each segment data of FIG. 11 described later and stored in the RAM 503 or the external storage device 506. FIG. 3 is a diagram illustrating a data configuration example of 1],..., Prosody [N],. Each prosodic data prosody [i] (i = 0,..., N,...) Holds time, pitch, power, prev, and next variable data. time holds the time when the prosody occurs. pitch holds the pitch of the prosody (pitch frequency). power holds the strength of the prosody. prev holds a pointer to the previous prosodic data, and next holds a pointer to the next prosodic data. If the current prosodic data is head data, prev holds a null value, and if it is end data, next holds a null value.

図９は、図７のセグメントデータ内のtarget_prosodyポインタから参照されＲＡＭ５０３または外部記憶装置５０６に記憶される図２の素片候補データ２０９である素片候補データcandidate[0],candidate[1],・・・,candidate[N],・・・のデータ構成例を示す図である。各素片候補データcandidate[i]（i=0,・・・,N,・・・）は、図２のリストアップ部で生成され、unit_id,ctxt_distance,pros_distance,unit_distance,cont_distance,prev_total_cost,total_cost,best_cand,top_shift,tail_shift,prev,nextの各変数データを保持する。unit_idは、音声辞書１０６内の素片データを識別するための素片ＩＤ（図１２参照）を保持し、図２の素片リストアップ部２０７ａによりセットされる。ctxt_distanceは、前述した音素列コスト(音素列の不一致度)を保持し、図２の素片評価部２０８ａによって算出されセットされる。pros_distanceは、前述した韻律コスト(目標韻律データ２０１と素片韻律データの距離)を保持し、図２の素片評価部２０８ａによって算出されセットされる。unit_distanceは、音素列コストと韻律コストの重み付け和である前述した素片コストを保持し、図２の素片評価部２０８ａによって算出されセットされる。cont_distanceは、前述した接続コスト(音素接続点での特徴量距離)を保持し、図２の接続評価部２０８ｂによって算出されセットされる。prev_total_costは、先頭のセグメントデータからこの素片候補データが属するセグメントデータの１つ前のセグメントデータ（前方セグメントデータ）までに確定している、コスト合計を保持する。total_costは、先頭のセグメントデータからこの素片候補データが属するセグメントデータまでに確定したトータルコストを保持し、前述したように図２の音素列選択部２０７ｂにより算出されセットされる。best_candは、この素片候補データと接続する最良の前方素片候補データへのポインタを保持し、前述した音素列選択部２０７ｂによって算出されセットされる。ここで、最良の前方素片候補データは、best_candが含まれる素片候補データ（処理対象素片候補データ）が属するセグメントデータ（処理対象セグメントデータ）の１つ手前のセグメントデータ（前方セグメントデータ）に属する素片候補データ（前方素片候補データ）であって、その前方素片候補データで確定しているトータルコストと、処理対象素片候補データとの間の接続コストとの、重み付き和のコスト値が、最も小さい（最良の）前方素片候補データである。top_shiftは、この素片候補データ２０９が選択されるときの切り出し先頭位置の移動時間であり、後述する素片リストアップ処理（図１７）によりセットされる。この移動時間の起点は、unit_idが示す図１１の素片データ中の開始時刻startである。tail_shiftは、この素片候補データ２０９が選択されるときの切り出し終了位置の移動時間１であり、後述する素片リストアップ処理（図１７）によりセットされる。この移動時間の起点は、unit_idが示す図１１の素片データ中の（開始時刻start＋継続時間duration）である。prevは１つ手前の素片候補データへのポインタ、nextは１つ後ろの素片候補データへのポインタを保持する。現在の素片候補データが、先頭データであればprevはNULL値を保持し、末端データであればnextはNULL値を保持する。 9 is a segment candidate data candidate [0], candidate [1], which is the segment candidate data 209 of FIG. 2 that is referenced from the target_prosody pointer in the segment data of FIG. 7 and stored in the RAM 503 or the external storage device 506. .., Candidate [N],... Each piece candidate data candidate [i] (i = 0,..., N,...) Is generated by the list-up unit of FIG. Each variable data of best_cand, top_shift, tail_shift, prev, next is stored. unit_id holds a segment ID (see FIG. 12) for identifying segment data in the speech dictionary 106, and is set by the segment list-up unit 207a in FIG. ctxt_distance holds the above-described phoneme string cost (phoneme string disagreement degree), and is calculated and set by the segment evaluation unit 208a in FIG. pros_distance holds the above-mentioned prosody cost (distance between target prosody data 201 and segment prosody data), and is calculated and set by the segment evaluation unit 208a in FIG. unit_distance holds the above-described unit cost, which is the weighted sum of the phoneme sequence cost and the prosody cost, and is calculated and set by the unit evaluation unit 208a in FIG. cont_distance holds the above-described connection cost (feature amount distance at the phoneme connection point), and is calculated and set by the connection evaluation unit 208b in FIG. prev_total_cost holds the total cost determined from the first segment data to the segment data (front segment data) immediately before the segment data to which the segment candidate data belongs. total_cost holds the total cost determined from the first segment data to the segment data to which the segment candidate data belongs, and is calculated and set by the phoneme string selection unit 207b of FIG. 2 as described above. best_cand holds a pointer to the best forward segment candidate data connected to this segment candidate data, and is calculated and set by the phoneme string selection unit 207b described above. Here, the best front segment candidate data is the segment data (front segment data) immediately before the segment data (processing target segment data) to which the segment candidate data (processing target segment candidate data) including best_cand belongs. Is a weighted sum of the total cost determined by the front segment candidate data and the connection cost between the target segment candidate data. Is the smallest (best) forward segment candidate data. “top_shift” is the movement time of the cutout start position when this segment candidate data 209 is selected, and is set by the segment list-up process (FIG. 17) described later. The starting point of this movement time is the start time start in the fragment data of FIG. 11 indicated by unit_id. tail_shift is the movement time 1 of the cutout end position when this segment candidate data 209 is selected, and is set by the segment list-up process (FIG. 17) described later. The starting point of this movement time is (start time start + duration duration) in the fragment data of FIG. 11 indicated by unit_id. prev holds a pointer to the previous piece candidate data, and next holds a pointer to the next piece candidate data. If the current segment candidate data is the top data, prev holds a NULL value, and if the current segment candidate data is end data, next holds a NULL value.

図１０は、図１の音声辞書１０６を構成するＲＡＭ５０３または外部記憶装置５０６に記憶される音声辞書データunitdbのデータ構成例を示す図であり、図６の制御変数WavSelのunitdbポインタから参照される。音声辞書データunitdbは、phone_count、phoneme、unit_count、unit、fval_countの各変数データを保持する。phone_countは、この音声辞書データunitdbで定義されている音素数を保持する。phonemeは、先頭の音素データ（図１２のphoneme[0]の先頭アドレス）へのポインタを保持する。unit_countは、この音声辞書データunitdbが搭載する素片データの数を保持する。unitは、この音声辞書データunitdbが搭載する先頭の素片データ（図１２のunit[0]の先頭アドレス）へのポインタを保持する。fval_countは、特徴量ベクトル数を保持する。 FIG. 10 is a diagram showing a data configuration example of the speech dictionary data unitdb stored in the RAM 503 or the external storage device 506 constituting the speech dictionary 106 of FIG. 1, and is referenced from the unitdb pointer of the control variable WavSel of FIG. . The voice dictionary data unitdb holds variable data of phone_count, phoneme, unit_count, unit, and fval_count. phone_count holds the number of phonemes defined in the voice dictionary data unitdb. The phoneme holds a pointer to the head phoneme data (the head address of phoneme [0] in FIG. 12). unit_count holds the number of segment data mounted in the speech dictionary data unitdb. unit holds a pointer to the head segment data (head address of unit [0] in FIG. 12) mounted in the speech dictionary data unitdb. fval_count holds the number of feature vectors.

図１１は、図１の音声辞書１０６を構成するＲＡＭ５０３または外部記憶装置５０６に記憶される素片データunit[0]〜unit[unit_count-1]のデータ構成例を示す図であり、図１０の音声辞書データunitdbのunitポインタから参照される。音声辞書１０６への搭載素片数unit_countは、図１０の音声辞書データunitdbのunit_countデータとして登録される。各素片データunit[i]（i=0,・・・,unit_count-1）は、unit_id, phone_id,start,duration,prosody,prev,nextの各変数データと、featvalue[0]〜featvalue[fval_count-1]の各配列変数データを保持する。unit_idは、素片データを識別するための素片ＩＤを保持する。phone_idは、この素片データに対応付けられる音素を図１２で後述する音素データから特定するための音素ＩＤを保持する。startはこの素片データの開始時刻を保持する。durationはこの素片データがどれだけの時間長だけ継続しているかを示す継続長を保持する。prosodyは、図７のデータ構成例を有する、ＲＡＭ５０３または外部記憶装置５０６に保持される素片韻律データの先頭へのポインタを保持する。featvalue[0]〜featvalue[fval_count-1]は、１番目からfval_count番目までの後述する図１３に示されるデータ構成例を有する特徴量ベクトルデータの先頭データへのポインタを保持する。prevは１つ手前の素片データへのポインタ、nextは１つ後ろの素片データへのポインタを保持する。現在の素片データが、先頭データであればprevはNULL値を保持し、末端データであればnextはNULL値を保持する。 11 is a diagram showing a data configuration example of the segment data unit [0] to unit [unit_count-1] stored in the RAM 503 or the external storage device 506 constituting the speech dictionary 106 of FIG. Referenced from the unit pointer of the voice dictionary data unitdb. The number of units unit_count mounted in the speech dictionary 106 is registered as unit_count data of the speech dictionary data unitdb in FIG. Each unit data unit [i] (i = 0, ..., unit_count-1) is unit_id, phone_id, start, duration, prosody, prev, next variable data, and featvalue [0] to featvalue [fval_count -1] each array variable data is retained. unit_id holds a unit ID for identifying unit data. phone_id holds a phoneme ID for specifying the phoneme associated with the segment data from the phoneme data described later with reference to FIG. start holds the start time of the segment data. duration holds a duration indicating how long the segment data lasts. The prosody holds a pointer to the head of the segment prosodic data held in the RAM 503 or the external storage device 506 having the data configuration example of FIG. featvalue [0] to featvalue [fval_count-1] hold pointers to the first data of feature vector data having the data configuration example shown in FIG. prev holds a pointer to the previous segment data, and next holds a pointer to the next segment data. If the current segment data is head data, prev holds a null value, and if it is end data, next holds a null value.

図１２は、図１０の音声辞書データunitdb内のphonemeポインタから参照されＲＡＭ５０３または外部記憶装置５０６に記憶される音素データphoneme[0]〜phoneme[phone_count-1]のデータ構成例を示す図である。音素データの数は、音声辞書データunitdbのphone_countデータにセットされている。各音素データphoneme[i] （i=0,・・・,phone_count-1）は、phone_id,phomene,prev,nextの各変数データを保持する。phone_idは、音素を識別するための音素ＩＤを保持する。前述した図７のセグメントデータまたは前述した図１１の素片データは、その中のphone_idデータによって、図６の制御変数WavSel内のunitdb→図１０の音声辞書データunitdb内のphoneme→図１２の音素データphoneme[0]〜phoneme[phone_count-1]のうち上記phone_idの値が格納されている音素データとたどって、その音素データ内の音素名phomeneと関連付けられる。phomeneは音素名を保持する。prevは１つ手前の音素データへのポインタ、nextは１つ後ろの音素データへのポインタを保持する。現在の音素データが、先頭データであればprevはNULL値を保持し、末端データであればnextはNULL値を保持する。 FIG. 12 is a diagram illustrating a data configuration example of phoneme data phoneme [0] to phoneme [phone_count-1] which is referred to from the phoneme pointer in the speech dictionary data unitdb of FIG. 10 and stored in the RAM 503 or the external storage device 506. . The number of phoneme data is set in phone_count data of the voice dictionary data unitdb. Each phoneme data phoneme [i] (i = 0,..., Phone_count-1) holds variable data of phone_id, phomene, prev, and next. phone_id holds a phoneme ID for identifying a phoneme. The segment data shown in FIG. 7 or the segment data shown in FIG. 11 is unitphone in the control variable WavSel in FIG. 6 → phoneme in the speech dictionary data unitdb in FIG. 10 → phoneme in FIG. Of the data phoneme [0] to phoneme [phone_count-1], the phoneme data in which the value of the phone_id is stored is followed and associated with the phoneme name phomene in the phoneme data. phomene holds the phoneme name. prev holds a pointer to the previous phoneme data, and next holds a pointer to the next phoneme data. If the current phoneme data is head data, prev holds a null value, and if the current phoneme data is end data, next holds a null value.

図１３は、図１２の各素片データ内のfeatvalue[i] （i=0,・・・,fval_count-1）ポインタから参照されＲＡＭ５０３または外部記憶装置５０６に記憶される特徴量ベクトルデータfeatvalue[0],featvalue[1],・・・,featvalue[N],・・・のデータ構成例を示す図である。各特徴量ベクトルデータfeatvalue[i](i=0,・・・,N,・・・)は、time,dimension,prev,nextの各変数データと、value[0]〜value[dimension-1]の各配列変数データを保持する。timeは、その特徴量ベクトルデータに対応する時刻を保持する。dimensionは、この特徴量ベクトルデータの次元数を保持する。value[0]〜value[dimension-1]は、１番目からdimension番目までの特徴量を保持する。prevは１つ手前の特徴量ベクトルデータへのポインタ、nextは１つ後ろの特徴量ベクトルデータへのポインタを保持する。現在の特徴量ベクトルデータが、先頭データであればprevはNULL値を保持し、末端データであればnextはNULL値を保持する。この特徴量ベクトルデータは、前述したように、図２の接続評価部２０８ｂが、処理対象素片候補データ２０９と前方素片候補データ２０９との間の音素接続点での各素片データの各スペクトル包絡の距離を算出するために使用される。 FIG. 13 shows the feature vector data featvalue [] referenced from the featvalue [i] (i = 0,..., Fval_count-1) pointer in each piece of data in FIG. 12 and stored in the RAM 503 or the external storage device 506. FIG. 6 is a diagram illustrating a data configuration example of 0], featvalue [1],..., Featvalue [N],. Each feature vector data featvalue [i] (i = 0,..., N,...) Is variable data of time, dimension, prev, and next and value [0] to value [dimension-1] Each array variable data of is held. time holds the time corresponding to the feature vector data. dimension holds the number of dimensions of the feature vector data. value [0] to value [dimension-1] hold the feature quantities from the first to the dimension. prev holds a pointer to the previous feature vector data, and next holds a pointer to the next feature vector data. If the current feature vector data is head data, prev holds a null value, and if it is end data, next holds a null value. As described above, this feature vector data is obtained by the connection evaluation unit 208b of FIG. 2 in each piece of piece data at the phoneme connection point between the piece candidate data 209 to be processed and the front piece candidate data 209. Used to calculate the spectral envelope distance.

図１４は、図２の素片選定部２０７に対応する機能を、図５のハードウェア構成例を有するコンピュータのＣＰＵ５０１が、ソフトウェアプログラムの処理により実現する場合の、素片選定処理の例を示すフローチャートである。以下に説明する処理はすべてＣＰＵ５０１が実行する処理である。 FIG. 14 shows an example of the segment selection processing when the CPU 501 of the computer having the hardware configuration example of FIG. 5 realizes the function corresponding to the segment selection unit 207 of FIG. 2 by the processing of the software program. It is a flowchart. The processes described below are all executed by the CPU 501.

まず、ＲＡＭ５０３上の変数データsegprevに、未定義値NULLが格納され、変数データsegに、前述した図６のデータ構成例を有する制御変数WavSel内のsegmentデータの値が格納される（ステップＳ１４０１）。この値は、図７のデータ構成例を有するセグメントデータの最初のセグメントデータsegmen[0]の先頭アドレスへのポインタである。変数データsegは処理対象セグメントデータを示し、変数データsegprevは前方セグメントデータを示す。 First, the undefined value NULL is stored in the variable data segprev on the RAM 503, and the value of the segment data in the control variable WavSel having the above-described data configuration example in FIG. 6 is stored in the variable data seg (step S1401). . This value is a pointer to the head address of the first segment data segment [0] of the segment data having the data configuration example of FIG. Variable data seg indicates processing target segment data, and variable data segprev indicates forward segment data.

次に、seg変数の値が未定義値NULLでないか否か、すなわち図７の全てのセグメントデータsegment[0]〜segment[seg_count]が処理されたか否かが判定される（ステップＳ１４０２）。 Next, it is determined whether or not the value of the seg variable is not the undefined value NULL, that is, whether or not all the segment data segment [0] to segment [seg_count] in FIG. 7 have been processed (step S1402).

全てのセグメントデータの処理が完了しておらずseg変数の値が未定義値NULLではなくてステップＳ１４０２の判定がＹＥＳとなる間は、ステップＳ１４０５でsegprev変数にseg変数の値が格納され、seg変数が示すセグメントデータ内のnextポインタ（図７参照）が示す次のセグメントデータへのポインタ値が新たにseg変数にセットされながら、ステップＳ１４０３の素片リストアップ処理と、ステップＳ１４０４の音素列選択処理が繰り返し実行される。 While the processing of all the segment data is not completed and the value of the seg variable is not the undefined value NULL and the determination in step S1402 is YES, the value of the seg variable is stored in the segprev variable in step S1405. While the pointer value to the next segment data indicated by the next pointer (see FIG. 7) in the segment data indicated by the variable is newly set to the seg variable, the segment listing process in step S1403 and the phoneme string selection in step S1404 The process is executed repeatedly.

全てのセグメントデータの処理が完了してseg変数の値が未定義値NULLとなってステップＳ１４０２の判定がＮＯとなると、segprev変数が最後に示しているセグメントデータ全体の中の後方から先頭に向かって、決定されている最良候補をたどりながら、各セグメントデータの素片データの最終選択候補を指定してゆき、素片データ列として出力される（ステップＳ１４０６）。 When the processing of all the segment data is completed and the value of the seg variable is the undefined value NULL and the determination in step S1402 is NO, the segprev variable moves from the rear to the beginning of the entire segment data indicated at the end. Then, while tracing the determined best candidate, the final selection candidate of the segment data of each segment data is designated and output as a segment data string (step S1406).

図１５は、図１４のフローチャートによって実現される最良の素片候補データの選択動作の説明図である。この図において、・・・,segment_k-2, segment_k-1, segment_k, segment_k+1, segment_k+2,・・・は、入力テキストデータに対応する合成目標を構成するセグメントデータの離散時系列を示しており、segment_kが処理対象セグメントデータであるとする。図７のセグメントデータの形式で表せば、・・・,segment[k-2],segment[k-1],segment[k],segment[k+1],segment[k+2],・・・となる。また、例えばsegment_k-2に実線で接続されている・・・,candidate_k-2,0, candidate_k-2,1, candidate_k-2,2, candidate_k-2,3, candidate_k-2,4,・・・は、セグメントデータsegment_k-2に対してリストアップされる素片候補データ２０９を示している。同様に、例えばsegment_k-1に実線で接続されているcandidate_k-1,0, candidate_k-1,1, candidate_k-1,2, candidate_k-1,3, candidate_k-1,4,・・・は、セグメントデータsegment_k-1に対してリストアップされる素片候補データ２０９を示している。同様に、例えばsegment_kに実線で接続されているcandidate_k,0, candidate_k,1, candidate_k,2, candidate_k,3, candidate_k,4,・・・は、セグメントデータsegment_kに対してリストアップされる素片候補データ２０９を示している。 FIG. 15 is an explanatory diagram of the selection operation of the best segment candidate data realized by the flowchart of FIG. In this figure, ..., segment _k-2 , segment _k-1 , segment _k , segment _{k + 1} , segment _{k + 2} , ... are the segment data constituting the synthesis target corresponding to the input text data. A discrete time series is shown, and segment _k is processing target segment data. In the segment data format of FIG. 7, ..., segment [k-2], segment [k-1], segment [k], segment [k + 1], segment [k + 2], ...・ It becomes. Also, for example, connected to segment _k-2 with a solid line ..., candidate _k-2,0 , candidate _k-2,1 , candidate _k-2,2 , candidate _k-2,3 , candidate _{k-2 , 4} ,... Indicate segment candidate data 209 listed for the segment data segment _k-2 . Similarly, for example, candidate _k-1,0 , candidate _k-1,1 , candidate _k-1,2 , candidate _k-1,3 , candidate _k-1,4 , candidate _k-1,4 , connected to segment _k-1 by a solid line .. Indicate segment candidate data 209 listed for the segment data segment _k-1 . Similarly, for example, candidate _{k, 0} , candidate _{k, 1} , candidate _{k, 2} , candidate _{k, 3} , candidate _{k, 4} ,... Connected to segment _k with a solid line is for segment data segment _k . The segment candidate data 209 to be listed is shown.

上述の各素片候補データ２０９は、図１４のステップＳ１４０３の素片リストアップ処理によって生成される。この素片リストアップ処理は、図２の素片リストアップ部２０７ａの機能を実現している。このとき、素片リストアップ部２０７ａの説明において前述したように、例えば素片候補データcandidate_k,0, candidate_k,1, candidate_k,2, candidate_k,3, candidate_k,4,・・・がリストアップされるときに、それに対応するセグメントデータsegment_kとの間で素片コストが算出され、その算出された素片コストに基づいて、上記素片候補データの並び順が決定される。 Each piece candidate data 209 described above is generated by the piece list-up process in step S1403 of FIG. This segment list-up process realizes the function of the segment list-up unit 207a in FIG. At this time, as described above in the description of the segment list-up unit 207a, for example, the segment candidate data candidate _{k, 0} , candidate _{k, 1} , candidate _{k, 2} , candidate _{k, 3} , candidate _{k, 4} ,. Are listed, the segment cost is calculated with the corresponding segment data segment _k, and the arrangement order of the segment candidate data is determined based on the calculated segment cost.

次に、セグメントデータsegment_k-2内の濃い色の素片候補データcandidate_k-2,4は、その次のセグメントデータsegment_k-1内の濃い色の素片候補データcandidate_k-1,1が処理対象素片候補データ２０９として実行されるときに検出される最良の素片候補データ２０９を示している。同様に、セグメントデータsegment_k-1内の濃い色の素片候補データcandidate_k-2,1は、その次のセグメントデータsegment_k内の濃い色の素片候補データcandidate_k,2が処理対象素片候補データ２０９として実行されるときに検出される最良の素片候補データ２０９を示している。いま、処理対象セグメントデータをsegment_kとし、処理対象素片候補データ２０９をcandidate_k,2とすれば、前方セグメントデータsegment_k-1内の各素片候補データ（前方素片候補データ）candidate_k-1,0, candidate_k-1,1, candidate_k-1,2, candidate_k-1,3, candidate_k-1,4,・・・との間で、接続コストが計算される。そして、そのように計算された接続コストと２つの素片コストと、前方素片候補データ２０９に対して確定しているさらにその１つ前方の最良の素片候補データ２０９までのトータルコストの重み付け和が計算され、その結果値が処理対象素片候補データcandidate_k,2に対するトータルコストとされる。この計算が、処理対象素片候補データcandidate_k,2において、すべての前方素片候補データcandidate_k-1,0, candidate_k-1,1, candidate_k-1,2, candidate_k-1,3, candidate_k-1,4に対して実行され、トータルコストが最も小さい前方素片候補データ２０９が、処理対象素片候補データcandidate_k,2に対する最良の前方素片候補データ２０９として決定される。例えば、前方セグメントデータsegment_k-1内の濃い色の前方素片候補データcandidate_k-1,1が、処理対象素片候補データcandidate_k,2に対する最良の前方素片候補データ２０９として決定される。また、処理対象素片候補データcandidate_k,2と最良の前方素片候補データcandidate_k-1,1間の接続コスト１５０２_kが計算される。そして、各素片コスト１５０１_kと１５０１_k-1と接続コスト１５０２_kの重み付け和と、最良の前方素片候補データcandidate_k-1,1に対して確定しているトータルコストの和が、処理対象素片候補データcandidate_k,2に対するトータルコストとして計算される。 Then, the segment data segment dark piece candidate data within _{_k-2} candidate _{_k-2,4,} the dark segment candidate data of the next segment data segment _{_k-1} candidate _{_k-1,1} Shows the best segment candidate data 209 detected when this is executed as the processing target segment candidate data 209. Similarly, the segment data segment _k-1 in the dark segment candidate data candidate _k-2,1, the next segment data segment _k dark segment candidate data candidate _{k within, 2} processed element The best segment candidate data 209 detected when it is executed as the segment candidate data 209 is shown. Now, if the processing target segment data is segment _k and the processing target segment candidate data 209 is candidate _{k, 2} , each segment candidate data (forward segment candidate data) in the forward segment data segment _k-1 candidate _{k A} connection cost is calculated among _-1,0 , candidate _k-1,1 , candidate _k-1,2 , candidate _k-1,3 , candidate _k-1,4 ,. Then, the connection cost calculated in this way, the two unit costs, and the weighting of the total cost up to the best unit candidate data 209 that is one further forward determined for the front unit candidate data 209 The sum is calculated, and the result value is taken as the total cost for the candidate element candidate data _{k, 2} . This calculation is, in the process target segment candidate data candidate _{k, 2,} all the front piece candidate data _{_{candidate k-1,0, candidate k-}} 1,1, candidate k-1,2, candidate k-1,3 , candidate _k−1,4 , and the front segment candidate data 209 having the smallest total cost is determined as the best front segment candidate data 209 for the processing target segment candidate data candidate _{k, 2} . For example, the dark-colored front segment candidate data candidate _k-1,1 in the front segment data segment _k-1 is determined as the best front segment candidate data 209 for the processing target segment candidate data candidate _{k, 2} . . Also, a connection cost 1502 _k between the processing target segment candidate data candidate _{k, 2} and the best forward segment candidate data candidate _k-1,1 is calculated. Then, the weighted sum of the unit costs 1501 _k and 1501 _k−1 and the connection cost 1502 _{k and} the sum of the total costs determined for the best forward segment candidate data candidate _k−1,1 are processed. Calculated as the total cost for the target segment candidate data candidate _{k, 2} .

以上のように、本実施形態では、処理対象セグメントデータと前方セグメントデータの組が、セグメントデータの先頭から末尾に向かって順次進められながら、処理対象セグメントデータ内でリストアップされた素片候補データ２０９（処理対象素片候補データ２０９）ごとに、先頭から現在の処理対象セグメントデータ内の現在の処理対象素片候補データ２０９に至るセグメントデータごとの最良の素片候補データ２０９の経路が探索される（ステップＳ１４０２からＳ１４０５の繰返し処理）。また、現在の処理対象セグメントデータ内でそのように探索された素片候補データ２０９ごとのトータルコストが比較され、処理対象セグメントデータ内での最良の素片候補データ２０９が決定される。そして、末尾のセグメントデータまで処理が完了すると（ステップＳ１４０２の判定がＮＯ）、末尾のセグメントデータから先頭のセグメントデータに向かって最良の前方素片候補データ２０９を順次たどる探索処理が実行され、各素片データが決定されてゆく。最後に探索処理が先頭のセグメントデータに到達すると、全てのセグメントデータに対応する素片データが決定されるので、それらが素片データ列として図２の波形合成部１０７に出力される。このようにして、本実施形態では、いわゆるビタビアルゴリズムにより、最適な素片データ列を出力することができる。 As described above, in this embodiment, the segment candidate data listed in the processing target segment data while the set of the processing target segment data and the front segment data is sequentially advanced from the head to the end of the segment data. For each segment 209 (processing target segment candidate data 209), the route of the best segment candidate data 209 for each segment data from the beginning to the current processing target segment candidate data 209 in the current processing target segment data is searched. (Repeat processing from step S1402 to S1405). Further, the total cost for each piece candidate data 209 searched in the current processing target segment data is compared, and the best piece candidate data 209 in the processing target segment data is determined. Then, when the process is completed up to the end segment data (NO in step S1402), a search process for sequentially tracing the best front segment candidate data 209 from the end segment data to the start segment data is executed. Fragment data is determined. Finally, when the search process reaches the first segment data, the segment data corresponding to all the segment data are determined, and are output to the waveform synthesis unit 107 in FIG. 2 as a segment data string. In this way, in the present embodiment, an optimum segment data string can be output by a so-called Viterbi algorithm.

図１６および図１７は、図１５のステップＳ１４０３の素片リストアップ処理の詳細例を示すフローチャートであり、図２の素片リストアップ部２０７ａの機能を実現している。 FIGS. 16 and 17 are flowcharts showing a detailed example of the segment list-up process in step S1403 in FIG. 15, and realize the function of the segment list-up unit 207a in FIG.

まず、ＲＡＭ５０３上の変数uに、音声辞書１０６内の音声コーパスから先頭の素片データへのポインタが格納される（ステップＳ１６０１）。このポインタ値は、図１０の音声辞書データunitdbのunitポインタ値として得ることができる。 First, a pointer from the speech corpus in the speech dictionary 106 to the first segment data is stored in the variable u on the RAM 503 (step S1601). This pointer value can be obtained as the unit pointer value of the speech dictionary data unitdb in FIG.

次に、ステップＳ１６０２で末尾の素片データの検索が終わっていないと判定される間（判定がＹＥＳの間）、ステップＳ１６０８で次の素片データへのポインタが変数uに格納されながら、各素片データごとに、以下のステップＳ１６０３からＳ１６２０までの一連の処理が実行される。ここで、次の素片データへのポインタは、変数uが示す図１１の素片データ内のnextポインタとして得られる。また、ステップＳ１６０２の判定は、ステップＳ１６０８でセットされたnextポインタの値がNULLであるか否かを判定することにより実現できる。 Next, while it is determined in step S1602 that the search for the last segment data has not ended (while the determination is YES), a pointer to the next segment data is stored in the variable u in step S1608. For each piece of data, a series of processes from the following steps S1603 to S1620 are executed. Here, the pointer to the next segment data is obtained as the next pointer in the segment data of FIG. 11 indicated by the variable u. The determination in step S1602 can be realized by determining whether or not the value of the next pointer set in step S1608 is NULL.

変数uによって参照される素片データとその前後２つずつの素片データの音素ラベル列と、変数segによって参照されるセグメントデータとその前後２つずつのセグメントデータの音素ラベル列とが比較され、音素列コストが算出され、ＲＡＭ５０３上の変数contextに格納される（ステップＳ１６０３）。この算出方法については、図３の説明で前述した通りである。なお、変数uによって参照される素片データの音素ラベルは、変数uによって参照される図９の素片データ中の音素ＩＤデータphone_idから図１２の音素データ中の音素名データphomeneを参照することにより得られる。また、その前後２つずつの素片データの音素ラベル列は、変数uによって参照される図９の素片データ中のprevポインタおよびnextポインタを順次２つずつたどった先の素片データから、上記と同様にして得られる。変数segによって参照されるセグメントデータの音素ラベルは、変数segによって参照される図７のセグメントデータ中の音素ＩＤデータphone_idから図１２の音素データ中の音素名データphomeneを参照することにより得られる。また、その前後２つずつのセグメントデータの音素ラベル列は、変数segによって参照される図７のセグメントデータ中のprevポインタおよびnextポインタを順次２つずつたどった先のセグメントデータから、上記と同様にして得られる。 The phoneme label sequence of the segment data referenced by the variable u and the segment data of two segment data before and after it, the segment data referenced by the variable seg and the phoneme label sequence of the segment data of two segments before and after the segment data are compared. The phoneme sequence cost is calculated and stored in the variable context on the RAM 503 (step S1603). This calculation method is as described above with reference to FIG. Note that the phoneme label of the segment data referenced by the variable u refers to the phoneme name data phomene in the phoneme data in FIG. 12 from the phoneme ID data phone_id in the segment data in FIG. 9 referenced by the variable u. Is obtained. Further, the phoneme label string of the two segment data before and after the segment data is obtained from the previous segment data obtained by sequentially tracing the prev pointer and the next pointer in the segment data of FIG. 9 referenced by the variable u. It is obtained in the same manner as above. The phoneme label of the segment data referenced by the variable seg is obtained by referring to the phoneme name data phomene in the phoneme data in FIG. 12 from the phoneme ID data phone_id in the segment data in FIG. 7 referenced by the variable seg. Further, the phoneme label string of the segment data of two segments before and after the segment data is the same as described above from the previous segment data obtained by sequentially tracing the prev pointer and the next pointer in the segment data of FIG. 7 referenced by the variable seg. Is obtained.

次に、ＲＡＭ５０３上の図９の素片候補データ２０９の末尾に新規選択候補のエントリcandidate[i]（iは末尾の次に追加されるエントリの番号）が生成され、そのエントリに素片ID unit_id が新規に付与され、ステップＳ１６０３でセットされたＲＡＭ５０３上の変数contextの値が音素列コストctxt_distanceとして代入される（ステップＳ１６０４）。 Next, an entry candidate [i] (i is the number of the entry added next to the end) of the newly selected candidate is generated at the end of the segment candidate data 209 in FIG. The unit_id is newly assigned, and the value of the variable context on the RAM 503 set in step S1603 is substituted as the phoneme string cost ctxt_distance (step S1604).

次に、ステップＳ１６０４でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの切り出し先頭位置top_shiftおよび切り出し終了位置tail_shiftに、それぞれ値０が代入される（ステップＳ１６０５）。切り出し先頭位置および切り出し終了位置は、図４に示される３つのパターンの素片データが登録されるときの切り出し区間の先頭位置と終了位置を示すものであるが、ステップＳ１６０３からステップＳ１６０６で処理しているのは音声コーパスから取得された素片データそのものの登録処理であり、切り出しは行われないため、それぞれの値は０とされる。 Next, a value of 0 is substituted for the cutout start position top_shift and cutout end position tail_shift of the entry of the segment candidate data 209 of FIG. 9 generated on the RAM 503 in step S1604 (step S1605). The cutout start position and cutout end position indicate the start position and end position of the cutout section when the segment data of the three patterns shown in FIG. 4 are registered, and are processed in steps S1603 to S1606. This is registration processing of the segment data itself acquired from the speech corpus, and no value is cut out, so that each value is 0.

次に、ステップＳ１６０３〜Ｓ１６０５で生成された素片候補データ２０９のエントリに対して、さらに韻律コストおよび素片コストが計算され、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントのエントリに登録される（ステップＳ１６０６）。この処理の詳細については、図１８のフローチャートを用いて後述する。 Next, the prosody cost and the segment cost are further calculated for the entry of the segment candidate data 209 generated in steps S1603 to S1605, and stored in the variable seg on the RAM 503 in step S1405 of FIG. It is registered in the entry of the target segment (step S1606). Details of this processing will be described later with reference to the flowchart of FIG.

次に、音声コーパスから取得されている現在の処理対象素片データが、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントデータの目標韻律データ２０１よりも長い継続長を有するか否かが判定される（ステップＳ１６０７）。処理対象素片データの韻律データ長は、図１１の該当する素片データエントリ内のprosodyポインタから参照される韻律データの長さとして取得できる。処理対象セグメントデータの目標韻律データ２０１の長さは、図７の該当するセグメントデータエントリ内のtarget_prosodyポインタから参照される韻律データの長さとして取得できる。 Next, the current process target segment data acquired from the speech corpus has a longer duration than the target prosody data 201 of the process target segment data stored in the variable seg on the RAM 503 in step S1405 of FIG. It is determined whether or not it has (step S1607). The prosodic data length of the processing target segment data can be acquired as the length of the prosodic data referenced from the prosody pointer in the corresponding segment data entry in FIG. The length of the target prosodic data 201 of the processing target segment data can be acquired as the length of the prosodic data referenced from the target_prosody pointer in the corresponding segment data entry in FIG.

ステップＳ１６０７の判定がＮＯならば、次の素片データへのポインタが変数uに格納されて（ステップＳ１６０８）、ステップＳ１６０２の処理に戻る。 If the determination in step S1607 is NO, the pointer to the next segment data is stored in the variable u (step S1608), and the process returns to step S1602.

ステップＳ１６０７の判定がＹＥＳならば、図１７のフローチャートの処理に進み、前述した図４の（ａ）の第１のパターンの素片データのリストアップ処理（ステップＳ１６０９〜Ｓ１６１１）、図４の（ｂ）の第２のパターンの素片データのリストアップ処理（ステップＳ１６１２〜Ｓ１６１４）、および図４の（ｃ）の第３のパターンの素片データのリストアップ処理（ステップＳ１６１５〜Ｓ１６２０）が実行される。 If the determination in step S1607 is YES, the process proceeds to the process of the flowchart of FIG. 17 and the above-described process of listing the segment data of the first pattern (step S1609 to S1611) in FIG. The second pattern segment data listing process (steps S1612 to S1614) in b) and the third pattern segment data listing process (steps S1615 to S1620) in FIG. 4C are executed. Is done.

まず、第１のパターン（図４（ａ））の素片データの登録として、図１６のステップＳ１６０４と同様に、ＲＡＭ５０３上の図９の素片候補データ２０９の末尾に新規選択候補のエントリcandidate[i]が生成され、そのエントリに素片ID unit_id が新規に付与され、ステップＳ１６０３でセットされたＲＡＭ５０３上の変数contextの値が音素列コストctxt_distanceとして代入される（ステップＳ１６０９）。 First, as registration of the segment data of the first pattern (FIG. 4A), as in step S1604 of FIG. 16, the entry “candidate” of the new selection candidate at the end of the segment candidate data 209 of FIG. [i] is generated, the element ID unit_id is newly assigned to the entry, and the value of the variable context on the RAM 503 set in step S1603 is substituted as the phoneme string cost ctxt_distance (step S1609).

次に、ステップＳ１６０９でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの切り出し先頭位置top_shiftに、図１６のステップＳ１６０７で計算された処理対象素片データの韻律データ長から処理対象セグメントデータの韻律データ長を減算して得られる継続長の差の値（図４（ａ）の４０１に対応する値）が代入される（ステップＳ１６１０）。なお、切り出し終了位置tail_shiftには０が代入される。 Next, in the segment start data top_shift of the segment candidate data 209 in FIG. 9 generated on the RAM 503 in step S1609, the processing target is determined from the prosodic data length of the processing target segment data calculated in step S1607 in FIG. The value of the difference in continuation length obtained by subtracting the prosodic data length of the segment data (a value corresponding to 401 in FIG. 4A) is substituted (step S1610). Note that 0 is assigned to the cutout end position tail_shift.

次に、ステップＳ１６０９とＳ１６１０で生成された素片候補データ２０９のエントリに対して、さらに韻律コストおよび素片コストが計算され、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントのエントリに登録される（ステップＳ１６１１）。この処理の詳細については、図１８のフローチャートを用いて後述する。 Next, the prosody cost and the segment cost are further calculated for the entries of the segment candidate data 209 generated in steps S1609 and S1610, and stored in the variable seg on the RAM 503 in step S1405 of FIG. It is registered in the entry of the target segment (step S1611). Details of this processing will be described later with reference to the flowchart of FIG.

続いて、第２のパターン（図４（ｂ））の素片データの登録として、図１６のステップＳ１６０４と同様に、ＲＡＭ５０３上の図９の素片候補データ２０９の末尾に新規選択候補のエントリcandidate[i]が生成され、そのエントリに素片ID unit_id が新規に付与され、ステップＳ１６０３でセットされたＲＡＭ５０３上の変数contextの値が音素列コストctxt_distanceとして代入される（ステップＳ１６１２）。 Subsequently, as registration of the segment data of the second pattern (FIG. 4B), as in step S1604 in FIG. 16, the entry of the new selection candidate at the end of the segment candidate data 209 in FIG. candidate [i] is generated, a unit ID unit_id is newly assigned to the entry, and the value of the variable context on the RAM 503 set in step S1603 is substituted as the phoneme string cost ctxt_distance (step S1612).

次に、ステップＳ１６１２でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの切り出し終了位置tail_shiftに、図１６のステップＳ１６０７で計算された処理対象素片データの韻律データ長から処理対象セグメントデータの韻律データ長を減算して得られる継続長の差の値（図４（ｂ）の４０２に対応する値）が代入される（ステップＳ１６１３）。なお、切り出し先頭位置top_shiftには０が代入される。 Next, at the segmentation end position tail_shift of the entry of the segment candidate data 209 in FIG. 9 generated on the RAM 503 in step S1612, the processing target is calculated from the prosodic data length of the processing target segment data calculated in step S1607 in FIG. The value of the difference in continuation length obtained by subtracting the prosodic data length of the segment data (a value corresponding to 402 in FIG. 4B) is substituted (step S1613). Note that 0 is assigned to the cut start position top_shift.

次に、ステップＳ１６１２とＳ１６１３で生成された素片候補データ２０９のエントリに対して、さらに韻律コストおよび素片コストが計算され、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントのエントリに登録される（ステップＳ１６１４）。この処理の詳細については、図１８のフローチャートを用いて後述する。 Next, the prosody cost and the segment cost are further calculated for the entry of the segment candidate data 209 generated in steps S1612 and S1613, and stored in the variable seg on the RAM 503 in step S1405 of FIG. It is registered in the entry of the target segment (step S1614). Details of this processing will be described later with reference to the flowchart of FIG.

最後に、第３のパターン（図４（ｃ））の素片データの登録として、図１６のステップＳ１６０４と同様に、ＲＡＭ５０３上の図９の素片候補データ２０９の末尾に新規選択候補のエントリcandidate[i]が生成され、そのエントリに素片ID unit_id が新規に付与され、ステップＳ１６０３でセットされたＲＡＭ５０３上の変数contextの値が音素列コストctxt_distanceとして代入される（ステップＳ１６１５）。 Finally, as registration of the segment data of the third pattern (FIG. 4C), as in step S1604 in FIG. 16, the entry of the new selection candidate at the end of the segment candidate data 209 in FIG. candidate [i] is generated, a unit ID unit_id is newly assigned to the entry, and the value of the variable context on the RAM 503 set in step S1603 is substituted as the phoneme string cost ctxt_distance (step S1615).

次に、図１６のステップＳ１６０７で計算された処理対象素片データの韻律データ長から処理対象セグメントデータの韻律データ長を減算して得られる継続長の差の値を２で割って得られる値（図４（ｃ）の４０３に対応する値）が、ＲＡＭ５０３上の変数shiftに登録される（ステップＳ１６１６）。 Next, a value obtained by dividing the continuation length difference value obtained by subtracting the prosodic data length of the processing target segment data from the prosodic data length of the processing target segment data calculated in step S1607 of FIG. (A value corresponding to 403 in FIG. 4C) is registered in the variable shift on the RAM 503 (step S1616).

次に、ステップＳ１６１５でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの切り出し先頭位置top_shiftに、ステップＳ１６１６で変数shiftに得られた値が代入される（ステップＳ１６１７）。 Next, the value obtained for the variable shift in step S1616 is substituted into the cutout start position top_shift of the entry of the segment candidate data 209 in FIG. 9 generated on the RAM 503 in step S1615 (step S1617).

さらに、図１６のステップＳ１６０７で計算された処理対象素片データの韻律データ長から処理対象セグメントデータの韻律データ長を減算して得られる継続長の差の値から、さらにステップＳ１６１６で変数shiftに得られた値を減算して得られる値が、変数shiftに上書きされる（ステップＳ１６１８）。ここで、ステップＳ１６１６で得られる変数shiftの値をそのまま使わないのは、残りのわずかな誤差を吸収するためである。 Further, from the value of the continuation length obtained by subtracting the prosodic data length of the processing target segment data from the prosodic data length of the processing target segment data calculated in step S1607 of FIG. 16, the variable shift is set in step S1616. A value obtained by subtracting the obtained value is overwritten on the variable shift (step S1618). Here, the reason why the value of the variable shift obtained in step S1616 is not used as it is is to absorb the remaining slight error.

そして、ステップＳ１６１５でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの切り出し終了位置tail_shiftに、ステップＳ１６１８で変数shiftに得られた値が代入される（ステップＳ１６１９）。 Then, the value obtained for the variable shift in step S1618 is substituted into the cutout end position tail_shift of the entry of the segment candidate data 209 in FIG. 9 generated on the RAM 503 in step S1615 (step S1619).

最後に、ステップＳ１６１５からＳ１６１９で生成された素片候補データ２０９のエントリに対して、さらに韻律コストおよび素片コストが計算され、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントのエントリに登録される（ステップＳ１６２０）。この処理の詳細については、図１８のフローチャートを用いて後述する。 Finally, the prosody cost and the segment cost are further calculated for the entry of the segment candidate data 209 generated in steps S1615 to S1619, and stored in the variable seg on the RAM 503 in step S1405 of FIG. It is registered in the entry of the target segment (step S1620). Details of this processing will be described later with reference to the flowchart of FIG.

以上の図１７のフローチャートで例示される処理の後、図１６のステップＳ１６０８の処理に移行し、次の素片データへのポインタが変数uに格納されて（ステップＳ１６０８）、ステップＳ１６０２の処理に戻る。 After the process illustrated in the flowchart of FIG. 17, the process proceeds to the process of step S <b> 1608 of FIG. 16, the pointer to the next segment data is stored in the variable u (step S <b> 1608), and the process of step S <b> 1602 is performed. Return.

図１８は、図１６のステップＳ１６０６、図１７のステップＳ１６１１、Ｓ１６１４、またはＳ１６２０のそれぞれで実行される素片コスト計算＆候補追加処理の詳細を示すフローチャートである。 FIG. 18 is a flowchart showing details of the segment cost calculation & candidate addition process executed in step S1606 in FIG. 16 and steps S1611, S1614, or S1620 in FIG.

まず、処理対象素片データ（音素片）と処理対象セグメントデータの目標韻律データ２０１の韻律遷移が比較され、韻律コストが算出される（ステップＳ１８０１）。具体的な処理については、図２の素片リストアップ部２０７ａの説明で前述した。そして、この韻律コストの値が、ステップＳ１６０４、Ｓ１６０９、Ｓ１６１２、またはＳ１６１５でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの韻律コストpros_distanceとして登録される。 First, the prosody transition of the target prosody data 201 of the processing target segment data (phoneme segment) and the processing target segment data is compared, and the prosodic cost is calculated (step S1801). Specific processing is described above in the description of the segment list-up unit 207a in FIG. Then, this prosodic cost value is registered as the prosodic cost pros_distance of the entry of the segment candidate data 209 of FIG. 9 generated on the RAM 503 in step S1604, S1609, S1612, or S1615.

次に、ステップＳ１６０３で算出された音素列コストとステップＳ１８０１で算出された韻律コストとの重み付け加算して得られるコスト値が、素片コストとして算出される。そして、その素片コストの値が、ステップＳ１６０４、Ｓ１６０９、Ｓ１６１２、またはＳ１６１５でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリの素片コストunit_distanceとして登録される（ステップＳ１８０２）。 Next, a cost value obtained by weighted addition of the phoneme string cost calculated in step S1603 and the prosodic cost calculated in step S1801 is calculated as a segment cost. Then, the value of the element cost is registered as the element cost unit_distance of the entry of the element candidate data 209 of FIG. 9 generated on the RAM 503 in step S1604, S1609, S1612, or S1615 (step S1802).

最後に、図１４のステップＳ１４０５でＲＡＭ５０３上の変数segに格納されている処理対象セグメントのエントリに、ステップＳ１６０４、Ｓ１６０９、Ｓ１６１２、またはＳ１６１５でＲＡＭ５０３上に生成された図９の素片候補データ２０９のエントリがリンクされて登録される（ステップＳ１８０３）。このリンクは、図７のセグメントデータのエントリのcandidateポインタから、図９の素片候補データ２０９のエントリのunit_idがリンクされ、さらにその素片候補データ２０９のエントリのnextポインタによって、次の素片候補データ２０９のエントリのunit_idがリンクされる。このとき、ステップＳ１８０２で算出された素片コストの低い順にリンクが張られる。なおこのとき、候補数や素片コスト値などで、追加の可否に制限をかけてもよい。 Finally, the segment candidate data 209 of FIG. 9 generated on the RAM 503 in step S1604, S1609, S1612 or S1615 is added to the entry of the processing target segment stored in the variable seg on the RAM 503 in step S1405 of FIG. Are linked and registered (step S1803). In this link, the unit_id of the entry of the segment candidate data 209 in FIG. 9 is linked from the candidate pointer of the segment data entry in FIG. The unit_id of the entry of the candidate data 209 is linked. At this time, links are established in ascending order of segment cost calculated in step S1802. At this time, the possibility of addition may be limited by the number of candidates and the unit cost value.

図１６および図１７において、変数uの値が更新されていった結果、末尾の音声素片の検索が終了しステップＳ１６０２の判定がＮＯになると、図１６および図１７のフローチャートの処理が終了して図１４のステップＳ１４０３の１回の素片リストアップ処理が終了する。 In FIGS. 16 and 17, when the value of the variable u is updated, the search for the last speech unit is completed and the determination in step S1602 is NO, the processing of the flowcharts of FIGS. 16 and 17 is completed. Thus, the one unit list-up process in step S1403 of FIG.

図１９は、図１４のステップＳ１４０４の音素列選択処理の詳細例を示すフローチャートであり、図２の音素列選択部２０７ｂの機能を実現している。 FIG. 19 is a flowchart showing a detailed example of the phoneme string selection processing in step S1404 in FIG. 14, and implements the function of the phoneme string selection unit 207b in FIG.

まず、変数segが参照する図７のセグメントデータ中の最良の素片候補データ２０９へのポインタseg.best_candが初期化（クリア）される（ステップＳ１９０１）。 First, the pointer seg.best_cand to the best segment candidate data 209 in the segment data of FIG. 7 referred to by the variable seg is initialized (cleared) (step S1901).

次に、ＲＡＭ５０３上の変数ctに、変数segが示す現在のセグメントデータ（処理対象セグメントデータ）に対してリストアップされている素片候補データ２０９の先頭データのアドレスがセットされる（ステップＳ１９０２）。具体的には、このアドレスは、変数segが参照する図７のセグメントデータ中の最初の素片候補データへのポインタcandidateの値として得られる。 Next, the address of the head data of the segment candidate data 209 listed for the current segment data (processing target segment data) indicated by the variable seg is set in the variable ct on the RAM 503 (step S1902). . Specifically, this address is obtained as the value of the pointer candidate to the first segment candidate data in the segment data of FIG. 7 referenced by the variable seg.

次に、ステップＳ１９０３で素片候補データ２０９の検索が終わっていないと判定される間（判定がＹＥＳの間）、ステップＳ１９１５で次の素片候補データ２０９へのポインタが変数ctに格納されながら、各素片候補データ２０９（処理対象素片候補データ２０９）ごとに、以下のステップＳ１９０４からＳ１９１４までの一連の処理が実行される。ここで、次の処理対象素片候補データ２０９へのポインタは、変数ctが示す図９の素片候補データ２０９内のnextポインタとして得られる。また、ステップＳ１９０３の判定は、ステップＳ１９１５でセットされたnextポインタの値がNULLであるか否かを判定することにより実現できる。 Next, while it is determined in step S1903 that the search for the segment candidate data 209 has not ended (while the determination is YES), a pointer to the next segment candidate data 209 is stored in the variable ct in step S1915. For each piece candidate data 209 (processing target piece candidate data 209), a series of processes from the following steps S1904 to S1914 are executed. Here, the pointer to the next processing target segment candidate data 209 is obtained as the next pointer in the segment candidate data 209 of FIG. 9 indicated by the variable ct. The determination in step S1903 can be realized by determining whether or not the value of the next pointer set in step S1915 is NULL.

まず、ＲＡＭ５０３上の変数cpに、処理対象セグメントデータの前方セグメントデータに対してリストアップされている素片候補データ２０９（前方素片候補データ２０９）の先頭データのアドレスがセットされる（ステップＳ１９０４）。具体的には、このアドレスは、変数segが参照する図７のセグメントデータ中のprevポインタによって参照される図７のセグメントデータ中の最初の素片候補データへのポインタcandidateの値として得られる。 First, the address of the head data of the segment candidate data 209 (front segment candidate data 209) listed for the front segment data of the processing target segment data is set in the variable cp on the RAM 503 (step S1904). ). Specifically, this address is obtained as the value of the pointer candidate to the first segment candidate data in the segment data of FIG. 7 referenced by the prev pointer in the segment data of FIG. 7 referenced by the variable seg.

続いて、変数ctによって参照される構造体データ中のbest_candポインタ（図９の処理対象素片候補データ２０９を参照）が初期化（クリア）される（ステップＳ１９０５）。best_candは、処理対象素片候補データ２０９と接続する最良の前方素片候補データ２０９を参照するポインタである。 Subsequently, the best_cand pointer (see the processing target segment candidate data 209 in FIG. 9) in the structure data referred to by the variable ct is initialized (cleared) (step S1905). best_cand is a pointer that refers to the best front segment candidate data 209 connected to the processing target segment candidate data 209.

その後、ステップＳ１９０６で前方素片候補データ２０９の検索が終わっていないと判定される間（判定がＹＥＳの間）、ステップＳ１９１１で次の前方素片候補データ２０９へのポインタが変数cpに格納されながら、各前方素片候補データ２０９ごとに、以下のステップＳ１９０７からＳ１９１０までの一連の処理が実行される。ここで、次の前方素片候補データ２０９へのポインタは、変数cpが示す図９の素片候補データ２０９内のnextポインタとして得られる。また、ステップＳ１９０６の判定は、ステップＳ１９１１でセットされたnextポインタの値がNULLであるか否かを判定することにより実現できる。 Thereafter, while it is determined in step S1906 that the search for the front segment candidate data 209 has not ended (while the determination is YES), a pointer to the next front segment candidate data 209 is stored in the variable cp in step S1911. However, a series of processes from the following steps S1907 to S1910 are executed for each front segment candidate data 209. Here, the pointer to the next front segment candidate data 209 is obtained as the next pointer in the segment candidate data 209 of FIG. 9 indicated by the variable cp. The determination in step S1906 can be realized by determining whether or not the value of the next pointer set in step S1911 is NULL.

まず、変数ctが示す処理対象素片候補データ２０９と変数cpが示す前方素片候補データ２０９の接続コストが算出される（ステップＳ１９０７）。具体的には、上記各素片候補データ２０９に対応する特徴量ベクトルデータが、図９のunit_idデータ→図１１のfeatvalue[0]〜featvalue[fval_count-1]とたどられて参照され、切り出し先頭位置および切り出し終了位置に対応する特徴量ベクトルデータが抽出される。切り出し先頭位置および切り出し終了位置は、図９のエントリのtop_shiftおよびtail_shiftのデータとして得られる。そして、これらの特徴量ベクトルデータの組によって算出される各素片データの各スペクトル包絡の距離（例えばメルケプストラムのユークリッド距離）が接続コストとして算出される。 First, the connection cost of the target segment candidate data 209 indicated by the variable ct and the front segment candidate data 209 indicated by the variable cp is calculated (step S1907). Specifically, the feature vector data corresponding to each of the segment candidate data 209 is referred to by following the unit_id data in FIG. 9 → featvalue [0] to featvalue [fval_count-1] in FIG. Feature vector data corresponding to the start position and cutout end position is extracted. The cutout start position and cutout end position are obtained as top_shift and tail_shift data of the entry in FIG. Then, the distance of each spectral envelope (for example, the Euclidean distance of the mel cepstrum) of each piece data calculated by the set of these feature vector data is calculated as the connection cost.

次に、図９に示される前方素片候補データ２０９中の確定している前方までのコスト合計cp.total_costと、ステップＳ１９０７で算出された接続コストに重み付け係数を乗じた結果の和が算出される。そして、その加算結果が、ＲＡＭ５０３上の変数ctが示す処理対象素片候補データ２０９の構造体内のtotal_cost値（ct.total_cost）として保存される（以上、ステップＳ１９０８）。 Next, the sum of the determined total cost cp.total_cost to the front in the front segment candidate data 209 shown in FIG. 9 and the connection cost calculated in step S1907 is multiplied by the weighting coefficient is calculated. The Then, the addition result is stored as a total_cost value (ct.total_cost) in the structure of the processing target segment candidate data 209 indicated by the variable ct on the RAM 503 (step S1908).

続いて、上述のct.total_cost値が、変数ctが示す元のＲＡＭ５０３上の処理対象素片候補データ２０９内の図９に示されるtotal_cost値よりも小さいか否か、すなわち今回のct.total_cost値が最良であるか否かが判定される（ステップＳ１９０９）。ステップＳ１９０９の判定がＹＥＳの場合には、変数ctが示す元の処理対象素片候補データ２０９内の図９に示される全てのメンバー変数値が、変数ctが示す構造体の全てのメンバー変数値で置き換えられる（ステップＳ１９１０）。ステップＳ１９０９の判定がＮＯの場合には、ステップＳ１９１０の置換えは実行されない。 Subsequently, whether or not the above ct.total_cost value is smaller than the total_cost value shown in FIG. 9 in the processing target segment candidate data 209 on the original RAM 503 indicated by the variable ct, that is, the current ct.total_cost value. It is determined whether or not is the best (step S1909). If the determination in step S1909 is YES, all member variable values shown in FIG. 9 in the original processing target segment candidate data 209 indicated by the variable ct are all member variable values of the structure indicated by the variable ct. (Step S1910). If the determination in step S1909 is NO, the replacement in step S1910 is not executed.

その後、次の前方素片候補データ２０９へのポインタnextが変数cpに格納され、ステップＳ１９０６の処理に戻り、ステップＳ１９０６での判定の後、上述したステップＳ１９０７からＳ１９１０までの一連の処理が繰り返し実行される。 Thereafter, the pointer next to the next front segment candidate data 209 is stored in the variable cp, and the processing returns to step S1906. After the determination in step S1906, the series of processing from step S1907 to S1910 described above is repeatedly executed. Is done.

変数cpの値が更新されていった結果、末尾の前方素片候補データ２０９の検索が終了しステップＳ１９０６の判定がＮＯになると、まず、変数ctが示す前方最良情報が、ステップＳ１９１０でセットされている最良情報保存用変数のデータに書き換えられる（ステップＳ１９１２）。 As a result of updating the value of the variable cp, when the search of the last front segment candidate data 209 is completed and the determination in step S1906 is NO, first, the best forward information indicated by the variable ct is set in step S1910. It is rewritten with the data of the best information storage variable (step S1912).

その後、変数ctが示す処理対象素片候補データ２０９に新たにセットされたtotal_cost値が、変数segが示す図７のセグメントデータ中のbest_candポインタが示す図９の素片候補データ２０９中のtotal_cost値よりも小さいか否かが判定される（ステップＳ１９１３）。ステップＳ１９１３の判定がＹＥＳならば、変数segが示す図７のセグメントデータ中のbest_candポインタに、変数ctの値がセットされる（ステップＳ１９１４）。ステップＳ１９１３の判定がＮＯならばステップＳ１７２４は実行されない。 Thereafter, the total_cost value newly set in the processing target segment candidate data 209 indicated by the variable ct is the total_cost value in the segment candidate data 209 in FIG. 9 indicated by the best_cand pointer in the segment data in FIG. 7 indicated by the variable seg. It is determined whether it is smaller than (step S1913). If the determination in step S1913 is YES, the value of variable ct is set to the best_cand pointer in the segment data of FIG. 7 indicated by variable seg (step S1914). If the determination in step S1913 is NO, step S1724 is not executed.

その後、次の処理対象素片候補データ２０９へのポインタが変数ctに格納されて（ステップＳ１９１５）、ステップＳ１９０３の処理に戻る。 Thereafter, a pointer to the next processing target segment candidate data 209 is stored in the variable ct (step S1915), and the process returns to step S1903.

変数ctの値が更新されていった結果、末尾の処理対象素片候補データ２０９の検索が終了しステップＳ１９０３の判定がＮＯになると、図１９のフローチャートの処理が終了し、図１４のステップＳ１４０４の１回の音素列選択処理が終了する。 As a result of the update of the value of the variable ct, when the search of the last candidate segment data 209 to be processed is completed and the determination in step S1903 is NO, the processing of the flowchart in FIG. 19 ends, and step S1404 in FIG. The one phoneme string selection process is completed.

以上一連の処理により、全てのセグメントデータの処理が完了してseg変数の値が未定義値NULLとなって図１４のステップＳ１４０２の判定がＮＯとなると、segprev変数が最後に示しているセグメントデータ中の図７のbest_candポインタが参照されることにより図９のデータが参照され、unit_idデータによって１つの素片データが決定される。その後、図９のbest_candポインタを末尾のセグメントデータから先頭のセグメントデータに向かって順次たどる探索処理が実行され、各前方セグメントデータごとに最良の前方素片候補データ中のunit_idデータが参照されて、各素片データが決定されてゆく。最後に探索処理が先頭のセグメントデータに到達すると、全てのセグメントデータに対応する素片データが決定されるので、それらが素片データ列として図２の波形合成部１０７に出力される。 Through the above series of processing, when the processing of all segment data is completed and the value of the seg variable is the undefined value NULL and the determination in step S1402 in FIG. 14 is NO, the segment data indicated by the segprev variable is the last. 9 is referred to by referring to the best_cand pointer in FIG. 7, and one piece of data is determined by unit_id data. Thereafter, a search process for sequentially tracing the best_cand pointer in FIG. 9 from the last segment data to the first segment data is executed, and the unit_id data in the best front segment candidate data is referenced for each front segment data, Each piece of data is determined. Finally, when the search process reaches the first segment data, the segment data corresponding to all the segment data are determined, and are output to the waveform synthesis unit 107 in FIG. 2 as a segment data string.

以上の実施形態により、指定された韻律遷移に忠実であり、かつ素片同士の接続部が滑らかに連続する素片を、音声コーパスから選択することが可能となる。 According to the above embodiment, it is possible to select from the speech corpus a segment that is faithful to the specified prosodic transition and in which the connected portions of the segments are smoothly connected.

以上の実施形態に関して、更に以下の付記を開示する。
（付記１）
入力テキストデータから音素及び目標韻律が対応付けられたセグメントデータの列を抽出する抽出部と、
前記抽出部により抽出されたセグメントデータごとに、音声コーパスから取得した音声素片が当該セグメントデータの目標韻律よりも長い継続長を有する場合に、当該取得した音声素片から前記目標韻律の継続長より長い箇所を削除した新たな音声素片を生成する生成部と、
前記取得した音声素片及び前記新たな音声素片夫々と前記セグメントデータとの不一致度を示す素片コストを算出し、当該算出された各素片コストに基づき前記取得した音声素片及び前記新たな音声素片夫々から前記セグメントデータに対する音声素片候補データとなる音声素片をリストアップする素片リストアップ部と、
前記抽出されたセグメントデータの列を構成する前記セグメントデータごとに、当該セグメントデータに対応する前記音声素片候補データと、当該セグメントデータに隣接するセグメントデータに対応する前記音声素片候補データとの間の不連続性を示す接続コストを算出し、当該算出された接続コスト及び前記隣接する音声素片候補データ夫々の素片コストに基づいて、前記音声素片候補データとしてリストアップされた音声素片のいずれかを選択して前記音声素片データ列を生成する音素列選択部と、
前記生成された音声素片データ列に基づいて合成音声を生成する音声生成部と、
を備える音声合成装置。
（付記２）
前記生成部は、前記取得した音声素片から前記目標韻律の継続長より長い箇所を削除した新たな音声素片として、前記取得した音声素片の先頭箇所を削除した新たな音声素片、前記取得した音声素片の末尾箇所を削除した新たな音声素片、及び前記取得した音声素片の先頭箇所及び末尾箇所を均等に削除した新たな音声素片を生成する、付記１に記載の音声合成装置。
（付記３）
前記セグメントデータおよび前記音声素片は音素データと韻律データを含み、
前記素片コストは、処理対象の前記セグメントデータの音素と当該セグメントデータの前後所定数セグメントずつの各セグメントデータの音素とから構成される音素列と、処理対象の前記音声素片候補データの音素と当該音声素片候補データの前後の前記所定数ずつの各音声素片データの音素とから構成される音素列とを比較することにより算出される音素列コストと、処理対象の前記セグメントデータの韻律データと処理対象の前記音声素片候補データの韻律データとの差に基づいて算出される韻律コストとの、重み付け和として算出する、付記１または２に記載の音声合成装置。
（付記４）
前記接続コストは、処理対象の前記セグメントデータに対応する処理対象の前記音声素片候補データと当該セグメントデータの１つ前の前記セグメントデータに対応する前記音声素片候補データとの間の音素接続点での各特徴量ベクトルデータ間の距離として算出する、付記１ないし３のいずれかに記載の音声合成装置。
（付記５）
音声素片を記憶する音声コーパスを有する音声合成装置で音声を合成するための方法であって、前記音声合成装置は、
入力テキストデータから音素及び目標韻律が対応付けられたセグメントデータの列を抽出し、
前記抽出されたセグメントデータごとに、前記音声コーパスから取得した音声素片が当該セグメントデータの目標韻律よりも長い継続長を有する場合に、当該取得した音声素片から前記目標韻律の継続長より長い箇所を削除した新たな音声素片を生成し、
前記取得した音声素片及び前記新たな音声素片夫々と前記セグメントデータとの不一致度を示す素片コストを算出し、
前記算出された各素片コストに基づき前記取得した音声素片及び前記新たな音声素片夫々から前記セグメントデータに対する音声素片候補データとなる音声素片をリストアップし、
前記抽出されたセグメントデータの列を構成する前記セグメントデータごとに、当該セグメントデータに対応する前記音声素片候補データと、当該セグメントデータに隣接するセグメントデータに対応する前記音声素片候補データとの間の不連続性を示す接続コストを算出し、
前記算出された接続コスト及び前記隣接する音声素片候補データ夫々の素片コストに基づいて、前記音声素片候補データとしてリストアップされた音声素片のいずれかを選択して前記音声素片データ列を生成し、
前記生成された音声素片データ列に基づいて合成音声を生成する、音声合成方法。
（付記６）
音声素片を記憶する音声コーパスを有する音声合成装置として用いられるコンピュータに、
入力テキストデータから音素及び目標韻律が対応付けられたセグメントデータの列を抽出するステップと、
前記抽出されたセグメントデータごとに、前記音声コーパスから取得した音声素片が当該セグメントデータの目標韻律よりも長い継続長を有する場合に、当該取得した音声素片から前記目標韻律の継続長より長い箇所を削除した新たな音声素片を生成するステップと、
前記取得した音声素片及び前記新たな音声素片夫々と前記セグメントデータとの不一致度を示す素片コストを算出するステップと、
前記算出された各素片コストに基づき前記取得した音声素片及び前記新たな音声素片夫々から前記セグメントデータに対する音声素片候補データとなる音声素片をリストアップするステップと、
前記抽出されたセグメントデータの列を構成する前記セグメントデータごとに、当該セグメントデータに対応する前記音声素片候補データと、当該セグメントデータに隣接するセグメントデータに対応する前記音声素片候補データとの間の不連続性を示す接続コストを算出するステップと、
前記算出された接続コスト及び前記隣接する音声素片候補データ夫々の素片コストに基づいて、前記音声素片候補データとしてリストアップされた音声素片のいずれかを選択して前記音声素片データ列を生成するステップと、
前記生成された音声素片データ列に基づいて合成音声を生成するするステップと、
を実行させるプログラム。 Regarding the above embodiment, the following additional notes are disclosed.
(Appendix 1)
An extraction unit that extracts a string of segment data in which phonemes and target prosody are associated with each other from input text data;
For each segment data extracted by the extraction unit, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, the duration of the target prosody from the acquired speech segment A generation unit that generates a new speech segment in which a longer part is deleted;
A unit cost indicating a degree of disagreement between the acquired speech unit and the new speech unit and the segment data is calculated, and the acquired speech unit and the new unit are calculated based on the calculated unit cost. A speech unit list-up unit that lists speech units to be speech unit candidate data for the segment data from each speech unit;
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data A speech cost listed as the speech segment candidate data on the basis of the calculated connection cost and the segment cost of each of the adjacent speech segment candidate data. A phoneme string selection unit that selects one of the pieces and generates the speech element data string;
A speech generation unit that generates synthesized speech based on the generated speech segment data string;
A speech synthesizer comprising:
(Appendix 2)
The generation unit, as a new speech segment from which the portion longer than the target prosody continuation length has been deleted from the acquired speech segment, a new speech segment from which the top location of the acquired speech segment has been deleted, The speech according to appendix 1, which generates a new speech unit from which the end portion of the acquired speech unit has been deleted, and a new speech unit from which the beginning and end portions of the acquired speech unit have been uniformly deleted. Synthesizer.
(Appendix 3)
The segment data and the speech segment include phoneme data and prosodic data,
The segment cost includes a phoneme string composed of a phoneme of the segment data to be processed and a phoneme of each segment data of a predetermined number of segments before and after the segment data, and a phoneme of the speech segment candidate data to be processed. And the phoneme sequence cost calculated by comparing the phoneme sequence composed of the predetermined number of phoneme data of each speech unit data before and after the speech unit candidate data, and the segment data to be processed The speech synthesizer according to appendix 1 or 2, wherein the speech synthesizer calculates a weighted sum of the prosodic data and the prosodic cost calculated based on a difference between the prosody data of the speech segment candidate data to be processed.
(Appendix 4)
The connection cost is a phoneme connection between the speech segment candidate data to be processed corresponding to the segment data to be processed and the speech segment candidate data corresponding to the segment data immediately before the segment data. The speech synthesizer according to any one of appendices 1 to 3, wherein the speech synthesizer is calculated as a distance between the feature vector data at points.
(Appendix 5)
A method for synthesizing speech with a speech synthesizer having a speech corpus that stores speech segments, the speech synthesizer comprising:
Extract a segment data string that associates phonemes and target prosody from the input text data,
For each of the extracted segment data, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, it is longer than the duration of the target prosody from the acquired speech segment. Generate a new speech segment with the part removed,
Calculating a unit cost indicating a degree of mismatch between the acquired speech unit and the new speech unit and the segment data;
List speech units that are speech unit candidate data for the segment data from the acquired speech unit and the new speech unit based on each calculated unit cost,
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data Calculate the connection cost that shows the discontinuity between
Based on the calculated connection cost and the unit cost of each of the adjacent speech unit candidate data, the speech unit data is selected by selecting one of speech units listed as the speech unit candidate data. Generate columns,
A speech synthesis method for generating synthesized speech based on the generated speech unit data string.
(Appendix 6)
In a computer used as a speech synthesizer having a speech corpus for storing speech segments,
Extracting a segment data string associated with phonemes and target prosody from input text data;
For each of the extracted segment data, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, it is longer than the duration of the target prosody from the acquired speech segment. Generating a new speech segment with the location removed;
Calculating a unit cost indicating a degree of mismatch between the acquired speech unit and the new speech unit and the segment data;
Listing speech units that are speech unit candidate data for the segment data from the acquired speech unit and the new speech unit based on the calculated unit costs, respectively;
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data Calculating a connection cost indicating a discontinuity between;
Based on the calculated connection cost and the unit cost of each of the adjacent speech unit candidate data, the speech unit data is selected by selecting one of speech units listed as the speech unit candidate data. Generating a column;
Generating synthesized speech based on the generated speech segment data sequence;
A program that executes

１００音声合成装置
１０１テキスト入力部
１０２形態素解析部
１０３韻律予測部
１０４韻律辞書
１０５波形選択部
１０６音声辞書
１０７波形合成部
２０１目標韻律データ
２０２韻律入力部
２０７素片選定部
２０７ａ素片リストアップ部
２０７ｂ音素列選択部
２０８評価部
２０８ａ素片評価部
２０８ｂ接続評価部
２０９素片候補データ
２１０合成部
５０１ＣＰＵ
５０２ＲＯＭ（リードオンリーメモリ）
５０３ＲＡＭ（ランダムアクセスメモリ）
５０４入力装置
５０５出力装置
５０６外部記憶装置
５０７可搬記録媒体駆動装置
５０８通信インタフェース
５０９バス
５１０可搬記録媒体 100 speech synthesis apparatus 101 text input unit 102 morphological analysis unit 103 prosody prediction unit 104 prosody dictionary 105 waveform selection unit 106 speech dictionary 107 waveform synthesis unit 201 target prosody data 202 prosody input unit 207 segment selection unit 207a segment list up unit 207b Phoneme string selection unit 208 Evaluation unit 208a Segment evaluation unit 208b Connection evaluation unit 209 Segment candidate data 210 Composition unit 501 CPU
502 ROM (Read Only Memory)
503 RAM (Random Access Memory)
504 Input device 505 Output device 506 External storage device 507 Portable recording medium driving device 508 Communication interface 509 Bus 510 Portable recording medium

Claims

An extraction unit that extracts a string of segment data in which phonemes and target prosody are associated with each other from input text data;
For each segment data extracted by the extraction unit, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, the duration of the target prosody from the acquired speech segment A generation unit that generates a new speech segment in which a longer part is deleted;
A unit cost indicating a degree of disagreement between the acquired speech unit and the new speech unit and the segment data is calculated, and the acquired speech unit and the new unit are calculated based on the calculated unit cost. A speech unit list-up unit that lists speech units to be speech unit candidate data for the segment data from each speech unit;
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data A speech cost listed as the speech segment candidate data on the basis of the calculated connection cost and the segment cost of each of the adjacent speech segment candidate data. A phoneme string selection unit that selects one of the pieces and generates the speech element data string;
A speech generation unit that generates synthesized speech based on the generated speech segment data string;
A speech synthesizer comprising:

The generation unit, as a new speech segment from which the portion longer than the target prosody continuation length has been deleted from the acquired speech segment, a new speech segment from which the top location of the acquired speech segment has been deleted, The new speech unit from which the end part of the acquired speech unit is deleted, and the new speech unit from which the beginning part and the end part of the acquired speech unit are equally deleted are generated. Speech synthesizer.

The segment data and the speech segment include phoneme data and prosodic data,
The segment cost includes a phoneme string composed of a phoneme of the segment data to be processed and a phoneme of each segment data of a predetermined number of segments before and after the segment data, and a phoneme of the speech segment candidate data to be processed. And the phoneme sequence cost calculated by comparing the phoneme sequence composed of the predetermined number of phoneme data of each speech unit data before and after the speech unit candidate data, and the segment data to be processed The speech synthesizer according to claim 1, wherein the speech synthesizer is calculated as a weighted sum of prosodic data and prosodic cost calculated based on a difference between prosodic data of the speech segment candidate data to be processed.

The connection cost is a phoneme connection between the speech segment candidate data to be processed corresponding to the segment data to be processed and the speech segment candidate data corresponding to the segment data immediately before the segment data. The speech synthesizer according to claim 1, wherein the speech synthesizer is calculated as a distance between feature vector data at points.

A method for synthesizing speech with a speech synthesizer having a speech corpus that stores speech segments, the speech synthesizer comprising:
Extract a segment data string that associates phonemes and target prosody from the input text data,
For each of the extracted segment data, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, it is longer than the duration of the target prosody from the acquired speech segment. Generate a new speech segment with the part removed,
Calculating a unit cost indicating a degree of mismatch between the acquired speech unit and the new speech unit and the segment data;
List speech units that are speech unit candidate data for the segment data from the acquired speech unit and the new speech unit based on each calculated unit cost,
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data Calculate the connection cost that shows the discontinuity between
Based on the calculated connection cost and the unit cost of each of the adjacent speech unit candidate data, the speech unit data is selected by selecting one of speech units listed as the speech unit candidate data. Generate columns,
A speech synthesis method for generating synthesized speech based on the generated speech unit data string.

In a computer used as a speech synthesizer having a speech corpus for storing speech segments,
Extracting a segment data string associated with phonemes and target prosody from input text data;
For each of the extracted segment data, when the speech segment acquired from the speech corpus has a duration longer than the target prosody of the segment data, it is longer than the duration of the target prosody from the acquired speech segment. Generating a new speech segment with the location removed;
Calculating a unit cost indicating a degree of mismatch between the acquired speech unit and the new speech unit and the segment data;
Listing speech units that are speech unit candidate data for the segment data from the acquired speech unit and the new speech unit based on the calculated unit costs, respectively;
For each of the segment data constituting the extracted segment data row, the speech unit candidate data corresponding to the segment data and the speech unit candidate data corresponding to the segment data adjacent to the segment data Calculating a connection cost indicating a discontinuity between;
Based on the calculated connection cost and the unit cost of each of the adjacent speech unit candidate data, the speech unit data is selected by selecting one of speech units listed as the speech unit candidate data. Generating a column;
Generating synthesized speech based on the generated speech segment data sequence;
A program that executes