JP2012042974A

JP2012042974A - Voice synthesizer

Info

Publication number: JP2012042974A
Application number: JP2011234467A
Authority: JP
Inventors: Yusuke Fujita; 雄介藤田; Ryota Kamoshita; 亮太鴨志田; Kenji Nagamatsu; 健司永松
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-10-26
Filing date: 2011-10-26
Publication date: 2012-03-01

Abstract

PROBLEM TO BE SOLVED: To provide a high-quality voice synthesizer which generates voice data of a sentence including a fixed portion and a variable portion by combining a recorded voice and a regularly synthesized voice, the voice synthesizer solving the problem in which, when a recorded voice is connected with a synthesized voice, the discontinuity of sound quality and rhythm is perceived.SOLUTION: A voice synthesizer comprises: a recorded voice storing unit 5 for pre-storing recorded voice data including a fixed portion; a regularly synthesizing unit 7 for generating regularly synthesized voice data including a variable portion and at least part of the fixed portion, from an accepted text; a connection boundary calculating unit 8 for calculating a connection boundary position in a section where the recorded voice data is overlapped with the regularly synthesized voice data, based on audio feature information of the recorded voice data and the regularly synthesized data corresponding to the text; and a connection synthesizing unit 9 for generating synthesized voice data corresponding to the text by connecting the recorded voice data divided at the connection boundary position and extracted with the regularly synthesized voice data.

Description

本発明は、音声を合成する装置に係り、特に定型部分と可変部分からなる文章の音声データを、録音音声と規則合成音声を組み合わせて合成する音声合成技術に関する。 The present invention relates to an apparatus for synthesizing speech, and more particularly, to a speech synthesis technique for synthesizing speech data of a sentence composed of a fixed part and a variable part by combining recorded speech and rule-synthesized speech.

一般に、録音音声とは録音した音声をもとに作られた音声をいい、規則合成音声とは発音を表わした文字または符号列から合成した音声をいう。音声の規則合成は、入力されたテキストに対して言語処理を行い、読みやアクセントの情報を示す中間記号列を生成した後、基本周波数パタン（声の高さに対応する声帯の振動周期）や音素継続時間長（発声速度に対応する各音素の長さ）などの韻律パラメータを決定し、波形生成処理により、韻律パラメータに合わせた音声波形を生成するものである。韻律パラメータから音声波形を生成する方法として、音素や音節に対応する音声素片を組み合わせる、波形接続型音声合成が広く用いられている。 In general, a recorded voice refers to a voice created based on a recorded voice, and a rule-synthesized voice refers to a voice synthesized from a character or a code string representing pronunciation. In speech synthesis, the input text is subjected to language processing, an intermediate symbol string indicating reading and accent information is generated, and then the fundamental frequency pattern (vibration frequency corresponding to the pitch of the voice) and A prosody parameter such as a phoneme duration (the length of each phoneme corresponding to the utterance speed) is determined, and a speech waveform matching the prosody parameter is generated by waveform generation processing. As a method for generating a speech waveform from prosodic parameters, waveform-connected speech synthesis, in which speech segments corresponding to phonemes and syllables are combined, is widely used.

一般的な規則合成の流れは次の通りである。まず、言語処理では、入力されたテキストから、音素（音声の意味を弁別するための最小単位）や音節（１ないし３個程度の音素の結合からなる音声の聞こえの一種のまとまり）の並びを表現する読み情報、およびアクセント（発音の強さを指定する情報）や抑揚（疑問文や話し手の感情を示す情報）を表現するアクセント情報を生成し、これを中間記号列とする。中間記号列の生成には、辞書を用いた言語処理や、形態素解析処理が応用される。次に、中間記号列のアクセント情報に対応するように、基本周波数パタンや音素継続時間長などの韻律パラメータを決定する。韻律パラメータは、あらかじめ肉声を用いて学習された韻律モデルや、ヒューリスティクス（発見的に求められた制御規則）に基づいて生成される。最後に波形生成処理によって、韻律パラメータに合わせた音声波形を生成する。 The general rule composition flow is as follows. First of all, in language processing, a sequence of phonemes (minimum unit for discriminating the meaning of speech) and syllables (a kind of speech audible unit consisting of a combination of 1 to 3 phonemes) is arranged from the input text. Accent information that expresses reading information to be expressed, accent (information specifying the strength of pronunciation) and inflection (information indicating question sentence or speaker's emotion) is generated, and this is used as an intermediate symbol string. To generate the intermediate symbol string, language processing using a dictionary or morpheme analysis processing is applied. Next, prosodic parameters such as a fundamental frequency pattern and a phoneme duration are determined so as to correspond to the accent information of the intermediate symbol string. The prosodic parameters are generated based on a prosodic model learned in advance using the real voice and heuristics (control rules determined heuristically). Finally, a speech waveform that matches the prosodic parameters is generated by waveform generation processing.

規則合成は、入力された任意のテキストを音声として出力することができるため、録音音声を利用する場合と比べて、柔軟性の高い音声案内システムが構築できる。しかし、肉声と比べると品質は充分でなく、従来、録音音声を利用している車載用カーナビゲーションなどの音声案内システムに、規則合成音声を導入するには、品質の面で問題があった。 In the rule synthesis, any input text can be output as speech, so that a highly flexible speech guidance system can be constructed compared to the case of using recorded speech. However, the quality is not sufficient compared to the real voice, and there has been a problem in terms of quality in order to introduce rule-synthesized voice into a voice guidance system such as an in-vehicle car navigation system that uses recorded voice.

そこで、規則合成音声を利用した音声案内システムを実現するために、定型部分にはあらかじめ録音された録音音声を用い、可変部分は規則合成音声を用いることにより、録音音声の高品質性と規則合成音声の柔軟性を組み合わせる方法が利用されている。 Therefore, in order to realize a voice guidance system using rule-synthesized voice, the recorded voice recorded in advance is used for the standard part and the rule-synthesized voice is used for the variable part. A method that combines the flexibility of speech is used.

しかしながら、録音音声と規則合成音声を組み合わせて出力される音声は、録音音声と規則合成音声の間の、音質および韻律の不連続が知覚され、録音音声部分は高品質であっても全体としては高品質でなくなるという問題があった。 However, the sound that is output by combining the recorded voice and the regular synthesized voice is perceived as a discontinuity in sound quality and prosody between the recorded voice and the regular synthesized voice. There was a problem of not being high quality.

韻律の不連続を解消する方法として、規則合成音声に対するパラメータを設定する際に録音音声の特徴を利用する方法が開示されている（例えば、特許文献１参照）。また、定型部分と可変部分の韻律の連続性を考慮して、規則合成音声部分を拡張する方法が開示されている（例えば、特許文献２参照）。 As a method for eliminating discontinuity of prosody, a method is disclosed that uses the characteristics of recorded speech when setting parameters for rule-synthesized speech (see, for example, Patent Document 1). In addition, a method of extending the rule-synthesized speech part in consideration of the continuity of the prosody of the fixed part and the variable part is disclosed (for example, see Patent Document 2).

特開平１１−２４９６７７号公報JP-A-11-249677 特開２００５−３２１５２０号公報JP 2005-321520 A

従来技術によれば、規則合成音声部分の韻律は自然となるが、一方で、規則合成音声と録音音声との間の音質の差が大きくなることがあり、全体として自然な音声を得ることはできないという課題を有する。 According to the prior art, the prosody of the regular synthesized speech part becomes natural, but on the other hand, the difference in the sound quality between the regular synthesized speech and the recorded speech may become large, and it is difficult to obtain a natural speech as a whole. It has a problem that it cannot be done.

本発明は、上記の問題を解決するものであり、録音音声と合成音声を接続する際に音質および韻律の不連続が知覚されない、高品質な音声合成装置を提供することを目的とする。 An object of the present invention is to solve the above problems, and to provide a high-quality speech synthesizer in which discontinuity of sound quality and prosody is not perceived when connecting recorded speech and synthesized speech.

上記目的を達成するために、本発明は、定型部分と可変部分からなるテキストを合成する音声合成装置において、録音された音声をもとに作成された、定型部分を含む音声データである第１の音声データ（録音音声データ）を予め格納する録音音声格納手段と、受け付けた前記テキストから、可変部分と少なくとも定型部分の一部を含む第２の音声データ（規則合成音声データ）を生成する規則合成手段と、前記テキストに対応する、前記第１の音声データと前記第２の音声データとが重複する区間の音響特徴情報にもとづいて、録音された音声データと規則合成により生成された音声データとの接続境界の位置を選択する接続境界算出手段と、前記第１の音声データを前記接続境界で区切った第３の音声データと、前記第２の音声データを前記接続境界で区切って切り出した第４の音声データとを接続して前記テキストの音声データを合成する接続合成手段とを備えることを特徴とする。ここで一例として、定型部分とは音声データに対応する部分がある部分と定義でき、可変部分とは音声データに対応する部分がない部分と定義できる。 To achieve the above object, according to the present invention, there is provided a speech synthesizer that synthesizes text composed of a fixed part and a variable part, and is first speech data including a fixed part created based on recorded speech. A recording voice storing means for previously storing voice data (recorded voice data), and a rule for generating second voice data (rule synthesis voice data) including a variable part and at least a part of a fixed part from the received text Based on acoustic feature information of a section where the first voice data and the second voice data corresponding to the text overlap, the recorded voice data and voice data generated by rule synthesis corresponding to the text A connection boundary calculation means for selecting a connection boundary position, third audio data obtained by dividing the first audio data by the connection boundary, and the second audio data. Connects the fourth audio data cut out, separated by serial connection boundary, characterized in that it comprises a connection synthesizing means for synthesizing speech data of the text. Here, as an example, the standard part can be defined as a part having a part corresponding to audio data, and the variable part can be defined as a part having no part corresponding to audio data.

この構成においては、可変部分に加えて定型部分の一部を含むように規則合成音声データを生成し、規則合成音声データと録音音声データとの重複した区間を作ることにより、録音音声と規則合成音声の接続位置を可変とすることができる。前記重複区間における、録音音声と規則合成音声の音響特徴情報を用いて、最適な接続位置を算出することにより、従来の技術と比較して自然な合成音声が生成される。 In this configuration, the rule-synthesized voice data is generated to include a part of the standard part in addition to the variable part, and the recorded voice and the rule synthesis are created by creating an overlapping section between the rule-synthesized voice data and the recorded voice data. The voice connection position can be made variable. By calculating the optimum connection position using the acoustic feature information of the recorded voice and the rule-synthesized voice in the overlapping section, a natural synthesized voice is generated as compared with the conventional technique.

また、本発明の別の構成では、前記重複区間における、録音音声データの音響特徴情報を用いて、録音音声データと整合する規則合成音声データを生成する規則合成手段を備える。 In another configuration of the present invention, rule synthesis means for generating rule-synthesized voice data that matches the recorded voice data using the acoustic feature information of the recorded voice data in the overlapping section is provided.

この構成においては、重複区間における韻律の整合をとることで、韻律の不連続を解消することができ、さらに、重複区間に先行または後続する可変部分の規則合成音声データについても、同時に整合をとることができ、接続境界だけでなく、全体の整合がとれた合成音声が生成される。 In this configuration, the prosody discontinuity can be eliminated by matching the prosody in the overlapping section, and the variable synthesized speech data of the variable part preceding or following the overlapping section is also matched at the same time. Thus, a synthesized speech in which not only the connection boundary but also the entire match is obtained is generated.

また、本発明の別の構成では、前記接続境界算出手段から得られる接続境界の位置における録音音声データと規則合成音声データの音響特徴情報にもとづいて、規則合成音声データを加工する規則合成手段を備える。 Further, in another configuration of the present invention, rule synthesis means for processing the rule synthesized voice data based on the acoustic feature information of the recorded voice data and the rule synthesized voice data at the position of the connection boundary obtained from the connection boundary calculation means. Prepare.

この構成においては、接続境界を決定した後に、接続境界近傍での音響特徴がより録音音声に近づくように、規則合成音声データの特徴を加工することにより、さらに韻律や音質の不連続が目立たない合成音声が生成される。 In this configuration, after determining the connection boundary, the discontinuity of prosody and sound quality is further inconspicuous by processing the characteristics of the regularly synthesized voice data so that the acoustic characteristics near the connection boundary are closer to the recorded voice. A synthesized speech is generated.

本発明における音響特徴情報として、音素カテゴリを用いることにより、好適な接続境界を得ることができる。音素カテゴリは、例えば、有声音・無声音・破裂音・摩擦音等、音素の分類を規定する情報である。ポーズ（無音）区間で接続することで、接続歪が目立たなくなることは言うまでもないが、無声破裂音の先頭も同様に、短い無音区間が存在するため、接続歪が目立たない。また、有声音区間中での接続は、接続境界前後の基本周波数の差や位相の差により異音が目立つ可能性があるため、無声区間での接続が望ましい。また、音響特徴情報として、パワーを利用することにより、パワーの小さな接続境界を選択し、接続歪を目立たなくすることができる。 By using a phoneme category as acoustic feature information in the present invention, a suitable connection boundary can be obtained. The phoneme category is information defining the classification of phonemes, such as voiced / unvoiced / plosive / frictional sounds. It goes without saying that the connection distortion becomes inconspicuous by connecting in the pause (silence) section, but the connection distortion is not conspicuous because there is also a short silence section at the head of the unvoiced plosive sound. In addition, connection in a voiced sound section is desirable because connection may be noticeable due to differences in fundamental frequency and phase differences before and after the connection boundary. Further, by using power as acoustic feature information, a connection boundary having a small power can be selected, and connection distortion can be made inconspicuous.

また、音響特徴情報として、基本周波数を用いると、韻律の接続がなめらかな接続境界を得ることができる。録音音声と規則合成音声の基本周波数の差が小さな音素境界を選択することによって、基本周波数の不連続が知覚されにくくなる。また、音韻継続長時間を用いると、接続境界の前後で急に話速が変化しないような、接続境界を選択することができる。 Further, when the fundamental frequency is used as the acoustic feature information, a connection boundary with a smooth prosodic connection can be obtained. By selecting a phoneme boundary where the difference between the fundamental frequencies of the recorded speech and the regular synthesized speech is small, the discontinuity of the fundamental frequency becomes difficult to perceive. Further, when the phoneme duration is used, it is possible to select a connection boundary so that the speech speed does not change suddenly before and after the connection boundary.

また、音響特徴情報として、スペクトル（音声の周波数成分を示す情報）を用いると、接続境界近傍で、音質が急に変化することを避けることができる。特に、接続境界を決定した後に、接続境界近傍での音響特徴情報を用いて規則合成音声データの特徴を加工する構成の場合に有効で、接続境界近傍の規則合成音声のスペクトルが、録音音声とより近くなるように加工することができる。 Further, if a spectrum (information indicating the frequency component of speech) is used as the acoustic feature information, it is possible to avoid a sudden change in sound quality near the connection boundary. This is particularly effective in the case of a configuration in which the characteristics of rule-synthesized speech data are processed using acoustic feature information in the vicinity of the connection boundary after the connection boundary is determined. It can be processed to be closer.

本発明では、規則合成音声データを作成する範囲として、可変部分に加えて定型部分の一部を含むようにしているが、この範囲は、一呼気段落（息継ぎのためのポーズで分割される一単位）、一文（句点によって分割される一単位）、定型部分の全体のいずれかで定義することが望ましい。特に録音音声と規則合成音声の韻律の整合をとるためには、前記重複区間は大きくとるとよい。ただし、別の手段による韻律の整合方法が利用できる場合や計算量の観点から問題となる場合は、一呼気段落未満の範囲となるように定義してもよい。 In the present invention, the range for generating the regularly synthesized voice data is to include a part of the fixed part in addition to the variable part, but this range is one expiratory paragraph (one unit divided by a pause for breathing). It is desirable to define it in one of a sentence (one unit divided by a punctuation mark) or a whole fixed part. In particular, in order to match the prosody of the recorded voice and the rule-synthesized voice, it is preferable that the overlapping section is large. However, when a prosody matching method by another means can be used or when there is a problem from the viewpoint of calculation amount, it may be defined to be within a range of one exhalation paragraph.

本発明の接続境界算出手段において、接続境界の候補となる位置は、前記重複区間における全ての標本点であるが、音素境界に限定して接続境界を選択すると、効果的な接続境界が得られる。このような構成をとることによって、録音音声および規則合成音声の音響特徴情報は、音素境界のみで計算するものであればよく、記憶容量や計算量の観点で有利となる。 In the connection boundary calculation means of the present invention, the connection boundary candidate positions are all sample points in the overlapping section. However, if a connection boundary is selected only for the phoneme boundary, an effective connection boundary can be obtained. . By adopting such a configuration, it is sufficient that the acoustic feature information of the recorded voice and the rule-synthesized voice is calculated only at the phoneme boundary, which is advantageous in terms of storage capacity and calculation amount.

本発明の録音音声格納手段において、定型部分と定型部分以外の一部を含む、一呼気段落または一文の単位であらかじめ録音した音声データを格納しておくことにより、録音音声における定型部分以外の区間も有効に利用できるようになる。定型部分のテキストがあらかじめ設定されている場合、可変部分のテキストに応じて録音音声を決定するようにすると、可変部分の一部についても録音音声が利用できる場合は、前記重複区間として、可変部分の一部を含めることができる。このようにすると、録音音声の多くの部分を活かすことができ、より高品質な合成音声を生成できる。 In the recorded voice storage means of the present invention, by storing voice data recorded in advance in one breath paragraph or one sentence unit, including a fixed part and a part other than the fixed part, a section other than the fixed part in the recorded voice Can also be used effectively. If the recorded text is determined according to the text of the variable part when the text of the standard part is set in advance, if the recorded voice can be used for a part of the variable part, the variable part is used as the overlapping section. Can be included. In this way, many parts of the recorded voice can be utilized, and a higher quality synthesized voice can be generated.

さらに、本発明の音声合成装置は、定型部分と可変部分からなるテキストを合成する音声合成装置において、録音された、前記定型部分を含む録音音声データを予め格納する録音音声格納部と、受け付けた前記テキストから、前記可変部分と少なくとも前記定型部分の一部を含む規則合成音声データを生成する規則合成部と、前記テキストに対応する、前記録音音声データおよび前記規則合成音声データの音響特徴情報に基いて、前記録音音声データと前記規則合成音声データとが重複する区間における接続境界位置を算出する接続境界算出部と、前記接続境界位置で区切って切り出した前記録音音声データと前記規則合成音声データとを接続して、前記テキストに対応する合成音声データを生成する接続合成部とを備えることを特徴とする。 Furthermore, the speech synthesizer of the present invention is a speech synthesizer that synthesizes text composed of a fixed part and a variable part, and has received a recorded voice storage unit that stores recorded voice data including the fixed part in advance. From the text, a rule synthesis unit that generates the rule synthesis voice data including the variable part and at least a part of the fixed part, and acoustic feature information of the recorded voice data and the rule synthesis voice data corresponding to the text A connection boundary calculation unit for calculating a connection boundary position in a section in which the recorded voice data and the rule-synthesized voice data overlap; and the recorded voice data and the rule-synthesized voice data cut out by being separated at the connection boundary position And a connection synthesis unit that generates synthesized speech data corresponding to the text.

さらにまた、本発明の音声合成装置は、定型部分と可変部分からなるテキストを合成する音声合成装置において、録音された、前記定型部分を含む録音音声データを予め格納する録音音声格納部と、受け付けた前記テキストから、前記可変部分と少なくとも前記定型部分の一部を含む規則合成パラメータを算出し、規則合成音声の音響特徴情報を生成する規則合成パラメータ算出部と、前記録音音声の音響特徴情報と前記規則合成の音響特徴情報とを用いて、前記録音音声データと前記規則合成パラメータとが重複する区間における接続境界位置を算出する接続境界算出部と、前記録音音声の音響特徴情報と、前記規則合成音声の音響特徴情報と、前記接続境界位置とを用いて、規則合成音声データを生成する規則合成音声データ部と、前記接続境界位置で区切って切り出した前記録音音声データと前記規則合成音声データとを接続して、前記テキストに対応する合成音声データを出力する接続合成部と、前記合成音声データを出力する手段とを備えることを特徴とする。 Furthermore, the speech synthesizer of the present invention is a speech synthesizer that synthesizes text composed of a fixed part and a variable part, and a recorded voice storage unit that stores recorded voice data including the fixed part in advance. A rule synthesis parameter calculating unit that calculates a rule synthesis parameter including the variable part and at least a part of the fixed part from the text, and generates acoustic feature information of the rule-synthesized speech; and acoustic feature information of the recorded voice; A connection boundary calculation unit that calculates a connection boundary position in a section where the recorded voice data and the rule synthesis parameter overlap using the acoustic characteristic information of the rule synthesis, the acoustic feature information of the recorded voice, and the rule Using the acoustic feature information of synthesized speech and the connection boundary position, a rule-synthesized speech data unit that generates rule-synthesized speech data, and A connection synthesizer for connecting the recorded voice data and the rule synthesized voice data cut out by dividing at a continuous boundary position and outputting synthesized voice data corresponding to the text; and means for outputting the synthesized voice data It is characterized by providing.

さらにまた、本発明の音声合成装置は、あらかじめ録音された、可変部分を含む音片と定型部分を含む音片とを接続して合成音声を作成する装置において、あらかじめ録音された前記音片からなる音声データを格納する録音音声格納部と、受け付けた入力テキストから、前記可変部分の音片の中間記号列と前記定型部分の音片の中間記号列とを作成する入力解析部と、前記可変部分の入力に従って、同じ定型部分をもつ複数の録音音声データの中から適切な録音音声データを選択する録音音声選択部と、前記入力解析部で得られる前記可変部分の音片の中間記号列と、前記定型部分の音片の中間記号列を用いて、規則合成音声データを生成する範囲を決定する規則合成部と、前記録音音声データの音響特徴情報および前記規則合成音声データの音響特徴情報を用いて、前記録音音声データと前記規則合成音声データとの重複区間における、接続境界位置を算出する接続境界算出部と、前記接続境界算出部から得られる前記接続境界位置を用いて、前記録音音声データと前記規則合成音声データとを切断し、切断された前記録音音声データと前記規則合成音声データとを接続することにより、前記可変部分を含む音片に対応する合成音声データを作成する接続合成部と、入力テキストから得られる音片の順序に基づいて、音片を接続して出力音声を生成する音片接続部とを有することを特徴とする。 Furthermore, the speech synthesizer of the present invention is a device that creates a synthesized speech by connecting a pre-recorded sound piece including a variable part and a sound piece including a fixed part, from the pre-recorded sound piece. A voice recording unit that stores voice data, an input analysis unit that creates an intermediate symbol string of the variable piece and an intermediate symbol string of the fixed part from the received input text, and the variable A recording voice selection unit that selects appropriate recording voice data from among a plurality of recording voice data having the same fixed part in accordance with the input of the part; and an intermediate symbol string of the variable pieces obtained by the input analysis unit; A rule synthesizing unit for determining a range in which the rule synthesized voice data is generated using an intermediate symbol string of the fixed-part sound pieces; acoustic feature information of the recorded voice data and the rule synthesized voice data Using acoustic feature information, a connection boundary calculation unit that calculates a connection boundary position in an overlapping section between the recorded voice data and the rule-synthesized voice data, and using the connection boundary position obtained from the connection boundary calculation unit Cutting the recorded voice data and the rule synthesized voice data, and connecting the cut recorded voice data and the rule synthesized voice data to obtain synthesized voice data corresponding to the sound piece including the variable portion. It has a connection composition part to create and a sound piece connection part which connects sound pieces and generates output speech based on the order of sound pieces obtained from input text.

また、本発明の音声合成方法は、あらかじめ録音音声データおよび録音音声データに対応する第１の中間記号列を格納しておき、入力テキストを準備する第１のステップと、入力テキストを第２の中間記号列に変換する第２のステップと、第１の中間記号列を参照し、第２の中間記号列を第１の中間記号列と対応する定型部分と対応しない可変部分に弁別する第３のステップと、録音音声データから、第１の中間記号列が定型部分に対応する部分を取得する第４のステップと、第２の中間記号列を用いて可変部分に対応する部分全部と定型部分に対応する部分の少なくとも一部の規則合成音声データを生成する第５のステップと、取得された録音音声データの一部と生成された規則合成音声データの一部を結合する第６のステップとを有する。 In the speech synthesis method of the present invention, the recorded voice data and the first intermediate symbol string corresponding to the recorded voice data are stored in advance, the first step of preparing the input text, and the input text as the second text. A second step of converting to an intermediate symbol string; a third step of referring to the first intermediate symbol string and discriminating the second intermediate symbol string into variable parts not corresponding to the fixed part corresponding to the first intermediate symbol string The fourth step of acquiring from the recorded voice data a portion where the first intermediate symbol string corresponds to the fixed portion, and all the portions corresponding to the variable portion using the second intermediate symbol sequence and the fixed portion A fifth step of generating at least a part of the synthesized speech data of a part corresponding to the above, a sixth step of combining a part of the acquired recorded voice data and a part of the generated ruled synthesized voice data; Have

ここで、取得された録音音声データ、生成された規則合成音声データは、それぞれ連続する一つのフレーズとすることができ、２つのフレーズは重複する箇所を持つため、つなぎ合わせる箇所の自由度が大きく、自然なつながりで結合することができる。すなわち、２つの音声データは定型部分で重複している区間を持つので、この区間で２つの音声データが整合する部分を接続境界として選び、つなぎ合わせればよい。いかなる部分で整合するかの評価基準としては、例えば、２つの音声データの基本周波数、スペクトル、継続長などの特徴量の差の小さな箇所を選ぶことができる。また、必要に応じて、２つのデータの片方を修正（加工）してつなぎ合わせることもできる。例えば、録音音声データと規則合成音声データの特徴量の差が小さくなるように、規則合成音声データ生成の際のパラメータを修正して音響特徴を合わせることができる。 Here, the acquired recorded voice data and the generated rule-synthesized voice data can each be one continuous phrase, and the two phrases have overlapping parts, so the degree of freedom of the parts to be joined is large. , Can be combined with natural connections. That is, since the two audio data have a section that overlaps with the fixed portion, a portion where the two audio data matches in this section may be selected as a connection boundary and connected. As an evaluation criterion for which part is matched, for example, a portion having a small difference in feature quantity such as a fundamental frequency, a spectrum, and a duration of two audio data can be selected. Further, if necessary, one of the two data can be corrected (processed) and connected. For example, the acoustic features can be matched by correcting the parameters when generating the regularly synthesized speech data so that the difference between the feature amounts of the recorded speech data and the regularly synthesized speech data becomes small.

本発明によれば、録音音声と合成音声を接続する際に音質および韻律の不連続が知覚されない、高品質な音声合成装置が実現できる。 According to the present invention, it is possible to realize a high-quality speech synthesizer in which sound quality and prosody discontinuity are not perceived when recording speech and synthesized speech are connected.

本発明の第１の実施例における音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer in 1st Example of this invention. 第１の実施例における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in a 1st Example. 第１の実施例における録音音声格納部に記憶される情報を示す図であるIt is a figure which shows the information memorize | stored in the sound recording audio | voice storage part in a 1st Example. 第１の実施例における録音音声格納部に記憶される情報の具体例を示す図である。It is a figure which shows the specific example of the information memorize | stored in the sound recording audio | voice storage part in a 1st Example. 第１の実施例における規則合成音声の生成方法を説明するための説明図である。It is explanatory drawing for demonstrating the production | generation method of the rule synthetic speech in a 1st Example. 第１の実施例における接続境界位置の選択方法を説明するための説明図である。It is explanatory drawing for demonstrating the selection method of the connection boundary position in a 1st Example. 本発明の第２の実施例における音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer in 2nd Example of this invention. 第２の実施例における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer in a 2nd Example. 本発明の第３の実施例における音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer in the 3rd Example of this invention. 第３の実施例における音声合成装置の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the speech synthesizer in a 3rd Example. 第３の実施例における入力画面の構成を示す図である。It is a figure which shows the structure of the input screen in a 3rd Example. 第３の実施例における音片情報格納部に記憶される情報を示す図である。It is a figure which shows the information memorize | stored in the sound piece information storage part in a 3rd Example. 第３の実施例における音片情報格納部に記憶される情報の具体例を示す図である。It is a figure which shows the specific example of the information memorize | stored in the sound piece information storage part in a 3rd Example. 第３の実施例における接続境界位置の選択方法を説明するための説明図である。It is explanatory drawing for demonstrating the selection method of the connection boundary position in a 3rd Example.

以下、本発明の実施例について、図面を参照して詳述する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（実施例１）
図１は、本発明の第１の実施例に係り、カーナビゲーションシステム用に構成された本発明の音声合成装置を示すブロック図である。
本実施例は、図示のとおり、音声合成装置１とナビゲーション制御装置２とからなる。本発明の音声合成装置１は、ナビゲーション制御部３からのテキスト入力を解析する入力解析部４と、入力解析部４で得られる定型部分の中間記号列を用いて録音音声格納部５から録音音声データを索出する録音音声選択部６と、入力解析部４で得られる可変部分の中間記号列および定型部分の中間記号列の一部と録音音声選択部６で得られる録音音声の音響特徴情報とを用いて規則合成音声データを生成する規則合成部７と、録音音声選択部６で得られる録音音声の音響特徴情報と規則合成部７で得られる規則合成音声の音響特徴情報を用いて、録音音声データと規則合成音声データとの接続境界を算出する接続境界算出部８と、接続境界算出部で得られる接続境界を用いて録音音声データと規則合成音声データを切り出して接続する接続合成部９とを備えている。 Example 1
FIG. 1 is a block diagram showing a speech synthesizer of the present invention configured for a car navigation system according to a first embodiment of the present invention.
As shown in the figure, this embodiment includes a speech synthesizer 1 and a navigation control device 2. The speech synthesizer 1 of the present invention uses an input analysis unit 4 that analyzes text input from the navigation control unit 3 and a recorded voice from the recorded speech storage unit 5 using an intermediate symbol string of a fixed part obtained by the input analysis unit 4. Recorded sound selection unit 6 for searching for data, variable part intermediate symbol string obtained by input analysis unit 4 and part of fixed part intermediate symbol string, and acoustic feature information of recorded sound obtained by recorded voice selection unit 6 Using the rule synthesis unit 7 that generates the rule synthesis voice data using the, and the sound feature information of the recorded voice obtained by the recording voice selection unit 6 and the sound feature information of the rule synthesis voice obtained by the rule synthesis unit 7, A connection boundary calculation unit 8 that calculates a connection boundary between the recorded voice data and the rule synthesized voice data, and a connection that cuts out and connects the recorded voice data and the rule synthesized voice data using the connection boundary obtained by the connection boundary calculation unit And a generation unit 9.

次に、図１および図２を用いて、本発明の第１の実施例に係る音声合成装置１の動作について説明する。なお、図２は、本発明の第１の実施例に係る音声合成装置１の動作を示すフローチャートである。 Next, the operation of the speech synthesizer 1 according to the first embodiment of the present invention will be described using FIG. 1 and FIG. FIG. 2 is a flowchart showing the operation of the speech synthesizer 1 according to the first embodiment of the present invention.

まず、ナビゲーション制御部２において、音声合成装置１へ渡す入力テキストを決定する。 First, the navigation control unit 2 determines an input text to be passed to the speech synthesizer 1.

ナビゲーション制御部３は、情報受信部１０から例えば天気予報や交通情報などの各種情報を受信し、ＧＰＳ１１から得られる現在位置情報や、ナビゲーション用データ記憶部１２のもつ地図情報と組み合わせるなどして、音声合成装置１へ渡す入力テキストを作成する（ステップ１０１）。 The navigation control unit 3 receives various information such as weather forecasts and traffic information from the information receiving unit 10 and combines it with the current position information obtained from the GPS 11 and the map information of the navigation data storage unit 12. An input text to be passed to the speech synthesizer 1 is created (step 101).

次に、入力解析部４においてナビゲーション制御部２から音声出力するための入力テキストを受け取り、中間記号列に変換する(ステップ１０２)。入力テキストは、例えば、「国分寺の明日の天気です。」のような漢字仮名混じりの文字列である。ここで入力解析部４は、言語処理を行い、「コクブンジノアシタノテンキデス」のように、音声合成用の中間記号列に変換する。 Next, the input analysis unit 4 receives the input text for voice output from the navigation control unit 2 and converts it into an intermediate symbol string (step 102). The input text is, for example, a character string mixed with kanji characters such as “Kokubunji tomorrow's weather”. Here, the input analysis unit 4 performs language processing and converts it into an intermediate symbol string for speech synthesis such as “Kokubunzino Asitano Tenchides”.

次に、入力解析部４は、録音音声格納部５に、図３に示す録音音声データ４０１と関連付けて格納されている中間記号列４０２とを参照して一致する部分を探索し、定型部分とする中間記号列を決定し、音声波形データ４０１と関連付けることのできない部分を可変部分として決定する(ステップ１０３)。 Next, the input analysis unit 4 refers to the intermediate symbol string 402 stored in the recorded voice storage unit 5 in association with the recorded voice data 401 shown in FIG. An intermediate symbol string to be determined is determined, and a portion that cannot be associated with the speech waveform data 401 is determined as a variable portion (step 103).

録音音声格納部５には、上述のように、図３に示すような構成で、録音音声データ４０１と関連付けられた中間記号列４０２が複数組格納されている。ここで、図４に示すように、録音音声格納部５に中間記号列「シンジュクノアシタノテンキデス」が格納されている場合を例としてステップ１０３の動作を説明する。 As described above, a plurality of sets of intermediate symbol strings 402 associated with the recorded voice data 401 are stored in the recorded voice storage unit 5 with the configuration shown in FIG. Here, as shown in FIG. 4, the operation of step 103 will be described by taking as an example the case where the intermediate symbol string “Shinjukuno Ashitano Tenchides” is stored in the recorded voice storage unit 5.

入力解析部４から得られる中間記号列「コクブンジノアシタノテンキデス」と録音音声格納部５に格納されている中間記号列４０２と順次比較すると、「シンジュクノ
アシタノテ’ンキデス」が、「ノアシタノテンキデス」の部分で入力解析部４から得られる中間記号列と一致するため、該当する部分を定型部分として、録音音声データ４０１を用いることができる。そこで、「ノアシタノテンキデス」を定型部分と決定し、録音音声データと関連付けることができない「コクブンジ」を可変部分と決定する。 When the intermediate symbol string “Kokubunji no Ashitano Tenchides” obtained from the input analysis unit 4 and the intermediate symbol sequence 402 stored in the recorded voice storage unit 5 are sequentially compared, the “No. Ashitano Tenchides” Therefore, the recorded voice data 401 can be used with the corresponding part as a fixed part. Therefore, “Noashitan Tenkides” is determined as the standard part, and “Kokubunji” that cannot be associated with the recorded voice data is determined as the variable part.

次に録音音声選択部６において、録音音声データ４０１と録音音声の音響特徴情報４０３を取得する（ステップ１０４）。 Next, the recorded voice selection unit 6 acquires the recorded voice data 401 and the acoustic feature information 403 of the recorded voice (step 104).

録音音声選択部６は、入力解析部４で得られる定型部分の中間記号列を用いて、録音音声格納部５から録音音声データ４０１を取得する。ここで、定型部分の中間記号列が「ノ
アシタノテンキデス」となっている場合でも、当該中間記号列の前および後ろの少なくとも一方の録音音声データを一緒に取得する。ここでは一例として、「シンジュクノアシタノテンキデス」に対応する録音音声データ全体を取得するものとした。定型部分に対応する部分だけ切り出す処理はここでは行わない。 The recorded voice selection unit 6 acquires the recorded voice data 401 from the recorded voice storage unit 5 using the intermediate symbol string of the standard part obtained by the input analysis unit 4. Here, even if the intermediate symbol string of the standard part is “Noashitan Tenchides”, at least one of the recorded voice data before and after the intermediate symbol string is acquired together. Here, as an example, it is assumed that the entire recorded audio data corresponding to “Shinjukuno Ashitano Tenchides” is acquired. The process of cutting out only the part corresponding to the fixed part is not performed here.

また、録音音声格納部５に録音音声データ４０１と関連付けて格納されている音響特徴情報４０３を取得する。音響特徴情報は、図４の例に示すような構成で格納されており、録音音声の各音素に関して，音素カテゴリ・始終端の時刻・基本周波数が記述されている。 Also, acoustic feature information 403 stored in the recorded voice storage unit 5 in association with the recorded voice data 401 is acquired. The acoustic feature information is stored in a configuration as shown in the example of FIG. 4, and for each phoneme of the recorded voice, the phoneme category, the start / end time, and the fundamental frequency are described.

規則合成部７は、入力解析部４で得られる可変部分の中間記号列と定型部分の中間記号列を用いて、規則合成音声を生成する範囲を決定する（ステップ１０５)。ここで、規則合成音声を作成する範囲は、可変部分を含む一文と定義しておくと、可変部分の「コクブンジ」に加えて、定型部分「ノアシタノテンキデス」を含めて規則合成音声を生成する。 The rule synthesizing unit 7 determines the range for generating the rule synthesized speech using the intermediate symbol string of the variable part and the intermediate symbol string of the fixed part obtained by the input analyzing unit 4 (step 105). Here, if the range for creating a rule-synthesized speech is defined as a sentence containing a variable part, a rule-synthesized speech is generated that includes the fixed part “Noashitta Tenchides” in addition to the variable part “Kokubunji”. .

次に、規則合成部７は、録音音声の音響特徴情報４０３を参照して、規則合成音声データを生成する（ステップ１０６）。ここで、基本周波数や音素継続長時間などの規則合成パラメータを、規則合成部７があらかじめ記憶している韻律モデル１３を用いて算出するが、その際、録音音声の音響特徴情報を参照して、規則合成パラメータを修正することにより、録音音声と接続しやすい規則合成音声データを生成することが出来る。 Next, the rule synthesizing unit 7 refers to the acoustic feature information 403 of the recorded voice and generates rule synthesized voice data (step 106). Here, the rule synthesis parameters such as the fundamental frequency and the phoneme duration are calculated using the prosodic model 13 stored in advance by the rule synthesis unit 7. At this time, referring to the acoustic feature information of the recorded voice By modifying the rule synthesis parameters, it is possible to generate rule synthesis voice data that can be easily connected to the recorded voice.

録音音声の音響特徴情報４０３のうち基本周波数情報を用いて規則合成パラメータを決定する様子を、図５に示す。図５に示すように、録音音声データと生成される規則合成音声データの重複する区間５０１において、録音音声データの音響特徴情報（録音音声の基本周波数パタン）５０２との誤差が小さくなるように、韻律モデル１３を用いて算出された規則合成パラメータ（韻律モデルが設定する基本周波数パタン）５０３を修正し、規則合成音声データの音響特徴情報（修正された基本周波数パタン）５０４を生成する。修正方法として、平行移動および、ダイナミックレンジの拡大や縮小などの操作を使用する。 FIG. 5 shows how the rule synthesis parameters are determined using the fundamental frequency information in the acoustic feature information 403 of the recorded voice. As shown in FIG. 5, in a section 501 where the recorded voice data and the generated rule synthesized voice data overlap, the error from the acoustic feature information (basic frequency pattern of the recorded voice) 502 of the recorded voice data is reduced. The rule synthesis parameter (basic frequency pattern set by the prosody model) 503 calculated using the prosody model 13 is modified to generate acoustic feature information (modified fundamental frequency pattern) 504 of the rule synthesized speech data. As a correction method, operations such as parallel movement and expansion or reduction of the dynamic range are used.

このように、録音音声データと規則合成音声データとの重複する区間５０１において音響特徴を合わせる操作を行い、同様の操作が録音音声データと重複しない可変部分５０５に対しても行われることにより、可変部分と定型部分の韻律の整合をとることが可能となる。 As described above, the operation for matching the acoustic features is performed in the section 501 where the recorded voice data and the rule-synthesized voice data overlap, and the same operation is performed on the variable portion 505 that does not overlap with the recorded voice data. It becomes possible to match the prosody of the part and the fixed part.

音響特徴情報は、基本周波数のみに限らず、音韻継続長時間をあわせて利用することにより、録音音声データと規則合成音声データとの間のリズムの不整合が解消される。また、音響特徴情報として録音音声のスペクトル情報を用いることもでき、音質面でも、録音音声データと規則合成音声データの不連続を解消することができる。 The acoustic feature information is used not only for the fundamental frequency but also for the long phoneme duration, thereby eliminating the rhythm inconsistency between the recorded voice data and the rule-synthesized voice data. Further, the spectrum information of the recorded voice can be used as the acoustic feature information, and the discontinuity between the recorded voice data and the rule-synthesized voice data can be eliminated in terms of sound quality.

次に、接続境界算出部８は、録音音声データの音響特徴情報５０２と規則合成音声データの音響特徴情報５０４とを用いて、録音音声データと規則合成音声データとの重複区間５０１における、図６に示す接続境界位置６０１を算出する（ステップ１０７）。録音音声データと規則合成音声データとの重複区間５０１における音響特徴情報として、基本周波数が与えられている際の算出方法を、図６を例として説明する。 Next, the connection boundary calculation unit 8 uses the acoustic feature information 502 of the recorded voice data and the acoustic feature information 504 of the regular synthesized voice data in the overlapping section 501 of the recorded voice data and the regular synthesized voice data in FIG. Is calculated (step 107). A calculation method when a fundamental frequency is given as acoustic feature information in the overlapping section 501 between the recorded voice data and the rule synthesized voice data will be described with reference to FIG.

まず、音素カテゴリ情報を用いて、無声破裂音の先頭など、語中の無声音区間を、接続境界の候補として選択する。続いて、音素境界候補における、録音音声と規則合成音声の基本周波数の差を算出して、差が小さくなるものを接続境界の候補とする。この時点で、算出された同等な候補が複数ある場合には、規則合成音声データの区間を短くすることを考慮して、接続境界位置６０１を決定する。 First, using phoneme category information, an unvoiced sound segment in a word, such as the head of an unvoiced plosive, is selected as a connection boundary candidate. Subsequently, a difference between the fundamental frequencies of the recorded speech and the rule-synthesized speech in the phoneme boundary candidate is calculated, and a connection boundary candidate is selected if the difference is small. At this time, when there are a plurality of calculated equivalent candidates, the connection boundary position 601 is determined in consideration of shortening the section of the regularly synthesized speech data.

音素カテゴリ情報を用いて接続境界の候補を得る際には、無声破裂音の先頭位置が有効であるが、その他の無声音についても、有声音と比較すれば滑らかな接続が可能である。ただし、接続合成部９での接続方法に、クロスフェードを用いることができるときは、有声音中でも滑らかな接続ができる可能性があるため、接続境界の候補の選び方は、無声破裂音の先頭位置に限るものではない。 When obtaining connection boundary candidates using phoneme category information, the head position of the unvoiced plosive sound is effective, but other unvoiced sounds can be smoothly connected as compared to voiced sounds. However, when a cross fading can be used as the connection method in the connection synthesis unit 9, there is a possibility that a smooth connection can be made even in voiced sound. It is not limited to.

接続位置を算出するための音響特徴情報として、基本周波数の差を用いる以外にも、音韻継続長の差、スペクトルの差を併せて用いることで、より接続時に滑らかとなる位置を算出することが可能となる。 In addition to using the fundamental frequency difference as the acoustic feature information for calculating the connection position, it is possible to calculate a smoother position at the time of connection by using the phoneme duration difference and the spectrum difference together. It becomes possible.

接続境界算出部８は、上述の例のように、音素カテゴリ情報で候補を絞りこんだ後に、基本周波数の差を計算する順序で、接続境界を算出するだけでなく、下記に示す（数１）の例に示すようなコスト関数を定義して算出することもできる。
C(b)＝Wp×Cp(b)＋Wf×Cf(b)＋Wd×Cd(b)＋Ws×Cs(b)＋Wl×Cl(b)
・・・・・・・（数１）
ここで、音素カテゴリ情報から決定される接続のしにくさを音素カテゴリコストCp(b)として定義し、その重み付けをWpとする。また、音響特徴情報における差も、それぞれ、基本周波数コストCf(b)、音韻継続長コストCd(b)、スペクトルコストCs(b)として定義し、それらの重み付けを、それぞれ、Wf、Wd、Wsとする。さらに、各音素境界位置から、可変部分と定型部分の境界との時刻の差を求め、規則合成音声長コストCl(b)として定義し、その重み付けをWlとする。各コストの重み付け和として、接続境界位置に関するコストC(b)を算出し、最も小さなコストを持つ音素境界を接続境界位置とすることも可能である。 The connection boundary calculation unit 8 not only calculates the connection boundary in the order of calculating the fundamental frequency difference after narrowing down the candidates with the phoneme category information as in the above example, but also shows the following (Equation 1 It is also possible to define and calculate a cost function as shown in the example.
C (b) = Wp × Cp (b) + Wf × Cf (b) + Wd × Cd (b) + Ws × Cs (b) + Wl × Cl (b)
・・・・・・・ (Equation 1)
Here, the difficulty of connection determined from the phoneme category information is defined as the phoneme category cost Cp (b), and the weight is defined as Wp. Also, differences in acoustic feature information are also defined as fundamental frequency cost Cf (b), phoneme duration cost Cd (b), and spectrum cost Cs (b), respectively, and their weights are Wf, Wd, Ws, respectively. And Further, the time difference between the boundary between the variable part and the fixed part is obtained from each phoneme boundary position, defined as the rule synthesis speech length cost Cl (b), and the weight is defined as Wl. It is also possible to calculate the cost C (b) related to the connection boundary position as a weighted sum of the respective costs, and to set the phoneme boundary having the smallest cost as the connection boundary position.

次に、接続合成部９は、接続境界算出部８から得られる接続境界位置を用いて、録音音声データと規則合成音声データを切断し、切断された録音音声データと規則合成音声データを接続することにより、入力テキストに対応する合成音声データを出力する（ステップ１０８）。ここで、接続境界位置は、録音音声データにおける時刻および規則合成音声データにおける時刻として算出し、算出された時刻を用いて音声データの切断および接続を行う。 Next, the connection synthesis unit 9 cuts the recorded voice data and the rule synthesized voice data using the connection boundary position obtained from the connection boundary calculation unit 8, and connects the cut recorded voice data and the rule synthesized voice data. Thus, synthesized speech data corresponding to the input text is output (step 108). Here, the connection boundary position is calculated as the time in the recorded voice data and the time in the rule synthesized voice data, and the voice data is disconnected and connected using the calculated time.

接続合成部９は、切断した音声を接続する際に、単に接続を行うだけでなく、クロスフェード処理を用いて接続部分を目立たなくすることもできる。特に有声部の中間で接続が行われる場合には、基本周波数に同期して、接続境界位置の音声波形の１基本周期分だけクロスフェード処理を行うことで、接続時の異音を解消することができる。ただし、クロスフェード処理を用いて信号が劣化する可能性もあるため、有声部の中間で接続を行うことは避けるように、接続境界位置を決定しておくことが望ましい。 When connecting the disconnected voice, the connection synthesis unit 9 can not only make the connection but also make the connection portion inconspicuous by using the cross-fade process. In particular, when a connection is made in the middle of a voiced part, the crossfade process is performed only for one fundamental period of the speech waveform at the connection boundary position in synchronization with the fundamental frequency to eliminate abnormal noise during connection. Can do. However, since there is a possibility that the signal is deteriorated by using the cross-fade processing, it is desirable to determine the connection boundary position so as to avoid the connection in the middle of the voiced portion.

なお、上記実施例では、規則合成音声データを作成する範囲は、可変部分を含む一文と定義した場合について述べたが、一呼気段落、一文のいずれかの単位で生成するようにしてもよい。 In the above-described embodiment, the range in which the rule-synthesized voice data is created is defined as one sentence including a variable part. However, the range may be generated in one unit of one breath paragraph or one sentence.

以上のように第１の実施例では、車載用カーナビゲーションシステム用に構成された、録音音声データと規則合成音声データを接続する音声合成装置において、規則合成音声データの音質と韻律を録音音声データに近づけるとともに、好適な接続境界を算出することにより、自然な合成音声を生成することが可能となる。 As described above, in the first embodiment, in the speech synthesizer configured for an in-vehicle car navigation system and connecting the recorded voice data and the rule synthesized voice data, the sound quality and prosody of the rule synthesized voice data are recorded voice data. It is possible to generate a natural synthesized speech by calculating a suitable connection boundary.

（実施例２）
次に、本発明の第２の実施例について説明する。 (Example 2)
Next, a second embodiment of the present invention will be described.

第１の実施例は、規則合成音声データを生成した後に決定される接続境界位置を用いて、録音音声データと規則合成音声データを接続するものであるが、接続境界位置の決定後に、規則合成音声データを生成する構成としてもよい。 In the first embodiment, the recorded voice data and the rule synthesized voice data are connected using the connection boundary position determined after the rule synthesized voice data is generated. After the connection boundary position is determined, the rule synthesis is performed. It is good also as a structure which produces | generates audio | voice data.

図７は、本発明の第２の実施例を示すブロック図である。第２の実施例は、第１の実施例における規則合成部７の代わりに、規則合成パラメータ算出部２１と規則合成音声データ生成部２２とを設けた構成となる。図８は、第２の実施例に係る音声合成装置２０の動作を示すフローチャートである。図７と図８を用いて、第２の実施例に係る音声合成装置２０の動作について説明する。 FIG. 7 is a block diagram showing a second embodiment of the present invention. In the second embodiment, a rule synthesis parameter calculation unit 21 and a rule synthesis voice data generation unit 22 are provided instead of the rule synthesis unit 7 in the first embodiment. FIG. 8 is a flowchart showing the operation of the speech synthesizer 20 according to the second embodiment. The operation of the speech synthesizer 20 according to the second embodiment will be described with reference to FIGS.

まず、ナビゲーション制御部３において、音声合成装置２０へ渡す入力テキストを決定する（ステップ２０１）。 First, the navigation control unit 3 determines an input text to be passed to the speech synthesizer 20 (step 201).

次に、入力解析部４において、定型部分の中間記号列と可変部分の中間記号列が決定され（ステップ２０２〜ステップ２０３）、録音音声選択部６で録音音声データと録音音声の音響特徴情報が得られる(ステップ２０４)。続いて、規則合成音声の作成範囲が決定される（ステップ２０５）。ここまでの処理は第１の実施例と同様の方法で行われる。 Next, the input analysis unit 4 determines the intermediate symbol string of the fixed part and the intermediate symbol string of the variable part (steps 202 to 203), and the recorded voice data and the acoustic feature information of the recorded voice are obtained by the recorded voice selection unit 6. Is obtained (step 204). Subsequently, a creation range of the rule synthesized speech is determined (step 205). The processing so far is performed by the same method as in the first embodiment.

次に、規則合成パラメータ算出部２１において、規則合成パラメータを算出し、規則合成音声の音響特徴情報を生成する（ステップ２０６）。ここで第１の実施例では、規則合成部７において規則合成音声データを作成したが、第２の実施例においては、規則合成音声データを作成しない。 Next, the rule synthesis parameter calculation unit 21 calculates a rule synthesis parameter and generates acoustic feature information of the rule synthesis voice (step 206). Here, in the first embodiment, the rule synthesis voice data is created in the rule synthesis unit 7, but in the second embodiment, the rule synthesis voice data is not created.

次に、接続境界算出部８は、録音音声の音響特徴情報と規則合成音声の音響特徴情報とを用いて、録音音声データと規則合成パラメータとの重複区間における、接続境界位置を算出する（ステップ２０７）。本ステップは、第１の実施例と同様の方法で行われる。 Next, the connection boundary calculation unit 8 calculates the connection boundary position in the overlapping section between the recorded voice data and the rule synthesis parameter using the acoustic feature information of the recorded voice and the acoustic feature information of the rule synthesized voice (step) 207). This step is performed in the same manner as in the first embodiment.

次に、規則合成音声データ生成部２２において、録音音声の音響特徴情報と規則合成音声の音響特徴情報と接続境界算出部８で得られる接続境界位置とを用いて、規則合成音声データを生成する（ステップ２０８）。本ステップは、接続境界位置に録音音声の音響特徴情報を参照して、ステップ２０６で得られた規則合成パラメータを修正し、規則合成音声データを生成するものである。 Next, the rule-synthesized voice data generation unit 22 generates rule-synthesized voice data using the acoustic feature information of the recorded voice, the acoustic feature information of the rule-synthesized voice, and the connection boundary position obtained by the connection boundary calculation unit 8. (Step 208). This step refers to the acoustic feature information of the recorded voice at the connection boundary position, corrects the rule synthesis parameter obtained in step 206, and generates rule synthesis voice data.

例えば、接続境界位置にある音素に対して、音響特徴の差が小さくなるように、規則合成パラメータを修正すると、より接続歪の少ない合成音声が生成されることになる。 For example, when the rule synthesis parameter is modified so that the difference in acoustic characteristics is small with respect to the phoneme at the connection boundary position, a synthesized speech with less connection distortion is generated.

第１の実施例では、可変部分を含む１文として定義された規則合成音声データの範囲と、録音音声データとの重複区間の音響特徴情報を用いて、規則合成パラメータを作成するものであったが、第２の実施例では、接続境界算出部８で得られる接続境界位置における、録音音声の音響特徴情報を用いて、規則合成パラメータを再度修正した上で、規則合成音声データを生成するものである。これにより、接続境界位置を考慮した、より滑らかな接続が行われる。 In the first embodiment, the rule synthesis parameter is created using the range of the rule synthesis voice data defined as one sentence including a variable part and the acoustic feature information of the overlapping section with the recorded voice data. In the second embodiment, however, the rule synthesis parameters are generated again using the acoustic feature information of the recorded voice at the connection boundary position obtained by the connection boundary calculation unit 8, and the rule synthesis voice data is generated. It is. Thereby, a smoother connection in consideration of the connection boundary position is performed.

次に、接続合成部９は、接続境界算出部８から得られる接続境界位置を用いて、録音音声データと規則合成音声データとを切断し、切断された録音音声データと規則合成音声データとを接続することにより、入力テキストに対応する合成音声データを出力する（ステップ２０９）。 Next, using the connection boundary position obtained from the connection boundary calculation unit 8, the connection synthesis unit 9 cuts the recorded voice data and the rule synthesized voice data, and combines the cut recorded voice data and the rule synthesized voice data. By connecting, synthesized speech data corresponding to the input text is output (step 209).

以上のように、第２の実施例では、第１の実施例と異なり、規則合成パラメータの設定を２段階で行う。１段目では、文全体の滑らかな接続を考慮した規則合成パラメータが設定され、２段目では、接続境界算出部８で得られる接続境界位置を考慮して規則合成パラメータが修正される。このようにして、規則合成パラメータを修正することで、録音音声データと規則合成音声データのより自然な接続を可能とする。 As described above, in the second embodiment, unlike the first embodiment, the rule synthesis parameter is set in two stages. In the first stage, rule synthesis parameters are set in consideration of the smooth connection of the entire sentence. In the second stage, the rule synthesis parameters are corrected in consideration of the connection boundary position obtained by the connection boundary calculation unit 8. In this way, by modifying the rule synthesis parameter, a more natural connection between the recorded voice data and the rule synthesized voice data is enabled.

（実施例３）
次に、本発明の第３の実施例について説明する。 (Example 3)
Next, a third embodiment of the present invention will be described.

図９は、本発明の第３の実施例に係り、鉄道放送システムに本発明を適用する構成を示すブロック図である。図１０は、第２の実施例に係る音声合成装置３０の動作を示すフローチャートである。 FIG. 9 is a block diagram showing a configuration in which the present invention is applied to a railway broadcast system according to a third embodiment of the present invention. FIG. 10 is a flowchart showing the operation of the speech synthesizer 30 according to the second embodiment.

本実施例は、あらかじめ録音された音片を接続して合成音声を作成する装置において、本発明の実施により可変部分を含む音片を生成する機能を備えた構成となっている。 The present embodiment is a device that creates a synthesized speech by connecting prerecorded sound pieces, and has a function of generating a sound piece including a variable part by implementing the present invention.

入力部３１は、図１１に示すように、文例を選択するための表示手段３３と、選択された文例に従った音片の順序構成の表示手段３４と、可変部分を含む音片においては、テキストの定型部分と可変部分が分かるような表示手段３５を有する入力画面３２と、入力画面３２を見ながら、複数の文例の中から利用者が出力したい文例を選択し、音片の順序構成を編集し、可変部分のテキストをキーボード等で入力するための入力装置３６を備えている。 As shown in FIG. 11, the input unit 31 includes a display unit 33 for selecting a sentence example, a display unit 34 having a sequence configuration of sound pieces according to the selected sentence example, and a sound piece including a variable part. While viewing the input screen 32 having the display means 35 that can recognize the fixed part and the variable part of the text, and looking at the input screen 32, the user selects a sentence example that the user wants to output, and the order composition of the sound pieces is selected. An input device 36 is provided for editing and inputting variable text with a keyboard or the like.

また、音片情報格納部３５は、図１２に示すような構成で、録音音声格納部５にあらかじめ録音された音声データを、図１３の例に示すように分類しておき、文例を、音片分類コード７０１の組み合わせで表現できるように構成する。また、音片情報格納部３５は、図１３に示すように各録音音声データについて一意に定められた音片コード７０２を格納する。このとき音片コード７０２から音片分類コード７０１が分かるように構成しておく。例として図１３では、音片コード７０２の最上位の桁が音片分類コード７０１の最上位の桁と一致するように構成している。 The sound piece information storage unit 35 is configured as shown in FIG. 12 and classifies the sound data recorded in advance in the recorded sound storage unit 5 as shown in the example of FIG. It is configured so that it can be expressed by a combination of the piece classification codes 701. Also, the sound piece information storage unit 35 stores a sound piece code 702 uniquely determined for each recorded sound data as shown in FIG. At this time, the sound piece classification code 701 is determined from the sound piece code 702. For example, in FIG. 13, the highest digit of the sound piece code 702 is configured to match the highest digit of the sound piece classification code 701.

以下、第３の実施例の動作について説明する。 The operation of the third embodiment will be described below.

入力部３１では、文例を選択することによって、音片の構成を決定する（ステップ３０１）。ここで、音片の順序構成において、音片コードが指定されている場合は、固定の音片を利用し、音片分類コードが指定されている場合は、該当する音片を、本発明の音声合成方法によって生成することができる。例えば、図１３の例に示す音片情報が格納されており、入力部で、音片分類コード「２００」が設定されると、入力画面には、可変部分のテキストを入力するための領域と、表示データ７０３として、定型部分の「行きがまいります」が表示される。 The input unit 31 determines the composition of the sound piece by selecting a sentence example (step 301). Here, in the order configuration of the sound pieces, when a sound piece code is specified, a fixed sound piece is used, and when a sound piece classification code is specified, the corresponding sound piece is It can be generated by a speech synthesis method. For example, when the sound piece information shown in the example of FIG. 13 is stored and the sound piece classification code “200” is set in the input unit, the input screen includes an area for inputting text of a variable part. As the display data 703, the standard part “I will be going” is displayed.

続いて、可変部分のテキストをキーボードから入力し、可変部分のテキストを決定する（ステップ３０２）。例えば、可変部分のテキストとして、「原宿」と入力されると、定型部分と組み合わせた「原宿行きがまいります」を、音片として生成する。 Subsequently, the variable part text is input from the keyboard, and the variable part text is determined (step 302). For example, if “Harajuku” is entered as the text of the variable part, “I will go to Harajuku” combined with the fixed part is generated as a sound piece.

入力解析部４は、入力部３１で指定した可変部分を含む音片を作成するために、音片分類コード７０１と対応する定型部分の中間記号列７０４を取得する。また、入力部から得られる可変部分のテキストを言語処理により中間記号列に変換し、可変部分の中間記号列を決定する（ステップ３０３）。このステップにより、可変部分のテキストが「原宿」である場合、可変部分の中間記号列「ハラジュク」が得られる。 The input analysis unit 4 acquires an intermediate symbol string 704 of a fixed part corresponding to the sound piece classification code 701 in order to create a sound piece including the variable part specified by the input unit 31. Further, the variable part text obtained from the input unit is converted into an intermediate symbol string by language processing, and the variable part intermediate symbol string is determined (step 303). By this step, when the text of the variable part is “Harajuku”, the intermediate symbol string “Harajuk” of the variable part is obtained.

次に、録音音声選択部６は、可変部分の入力に従って、同じ定型部分をもつ複数の録音音声の中から適切な録音音声を選択する。ここで、定型部分と可変部分を含めた中間記号列と、録音音声に対応する中間記号列を比較し、最も長く中間記号列が一致するものを選択する（ステップ３０４）。このようにすると、録音音声と規則合成音声の接続境界位置は、定型部分の中に決定されるだけでなく、場合によっては可変部分の中に決定することも可能となり、より高品質な合成音声を生成することができる。 Next, the recorded voice selection unit 6 selects an appropriate recorded voice from a plurality of recorded voices having the same fixed part according to the input of the variable part. Here, the intermediate symbol string including the fixed part and the variable part is compared with the intermediate symbol string corresponding to the recorded voice, and the one that matches the longest intermediate symbol string is selected (step 304). In this way, the connection boundary position between the recorded voice and the regular synthesized voice can be determined not only in the fixed part but also in the variable part depending on the case, so that a higher quality synthesized voice can be obtained. Can be generated.

次に、規則合成部７は、入力解析部４で得られる可変部分の中間記号列と定型部分の中間記号列を用いて、規則合成音声を生成する範囲を決定する（ステップ３０５)。ここで、規則合成音声を作成する範囲は、可変部分を含む一音片と定義しておくと、可変部分の「ハラジュク」に加えて、定型部分「ユキガマイリマス」を含めて規則合成音声を生成する。 Next, the rule synthesis unit 7 determines a range for generating the rule synthesis speech using the intermediate symbol string of the variable part and the intermediate symbol string of the fixed part obtained by the input analysis unit 4 (step 305). Here, if the range for creating a rule-synthesized speech is defined as a single sound piece that includes a variable part, in addition to the variable part “Harajuk”, a regular-synthesized part “Yukiga Miraimas” will be generated. To do.

次に、接続境界算出部８は、録音音声の音響特徴情報と規則合成音声の音響特徴情報を用いて、録音音声データと規則合成音声データの重複区間における、接続境界位置を算出する（ステップ３０６）。 Next, the connection boundary calculation unit 8 uses the acoustic feature information of the recorded voice and the acoustic feature information of the rule-synthesized voice to calculate the connection boundary position in the overlapping section of the recorded voice data and the rule-synthesized voice data (step 306). ).

このステップ３０６は、第１の実施例のステップ１０６と同様であるが、録音音声と規則合成音声の接続境界位置は、定型部分の中に決定されるだけでなく、場合によっては可変部分の中に決定することも可能となる。可変部分の中に、接続境界位置が決定される例を、図１４に示す。図１３に示すような音片情報に対応した録音音声が録音音声格納部５に格納されており、定型部分として、音片分類コード「２００」が指定されると、音片コード「２０１」、「２０２」、「２０３」の録音音声が選択の対象となる。ここで、可変部分の中間記号列が「ハラジュク」である場合、定型部分と組み合わせた中間記号列「ハラジュクユキガマイリマス」と、各録音音声の中間記号列を比較すると、音片コード「２０１」の「シンジュクユキガマイリマス」が選択される。 This step 306 is the same as step 106 of the first embodiment, but the connection boundary position between the recorded voice and the rule-synthesized voice is not only determined in the fixed part, but in some cases in the variable part. It is also possible to decide. An example in which the connection boundary position is determined in the variable part is shown in FIG. When the recorded sound corresponding to the sound piece information as shown in FIG. 13 is stored in the recorded sound storage unit 5 and the sound piece classification code “200” is designated as the standard part, the sound piece code “201”, Recorded voices “202” and “203” are selected. Here, when the intermediate symbol string of the variable part is “Harajuk”, comparing the intermediate symbol string “Harajuk Yukiga Mairimus” combined with the fixed part and the intermediate symbol string of each recorded voice, the speech piece code “201” “Shinjuku Yukiga My Limas” is selected.

このようにすると、録音音声と規則合成音声の重複区間８０１は、「ジュクノアシタノテ’ンキデス」に対応する区間となり、あらかじめ指定された定型部分８０３のみならず、可変部分８０２の一部である「ジュク」の部分に関しても録音音声を利用することができるようになり、接続境界位置８０４を可変部分８０２の中に決定することが可能となる。 In this way, the overlapping section 801 of the recorded voice and the rule-synthesized voice is a section corresponding to “Jukuno Ashitano Tenchides”, which is not only the pre-designated fixed part 803 but also a part of the variable part 802 “ The recorded voice can be used also for the “juku” portion, and the connection boundary position 804 can be determined in the variable portion 802.

次に、接続合成部９は、接続境界算出部８から得られる接続境界位置を用いて、録音音声データと規則合成音声データを切断し、切断された録音音声データと規則合成音声データを接続することにより、可変部分を含む音片に対応する合成音声データを作成する（ステップ３０７）。ここで、接続境界位置は、録音音声データにおける時刻および規則合成音声データにおける時刻として算出し、算出された時刻を用いて音声データの切断および接続を行う。このステップは、第１の実施例のステップ１０７と同様であるが、音声データをスピーカから出力する処理は、次の音片接続部３６が行う。 Next, the connection synthesis unit 9 cuts the recorded voice data and the rule synthesized voice data using the connection boundary position obtained from the connection boundary calculation unit 8, and connects the cut recorded voice data and the rule synthesized voice data. Thus, synthesized speech data corresponding to the sound piece including the variable portion is created (step 307). Here, the connection boundary position is calculated as the time in the recorded voice data and the time in the rule synthesized voice data, and the voice data is disconnected and connected using the calculated time. This step is the same as step 107 in the first embodiment, but the process of outputting audio data from the speaker is performed by the next sound piece connecting unit 36.

音片接続部３６は、入力部から得られる音片の順序に基づいて、音片を接続して出力音声を生成する（ステップ３０８）。ここで、可変部分を含む音片は、接続合成部から得られる合成音声を用いる。 The sound piece connecting unit 36 connects the sound pieces based on the order of the sound pieces obtained from the input unit, and generates output sound (step 308). Here, the synthesized speech obtained from the connection synthesis unit is used as the sound piece including the variable part.

このようにして、録音された音片を接続して合成音声を作成する装置において、規則合成音声を用いた音片を用いて、自然な接続の合成音声を出力することができる。 In this way, in a device that creates a synthesized speech by connecting recorded sound pieces, a synthesized speech with a natural connection can be output using the sound pieces using the regular synthesized speech.

以上のように第３の実施例では、鉄道放送システムに本発明を適用した場合、あらかじめ録音された音片を接続して合成音声を作成する装置において、可変部分を含む音片を生成する機能を備え、高品質な音声を出力することができる。 As described above, in the third embodiment, when the present invention is applied to a railway broadcasting system, a function for generating a sound piece including a variable part in an apparatus for creating a synthesized voice by connecting sound pieces recorded in advance. It is possible to output high-quality sound.

以上詳述したように、本発明によれば、予め格納された録音された音声データと、規則合成により生成された音声データとが重複する区間の音響特徴情報にもとづいて、録音音声と規則合成音声との間の音質および韻律の連続性を考慮した接続境界を選択し、自然な合成音声を生成することが可能となる。また、規則合成作成手段は、重複する区間の音響特徴情報を目標として規則合成音声を作成することにより、規則合成音声の音質と韻律が録音音声に近づき、自然な合成音声を生成することが可能となる。 As described above in detail, according to the present invention, based on the acoustic feature information of the section where the recorded voice data stored in advance and the voice data generated by the rule synthesis overlap, the recorded voice and the rule synthesis are synthesized. It is possible to select a connection boundary in consideration of sound quality and prosody continuity with speech, and generate a natural synthesized speech. In addition, the rule synthesis creation means creates rule synthesis speech with the acoustic feature information of overlapping sections as a target, so that the sound quality and prosody of the rule synthesis speech approaches that of the recorded speech, and natural synthesis speech can be generated It becomes.

本発明は、車載用カーナビゲーションシステムや鉄道放送システムへの適用が好適であるが、テキストを音声出力する音声案内システム一般に適用可能である。 The present invention is preferably applied to an in-vehicle car navigation system and a railroad broadcasting system, but can be applied to a general voice guidance system that outputs text as a voice.

１…音声合成装置、２…ナビゲーション制御装置、３…ナビゲーション制御部、４…入力解析部、５…録音音声格納部、６…録音音声選択部、７…規則合成部、８…接続境界算出部、９…接続合成部、１０…情報受信部、１１…ＧＰＳ、１２…ナビゲーション用データ記憶部、１３…韻律モデル、２０…音声合成装置、２１…規則合成パラメータ算出部、２２…規則合成音声データ生成部、３０…音声合成装置、３１…入力部、３５…音片情報格納部、３６…音片接続部。 DESCRIPTION OF SYMBOLS 1 ... Speech synthesizer, 2 ... Navigation control device, 3 ... Navigation control part, 4 ... Input analysis part, 5 ... Recording voice storage part, 6 ... Recording voice selection part, 7 ... Rule synthesis part, 8 ... Connection boundary calculation part , 9 ... connection synthesis unit, 10 ... information reception unit, 11 ... GPS, 12 ... navigation data storage unit, 13 ... prosody model, 20 ... speech synthesis device, 21 ... rule synthesis parameter calculation unit, 22 ... rule synthesis voice data Generating unit, 30 ... speech synthesizer, 31 ... input unit, 35 ... sound piece information storage unit, 36 ... sound piece connecting unit.

Claims

In a speech synthesizer that synthesizes text consisting of a fixed part and a variable part,
Recorded sound storage means for storing in advance first sound data which is created based on the recorded sound and which is the sound data including the fixed portion;
Rule synthesis means for generating, from the received text, second speech data including the variable part and at least a part of the fixed part;
The acoustic feature information corresponding to the text is at least one of a phoneme category, a fundamental frequency, a phoneme duration, a power, and a spectrum in a section where the first voice data and the second voice data overlap. Based on the connection boundary calculation means for selecting the position of the connection boundary between the recorded voice data and the voice data generated by rule synthesis,
Audio data of the text by connecting third audio data obtained by dividing the first audio data at the connection boundary and fourth audio data obtained by dividing the second audio data at the connection boundary. Connection synthesis means for synthesizing
The speech synthesizer characterized in that the connection boundary calculation means selects a position of the connection boundary from a plurality of phoneme boundaries included in a section where the first speech data and the second speech data overlap.

The rule synthesis means uses the acoustic feature information of the first voice data in a section corresponding to the text where the first voice data and the second voice data overlap. The speech synthesizer according to claim 1, wherein the second speech data that matches speech data is generated.

The rule synthesizing unit processes the second audio data based on acoustic feature information of the first audio data and the second audio data at a connection boundary position obtained from the connection boundary calculating unit. The speech synthesizer according to claim 1.

The rule synthesizing unit generates the second audio data in a unit of one of the entire fixed part, one expiratory paragraph, and one sentence among the variable part and the fixed part preceding or following the variable part. The speech synthesizer according to claim 1.

The recorded voice storage means stores, as the first voice data, voice data recorded in advance in one breath paragraph or one sentence unit, including a fixed part and at least a part other than the fixed part. The speech synthesizer according to claim 1 or 2.

2. The connection boundary position is calculated as a time in the first audio data and a time in the second audio data, and the audio data is disconnected and connected using the calculated time. Or the speech synthesizer of 2.

3. The speech synthesis apparatus according to claim 1, further comprising means for outputting the voice data synthesized by the connection synthesis means.

In a speech synthesizer that synthesizes text consisting of a fixed part and a variable part,
A recorded voice storage unit that stores the recorded voice data including the fixed portion recorded in advance;
A rule synthesizing unit that generates rule-synthesized speech data including the variable part and at least a part of the fixed part from the received text;
Based on acoustic feature information that is at least one of phoneme category, fundamental frequency, phoneme duration, power, spectrum of the recorded voice data and the rule-synthesized voice data corresponding to the text, the recorded voice data and A connection boundary calculation unit for calculating a connection boundary position in a section where the rule-synthesized speech data overlaps;
A connection synthesis unit that connects the recorded voice data and the rule synthesized voice data cut out by dividing at the connection boundary position, and generates synthesized voice data corresponding to the text;
The speech synthesis apparatus, wherein the connection boundary calculation unit selects the connection boundary position from a plurality of phoneme boundaries included in a section where the recorded speech data and the rule-synthesized speech data overlap.

The rule synthesizing means uses the acoustic feature information of the recorded voice data in a section corresponding to the text where the recorded voice data and the rule synthesized voice data overlap, and matches the rule with the recorded voice data 9. The speech synthesizer according to claim 8, wherein the synthesized speech data is generated.

The rule synthesizing means processes the rule synthesized voice data based on acoustic feature information of the recorded voice data and the rule synthesized voice data at a connection boundary position obtained from the connection boundary calculating means. The speech synthesizer according to claim 8.

The rule synthesizing unit generates the second audio data in units of one of the entire fixed part, one expiratory paragraph, and one sentence among the variable part and the fixed part preceding or following the variable part. The speech synthesizer according to claim 8.

The recorded voice storage means stores the voice data recorded in advance in one breath paragraph or one sentence unit including the fixed part and at least a part other than the fixed part as the recorded voice data. Item 10. The speech synthesizer according to Item 8 or 9.

10. The connection boundary position is calculated as a time in the recorded voice data and a time in the rule synthesized voice data, and the voice data is disconnected and connected using the calculated time. The speech synthesizer described.

10. The speech synthesizer according to claim 8, further comprising means for outputting the synthesized speech data generated by the connection synthesis means.

In a speech synthesizer that synthesizes text consisting of a fixed part and a variable part,
A recorded voice storage unit that stores the recorded voice data including the fixed portion recorded in advance;
A rule synthesis parameter calculation unit that calculates a rule synthesis parameter including the variable part and at least a part of the standard part from the received text, and generates acoustic feature information of the rule synthesis speech;
A connection boundary calculation unit that calculates a connection boundary position in a section where the recorded voice data and the rule synthesis parameter overlap using the acoustic feature information of the recorded voice and the acoustic feature information of the rule synthesis;
Using the acoustic feature information of the recorded speech, the acoustic feature information of the rule-synthesized speech, and the connection boundary position, a rule-synthesized speech data unit that generates rule-synthesized speech data;
A connection synthesizer that connects the recorded voice data and the rule-synthesized voice data cut out at the connection boundary position and outputs synthesized voice data corresponding to the text;
Means for outputting the synthesized voice data,
The connection boundary calculation unit selects the connection boundary position from a plurality of phoneme boundaries included in a section where the recorded voice data and the rule synthesis parameter overlap,
The speech synthesizer characterized in that the acoustic feature information is at least one of a phoneme category, a fundamental frequency, a phoneme duration, power, and a spectrum.

In an apparatus for creating a synthesized speech by connecting a pre-recorded sound piece including a variable part and a sound piece including a fixed part,
A recorded voice storage unit for storing voice data composed of the previously recorded pieces;
An input analysis unit for creating an intermediate symbol string of the variable piece and an intermediate symbol string of the fixed part from the received input text;
According to the input of the variable part, a recording voice selection unit for selecting appropriate recording voice data from a plurality of recording voice data having the same fixed part,
A rule synthesizing unit that determines a range for generating rule-synthesized speech data, using an intermediate symbol string of the variable piece sound pieces obtained by the input analysis unit, and an intermediate symbol string of the fixed part sound pieces;
A connection boundary calculation unit that calculates a connection boundary position in an overlapping section between the recorded voice data and the rule-synthesized voice data, using the acoustic feature information of the recorded voice data and the acoustic feature information of the rule-synthesized voice data;
By using the connection boundary position obtained from the connection boundary calculation unit, the recorded voice data and the rule synthesized voice data are disconnected, and the disconnected recorded voice data and the rule synthesized voice data are connected. A connection synthesizer for creating synthesized voice data corresponding to the sound piece including the variable part;
Based on the order of the sound pieces obtained from the input text, and having a sound piece connection unit for connecting the sound pieces to generate output speech,
The connection boundary calculation unit selects the connection boundary position from a plurality of phoneme boundaries included in an overlapping section of the recorded voice data and the rule synthesized voice data,
The speech synthesizer characterized in that the acoustic feature information is at least one of a phoneme category, a fundamental frequency, a phoneme duration, power, and a spectrum.

In a speech synthesis method that synthesizes text consisting of a fixed part and a variable part,
Recorded recording voice data including the standard part is stored in advance,
Generating rule-synthesized speech data including the variable part and at least a part of the fixed part from the received text,
Based on acoustic feature information that is at least one of phoneme category, fundamental frequency, phoneme duration, power, spectrum of the recorded voice data and the rule-synthesized voice data corresponding to the text, the recorded voice data and Selecting a connection boundary position from a plurality of phoneme boundaries included in a section where the rule-synthesized speech data overlaps;
A speech synthesizing method comprising: connecting the recorded speech data segmented and cut out at the connection boundary position and the rule synthesized speech data to generate synthesized speech data corresponding to the text.