JP2005321520A

JP2005321520A - Voice synthesizer and its program

Info

Publication number: JP2005321520A
Application number: JP2004138533A
Authority: JP
Inventors: Takahiro Otsuka; 貴弘大塚; Koichi Tanigaki; 宏一谷垣; Yasushi Ishikawa; 泰石川; Yoichi Fujii; 洋一藤井; Katsushi Suzuki; 克志鈴木
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2004-05-07
Filing date: 2004-05-07
Publication date: 2005-11-17
Anticipated expiration: 2024-05-07
Also published as: JP4525162B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve naturalness of voice information in a voice synthesizer in which voice information of a synthesized sentence, that is made by replacing the variable portion of a reference sentence with replacement words and phrases, is generated by combining voice information obtained by rule-synthesizing the rhythm information of the the replacement words and phrases and the voice information of the reference sentence, and by reading the synthesized sentence using the voice information of the generated synthesized sentence for the synthesis of voice. <P>SOLUTION: The voice synthesizer is provided with a rule synthesizing means 4 in which when prescribed conditions are met, either one of the portion of the synthesized sentence preceding the replacement words and phrases or the portion of the succeeding synthesized sentence is made a rule synthesis expanding portion and the voice information of the rule synthesized expanding portion is rule-synthesized with the voice information of the replacement words and phrases. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

この発明は、文に基づいて音声合成を行う音声合成装置及びそのプログラムに係るものであり、特に置換可能な可変部を有する参照文と可変部を置換する語句とを組み合わせて合成された合成文の音声合成を行う技術に関する。 The present invention relates to a speech synthesizer that performs speech synthesis based on a sentence and a program thereof, and particularly, a synthesized sentence that is synthesized by combining a reference sentence having a replaceable variable part and a phrase that replaces the variable part. The present invention relates to a technology for voice synthesis.

従来、参照文の定形部と置換語句とを組み合わせて生成した合成文の音声合成を行う音声合成技術では、可変部の韻律を調整するとともに、可変部の韻律を参照しながら定形部の韻律の変形を行うものがあった（例えば特許文献１）。 Conventionally, in speech synthesis technology for synthesizing a synthesized sentence generated by combining a fixed part of a reference sentence and a replacement phrase, the prosody of the fixed part is adjusted while adjusting the prosody of the variable part and referring to the prosody of the variable part. There was one that deformed (for example, Patent Document 1).

特開平１１−３８９８９号公報JP 11-38989 A

従来、このような合成文を対象とした音声合成技術によれば、定形部と置換語句との韻律の接続位置において、置換語句の韻律情報を調整し、定形部のＦ０（基本周波数）情報を平滑化（スムージング）している。しかしこのような方法では、可変部や置換語句の内容によっては、接続が滑らかに行われず、韻律の自然性が劣化してしまうという問題があった。この発明は、かかる問題を解決するためになされたものであり、可変部が多様化しても韻律の劣化を起こさずに自然な韻律を生成し、合成音声を聞き取りやすいものとすることを目的とする。 Conventionally, according to the speech synthesis technology for such a synthesized sentence, the prosodic information of the replacement phrase is adjusted at the connection position of the prosody of the fixed form part and the replacement phrase, and the F0 (fundamental frequency) information of the fixed form part is changed. Smoothing (smoothing). However, in such a method, there is a problem in that connection is not smoothly performed depending on the contents of the variable part and the replacement phrase, and the naturalness of the prosody is deteriorated. The present invention has been made to solve such a problem, and an object of the present invention is to generate a natural prosody without causing deterioration of the prosody even if the variable part is diversified, and to make it easy to hear the synthesized speech. To do.

この発明に係る音声合成装置は、参照文の可変部分を置換語句に置換してなる合成文の音声情報を、前記置換語句の韻律情報を規則合成して得た音声情報と前記参照文の音声情報とを組み合わせて生成し、生成された合成文の音声情報のうち前記規則合成して得た音声情報を所定の平滑量分増減して平滑化するとともに、平滑化された音声情報を用いて前記合成文の音声合成を行う音声合成装置において、
前記置換語句に続く前記合成文の後続部分の音声情報をも前記平滑量分増減することで前記合成文の音声情報をさらに平滑化する拡張平滑手段、
を備えたものである。 The speech synthesizer according to the present invention provides speech information of a synthesized sentence obtained by replacing a variable part of a reference sentence with a replacement phrase, speech information obtained by regular synthesis of prosodic information of the replacement phrase, and speech of the reference sentence The speech information generated by combining the information and smoothing the speech information obtained by the rule synthesis among the speech information of the generated synthesized sentence by increasing / decreasing by a predetermined smoothing amount and using the smoothed speech information In a speech synthesizer that performs speech synthesis of the synthesized sentence,
Extended smoothing means for further smoothing the voice information of the synthesized sentence by increasing / decreasing the voice information of the subsequent part of the synthesized sentence following the replacement phrase by the smoothing amount;
It is equipped with.

またこのような音声合成装置は、汎用的なコンピュータとこのコンピュータに音声合成処理を行わせる音声合成プログラムとを組み合わせて構成しても構わない。すなわち、この発明に係る音声合成プログラムとは、参照文の可変部分を置換語句に置換してなる合成文の音声情報を、前記置換語句の韻律情報を規則合成して得た音声情報と前記参照文の音声情報とを組み合わせて生成し、生成された合成文の音声情報のうち前記規則合成して得た音声情報を所定の平滑量分増減して平滑化するとともに、平滑化された音声情報を用いて前記合成文の音声合成を行う処理、をコンピュータに実行させる音声合成プログラムにおいて、
前記置換語句に続く前記合成文の後続部分の音声情報をも前記平滑量分増減することで前記合成文の音声情報をさらに平滑化する拡張平滑手段、
を前記コンピュータに実行させるものである。 Such a speech synthesizer may be configured by combining a general-purpose computer and a speech synthesis program that causes the computer to perform speech synthesis processing. That is, the speech synthesis program according to the present invention includes speech information of a synthesized sentence obtained by replacing a variable part of a reference sentence with a replacement phrase, speech information obtained by regular synthesis of prosodic information of the replacement phrase, and the reference The speech information generated by combining the speech information of the sentence and smoothed by increasing / decreasing the speech information obtained by the rule synthesis among the speech information of the generated synthesized sentence by a predetermined smoothing amount. In a speech synthesis program that causes a computer to execute speech synthesis of the synthesized sentence using
Extended smoothing means for further smoothing the voice information of the synthesized sentence by increasing / decreasing the voice information of the subsequent part of the synthesized sentence following the replacement phrase by the smoothing amount;
Is executed by the computer.

このようにして、この発明に係る音声合成装置及びそのプログラムは、合成文における音声情報の規則合成範囲を参照文と置換語句との関連性に基づいて適宜伸長して、規則合成された音声情報を調整することとしたので、定型部と規則合成範囲との接続点において自然な韻律を生成することができ、合成音声を聞き取りやすいものとすることができる。 In this way, the speech synthesizer and the program thereof according to the present invention appropriately expands the rule synthesis range of the speech information in the synthesized sentence based on the relation between the reference sentence and the replacement phrase, and the synthesized voice information. Therefore, a natural prosody can be generated at the connection point between the fixed part and the rule synthesis range, and the synthesized speech can be easily heard.

次にこの発明の実施の形態について、図を用いて説明する。
実施の形態１．
図１はこの発明の実施の形態１による音声合成装置の構成を示すブロック図である。図の音声合成装置１は、この発明の実施の形態１による音声合成装置であって、合成文生成手段としての合成文生成部２、規則合成範囲平滑化手段としての規則合成範囲平滑化部３、語句類似性判定手段としての語句類似性判定部３、拡張平滑化手段としての拡張平滑化部５、局所平滑化手段としての局所平滑化部６、音声生成手段としての音声生成部６を備えている。 Next, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing the configuration of a speech synthesis apparatus according to Embodiment 1 of the present invention. The speech synthesizer 1 shown in the figure is a speech synthesizer according to Embodiment 1 of the present invention, and includes a synthesized sentence generator 2 as a synthesized sentence generator and a rule synthesis range smoother 3 as a rule synthesis range smoother. A phrase similarity determination unit 3 as a phrase similarity determination unit, an extended smoothing unit 5 as an extended smoothing unit, a local smoothing unit 6 as a local smoothing unit, and a speech generation unit 6 as a speech generation unit. ing.

合成文生成部２は参照文の一部を置換語句で置換することにより合成文を生成し、さらに生成した合成文の韻律情報を規則に基づいて生成する部位である。なお、この説明及び以降の説明において、部位という語はそのような機能を備えるために構成された専用の素子又は回路を指すものとして用いるが、汎用のコンピュータ用中央演算装置（ＣＰＵ：ＣｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇＵｎｉｔ）とそのＣＰＵに同等の処理を実行させるコンピュータプログラムとを組み合わせて構成しても構わない。 The synthetic sentence generation unit 2 is a part that generates a synthetic sentence by replacing a part of a reference sentence with a replacement phrase, and further generates prosodic information of the generated synthetic sentence based on a rule. In this description and the following description, the term “part” is used to indicate a dedicated element or circuit configured to have such a function. However, a general-purpose computer central processing unit (CPU: Central processing Unit) is used. ) And a computer program that causes the CPU to execute equivalent processing may be combined.

合成文生成部２が扱う参照文とは、例えば「近くの［駅］付近の地図を表示します」のように定型文と可変部との組み合わせから構成される文をいう。この例では、［駅］という部分が可変部に相当する。可変部は一応の語句が割り当てられているものの、後に合成文を生成する上で他の語句と置換されることを予定している部分である。これに対して「近くの」あるいは「付近の地図を表示します」は、定形部と呼ばれる部分である。この参照文の可変部を、「コンビニ」あるいは「ガソリンスタンド」といった置換語句と置き換えることによって、「近くのコンビニ付近の地図を表示します」、あるいは「近くのガソリンスタンド付近の地図を表示します」といった合成文が生成される。 The reference sentence handled by the synthetic sentence generation unit 2 is a sentence composed of a combination of a fixed sentence and a variable part, for example, “Display a map near a nearby [station]”. In this example, the part [station] corresponds to the variable part. The variable part is a part that is assigned a temporary phrase, but is scheduled to be replaced with another phrase in generating a synthesized sentence later. On the other hand, “near” or “display a nearby map” is a part called a fixed part. By replacing the variable part of this reference sentence with a replacement phrase such as “Convenience store” or “Gas station”, “Display a map near a nearby convenience store” or “Display a map near a nearby gas station” Is generated.

なお、合成文生成部２の目的は、文字列操作処理を行って合成文を生成し、生成した合成文の韻律情報を規則的に生成することであるが、このような機能は公知技術に基づいて容易に構成することができる。したがって合成文生成部２という場合には、このような作用を奏するすべての構成を含み、特定の構成に限定するものではない。しかし以下においては、説明の便宜上、合成文生成部２の構成の一例として、図２にその詳細な構成を示すように、音声合成装置１の外部から入力される置換語句を取得する置換語句取得部２１と、参照文を記憶する参照文記憶部２２、文字列操作部２３、規則音声情報生成部２４とを備えるように構成されているものとする。 The purpose of the synthetic sentence generation unit 2 is to generate a synthetic sentence by performing character string manipulation processing, and regularly generate prosodic information of the generated synthetic sentence. Can be easily configured based on. Therefore, in the case of the compound sentence generation unit 2, it includes all the configurations that exhibit such an action, and is not limited to a specific configuration. However, in the following, for convenience of explanation, as an example of the configuration of the synthetic sentence generation unit 2, as shown in FIG. 2, a replacement phrase acquisition for acquiring a replacement phrase input from outside the speech synthesizer 1 is shown. It is assumed that the unit 21 includes a reference sentence storage unit 22 that stores a reference sentence, a character string operation unit 23, and a regular voice information generation unit 24.

ここで、置換語句取得部２１は、例えばキーボードなどを通じて参照文の可変部に当てはめる置換語句を取得する部位である。また参照文記憶部２２は記憶素子又は回路、あるいはＣＤ−ＲＯＭやハードディスク装置などの記憶媒体を用いて参照文に関する情報を記憶する部位である。文字列操作部２３は、参照文記憶部２２が記憶する参照文テキストを、例えば文字列解析して可変部位置を特定し、置換語句とを可変部位置に代入した文字列を生成する部位である。 Here, the replacement phrase acquisition unit 21 is a part that acquires a replacement phrase applied to the variable part of the reference sentence through, for example, a keyboard. The reference sentence storage unit 22 is a part that stores information about the reference sentence using a storage element or circuit, or a storage medium such as a CD-ROM or a hard disk device. The character string operation unit 23 is a part that generates a character string in which the reference sentence text stored in the reference sentence storage unit 22 is analyzed by, for example, character string analysis, the variable part position is specified, and the replacement phrase is substituted into the variable part position. is there.

図３は、参照文に関する情報として参照文記憶部２２によって記憶される参照文データの例を示す図である。図に示すように、参照文データ３０はレコード３１、レコード３２などの複数のレコードを備えている。これらの各レコードは、図３がレコード３１について示しているように、フィールド（レコードの項目）としてＩＤ３０１、参照文３０２、韻律情報３０３、Ｆ０情報３０４、時間長情報３０５を有している。 FIG. 3 is a diagram illustrating an example of reference sentence data stored by the reference sentence storage unit 22 as information regarding a reference sentence. As shown in the figure, the reference sentence data 30 includes a plurality of records such as a record 31 and a record 32. Each of these records has an ID 301, a reference sentence 302, prosodic information 303, F0 information 304, and time length information 305 as fields (record items) as shown in FIG.

ＩＤ３０１は、レコード３１を一意に識別する識別子である。参照文テキスト３０２はレコード３１が表している文字列情報である。この例では「近くの［駅］付近の地図を表示します。」という内容の文字列が参照文テキスト３０２のフィールドに格納されている。ここで'［'と'］'とに囲まれた部分は可変部、すなわち置換語句と置換される対象となる文字列部分を示している。それ以外の文字列部分、すなわち「近くの」及び「付近の地図を表示します。」は定型部と呼ばれる。なお可変部の位置を特定する記号'［'と'］'は他の記号や文字（制御文字を含む）であってもよい。また、可変部の開始位置と終了位置とを示す記号を用いずに参照文テキスト３０２を平文（例．「近くの駅付近の地図を表示します。」）として表しておき、可変部の開始位置（先頭から４文字目）と終了位置（先頭から４文字目）とをレコード３１の別フィールドとして保持するようにしても構わない。 The ID 301 is an identifier that uniquely identifies the record 31. The reference sentence text 302 is character string information represented by the record 31. In this example, a character string with the content “Display a map near a [station] nearby” is stored in the field of the reference text 302. Here, a portion surrounded by “[” and “]” indicates a variable portion, that is, a character string portion to be replaced with a replacement phrase. The other character string parts, that is, “near” and “display a map of the vicinity” are called fixed parts. The symbols “[” and “]” that specify the position of the variable portion may be other symbols or characters (including control characters). In addition, the reference text 302 is expressed as plain text (eg, “A map near the station is displayed”) without using the symbols indicating the start position and end position of the variable section, and the start of the variable section is displayed. The position (fourth character from the beginning) and the end position (fourth character from the beginning) may be held as separate fields of the record 31.

韻律情報３０３は、参照文テキスト３０２を読み上げた場合の韻律情報を保持するフィールドであって、この例ではモーラ数とアクセント型とを格納している。Ｆ０（エフゼロと読む）情報３０４は、参照文テキスト３０２で示される文字列を読み上げた場合（ただし可変部を示す記号は読み上げない）の、Ｆ０情報（基本周波数）の配列を格納するフィールドである。ここで、Ｆ０情報３０４には、定型部と可変部を区別することなく参照文３０２を読み上げた場合のＦ０情報が格納される。自然な韻律においては、可変部にどのような語句が実際に存在するかによって、前後の定型部の音声情報が変化する。したがって可変部に「駅」という語句を割り当てた場合の参照文テキスト３０２に対する音声情報（韻律情報、Ｆ０情報、時間長情報など）は、他の語句を割り当てた場合の音声情報と異なる場合が多い。参照文３０２の可変部の表現方法としては、「駅」のように具体的な語句を割り当てずに、抽象的に可変フィールドであることが分かるような表現方法（例えば、「近くの＊付近の地図を表示します。」のように、＊を可変部であることを示し、かつ読み上げることのない記号として用いる、など）を採用することも考えられる。しかし「近くの＊付近の地図を表示します。」のような文を、人間が自然な会話や生活の中で読み上げることはないので、このような抽象的な表現では、音声合成の元となる自然な音声情報が得られない。このような理由により、この発明の実施の形態１では、参照文として現実に存在する文を指定し、その一部を可変部として表現するようにしている。 The prosody information 303 is a field for holding prosody information when the reference text 302 is read out, and stores the number of mora and the accent type in this example. The F0 (read as F zero) information 304 is a field for storing an array of F0 information (fundamental frequency) when the character string indicated by the reference sentence text 302 is read out (however, a symbol indicating a variable part is not read out). . Here, the F0 information 304 stores the F0 information when the reference sentence 302 is read out without distinguishing between the fixed part and the variable part. In natural prosody, the speech information of the front and rear fixed parts varies depending on what words are actually present in the variable part. Therefore, the speech information (prosodic information, F0 information, time length information, etc.) for the reference sentence text 302 when the phrase “station” is assigned to the variable part is often different from the speech information when other words are assigned. . As a method of expressing the variable part of the reference sentence 302, an expression method (for example, “near * near * near” is assigned to an abstract variable field without assigning a specific word or phrase such as “station”. It is also conceivable to use “*” to indicate that it is a variable part and to be used as a symbol that does not read aloud, as in “Displays a map”. However, since humans do not read a sentence like "Display a map of nearby *" in a natural conversation or life, such an abstract expression is the source of speech synthesis. Natural voice information that cannot be obtained. For this reason, in Embodiment 1 of the present invention, a sentence that actually exists is designated as a reference sentence, and a part thereof is expressed as a variable part.

なお、時間長情報３０５は、各音韻の時間長情報を数値化して格納するフィールドである。また、レコード３２以降のレコードにおいても、ＩＤ３０１、参照文テキスト３０２、韻律情報３０３、Ｆ０情報３０４、時間長情報３０５に相当するフィールドを備えている。 The time length information 305 is a field in which time length information of each phoneme is digitized and stored. Also, the records after the record 32 are provided with fields corresponding to the ID 301, the reference text 302, the prosody information 303, the F0 information 304, and the time length information 305.

引き続き、図１における音声合成装置１の構成について説明する。語句類似性判定部３は、置換語句と参照文の可変部に元々あった語句との類似性を判定する部位である。ここでは、置換語句と参照文の可変部に元々あった語句との類似性として、韻律的に類似するかどうかを判断するようになっている。規則合成部４は合成文の一部の語句について音声情報を規則合成する部位である。規則合成される部位は、語句類似性判定部３の判定結果に基づいて決定されるようになっている。 Next, the configuration of the speech synthesizer 1 in FIG. 1 will be described. The phrase similarity determination unit 3 is a part that determines the similarity between the replacement phrase and the phrase originally in the variable part of the reference sentence. Here, as a similarity between the replacement phrase and the phrase originally in the variable part of the reference sentence, it is determined whether or not they are prosodic. The rule synthesizing unit 4 is a part that regularly synthesizes speech information for some words in the synthesized sentence. The part to be regularly synthesized is determined based on the determination result of the phrase similarity determination unit 3.

局所平滑部５は、規則合成部４によって規則合成された範囲の音声情報とそれ以外の合成文の部分の音声情報との接続をなめらかにするために平滑化を行う部位である。音声生成部６は、生成された音声情報に基づいて音声信号を発生し、人間が聴覚を通じて認識できる音声として再生する部位である。 The local smoothing unit 5 is a part that performs smoothing in order to smooth the connection between the speech information in the range that is regularly synthesized by the rule synthesis unit 4 and the speech information in the other part of the synthesized sentence. The voice generation unit 6 is a part that generates a voice signal based on the generated voice information and reproduces it as a voice that can be recognized through hearing.

続いて、音声合成装置１の動作について説明する。図４は、音声合成装置１の動作を示すフローチャートである。ここでは、外部から置換語句として「ガソリンスタンド」と、この置換語句と組み合わせることとなる参照文データのレコードを識別する情報（例えばＩＤ値や参照文そのもの）とが指定されるものとする。 Next, the operation of the speech synthesizer 1 will be described. FIG. 4 is a flowchart showing the operation of the speech synthesizer 1. Here, it is assumed that “gas station” is specified as a replacement phrase from the outside, and information (for example, an ID value or a reference sentence itself) that identifies a record of reference sentence data to be combined with this replacement phrase.

これに対して、合成文生成部２の置換語句取得部２１は、「ガソリンスタンド」という置換語句と、さらにはこの置換語句と組み合わせる参照文データのレコードを識別する情報（ここではＩＤ番号００１とする）を取得する（ステップＳ１０１）。文字列操作部２３は、参照文記憶部２２が記憶している参照文データ３０のうちＩＤ番号００１から特定されるレコード３１の参照文テキスト３０２の内容「近くの[駅]付近の地図を表示します。」を取得し、この文字列の可変部と置換語句取得部２１が取得した置換語句「ガソリンスタンド」とを置き換えて、合成文「近くのガソリンスタンド付近の地図を表示します。」を生成する（ステップＳ１０２）。 On the other hand, the replacement phrase acquisition unit 21 of the synthetic sentence generation unit 2 identifies information (here, ID number 001) that identifies the replacement phrase “gas station” and the record of reference sentence data to be combined with this replacement phrase. Is acquired (step S101). The character string operation unit 23 displays the contents of the reference sentence text 302 of the record 31 identified from the ID number 001 among the reference sentence data 30 stored in the reference sentence storage unit 22 “a map near the nearby [station]”. The variable part of this character string and the replacement phrase “gas station” acquired by the replacement phrase acquisition unit 21 are replaced with the compound sentence “Display a map near the nearby gas station.” Is generated (step S102).

次に、語句類似性判定部３は、置換語句と可変部の語句との韻律的な類似性を判定する（ステップＳ１０３）。韻律的に類似する場合は、置換語句と可変部の語句の音声情報は似通っており、互換性が高いと考えられる。したがってこのような場合においては、可変部前後の音声情報と置換語句を規則合成して得た音声情報との接続は自然性が高く、そのまま接続しても違和感のない音声が得られる。これに対して、置換語句と可変部の語句とが韻律的に類似しない場合は、合成文において置換語句が他の部分の韻律に及ぼす影響が、参照文における可変部の語句が他の部分に及ぼす影響とは異なってくる。このために、可変部前後の韻律と置換語句との韻律を接続しても自然性が得られない。そこで、語句類似性判定部３によって、置換語句と可変部の語句との韻律的な類似性を判定し、類似と判断した場合と非類似であると判断した場合とで、その後の処理を分けることとしたのである。 Next, the phrase similarity determination unit 3 determines prosodic similarity between the replacement phrase and the variable phrase (step S103). When prosodically similar, the speech information of the replacement phrase and the variable part phrase is similar and considered to be highly compatible. Therefore, in such a case, the connection between the voice information before and after the variable part and the voice information obtained by regularly synthesizing the replacement phrase is highly natural, and a voice without any sense of incongruity can be obtained even if connected as it is. On the other hand, if the replacement phrase and the variable part phrase are not prosodically similar, the effect of the replacement phrase on the prosody of the other part in the synthesized sentence is that the variable part phrase in the reference sentence is different from the other part. It is different from the effect. For this reason, even if the prosody of the variable part and the prosody of the replacement phrase are connected, naturalness cannot be obtained. Therefore, the phrase similarity determination unit 3 determines the prosodic similarity between the replacement phrase and the variable part phrase, and separates the subsequent processing depending on whether it is determined to be similar or dissimilar. It was decided.

韻律的な類似性の判定方法の例としては、置換語句と可変部の語句とのそれぞれのモーラ数とアクセント型の比較を行う方法が考えられる。その場合には、例えばアクセント型が同一であって、モーラ数が似通っているときに、韻律的に互換性が高くなるので、韻律的に類似性が高い、と判定することができる。モーラ数が似通っているか否か、についてはそれぞれのモーラ数間の差を求めて、その差が所定値以内かどうかを判定すればよい。例えばモーラ数の差が３以下の場合に「類似である」と判定し、３を超える場合には「非類似である」と判定する、と決めておくと、置換語句「ガソリンスタンド」と可変部の語句「駅」の場合にあっては、もアクセント型は−４となって一致するが、モーラ数はそれぞれ８と２であるので、差は６となり、３以下という条件を満たさない。したがって、置換語句「ガソリンスタンド」と可変部の語句「駅」については韻律的に非類似である、と判断される。 As an example of the prosodic similarity determination method, a method of comparing the number of mora and accent type of each of the replacement phrase and the variable part phrase can be considered. In this case, for example, when the accent type is the same and the number of mora is similar, the prosodic compatibility is increased, so that it can be determined that the prosodic similarity is high. Whether or not the number of mora is similar may be determined by obtaining a difference between the respective mora numbers and determining whether or not the difference is within a predetermined value. For example, if the difference in number of mora is 3 or less, it is determined as “similar”, and if it exceeds 3, it is determined as “dissimilar”. In the case of the part phrase “station”, the accent type is -4 and matches, but the number of mora is 8 and 2, respectively, so the difference is 6 and the condition of 3 or less is not satisfied. Accordingly, it is determined that the replacement phrase “gas station” and the variable phrase “station” are dissimilar in terms of prosody.

一般にモーラ数が大きくなるとアクセント強度が大きくなることが知られている。したがって置換語句と可変部の語句とのモーラ数の差異が大きくなるとアクセント強度の差によって置換後のＦ０情報の誤差が大きく。上述のアクセント型とモーラ数の差異とに基づく類似性の判断はかかる知見に基づくものである。 Generally, it is known that the accent intensity increases as the number of mora increases. Therefore, when the difference in the number of mora between the replacement phrase and the variable part phrase increases, the error in the F0 information after replacement increases due to the difference in accent strength. The determination of similarity based on the above accent type and the difference in the number of mora is based on such knowledge.

ステップＳ１０３において、置換語句と可変部とが韻律的に類似であると判断した場合はステップＳ１０５に進む。また置換語句と可変部とが韻律的に非類似であると判断した場合はステップＳ１０４を経た後にステップＳ１０５に進む。そこで、以下においてはまず置換語句と可変部とが韻律的に非類似であると判断した場合に実行されるステップＳ１０４について説明し、ステップＳ１０５はその後に説明することとする。 If it is determined in step S103 that the replacement word and the variable part are prosodically similar, the process proceeds to step S105. On the other hand, if it is determined that the replacement word and the variable part are dissimilar in terms of prosody, the process proceeds to step S105 after step S104. Therefore, in the following, step S104, which is executed when it is determined that the replacement phrase and the variable part are prosodically dissimilar, will be described first, and step S105 will be described later.

次に規則合成部４は、規則合成する範囲を拡張する（ステップＳ１０４）。参照文データ３０は合成文における置換語句部分の音声情報を記憶していないので、少なくとも置換語句部分の音声情報を規則合成することとなる。したがってまず置換語句部分（「ガソリンスタンド」）が基本的な規則合成する範囲となる。さらにここでは、ステップＳ１０３において、置換語句（「ガソリンスタンド」）と可変部（「駅」）とが韻律的に非類似であると判断されている。そこでこのような場合にも、合成文の音声情報を自然性の高いものとするために、規則合成する範囲を置換語句部分以外の部分に拡張する。例えば置換語句部分に続く直後の語句（例えば、「付近の」）についても規則合成する範囲に含める。 Next, the rule synthesis unit 4 expands the range for rule synthesis (step S104). Since the reference sentence data 30 does not store the speech information of the replacement phrase part in the synthesized sentence, at least the speech information of the replacement phrase part is regularly synthesized. Therefore, the replacement phrase part (“gas station”) is the basic rule composition range. Further, here, in step S103, it is determined that the replacement phrase ("gas station") and the variable part ("station") are dissimilar in terms of prosody. Therefore, even in such a case, in order to make the speech information of the synthesized sentence highly natural, the range of rule synthesis is extended to a part other than the replacement phrase part. For example, the phrase immediately following the replacement phrase part (for example, “near”) is also included in the range for rule synthesis.

このように、置換語句部分だけでなく、周囲の部分についても規則合成することとしたので、置換語句と可変部とが韻律的に類似していない場合であっても、合成文の音声情報の自然性を高めることができる。 In this way, not only the replacement phrase part but also the surrounding part is rule-synthesized, so even if the replacement phrase and the variable part are not prosodically similar, the speech information of the synthesized sentence Natural nature can be improved.

規則合成する範囲を拡張する方法の例としては、後続部分「付近の地図を表示します。」を形態素解析して最初の形態素（この場合は「付近」）を平滑化する範囲として定める方法がある。また、語句類似性判定部３によって算出された類似性（類似性を表す数値の程度）に基づいて、規則合成する範囲を拡張する量（形態素の個数）を増やすようにしてもよい。例えば語句類似性判定部３において、置換語句と可変部の語句とのアクセント型が異なり、さらにモーラ数も一定数以上異なる場合には、後続部分の韻律情報の影響度合いは大きく異なるので、２つの形態素（この場合は「付近の」）あるいは３つの形態素（「付近の地図」）を平滑化する範囲として定めるようにしてもよい。 As an example of a method for expanding the range of rule synthesis, there is a method in which the subsequent part “display a nearby map” is determined as a range for smoothing the first morpheme (in this case, “near”) by performing morphological analysis. is there. Further, the amount (number of morphemes) for extending the range of rule synthesis may be increased based on the similarity (degree of numerical value representing similarity) calculated by the phrase similarity determination unit 3. For example, in the phrase similarity determination unit 3, if the accent type of the replacement phrase is different from that of the variable part, and the number of mora also differs by a certain number or more, the degree of influence of the prosodic information in the subsequent part is greatly different. The morpheme (in this case “near”) or three morphemes (“near map”) may be determined as a range to be smoothed.

このように置換語句に後続する形態素ごとに規則合成するかどうかを決定するようにしたので、発声の切れ目となることの多い形態素単位で平滑化を行い、より自然な音声情報を得ることができるのである。また置換語句と可変部との類似性が乏しくなるにつれて、規則合成する範囲をより拡張するようにすることで、語句変更の影響が大きい場合であっても、規則合成が必要な部位の長さを動的に決定し、自然な音声を得ることができる。 In this way, since it is determined whether or not rule synthesis is performed for each morpheme following the substitution word, smoothing is performed in units of morphemes that often cause utterance breaks, and more natural speech information can be obtained. It is. Also, as the similarity between the replacement word and variable part becomes poor, the range of rule composition is expanded so that the length of the part that requires rule composition even when the influence of the word change is large. Can be determined dynamically, and natural speech can be obtained.

なお、規則合成する範囲の拡張は、置換語句に後続する語句だけに限定されるものではない。例えば、置換語句の語彙や可変部の構成の仕方（何を可変部として参照文３０を設計するか）によっては、置換語句に先行する語句を規則合成する範囲に含めるようにしてもよい。また双方（先行する語句と後続する語句の両方）としてもよい。 It should be noted that the expansion of the range for rule synthesis is not limited to only the phrase that follows the replacement phrase. For example, depending on the vocabulary of the replacement phrase and how the variable part is configured (what the reference sentence 30 is designed as a variable part), the phrase preceding the replacement phrase may be included in the range of rule synthesis. Moreover, it is good also as both (both a preceding phrase and a succeeding phrase).

そして規則合成部４は、規則合成する範囲の語句について規則合成によって音声情報を生成する（ステップＳ１０５）。規則合成の方法は従来から知られている音声合成の方法を用いる。例えば、規則合成する対象となる語句が「ガソリンスタンド付近の」である場合には、各音素ｇ、ａ、ｓ、ｏ、ｒ、ｉ、Ｎ、ｓ、ｕ、ｔ、…の時間長とＦ０値を、従来の音声合成方法によって生成する。規則に基づく音声の生成方法の一例としては、点ピッチと呼ばれる制御モデルがある。このモデルでは、文章全体ではピッチ（Ｆ０）が下降していく傾向になるので、この傾斜パタンを直線で、その上に付加されるアクセントを形成する成分を台形で表現し、各モーラの中心点のピッチ（Ｆ０）を決定する。直線の始点と終点や台形の高さなどは、文字列操作部２３により生成された合成文「近くのガソリンスタンド付近の地図を表示します。」のアクセント位置、モーラ数などから決定する。アクセント位置、モーラ数などと直線の始点と終点や台形の高さなどの対応は、あらかめテーブルに記述しておく方法がある。 Then, the rule synthesizing unit 4 generates speech information by rule synthesis for words in the range to be rule synthesized (step S105). As a rule synthesis method, a conventionally known speech synthesis method is used. For example, when the phrase to be subject to rule synthesis is “near a gas station”, the time length of each phoneme g, a, s, o, r, i, N, s, u, t,. The value is generated by a conventional speech synthesis method. One example of a rule-based voice generation method is a control model called point pitch. In this model, the pitch (F0) tends to decrease in the whole sentence. Therefore, the slope pattern is represented by a straight line, and the component that forms the accent added on it is represented by a trapezoid, and the center point of each mora The pitch (F0) is determined. The start and end points of the straight line, the height of the trapezoid, and the like are determined from the accent position, the number of mora, and the like of the synthesized sentence “Display a map near the nearby gas station” generated by the character string operation unit 23. The correspondence between the accent position, the number of mora, the start and end points of the straight line, and the height of the trapezoid can be described in advance in the table.

次に、局所平滑部５は、規則合成された合成文の音声情報を平滑するために、最初に平滑量を算出し（ステップＳ１０６）、続いて合成文の音声情報を、この平滑量を用いて平滑化する（ステップＳ１０７）。規則合成部４によって規則合成された音声情報は参照文の定型部と置換語句との接続性を考慮したものではないので、自然性が低く聞き取りにくいものとなることが多い。そこで、局所平滑部５は、合成文の音声情報を平滑化して、自然性を高めるのである。 Next, the local smoothing unit 5 first calculates a smoothing amount in order to smooth the speech information of the synthesized sentence that has been regularly synthesized (step S106), and then uses the smoothing amount for the speech information of the synthesized sentence. To smooth (step S107). The speech information that is rule-synthesized by the rule-synthesizing unit 4 does not take into account the connectivity between the fixed part of the reference sentence and the replacement phrase, and therefore often has low naturalness and is difficult to hear. Therefore, the local smoothing unit 5 smoothes the voice information of the synthesized sentence to enhance the naturalness.

ここで、ステップＳ１０６における平滑量の算出は次のように行われる。いま、図５に示すように、置換語句直前の定型部の最後の部分のＦ０情報をＢ０、置換語句先頭のＦ０情報をＨ０、置換語句末端のＦ０情報をＴ０、置換語句直後の定型部の最初の部分のＦ０情報をＡ０とした場合、平滑量Δ（デルタ）は式（１）による重み付き平均として算出される。

式（１）において、Ｗ１及びＷ２は重み係数であって、Ｗ１＋Ｗ２＝１となる関係にある数値である。 Here, the calculation of the smoothing amount in step S106 is performed as follows. Now, as shown in FIG. 5, the F0 information of the last part of the fixed part immediately before the replacement phrase is B0, the F0 information at the beginning of the replacement phrase is H0, the F0 information at the end of the replacement phrase is T0, and the fixed part immediately after the replacement phrase is When the F0 information of the first part is A0, the smoothing amount Δ (delta) is calculated as a weighted average according to the equation (1).

In Expression (1), W1 and W2 are weighting factors, and are numerical values in a relationship of W1 + W2 = 1.

なお式（１）において、Ｗ１やＷ２を可変部や置換語句の音韻の種類に応じて変更するようにしても構わない。このようにすることで、可変部や置換語句の種類に基づいて、個別具体的に平滑処理の内容を切り替えることができ、合成音声の自然性を増して、より聞き取りやすい音声情報を得ることができる。 In equation (1), W1 and W2 may be changed according to the variable part and the type of phoneme of the replacement phrase. By doing in this way, the content of smoothing processing can be switched individually and specifically based on the type of variable part or replacement phrase, and the naturalness of the synthesized speech can be increased and more easily audible speech information can be obtained. it can.

置換語句直前の定型部の最後の部分のＦ０情報Ｂ０として、置換語句直前の定型部の最後のモーラのＦ０情報が例として考えられるが、この限りではない。すなわち、可変部によって、可変部に隣接するモーラのＦ０情報は変形を受けることが多いので、必ずしも最後のモーラである必要はなく、置換語句直前の定型部の最後の部分のＦ０情報ではなく、可変部による変形を受けにくい最後のモーラの近傍からＦ０情報を取得するようにしてもよいのである。これと同様に、置換語句先頭のＦ０情報としては置換語句先頭のモーラのＦ０情報を用いてもよいが、先頭モーラ近傍のＦ０情報を用いてもよい。置換語句末端のＦ０情報Ｔ０、置換語句直後の定型部の最初の部分のＦ０情報Ａ０についても、それぞれ置換語句末端モーラあるいは置換語句直後の定型部の最初のモーラの近傍から選択してもよい。 As the F0 information B0 of the last part of the fixed part immediately before the replacement word, the F0 information of the last mora of the fixed part immediately before the replacement word can be considered as an example, but this is not restrictive. That is, since the F0 information of the mora adjacent to the variable part is often transformed by the variable part, it is not necessarily the last mora, not the F0 information of the last part of the fixed part immediately before the replacement phrase, The F0 information may be acquired from the vicinity of the last mora that is not easily deformed by the variable part. Similarly, the F0 information at the head of the replacement phrase may be used as the F0 information at the head of the replacement phrase, or F0 information near the head mora may be used. The F0 information T0 at the end of the replacement phrase and the F0 information A0 of the first part of the fixed part immediately after the replacement phrase may also be selected from the vicinity of the replacement phrase end mora or the first mora of the fixed part immediately after the replacement phrase.

ステップＳ１０７における置換語句の音声情報の平滑化は、この区間の置換語句のＦ０情報にステップＳ１０６で算出したΔを一律に増減（シフト）することによって行われる。このようにすることで、置換語句の最初と最後のＦ０情報が、それらに接続する定型部のＦ０情報と極端に離れてしまっている場合に、置換語句全体を増減することでＦ０情報の差を小さくすることで、韻律の自然性を高めるのである。 The speech information of the replacement phrase in step S107 is smoothed by uniformly increasing / decreasing (shifting) Δ calculated in step S106 to the F0 information of the replacement phrase in this section. In this way, when the first and last F0 information of the replacement phrase is far away from the F0 information of the fixed part connected to them, the difference between the F0 information is increased or decreased by increasing or decreasing the entire replacement phrase. The naturalness of the prosody is increased by reducing the size.

このようにして平滑化する後続部分の範囲を決定した後、局所平滑部５は平滑化する後続部分の音声情報を置換語句部分の音声情報の平滑量と同じ量だけ増減する。ここで用いる平滑量は、例えば式（１）ですでに算出しているΔである。置換語句部分の音声情報として各Ｆ０情報をΔだけ増やしている場合には、平滑化する後続部分の各Ｆ０情報もΔだけ増やすことになる。 After determining the range of the subsequent part to be smoothed in this way, the local smoothing unit 5 increases or decreases the audio information of the subsequent part to be smoothed by the same amount as the smoothing amount of the speech information of the replacement phrase part. The smoothing amount used here is, for example, Δ that has already been calculated by the equation (1). When each F0 information is increased by Δ as the speech information of the replacement phrase portion, each F0 information of the subsequent portion to be smoothed is also increased by Δ.

なお、上述の説明では置換語句とその前後の定型文との接続位置から式（１）のＢ０、Ｈ０、Ｔ０、Ａ０を算出し、式（１）からΔを算出してこれを後続部分の平滑量として用いることとしている。しかしながら、このような方法に限定するものではなく、例えば置換語句と、局所平滑部５によって決定された平滑化される後続部分とを一体の範囲としてみなし、この一体範囲と、前後の定型部分との接続位置からＢ０、Ｈ０、Ｔ０、Ａ０を求めて、Δを先に算出するようにし、算出されたΔを用いて、この一体範囲のＦ０情報全体を増減するようにしてもよい。このような方法を用いることで、後続部分とその他の定型文との接続状態を平滑量に反映できるようになるので、さらに自然な接続が可能となるのである。 In the above description, B0, H0, T0, and A0 of Equation (1) are calculated from the connection positions of the replacement phrase and the fixed phrases before and after it, and Δ is calculated from Equation (1), which is used as the subsequent part. The smoothing amount is used. However, the present invention is not limited to such a method. For example, the replacement word and the subsequent portion to be smoothed determined by the local smoothing unit 5 are regarded as an integrated range, and the integrated range and the front and rear fixed portions are Alternatively, B0, H0, T0, and A0 may be obtained from the connection position, Δ may be calculated first, and the entire F0 information in the integrated range may be increased or decreased using the calculated Δ. By using such a method, it becomes possible to reflect the connection state between the succeeding portion and other fixed sentences in the smoothing amount, so that a more natural connection is possible.

また局所平滑部５はステップＳ１０７において、規則合成する範囲の境界がフレーズの区切りになる場合は、規則合成する範囲の末端部付近にあるモーラ（例えば最後のモーラ）のＦ０情報と、その規則合成する範囲に後続する部分の先頭部付近にあるモーラ（例えば第２モーラ）のＦ０情報とを直線で結んで、規則合成する範囲と後続部分とのＦ０情報を変形し、さらに平滑化するようにしてもよい。例えば、「ガソリンスタンド付近の」が規則合成する範囲である場合は、規則合成する範囲の境界がフレーズの区切りとなるから、この範囲の最後のモーラのＦ０情報と後続する部分の「地図を表示します。」の第２モーラのＦ０情報を直線で結び、規則合成する範囲とその他の範囲とを変形する。 Further, in step S107, when the boundary of the range to be rule-combined becomes a phrase delimiter, the local smoothing unit 5 includes the F0 information of the mora (for example, the last mora) near the end of the rule-synthesized range and its rule synthesis. The F0 information of the mora (for example, the second mora) in the vicinity of the beginning of the portion following the range to be connected is connected by a straight line, and the F0 information of the range to be ruled and the subsequent portion is transformed and further smoothed. May be. For example, if “near the gas station” is the range where the rule is synthesized, the boundary of the range where the rule is synthesized is a phrase delimiter. Therefore, the F0 information of the last mora in this range and the “map display” The F0 information of the second mora is connected by a straight line, and the range for rule synthesis and other ranges are transformed.

このようにして、局所平滑部５が、置換語句の後続部分とさらにその後の合成文の部分との音声情報とが滑らかに接続されるように平滑化することとしたので、置換語句及びその後続部分について単純な平滑量を増減しただけでは十分な自然性が得られない場合であっても、最終的に合成文全体として十分に自然な合成音声が得られるのである。 In this way, the local smoothing unit 5 performs smoothing so that the speech information of the subsequent part of the replacement phrase and the subsequent synthesized sentence part is smoothly connected. Even if the natural smoothness cannot be obtained only by increasing or decreasing the simple smoothing amount for the portion, a sufficiently natural synthesized speech can be finally obtained as a whole synthesized sentence.

なお、直線で結ぶモーラは規則合成する範囲の末モーラと後続する部分の第２モーラに限るものではなく、他のモーラ、例えば末モーラに替えて、最後から２番目のモーラとしてもよいし、後続する部分の第２モーラに替えて第３モーラとしてもよい。すなわち、近傍するモーラであってもよいのである。なおここで「近傍」という語は、音韻的に近い性質を備えるものと期待される程度に近い位置にあるモーラをいう。自然な音声情報では互いに近い位置にあるモーラの音声情報は近い値となることが多いからである。 It should be noted that the mora that is connected by a straight line is not limited to the second mora of the part that follows the last mora of the range to be regularly synthesized, but may be another mora, for example, the last mora instead of the last mora, A third mora may be used instead of the second mora in the subsequent portion. That is, it may be a nearby mora. Here, the term “neighboring” refers to a mora in a position close to the level expected to have a phonologically close property. This is because, in natural speech information, the speech information of mora that are close to each other often has a close value.

さらに置換語句と置換語句との後続部分のみならず、置換語句の先頭部付近にあるモーラと、置換語句の先行部分の末端部付近にあるモーラとの間を直線で結ぶようにしてもよいことはいうまでもない。 Furthermore, not only the replacement phrase and the subsequent part of the replacement phrase, but also a mora near the beginning of the replacement phrase and a mora near the end of the preceding part of the replacement phrase may be connected with a straight line. Needless to say.

また直線で結ぶ、とは例えば中間に存在するモーラのＦ０情報の値を基準となる両端のモーラのＦ０情報を用いて内挿することをいう。なお、このような処理を行う理由は、Ｆ０情報が滑らかに推移していくように変形させることにあるから、直線で結ぶ以外に、例えば連続曲線上の値をとるようにしてもよい。 Connecting with a straight line means, for example, interpolating the value of the F0 information of the mora existing in the middle using the F0 information of the mora at both ends serving as a reference. In addition, since the reason for performing such a process is to change the F0 information so that it smoothly changes, for example, a value on a continuous curve may be taken in addition to a straight line.

音声生成部６は、これまで得られた合成文の音声情報から音声を生成する（ステップＳ１０９）。この処理は従来の技術と同様であるので、詳細については説明を割愛する。 The voice generation unit 6 generates voice from the voice information of the synthesized sentence obtained so far (step S109). Since this processing is the same as that of the conventional technology, the details are omitted.

以上のようにして、この音声合成装置１によれば、置換語句の音声情報だけでなく、置換語句の直後の後続部分の音声情報についても平滑化することとしたので、置換語句部分の音声情報を平滑化する場合に比べて自然で聞き取りやすい音声を生成する音声情報を得ることができる。 As described above, according to the speech synthesizer 1, not only the speech information of the replacement phrase but also the speech information of the subsequent part immediately after the replacement phrase is smoothed. It is possible to obtain voice information that generates a natural and easy-to-hear voice compared to the case of smoothing.

なお上述の例では、主としてＦ０情報を平滑化することとして説明をしたが、時間長情報についてもＦ０情報と同様に変形を加えてもよい。この場合の変形量、すなわち平滑量は、定型部の各モーラの時間長の平均と、置換語句、あるいは置換語句とその直後に後続する部分を合わせた語句の各モーラの時間長の平均の差からから算出する。また、可変部に隣接する定型部のモーラの時間長を、このモーラ対応するの規則で作成した時間長と、定型部のモーラの時間長の重み付き平均で置き換えることで変形する。このようにすることで、可変部の音韻の種類に対応した定型部の音韻の時間長を得ることができ、自然な時間長情報の結合を得ることができるのである。 In the above-described example, the description has been made mainly on the smoothing of the F0 information. However, the time length information may be modified similarly to the F0 information. In this case, the amount of deformation, that is, the smoothing amount, is the difference between the average time length of each mora in the fixed form part and the average time length of each mora of the replacement phrase or the phrase including the replacement phrase and the immediately following part. Calculate from Further, the time length of the mora of the fixed part adjacent to the variable part is changed by replacing the time length created by the rule corresponding to this mora with the weighted average of the time length of the mora of the fixed part. By doing in this way, the time length of the phoneme of the fixed part corresponding to the type of phoneme of the variable part can be obtained, and a natural combination of time length information can be obtained.

また上述の例では、Ｆ０情報と時間長情報を変形する範囲を同一範囲としたが、これらを独立に決定してもよい。すなわちＦ０情報と時間長情報についてモーラ数の差について異なる境界値を設定するのである。例えばＦ０情報については境界値を１とし、時間長情報については境界値を３とすれば、「近くの「駅」付近の地図を表示します。」の可変部「駅」を置換語句「コンビニ」で置換する場合、「コンビニ」のモーラ数は４なので、差異は２となってＦ０情報についての境界値を超えてしまう一方で、時間長情報についての境界値は超えない。したがってＦ０情報については「駅」と「コンビニ」は非類似と扱われ、時間長情報については「駅」と「コンビニ」は類似と扱われる。したがってＦ０情報のみ拡張平滑部５で後続部分の平滑処理がなされるようになる。 Moreover, in the above-mentioned example, although the range which deform | transforms F0 information and time length information was made into the same range, you may determine these independently. That is, different boundary values are set for the difference in the number of mora for the F0 information and the time length information. For example, if the boundary value is set to 1 for F0 information and the boundary value is set to 3 for time length information, a map near “Station” will be displayed. When the variable part “station” of “is replaced by the replacement word“ convenience store ”, the number of mora of“ convenience store ”is 4, so the difference becomes 2 and exceeds the boundary value for F0 information, while the time length information The boundary value for is not exceeded. Accordingly, “station” and “convenience store” are treated as dissimilar for F0 information, and “station” and “convenience store” are treated as similar for time length information. Therefore, only the F0 information is subjected to smoothing of the subsequent portion by the extended smoothing unit 5.

このようにＦ０情報と時間長情報とを独立して扱うことで、拡張平滑部５は、後続部分のＦ０情報と時間長情報とのいずれか一方のみを平滑量分増減することとなる。このような構成によってＦ０情報については平滑が必要なものの、時間長情報については十分に自然性が確保されているから調整が必要な場合、あるいはその逆の場合など、柔軟な平滑処理を行うことが可能となるのである。 By handling F0 information and time length information independently in this way, the extended smoothing unit 5 increases or decreases only one of the F0 information and time length information of the subsequent portion by the smoothing amount. With such a configuration, smoothness is required for F0 information, but the time length information is sufficiently natural so that adjustment is necessary or vice versa. Is possible.

実施の形態２．
実施の形態１による音声合成装置では、置換語句と可変部の語句との韻律的な類似性に基づいて拡張平滑処理を行うかどうかを決定した。しかしこのような構成の他に、置換語句の音声情報を平滑化（シフト）してもなお、置換語句と前後の定型部との接続点における韻律情報との差異が大きい場合に拡張平滑処理を行うようにしてもよい。実施の形態２による音声合成装置はかかる特徴を有するものである。 Embodiment 2. FIG.
In the speech synthesizer according to the first embodiment, whether or not to perform the extended smoothing process is determined based on the prosodic similarity between the replacement phrase and the variable part phrase. However, in addition to such a configuration, when smoothing (shifting) the speech information of the replacement phrase, extended smoothing processing is performed when the difference between the replacement phrase and the prosodic information at the connection point between the preceding and following fixed parts is large. You may make it perform. The speech synthesizer according to the second embodiment has such a feature.

図６は、この発明の実施の形態２による音声合成装置の構成を示すブロック図である。図において、図１と同一の符号を付した構成要素は実施の形態１と同様であるので説明を省略する。図において、前方差算出手段としての前方差算出部７は、置換語句とその前にある定型部との接続点における韻律情報の差（前方差）を算出する部位である。つまり、前方差算出部７は、規則合成して得た音声情報を平滑量分増減して得た音声情報から前記置換語句の先頭モーラのＦ０情報を取得し、置換語句より前にある合成文の部分（前方部分）のＦ０情報を取得して、先頭モーラのＦ０情報と合成文の部分のＦ０情報との差を算出するようになっている。 FIG. 6 is a block diagram showing a configuration of a speech synthesis apparatus according to Embodiment 2 of the present invention. In the figure, the components denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and the description thereof will be omitted. In the figure, a forward difference calculation unit 7 as a forward difference calculation means is a part that calculates a difference (forward difference) in prosodic information at a connection point between a replacement word and a fixed form part that precedes it. That is, the forward difference calculation unit 7 acquires the F0 information of the first mora of the replacement phrase from the speech information obtained by increasing or decreasing the smoothing amount of the speech information obtained by the rule synthesis, and the synthesized sentence preceding the replacement phrase Is obtained, and the difference between the F0 information of the head mora and the F0 information of the synthesized sentence is calculated.

続いて、実施の形態２による音声合成装置１の動作について説明する。図７は、この音声合成装置１の動作を示すフローチャートである。このフローチャートが実施の形態１のフローチャートである図４と異なっているのは、ステップＳ２０１〜Ｓ２０３及びＳ１０５−２のみであるので、これらの処理を中心に以降の説明を行うこととする。この発明の実施の形態２においても、実施の形態１と同様にステップＳ１０２までに合成文生成部２が合成文を生成する。続いて、規則合成部４は置換語句部分の音声情報を規則合成する（ステップＳ２０１）。規則合成の方法は、実施の形態１と同様である。 Next, the operation of the speech synthesizer 1 according to the second embodiment will be described. FIG. 7 is a flowchart showing the operation of the speech synthesizer 1. Since this flowchart is different from FIG. 4 which is the flowchart of the first embodiment only in steps S201 to S203 and S105-2, the following description will be made focusing on these processes. Also in the second embodiment of the present invention, similarly to the first embodiment, the synthesized sentence generating unit 2 generates a synthesized sentence until step S102. Subsequently, the rule synthesizing unit 4 synthesizes the speech information of the replacement phrase part with a rule (step S201). The rule synthesis method is the same as in the first embodiment.

その後、前方差算出部７は、置換語句部分とその前の定型部との接続点における音声情報の差異を算出する（ステップＳ２０２）。この差異として用いられるのはＦ０情報の差異である。例えば、置換語句部分の最初のモーラのＦ０情報と置換語句部分の前にある定型部の最後のモーラのＦ０情報を取得して、それらの差異を前方差として求める。 Thereafter, the forward difference calculation unit 7 calculates the difference in the speech information at the connection point between the replacement word part and the previous fixed part (step S202). It is the difference of F0 information that is used as this difference. For example, F0 information of the first mora of the replacement phrase part and F0 information of the last mora of the fixed part before the replacement phrase part are acquired, and the difference between them is obtained as a forward difference.

続いて、前方差算出部７は算出された前方差と所定のしきい値とを比較する（ステップＳ２０３）。前方差がしきい値を上回る場合は、置換語句とその他の部分との接続が不自然であるので、ステップＳ１０４に進む。また前方差がしきい値以下の場合はステップＳ１０６に進む。ステップＳ１０６以降の処理については実施の形態１と同様であるので、説明を省略する。 Subsequently, the forward difference calculation unit 7 compares the calculated forward difference with a predetermined threshold value (step S203). If the forward difference exceeds the threshold value, the connection between the replacement word and other parts is unnatural, and the process proceeds to step S104. If the forward difference is less than or equal to the threshold value, the process proceeds to step S106. Since the process after step S106 is the same as that of Embodiment 1, description is abbreviate | omitted.

ステップＳ１０４では、規則合成部４が実施の形態１と同様に規則合成する範囲を拡張する。そして規則合成部４は拡張した規則合成範囲の音声情報を規則合成する（ステップＳ１０５−２）。規則合成の方法についてはステップＳ２０１と同様である。そして前方差がしきい値以下の場合と同様にステップＳ１０６以降の処理に進む。 In step S104, the rule synthesizing unit 4 extends the range for rule synthesis as in the first embodiment. The rule synthesizing unit 4 synthesizes the speech information in the expanded rule synthesis range (step S105-2). The rule composition method is the same as in step S201. Then, similarly to the case where the forward difference is equal to or smaller than the threshold value, the process proceeds to step S106 and subsequent steps.

このように規則合成部４による平滑処理の結果として、十分に滑らかに接続されていない場合は、後続部分の平滑処理を行うことで合成文全体として自然な音声情報を得ることとしたので、聞き取りやすい音声を生成することが可能となる。 As a result of the smoothing process by the rule synthesizing unit 4 as described above, if the connection is not sufficiently smooth, natural speech information is obtained as the entire synthesized sentence by performing the smoothing process of the subsequent part. Easy voice can be generated.

なおこの例では、置換語句とその前にある語句との接続に基づいて、規則合成する語句の範囲を拡張するかどうかを決定することとした。しかしこれに替えて、置換語句とその後の語句との接続に基づいて規則合成する語句の範囲を拡張するようにしてもよい。つまり前方差算出部７に替えて、規則合成して得た置換語句の末端モーラのＦ０情報を取得し、置換語句より後にある合成文部分（後方部分）のＦ０情報を取得して、先頭モーラのＦ０情報と後方部分のＦ０情報との差（後方差）を算出する後方差算出手段を用いるようにしても構わない。 In this example, it is determined whether or not to expand the range of the words to be ruled based on the connection between the replacement word and the preceding word. However, instead of this, the range of words to be rule-combined may be expanded based on the connection between the replacement word and the subsequent word. That is, instead of the forward difference calculation unit 7, F0 information of the terminal mora of the replacement phrase obtained by rule synthesis is acquired, F0 information of the synthesized sentence part (rear part) after the replacement phrase is acquired, and the first mora is acquired. A backward difference calculating means for calculating a difference (backward difference) between the F0 information and the rear portion F0 information may be used.

実施の形態３．
実施の形態１のように平滑化する範囲を置換語句と可変部の語句との類似性に基づいて決定するのではなく、置換語句に後続する語句と置換語句とが一定の関係にあるかどうかに基づいて決定するようにしてもよい。例えば、「ガソリンスタンド」という置換語句に「付近」という語句が後続する場合には、常に「付近」という語の音声情報を平滑化するようにしておく。実施の形態３による音声合成装置はかかる特徴を有するものである。 Embodiment 3 FIG.
Whether or not the range to be smoothed is determined based on the similarity between the replacement word and the variable part as in the first embodiment, but whether the word following the replacement word and the replacement word have a certain relationship You may make it determine based on. For example, when the word “near” follows the replacement word “gas station”, the voice information of the word “near” is always smoothed. The speech synthesizer according to the third embodiment has such a feature.

図８は、この発明の実施の形態３による音声合成装置の構成を示すブロック図である。図において、図１と同一の符号を付した構成要素は実施の形態１と同様であるので説明を省略する。図の音声合成装置１において、関連語句判定手段としての関連語句判定部８は、置換語句とその置換語句に後続する語が所定の関係にあるかどうかを判定する部位である。 FIG. 8 is a block diagram showing a configuration of a speech synthesizer according to Embodiment 3 of the present invention. In the figure, the components denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and the description thereof will be omitted. In the speech synthesizer 1 in the figure, the related phrase determination unit 8 as a related phrase determination unit is a part that determines whether or not a replacement word and a word following the replacement word have a predetermined relationship.

次に、この音声合成装置１の動作について説明する。図９は音声合成装置１の動作を示すフローチャートである。このフローチャートが実施の形態１のフローチャートである図４と異なっているのは、ステップＳ３０１のみであるので、この処理を中心に以降の説明を行うこととする。ステップＳ１０２までにおいて、合成文生成部２により、合成文が生成される。そしてステップＳ３０１において、関連語句判定部９は、置換語句に続く語句が置換語句と所定の関係にある語句かどうかを判定する。この判定処理のために、関連語句判定部９は図示せぬ記憶装置に置換語句と後続語句との対応関係を、置換語句対応語句一覧表として記憶しておく。そしてステップＳ３０１において、関連語句判定部９は合成文を取得すると、置換語句と後続する語句とが置換語句対応語句一覧表に記憶されているかどうかを検索し、記憶されている場合には、後続語句を関連する語句として判断する。そして規則合成部４は、後続語句が関連する語句である場合にはステップＳ１０４に進んで規則合成する範囲を拡張する。また関連する語句でない場合には、直接ステップＳ１０５に進み、規則合成する範囲の音声情報を合成する。 Next, the operation of the speech synthesizer 1 will be described. FIG. 9 is a flowchart showing the operation of the speech synthesizer 1. Since this flowchart is different from FIG. 4 which is the flowchart of the first embodiment only in step S301, the following description will be focused on this process. Up to step S102, the synthesized sentence generation unit 2 generates a synthesized sentence. In step S301, the related phrase determination unit 9 determines whether the phrase following the replacement phrase is a phrase having a predetermined relationship with the replacement phrase. For this determination process, the related word / phrase determination unit 9 stores the correspondence relationship between the replacement word / phrase and the subsequent word / phrase in a storage device (not shown) as a replacement word / phrase list. In step S301, when the related phrase determination unit 9 acquires the synthesized sentence, the related phrase determination unit 9 searches whether or not the replacement phrase and the subsequent phrase are stored in the replacement phrase corresponding phrase list. Judge words as related words. If the subsequent phrase is a related phrase, the rule synthesizing unit 4 proceeds to step S104 and expands the range for rule synthesis. If it is not a related phrase, the process proceeds directly to step S105 to synthesize the speech information within the range to be synthesized.

以上より明らかなように、この発明の実施の形態３の音声合成装置１によれば、置換語句とその後続語句とが特別の関係にある場合に、後続語句をも含めて音声情報を規則合成することとしたので、置換語句と後続語句とが語と語の結びつきによって特別なアクセントを発生させる場合にも対応可能となり、自然な韻律の下聞き取りやすい音声を合成することが可能となるのである。 As is clear from the above, according to the speech synthesizer 1 of Embodiment 3 of the present invention, when the replacement phrase and its subsequent phrase have a special relationship, the speech information including the subsequent phrase is regularly synthesized. As a result, it becomes possible to deal with the case where a special accent is generated by the word-to-word combination of the replacement phrase and the succeeding phrase, and it becomes possible to synthesize a voice that is easy to hear under natural prosody. .

この発明に係る音声合成装置及び音声合成プログラムは、特に置換可能な可変部を有する参照文と可変部を置換する語句とを組み合わせて合成された合成文の音声合成を行う音声合成装置に適用することができる。 The speech synthesizer and the speech synthesis program according to the present invention are particularly applied to a speech synthesizer that synthesizes a synthesized sentence synthesized by combining a reference sentence having a replaceable variable part and a phrase that replaces the variable part. be able to.

この発明の実施の形態１による音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer by Embodiment 1 of this invention. この発明の実施の形態１による音声合成装置の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the speech synthesizer by Embodiment 1 of this invention. 参照文に関する情報として記憶される参照文データの例を示す図である。It is a figure which shows the example of the reference sentence data memorize | stored as information regarding a reference sentence. この発明の実施の形態１による音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer by Embodiment 1 of this invention. この発明の実施の形態１において平滑量を算出する方法を説明するための説明図である。It is explanatory drawing for demonstrating the method of calculating the smoothing amount in Embodiment 1 of this invention. この発明の実施の形態２による音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer by Embodiment 2 of this invention. この発明の実施の形態２による音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer by Embodiment 2 of this invention. この発明の実施の形態３による音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer by Embodiment 3 of this invention. この発明の実施の形態３による音声合成装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech synthesizer by Embodiment 3 of this invention.

Explanation of symbols

２合成文生成部、
３語句類似性判定部、
４規則合成部、
５局所平滑部、
６音声生成部、
７前方差算出部、
８関連語句判定部。 2 compound sentence generator,
3 Phrase similarity determination unit,
4 Rule composition part,
5 local smoothing part,
6 voice generator,
7 Forward difference calculation unit,
8 Related phrase determination unit.

Claims

Generates speech information of a synthesized sentence obtained by replacing a variable part of a reference sentence with a replacement phrase by combining voice information obtained by regular synthesis of prosodic information of the replacement phrase and voice information of the reference sentence. In a speech synthesizer that synthesizes speech that reads out the synthesized sentence using the synthesized speech information,
When a predetermined condition is met, any part of the synthetic sentence part preceding the replacement phrase and the subsequent synthetic sentence part is defined as a rule synthesis extension part, and the speech information of the rule synthesis extension part is defined as the rule synthesis extension part. Rule synthesis means for synthesizing rules together with the speech information of the replacement phrase,
A speech synthesizer characterized by comprising:

A phrase similarity determination means for determining whether or not the variable part phrase and the replacement phrase are prosodically similar;
The rule synthesizing means performs rule synthesis of the speech information of the rule synthesis extension part when the phrase similarity judgment means judges that the variable part phrase and the replacement phrase are not prosodically similar. The speech synthesizer according to claim 1.

The phrase similarity determination means has a difference between the number of mora of the phrase of the variable part and the number of mora of the replacement phrase within a predetermined value, and the accent type of the phrase of the variable part and the accent type of the replacement phrase The speech synthesizer according to claim 2, wherein when the two match, it is determined that the phrase of the variable part and the replacement phrase are prosodically similar.

4. The speech synthesizer according to claim 3, wherein the rule synthesizer synthesizes morpheme speech information preceding or succeeding the replacement word as the rule synthesis extension part.

The phrase similarity determination means calculates a numerical value indicating the similarity between the variable part phrase and the replacement phrase,
5. The speech synthesizer according to claim 4, wherein the rule synthesizing unit increases or decreases the number of morphemes included in the rule synthesis extension portion according to the numerical value.

Local smoothing means for further smoothing speech information from a mora near the end of the replacement phrase to a mora near the beginning of the portion following the replacement phrase;
The speech synthesis apparatus according to claim 1, further comprising:

Local smoothing means for further smoothing speech information from a mora near the beginning of the replacement phrase to a mora near the end of the portion preceding the replacement phrase;
The speech synthesis apparatus according to claim 1, further comprising:

Forward difference calculating means for calculating a difference between the F0 information of the first mora of the speech information obtained by rule synthesis from the replacement phrase and the terminal F0 information of the synthesized sentence part preceding the replacement phrase;
2. The speech synthesizer according to claim 1, wherein the rule synthesizing unit synthesizes the speech information of the rule synthesis extended portion when the difference calculated by the forward difference calculation unit is equal to or greater than a predetermined value. .

A backward difference calculating means for calculating a difference between the F0 information of the terminal mora of the speech information obtained by rule synthesis from the replacement phrase and the head F0 information of the synthesized sentence portion following the replacement phrase;
2. The speech synthesizer according to claim 1, wherein the rule synthesizing unit synthesizes the speech information of the rule synthesis extended portion when the difference calculated by the backward difference calculating unit is equal to or greater than a predetermined value. .

Related phrase determining means for determining whether the preceding phrase or the succeeding phrase of the replacement phrase in the synthetic sentence is a predetermined phrase related to the replacement phrase;
The rule synthesizing means, when the subsequent phrase is a predetermined phrase related to the replacement phrase, synthesizes the speech information using the preceding phrase or the subsequent part as the rule synthesis extension part. The speech synthesizer according to claim 1.

Generates speech information of a synthesized sentence obtained by replacing a variable part of a reference sentence with a replacement phrase by combining voice information obtained by regular synthesis of prosodic information of the replacement phrase and voice information of the reference sentence. In a speech synthesis program that causes a computer to execute a process of synthesizing speech that reads out the synthesized sentence using speech information of the synthesized sentence,
Rule synthesizing means for rule-synthesizing the speech information of the rule synthesis extension part together with the speech information of the replacement phrase, with any part of the part preceding and following the replacement phrase of the synthesis sentence as a rule synthesis extension part ,
A speech synthesizer characterized by causing the computer to execute.

12. The speech synthesis program according to claim 11, wherein the rule synthesis means determines whether to synthesize the speech information of the rule synthesis extension portion based on a predetermined condition.