JP2013011902A

JP2013011902A - Method for generating speech by processing text by using non-language dependent rhythm markup and device for the same

Info

Publication number: JP2013011902A
Application number: JP2012201342A
Authority: JP
Inventors: Gregory P Kochanski; ピー．コチャンスキグレゴリー; Shii Chi-Rin; シィチーリン
Original assignee: Alcatel Lucent USA Inc
Current assignee: Nokia of America Corp
Priority date: 2000-09-05
Filing date: 2012-09-13
Publication date: 2013-01-17
Anticipated expiration: 2021-09-05
Also published as: JP5634466B2; JP2002091474A; JP5361104B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique for modeling a speech having desired rhythm characteristics.SOLUTION: A set of tags for defining rhythm characteristics is prepared, and a selected tag is arranged in a proper place of a text main body. Each tag imposes a constraint on the rhythm characteristics of a speech to be generated by processing the text. A set of equations to be solved so that a curve for defining the rhythm characteristics across a range of words and phrases can be generated and a set of equations to be solved so that a curve for defining the rhythm characteristics of the respective words in the words and phrases can be generated are generated by processing of the speech text and the tags. The data defined by the curve is used together with the text so that the speech having the rhythm characteristics defined by the tag can be generated. The set of tags is generated by the reading of the training text to be read by a target speaker, a training corpus on which the rhythm characteristics of the target speaker are reflected is generated, and then the training corpus is analyzed so that the tag for modeling the rhythm characteristics of the training corpus can be generated.

Description

本出願は、２０００年９月５日付けで出願された米国特許出願第６０／２３０，２０４号および２０００年９月２８日付けで出願された米国特許出願第６０／２３６，００２号の利点を請求するものであり、これらを双方とも全体的に参照することにより本明細書に援用する。
本発明は、概して、連続すると共に生理学的制約を受ける現象の表現およびモデリングにおける改良に関する。特に、本発明は、信号の特徴およびタグの処理を定義して、タグによって定義される特徴を有する信号を生成するタグのセットの作成およびその使用に関する。 This application takes advantage of US Patent Application No. 60 / 230,204 filed on September 5, 2000 and US Patent Application No. 60 / 236,002 filed on September 28, 2000. Both of which are hereby incorporated by reference in their entirety.
The present invention relates generally to improvements in the representation and modeling of phenomena that are continuous and subject to physiological constraints. In particular, the invention relates to the creation and use of a set of tags that define signal characteristics and tag processing to produce signals having the characteristics defined by the tags.

テキスト−スピーチシステムは、通常単語および文章であるテキストの入力を受け取り、これらの入力を発話される単語および文章に変換する。テキスト−スピーチシステムは、各発音可能なテキストの単位に応答するスピーチ単位および韻律のモデルの在庫表を構築するために、特定の話者のスピーチのモデルを採用している。スピーチの韻律特徴は、スピーチのリズム的およびイントネーション的な特徴である。次にシステムは、スピーチの単位を組み立て、テキストで表される順序にし、該スピーチ単位を並べたものを再生する。典型的なテキスト−スピーチシステムは、電話シーケンスを予測するためにテキストの解析を行い、各電話の長さを予測するために継続期間モデリングを行い、ピッチ輪郭を予測するためにイントネーションモデリングを行い、異なる解析およびモジュールの結果を組み合わせて、スピーチ音声を作成するために信号処理を行う。 A text-to-speech system takes text inputs, usually words and sentences, and converts these inputs into spoken words and sentences. The text-to-speech system employs a model of a particular speaker's speech to build a stock table of speech units and prosodic models that respond to each phonetic text unit. The prosodic features of speech are the rhythmic and intonation features of speech. The system then assembles the speech units, puts them in the order represented by the text, and plays back the speech units. A typical text-to-speech system analyzes text to predict phone sequences, performs duration modeling to predict the length of each phone, performs intonation modeling to predict pitch contours, Signal processing is performed to create speech speech by combining the results of different analyzes and modules.

多くの従来技術によるテキスト−スピーチシステムは、合成スピーチを生成するテキストから、韻律情報を推定する。韻律情報には、スピーチのリズム、ピッチ、アクセント、音量、および他の特徴が含まれる。テキストは、通常、韻律情報を推定することのできる情報をわずかにしか含まない。したがって、従来技術によるテキスト−スピーチシステムは、中庸に設計される傾向がある。中庸に設計されたシステムは、正確な韻律を決定できない場合には、不正確な韻律よりもあいまいな韻律の方が勝るという理論に基づき、あいまいな韻律を生成する。その結果、韻律モデルも同様に中庸に設計される傾向があると共に、自然なスピーチに見られる韻律の変動をモデリングする能力を持たない。これらの変動により、自然なスピーチに、任意所定のピッチ輪郭にマッチする能力、または個人のスピーチスタイルおよび感情等、広範な印象を伝達する能力を与えられる。従来技術によるテキスト−スピーチシステムによって生成されるスピーチにおけるこのような変動の欠落は、多くのこのようなシステムによって生成される人工的な音声に大きく寄与している。 Many prior art text-to-speech systems infer prosodic information from text that produces synthetic speech. Prosodic information includes speech rhythm, pitch, accent, volume, and other features. Text usually contains only a small amount of information from which prosodic information can be estimated. Thus, prior art text-to-speech systems tend to be moderately designed. A moderately designed system generates ambiguous prosody based on the theory that if an accurate prosody cannot be determined, an ambiguous prosody wins over an inaccurate prosody. As a result, prosodic models tend to be moderately designed as well, and do not have the ability to model prosodic variations found in natural speech. These variations give natural speech the ability to match any given pitch contour or to convey a wide range of impressions, such as personal speech styles and emotions. This lack of variation in speech produced by prior art text-to-speech systems contributes significantly to the artificial speech produced by many such systems.

多くの用途において、対話を実行可能なテキスト−スピーチシステムを用いることが望ましい。例えば、テキスト−スピーチシステムを用いて、顧客の入力に対して発話される応答を提供する、電話メニューシステム用のスピーチを生成できる。このようなシステムは、概念、目標、および意図に相当する状態情報を適宜含みうる。例えば、システムが「Wells Fargo Bank」等単一の固有名詞を表現する単語セットを生成する場合、合成されたスピーチは、その単語セットが単一の名詞であることを伝える音声の特徴を含むべきである。他の場合、ある単語が特に重要であること、またはある単語が確認を要するものであることを、印象によって伝える必要がある場合がある。正確な印象を伝えるため、生成されるスピーチは、適切な韻律特徴を持たなければならない。生成されるスピーチに有利に定義しうる韻律特徴には、ピッチ、振幅、およびスピーチに自然な音声を与えると共に、所望の印象を伝えるために必要な任意の他の特徴がある。 In many applications, it is desirable to use a text-to-speech system that can perform interaction. For example, a text-to-speech system can be used to generate speech for a telephone menu system that provides a spoken response to customer input. Such a system may optionally include state information corresponding to concepts, goals, and intentions. For example, if the system generates a set of words that represent a single proper noun, such as “Wells Fargo Bank”, the synthesized speech should include an audio feature that conveys that the word set is a single noun. It is. In other cases, it may be necessary to convey by impression that a word is particularly important or that a word requires confirmation. In order to convey an accurate impression, the generated speech must have appropriate prosodic features. Prosodic features that can be advantageously defined for the speech that is generated include pitch, amplitude, and any other features that are necessary to give the speech a natural voice and convey the desired impression.

したがって、所望の特徴を有するスピーチをモデリングするのに十分詳細に韻律特徴を定義することのできるタグのシステムおよびタグを処理して、タグによって定義される特徴を有するスピーチを生成するシステムが必要とされる。 Therefore, there is a need for a system of tags that can define prosodic features in sufficient detail to model speech having the desired features and systems that process the tags and generate speech having features defined by the tags. Is done.

本発明は、所望の韻律特徴を有するスピーチを生成するシステムへの必要性を認識する。このために、本システムは、連続すると共に生理学的制約を受ける現象のモデリングに使用することのできるタグのセットの生成および処理を含む。スピーチの韻律特徴はこのような現象の一例であり、特定話者のスピーチの韻律特徴または他の所望の韻律特徴を表現するように、タグのセットを作成することができる。これらのタグは、テキスト内の適切な場所でテキストに適用することができ、また、テキストを処理することによって生成されるスピーチの韻律特徴を定義することができる。タグのセットは、テキストと共にタグを処理することで、タグが作成された元のスピーチの韻律特徴を有するスピーチを正確にモデリングすることができるほど十分詳細に韻律特徴を定義する。このレベルの詳細を含めることで、タグを非言語依存にすることができる。これは、他の場合には、用いられる言語の韻律特徴の知識によって提供される情報を、タグを用いて提供できるためである。このようにして、本発明によるタグのセットを採用するテキスト−スピーチシステムは、すべての言語で正確な韻律を生成することができると共に、言語を混合したテキストに対して正確な韻律を生成することができる。例えば、本発明の教示を採用するテキスト−スピーチシステムは、フランス語の引用を含む英語のテキストブロックを正確に処理可能であると共に、該スピーチの英語の部分に正確な韻律特徴を、そして同様に該スピーチのフランス語の部分に正確な韻律特徴を有するスピーチを生成することが可能である。 The present invention recognizes the need for a system that generates speech with desired prosodic features. To this end, the system includes the generation and processing of a set of tags that can be used for modeling phenomena that are continuous and subject to physiological constraints. Speech prosodic features are an example of such a phenomenon, and a set of tags can be created to represent a particular speaker's speech prosodic features or other desired prosodic features. These tags can be applied to the text at the appropriate place in the text and can define the prosodic features of the speech generated by processing the text. The set of tags defines the prosodic features in sufficient detail to be able to accurately model the speech having the prosodic features of the original speech from which the tag was created by processing the tag with the text. Including this level of detail can make the tag non-language dependent. This is because in other cases, information provided by knowledge of the prosodic features of the language used can be provided using tags. In this way, a text-to-speech system that employs a set of tags according to the present invention can generate accurate prosody for all languages and generate accurate prosody for mixed-language text. Can do. For example, a text-to-speech system that employs the teachings of the present invention can accurately process English text blocks containing French citations, as well as accurate prosodic features in the English portion of the speech, as well as the It is possible to generate a speech with accurate prosodic features in the French part of the speech.

スピーチの正確な表現を提供するために、タグ間の折衷を定義する情報を含むことが好ましく、タグ処理時に、どのようにタグが互いに関連するかを定義するタグ内の情報およびデフォルト情報に基づいて、折衷が行われる。多くのスピーチ単位は、他のスピーチ単位の特徴に影響を及ぼす。隣接単位は、特に、互いに影響を及ぼす傾向を有する。音節、単語、または単語グループ等、隣接する単位の定義に用いるタグが、韻律特徴の割り当てに関して競合する命令を含む場合、情報の優先度、および競合および折衷をどのように処理するかにより、適切な調整が行われる。例えば、隣接する単語または語句がそれぞれ調整される。あるいは、タグ情報が隣接する単語または語句の一方が優勢であることを示す場合には、他方の単語または語句に対して適切な調整が行われることになる。 In order to provide an accurate representation of speech, it is preferable to include information that defines a compromise between tags, based on information in the tags and default information that defines how the tags relate to each other during tag processing A compromise is made. Many speech units affect the characteristics of other speech units. Adjacent units have a particular tendency to influence each other. If the tags used to define adjacent units, such as syllables, words, or word groups, contain competing instructions for prosodic feature assignment, depending on the priority of the information and how conflicts and compromises are handled Adjustments are made. For example, adjacent words or phrases are each adjusted. Alternatively, if the tag information indicates that one of the adjacent words or phrases is dominant, an appropriate adjustment will be made to the other word or phrase.

タグのセットは、トレーニングにより、すなわち特定の話者が読んだトレーニングテキストコーパスの特徴を解析することで定義することができる。タグは、識別された特徴を用いて定義することができる。例えば、トレーニングコーパスから、話者が１５０Ｈｚの基本発話周波数を有し、話者のスピーチのピッチは疑問文の末尾では５０Ｈｚ上がることがわかると、生成されるスピーチの基本周波数を１５０Ｈｚに設定すると共に、質問の末尾にピッチを５０Ｈｚ上げるようタグを定義することができる。 A set of tags can be defined by training, i.e. by analyzing features of a training text corpus read by a particular speaker. Tags can be defined using the identified features. For example, if the training corpus shows that the speaker has a basic speech frequency of 150 Hz and the speaker's speech pitch is increased by 50 Hz at the end of the question, the fundamental frequency of the generated speech is set to 150 Hz. The tag can be defined to increase the pitch by 50 Hz at the end of the question.

タグを一旦確立すると、スピーチを生成することが望ましいテキスト本文に入力することができる。これは単に、エディタを用いて適切なタグをテキストに入力するだけで行うことができる。例えば、「You are the weakest link」という文に対してテキスト−スピーチ処理を行い、「are」という単語にアクセントを置いた１５０Ｈｚの基本周波数を確立したい場合に、タグを次のように文に付加することが可能である。<setbase=150/>You<stress strength=4 type=0.5 pos=*shape=-0.2s.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>are<slope=-0.8/>the weakest link。 Once the tag is established, it can be entered into a text body where it is desirable to generate speech. This can be done by simply entering the appropriate tag into the text using an editor. For example, if you want to establish a basic frequency of 150Hz with the word “are” accented by text-to-speech processing for the sentence “You are the weakest link”, add a tag to the sentence as follows: Is possible. <setbase = 150 /> You <stress strength = 4 type = 0.5 pos = * shape = -0.2s.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1 /> are <slope =- 0.8 /> the weakest link.

この結果、約１５０Ｈｚを中心とするピッチを有し、単語「are」にアクセントが置かれ、単語「are」の終わりから文の終わりにかけてピッチが下がる語句曲線になる。テキストおよびタグによって定義されるデータが合成器に与えられると、合成器による文の発音の仕方は、語句曲線によって定義される特徴を反映したものになる。タグおよびその作用のさらなる態様について後述する。 This results in a phrase curve having a pitch centered around 150 Hz, accented on the word “are”, and decreasing in pitch from the end of the word “are” to the end of the sentence. When data defined by text and tags is provided to the synthesizer, the way the synthesizer pronounces the sentence reflects the characteristics defined by the phrase curve. Further aspects of the tag and its operation are described below.

エディタを用いてタグを入力する代替として、プログラムしたルールセットに従い、タグを自動的にスピーチに配置することが可能である。平叙文のピッチを定義する例示的なルールセットは、例えば、文の行程にわたって垂下する傾きを設定し、文の最後の単語には下がるアクセントを用いるというものでありうる。こういったルールをテキスト本文に適用すると、テキスト本文内の各平叙文に適切なタグが確立される。他の文のタイプおよび機能を定義するために、さらなるルールを採用してもよい。例えば音量（振幅）およびアクセント（強調）を定義するために、他のタグを確立して、テキストに適用しうる。 As an alternative to entering tags using an editor, tags can be automatically placed in speech according to a programmed rule set. An example rule set that defines the pitch of a plain text may be, for example, setting a slope that hangs down over the course of a sentence and using a descending accent for the last word of the sentence. When these rules are applied to the text body, an appropriate tag is established for each plain text in the text body. Additional rules may be employed to define other sentence types and functions. Other tags can be established and applied to the text, for example to define volume (amplitude) and accent (emphasis).

テキスト本文がタグのセットを用いて作成されると、タグが処理される。最初に、語句曲線が計算される。語句曲線は、語句の範囲にわたって計算される、ピッチ等の韻律特徴を表す曲線である。本発明による添付のタブを用いて、テキストを処理するに当たり、一度に１つの小語句（minor phase）を処理することで、語句曲線を適宜作成することができる。ここで、小語句とは、語句、従属節、または等位節である。１つの文は、通常、１つまたは複数の小語句を含む。タグの先行する小語句に影響を及ぼす能力を１つの小語句に制限するために、境界が設けられる。次に、語句曲線に関して、韻律が計算される。個々の単語単位での韻律特徴が計算され、各語句におけるその作用が計算される。この計算は、例えば語句内に現れる、アクセントが置かれる単語の作用をモデリングする。語句曲線に関して韻律を計算した後に、言語的属性から観察可能な音響特徴へのマッピングが行われる。次に、音響特徴が、テキストを処理することで生成されたスピーチに適用される。音響特徴は、特定の時間に特定の値を有し、時間の関数をそれぞれ表す１つの曲線または曲線のセットで適切に表すことができる。スピーチは機械によって生成されるため、各スピーチ成分が発生する時間がわかる。したがって、特定のスピーチ成分に適切な韻律特徴を、スピーチの成分が発生するとわかっている時間における値として表現することができる。スピーチ成分は、入力として合成器に与えることができ、観察可能な音響特徴の値も、スピーチの特徴を制御するために合成器に与えられる。 When the text body is created using a set of tags, the tags are processed. First, the phrase curve is calculated. A phrase curve is a curve representing prosodic features such as pitch, calculated over a range of phrases. In processing text using the attached tabs according to the present invention, a phrase curve can be created as appropriate by processing one minor phase at a time. Here, a small phrase is a phrase, a subordinate clause, or a coordinate clause. A sentence typically includes one or more small words. A boundary is provided to limit the ability to affect the preceding subword of the tag to a single subword. Next, the prosody is calculated for the phrase curve. Prosodic features for each individual word are calculated and their effect on each word is calculated. This calculation models the effect of accented words that appear, for example, in phrases. After calculating the prosody for the phrase curve, mapping from linguistic attributes to observable acoustic features is performed. The acoustic features are then applied to the speech generated by processing the text. An acoustic feature can be suitably represented by a single curve or set of curves each having a specific value at a specific time and representing a function of time. Since the speech is generated by the machine, the time when each speech component is generated is known. Therefore, a prosodic feature appropriate to a specific speech component can be expressed as a value at a time when the speech component is known to occur. The speech component can be provided as an input to the synthesizer, and observable acoustic feature values are also provided to the synthesizer to control the speech features.

本発明のより完全な理解ならびに本発明のさらなる特徴および利点は、以下の詳細な説明および添付の図面から明らかになろう。 A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be apparent from the following detailed description and the accompanying drawings.

本発明によるテキスト−スピーチ処理のプロセスを示す図である。FIG. 3 shows a text-to-speech process according to the present invention. 本発明によるタグの処理によって生成されるアクセント曲線を示す図である。It is a figure which shows the accent curve produced | generated by the process of the tag by this invention. 本発明による<step>タグの作用を示すグラフである。It is a graph which shows the effect | action of the <step> tag by this invention. 本発明による<step>タグの作用を示すグラフである。It is a graph which shows the effect | action of the <step> tag by this invention. 本発明による<slope>タグの作用を示すグラフである。It is a graph which shows the effect | action of the <slope> tag by this invention. 本発明による<phrase>タグの作用を示すグラフである。It is a graph which shows the effect | action of the <phrase> tag by this invention. 本発明による<stress>タグの作用および相互関係を示す図である。It is a figure which shows the effect | action and correlation of the <stress> tag by this invention. 本発明による<stress>タグの作用および相互関係を示す図である。It is a figure which shows the effect | action and correlation of the <stress> tag by this invention. 本発明による<stress>タグの作用および相互関係を示す図である。It is a figure which shows the effect | action and correlation of the <stress> tag by this invention. 本発明による<stress>タグの作用および相互関係を示す図である。It is a figure which shows the effect | action and correlation of the <stress> tag by this invention. 本発明による<stress>タグの作用および相互関係を示す図である。It is a figure which shows the effect | action and correlation of the <stress> tag by this invention. 本発明によるタグの間の折衷を示すグラフである。4 is a graph showing compromise between tags according to the present invention. 本発明によるタグの強さの変動の作用を示すグラフである。It is a graph which shows the effect | action of the fluctuation | variation of the strength of the tag by this invention. 本発明によるタグにおいて用いられる「pdroop」パラメータの異なる値の作用を示すグラフである。Fig. 6 is a graph showing the effect of different values of the "pdroop" parameter used in a tag according to the present invention. 本発明によるタグにおいて用いられる「adroop」パラメータの異なる値の作用を示すグラフである。4 is a graph showing the effect of different values of the “adroop” parameter used in a tag according to the invention. 本発明によるタグにおいて用いられる「smooth」パラメータの異なる値の作用を示すグラフである。4 is a graph showing the effect of different values of the “smooth” parameter used in a tag according to the invention. 本発明によるタグにおいて用いられる「jittercut」パラメータの異なる値の作用を示すグラフである。6 is a graph showing the effect of different values of the “jittercut” parameter used in a tag according to the present invention. 本発明によるタグ処理のプロセスのステップを示す図である。FIG. 6 is a diagram showing steps of a tag processing process according to the present invention. 本発明による、言語的な位置を観察可能な音響特徴にマッピングする一例を示すグラフである。6 is a graph illustrating an example of mapping linguistic positions to observable acoustic features according to the present invention. 本発明によるテキスト−スピーチ処理において行われる非線形変換の作用を示すグラフである。It is a graph which shows the effect | action of the nonlinear transformation performed in the text-speech process by this invention. 本発明によるタグにおいて用いられる「add」パラメータの異なる値の作用を示すグラフである。4 is a graph showing the effect of different values of the “add” parameter used in a tag according to the invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明によるタグを用いる、例示的なデータのモデリングを示すグラフである。Fig. 6 is a graph illustrating exemplary data modeling using tags according to the present invention. 本発明による、タグを作成して使用するプロセスを示す図である。FIG. 4 illustrates a process for creating and using tags according to the present invention. 本発明による例示的なテキスト−スピーチシステムを示す図である。FIG. 2 illustrates an exemplary text-to-speech system according to the present invention. 本発明により、動きを定義し生成するためのタグを生成し使用するプロセスを示す図である。FIG. 5 illustrates a process for generating and using tags for defining and generating movement according to the present invention.

以下の説明は、本発明によるスピーチの韻律特徴を特定するための技術について説明する。まず、テキスト／スピーチ処理の全体的なプロセスについて説明する。次に、韻律特徴を特定するために用いるタグのセットについて説明する。タグの概略的な構造と文法を説明した後に、タグにおいて用いられるタグ、パラメータ、および値の各カテゴリについて説明する。次に、いくつかの例示的な各タグの作用について、異なるパラメータの作用、競合するタグ間の折衷、および他のタグの代表的な属性を示しながら、説明する。次に、本発明による、タグを含むテキスト本文の処理の説明、ターゲット話者の韻律特徴を有するスピーチを生成するためのタグの作成および使用方法の説明、および本発明によるテキスト−スピーチ処理システムの説明が続く。 The following description describes techniques for identifying the prosodic features of speech according to the present invention. First, the overall process of text / speech processing will be described. Next, a set of tags used for specifying prosodic features will be described. After describing the general structure and grammar of the tag, the tag, parameter, and value categories used in the tag will be described. Next, the action of each of several exemplary tags will be described, showing the action of different parameters, a compromise between competing tags, and representative attributes of other tags. Next, the description of the processing of the text body including the tag according to the present invention, the description of how to create and use the tag to generate the speech having the prosody characteristics of the target speaker, and the text-to-speech processing system according to the present invention. The explanation continues.

図１は、本発明による、タグを含むテキスト本文のテキスト−スピーチ処理のプロセス１００を示す。ステップ１０２において、テキスト本文を解析し、タグを抽出する。ステップ１０４において、タグを処理し、ピッチおよび音量等、該タグによって時間の関数として定義される音響特徴の値を決定する。ステップ１０６において、音響特徴について決定されたテキストおよび値を、合成器に与えられる言語的記号に変換する。ステップ１０８において、言語的記号を入力として合成器に与え、タグによって定義される音響特徴を有するスピーチを生成する。 FIG. 1 illustrates a process 100 for text-to-speech processing of a text body containing tags in accordance with the present invention. In step 102, the text body is analyzed and tags are extracted. In step 104, the tag is processed to determine values of acoustic features defined as a function of time by the tag, such as pitch and volume. In step 106, the text and values determined for the acoustic features are converted into linguistic symbols that are provided to the synthesizer. In step 108, a linguistic symbol is provided as an input to the synthesizer to generate speech having acoustic features defined by the tag.

タグは、テキストを処理することで生成されるスピーチに望ましい韻律特徴を定義するために、テキスト本文内、通常は単語の間に配置される。各タグは、韻律に対して制約セットを課す。<step>タグおよび<stress>タグは、他のタグへの関係を定義する「strength」パラメータを含む。タグは競合する情報を頻繁に含み、「strength」パラメータは、競合をどのように解決するかを決定する。「strength」パラメータのさらなる詳細およびその動作については後述する。 Tags are placed in the text body, usually between words, to define the prosodic features desired for speech generated by processing the text. Each tag imposes a constraint set on the prosody. The <step> and <stress> tags include a “strength” parameter that defines a relationship to other tags. Tags often contain conflicting information, and the “strength” parameter determines how to resolve the conflict. Further details of the “strength” parameter and its operation will be described later.

タグは、ＸＭＬすなわち拡張可能なマーク付け言語フォーマットで適切に定義することができる。ＸＭＬは、ワールドワイドウェブでの構造化文書用の汎用フォーマットであり、www.w3.org/XMLにおいて説明されている。当業者には、タグはＸＭＬシンタクスで実現する必要がないことが明確であろう。タグは、（ＸＭＬで使用される「＜」および「＞」とは異なり）あらゆる任意の文字列で区切ることができ、タグの内部構造は、ＸＭＬのフォーマットに従わなくてもよく、適切に、タグの識別が可能であると共に、必要な属性を設定可能な任意の構造でありうる。また、単一のキャラクタストリームにおいて、タグの間にテキストが介在している必要がないことも認識されよう。タグおよびテキストは、例えば、タグを対応するテキストシーケンスでの場所に同期させる手段がありさえすれば、２つの並列データチャネルで流れることができる。 Tags can be appropriately defined in XML, an extensible markup language format. XML is a universal format for structured documents on the World Wide Web and is described at www.w3.org/XML. It will be clear to those skilled in the art that the tag need not be implemented in XML syntax. Tags can be delimited by any arbitrary string (unlike “<” and “>” used in XML), and the internal structure of the tag may not follow the XML format, The tag can be identified, and can have an arbitrary structure in which necessary attributes can be set. It will also be appreciated that text need not be interposed between tags in a single character stream. Tags and text can flow in two parallel data channels, for example, as long as there is a means to synchronize the tag to the location in the corresponding text sequence.

タグは、テキストが存在せず、入力が一連のタグだけから構成される場合にも用いうる。このような入力は、例えば、コンピュータグラフィックスアプリケーション用に筋力学をモデリングするために、これらのタグを用いる場合に適切である。一例を挙げると、シミュレートした金魚のひれの動きを制御するために、タグを用いることも可能である。このような場合、存在していないテキストからタグを分離する必要はなく、また、タグの区切り文字は、タグを次のタグから分離するために必要なだけである。 Tags can also be used when there is no text and the input consists only of a series of tags. Such input is appropriate when using these tags, for example, to model muscle mechanics for computer graphics applications. As an example, a tag can be used to control the movement of a simulated goldfish fin. In such a case, it is not necessary to separate the tag from the non-existent text, and a tag delimiter is only necessary to separate the tag from the next tag.

最後に、タグをシリアルデータストリームとして表現する必要はなく、シリアルデータストリームとして表現する代わりに、コンピュータのメモリ内のデータ構造として表現可能なことが認識されよう。例えば、コンピュータプログラムがテキストおよびタグを生成中の対話システムでは、テキスト（もしあれば）、タグ、およびテキストとタグの間の一時的な関係を記述するデータ構造にポインタまたはリファレンスを渡すことが、最も効率的でありうる。そして、タグを記述するデータ構造は、おそらく例えば、デバッグ、メモリ管理、または他の補助的目的で使用される他の情報と共に、ＸＭＬ記述と同等の情報を含む。 Finally, it will be appreciated that the tag need not be represented as a serial data stream, but can be represented as a data structure in the computer's memory instead of as a serial data stream. For example, in an interactive system in which a computer program is generating text and tags, passing pointers or references to data structures that describe the text (if any), tags, and temporary relationships between texts, Can be the most efficient. The data structure describing the tag then contains information equivalent to the XML description, possibly along with other information used, for example, for debugging, memory management, or other auxiliary purposes.

本発明によるタグのセットについて以下に説明する。この説明において、アルファベットのストリングは引用符で囲まれている。ＸＭＬ表記法での標準のように、「？」はオプショントークンを表し、「*」はトークンのゼロまたは複数の発生を表し、「＋」はトークンの１つまたは複数の発生を表す。タグの文法は、次のフォーマットで表される。
タグ＝”<”tagname AttValue”*”/>”
例示的なタグは、
<set base=”200”/>である。このタグは、話者の基本周波数を２００ＨＺに設定する。この例において、「”<”」はタグの開始を示し、「set」はとるべきアクション、すなわち特定された属性の値を設定することであり、「base」は値を設定すべき属性であり、「200」は属性「base」を設定すべき値であり、「”＞”」はタグの終了を示す。 A set of tags according to the present invention is described below. In this description, alphabetic strings are enclosed in quotation marks. Like the standard in XML notation, “?” Represents an optional token, “*” represents zero or more occurrences of the token, and “+” represents one or more occurrences of the token. The tag syntax is represented in the following format:
Tag = ”<” tagname AttValue ”*” /> ”
An example tag is
<set base = ”200” />. This tag sets the speaker's fundamental frequency to 200 Hz. In this example, “” <”” indicates the start of the tag, “set” is the action to be taken, ie setting the value of the specified attribute, and “base” is the attribute whose value is to be set. , “200” is a value to which the attribute “base” should be set, and “”> ”indicates the end of the tag.

各タグは、２つの部分を含む。第１の部分はアクションであり、第２の部分は、タグの動作の詳細を制御する属性−値の対のセットである。殆どのタグは、自己完結する「point」タグである。タグをいつ動作させるかを定義する際の精度を考慮するために、タグは「move」属性を含みうる。この属性は、タグを単語の冒頭に配置させることができるが、その作用を単語内のどこかに任せる。「move」属性の使用および動作については、さらに詳細に後述する。 Each tag includes two parts. The first part is an action and the second part is a set of attribute-value pairs that control the details of the tag's operation. Most tags are self-contained “point” tags. To take into account the accuracy in defining when a tag will operate, the tag may include a “move” attribute. This attribute allows a tag to be placed at the beginning of a word, but leaves the action anywhere within the word. The use and operation of the “move” attribute will be described in more detail below.

タグは、４つのカテゴリ、すなわち（１）パラメータを設定するタグ、（２）語句曲線、または語句曲線を構築するポイントを定義するタグ、（３）単語のアクセントを定義するタグ、および（４）境界をマークするタグ、のうちの１つに分類される。 Tags are in four categories: (1) tags that set parameters, (2) tags that define phrases or points that build phrase curves, (3) tags that define accents of words, and (4) Classified as one of the tags that mark the boundary.

パラメータは、<set Att=value>という文法を有する、<set>タグによって設定される。ここで、「Att」はタグが制御する属性であり、valueはその属性の数値である。<set>タグは、以下の属性を許容する。
max=value。この属性は、許容される最大の値、例えば、ピッチが適宜制御されている場合に生成されるべき、最大周波数をヘルツ単位で設定する。
min=value。この属性は、許容される最小の値、例えば、ピッチが適宜制御されている場合に生成されるべき、最小周波数をヘルツ単位で設定する。
smooth=value。これは、シミュレート中の機械システムの応答時間を制御する。ピッチが制御されている場合、このパラメータは、ピッチステップの幅を設定するために、ピッチ曲線の平滑時間を秒単位で設定する。
base=value。これは、話者のベースライン、すなわちタグが全くない状態での周波数を設定する。
range=mvalue。これは、話者のピッチの範囲をＨｚ単位で設定する。
pdroop=value。これは、基本周波数への語句曲線の垂下を設定し、１秒当たりの垂下量を単位として表される。
adroop=value。これは、語句曲線に向けてのピッチ軌跡の垂下率を設定し、１秒当たりの垂下率を単位として表される。
add=value。これは、語句の範囲にわたるピッチの軌跡と、語句に対して局所的な影響を有する個々の単語のピッチの軌跡との間のマッピングにおける非線形性を設定する。「add」の値が１に等しい場合、線形マッピングが行われる。すなわち、アクセントが、高ピッチ領域にあるか、または低ピッチ領域にあるかに関わらず、ピッチに対して同じ作用を有する。「add」の値が０に等しい場合、アクセントの作用は対数的であり、高い語句曲線上にあるときには、小さなアクセントが周波数をより大きく変化させる。「add」の値が１よりも大きい場合、線形マッピングよりも低速で行われる。
jitter=value。これは、ピッチジッタの平方二乗平均（ＲＭＳ）の大きさを、話者の範囲を１とした小数で設定する。ジッタは、処理されたスピーチにより自然な音声を与えるために導入されるランダムなピッチ変動の程度である。
jittercut=value。これは、ピッチジッタの時間の尺度を秒単位で設定する。ピッチジッタは、「jittercut」よりも短い間隔では、相関する（１／ｆ）ノイズであり、「jittercut」よりも長い間隔では相関しないノイズ、すなわちホワイトノイズである。大きな値の「jittercut」は、より長くかつ平滑なピッチの値を定義する一方、小さな値の「jittercut」は、短く不規則なピッチの変化を定義する。 A parameter is set by a <set> tag having a grammar <set Att = value>. Here, “Att” is an attribute controlled by the tag, and value is a numerical value of the attribute. The <set> tag allows the following attributes:
max = value. This attribute sets the maximum allowed value, for example, the maximum frequency to be generated when the pitch is appropriately controlled, in hertz.
min = value. This attribute sets the minimum allowable value, for example, the minimum frequency to be generated when the pitch is appropriately controlled, in hertz.
smooth = value. This controls the response time of the mechanical system being simulated. When the pitch is being controlled, this parameter sets the smoothing time of the pitch curve in seconds to set the pitch step width.
base = value. This sets the speaker's baseline, i.e. the frequency with no tags at all.
range = mvalue. This sets the speaker pitch range in Hz.
pdroop = value. This sets the droop of the word curve to the fundamental frequency and is expressed in units of droop per second.
adroop = value. This sets the droop rate of the pitch trajectory toward the phrase curve, and is expressed in units of the droop rate per second.
add = value. This sets up a non-linearity in the mapping between the pitch trajectory over the range of the phrase and the pitch trajectory of individual words that have a local effect on the phrase. If the value of “add” is equal to 1, linear mapping is performed. That is, it has the same effect on the pitch regardless of whether the accent is in the high pitch region or the low pitch region. When the value of “add” is equal to 0, the effect of the accent is logarithmic, and a small accent changes the frequency more greatly when on a high word curve. If the value of “add” is greater than 1, it is performed slower than linear mapping.
jitter = value. In this case, the magnitude of the root mean square (RMS) of pitch jitter is set as a decimal number with the speaker range set to 1. Jitter is the degree of random pitch variation introduced to give a natural speech with processed speech.
jittercut = value. This sets the time scale for pitch jitter in seconds. The pitch jitter is correlated (1 / f) noise at intervals shorter than “jittercut”, and is uncorrelated noise, ie, white noise, at intervals longer than “jittercut”. A large value of “jittercut” defines a longer and smooth pitch value, while a small value of “jittercut” defines a short and irregular pitch change.

<set>タグに提供される引数は、テキスト−スピーチ処理が完了するまで、語句の境界にわたってまで、各音声ごとに保持される。 The arguments provided to the <set> tag are retained for each voice until the text-to-speech process is complete and across word boundaries.

<step>タグはいくつかの引数をとり、語句曲線に対して動作する。<step>タグは、<step by=value|to=value|strength=value>の形態をとる。<step>タグの属性は、以下のようなものである。
by=value。これは、各ステップのサイズを、話者の範囲を１とした小数で定義する。語句曲線におけるステップは、「smooth」時間によって平滑化される。パラメータ「smooth」は上記で定義される。
to=value。これは、ステップが近づいていく周波数であり、話者の範囲を１とした小数で表現される。
strength=value。この属性は、特定の<step>タグがどのようにその隣接タグと相互作用するかを制御する。「strength」の値が高い場合、タグはその隣接タグに対して優勢であり、「strength」の値が低い場合、隣接タグがそのタグに対して優勢である。 The <step> tag takes several arguments and operates on a phrase curve. The <step> tag takes the form <step by = value | to = value | strength = value>. The attributes of the <step> tag are as follows.
by = value. This defines the size of each step as a decimal with 1 as the speaker range. The steps in the phrase curve are smoothed by “smooth” time. The parameter “smooth” is defined above.
to = value. This is the frequency at which the step approaches, and is expressed as a decimal with the speaker range set to 1.
strength = value. This attribute controls how a particular <step> tag interacts with its neighboring tags. When the value of “strength” is high, the tag is dominant over the adjacent tag, and when the value of “strength” is low, the adjacent tag is dominant over the tag.

<slope>タグは、１つの引数をとり、語句曲線に対して動作する。<slope>タグは、<slope rate=value”％”？>という形態を有する。これは、１秒当たりの話者の範囲を１とした少数で表される語句の増減率を設定する。記号「”％”」が存在する場合、その値は小語句の単位長さ当たりの範囲に対する割合に関して増減を表す。 The <slope> tag takes one argument and operates on a phrase curve. <slope> tag is <slope rate = value ”%”? >. This sets an increase / decrease rate of a phrase expressed by a small number where the range of speakers per second is 1. When the symbol “%” is present, its value represents an increase or decrease with respect to the ratio of the small phrase to the range per unit length.

<stress>タグは、語句曲線に関する韻律を定義する。各<stress>タグは、語句曲線に関して好ましい形状および好ましい高さを定義する。しかし、<stress>タグがしばしば競合する特性を定義する。<stress>タグを処理する上で、<stress>タグによって定義される好ましい形状および高さは、これらの特性が互いに折衷できるように、また、ピッチ曲線が平滑でなければならないという要件により、変更される。<stress>タグは、<stress shape = (point”.”) *point|strength = value|type=value>という形態を有する。 The <stress> tag defines a prosody for the phrase curve. Each <stress> tag defines a preferred shape and preferred height for the phrase curve. However, <stress> tags often define competing characteristics. In processing the <stress> tag, the preferred shape and height defined by the <stress> tag will change due to the requirement that these characteristics can be compromised with each other and that the pitch curve must be smooth. Is done. The <stress> tag has a form of <stress shape = (point ”.”) * point | strength = value | type = value>.

「shape」パラメータは、他のstressタグや制約との折衷がない場合に、アクセント曲線の理想的な形状を、点の集合という点において特定する。 The “shape” parameter specifies the ideal shape of the accent curve in terms of a set of points when there is no compromise with other stress tags or constraints.

「strength」パラメータは、アクセントの言語的強さを定義する。強さがゼロのアクセントは、ピッチに対して何の影響も及ぼさない。強さが１よりもはるかに大きなアクセントには、それに匹敵するか、それよりも大きな強さを有する隣接タグがない場合、正確に従う。それに匹敵するか、それよりも大きな強度を有する隣接タグがある場合には、アクセントは、隣接タグの強度に応じて、隣接タグと折衷されるか、または隣接タグが該タグよりも優勢になる。強度がおおよそ１に等しいアクセントは、アクセントを滑らかにしたピッチ曲線になる。 The “strength” parameter defines the linguistic strength of the accent. A zero-strength accent has no effect on the pitch. Accents with a strength much greater than 1 are followed exactly if there are no adjacent tags that are comparable to or greater than that. If there is an adjacent tag that is comparable or greater in strength, the accent is compromised with the adjacent tag, or the adjacent tag becomes dominant over the tag, depending on the strength of the adjacent tag . An accent whose intensity is approximately equal to 1 is a pitch curve with a smoothed accent.

「type」パラメータは、アクセントがピッチ曲線の平均値によって定義されるのるか、またはその形状によって定義されるのかを制御する。「type」パラメータの値は、アクセントが隣接タグと折衷する必要がある場合に作用する。アクセントが隣接タグよりもはるかに強い場合、ピッチの形状および平均値の双方が保持される。 The “type” parameter controls whether the accent is defined by the average value of the pitch curve or by its shape. The value of the “type” parameter works when accents need to be compromised with adjacent tags. If the accent is much stronger than the adjacent tag, both the pitch shape and the average value are retained.

しかし、折衷が必要な場合、「type」は、いずれの特性を折衷するかを決定する。「type」が０の値を有する場合、アクセントは、平均ピッチを犠牲にしてその形状を保持する。「type」が１の値を有する場合、アクセントは、形状を犠牲にしてその平均ピッチを維持する。「type」の値が０から１の間である場合には、「type」の実際の値によって決定される折衷の範囲で、形状と平均ピッチとの間で折衷する。 However, if compromise is required, “type” determines which characteristic to compromise. If “type” has a value of 0, the accent retains its shape at the expense of the average pitch. If “type” has a value of 1, the accent maintains its average pitch at the expense of shape. When the value of “type” is between 0 and 1, a compromise is made between the shape and the average pitch within the range of compromise determined by the actual value of “type”.

<stress>タグの引数「shape」における「point」パラメータは、次のシンタクスに従う。
point=float(X”s”|X”p”|X”y”|X”w”)value。アクセント曲線上の点は、周波数が話者の範囲を１とした小数で表される（時間、周波数）対として特定される。Ｘは、秒（ｓ）、音素（ｐ）、音節（ｙ）、または単語（ｗ）で測定される。アクセント曲線は滑らかなものであるという制約を付けることが好ましいため、アクセント曲線はそれほど詳細に特定する必要はない。 The “point” parameter in the argument “shape” of the <stress> tag follows the following syntax.
point = float (X "s" | X "p" | X "y" | X "w") value. A point on the accent curve is specified as a (time, frequency) pair whose frequency is represented by a decimal number with the speaker range being 1. X is measured in seconds (s), phonemes (p), syllables (y), or words (w). Since it is preferable to constrain the accent curve to be smooth, the accent curve need not be specified in detail.

図２は、<stress strength=10 type=0.5 shape=0.3s0, 0.15s0.3, 0s0.5, 0.15s0, 0.25s0/>という値を有するstressタグによって記述される例示的なアクセント曲線２０２を示すグラフ２００である。タグを処理することで、点２０４〜２１４と、該点２０４〜２１４に適合する曲線２０２とが生成される。曲線２０２の点２０４〜２１４への適合は、いかにも人間のスピーチらしい自然な音声を反映する滑らかな曲線を生成するように設計されることが好ましい。 FIG. 2 shows an exemplary accent curve 202 described by a stress tag having the values <stress strength = 10 type = 0.5 shape = 0.3s0, 0.15s0.3, 0s0.5, 0.15s0, 0.25s0 />. It is the graph 200 shown. Processing the tag generates points 204-214 and a curve 202 that fits the points 204-214. The fit of curve 202 to points 204-214 is preferably designed to produce a smooth curve that reflects the natural speech that is truly human speech.

上述したタグの他に、語句の境界を挿入する<phrase>タグが実施される。通常、<phrase>タグは、小語句または息継ぎグループをマークするために用いられる。phraseタグを越えての事前計画は行われない。<phrase>前に定義される韻律は、<phrase>タグの後に発生するいずれのタグからも全体的に無関係である。 In addition to the tags described above, a <phrase> tag that inserts a word boundary is implemented. The <phrase> tag is typically used to mark a small phrase or breath group. Advance planning beyond the phrase tag is not performed. The prosody defined before <phrase> is totally independent of any tags that occur after the <phrase> tag.

上述したように、任意のタグが「move」属性を含むことができる。「move」属性は、該「move」属性が特定するポイントまでそのアクションを据え置くようタグに命令する。「move」属性は、次のシンタクスに従う。
AttValue=position|other_attributes
但し、position＝”move””=”move_valueであり、
move_value＝”ell”？motion*であり、かつ
motion＝(float|”b”|”c”|”e”)(”r”|”w”|”y”|”p”|”s”)”*”|”？である。 As described above, any tag can include a “move” attribute. The “move” attribute instructs the tag to defer the action up to the point specified by the “move” attribute. The “move” attribute follows the following syntax:
AttValue = position | other_attributes
However, position = ”move” ”=” move_value,
move_value = ”ell”? motion * and
motion = (float | "b" | "c" | "e") ("r" | "w" | "y" | "p" | "s") "*" | "?

motionは、左から右の順に評価される。positionは、move_valueが「”ell”」で開始しない場合、タグから開始されるカーソルとしてモデリングされる。「”ell”」で開始する場合には、先行するタグからの最後のカーソル位置が開始点として用いられる。通常、タグは単語内に配置され、「move」属性は、アクセントを単語内に配置するために用いられる。motionは、小語句（ｒ）、単語（ｗ）、音節（ｙ）、音素（ｐ）またはアクセント（*）に関して特定することができる。タグが語句の冒頭に集まっている場合には、小語句および単語に関してのmotionの特定が有用である。motionを識別するルールは、次のようなものである。小語句に関して特定されたmotionは、語句間のあらゆる小休止をスキップする。単語に関して特定されるmotionは、単語間のあらゆる小休止をスキップする。音節に関して特定されるmotionは、一小休止を一音節として取り扱う。音素に関して特定されるmotionは、一小休止を一音素として取り扱う。「ｂ」、「ｃ」、または「ｅ」をmotionとして用いる場合、ポインタが、最も近い、語句、単語、音節、または音素の冒頭、中央、または末尾にそれぞれ移動する。秒に関して特定されるmoveは、ポインタをその秒数分移動する。Motion”*”（強勢が置かれる）は、ポインタを次に強勢が置かれる音節の中央に移動させる。疑問符（？）はポインタを移動させず、疑問符に続くmotionが単語の境界と交差しないよう制限する役割を果たす。引数がその制約に違反する場合には、警告メッセージが生じるか、または違反するタグを無視させる。 Motion is evaluated in order from left to right. The position is modeled as a cursor starting from a tag if the move_value does not start with "" ell "". When starting with "" ell "", the last cursor position from the preceding tag is used as the starting point. Usually, tags are placed in words and the “move” attribute is used to place accents in words. Motion can be specified in terms of subphrases (r), words (w), syllables (y), phonemes (p) or accents (*). When tags are gathered at the beginning of a phrase, it is useful to specify motion for sub-phrases and words. The rules for identifying motion are as follows. A motion specified for a small phrase skips every break between phrases. The motion specified for a word skips every pause between words. The motion specified for a syllable treats one pause as one syllable. The motion specified for phonemes treats a short pause as a phoneme. When using “b”, “c”, or “e” as motion, the pointer moves to the beginning, center, or end of the nearest phrase, word, syllable, or phoneme, respectively. A move specified for a second moves the pointer by that number of seconds. Motion “*” (stress is placed) moves the pointer to the center of the next syllable where the stress is placed. The question mark (?) Does not move the pointer and serves to limit the motion following the question mark from crossing word boundaries. If an argument violates the constraint, a warning message is generated or the violating tag is ignored.

「move」コマンドを含むタグの一例は、次のようなものである。
<step move=*0.5p by=1/>
このタグの作用は、該タグ後に最初に強勢が置かれる音節の中心から音素０．５個分後に最も急な部分があるステップを、ピッチ曲線に配置することである。「move」属性により、タグは、タグ自体の場所ではなく、所望のポイントで作用を生じさせる。 An example of a tag containing a “move” command is as follows.
<step move = * 0.5p by = 1 />
The action of this tag is to place a step on the pitch curve that has the steepest part 0.5 phonemes after the center of the syllable where the stress is first placed after the tag. The “move” attribute causes the tag to act at the desired point, not the location of the tag itself.

図３Ａ〜図３Ｉは、各種タグの作用を示す。図３Ａは、単一の周波数を設定する１つの<step to>タグと、同一周波数をそれぞれ設定する２つの<step to>タグと、異なる周波数をそれぞれ設定する２つの<step to>タグと、をそれぞれ処理した結果生じる曲線３０２〜３０６を示すグラフ３００である。曲線３０２は、タグ<step strength=10 to=0.5/>から生じるものである。曲線３０４は、第１のタグ<step strength=10 to=0.5/>の後に介在するテキストが続き、次に第２のタグ<step strength=10 to=0.5/>が続いた結果生じるものである。曲線３０６は、第１のタグ<step strength=10 to=0.5/>の後に介在するテキストが続き、次に第２のタグ<step strength=10 to=0/>が続いた結果生じるものである。 3A to 3I show the action of various tags. FIG. 3A shows one <step to> tag that sets a single frequency, two <step to> tags that respectively set the same frequency, two <step to> tags that respectively set different frequencies, 3 is a graph 300 showing curves 302 to 306 generated as a result of processing each of the above. Curve 302 results from the tag <step strength = 10 to = 0.5 />. Curve 304 results from a first tag <step strength = 10 to = 0.5 /> followed by intervening text, followed by a second tag <step strength = 10 to = 0.5 />. . Curve 306 results from the first tag <step strength = 10 to = 0.5 /> followed by intervening text, followed by the second tag <step strength = 10 to = 0 />. .

<step by>タグは、単にステップをピッチ曲線に挿入するだけのものである。タグ<step by=X/>は、該タグ後のピッチが、タグ前のピッチよりもＸＨｚ高くなるように指示する。該タグは、ピッチを変えるが、タグのいずれの側におけるピッチにも任意特定の値をとるように強制はしない。したがって、<step by>タグが他のタグと競合する傾向はない。例えば、<step to=100/>タグの後に<step by=-50/>が続く場合、<step by=-50>タグよりも前の周波数は１００Ｈｚとなり、該タグ後の周波数は５０Ｈｚになる。 The <step by> tag simply inserts a step into the pitch curve. The tag <step by = X /> instructs the pitch after the tag to be higher by X Hz than the pitch before the tag. The tag changes pitch but does not force the pitch on either side of the tag to take any particular value. Therefore, the <step by> tag does not tend to compete with other tags. For example, when <step by = -50 /> is followed by <step by = -50 />, the frequency before the <step by = -50> tag is 100 Hz, and the frequency after the tag is 50 Hz. .

図３Ｂは、曲線３１２および３１４を示すグラフ３１０である。曲線３１２は、一連のタグ<step to=0.1 strength=10/>... <step by=0.3 strength=10/>から生じるものである。曲線３１４は、一連のタグ<step to=0.1 strength=10/>... <step by=0.3 strength=10/>... <step by=0.3 strength=10/>から生じるものである。ピッチ曲線に対する制約が競合していないため、この例では折衷が必要ない。 FIG. 3B is a graph 310 showing curves 312 and 314. Curve 312 results from a series of tags <step to = 0.1 strength = 10 /> ... <step by = 0.3 strength = 10 />. Curve 314 results from a series of tags <step to = 0.1 strength = 10 /> ... <step by = 0.3 strength = 10 /> ... <step by = 0.3 strength = 10 />. Since constraints on the pitch curve are not competing, no compromise is necessary in this example.

語句曲線には、<slope>タグも関連する。<slope>タグは、その引数に応じて、タグの左側、すなわちタグよりも時間的に先行する側に対して、語句曲線を上か下に傾斜させる。slopeタグは、現在の傾きの値を置換させる。説明のため、一連のタグ<slope rate=1/>... <slope rate=0/>の結果では、傾きはゼロになる。タグ<slope rate=0/>は、タグ<slope rate=1/>およびあらゆる先行タグによって設定された傾きを置換する。 The word curve also has a <slope> tag associated with it. The <slope> tag tilts the phrase curve up or down with respect to the left side of the tag, that is, the side temporally preceding the tag, depending on the argument. The slope tag replaces the current slope value. For illustration purposes, the result of a series of tags <slope rate = 1 /> ... <slope rate = 0 /> has a slope of zero. The tag <slope rate = 0 /> replaces the slope set by the tag <slope rate = 1 /> and any preceding tags.

図３Ｃは、曲線３２２〜３２８を含むグラフ３２０である。曲線３２２は、タグ<slope rate=0.8/>から生じるものである。曲線３２４は、一連のタグ<slope rate=0.8/>... <step by=0.1 strength=10>から生じるものである。曲線３２６は、タグ... <slope rate=0.8>から生じるものである。曲線３２８は、一連のタグ<slope rate=0.8/>... <set slope=0.1/>から生じるものである。曲線３２２〜３２８はそれぞれ、語句の境界から開始される傾き、０．２５秒遅延した傾き、小さなステップが置かれた傾き、および上がった後に下がる傾きを表している。新しい値を有する<slope>タグが、先行する<slope>タグによって課されたあらゆる値を置換するため、折衷は必要ない。 FIG. 3C is a graph 320 that includes curves 322-328. Curve 322 results from the tag <slope rate = 0.8 />. Curve 324 results from a series of tags <slope rate = 0.8 /> ... <step by = 0.1 strength = 10>. Curve 326 results from the tag ... <slope rate = 0.8>. Curve 328 results from a series of tags <slope rate = 0.8 /> ... <set slope = 0.1 />. Curves 322 to 328 represent the slope starting from the word boundary, the slope delayed by 0.25 seconds, the slope with a small step, and the slope going up and down, respectively. No compromise is necessary because the <slope> tag with the new value replaces any value imposed by the preceding <slope> tag.

図３Ｄは、<phrase>タグの作用を示す。グラフ３３０は、平坦なトーンを表す曲線３３２を示す。曲線３３２の後には語句の境界３３４が続く。語句の境界の後には、様々な振幅のトーンを示す曲線３３６〜３３９が続く。グラフ３３０は、一連のタグ<stress strength = 4 type=0.8 shape = 0.1s0.3, 0.1s0.3/>... <phrase/>... <stress strength=4 type=0.1 shape=various/>の作用を示す。<phrase>タグは、０．４２秒後に下降トーンが、０．４２秒前の平坦なトーンに何等影響を与えないようにする。 FIG. 3D shows the effect of the <phrase> tag. Graph 330 shows a curve 332 representing a flat tone. Curve 332 is followed by phrase boundary 334. The phrase boundaries are followed by curves 336-339 showing tones of various amplitudes. Graph 330 shows a series of tags <stress strength = 4 type = 0.8 shape = 0.1s0.3, 0.1s0.3 /> ... <phrase /> ... <stress strength = 4 type = 0.1 shape = various / The effect of> is shown. The <phrase> tag prevents the falling tone after 0.42 seconds from affecting the flat tone 0.42 seconds ago.

<phrase>タグは、事前計画が停止する境界をマークし、好ましくは小語句の境界に配置される。小語句は通常、一語句、または全文よりも範囲の小さな従属節、または等位節である。典型的な人間のスピーチは、韻律を計画または韻律を準備することを特徴とし、この計画または準備は、生成される数音節前に行われる。例えば、準備することで、話者が難しいトーンの組み合わせを滑らかに折衷したり、快いピッチ範囲を超えたり、それ以下になったりしないようにすることができる。本発明によるタグを配置し処理するシステムは、人間によるスピーチ生成のこの側面をモデリングすることが可能であり、また、<phrase>タグの使用により、準備する範囲を制御する。すなわち、<phrase>タグの配置が、折衷または他の準備が行われる音節の数を制御する。phraseタグは一方向制限要素として作用し、<phrase>タグの前にあるタグはその先に影響を及ぼせるが、<pharse>タグの後にあるタグがその前に影響を及ぼさないようにする。 The <phrase> tag marks the boundary where the pre-planning stops and is preferably placed at the small phrase boundary. A subphrase is usually a single phrase, or a subordinate or equal clause that has a smaller range than the full sentence. A typical human speech is characterized by planning or preparing a prosody, which is done a few syllables before it is generated. For example, by preparing, it is possible to prevent a speaker from smoothly compromising a difficult tone combination, exceeding a pleasant pitch range, or falling below it. The system for placing and processing tags according to the present invention can model this aspect of human speech generation, and controls the range to be prepared by using <phrase> tags. That is, the placement of the <phrase> tag controls the number of syllables that are compromised or otherwise prepared. The phrase tag acts as a one-way restriction element, so that the tag before the <phrase> tag can affect the destination, but the tag after the <pharse> tag does not affect it before that.

図３Ｅ〜図３Ｉは、<stress>タグの作用を示す。<stress>タグは、単語または音節にアクセントを付けられるようにする。<stress>タグは常に、少なくとも以下の３つの要素を含む。第１の要素は、アクセントの理想的な「プラトン」形状であり、これは、隣接するアクセントがない状態で、かつ非常にゆっくりと発話される場合にアクセントが有する形状である。第２の要素は、アクセントタイプである。第３の要素は、アクセントの強さである。強いアクセントはその形状を保つ傾向がある一方、弱いアクセントは隣接するアクセントに支配される傾向がある。 3E-3I show the action of the <stress> tag. The <stress> tag allows you to accent words or syllables. The <stress> tag always includes at least the following three elements. The first element is the ideal “platon” shape of the accent, which is the shape that the accent has when there is no adjacent accent and when it is spoken very slowly. The second element is an accent type. The third element is the strength of the accent. Strong accents tend to retain their shape, while weak accents tend to be dominated by adjacent accents.

話すという動作はこれらの傾向を折衷するものであり、これらの状況下でスピーチをモデリングするよう追求するシステムはいずれも、かかる傾向を折衷する方法も持たなければならない。<stress>タグの引数「strength」は、競合する要件を表すタグ間での相互作用を制御する。図３Ｅは、タイプ０．８の平坦なトーンと、その後に続くタイプ０の純粋に下降するトーンとの相互作用を示すグラフ３４０である。平坦なトーンのタイプは０．８である、すなわちタイプ値が１に近いため、形状を犠牲にしてその平均ピッチを保つ傾向がある。下降トーンのタイプは０であるため、その平均ピッチを犠牲にして形状を保つ。曲線３４２Ａ〜３４２Ｇは、一連のタグ<stress strength=4 type=0.8 shape=-0.1sY, 0.1sY/>... <stress strength=4 type=0 shape=-0.2.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>の作用を示す。但し、Ｙの値は、−０．１から０．５まで０．１ずつ増分して変化する。 The behavior of speaking is a compromise between these trends, and any system that seeks to model speech under these circumstances must have a way to compromise these trends. The argument “strength” of the <stress> tag controls the interaction between tags representing competing requirements. FIG. 3E is a graph 340 illustrating the interaction of a type 0.8 flat tone followed by a type 0 purely descending tone. The flat tone type is 0.8, i.e. the type value is close to 1, so it tends to keep its average pitch at the expense of shape. Since the descending tone type is 0, the shape is maintained at the expense of the average pitch. Curves 342A-342G are a series of tags <stress strength = 4 type = 0.8 shape = -0.1sY, 0.1sY /> ... <stress strength = 4 type = 0 shape = -0.2.03, -.1s.03 , 0s0, 0.1s-0.1, 0.2s-0.1 />. However, the value of Y changes from 0.1 to 0.5 in increments of 0.1.

図３Ｆは、タイプ０．８の平坦なトーンと、その後に続くタイプ０．１の下降トーンとの相互作用を示すグラフ３５０である。平坦なトーンのタイプは０．８である、すなわちタイプ値が１に近いため、形状を犠牲にしてその平均ピッチを保つ傾向がある。下降トーンのタイプは０．１であるため、ピッチを維持するために形状を折衷するわずかな傾向を示す。曲線３５２Ａ〜３５２Ｇは、一連のタグ<stress strength=4 type=0.8 shape=-0.1sY, 0.1sY/>... <stress strength=4 type=0.1 shape=-0.2.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>の作用を示す。但し、Ｙの値は、−０．１から０．５まで０．１ずつ増分して変化する。曲線３５２Ａ〜曲線３５２Ｇは、トーンによってわずかなピッチの優先が示されるため、下降トーンのエリアにおいてわずかに一点に近寄ることが見てとれる。 FIG. 3F is a graph 350 illustrating the interaction of a type 0.8 flat tone followed by a type 0.1 falling tone. The flat tone type is 0.8, i.e. the type value is close to 1, so it tends to keep its average pitch at the expense of shape. The descending tone type is 0.1, which indicates a slight tendency to compromise shape to maintain pitch. Curves 352A-352G are a series of tags <stress strength = 4 type = 0.8 shape = -0.1sY, 0.1sY /> ... <stress strength = 4 type = 0.1 shape = -0.2.03, -.1s.03 , 0s0, 0.1s-0.1, 0.2s-0.1 />. However, the value of Y changes from 0.1 to 0.5 in increments of 0.1. It can be seen that curves 352A-352G are slightly closer to a point in the area of the falling tone, as the tone indicates a slight pitch preference.

図３Ｇは、タイプ０．８の平坦なトーンと、その後に続くタイプ０．５の下降トーンとの相互作用を示すグラフ３６０である。平坦なトーンのタイプは０．８である、すなわちタイプ値が１に近いため、形状を犠牲にしてその平均ピッチを保つ傾向がある。下降トーンのタイプはここでは０．５であるため、そのピッチを維持する強い傾向を示し、その結果ピッチと形状との間が折衷されることになる。曲線３６２Ａ〜３６２Ｇは、一連のタグ<stress strength=4 type=0.8 shape=-0.1sY, 0.1sY/>... <stress strength=4 type=0.5 shape=-0.2.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>の作用を示す。但し、Ｙの値は、−０．１から０．５まで０．１ずつ増分して変化する。曲線３６２Ａ〜曲線３６２Ｇは、まだ各自の形状を維持しているが、ピッチを維持するために、共に強く圧縮されていることが見てとれる。 FIG. 3G is a graph 360 illustrating the interaction of a type 0.8 flat tone followed by a type 0.5 falling tone. The flat tone type is 0.8, i.e. the type value is close to 1, so it tends to keep its average pitch at the expense of shape. Since the descending tone type here is 0.5, it shows a strong tendency to maintain that pitch, resulting in a compromise between pitch and shape. Curves 362A-362G are a series of tags <stress strength = 4 type = 0.8 shape = -0.1sY, 0.1sY /> ... <stress strength = 4 type = 0.5 shape = -0.2.03, -.1s.03 , 0s0, 0.1s-0.1, 0.2s-0.1 />. However, the value of Y changes from 0.1 to 0.5 in increments of 0.1. Although the curves 362A to 362G still maintain their shapes, it can be seen that both are strongly compressed to maintain the pitch.

図３Ｈは、タイプ０．８の平坦なトーンと、その後に続くタイプ０．８の下降トーンとの相互作用を示すグラフ３７０である。平坦なトーンのタイプは０．８である、すなわちタイプ値が１に近いため、形状を犠牲にしてその平均ピッチを保つ傾向がある。下降トーンのタイプはここでは０．８であるため、そのピッチを維持する非常に強い傾向を示し、その形状の維持には弱い傾向しか示さない。曲線３７２Ａ〜３７２Ｇは、一連のタグ<stress strength=4 type=0.8 shape=-0.1sY, 0.1sY/>... <stress strength=4 type=0.8 shape=-0.2.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>の作用を示す。但し、Ｙの値は、−０．１から０．５まで０．１ずつ増分して変化する。曲線３７２Ａ〜曲線３７２Ｇでは、形状の優位はピッチをその中点付近で低減するよう強いることができるが、形状を維持する傾向がかなり低減していることが見て取れる。最初のトーン、すなわち平坦なトーンが低いピッチを有する場合、２番目のアクセントの中央で正確なピッチを維持するために、ピッチ曲線は、２つのトーンの間で上がる強い傾向を有する。 FIG. 3H is a graph 370 illustrating the interaction of a type 0.8 flat tone followed by a type 0.8 falling tone. The flat tone type is 0.8, i.e. the type value is close to 1, so it tends to keep its average pitch at the expense of shape. The descending tone type here is 0.8, so it shows a very strong tendency to maintain its pitch and only a weak tendency to maintain its shape. Curves 372A-372G are a series of tags <stress strength = 4 type = 0.8 shape = -0.1sY, 0.1sY /> ... <stress strength = 4 type = 0.8 shape = -0.2.03, -.1s.03 , 0s0, 0.1s-0.1, 0.2s-0.1 />. However, the value of Y changes from 0.1 to 0.5 in increments of 0.1. In curves 372A through 372G, it can be seen that the shape advantage can be forced to reduce the pitch near its midpoint, but the tendency to maintain the shape is significantly reduced. If the first tone, ie the flat tone, has a low pitch, the pitch curve has a strong tendency to rise between the two tones in order to maintain an accurate pitch in the middle of the second accent.

図３Ｉは、タイプ０．８の平坦なトーンと、その後に続くタイプ１の下降トーンとの相互作用を示すグラフ３８０である。平坦なトーンのタイプは０．８である、すなわちタイプ値が１に近いため、形状を犠牲にしてその平均ピッチを保つ傾向がある。下降トーンのタイプはここでは１であるため、そのピッチを維持して、ピッチを厳密に維持するために、必要に応じて形状を折衷する。曲線３８２Ａ〜３８２Ｇは、一連のタグ<stress strength=4 type=0.8 shape=-0.1sY, 0.1sY/>... <stress strength=4 type=1 shape=-0.2.03, -.1s.03, 0s0, 0.1s-0.1, 0.2s-0.1/>の作用を示す。但し、Ｙの値は、−０．１から０．５まで０．１ずつ増分して変化する。曲線３８２Ａ〜曲線３８２Ｇから、下降トーンはここではピッチによってその全体が定義されることが見てとれる。 FIG. 3I is a graph 380 illustrating the interaction of a type 0.8 flat tone followed by a type 1 falling tone. The flat tone type is 0.8, i.e. the type value is close to 1, so it tends to keep its average pitch at the expense of shape. Since the descending tone type is 1 here, the shape is compromised as necessary in order to maintain the pitch and maintain the pitch strictly. Curves 382A-382G are a series of tags <stress strength = 4 type = 0.8 shape = -0.1sY, 0.1sY /> ... <stress strength = 4 type = 1 shape = -0.2.03, -.1s.03 , 0s0, 0.1s-0.1, 0.2s-0.1 />. However, the value of Y changes from 0.1 to 0.5 in increments of 0.1. From curves 382A to 382G, it can be seen that the descending tone is here defined entirely by the pitch.

アクセントが共に近づいた場合に、タグ間の折衷の別の例を見ることができる。２つのアクセントが重複する結果は、双方のアクセントを足したものよりも低い。その代わりに、同じサイズおよび形状であるが、個々のいずれかのアクセントの２倍の強さを有する単一のアクセントが形成される。 You can see another example of compromise between tags when the accents are close together. The result of overlapping two accents is lower than the sum of both accents. Instead, a single accent is formed that is the same size and shape but has twice the strength of any individual accent.

図４は、０．８３ｓにピークがある固定されたアクセント曲線４０２と、曲線４０２に向かい、曲線４０４Ｆが曲線４０２に重複するまで徐々に移動するアクセント曲線４０４Ａ〜４０４Ｅとの結果、を示すグラフ４００である。曲線４０２および曲線４０４Ａ〜４０４Ｅは、一連のタグ<stress strength=4 shape=-.15s0, -.1s0, -.05s.1, 0s.3, .05s.1, .1s0, .15s0 type=0.5/>... <stress strength=4 shape=-.15s0, -.1s0,-.05s.1, 0s.3, .05s.1, .1s0, .15s0 type=0.5/>の処理結果である。曲線４０４Ｆは、曲線４０２および曲線４０４Ｅによって表される曲線を組み合わせた結果である。曲線４０４Ｆのピークは、曲線４０２および曲線４０４Ｅのピークを足したものよりも低いことが見て取れる。 FIG. 4 is a graph 400 showing the results of a fixed accent curve 402 with a peak at 0.83 s and accent curves 404A-404E that move toward the curve 402 and gradually move until the curve 404F overlaps the curve 402. It is. Curve 402 and curves 404A-404E are a series of tags <stress strength = 4 shape =-. 15s0, -.1s0, -.05s.1, 0s.3, .05s.1, .1s0, .15s0 type = 0.5 /> ... <stress strength = 4 shape =-. 15s0, -.1s0,-. 05s.1, 0s.3, .05s.1, .1s0, .15s0 type = 0.5 /> . Curve 404F is the result of combining the curves represented by curve 402 and curve 404E. It can be seen that the peak of curve 404F is lower than the sum of the peaks of curve 402 and curve 404E.

すべてのアクセントタグは、「strength」パラメータを含む。タグの「strength」パラメータにより、タグによって定義されるアクセントが隣接するアクセントにどのように影響を及ぼすかが影響される。概して、強いアクセント、すなわち、比較的高いstrengthパラメータを有するタグによって定義されるアクセントは、その形状を保つ傾向がある一方、比較的低いstrengthパラメータを有する弱いアクセントは、隣接するアクセントに支配される傾向がある。 All accent tags include a “strength” parameter. A tag's “strength” parameter affects how the accent defined by the tag affects adjacent accents. In general, strong accents, ie, accents defined by tags with relatively high strength parameters, tend to retain their shape, while weak accents with relatively low strength parameters tend to be dominated by adjacent accents There is.

図５は、下降トーンと、先行する強く高いトーンと、後続する弱く高いトーンの間の相互作用を、下降トーンの強さを変化させて示すグラフ５００である。曲線５０２〜５１２は、下降トーンの強さを０から５まで１ずつ増分させた、トーンのシーケンスを表す。曲線５０２〜５１２は、一連のタグ<stress strength=4 type=0.3 shape=-0.1s0.3, 0.1s0.3/>... <stress strength=X type=0.5 shape=-.15s2, -.1s.2, 0s0,.1s-.2, .15s-.2/>... <stress strength=2.5 type=0.3, shape=-0.1s0.3, 0.1s0.3/>を処理することで生成される。但し、Ｘは０から５まで１ずつ増分して変化する。曲線５１４は、弱い平坦なトーンを後続せずに、強い平坦なトーンの後に続く下降トーンを示す。曲線５０２で示す０の強さ（strength）を有する下降トーンは、完全に隣接するタグに支配されていることが見てとれる。曲線５０４〜５１２は、下降トーンが、強さ（strength）が増大するにつれ、隣接するタグをますます乱しながら、どのようにその形状を保持する傾向があるかを示す。曲線５１２で示す下降トーンの形状は曲線５１４と略同じであり、下降トーンの強さ（strength）が、後続する弱い平坦なトーンに対してどのように優勢になるかを示す。 FIG. 5 is a graph 500 illustrating the interaction between a descending tone, a preceding strong high tone, and a subsequent weakly high tone with varying strength of the descending tone. Curves 502-512 represent a sequence of tones where the strength of the falling tone is incremented by 1 from 0 to 5. Curves 502-512 are a series of tags <stress strength = 4 type = 0.3 shape = -0.1s0.3, 0.1s0.3 /> ... <stress strength = X type = 0.5 shape =-. 15s2,-. 1s.2, 0s0, .1s-.2, .15s-.2 /> ... By processing <stress strength = 2.5 type = 0.3, shape = -0.1s0.3, 0.1s0.3 /> Generated. However, X changes in increments of 1 from 0 to 5. Curve 514 shows a falling tone that follows a strong flat tone without following a weak flat tone. It can be seen that the descending tone having a strength of zero, shown by curve 502, is dominated by a fully adjacent tag. Curves 504-512 show how the falling tone tends to retain its shape as the strength increases, disturbing adjacent tags more and more. The shape of the falling tone shown by curve 512 is substantially the same as curve 514, and shows how the strength of the falling tone prevails over the subsequent weak flat tone.

語句曲線に影響を及ぼす別の要因は、垂下、すなわち語句中でしばしば生じるピッチの規則的な低下である。この要因は、語句曲線が話者の基本周波数に向かって減衰する率を設定するパラメータpdroopによって表される。<step to>タグ付近のポイントは、特に、高いstrengthパラメータを有している場合に、比較的影響を受けない。これは、pdroopパラメータによって定義される減衰が時間の経過に伴って作用し、周波数の設定付近では、比較的わずかな減衰が起こるためである。<set to>タグから離れたポイントほど、強い影響を受ける。 Another factor that affects the word curve is drooping, a regular drop in pitch that often occurs in words. This factor is represented by the parameter pdroop that sets the rate at which the phrase curve decays towards the speaker's fundamental frequency. Points near the <step to> tag are relatively unaffected, especially when they have high strength parameters. This is because the attenuation defined by the pdroop parameter acts with time, and a relatively slight attenuation occurs near the frequency setting. The farther away from the <set to> tag, the stronger the influence.

「pdroop」の値は、語句曲線の減衰率を指数で設定するため、ステップは１／pdroop秒で減衰する。通常、話者のピッチの軌跡は事前に計画される、すなわち滑らかなピッチの軌跡を達成するために、連続的または断続的な調整が行われる。この事前計画をモデリングするため、pdroopパラメータは、pdroopパラメータが<set to>タグの前に設定されるか、または後に設定されるかに関わらず、語句曲線において減衰を生じさせる能力を有する。 Since the value of “pdroop” sets the attenuation rate of the phrase curve by an exponent, the step is attenuated by 1 / pdroop seconds. Typically, the speaker's pitch trajectory is planned in advance, ie, continuous or intermittent adjustments are made to achieve a smooth pitch trajectory. To model this pre-plan, the pdroop parameter has the ability to cause attenuation in the phrase curve regardless of whether the pdroop parameter is set before or after the <set to> tag.

例えば、図６は、語句の冒頭における正の<step to>タグ６０１の発生を表すグラフ６００を示す。タグは、<step to=0.5 strength=3 set pdroop=X/>である。但し、Ｘは０、０．５、１、および２の値をとり、その結果が、それぞれ語句曲線６０２〜６０８である。曲線６０４〜６０８を定義するタグに用いられるpdroopパラメータがゼロではないと、pdroopの値が増大するにつれ、増大する垂下率で、曲線６０４〜６０８が基本周波数である１００Ｈｚに向けて下がるという結果になることが見てとれる。 For example, FIG. 6 shows a graph 600 representing the occurrence of a positive <step to> tag 601 at the beginning of a phrase. The tag is <step to = 0.5 strength = 3 set pdroop = X />. However, X takes values of 0, 0.5, 1, and 2, and the results are the phrase curves 602 to 608, respectively. If the pdroop parameter used for the tags defining the curves 604 to 608 is not zero, the result is that as the value of pdroop increases, the curves 604 to 608 drop toward the fundamental frequency of 100 Hz with increasing droop rate. I can see that.

「pdroop」に類似するパラメータは、「adroop」である。「adroop」パラメータは、ピッチの軌跡を語句曲線に戻すため、タグ処理時に仮定される事前計画の量を制限することができる。所与のポイントから１／adroop秒離れたアクセントは、そのポイント周囲のピッチの局所的な軌跡に対して、ほとんど影響を持たない。 A parameter similar to “pdroop” is “adroop”. The “adroop” parameter returns the pitch trajectory back to the word curve, which can limit the amount of pre-planning assumed during tag processing. An accent 1 / adroop seconds away from a given point has little effect on the local trajectory of the pitch around that point.

図７は、一連のタグ<set adroop=X/>... <set smooth=0.08/>... <step to=0 strength=3/>... <stress shape=-.1s0. -.05s0, .05s.3, .1s.3 strength=3 type=.5/>を処理することで生成される曲線７０２〜７０８を示すグラフ７００である。但し、Ｘは０、１、３、および１０の値をそれぞれとる。ここで、ピッチ曲線は一定の１００Ｈｚであり、「adroop」パラメータは、アクセントからの距離が増大するにつれ、曲線７０２〜７０８をピッチ曲線に向けて減衰させる。減衰率は、「adroop」の値が増大するにつれて大きくなる。 Figure 7 shows a series of tags <set adroop = X /> ... <set smooth = 0.08 /> ... <step to = 0 strength = 3 /> ... <stress shape =-. 1s0. It is a graph 700 showing curves 702 to 708 generated by processing 05s0, .05s.3, .1s.3 strength = 3 type = .5 />. However, X takes values of 0, 1, 3, and 10, respectively. Here, the pitch curve is a constant 100 Hz, and the “adroop” parameter attenuates the curves 702-708 towards the pitch curve as the distance from the accent increases. The attenuation rate increases as the value of “adroop” increases.

図８は、曲線８０２〜８０８を示すグラフ８００であり、異なる平滑化時間を有するアクセントを表す。曲線８０２〜８０８は、一連のタグ<set smooth=X/>... <stress strength=4 shape=-.15s0, -.1s0, -.05s.1, 0s.3, -15s0, .1s0, -05s.1/>を処理することで生成される。但し、Ｘは０．００４、０．１０、０．１４、および０．２の値をそれぞれとる。「smooth」パラメータは、例えば、延びた母音の中程でピッチを意図的に変化させるために、話者が通常ピッチの変更にかかる時間に設定されることが好ましい。「smooth」値が０．２の曲線８０８は、アクセントの形状に関して実質的に平滑化されすぎている。 FIG. 8 is a graph 800 showing curves 802-808, representing accents with different smoothing times. Curves 802-808 are a series of tags <set smooth = X /> ... <stress strength = 4 shape =-. 15s0, -.1s0, -.05s.1, 0s.3, -15s0, .1s0, Generated by processing -05s.1 />. However, X takes the values of 0.004, 0.10, 0.14, and 0.2, respectively. The “smooth” parameter is preferably set to a time required for the speaker to change the normal pitch in order to intentionally change the pitch in the middle of the extended vowel, for example. Curve 808 with a “smooth” value of 0.2 is substantially too smooth with respect to the shape of the accent.

図９は、「jittercut」パラメータの作用を示すグラフ９００である。「jittercut」パラメータは、ランダムな変動を語句に導入して、より現実味のあるスピーチを生成するために用いられる。人間の話者は、同じ語句や文を、言う度に全く同じように口にすることはない。「jittercut」パラメータを用いることで、人間の話者の変動特徴をいくらか導入することが可能である。 FIG. 9 is a graph 900 illustrating the effect of the “jittercut” parameter. The “jittercut” parameter is used to introduce random variations into the phrase to produce a more realistic speech. Human speakers do not speak the same phrase or sentence exactly the same every time they say it. By using the “jittercut” parameter, it is possible to introduce some variation characteristics of human speakers.

グラフ９００は、「jittercut」の値を０．１、０．３、および１それぞれに設定した曲線９０２〜９０６を示す。曲線９０２の生成に用いられる「jittercut」の値は、おおよそ平均単語長を尺度とするため、単語内にかなりの変動をもたらす。曲線９０６の生成に用いられる「jittercut」の値は一語句が尺度であり、語句の範囲にわたって変動を生成するが、単語内ではほとんど変動を生成しない。 A graph 900 shows curves 902 to 906 in which the values of “jittercut” are set to 0.1, 0.3, and 1, respectively. The value of “jittercut” used to generate the curve 902 is approximately scaled by the average word length, thus causing significant variation within the word. The value of “jittercut” used to generate curve 906 is a measure of a phrase and produces variation over the range of the phrase, but produces little variation within the word.

図１０は、タグを処理し、タグによって定義される音響特徴の値を決定するプロセス１０００を示す。プロセス１０００は、図１のプロセス１００のステップ１０４として採用してもよい。プロセス１０００は、各時点でピッチについての１つまたは複数の一次方程式を構築してから、その方程式のセットを解くことで進む。各タグは韻律への制約を表し、各タグを処理するごとに、方程式のセットにさらなる方程式が追加される。 FIG. 10 shows a process 1000 for processing a tag and determining the value of an acoustic feature defined by the tag. Process 1000 may be employed as step 104 of process 100 of FIG. Process 1000 proceeds by building one or more linear equations for the pitch at each point in time and then solving the set of equations. Each tag represents a prosody constraint, and as each tag is processed, additional equations are added to the set of equations.

ステップ１００２〜１００８において、stepおよびslopeタグが処理され、該タグによって定義される一次方程式でそれぞれ表される、語句曲線に対する制約のセットを作成する。 In steps 1002-1008, the step and slope tags are processed to create a set of constraints on the phrase curve, each represented by a linear equation defined by the tags.

ステップ１００２において、一次方程式が各<step by>タグごとに生成される。各方程式は、ｐ_ｔ＋ｗ−ｐ_ｔ−ｗ＝ｓｔｅｐｓｉｚｅ_ｔ、ｗ＝１＋［ｓｍｏｏｔｈ/2Δt］は、平滑化幅の半分であり、ｔはタグの位置である。各<step to>タグは、ｐ_ｔ＝ｔａｒｇｅｔの形態の１つの方程式を追加する。ここで、ｔａｒｇｅｔは、引数「to」の値である。 In step 1002, a linear equation is generated for each <step by> tag. In each equation, pt _{+ w} -pt _-w = stepsizet, w = 1 + [smooth / 2 [Delta] _t ] is half the smoothing width, and t is the tag position. Each <step to> tag adds one equation of the form p _t = target. Here, target is the value of the argument “to”.

ステップ１００４において、制約方程式のセットが各<slope>タグごとに生成される。各時間ｔごとに１つの方程式が追加される。方程式は、Ｐ_ｔ＋１−ｐ_ｔ＝ｓｌｏｐｅ_ｔ・Δｔの形態をとる。式中、ｐ_ｔは語句曲線であり、ｓｌｏｐｅ_ｔは先行する<slope>タグの属性「rate」であり、Δtは韻律計算の間隔であり、通常は１０ｍｓである。 In step 1004, a set of constraint equations is generated for each <slope> tag. One equation is added for each time t. The equation takes the form of P _{t + 1} −p _t = slope _t · Δt. Wherein, p _t is the phrase curve, slope _t is a preceding <slope> tag attributes "rate", Delta] t is the spacing prosody calculations, usually a 10 ms.

<slope>タグから生成される方程式は、各ポイントを隣接するポイントに関連付ける。該方程式を解くことで、連続した語句曲線、すなわち、急なステップやジャンプのない語句曲線がもたらされる。このような連続した語句曲線は、実際の人間のスピーチパターンを反映するものであり、その変化率は、声帯筋が即座には反応しないため、連続している。 The equation generated from the <slope> tag associates each point with an adjacent point. Solving the equation results in a continuous phrase curve, i.e., a phrase curve without steep steps or jumps. Such a continuous phrase curve reflects the actual human speech pattern, and the rate of change is continuous because the vocal cord muscles do not react immediately.

ステップ１００６において、１つの方程式が、「pdroop」がゼロではない各ポイントに追加される。このような方程式はそれぞれ、語句曲線をゼロに引き下げる傾向がある。各垂下方程式は、ｓ^{［ｄｒｏｏｐ］}＝ｐｄｒｏｏｐ・Δｔの形態を有する。各方程式は、別個の小さな作用を有するが、作用は累積されて、最終的には語句曲線をゼロにする。 In step 1006, one equation is added to each point where "pdroop" is not zero. Each of these equations tends to pull the phrase curve to zero. Each droop equation has the form s ^[drop] = ploop · Δt. Each equation has a separate small effect, but the effect is cumulative, eventually bringing the phrase curve to zero.

ステップ１００８〜１０１２において、方程式を解く。全体的に、ｍ＋ｎの方程式がある（ｎは未知数）。ｍの値は、stepタグの数＋（ｎ−１）である。ｐ_ｔのすべての値が未知数である。未知数よりも多くの方程式があるため、方程式は、未知の値の過剰決定（overdetermination）をもたらす。したがって、すべての方程式を適切に解く１つの解を見つける必要がある。方程式を解く分野に馴染みがある者は、これが、その解に標準アルゴリズムを有する「加重最小二乗」問題と特徴付けうることを認識するであろう。 In steps 1008-1012, the equations are solved. Overall, there are m + n equations (n is an unknown). The value of m is the number of step tags + (n−1). all values of p _t is unknown. Since there are more equations than unknowns, the equations lead to overdetermination of unknown values. It is therefore necessary to find one solution that properly solves all equations. Those familiar with the field of solving equations will recognize that this can be characterized as a “weighted least squares” problem with a standard algorithm in the solution.

ステップ１００８において、好ましい実施では、方程式をｓ・ａ・ｐ＝ｓ・ｂと行列の形態で表す。ここで、ｓはstrengthのｍ×ｍ対角行列であり、ａ（ａはｍ×ｎ）は、方程式におけるｐ_ｔの係数を含み、ｂ（これはｍ×１）は方程式の右辺（定数）を含む。Ｐは、ｍ×１列ベクトルである。次に、ステップ１０１０において、方程式が解の正規形、すなわちａ^ｔ・ｓ^２・ａ・ｐ＝ａ・ｓ^２・ｂに変形される。この理由は、こうすると、左辺が、帯幅の狭い帯対角行列（ａ^ｔ・ｓ^２・ａ）を含むためである。その帯幅は、通常はｎまたはｍよりもはるかに小さいｗ以下である。方程式を解くコストは、一般の場合でのｎ^３ではなく、帯対角行列の場合にはｗ^２ｎとして測られるため、帯幅の狭いことが重要である。本発明において、この測定は、１０００倍計算コストを低減し、スピーチの各秒の処理に必要なＣＰＵサイクルの数が一定となるように保証する。最後に、ステップ１０１２において、行列解析を用いて方程式を解く。当業者は、ステップ１００８〜１０１２を同等の結果をもたらしうる他のアルゴリズムで置換してもよいことを認識しよう。 In step 1008, the preferred implementation represents the equation in the form of a matrix with s · a · p = s · b. Here, s is the m × m diagonal matrix of strength, a (a is m × n) includes a coefficient of p _t in equation, b (which is m × 1) right-hand side of equation (constant) including. P is an m × 1 column vector. Next, in step 1010, the equation is transformed into the normal form of the solution, that is, at ^* s ² * a * p = a * s ² * b. The reason is that when doing so, the left side is because including narrow banded diagonal matrix of strip width ^{^{(a t · s 2 · a}} ). The band width is usually w or less, which is much smaller than n or m. The cost of solving the equation is measured not as n ³ in the general case but as w ² n in the case of a band diagonal matrix, so it is important that the band width is narrow. In the present invention, this measurement reduces the computational cost by a factor of 1000 and ensures that the number of CPU cycles required to process each second of speech is constant. Finally, in step 1012, the equations are solved using matrix analysis. One skilled in the art will recognize that steps 1008-1012 may be replaced with other algorithms that may yield equivalent results.

一例を挙げるため、サンプリング間隔ｄｔ＝０．０１ｓ、ｓｍｏｏｔｈ＝０．０４ｓ、ｐｄｒｏｏｐ＝１、および以下のタグを想定する。
<slope rate=1 pos=0s/>
<step to=0.3 strength=2 pos=0s/>
<step by=0.5 pos=0.04 strength=0.7/>
この結果、以下の方程式のセットが得られる。式中、「＃」と、それに続く各行の材料はコメントを表し、方程式の一部ではない。
１：ｐ０＝０．３；ｓ１＝２＃step to
２：ｐ６−ｐ２＝０．５；ｓ２＝０．７＃step by
３：ｐ１−ｐ０＝０．０１；ｓ３＝１＃slope
４：ｐ２−ｐ１＝０．０１；ｓ４＝１＃slope
５：ｐ３−ｐ２＝０．０１；ｓ５＝１＃slope
６：ｐ４−ｐ３＝０．０１；ｓ６＝１＃slope
１１：ｐ０＝０；ｓｌｌ＝０．０１＃pdroop
１２：ｐ１＝０；ｓ１２＝０．０１＃pdroop
１３：ｐ２＝０；ｓ１３＝０．０１＃pdroop
行列「ａ」は次のようになる。
１００００００００
００ -1 ０００１００
-1 １０００００００
０ -1 １００００００
００ -1 １０００００
・・・
１００００００００
０１０００００００
００１００００００
・・・，
ここで、各行は上記方程式の左辺に対応する。各列は、時間値に対応する。 To give an example, assume a sampling interval dt = 0.01 s, smooth = 0.04 s, ploop = 1, and the following tags:
<slope rate = 1 pos = 0s />
<step to = 0.3 strength = 2 pos = 0s />
<step by = 0.5 pos = 0.04 strength = 0.7 />
This results in the following set of equations: In the formula, “#” and the material in each subsequent line represent a comment and are not part of the equation.
1: p0 = 0.3; s1 = 2 # step to
2: p6-p2 = 0.5; s2 = 0.7 # step by
3: p1-p0 = 0.01; s3 = 1 # slope
4: p2-p1 = 0.01; s4 = 1 # slope
5: p3-p2 = 0.01; s5 = 1 # slope
6: p4-p3 = 0.01; s6 = 1 # slope
11: p0 = 0; sll = 0.01 # pdroop
12: p1 = 0; s12 = 0.01 # pdroop
13: p2 = 0; s13 = 0.01 # pdroop
The matrix “a” is as follows:
1 0 0 0 0 0 0 0 0 0
0 0 -1 0 0 0 1 0 0
-1 1 0 0 0 0 0 0 0 0
0 -1 1 0 0 0 0 0 0 0
0 0 -1 1 0 0 0 0 0 0
...
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
...
Here, each row corresponds to the left side of the above equation. Each column corresponds to a time value.

上記方程式の右辺は、「ｂ」行列をもたらす。「ｂ」行列の各行は、上記方程式の１つの右辺に対応する。
０．３
０．５
０．０１
０．０１
０．０１
０．０１
・・・
０
０
０
・・・
strength ｓ_ｉ，ｉの対角要素は、次のようなものである。
[2 0.7 1 1 1 1 ... 0.01 0.01 0.01 ...]
ここで、各エントリは１つの方程式に対応する。 The right hand side of the above equation yields a “b” matrix. Each row of the “b” matrix corresponds to one right side of the above equation.
0.3
0.5
0.01
0.01
0.01
0.01
...
0
0
0
...
The diagonal elements of strength s _{i, i} are as follows.
[2 0.7 1 1 1 1 ... 0.01 0.01 0.01 ...]
Here, each entry corresponds to one equation.

自然な音声を達成するために、小語句間で連続性を実現することが重要である。これは、全文を一度に計算することで達成することができる。しかし、この手法では、語句の冒頭にあるタグが先行語句の末尾付近のピッチに影響させるに任せておくため、結果が望ましくない。実際の人間のスピーチパターンでは、語句の冒頭におけるピッチおよびアクセントは、先行語句の末尾付近のピッチに影響しない。人間は、次の語句の冒頭でのピッチを考慮せずに、語句を終えてから、語句間の小休止中または後続する語句の冒頭において、任意の必要なピッチのシフトを行う傾向がある。 In order to achieve natural speech, it is important to achieve continuity between small phrases. This can be achieved by calculating the whole sentence at once. However, this approach leaves the tag at the beginning of the phrase to affect the pitch near the end of the preceding phrase, which is undesirable. In an actual human speech pattern, the pitch and accent at the beginning of a phrase do not affect the pitch near the end of the preceding phrase. Humans tend not to consider the pitch at the beginning of the next word, but to make any necessary pitch shift during the pause between words or at the beginning of the following word after the word is finished.

したがって、連続性は、一度に１つの小語句の韻律を計算することで達成される。しかし、完全に分離して語句を計算するのではなく、先行語句の末尾付近のｐ_ｔの値を見返し、それらを既知の値として方程式に代入して語句を計算する。 Thus, continuity is achieved by calculating the prosody of one small phrase at a time. However, completely separated rather than computing the phrases, looking back the value of p _t near the end of the preceding word, by substituting the equation them as known values to calculate the phrase.

タグ処理の次の段階は、ピッチ曲線の計算である。ピッチ曲線は、個々の単語、および語句全体ではなく、語句のより小さな他の要素のピッチの挙動を記述する。ピッチの軌跡は、語句曲線および<stress>タグに基づいて計算される。プロセスステップ１００２〜１０１２に関して上述したアルゴリズムが適用されるが、方程式のセットは異なる。 The next stage of tag processing is the calculation of the pitch curve. The pitch curve describes the pitch behavior of individual words and other smaller elements of the phrase rather than the entire phrase. The pitch trajectory is calculated based on the phrase curve and the <stress> tag. The algorithm described above with respect to process steps 1002-1012 is applied, but the set of equations is different.

ステップ１０１４において、ｅ_ｔ＋１＝ｅ_ｔ＝０の形態で表される連続性方程式、ならびに−ｅ_ｔ＋１＋２ｅ_ｔ−ｅ_ｔ＋１＝０の形態で表される、平滑性を表すさらなる方程式のセットが各ポイントに適用される。各方程式は、ｓｔｒｅｎｇｔｈｓ^{［ｓｍｏｏｔｈ］}＝π／２・ｓｍｏｏｔｈ／Δｔを有する。平滑性方程式は、ピッチの軌跡に尖った角がないことを含意する。数学的に、「平滑性(smoothness)」方程式は、確実に二次導関数が小さいままであるようにする。この要件は、韻律を実施するために用いられる筋肉がすべてゼロではない質量を有するという物理的な制約に起因することから、滑らかに加速され、発作的に応答することはできない。 In step _{_{1014, e t + 1 = e}} t = 0 of the continuity equation is expressed in the form and _{_{_{-e t + 1 + 2e t -e}}} t + 1 = is represented by 0 mode, set the point of further equations representing the smoothness, Applies to Each equation has strength s ^[smooth] = π / 2 · smooth / Δt. The smoothness equation implies that there are no sharp corners in the pitch trajectory. Mathematically, the “smoothness” equation ensures that the second derivative remains small. This requirement is due to the physical constraint that the muscles used to perform the prosody all have non-zero mass, so it is smoothly accelerated and cannot respond seizureally.

ステップ１０１６において、「垂下（droop）」方程式ｎ個のセットが適用される。これらの方程式は、上述した垂下方程式が語句曲線に影響するのと同様に、ピッチの軌跡に影響を及ぼす。各「垂下」方程式は、ｓ^{［ｄｒｏｏｐ］}＝ａｄｒｏｏｐ・Δｔという形態を有する。これらの方程式は、語句曲線をゼロに向けて引き下げる傾向のある上述したpdroopパラメータとは反対に、ピッチの軌跡を語句曲線に垂下させる。 In step 1016, a set of n "droop" equations is applied. These equations affect the pitch trajectory in the same way that the droop equation described above affects the phrase curve. Each “droop” equation has the form s ^[drop] = adloop · Δt. These equations cause the pitch trajectory to droop into the phrase curve, as opposed to the pdroop parameter described above, which tends to pull the phrase curve down to zero.

ステップ１０１８〜１０２０において、各<stress>タグごとに１つの方程式が導入される。このような方程式はそれぞれ、ピッチ軌跡の形状に制約を加える。ステップ１０１８において、<stress>タグの形状をまず線形的に補間し、ターゲットの連続したセットを形成する。ｓｈａｐｅ=ｔ_０，ｘ_０，ｔ_１，ｘ_１，ｔ_２，ｘ_２，・・・，ｔ_ｊ，ｘ_ｊによって定義されるアクセントを補間し、Ｘ_ｋ，Ｘ_ｋ＋１，Ｘ_ｋ＋２，・・・，Ｘ_ｊにする。但し、ｋ＝ｔ_０／Δｔは、アクセントの形状の最初の点のインデックスであり、Ｊ＝ｔ_ｊ／Δｔは、アクセントの末尾のインデックスである。アクセントの平均ピッチに制約を加える方程式は、ｓ^{［ｐｏｓ］}＝ｓｔｒｅｎｇｔｈ・ｓｉｎ（ｔｙｐｅ・π／２）である状態で、次のようなものである。

「type」が０から増大するにつれ、この方程式のstrengthもまた０（アクセントが平均ピッチを犠牲にして形状を保持することを意味する）から「strength」（アクセントが形状を犠牲にして平均ピッチを保持することを意味する）に増大することが見て取れる。 In steps 1018-1020, one equation is introduced for each <stress> tag. Each such equation places constraints on the shape of the pitch trajectory. In step 1018, the shape of the <stress> tag is first linearly interpolated to form a continuous set of targets. shape = t ₀ , x ₀ , t ₁ , x ₁ , t ₂ , x ₂ ,..., t _j , x _j are interpolated, and X _k , X _{k + 1} , X _{k + 2} ,. , X _j . However, k = t ₀ / Δt is the index of the first point of the accent shape, and J = t _j / Δt is the index of the end of the accent. The equation that constrains the average pitch of the accent is as follows, with s ^[pos] = strength · sin (type · π / 2).

As “type” increases from 0, the strength of this equation also changes from 0 (meaning that the accent retains shape at the expense of the average pitch) to “strength” (accent increases the average pitch at the expense of shape). Can be seen to increase).

ステップ１０２０において、各ポイントに、すなわちアクセントのｋからｊについてさらなる方程式も生成される。これらの方程式は、アクセントの形状を定義し、次の形態をとる。

式中、

は、アクセントにわたるピッチの軌跡の平均値であり、

はアクセントの形状である。平均を差し引くことで、これらの方程式によりアクセントが語句曲線の上にあるのか下にあるのかが制約されないようにする。方程式は、アクセントが語句曲線の上にあるのか下にあるのかを制約するのではなく、アクセントの形状のみを制約する。各アクセントは、ｓ^{［ｓｈａｐｅ］}＝ｓｔｒｅｎｇｔｈ・ｃｏｓ（ｔｙｐｅ・π／２）／（Ｊ−ｋ＋１）という「strength」値を有する。ステップ１０２２において、上記例において説明したものと同様の行列解析を用いて、該方程式を解く。 In step 1020, further equations are also generated for each point, i.e., for accents k through j. These equations define the shape of the accent and take the following form:

Where

Is the average pitch trajectory over the accent,

Is the shape of the accent. By subtracting the average, these equations do not constrain whether the accent is above or below the phrase curve. The equation does not constrain whether the accent is above or below the phrase curve, but only the shape of the accent. Each accent has a “strength” value of s ^[shape] = strength · cos (type · π / 2) / (J−k + 1). In step 1022, the equation is solved using matrix analysis similar to that described in the above example.

制約方程式は、等価最適化問題として考えることができる。方程式ｅ＝（ａ・ｐ−ｂ）^ｔ・ｓ^２・（ａ・ｐ−ｂ）は、制約方程式を解く同じ値ｐに関して、ｅの最小値を与える。したがって、ｅを最小化することで、ｐを決定することができる。上記ｅの方程式は、ａおよびｂの行のグループを選択することで、セグメントに分割可能である。こういったグループは制約方程式のグループに対応し、ｅは、同じ二次形式のより小さなバージョンのグループにわたる和となる。連続性、平滑性、および垂下方程式は、所望の韻律特徴を有するスピーチを生成するために必要な努力に関連するものとして理解することができる１つのグループに配置することが可能である。タグから生じる制約方程式は、エラーを防ぐ、すなわちクリアで明白はスピーチの生成に関連するものとして理解することができる別のグループに配置することもできる。そして、「ｅ」の値をｅ＝努力＋エラーとして理解することができる。質的に、「努力」という語は、生理学的な努力のように振る舞う。筋肉が中立位置に静止している場合にはゼロであり、筋肉の動きが速くかつ強くなるにつれて増大する。同様に、「エラー」という語は、伝達エラーレートのように振る舞う。韻律が理想とするターゲットに正確に適合する場合には最小であり、韻律が理想から離れるにつれて増大する。韻律が理想から離れるにつれ、聞き手がアクセントまたはトーン形状を誤認する機会がますます大きくなるものと予想される。人間のスピーチが、話す努力と誤解される可能性の組み合わせを最小化する試みが表すはずであるというのは、妥当な仮定である。エラーレート（すなわち、スピーチが誤って解釈される機会）を最小にすることが望ましく、また、話す努力の低減も望ましい目標である。本発明の技法によって達成される「ｅ」の値の最小化は、本物の人間のスピーチの傾向および折衷特徴を反映するものとみなすことができる。 The constraint equation can be considered as an equivalent optimization problem. The equation e = (a · p−b) ^t · s ² · (a · p−b) gives the minimum value of e for the same value p that solves the constraint equation. Therefore, p can be determined by minimizing e. The equation e above can be divided into segments by selecting groups of rows a and b. These groups correspond to the groups of constraint equations, and e is the sum over groups of smaller versions of the same quadratic form. The continuity, smoothness, and droop equations can be placed in one group that can be understood as relating to the effort required to generate speech with the desired prosodic features. The constraint equations arising from the tags can be placed in another group that can be understood as preventing errors, i.e. clear and clearly related to the generation of speech. Then, the value of “e” can be understood as e = effort + error. Qualitatively, the term “effort” behaves like a physiological effort. Zero when the muscle is stationary in the neutral position and increases as muscle movement becomes faster and stronger. Similarly, the term “error” behaves like a transmission error rate. It is minimal when the prosody exactly matches the ideal target, and increases as the prosody departs from the ideal. As the prosody departs from ideal, it is expected that the opportunity for the listener to misidentify the accent or tone shape will increase. It is a reasonable assumption that human speech should represent an attempt to minimize the combination of speaking effort and potential misunderstanding. It is desirable to minimize the error rate (ie, the chance that speech is misinterpreted), and reducing speaking effort is also a desirable goal. The minimization of the value of “e” achieved by the technique of the present invention can be considered as reflecting true human speech trends and compromise characteristics.

ピッチ曲線を計算した後、プロセスが継続し、語句曲線およびピッチ曲線で表される言語的概念が、観察可能な音響特徴にマッピングされる。マッピングは、予測される時間変化強調ｅ_ｔと、スピーチ信号において生成するか、スピーチ信号について生成することのできる観察可能な特徴との間に統計学的相関を想定することで、達成することができる。ｅ_ｔは通常ベクトルであるため、ｅ_ｔを統計学的相関の行列Ｍで乗算することで達成可能である。 After calculating the pitch curve, the process continues and linguistic concepts represented by the phrase curve and the pitch curve are mapped to observable acoustic features. The mapping can be achieved by assuming a statistical correlation between the predicted time variation enhancement _et and the observable features that can be generated in or for the speech signal. it can. Since e _t is usually vector can be accomplished by multiplying by the matrix M statistical correlation e _t.

ステップ１０２４において、行列Ｍがタグ<set range>から導出される。次に、ステップ１０２６において、ｅ_ｔ・Ｍが計算される。ステップ１０２８において、タグによって定義される韻律特徴を人間の知覚および予想に対して調整するために、ステップ１０２６の結果、すなわちｅ_ｔ・Ｍに対して非線形変換を行う。変換は、<set add>タグによって定義される。変換は、関数ｆ（ｘ）＝ｂａｓｅ・（１＋γ＋ｘ）^{１／ａｄｄ}で表されるが、式中γ＝（１＋（ｒａｇｅ／ｂａｓｅ））^ａｄｄ−１である。ｆ（０）の値は「base」の値に等しく、ｆ（１）の値は「base＋range」の値に等しい。 In step 1024, the matrix M is derived from the tag <set range>. Next, in step 1026, _{e t} · M is computed. In step 1028, in order to adjust the prosodic characteristics defined by the tags for human perception and expected, the result of step 1026, i.e., performs non-linear conversion on e _t · M. A transformation is defined by a <set add> tag. The transformation is represented by the function f (x) = base · (1 + γ + x) ^{1 / add} , where γ = (1+ (rage / base)) ^add −1. The value of f (0) is equal to the value of “base”, and the value of f (1) is equal to the value of “base + range”.

周波数で測定されるピッチと、アクセントの知覚される強さとの間の関係は、かならずしも線形である必要はない。さらに、神経信号または筋肉の張りとピッチとの間の関係は、線形ではない。知覚作用が最も重要であり、かつ人間の話者が、適切な音声になるようにアクセントを調整する場合、ピッチの変化を検出可能な最小の周波数の変化として見ることが有用である。検出可能な最小の周波数変化の値は、周波数が増大するにつれて、増大する。広く受け入れられている１つの見解によれば、検出可能な最小の周波数変化と、周波数との間の関係は、ＤＬ∝ｅ^√ｆとして与えられる。式中、ＤＬは検出可能な最小の周波数変化であり、ｅは自然対数の底であり、ｆは周波数またはピッチである。本発明によるタグおよびタグ処理システムにおいて、この関係は、アクセントの強さと、「add」の値がおおよそ０．５である<set add>タグによって記述される線形と指数の間の中間にある周波数と、の間のある関係に対応する。一方、話者が聞き手の都合に合わせないという前提でスピーチをモデリングするシステムが実施される場合には、他の値の「add」が考えられ、１を越える「add」の値を用いることができる。例えば、筋肉の張りが追加される場合、ピッチｆ０の値はおおよそ√（張り（tension））に等しい。 The relationship between the pitch measured in frequency and the perceived strength of the accent need not be linear. Furthermore, the relationship between neural signals or muscle tension and pitch is not linear. When the perceptual effect is most important, and a human speaker adjusts the accent for proper speech, it is useful to see the change in pitch as the smallest change in frequency that can be detected. The minimum detectable frequency change value increases as the frequency increases. According to one opinion that widely accepted minimum frequency change detectable, the relationship between the frequency, given as ^DLαe √f. Where DL is the smallest detectable frequency change, e is the base of the natural logarithm, and f is the frequency or pitch. In the tag and tag processing system according to the present invention, this relationship is the frequency between the strength of the accent and the linear and exponent described by the <set add> tag with an “add” value of approximately 0.5. Corresponds to a certain relationship. On the other hand, when a system for modeling speech on the premise that the speaker does not suit the listener's convenience, other values of “add” can be considered, and a value of “add” exceeding 1 may be used. it can. For example, when muscle tension is added, the value of the pitch f0 is approximately equal to √ (tension).

各観察可能な特徴は、<set add>タグの適切な成分によって制御される、異なる関数を有することができる。振幅の知覚は、振幅の知覚およびピッチの知覚が双方とも、根底にある観察可能な変化としてゆっくりと増大する受信量を有するという点において、おおよそピッチの知覚と同様である。振幅およびピッチの双方は、所望の知覚の影響でほぼ指数的に増大する逆関数として表現される。 Each observable feature can have a different function that is controlled by the appropriate component of the <set add> tag. Amplitude perception is roughly similar to pitch perception in that both amplitude perception and pitch perception have a slowly increasing amount of reception as the underlying observable change. Both amplitude and pitch are expressed as inverse functions that increase approximately exponentially with the desired perceptual effect.

上記関数、すなわちｆ（ｘ）＝ｂａｓｅ・（１＋γ＋ｘ）^{１／ａｄｄ}は、「add」の値が１の場合には、線形的な挙動をスムーズに記述する。「add」の値が０に近づく場合、該関数は指数的な挙動を記述し、「add」の値が１と０の間であるか、または０に近い場合には、線形と指数の間での挙動を記述する。 The above function, ie, f (x) = base · (1 + γ + x) ^{1 / add} smoothly describes linear behavior when the value of “add” is 1. If the value of “add” approaches 0, the function describes exponential behavior, and if the value of “add” is between 1 and 0, or close to 0, it is between linear and exponential Describe the behavior at.

図１１は、図１０のステップ１０２４〜１０２６に関連して上述した、言語的座標を観察可能な音響特徴にマッピングする一例を示す。グラフ１１０２は、驚き対強調を描いた曲線１１０４を示す。グラフ１１０６は、ピッチ対振幅を描いた曲線１１０６を示す。曲線１１０４は、曲線１１０６にマッピングされる。このマッピングは、図１０のステップ１０２４〜１０２６に関連して上述した行列の乗算によって可能になる。 FIG. 11 shows an example of mapping linguistic coordinates to observable acoustic features, as described above in connection with steps 1024-1026 of FIG. Graph 1102 shows a curve 1104 depicting surprise versus emphasis. Graph 1106 shows a curve 1106 depicting pitch versus amplitude. Curve 1104 is mapped to curve 1106. This mapping is made possible by the matrix multiplication described above in connection with steps 1024-1026 of FIG.

図１２は、図１０のステップ１０２８に関連して説明したものと同様の線形変換の結果を示すグラフ１２００である。曲線１２０２〜１２０８は、それぞれ「add」の値が０．０、０．５、１．０、および２．０である関数ｆ（ｘ）の軌跡を表す。「add」の値が０の曲線１２０２は指数関係を示し、「add」の値が１である曲線１２０６は線形関係を示し、「add」の値が２である曲線１２０８は対数関係を示す。 FIG. 12 is a graph 1200 showing the results of a linear transformation similar to that described in connection with step 1028 of FIG. Curves 1202 to 1208 represent the trajectories of the function f (x) having “add” values of 0.0, 0.5, 1.0, and 2.0, respectively. A curve 1202 with an “add” value of 0 indicates an exponential relationship, a curve 1206 with an “add” value of 1 indicates a linear relationship, and a curve 1208 with an “add” value of 2 indicates a logarithmic relationship.

図１３は、「add」の値が異なる場合の、ピッチ曲線に対するアクセントの作用を示すグラフ１３００である。曲線１３０２Ａ、１３０４Ａ、および１３０６Ａは、一連のタグ<set add=X/>... <slope rate=1/>の作用を示す。ここで、Ｘの値は、曲線１３０２Ａでは０、曲線１３０４Ａでは０．５、曲線１３０６Ａでは１である。曲線１３０２Ａは指数関係を示す一方、曲線１３０６Ａは線形関係を示すことが見て取れる。曲線１３０２Ａは、周波数と知覚されるピッチの間での対数関係を示すため、知覚されるピッチの均一な傾きが望ましい場合、実際の周波数は線形的に増大する。 FIG. 13 is a graph 1300 showing the effect of accents on the pitch curve when the value of “add” is different. Curves 1302A, 1304A, and 1306A show the effect of a series of tags <set add = X /> ... <slope rate = 1 />. Here, the value of X is 0 for the curve 1302A, 0.5 for the curve 1304A, and 1 for the curve 1306A. It can be seen that curve 1302A exhibits an exponential relationship while curve 1306A exhibits a linear relationship. Since curve 1302A shows a logarithmic relationship between frequency and perceived pitch, the actual frequency increases linearly if a uniform slope of the perceived pitch is desired.

曲線１３０２Ｂ、１３０４Ｂ、および１３０６Ｂは、一連のタグ<stress strength=3 type=0.5 shape=-0.1s0, 0.05s0, 0s0.1, 0.05s0, 0.1s0/>... <stress strength=3 type=0.5 shape=-0.1s0, 0.05s0, 0s0.1, 0.05s0, 0.1s0/>を追加した、一連のタグ<set add=X/>... <slope rate=1/>の作用を示す。Ｘの値は、曲線１３０２Ｂでは０、曲線１３０４Ｂでは０．５、曲線１３０６Ｂでは１である。最初のアクセントの作用は、曲線１３０２Ｂ、１３０４Ｂ、および１３０６Ｂそれぞれに関して同様なことが見て取れる。この理由は、最初のアクセントが比較的低周波で発生することから、「add」の異なる値の異なる作用は特に目立たないためである。「add」の値が高いほど、高周波の場合には、作用がより目立つようになるが、低周波では作用は特に目立たない。しかし、二番目のアクセントは、曲線１３０２Ｂ、１３０４Ｂ、および１３０６Ｂそれぞれごとにかなり異なる結果を生成する。周波数が増大するにつれ、「add」の値が低減するほど、アクセントがより大きく周波数を偏位させることがわかる。 Curves 1302B, 1304B, and 1306B are a series of tags <stress strength = 3 type = 0.5 shape = -0.1s0, 0.05s0, 0s0.1, 0.05s0, 0.1s0 /> ... <stress strength = 3 type = 0.5 shape = -0.1s0, 0.05s0, 0s0.1, 0.05s0, 0.1s0 /> to which a series of tags <set add = X /> ... <slope rate = 1 /> are shown. The value of X is 0 for the curve 1302B, 0.5 for the curve 1304B, and 1 for the curve 1306B. It can be seen that the effect of the initial accent is similar for curves 1302B, 1304B, and 1306B, respectively. This is because the different effects of different values of “add” are not particularly noticeable because the initial accent occurs at a relatively low frequency. The higher the “add” value, the more noticeable the action at high frequencies, but the action is not particularly noticeable at low frequencies. However, the second accent produces significantly different results for each of the curves 1302B, 1304B, and 1306B. It can be seen that as the frequency increases, the lower the value of “add”, the greater the accent is and the more the frequency is displaced.

以下の例は、本発明のタグからの標準中国語文の生成を示す。標準中国語は、４つの異なる語彙トーンを有するトーン言語である。トーンには強弱があり、トーンの相対的な強さまたは弱さは、その形状および隣接するトーンと相互作用する。図１４Ａ〜図１４Ｈは、４つの異なるトーンを強いおよび弱い文脈でそれぞれ含む文全体にわたるピッチが、８つの状況においてどのように変化するかを示す。トーンの隣接トーンとの相互作用は、以下に示すように、文中の音節の強さ（strength）を制御するタグを用いて表すことができる。
Chinese word English translation Strength Type
Shou- radio １．５０．５
Yin- −− １．００．２
Ji −− １．００．３
Duo more １．１０．５
ying- should ０．８０．２
gai −− ０．８０．３
deng lamp １．００．５
bi- comparatively １．５０．５
jiao −− １．００．３
duo more １．００．５ The following example shows the generation of a standard Chinese sentence from the tag of the present invention. Mandarin Chinese is a tone language with four different vocabulary tones. A tone has strength and weakness, and the relative strength or weakness of a tone interacts with its shape and adjacent tone. FIGS. 14A-14H show how the pitch across sentences containing four different tones, respectively in strong and weak contexts, changes in eight situations. The interaction of tones with adjacent tones can be expressed using tags that control the strength of syllables in the sentence, as shown below.
Chinese word English translation Strength Type
Shou- radio 1.5 0.5
Yin- --- 1.0 0.2
Ji --- 1.0 0.3
Duo more 1.1 0.5
ying- should 0.8 0.2
gai --- 0.8 0.3
deng lamp 1.0 0.5
bi- comparatively 1.5 0.5
jiao --- 1.0 0.3
duo more 1.0 0.5

「strength」および「type」の値は、単語shou1 yin1 ji1を含むトレーニング文から導出された。但し、「1」は、標準中国語のトーン１、すなわち平坦なトーンを示す。 The values of “strength” and “type” were derived from training sentences containing the words shou1 yin1 ji1. However, “1” indicates a standard Chinese tone 1, that is, a flat tone.

これらのタグが、４つの異なるトーンが文の二番目の音節にある、図１４Ｅ〜図１４Ｈ（shou1 yin ji1）の４つの図に用いられる。図１４Ａ〜図１４Ｄに示す短い「Yan」文の場合、三音節の単語「shou1 yin/ying ji」が、単音節の単語「yan」で置換される。各文の残りは同じである。音節「Yan」のタグは、strength=1.5、type=0.5であり、これは、三音節の単語である「Shou yin ji」で最も強い音節「Shou」と同じである。 These tags are used in the four diagrams of FIGS. 14E-14H (shou1 yin ji1) where four different tones are in the second syllable of the sentence. In the case of the short “Yan” sentence shown in FIGS. 14A-14D, the three-syllable word “shou1 yin / ying ji” is replaced with the single-syllable word “yan”. The rest of each sentence is the same. The tag of the syllable “Yan” is strength = 1.5, type = 0.5, which is the same as the strongest syllable “Shou” in the three-syllable word “Shou yin ji”.

図１４Ａは、本発明によるタグの使用および処理による、一文中の単語「Yan1」のモデリングを表す曲線１４０２を示すグラフ１４００である。「Yan1」は、トーン１、すなわち平坦なトーンで話される単語「Yan」である。曲線１４０４は、冒頭に単語「Yan1」がある文を話す話者によって生成されるデータを表す。単音節の単語「Yan1」は強いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの影響をわずかにしか示さない。 FIG. 14A is a graph 1400 showing a curve 1402 representing modeling of the word “Yan1” in a sentence, according to the use and processing of tags according to the present invention. “Yan1” is the word “Yan” spoken in tone 1, ie a flat tone. Curve 1404 represents data generated by a speaker speaking a sentence with the word “Yan1” at the beginning. The single syllable word “Yan1” has a strong strength, so the pitch curve shows little influence from other words in the vicinity.

図１４Ｂは、本発明によるタグの使用および処理による、一文中の単語「Yan2」のモデリングを表す曲線１４１２を示すグラフ１４１０である。「Yan2」は、トーン２、すなわち上昇トーンで話される単語「Yan」である。曲線１４１４は、冒頭に単語「Yan2」がある文を話す話者によって生成されるデータを表す。単音節の単語「Yan2」は強いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの影響をわずかにしか示さない。 FIG. 14B is a graph 1410 illustrating a curve 1412 representing modeling of the word “Yan2” in a sentence, according to the use and processing of tags according to the present invention. “Yan2” is the word “Yan” spoken with tone 2, the rising tone. Curve 1414 represents data generated by a speaker speaking a sentence with the word “Yan2” at the beginning. The single syllable word “Yan2” has a strong strength, so the pitch curve shows little influence from other words in the vicinity.

図１４Ｃは、本発明によるタグの使用および処理による、一文中の単語「Yan3」のモデリングを表す曲線１４２２を示すグラフ１４２０である。「Yan3」は、トーン３、すなわち低トーンで話される単語「Yan」である。曲線１４２４は、冒頭に単語「Yan3」がある文を話す話者によって生成されるデータを表す。単音節の単語「Yan3」は強いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの影響をわずかにしか示さない。 FIG. 14C is a graph 1420 illustrating a curve 1422 representing modeling of the word “Yan3” in a sentence, according to the use and processing of tags according to the present invention. “Yan3” is the word “Yan” spoken in tone 3, ie low tone. Curve 1424 represents data generated by a speaker speaking a sentence with the word “Yan3” at the beginning. The single syllable word “Yan3” has a strong strength, so the pitch curve shows little influence from other words in the vicinity.

図１４Ｄは、本発明によるタグの使用および処理による、一文中の単語「Yan4」のモデリングを表す曲線１４３２を示すグラフ１４３０である。「Yan4」は、トーン４、すなわち下降トーンで話される単語「Yan」である。曲線１４３４は、冒頭に単語「Yan4」がある文を話す話者によって生成されるデータを表す。単音節の単語「Yan4」は強いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの影響をわずかにしか示さない。 FIG. 14D is a graph 1430 illustrating a curve 1432 representing modeling of the word “Yan4” in a sentence, according to the use and processing of tags according to the present invention. “Yan4” is the word “Yan” spoken with tone 4, the descending tone. Curve 1434 represents data generated by a speaker speaking a sentence with the word “Yan4” at the beginning. The single syllable word “Yan4” has a strong strength, so the pitch curve shows little influence from other words in the vicinity.

図１４Ｅは、本発明によるタグの使用および処理による、一文中の単語「Shou1 yin1 ji1」のモデリングを表す曲線１４４２を示すグラフ１４４０である。「Yin1」は、トーン１、すなわち平坦なトーンで話される音節「Yin」である。曲線１４４４は、冒頭に単語「Shou1 yin1 ji1」がある文を話す話者によって生成されるデータを表す。三音節の単語の中間の音節である音節「Yin1」は弱いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの強い影響を示す。 FIG. 14E is a graph 1440 illustrating a curve 1442 representing modeling of the word “Shou1 yin1 ji1” in a sentence, according to the use and processing of tags according to the present invention. “Yin1” is syllable “Yin” spoken with tone 1, that is, a flat tone. Curve 1444 represents data generated by a speaker speaking a sentence with the word “Shou1 yin1 ji1” at the beginning. The syllable “Yin1”, which is the middle syllable of the three syllable words, has a weak strength, so the pitch curve shows a strong influence from other words in the vicinity.

図１４Ｆは、本発明によるタグの使用および処理による、一文中の単語「Shou1 yin2 ji1」のモデリングを表す曲線１４５２を示すグラフ１４５０である。「Yin2」は、トーン２、すなわち上昇トーンで話される音節「Yin」である。曲線１４５４は、冒頭に単語「Shou1 yin2 ji1」がある文を話す話者によって生成されるデータを表す。三音節の単語の中間の音節である音節「Yin2」は弱いstrengthを有し、そのためピッチ曲線は、付近の他の単語からの強い影響を示す。 FIG. 14F is a graph 1450 showing a curve 1452 representing modeling of the word “Shou1 yin2 ji1” in a sentence, according to the use and processing of tags according to the present invention. “Yin2” is the syllable “Yin” spoken with tone 2, the rising tone. Curve 1454 represents data generated by a speaker speaking a sentence with the word “Shou1 yin2 ji1” at the beginning. The syllable “Yin2”, the middle syllable of the three syllable words, has a weak strength, so the pitch curve shows a strong influence from other words in the vicinity.

図１４Ｇは、本発明によるタグの使用および処理による、一文中の単語「Shou1 ying3 ji1」のモデリングを表す曲線１４６２を示すグラフ１４６０である。「Ying3」は、トーン３、すなわち低トーンで話される音節「Ying」である。曲線１４６４は、冒頭に単語「Shou1 ying3 ji1」がある文を話す話者によって生成されるデータを表す。三音節の単語の中間の音節である音節「Ying3」は弱いstrengthを有し、そのためピッチ曲線は、付近の他の音節からの強い影響を示す。 FIG. 14G is a graph 1460 illustrating a curve 1462 representing modeling of the word “Shou1 ying3 ji1” in a sentence, according to the use and processing of tags according to the present invention. “Ying3” is syllable “Ying” spoken in tone 3, ie, low tone. Curve 1464 represents data generated by a speaker speaking a sentence with the word “Shou1 ying3 ji1” at the beginning. The syllable “Ying3”, which is the middle syllable of the three syllable words, has a weak strength, so the pitch curve shows a strong influence from other nearby syllables.

図１４Ｈは、本発明によるタグの使用および処理による、一文中の単語「Shou1 ying4 ji1」のモデリングを表す曲線１４７２を示すグラフ１４７０である。「Ying4」は、トーン４、すなわち下降トーンで話される音節「Ying」である。曲線１４７４は、冒頭に単語「Shou1 ying4 ji1」がある文を話す話者によって生成されるデータを表す。三音節の単語の中間の音節である音節「Ying4」は弱いstrengthを有し、そのためピッチ曲線は、付近の他の音節からの強い影響を示す。 FIG. 14H is a graph 1470 illustrating a curve 1472 representing modeling of the word “Shou1 ying4 ji1” in a sentence, according to the use and processing of tags according to the present invention. “Ying4” is syllable “Ying” spoken with tone 4, that is, a descending tone. Curve 1474 represents data generated by a speaker speaking a sentence with the word “Shou1 ying4 ji1” at the beginning. The syllable “Ying4”, the middle syllable of the three syllable words, has a weak strength, so the pitch curve shows a strong influence from other nearby syllables.

図１４Ａ〜図１４Ｈに示す曲線から、本発明によるタグを用いてテキスト処理のモデリングを表す曲線が、実際に話される単語を表す曲線に対する良好な近似を提供することが見て取れる。 From the curves shown in FIGS. 14A-14H, it can be seen that the curves representing text processing modeling using tags according to the present invention provide a good approximation to the curves representing the words actually spoken.

図１５は、本発明によりタグを生成して使用するプロセス１５００のステップを示す。ステップ１５０２において、トレーニングテキスト本文を選択する。ステップ１５０４において、ターゲット話者がトレーニングテキストを読み、トレーニングコーパスを生成する。ステップ１５０６において、トレーニングコーパスを解析し、トレーニングコーパスの韻律特徴を識別する。ステップ１５０８において、トレーニングコーパスの韻律特徴をモデリングするタグのセットを生成し、トレーニングコーパスをモデリングするように、タグをトレーニングテキストに配置する。ステップ１５１０において、トレーニングテキストにおけるタグの配置を解析し、ターゲット話者の韻律特徴をモデリングするために、テキストにおけるタグの配置についてのルールセットを生成する。ステップ１５１２において、テキスト−スピーチ処理を実行することが望ましいテキスト本文にタグを配置する。タグの配置は、手動で、例えばテキストエディタを通して達成することも、あるいはステップ１５１０において確立したルールセットを用いて自動的に達成することもできる。ステップ１５０２〜１５１０は通常、ターゲット話者ごとに一回または数回行われるが、ステップ１５１２は、テキスト本文をテキスト−スピーチ処理のために準備することが望ましいときにいつでも実行されることが認識されよう。 FIG. 15 shows the steps of a process 1500 for generating and using tags in accordance with the present invention. In step 1502, a training text body is selected. In step 1504, the target speaker reads the training text and generates a training corpus. At step 1506, the training corpus is analyzed to identify prosodic features of the training corpus. In step 1508, a set of tags that model the prosody features of the training corpus is generated, and the tags are placed in the training text to model the training corpus. At step 1510, the tag placement in the training text is analyzed and a rule set for tag placement in the text is generated to model the target speaker's prosodic features. In step 1512, tags are placed in the text body where it is desired to perform text-to-speech processing. Tag placement can be accomplished manually, for example through a text editor, or automatically using the rule set established in step 1510. Steps 1502-1510 are typically performed once or several times for each target speaker, but it is recognized that step 1512 is performed whenever it is desirable to prepare the text body for text-to-speech processing. Like.

図１６は、本発明によるテキスト−スピーチシステム１６００を示す。システム１６００は、メモリ１６０６およびハードディスク１６０８を含む処理ユニット１６０４と、モニタ１６１０と、キーボード１６１２と、マウス１６１４とを備えるコンピュータ１６０２を含む。コンピュータ１６０２は、マイクロホン１６１６およびラウドスピーカ１６１８も備える。コンピュータ１６０２は、テキスト入力インタフェース１６２０およびスピーチ出力インタフェース１６２２を実施するよう動作する。コンピュータ１６０２は、また、テキスト入力インタフェース１６２０からテキストを受信するよう適合されたスピーチモデラ１６２４も提供する。テキストには、本発明により生成されたタグが配置されている。スピーチモデラ１６２４は、テキストおよびタグを処理して、タグによって定義される韻律特徴を有するスピーチを生成し、スピーチ出力インタフェース１６２２を用いて、該スピーチをラウドスピーカ１６１８に出力する。スピーチモデラ１６２４は、ターゲット話者に典型的な韻律特徴を有するスピーチを生成するために、タグのセットを生成すると共に、タグの適用についてのルールを生成するよう適合された韻律タグ生成コンポーネント１６２６を適宜含みうる。タグのセットを生成するために、韻律タグ生成コンポーネント１６２６が、ターゲット話者が読むトレーニングテキストのリーディングを表すトレーニングコーパスを解析し、トレーニングコーパスの韻律特徴を解析し、トレーニングコーパスをモデリングするために、トレーニングテキストに追加可能なタグのセットを生成する。次に、韻律タグ生成コンポーネント１６２６は、タグをトレーニングテキストに配置し、タグの配置を解析し、ターゲット話者の話し方の特徴をモデリングするため、テキストにおけるタグの配置のルールセットを作成する。 FIG. 16 illustrates a text-to-speech system 1600 according to the present invention. System 1600 includes a computer 1602 that includes a processing unit 1604 that includes memory 1606 and a hard disk 1608, a monitor 1610, a keyboard 1612, and a mouse 1614. Computer 1602 also includes a microphone 1616 and a loudspeaker 1618. Computer 1602 operates to implement text input interface 1620 and speech output interface 1622. Computer 1602 also provides a speech modeler 1624 adapted to receive text from text input interface 1620. In the text, tags generated by the present invention are arranged. The speech modeler 1624 processes the text and tags to generate speech having prosodic features defined by the tags, and outputs the speech to the loudspeaker 1618 using the speech output interface 1622. Speech modeler 1624 generates a set of tags and a prosody tag generation component 1626 adapted to generate rules for tag application to generate speech with prosodic features typical of the target speaker. It may be included as appropriate. To generate a set of tags, a prosody tag generation component 1626 analyzes a training corpus that represents the reading of the training text read by the target speaker, analyzes the prosodic features of the training corpus, and models the training corpus. Generate a set of tags that can be added to the training text. Next, prosody tag generation component 1626 places the tag in the training text, analyzes the tag placement, and creates a rule set for tag placement in the text to model the characteristics of the target speaker's speech.

スピーチモデラ１６２４もまた、テキスト−スピーチの生成が望ましいテキストに配置されたタグを処理するために用いられる韻律評価コンポーネント１６２８を適宜含みうる。韻律評価コンポーネント１６２８は、タグによって定義されるピッチ値または振幅値の時系列を生成する。 The speech modeler 1624 may also optionally include a prosodic evaluation component 1628 that is used to process tags placed in the text for which text-to-speech generation is desired. The prosody evaluation component 1628 generates a time series of pitch or amplitude values defined by the tag.

上述したタグを生成し処理するシステムは、より一般的な問題の一側面に対する解決策である。話すという動作は、筋肉を動かすために必要な努力の最小化、および動きエラー、すなわち望ましい動きと実際になされる動きとの間の差の最小化という２つの主な目標を平衡させる筋肉の動きの動作である。上述したタグを生成し処理するシステムは、概して、隣接するタグの要求がひどく競合する場合であっても、韻律の滑らかな変化を生成する。滑らかな変化の生成は、筋肉の動きがどのようにしてなされるかの現実味を反映するものであり、努力と動きエラーを均衡させる。 The system for generating and processing the tags described above is a solution to one aspect of the more general problem. Speaking is a movement of the muscle that balances the two main goals of minimizing the effort required to move the muscle and minimizing the difference between the movement error, the desired movement and the movement actually made. Is the operation. The system for generating and processing the tags described above generally produces a smooth change in prosody, even when adjacent tag requirements are severely competing. The generation of smooth changes reflects the reality of how muscle movements are made, balancing effort and movement errors.

本発明によるタグを生成し処理するシステムでは、ユーザが、定義しているアクセントに形状または範囲をいずれも制限することなく、アクセントを定義するタグを生成可能なことを認識されよう。したがって、ユーザには、異なる言語のアクセント形状ならびに同一言語内でのバリエーションを定義するように、タグを作成し配置する自由がある。話者固有のアクセントをスピーチに定義することも可能である。音楽に、装飾的なアクセントを定義することも可能である。ユーザのアクセント定義作成には、形状または範囲の制約が課されないため、定義の結果、生理学的にありそうもないターゲットの組み合わせになることもある。本発明によるタグを生成し処理するシステムは、競合する仕様を許容し、すべての制約を満たす滑らかな表面を具現化したものを戻す。 It will be appreciated that in a system for generating and processing tags according to the present invention, a user can generate tags that define accents without limiting any shape or range to the accents being defined. Thus, the user has the freedom to create and place tags so as to define different language accent shapes as well as variations within the same language. It is also possible to define speaker-specific accents in speech. It is also possible to define decorative accents for music. The creation of a user's accent definition is not subject to shape or range constraints, so the definition may result in a target combination that is unlikely to be physiological. The system for generating and processing tags according to the present invention allows for competing specifications and returns an implementation of a smooth surface that satisfies all constraints.

競合する仕様に直面しながら滑らかな表面を具現化したものを生成することは、実際の人間のスピーチを正確に実現する助けとなる。実際の人間のスピーチで韻律を制御する筋肉の動きは、滑らかである。これは、ある意図するアクセントターゲットから次のアクセントターゲットに移るために時間がかかるからである。スピーチ材料の１つのセクションが重要ではない場合、話者はそのターゲットの実現にあまり努力をしない場合もあることにも留意する。したがって、韻律の表面の具現化は、２つの関数の和を最小化する最適化問題として提示することができる。第１の関数は生理学的制約Ｇであり、これは、特定したピッチｐの一次導関数および二次導関数を最小化することで、平滑性制約を課す。第２の関数は、通信制約Ｒであり、これは、実現されたピッチｐとターゲットとするｙの間のエラーτの和を最小化する。この制約は、聞き手の理解のため、スピーチにおける精密さが必要な要件をモデリングする。 Creating a realization of a smooth surface while facing competing specifications helps to achieve real human speech accurately. The movement of the muscles that control the prosody with real human speech is smooth. This is because it takes time to move from one intended accent target to the next. Note also that if one section of speech material is not important, the speaker may not make much effort to achieve that target. Therefore, the realization of the prosodic surface can be presented as an optimization problem that minimizes the sum of two functions. The first function is a physiological constraint G, which imposes a smoothness constraint by minimizing the first and second derivatives of the identified pitch p. The second function is the communication constraint R, which minimizes the sum of errors τ between the realized pitch p and the target y. This constraint models requirements that require precision in speech for listener understanding.

エラーは、タグの仕様を満たすためにどの程度重要かを示す、タグのstrengthＳ_ｉによって重み付けられる。タグのstrengthが弱い場合、生理学的制約が優勢となり、このような場合、平滑性が精度よりも重要になる。Ｓ_ｉは、平滑性要件Ｇにより、隣接するタグとのアクセントタグの相互作用を制御する。タグが強いほど、隣接するタグへの影響が強い。タグはまた、パラメータαおよびβも含み、これらは、最も重要なのは形状のエラーであるか、ｐ_ｔの平均値であるかを制御する。これらのパラメータは、「type」パラメータから導出される。ターゲットｙは、語句曲線のトップにあるアクセント成分で表すことができる。 Errors are weighted by the tag's strengthS _i indicating how important it is to meet the tag's specifications. If the strength of the tag is weak, physiological constraints prevail, and in such cases smoothness is more important than accuracy. S _i controls the interaction of accent tags with neighboring tags according to the smoothness requirement G. The stronger the tag, the stronger the effect on adjacent tags. The tag also includes parameters α and β, which control whether the most important is shape error or the average value of _pt . These parameters are derived from the “type” parameter. The target y can be represented by an accent component at the top of the phrase curve.

Ｇ、Ｒ、およびτの値は、次の式で与えられる。

The values of G, R, and τ are given by

タグは、概して、ＧとＲの和を最小化するように処理される。上記式は、韻律を定義するタグの処理に当たり、努力および動きの組み合わせのエラーの最小化を示す。 Tags are generally processed to minimize the sum of G and R. The above equation shows the minimization of effort and movement combination errors in the processing of tags defining prosody.

図１７は、連続しており、かつ筋力学等の制約を受ける動きの現象をモデリングするプロセス１７００を示す。ステップ１７０２において、所望の動き成分を定義するタグのセットを作成する。ステップ１７０４において、タグを選択および配置して、所望の動きを定義する。ステップ１７０６において、タグを解析して、タグによって定義される動きを決定する。ステップ１７０８において、動きの努力、すなわち動きの生成に必要な努力と、動きのエラー、すなわちタグが定義する動きからの逸脱との組み合わせを最小化する動きの時系列を識別する。ステップ１７１０において、識別された動きの時系列を生成する。ステップ１７０２は、生成する動きを定義するタグのセットが生成される場合、比較的まれに行われ、ステップ１７０４〜１７１０は、動きを定義し生成するために、タグを採用するときはいつでも、より頻繁に行われることが認識されよう。 FIG. 17 shows a process 1700 for modeling a phenomenon of motion that is continuous and subject to constraints such as muscle mechanics. In step 1702, a set of tags defining the desired motion component is created. In step 1704, tags are selected and placed to define the desired movement. In step 1706, the tag is analyzed to determine the movement defined by the tag. In step 1708, a time series of motions that minimizes the combination of motion effort, ie, the effort required to generate motion, and motion error, ie, deviation from the motion defined by the tag, is identified. In step 1710, a time series of identified movements is generated. Step 1702 is performed relatively infrequently when a set of tags defining the motion to generate is generated, and steps 1704-1710 are more often employed when adopting tags to define and generate motion. It will be appreciated that this happens frequently.

上記説明において、連続しており、かつ生理学的な制約を受ける現象の記述およびモデリングに適したタグを生成し使用する技法を説明した。このような技法が有用な広く使用される用途は、テキスト−スピーチ生成におけるスピーチの韻律特徴の記述およびモデリングであり、このような特徴のモデリングに適したタグのセットについて説明した。タグの作用の説明ならびにタグを処理する技法を提示した。タグを生成、選択、配置、処理するプロセスならびにタグを用いて所望の韻律特徴を有するスピーチを生成するテキスト−スピーチシステムを提示した。最後に、タグを生成し使用して、一連の動きを定義し生成するプロセスについて説明した。 In the above description, techniques have been described for generating and using tags suitable for describing and modeling phenomena that are continuous and subject to physiological constraints. A widely used application where such techniques are useful is the description and modeling of speech prosodic features in text-to-speech generation, and a set of tags suitable for modeling such features has been described. An explanation of tag effects and techniques for processing tags were presented. A process for generating, selecting, placing and processing tags as well as a text-to-speech system using tags to generate speech with desired prosodic features was presented. Finally, the process of creating and using tags to define and generate a series of movements has been described.

本発明を目下好ましい実施形態の文脈で開示したが、当業者が、上記説明および添付の特許請求の範囲に準拠する広範な実施を採用しうることを認識されよう。 Although the present invention has been disclosed in the context of the presently preferred embodiment, it will be appreciated that those skilled in the art may employ a wide variety of implementations consistent with the above description and the appended claims.

Claims

A method of text-to-speech processing,
Analyzing the training corpus representing the reading of the training text to create a set of tags from which to place in the text to define the prosodic features of the speech generated by processing the text;
Placing selected members of the set of tags in a text body in a desired sequence to generate speech features defined by the sequence of tags;
Processing the text body and the tag to generate speech having prosodic features defined by the tag;
The training corpus is formed by selecting a body of the training text and receiving speech representing reading of the training text by the target speaker.

Each tag imposes constraints on the prosodic features of the speech that are affected by the tag, or each tag specifies an action to be taken and attributes and associations that provide information about the action to be taken The method of claim 1, further comprising: a parameter that defines a value to be performed, or each of the tags may include a parameter that identifies a location where the influence of the tag appears.

Does the set of tags include tags that establish settings that remain unchanged unless modified by the next tag, or does the set of tags include members that define the pitch behavior of the speech across phrases? 3. The method of claim 2, wherein the tag set includes tags defining accents that define pitch behavior of local influences within a phrase.

Local effects within the phrase include individual words within the phrase, and the tag defining the accent includes a parameter defining the degree of non-linearity of the accent effect, with accents having higher non-linearity. , Accents appearing in higher pitch areas of the phrase than accents appearing in lower pitch areas of the phrase, and accents that have a linear effect are the same in each pitch area of the phrase 4. The method of claim 3, wherein the method has an effect.

Accents that have an impact that is lower than the linear effect have a lower impact in the higher region of the phrase pitch and a higher impact in the lower region of the phrase pitch, and the set of tags is the region that the tag affects A tag that defines the boundary of a phrase that marks the boundary between, the tag that defines the boundary of the phrase is a speech component in which the tag after the tag that marks the boundary is before the tag that marks the boundary The method of claim 3, wherein each of the tags can include a value defining a type and strength to define an interaction with the other tag of the tag.

Processing the text body and tag comprises:
Extracting the tag from the text;
Creating a set of equations that define the phrase curve;
Solving the set of equations to generate the phrase curve;
Creating a set of equations that define the pitch curve;
Solving the set of equations to generate the pitch curve;
Mapping the linguistic concept appearing by the phrase curve and the pitch curve to an observable sound;
Performing a non-linear transformation to adjust the prosodic features defined by the tag to human perception and prediction.

A method of defining a set of tags that identify a target speaker's prosodic features,
Selecting a training text body;
Receiving speech representing the reading of the training text by the target speaker and forming a training corpus;
Analyzing the training corpus to identify prosodic features of the training corpus;
Creating a set of tags defining prosodic features of the identified training corpus.

A method for placing tags in text for text-to-speech processing,
Placing a tag in the body of the training text to model the prosodic features of the training corpus generated by reading the training text;
Analyzing the placement of the tag in the training text and creating a rule set for the placement of the tag in the text;
Applying the rules to text for which text-to-speech processing is desired and placing a tag in the text to generate speech with the desired prosodic features;
The training corpus is formed by selecting the training text body and receiving speech representing a reading of the training text by a target speaker.

A text-to-speech system that receives text input including text to be processed to generate speech and tags that define the prosodic features of the speech to be generated, comprising:
A prosodic tag generation component that analyzes a training corpus to identify features indicated by one or more readings by one or more target speakers and to generate a set of tags that define the identified features; ,
A text input interface for receiving the text input;
A speech modeler operable to process the text input to generate speech having the prosodic features specified by the tag, wherein the speech generated by the speech modeler is the one or more targets Similar to that of the speaker,
A speech output interface for generating the speech output;
The training corpus is formed by selecting a body of training text and receiving speech representing the reading of the training text by the target speaker.