JP2007527555A

JP2007527555A - Prosodic speech text codes and their use in computerized speech systems

Info

Publication number: JP2007527555A
Application number: JP2007502054A
Authority: JP
Inventors: マープル、ゲーリー; パーク、スー、アン; ウィルソン、エイチ．、ドナルド; クレブス、ナンシー; ゲーリー、ダイアン; クアル、バリー
Original assignee: LESSAC TECHNOLOGIES Inc
Current assignee: LESSAC TECHNOLOGIES Inc
Priority date: 2004-03-05
Filing date: 2005-03-07
Publication date: 2007-09-27
Also published as: CA2557079A1; US20070260461A1; WO2005088606B1; EP1726005A1; US7877259B2; EP1726005A4; KR20070004788A; WO2005088606A1; CN1938756A

Abstract

テキストから音声を合成する際に使用されるテキストを音響的にコード化する方法及びシステムが開示され、本方法は、表現力のある意味を伝えるよう、発声されるテキストに付与する所望の韻律を話し手に示す、１つ以上の図形記号を用いて発声されるテキストをマークすることを含む。マーク付けは、それぞれ、書かれたテキストで用いられ得る可視韻律表示書記素と、ディジタルドメイン内で機能する、これに対応するディジタル音素とを有する、書記素／音素の対を有し得る。本発明は、口語本及び雑誌、ドラマ、及び他の娯楽において、ボイスメールシステム、電子的に使用可能な器具、自動車、コンピュータ、ロボットアシスタント、ゲームなどを含む広範囲な適用形態について、人の心に訴える、人間らしい機械音声を生成するのに有用である。 Disclosed is a method and system for acoustically encoding text for use in synthesizing speech from text, the method providing a desired prosody to the spoken text to convey expressive meaning. Marking the spoken text with one or more graphic symbols shown to the speaker. Marking may have grapheme / phoneme pairs, each having a visible prosodic display grapheme that can be used in written text and a corresponding digital phoneme that functions in the digital domain. The present invention is in the mind of a wide range of applications, including voicemail systems, electronically usable equipment, cars, computers, robot assistants, games, etc. in colloquial books and magazines, dramas, and other entertainment. Useful for generating appealing, human-like machine speech.

Description

本発明は、新規な韻律音声テキストコードを用いた、通常テキスト入力からの、合成又は人工音声を提供する方法及びコンピュータシステムに関する。 The present invention relates to a method and computer system for providing synthesized or artificial speech from normal text input using a novel prosodic speech text code.

合成、人工、又は機械音声は、たとえば、口語本及び雑誌、ドラマ、及び他の娯楽において、ボイスメールシステム、電子的に使用可能な器具、自動車、コンピュータ、ロボットアシスタント、ゲームなどの、多くの有用な適用形態を有する。本発明は、任意のこのようなシステムの実装形態にも広く適用されるものであるが、これについては、本明細書の以下において明らかとなろう。 Synthetic, artificial, or machine speech is useful in many ways, such as in voice books and magazines, dramas, and other entertainment, such as voicemail systems, electronically available equipment, cars, computers, robot assistants, games, etc. It has various application forms. The present invention is broadly applicable to any such system implementation, as will be apparent hereinafter.

人工音声を生成するための有用な公知のシステムが、一般に、連結システム又はフォルマントシステムとして記述されている。連結人工音声システムが、たとえば対話型ボイスメールシステムにおいて使用されており、事前に録音された完全なフレーズ又は文を用いて、許容できる人の音声サウンドを生じさせる。しかし、このようなシステムは、雑誌の記事や本などの、大量の小冊子の未知のテキストの音声への変換には適していない。テキストが機械で読み取られる又はコンピュータシステムによって処理される「オンザフライ」で声によって作り出される又は声のようなサウンドの小さいスライスを合成するフォルマントシステムの方が、このような広範な小冊子に適している。しかし、最近まで、このようなフォルマント音声システムの出力は、機械的で、単調かつ機械のようであり、評判の悪いものであった。 Useful known systems for generating artificial speech are generally described as connected systems or formant systems. Concatenated artificial voice systems are used, for example, in interactive voicemail systems to produce acceptable human voice sounds using pre-recorded complete phrases or sentences. However, such a system is not suitable for converting unknown text into speech from a large number of booklets, such as magazine articles and books. A formant system that is produced by voice “on the fly” where text is read by a machine or processed by a computer system or that synthesizes a small slice of a voice-like sound is more suitable for such a wide booklet. However, until recently, the output of such formant audio systems was mechanical, monotonous and mechanical, and was not well received.

センシメトリクス社（ＳｅｎｓｉｍｅｔｒｉｃｓＣｏｒｐｏｒａｔｉｏｎ）（マサチューセッツ州ケンブリッジ(Ｃａｍｂｒｉｄｇｅ、ＭＡ））に譲渡されたスティーブンス（Ｓｔｅｖｅｎｓ）による米国特許第５,７４８,８３８号明細書では、声門モデリングを使用し、マッピング関係を使用して、１０以下の高レベルパラメータを判断し、３９の低レベルパラメータに変換する音声合成方法について開示している。これらのパラメータを音声シンセサイザに入力することにより、任意の特定の音声を表すのに５０〜６０のパラメータを入力する必要がある先行技術によるシステムと比べて、音声をより簡単に合成できる。スティーブンスの開示は、その使用目的については有用なものであるが、スティーブンスが用いている声を解剖するやや機械的なモデリングは、人の心に訴える人間的な音質を有する音声出力を生じない。また、スティーブンスは、望ましい韻律を追加したり、合成的又は人工的に生成された音声の韻律を制御し修正したりする手段を提供又は提案していない。 US Pat. No. 5,748,838 by Stevens assigned to Sensimetrics Corporation (Cambridge, Mass.) Uses glottal modeling and uses mapping relationships Thus, a speech synthesis method is disclosed in which a high level parameter of 10 or less is judged and converted into 39 low level parameters. By entering these parameters into a speech synthesizer, speech can be synthesized more easily compared to prior art systems that need to input 50-60 parameters to represent any particular speech. Stevens' disclosure is useful for its intended use, but the somewhat mechanical modeling of dissecting the voice that Stevens uses results in an audio output with a human-like sound quality that appeals to the human mind. Absent. Stevens also does not provide or suggest a means to add desirable prosody or to control and modify the prosody of synthetically or artificially generated speech.

共同所有される、アディソン（Ａｄｄｉｓｏｎ）らによる米国特許第６,８４７,９３１号明細書、同時係属中の米国特許出願第１０／３３４,６５８号明細書、（以下、アディソン「‘６５８」と称する）、及び国際公開第２００３／０６５３４９号パンフレットに記述されているように、合成されるテキストが、理解を容易にするための発音ガイドとしての発声訓練表記法を用いてマーク付けされることがある。アディソン「‘６５８」は、音声合成の表現構文解析を提供し、訓練された話し手を用いて、利用可能な音声要素データベースを生成し、テキストから音声の表現合成を実施する。レサック（Ｌｅｓｓａｃ）システムも、他の公知のシステムも、出力される音声の韻律を制御できるようなやり方で所望の韻律を音声シンセサイザに通信するための簡単な方法を提供していない。 Co-owned US Pat. No. 6,847,931 by Addison et al., Co-pending US patent application Ser. No. 10 / 334,658 (hereinafter referred to as Addison “'658”) ), And as described in WO2003 / 065349, the synthesized text may be marked using utterance training notation as a pronunciation guide to facilitate understanding . Addison “’ 658 ”provides a speech synthesis expression parsing, uses a trained speaker to generate an available speech element database, and performs speech synthesis from text. Neither the Lessac system nor other known systems provide a simple way to communicate the desired prosody to the speech synthesizer in such a way that the prosody of the output speech can be controlled.

マーガレット・ブレンダギャスト・マクリーン（ＭａｒｇａｒｅｔＰｒｅｎｄｅｒｇａｓｔＭｃＬｅａｎ）による「ＧｏｏｄＡｍｅｒｉｃａｎＳｐｅｅｃｈ」（Ｅ．Ｐ．Ｄｕｔｔｏｎ＆Ｃｏ．，Ｉｎｃ．（１９５２）（以下「マクリーン」と称す）では、一本調子又は異様な又は方言的なイントネーションなどの欠陥を避けるために採用される、所望のイントネーションパターン、又は連結音声中のピッチの変更を読者に指示するために、テキストをマークする表記法システムについて記述している。音声をコンピュータ化するが、芸術の観点からは何の試みも行わないという、このような近代以前の作業においては、音声合成における今日の課題を解決するためには、マクリーンのイントネーションパターンが有用であることを示唆している。さらに、マクリーンのイントネーションパターンは、ピッチを基準とするどのような手段もなく、このため、異なる話し手が、一貫性をもってイントネーションパターンを利用することが困難である。 In “Good American Speech” (EP Dutton & Co., Inc. (1952)) (hereinafter referred to as “McLean”) by Margaret Brendagust McLean, it is monotonous or odd Describes a notation system that marks text to indicate to the reader a desired intonation pattern or pitch change in the concatenated speech that is employed to avoid defects such as dialectal intonation. McLean's intonation pattern is useful to solve today's challenges in speech synthesis in such pre-modern work where computerization is not attempted from an artistic point of view. That And entice. Furthermore, intonation pattern of McLean, without any means relative to the pitch, and therefore, different speaker, it is difficult to utilize the intonation pattern consistently.

上記の背景技術の記述には、本発明以前の関連技術においては公知でないが、本発明によって実現される、洞察力、発見、理解又は開示、又は開示との関連が含まれる。本発明のいくつかのこのような寄与が、本明細書において特に指摘されているが、本発明の他のこのような寄与についても、それらの文脈より明らかとなろう。ある文献が本明細書に引用されているからといって、本発明とはかなり異なるその文献の分野が、本発明の１つ又は複数の分野に類似していると認めるものではない。 The above description of background art includes insight, discovery, understanding or disclosure, or association with disclosure, which is not known in the related art prior to the present invention, but is realized by the present invention. Some such contributions of the present invention are specifically pointed out herein, but other such contributions of the present invention will also be apparent from their context. The citation of a document in this specification is not an admission that a field of the document that differs significantly from the present invention is similar to one or more fields of the present invention.

したがって、出力される音声の韻律を制御できるようなやり方で所望の韻律を音声シンセサイザに通信するための簡単な方法が必要となる。 Therefore, there is a need for a simple method for communicating the desired prosody to the speech synthesizer in such a way that the prosody of the output speech can be controlled.

上記の又は他の目的を満たすために、本発明は、テキストから音声を合成する際に使用されるテキストを音響的にコード化する方法を提供し、本方法は、１つ以上の図形記号を用いて発声されるテキストをマークして、発声されるテキストに付与する所望の韻律を話し手に示すことを含む。本発明はまた、適切な韻律を指定する表現力のある意味を用いてテキストをマーク付けするための韻律コード又は表記法を含む、音声合成方法及びシステムを提供する。マーク付けは、書記素／音素の対を有し、このそれぞれが、書かれたテキストで用いられ得る可視韻律表示書記素と、ディジタルドメイン内で機能する、これに対応するディジタル音素とを有する。 To meet the above or other objectives, the present invention provides a method for acoustically encoding text used in the synthesis of speech from text, the method comprising one or more graphical symbols. Use to mark the spoken text and indicate to the speaker the desired prosody to give to the spoken text. The present invention also provides a speech synthesis method and system that includes a prosodic code or notation for marking text using expressive meanings that specify the appropriate prosody. Marking has grapheme / phoneme pairs, each of which has a visible prosodic display grapheme that can be used in the written text and a corresponding digital phoneme that functions in the digital domain.

付与する韻律は、テンポと、イントネーションパターンと、リズムと、音楽性と、振幅と、強調及び息継ぎのための休止と、単語及びフレーズの正式及び略式調音とからなる群から選択される、１つ以上の韻律要素を有し得る。 The prosody to be given is selected from the group consisting of tempo, intonation pattern, rhythm, musicality, amplitude, pause for emphasis and breathing, and formal and informal articulation of words and phrases. It can have the above prosodic elements.

本方法は、図形韻律記号を用いて可視テキストをマークすること、又は電子バージョンの図形記号を用いて電子テキストを電子的にマークすることを含み、この電子的にマークされたテキストは、人が読み取ることのできる図形でマークされたテキストとして表示可能又は印刷可能である。 The method includes marking visible text using a graphical prosodic symbol or electronically marking electronic text using an electronic version of the graphical symbol, where the electronically marked text is It can be displayed or printed as text marked with a readable graphic.

別の態様においては、本発明は、音声シンセサイザに入力された音響コード化変数によって制御される音声シンセサイザを提供し、この音響コード化変数は、所望の韻律発音を具現化する合成音声出力を提供する所望の韻律発音を有する、録音された人の音声を生成するのに用いられる韻律仕様に対応する。 In another aspect, the present invention provides a speech synthesizer that is controlled by an acoustic coding variable input to a speech synthesizer, which provides a synthesized speech output that embodies the desired prosodic pronunciation. Corresponding to the prosodic specification used to generate the recorded person's voice with the desired prosodic pronunciation.

本発明の一実施形態によれば、本明細書の以下に記述されているように、再生可能な子音及び再生不可能な子音の音声学、構造、及び名称のための新規な表記法システム、及び所謂４つの「レサック」の曖昧語（ｎｅｕｔｒａｌ）が実現される。これらを使用することにより、図形でマーク付けされた新規なテキストが合成される。 In accordance with one embodiment of the present invention, a novel notation system for phonetics, structure, and name of playable and non-playable consonants, as described herein below, And so-called four “lessac” neutrals are realized. By using these, a new text marked with a graphic is synthesized.

さらに、本発明は、テキスト／音声（本明細書においては、時に「ＴＴＳ」と称する）音声又は音認識適用形態に有用な、新規な手順及びシステムを提供する。この手順には、
音声合成における韻律音声規則の生成及びそれらの適用形態と、
韻律音声規則の音響デモンストレーションと、
韻律音声要素の音響データベースライブラリと、
ＴＴＳのためのソフトウェアの例と、
ＴＴＳの聞き手による試験と、の１つ以上又はすべてが含まれる。 Furthermore, the present invention provides novel procedures and systems useful for text / voice (sometimes referred to herein as “TTS”) voice or sound recognition applications. This procedure includes
Generation of prosodic speech rules in speech synthesis and their application forms;
Acoustic demonstration of prosodic speech rules,
An acoustic database library of prosodic speech elements;
Examples of software for TTS and
One or more or all of the TTS listener tests are included.

以下、添付図面を参照しながら、本発明のいくつかの実施形態、及び本発明の製作及び使用、及び本発明を実施するための最良の形態を、例示として詳細に記述する。なお、参照文字は、いくつかの図を通じて同様の要素を表す。 Hereinafter, some embodiments of the present invention, the production and use of the present invention, and the best mode for carrying out the present invention will be described in detail by way of example with reference to the accompanying drawings. Note that reference characters represent similar elements throughout several figures.

本発明以前には、サウンドを指定するための公知のシンセサイザ「コード」もなければ、コードに作成させる、任意の測定セットのサウンド表示もなかった。したがって、本発明では、熟練した音声実践者が、言わば「シンセサイザ」となり、韻律的にマーク付けされたテキストのサンプリングを発音して、所望の発音のための音響値を取得できるようにすることが実現される。これらの音響値は、本発明に従って音声合成に使用され得る、新規な韻律音響データベースライブラリを準備するのに用いられる。上述の制御されたデータベース録音方法と共に、本明細書に記述されている新規な図形マーク付け記号を用いて、イントネーションパターン、リズム、強調及び息継ぎのための休止、それに単語及びフレーズの正式及び略式調音などの有用な韻律要素を、合成又は人工音声に組み込むことが有益であろう。 Prior to the present invention, there was no known synthesizer “code” for specifying sound, nor was there any sound display of any measurement set that the code would create. Therefore, in the present invention, a skilled voice practitioner can be said to be a “synthesizer” and pronounce a prosodically marked text to obtain an acoustic value for a desired pronunciation. Realized. These acoustic values are used to prepare a new prosodic acoustic database library that can be used for speech synthesis in accordance with the present invention. Along with the controlled database recording method described above, using the new graphical markup symbols described herein, intonation patterns, rhythms, emphasis and pauses for breathing, and formal and informal articulation of words and phrases It would be beneficial to incorporate useful prosodic elements such as into synthetic or artificial speech.

公知の合成音声出力に代表的な、魅力のない機械的な音質を改良するために、本発明は、人間らしい音声出力を提供するために、適切な音響要素を用いるよう適応された規則セットを用いて、音声に機械的に組み立てられ得る音響要素のデータベースを生成するのに使用される、制御された又は標準化された人の音声入力を提供する、システム、方法、及び新規なテキストコード化技術を提供する。 In order to improve the unattractive mechanical sound quality typical of known synthesized speech output, the present invention uses a rule set adapted to use appropriate acoustic elements to provide a human-like speech output. Systems, methods, and novel text encoding techniques that provide controlled or standardized human speech input used to generate a database of acoustic elements that can be mechanically assembled into speech provide.

人の音声入力及び規則セットは、１人以上の専門的な音声実践者の教示を具現化することが望ましい。本発明の一実施形態においては、公認されている音声訓練指導者の教示が用いられる。 The human voice input and rule set preferably embodies the teaching of one or more professional voice practitioners. In one embodiment of the invention, the teaching of a certified voice training instructor is used.

演劇及び演説についてのアーサー・レサック（ＡｒｔｈｕｒＬｅｓｓａｃ）の教示が例として本明細書に言及されているが、他の音声訓練指導者又は音声訓練の他の団体の教示が用いられることもあり、特に英語以外の言語においては、多くの場合、かなり異なる音声訓練教示を用いることが理解されている。このような他の音声訓練技術は、本明細書の教示より明らかであろうが、人の心に訴える韻律文字を有する、一貫性のある容易に理解される音声出力、たとえば楽譜記号を提供する規則セットを有することが望ましい。音声実践者とは、話す能力について、通常専門的な発声スキル及び知識を用いる、関連する発声訓練又は指導において十分な教育を受けた個人、たとえば発声教師、演説家、又は役者であると理解されたい。 Although Arthur Lessac's teachings about theater and speech are mentioned here as examples, the teachings of other voice training instructors or other groups of voice training may be used, especially It is understood that languages other than English often use significantly different speech training teachings. Such other speech training techniques, as will be apparent from the teachings herein, provide a consistent and easily understood speech output, such as musical notation, with prosodic characters that appeal to the human mind. It is desirable to have a rule set. A voice practitioner is understood to be an individual who is well educated in the relevant vocal training or instruction, for example speaking teachers, speakers, or actors, who typically use specialized vocal skills and knowledge for speaking ability. I want.

レサック方法の訓練を受けた音声実践者は、音声をオーケストラのサウンドとして考える。即ち音声は音楽であると考えるようになる。テキストは、アーサー・レサックの本「人の声の使用及び訓練（ＴｈｅＵｓｅＡｎｄＴｒａｉｎｉｎｇＯｆＴｈｅＨｕｍａｎＶｏｉｃｅ）」（メイフィールト出版社（ＭａｙｆｉｅｌｄＰｕｂｌｉｓｈｉｎｇＣｏｍｐａｎｙ、第３版、１９９７）（以下、「アーサー・レサックの本」と称する）のパートＩＩの６１ページより記述されている、３要素の相互作用として音声パラメータを識別する。レサックによって識別される３つの音声パラメータは、子音、音調、及び構造エネルギーである。これらは、子音「オーケストラ」、声自体の音楽としての音調性、及び子音及び母音に対する構造要素の相互作用としての構造エネルギーとして導入される。アーサー・レサックは、声のｅＮｅＲＧｙとしての３つの音声パラメータについて言及している。レサックは、これらのすべてが、発音されるテキストから導き出されることに留意している。つまり、全体として考慮されるテキストの内容、単語の意味及びサウンド、それらの文法上の関係、用いられる構文、及びテキストによって伝えるメッセージなどの要因を示唆している。 A voice practitioner who has been trained in the Lessac method considers the voice as an orchestra sound. That is, the voice is considered to be music. The text is from Arthur Lesack's book “The Use And Training Of The Human Voice” (Mayfield Publishing Company, 3rd edition, 1997) (hereinafter “Arthur Resack”). The voice parameters are identified as a three-element interaction, described on page 61 of Part II of the book "referred to as a book". The three voice parameters identified by Lesac are consonant, tone, and structural energy. These are introduced as structural energy as the consonant “orchestra”, the tonal nature of the voice itself as music, and the interaction of structural elements to the consonant and vowels, Arthur Lesack, the three voice parameters as eNeRGy of the voice In Lessac notes that all of this is derived from the text that is pronounced: the content of the text considered as a whole, the meaning and sound of the words, their grammatical Suggests factors such as relationships, syntax used, and messages conveyed by text.

人の音声はアナログサウンドであり、話し手は「連続した楽器として声を再生」し得るが、アーサー・レサックの本の、特に１４９及び１７０〜１７３ページに記述されている連続可変表現音声の概念を教示するには、連続した文中で個別の点をとることが有用であり、このテキストは、連続した構造及び音調範囲内の「点」の値を例示している。 Human speech is analog sound, and the speaker can “reproduce the voice as a continuous instrument”, but the concept of continuously variable expression speech described in Arthur Lesack's book, especially pages 149 and 170-173, is used. To teach, it is useful to take individual points in a sequence of sentences, and this text illustrates the value of “points” within a sequence and tone range.

レサックシステムは、個々の音声要素、顕著な音素、複音、及び所謂Ｍ−ａｒｙｐｈｏｎｅについての理解を容易にするために所望の発音をコード化するよう、ある程度の英数字表記法を提供する。これらの音声要素は、主に、個々の母音及び子音、二重母音、及び子音の混合語である。 The Resack system provides some alpha-numeric notation to encode the desired pronunciation to facilitate understanding of individual phonetic elements, prominent phonemes, complex tones, and so-called M-ary phones. These speech elements are mainly mixed words of individual vowels and consonants, double vowels, and consonants.

ここで、図１を参照すると、マーク付けされたテキストは、レサック表記法に従って発音のためにマーク付けされた単語のテキスト行１０を有し、英数字記号の表記法行１２は、テキスト行１０の直ぐ上に位置決めされる。たとえばアーサー・レサックの本に記述されているレサックシステムに精通した個人は、行１２の表記法によって示されている発音指示を理解することができ、これらを適用して、個人毎に一貫性をもって行１０のテキストを発音することができる。本発明を実践する際に用いられるこのような表記法のサンプルが、以下の表Ａ〜Ｅに示されている。テキストが上手にマークされ、話し手がマーク付けの指示を適切に実施すると、明瞭かつわかりやすい音声が生じる。しかし、明瞭かつわかりやすいものであっても、音声は、話し手又は音声源により、やや単調又は機械的になることがある。 Referring now to FIG. 1, the marked text has a text line 10 of words that are marked for pronunciation according to a Lessac notation, and an alphanumeric symbol notation line 12 is a text line 10. Positioned just above. For example, an individual familiar with the Lesac system described in Arthur Lesack's book can understand the pronunciation instructions indicated by the notation in line 12, and apply them to ensure consistency among individuals. Can pronounce the text on line 10. Samples of such notation used in practicing the present invention are shown in Tables A-E below. When the text is marked well and the speaker properly performs the marking instructions, a clear and understandable voice is produced. However, even if it is clear and understandable, the sound can be somewhat monotonous or mechanical depending on the speaker or the sound source.

わかりやすい発音を示すレサック図形表記法の有用性が、２００２年１２月３１日に出願の「テキストから音声に（ＴＥＸＴＴＯＳＰＥＥＣＨ）」と題された、アディソンらによる米国特許出願第１０／３３４,６５８号明細書に記述されているが、例が記載されておらず、図１に示されているマーク付けの特定の実施形態も記述されていない。

The usefulness of the Lesack graphic notation with easy-to-understand pronunciation is the US patent application 10 / 334,658 by Addison et al. Entitled “TEXT TO SPEECH” filed on Dec. 31, 2002. Although described in the specification, no example is described, nor is the specific embodiment of the marking shown in FIG. 1 described.

ここで、図２を参照すると、本発明によれば、示されているグラフィカルシンボルが有用であり、話されている過程において所望の韻律を得るための、英字、二重母音、音節、又は他の音声要素の発音に必要なピッチ制御を示す。 Referring now to FIG. 2, according to the present invention, the graphical symbols shown are useful and can be used to obtain the desired prosody in the spoken process, such as letters, diphthongs, syllables, or other The pitch control necessary for pronunciation of speech elements is shown.

本発明に用いられる韻律コードは、テキストの配列、１単語内の英字の配列、１文内の単語の配列、１文の固有の配列、１段落中の文の配列の場所、及び段落の配列の一部としての段落の位置に関係する発音コードである。これらの考慮すべき点の任意の１つ以上により、何が適切な韻律であるか、又は何が適切な韻律でないか、又は何が、テキストに適用するのに適した、強勢、ピッチ、又はタイミングの韻律要素であるかが判断される。適切な韻律は、配列が完了するまで明らかでないこともある。本発明により、これらの考慮すべき点を鑑みて、適切な韻律をテキストに適用することができる。本発明において用いられるコードは、調音的なサウンド作成原理及び文脈によって判断され、適切な韻律を指定することにより、表現力のある意味について修正される。 The prosody code used in the present invention is an array of text, an array of alphabetic characters within a word, an array of words within a sentence, a unique array of sentences, a location of sentence arrays within a paragraph, and an array of paragraphs Is a pronunciation code related to the position of the paragraph as part of. Depending on any one or more of these considerations, what is a proper prosody, what is not a proper prosody, or what is appropriate to apply to text, pitch, or It is determined whether it is a prosodic element of timing. The proper prosody may not be apparent until the sequence is complete. In accordance with the present invention, an appropriate prosody can be applied to text in view of these considerations. The chords used in the present invention are judged according to articulatory sound creation principles and context, and are corrected for expressive meaning by specifying appropriate prosody.

上昇曲線（ｕｐｇｌｉｄｅ）２０、下降曲線（ｄｏｗｎｇｌｉｄｅ）２２、２つの曲折アクセント記号２４Ａ及び２４Ｂ、及びレベル持続部２６が例示されている。それぞれの図形表記法２０〜２６は、開始ピッチを示すドット２８などの左側のドットと、ドット２８の右に延びた上向きの尾部３０などの尾部とを有する。 An up curve 20, a down curve 22, two bent accent symbols 24 A and 24 B, and a level sustaining portion 26 are illustrated. Each graphic notation 20-26 has a left dot such as a dot 28 indicating the start pitch and a tail such as an upward tail 30 extending to the right of the dot 28.

尾部３０の形状は、音声要素がはっきりと発音されるにつれてピッチがどのように変化するかを示している。上昇曲線２０の上向きの尾部３０は、上昇するピッチを示している。下降曲線２２は、下降するピッチを示す下向きの尾部３２を有し、レベル持続部２６は、持続している不変のピッチを示すレベルに留まっている。曲折アクセント記号２４Ａは、ピークまで上昇し次いで下降するピッチを示し、曲折アクセント記号２４Ｂは、この反対を示している。韻律グラフィカルシンボル２０〜２６は、発声されるテキストに隣接した任意の都合の良い場所に、たとえばテキストの直ぐ上の列に調整されて、又はテキストの下に選択的に、置かれ得る。テキストは、本明細書に記述されている図形発音記号の付属語句としてテキスト又はハイフンで分割され得るが、通常タイプされた、キー入力された、又は書かれた、テキストの外観を維持することが好ましい。 The shape of the tail 30 shows how the pitch changes as the audio element is pronounced clearly. The upward tail portion 30 of the rising curve 20 indicates the rising pitch. The descending curve 22 has a downward tail 32 indicating a descending pitch, and the level sustaining part 26 remains at a level indicating a lasting unchanged pitch. The circumflex symbol 24A indicates the pitch that rises to the peak and then decreases, and the circumflex symbol 24B indicates the opposite. The prosodic graphical symbols 20-26 may be placed at any convenient location adjacent to the text to be uttered, eg, adjusted to the line immediately above the text, or selectively below the text. Text can be split with text or hyphens as an adjunct phrase to the graphic phonetic symbols described herein, but can maintain the appearance of text that is normally typed, keyed, or written preferable.

本明細書の以下に記述する、この後の図（図５以降）においては、英字上の前方斜線３６を用いて、この英字が一部のみ発せられる、「準備される」ことが示される。何故なら、以下の子音は、これに密接に関連した又は同一のサウンドを有するからである。また、結合された英字及び本明細書における参照符号４０の下に及びこれらの間にハンモック状の線（ｈａｍｍｏｃｋｓｔｒｕｎｇ）の形態を有する、浅いＵ形の結合記号を用いて、他の英字により互いに分離されており、かつ通常隣接した単語中にある英字は、連結調音を連続して発音すべきであることが示される。以下、図９に関連して、直接結合をマークする結合記号４０の使用についてより詳細に記述する。 In the following figures (from FIG. 5 onwards) described below in this specification, a forward slash 36 on the letter is used to indicate that the letter is “partially prepared”, only part of it. This is because the following consonants have closely related or identical sounds. Also, using the combined U-letters and the shallow U-joint symbol, which has the form of a hammock string under and between the reference numeral 40 herein, the other alphabetic characters English letters that are separated and usually in adjacent words indicate that the connected articulation should be pronounced continuously. In the following, with reference to FIG. 9, the use of the coupling symbol 40 to mark a direct coupling will be described in more detail.

一般に、本発明の一実施形態によれば、子音をマーク付けする場合、母音の前の子音が発声されるが、「再生可能ではない」。何故なら、音声が母音へと直接流れる時に、非常に短く形成されるだけだからである。この場合、「再生可能」とは、話し手が、再生可能な子音をはっきりと発音する時に、一時停止、休止、又はピッチの変化を用いて、再生可能な子音をはっきりと発音すると、所望の韻律効果が作られ得ることを意味する。 In general, according to one embodiment of the present invention, when marking a consonant, the consonant before the vowel is uttered, but “not reproducible”. This is because when the voice flows directly into the vowel, it is only formed very short. In this case, “reproducible” means that when a speaker clearly pronounces a reproducible consonant using a pause, pause, or pitch change when the reproducible consonant is clearly pronounced, the desired prosody is obtained. It means that an effect can be made.

この実施形態においては、無音の子音は、図形でマークされることはないが、コンピュータソフトウェアに残されている。息継ぎ又は解釈のための休止の前の最後の子音が、再生可能であるとマークされる。Ｒトロンボーンは、再生可能ではなく、他の任意の子音の前に、又は最終時に、息継ぎ又は解釈のための休止の前にマークされることはない。この特色はまた、コンピュータが理解するようプログラムされる場合の特色でもある。 In this embodiment, the silent consonant is not marked with a graphic but is left in the computer software. The last consonant before a pause for breathing or interpretation is marked as playable. The R trombone is not reproducible and will not be marked before any other consonant or at the end, before breathing or pause for interpretation. This feature is also the feature when the computer is programmed to understand.

ここで、図３を参照すると、例示されている韻律グラフィカルシンボルの実施形態には、以下のように、子音をマークすることが含まれる。即ち、
それぞれ、再生可能な打楽器、たとえば、ティンパニのドラムビート、Ｄ、Ｂ、及びＧ、及びスネアドラム、ベース、及びトムトムのドラムビート、Ｔ、Ｐ、及びＫをマークするための１本の下線、
再生可能な弦楽器Ｎ、Ｍ、Ｖ、及びＺ、木管楽器Ｌ、ＮＧ、ＴＨ、及びＺＨ、及び（無声）音響効果Ｆ、Ｓ、ＳＨ、及びｔｈをマークするための２本の下線である。 Referring now to FIG. 3, the illustrated prosodic graphical symbol embodiment includes marking consonants as follows. That is,
Respective percussion instruments, such as timpani drum beats, D, B and G, and snare drums, bass and tom tom drum beats, one underline for marking T, P and K,
Two underlines for marking reproducible stringed instruments N, M, V, and Z, woodwinds L, NG, TH, and ZH, and (silent) acoustic effects F, S, SH, and th.

マークのない子音は、再生可能でない。つまり、はっきりと発音される時に、一時停止、休止、又はピッチの変化を用いて所望の韻律効果を作るための注意箇所ではない。 A consonant without a mark is not playable. That is, it is not a cautionary point for creating a desired prosodic effect using pauses, pauses, or pitch changes when pronounced clearly.

子音の混合語に用いられる本発明による韻律図形表記法のための追加規則には、単語から始まる子音の混合語の最初の英字がマークされないことが含まれる。単語中の子音の混合語は、以下のようにマークされ得る。 Additional rules for prosodic graphic notation according to the present invention used for consonant mixed words include that the first alphabetic character of a consonant mixed word starting from a word is not marked. A mixed word of consonants in a word can be marked as follows:

図２〜１０に例示されている本発明を実践する際に有用な図形表記法の例示的実施形態においては、弦楽器（Ｎ、Ｍ、Ｖ、及びＺ）、木管楽器（Ｌ、ＮＧ、ＴＨ、及びＺＨ）、及び（無声）音響効果（Ｆ、Ｓ、ｔｈ、及びＺＨ）として上述した英字又は英字の組合せ及び二重母音は、以下の子音が同一の子音又は同語源の語ではなく、他のすべての子音の前に現れる場合、２本の下線により「再生可能である」とマークされる。同一の子音又は同語源の語がこの後に続く場合には、最初の子音は、子音上の前方斜線を用いて「準備された」としてマークされる。

In the exemplary embodiment of the graphical notation useful in practicing the invention illustrated in FIGS. 2-10, stringed instruments (N, M, V, and Z), woodwinds (L, NG, TH, And ZH), and (silent) sound effects (F, S, th, and ZH) mentioned above as alphabets or combinations of letters and double vowels are not consonants or words of the same consonant, If it appears before all consonants, it is marked as “reproducible” by two underscores. If this is followed by the same consonant or synonym word, the first consonant is marked as “prepared” using a forward diagonal on the consonant.

英字ＮＧには、Ｇがドラムビートが後に続くオーボエを表さない場合、２回下線が引かれ得ることが望ましい。英字ＮＧで終了する単語の一部が単語全体と共通の意味を有する場合は、以下の例などの、オーボエ英字の後にドラムビートがないと考えられる。即ち、

図３は、様々な単語内で発生する子音のｓ−混合語中の子音のうちのどれが、再生可能であるか、つまり、韻律を高めるために、延ばされた又は強調された発音、又は楽音が与えられ得ることを示している。たとえば、「ｗｈｉｓｋｅｙ(ウィスキー）」及び「ｈｕｓｋｙ(ハスキー）」においては、Ｓは再生されるが、Ｋは再生されない。Ｋは無音ではなく、休止又は一時停止を置かずに簡単に素早く発せられる。「ｅｎｓｎａｒｅ(わなにかける）」においては、第１のＮ及びＳは再生されるが、第２のＮ及びＲは再生されない。Ｃ及びＴのための１本の下線の後の別個のＮの下に「尾部」３４を有する２本の下線は、再生可能なＮがオーボエとして再生されるが、再生されるべきドラムビートの子音、この場合は二重ドラムビート子音対が続くことを示している。「ｄｉｓｍａｎｔｌｅ（取り外す）」のＴ上でマークされた上端部に球状部３８を有する前方斜線３６は、ＴＬが、「ウッドブロックのカチッという音」としては再生され得ないが、Ｌのための２本の下線で注記されているように、再生可能な子音Ｌが後に続く子音Ｔとして再生され得ることを示している。 It is desirable that the letter NG can be underlined twice if G does not represent an oboe followed by a drum beat. If some of the words that end with the letter NG have the same meaning as the whole word, it is considered that there is no drum beat after the oboe letter, as in the following example. That is,

FIG. 3 shows which of the consonant s-consonant consonants occurring within the various words is reproducible, i.e. extended or emphasized pronunciation to enhance the prosody, Or it shows that a musical sound can be given. For example, in “whiskey” and “husky”, S is played but K is not played. K is not silent and can be emitted quickly and easily without any pause or pause. In “ensnare”, the first N and S are played, but the second N and R are not played. The two underlines with a “tail” 34 under a separate N after one underline for C and T, the reproducible N is played as an oboe, but the drum beat to be played It shows that a consonant, in this case a double drumbeat consonant pair, continues. The forward diagonal line 36 with the spherical portion 38 at the upper end marked on the T of “dismantle” cannot be reproduced as a “wood block click”, but for the L 2 As noted in the underline of the book, it shows that a reproducible consonant L can be reproduced as a subsequent consonant T.

図４に示されているように、ＮＧがオーボエそれにドラムビート又は他の打楽器を表す場合、Ｎには尾部４２で２回下線が引かれ、Ｎがオーボエとして再生可能であることを示すが、Ｇには１回下線が引かれ、正しく歯切れ良く発音すべき単語のためのＧティンパニのドラムビート打楽器として再生される必要があることを示す。また図４においては、単語の長さが、下線なしのＧが後に続くＮのための２本の下線及び尾部を用いてマークされ、Ｎはオーボエとして再生可能であるが、Ｇは、正しく歯切れ良く発音すべき単語のための「他の打楽器」シンバルＤＧとして歯切れ良く発音しなければならないことを示す。 As shown in FIG. 4, if NG represents an oboe and a drum beat or other percussion instrument, N is underlined twice at the tail 42 to indicate that N can be played as an oboe, G is underlined once, indicating that it needs to be played as a G timpani drumbeat percussion instrument for words that should be pronounced correctly and crisply. Also in FIG. 4, the length of the word is marked with two underscores and tails for N followed by G without an underscore, where N is reproducible as an oboe, but G is crisp This indicates that the “other percussion instrument” cymbal DG for a word that should be pronounced must be pronounced crisply.

ドラムビートは、ドラムビートに無関係であり、かつ口腔解剖学による舌の異なる接触位置で作られ感じられる子音の前に、１本の下線を用いて再生可能であるとしてマークされ得る。殆ど同じ位置であると感じられるように作られた、同一の、同語源の、又は半ば関連する子音の前に、ドラムビートが、子音上の前方斜線を用いて「準備された」としてマークされ得ることが有用である。 The drum beat can be marked as reproducible with a single underline before the consonants that are independent of the drum beat and are made and felt at different contact positions of the tongue by oral anatomy. Before an identical, synonymous or mid-related consonant made to feel almost in the same position, the drum beat is marked as “prepared” with a forward diagonal on the consonant. It is useful to obtain.

ここで、図５を参照すると、シンバルが、同一の語及び同語源の語を除く、他のすべての子音の前のシンバルのそれぞれの英字の下に、１本の下線を用いて「再生可能である」としてマークされる。したがって、たとえば、「ｈｅａｄｓ（頭部）」におけるＤＳは、「ｈｅａｄｓｂａｃｋ（頭部を後ろに）」においては再生可能であるが、「ｈｅａｄｓｓｏｕｔｈ（頭部を南に）」においては再生可能でない。図５においては、上述したように、直接結合が結合記号４０でマークされる。したがって、「ｈｅａｄｓｂａｃｋ」におけるＤＳは、Ｂに結合された結合記号４０で示され、「ｂｅａｔｓｆａｓｔ（早く打つ）」におけるＴＳは、Ｆに結合された結合記号４０で示されている。 Referring now to FIG. 5, the cymbals are “reproducible using a single underline under each alphabetic character of the cymbals before all other consonants except the same word and words of the same origin. Marked as "is." Thus, for example, the DS in “heads” can be played back in “heads back” but not in “heads south”. . In FIG. 5, a direct bond is marked with a bond symbol 40 as described above. Thus, the DS in “heads back” is indicated by a joint symbol 40 coupled to B, and the TS in “beats fast” is represented by a joint symbol 40 coupled to F.

ここで、図６を参照すると、ウッドブロックのカチッという音ＤＬ及びＴＬが、以下の頭文字Ｌを除く（何故なら、このＬはウッドブロックのカチッという音中にあるからである）、他のすべての子音の前に、２本の下線を用いて「再生可能である」としてマークされている。したがって、たとえば、「ｍｉｄｄｌｅ(真ん中）」におけるＤＬは、「ｍｉｄｄｌｅｓｃｈｏｏｌ（中学校）」においては再生可能であるが、「ｍｉｄｄｌｅｌｉｆｅ（中年）」においては再生可能でない。上述したように、ハンモック４０は、直接結合を示すようマークされる。前方斜線３６の上部にマークされた「ｏ」は、ウッドブロックのカチッという音にのみ使用される、「準備」マークの特別バージョンを示し、子音が準備され、Ｌに結合されることを示す。以下のＬの場合、このＬは、以下のＬに直接結合され、したがって、ウッドブロックのカチッという音の終了が、持続可能な子音Ｌとして再生されないことがある。 Referring now to FIG. 6, the wood block clicks DL and TL exclude the following initial letter L (because this L is in the wood block click) and other Before every consonant, it is marked as “reproducible” with two underscores. Therefore, for example, DL in “middle” can be reproduced in “middle school”, but not in “middle life”. As described above, the hammock 40 is marked to indicate direct coupling. The “o” marked at the top of the forward slash 36 indicates a special version of the “preparation” mark, used only for wood block clicks, indicating that consonants are prepared and coupled to L. In the case of the following L, this L is directly coupled to the following L, so the end of a wood block click may not be reproduced as a sustainable consonant L.

図７を参照すると、ＧＬ、ＫＬ、ＢＬ、及びＰＬの子音の組合せは、これらの間で発声される曖昧な（書かれない）母音があるため、ウッドブロックのカチッという音を表すものとして扱われないことが望ましいことに留意されたい。したがって、示されているように、Ｌは再生可能であるが、これに先行する子音は再生可能でない。 Referring to FIG. 7, the combination of GL, KL, BL, and PL consonants is treated as representing a clicking sound of a wood block because there is an ambiguous (unwritten) vowel uttered between them. Note that it is desirable not to break. Thus, as shown, L is reproducible, but the consonants that precede it are not reproducible.

Ｗ、Ｈ、及びＹは、他の子音楽器について再生可能であると思われる位置で発生した場合は再生可能であるとしてマークされないことが望ましい。何故なら、これらは、以下の例に示されているように、母音又は二重母音の一部であるからである。 W, H, and Y are preferably not marked as reproducible if they occur at a position that is considered reproducible for other child music instruments. This is because they are part of a vowel or a double vowel, as shown in the examples below.

本発明による、共通の組合せＷＨ中のＷ及びＨのための共に有用な表記法は、ＷＨの上に英字「ｈｗ」をマークして、Ｈがまず発せられ、次にＷが発せられ、いずれも再生されないことを示すことである。

A notation useful together for W and H in a common combination WH according to the present invention is to mark the letter “hw” on WH, H is emitted first, then W is emitted, Is also not to be reproduced.

図８を参照すると、「Ｙ」又は「Ｗ」が、別の母音の前に、単語の中に、及び単語と単語との間に発生した場合、ＹとＷとの連結語５０及び５２が作成されて、音声の連続性が、１単語から次の単語へ又は１つの音節から次の音節へと維持されることを示す。本発明のこの実施形態における、ＹとＷとの連結語５０及び５２で用いられる記号の例は、それぞれ、Ｕ中、Ｕの近く、又はＵ上でマークされた小文字のＹ又はＷとの、それぞれＹ又はＷの下から以下の母音へと、ハンモック状の浅いＵルーピングを有する。Ｕは、維持されるべき連続性を示し、英字Ｙ又はＷは、この英字が書かれたテキスト入力内に存在していてもしていなくても用いられるサウンドを示す。たとえば、Ｙは、「ｃｒｅａｔｅ(作成する）」のＥとＡとの間で発せられ、Ｗは、「ｃｒｕｅｌ(残酷な)」のＵとＥとの間で発せられる。 Referring to FIG. 8, if “Y” or “W” occurs in another word and between words and between words, the concatenated words 50 and 52 of Y and W are Created to indicate that speech continuity is maintained from one word to the next or from one syllable to the next. Examples of symbols used in the Y and W concatenations 50 and 52 in this embodiment of the invention are with lowercase Y or W marked in, near or on U, respectively. Each has a hammock-like shallow U-looping from below Y or W to the following vowels. U indicates the continuity to be maintained and the letter Y or W indicates the sound used whether or not this letter is present in the written text input. For example, Y is emitted between E and A of “create”, and W is emitted between U and E of “cruel”.

ここで、図９を参照すると、アディソンらによる同時係属中の特許出願第１０／３３４,６５８号明細書に、より詳細にはアーサー・レサックの本に記述されているように、レサックシステムは、子音、及び単語又はフレーズ内の１つ以上の追加の英字又は音素が発声される時に結合されるいくつかの方法を明らかにする。図９は、このような結合された単語の所望の発音が、本発明に従って、どのように図形的に示されるかについて、いくつかの例を示している。 Referring now to FIG. 9, as described in the co-pending patent application 10 / 334,658 by Addison et al., And more particularly in the Arthur Resack book, Identify several methods that are combined when one or more additional letters or phonemes in a consonant, and a word or phrase are uttered. FIG. 9 shows some examples of how the desired pronunciation of such combined words is shown graphically in accordance with the present invention.

レサック音声システムにおいて利用される、発声される単語結合の３例、即ち、所謂、「直接結合」、「再生及び結合」、及び「準備及び結合」が、図９に記載されている。 Three examples of spoken word combinations utilized in the Lesac speech system, namely so-called “direct combination”, “play and combine”, and “preparation and combination” are described in FIG.

直接結合においては、１単語の最後の子音が、次の単語の初めの母音に直接結合される。たとえば、「ｆａｒａｂｏｖｅ（はるか上に）」は、１単語「ｆａｒａｂｏｖｅ」と発音される。 In direct combination, the last consonant of one word is directly combined with the first vowel of the next word. For example, “far above” is pronounced as one word “farabove”.

「ｐｌａｙ−ａｎｄ−ｌｉｎｋ（再生及び結合）」においては、「ｋ」の後に「ｔ」が続くなどの、口の中の異なる場所で作られる２つの隣接した子音があり、第１の語、この場合「ｋ」が、十分に「再生される」（発音される又は発せられる）。つまり、第２の子音、この場合「ｔ」に進む前に完了する。 In “play-and-link”, there are two adjacent consonants made at different locations in the mouth, such as “k” followed by “t”, the first word, In this case, “k” is fully “reproduced” (sounded or emitted). That is, it completes before proceeding to the second consonant, in this case “t”.

「ｇｒａｂｂｏｘｅｓ（箱をつかむ）」又は「ｋｅｅｐｂａｃｋ（我慢する）」の場合などの、「ｂ」の後に別の「ｂ」又は「ｐ」が続くなどの、口の中の同じ場所で又は互いに密接した所で作られる２つの隣接した子音がある場合、準備及び結合が使用される。この場合、第１の子音又は「ドラムビート」が準備される。つまり、わずかな躊躇でなされる、第２のドラムビートに進む前に完了しない。 At the same location in the mouth, such as “b” followed by another “b” or “p”, such as in the case of “grab boxes” or “keep back” If there are two adjacent consonants that are made close together, preparation and combination is used. In this case, a first consonant or “drum beat” is prepared. That is, it is not completed before proceeding to the second drum beat, which is made with a slight stroke.

図９の上の行に示されている直接結合を示すのに採用されている韻律図形表記法の例は、示されている例より明らかであろうが、１単語の終了時又は終了時近くの１つ又は複数の英字と次の単語の開始時又は開始時近くの英字とを通常結合する、結合された英字の下及びこれらの間の結合記号４０線を有する。直接結合は、発声される勢いが、単語と単語との間に中断又は休止又は休みを置かずに、１つの結合された英字から次の結合された英字へと伝わるべきであることを示す。 An example of the prosodic graphic notation employed to show the direct combination shown in the upper row of FIG. 9 will be clearer than the example shown, but at or near the end of a word. One or more of the letters and a letter 40 at the beginning or near the beginning of the next word usually joins below and between the joined letters. Direct combination indicates that the momentum spoken should be transmitted from one combined alphabet to the next combined alphabet without any interruption or pause or rest between words.

図９の真ん中の行に示されている再生及び結合の例においては、第１の子音は再生されるが、第２の子音は再生されない。したがって、結合記号４０が、第１の子音の１本又は２本の下線と組み合わせられる。 In the example of reproduction and combination shown in the middle row of FIG. 9, the first consonant is reproduced, but the second consonant is not reproduced. Thus, the combination symbol 40 is combined with one or two underlines of the first consonant.

図９に下の行に例示されている準備及び結合の例では、準備され、かつ第２の子音との結合記号４０と組み合わせられた、第１の子音上の前方斜線が用いられ、結合が示される。その上、上述したように、再生可能な子音が下線で示される。 In the example of preparation and combination illustrated in the lower row in FIG. 9, the forward diagonal line on the first consonant, which is prepared and combined with the combination symbol 40 with the second consonant, is used to Indicated. In addition, as described above, reproducible consonants are indicated by underlines.

ここで、示されている２つの韻律グラフィカルシンボルについて、図１０を参照すると、例１は、実施するのが比較的簡単かつ経済的なものであり、例２は、口語本及び雑誌、ドラマ及び他の娯楽などの適用形態に、無制限に、適した高品質の合成音声出力の作成を容易にするよう設計された、より高度なものである。また、例２のより詳細な表記法により、訓練された話し手が用いられた場合にも発生し得る話し手間のばらつきが減少し、これにより、出力の一貫性が促進される。 Now referring to FIG. 10 for the two prosodic graphical symbols shown, Example 1 is relatively simple and economical to implement, and Example 2 includes colloquial books and magazines, drama and It is more sophisticated designed to facilitate the creation of high quality synthesized speech output that is suitable indefinitely for other entertainment applications. The more detailed notation of Example 2 also reduces speaker-to-speaker variability that can occur when trained speakers are used, thereby promoting output consistency.

例１の表記法は、器具、車両、製造機械、ローエンドゲーム、及び娯楽装置などとの、声による通信などの工業適用形態に、無制限に適し得る。勿論、所望の場合には、いずれの表記法も他の目的に使用され得る。 The notation of Example 1 can be used without limitation for industrial applications such as voice communication with appliances, vehicles, manufacturing machines, low-end games, entertainment devices, and the like. Of course, any notation can be used for other purposes if desired.

図１０においては、並べて比較するために、交互の線に、同じテキストに適用された例１及び例２が例示されている。図１０の最初の２本の線を順に比較すれば分かるであろうが、「ｈｅａｄｓ」と別の結合された単語とのいくつかの組合せにおいて、ＤＳの組合せのＤ上の前方斜線３６によってマークされる追加の準備により、より微妙かつより人の心に訴えるサウンドが提供される。それぞれの場合において、「ｈｅａｄｓ」から以下の単語までの連続性が維持されるが、例２においては、追加のマークに従って準備されるので、Ｄはよりはっきりと聞こえる。例１以下の発音においては、Ｄが消えることがある。 FIG. 10 illustrates Example 1 and Example 2 applied to the same text in alternating lines for side-by-side comparison. As can be seen by comparing the first two lines in FIG. 10 in order, in some combinations of “heads” and another combined word, marked by a forward slash 36 on the DS combination D. The additional preparation that is provided provides a more subtle and more appealing sound. In each case, continuity from "heads" to the following words is maintained, but in Example 2, D is heard more clearly because it is prepared according to additional marks. In the pronunciation of example 1 and below, D may disappear.

ここで、図１１及び図１２を参照すると、本発明は、１つの又は別の互いにかなり異なるスタイルを有する、魅力的な韻律音声出力用のテンプレートを示す又は提供するのに用いられ得るグラフィカルシンボルセットを提供し、用いることが理解されよう。図１１は、「レポートリアルな」と称される「韻律」スタイルのゲティスバーグ演説（ＧｅｔｔｙｓｂｕｒｇＡｄｄｒｅｓｓ）の一部分のレンダリングのためのマーク付けを示しており、図１２に示されているスタイルは、より感情的な人の興味をそそるスタイルである。 Referring now to FIGS. 11 and 12, the present invention is a graphical symbol set that can be used to show or provide a template for attractive prosodic audio output having one or another significantly different style. Will be understood to provide and use. FIG. 11 shows the markup for rendering a portion of the “Prosody” style Gettysburg Address, called “Report Real”, where the style shown in FIG. It is an intriguing style for a traditional person.

アドレスのテキストは、テキストライン１０などのテキストラインにおいてレンダリングされ、この上に、表記法ライン１２などの表記法ラインにおいて、上述したような、打楽器及び持続可能な音調の子音、ウッドブロックのカチッという音、及び結合記号のためのマークを含む、レサック構造及び音調エネルギー母音発音表記法を用いて及び子音エネルギー発音表記法を用いてマークされる。テキストはまた、個々の英字下線、上昇曲線、下降曲線、ハンモックなどを含む、本明細書の上記に記述したような韻律グラフィカルシンボルで、発音表記法に干渉しないようテキストの下にマークされる。その上、所謂Ｙバズライン（Ｙ−ｂｕｚｚ）６０が、上記の表記法ライン１２の上に追加され、この上のＹバズライン６０に、さらなる韻律記号がマークされる。アーサー・レサックの本の、たとえばページ１２２以降に記述されているように、Ｙバズは、話し手の又は歌手の声の骨を通って伝わる音調の振動の基礎である。 The text of the address is rendered in a text line, such as text line 10, on which a percussion instrument and a sustainable tone consonant, a wood block click, as described above, in a notation line, such as notation line 12. Marked using resac structure and tonal energy vowel pronunciation notation, and using consonant energy pronunciation notation, including marks for sounds and combined symbols. The text is also a prosodic graphical symbol as described herein above, including individual alphabetic underlines, ascending curves, descending curves, hammocks, etc., and is marked below the text so as not to interfere with phonetic notation. In addition, a so-called Y-buzz line (Y-buzz) 60 is added on top of the notation line 12 above, and a further prosodic symbol is marked on the Y buzz line 60 above. Y buzz is the basis of tonal vibrations that travel through the bones of the voice of a speaker or singer, as described, for example, on page 122 et seq. Of Arthur Lesack's book.

所望のイントネーションパターンが、話し手のＹバズラインのピッチに対する所望のピッチを示すＹバズライン６０上のドットのレベルで小さいドット６２及び大きいドット６４を使用して、Ｙバズライン６０の上の所謂韻律ピッチチャートと称されるものでマークされる。ドットサイズは、所望の強勢、又は表された相対ピッチの相対振幅を示すのに用いられ、小さいドット６２は余分の強勢がないことを示し、大きいドット６４は追加強勢が望ましいことを示す。任意に、ドットサイズは、所望の度合いの強勢を比例で示し得る。Ｙバズライン６０以下の周波数で声に出すことは可能であろうが、制御可能な楽器として演奏するための声については、声の音調及び調音の制御が不十分となり得る。 The desired intonation pattern is a so-called prosodic pitch chart on the Y buzz line 60 using small dots 62 and large dots 64 at the level of the dots on the Y buzz line 60 indicating the desired pitch relative to the pitch of the speaker's Y buzz line. Marked with what is called. The dot size is used to indicate the desired stress or relative amplitude of the expressed relative pitch, with a small dot 62 indicating no extra stress and a large dot 64 indicating that an additional stress is desirable. Optionally, the dot size may indicate a desired degree of stress in proportion. Although it may be possible to utter a voice at a frequency below the Y buzz line 60, the voice tone and articulation control may be insufficient for a voice to be played as a controllable instrument.

また、図１１及び図１２には、マーク付けに従ってテキストを発音する場合、音声実践者Ｂによって作られた誤りを示す円内に英字Ｂがマークされている。誤りは、他の音声実践者が、発音の録音を聴き、その発音がどこで従うべきマーク付けから逸脱したかに注意することによって判断される。たとえば図１２においては、実践者Ｂは、マーク付けに必要なものとは異なる２つの発音を行った。第１のものは、単語「ｅｎｇａｇｅｄ(係合される）」の終了時（テキストライン４、単語４）に、ドラムビートの子音Ｄを発音する際の失敗であった。第２のものは、単語「ｄｅｄｉｃａｔｅｄ(専用）」の終了時（テキストライン５、単語５）に、Ｅの上昇曲線、それに子音Ｄのドラムビートを演奏する際の失敗であった。 Further, in FIGS. 11 and 12, when a text is pronounced according to marking, an English letter B is marked in a circle indicating an error made by the voice practitioner B. Errors are determined by other voice practitioners listening to the recording of the pronunciation and noting where the pronunciation deviates from the mark to follow. For example, in FIG. 12, practitioner B made two pronunciations different from those necessary for marking. The first was a failure to pronounce the drum beat consonant D at the end of the word “engaged” (text line 4, word 4). The second was a failure in playing the rising curve of E and the drum beat of consonant D at the end of the word “dedicated” (text line 5, word 5).

図１２に示されている、人の興味をそそるマーク付けによって伝えられる、さらなる感情及びエネルギーは、図１１のレポートリアルなマーク付けと韻律ピッチチャートのマーク付けとを注意深く比較することにより、容易に明らかとなる。たとえば、図１２の強勢ドットの多くが、より大きい強勢又は強調を必要とするより大きいドット６４である。また、ライン１の「ａｇｏ（前に）」の上の上昇曲線２０は、Ｙバズライン６０の上に上昇し、所望のより高いピッチを示す。最も低いラインにおいては、「ｃｏｎｃｅｉｖｅｄ(考案される）」のＶには、余分の強勢ドット６４が与えられ、曲折アクセント記号２４Ｂは用いられない。複数の図を点検すれば、他の差が理解されよう。 Additional emotions and energies conveyed by the intriguing markings shown in FIG. 12 can be easily obtained by carefully comparing the report realistic markings of FIG. 11 with the prosodic pitch chart markings. It becomes clear. For example, many of the stress dots in FIG. 12 are larger dots 64 that require greater stress or enhancement. Also, the ascending curve 20 above “ago” on line 1 rises above the Y buzz line 60, indicating the desired higher pitch. In the lowest line, the “conceeded” V is given an extra stress dot 64 and the circumflex symbol 24B is not used. If you examine several figures, you will see other differences.

図２〜図１０から及び特に図１１及び図１２から、本発明により、包括的なテキストマーク付けシステムが提供されることが理解されよう。これにより、通常のテキストを分割又は中断しない重ね書きとして、通常のテキストを有する高度な発音及び韻律記号を具現化し、人の声又は機械による音声としてのテキストの、正確で、理解し得る、人の心に訴える、さらには旋律の美しいレンダリングの青写真を提供する、新規な指示又は制御文書を生じ得る。 It will be appreciated from FIGS. 2-10, and in particular from FIGS. 11 and 12, that the present invention provides a comprehensive text marking system. This makes it possible to embody advanced pronunciation and prosodic symbols with normal text as an overwriting that does not divide or interrupt normal text, and a person who can accurately and understand the text as human voice or machine voice New instructions or control documents can be generated that will appeal to you and even provide a beautifully rendered blueprint of melody.

本発明の目的を達成するために、音声用テキストをマーク付けするための他の好適なグラフィカルシンボルが、当業者には、本明細書の開示より明らかとなり、本発明を実践するのに適していることが推察されるであろう。たとえば、様々な幾何学的な記号又は幾何学的な記号のスキーム又は動画化された書記素が用いられることがある。しかし、本明細書に記述されているような簡単な記号は、直感的に理解できるものであり、スクリプト又は他のテキストをマーク付けする際に適用することが容易であり、さらに重要なことには、訓練された話し手がマーク付けされたテキストを読む時に容易に理解し得ることである。 To achieve the objectives of the present invention, other suitable graphical symbols for marking speech text will be apparent to those skilled in the art from the disclosure herein and suitable for practicing the present invention. It will be inferred. For example, various geometric symbols or geometric symbol schemes or animated graphemes may be used. However, simple symbols as described herein are intuitive and easy to apply when marking scripts or other text, and more importantly Is easily understandable when a trained speaker reads the marked text.

図２〜図４及び他の図に示されている、かつ本明細書に記述されている、韻律グラフィカルシンボルは、合成音声出力で、特にフォルマント音声出力で、人間らしいサウンドを促進するよう様々な方法で利用される。たとえば、表記法は、発音される音声データベースを作成するために、本明細書に記載されているマーク付け音声コードに従ってテキストを正確に発音するよう訓練された、１人以上の、好ましくは複数の人によって用いられる。前記データベースは、本明細書に記載されている音声コードでマークされたテキストに正確に従うよう、例証されている発音される音声を含む。代替形態として又は追加形態として、本発明の韻律グラフィカルシンボルは、ディジタル方式でレンダリングされ、韻律要素のディジタルドメインの出力される音声への導入を促進する又はガイドするために、機械で発声されるテキストの電子マーク付け用のシンセサイザソフトウェアにおいて用いられ得る。英字、単語、フレーズ、文、段落、及びより長いテキストのための図形表記法に対応する、録音された音声データベースは、ディジタル化され分析されて、関連する図形表記法を有する特有のテキストに対応する特有の音声データの一意の関係を指定するためのアルゴリズム及び他のメトリクスを得る。ついで、これを使用して、入力パラメータをシンセサイザに提供し、指定された韻律を有する音声として合成される特定のテキストのための人の音声を模倣するサウンドを再び作成し得る。 The prosodic graphical symbols shown in FIGS. 2-4 and other figures, and described herein, can be used in various ways to promote human-like sound with synthetic speech output, particularly with formant speech output. Used in For example, a notation can include one or more, preferably multiple, trained to pronounce text correctly according to the marked phonetic code described herein to create a pronounced speech database. Used by humans. The database includes pronounced speech that is illustrated to accurately follow text marked with the phonetic code described herein. As an alternative or in addition, the prosodic graphical symbols of the present invention can be rendered digitally and machine-spoken text to facilitate or guide the introduction of the digital domain of prosodic elements into the output speech. Can be used in synthesizer software for electronic marking. Recorded speech databases that support graphic notations for letters, words, phrases, sentences, paragraphs, and longer text are digitized and analyzed to support specific text with associated graphic notations Obtain algorithms and other metrics to specify unique relationships of specific voice data. This can then be used to provide input parameters to the synthesizer to recreate a sound that mimics a person's speech for a particular text that is synthesized as speech with a specified prosody.

話しを簡単にするために、それぞれのテキストユニット及びそれに関連付けられた音声コード図形表記法を「書記素」と考える。同様に、「書記素」に対応するそれぞれの音響ユニットを「音素」として識別する。数百又は数千（ここで、「数」とは、「少なくとも２つ」を意味する）の拡張セット、又はこれ以上、ピッチ、振幅のための発音を関連付けするための書記素／音素対、及び本発明の韻律図形表記法が、ディジタル方式でレンダリングされ、韻律要素のディジタルドメインの出力される音声への導入を促進する又はガイドするために機械で発声されるテキストの電子マーク付け用のシンセサイザソフトウェアにおいて用いられ得る。 To simplify the conversation, each text unit and the associated phonetic code notation are considered “graphemes”. Similarly, each acoustic unit corresponding to “grapheme” is identified as “phoneme”. An extended set of hundreds or thousands (where “number” means “at least two”), or more, grapheme / phoneme pairs for associating pronunciations for pitch, amplitude, And the prosody graphic notation of the present invention is a digitally rendered synthesizer for electronic marking of text uttered by a machine to facilitate or guide the introduction of the digital domain of prosodic elements into the output speech Can be used in software.

当業者は、図２〜図４に示されている特定の韻律グラフィカルシンボルは単なる例示であり、本明細書の教示に従って本発明を実践するのに有用な韻律グラフィカルシンボルが、他の多くの形態を有することを理解されるであろう。さらに、示されている記号の特定の例は、レサック音声システムに適応される。所望の場合には、他の韻律グラフィカルシンボルが、他の音声指導又は訓練方法を実施するのに、又は本発明に従って機械音声を人間らしいものにすることを促進するようレサックシステムを実施するのに用いられる。これについて、当業者は、本明細書の教示より理解されよう。 Those skilled in the art will appreciate that the specific prosodic graphical symbols shown in FIGS. 2-4 are merely exemplary, and prosodic graphical symbols useful for practicing the invention in accordance with the teachings herein are many other forms. It will be understood that In addition, the particular example of symbols shown is adapted for a resack speech system. If desired, other prosodic graphical symbols may be used to implement other voice guidance or training methods, or to implement resack systems to facilitate making machine speech human according to the present invention. Used. This will be understood by those skilled in the art from the teachings herein.

本発明の一実施形態によれば、以下の段落に記述されているようなガイドラインが、図１１及び図１２に示されているような韻律マーク付けを準備する際に観察され得る。 In accordance with one embodiment of the present invention, guidelines such as those described in the following paragraphs can be observed in preparing prosodic markup as shown in FIGS.

マーク付けされたスクリプトの準備
図１１〜図１２に例示されているような、発声されるマーク付けされたスクリプトを準備する際に、上手なページのレイアウトが話し手にとって有益であり、本発明に従って、一貫性のある魅力的な音声出力に話し手をガイドする場合、テキストだけでなく追加記号の同時理解が容易となり、この音声出力は、データベースを作成する際に有用であり、またコンピュータ化された音声の合成にも有用であり得る。 Marked Script Preparation When preparing a marked script to be spoken as illustrated in FIGS. 11-12, a good page layout is beneficial to the speaker, and according to the present invention, When guiding a speaker to a consistent and attractive voice output, it is easy to simultaneously understand additional symbols as well as text, which is useful when creating a database and is also a computerized voice. It may also be useful for the synthesis of

テキストは、追加すべき発音表記法及び韻律グラフィカルシンボルを収容するよう、それぞれのラインの上に、たとえば３ｃｍ以上の、たっぷりとした間隔がとられることが望ましい。比較的大きいフォントは、様々な表記法及び記号を収容するのに有用であり、声に出して発声している間のマークの正確な読取り及び解釈が容易となる。１４ポイントのルシダ（Ｌｕｃｉｄａ）輝度のセミボールドなどのフォントが、好適なフォントの一例である。 The text is preferably well spaced on each line, for example 3 cm or more, to accommodate the phonetic notation and prosodic graphical symbols to be added. The relatively large font is useful to accommodate a variety of notations and symbols, and facilitates accurate reading and interpretation of the mark while speaking aloud. A font such as a 14 point Lucida luminance semi-bold is an example of a suitable font.

スクリプトをマーク付けする本発明の一実施形態においては、ページの最後のラインを含む、スクリプトのそれぞれのラインが、区切り又は息継ぎのための垂直のマークで終了する。数字が書き込まれる。頭辞語が、英字ではなく単語として発声される場合は、すべて書かれる。有用なことに、一貫性のある発音のための参考辞書、たとえば、メリアム・ウェブスター・カレッジ英英辞典（ＭｅｒｒｉａｍＷｅｂｓｔｅｒ’ｓＣｏｌｌｅｇｉａｔｅＤｉｃｔｉｏｎａｒｙ)第１０版が、指摘されている。 In one embodiment of the present invention for marking a script, each line of the script, including the last line of the page, ends with a vertical mark for a break or breath. A number is written. If the acronym is spoken as a word rather than an English letter, it is all written. Usefully, reference dictionaries for consistent pronunciation, such as the Merriam Webster's Collegiate Dictionary, 10th edition, are pointed out.

参考辞書は、発音の選択項目を提示しているが、記載されている最初の発音は、録音が正式のものでない又は「高い」音声でない場合に使用される。この場合、列挙されている場合には、「短縮された」発音が使用される。 The reference dictionary presents pronunciation choices, but the first pronunciation listed is used when the recording is not authoritative or “high”. In this case, when listed, “abbreviated” pronunciation is used.

イントネーションパターンとは、連結音声中の、段階的なピッチ毎の一般的な動きである。抑揚とは、所謂、上昇曲線、レベル持続部、下降曲線、又は曲折アクセント記号などの、母音又は子音の、移動するピッチの変化である。 The intonation pattern is a general movement for each stepped pitch in the connected voice. An inflection is a so-called change in pitch of a vowel or consonant, such as an ascending curve, a level sustaining part, a descending curve, or a bent accent symbol.

有用なことに、スクリプトの準備のこの実施形態においては、話し手は、それぞれの文の意味を伝えるイントネーション及び抑揚を得るよう、スクリプトを声に出して探る。 Usefully, in this embodiment of script preparation, the speaker will explore the script aloud to obtain intonations and intonations that convey the meaning of each sentence.

２本の垂直線がスクリプト上にマークされて、息継ぎの休止を示し、１本の垂直線がマークされて、これらの分割部分の最後の子音の取扱いに影響を及ぼす休止がない場合の区切りを示す。 Two vertical lines are marked on the script to indicate breath pauses, and a single vertical line is marked to mark the break when there is no pause affecting the handling of the last consonant of these segments. Show.

本発明の１つの有用な実施形態においては、これらの準備の後、本明細書の上記に記述されているように発音及び韻律について、まず子音がマーク付けされ、次いで母音がマーク付けされる。 In one useful embodiment of the invention, after these preparations, consonants are first marked and then vowels are marked for pronunciation and prosody as described herein above.

次のステップは、母音の直ぐ上に、母音を隠さずに、Ｙバズピッチライン６０を引き、ピッチのマーク基準を提供することである。留意すべきピッチ範囲は、低いＹバズ範囲（以下、Ｙバズライン６０）から、変形しない母音がミッドコール（ｍｉｄ−ｃａｌｌ）で、たとえば＃３、Ｒ、及び＋Ｙ、及び任意に＃４及びＮ内で声に出され得るミッドコール範囲までである。 The next step is to draw the Y buzz pitch line 60 directly above the vowel without hiding the vowel and provide a mark reference for the pitch. The pitch range to be noted is from the low Y buzz range (hereinafter Y buzz line 60), the vowels that are not deformed are mid-call, for example, # 3, R, and + Y, and optionally in # 4 and N Up to the mid-call range that can be spoken at.

所望の場合には、Ｙバズライン６０の上に、追加のピッチライン（図示せず）、たとえば真ん中下部の音域ライン及びちょうど真ん中の音域ラインが引かれ得る。イントネーションドット６２、６４又は他の好適なマークは、ピッチラインの上及びピッチラインの間の両方に置かれ得る。所望の場合には、たとえばダニエル・ジョーンズ（ＤａｎｉｅｌＪｏｎｅｓ）の本、「英語音声学の概要（ＯｕｔｌｉｎｅｏｆＥｎｇｌｉｓｈＰｈｏｎｅｔｉｃｓ）」に記述されているような、ピッチ範囲を提供するための他の方法が使用されることがあり、ここでは、ピッチ範囲を定義するのに、３線譜が使用される。 If desired, additional pitch lines (not shown) may be drawn on the Y buzz line 60, such as the lower middle range line and just the middle range line. Intonation dots 62, 64 or other suitable marks can be placed both above and between pitch lines. If desired, other methods for providing pitch ranges may be used, as described, for example, in Daniel Jones's book, “Outline of English Phonetics”. Here, a 3-line score is used to define the pitch range.

本明細書に記述されている、例示されている実施形態を実践するためには、音声システム実践者は、レサック実践者であるべきであり、次いで、自分自身のピッチ範囲のためのガイドとしてＹバズライン６０を使用して、各音声システム及び所望の韻律に従って、所望のイントネーション及び抑揚であると認識されるものを録音する。上記に言及したように、あらゆる音節についてドットがマークされ得る。つまり、強調しない音節には６２などの小さい個別のドット、強調した音節には非常に大きいドット６４を用いる。 In order to practice the illustrated embodiment described herein, the audio system practitioner should be a Lessac practitioner and then Y as a guide for his own pitch range. Buzzline 60 is used to record what is recognized as the desired intonation and intonation according to each audio system and the desired prosody. As mentioned above, a dot can be marked for every syllable. That is, small individual dots such as 62 are used for syllables that are not emphasized, and very large dots 64 are used for emphasized syllables.

たとえば図１２に例示されている人の興味をそそる録音については、レサックシステムにおいて「構造ＮＲＧ」及び「集中音調」として公知のものなどの音声特徴がより豊富に使用され、より大きいドットには周囲に円を使用し、又は母音の変形のない、レサック基本コール単語を含む単語に、所望の場合には、コールに集中する（ｃａｌｌｆｏｃｕｓ）ことを示すための、他の好適な図形表示を使用してマークされ得る。この段落において使用される用語は、レサック音声又は声のシステムに関係しており、アーサー・レサックの本より理解されるであろう。 For example, for the intriguing recording illustrated in FIG. 12, voice features such as those known as “structural NRG” and “concentrated tone” are more abundantly used in the Lessac system, with larger dots Other suitable graphical displays to indicate that the call focus is used, if desired, to words that include a lessac basic call word, using a circle around or without vowel deformation Can be marked using. The terminology used in this paragraph relates to the Lessac speech or voice system and will be understood from the Arthur Lessac book.

たとえば、その本に記述されているように、構造ＮＲＧとは、顔の表情に関係し、かつ声の共鳴室の型、形状、及びサイズに関する運動感覚の声のエネルギー（レサックによる「ＮＲＧ」）の状態である。構造ＮＲＧは、色、体温、及び声の音調の美しさに関係すると考えられる。 For example, as described in that book, the structure NRG is related to facial expression and the kinetic sensation voice energy (“NRG” by Lesac) related to the type, shape, and size of the voice resonance chamber. It is a state. The structure NRG is thought to be related to the beauty of color, body temperature, and voice tone.

有効な単語が、二重アクセントマークを用いて、たとえば「二次強勢音節が、たとえば１つのアクセントマークをとり得る間、一次強勢音節の前に」マークされ得る。 A valid word can be marked with a double accent mark, for example “before the primary stress syllable while a secondary stress syllable can take one accent mark, for example”.

この場合、有効な単語とは、文が進行するにつれて引き数を前方に運ぶという新しい考え方を導入する、それぞれの連続するフレーズ、又は他のテキストの感覚群分割中の単語である。 In this case, a valid word is a word in each successive phrase or other textual grouping that introduces a new concept of carrying arguments forward as the sentence progresses.

本発明のこの態様に従って、注意深くかつ一貫性のある声の録音が、テキスト／音声合成データベースで行われ得る場合、テキストの区切り及び息継ぎ休止分割内のそれぞれの「感覚群」の単語が、識別される有効な単語を有することが望ましい。まれに、同等に重要な、２つの有効な単語があり得る。 In accordance with this aspect of the present invention, if careful and consistent voice recordings can be made in a text / speech synthesis database, each “sensory group” word in the text break and breath break split is identified. It is desirable to have valid words. In rare cases, there can be two valid words that are equally important.

有効な単語は、レサック訓練の声の強弱法を使用し、様々な方法で、たとえばより高いピッチで発声されるようマークすることにより、又はその母音及び子音を実質的に長くすることにより、又は集中音調、コール共振を追加することにより、又はこれらの強弱法の組合せによって識別され得る。 Valid words can be used in various ways, such as marking them to be uttered at higher pitches, or by making their vowels and consonants substantially longer, using the method of strength of the voice of Lesak training It can be identified by adding concentrated tone, call resonance, or a combination of these strengths.

本発明の一例示的実施形態においては、引き数の導入は平叙文から始まり、この平叙文において、連続語以外のすべてが、ほぼ同じ量の強勢又は強調を有するよう、マークされる場合もマークされない場合もある。第１の強調単語でピッチが上昇し、残りで、最後の強勢音節上に下降曲線を有するＹバズ範囲に段階的に下降し得る。 In one exemplary embodiment of the present invention, the introduction of arguments begins with a plain text in which all but the consecutive words are marked as having approximately the same amount of stress or emphasis. It may not be done. The pitch can rise with the first emphasis word and the rest can step down to a Y buzz range with a down curve on the last stressed syllable.

たとえば、本発明に従って音声についてマーク付けされるスクリプトにおいては、様々な抑揚が句読点マークで使用され得る。以下、これについて記述する。ピリオド及びセミコロンが、最後の強調単語上に下降曲線２２をとり得る。コンマ及びコロンが、上昇曲線２０又はレベル持続部２６をとり得る。疑問語（たとえば、誰、何、どこ、いつ、どのように、又は何故）から開始する疑問は、最後の強調単語上に下降曲線２２をとり、他の疑問、通常「はい」又は「いいえ」の答えを予想する疑問は、最後の強調単語上に上昇曲線をとる。 For example, in a script that is marked for speech according to the present invention, various inflections can be used with punctuation marks. This will be described below. A period and a semicolon can take the descending curve 22 on the last highlighted word. A comma and a colon can take the ascending curve 20 or the level sustain 26. Questions that start with a question word (eg, who, what, where, when, how, or why) take a descending curve 22 over the last highlighted word, and other questions, usually “yes” or “no” The question that predicts the answer takes an ascending curve on the last highlighted word.

他の音声訓練システム
当業者は理解されるであろうが、権利請求の対象とする本発明は、レサック方法以外の規則又は音声訓練原理又は実践を用いる実施形態においても実施され得る。このような例の１つが、コロンビア大学演劇部のクリスティン・リンクレーター（ＫｒｉｓｔｉｎＬｉｎｋｌａｔｅｒ）の方法である。クリスティン・リンクレーターの技術、及び所望の場合には、本発明を実践する際にその規則が用いられ得る分野における他の音声実践者の技術に関する情報については、
ｗｗｗ．ｃｏｌｕｍｂｉａ．ｅｄｕ／ｃｕ／ｎｅｗｓ／ｍｅｄｉａ／００／ｋＬｉｎｋｌａｔｅｒ／及びｗｗｗ．ｋｒｉｓｔｉｎｌｉｎｌｄａｔｅｒ．ｃｏｍ
に見出され得る。 Other Voice Training Systems As will be appreciated by those skilled in the art, the claimed invention may be practiced in embodiments that employ rules or voice training principles or practices other than the Resack method. One such example is the method of Kristin Linklater at Columbia University Theater Department. For information on Kristin Linkator's technology and, if desired, other voice practitioner's technology in areas where the rules may be used in practicing the invention,
www. columnia. edu / cu / news / media / 00 / kLinklater / and www. kristinlindata. com
Can be found.

韻律音声規則及びその適用形態
本発明において用いることが可能な韻律音声規則は、言語の調音及び同時調音及びその様々な言葉遣いを説明するためのものである。本明細書において言及するプログラム言語の例は、アメリカ英語、一般的な教養のある言葉遣いである。他の言語も用いられ得ることが理解されよう。少なくともそのいくつかがレサックテキストから導き出せる又は公知である韻律音声規則がテキストに適用され、本明細書に記述されている新規な音響コードを使用して発音され又は合成され、レサックシステムに精通した話し手が、適切なレサックで決められた制御された発音で、テキストを声に出して読むことができる。 Prosodic speech rules and their application forms The prosodic speech rules that can be used in the present invention are for explaining the articulation and simultaneous articulation of languages and their various wordings. Examples of programming languages mentioned in this specification are American English and general well-known language. It will be appreciated that other languages may be used. Prosodic phonetic rules, at least some of which can be derived from the Lesack text, are applied to the text and are pronounced or synthesized using the new acoustic codes described herein, familiar with the Lessac system Speaker can read the text aloud with controlled pronunciation determined by the appropriate resack.

規則の例には、一般に所有されている適用形態の１つ以上に記述され、かつ息継ぎ休止によって修正されるランダムな休止の使用、リズム、イントネーションパターン、語強勢、単語の選択、及び子音の「混合語」を組み込んだ韻律の定義が含まれ、これらすべてが、発音されるテキストから直接導き出される。これらの韻律音声規則は、他の言葉遣い及び言語にも適応され得る。 Examples of rules include the use of random pauses described in one or more of the commonly owned applications and modified by breath pauses, rhythms, intonation patterns, word stresses, word selections, and consonant “ Prosody definitions incorporating “mixed words” are included, all of which are derived directly from the pronounced text. These prosodic phonetic rules can be adapted to other wordings and languages.

本発明において用いられる音響マーク付けコードは、特定の音声サウンドがどのように作成されるか、及びこれらのサウンドを作成するのに、どの音声変数が用いられるかを示し得る。発音されるテキストは、それぞれのコードの１つ又は複数の変数のための任意選択の指定された１つ又は複数の値と共に、テキストを発音する場合にコードに従うよう訓練された人の話し手のための韻律指示として役立ち得る。本発明によれば、同一の又は同様のこのようなコード変数、又はこれらの機械等価物を使用して、韻律指示に従ってテキストを発音するよう、コンピュータシンセサイザに命令する。サウンド及び変数の作成を制御するコードは、所望のサウンド特性に関係する、定量化可能な識別情報を示す。 The acoustic markup code used in the present invention may indicate how specific voice sounds are created and which voice variables are used to create these sounds. The text to be pronounced is for a speaker of a person trained to follow the code when speaking the text, along with an optional specified value or values for one or more variables of each code It can be useful as a prosodic indication. In accordance with the present invention, the computer synthesizer is instructed to pronounce text according to prosodic instructions using the same or similar such code variables, or their machine equivalents. The code that controls the creation of sounds and variables indicates quantifiable identification information related to the desired sound characteristics.

本発明の方法に従ってコード化され得る音声変数の例には、特定の音素又は他の音声要素を表現するよう合成されるサウンド要素の、可聴周波数、振幅、ピッチ、及び持続時間が含まれる。所望の値に定量化され得る特有の変数のいくつかの例に、基本的な声の周波数、制御可能なピッチ範囲の上限値及び下限値、時間単位当りの周波数の変化として表現されるピッチの変化、時間単位当りの振幅の変化、及び時間単位当りの振幅とピッチの変化との組合せがある。 Examples of speech variables that can be coded according to the method of the present invention include the audible frequency, amplitude, pitch, and duration of a sound element that is synthesized to represent a particular phoneme or other speech element. Some examples of specific variables that can be quantified to a desired value include the basic voice frequency, the upper and lower limits of the controllable pitch range, and the pitch expressed as a change in frequency per unit of time. There are changes, changes in amplitude per time unit, and combinations of amplitude and pitch change per time unit.

ここで、音声規則と、音響マーク付けコードと、１つ以上の変数値との間の１つの有用な関係の一例について記述する。当業者には他の可能性も明らかであろう。単語間の休み及び文中のコンマの発生の両方が、音声サウンドを作成する際の休止を表す。しかしながら、それぞれの種類の休止が、本発明の一実施形態に従って異なる音響コードによって示され得る異なる文字を有する。休止は聞き手にとって有用なものであり、個々の単語の認識を容易にし、フレーズの識別を援助する分離を提供する。同様に、それぞれの休止は変数としての時間を伴うが、一般に休止を構成するサウンド間の相対沈黙のミリ秒で測定される時間値又は持続時間は、状況により異なる場合がある。書かれたテキストがコンマを有さない場合、単語間の休止は、話しのテンポの一部であり、それぞれの休止を境界付けかつ完全なフレーズ内に含まれる、歯切れ良い発音の単語のリズミカルなサウンドに必要な、音声の速度及びリズミカルなばらつき全体によって判断されることがある。 An example of one useful relationship between a voice rule, an acoustic markup code, and one or more variable values will now be described. Other possibilities will be apparent to those skilled in the art. Both breaks between words and occurrences of commas in the sentence represent pauses in creating a sound sound. However, each type of pause has different characters that can be indicated by different acoustic codes according to one embodiment of the present invention. Pause is useful for the listener and facilitates the recognition of individual words and provides separation that aids phrase identification. Similarly, each pause has a variable time, but in general the time value or duration measured in milliseconds of relative silence between the sounds that make up the pause may vary from situation to situation. If the written text has no commas, pauses between words are part of the tempo of the speech, and rhythmic pronunciation of crisp pronunciation words that delimits each pause and is contained within a complete phrase. It may be determined by the overall speed of the sound and the rhythmic variation required for the sound.

したがって、休止は、たとえば興奮した、深刻な、レポートリアルな、詩的な、説得力のある、又は他の韻律などの、音声の韻律によって文脈上で判断されることがあり、これに対応するテキスト中のコンマはフレーズの分離を示し、これに対応する休止持続時間又はテキストが発声される時に発言のない時間は、韻律及び他の要因に従って、話し手によって変わり得る。自然な人の音声においては、この休止は１つの値ではなく、より大きい又はより少ない長さの時間のばらつきを有し、時には、新鮮な息継ぎをとるための、及び追加強調を提供する、又は文及びそのフレーズを含む段落全体のリズムのためのポイント／反対のポイントとして役立つ、他の時間を有する。段落の機械音声レンダリングの場合、人の話し手の変化する休止持続時間は、一定のミリ秒値としてレンダリングされ、この結果生じた音声は、人ではなく機械として認識される可能性がある。何故なら、人には、フレーズ間の休止の長さを変える傾向があるからである。 Thus, pauses may be determined in context by phonetic prosody, such as excited, serious, report-real, poetic, persuasive, or other prosody, corresponding to this The commas in the text indicate phrase separation, and the corresponding pause duration or the time when there is no speech when the text is spoken can vary from speaker to speaker, according to prosody and other factors. In natural human speech, this pause is not a single value but has a greater or lesser length of time variability, sometimes providing a fresh breath and and additional emphasis, or It has other time to serve as a point / opposite point for the rhythm of the entire paragraph containing the sentence and its phrase. In the case of paragraph machine voice rendering, the changing pause duration of a person's speaker is rendered as a constant millisecond value, and the resulting voice may be perceived as a machine rather than a person. This is because people tend to change the length of pauses between phrases.

クリプトマーク付け手順の例
図形記号セットを作成し、韻律音響ライブラリの例を準備するための規則に従う正確な発音のための音響データを提供するために、４人の公認レサック実践者のチームが、アメリカ英語の１,０００の最も頻繁に使用される単語と５００のフレーズ及び文とを使って作業する。 Cryptographic Marking Procedure Example To create acoustic symbol sets and provide acoustic data for accurate pronunciation according to the rules for preparing examples of prosodic acoustic libraries, a team of four certified Lesack practitioners Work with 1,000 most frequently used words and 500 phrases and sentences in American English.

実践者は、用いられた韻律音声規則を再検討し、改良する。規則のそれぞれについてマーク付けの指示及び表記法を開発し得ることが望ましい。また、韻律のための表記法も開発し得る。次いで、規則は、サンプルの単語及び文に適用される。 The practitioner reviews and refines the prosodic speech rules used. It is desirable to be able to develop marking instructions and notation for each of the rules. A notation for prosody can also be developed. The rules are then applied to the sample words and sentences.

本発明によるスクリプトマーク付け手順の一例示的実施形態においては、それぞれの音声実践者が、韻律音声規則による発音について、本明細書に記述されているようにフォーマットされた、スクリプト中の単語及び文をマークする。有用なことに、スクリプトは、音声に変換すべきテキストの言語を広範囲に表す、少なくとも約１,０００の単語と５００のフレーズとを有し得る。所望の場合には、スクリプト中の単語及びフレーズは、言語の専門的な部分集合、たとえば医学、科学、又は地域の方言などの専門的な部分集合に限定される。次いで、それぞれの実践者のマーク付けが、別のチームのメンバーによって調べられ、韻律音声規則を適用した際の誤りが識別される。誤りのない、１,０００の単語及び５００のフレーズ及び文の照合調整された最終マーク付けが準備された。 In an exemplary embodiment of a script marking procedure according to the present invention, each speech practitioner uses words and sentences in a script formatted as described herein for pronunciation by prosodic speech rules. Mark Useful, the script may have at least about 1,000 words and 500 phrases that broadly represent the language of the text to be converted to speech. If desired, the words and phrases in the script are limited to a specialized subset of the language, such as a specialized subset such as medical, scientific, or local dialects. Each practitioner's markup is then examined by another team member to identify errors in applying prosodic speech rules. A final, coordinated final markup of 1,000 words and 500 phrases and sentences without errors was prepared.

照合調整された最終マーク付けを使用して、それぞれの実践者が、マーク付けされたスクリプトからの単語及び文のサンプリングを音読することが望ましい。他の実践者の１人以上が、その発音を聴き、韻律的にマーク付けされたテキストに従っているかどうかに関する誤りについて調べる。この技術を用いて、録音セッション又は他の言葉によるプレゼンテーションの前に、１人以上の話し手でリハーサルされ得る。 Using collated final marking, it is desirable for each practitioner to read aloud word and sentence sampling from the marked script. One or more of the other practitioners listen to the pronunciation and look for errors regarding whether they are following prosodically marked text. Using this technique, one or more speakers can be rehearsed before a recording session or other verbal presentation.

本発明に従って音声合成に使用する音響データベースを準備するのに有用な録音を準備するために、スタジオでの録音セッションにおいて発音される単語及び文のスクリプトが、たとえば図１１に示されている、基準線のレポートリアルな韻律でマーク付けされる。それぞれの実践者は、第２の韻律のためのマーク付けを有する同じ文の部分集合のマーク付けと共に、レポートリアルなスクリプトの最終的に照合調整されたマーク付けのコピーを受け取る。 To prepare a recording useful for preparing an acoustic database for use in speech synthesis in accordance with the present invention, word and sentence scripts that are pronounced in a studio recording session are shown in FIG. Line reports are marked with realistic prosody. Each practitioner receives a final collated copy of the report-real script along with a mark of the same sentence subset with the mark for the second prosody.

録音セッションについて、実践者は、「乾燥室」の録音環境を有するスタジオ、望ましくは、アナログ／ディジタルサンプリングレート及び音響品質について正確に引かれた基準に合うスタジオを用いる。スタジオセッションにおいては、それぞれの実践者の発音の音響ＣＤ又は他のアナログ録音が、録音された発音をＷＡＶ又は他のデータファイルとして取り込むデータＣＤ又はＤＶＤと共に準備される。 For recording sessions, practitioners use studios with “dry room” recording environments, preferably studios that meet the exact drawn criteria for analog / digital sampling rates and sound quality. In a studio session, an acoustic CD or other analog recording of each practitioner's pronunciation is prepared along with a data CD or DVD that captures the recorded pronunciation as a WAV or other data file.

データの品質を確実にするために、それぞれの実践者の音響ＣＤが別の実践者に提供され、別の実践者は、発音を聴き、完全な正しいマーク付けのコピーの、そのマーク付けに従っていない発音の誤りについて留意する。誤りに気づいた場合、その発音はＷＡＶデータベースから除かれ、データベースには、正しい調音、イントネーション、及び韻律要素のみが保有されることが望ましい。 To ensure the quality of the data, each practitioner's audio CD is provided to another practitioner who listens to the pronunciation and does not follow that marking of a copy of the complete correct markup Be aware of pronunciation errors. If an error is noticed, the pronunciation is removed from the WAV database, and it is desirable that the database contains only the correct articulation, intonation, and prosodic elements.

このようなマーク付け、発声及び録音手順に従うことにより、音声要素の比較的誤りのないディジタル化されたデータベースライブラリが提供され、この中には、入力された発音及び韻律規則に準拠する言語又は言語の部分集合の、音素、単語、フレーズ及び文が含まれ得る。ある程度の一貫性が可能となり、これにより、１群の実践者によって準備された音声要素ライブラリが、別の群の同様に訓練された実践者によって準備された同様のライブラリと比較され得る。 By following such marking, utterance, and recording procedures, a digitized database library is provided that is relatively error-free of speech elements, including a language or language that complies with the pronunciation and prosodic rules entered. Phonemes, words, phrases, and sentences of a subset of A certain degree of consistency is possible so that a speech element library prepared by one group of practitioners can be compared to a similar library prepared by another group of similarly trained practitioners.

韻律音響ライブラリ
コンピュータ化された音声に効果的に適用される韻律音声規則について、本発明は、それぞれの韻律音声規則を発音される特有のテキスト及び正しく発音された場合にはこれに対応する音声データに一義的に連結する図形記号セットを提供する。特有の韻律音響ライブラリが、それぞれの言語及び最も広く使用されている言葉遣いについて準備される。それぞれの特有の韻律音響ライブラリは、発音の例のためのテキストと共に、包括的な辞書、韻律音声規則、規則を表す図形マーク付け記号、規則に正しく従っている発音のための音声データの例を含むと想定される。特有の言語及び言葉遣いのための包括的な韻律音響ライブラリは、適用される韻律音声規則に一義的に関連付けられている調音のためのフォルマントパラメータ値を導き出す、したがって指定するための土台である。 Prosodic Acoustic Library For prosodic speech rules that are effectively applied to computerized speech, the present invention provides specific text that each prosodic speech rule is pronounced and speech data corresponding to it when pronounced correctly. A graphic symbol set that is unambiguously connected to the object is provided. A unique prosodic acoustic library is prepared for each language and the most widely used wording. Each unique prosodic acoustic library includes a comprehensive dictionary, prosodic phonetic rules, graphic markup symbols representing the rules, and examples of phonetic data for pronunciations that follow the rules correctly, along with text for pronunciation examples It is assumed. A comprehensive prosodic acoustic library for specific languages and wordings is the basis for deriving and thus specifying formant parameter values for articulation that are uniquely associated with the applied prosodic speech rules.

本発明の一実施形態による韻律音響ライブラリデータベースの例には、以下のものが含まれる。即ち、
ａ）音声に合成されるテキストを表す、テキストの単語及び文の選択。 Examples of prosodic acoustic library databases according to one embodiment of the present invention include: That is,
a) Selection of text words and sentences representing text to be synthesized into speech.

ｂ）わかりやすい発音のためのテキストのコンピュータ化されたマーク付けのための１組の規則。これらは、子音、母音、同時調音、及び休止規則を含むことがある。 b) A set of rules for computerized marking of text for easy pronunciation. These may include consonants, vowels, simultaneous articulations, and pause rules.

ｃ）本明細書に記述されている、２つの韻律、「レポートリアルな」及び「人の興味をそそる」についての韻律規則。これらの韻律は、マーク付けされ、発音され、含まれる。これらの規則は、ピッチ、音量、リズム、話すテンポ、及び語強勢の変化などの、時間と共に変化する値を指定する。 c) Prosodic rules for the two prosody described herein, “report real” and “intriguing”. These prosody are marked, pronounced and included. These rules specify values that change over time, such as changes in pitch, volume, rhythm, speaking tempo, and word stress.

ｄ）テキストに適用される韻律音声規則の照合調整された手動マーク付けのコピー。 d) Coordinated manual markup copy of the prosodic phonetic rules applied to the text.

ｅ）マーク付けされたテキストの発音。「レポートリアルな」韻律で発音されたすべての単語及び文を表す、４人の実践者のＷＡＶデータファイル、それに「人の興味をそそる」韻律で発音されたいくつかの文の例。 e) Pronunciation of the marked text. An example of a WAV data file of four practitioners representing all words and sentences pronounced in a “report-real” prosody, and several sentences pronounced in a “intriguing” prosody.

韻律音響ライブラリデータベース構造の例には、ＷＡＶデータ、テキスト、図形、及び数値データが含まれることが望ましい。ソフトウェアステートメントの例、ソースコードの修正、及びシンセサイザ仕様値も追加され得る。韻律音響ライブラリデータベースの一例には、約８〜１２ギガバイトのデータが含まれ得る。市販の標準量産品のリレーショナルデータベースでは、現在、ＷＡＶデータと、テキスト、図形、オーディオＣＤ、及び数値データとを組み合わせることができない。したがって、本発明は、ＷＡＶデータと、テキスト、図形、及び数値データとを組み合わせるための製品設計を検証するために、一時的データベース構造を用い得る。ソフトウェアステートメントの例、ソースコードの修正、及びシンセサイザ仕様値も追加され得る。ＬＡＬデータベースの一例には、約８〜１２ギガバイトのデータが含まれ得る。所望の場合には、データベース構成要素を組み立てる、格納する、及び処理するためのアーキテクチャが、一時的構造を使用した結果を鑑みて改良され得る。これは、テキスト、図形、音響、及び数値データを含む包括的なデータベースライブラリを組み立てるのに有用であり得る。 The example of the prosodic sound library database structure preferably includes WAV data, text, graphics, and numerical data. Software statement examples, source code modifications, and synthesizer specification values may also be added. An example of a prosodic acoustic library database may include about 8-12 gigabytes of data. Commercially available relational databases for standard mass-produced products cannot currently combine WAV data with text, graphics, audio CDs, and numerical data. Thus, the present invention can use a temporary database structure to validate product designs for combining WAV data with text, graphics, and numerical data. Software statement examples, source code modifications, and synthesizer specification values may also be added. An example LAL database may include about 8-12 gigabytes of data. If desired, the architecture for assembling, storing, and processing database components can be improved in view of the results of using temporary structures. This can be useful for building a comprehensive database library containing text, graphics, sound, and numerical data.

ソフトウェアの例
公知の音声シンセサイザ又はシンセサイザエンジンは、以下のものを有し得る。即ち、
テキスト入力手段、たとえば、１つ以上のデータファイル、好適な形態のシステムで利用可能なテキストデータを作るための、スキャナ及び関連ソフトウェア及びハードウェア。 Software Examples A known speech synthesizer or synthesizer engine may have: That is,
Scanner and associated software and hardware for creating text input means, for example, one or more data files, text data usable in a suitable form of system.

ソフトウェア及び効果音声合成オペレーションを実施するための、データ処理ユニット及び関連データメモリ。 Data processing unit and associated data memory for performing software and effect speech synthesis operations.

データ処理ユニットによって実施可能な音声合成ソフトウェア。このソフトウェアはまた、テキストデータを音声データに変換するためのソフトウェアエンジンでもあり得る。 Speech synthesis software that can be implemented by the data processing unit. The software can also be a software engine for converting text data into speech data.

可聴出力手段、たとえば、オーディオ信号を拡声器又はヘッドホンに提供できるオーディオポート、及び音声合成ソフトウェアから受信された音声データを最終的にオーディオ形態で出力するための関連ハードウェア及びソフトウェア。 Audible output means, for example, an audio port that can provide an audio signal to a loudspeaker or headphones, and associated hardware and software for finally outputting audio data received from speech synthesis software in audio form.

音声は、所望の場合には、合成された後の１つ又は複数の時間に再生するために、音声ファイル、たとえばｗａｖファイルとして、格納される、通信される、又は分散されることが理解されよう。 It is understood that the audio can be stored, communicated, or distributed as an audio file, eg, a wav file, for playback at one or more times after synthesis, if desired. Like.

従来、特有の制限された組の言語学及び合成規則を実施するために、このような公知の音声シンセサイザが開発されてきたが、音素、単語、又は短いフレーズなどの小さい音声構成要素から組み立てられている場合、それらの出力は、機械的かつ非人間的で評判の悪いものである。本発明は、人の心に訴える、人間らしい音声出力を提供するよう、本発明の新規なテキストマーク付け記号及び韻律マーク付けを用いて、本明細書の教示を実施するために、追加及び／又は代替言語学規則を実施できるよう適応されたソースコードを用いる、新規な音声シンセサイザと音声合成ソフトウェアとを提供する。本発明によれば、音声合成ソフトウェアは、発音表記法及びグラフィカルシンボルによって示されるテキストに適用される発音規則に対応するサウンドを作成する好適な音声シンセサイザの音響出力値を指定でき、このテキストは、出力音声の発音及び韻律を判断するためにマーク付けされ得る。 Traditionally, such known speech synthesizers have been developed to implement a unique limited set of linguistics and synthesis rules, but are assembled from small speech components such as phonemes, words, or short phrases. If so, their output is mechanical, inhuman, and unreputable. The present invention may be added and / or used to implement the teachings herein using the new text markup symbols and prosodic markup of the present invention to provide a human-sounding, human-like audio output. A new speech synthesizer and speech synthesis software is provided that uses source code adapted to implement alternative linguistic rules. According to the present invention, the speech synthesis software can specify the sound output value of a suitable speech synthesizer that creates a sound corresponding to the phonetic notation and the pronunciation rules applied to the text indicated by the graphical symbol, It can be marked to determine the pronunciation and prosody of the output speech.

ソフトウェアの例
本明細書に記述されている本発明の目的に適応された、本発明を実施するのに好適なソフトウェアが、フォルマントテキスト／音声（「ＴＴＳ」）エンジンソフトウェアに精通した、１人以上の当業者、たとえば技術者及び／又はコンピュータ言語学者によって提供され得る。好適な追加の言語学規則及びシンセサイザ信号仕様が、公知の音声ソフトウェアエンジンに追加され、本発明を具現化する又は実施するソフトウェアを構築し試験し得る。たとえば、本明細書に記述されている、サンプルの韻律音響ライブラリデータベースを分析して、公知のフォルマントＴＴＳシンセサイザにおいて現在指定されていない同時調音のための、本発明の発音マーク付け記号及びこれに対応するＷＡＶデータを隔離し、必要な要素を公知のシンセサイザに追加し得る。 Software Examples One or more persons familiar with formant text / speech ("TTS") engine software that are suitable for practicing the invention, adapted for the purposes of the invention described herein. May be provided by a person skilled in the art, for example a technician and / or a computer linguist. Suitable additional linguistic rules and synthesizer signal specifications can be added to known speech software engines to build and test software that embodies or implements the present invention. For example, the sample prosodic acoustic library database described herein is analyzed to correspond to the phonetic markup symbols of the present invention for simultaneous articulation not currently specified in known formant TTS synthesizers and corresponding The WAV data to be isolated can be isolated and the necessary elements can be added to a known synthesizer.

この結果生じた、１つ又は複数の音声合成ソフトウェアプログラムは、テキストからの、人の心に訴える又は優雅な音声の機械生成に使用することに加えて、ソフトウェア内で実施されるレサック又は他の音声訓練システム及び用いられる新規な韻律音声規則を実際的に理解するようソフトウェア技術者などを教育するのに有用であり得る。これはまた、本明細書に記述されている、所望のさらなる言語学、音声学、及び韻律規則及び新規な音響信号パラメータを収容するようプログラムされる項目を識別し得る。 The resulting one or more speech synthesis software programs can be used for the machine generation of human speech or graceful speech from text, in addition to resack or other implementations implemented in the software. It may be useful to educate software engineers etc. to practically understand the speech training system and the new prosodic speech rules used. This may also identify items that are programmed to accommodate the desired additional linguistic, phonetic and prosodic rules and new acoustic signal parameters described herein.

ソフトウェアの例は、テキストへのマーク付けをプログラミングし、シンセサイザのサウンド作成について関連付けられた音声値を指定して手動で書き込み得る。このようなサンプルが作られると、合成されるテキストを入力として直接使用し、ハイブリッドフォルマント、及び連結パラメータ及び値を指定するための、特定の単語、文、及びフレーズの文脈中にその特定のテキストが必要とする規則を適用するコンピュータシステムを用いて、より大きい辞書が自動的にプログラムされ得る。フォルマントパラメータ値とは、マークされた特有の発音及び／又は韻律規則に従って、及び出力されるべき、声識別情報特性、基本的な周波数、和声学などに従って動作する、テキストのマーク付けに指定される発音及び韻律を生じさせるのに必要な値である。 An example software may be programmed manually to mark the text and specify the associated speech values for synthesizer sound creation and write manually. Once such a sample has been created, the specific text will be used in the context of the specific word, sentence, and phrase to directly use the synthesized text as input and to specify the hybrid formant and concatenation parameters and values. Larger dictionaries can be automatically programmed using a computer system that applies the rules that require. Formant parameter values are specified for text marking that operates according to marked specific pronunciation and / or prosodic rules and according to the voice identification information characteristics, fundamental frequencies, harmony, etc. to be output. This is the value required to produce pronunciation and prosody.

聞き手による試験
本発明は、本発明に従って、聞き手による合成音声出力の試験を用いて、製品を改良するためのフィードバックを提供することを想定している。聞き手については、明瞭度及びメッセージの理解度の向上を認識するために、及びサンプルが従来の比較製品、たとえばＳｅｎｓｉｍｅｔｒｉｃｓ社のＨＬＳＹＮ（登録商標）又はＳＥＮＳＹＮ（登録商標）フォルマントシンセサイザより良い音を発するかどうかについて判断するために投票がなされ得る。認識、理解度、及び好みの測定では、各業界において公知であり得る、正当性が検証された実験的な設計及びデータ収集技術が用いられることが望ましい。 Test by Listener The present invention contemplates using the test of synthesized speech output by the listener in accordance with the present invention to provide feedback to improve the product. For listeners, to recognize increased clarity and comprehension of messages, and whether the sample sounds better than conventional comparison products such as Sensimetics HLSYN® or SENSYN® formant synthesizers A vote can be made to determine whether. For recognition, comprehension, and preference measurements, it is desirable to use validated experimental design and data collection techniques that may be known in the industry.

前述の記述より明らかであろうが、発声されるテキストは、所望の韻律に従って英字、二重母音、音節、又は他の音声要素の発音に必要なピッチ制御を示すグラフィカルシンボルと、上昇するピッチを示す上昇曲線と、下降するピッチを示す下降曲線と、上昇し次いで下降するピッチ又は下降し次いで上昇するピッチを示す曲折アクセント記号と、不変のピッチを示すレベル持続部と、最初の英字が準備されたことを示す、密接に関連した又は同一のサウンドの子音が後に続く最初の英字上の前方斜線と、他の英字により互いに分離された英字が間に休止を置かずに連続して発音されることを示す浅いＵ形のリエゾンハンモックと、それぞれ、ティンパニのドラムビート、Ｄ、Ｂ、及びＧ、及びスネアドラム、ベース、及びトムトムのドラムビート、Ｔ、Ｐ、及びＫを含む再生可能な打楽器としてマークする１本の下線であり、マークのない子音は再生不可能であることと、再生可能な弦楽器Ｎ、Ｍ、Ｖ、及びＺ、木管楽器、Ｌ、ＮＧ、ＴＨ、及びＺＨ、及び（無声）音響効果Ｆ、Ｓ、ＳＨ、及びｔｈとしてマークする２本の下線であり、マークのない子音は再生不可能であることと、Ｈがまず発せられるべきであり、次にＷが続き、両方とも再生されないことを示す、発声されるテキスト中の英字の組合せＷＨの上に又はこれに隣接してマークされた英字の組合せ「ｈｗ」と、別の母音の前に「Ｙ」又は「Ｗ」が発生した場合には、１つの単語から次の単語への又は１つの音節から次の音節への音声連続性が維持されるべきことを示すＹとＷとの連結語であり、このＹとＷとの連結語のそれぞれが、Ｕの中、Ｕの近く、又はＵの上でそれぞれマークされた小さいＹ又はＷ英字と共に、それぞれＹ又はＷの下から以下の母音へのハンモック状の浅いＵルーピングを有することとからなる群から選択された韻律グラフィカルシンボルの１つ、２つ以上、又はすべてを用いてマークされ得る。ここで、母音の前の子音は、発声されるが、再生可能であるとしてマークされないという規則を有する。 As will be apparent from the foregoing description, the spoken text will show a graphical symbol indicating the pitch control required for pronunciation of English letters, diphthongs, syllables or other speech elements according to the desired prosody, and an increasing pitch. A rising curve, a falling curve indicating a descending pitch, a bent accent symbol indicating a rising and descending pitch or a descending and rising pitch, a level continuation indicating an invariant pitch, and the first alphabetic character are prepared The first oblique letter followed by a closely related or identical sound consonant, and the letters separated from each other by other letters are pronounced consecutively without any pauses Shallow U-shaped liaison hammock with timpani drum beats, D, B, and G, and snare drum, bass, and Tom Tom drums, respectively A single underline marking as a reproducible percussion instrument including notes, T, P, and K. Unmarked consonants are not reproducible and reproducible stringed instruments N, M, V, and Z , Woodwind instruments, L, NG, TH, and ZH, and two underlines marked as (silent) acoustic effects F, S, SH, and th, and uncontained consonants are not reproducible; The letter combination “hw” marked on or adjacent to the letter combination WH in the spoken text, indicating that H should be uttered first, then followed by W and neither will be played. And "Y" or "W" occurs before another vowel, the continuity of speech from one word to the next or from one syllable to the next should be maintained This is a concatenated word of Y and W indicating that Each of the words has a hammock-like shallow U looping from below Y or W to the following vowel, respectively, with a small Y or W letter marked in, near or on U, respectively. May be marked with one, two or more, or all of the prosodic graphical symbols selected from the group consisting of: Here, the consonant before the vowel has a rule that it is uttered but not marked as reproducible.

代替形態として又は追加形態として、発声されるテキストは、発声される勢いが、単語の間に中断又は休止又は休みを置かずに１つの結合された英字から次の結合された英字へと進むべきであることを示す、結合された英字の下及びこの間にリエゾンハンモック状の線を有する直接結合と、第１の子音は再生され、第２の子音は再生されないことを示す、第１の子音の１本の又は２本の下線と組み合わせられたリエゾンハンモックを有する再生及び結合と、準備され、かつ第１の子音と第２の子音との間の結合を示す第２の子音とのリエゾンハンモックと組み合わせられた、子音である第１の子音上の前方斜線を有する準備及び結合であり、再生可能な子音が下線で示されることとからなる群から選択される韻律グラフィカルシンボルの１つ、２つ以上、又はすべてを用いてマークされ得る。 As an alternative or in addition, the spoken text should have the momentum spoken proceed from one combined alphabet to the next with no interruption or pause or break between words A direct combination with a liaison hammock-like line under and between the combined English letters, indicating that the first consonant is played and the second consonant is not played. Playback and combination with a liaison hammock combined with one or two underscores, and a liaison hammock with a second consonant prepared and indicative of a coupling between the first and second consonants One of the combined prosody graphical symbols selected from the group consisting of a preparation and combination having a forward diagonal line on the first consonant that is a consonant, the reproducible consonant being underlined Two or more, or may be marked using all.

本発明において用いられ得る、所望の場合には韻律音響ライブラリデータベース内に含まれる、マーク付け指示のいくつかの可能な実施形態が、本明細書に記述されている。本発明によれば、本明細書に記述されている新規な音響値コード又はグラフィカルシンボルセット及び表記法は、本明細書の開示より明らかであるが、本発明の目的のために用いられ得る又は考案され得るコードの単なる例示であることを理解されたい。 Several possible embodiments of marking instructions are described herein that may be used in the present invention, if desired, included in a prosodic acoustic library database. In accordance with the present invention, the novel acoustic value codes or graphical symbol sets and notations described herein are apparent from the disclosure herein, but may be used for the purposes of the present invention or It should be understood that this is merely an example of code that can be devised.

さらに、音響値コードの例が英語の言語の文脈で記述されているが、本発明は、別の言語の必要に応じて好適に修正された又は適応された音声規則を用いて、一貫性のある発音のマーク付けについて、本明細書の一般的な原理を具現化する他の言語の特定の必要性のために考案された他のコード化システムを包含することを理解されよう。したがって、本発明の方法は、たとえば、英語と、アメリカ英語と、フランス語と、スペイン語と、ドイツ語と、日本語と、ロシア語と、中国語と、アラビア語と、ヒンディ語と、図形記号セット及び規則に基づく文法を有する文語及び口語と、前述の言語又は他の言語、部分集合、又は言葉遣いの、任意の１つの言葉遣い及び専門的な部分集合とからなる群から選択される言語において実施され得る。これについては、当業者には、本明細書の教示より明らかであろう。 Further, although examples of acoustic value codes are described in the context of an English language, the present invention uses consistently modified or adapted phonetic rules as needed for another language. It will be appreciated that certain pronunciation markings encompass other coding systems designed for the specific needs of other languages that embody the general principles herein. Thus, the method of the present invention is, for example, English, American English, French, Spanish, German, Japanese, Russian, Chinese, Arabic, Hindi, and graphic symbols. A language selected from the group consisting of a sentence and colloquial language having a grammar based on sets and rules, and any one wording and specialized subset of the aforementioned language or other languages, subsets, or wordings Can be implemented. This will be apparent to those skilled in the art from the teachings herein.

レサック又は他の音声訓練規則は、個々の英字及び２つの又は３つの英字の組合せを有する比較的小さい音声要素についての理解を容易にするための発音に特に効果的であり、本明細書に記述されている韻律規則は、通常、全単語、フレーズ、文、又は段落を有するより大きい音声要素の文脈中の、このような英字又は英字の組合せに、再生、休止、強勢、及び他の韻律技術を適用するのに有用であることが理解されよう。 Lessac or other phonetic training rules are particularly effective in pronunciation to facilitate understanding of individual English letters and relatively small phonetic elements having combinations of two or three English letters and are described herein. Prosodic rules that are commonly used to play, pause, stress, and other prosodic techniques in such a letter or combination of letters in the context of a larger speech element that has an entire word, phrase, sentence, or paragraph It will be appreciated that it is useful to apply

要約すると、本発明は、熟練した音声実践者が人又は機械により発声されるテキストに適用した場合には、明瞭で、人の心に訴える、旋律の美しい音声出力のための明白なテンプレートを提供し得る、理解を容易にするための発音規則とリズム及びメロディのための韻律規則とを有する、図形的に表示可能なグローバル規則セットを提供する。 In summary, the present invention provides a clear template for clear and appealing speech output that is clear and appealing to the human mind when applied to text uttered by a skilled speech practitioner. A graphically displayable global rule set is provided that has pronunciation rules for ease of understanding and prosodic rules for rhythms and melodies.

援用される開示
本明細書又はこの特許出願の他の箇所において言及した、それぞれの及びあらゆる米国特許及び特許出願の、それぞれの外国及び国際特許出願の、それぞれの他の公開及び非公開特許出願の開示全体が、参照により本明細書に援用される。 Incorporated disclosure of each and every U.S. patent and patent application, each foreign and international patent application, each other published and unpublished patent application mentioned in this specification or elsewhere in this patent application. The entire disclosure is incorporated herein by reference.

本発明の例示的実施形態について記述してきたが、当然ながら、当業者には、多くの及び様々な修正形態が明らかであり、当業界において開発が進むにつれて明らかとなり得ることを理解されたい。このような修正形態は、本明細書に開示されている本発明の趣旨及び範囲内に含まれるものとする。 Although exemplary embodiments of the present invention have been described, it should be understood that many and various modifications will be apparent to those skilled in the art and may become apparent as development progresses in the art. Such modifications are intended to be included within the spirit and scope of the invention disclosed herein.

構造ＮＲＧ母音のためのレサック発音表記法を用いてマーク付けされた、いくつかの単語及びフレーズを示す図である。FIG. 6 shows some words and phrases marked using a Lesack phonetic notation for structured NRG vowels. 本発明の実施形態による、テキスト中の所望のピッチの変化を示すのに有用な韻律グラフィカルシンボルのサンプル、たとえば、テキストに関係する韻律イントネーションパターン内の連続した音調ピッチ変化パターンを示す図である。FIG. 5 illustrates a sample prosodic graphical symbol useful for showing a desired pitch change in text, eg, a continuous tone pitch change pattern in a prosodic intonation pattern related to text, according to an embodiment of the present invention. 本発明の実施形態による、子音ｓの混合語中の子音の所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 4 shows a sample prosodic graphical symbol useful for showing a desired pronunciation of a consonant in a mixed word of consonant s, according to an embodiment of the present invention. 本発明の実施形態による、「オーボエ」の後に「打楽器」が続く子音の混合語を有する子音の所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 6 illustrates a sample prosodic graphical symbol useful for showing a desired pronunciation of a consonant having a mixed consonant word that is “oboe” followed by “percussion instrument” in accordance with an embodiment of the present invention. 「シンバル」を有する打楽器の子音の組合せの所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 5 shows a sample prosodic graphical symbol useful for showing the desired pronunciation of a consonant combination of a percussion instrument having “cymbals”. 「ウッドブロックのカチッという音」を有する、子音の組合せの所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 6 shows a sample prosodic graphical symbol useful for showing a desired pronunciation of a consonant combination having a “wood block click”. 子音の間に曖昧な母音を有する、子音の組合せの所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 5 shows a sample prosodic graphical symbol useful for showing a desired pronunciation of a consonant combination with ambiguous vowels between consonants. ＹとＷとの連結語を有する、子音の組合せの所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプルを示す図である。FIG. 6 shows a sample prosodic graphical symbol useful for showing a desired pronunciation of a consonant combination having a concatenated word of Y and W. 単語を順に結合する場合に調音上考慮すべき点、この場合は短いフレーズに基づいて、所望の発音を示すのに有用な、韻律グラフィカルシンボルのサンプリングを示す図である。FIG. 5 is a diagram showing sampling of prosodic graphical symbols useful for articulating a point of articulation when combining words in order, in this case based on a short phrase, useful for indicating a desired pronunciation. 指定された韻律、この場合は「レポートリアルな（ｒｅｐｏｒｔｏｒｉａｌ)」韻律のための、語強勢及びイントネーションパターンを用いた、本発明による韻律図形表記法の２例の使用を示す図である。FIG. 6 shows the use of two examples of prosodic graphic notation according to the present invention using word stress and intonation patterns for a specified prosody, in this case a “reportial” prosody. レポートリアルなスタイルで図２〜図１０に例示されているレサック発音表記法及び韻律図形表記法の両方を用いるマーク付けの１つのサンプルを例示する図である。FIG. 11 is a diagram illustrating one sample of markup using both the Lesack phonetic notation and the prosodic figure notation illustrated in FIGS. 2-10 in a report-real style. 図２〜図１０に例示されているレサック発音表記法及び韻律図形表記法の両方を用いるマーク付けの別のサンプルを例示する図であり、このサンプルは、人の興味をそそるスタイルである。FIG. 11 is a diagram illustrating another example of marking using both the Lesack phonetic notation and the prosodic graphic notation illustrated in FIGS. 2-10, which is a style that is intriguing to people.

Claims

A method of marking the text used when synthesizing speech from text,
Marking the text to be spoken with one or more graphical symbols that indicate to the speaker the desired speech characteristics to be used in uttering the text,
A method of using as a graphic symbol acoustic code indicating a desired prosody given to the text uttered by the speaker.

The assigned prosody is selected from the group consisting of tempo, intonation pattern, rhythm, musicality, amplitude, pause for emphasis and breathing, and formal and informal articulation of words and phrases; The method of claim 1, comprising one or more prosodic elements.

Marking the visible text with a graphic prosodic symbol or electronically marking the electronic text with an electronic version of the graphical symbol,
The electronically marked text can be displayed or printed as human-readable graphic marked text and communicates the desired prosody to the speech synthesizer in a manner that can control the prosody of the output speech 3. The method of claim 2, wherein the method is effective to do so.

The text to be uttered includes a graphical symbol that indicates the pitch control required to pronounce a letter, diphthong, syllable, or other speech element in the desired prosody, a rising curve that indicates a rising pitch, and a falling pitch. Closely related, showing the down curve shown, the curved accent symbol showing the pitch that goes up and down or the pitch that goes down and then rises, the level duration that shows the unchanged pitch, and the first letter is prepared Shallow to indicate that the front diagonal line on the first alphabet followed by a consonant of the same or the same sound and the letters separated from each other by other letters are continuously pronounced without any pause between the letters U-shaped liaison hammock, timpani drum beats, D, B, and G, and snare drum, bass, and Tom Tom drum beats, T, P, and A single underline for marking a reproducible percussion instrument including a non-marked consonant cannot be displayed, and reproducible stringed instruments N, M, V, and Z, a woodwind instrument, L, NG , TH, and ZH, and (silent) acoustic effects F, S, SH, and two underscores marked as th, unmarked consonants cannot be displayed and H should be emitted first. A letter combination "hw" and "Y" marked on or adjacent to the letter combination WH in the text to be spoken, which is followed by W and none will be played Or if “W” occurs before another vowel, Y indicates that speech continuity should be maintained from one word to the next or from one syllable to the next The connected word of W and the connected word of Y and W are respectively With a small hammock-like U looping from below Y or W to the following vowels, respectively, with a small Y or W letter marked in, near, or on U, respectively. With one, two or more or all of the prosodic graphical symbols selected from the group having the rule that consonants before the vowel are spoken but not marked as reproducible The method according to claim 1, 2, or 3.

Marking the text to be spoken means that the momentum to be spoken should go from one combined alphabet to the next without any interruption or pause or break between words 1 of the first consonant indicating that the first consonant is reproduced and the second consonant is not reproduced, and a direct combination having a liaison hammock-like line under and between the combined alphabetic characters indicating A liaison hammock with a liaison hammock combined with one or two underlines, and a liaison hammock prepared and indicating a coupling between the first consonant and the second consonant Using one, two or more or all of the prosodic graphical symbols selected from the group consisting of preparations and combinations having forward diagonal lines on the first consonant that are consonants Is to be click method according to claim 1, 2, 3, or 4, characterized in that the renewable consonants are indicated with an underline.

Placing the prosodic graphical symbol adjacent to the text to be uttered; adjusting the prosodic graphical symbol in a line immediately above the text; selectively placing the prosodic graphical symbol below the text; 6. A method according to claim 1, 2, 3, 4, or 5, characterized in that the prosodic graphical symbol is placed above and below the text.

Marking the text to be spoken by rendering the text in a line; marking the phonetic notation to facilitate understanding on the text; and prosodic graphical symbols below the text Mark a pitch reference line on the phonetic notation line to facilitate understanding, and mark additional prosodic symbols on the pitch reference line to indicate the desired pitch change and enhancement 6. The method according to claim 1, 2, 3, 4, or 5.

Using the pitch reference line to have a Y buzz pitch line and using smaller dots that show less stress and larger dots that show greater strength, the desired intonation pattern is placed on the Y buzz pitch line. 8. The method of claim 7, wherein the dots are positioned at a level above the Y buzz pitch line that indicates a desired pitch relative to the speaker's Y buzz pitch line.

A method for automatically applying prosodic marks to text,
6. The method of claim 1, 2, 3, 4, or 5, comprising identifying, marking, and indicating a desired prosodic pronunciation using at least one computational linguistic algorithm. The method described.

Generate values for acoustic variables that can be used to specify input to a speech synthesizer to output the marked text as synthesized speech using chord variables corresponding to the desired pronunciation sound The method of claim 9, comprising:

The method of claim 1, further comprising using an acoustic library having audio elements recorded in a digital manner, which are audio elements uttered using the prosody indicated by the graphic symbol mark. Or the method according to 5.

English, American English, French, Spanish, German, Japanese, Russian, Chinese, Arabic, Hindi, sentence language with grammar based on graphic symbol set and rules, and 12. The method of claim 11, wherein the method is implemented in a language selected from the group consisting of colloquial language and any one wording and specialized subset of said language.

Using the prosodic graphical symbol to mark the text using the prosodic graphical symbol to create a database of pronounced speech, including speech that is correctly pronounced according to the text marked with a speech code The method of claim 1, 2, 3, comprising promoting synthetic speech output, optionally human-like sound in formant speech output, by one or more trained people to pronounce the text correctly. The method according to 4, or 5.

Rendering the prosodic graphical symbols digitally and using the graphical symbols in synthesizer software for electronic markup of machine-spoken text, thereby providing a digital domain of prosodic elements to the output speech 14. Method according to claim 13, characterized in that the introduction is facilitated or guided.

Recorded speech corresponding to the graphic notation for text having one or more of letters, words, phrases, sentences, paragraphs, and longer text is digitized into a database and the unique text and The method of claim 14, wherein the method is analyzed to provide an algorithm or metric for specifying the relationship of specific audio data corresponding to the graphical notation related to text.

Using one or more of the provided algorithms or metrics, input parameters to the speech synthesizer to recreate a sound that mimics human speech for specific text synthesized as speech using a specified prosody 16. The method of claim 15, wherein:

A speech synthesizer controlled by an acoustic coding variable input to the speech synthesizer,
The acoustic coding variable corresponds to a prosodic specification used to generate a recorded person's voice having a desired prosodic pronunciation, and uses the recorded person's voice to realize the desired prosodic pronunciation A speech synthesizer characterized in that a synthesized speech output is provided.