JP6435791B2

JP6435791B2 - Display control apparatus and display control method

Info

Publication number: JP6435791B2
Application number: JP2014228912A
Authority: JP
Inventors: 誠橘; 橘　　誠
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-11-11
Filing date: 2014-11-11
Publication date: 2018-12-12
Anticipated expiration: 2034-11-11
Also published as: JP2016090966A

Description

本発明は、音声合成に利用する合成情報を表示させる技術に関する。 The present invention relates to a technique for displaying synthesis information used for speech synthesis.

発音文字と発音期間と音高とを音符毎に指定する合成情報に応じて所望の音声を合成する音声合成技術が従来から提案されている。例えば、特許文献１には、音楽情報画像（音高に対応する音高軸と時間に対応する時間軸とが設定されたピアノロール型の画像領域）を表示させて、利用者が各音符の音高や発音文字や発音期間（発音期間の始点および終点，継続長）を視覚的に確認しながら音楽情報を生成または編集することが可能な構成が開示されている。 Conventionally, a speech synthesis technique for synthesizing a desired speech according to synthesis information designating a pronunciation character, a pronunciation period, and a pitch for each note has been proposed. For example, Patent Document 1 displays a music information image (a piano roll type image area in which a pitch axis corresponding to a pitch and a time axis corresponding to time) are set, and the user can display each musical note. A configuration is disclosed in which music information can be generated or edited while visually confirming a pitch, a pronunciation character, and a pronunciation period (a start point and an end point of a pronunciation period, and a duration).

特開２０１１−０９５３９６号公報JP 2011-095396 A

ところで、合成情報で指定される音符の発音期間の始点で母音が発音されて、子音の発音期間は音符の発音期間の始点に先行する。しかし、特許文献１の技術では、合成情報で指定された発音期間が音楽情報画像に表示されるに過ぎないから、子音の発音期間の開始点を利用者が視覚的に認識できないという問題があった。以上の事情を考慮して、本発明は、合成情報で指定される発音期間の始点に先行して発音される音素の発音期間の始点を利用者が視覚的に把握できるようにすることを目的とする。 By the way, a vowel is pronounced at the start of the note generation period specified by the synthesis information, and the consonant pronunciation period precedes the start of the note generation period. However, the technique of Patent Document 1 has a problem in that the user cannot visually recognize the starting point of the pronunciation period of the consonant because the pronunciation period specified by the synthesis information is only displayed on the music information image. It was. In view of the above circumstances, it is an object of the present invention to enable a user to visually grasp the starting point of the pronunciation period of a phoneme that is pronounced prior to the starting point of the pronunciation period specified by the synthesis information. And

以上の課題を解決するために、本発明の表示制御装置は、発音文字と発音期間と音高とを音符毎に指定する音声合成用の合成情報を参照して、各音符を表象する音符図像を時間軸上に配置した音符列画像を表示装置に表示させる表示制御手段を具備し、前記表示制御手段は、前記合成情報で指定される一の音符の発音文字が第１音素と前記第１音素の後方の第２音素とを含み、且つ、前記合成情報を適用した合成音声において前記第１音素の発音の始点が当該一の音符の発音期間の始点に対して時間軸上で先行する場合に、前記第１音素の発音の始点を時間軸上で示す先行音素画像を前記一の音符の音符図像に対応付けて表示させる。以上の構成では、合成情報で指定される一の音符の発音文字が第１音素と第２音素とを含み、且つ、第１音素の発音の始点が、当該一の音符の発音期間の始点に対して時間軸上で先行する場合には、先行音素画像が表示されるから、利用者は、音符の発音期間の始点を音符図像により視覚的に把握するとともに、第１音素の発音の始点を先行音素画像により視覚的に把握することが可能である。なお、第１音素は、第２音素の直前の１個の音素、または、第２音素に先行する複数の音素のうち任意の１個の音素（典型的には複数個のうち最初の音素）を意味する。 In order to solve the above-described problems, the display control apparatus of the present invention refers to the synthesis information for speech synthesis that specifies the pronunciation character, the pronunciation period, and the pitch for each note, and represents a note image representing each note. Is displayed on a time axis on a display device, and the display control means includes a first phoneme and the first phonetic character specified by the synthesis information. A second phoneme behind the phoneme, and in the synthesized speech to which the synthesis information is applied, the starting point of the pronunciation of the first phoneme precedes the starting point of the pronunciation period of the one note on the time axis In addition, a preceding phoneme image showing the starting point of the pronunciation of the first phoneme on the time axis is displayed in association with the note image of the one note. In the above configuration, the pronunciation character of one note specified by the synthesis information includes the first phoneme and the second phoneme, and the starting point of the first phoneme is the starting point of the pronunciation period of the one note. On the other hand, if the preceding phoneme image is preceded on the time axis, the preceding phoneme image is displayed, so that the user can visually grasp the starting point of the note pronunciation period from the note image, and can also determine the starting point of the first phoneme pronunciation. It is possible to visually grasp the preceding phoneme image. Note that the first phoneme is one phoneme immediately before the second phoneme, or any one of a plurality of phonemes preceding the second phoneme (typically, the first phoneme of the plurality). Means.

本発明の好適な態様において、前記表示制御手段は、各音符の音高遷移を示す線状の音高遷移画像と前記先行音素画像とを前記各音符の音符図像に対応付けて表示させ、前記先行音素画像は、時間軸上における第１音素の発音の始点を端点として前記音高遷移画像に連続する線状の画像である。以上の態様では、第１音素の発音の始点を端点として、音高遷移画像に連続する線状の音高遷移画像が、各音符の音高遷移を示す線状の音高遷移画像に対応付けて表示される。以上の態様によれば、利用者は時間軸上における各音符の音高遷移を音高遷移画像により視覚的に把握するとともに、先行音素画像の端点で第１音素の発音の始点を視覚的に把握することが可能である。また、先行音素画像と音高遷移画像とは、時間軸上で連続する線状の画像として表示されるから、利用者は第１音素の発音の開始点と各音符の音高遷移とを直感的に把握することが可能である。 In a preferred aspect of the present invention, the display control means displays a linear pitch transition image indicating the pitch transition of each note and the preceding phoneme image in association with the note image of each note, and The preceding phoneme image is a linear image that continues from the pitch transition image with the starting point of the first phoneme on the time axis as the end point. In the above aspect, the linear pitch transition image continuous to the pitch transition image with the start point of the first phoneme pronunciation as an end point is associated with the linear pitch transition image indicating the pitch transition of each note. Displayed. According to the above aspect, the user visually grasps the pitch transition of each note on the time axis from the pitch transition image, and visually identifies the starting point of the pronunciation of the first phoneme at the end point of the preceding phoneme image. It is possible to grasp. In addition, since the preceding phoneme image and the pitch transition image are displayed as linear images that are continuous on the time axis, the user can intuitively know the starting point of the first phoneme and the pitch transition of each note. It is possible to grasp it.

本発明の好適な態様において、前記表示制御手段は、前記先行音素画像の前記端点の音高軸上の位置を、直前の音符の音高に応じて変化させる。以上の態様では、先行音素画像の端点の音高軸上の位置は、直前の音符の音高に応じて変化する。したがって、先行音素画像を視認することで、利用者は、前述の通り第１音素の発音の始点を視覚的に把握するとともに、直前の音符からの音高の変化も直観的に把握できるという利点がある。 In a preferred aspect of the present invention, the display control means changes the position of the end point of the preceding phoneme image on the pitch axis according to the pitch of the immediately preceding note. In the above aspect, the position of the end point of the preceding phoneme image on the pitch axis changes according to the pitch of the immediately preceding note. Therefore, by visually recognizing the preceding phoneme image, the user can visually grasp the starting point of the pronunciation of the first phoneme as described above, and can intuitively grasp the change in pitch from the immediately preceding note. There is.

本発明の好適な態様において、前記表示制御手段は、一の音符と直前の音符との間隔が閾値を上回る場合に、前記先行音素画像の端点の位置を、前記一の音符の音高に応じた所定の初期位置に設定し、前記一の音符と直前の音符との間隔が閾値を下回る場合に、前記先行音素画像の端点の位置を前記初期位置から変化させる。時間軸上で相前後する音符が存在する場合、先行する音符と後続する音符との間隔に応じて、先行する音符の音高が後続する音符の音高に影響を与える傾向がある。具体的には、先行する音符と後続する音符との間隔が長ければ、先行する音符の音高が後続する音符の音高に与える影響が少ない。他方、先行する音符と後続する音符との間隔が短い場合では、先行する音符の音高が後続する音符の音高に与える影響が大きい。直前の音符との間隔が閾値を上回る場合に先行音素画像の端点の位置を初期位置に設定し、間隔が閾値を下回る場合に端点の位置を初期位置から変化させる前述の態様によれば、音符間の間隔による音高の影響を利用者が直観的に把握できるという利点がある。 In a preferred aspect of the present invention, the display control means sets the position of the end point of the preceding phoneme image according to the pitch of the one note when the interval between the one note and the immediately preceding note exceeds a threshold value. When the interval between the one note and the immediately preceding note is less than a threshold value, the position of the end point of the preceding phoneme image is changed from the initial position. When there are successive notes on the time axis, the pitch of the preceding note tends to affect the pitch of the succeeding note according to the interval between the preceding note and the following note. Specifically, if the interval between the preceding note and the subsequent note is long, the pitch of the preceding note has less influence on the pitch of the subsequent note. On the other hand, when the interval between the preceding note and the subsequent note is short, the pitch of the preceding note has a great influence on the pitch of the subsequent note. According to the above aspect, the position of the end point of the preceding phoneme image is set to the initial position when the interval with the immediately preceding note exceeds the threshold value, and the position of the end point is changed from the initial position when the interval is less than the threshold value. There is an advantage that the user can intuitively grasp the influence of the pitch due to the interval.

本発明の好適な態様において、前記表示制御手段は、前記音高遷移画像と前記先行音素画像とを相異なる態様で表示させる。相異なる態様の一例としては、例えば、音高遷移画像と先行音素画像とで、相互に色彩や明度や彩度を異ならせる構成が好適に採用され得る。以上の態様によれば、音符の音高遷移と、第１音素の発音の始点とを、利用者が視覚的且つ直感的に認識することが可能である。 In a preferred aspect of the present invention, the display control means displays the pitch transition image and the preceding phoneme image in different aspects. As an example of a different aspect, for example, a configuration in which the color transition, the brightness, and the saturation of the pitch transition image and the preceding phoneme image are different from each other can be suitably employed. According to the above aspect, the user can visually and intuitively recognize the pitch transition of the note and the starting point of the pronunciation of the first phoneme.

本発明の好適な態様において、前記表示制御手段は、第１音素の種別に応じて第１音素の発音の時間長を設定する。以上の態様によれば、例えば、第１音素の種別に応じて第１音素の発音の時間長が設定される。したがって、音素の種別毎の相違を踏まえた最適な時間長を設定することが可能になる。 In a preferred aspect of the present invention, the display control means sets the time length of pronunciation of the first phoneme according to the type of the first phoneme. According to the above aspect, for example, the duration of pronunciation of the first phoneme is set according to the type of the first phoneme. Therefore, it is possible to set an optimum time length based on the difference between phoneme types.

本発明の好適な態様において、前記合成情報は、利用者からの指示に応じた制御情報を音符毎に含み、前記表示制御手段は、第１音素の種別に応じた数値を上限値として前記第１音素の発音の時間長を、前記制御情報に応じて設定する。以上の態様によれば、音素の種別毎に設定された上限値の範囲内で利用者から指定された制御情報に応じて第１音素の発音の時間長が設定されるから、音素の種別毎の相違を前提として、利用者の指示を反映させた時間長を設定することが可能になる。 In a preferred aspect of the present invention, the synthesis information includes control information corresponding to an instruction from a user for each note, and the display control means uses the numerical value corresponding to the type of the first phoneme as an upper limit value. The time length of sound generation of one phoneme is set according to the control information. According to the above aspect, the time length of pronunciation of the first phoneme is set according to the control information designated by the user within the range of the upper limit value set for each phoneme type. It is possible to set a time length reflecting the user's instruction on the premise of the difference.

以上の各態様に係る表示制御装置は、合成情報の編集や音声信号の生成に専用されるDSP（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、CPU（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る表示制御装置の動作方法（表示制御方法）としても特定される。 The display control device according to each aspect described above is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to editing synthesis information and generating audio signals, as well as a CPU (Central Processing Unit) It is also realized by cooperation between a general-purpose arithmetic processing device such as the above and a program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (display control method) of the display control device according to each aspect described above.

第１実施形態に係る音声合成装置１００のブロック図である。1 is a block diagram of a speech synthesizer 100 according to a first embodiment. 第１実施形態の合成情報Ｓの模式図である。It is a schematic diagram of the synthesis information S of the first embodiment. 第１実施形態の表示制御部２４に表示される編集画面４０の模式図である。It is a schematic diagram of the edit screen 40 displayed on the display control part 24 of 1st Embodiment. 第１音素と第２音素とを包含する音符図像５４を拡大して示す模式図である。It is a schematic diagram which expands and shows the note image 54 containing a 1st phoneme and a 2nd phoneme. 第１実施形態の音素種別情報Ｆの説明図である。It is explanatory drawing of the phoneme classification information F of 1st Embodiment. 第１実施形態の表示制御部２４の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the display control part 24 of 1st Embodiment. 第２実施形態の合成情報Ｓの模式図である。It is a schematic diagram of the synthesis information S of the second embodiment. 第２実施形態の表示制御部２４に表示される編集画面４０の模式図である。It is a schematic diagram of the edit screen 40 displayed on the display control part 24 of 2nd Embodiment. 第２実施形態の音素種別情報Ｆの説明図である。It is explanatory drawing of the phoneme classification information F of 2nd Embodiment. 第３実施形態の表示制御部２４に表示される編集画面４０の模式図である。It is a schematic diagram of the edit screen 40 displayed on the display control part 24 of 3rd Embodiment. 第３実施形態の表示制御部２４に表示される編集画面４０の模式図である。It is a schematic diagram of the edit screen 40 displayed on the display control part 24 of 3rd Embodiment. 第４実施形態の表示制御部２４に表示される編集画面４０の模式図である。It is a schematic diagram of the edit screen 40 displayed on the display control part 24 of 4th Embodiment. 変形例の音符図像５４の模式図である。It is a schematic diagram of the musical note iconic image 54 of a modification. 変形例の音符図像５４の模式図である。It is a schematic diagram of the musical note iconic image 54 of a modification. 変形例の音符図像５４の模式図である。It is a schematic diagram of the musical note iconic image 54 of a modification.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、発話音や歌唱音等の音声を素片接続型の音声合成処理で生成する信号処理装置であり、図１に示すように、演算処理装置１２と記憶装置１４と表示装置１６と入力装置１７と放音装置１８とを具備するコンピュータシステムで実現される。第１実施形態では、特定の楽曲（以下「合成楽曲」という）の歌唱音声の音声信号Ｚを生成する場合を想定する。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a signal processing device that generates speech such as speech sounds and singing sounds through segment-connected speech synthesis processing. As shown in FIG. 1, the arithmetic processing device 12, the storage device 14, and the display device are used. 16, an input device 17, and a sound emitting device 18. In 1st Embodiment, the case where the audio | voice signal Z of the singing voice of a specific music (henceforth "synthetic music") is produced | generated is assumed.

表示装置１６（例えば液晶表示装置）は、演算処理装置１２から指示された画像を表示する。入力装置１７は、利用者からの指示を受付ける機器（例えばマウスやキーボード）である。放音装置１８（例えばヘッドホンやスピーカ）は、演算処理装置１２が生成した音声信号Ｚに応じた音波を放射する。 The display device 16 (for example, a liquid crystal display device) displays an image instructed from the arithmetic processing device 12. The input device 17 is a device (for example, a mouse or a keyboard) that receives an instruction from a user. The sound emitting device 18 (for example, a headphone or a speaker) emits a sound wave corresponding to the sound signal Z generated by the arithmetic processing device 12.

記憶装置１４は、演算処理装置１２が実行するプログラムＰGMや演算処理装置１２が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として採用される。第１実施形態の記憶装置１４は、以下に例示する通り、音声素片群Ｌと合成情報Ｓと音素種別情報Ｆとを記憶する。 The storage device 14 stores a program PGM executed by the arithmetic processing device 12 and various data used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is employed as the storage device 14. The storage device 14 of the first embodiment stores a speech unit group L, synthesis information S, and phoneme type information F as illustrated below.

音声素片群Ｌは、特定の発声者の収録音声から事前に採取された複数の音声素片の集合（音声合成用ライブラリ）である。各音声素片は、言語的な意味の最小単位である音素単体（例えば母音や子音）、または複数の音素を連結した音素連鎖（例えばダイフォンやトライフォン）であり、時間領域の音声波形のサンプル系列や、音声波形のフレーム毎に算定された周波数領域のスペクトルの時系列で表現される。 The speech segment group L is a set (speech synthesis library) of a plurality of speech segments collected in advance from recorded speech of a specific speaker. Each speech element is a phoneme unit (for example, a vowel or a consonant) that is the smallest unit of linguistic meaning, or a phoneme chain (for example, a diphone or a triphone) that connects a plurality of phonemes, and is a sample of a time domain speech waveform. It is expressed as a time series of a frequency domain spectrum calculated for each series or frame of a speech waveform.

合成情報Ｓは、図２に例示される通り、合成楽曲の歌唱音声を指定する時系列データであり、合成楽曲を構成する音符毎に発音文字Ｘ1と発音期間Ｘ2と音高（例えばノートナンバー）Ｘ3とを時系列に指定する。発音文字Ｘ1は、母音単体または子音と母音との組合せで構成される音節（モーラ）を表現する符号である。発音期間Ｘ2は、音符の時間長（音価）であり、例えば発音の開始時刻と時間長（継続長）または終了時刻とで規定される。以上の説明から理解される通り、合成情報Ｓは、合成楽曲の楽譜を指定する時系列データとも換言され得る。 As illustrated in FIG. 2, the synthesis information S is time-series data that designates the singing voice of the synthesized music, and for each note constituting the synthesized music, the pronunciation character X1, the pronunciation period X2, and the pitch (for example, note number). Designate X3 in time series. The phonetic character X1 is a code that represents a syllable (mora) composed of a single vowel or a combination of consonants and vowels. The sound generation period X2 is the time length (note value) of a note, and is defined by, for example, the start time and time length (continuation length) or end time of sound generation. As can be understood from the above description, the synthesis information S can be rephrased as time-series data for designating the score of the synthesized music.

演算処理装置１２（ＣＰＵ）は、記憶装置１４に格納されたプログラムＰGMの実行で、合成音の波形を表す音声信号Ｚを生成するための複数の機能（情報編集部２２，表示制御部２４，音声合成部２８）を実現する。なお、演算処理装置１２の各機能を複数の集積回路に分散した構成や、専用の電子回路（ＤＳＰ）が一部の機能を実現する構成も採用され得る。 The arithmetic processing unit 12 (CPU) has a plurality of functions (an information editing unit 22, a display control unit 24, a display control unit 24, and the like for generating a voice signal Z representing the waveform of the synthesized sound by executing the program PGM stored in the storage device 14. A speech synthesizer 28) is realized. A configuration in which each function of the arithmetic processing unit 12 is distributed over a plurality of integrated circuits, or a configuration in which a dedicated electronic circuit (DSP) realizes a part of the functions may be employed.

表示制御部２４は、各種の画像を表示装置１６に表示させる。第１実施形態の表示制御部２４は、合成情報Ｓを参照して、合成楽曲の内容（音符列）を利用者が確認および編集するための図３の編集画面４０を表示装置１６に表示させる。 The display control unit 24 displays various images on the display device 16. The display control unit 24 of the first embodiment refers to the synthesis information S and causes the display device 16 to display the editing screen 40 of FIG. 3 for the user to confirm and edit the content (note string) of the synthesized music. .

図３は編集画面４０の一例を示す図である。図３に例示される通り、編集画面４０は、利用者から指示された各音符を表象する画像（以下「音符図像」という）５４を時間軸上に配置した音符列画像Ｎを包含する。音符列画像Ｎは、相互に交差する時間軸（横軸）および音高軸（縦軸）が設定されたピアノロール型の座標平面を包含する。音高軸の方向における音符図像５４の位置は当該音符の音高Ｘ3に応じて選定され、時間軸の方向における音符図像５４の位置および表示長は当該音符の発音期間Ｘ2に応じて選定される。利用者は、編集画面４０を視認しながら入力装置１７を適宜に操作することで、新規な音符図像５４の追加や既存の音符図像５４の移動または伸縮を指示することが可能である。 FIG. 3 is a diagram illustrating an example of the editing screen 40. As illustrated in FIG. 3, the editing screen 40 includes a note string image N in which an image 54 representing each note instructed by the user (hereinafter referred to as “note image”) is arranged on the time axis. The note string image N includes a piano roll type coordinate plane in which a time axis (horizontal axis) and a pitch axis (vertical axis) intersecting each other are set. The position of the note image 54 in the direction of the pitch axis is selected according to the pitch X3 of the note, and the position and display length of the note image 54 in the direction of the time axis are selected according to the pronunciation period X2 of the note. . The user can instruct the addition of a new musical note graphic image 54 and the movement or expansion / contraction of the existing musical note graphic image 54 by appropriately operating the input device 17 while visually recognizing the editing screen 40.

図１の情報編集部２２は、合成情報Ｓを管理する。具体的には、情報編集部２２は、入力装置１７に対する利用者からの指示に応じて合成情報Ｓを生成および編集する。例えば、情報編集部２２は、音符列画像Ｎに対する音符図像５４の追加や、任意の音符図像５４の移動や時間軸上の伸縮の指示に応じて、編集画面４０での編集内容を反映させるように合成情報Ｓを更新する。 The information editing unit 22 in FIG. 1 manages the synthesis information S. Specifically, the information editing unit 22 generates and edits the composite information S in accordance with an instruction from the user to the input device 17. For example, the information editing unit 22 reflects the editing content on the editing screen 40 in accordance with an instruction to add a note image 54 to the note string image N, to move an arbitrary note image 54, or to expand or contract on the time axis. The composite information S is updated.

表示制御部２４は、利用者から指示された発音文字Ｘ1を、音符図像５４とともに（例えば図３の例示のように音符図像５４に重ねて）表示装置１６に表示させる。図３の編集画面４０では、時間軸上に配置された各発音文字Ｘ1“す”，“な”，“あ”，“る”，“あ”を５個の音符に割当てた音符図像５４を例示している。以上の例示から把握される通り、各発音文字Ｘ1は、母音単体で構成される発音文字（“あ[a]”）と、子音と母音との組み合わせで構成される発音文字（“す[s-u]”，“な[n-a]”，“る[r-u]”）とを包含する。 The display control unit 24 causes the display device 16 to display the phonetic character X1 instructed by the user together with the musical note graphic image 54 (for example, superimposed on the musical note graphic image 54 as illustrated in FIG. 3). In the editing screen 40 shown in FIG. 3, a musical note image 54 in which each of the pronunciation characters X1 “su”, “na”, “a”, “ru”, “a” arranged on the time axis is assigned to five notes is displayed. Illustrated. As understood from the above examples, each phonetic character X1 is a phonetic character (“su [a]”) composed of a single vowel and a phonetic character (“su [su]” composed of a combination of a consonant and a vowel. ] ”,“ Na [na] ”,“ ru [ru] ”).

音声合成部２８は、記憶装置１４に記憶された音声素片群Ｌと合成情報Ｓとを利用した音声合成処理で音声信号Ｚを生成する。第１実施形態の音声合成部２８は、音高遷移生成部２８２を包含する。音高遷移生成部２８２は、合成情報Ｓが指定する各音符の音高の時間軸上の変化（以下「音高遷移」という）を生成する。例えば、時間軸上で相連続する音符間で音高が滑らかに変化するように音高遷移生成部２８２は音高遷移を設定する。音声合成部２８は、合成情報Ｓが時系列に指定する各発音文字Ｘ1に対応した音声素片を音声素片群Ｌから順次に選択するとともに、音高遷移生成部２８２が生成した音高遷移に沿うように各音声素片の音高を調整し、発音期間Ｘ2に応じて伸縮したうえで相互に連結することで音声信号Ｚを生成する。音声合成部２８は、発音文字Ｘ1を構成する母音の音素が、音符の始点（音符図像５４の始点）に合致するように音声合成する。具体的には、子音と母音との組み合わせで構成される発音文字Ｘ1は、各音符の発音期間の始点前に子音の音素の発音が開始され、音符の発音期間の始点で母音の発音が開始されるように音声合成する。他方、母音単体で構成される発音文字Ｘ1では、母音の音素の発音の始点が音符の始点に合致するように音声合成する。 The speech synthesizer 28 generates a speech signal Z by speech synthesis processing using the speech element group L and the synthesis information S stored in the storage device 14. The voice synthesis unit 28 of the first embodiment includes a pitch transition generation unit 282. The pitch transition generation unit 282 generates a change (hereinafter referred to as “pitch transition”) of the pitch of each note specified by the synthesis information S on the time axis. For example, the pitch transition generation unit 282 sets the pitch transition so that the pitch changes smoothly between successive notes on the time axis. The speech synthesizer 28 sequentially selects speech segments corresponding to each phonetic character X1 designated by the synthesis information S in time series from the speech segment group L, and the pitch transition generated by the pitch transition generator 282. The pitch of each speech segment is adjusted so as to extend along the line, and the speech signal Z is generated by expanding and contracting according to the sound generation period X2 and then connecting them. The speech synthesizer 28 synthesizes speech so that the vowel phonemes constituting the phonetic character X1 coincide with the start point of the note (start point of the note image 54). Specifically, for the phonetic character X1, which consists of a combination of consonants and vowels, the pronunciation of consonant phonemes begins before the beginning of each note's pronunciation period, and vowel pronunciation begins at the beginning of the note's pronunciation period. To synthesize speech. On the other hand, for the pronunciation character X1 composed of a single vowel, speech synthesis is performed so that the starting point of the vowel phoneme coincides with the starting point of the note.

図３に例示される通り、第１実施形態の表示制御部２４は、各音符図像５４に遷移画像ＴRを対応付けて表示させる。本実施形態の表示制御部２４は、図３に例示される通り、発音文字Ｘ1が母音単体で構成される場合（“あ[a]”）と、子音と母音との組み合わせにより構成される場合（“す[s-u]”，“な[n-a]”，“る[r-u]”）とで、相異なる遷移画像ＴRを、音符図像５４に対応付けて表示させる。具体的には、例えば図３の発音文字“あ[a]”のように発音文字Ｘ1が母音単体で構成される場合、遷移画像ＴRは、音高遷移生成部２８２が当該音符について生成した音高遷移（ピッチカーブ）を示す線状の音高遷移画像ＰCで構成される。他方、発音文字Ｘ1が子音と母音との組み合わせにより構成される場合（例えば、子音の第１音素「s」と母音の第２音素「u」とを含む発音文字“す[s-u]”）、遷移画像ＴRは、図３および図４に例示される通り、第１音素の発音の始点を時間軸上で示す先行音素画像Ｇと音高遷移画像ＰCとによって構成される。第２音素は、第１音素の後方の音素（第１実施形態の例示のように発音文字Ｘ1が第１音素および第２音素の２個の音素で構成される構成では第１音素の直後の音素）である。 As illustrated in FIG. 3, the display control unit 24 according to the first embodiment displays each musical note image 54 in association with the transition image TR. As illustrated in FIG. 3, the display control unit 24 of the present embodiment includes a case where the phonetic character X1 is composed of a single vowel (“a [a]”) and a case where it is composed of a combination of consonants and vowels. (“Su [su]”, “na [na]”, “ru [ru]”), different transition images TR are displayed in association with the musical note image 54. Specifically, for example, when the phonetic character X1 is composed of a single vowel like the phonetic character “A [a]” in FIG. 3, the transition image TR is a sound generated by the pitch transition generation unit 282 for the note. It is composed of a linear pitch transition image PC showing a high transition (pitch curve). On the other hand, when the phonetic character X1 is composed of a combination of a consonant and a vowel (for example, the phonetic character “su [su]” including the first phoneme “s” of the consonant and the second phoneme “u” of the vowel). As illustrated in FIGS. 3 and 4, the transition image TR is composed of a preceding phoneme image G indicating the start point of the first phoneme on the time axis and a pitch transition image PC. The second phoneme is a phoneme behind the first phoneme (in the configuration in which the phonetic character X1 is composed of two phonemes of the first phoneme and the second phoneme as illustrated in the first embodiment, immediately after the first phoneme. Phoneme).

図４は、発音文字Ｘ1が第１音素（子音）と第２音素（母音）との組み合わせにより構成される場合の遷移画像ＴRの説明図である。表示制御部２４は、第１音素「s」の発音の始点を示す先行音素画像Ｇと当該音符の音高遷移を表す音高遷移画像ＰCとを包含する遷移画像ＴRを音符図像５４に対応付けて表示させる。図４に例示されるように、先行音素画像Ｇは、第１音素「s」の発音の始点を端点Ｅとして後続の音高遷移画像ＰCに連続する線状の画像である。以上の通り、第１音素たる子音については実際には音高を特定できないが、遷移画像ＴRでは、便宜的に、先行音素画像Ｇと音高遷移画像ＰCとが時間軸上で連続する線状の画像として表示される。以上の説明では、図３の音符列画像Ｎの第１番目に表示される発音文字“す[s-u]”を例示して説明したが、第１音素と第２音素とを包含する他の発音文字（“な[n-a]”，“る[r-u]”）についても、先行音素画像Ｇと音高遷移画像ＰCとを含む遷移画像ＴRが表示制御部２４によって表示される。 FIG. 4 is an explanatory diagram of the transition image TR when the phonetic character X1 is composed of a combination of the first phoneme (consonant) and the second phoneme (vowel). The display control unit 24 associates the transition image TR including the preceding phoneme image G indicating the starting point of the pronunciation of the first phoneme “s” and the pitch transition image PC indicating the pitch transition of the note with the note image 54. To display. As illustrated in FIG. 4, the preceding phoneme image G is a linear image that continues from the subsequent pitch transition image PC with the start point of the pronunciation of the first phoneme “s” as the end point E. As described above, although the pitch cannot be actually specified for the consonant that is the first phoneme, in the transition image TR, for convenience, a linear shape in which the preceding phoneme image G and the pitch transition image PC are continuous on the time axis. Displayed as an image. In the above description, the phonetic character “su” displayed first in the note string image N of FIG. 3 has been described as an example, but other phonetics including the first phoneme and the second phoneme are described. For the characters (“na [na]” and “ru [ru]”), the display control unit 24 displays the transition image TR including the preceding phoneme image G and the pitch transition image PC.

本実施形態の表示制御部２４は、遷移画像ＴRを構成する音高遷移画像ＰCと先行音素画像Ｇとを相異なる態様で表示させる。例えば、図４に例示されるように、先行音素画像Ｇと音高遷移画像ＰCとは、線の太さが相互に異なるように、表示制御部２４によって表示される。利用者は、編集画面４０上に相異なる態様で表示された先行音素画像Ｇと音高遷移画像ＰCとを視認することで、第１音素の発音の開始点Ｅと、時間軸上における各音符列画像Ｎの音高遷移との両方を直感的に把握することが可能である。 The display control unit 24 of the present embodiment displays the pitch transition image PC and the preceding phoneme image G constituting the transition image TR in different modes. For example, as illustrated in FIG. 4, the preceding phoneme image G and the pitch transition image PC are displayed by the display control unit 24 so that the thicknesses of the lines are different from each other. The user visually recognizes the preceding phoneme image G and the pitch transition image PC displayed in different modes on the editing screen 40, so that the first phoneme pronunciation start point E and each note on the time axis are displayed. It is possible to intuitively grasp both the pitch transition of the row image N.

図４で、先行音素画像Ｇの端点Ｅと音符図像５４の始点（音符の発音の始点）とで規定される継続長ＴAは、第１音素「s」の発音の時間長を意味する。他方、音符図像５４の始点と終点とで規定される継続長ＴBは、母音である第２音素の発音の時間長（すなわち発音期間Ｘ2）を意味する。第１音素の継続長ＴAは、第１音素の種別に応じて、表示制御部２４によって設定される。例えば、図３では、第１音素の種別（[s]と[n]と[r]）に応じて相異なる継続長ＴAが設定された場合が例示されている。種別毎の音素の継続長ＴAの設定には、記憶装置１４に記憶された音素種別情報Ｆが利用される。 In FIG. 4, the duration TA defined by the end point E of the preceding phoneme image G and the start point of the note image 54 (the start point of the note generation) means the duration of the sound generation of the first phoneme “s”. On the other hand, the duration TB defined by the start point and the end point of the note image 54 means the time length of pronunciation of the second phoneme that is a vowel (that is, the pronunciation period X2). The first phoneme duration TA is set by the display control unit 24 according to the type of the first phoneme. For example, FIG. 3 illustrates a case where different durations TA are set according to the type of the first phoneme ([s], [n], and [r]). Phoneme type information F stored in the storage device 14 is used to set the phoneme duration TA for each type.

図５は、音素種別情報Ｆの説明図である。音素種別情報Ｆは、音素の種別毎に継続長ＴA（ＴA1，ＴA2，……）を指定するデータテーブルである。図５では、半母音（/ｗ/，/ｙ/），鼻音（/ｍ/，/ｎ/），流音（/ｒ/），破裂音（/ｔ/，/ｋ/，/ｐ/），破擦音（/ｔｓ/），摩擦音（/ｓ/，/ｆ/）が音素の種別として例示されている。音素種別情報Ｆで指定される発音の継続長ＴAは音素の種別毎に相違する。例えば、伸張処理に使用するアルゴリズムの違いに応じて、破擦音や摩擦音の時間長ＴAを破裂音や流音の時間長ＴAよりも長い傾向に設定することができる。 FIG. 5 is an explanatory diagram of the phoneme type information F. The phoneme type information F is a data table that designates a duration TA (TA1, TA2,...) For each phoneme type. In FIG. 5, semi-vowels (/ w /, / y /), nasal sounds (/ m /, / n /), flow sounds (/ r /), burst sounds (/ t /, / k /, / p /), Rupture sound (/ ts /) and friction sound (/ s /, / f /) are illustrated as phoneme types. The pronunciation duration TA specified by the phoneme type information F is different for each phoneme type. For example, depending on the algorithm used for the decompression process, the time length TA of the rupture sound or the friction sound can be set to be longer than the time length TA of the plosive sound or the streaming sound.

表示制御部２４は、音素種別情報Ｆを参照し、各音符の発音文字Ｘ1が包含する第１音素の種別に対応付けられた継続長ＴAを、当該音符の第１音素の発音の時間の継続長ＴAとして設定する。図３の例では、第１番目の音符の第１音素[s]には継続長ＴA6が、第２番目の音符の第１音素[n]には継続長ＴA2が、第４番目の音符の第１音素[r]には継続長ＴA3が、第１音素（[s]，[n]，[r]）の発音の継続長ＴAとして表示制御部２４によって設定される。表示制御部２４は、以上の方法で設定した継続長ＴAだけ当該音符の始点から手前の時点を端点Ｅとして音高遷移画像ＰCまで継続長ＴAにわたる先行音素画像Ｇを、各音符の音符図像５４と音高遷移画像ＰCとに対応付けて表示装置１６に表示させる。 The display control unit 24 refers to the phoneme type information F, and uses the duration TA associated with the first phoneme type included in the phonetic character X1 of each note as the duration of the first phoneme pronunciation of the note. Set as length TA. In the example of FIG. 3, the first phoneme [s] of the first note has a duration TA6, the first phoneme [n] of the second note has a duration TA2 and the fourth note of the fourth note. For the first phoneme [r], the duration TA3 is set by the display control unit 24 as the duration of pronunciation TA of the first phoneme ([s], [n], [r]). The display control unit 24 displays the preceding phoneme image G over the duration TA from the start point of the note to the end point E by the duration TA set by the above method to the pitch transition image PC, and the note image 54 of each note. And the pitch transition image PC are displayed on the display device 16 in association with each other.

図６は、第１実施形態に係る表示制御部２４の概略的な動作のフローチャートである。例えば入力装置１７に対する利用者からの指示（合成情報Ｓの編集指示）に応じて編集画面４０が表示装置１６に表示された状態で、所定の時間毎に発生する割込を契機として図６の処理が開始される。 FIG. 6 is a flowchart of a schematic operation of the display control unit 24 according to the first embodiment. For example, in the state in which the editing screen 40 is displayed on the display device 16 in response to an instruction from the user to the input device 17 (an instruction to edit the composite information S), an interrupt that occurs every predetermined time is used as an opportunity in FIG. Processing begins.

利用者は、編集画面４０を確認しながら入力装置１７を適宜に操作することで、音符列画像Ｎの任意の位置に音符図像５４を配置して新規な音符（以下「対象音符」という）の追加を指示するとともに当該対象音符の発音文字Ｘ1を指定することが可能である。表示制御部２４は、利用者が入力装置１７に対する操作で対象音符の追加を指示したか否かを判定する（ＳA1）。対象音符が追加された場合には（ＳA1：YES）、表示制御部２４は、対象音符について利用者が指定した発音文字Ｘ1が第１音素と第２音素とを包含するか否かを判定する（ＳA2）。そして、第１音素と第２音素とを包含する場合（ＳA2：YES）、表示制御部２４は、音素種別情報Ｆにおいて第１音素の種別に対応付けられた継続長ＴAを特定し、対象音符の始点から継続長ＴAだけ手前の時点を端点Ｅとする先行音素画像Ｇを生成する（ＳA3）。表示制御部２４は、音高遷移生成部２８２によって生成された各音符の音高遷移を示す音高遷移画像ＰCを生成し（ＳA4）、第１音素の先行音素画像Ｇと音高遷移画像ＰCとを含む遷移画像ＴRを音符図像５４に対応付けて表示させる（ＳA5）。他方、音符の発音文字Ｘ1が例えば音素単体で構成される場合（ＳA2：No）、表示制御部２４は、先行音素画像Ｇを生成する処理を実行することなく、音高遷移画像ＰCを遷移画像ＴRとして音符図像５４に対応付けて表示させる（ＳA4−ＳA5）。 The user appropriately operates the input device 17 while confirming the editing screen 40, thereby arranging the note image 54 at an arbitrary position of the note string image N to create a new note (hereinafter referred to as "target note"). It is possible to instruct addition and to specify the pronunciation character X1 of the target note. The display control unit 24 determines whether or not the user has instructed the addition of the target note by an operation on the input device 17 (SA1). When the target note is added (SA1: YES), the display control unit 24 determines whether the phonetic character X1 designated by the user for the target note includes the first phoneme and the second phoneme. (SA2). When the first phoneme and the second phoneme are included (SA2: YES), the display control unit 24 specifies the duration TA associated with the type of the first phoneme in the phoneme type information F, and the target note. The preceding phoneme image G is generated with the end point E as the end point E from the starting point by the duration TA (SA3). The display control unit 24 generates a pitch transition image PC indicating the pitch transition of each note generated by the pitch transition generation unit 282 (SA4), and the preceding phoneme image G of the first phoneme and the pitch transition image PC. Is displayed in association with the musical note iconic image 54 (SA5). On the other hand, when the pronunciation character X1 of the note is composed of, for example, a phoneme alone (SA2: No), the display control unit 24 converts the pitch transition image PC to the transition image without executing the process of generating the preceding phoneme image G. TR is displayed in association with the musical note image 54 (SA4-SA5).

以上の説明から理解される通り、第１実施形態では、合成情報Ｓで指定された音符の発音文字Ｘ1が第１音素と第２音素とを包含し、且つ、合成情報Ｓを適用した合成音声において第１音素の発音の始点が音符の発音期間の始点（音符図像５４の始点）に先行する場合に、表示制御部２４は、第１音素の発音の始点を時間軸上で示す先行音素画像Ｇを表示させる。したがって、利用者は、各音符の発音期間Ｘ2の始点に先行して発音される第１音素の発音の始点を先行音素画像Ｇにより視覚的に把握することが可能である。 As understood from the above description, in the first embodiment, the synthesized character to which the pronunciation character X1 of the note designated by the synthesis information S includes the first phoneme and the second phoneme and to which the synthesis information S is applied. When the start point of the first phoneme is preceded by the start point of the note generation period (start point of the note image 54), the display control unit 24 displays the start phoneme image indicating the start point of the first phoneme on the time axis. G is displayed. Therefore, the user can visually grasp the starting point of the pronunciation of the first phoneme that is sounded before the starting point of the sounding period X2 of each note by the preceding phoneme image G.

先行音素画像Ｇは、第１音素の発音の始点を端点Ｅとして音高遷移画像ＰCに連続する線状の画像として構成されるから、利用者は第１音素の発音の始点Ｅを視覚的に認識するとともに、時間軸上で連続する各音符の音高遷移を直感的に把握することが可能である。第１実施形態では、先行音素画像Ｇと音高遷移画像ＰCとが相異なる態様で表示されるから、利用者は、音符の音高遷移と第１音素の発音の始点とを明確に把握することが可能である。また、表示制御部２４は、第１音素の種別（子音の種別）に応じて第１音素の発音の継続長ＴAを選定するから、音素の種別毎の相違を踏まえた最適な継続長ＴAを設定することが可能になる。 The preceding phoneme image G is configured as a linear image continuous with the pitch transition image PC with the starting point of the first phoneme pronunciation as an end point E, so that the user visually identifies the starting point E of the first phoneme pronunciation. While recognizing, it is possible to intuitively grasp the pitch transition of each note that is continuous on the time axis. In the first embodiment, since the preceding phoneme image G and the pitch transition image PC are displayed in different modes, the user clearly understands the pitch transition of the note and the starting point of the first phoneme. It is possible. In addition, since the display control unit 24 selects the duration of pronunciation of the first phoneme according to the type of the first phoneme (consonant type), the optimum duration TA based on the difference for each phoneme type is selected. It becomes possible to set.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。第１実施形態では、発音文字Ｘ1の第１音素の継続長ＴAを、音素種別情報Ｆから特定される固定値に設定した。第２実施形態では、第１音素の継続長ＴAが利用者からの指示に応じて可変に制御される。なお、以下に例示する各態様において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In the first embodiment, the duration TA of the first phoneme of the phonetic character X1 is set to a fixed value specified from the phoneme type information F. In the second embodiment, the duration TA of the first phoneme is variably controlled according to an instruction from the user. In addition, about the element in which an effect | action and a function are the same as that of 1st Embodiment in each aspect illustrated below, the detailed description of each is abbreviate | omitted suitably using the code | symbol referred by description of 1st Embodiment.

図７は、第２実施形態の合成情報Ｓの模式図である。第２実施形態の合成情報Ｓは、図７に例示される通り、第１実施形態と同様の情報（発音文字Ｘ1，発音期間Ｘ2，音高Ｘ3）に加えて制御情報Ｖを音符毎に指定する。本実施形態の制御情報Ｖは、継続長ＴAの設定に利用されるパラメータであり、利用者からの指示に応じて可変に設定される。 FIG. 7 is a schematic diagram of the synthesis information S of the second embodiment. As illustrated in FIG. 7, the synthesis information S of the second embodiment specifies control information V for each note in addition to the same information (phonetic character X1, pronunciation period X2, pitch X3) as in the first embodiment. To do. The control information V of the present embodiment is a parameter used for setting the continuation length TA, and is variably set according to an instruction from the user.

図８は、制御情報Ｖの設定についての説明図である。図８の編集画面４０は、第１実施形態と同様の音符列画像Ｎの下方に制御変数指定画面Ｃを追加した画像である。制御変数指定画面Ｃは、利用者が制御情報Ｖの数値を音符毎に指定するための画面である。図８では各音符の制御情報Ｖの数値を棒グラフで表示した場合が例示されている。利用者は入力装置１７を適宜操作することで、各音符の制御情報Ｖについて予め定められた範囲内で所望の値を指示することが可能である。第２実施形態では、制御情報Ｖに応じて第１音素の発音の継続長ＴAを設定する。 FIG. 8 is an explanatory diagram for setting the control information V. The editing screen 40 in FIG. 8 is an image in which a control variable designation screen C is added below the note string image N as in the first embodiment. The control variable designation screen C is a screen for the user to designate the numerical value of the control information V for each note. FIG. 8 illustrates a case where the numerical value of the control information V of each note is displayed as a bar graph. The user can designate a desired value within a predetermined range for the control information V of each note by appropriately operating the input device 17. In the second embodiment, the pronunciation duration TA of the first phoneme is set according to the control information V.

図９は、第２実施形態の音素種別情報Ｆの説明図である。音素種別情報Ｆは、継続長ＴAの初期値Ｌ（Ｌ1，Ｌ2，Ｌ3，Ｌ4，Ｌ5，Ｌ6，…）と、上限値Ａ（Ａ1，Ａ2，Ａ3，Ａ4，Ａ5，Ａ6…）とが音素の種別毎に設定されたデータテーブルである。初期値Ｌおよび上限値Ａの各々は音素の種別毎に相違する。 FIG. 9 is an explanatory diagram of the phoneme type information F of the second embodiment. The phoneme type information F includes an initial value L (L1, L2, L3, L4, L5, L6,...) And an upper limit A (A1, A2, A3, A4, A5, A6. This is a data table set for each type. Each of the initial value L and the upper limit value A is different for each phoneme type.

表示制御部２４は、各音符の発音文字Ｘ1に包含される第１音素の発音の継続長ＴAを、
当該第１音素の種別に対応する初期値Ｌと上限値Ａとを音素種別情報Ｆから特定し、上限値Ａを下回る範囲内で初期値Ｌを制御情報Ｖに応じて調整した数値を、第１音素の継続長ＴAとして算定する。具体的には、図８には、共通の発音文字Ｘ1が指定された４個の音符の制御情報Ｖが相異なる数値に設定された状況が例示されている。図８から理解される通り、制御情報Ｖの数値が大きいほど継続長ＴAが短くなるように、上限値Ａの範囲内で初期値Ｌが制御情報Ｖに応じて調整される。継続長ＴAに応じた先行音素画像Ｇの表示は第１実施形態と同様である。 The display control unit 24 sets the continuation duration TA of the first phoneme included in the pronunciation character X1 of each note,
The initial value L and the upper limit value A corresponding to the type of the first phoneme are specified from the phoneme type information F, and a numerical value obtained by adjusting the initial value L according to the control information V within a range below the upper limit value A is Calculated as the duration TA of one phoneme. Specifically, FIG. 8 exemplifies a situation where the control information V of four notes for which a common phonetic character X1 is designated is set to different numerical values. As understood from FIG. 8, the initial value L is adjusted in accordance with the control information V within the range of the upper limit value A so that the continuation length TA decreases as the numerical value of the control information V increases. The display of the preceding phoneme image G corresponding to the continuation length TA is the same as in the first embodiment.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、利用者から指示された制御情報Ｖに応じて第１音素の継続長ＴAが可変に設定されるから、利用者の意図を反映させた先行音素画像Ｇを表示できるという利点がある。また、継続長ＴAの上限値Ａは音素の種別毎に設定されるから、音素の特性（継続長の長短）を視覚的に表現した先行音素画像Ｇを表示できるという利点もある。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, since the duration TA of the first phoneme is variably set according to the control information V instructed by the user, the preceding phoneme image G reflecting the user's intention can be displayed. There is an advantage. Further, since the upper limit value A of the continuation length TA is set for each phoneme type, there is also an advantage that the preceding phoneme image G that visually represents the phoneme characteristics (long and short continuation length) can be displayed.

＜第３実施形態＞
図１０は、第３実施形態の表示制御部２４により表示される編集画面４０の説明図である。図１０から把握される通り、第３実施形態の表示制御部２４は、先行音素画像Ｇの端点Ｅの音高軸上の位置を、直前の音符の音高に応じて変化させる。具体的には、任意の１個の音符の先行音素画像Ｇの端点Ｅは、当該音符の音高Ｘ3と比較して直前の音符の音高Ｘ3が高いほど、音高軸上の高音側（直前の音符の音高Ｘ3に近い側）に位置する。例えば図１０では、表示制御部２４によって、第２番目の音符と第４番目の音符とで音高Ｘ3が共通するとともに、第１番目の音符は、第３番目の音符よりも音高Ｘ3が高く、第１番目の音符と第２番目の音符との音高差は、第３番目の音符と第４番目の音符との音高差よりも大きい場合が想定されている。したがって、表示制御部２４は、図１０に例示されるように、音符列画像Ｎのうち第２番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅを、第４番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅと比較して、音高軸上における高い位置に位置させる。 <Third Embodiment>
FIG. 10 is an explanatory diagram of an editing screen 40 displayed by the display control unit 24 of the third embodiment. As understood from FIG. 10, the display control unit 24 of the third embodiment changes the position of the end point E of the preceding phoneme image G on the pitch axis according to the pitch of the immediately preceding note. Specifically, the end point E of the preceding phoneme image G of any one note is higher on the high pitch side on the pitch axis as the pitch X3 of the immediately preceding note is higher than the pitch X3 of the note. It is located on the side near the pitch X3 of the immediately preceding note. For example, in FIG. 10, the display control unit 24 uses the second note and the fourth note to share the pitch X3, and the first note has a pitch X3 higher than that of the third note. It is assumed that the pitch difference between the first note and the second note is larger than the pitch difference between the third note and the fourth note. Therefore, as illustrated in FIG. 10, the display control unit 24 sets the end point E of the preceding phoneme image G associated with the second note image 54 in the note string image N to the fourth note image 54. Compared to the end point E of the preceding phoneme image G associated with, it is positioned at a higher position on the pitch axis.

また、表示制御部２４は、任意の１個の音符の先行音素画像Ｇの端点Ｅは、当該音符の音高Ｘ3と比較して直前の音符の音高Ｘ3が低いほど、音高軸上の低音側（直前の音符の音高Ｘ3に近い側）に位置するように表示させる。図１１では、表示制御部２４によって、第２番目の音符と第４番目の音符とで音高Ｘ3が共通するとともに、第１番目の音符は、第３番目の音符よりも音高Ｘ3が低く、第１番目の音符と第２番目の音符との音高差は、第３番目の音符と第４番目の音符との音高差よりも大きい場合が想定されている。したがって、表示制御部２４は、図１１に例示される通り、第２番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅを、第４番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅと比較して、音高軸上における低い位置に位置させる。 Further, the display control unit 24 determines that the end point E of the preceding phoneme image G of any one note is on the pitch axis as the pitch X3 of the immediately preceding note is lower than the pitch X3 of the note. It is displayed so as to be positioned on the low pitch side (the side close to the pitch X3 of the immediately preceding note). In FIG. 11, the display control unit 24 uses the second note and the fourth note to share the pitch X3, and the first note has a lower pitch X3 than the third note. It is assumed that the pitch difference between the first note and the second note is larger than the pitch difference between the third note and the fourth note. Therefore, as illustrated in FIG. 11, the display control unit 24 sets the end point E of the preceding phoneme image G associated with the second note image 54 to the preceding phoneme image associated with the fourth note image 54. Compared to the end point E of G, it is positioned at a lower position on the pitch axis.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、先行音素画像Ｇの端点Ｅの音高軸上の位置が直前の音符の音高Ｘ3に応じて変化する。したがって、直前の音符からの音高の遷移（音高差）を利用者が直観的に把握できるという利点がある。 In the third embodiment, the same effect as in the first embodiment is realized. In the third embodiment, the position on the pitch axis of the end point E of the preceding phoneme image G changes according to the pitch X3 of the immediately preceding note. Therefore, there is an advantage that the user can intuitively grasp the transition (pitch difference) of the pitch from the immediately preceding note.

＜第４実施形態＞
第３実施形態では、直前の音符の音高Ｘ3に応じて先行音素画像Ｇの端点Ｅの音高軸上の位置を変化させたが、直前の音符との間に充分な間隔が存在する場合には、直前の音符の音高との相関（直前の音符との音高差）を利用者に提示する必要がない（あるいは敢えて提示しないほうが望ましい）、という事情が想定される。以上の事情を考慮して、第４実施形態の表示制御部２４は、先行音素画像Ｇの端点Ｅの位置を直前の音符の音高Ｘ3に応じて変化させるか否かを、直前の音符との間隔に応じて切り換える。 <Fourth embodiment>
In the third embodiment, the position on the pitch axis of the end point E of the preceding phoneme image G is changed according to the pitch X3 of the immediately preceding note, but there is a sufficient interval between the preceding note and the preceding note. It is assumed that there is no need to present to the user the correlation (pitch difference from the immediately preceding note) with the pitch of the immediately preceding note (or it is desirable not to present it dare). Considering the above circumstances, the display control unit 24 of the fourth embodiment determines whether or not to change the position of the end point E of the preceding phoneme image G according to the pitch X3 of the immediately preceding note. Switch according to the interval.

図１２は、第４実施形態の表示制御部２４により表示される編集画面４０の説明図である。図１２の音符列画像Ｎのうち、第１番目および第３番目の各音符の発音文字“あ[a]”は母音単体の音素で構成され、第２番目および第４番目の各音符の発音文字“さ[s-a]”は、子音と母音との組合せで構成される。 FIG. 12 is an explanatory diagram of an edit screen 40 displayed by the display control unit 24 according to the fourth embodiment. In the note string image N of FIG. 12, the pronunciation characters “a [a]” of the first and third notes are composed of phonemes of a single vowel, and the pronunciations of the second and fourth notes are generated. The character “sa [sa]” is composed of a combination of consonants and vowels.

音高軸上で相前後する音符の区間長ＴCが閾値を上回る場合には、表示制御部２４は、後方の先行音素画像Ｇの端点Ｅを、音符の音高Ｘ3に応じた初期位置に位置させる。後方の音符と直前の音符との区間長ＴCが十分に長い場合には、後方の音符と直前の音符との相関が低いと考えられるから、図１２に例示されるように、各音符の区間長ＴC2が閾値Ｄthを上回る（ＴC2＞Ｄth）場合には、表示制御部２４は、第４番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅを、音高Ｘ3に応じた初期的な位置（以下「初期位置」という）に位置させる。 When the section length TC of successive notes on the pitch axis exceeds the threshold value, the display control unit 24 positions the end point E of the preceding preceding phoneme image G at the initial position corresponding to the note pitch X3. Let When the section length TC between the backward note and the immediately preceding note is sufficiently long, it is considered that the correlation between the backward note and the immediately preceding note is low, and therefore, as illustrated in FIG. When the length TC2 exceeds the threshold Dth (TC2> Dth), the display control unit 24 sets the end point E of the preceding phoneme image G associated with the fourth note image 54 to the initial value corresponding to the pitch X3. Position (hereinafter referred to as “initial position”).

他方、音高軸上で相前後する音符の区間長ＴCが閾値を下回る場合には、表示制御部２４は、後方の先行音素画像Ｇの端点Ｅの位置を、音符の音高Ｘ3に応じた初期位置から変化させる。後方の音符と直前の音符との区間長ＴCが十分に短い場合には、後方の音符と直前の音符との相関が高いと考えられるから、図１２に例示されるように、各音符の区間長ＴC1が閾値Ｄthを下回る（ＴC1＜Ｄth）場合には、表示制御部２４は、第２番目の音符図像５４に対応付けられる先行音素画像Ｇの端点Ｅを、直前の音符の音高Ｘ3に応じて、例えば、直前の音符の音高遷移画像ＰCの延長線上に位置させる。したがって、相前後する音符間の間隔（区間長ＴC）による各音符の音高の影響を利用者が視覚的および直観的に把握できるという利点がある。 On the other hand, when the section length TC of successive notes on the pitch axis is below the threshold, the display control unit 24 sets the position of the end point E of the preceding preceding phoneme image G in accordance with the pitch X3 of the note. Change from the initial position. When the section length TC between the backward note and the immediately preceding note is sufficiently short, it is considered that the correlation between the backward note and the immediately preceding note is high, and therefore, as illustrated in FIG. When the length TC1 is less than the threshold value Dth (TC1 <Dth), the display control unit 24 sets the end point E of the preceding phoneme image G associated with the second note image 54 to the pitch X3 of the immediately preceding note. Accordingly, for example, it is positioned on the extension line of the pitch transition image PC of the immediately preceding note. Therefore, there is an advantage that the user can visually and intuitively understand the influence of the pitch of each note due to the interval between successive notes (section length TC).

＜変形例＞
前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）前述した各実施形態では、第１音素の発音の始点を端点Ｅとする線状の先行音素画像Ｇを例示したが、第１音素の発音の始点Ｅを利用者が視認することが可能な態様であれば、以上の例示に限定されない。 (1) In each of the above-described embodiments, the linear preceding phoneme image G having the end point E as the starting point of the first phoneme is illustrated. However, the user can visually recognize the starting point E of the first phoneme. As long as it is possible, it is not limited to the above examples.

図１３は、変形例にかかる先行音素画像Ｇの説明図である。図１３の領域（ａ）に例示されるように、先行音素画像Ｇの線種は前述の各形態の例示に限定されない。また、図１３の領域（ｂ）に例示されるように、先行音素画像Ｇは、線状の画像ではなく、第１音素の発音の始点Ｅに位置する点状の画像として構成してもよい。以上の態様によっても、利用者は第１音素の発音の始点、および、端点Ｅと音符図像５４の始点とで規定される第１音素の継続長ＴAを視覚的に把握することが可能である。図１３の領域（ｃ）に例示されるように、先行音素画像Ｇを、音符図像５４から前方に連続する矩形状の図形としてもよい。図１３の領域（ｄ）に例示されるように、時間軸上における第１音素の発音の始点に配置されて音高軸に平行な線分（例えば音符図像５４の高さと同等の長さの線分）を先行音素画像Ｇとして表示することも可能である。また、図１３の領域（ｂ）および（ｄ）から把握される通り、先行音素画像Ｇが音高遷移画像ＰCや音符図像５４に連続する必要はない。上述した例示以外に、例えば、先行音素画像Ｇと音高遷移画像ＰCとを共通の態様で表示することも可能である。 FIG. 13 is an explanatory diagram of a preceding phoneme image G according to a modification. As illustrated in the area (a) of FIG. 13, the line type of the preceding phoneme image G is not limited to the above-described examples of each form. Further, as illustrated in the area (b) of FIG. 13, the preceding phoneme image G may be configured not as a linear image but as a dot image located at the starting point E of the first phoneme pronunciation. . Also according to the above aspect, the user can visually grasp the first phoneme pronunciation start point, and the first phoneme duration TA defined by the end point E and the note image start point 54. . As illustrated in the area (c) of FIG. 13, the preceding phoneme image G may be a rectangular figure that continues forward from the musical note image 54. As illustrated in the region (d) of FIG. 13, a line segment (for example, having a length equivalent to the height of the musical note image 54) arranged at the starting point of the first phoneme on the time axis and parallel to the pitch axis. It is also possible to display the line segment) as the preceding phoneme image G. Further, as understood from the regions (b) and (d) of FIG. 13, the preceding phoneme image G does not need to be continuous with the pitch transition image PC or the note image 54. In addition to the examples described above, for example, the preceding phoneme image G and the pitch transition image PC can be displayed in a common mode.

（２）前述した各実施形態では、第１音素の発音の始点が、一の音符の発音期間の始点（音符図像５４の始点）に対して時間軸上で先行する第１音素の始点を端点Ｅとする線状の先行音素画像Ｇを例示したが、第２音素の発音の終点（音符図像５４の終点）に後続する後行音素画像Ｈを併せて表示させる構成も採用され得る。例えば、音声素片の区分の仕方や音声素片の伸縮の方法によっては、発音期間Ｘ2の終点の後方まで音声素片が継続して発音される可能性がある。そこで、図１４に例示されるように、発音文字“す[s-u]”の第２音素の発音の終点（音符図像５４の終点）に後続するように、第２音素の余韻を表象する後行音素画像（すなわち発音期間Ｘ2の終点の経過後の発音を表象する画像）Ｈを表示させてもよい。具体的には、後行音素画像Ｈは、音高遷移画像ＰCの終点（発音期間Ｘ2の終点）から音声素片の発音の終点までにわたる線状の画像である。 (2) In each of the embodiments described above, the start point of the first phoneme is the end point of the start point of the first phoneme that precedes the start point of the note period (start point of the note image 54) on the time axis. Although the linear preceding phoneme image G as E is illustrated, a configuration in which the subsequent phoneme image H following the end point of the second phoneme pronunciation (end point of the note image 54) is also displayed can be adopted. For example, depending on how speech segments are segmented and how the speech segments are expanded and contracted, there is a possibility that the speech segments are continuously pronounced until the end of the pronunciation period X2. Therefore, as illustrated in FIG. 14, the second phoneme aftertone that represents the end point of the pronunciation of the second phoneme of the pronunciation character “su [su]” (the end point of the note image 54) is followed. A phoneme image (that is, an image representing the pronunciation after the end of the pronunciation period X2) H may be displayed. Specifically, the succeeding phoneme image H is a linear image extending from the end point of the pitch transition image PC (end point of the sound generation period X2) to the end point of the sound segment.

（３）図１５に例示される通り、複数の音符の時系列から設定される音高遷移以外のビブラートのような付加的な音高変化を表象する付加画像Ｂを、音高遷移画像ＰCとともに各音符図像５４に対応付けて表示装置１６に表示させることも可能である。 (3) As illustrated in FIG. 15, an additional image B representing an additional pitch change such as vibrato other than pitch transition set from a time series of a plurality of notes, together with a pitch transition image PC It is also possible to display on the display device 16 in association with each musical note image 54.

（４）前述の各実施形態では、ビブラートのような付加的な音高変化を除外した音高遷移を音高遷移画像ＰCで表現したが、複数の音符の時系列のみから特定される音高変化以外の付加的な音高変化（典型的には歌唱表現としての音高変化）を含む音高遷移を音高遷移画像ＰCで表すことも可能である。付加的な音高変化としては、前述のビブラートのほか、ピッチベンドやポルタメント（上行形／下行形）が例示され得る。音高遷移画像ＰCは、ビブラートやピッチベンドの深度や速度など付加的な音高変化の特性（パラメータ）を反映した形状となる。 (4) In each of the above-described embodiments, the pitch transition excluding an additional pitch change such as vibrato is represented by the pitch transition image PC, but the pitch specified only from the time series of a plurality of notes. It is also possible to represent a pitch transition including an additional pitch change (typically a pitch change as a singing expression) other than the change by a pitch transition image PC. As the additional pitch change, pitch bend and portamento (ascending / descending) can be exemplified in addition to the vibrato described above. The pitch transition image PC has a shape reflecting additional pitch change characteristics (parameters) such as the depth and speed of vibrato and pitch bend.

（５）第２実施形態では、継続長ＴAの初期値Ｌに対して、固定値の上限値Ａが対応付けられた音素種別情報Ｆを例示し、初期値Ｌが上限値Ａの範囲内で指示された制御情報Ｖに応じて、第１音素の継続長ＴAが可変に設定される構成を例示したが、例えば初期値Ｌに対して制御情報Ｖに応じた可変の係数（以下「伸縮倍率」という）を乗算することで継続長ＴAを算定する構成では、伸縮倍率の最大値を音素の種別毎に設定することも可能である。初期値Ｌに最大倍率を乗算する構成によれば、伸縮倍率が共通する場合でも、音声素片の各音素の継続長に応じて継続長ＴAが相違し得るから、第２実施形態のように継続長ＴAの上限値Ａを設定する構成と比較して、継続長ＴAを多様に変化させることが可能である。 (5) In the second embodiment, the phoneme type information F in which the fixed value upper limit value A is associated with the initial value L of the continuation length TA is exemplified, and the initial value L is within the range of the upper limit value A. The configuration in which the duration TA of the first phoneme is variably set according to the instructed control information V has been exemplified. For example, a variable coefficient corresponding to the control information V with respect to the initial value L (hereinafter referred to as “expansion / contraction magnification”). In the configuration in which the duration TA is calculated by multiplying “)”, the maximum value of the expansion / contraction magnification can be set for each phoneme type. According to the configuration in which the initial value L is multiplied by the maximum magnification, even if the expansion / contraction magnification is common, the continuation length TA can be different depending on the continuation length of each phoneme of the speech segment, as in the second embodiment. Compared to the configuration in which the upper limit value A of the continuation length TA is set, the continuation length TA can be variously changed.

（６）前述の各形態では、音声合成部２８の音高遷移生成部２８２が生成した音高遷移の音高遷移画像ＰCを表示したが、音声合成部２８による音声信号Ｚの生成は、音高遷移画像ＰCの表示に必ずしも必要ではない。すなわち、音高遷移生成部２８２が生成した音高遷移を適用した各音声素片の音高の調整や各音声素片の接続等の処理は、音高遷移画像ＰCを表示するという観点のみからすれば省略することも可能である。音声信号Ｚの生成に必要な処理を省略した構成によれば、処理負荷（演算時間等）を軽減できるという利点がある。 (6) In each of the above-described embodiments, the pitch transition image PC of the pitch transition generated by the pitch transition generating unit 282 of the voice synthesizing unit 28 is displayed. This is not always necessary for displaying the high transition image PC. In other words, the adjustment of the pitch of each speech unit to which the pitch transition generated by the pitch transition generation unit 282 is applied, the processing such as connection of each speech unit, and the like are only from the viewpoint of displaying the pitch transition image PC. If so, it can be omitted. According to the configuration in which processing necessary for generating the audio signal Z is omitted, there is an advantage that processing load (calculation time and the like) can be reduced.

（７）前述の各形態では、利用者による音符（対象音符）の追加のたびに先行音素画像Ｇおよび音高遷移画像ＰCを表示したが、遷移画像ＴRの更新の契機は以上の例示（音符の追加）に限定されない。例えば、編集画面４０に配置済の各音符の遷移画像ＴRの更新が編集画面４０に対する利用者からの操作（例えば「再描画」ボタンに対する操作）で指示された場合に、配置済の各音符の遷移画像ＴRを追加または更新する構成や、利用者による明示的な指示を必要とせずに自動的に（例えば所定の時間毎や、利用者が何らかの編集を指示するたびに）、各音符の遷移画像ＴRを追加または更新する構成も採用され得る。 (7) In each of the above-described embodiments, the preceding phoneme image G and the pitch transition image PC are displayed every time a user adds a note (target note). However, the trigger for updating the transition image TR is the above example (note Is not limited to). For example, when updating of the transition image TR of each note placed on the editing screen 40 is instructed by a user operation on the editing screen 40 (for example, an operation on the “redraw” button), Transition of each note automatically (for example, every predetermined time or every time the user instructs some editing) without adding or updating the transition image TR or an explicit instruction by the user A configuration for adding or updating the image TR may also be employed.

（８）前述の各形態では、音声素片を利用した素片接続型の音声合成部２８を例示したが、合成情報Ｓを適用した音声合成には公知の技術が任意に採用される。例えば、隠れマルコフモデル（HMM: Hidden Markov Model）等の確率モデルを利用して、合成情報Ｓで指定された合成楽曲の歌唱音声を合成することも可能である。 (8) In each of the above-described embodiments, the unit connection type speech synthesizer 28 using speech units is exemplified, but a known technique is arbitrarily employed for speech synthesis using the synthesis information S. For example, it is also possible to synthesize the singing voice of the synthesized music specified by the synthesis information S using a probabilistic model such as a Hidden Markov Model (HMM).

（９）前述の各形態では、日本語の発音文字Ｘ1を例示したが、発音文字Ｘ1の言語（合成対象となる音声の言語）は任意である。例えば、英語，スペイン語，中国語，韓国語等の任意の言語の音声を生成する場合にも本発明を適用することが可能である。なお、言語によっては、例えば英語の単語“string”のように第２音素[i]（母音）の前方に複数の子音（第１音素[s]，[t]，[r]）が存在する発音文字Ｘ1も想定される。この場合、第２音素の前方に位置する複数の子音([s]，[t]，[r]）のうち１個の子音（典型的には複数個のうち先頭の子音[s]）を第１音素として、当該第１音素の発音の始点を表現する先行音素画像Ｇを表示することも可能である。以上の説明から理解される通り、第２音素は、第１音素の後方に位置する音素として表現され、第１音素の直後の音素には限定されない。 (9) In each of the above-described embodiments, the Japanese phonetic character X1 is exemplified, but the language of the phonetic character X1 (speech language to be synthesized) is arbitrary. For example, the present invention can be applied to the case of generating speech in an arbitrary language such as English, Spanish, Chinese, or Korean. Depending on the language, there are a plurality of consonants (first phonemes [s], [t], [r]) in front of the second phoneme [i] (vowel), for example, the English word “string”. The phonetic character X1 is also assumed. In this case, one consonant (typically, the first consonant [s] among the plural consonants ([s], [t], [r]) positioned in front of the second phoneme) is selected. As the first phoneme, it is also possible to display a preceding phoneme image G that represents the starting point of pronunciation of the first phoneme. As understood from the above description, the second phoneme is expressed as a phoneme located behind the first phoneme, and is not limited to the phoneme immediately after the first phoneme.

（１０）前述の各形態では、音声素片群Ｌと合成情報Ｓとを記憶する記憶装置１４を音声合成装置１００に搭載したが、音声合成装置１００とは独立した外部装置（例えばサーバ装置）が音声素片群Ｌや合成情報Ｓを記憶する構成も採用される。音声合成装置１００は、例えば通信網を介して音声素片群Ｌまたは合成情報Ｓを取得して編集処理や音声合成処理を実行する。以上の説明から理解される通り、音声素片群Ｌや合成情報Ｓを記憶する要素は音声合成装置１００の必須の要素ではない。 (10) In each of the above-described embodiments, the storage device 14 that stores the speech element group L and the synthesis information S is mounted on the speech synthesizer 100. However, an external device (for example, a server device) that is independent of the speech synthesizer 100. Is also used to store the speech element group L and the synthesis information S. The speech synthesizer 100 acquires the speech element group L or the synthesis information S via, for example, a communication network, and executes editing processing and speech synthesis processing. As understood from the above description, the elements that store the speech element group L and the synthesis information S are not essential elements of the speech synthesizer 100.

１００……音声合成装置、１２……演算処理装置、１４……記憶装置、１６……表示装置、１７……入力装置、１８……放音装置、２２……情報編集部、２４……表示制御部、２８……音声合成部、４０……編集画面、５４……音符図像、２８２……音高遷移生成部、Ｇ……先行音素画像、Ｎ……音符列画像、ＰC……音高遷移画像、Ｅ……端点、Ｃ……制御変数指定画面。
DESCRIPTION OF SYMBOLS 100 ... Voice synthesis device, 12 ... Arithmetic processing device, 14 ... Memory | storage device, 16 ... Display device, 17 ... Input device, 18 ... Sound emission device, 22 ... Information editing part, 24 ... Display Control unit 28... Speech synthesis unit 40 .. editing screen 54. Note image 282. Pitch transition generation unit G. Preceding phoneme image N. Note string image PC. Transition image, E ... End point, C ... Control variable designation screen.

Claims

Referring to the synthesis information for speech synthesis that specifies the pronunciation character, the pronunciation period, and the pitch for each note, a note string image in which a note image representing each note is arranged on the time axis, and the note of each note Comprising display control means for displaying on a display device a linear pitch transition image that is arranged in association with a graphic image and indicates the pitch transition of each note .
The display control means includes a first phoneme and a second phoneme behind the first phoneme as a phonetic character specified by the synthesis information, and in the synthesized speech to which the synthesis information is applied. When the starting point of the first phoneme is preceded on the time axis with respect to the starting point of the sounding period of the one note, a linear shape that continues to the pitch transition image with the starting point of the first phoneme as the end point the preceding phoneme image is displayed in association with the musical note iconic image of a note of the one, the preceding phoneme image position on the pitch axes of said end points of the display control device Ru is changed according to the pitch of the note immediately preceding .

Wherein the display control unit, the display control device according to claim 1 to set a time length of pronunciation of the first phoneme in accordance with the type of the first phoneme.

The combined information, the display control means comprises for each note control information according to an instruction from the user, the upper limit value corresponding to the type of the first phoneme, the time length of pronunciation of the first phoneme the display control device according to claim 1 or claim 2 set according to the control information.

  Referring to the synthesis information for speech synthesis that specifies the pronunciation character, the pronunciation period, and the pitch for each note, a note string image in which a note image representing each note is arranged on the time axis, and the note of each note A display control method for displaying on a display device a linear pitch transition image that is arranged in association with a graphic image and indicates a pitch transition of each note.
  The phonetic character of one note specified by the synthesis information includes a first phoneme and a second phoneme behind the first phoneme, and in the synthesized speech to which the synthesis information is applied, the pronunciation of the first phoneme When the start point precedes the start point of the sound generation period of the one note on the time axis, the linear preceding phoneme image continuous to the pitch transition image with the start point of the first phoneme pronunciation as the end point Display in correspondence with the note image of one note, and change the position of the end point of the preceding phoneme image on the pitch axis according to the pitch of the immediately preceding note.
  A display control method realized by a computer system.