JP2016004189A

JP2016004189A - Synthetic information management device

Info

Publication number: JP2016004189A
Application number: JP2014125138A
Authority: JP
Inventors: 入山　達也; Tatsuya Iriyama; 達也入山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2014-06-18
Filing date: 2014-06-18
Publication date: 2016-01-12
Anticipated expiration: 2034-06-18
Also published as: JP6439288B2

Abstract

PROBLEM TO BE SOLVED: To adjust positional relationship (separation/continuity) between phonemes while a user visually and intuitively confirms the positional relationship between front and rear phonemes.SOLUTION: A instruction reception unit 22 receive an instruction from a user. A display control unit 24 displays a time sequence of phoneme symbols of plural phonemes corresponding to a pronunciation content that synthesis information designates; and displays a connection indicator indicating positional relation ship on a time axis between a first phoneme and a second phoneme, between the phoneme symbol of a first phoneme and the phoneme symbol of a second phoneme right after the first phoneme among the plural phonemes, according to the instruction from the user received by the instruction reception unit 22. An information management unit 26 edits synthesis information so that the first phoneme and the second phoneme become positional relationship based on the connection indicator on the time axis.

Description

本発明は、音声合成に適用される合成情報を管理する技術に関する。 The present invention relates to a technique for managing synthesis information applied to speech synthesis.

複数の音声素片を相互に連結することで所望の発音内容の音声を合成する素片接続型の音声合成技術が従来から提案されている。例えば特許文献１には、利用者が任意に指定した発音文字の音声を合成する技術が開示されている。 Conventionally, a unit connection type speech synthesis technique for synthesizing speech of a desired pronunciation content by connecting a plurality of speech units to each other has been proposed. For example, Patent Literature 1 discloses a technique for synthesizing speech of pronunciation characters arbitrarily designated by a user.

特開２０１２−０２２１２１号公報JP 2012-022121 A

しかし、特許文献１の技術のもとでは、利用者は、音符毎の発音期間と発音文字とを指定できるに過ぎず、発音文字に対応する各音素間の連続性（離間／近接の度合）を調整することはできない。したがって、利用者の意図と比較して前後の各音素が極端に近接または離間した聴感的に不自然ないし不明瞭な音声が合成されるという問題や、利用者の意図や嗜好に沿った個性的ないし特徴的な表情の音声を合成できないという問題があった。以上の事情を考慮して、本発明は、前後の音素間の位置関係を視覚的かつ直感的に確認しながら利用者が音素間の位置関係（離間／近接）を調整することが可能な音声合成を実現することを目的とする。 However, under the technique of Patent Document 1, the user can only specify the pronunciation period and the pronunciation character for each note, and the continuity (degree of separation / proximity) between the phonemes corresponding to the pronunciation character. Cannot be adjusted. Therefore, there is a problem of synthesizing unnaturally or unclear speech that is extremely close to or separated from the phonemes before and after the user's intention, and the individuality according to the user's intention and preference. There was also a problem that it was not possible to synthesize speech with a characteristic expression. In view of the above circumstances, the present invention is a voice that allows the user to adjust the positional relationship (separation / proximity) between phonemes while visually and intuitively confirming the positional relationship between previous and subsequent phonemes. The purpose is to achieve synthesis.

以上の課題を解決するために、本発明の第１態様に係る合成情報管理装置は、合成音声の発音内容を指定する合成情報を管理する合成情報管理装置であって、利用者からの指示を受付ける指示受付手段と、前記合成情報が指定する発音内容に対応する複数の音素の音素記号の時系列を表示装置に表示させる手段であって、前記指示受付手段が利用者から受付けた指示に応じて、前記複数の音素のうち第１音素の音素記号と当該第１音素の直後の第２音素の音素記号との間に、前記第１音素と前記第２音素との時間軸上における位置関係を示す接続指示子を表示させる表示制御手段と、前記第１音素と前記第２音素とが時間軸上で前記接続指示子に応じた位置関係となるように前記合成情報を編集する情報管理手段とを具備する。以上の構成では、利用者から受付けた指示に応じて、第１音素の音素記号と第２音素の音素記号との間に、時間軸上における位置関係を示す接続指示子を表示させるとともに、第１音素と第２音素とが時間軸上において接続指示子に応じた位置関係となるように合成情報が編集されるから、利用者が前後の音素間の位置関係を視覚的かつ直感的に確認しながら音素の位置関係を調整することが可能である。 In order to solve the above problems, a synthesis information management apparatus according to the first aspect of the present invention is a synthesis information management apparatus that manages synthesis information that specifies the pronunciation content of synthesized speech, and that receives instructions from a user. An instruction receiving means for receiving, and means for displaying a time series of phoneme symbols of a plurality of phonemes corresponding to the pronunciation content specified by the synthesis information on a display device, wherein the instruction receiving means responds to an instruction received from a user The positional relationship on the time axis between the first phoneme and the second phoneme between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme immediately after the first phoneme among the plurality of phonemes Display control means for displaying a connection indicator indicating, and information management means for editing the composite information so that the first phoneme and the second phoneme have a positional relationship corresponding to the connection indicator on a time axis It comprises. In the above configuration, in accordance with an instruction received from the user, a connection indicator indicating a positional relationship on the time axis is displayed between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme, Since the synthesized information is edited so that the first phoneme and the second phoneme have a positional relationship according to the connection indicator on the time axis, the user visually and intuitively confirms the positional relationship between the preceding and following phonemes. It is possible to adjust the positional relationship between phonemes.

第１態様に係る合成情報管理装置の好適例において、前記表示制御手段は、前記指示受付手段が利用者から受付けた指示に応じて、前記第１音素と前記第２音素との近接を示す近接指示子を、前記第１音素の音素記号と前記第２音素の音素記号との間に前記接続指示子として表示させ、前記情報管理手段は、前記近接指示子に応じて前記第１音素と前記第２音素とが時間軸上で接近するように前記合成情報を編集する。以上の態様では、利用者から受付けた指示に応じて、第１音素の音素記号と第２音素の音素記号との間に、第１音素と第２音素との近接を示す接続指示子を表示させるとともに、第１音素と第２音素とが時間軸上において近接指示子に応じて接近するように合成情報が編集されるから、相前後する音素の近接を利用者が視覚的かつ直感的に確認しながら、音素同士を接近させることが可能になるという効果を奏する。また、利用者が意図や嗜好に沿って音素間の連続性の度合いを調整することが可能であるので、個性的ないし特徴的な表情の音声を合成することが可能になるという利点がある。 In a preferred example of the composite information management device according to the first aspect, the display control means is a proximity indicating the proximity of the first phoneme and the second phoneme in accordance with an instruction received from a user by the instruction receiving means. An indicator is displayed as the connection indicator between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme, and the information management means is configured to display the first phoneme and the phoneme according to the proximity indicator. The composite information is edited so that the second phoneme approaches on the time axis. In the above aspect, a connection indicator indicating the proximity of the first phoneme and the second phoneme is displayed between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme in accordance with an instruction received from the user. In addition, since the synthesis information is edited so that the first phoneme and the second phoneme approach in accordance with the proximity indicator on the time axis, the user can visually and intuitively determine the proximity of the adjacent phonemes. There is an effect that the phonemes can be brought close to each other while checking. In addition, since the user can adjust the degree of continuity between phonemes according to the intention or preference, there is an advantage that it is possible to synthesize voices with unique or characteristic facial expressions.

第１態様に係る合成情報管理装置の好適例において、前記表示制御手段は、前記指示受付手段が利用者から受付けた指示に応じて、前記第１音素と前記第２音素との離間を示す離間指示子を、前記第１音素の音素記号と前記第２音素の音素記号との間に前記接続指示子として表示させ、前記情報管理手段は、前記離間指示子に応じて前記第１音素と前記第２音素とが時間軸上で離間するように前記合成情報を編集する。以上の態様では、利用者から受付けた指示に応じて、第１音素の音素記号と第２音素の音素記号との間に、第１音素と第２音素との離間を示す離間指示子を表示させるとともに、第１音素と第２音素とが時間軸上において離間指示子に応じて離間するように合成情報が編集されるから、相前後する音素の離間を利用者が視覚的かつ直感的に確認しながら、音素同士を離間させることが可能になるという効果を奏する。また、利用者が意図や嗜好に沿って音素間の連続性の度合いを調整することが可能であるので、個性的ないし特徴的な表情の音声を合成することが可能になるという利点がある。 In a preferred example of the composite information management apparatus according to the first aspect, the display control means is a separation indicating a separation between the first phoneme and the second phoneme in accordance with an instruction received from a user by the instruction receiving means. An indicator is displayed as the connection indicator between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme, and the information management means is configured to display the first phoneme and the phoneme according to the separation indicator. The synthesis information is edited so that the second phoneme is separated on the time axis. In the above aspect, in accordance with an instruction received from the user, a separation indicator indicating the separation between the first phoneme and the second phoneme is displayed between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme. In addition, since the synthesis information is edited so that the first phoneme and the second phoneme are separated on the time axis according to the separation indicator, the user can visually and intuitively determine the separation of the adjacent phonemes. While checking, there is an effect that it becomes possible to separate phonemes. In addition, since the user can adjust the degree of continuity between phonemes according to the intention or preference, there is an advantage that it is possible to synthesize voices with unique or characteristic facial expressions.

第１態様に係る合成情報管理装置の好適例において、前記表示制御手段は、前記指示受付手段が利用者から受付けた指示に応じて、前記複数の音素のうち第１音素の音素記号と当該第１音素の直後の第２音素の音素記号との間に、時間軸上における前記第１音素と前記第２音素との近接または離間の程度を示す指標値を表示させ、前記情報管理手段は、前記指標値に応じて前記第１音素と前記第２音素とが時間軸上で接近または離間するように前記合成情報を更新する。以上の態様では、利用者から受付けた指示に応じて、第１音素の音素記号と第２音素の音素記号との間に、第１音素と第２音素との近接または離間の程度を示す指標値を表示させるとともに、第１音素と第２音素とが時間軸上において指標値に応じて接近または離間するように合成情報が編集されるから、相前後する音素の近接または離間の程度を、利用者が視覚的かつ直感的に確認しながら、音素同士の位置関係を調整することが可能になるという効果を奏する。また、利用者が意図や嗜好に沿って音素間の連続性の度合いを調整することが可能であるので、個性的ないし特徴的な表情の音声を合成することが可能になるという利点がある。 In a preferred example of the composite information management device according to the first aspect, the display control means and the phoneme symbol of the first phoneme out of the plurality of phonemes according to the instruction received by the instruction receiving means from the user. An index value indicating the degree of proximity or separation between the first phoneme and the second phoneme on the time axis is displayed between the phoneme symbol of the second phoneme immediately after one phoneme, and the information management unit includes: The synthesis information is updated so that the first phoneme and the second phoneme approach or separate on the time axis according to the index value. In the above aspect, the index indicating the degree of proximity or separation between the first phoneme and the second phoneme between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme according to an instruction received from the user. Since the composite information is edited so that the first phoneme and the second phoneme approach or separate according to the index value on the time axis while displaying the value, the degree of proximity or separation of the adjacent phonemes There is an effect that the user can adjust the positional relationship between phonemes while visually and intuitively checking. In addition, since the user can adjust the degree of continuity between phonemes according to the intention or preference, there is an advantage that it is possible to synthesize voices with unique or characteristic facial expressions.

本発明の第２態様に係る合成情報管理装置は、合成音声の発音内容を指定する合成情報を管理する合成情報管理装置であって、前記合成情報が指定する発音内容に対応する複数の音素の音素記号の時系列を表示装置に表示させる手段であって、前記複数の音素のうち第１音素の音素記号と当該第１音素の直後の第２音素の音素記号との間に、前記第１音素と前記第２音素との時間軸上における離間の程度を指定する操作指示子を表示させる表示制御手段と、前記操作指示子の操作を利用者から受付ける指示受付手段と、前記第１音素と前記第２音素との時間軸上における位置関係が、前記操作指示子の操作量に応じて離間するように前記合成情報を更新する情報管理手段とを具備する。以上の態様では、利用者から受付けた指示に応じて、第１音素の音素記号と第２音素の音素記号との間に、第１音素と第２音素との離間の程度を指定する操作指示子を表示させるとともに、第１音素と第２音素とが時間軸上において操作指示子の操作量に応じて離間するように合成情報が編集されるから、相前後する音素の離間の程度を、利用者が視覚的かつ直感的に確認しながら、音素同士の位置関係を調整することが可能になるという効果を奏する。例えば、操作指示子の操作量が大きいときに離間の程度を大きくする構成としてもよい。また、利用者が意図や嗜好に沿って音素間の連続性の度合いを調整することが可能であるので、個性的ないし特徴的な表情の音声を合成することが可能になるという利点がある。 The synthesis information management device according to the second aspect of the present invention is a synthesis information management device for managing synthesis information for designating the pronunciation content of synthesized speech, and comprising a plurality of phonemes corresponding to the pronunciation content designated by the synthesis information. Means for displaying a time series of phoneme symbols on a display device, wherein the first phoneme symbol of the plurality of phonemes and the second phoneme symbol immediately after the first phoneme are between the first phoneme symbol and the first phoneme symbol. Display control means for displaying an operation indicator for designating the degree of separation between the phoneme and the second phoneme on the time axis, instruction accepting means for accepting an operation of the operation indicator from a user, and the first phoneme And information management means for updating the composite information so that a positional relationship with the second phoneme on the time axis is separated according to an operation amount of the operation indicator. In the above aspect, in accordance with an instruction received from the user, an operation instruction that specifies the degree of separation between the first phoneme and the second phoneme between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme. Since the synthesis information is edited so that the first phoneme and the second phoneme are separated according to the operation amount of the operation indicator on the time axis, the degree of separation of the phonemes before and after is displayed. There is an effect that the user can adjust the positional relationship between phonemes while visually and intuitively checking. For example, the degree of separation may be increased when the operation amount of the operation indicator is large. In addition, since the user can adjust the degree of continuity between phonemes according to the intention or preference, there is an advantage that it is possible to synthesize voices with unique or characteristic facial expressions.

以上の各態様に係る合成情報管理装置は、合成情報の編集や音声信号の生成に専用されるDSP（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、CPU（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る合成情報管理装置の動作方法（合成情報管理方法）としても特定される。 The synthesis information management device according to each of the above aspects is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to editing synthesis information and generating audio signals, and also a CPU (Central Processing Unit) ) Etc., and can also be realized by the cooperation of a program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (composite information management method) of the composite information management apparatus according to each aspect described above.

第１実施形態に係る音声合成装置１００のブロック図である。1 is a block diagram of a speech synthesizer 100 according to a first embodiment. 合成情報の模式図である。It is a schematic diagram of synthetic information. 楽曲の一部の模式図である。It is a schematic diagram of a part of music. 編集画面の模式図である。It is a schematic diagram of an edit screen. 音声素片の説明図である。It is explanatory drawing of a speech segment. 第１実施形態に係る音声合成装置１００の概略的な動作のフローチャートである。3 is a flowchart of a schematic operation of the speech synthesizer 100 according to the first embodiment. 第１実施形態に係る編集処理のフローチャートである。It is a flowchart of the edit process which concerns on 1st Embodiment. 離間指示子ＣSの表示例の説明図である。It is explanatory drawing of the example of a display of the separation indicator CS. 近接指示子ＣCの表示例の説明図である。It is explanatory drawing of the example of a display of proximity indicator CC. 音声合成部２８によって生成された音声信号Ｖの波形図である。4 is a waveform diagram of a voice signal V generated by a voice synthesizer 28. FIG. 第１実施形態に係る音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process which concerns on 1st Embodiment. 音素間の間隔の調整（離間）の説明図である。It is explanatory drawing of adjustment (separation) of the space | interval between phonemes. 音素間の間隔の調整（近接）の説明図である。It is explanatory drawing of adjustment (proximity) of the space | interval between phonemes. 離間指示子ＣSの表示例の別の態様の説明図である。It is explanatory drawing of another aspect of the example of a display of the space | interval indicator CS. 第２実施形態に係る編集処理のフローチャートである。It is a flowchart of the edit process which concerns on 2nd Embodiment. 離間指示子ＣSおよび指標値Ｉの表示例の説明図である。It is explanatory drawing of the example of a display of the separation indicator CS and the index value I. 第２実施形態に係る音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process which concerns on 2nd Embodiment. 音声素片の融合の説明図である。It is explanatory drawing of fusion | fusion of an audio | voice element. 第３実施形態に係る編集処理のフローチャートである。It is a flowchart of the edit process which concerns on 3rd Embodiment. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 第３実施形態の音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process of 3rd Embodiment. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 操作指示子４４の表示例の説明図である。It is explanatory drawing of the example of a display of the operation indicator. 対比例における制御変数の編集の説明図である。It is explanatory drawing of the edit of the control variable in contrast. メゾスタッカートの楽譜の一例である。It is an example of the score of mezzo staccato.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。第１実施形態の音声合成装置１００は、複数の音声素片を連結する素片接続型の音声合成で任意の楽曲（以下「合成楽曲」という）の歌唱音声の音声信号Ｖを生成する信号処理装置である。第１実施形態では、利用者から受付けた指示に応じて、時間軸上において相前後する音素同士の相互の位置関係が調整された音声信号Ｖを生成する。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 according to the first embodiment generates a speech signal V of a singing voice of an arbitrary piece of music (hereinafter referred to as “synthetic music”) by a unit connection type speech synthesis that connects a plurality of speech units. Device. In the first embodiment, an audio signal V is generated in which the mutual positional relationship between adjacent phonemes on the time axis is adjusted according to an instruction received from a user.

図１に例示される通り、音声合成装置１００は、演算処理装置１０と記憶装置１２と表示装置１４と入力装置１６と放音装置１８とを具備するコンピュータシステム（例えば携帯電話機やパーソナルコンピュータ等の情報処理装置）で実現される。表示装置１４（例えば液晶表示パネル）は、演算処理装置１０から指示された画像を表示する。入力装置１６は、音声合成装置１００に対する各種の指示のために利用者が操作する操作機器（例えばマウス等のポインティングデバイスやキーボード）であり、例えば利用者が操作する複数の操作子を含んで構成される。なお、表示装置１４と一体に構成されたタッチパネルを入力装置１６として採用することも可能である。放音装置１８（例えばスピーカやヘッドホン）は、音声信号Ｖに応じた音響を再生する。 As illustrated in FIG. 1, the speech synthesizer 100 includes a computer system (for example, a mobile phone or a personal computer) that includes an arithmetic processing device 10, a storage device 12, a display device 14, an input device 16, and a sound emitting device 18. Information processing device). The display device 14 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 10. The input device 16 is an operation device (for example, a pointing device such as a mouse or a keyboard) operated by the user for various instructions to the speech synthesizer 100, and includes a plurality of operators operated by the user, for example. Is done. Note that a touch panel configured integrally with the display device 14 may be employed as the input device 16. The sound emitting device 18 (for example, a speaker or headphones) reproduces sound according to the audio signal V.

記憶装置１２は、演算処理装置１０が実行するプログラムや演算処理装置１０が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。第１実施形態の記憶装置１２は、以下に例示する通り、音声素片群Ｌと合成情報Ｓとを記憶する。 The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 of the first embodiment stores the speech element group L and the synthesis information S as illustrated below.

音声素片群Ｌは、特定の発声者の収録音声から事前に採取された複数の音声素片Ｐの集合（音声合成用ライブラリ）である。図５に例示される通り、第１実施形態における音声素片群Ｌは、音素ｐAと音素ｐBとを連結した音素連鎖（ダイフォン）を含む。音素（後方音素）ｐBは、音素（前方音素）ｐAの後方に位置する。各音声素片Ｐは、時間領域の音声波形のサンプル系列や、音声波形のフレーム毎に算定された周波数領域のスペクトルの時系列で表現される。なお、以下の説明では、無音を便宜的に１個の音素と位置付けて記号“Sil”で表記する。 The speech segment group L is a set (speech synthesis library) of a plurality of speech segments P collected in advance from the recorded speech of a specific speaker. As illustrated in FIG. 5, the speech element group L in the first embodiment includes a phoneme chain (a diphone) in which a phoneme pA and a phoneme pB are connected. The phoneme (back phoneme) pB is located behind the phoneme (front phoneme) pA. Each speech element P is expressed by a time series of a time domain speech waveform sample sequence or a frequency domain spectrum calculated for each frame of the speech waveform. In the following description, silence is positioned as one phoneme for the sake of convenience and is represented by the symbol “Sil”.

第１実施形態の音声素片群Ｌには、２個の音素の組合せ毎に、各音素の位置関係を相違させた複数種の音声素片Ｐが記憶される。例えば、音素ｐAと音素ｐBとの特定の組合せに対応する音声素片については、音素ｐAと音素ｐBとの間隔が標準的な基準値に設定された音声素片Ｐ（以降「標準素片Ｐ0」という）のほか、音素ｐAと音素ｐBとの間隔が基準値を上回る（すなわち音素ｐAと音素ｐBとが標準素片Ｐ0と比較して離間した）音声素片Ｐ（以降「離間型素片Ｐ1」という）と、音素ｐAと音素ｐBとの間隔が基準値を下回る（すなわち音素ｐAと音素ｐBとが標準素片Ｐ0と比較して近接した）音声素片Ｐ（以降「近接型素片Ｐ2」という）とが音声素片群Ｌに包含される。ただし、２個の音素の全通りの組合せについて音素間の位置関係が相違する複数種の音声素片（離間型素片Ｐ1，近接型素片Ｐ2）が事前に用意されるわけではなく、標準素片Ｐ0以外の音声素片（離間型素片Ｐ1，近接型素片Ｐ2）が音声素片群Ｌに用意されていない２音素の組合せも存在する。 In the speech element group L of the first embodiment, for each combination of two phonemes, a plurality of types of speech elements P in which the positional relationship of each phoneme is different are stored. For example, for a speech unit corresponding to a specific combination of the phoneme pA and the phoneme pB, the speech unit P (hereinafter referred to as “standard segment P0” in which the interval between the phoneme pA and the phoneme pB is set to a standard reference value. ), And a speech unit P (hereinafter referred to as a “separated unit”), in which the interval between the phoneme pA and the phoneme pB exceeds the reference value (that is, the phoneme pA and the phoneme pB are separated from the standard unit P0). P1 ”and the interval between the phoneme pA and the phoneme pB is less than the reference value (that is, the phoneme pA and the phoneme pB are close to each other compared to the standard segment P0). P2 ") is included in the speech element group L. However, multiple types of speech elements (separated type element P1 and proximity type element P2) having different positional relationships between phonemes for all combinations of two phonemes are not prepared in advance. There is a combination of two phonemes in which a speech unit other than the unit P0 (a separated unit P1, a proximity unit P2) is not prepared in the speech unit group L.

合成情報Ｓは、図２に例示される通り、合成楽曲の歌唱音声を指定する時系列データであり、合成楽曲を構成する音符毎に音高（例えばノートナンバー）Ｘ1と発音期間Ｘ2と音声符号Ｘ3とを時系列に指定する。発音期間Ｘ2は、音符の時間長（音価）であり、例えば発音の開始時刻Ｔ1と時間長（継続長）Ｔ2とで規定される。なお、発音期間Ｘ2を発音の開始時刻Ｔ1と終了時刻とで規定する構成（両時刻間の時間長が時間長Ｔ2として算定され得る構成）も好適である。以上の説明から理解される通り、合成情報Ｓは、合成楽曲の楽譜を指定する時系列データとも換言され得る。音声符号Ｘ3は、合成対象の音声の発音内容（すなわち合成楽曲の歌詞）を指定する。具体的には、音声符号Ｘ3は、合成楽曲の１個の音符について発音される音声単位（例えば音節やモーラ）を指定する情報であり、当該音声単位に対応する発音文字ＱAと、当該音声単位を構成する各音素の音素記号ＱBとを含んで構成される。発音文字ＱAは、合成楽曲の歌詞を構成する文字（書記素）に相当する。また、第１実施形態では、利用者から受け付けられた指示に応じて、複数の音素の時系列のうち任意の音素（第１音素）の直後に、当該音素と、当該音素の直後の音素（第２音素）との位置関係を規定する音素間情報ＱCが付加される。音素間情報ＱCは、具体的には、第１音素と第２音素との時間軸上における離間または近接を規定する。 As illustrated in FIG. 2, the synthesis information S is time-series data for designating the singing voice of the synthesized music, and the pitch (for example, note number) X1, the pronunciation period X2, and the voice code for each note constituting the synthesized music. Designate X3 in time series. The sound generation period X2 is the time length (note value) of a note, and is defined by, for example, the start time T1 of sound generation and the time length (continuation length) T2. A configuration in which the sound generation period X2 is defined by a sound generation start time T1 and an end time (a configuration in which the time length between both times can be calculated as the time length T2) is also suitable. As can be understood from the above description, the synthesis information S can be rephrased as time-series data for designating the score of the synthesized music. The voice code X3 designates the pronunciation content of the voice to be synthesized (that is, the lyrics of the synthesized music). Specifically, the phonetic code X3 is information for designating a voice unit (for example, a syllable or a mora) that is pronounced for one note of the synthesized music, and the phonetic character QA corresponding to the voice unit and the voice unit. The phoneme symbol QB of each phoneme that constitutes. The phonetic character QA corresponds to a character (grapheme) constituting the lyrics of the synthesized music. In the first embodiment, in response to an instruction received from the user, the phoneme immediately after an arbitrary phoneme (first phoneme) in a time series of a plurality of phonemes and a phoneme immediately after the phoneme ( Interphoneme information QC defining the positional relationship with the second phoneme) is added. More specifically, the inter-phoneme information QC defines the separation or proximity of the first phoneme and the second phoneme on the time axis.

図１の演算処理装置１０（ＣＰＵ）は、記憶装置１２に格納されたプログラムを実行することで、合成情報Ｓの編集や音声信号Ｖの生成のための複数の機能（指示受付部２２，表示制御部２４，情報管理部２６，音声合成部２８）を実現する。なお、演算処理装置１０の各機能を複数の装置に分散した構成や、専用の電子回路（例えばＤＳＰ）が演算処理装置１０の一部の機能を実現する構成も採用され得る。指示受付部２２と表示制御部２４と情報管理部２６とは、例えば楽曲編集用のソフトウェア（エディタ）で実現され、音声合成部２８は、例えば音声合成用のソフトウェア（音声合成エンジン）で実現される。ただし、演算処理装置１０の各機能と各機能を実現するソフトウェアの切分けとの関係は以上の例示に限定されない。 The arithmetic processing unit 10 (CPU) in FIG. 1 executes a program stored in the storage unit 12 to thereby edit a plurality of functions (instruction receiving unit 22, display) for editing the synthesis information S and generating the audio signal V. The control unit 24, the information management unit 26, and the speech synthesis unit 28) are realized. A configuration in which each function of the arithmetic processing device 10 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions of the arithmetic processing device 10 may be employed. The instruction receiving unit 22, the display control unit 24, and the information management unit 26 are realized by, for example, music editing software (editor), and the voice synthesis unit 28 is realized by, for example, voice synthesis software (voice synthesis engine). The However, the relationship between each function of the arithmetic processing device 10 and the separation of software that realizes each function is not limited to the above examples.

指示受付部２２は、入力装置１６に対する操作に応じた利用者からの指示を受付ける。表示制御部２４は、各種の画像を表示装置１４に表示させる。具体的には、第１実施形態の表示制御部２４は、合成情報Ｓが指定する合成楽曲の内容を利用者が確認するための図４の編集画面４０を表示装置１４に表示させる。編集画面４０は、相互に交差する時間軸（横軸）および音高軸（縦軸）が設定されたピアノロール型の座標平面である。 The instruction receiving unit 22 receives an instruction from a user according to an operation on the input device 16. The display control unit 24 displays various images on the display device 14. Specifically, the display control unit 24 of the first embodiment causes the display device 14 to display the editing screen 40 of FIG. 4 for the user to confirm the content of the composite music specified by the composite information S. The editing screen 40 is a piano roll coordinate plane in which a time axis (horizontal axis) and a pitch axis (vertical axis) intersecting each other are set.

表示制御部２４は、合成情報Ｓが指定する音符毎に音符図像４２と発音文字ＱAと音素記号ＱBとを編集画面４０に時系列に配置する。図４は、図３に示される楽曲の“I wanted to see”という歌詞（文字列）の各音声単位“I”，“wan-”，“ted”，“to”，“see”を音声符号Ｘ3として５個の音符に割当てた編集画面４０を例示している。なお、図の例では、複数の文字列に対応する音声符号Ｘ3を１つの音符に割当てているが、１つの文字に対応する音声符号Ｘ3を１つの音符に割当てる構成としてもよい。音符図像４２は、合成楽曲の各音符を表象する画像である。具体的には、音高軸の方向における音符図像４２の位置は、合成情報Ｓが指定する音高Ｘ1に応じて設定される。また、時間軸の方向における音符図像４２の位置は、合成情報Ｓが指定する発音期間Ｘ2の開始時刻Ｔ1に応じて設定され、時間軸の方向における音符図像４２の表示長（サイズ）は、合成情報Ｓが指定する発音期間Ｘ2の時間長Ｔ2に応じて設定される。すなわち、時間長Ｔ2が長い音符ほど音符図像４２の時間軸上の表示長は長い。発音文字ＱAおよび音素記号ＱBは音符図像４２の内部に配置される。以上の説明から理解される通り、編集画面４０は、合成楽曲の歌詞（発音内容）に対応した発音文字ＱAと各音素の音素記号ＱBとを時系列に配置した画像である。なお、発音文字ＱAや音素記号ＱBの位置は適宜に変更される。例えば、発音文字ＱAおよび音素記号ＱBの一方または双方を音符図像４２の近傍（外側）に配置することも可能である。また、発音文字ＱAの表示を省略した構成や、音素記号ＱBの表示を省略した構成も採用され得る。 The display control unit 24 arranges the note image 42, the phonetic character QA, and the phoneme symbol QB in time series on the editing screen 40 for each note specified by the synthesis information S. FIG. 4 is a voice code of each voice unit “I”, “wan-”, “ted”, “to”, “see” of the lyrics (character string) “I wanted to see” of the music shown in FIG. An edit screen 40 assigned to five notes is illustrated as X3. In the example shown in the figure, the speech code X3 corresponding to a plurality of character strings is assigned to one note, but the speech code X3 corresponding to one character may be assigned to one note. The musical note iconic image 42 is an image representing each musical note of the synthesized music. Specifically, the position of the note image 42 in the direction of the pitch axis is set according to the pitch X1 specified by the synthesis information S. The position of the musical note iconic image 42 in the direction of the time axis is set according to the start time T1 of the sounding period X2 designated by the synthesis information S, and the display length (size) of the musical note iconic image 42 in the direction of the time axis is It is set according to the time length T2 of the sound generation period X2 designated by the information S. That is, the longer the time length T2, the longer the display length of the note image 42 on the time axis. The phonetic character QA and the phoneme symbol QB are arranged inside the musical note iconic image 42. As understood from the above description, the editing screen 40 is an image in which the phonetic characters QA corresponding to the lyrics (pronunciation content) of the synthesized music and the phoneme symbols QB of each phoneme are arranged in time series. Note that the positions of the phonetic character QA and the phoneme symbol QB are appropriately changed. For example, one or both of the phonetic character QA and the phoneme symbol QB can be arranged in the vicinity (outside) of the musical note iconic image 42. A configuration in which the display of the phonetic character QA is omitted or a configuration in which the display of the phoneme symbol QB is omitted may be employed.

利用者は、図４の編集画面４０を確認しながら入力装置１６を適宜に操作することで、音符図像４２の追加または移動や発音文字ＱAの追加または変更を指示することが可能である。また、利用者は入力装置１６を適宜に操作することで、編集画面４０に表示された複数の音素記号ＱBの時系列のうち、任意の音素（第１音素）の音素記号ＱBと、当該音素の直後の音素（第２音素）の音素記号ＱBとの間に、第１音素と第２音素との時間軸上における位置関係を指示することが可能である。 The user can instruct addition or movement of the note image 42 and addition or change of the pronunciation character QA by appropriately operating the input device 16 while confirming the editing screen 40 of FIG. Further, by appropriately operating the input device 16, the user can select a phoneme symbol QB of an arbitrary phoneme (first phoneme) from the time series of a plurality of phoneme symbols QB displayed on the editing screen 40, and the phoneme. It is possible to indicate the positional relationship between the first phoneme and the second phoneme on the time axis between the phoneme symbol QB of the phoneme (second phoneme) immediately after.

図１の情報管理部２６は、編集画面４０に対する利用者からの指示に応じて合成情報Ｓを編集する。例えば、情報管理部２６は、音高軸の方向における音符図像４２の移動の指示に応じて、合成情報Ｓのうち当該音符図像４２に対応する音符の音高Ｘ1を変更する。また、情報管理部２６は、時間軸の方向における音符図像４２の位置に応じて、合成情報Ｓのうち当該音符図像４２に対応する音符の発音期間Ｘ2の開始時刻Ｔ1を変更し、音符図像４２の時間軸上の表示長に応じて、合成情報Ｓのうち当該音符図像４２に対応する音符の発音期間Ｘ2の時間長Ｔ2を変更する。すなわち、音符図像４２の表示長を変更する指示は、発音期間Ｘ2の時間長Ｔ2を変更する指示に相当する。任意の音符の発音文字ＱAが変更された場合、情報管理部２６は、合成情報Ｓのうち当該音符に対応する発音文字ＱAを変更するとともに当該音符の各音素記号ＱBを変更後の発音文字ＱAに応じて更新する。また、情報管理部２６は、第１音素と第２音素との時間軸上における位置関係が、利用者から受付けた指示に応じた位置関係となるように、第１音素の音素記号ＱBの直後に、利用者の指示に応じた位置関係を示す音素間情報ＱCを追加する。 The information management unit 26 in FIG. 1 edits the composite information S in response to an instruction from the user with respect to the editing screen 40. For example, the information management unit 26 changes the pitch X1 of the note corresponding to the musical note iconic image 42 in the synthesis information S in response to an instruction to move the musical note iconic image 42 in the direction of the pitch axis. Further, the information management unit 26 changes the start time T1 of the note sound generation period X2 corresponding to the note image 42 in the synthesis information S in accordance with the position of the note image 42 in the direction of the time axis. The time length T2 of the note production period X2 corresponding to the note image 42 in the synthesis information S is changed according to the display length on the time axis. That is, the instruction to change the display length of the musical note iconic image 42 corresponds to an instruction to change the time length T2 of the sound generation period X2. When the phonetic character QA of an arbitrary note is changed, the information management unit 26 changes the phonetic character QA corresponding to the note in the synthesis information S and changes the phoneme symbol QB of the note. Update as appropriate. Further, the information management unit 26 immediately follows the phoneme symbol QB of the first phoneme so that the positional relationship between the first phoneme and the second phoneme on the time axis becomes a positional relationship according to an instruction received from the user. In addition, interphoneme information QC indicating the positional relationship according to the user's instruction is added.

図１の音声合成部２８は、記憶装置１２に記憶された音声素片群Ｌと合成情報Ｓとを利用して音声信号Ｖを生成する。具体的には、音声合成部２８は、合成情報Ｓが指定する音符毎の音声符号Ｘ3に応じた音声素片Ｐを音声素片群Ｌから順次に選択し、各音声素片Ｐを音高Ｘ1および発音期間Ｘ2に調整して相互に連結することで歌唱音声の音声信号Ｖを生成する。音声合成部２８が生成した音声信号Ｖが放音装置１８に供給されることで、合成楽曲の歌唱音声が再生される。 The speech synthesizer 28 in FIG. 1 generates a speech signal V using the speech element group L and the synthesis information S stored in the storage device 12. Specifically, the speech synthesizer 28 sequentially selects a speech unit P corresponding to the speech code X3 for each note specified by the synthesis information S from the speech unit group L, and selects each speech unit P as a pitch. The voice signal V of the singing voice is generated by adjusting and connecting to X1 and the sound generation period X2. The voice signal V generated by the voice synthesizer 28 is supplied to the sound emitting device 18 so that the singing voice of the synthesized music is reproduced.

図６は、第１実施形態に係る音声合成装置１００の概略的な動作のフローチャートである。例えば合成情報Ｓの編集が利用者から指示された場合に図６の処理が開始される。演算処理装置１０（表示制御部２４，情報管理部２６）は、合成情報Ｓに応じた編集画面４０を表示装置１４に表示させ（ＳA1）、指示受付部２２が利用者から受付ける指示に応じて合成情報Ｓを編集するとともに、編集内容に応じて編集画面４０を更新する編集処理ＳA2を実行する。編集処理ＳA2を実行すると、演算処理装置１０は、音声合成が利用者から指示されたか否かを判定し（ＳA3）、音声合成が指示された場合（ＳA3：YES）には、合成情報Ｓで指定される歌唱音声の音声信号Ｖを生成する音声合成処理ＳA4を実行する。音声合成が指示されない場合（ＳA3：NO）には音声合成処理ＳA4は実行されない。 FIG. 6 is a flowchart of a schematic operation of the speech synthesizer 100 according to the first embodiment. For example, when editing of the composite information S is instructed by the user, the processing in FIG. 6 is started. The arithmetic processing device 10 (the display control unit 24, the information management unit 26) causes the display device 14 to display an editing screen 40 corresponding to the composite information S (SA1), and in response to an instruction received from the user by the instruction receiving unit 22. While editing the composite information S, an editing process SA2 for updating the editing screen 40 according to the editing content is executed. When the editing process SA2 is executed, the arithmetic processing unit 10 determines whether or not speech synthesis is instructed by the user (SA3). When speech synthesis is instructed (SA3: YES), the synthesis information S is used. The voice synthesis process SA4 for generating the voice signal V of the designated singing voice is executed. When the voice synthesis is not instructed (SA3: NO), the voice synthesis process SA4 is not executed.

演算処理装置１０は、処理終了が利用者から指示されたか否かを判定する（ＳA5）。処理終了が指示されていない場合（ＳA5：NO）、演算処理装置１０は編集処理ＳA2を実行する。他方、処理終了が指示された場合（ＳA5：YES）、演算処理装置１０は図６の処理を終了する。 The arithmetic processing unit 10 determines whether or not the user has instructed the end of the process (SA5). When the process end is not instructed (SA5: NO), the arithmetic processing unit 10 executes the editing process SA2. On the other hand, when the process end is instructed (SA5: YES), the arithmetic processing unit 10 ends the process of FIG.

利用者は、編集画面４０のうち所望の音声符号Ｘ3が包含される音符の音符図像４２を対象として、音声符号Ｘ3に包含される第１音素の音素記号ＱBの直後に、第１音素と第２音素とを時間軸上で離間または近接させる旨を指示することが可能である。図７は、図６の編集処理ＳA2のうち第１音素と第２音素とを離間または近接させる指示が利用者から受付けられた場合に実行される処理のフローチャートである。指示受付部２２が音素間の離間または近接の指示を受付けた場合、表示制御部２４は、編集画面４０の第１音素の音素記号ＱBの直後に接続指示子Ｃを表示させる（ＳB1）。情報管理部２６は、離間指示子ＣSまたは近接指示子ＣCに応じて第１音素[n]と第２音素[t]とが時間軸上において離間または接近するように、離間または近接を規定する音素間情報ＱCを合成情報Ｓに追加する（ＳB2）。表示制御部２４および情報管理部２６による処理の具体例を以下に詳述する。 For the note image 42 of the note that includes the desired phonetic code X3 in the editing screen 40, the user immediately follows the first phoneme and the first phoneme symbol QB included in the phonetic code X3. It is possible to instruct the two phonemes to be separated or close to each other on the time axis. FIG. 7 is a flowchart of processing executed when an instruction to separate or approach the first phoneme and the second phoneme is received from the user in the editing processing SA2 of FIG. When the instruction receiving unit 22 receives an instruction to separate or approach between phonemes, the display control unit 24 displays the connection indicator C immediately after the phoneme symbol QB of the first phoneme on the editing screen 40 (SB1). The information management unit 26 defines the separation or proximity so that the first phoneme [n] and the second phoneme [t] are separated or approached on the time axis according to the separation indicator CS or the proximity indicator CC. The interphoneme information QC is added to the synthesis information S (SB2). Specific examples of processing by the display control unit 24 and the information management unit 26 will be described in detail below.

＜相前後する音素同士の離間＞
図８は、接続指示子の一態様の説明図（図４の一部の再掲）である。利用者は、入力装置１６を適宜に操作することで、編集画面４０に表示された複数の音素記号ＱBの時系列のうち相前後する任意の各音素記号ＱBに対応する各音素を時間軸上で離間させる指示を付与することが可能である。例えば、利用者は、任意の１個の音素記号ＱBを選択したうえで、当該音素記号ＱBの音素（第１音素）と直後の音素（第２音素）との離間を指示し得る。図８では、“wan-”の音声符号Ｘ3に包含される音素[n]（第１音素）と、“ted”の音声符号Ｘ3に包含される[t]（第２音素）とを時間軸上で離間させる指示を、指示受付部２２が利用者から受付けた場合が想定されている。以上の指示が受付けられると、表示制御部２４は、編集画面４０上における第１音素[n]の音素記号ＱBの直後に、第１音素と第２音素との時間軸上における離間を表象する離間指示子（ドット『．』）ＣSを接続指示子Ｃとして表示させる。また、情報管理部２６は、離間指示子ＣSに応じて、第１音素[n]の直後に第１音素[n]と第２音素[t]との時間軸上における離間を規定する音素間情報ＱCを追加する。以上の説明から理解される通り、第１実施形態では、複数の音素記号ＱBの時系列と、利用者からの指示に応じた接続指示子Ｃ（離間指示子ＣS）とが編集画面４０上に表示されるから、利用者は、合成楽曲の歌詞に包含される複数の音素を確認しながら、第１音素[n]と第２音素[t]との連続性の度合を調整（離間）することが可能である。 <Separation between adjacent phonemes>
FIG. 8 is an explanatory diagram of one aspect of the connection indicator (part of FIG. 4 is shown again). By appropriately operating the input device 16, the user can set each phoneme corresponding to each successive phoneme symbol QB in the time series of the plurality of phoneme symbols QB displayed on the editing screen 40 on the time axis. It is possible to give an instruction to move away with. For example, the user can select any one phoneme symbol QB and then instruct the separation of the phoneme (first phoneme) of the phoneme symbol QB from the immediately following phoneme (second phoneme). In FIG. 8, the phoneme [n] (first phoneme) included in the “wan-” speech code X3 and the [t] (second phoneme) included in the “ted” speech code X3 are time axes. It is assumed that the instruction receiving unit 22 has received an instruction for separating from the user. When the above instruction is accepted, the display control unit 24 represents the separation of the first phoneme and the second phoneme on the time axis immediately after the phoneme symbol QB of the first phoneme [n] on the editing screen 40. The separation indicator (dot “.”) CS is displayed as the connection indicator C. In addition, the information management unit 26 determines the distance between the phonemes that defines the separation on the time axis between the first phoneme [n] and the second phoneme [t] immediately after the first phoneme [n] in accordance with the separation indicator CS. Add information QC. As understood from the above description, in the first embodiment, the time series of a plurality of phoneme symbols QB and the connection indicator C (separation indicator CS) corresponding to the instruction from the user are displayed on the editing screen 40. Since the user is displayed, the user adjusts (separates) the degree of continuity between the first phoneme [n] and the second phoneme [t] while confirming a plurality of phonemes included in the lyrics of the synthesized music. It is possible.

なお、離間指示子ＣSの表示態様は任意である。例えば図８に例示した記号以外に、任意の文字列や任意の画像等を離間指示子ＣSとして表示させる態様も採用され得る。また、以上の説明では、第１音素[n]を包含する“wan-”の音声符号Ｘ3と、第２音素[t]を包含する“ted”の音声符号Ｘ3とは別個の音符図像４２に対応していたが、第１音素と第２音素とが１つの音符区間（音符図像４２）に包含される構成も採用され得る。例えば、図１４に例示するように、第１音素[t]と、第２音素[I]との間に、接続指示子Ｃを表示させる構成としてもよい。 The display mode of the separation indicator CS is arbitrary. For example, in addition to the symbols illustrated in FIG. 8, an aspect in which an arbitrary character string, an arbitrary image, or the like is displayed as the separation indicator CS may be employed. Further, in the above description, the “wan-” speech code X3 including the first phoneme [n] and the “ted” speech code X3 including the second phoneme [t] are represented as separate note images 42. Although it corresponded, the structure by which a 1st phoneme and a 2nd phoneme are included by one note interval (note image 42) can also be employ | adopted. For example, as illustrated in FIG. 14, the connection indicator C may be displayed between the first phoneme [t] and the second phoneme [I].

＜相前後する音素同士の近接＞
図９は、接続指示子の一態様の説明図（図４の一部の再掲）である。図９では、“wan-”の音声符号Ｘ3に包含される音素[n]（第１音素）と、“ted”の音声符号Ｘ3に包含される[t]（第２音素）とを時間軸上で近接させる指示を、指示受付部２２が利用者から受付けた場合が想定されている。以上の指示が受付けられると、表示制御部２４は、編集画面４０上における第１音素[n]の音素記号ＱBの直後に、第１音素と第２音素との時間軸上における離間を表象する近接指示子（ハイフン『-』およびバックスラッシュ『＼』）ＣCを接続指示子Ｃとして表示させる。また、情報管理部２６は、近接指示子ＣCに応じて、第１音素[n]の直後に第１音素[n]と第２音素[t]との時間軸上における近接を規定する音素間情報ＱCを追加する。以上の説明から理解される通り、第１実施形態では、複数の音素記号ＱBの時系列と、利用者からの指示に応じた接続指示子Ｃ（近接指示子ＣC）とが編集画面４０上に表示されるから、利用者は合成楽曲の歌詞に包含される複数の音素を確認しながら、第１音素[n]と第２音素[t]との連続性の度合を調整（接近）することが可能である。なお、近接指示子ＣCの表示態様は、図９に例示した記号以外に、任意の文字列や任意の画像等が採用され得る。 <Proximity between neighboring phonemes>
FIG. 9 is an explanatory diagram of one aspect of the connection indicator (part of FIG. 4 is shown again). In FIG. 9, the phoneme [n] (first phoneme) included in the “wan-” speech code X3 and the [t] (second phoneme) included in the “ted” speech code X3 are time axes. It is assumed that the instruction accepting unit 22 accepts an instruction to make the above approach from the user. When the above instruction is accepted, the display control unit 24 represents the separation of the first phoneme and the second phoneme on the time axis immediately after the phoneme symbol QB of the first phoneme [n] on the editing screen 40. The proximity indicator (hyphen “-” and backslash “\”) CC is displayed as the connection indicator C. Further, the information management unit 26 determines the interphoneme that defines the proximity of the first phoneme [n] and the second phoneme [t] on the time axis immediately after the first phoneme [n] according to the proximity indicator CC. Add information QC. As understood from the above description, in the first embodiment, a time series of a plurality of phoneme symbols QB and a connection indicator C (proximity indicator CC) according to an instruction from the user are displayed on the editing screen 40. The user can adjust (approach) the degree of continuity between the first phoneme [n] and the second phoneme [t] while confirming the multiple phonemes included in the lyrics of the synthesized music. Is possible. As the display mode of the proximity indicator CC, an arbitrary character string, an arbitrary image, or the like can be adopted in addition to the symbols illustrated in FIG.

図１０は、音声合成部２８が音声合成処理ＳA4で生成する音声信号Ｖの波形図である。具体的には、図４に例示した内容の合成情報Ｓのもとで生成された音声信号Ｖの波形が図１０では例示されている。図１０に矩形上で囲まれた部分は、音声素片群Ｌのうち、音素[n]および音素[t]を包含する区間を示している。以下の説明では、音素[n]と音素[t]との間隔に特に着目する。 FIG. 10 is a waveform diagram of the voice signal V generated by the voice synthesis unit 28 in the voice synthesis process SA4. Specifically, the waveform of the audio signal V generated under the synthesis information S having the content illustrated in FIG. 4 is illustrated in FIG. A portion surrounded by a rectangle in FIG. 10 indicates a section including the phoneme [n] and the phoneme [t] in the speech element group L. In the following description, particular attention is paid to the interval between the phoneme [n] and the phoneme [t].

図１１は、図６に例示した音声合成処理ＳA4の具体例のフローチャートである。音声合成処理ＳA4を開始すると、音声合成部２８は、合成情報Ｓが音符毎に指定する各音声符号Ｘ3に応じた選択対象の音声素片のうち前方の音素について音素間情報ＱCが付加されているか否かを判定する（ＳC1）。音素間情報ＱCが付加されていない場合（ＳC1：NO）、音声合成部２８は、音声符号Ｘ3に応じた標準素片Ｐ0を音声素片群Ｌから選択する（ＳC2）。他方、音素間情報ＱCが付加されている場合（ＳC1：YES）、音声合成部２８は、音素間情報ＱCが離間／近接のいずれを規定するかを判定する（ＳC3）。 FIG. 11 is a flowchart of a specific example of the speech synthesis process SA4 illustrated in FIG. When the speech synthesis process SA4 is started, the speech synthesizer 28 adds inter-phoneme information QC for the front phoneme among the speech units to be selected corresponding to each speech code X3 designated by the synthesis information S for each note. It is determined whether or not (SC1). When the inter-phoneme information QC is not added (SC1: NO), the speech synthesizer 28 selects the standard segment P0 corresponding to the speech code X3 from the speech unit group L (SC2). On the other hand, when the inter-phoneme information QC is added (SC1: YES), the speech synthesizer 28 determines whether the inter-phoneme information QC defines separation / proximity (SC3).

例えば、図８に例示されるように、離間を規定する音素間情報ＱCが設定されている場合（ＳC3：離間）、音声合成部２８は、第１音素[n]と第２音素[t]との間隔が基準値よりも離間した離間型の[n.t]の音声素片Ｐ（離間型素片Ｐ1）が音声素片群Ｌに存在するか否かを判定する（ＳC4）。音声素片群Ｌに離間型素片Ｐ1が存在する場合（ＳC4：YES）、音声合成部２８は音声素片群Ｌから離間型素片Ｐ1を選択する（ＳC5）。他方、音声素片群Ｌに離間型素片Ｐ1が存在しない場合（ＳC4：NO）、音声合成部２８は、第１音素[n]と第２音素[t]との間隔が基準値に設定された[n-t]の音声素片Ｐ（標準素片Ｐ0）を音声素片群Ｌから選択し、標準素片Ｐ0における第１音素[n]と第２音素[t]との間隔を伸張する（ＳC6）。具体的には、図１２に例示されるように、第１音素[n]の終了から第２音素[t]の開始までの間隔ＤをＤ1（Ｄ1＞Ｄ）に伸長する。間隔Ｄ1は、例えば初期的な間隔Ｄに対して所定の比率（＞１）を乗算した時間長である。 For example, as illustrated in FIG. 8, when the interphoneme information QC that defines the separation is set (SC3: separation), the speech synthesis unit 28 uses the first phoneme [n] and the second phoneme [t]. It is determined whether or not a separated [nt] speech segment P (separated segment P1) having a spacing with respect to the reference value exists in the speech segment group L (SC4). When the separated segment P1 exists in the speech unit group L (SC4: YES), the speech synthesis unit 28 selects the separated unit P1 from the speech unit group L (SC5). On the other hand, when the separated-type segment P1 does not exist in the speech segment group L (SC4: NO), the speech synthesizer 28 sets the interval between the first phoneme [n] and the second phoneme [t] as a reference value. [Nt] speech segment P (standard segment P0) is selected from the speech segment group L, and the interval between the first phoneme [n] and the second phoneme [t] in the standard segment P0 is extended. (SC6). Specifically, as illustrated in FIG. 12, the interval D from the end of the first phoneme [n] to the start of the second phoneme [t] is extended to D1 (D1> D). The interval D1 is, for example, a time length obtained by multiplying the initial interval D by a predetermined ratio (> 1).

他方、図９に例示されるように、近接を規定する音素間情報ＱCが設定されている場合（ＳC3：近接）、音声合成部２８は、第１音素[n]と第２音素[t]との間隔が基準値よりも近接した近接型の[n-＼t]の音声素片Ｐ（近接型素片Ｐ2）が音声素片群Ｌに存在するか否かを判定する（ＳC7）。音声素片群Ｌに、近接型素片Ｐ2が存在する場合（ＳC7：YES）、音声合成部２８は近接型素片Ｐ2を選択する（ＳC8）。他方、音声素片群Ｌに近接型素片Ｐ2が存在しない場合（ＳC7：NO）、音声合成部２８は、標準素片Ｐ0を音声素片群Ｌから選択し、標準素片Ｐ0における第１音素[n]と第２音素[t]との間隔を短縮する（ＳC9）。具体的には、図１３に例示されるように、第１音素[n]の終了から第２音素[t]の開始までの間隔ＤをＤ2（Ｄ2＜Ｄ）に短縮する。間隔Ｄ2は、例えば初期的な間隔Ｄに対して所定の比率（＜１）を乗算した時間長である。 On the other hand, as illustrated in FIG. 9, when the interphoneme information QC defining proximity is set (SC3: proximity), the speech synthesizer 28 performs the first phoneme [n] and the second phoneme [t]. It is determined whether or not a proximity type [n− \ t] speech segment P (proximity segment P2) having an interval with the reference value is present in the speech segment group L (SC7). When the proximity unit P2 exists in the speech unit group L (SC7: YES), the speech synthesizer 28 selects the proximity unit P2 (SC8). On the other hand, when the proximity unit P2 does not exist in the speech unit group L (SC7: NO), the speech synthesizer 28 selects the standard unit P0 from the speech unit group L, and the first unit P0 in the standard unit P0. The interval between the phoneme [n] and the second phoneme [t] is shortened (SC9). Specifically, as illustrated in FIG. 13, the interval D from the end of the first phoneme [n] to the start of the second phoneme [t] is shortened to D2 (D2 <D). The interval D2 is, for example, a time length obtained by multiplying the initial interval D by a predetermined ratio (<1).

音声合成部２８は、以上に説明した各処理（ＳC2,ＳC5，ＳC6，ＳC8，ＳC9）で選択または伸縮した音声素片Ｐを、合成情報Ｓが指定する音高Ｘ1および発音期間Ｘ2に調整したうえで相互に連結することで音声信号Ｖを生成する（ＳC10）。音声合成部２８は、合成情報Ｓに包含される全部の音素を処理したか否かを判定し（ＳC11）、処理が完了していない場合（ＳC11：NO）には、処理をステップＳC1に移行し、合成情報Ｓが指定する他の音素について同様の処理を実行する。他方、処理が完了した場合（ＳC11：YES）、音声合成部２８は音声合成処理ＳA4を終了する。 The speech synthesizer 28 adjusts the speech segment P selected or expanded / contracted by the above-described processes (SC2, SC5, SC6, SC8, SC9) to the pitch X1 and the pronunciation period X2 specified by the synthesis information S. In addition, the audio signal V is generated by connecting them together (SC10). The speech synthesizer 28 determines whether or not all phonemes included in the synthesis information S have been processed (SC11). If the processing is not complete (SC11: NO), the process proceeds to step SC1. Then, similar processing is executed for other phonemes designated by the synthesis information S. On the other hand, when the process is completed (SC11: YES), the speech synthesizer 28 ends the speech synthesis process SA4.

以上に説明した通り、第１実施形態では、利用者からの指示に応じて、複数の音素記号ＱBの時系列と、音声符号Ｘ3に包含される第１音素[n]と第２音素[t]との離間または近接を規定する接続指示子Ｃ（離間指示子ＣSまたは近接指示子ＣC）を表示させる。したがって、利用者は、時系列に配置される複数の音素（音素記号ＱB）を編集画面４０上で視覚的かつ直感的に確認しながら、第１音素[n]と第２音素[t]との連続性の度合を調整（離間または接近）することができる。第１実施形態によれば、利用者は自らの意図や嗜好に沿った個性的ないし特徴的な表情の音声を合成することが可能になるとともに、前後の各音素が極端に近接または離間した聴感的に不自然ないし不明瞭な音声が合成される事態を回避することができるという利点がある。 As described above, in the first embodiment, in response to an instruction from the user, the time series of a plurality of phoneme symbols QB and the first phoneme [n] and the second phoneme [t included in the speech code X3 The connection indicator C (separation indicator CS or proximity indicator CC) that defines separation or proximity to the display is displayed. Accordingly, the user can visually confirm the plurality of phonemes (phoneme symbols QB) arranged in time series on the editing screen 40 while visually confirming the first phoneme [n] and the second phoneme [t]. The degree of continuity can be adjusted (separated or approached). According to the first embodiment, it becomes possible for a user to synthesize a voice with a unique or characteristic expression according to his / her intention or preference, and the sensation that front and rear phonemes are extremely close or separated from each other. Therefore, there is an advantage that it is possible to avoid a situation where unnatural or unclear voice is synthesized.

また、第１実施形態では、第１音素と第２音素との間隔が基準値以上である離間型素片Ｐ1（[n．t]の音声素片Ｐ1）、または、第１音素と第２音素との間隔が基準値以下である近接型素片Ｐ2（[n-＼t]の音声素片Ｐ2）が音声素片群Ｌに存在しない場合に、既存の標準素片Ｐ0（[n-t]の音声素片）における第１音素[n]と第２音素[t]との間隔（区間Ｄ）を伸縮する。したがって、２個の音素の全通りの組合せについて、各音素の位置関係を相違させた複数種の音声素片（離間型素片Ｐ1,近接型素片Ｐ2）を記憶する必要がないから、記憶装置１２の記憶容量を削減できるという利点がある。 Further, in the first embodiment, the separation type element P1 ([n.t] speech element P1) in which the interval between the first phoneme and the second phoneme is equal to or greater than the reference value, or the first phoneme and the second phoneme. When the proximity unit P2 (the speech unit P2 of [n- \ t]) whose distance from the phoneme is equal to or less than the reference value does not exist in the speech unit group L, the existing standard unit P0 ([nt] The interval (section D) between the first phoneme [n] and the second phoneme [t] in the speech unit). Therefore, it is not necessary to store a plurality of types of speech elements (separated type element P1, proximity type element P2) in which the positional relationship of each phoneme is different for all combinations of two phonemes. There is an advantage that the storage capacity of the device 12 can be reduced.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。第１実施形態では、接続指示子Ｃ（離間指示子ＣS，近接指示子ＣC）に応じて、第１音素と第２音素とを離間または接近させる構成を例示した。第２実施形態では、第１音素と第２音素との離間または近接の程度を示す指標値の指示を利用者から受付け、第１音素と第２音素とを指標値に応じて離間または近接させる。 Second Embodiment
A second embodiment of the present invention will be described below. In the first embodiment, the configuration in which the first phoneme and the second phoneme are separated or approached according to the connection indicator C (separation indicator CS, proximity indicator CC) has been exemplified. In the second embodiment, an indication of an index value indicating the degree of separation or proximity between the first phoneme and the second phoneme is received from the user, and the first phoneme and the second phoneme are separated or approached according to the index value. .

図１５は、編集処理ＳA2のうち、第１音素と第２音素とを離間または近接させる指示と、指標値とが利用者から受付けられた場合に実行される処理の具体例のフローチャートである。第２実施形態の編集処理ＳA2では、第１実施形態の編集処理ＳA2におけるＳB2の処理がＳD1およびＳD2の処理に置き換えられる。指示受付部２２が音素間の離間または近接の指示に続いて指標値の指示を受付けた場合、表示制御部２４は、編集画面４０の第１音素の音素記号ＱBの直後の接続指示子Ｃの近傍に、離間または近接の程度を示す指標値Ｉを表示させる（ＳD1）。情報管理部２６は、接続指示子Ｃ（離間指示子ＣSまたは近接指示子ＣC）と指標値Ｉとに応じて第１音素[n]と第２音素[t]とが時間軸上において離間または接近するように、離間または近接、および、指標値Ｉを規定する音素間情報ＱCを合成情報Ｓに追加する（ＳD2）。第２実施形態における表示制御部２４および情報管理部２６による処理の具体例を以下に詳述する。 FIG. 15 is a flowchart of a specific example of processing executed when an instruction to separate or approach the first phoneme and the second phoneme and an index value are received from the user in the editing processing SA2. In the editing process SA2 of the second embodiment, the process of SB2 in the editing process SA2 of the first embodiment is replaced with the processes of SD1 and SD2. When the instruction receiving unit 22 receives an instruction for an index value following an instruction for separation or proximity between phonemes, the display control unit 24 sets the connection indicator C immediately after the phoneme symbol QB of the first phoneme on the editing screen 40. An index value I indicating the degree of separation or proximity is displayed in the vicinity (SD1). The information management unit 26 determines whether the first phoneme [n] and the second phoneme [t] are separated on the time axis according to the connection indicator C (the separation indicator CS or the proximity indicator CC) and the index value I. The interphoneme information QC that defines the separation or proximity and the index value I is added to the synthesis information S so as to approach (SD2). A specific example of processing by the display control unit 24 and the information management unit 26 in the second embodiment will be described in detail below.

図１６は、接続指示子Ｃおよび指標値Ｉの表示例の説明図である。利用者は、入力装置１６を適宜操作することで、編集画面４０に表示された複数の音素記号ＱBの時系列のうち相前後する各音素記号ＱBに対応する各音素の時間軸上における離間または近接の程度を規定する指標値Ｉを指示することが可能である。例えば、利用者は任意の１個の音素記号ＱBを選択したうえで、当該音素記号ＱBの音素（第１音素）と直後の音素（第２音素）との離間または近接の程度を示す指標値を指示し得る。指標値は、第１音素と第２音素との離間（近接）の程度を0から100までの範囲で相対的に規定する整数である。なお、指標値Ｉの数値と離間の程度との対応関係は任意に定められる。例えば、指標値Ｉが最大値（100）の場合（すなわち、離間の程度が最大である場合）は、離間指示子ＣSのみが指示された場合と同程度に第１音素と第２音素とを時間軸上で離間させる。他方、指標値Ｉが最小値（0）の場合（すなわち、離間の程度が最小である場合）は、近接指示子ＣCのみが指示された場合と同程度に第１音素と第２音素とを時間軸上で近接させる。図１６では、“wan-”の音声符号Ｘ3に包含される音素[n]（第１音素）と、“ted”の音声符号Ｘ3に包含される[t]（第２音素）とを時間軸上で離間させる指示と、指標値Ｉ（80）とを指示受付部２２が利用者から受付けた場合が想定されている。以上の指示が受付けられると、表示制御部２４は、編集画面４０上における第１音素[n]の音素記号ＱBの直後に、第１音素と第２音素との時間軸上における離間を表象する離間指示子ＣSを接続指示子Ｃとして表示させるとともに、指標値Ｉ（80）を離間指示子ＣSの直後に表示させる。また、情報管理部２６は、離間指示子ＣSと指標値Ｉ（80）とに応じて、第１音素[n]の直後に第１音素[n]と第２音素[t]との時間軸上における離間と、離間の程度（指標値Ｉ）とを規定する音素間情報ＱCを追加する。以上の説明から理解される通り、第２実施形態では、複数の音素記号ＱBの時系列と、利用者からの指示に応じた接続指示子Ｃ（離間指示子ＣS）と、指標値Ｉとが編集画面４０上に表示されるから、利用者は、合成楽曲の歌詞に包含される複数の音素と、相前後する音素間の間隔を確認しながら、第１音素[n]と第２音素[t]との連続性の度合を調整（離間）することが可能である。 FIG. 16 is an explanatory diagram of a display example of the connection indicator C and the index value I. By appropriately operating the input device 16, the user can separate the phonemes corresponding to each successive phoneme symbol QB in the time series of the plurality of phoneme symbols QB displayed on the editing screen 40 on the time axis or It is possible to indicate an index value I that defines the degree of proximity. For example, the user selects an arbitrary phoneme symbol QB, and then indicates an index value indicating the degree of separation or proximity between the phoneme of the phoneme symbol QB (first phoneme) and the immediately following phoneme (second phoneme). Can be directed. The index value is an integer that relatively defines the degree of separation (proximity) between the first phoneme and the second phoneme in the range from 0 to 100. The correspondence between the numerical value of the index value I and the degree of separation is arbitrarily determined. For example, when the index value I is the maximum value (100) (that is, when the degree of separation is the maximum), the first phoneme and the second phoneme are reduced to the same extent as when only the separation indicator CS is designated. Separate on the time axis. On the other hand, when the index value I is the minimum value (0) (that is, when the degree of separation is the minimum), the first phoneme and the second phoneme are reduced to the same extent as when only the proximity indicator CC is instructed. Make them close on the time axis. In FIG. 16, the phoneme [n] (first phoneme) included in the “wan-” speech code X3 and the [t] (second phoneme) included in the “ted” speech code X3 are time axes. It is assumed that the instruction receiving unit 22 receives an instruction to separate the index and the index value I (80) from the user. When the above instruction is accepted, the display control unit 24 represents the separation of the first phoneme and the second phoneme on the time axis immediately after the phoneme symbol QB of the first phoneme [n] on the editing screen 40. The separation indicator CS is displayed as the connection indicator C, and the index value I (80) is displayed immediately after the separation indicator CS. Further, the information management unit 26 sets the time axis of the first phoneme [n] and the second phoneme [t] immediately after the first phoneme [n] according to the separation indicator CS and the index value I (80). The interphoneme information QC that defines the separation above and the degree of separation (index value I) is added. As understood from the above description, in the second embodiment, a time series of a plurality of phoneme symbols QB, a connection indicator C (separation indicator CS) according to an instruction from the user, and an index value I are provided. Since it is displayed on the editing screen 40, the user can confirm the intervals between the phonemes included in the lyrics of the synthesized music and the phonemes that follow each other, and the first phoneme [n] and the second phoneme [ It is possible to adjust (separate) the degree of continuity with t].

図１７は、第２実施形態における音声合成処理ＳA4の具体例のフローチャートである。第２実施形態の音声合成処理ＳA4では、第１実施形態の音声合成処理ＳA4におけるＳC4からＳC9の処理が、ＳE1およびＳE2の処理の処理に置き換えられる。離間を規定する音素間情報ＱCが設定されている場合（ＳC3：離間）、音声合成部２８は、音声符号Ｘ3に対応する標準素片Ｐ0と離間型素片Ｐ1とを音声素片群Ｌから選択するとともに、指標値Ｉに応じて標準素片Ｐ0と離間型素片Ｐ1とを融合する（ＳE1）。 FIG. 17 is a flowchart of a specific example of the speech synthesis process SA4 in the second embodiment. In the speech synthesis process SA4 of the second embodiment, the processes from SC4 to SC9 in the speech synthesis process SA4 of the first embodiment are replaced with the processes of SE1 and SE2. When the inter-phoneme information QC that defines the separation is set (SC3: separation), the speech synthesis unit 28 extracts the standard unit P0 and the separation-type unit P1 corresponding to the speech code X3 from the speech unit group L. In addition to the selection, the standard segment P0 and the separated segment P1 are fused according to the index value I (SE1).

図１８は、第２実施形態における音声素片の融合の説明図である。音声素片群Ｌから[n-t]の標準素片Ｐ0と[n．t]の離間型素片Ｐ1とを選択すると、音声合成部２８は、[n．t]の離間型素片Ｐ1と[n-t]の標準素片Ｐ0とを、指標値Ｉ（80）に応じた比率で融合する。具体的には、離間型素片Ｐ1が指標値（80）に応じた割合となるように、離間型素片Ｐ1と標準素片Ｐ0とを８：２の比率で融合して、[n・(80)t]の音声素片Ｐ3を生成する。以上の説明から理解されるように、指標値Ｉは、第１音素[n]と第２音素[t]の離間の程度を、離間型素片Ｐ1と標準素片Ｐ0との融合における離間型素片Ｐ1の割合（80／100）の形で規定する。図１８に例示されるように、音声素片Ｐ3における第１音素[n]の終了から第２音素[t]の開始までの間隔Ｄ3は、離間型素片Ｐ1における音素間の間隔Ｄ1と標準素片Ｐ0における音素間の間隔Ｄとの間の範囲内で指標値Ｉに応じた時間長に設定される。 FIG. 18 is an explanatory diagram of speech unit fusion in the second embodiment. [N.t] standard segment P0 and [n. t], the speech synthesis unit 28 selects [n. The separated piece P1 of t] and the standard piece P0 of [n−t] are fused at a ratio according to the index value I (80). More specifically, the separation-type element P1 and the standard element P0 are fused at a ratio of 8: 2 so that the separation-type element P1 has a ratio corresponding to the index value (80). A speech unit P3 of (80) t] is generated. As understood from the above description, the index value I indicates the degree of separation between the first phoneme [n] and the second phoneme [t], and the separation type in the fusion of the separation unit P1 and the standard unit P0. It is defined in the form of the ratio of the piece P1 (80/100). As illustrated in FIG. 18, the interval D3 from the end of the first phoneme [n] to the start of the second phoneme [t] in the speech unit P3 is equal to the interval D1 between phonemes in the separated unit P1 and the standard. The time length corresponding to the index value I is set within a range between the phoneme interval D in the segment P0.

図１７に戻り、近接を規定する音素間情報ＱCが設定されている場合（ＳC3：近接）、音声合成部２８は、標準素片Ｐ0と近接型素片Ｐ2とを音声素片群Ｌから選択し、指標値Ｉに応じて標準素片Ｐ0と近接型素片Ｐ2とを融合する（ＳE2）。標準素片Ｐ0と近接型素片Ｐ2との合成については図示を省略したが、例えば、指標値Ｉ（80）が指示された場合には、図１８の例と同様に近接型素片Ｐ2が指標値Ｉ（80）に応じた割合となるように、近接型素片Ｐ2と標準素片Ｐ0とを８：２の比率で融合することで[n-＼(80)t]の音声素片Ｐを生成する。音声合成部２８は、以上に説明した各処理（ＳC2,ＳE1，ＳE2）で選択または生成した音声素片Ｐを、合成情報Ｓが指定する音高Ｘ1および発音期間Ｘ2に調整したうえで相互に連結することで音声信号Ｖを生成する（ＳC10）。以降の処理については第１実施形態と同様であるので説明を省略する。 Returning to FIG. 17, when the interphoneme information QC defining proximity is set (SC3: proximity), the speech synthesizer 28 selects the standard segment P0 and the proximity segment P2 from the speech segment group L. Then, the standard segment P0 and the proximity segment P2 are fused according to the index value I (SE2). Although the illustration of the combination of the standard segment P0 and the proximity type segment P2 is omitted, for example, when the index value I (80) is instructed, the proximity type segment P2 is similar to the example of FIG. [N-\ (80) t] speech segment by fusing the proximity segment P2 and the standard segment P0 at a ratio of 8: 2 so that the ratio is in accordance with the index value I (80). P is generated. The speech synthesizer 28 adjusts the speech segment P selected or generated in each process (SC2, SE1, SE2) described above to the pitch X1 and the sound generation period X2 specified by the synthesis information S, and then mutually. The audio signal V is generated by the connection (SC10). Since the subsequent processing is the same as that of the first embodiment, description thereof is omitted.

以上に説明した通り、第２実施形態では利用者からの指示に応じて、複数の音素記号ＱBの時系列と、接続指示子Ｃ（離間指示子ＣSまたは近接指示子ＣC）と、離間または近接の程度を規定する指標値Ｉとを表示させる。したがって、利用者は、時系列に配置される複数の音素（音素記号ＱB）を編集画面４０上で視覚的かつ直感的に確認しながら、第１音素[n]と、第２音素[t]との連続性の度合を指標値Ｉに応じて細かに調整（離間または接近）することが可能である。具体的には、利用者が自らの意図や嗜好に沿った個性的ないし特徴的な表情の音声を合成することが可能になるという効果や、前後の各音素が極端に近接または離間した聴感的に不自然ないし不明瞭な音声が合成される事態を回避することができるという効果は、第２実施形態において顕著である。 As described above, according to the second embodiment, in accordance with an instruction from the user, a time series of a plurality of phoneme symbols QB, a connection indicator C (a separation indicator CS or a proximity indicator CC), and a separation or proximity An index value I that defines the degree of is displayed. Accordingly, the user visually and intuitively confirms a plurality of phonemes (phoneme symbols QB) arranged in time series on the editing screen 40, while the first phoneme [n] and the second phoneme [t]. Can be finely adjusted (separated or approached) according to the index value I. Specifically, the effect is that the user can synthesize voices with unique or characteristic facial expressions according to their intentions and preferences, and the auditory sense that front and back phonemes are extremely close or separated. The effect that it is possible to avoid the situation where unnatural or unclear voice is synthesized is remarkable in the second embodiment.

＜第３実施形態＞
本発明の第３実施形態を以下に説明する。第２実施形態では、第１音素と第２音素との離間または近接の程度を規定する指標値を利用者が数値で指示する構成を例示した。第３実施形態では、複数の音素記号ＱBの時系列と操作指示子（スライダー）とを編集画面４０に表示させ、操作指示子の操作量に応じて第１音素と第２音素との離間の程度を示す指標値を設定する。 <Third Embodiment>
A third embodiment of the present invention will be described below. In 2nd Embodiment, the structure which a user instruct | indicates the index value which prescribes | regulates the isolation | separation or proximity | contact degree of a 1st phoneme and a 2nd phoneme was illustrated. In the third embodiment, a time series of a plurality of phoneme symbols QB and an operation indicator (slider) are displayed on the editing screen 40, and the first phoneme and the second phoneme are separated according to the operation amount of the operation indicator. An index value indicating the degree is set.

図１９は、編集処理ＳA2のうち、操作指示子を表示させる指示が利用者から受付けられた場合に実行される処理の具体例のフローチャートである。第３実施形態の表示制御部２４は、音声符号Ｘ3に包含される第１音素の音素記号ＱBを包含する音符図像４２のうち発音期間の終点に対応する縁辺の線上に操作指示子４４を表示させる（ＳF1）。指示受付部２２は、利用者から操作指示子４４を介して操作を受付けるまで待機し（ＳF2:NO）、利用者から操作指示子４４を介して離間の程度を指示する操作を受付けた場合（ＳF2:YES）、情報管理部２６は、操作指示子４４の操作量Ｍに応じて第１音素[n]と第２音素[t]とが時間軸上において離間するように、操作量Ｍを規定する音素間情報ＱCを合成情報Ｓに追加する（ＳF3）。第３実施形態における表示制御部２４および情報管理部２６による処理の具体例を以下に詳述する。 FIG. 19 is a flowchart of a specific example of processing executed when an instruction to display an operation indicator is received from the user in the editing processing SA2. The display control unit 24 of the third embodiment displays an operation indicator 44 on the edge line corresponding to the end point of the pronunciation period in the note image 42 including the phoneme symbol QB of the first phoneme included in the phonetic code X3. (SF1). The instruction receiving unit 22 waits until an operation is received from the user via the operation indicator 44 (SF2: NO), and when an operation for instructing the degree of separation is received from the user via the operation indicator 44 ( SF2: YES), the information management unit 26 sets the operation amount M so that the first phoneme [n] and the second phoneme [t] are separated on the time axis according to the operation amount M of the operation indicator 44. The prescribed interphoneme information QC is added to the synthesis information S (SF3). Specific examples of processing by the display control unit 24 and the information management unit 26 in the third embodiment will be described in detail below.

図２０は、操作指示子４４の表示例の説明図である。操作指示子４４は、編集画面４０における第１音素[n]を包含する音声符号Ｘ3に対応した音符図像４２のうち、発音期間の終点に対応する縁辺の線上に配置されている。利用者は、当該縁辺に沿って（すなわち音高軸の方向に）、操作指示子４４を当該音符図像４２の上辺から下辺までの範囲内で移動させることが可能である。なお、操作指示子４４の操作量Ｍと離間の程度との対応関係は任意に定められる。第３実施形態では、操作指示子４４が初期的に配置される音符図像４２の上辺上の地点Oを起点として、起点0から音高軸下方向への操作量Ｍが大きくなるほど、離間の程度が大きくなるように構成している。利用者は、操作指示子４４を、起点Oから音高軸下方向にスライドさせる操作により、第１音素[n]と第２音素[t]との時間軸上における離間と離間の程度とを、一度の操作で指示することが可能である。図２０では、“wan-”の音素記号ＱBに包含される音素[n]（第１音素）と、“ted”の音素記号ＱBに包含される音素[t]（第２音素）とを操作量Ｍ(40)に応じた程度だけ離間させる指示が付与された場合が想定されている。以上の指示が受付けられると、表示制御部２４は、操作量Ｍに応じて、第１音素[n]を包含する音符図像４２に、発音期間の終点に対応する縁辺上の一点と、上辺上の一点とを結ぶ直線とで規定される切欠きが形成されるように音符図像４２の形状を変化させる。図２０から理解される通り、操作量Ｍの増加に連動して切欠きが大きくなる。 FIG. 20 is an explanatory diagram of a display example of the operation indicator 44. The operation indicator 44 is arranged on the line of the edge corresponding to the end point of the pronunciation period in the musical note image 42 corresponding to the phonetic code X3 including the first phoneme [n] on the editing screen 40. The user can move the operation indicator 44 along the edge (that is, in the direction of the pitch axis) within a range from the upper side to the lower side of the musical note image 42. The correspondence between the operation amount M of the operation indicator 44 and the degree of separation is arbitrarily determined. In the third embodiment, the degree of separation increases as the operation amount M from the starting point 0 to the lower pitch axis starts from the point O on the upper side of the musical note iconic image 42 where the operation indicator 44 is initially arranged. Is configured to be large. The user slides the operation indicator 44 from the starting point O in the downward direction of the pitch axis to thereby determine the separation and degree of separation of the first phoneme [n] and the second phoneme [t] on the time axis. It is possible to instruct by one operation. In FIG. 20, the phoneme [n] (first phoneme) included in the “wan-” phoneme symbol QB and the phoneme [t] (second phoneme) included in the “ted” phoneme symbol QB are operated. A case is assumed in which an instruction to separate the amount by an amount corresponding to the amount M (40) is given. When the above instruction is received, the display control unit 24 changes the musical note iconic image 42 including the first phoneme [n] to a point on the edge corresponding to the end point of the pronunciation period and the upper side according to the operation amount M. The shape of the musical note iconic image 42 is changed so that a notch defined by a straight line connecting the two points is formed. As will be understood from FIG. 20, the notch is increased in conjunction with the increase in the operation amount M.

利用者からの指示が受付けられると、情報管理部２６は、操作量Ｍに応じた指標値Ｉを音素間情報ＱCに追加する。第３実施形態の指標値Ｉは、第１音素と第２音素との離間の程度を0から100までの範囲で相対的に規定する整数である。例えば、指標値Ｉが最大値（100）の場合（すなわち、離間の程度が最大である場合）は、離間指示子ＣSのみが指示された場合と同程度に第１音素と第２音素とを時間軸上で離間させる。他方、指標値Ｉが最小値（0）の場合（すなわち、離間の程度が最小である場合）には第１音素と第２音素との時間軸上における位置関係の調整を行わない。他方、指標値Ｉが最小値を上回り最大値を下回るとき（指標値Ｉ：0＜Ｉ＜100）、当該指標値Ｉに応じて、第１音素と第２音素とを時間軸上で離間させる。 When the instruction from the user is accepted, the information management unit 26 adds the index value I corresponding to the operation amount M to the interphoneme information QC. The index value I of the third embodiment is an integer that relatively defines the degree of separation between the first phoneme and the second phoneme in the range from 0 to 100. For example, when the index value I is the maximum value (100) (that is, when the degree of separation is the maximum), the first phoneme and the second phoneme are reduced to the same extent as when only the separation indicator CS is designated. Separate on the time axis. On the other hand, when the index value I is the minimum value (0) (that is, when the degree of separation is minimum), the positional relationship between the first phoneme and the second phoneme on the time axis is not adjusted. On the other hand, when the index value I exceeds the minimum value and falls below the maximum value (index value I: 0 <I <100), the first phoneme and the second phoneme are separated on the time axis according to the index value I. .

図２１は、第３実施形態の音声合成処理ＳA4のフローチャートである。第３実施形態の音声合成処理ＳA4では、第２実施形態の音声合成処理ＳA4におけるＳC3，ＳE1，ＳE2の処理が、ＳG1からＳG3の処理に置き換えられる。音声合成部２８は、音素間情報ＱCが規定する指標値Ｉが最大値(100)および最小値(0)のいずれかに該当するか否かを判定する（ＳG1）。指標値Ｉが最大値(Ｉ＝100)である場合は、音声合成部２８によって離間型素片Ｐ1が選択される。他方、指標値Ｉが最小値(Ｉ＝0)である場合には、音声合成部２８によって標準素片Ｐ0が選択される。指標値Ｉが最大値および最小値のいずれにも該当しない場合[指標値Ｉ：0＜Ｉ＜100）]（ＳG1：NO）、音声合成部２８は、離間型素片Ｐ1と標準素片Ｐ0とを音声素片群Ｌから選択し、指標値Ｉに応じた比率で離間型素片Ｐ1と標準素片Ｐ0とを融合する。図２０に例示されるように、操作量Ｍ（40）に応じて指標値Ｉ=40に設定された場合、音声合成部２８は、離間型素片Ｐ1の割合が指標値Ｉに応じた割合となるように、離間型素片Ｐ1と標準素片Ｐ0とをＰ1：Ｐ0＝4：6の割合で融合する（ＳG2）。音声合成部２８は、ＳC2,ＳG2,ＳG3で選択または生成した音声素片を、合成情報Ｓが指定する音高Ｘ1および発音期間Ｘ2に調整したうえで相互に連結することで音声信号Ｖを生成する（ＳC10）。以降の処理については前述した第１実施形態と同様であるので説明を省略する。 FIG. 21 is a flowchart of the speech synthesis process SA4 of the third embodiment. In the speech synthesis process SA4 of the third embodiment, the processes of SC3, SE1, and SE2 in the speech synthesis process SA4 of the second embodiment are replaced with the processes of SG1 to SG3. The speech synthesizer 28 determines whether the index value I defined by the interphoneme information QC corresponds to either the maximum value (100) or the minimum value (0) (SG1). When the index value I is the maximum value (I = 100), the speech synthesis unit 28 selects the separation type segment P1. On the other hand, when the index value I is the minimum value (I = 0), the speech unit 28 selects the standard segment P0. When the index value I does not correspond to either the maximum value or the minimum value [index value I: 0 <I <100)] (SG1: NO), the speech synthesizer 28 performs the separation-type segment P1 and the standard segment P0. Are selected from the speech element group L, and the separated element P1 and the standard element P0 are fused at a ratio corresponding to the index value I. As illustrated in FIG. 20, when the index value I = 40 is set according to the operation amount M (40), the speech synthesizer 28 determines that the ratio of the separated segment P1 is the ratio according to the index value I. Then, the separated piece P1 and the standard piece P0 are fused at a ratio of P1: P0 = 4: 6 (SG2). The speech synthesizer 28 generates the speech signal V by adjusting the speech units selected or generated in SC2, SG2, and SG3 to the pitch X1 and the pronunciation period X2 specified by the synthesis information S and connecting them to each other. (SC10). Since the subsequent processing is the same as that of the first embodiment described above, description thereof is omitted.

以上に説明した通り、第３実施形態では、操作指示子４４を移動させる操作により指標値Ｉを指示することが可能である。したがって、指標値Ｉを数値で指定する第２実施形態と比較して、利用者は直感的かつ容易に指標値Ｉを指示できるという利点がある。 As described above, in the third embodiment, it is possible to instruct the index value I by an operation of moving the operation indicator 44. Therefore, compared with the second embodiment in which the index value I is designated by a numerical value, there is an advantage that the user can instruct the index value I intuitively and easily.

なお、第３実施形態では、編集画面４０における第１音素[n]を包含する音声符号Ｘ3に対応した音符図像４２のうち、発音期間の終点に対応する縁辺の線上に操作指示子４４を配置し、当該音符図像４２の上辺から下辺までの範囲内で移動させる場合を例示したが、図２２の点線で囲む領域に例示されるように、相前後する各音符図像４２の境界の近傍を拡大表示して、利用者が操作指示子４４を移動させる操作を補助することも可能である。 In the third embodiment, the operation indicator 44 is arranged on the edge line corresponding to the end point of the pronunciation period in the musical note image 42 corresponding to the phonetic code X3 including the first phoneme [n] on the editing screen 40. Although the case where the musical note iconic image 42 is moved within the range from the upper side to the lower side is illustrated, the vicinity of the boundary of each successive musical note iconic image 42 is enlarged as exemplified by the area surrounded by the dotted line in FIG. It is also possible to display and assist the operation of moving the operation indicator 44 by the user.

＜変形例＞
前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the aforementioned embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）第３実施形態では、利用者による操作指示子４４の操作量Ｍに応じて第１音素[n]と、第２音素[t]との離間の程度を指示する構成を例示した。以上の構成に、操作量Ｍを規定する目盛を付加した構成も採用され得る。具体的には、図２３に例示されるように、操作指示子４４の近傍に目盛４６を表示させる態様としてもよい。上述した態様では、利用者が離間の程度を視認しやすいという利点がある。 (1) In the third embodiment, the configuration in which the degree of separation between the first phoneme [n] and the second phoneme [t] is instructed according to the operation amount M of the operation indicator 44 by the user is exemplified. The structure which added the scale which prescribes | regulates the operation amount M to the above structure can also be employ | adopted. Specifically, as illustrated in FIG. 23, a scale 46 may be displayed near the operation indicator 44. In the aspect mentioned above, there exists an advantage that a user is easy to visually recognize the degree of separation.

（２）第３実施形態では、操作指示子４４の可動範囲を第１音素を包含する音符図像４２の上辺から下辺までの範囲内とし、操作量Ｍに応じた指標値Ｉ（離間の程度）を0から100の範囲の整数で規定する構成を例示したが、操作指示子４４の可動範囲は、第１音素を包含する音符図像４２の領域外（音符図像４２の下辺よりも音高軸方向において下方や、音符図像４２の上辺よりも音高軸方向において上方）であってもよい。 (2) In the third embodiment, the movable range of the operation indicator 44 is set within the range from the upper side to the lower side of the musical note iconic image 42 including the first phoneme, and the index value I (the degree of separation) corresponding to the operation amount M However, the movable range of the operation indicator 44 is outside the region of the musical note graphic image 42 including the first phoneme (the pitch axis direction is lower than the lower side of the musical note graphic image 42). Or the upper side of the musical note iconic image 42 in the pitch axis direction).

例えば、図２４に例示される通り、操作指示子４４が、第１音素を包含する音符図像４２の下辺よりも音高軸方向において下方にわたって操作された場合に、指標値Ｉを、100を上回る数値（Ｉ＞100）に設定してもよい。指標値Ｉが100以上であるときに、第１音素と第２音素との離間の程度が極めて高くなるように（すなわち、前後する音素間の間隔が離間型素片Ｐ1における第１音素と第２音素との間隔よりも離間するように）第１音素と第２音素との位置関係が調整される。図２４は、操作指示子４４が第１音素[n]を包含する音符図像４２の下辺よりも下方にわたって操作され、操作量Ｍに応じた指標値Ｉは150に設定された場合を想定している。以上のように音素間の離間の度合が充分に大きい場合、音声合成部２８は、標準素片Ｐ0における第１音素[n]と第２音素[t]との間に無音の音素“Sil”を挿入した[n Sil t]の音声素片Ｐを生成し、[n Sil t]の音声素片Ｐと、[n-t]の標準素片Ｐ0とを合成して指標値Ｉ（150）に応じた音声素片を生成する。 For example, as illustrated in FIG. 24, when the operation indicator 44 is operated over the lower side in the pitch axis direction than the lower side of the musical note iconic image 42 including the first phoneme, the index value I exceeds 100. A numerical value (I> 100) may be set. When the index value I is 100 or more, the degree of separation between the first phoneme and the second phoneme is extremely high (that is, the interval between the preceding and following phonemes is the first phoneme and the first phoneme in the separated unit P1). The positional relationship between the first phoneme and the second phoneme is adjusted so that it is separated from the interval between the two phonemes. FIG. 24 assumes that the operation indicator 44 is operated below the lower side of the musical note iconic image 42 including the first phoneme [n], and the index value I corresponding to the operation amount M is set to 150. Yes. As described above, when the degree of separation between phonemes is sufficiently large, the speech synthesizer 28 generates a silent phoneme “Sil” between the first phoneme [n] and the second phoneme [t] in the standard segment P0. [N Sil t] speech segment P is inserted, and [n Sil t] speech segment P is synthesized with [nt] standard segment P0, according to index value I (150). Generate a speech segment.

同様に、図２５に例示される通り、操作指示子４４が、第１音素を包含する音符図像４２の上辺よりも音高軸方向において上方にわたって操作された場合に、操作量Ｍに応じて指標値Ｉを負数（Ｉ＜0）と定めてもよい。指標値Ｉが負数であるときに、第１音素と第２音素との近接の程度が極めて高くなるように第１音素と第２音素との位置関係が調整される。図２５では、各音声単位“got”，“up”のうち、“gh Q t”の各音素記号ＱBに包含される音素[t]を第１音素とし、“V p”の音素記号ＱBに包含される音素[V]を第２音素としたときに、操作指示子４４が第１音素[t]を包含する音符図像４２の上辺よりも音高軸方向において上方にわたって操作され、操作量Ｍに応じた指標値Ｉが-100に設定された場合を想定している。以上のように音素間の近接の度合が充分に大きい場合、音声合成部２８は、第１音素[t]と第２音素[V]との時間軸上における間隔が、近接型素片Ｐ2における第１音素と第２音素との間隔よりも接近するように、第１音素[t]と第２音素[V]との時間軸上における間隔を近接させる。ここで、第１音素[t]は破裂音であるから、音声合成部２８は、第１音素[t]と第２音素[V]との時間軸上における間隔を近接させるように、[t-V]の音声素片Ｐと、[Q-V]の音声素片Ｐとの中間的な音声素片を生成する。以上の構成によれば、“go up（gh Q V p）”というように音素の脱落（同化）が発生する口語的な発音を実現することが可能である。なお、以上の構成に、指標値Ｉ（-100）に特定の音声素片を割当てる構成や、所定の合成比率を割り当てる構成を付加した構成も採用され得る。 Similarly, as illustrated in FIG. 25, when the operation indicator 44 is operated over the upper side in the pitch axis direction from the upper side of the musical note iconic image 42 including the first phoneme, the index is indicated according to the operation amount M. The value I may be defined as a negative number (I <0). When the index value I is a negative number, the positional relationship between the first phoneme and the second phoneme is adjusted so that the degree of proximity between the first phoneme and the second phoneme becomes extremely high. In FIG. 25, among the speech units “got” and “up”, the phoneme [t] included in each phoneme symbol QB of “gh Q t” is set as the first phoneme, and the phoneme symbol QB of “V p” is obtained. When the included phoneme [V] is the second phoneme, the operation indicator 44 is operated upward in the pitch axis direction from the upper side of the note image 42 including the first phoneme [t], and the operation amount M It is assumed that the index value I corresponding to is set to -100. As described above, when the degree of proximity between phonemes is sufficiently large, the speech synthesizer 28 determines that the interval on the time axis between the first phoneme [t] and the second phoneme [V] is in the proximity unit P2. The interval on the time axis between the first phoneme [t] and the second phoneme [V] is brought closer to each other so as to be closer than the interval between the first phoneme and the second phoneme. Here, since the first phoneme [t] is a plosive sound, the speech synthesizer 28 [tV so that the interval on the time axis between the first phoneme [t] and the second phoneme [V] is close. ] And a speech element intermediate between the speech element P of [QV]. According to the above configuration, it is possible to realize colloquial pronunciation in which phoneme dropout (assimilation) occurs, such as “go up (gh Q V p)”. In addition, the structure which added the structure which allocates a specific speech unit to the index value I (-100) to the above structure, and the structure which allocates a predetermined | prescribed synthetic | combination ratio may be employ | adopted.

（３）図２６に例示される通り、操作指示子４４の操作を、第１音素を包含する音符図像４２の領域外においても許容する構成にあっては、操作指示子４４が第１音素の音符図像４２の上辺よりも音高軸方向において上方に操作された場合に、第１音素[t]を包含する音符図像４２と第２音素[V]を包含する“V p”の音符図像４２との境界線を音高軸に対して傾斜させる構成も採用される。図２６は、操作指示子４４が第１音素[t]を包含する音符図像４２の上辺よりも音高軸方向において上方に操作され、指標値Ｉが負数（Ｉ＜0）に定められた場合を想定している。以上の構成では、指標値Ｉの正負に応じて各音符図像４２の境界線の角度が変更される（指標値Ｉが負数であることが強調表示される）から、指標値Ｉの正負を利用者が直観的に把握できるという利点がある。 (3) As illustrated in FIG. 26, when the operation indicator 44 is allowed to be operated even outside the area of the musical note iconic image 42 including the first phoneme, the operation indicator 44 has the first phoneme. When operated upward in the pitch axis direction from the upper side of the note image 42, the note image 42 including the first phoneme [t] and the note image 42 of “V p” including the second phoneme [V] A configuration is also adopted in which the boundary line is inclined with respect to the pitch axis. In FIG. 26, the operation indicator 44 is operated upward in the pitch axis direction from the upper side of the musical note iconic image 42 including the first phoneme [t], and the index value I is set to a negative number (I <0). Is assumed. In the above configuration, the angle of the boundary line of each musical note iconic image 42 is changed according to whether the index value I is positive or negative (it is highlighted that the index value I is a negative number). There is an advantage that a person can grasp intuitively.

（４）第３実施形態では、操作指示子４４の音高軸方向における操作量Ｍに応じて第１音素と第２音素との離間の程度を指示する構成を例示した。このほかにも、操作指示子４４の時間軸方向における操作量Ｍに応じて合成音の特性（音量，音色，抑揚等）を指示する構成も採用され得る。例えば、図２７および図２８に例示されるように、音高軸方向における操作指示子４４の操作量Ｍ1に応じて第１音素と第２音素との位置関係を調整し、かつ、時間軸方向における操作指示子４４の操作量Ｍ2に応じて各音素の特性を制御することが可能である。利用者は、時間軸上に時系列に配置される複数の音素（音声符号Ｘ3）を視覚的かつ直感的に確認しながら、第１音素と第２音素との位置関係や、合成音声の特性を一度の操作により指示することが可能である。 (4) In the third embodiment, the configuration in which the degree of separation between the first phoneme and the second phoneme is instructed according to the operation amount M in the pitch axis direction of the operation indicator 44 is exemplified. In addition to this, it is possible to adopt a configuration in which the characteristics (sound volume, timbre, intonation, etc.) of the synthesized sound are instructed according to the operation amount M of the operation indicator 44 in the time axis direction. For example, as illustrated in FIGS. 27 and 28, the positional relationship between the first phoneme and the second phoneme is adjusted according to the operation amount M1 of the operation indicator 44 in the pitch axis direction, and the time axis direction It is possible to control the characteristics of each phoneme according to the operation amount M2 of the operation indicator 44 in FIG. While visually and intuitively confirming a plurality of phonemes (speech code X3) arranged in time series on the time axis, the user can determine the positional relationship between the first phoneme and the second phoneme and the characteristics of the synthesized speech. Can be instructed by a single operation.

図２９に例示されるように、編集画面４０で指定した合成音声の特性を示す制御変数の時間変化を、制御変数指定画面７０に対する操作で利用者が任意に指定できる構成（以下「対比例」という）が従来から提案されている。対比例では、任意の音符図像４２を移動させて各音符の発音期間Ｘ2（発音時刻や継続長）を変更した場合でも制御変数の時間変化自体は変化しないから、移動後の音符について変更前と同様の特性を付与するには、発音期間Ｘ2の変更に整合するように制御変数の時間変化を利用者が修正する必要があった。これに対し、操作指示子４４の時間軸方向における操作量Ｍ2に応じて合成音の特性（音量，音色，抑揚等）を指示する前述の構成では、各音符図像４２毎に操作指示子４４を操作することで音符毎に特性が制御され、音符図像４２が時間軸方向や音高軸方向に移動された場合でも当該音符の特性を維持することが可能である。すなわち、利用者は各音符の特性を再調整する必要がない。したがって、利用者の操作が簡略化されるという利点がある。なお、第１音素および第２音素のいずれか一方を包含する音符図像４２が時間軸方向や音高軸方向に移動された場合でも、音符毎の特性を維持することが可能である。 As illustrated in FIG. 29, a configuration in which the user can arbitrarily specify the time change of the control variable indicating the characteristics of the synthesized speech specified on the edit screen 40 by an operation on the control variable specification screen 70 (hereinafter “comparative”). Has been proposed in the past. In contrast, even if an arbitrary note image 42 is moved and the sound generation period X2 (sound generation time and duration) of each note is changed, the time change itself of the control variable does not change. In order to provide the same characteristic, the user had to correct the time variation of the control variable so as to be consistent with the change in the sound generation period X2. On the other hand, in the above-described configuration in which the characteristics (sound volume, timbre, inflection, etc.) of the synthesized sound are instructed according to the operation amount M2 in the time axis direction of the operation indicator 44, the operation indicator 44 is provided for each musical note image 42. By operating, the characteristics are controlled for each note, and even when the note image 42 is moved in the time axis direction or the pitch axis direction, the characteristics of the notes can be maintained. That is, the user does not need to readjust the characteristics of each note. Therefore, there is an advantage that the user's operation is simplified. Even when the musical note iconic image 42 including either the first phoneme or the second phoneme is moved in the time axis direction or the pitch axis direction, the characteristics of each note can be maintained.

（５）図３０の例示のように、相前後する音符を僅かに離間させる一方でフレーズの全体としては滑らかに発音する音楽の表現方法（メゾスタッカート）が知られている。各音符の発音期間Ｘ2を音符図像４２の時間軸方向の長さの調整のみで指定する従来の構成のもとでメゾスタッカートの歌唱音声を再現するには、各音符図像が適度な間隔で配列するように各音符図像４２の時間軸上の長さを利用者が個々に調整する必要がある。前述の各形態によれば、音符図像４２の時間軸上の長さとは独立に各音素の位置関係（離間／近接）を指示できるから、音符図像４２については相互間の間隔を考慮せずに楽譜通りに指定したうえで、各音素の間隔を接続指示子Ｃの付与や操作指示子４４の操作で調整することで、図３０に例示したメゾスタッカートのような微妙な表現を再現することが可能である。 (5) As illustrated in FIG. 30, a music expression method (meso staccato) is known in which successive notes are slightly separated while the entire phrase is smoothly pronounced. In order to reproduce the meso-staccato singing sound under the conventional configuration in which the sound generation period X2 of each note is specified only by adjusting the length of the note image 42 in the time axis direction, the note images are arranged at appropriate intervals. Thus, it is necessary for the user to individually adjust the length of each musical note image 42 on the time axis. According to each of the above-described embodiments, the positional relationship (separation / proximity) of each phoneme can be specified independently of the length of the musical note iconic image 42 on the time axis. By specifying according to the score and adjusting the interval of each phoneme by attaching the connection indicator C or operating the operation indicator 44, it is possible to reproduce a delicate expression such as the meso staccato illustrated in FIG. Is possible.

（６）前述の各形態では、編集画面４０に表示される音符図像４２の内部に、発音文字ＱAおよび音素記号ＱBを配置し、第１音素[n]の音素記号ＱBの直後に、接続指示子Ｃ（離間指示子ＣSまたは近接指示子ＣC）を表示させる構成を例示したが、音素記号ＱBの表示を省略した構成も採用され得る。音素記号ＱBの表示が省略される構成では、例えば第１音素に相当する発音文字ＱAの直後に、利用者の指示に応じた接続指示子Ｃや指標値Ｉが表示される。 (6) In each of the above-described forms, the phonetic symbol QA and the phoneme symbol QB are arranged inside the musical note image 42 displayed on the editing screen 40, and a connection instruction is provided immediately after the phoneme symbol QB of the first phoneme [n]. Although the configuration in which the child C (the separation indicator CS or the proximity indicator CC) is displayed has been illustrated, a configuration in which the display of the phoneme symbol QB is omitted may be employed. In the configuration in which the display of the phoneme symbol QB is omitted, for example, immediately after the phonetic character QA corresponding to the first phoneme, the connection indicator C or the index value I corresponding to the user's instruction is displayed.

（７）前述の各形態では、指標値Ｉは、第１音素と第２音素との離間（近接）の程度を0から100までの範囲で相対的に規定する整数とした構成を例示したが、上述した構成と比較して指標値Ｉの範囲が狭い構成（例えば0.0〜1.0）や、上述した構成と比較して指標値Ｉの範囲が広い構成（例えば0〜300）も採用され得る。これらの構成以外にも、指標値Ｉの範囲を0を基準値とした相対値（-2.0〜2.0）で規定する構成も採用され得る。 (7) In each of the above-described embodiments, the index value I is exemplified by a configuration in which the degree of separation (proximity) between the first phoneme and the second phoneme is an integer that relatively defines the range from 0 to 100. A configuration in which the range of the index value I is narrower than the above-described configuration (for example, 0.0 to 1.0) or a configuration in which the range of the index value I is wider than the above-described configuration (for example, 0 to 300) may be employed. In addition to these configurations, a configuration in which the range of the index value I is defined by a relative value (−2.0 to 2.0) with 0 as a reference value may be employed.

（８）前述の各形態では、音声素片Ｐを利用した素片接続型の音声合成処理ＳA4を例示したが、編集処理ＳA2で生成された合成情報Ｓを適用した音声合成には公知の技術が任意に採用される。例えば、隠れマルコフモデル（HMM: Hidden Markov Model）等の確率モデルを利用して、合成情報Ｓで指定された合成楽曲の歌唱音声を合成することも可能である。例えば、音声合成部２８は、合成情報Ｓの音高Ｘ1および発音期間Ｘ2に応じて音高の時間遷移（ピッチカーブ）を算定するとともに当該時間遷移で音高が変化する基礎信号（例えば声帯の発声音を表す正弦波信号）を生成し、編集処理ＳA2の実行後の合成情報Ｓが指定する音声符号Ｘ3に応じたフィルタ処理（例えば口腔内での共鳴を近似するフィルタ処理）を基礎信号に対して実行することで音声信号Ｖを生成する。 (8) In each of the above-described embodiments, the unit connection type speech synthesis process SA4 using the speech unit P is exemplified. However, a known technique is used for speech synthesis using the synthesis information S generated by the editing process SA2. Is arbitrarily adopted. For example, it is also possible to synthesize the singing voice of the synthesized music specified by the synthesis information S using a probabilistic model such as a Hidden Markov Model (HMM). For example, the speech synthesizer 28 calculates a time transition (pitch curve) of the pitch according to the pitch X1 and the sound generation period X2 of the synthesis information S, and a basic signal (for example, vocal cords) whose pitch changes with the time transition. A sine wave signal representing the uttered sound) and using a filter process (for example, a filter process approximating intraoral resonance) corresponding to the voice code X3 designated by the synthesis information S after the execution of the editing process SA2 as a basic signal On the other hand, the audio signal V is generated by executing it.

（９）前述の各形態では、２個の音素を連結した音声素片Ｐ（ダイフォン）を例示したが、３個以上の音素を連結した音声素片Ｐを利用することも可能である。３個以上の音素を連結した音声素片Ｐでは、１個の音声素片Ｐの先頭の音素が前述の各形態の音素ｐAに相当し、１個の音声素片Ｐの末尾の音素が前述の各形態の音素ｐBに相当する。 (9) In each of the above-described embodiments, the speech unit P (diphone) in which two phonemes are connected is illustrated, but a speech unit P in which three or more phonemes are connected may be used. In a speech unit P in which three or more phonemes are connected, the first phoneme of one speech unit P corresponds to the phoneme pA of each of the above-described forms, and the last phoneme of one speech unit P is described above. Correspond to the phoneme pB of each form.

（１０）前述の各形態では、音声素片群Ｌと合成情報Ｓとを記憶する記憶装置１２を音声合成装置１００に搭載したが、音声合成装置１００とは独立した外部装置（例えばサーバ装置）が音声素片群Ｌや合成情報Ｓを記憶する構成も採用される。音声合成装置１００は、例えば通信網を介して音声素片群Ｌまたは合成情報Ｓを取得して編集処理ＳA2や音声合成処理ＳA4を実行する。以上の説明から理解される通り、音声素片群Ｌや合成情報Ｓを記憶する要素は音声合成装置１００の必須の要素ではない。 (10) In each of the embodiments described above, the storage device 12 that stores the speech segment group L and the synthesis information S is mounted on the speech synthesizer 100. However, an external device (for example, a server device) independent of the speech synthesizer 100. Is also used to store the speech element group L and the synthesis information S. The speech synthesizer 100 acquires the speech element group L or the synthesis information S via, for example, a communication network, and executes the editing process SA2 and the speech synthesis process SA4. As understood from the above description, the elements that store the speech element group L and the synthesis information S are not essential elements of the speech synthesizer 100.

（１１）前述の各形態では、合成楽曲の歌唱音声の音声信号Ｖの生成を例示したが、歌唱音声以外の音声（例えば会話音等）の音声信号Ｖの生成にも本発明を適用することが可能である。したがって、歌唱音声の合成に好適な音高Ｘ1は合成情報Ｓから省略され得る。以上の説明から理解される通り、以上の各態様に例示した合成情報Ｓは、合成対象となる音声の発音内容を指定する情報として包括的に表現される。なお、音声変化の有無を音素毎に個別に制御する必要性は、歌唱音声を合成する場面で特に顕在化するから、本発明は、歌唱音声の合成に格別に好適である。 (11) In each of the above-described embodiments, the generation of the voice signal V of the singing voice of the synthesized music has been exemplified. However, the present invention is also applied to the generation of the voice signal V of the voice other than the singing voice (for example, conversation sound). Is possible. Therefore, the pitch X1 suitable for singing voice synthesis can be omitted from the synthesis information S. As understood from the above description, the synthesis information S exemplified in each of the above aspects is comprehensively expressed as information specifying the pronunciation content of the speech to be synthesized. In addition, since the necessity of controlling the presence or absence of a sound change for every phoneme becomes especially obvious in the scene of synthesizing a singing voice, the present invention is particularly suitable for synthesizing a singing voice.

（１２）前述の各形態では、英語の音声の合成を例示したが、合成対象となる音声の言語は任意である。例えば、日本語、スペイン語、中国語、韓国語等の任意の言語の音声を生成する場合にも本発明を適用することが可能である。 (12) In each of the above-described embodiments, the synthesis of English speech has been exemplified, but the speech language to be synthesized is arbitrary. For example, the present invention can be applied to the case of generating speech in an arbitrary language such as Japanese, Spanish, Chinese, or Korean.

１００……音声合成装置、１０……演算処理装置、１２……記憶装置、１４……表示装置、１６……入力装置、１８……放音装置、２２……指示受付部、２４……表示制御部、２６……情報管理部、２８……音声合成部、４０……編集画面、４２……音符図像、４４……操作指示子。
DESCRIPTION OF SYMBOLS 100 ... Voice synthesizer, 10 ... Arithmetic processing unit, 12 ... Memory | storage device, 14 ... Display device, 16 ... Input device, 18 ... Sound emission device, 22 ... Instruction reception part, 24 ... Display Control unit, 26... Information management unit, 28... Speech synthesis unit, 40... Editing screen, 42.

Claims

A synthesis information management device for managing synthesis information for designating pronunciation of synthesized speech,
An instruction receiving means for receiving an instruction from the user;
A means for displaying a time series of phoneme symbols of a plurality of phonemes corresponding to the pronunciation content specified by the synthesis information on a display device, the plurality of phonemes in accordance with an instruction received from a user by the instruction receiving means; A connection indicator indicating a positional relationship on the time axis between the first phoneme and the second phoneme between the phoneme symbol of the first phoneme and the phoneme symbol of the second phoneme immediately after the first phoneme. Display control means for displaying;
A composite information management device comprising: information management means for editing the composite information so that the first phoneme and the second phoneme have a positional relationship corresponding to the connection indicator on a time axis.

The display control means includes a proximity indicator indicating proximity of the first phoneme and the second phoneme, a phoneme symbol of the first phoneme and the first phoneme according to an instruction received from a user by the instruction receiving means. Displayed as the connection indicator between two phoneme symbols,
The composite information management apparatus according to claim 1, wherein the information management unit edits the composite information so that the first phoneme and the second phoneme approach on a time axis according to the proximity indicator.

The display control means includes a separation indicator indicating separation of the first phoneme and the second phoneme, a phoneme symbol of the first phoneme, and the first phoneme according to an instruction received from a user by the instruction reception means. Displayed as the connection indicator between two phoneme symbols,
The composite information management apparatus according to claim 1, wherein the information management unit edits the composite information so that the first phoneme and the second phoneme are separated on a time axis according to the separation indicator.

The display control means, between the phoneme symbol of the first phoneme among the plurality of phonemes and the phoneme symbol of the second phoneme immediately after the first phoneme, according to the instruction received from the user by the instruction receiving means To display an index value indicating the degree of proximity or separation between the first phoneme and the second phoneme on the time axis,
The composite information management according to claim 2 or 3, wherein the information management means updates the composite information so that the first phoneme and the second phoneme approach or separate on the time axis according to the index value. apparatus.

A synthesis information management device for managing synthesis information for designating pronunciation of synthesized speech,
A means for displaying a time series of phoneme symbols of a plurality of phonemes corresponding to the pronunciation content specified by the synthesis information on a display device, wherein the phoneme symbol of the first phoneme and the first phoneme immediately after the plurality of phonemes Display control means for displaying an operation indicator for designating the degree of separation of the first phoneme and the second phoneme on the time axis between the phoneme symbols of the second phoneme;
An instruction receiving means for receiving an operation of the operation indicator from a user;
A composite information management apparatus comprising: information management means for updating the composite information so that a positional relationship between the first phoneme and the second phoneme on a time axis is separated according to an operation amount of the operation indicator. .