JP2015049253A

JP2015049253A - Voice synthesizing management device

Info

Publication number: JP2015049253A
Application number: JP2013178514A
Authority: JP
Inventors: 入山　達也; Tatsuya Iriyama; 達也入山; 誠橘; Makoto Tachibana; 橘　　誠
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-08-29
Filing date: 2013-08-29
Publication date: 2015-03-16
Anticipated expiration: 2033-08-29
Also published as: JP6152753B2

Abstract

PROBLEM TO BE SOLVED: To reduce a work burden of a user who sets a variable applied to voice synthesizing.SOLUTION: A display control part makes a display device display an adjustment image 60 which contains an adjustment region 62 in which a first axis A1 representing mixing ratio between a first voice and a second voice at the time of synthesizing voices to be synthesized and a second axis A2 representing sound volume of the voice to be synthesized are set. An instruction accepting part accepts, from a user, instructions about an instruction point X1 and an instruction point X2 in the adjustment region 62. An information management part generates control information which represents temporal change in mixing ratio from a numerical value corresponding to the instruction point X1 on the first axis A1 to a numerical value corresponding to the instruction point X2 and a temporal change in sound volume from a numerical value corresponding to the instruction point X1 on the second axis A2 to a numerical value corresponding to the instruction point X2.

Description

本発明は、音声合成に適用される変数を管理する技術に関する。 The present invention relates to a technique for managing variables applied to speech synthesis.

事前に収録された音声から採取された複数の音声素片の集合（以下「音声ライブラリ」という）を利用して所望の音声を合成する素片接続型の音声合成技術が従来から提案されている。例えば特許文献１には、相異なる声質の音声に対応する２種類の音声ライブラリの各々の音声素片を混合（モーフィング）することで、既存の音声ライブラリの音声とは声質が相違する音声を合成する技術が開示されている。 Conventionally, a unit connection type speech synthesis technique for synthesizing desired speech using a set of a plurality of speech units (hereinafter referred to as “speech library”) collected from pre-recorded speech has been proposed. . For example, Patent Document 1 synthesizes speech having a voice quality different from that of an existing speech library by mixing (morphing) each speech unit of two types of speech libraries corresponding to speech of different voice qualities. Techniques to do this are disclosed.

特開平９−５０２９５号公報Japanese Patent Laid-Open No. 9-50295

ところで、各音声素片の音量は音声ライブラリ毎に相違し得る。したがって、複数の音声素片の混合で生成された合成音声の音量は、各音声素片の混合比率に応じて変動する。例えば、音量が大きい傾向にある音声ライブラリ内の音声素片の比率を経時的に減少させるとともに、音量が小さい傾向にある音声ライブラリ内の音声素片の比率を経時的に増加させて両者を混合した場合、合成音声の音量は経時的に減少する。したがって、合成音声の音量を一定に維持するためには、混合比率の時間変化に連動するように利用者が合成音声の音量の時間変化を調整する必要があり、利用者の作業負担が大きいという問題がある。 By the way, the volume of each speech unit may be different for each speech library. Therefore, the volume of the synthesized speech generated by mixing a plurality of speech units varies depending on the mixing ratio of each speech unit. For example, while decreasing the proportion of speech units in an audio library that tends to be louder over time, the proportion of speech units in an audio library that tends to be lower in volume is increased over time to mix the two. In this case, the volume of the synthesized speech decreases with time. Therefore, in order to keep the volume of the synthesized voice constant, it is necessary for the user to adjust the time change of the volume of the synthesized voice so as to be interlocked with the time change of the mixing ratio, and the work burden on the user is heavy. There's a problem.

なお、以上の説明では便宜的に音量に着目したが、音量以外の音響特性についても同様の事情が妥当し得る。例えば、音高が高域側に知覚され易い傾向にある音声ライブラリの音声素片（明瞭で明るい雰囲気の音声）の比率を経時的に減少させるとともに、音高が低域側に知覚され易い傾向にある音声ライブラリの音声素片（例えば不明瞭で暗い雰囲気の音声）の比率を経時的に増加させて両者を混合した場合、受聴者が知覚する音高は経時的に低下する。したがって、合成音声の音高感を一定に維持するためには、混合比率の時間変化に連動するように利用者が合成音声の音高の時間変化を調整する必要がある。以上の事情を考慮して、本発明は、音声合成に適用される変数を設定する利用者の作業負担を軽減することを目的とする。 In the above description, the sound volume is focused for the sake of convenience, but the same situation can be applied to the acoustic characteristics other than the sound volume. For example, the ratio of speech units (sounds with a clear and bright atmosphere) in an audio library whose pitch tends to be perceived easily on the high frequency side is reduced over time, and the pitch tends to be easily perceived on the low frequency side. If the ratio of speech segments in the speech library (for example, unclear and dark atmosphere speech) is increased over time and mixed, the pitch perceived by the listener decreases over time. Therefore, in order to keep the pitch of the synthesized speech constant, the user needs to adjust the time variation of the pitch of the synthesized speech so as to be interlocked with the time variation of the mixing ratio. In view of the above circumstances, an object of the present invention is to reduce a work burden on a user who sets a variable applied to speech synthesis.

以上の課題を解決するために、本発明の第１態様に係る音声合成管理装置は、合成対象音声の合成時における第１音声と第２音声との混合比率を示す第１軸と、合成対象音声の音響特性に関する第１特性変数を示す第２軸とが設定された調整領域を含む調整画像を表示装置に表示させる表示制御手段と、調整領域内の第１指示点および第２指示点の指示を利用者から受付ける指示受付手段と、第１軸上で第１指示点に対応する数値から第２指示点に対応する数値への混合比率の時間変化と、第２軸上で第１指示点に対応する数値から第２指示点に対応する数値への第１特性変数の時間変化とを示す制御情報を生成する情報管理手段とを具備する。以上の構成では、第１音声と第２音声との混合比率を示す第１軸と、第１特性変数を示す第２軸とが設定された調整領域内に利用者から第１指示点および第２指示点が指示され、第１軸上で第１指示点に対応する数値から第２指示点に対応する数値への混合比率の時間変化と、第２軸上で第１指示点に対応する数値から第２指示点に対応する数値への第１特性変数の時間変化とを示す制御情報が生成される。したがって、利用者は、混合比率の時間変化と第１特性変数の時間変化とを並行的に指示することが可能である。すなわち、混合比率の時間変化と第１特性変数の時間変化とを個別的に指示する必要がある従来の構成と比較して、利用者の作業負担を軽減できるという利点がある。 In order to solve the above problems, the speech synthesis management device according to the first aspect of the present invention includes a first axis indicating a mixing ratio of the first speech and the second speech at the time of synthesis of the synthesis target speech, and a synthesis target. Display control means for displaying on the display device an adjustment image including an adjustment area in which a second axis indicating the first characteristic variable relating to the acoustic characteristics of the sound is set; and the first indication point and the second indication point in the adjustment area An instruction receiving means for receiving an instruction from the user, a time change in the mixing ratio from a numerical value corresponding to the first indicating point on the first axis to a numerical value corresponding to the second indicating point, and the first instruction on the second axis Information management means for generating control information indicating a time change of the first characteristic variable from a numerical value corresponding to the point to a numerical value corresponding to the second indication point. In the above configuration, the first indication point and the first indication from the user within the adjustment region in which the first axis indicating the mixing ratio of the first voice and the second voice and the second axis indicating the first characteristic variable are set. 2 indicating points are indicated, and the time change of the mixing ratio from the numerical value corresponding to the first indicating point on the first axis to the numerical value corresponding to the second indicating point and corresponding to the first indicating point on the second axis Control information indicating the time change of the first characteristic variable from the numerical value to the numerical value corresponding to the second indication point is generated. Therefore, the user can instruct the time change of the mixing ratio and the time change of the first characteristic variable in parallel. That is, there is an advantage that the work burden on the user can be reduced as compared with the conventional configuration in which it is necessary to individually indicate the time change of the mixing ratio and the time change of the first characteristic variable.

本発明の好適な態様において、情報管理手段は、合成対象音声のうち時間軸上の第１時点から第２時点までの期間について、第１軸上で第１指示点に対応する数値から第２指示点に対応する数値への混合比率の時間変化と、第２軸上で第１指示点に対応する数値から第２指示点に対応する数値への第１特性変数の時間変化とを示す制御情報を生成する。以上の態様では、合成対象音声のうち第１時点から第２時点までの特定の期間について混合比率および第１特性変数の時間変化を調整することが可能である。また、指示受付手段が第１時点および第２時点の指示を利用者から受付ける構成によれば、合成対象音声のうち利用者の所望の期間について混合比率と第１特性変数との時間変化を調整できるという利点がある。 In a preferred aspect of the present invention, the information management means calculates the second period from the numerical value corresponding to the first indication point on the first axis for the period from the first time point to the second time point on the time axis of the synthesis target speech. Control showing the time change of the mixing ratio to the numerical value corresponding to the indicated point and the time change of the first characteristic variable from the numerical value corresponding to the first indicated point to the numerical value corresponding to the second indicated point on the second axis. Generate information. In the above aspect, the mixing ratio and the temporal change of the first characteristic variable can be adjusted for a specific period from the first time point to the second time point in the synthesis target speech. Further, according to the configuration in which the instruction accepting unit accepts the instructions at the first time point and the second time point from the user, the time change between the mixing ratio and the first characteristic variable is adjusted for the user's desired period of the synthesis target voice. There is an advantage that you can.

本発明の好適な態様において、表示制御手段は、調整領域内の各地点が相異なる表示態様に設定された調整画像と、時間軸上の第１時点から第２時点にかけて調整領域内の第１指示点での表示態様から第２指示点での表示態様に変化する遷移画像を含む変数画像とを表示装置に表示させる。以上の構成によれば、第１時点から第２時点にかけて調整領域内の第１指示点での表示態様から第２指示点での表示態様に変化する遷移画像が表示装置に表示されるから、第１時点から第２時点にわたる混合比率および第１特性変数の時間変化を利用者が視覚的および直観的に把握できるという利点がある。また、合成対象音声の音響特性に関する第２特性変数の時間変化を遷移画像の形状で表現する変数画像を表示制御手段が表示装置に表示させる構成によれば、第２特性変数の時間変化を遷移画像とは別個に表示する構成と比較して表示内容を簡素化することが可能である。 In a preferred aspect of the present invention, the display control means includes an adjustment image in which each point in the adjustment area is set to a different display aspect, and the first in the adjustment area from the first time point to the second time point on the time axis. A variable image including a transition image that changes from a display mode at the indicated point to a display mode at the second indicated point is displayed on the display device. According to the above configuration, since the transition image that changes from the display mode at the first indication point in the adjustment region to the display mode at the second indication point from the first time point to the second time point is displayed on the display device, There is an advantage that the user can visually and intuitively grasp the mixing ratio and the temporal change of the first characteristic variable from the first time point to the second time point. In addition, according to the configuration in which the display control unit displays on the display device the variable image representing the time change of the second characteristic variable related to the acoustic characteristic of the synthesis target speech in the shape of the transition image, the time change of the second characteristic variable is changed. The display content can be simplified as compared with a configuration in which the image is displayed separately from the image.

本発明の好適な態様において、表示制御手段は、調整領域内の各地点が相異なる表示態様に設定された調整画像と、合成対象音声の各音符を表象する音符図像を、時間軸と音高軸とが設定された楽譜領域に配置した楽譜画像とを表示装置に表示させ、各音符図像における時間軸上の各地点の表示態様を、調整領域における第１指示点から第２指示点までの経路のうち当該地点に対応した地点での表示態様に設定する。以上の態様では、各音符を表象する音符図像が、混合比率および第１特性変数の時間変化の表示にも流用されるから、混合比率および第１特性変数の時間変化と各音符との関係を利用者が容易に把握できるという利点がある。なお、以上の態様の具体例は例えば第３実施形態として後述される。 In a preferred aspect of the present invention, the display control means includes an adjustment image in which each point in the adjustment area is set to a different display aspect, and a musical note iconic image representing each note of the synthesis target voice, a time axis and a pitch. The score image arranged in the score area in which the axis is set is displayed on the display device, and the display mode of each point on the time axis in each note image is changed from the first designated point to the second designated point in the adjustment area. The display mode at the point corresponding to the point in the route is set. In the above aspect, the musical note image representing each note is also used for displaying the mixing ratio and the time variation of the first characteristic variable. Therefore, the relationship between the time variation of the mixing ratio and the first characteristic variable and each note is expressed as follows. There is an advantage that the user can easily grasp. In addition, the specific example of the above aspect is later mentioned as 3rd Embodiment, for example.

本発明の好適な態様において、表示制御手段は、調整領域内の各地点が相異なる表示態様に設定された調整画像と、合成対象音声の各音符を表象する音符図像と各音符図像に対応する補助図像とを、時間軸と音高軸とが設定された楽譜領域に配置した楽譜画像とを表示装置に表示させ、各補助図像における時間軸上の各地点の表示態様を、調整領域における第１指示点から第２指示点までの経路のうち当該地点に対応した地点での表示態様に設定する。以上の態様では、各音符図像に対応する補助図像が混合比率および第１特性変数の時間変化の表示に利用されるから、混合比率および第１特性変数の時間変化と各音符との関係を利用者が容易に把握できるという利点がある。なお、以上の態様の具体例は例えば第４実施形態として後述される。 In a preferred aspect of the present invention, the display control means corresponds to an adjustment image in which each point in the adjustment region is set to a different display form, a note graphic image representing each note of the synthesis target voice, and each musical note graphic image. A score image arranged in a score area in which a time axis and a pitch axis are set is displayed on the display device, and the display mode of each point on the time axis in each auxiliary image is displayed in the adjustment area. It sets to the display mode in the point corresponding to the said point among the paths from the 1st indication point to the 2nd indication point. In the above aspect, since the auxiliary image corresponding to each note image is used to display the mixing ratio and the time change of the first characteristic variable, the relationship between the time change of the mixing ratio and the first characteristic variable and each note is used. There is an advantage that the person can easily grasp. In addition, the specific example of the above aspect is later mentioned as 4th Embodiment, for example.

本発明の第２態様に係る音声合成管理装置は、合成対象音声の合成に利用されるＮ種類（Ｎは３以上の自然数）の音声の各々に対応する基準点が設定された調整領域を含む調整画像を表示装置に表示させる表示制御手段と、調整領域内の第１指示点および第２指示点の指示を利用者から受付ける指示受付手段と、合成対象音声の合成時におけるＮ種類の音声の混合比率について、調整領域内で第１指示点に対応する数値から第２指示点に対応する数値への時間変化を示す制御情報を生成する情報管理手段とを具備する。以上の構成では、合成対象音声の合成に利用されるＮ種類（Ｎは３以上の自然数）の音声の各々に対応する基準点が設定された調整領域内に利用者から第１指示点および第２指示点が指示され、Ｎ種類の音声の混合比率について、調整領域内で第１指示点に対応する数値から第２指示点に対応する数値への時間変化を示す制御情報が生成される。したがって、利用者は、各音声の相互的な関係を調整領域で視覚的に確認しながら混合比率の時間変化を指示することが可能である。すなわち、Ｎ種類の音声の混合比率を個別的に指示する必要がある構成と比較して、利用者の作業負担を軽減できるという利点がある。 The speech synthesis management device according to the second aspect of the present invention includes an adjustment region in which a reference point corresponding to each of N types (N is a natural number of 3 or more) of speech used for synthesis of speech to be synthesized is set. Display control means for displaying the adjustment image on the display device, instruction receiving means for receiving instructions of the first and second instruction points in the adjustment area from the user, and N kinds of sounds at the time of synthesizing the synthesis target sound An information management means for generating control information indicating a time change from a numerical value corresponding to the first indicating point to a numerical value corresponding to the second indicating point in the adjustment region with respect to the mixing ratio. With the above configuration, the first indication point and the first indication from the user within the adjustment area in which the reference points corresponding to each of N types (N is a natural number of 3 or more) of speech used for synthesis of the synthesis target speech are set. Two indication points are instructed, and control information indicating a time change from a numerical value corresponding to the first indication point to a numerical value corresponding to the second indication point in the adjustment region is generated for the mixing ratio of N types of sounds. Accordingly, the user can instruct a change in the mixing ratio over time while visually confirming the mutual relationship between the sounds in the adjustment area. That is, there is an advantage that the user's work load can be reduced as compared with the configuration in which the mixing ratio of N types of voices needs to be individually indicated.

以上の各態様に係る音声合成管理装置は、制御情報の生成等に専用されるDSP（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、CPU（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る音声合成管理装置の動作方法（音声合成管理方法）としても特定される。 The speech synthesis management device according to each aspect described above is realized by hardware (electronic circuit) such as DSP (Digital Signal Processor) dedicated to generation of control information, etc., and general purpose such as CPU (Central Processing Unit) This is also realized by cooperation between the arithmetic processing unit and the program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (speech synthesis management method) of the speech synthesis management device according to each aspect described above.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 混合処理の説明図である。It is explanatory drawing of a mixing process. 合成情報の模式図である。It is a schematic diagram of synthetic information. 音声合成処理のフローチャートである。It is a flowchart of a speech synthesis process. 編集画像の模式図である。It is a schematic diagram of an edit image. 音声合成装置の動作のフローチャートである。It is a flowchart of operation | movement of a speech synthesizer. 編集処理のフローチャートである。It is a flowchart of an edit process. 編集画像の模式図である。It is a schematic diagram of an edit image. 調整領域と各指示点との関係を示す模式図である。It is a schematic diagram which shows the relationship between an adjustment area | region and each indication point. 調整領域と各指示点との関係を示す模式図である。It is a schematic diagram which shows the relationship between an adjustment area | region and each indication point. 編集画像の模式図である。It is a schematic diagram of an edit image. 第２実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 2nd Embodiment. 第３実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 3rd Embodiment. 第４実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 4th Embodiment. 第５実施形態における混合処理の説明図である。It is explanatory drawing of the mixing process in 5th Embodiment. 第５実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 5th Embodiment. 第５実施形態における混合比率の説明図である。It is explanatory drawing of the mixing ratio in 5th Embodiment. 第６実施形態における調整画像の模式図である。It is a schematic diagram of the adjustment image in 6th Embodiment. 各指示点間の経路の変形例の模式図である。It is a schematic diagram of the modification of the path | route between each indication point. 各指示点間の経路の変形例の模式図である。It is a schematic diagram of the modification of the path | route between each indication point. 調整領域の変形例の模式図である。It is a schematic diagram of the modification of an adjustment area | region. 第５実施形態の変形例における調整領域の模式図である。It is a schematic diagram of the adjustment area | region in the modification of 5th Embodiment. 第５実施形態の変形例における調整領域の模式図である。It is a schematic diagram of the adjustment area | region in the modification of 5th Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、複数の音声素片を連結する素片接続型の音声合成処理で任意の音声の音声信号Ｓを生成する。具体的には、第１実施形態の音声合成装置１００は、任意の楽曲（以下「合成楽曲」という）の歌唱音声の音声信号Ｓを生成する信号処理装置であり、演算処理装置１０と記憶装置１２と表示装置１４と入力装置１６と放音装置１８とを具備するコンピュータシステム（例えば携帯電話機やパーソナルコンピュータ等の情報処理装置）で実現される。 <First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 generates a speech signal S of an arbitrary speech by a unit connection type speech synthesis process that connects a plurality of speech units. Specifically, the speech synthesizer 100 of the first embodiment is a signal processing device that generates a speech signal S of a singing voice of an arbitrary music (hereinafter referred to as “synthetic music”), and includes an arithmetic processing device 10 and a storage device. 12, a display device 14, an input device 16, and a sound emitting device 18 (for example, an information processing device such as a mobile phone or a personal computer).

表示装置１４（例えば液晶表示パネル）は、演算処理装置１０から指示された画像を表示する。入力装置１６は、音声合成装置１００に対する各種の指示のために利用者が操作する操作機器（例えばマウス等のポインティングデバイスやキーボード）であり、例えば利用者が操作する複数の操作子を含んで構成される。なお、表示装置１４と一体に構成されたタッチパネルを入力装置１６として採用することも可能である。放音装置１８（例えばスピーカやヘッドホン）は、音声信号Ｓに応じた音響を再生する。音声信号Ｓをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略されている。 The display device 14 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 10. The input device 16 is an operation device (for example, a pointing device such as a mouse or a keyboard) operated by the user for various instructions to the speech synthesizer 100, and includes a plurality of operators operated by the user, for example. Is done. Note that a touch panel configured integrally with the display device 14 may be employed as the input device 16. The sound emitting device 18 (for example, a speaker or headphones) reproduces sound according to the audio signal S. Illustration of a D / A converter for converting the audio signal S from digital to analog is omitted for convenience.

記憶装置１２は、演算処理装置１０が実行するプログラムや演算処理装置１０が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。第１実施形態の記憶装置１２は、音声ライブラリＬ（ＬA，ＬB）と合成情報Ｑとを記憶する。 The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 of the first embodiment stores an audio library L (LA, LB) and synthesis information Q.

音声ライブラリＬは、特定の発声者の音声から事前に採取された複数の音声素片Ｐ（ＰA，ＰB）の集合である。各音声素片Ｐは、言語的な意味の区別の最小単位である音素（例えば母音や子音）、または、複数の音素を連結した音素連鎖（例えばダイフォンやトライフォン）である。各音声素片Ｐは、時間領域での音声波形のサンプル系列や、音声波形のフレーム毎に算定された周波数領域のスペクトルの時系列として表現される。 The speech library L is a set of a plurality of speech segments P (PA, PB) collected in advance from the speech of a specific speaker. Each speech element P is a phoneme (for example, a vowel or a consonant) that is a minimum unit of linguistic meaning distinction, or a phoneme chain (for example, a diphone or a triphone) that connects a plurality of phonemes. Each speech segment P is expressed as a sample sequence of a speech waveform in the time domain or a time sequence of a spectrum in the frequency domain calculated for each frame of the speech waveform.

図１に例示される通り、第１実施形態の記憶装置１２は、複数の音声ライブラリＬ（ＬA，ＬB）を記憶する。音声ライブラリＬAは、第１音声から抽出された複数の音声素片ＰAの集合であり、音声ライブラリＬBは、第２音声から抽出された複数の音声素片ＰBの集合である。第１音声と第２音声とは声質（声色）が相違する。具体的には、第１音声（各音声素片ＰA）と第２音声（各音声素片ＰB）とは、相異なる発声者が発声した音声、または、ひとりの発声者が声質を相違させて発声した音声である。 As illustrated in FIG. 1, the storage device 12 of the first embodiment stores a plurality of audio libraries L (LA, LB). The speech library LA is a set of a plurality of speech segments PA extracted from the first speech, and the speech library LB is a set of a plurality of speech segments PB extracted from the second speech. The first voice and the second voice have different voice qualities (voice colors). Specifically, the first voice (each voice segment PA) and the second voice (each voice segment PB) are voices uttered by different speakers, or one voicer has a different voice quality. It is the voice that was uttered.

第１実施形態では、図２に例示される通り、合成対象となる音声（以下「合成対象音声」という）の発音内容に対応する音声素片Ｐ（ＰA，ＰB）が音声ライブラリＬAおよび音声ライブラリＬBの双方から順次に選択され、音声ライブラリＬAから選択された音声素片ＰAと音声ライブラリＬBから選択された音声素片ＰBとを混合比率Ｒで混合すること（以下「混合処理」という）で音声素片ＰSが生成される。混合処理（モーフィング）は、例えば以下の数式(a)で表現される通り、音声素片ＰAの声質に関する変数ｐAと音声素片ＰBの声質に関する変数ｐBとを混合比率Ｒに応じて加重加算することで、音声素片ＰSの声質に関する変数ｐSを算定する処理である。声質に関する変数としては、音声スペクトルの包絡線を規定する特徴量が例示され得る。
ｐS＝（１−Ｒ）・ｐA＋Ｒ・ｐB ……(a) In the first embodiment, as illustrated in FIG. 2, the speech segment P (PA, PB) corresponding to the pronunciation content of the speech to be synthesized (hereinafter referred to as “synthesized speech”) is represented by the speech library LA and the speech library. A speech unit PA selected from both of the LBs and sequentially selected from the speech library LA and a speech unit PB selected from the speech library LB are mixed at a mixing ratio R (hereinafter referred to as “mixing process”). A speech segment PS is generated. In the mixing process (morphing), for example, the variable pA related to the voice quality of the speech unit PA and the variable pB related to the voice quality of the speech unit PB are weighted and added according to the mixing ratio R as expressed by the following formula (a). This is a process for calculating the variable pS relating to the voice quality of the speech element PS. The variable relating to the voice quality can be exemplified by a feature amount that defines the envelope of the voice spectrum.
pS = (1-R) · pA + R · pB (a)

混合比率Ｒは、音声ライブラリＬA内の音声素片ＰAと音声ライブラリＬB内の音声素片ＰBとの混合処理における各音声素片Ｐの優勢度（混合後の音声素片ＰSに反映される度合）に相当する。具体的には、混合比率Ｒが最小値（例えば０）である場合には、混合後の音声素片ＰSは音声素片ＰAに一致し、混合比率Ｒが大きいほど音声素片ＰSに対する音声素片ＰAの影響が減少し、混合比率Ｒが最大値（例えば１）である場合には、混合後の音声素片ＰSは音声素片ＰBに一致する。以上の説明から理解される通り、音声素片ＰSの声質は、混合比率Ｒに応じて音声素片ＰAと音声素片ＰBとの中間的な声質に設定され得る。混合処理で順次に生成される音声素片ＰSを時間軸上で相互に連結することで音声信号Ｓが生成される。なお、音声素片ＰAと音声素片ＰBとの混合処理には、数式(a)以外にも公知の技術が任意に採用され得る。 The mixing ratio R is the degree of dominance of each speech unit P in the mixing process of the speech unit PA in the speech library LA and the speech unit PB in the speech library LB (the degree reflected in the speech unit PS after mixing). ). Specifically, when the mixing ratio R is a minimum value (for example, 0), the mixed speech unit PS matches the speech unit PA, and the larger the mixing ratio R, the speech unit for the speech unit PS. When the influence of the segment PA is reduced and the mixing ratio R is the maximum value (for example, 1), the speech unit PS after mixing coincides with the speech unit PB. As understood from the above description, the voice quality of the speech segment PS can be set to an intermediate voice quality between the speech segment PA and the speech segment PB according to the mixing ratio R. A speech signal S is generated by interconnecting speech segments PS sequentially generated by the mixing process on the time axis. In addition to the mathematical expression (a), a known technique can be arbitrarily adopted for the mixing process of the speech element PA and the speech element PB.

図１の合成情報Ｑは、合成対象音声を指定する。図３に例示される通り、第１実施形態の合成情報Ｑは、楽曲情報ＱMと制御情報ＱCとを含んで構成される。楽曲情報ＱMは、合成楽曲の内容を指定する時系列データであり、合成楽曲を構成する音符毎に音高ｑ1と発音期間ｑ2と音声符号ｑ3とを指定する。音高ｑ1は、例えばMIDI（Musical Instrument Digital Interface）規格に準拠したノートナンバーである。発音期間ｑ2は、例えば発音の開始時刻と継続長（または発音の終了時刻）とで規定される音符の継続長である。音声符号ｑ3は、合成対象音声の発音内容（すなわち合成楽曲の歌詞）を指定する。例えば合成楽曲の歌詞を構成する文字（書記素）や各文字に対応する音素の音素記号が音声符号ｑ3として指定される。 The synthesis information Q in FIG. 1 specifies a synthesis target voice. As illustrated in FIG. 3, the synthesis information Q of the first embodiment includes music information QM and control information QC. The music information QM is time-series data that specifies the content of the composite music, and specifies the pitch q1, the pronunciation period q2, and the voice code q3 for each note constituting the composite music. The pitch q1 is a note number based on, for example, the MIDI (Musical Instrument Digital Interface) standard. The pronunciation period q2 is a note duration defined by, for example, the start time and duration (or end time) of pronunciation. The voice code q3 designates the pronunciation content of the synthesis target voice (that is, the lyrics of the synthesized music). For example, a character (grapheme) constituting the lyrics of the synthesized music and a phoneme symbol of the phoneme corresponding to each character are designated as the speech code q3.

図３の制御情報ＱCは、音声合成処理に適用される変数の時間変化を指定する。第１実施形態の制御情報ＱCは、音声素片ＰAおよび音声素片ＰBの混合比率Ｒの時間変化と第１特性変数の時間変化とを指定する。第１特性変数は、合成対象音声の音響特性に関する変数（特徴量）である。第１実施形態では音量Ｖを第１特性変数として例示する。第１実施形態の制御情報ＱCは、合成対象音声（合成楽曲）のうち特定の期間（以下「制御期間」という）内の混合比率Ｒおよび音量Ｖの時間変化を指定する。 The control information QC in FIG. 3 specifies a time change of a variable applied to the speech synthesis process. The control information QC of the first embodiment designates the time change of the mixing ratio R of the speech unit PA and the speech unit PB and the time change of the first characteristic variable. The first characteristic variable is a variable (feature amount) related to the acoustic characteristic of the synthesis target speech. In the first embodiment, the volume V is exemplified as the first characteristic variable. The control information QC of the first embodiment designates the temporal change of the mixing ratio R and volume V within a specific period (hereinafter referred to as “control period”) of the synthesis target sound (synthesized music).

図１の演算処理装置１０（CPU）は、記憶装置１２に記憶されたプログラムを実行することで、合成情報Ｑの編集や音声信号Ｓの生成のための複数の機能（指示受付部２２，表示制御部２４，情報管理部２６，音声合成部２８）を実現する。なお、演算処理装置１０の各機能を複数の装置に分散した構成や、専用の電子回路（例えばDSP）が演算処理装置１０の一部の機能を実現する構成も採用され得る。 The arithmetic processing unit 10 (CPU) in FIG. 1 executes a program stored in the storage unit 12 to thereby execute a plurality of functions (instruction receiving unit 22, display) for editing the synthesis information Q and generating the audio signal S. The control unit 24, the information management unit 26, and the speech synthesis unit 28) are realized. A configuration in which each function of the arithmetic processing device 10 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions of the arithmetic processing device 10 may be employed.

音声合成部２８は、記憶装置１２に記憶された音声ライブラリＬと合成情報Ｑとを利用した音声合成処理で音声信号Ｓを生成する。図４は、音声合成処理のフローチャートである。音声合成処理を開始すると、音声合成部２８は、合成情報Ｑの楽曲情報ＱMが音符毎に指定する音声符号ｑ3に応じた音声素片Ｐ（ＰA，ＰB）を音声ライブラリＬAおよび音声ライブラリＬBの双方から順次に選択する（ＳA1）。 The voice synthesizer 28 generates a voice signal S by voice synthesis processing using the voice library L and the synthesis information Q stored in the storage device 12. FIG. 4 is a flowchart of the speech synthesis process. When the voice synthesizing process is started, the voice synthesizing unit 28 selects the voice element P (PA, PB) corresponding to the voice code q3 designated by the music information QM of the synthesis information Q for each note in the voice library LA and the voice library LB. Selections are made sequentially from both sides (SA1).

音声合成部２８は、音声ライブラリＬAから選択した音声素片ＰAと音声ライブラリＬBから選択した音声素片ＰBとについて、合成情報Ｑの制御情報ＱCが現時点について指定する混合比率Ｒを適用した混合処理を実行することで音声素片ＰSを生成する（ＳA2）。また、音声合成部２８は、混合後の音声素片ＰSの音量を、制御情報ＱCが現時点について指定する音量Ｖに調整する（ＳA3）。そして、音声合成部２８は、混合処理（ＳA2）および音量調整（ＳA3）で順次に生成される各音声素片ＰSを、合成情報Ｑの楽曲情報ＱMが指定する音高ｑ1および発音期間ｑ2に調整し（ＳA4）、調整後の各音声素片ＰSを相互に連結することで音声信号Ｓを生成する（ＳA5）。 The voice synthesizing unit 28 applies a mixing ratio R specified by the control information QC of the synthesis information Q for the current time for the voice element PA selected from the voice library LA and the voice element PB selected from the voice library LB. To generate a speech segment PS (SA2). Further, the speech synthesizer 28 adjusts the volume of the speech unit PS after mixing to the volume V specified by the control information QC for the current time (SA3). Then, the speech synthesizer 28 sets each speech segment PS sequentially generated by the mixing process (SA2) and the volume adjustment (SA3) to the pitch q1 and the pronunciation period q2 specified by the music information QM of the synthesis information Q. After adjusting (SA4), the adjusted speech elements PS are connected to each other to generate the audio signal S (SA5).

図１の表示制御部２４は、各種の画像を表示装置１４に表示させる。第１実施形態の表示制御部２４は、合成情報Ｑが指定する合成楽曲の内容を利用者が確認および編集するための図５の編集画像３０を表示装置１４に表示させる。図５に例示される通り、編集画像３０は、楽譜画像４０と変数画像５０とを包含する。 The display control unit 24 in FIG. 1 displays various images on the display device 14. The display control unit 24 of the first embodiment causes the display device 14 to display the edited image 30 of FIG. 5 for the user to confirm and edit the contents of the synthesized music specified by the synthesis information Q. As illustrated in FIG. 5, the edited image 30 includes a score image 40 and a variable image 50.

楽譜画像４０は、相互に交差する時間軸（横軸）および音高軸（縦軸）が設定されたピアノロール型の座標平面（以下「楽譜領域」という）４２を含んで構成され、合成情報Ｑの楽曲情報ＱMが指定する合成楽曲の内容を表現する。具体的には、表示制御部２４は、合成楽曲の各音符を表象する音符図像４４を楽譜領域４２に配置する。音高軸の方向における音符図像４４の位置は、楽曲情報ＱMが指定する音高ｑ1に応じて設定され、時間軸の方向における音符図像４４の位置および表示長は、楽曲情報ＱMが指定する発音期間ｑ2に応じて設定される。また、各音符図像４４には、楽曲情報ＱMが指定する音声符号ｑ3が付加される。図５では、音声符号ｑ3が指定する文字（合成楽曲の歌詞）と音素記号とを音符図像４４の内側に配置した場合が例示されている。 The score image 40 includes a piano roll type coordinate plane (hereinafter referred to as a “score region”) 42 on which a time axis (horizontal axis) and a pitch axis (vertical axis) intersect with each other, and is composed information. The contents of the synthesized music designated by the music information QM of Q are expressed. Specifically, the display control unit 24 arranges a note image 44 representing each note of the synthesized music in the score area 42. The position of the note image 44 in the direction of the pitch axis is set according to the pitch q1 specified by the music information QM, and the position and display length of the note image 44 in the direction of the time axis are pronounced specified by the music information QM. It is set according to the period q2. Further, each musical note image 44 is added with a voice code q3 designated by the music information QM. In FIG. 5, the case where the character (the lyrics of the synthesized music) and the phoneme symbol designated by the voice code q3 are arranged inside the musical note image 44 is illustrated.

変数画像５０は、時間軸（横軸）が設定された領域（以下「変数領域」という）５２を含んで構成され、混合処理（ＳA2）に適用される混合比率Ｒの時間変化と音量調整（ＳA3）に適用される音量Ｖの時間変化とを表現する。変数領域５２の時間軸は楽譜領域４２の時間軸と共通する。なお、変数画像５０の具体的な内容については後述する。 The variable image 50 is configured to include an area (hereinafter referred to as “variable area”) 52 in which a time axis (horizontal axis) is set, and a change in volume of the mixing ratio R applied to the mixing process (SA2) and volume adjustment ( The time change of the volume V applied to SA3) is expressed. The time axis of the variable area 52 is the same as the time axis of the score area 42. The specific contents of the variable image 50 will be described later.

指示受付部２２は、入力装置１６に対する操作に応じた利用者からの指示を受付ける。例えば利用者は、編集画像３０を確認しながら入力装置１６を適宜に操作することで合成情報Ｑの編集を指示することが可能である。情報管理部２６は、記憶装置１２に記憶された合成情報Ｑを管理する。具体的には、情報管理部２６は、指示受付部２２が利用者から受付けた編集の指示に応じて合成情報Ｑ（楽曲情報ＱM，制御情報ＱC）を更新する。 The instruction receiving unit 22 receives an instruction from a user according to an operation on the input device 16. For example, the user can instruct editing of the composite information Q by appropriately operating the input device 16 while confirming the edited image 30. The information management unit 26 manages the composite information Q stored in the storage device 12. Specifically, the information management unit 26 updates the composite information Q (music information QM, control information QC) according to the editing instruction received by the instruction receiving unit 22 from the user.

図６は、第１実施形態の音声合成装置１００の概略的な動作のフローチャートである。入力装置１６に対する利用者からの指示を契機として図６の処理が開始される。処理を開始すると、表示制御部２４は、記憶装置１２に記憶された合成情報Ｑに応じた図５の編集画像３０を表示装置１４に表示させる（ＳB1）。そして、指示受付部２２は、合成情報Ｑの編集の指示を利用者から受付けたか否かを判定する（ＳB2）。 FIG. 6 is a flowchart of a schematic operation of the speech synthesizer 100 according to the first embodiment. The process of FIG. 6 is started in response to an instruction from the user to the input device 16. When the process is started, the display control unit 24 causes the display device 14 to display the edited image 30 of FIG. 5 corresponding to the composite information Q stored in the storage device 12 (SB1). Then, the instruction receiving unit 22 determines whether an instruction to edit the composite information Q is received from the user (SB2).

合成情報Ｑの編集の指示を指示受付部２２が受付けた場合（ＳB2：YES）、表示制御部２４による編集画像３０の更新と情報管理部２６による合成情報Ｑの更新とを含む編集処理が実行される（ＳB3）。例えば、音符の追加が利用者から指示された場合、表示制御部２４は、楽譜領域４２内で利用者から指示された位置に音符図像４４を追加し、情報管理部２６は、利用者から指示された音符の情報（ｑ1〜ｑ3）を合成情報Ｑの楽曲情報ＱMに追加する。既存の音符図像４４の移動や時間軸上の伸縮が利用者から指示された場合、表示制御部２４は、音符図像４４の位置や表示長を利用者からの指示に応じて変更し、情報管理部２６は、楽曲情報ＱMのうち編集対象の音符の音高ｑ1や発音期間ｑ2を利用者からの指示に応じて変更する。また、各音符の音声符号ｑ3の変更が利用者から指示された場合、表示制御部２４は、当該音符の音声符号ｑ3の表示を利用者からの指示に応じて変更し、情報管理部２６は、楽曲情報ＱMのうち当該音符の音声符号ｑ3を利用者からの指示に応じて変更する。合成情報Ｑの編集が指示されていない場合（ＳB2：NO）、編集処理は実行されない。 When the instruction receiving unit 22 receives an instruction to edit the composite information Q (SB2: YES), an editing process including an update of the edited image 30 by the display control unit 24 and an update of the composite information Q by the information management unit 26 is executed. (SB3). For example, when the addition of a note is instructed by the user, the display control unit 24 adds the note image 44 to the position instructed by the user in the score area 42, and the information management unit 26 instructs from the user. The note information (q1 to q3) thus added is added to the music information QM of the composition information Q. When movement of the existing musical note image 44 or expansion / contraction on the time axis is instructed by the user, the display control unit 24 changes the position and display length of the musical note image 44 according to the instruction from the user, and manages information. The unit 26 changes the pitch q1 and the pronunciation period q2 of the note to be edited in the music information QM according to an instruction from the user. When the change of the voice code q3 of each note is instructed by the user, the display control unit 24 changes the display of the voice code q3 of the note according to the instruction from the user, and the information management unit 26 The voice code q3 of the note in the music information QM is changed in accordance with an instruction from the user. If editing of the composite information Q is not instructed (SB2: NO), the editing process is not executed.

以上の処理が完了すると、指示受付部２２は、音声合成（音声信号Ｓの生成）の指示を利用者から受付けたか否かを判定する（ＳB4）。音声合成の指示を指示受付部２２が受付けた場合（ＳB4：YES）、音声合成部２８は、音声ライブラリＬ（ＬA，ＬB）と合成情報Ｑとを適用した図４の音声合成処理を実行することで音声信号Ｓを生成する（ＳB5）。他方、音声合成が指示されていない場合（ＳB4：NO）には音声合成処理は実行されない。また、指示受付部２２は、処理終了の指示を利用者から受付けたか否かを判定する（ＳB6）。処理終了が指示されていない場合（ＳB6：NO）には、処理がステップＳB1に遷移して以降の処理が反復され、処理終了が指示された場合（ＳB6：YES）には図６の処理が終了する。 When the above processing is completed, the instruction receiving unit 22 determines whether or not an instruction for voice synthesis (generation of the voice signal S) has been received from the user (SB4). When the instruction receiving unit 22 receives a voice synthesis instruction (SB4: YES), the voice synthesis unit 28 executes the voice synthesis process of FIG. 4 to which the voice library L (LA, LB) and the synthesis information Q are applied. Thus, the audio signal S is generated (SB5). On the other hand, when the voice synthesis is not instructed (SB4: NO), the voice synthesis process is not executed. In addition, the instruction receiving unit 22 determines whether or not an instruction to end the process has been received from the user (SB6). When the process end is not instructed (SB6: NO), the process transitions to step SB1, and the subsequent processes are repeated. When the process end is instructed (SB6: YES), the process of FIG. finish.

利用者は、入力装置１６を適宜に操作することで、混合比率Ｒおよび音量Ｖの時間変化の編集を指示することが可能である。図７は、混合比率Ｒおよび音量Ｖの時間変化の編集の指示を指示受付部２２が受付けた場合（ＳB2：YES）に演算処理装置１０が実行する編集処理（ＳB3）のフローチャートである。 The user can instruct editing of the temporal change of the mixing ratio R and the volume V by appropriately operating the input device 16. FIG. 7 is a flowchart of the editing process (SB3) executed by the arithmetic processing unit 10 when the instruction receiving unit 22 receives an instruction to edit the mixing ratio R and volume V over time (SB2: YES).

混合比率Ｒおよび音量Ｖの編集が指示されると、表示制御部２４は、図８に例示される通り、調整画像６０を表示装置１４に表示させる（ＳC1）。調整画像６０は、合成対象音声（合成楽曲）の制御期間内における混合比率Ｒの時間変化と音量Ｖの時間変化とを利用者が編集するための画像である。 When editing of the mixing ratio R and volume V is instructed, the display control unit 24 displays the adjusted image 60 on the display device 14 as illustrated in FIG. 8 (SC1). The adjustment image 60 is an image for the user to edit the time change of the mixing ratio R and the time change of the volume V within the control period of the synthesis target sound (synthesized music).

調整画像６０は、相互に交差する第１軸Ａ1（横軸）と第２軸Ａ2（縦軸）とが設定された調整領域６２を包含する。第１軸Ａ1は、音声素片ＰA（第１音声）と音声素片ＰB（第２音声）との混合比率Ｒの数値を示す座標軸であり、第２軸Ａ2は、合成対象音声（混合後の音声素片ＰS）の音量Ｖの数値を示す座標軸である。第１軸Ａ1の負側の端部（左端部）に表示された「声色Ａ」は第１音声の声質を意味し、第１軸Ａ1の正側の端部（右端部）に表示された「声色Ｂ」は第２音声の声質を意味する。 The adjustment image 60 includes an adjustment region 62 in which a first axis A1 (horizontal axis) and a second axis A2 (vertical axis) intersecting each other are set. The first axis A1 is a coordinate axis indicating the numerical value of the mixing ratio R between the speech unit PA (first speech) and the speech unit PB (second speech), and the second axis A2 is the synthesis target speech (after mixing). Is a coordinate axis indicating the numerical value of the volume V of the speech unit PS). “Voice color A” displayed at the negative end (left end) of the first axis A1 means the voice quality of the first voice, and is displayed at the positive end (right end) of the first axis A1. “Voice color B” means the voice quality of the second voice.

調整領域６２内の各地点は位置に応じて相異なる表示態様（色相や彩度や明度等の視覚的に識別可能な画像の性状）で表示される。実際の調整領域６２は多数の色彩を含むカラー画像であるが、特許図面でカラー画像を利用できないという事情から便宜的に、図８では、調整領域６２内の色相を図面上の階調の濃淡（グレースケール）で代替的に表現し、調整領域６２内の明度（階調）を図面上の網点の粗密で代替的に表現した。具体的には、青色から赤色にわたる色相の分布が低階調から高階調にわたる階調の分布で表現され、暗部から明部にわたる階調の分布が高密度（密）から低密度（疎）への網点の密度で表現される。すなわち、図８から理解される通り、第１実施形態の調整領域６２は、第１軸Ａ1の負側（左側）の端部から正側（右側）の端部にかけて赤色（階調：高）から青色（階調：低）に連続的に変化するとともに、第２軸Ａ2の負側から正側にかけて低階調（網点：密）から高階調（網点：疎）に連続的に変化する画像である。 Each point in the adjustment area 62 is displayed in different display modes (characteristics of visually identifiable images such as hue, saturation, and brightness) depending on the position. Although the actual adjustment area 62 is a color image including a large number of colors, for convenience, the color image cannot be used in the patent drawing. In FIG. (Grayscale) is alternatively expressed, and the brightness (gradation) in the adjustment region 62 is alternatively expressed by the density of halftone dots on the drawing. Specifically, the distribution of hues from blue to red is expressed by the distribution of gradations from low to high gradations, and the distribution of gradations from dark to light is changed from high density (dense) to low density (sparse). It is expressed by the density of halftone dots. That is, as is understood from FIG. 8, the adjustment region 62 of the first embodiment is red (gradation: high) from the negative side (left side) end to the positive side (right side) end of the first axis A1. Changes continuously from blue to blue (gradation: low) and continuously from low gradation (halftone: dense) to high gradation (halftone: sparse) from the negative side to the positive side of the second axis A2. It is an image to be.

利用者は、図８の楽譜画像４０を確認しながら入力装置１６を適宜に操作することで、変数領域５２の時間軸上に複数の時点Ｔ（Ｔ1，Ｔ2）を任意に指示することが可能である。指示受付部２２は、時間軸上の複数の時点Ｔ（Ｔ1，Ｔ2）の指示を利用者から受付ける（ＳC2）。時点Ｔ1は、合成対象音声のうち混合比率Ｒと音量Ｖとが変化する制御期間の始点に相当し、時点Ｔ2は制御期間の終点に相当する。表示制御部２４は、図８に例示される通り、指示受付部２２が利用者から受付けた各時点Ｔを変数領域５２内に表示する（ＳC3）。利用者は、楽譜画像４０の複数の音符図像４４の時系列を随時に確認しながら、合成楽曲のうち混合比率Ｒおよび音量Ｖを変化させるべき箇所が制御期間に包含されるように時点Ｔ1と時点Ｔ2とを指示する。また、利用者は、入力装置１６に対する操作で各時点Ｔを時間軸の方向に移動させることも可能である。 The user can arbitrarily designate a plurality of time points T (T1, T2) on the time axis of the variable area 52 by appropriately operating the input device 16 while confirming the score image 40 of FIG. It is. The instruction receiving unit 22 receives instructions from a plurality of time points T (T1, T2) on the time axis from the user (SC2). The time point T1 corresponds to the start point of the control period in which the mixing ratio R and the volume V of the synthesis target sound change, and the time point T2 corresponds to the end point of the control period. As illustrated in FIG. 8, the display control unit 24 displays each time T received from the user by the instruction receiving unit 22 in the variable area 52 (SC3). While confirming the time series of the plurality of musical note images 44 of the musical score image 40 as needed, the user can select the time T1 and the time point T1 so that the portion where the mixing ratio R and volume V should be changed is included in the control period. The time T2 is indicated. The user can also move each time point T in the direction of the time axis by operating the input device 16.

利用者は、図８の調整画像６０を確認しながら入力装置１６を適宜に操作することで、変数領域５２内の各時点Ｔに対応する複数の地点（以下「指示点」という）Ｘを調整領域６２内に指示することが可能である。指示受付部２２は、調整領域６２内の複数の指示点Ｘ（Ｘ1，Ｘ2）の指示を利用者から順次に受付ける（ＳC4）。表示制御部２４は、図８に例示される通り、指示受付部２２が指示を受付けた各指示点Ｘと、相前後して指示された２個の指示点Ｘを連結する経路Ｃとを調整領域６２内に表示する（ＳC5）。第１実施形態の経路Ｃは２個の指示点Ｘを連結する直線である。なお、経路Ｃの表示は省略され得る。 The user adjusts a plurality of points (hereinafter referred to as “instruction points”) X corresponding to the respective time points T in the variable area 52 by appropriately operating the input device 16 while confirming the adjustment image 60 of FIG. It is possible to indicate in the area 62. The instruction receiving unit 22 sequentially receives instructions from a plurality of instruction points X (X1, X2) in the adjustment area 62 from the user (SC4). As illustrated in FIG. 8, the display control unit 24 adjusts each indication point X for which the instruction receiving unit 22 has received an instruction and a path C that connects the two instruction points X that have been indicated before and after. It is displayed in the area 62 (SC5). The path C of the first embodiment is a straight line connecting the two designated points X. Note that the display of the route C may be omitted.

１個の指示点Ｘは、混合比率Ｒおよび音量Ｖの各数値に対応した座標点である。すなわち、指示点Ｘの第１軸Ａ1上の位置が混合比率Ｒの数値に相当し、指示点Ｘの第２軸Ａ2上の位置が音量Ｖの数値に相当する。指示点Ｘが第１軸Ａ1の正側の端部（声色Ｂ）に近付くほど混合比率Ｒの数値は増加し、指示点Ｘが第２軸Ａ2の正側の端部に近付くほど音量Ｖの数値は増加する。 One indication point X is a coordinate point corresponding to each value of the mixing ratio R and the volume V. That is, the position of the indication point X on the first axis A1 corresponds to the value of the mixing ratio R, and the position of the indication point X on the second axis A2 corresponds to the value of the volume V. The numerical value of the mixing ratio R increases as the indication point X approaches the positive end (voice tone B) of the first axis A1, and the volume V increases as the indication point X approaches the positive end of the second axis A2. The number increases.

図９は、複数の指示点Ｘ（Ｘ1，Ｘ2）が指定された調整領域６２の模式図である。図９では、調整領域６２内の表示態様の変化の図示を便宜的に省略した。利用者が指示した指示点Ｘ1は、時点Ｔ1（制御期間の始点）における混合比率Ｒおよび音量Ｖの数値に対応する。すなわち、図９に例示される通り、第１軸Ａ1上で指示点Ｘ1に対応する数値ｒ1は時点Ｔ1での混合比率Ｒの数値に相当し、第２軸Ａ2上で指示点Ｘ1に対応する数値ｖ1は時点Ｔ1での音量Ｖの数値に相当する。他方、指示点Ｘ2は、時点Ｔ2（制御期間の終点）における混合比率Ｒおよび音量Ｖの数値に対応する。すなわち、図９に例示される通り、第１軸Ａ1上で指示点Ｘ2に対応する数値ｒ2は時点Ｔ2での混合比率Ｒの数値に相当し、第２軸Ａ2上で指示点Ｘ2に対応する数値ｖ2は時点Ｔ2での音量Ｖの数値に相当する。以上の説明から理解される通り、指示点Ｘ1および指示点Ｘ2は、時点Ｔ1から時点Ｔ2にかけて数値ｒ1から数値ｒ2に連続的に遷移する混合比率Ｒの時間変化と、時点Ｔ1から時点Ｔ2にかけて数値ｖ1から数値ｖ2に連続的に遷移する音量Ｖの時間変化とを表現する。 FIG. 9 is a schematic diagram of the adjustment region 62 in which a plurality of designated points X (X1, X2) are designated. In FIG. 9, the change of the display mode in the adjustment area 62 is omitted for convenience. The designated point X1 designated by the user corresponds to the numerical values of the mixing ratio R and the volume V at the time T1 (starting point of the control period). That is, as illustrated in FIG. 9, the numerical value r1 corresponding to the designated point X1 on the first axis A1 corresponds to the numerical value of the mixing ratio R at the time point T1, and corresponds to the designated point X1 on the second axis A2. The numerical value v1 corresponds to the numerical value of the volume V at the time point T1. On the other hand, the designated point X2 corresponds to the numerical values of the mixing ratio R and the volume V at the time point T2 (end point of the control period). That is, as illustrated in FIG. 9, the numerical value r2 corresponding to the designated point X2 on the first axis A1 corresponds to the numerical value of the mixing ratio R at the time point T2, and corresponds to the designated point X2 on the second axis A2. The numerical value v2 corresponds to the numerical value of the volume V at the time point T2. As can be understood from the above description, the indication point X1 and the indication point X2 are numerical values from the time point T1 to the time point T2 and from the time point T1 to the time point T2 and from the time point T1 to the time point T2. It expresses the time change of the volume V that continuously transitions from v1 to the numerical value v2.

利用者は、入力装置１６を適宜に操作することで、調整領域６２内の任意の指示点Ｘ（以下「選択指示点Ｘ」という）を選択し、選択指示点Ｘに対応する音声の再生を指示することが可能である。選択指示点Ｘの選択を指示受付部２２が受付けると、音声合成部２８は、選択指示点Ｘに対応する混合比率Ｒと音量Ｖとを適用した音声合成処理で音声信号Ｓを生成する。具体的には、音声合成部２８は、特定の発音内容（例えば合成情報Ｑで指定される音声符号ｑ3とは無関係に事前に選定された文字）に対応する音声素片Ｐ（ＰA，ＰB）を音声ライブラリＬAおよび音声ライブラリＬBの双方から選択し（ＳA1）、選択指示点Ｘに対応する混合比率Ｒの数値を適用した混合処理（ＳA2）と、選択指示点Ｘに対応する音量Ｖの数値を適用した音量調整（ＳA3）とを実行することで、所定の音高および発音期間の音声信号Ｓを生成（ＳA4，ＳA5）して放音装置１８から再生する。すなわち、利用者は、各指示点Ｘに対応する混合比率Ｒおよび音量Ｖを適用した合成音声を実際に聴取しながら、所望の合成音声が生成されるように調整領域６２内の各指示点Ｘの位置を調整することが可能である。例えば、音声素片ＰAと音声素片ＰBとの収録時の音量差に起因した合成音声の音量感の変化（混合比率Ｒの時間変化に連動した変化）が抑制されるように、音量Ｖを混合比率Ｒに応じて調整することが可能である。 By appropriately operating the input device 16, the user selects an arbitrary designated point X (hereinafter referred to as “selected designated point X”) in the adjustment area 62, and reproduces the sound corresponding to the selected designated point X. It is possible to instruct. When the instruction receiving unit 22 receives the selection of the selection instruction point X, the voice synthesis unit 28 generates a voice signal S by voice synthesis processing using the mixing ratio R and the volume V corresponding to the selection instruction point X. Specifically, the speech synthesizer 28 is a speech segment P (PA, PB) corresponding to specific pronunciation content (for example, a character selected in advance regardless of the speech code q3 designated by the synthesis information Q). Is selected from both the audio library LA and the audio library LB (SA1), the mixing process (SA2) applying the numerical value of the mixing ratio R corresponding to the selection instruction point X, and the numerical value of the volume V corresponding to the selection instruction point X The sound signal S having a predetermined pitch and sound generation period is generated (SA4, SA5) and reproduced from the sound emitting device 18 by executing the volume adjustment (SA3) to which the above is applied. That is, the user can actually listen to the synthesized speech to which the mixing ratio R and volume V corresponding to each designated point X are applied, and generate each desired point X in the adjustment area 62 so that a desired synthesized speech is generated. Can be adjusted. For example, the volume V is set so that the change in the volume feeling of the synthesized speech due to the difference in volume during recording between the speech unit PA and the speech unit PB (change associated with the time change of the mixing ratio R) is suppressed. It is possible to adjust according to the mixing ratio R.

表示制御部２４は、図８に例示される通り、調整領域６２内に指示された各指示点Ｘに応じた遷移画像５４を変数画像５０の変数領域５２に配置する（ＳC6）。遷移画像５４は、時間軸に沿って延在する帯状の画像であり、時点Ｔ1から時点Ｔ2にわたる混合比率Ｒおよび音量Ｖの時間変化を表現する。第１実施形態の表示制御部２４は、遷移画像５４の時間軸上の各時点での表示態様が、変数領域５２内の時間軸上の時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様まで連続的に変化するように遷移画像５４を生成する。すなわち、遷移画像５４のうち時点Ｔ1での表示態様は、調整領域６２内の指示点Ｘ1での表示態様に一致し、遷移画像５４のうち時点Ｔ2での表示態様は、調整領域６２内の指示点Ｘ2での表示態様に一致する。また、遷移画像５４のうち時点Ｔ1と時点Ｔ2との間の任意の時点ｔでの表示態様は、調整領域６２内の指示点Ｘ1から指示点Ｘ2までの経路Ｃ上で当該時点ｔに対応する地点での表示態様に一致する。したがって、利用者は、変数画像５０を確認することで、時点Ｔ1から時点Ｔ2にわたる混合比率Ｒおよび音量Ｖの時間変化を視覚的に把握することが可能である。 As illustrated in FIG. 8, the display control unit 24 arranges the transition image 54 corresponding to each designated point X designated in the adjustment area 62 in the variable area 52 of the variable image 50 (SC6). The transition image 54 is a band-shaped image extending along the time axis, and expresses the temporal change in the mixing ratio R and volume V from the time point T1 to the time point T2. In the display control unit 24 of the first embodiment, the display mode at each time point on the time axis of the transition image 54 is the indication point in the adjustment area 62 from time T1 to time T2 on the time axis in the variable area 52. The transition image 54 is generated so as to continuously change from the display mode at X1 to the display mode at the designated point X2. That is, the display mode at the time point T1 in the transition image 54 matches the display mode at the point X1 in the adjustment area 62, and the display mode at the time point T2 in the transition image 54 is the indication in the adjustment area 62. This corresponds to the display mode at the point X2. Further, the display mode at an arbitrary time t between the time T1 and the time T2 in the transition image 54 corresponds to the time t on the path C from the designated point X1 to the designated point X2 in the adjustment region 62. It matches the display mode at the point. Therefore, the user can visually grasp the temporal change in the mixing ratio R and the volume V from the time point T1 to the time point T2 by checking the variable image 50.

情報管理部２６は、調整画像６０および変数画像５０の内容が反映されるように合成情報Ｑの制御情報ＱCを更新する（ＳC7）。具体的には、混合比率Ｒおよび音量Ｖが、時間軸上の時点Ｔ1から時点Ｔ2にかけて、指示点Ｘ1に対応する数値から指示点Ｘ2に対応する数値まで経路Ｃに沿って遷移するように、制御情報ＱCが更新される。すなわち、情報管理部２６は、時間軸上の時点Ｔ1から時点Ｔ2にかけて、第１軸Ａ1上で指示点Ｘ1に対応する数値ｒ1から指示点Ｘ2に対応する数値ｒ2まで混合比率Ｒが連続的に遷移し、かつ、第２軸Ａ2上で指示点Ｘ1に対応する数値ｖ1から指示点Ｘ2に対応する数値ｖ2まで音量Ｖが連続的に遷移するように、制御情報ＱCを更新する。 The information management unit 26 updates the control information QC of the composite information Q so that the contents of the adjustment image 60 and the variable image 50 are reflected (SC7). Specifically, the mixing ratio R and the volume V are changed along the path C from the time point T1 to the time point T2 on the time axis from the numerical value corresponding to the designated point X1 to the numerical value corresponding to the designated point X2. The control information QC is updated. That is, the information management unit 26 continuously increases the mixing ratio R from the time point T1 on the time axis to the time point T2 from the numerical value r1 corresponding to the designated point X1 to the numerical value r2 corresponding to the designated point X2 on the first axis A1. The control information QC is updated so that the volume V continuously changes on the second axis A2 from the numerical value v1 corresponding to the indication point X1 to the numerical value v2 corresponding to the indication point X2.

以上に説明した通り、第１実施形態では、音声素片ＰAおよび音声素片ＰBの混合比率Ｒを示す第１軸Ａ1と、合成対象音声の音量Ｖを示す第２軸Ａ2とが設定された調整領域６２に、利用者からの指示に応じた各指示点Ｘ（Ｘ1，Ｘ2）が設定される。そして、第１軸Ａ1上で指示点Ｘ1に対応する数値ｒ1から指示点Ｘ2に対応する数値ｒ2まで遷移する混合比率Ｒの時間変化と、第２軸Ａ2上で指示点Ｘ1に対応する数値ｖ1から指示点Ｘ2に対応する数値ｖ2まで遷移する音量Ｖの時間変化とを指定する制御情報ＱCが生成される。以上の構成によれば、利用者は、混合比率Ｒと音量Ｖとの関係を確認しながら、混合比率Ｒの時間変化の指示に並行して音量Ｖの時間変化を指示する（両者の時間変化を一括的に指示する）ことが可能である。したがって、音声合成処理に適用される変数（混合比率Ｒおよび音量Ｖ）を指示する利用者の作業負担を軽減できるという利点がある。 As described above, in the first embodiment, the first axis A1 indicating the mixing ratio R of the speech unit PA and the speech unit PB and the second axis A2 indicating the volume V of the synthesis target speech are set. Each indication point X (X1, X2) according to an instruction from the user is set in the adjustment area 62. Then, the change over time of the mixture ratio R that transitions from the numerical value r1 corresponding to the designated point X1 on the first axis A1 to the numerical value r2 corresponding to the designated point X2, and the numerical value v1 corresponding to the designated point X1 on the second axis A2. Control information QC for designating the time change of the volume V that transitions from 1 to the numerical value v2 corresponding to the designated point X2. According to the above configuration, the user instructs the time change of the volume V in parallel with the time change instruction of the mixing ratio R while confirming the relationship between the mixing ratio R and the volume V (the time change of both). Can be instructed collectively). Therefore, there is an advantage that it is possible to reduce the work burden on the user who designates variables (mixing ratio R and volume V) applied to the speech synthesis process.

また、第１実施形態では、時間軸上の時点Ｔ1から時点Ｔ2にかけて混合比率Ｒおよび音量Ｖが指示点Ｘ1での数値から指示点Ｘ2での数値まで遷移するように制御情報ＱCが生成されるから、合成対象音声（合成楽曲）の特定の期間について限定的に混合比率Ｒおよび音量Ｖの時間変化を利用者が指示することが可能である。また、制御期間を画定する時点Ｔ1および時点Ｔ2は利用者からの指示に応じて可変に設定されるから、合成対象音声のうち利用者の所望の期間について混合比率Ｒおよび音量Ｖの時間変化を指示できるという利点もある。 In the first embodiment, the control information QC is generated so that the mixing ratio R and volume V transition from the numerical value at the designated point X1 to the numerical value at the designated point X2 from time T1 to time T2 on the time axis. Therefore, the user can instruct the change in the mixing ratio R and volume V over time for a specific period of the synthesis target voice (synthesized music). In addition, since the time point T1 and the time point T2 that define the control period are variably set in accordance with an instruction from the user, the time ratios of the mixing ratio R and the volume V are changed for the user's desired period of the synthesis target voice. There is also an advantage that it can be directed.

例えば、声色Ｂの第２音声（音声素片ＰB）の音量が声色Ａの第１音声（音声素片ＰA）と比較して大きいと仮定し、合成対象音声を第２音声（声色Ｂ）から第１音声（声色Ａ）に経時的に変化させる場合を想定する。図１０の例示のように第１軸Ａ1の正側（声色Ｂ側）に位置する指示点Ｘ1と負側（声色Ａ側）に位置する指示点Ｘ2とで第２軸Ａ2上の位置が相等しい場合、時点Ｔ1から時点Ｔ2にかけて音量Ｖの数値は略一定に維持される。したがって、合成音声の声質が第２音声の声色Ｂから第１音声の声色Ａに遷移する制御期間において、合成音声の音量は、収録時における各音声素片Ｐの音量差に起因して、制御期間内で経時的に減少する。他方、図９の例示のように第２軸Ａ2上で指示点Ｘ2が指示点Ｘ1の正側に位置する場合、時点Ｔ1から時点Ｔ2にかけて音量Ｖの数値は経時的に増加する。したがって、合成音声の声質が第２音声の声色Ｂから第１音声の声色Ａに遷移する制御期間において、合成音量の音量は制御期間内で略一定に維持される。すなわち、収録時における各音声素片Ｐの音量差が低減される。 For example, it is assumed that the volume of the second voice of voice color B (speech segment PB) is higher than that of the first voice of voice color A (speech segment PA), and the synthesis target voice is determined from the second voice (speech color B). Assume that the first sound (voice color A) is changed over time. As illustrated in FIG. 10, the position on the second axis A2 is the same between the pointing point X1 located on the positive side (voice tone B side) of the first axis A1 and the pointing point X2 located on the negative side (voice tone A side). If equal, the numerical value of the volume V is maintained substantially constant from time T1 to time T2. Therefore, in the control period in which the voice quality of the synthesized voice changes from the voice color B of the second voice to the voice color A of the first voice, the volume of the synthesized voice is controlled due to the volume difference of each voice segment P at the time of recording. Decreases over time within the period. On the other hand, when the indication point X2 is located on the positive side of the indication point X1 on the second axis A2 as illustrated in FIG. 9, the value of the volume V increases with time from the time point T1 to the time point T2. Therefore, in the control period in which the voice quality of the synthesized voice changes from the voice color B of the second voice to the voice color A of the first voice, the volume of the synthesized voice is maintained substantially constant within the control period. That is, the volume difference of each speech unit P during recording is reduced.

第１実施形態では、調整領域６２内の各地点が相異なる表示態様に設定され、時間軸上の時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様に変化する遷移画像５４が表示装置１４に表示される。したがって、時点Ｔ1から時点Ｔ2にわたる混合比率Ｒおよび音量Ｖの時間変化を利用者が視覚的および直観的に把握できるという利点もある。 In the first embodiment, each point in the adjustment region 62 is set to a different display mode, and from the time point T1 to the time point T2 on the time axis, the display mode at the point X1 in the adjustment region 62 changes from the point of display to the point X2. A transition image 54 that changes to the display mode is displayed on the display device 14. Therefore, there is also an advantage that the user can visually and intuitively grasp the temporal change of the mixing ratio R and the volume V from the time point T1 to the time point T2.

なお、以上の例示では、調整領域６２内の２個の指示点Ｘ（Ｘ1，Ｘ2）と変数領域５２内の２個の時点Ｔ（Ｔ1，Ｔ2）とを例示したが、図１１に例示される通り、変数領域５２内の３個以上の時点Ｔ（Ｔ1，Ｔ2，Ｔ3）と調整領域６２内の３個以上の指示点Ｘ（Ｘ1，Ｘ2，Ｘ3）とを設定することも可能である。表示制御部２４が変数領域５２に配置する遷移画像５４の表示態様は、時間軸上の時点Ｔ1から時点Ｔ2にかけて調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様まで連続的に変化し、かつ、時間軸上の時点Ｔ2から時点Ｔ3にかけて調整領域６２内の指示点Ｘ2での表示態様から指示点Ｘ3での表示態様まで連続的に変化する。また、情報管理部２６は、混合比率Ｒおよび音量Ｖが、時点Ｔ1から時点Ｔ2にかけて指示点Ｘ1での数値から指示点Ｘ2での数値まで経路Ｃ12に沿って遷移するとともに、時点Ｔ2から時点Ｔ3にかけて指示点Ｘ2での数値から指示点Ｘ3での数値まで経路Ｃ23に沿って遷移するように、制御情報ＱCを更新する。 In the above example, the two designated points X (X1, X2) in the adjustment area 62 and the two time points T (T1, T2) in the variable area 52 are exemplified, but are illustrated in FIG. As described above, it is also possible to set three or more time points T (T1, T2, T3) in the variable area 52 and three or more indication points X (X1, X2, X3) in the adjustment area 62. . The display mode of the transition image 54 arranged in the variable area 52 by the display control unit 24 is from the display mode at the designated point X1 in the adjustment area 62 to the display mode at the designated point X2 from time T1 to time T2 on the time axis. It changes continuously and from the time point T2 to the time point T3 on the time axis, it continuously changes from the display mode at the indicated point X2 in the adjustment area 62 to the display mode at the indicated point X3. In addition, the information management unit 26 changes the mixing ratio R and volume V from the time point T1 to the time point T2 along the path C12 from the numerical value at the designated point X1 to the numerical value at the designated point X2, and from the time point T2 to the time point T3. The control information QC is updated so that a transition is made along the path C23 from the value at the designated point X2 to the value at the designated point X3.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In each of the embodiments exemplified below, elements having the same functions and functions as those of the first embodiment will be referred to in the description of the first embodiment, and detailed descriptions thereof will be appropriately omitted.

第２実施形態における合成情報Ｑの制御情報ＱCは、第１実施形態と同様に混合比率Ｒおよび第１特性変数（音量Ｖ）の時間変化を指定するほか、第２特性変数の時間変化を指定する。第２特性変数は、第１特性変数と同様に、合成対象音声の音響特性に関する変数（特徴量）である。第２実施形態では、音量Ｕを第２特性変数として例示する。音量Ｖ（第１特性変数）と音量Ｕ（第２特性変数）とは同種の音響特性であるが、音量Ｖは混合比率Ｒの時間変化との関連を考慮して調整されるのに対し、音量Ｕは、合成楽曲の各音符との関連（合成楽曲の進行に連動した音量の時間変化）を考慮して調整される。すなわち、例えば、音声素片ＰAと音声素片ＰBとの収録時の音量差に起因した合成音声の音量感の変化（混合比率Ｒの時間変化に連動した変化）が抑制されるように音量Ｖを混合比率Ｒに応じて調整しながら、合成楽曲の進行とともに音楽的な表現として音量Ｕを変化させることが可能である。 The control information QC of the composite information Q in the second embodiment specifies the time change of the second characteristic variable in addition to specifying the time change of the mixing ratio R and the first characteristic variable (volume V) as in the first embodiment. To do. Similar to the first characteristic variable, the second characteristic variable is a variable (feature amount) related to the acoustic characteristic of the synthesis target speech. In the second embodiment, the volume U is exemplified as the second characteristic variable. The volume V (first characteristic variable) and the volume U (second characteristic variable) are the same kind of acoustic characteristics, but the volume V is adjusted in consideration of the relationship with the temporal change of the mixing ratio R, whereas The volume U is adjusted in consideration of the relationship with each note of the synthesized music (time change in volume linked to the progress of the synthesized music). That is, for example, the volume V so that the change in the volume feeling of the synthesized speech due to the volume difference during recording between the speech unit PA and the speech unit PB (change associated with the temporal change in the mixing ratio R) is suppressed. Can be adjusted according to the mixing ratio R, and the volume U can be changed as a musical expression as the synthesized music progresses.

図１２は、第２実施形態における編集画像３０の模式図である。第２実施形態における変数画像５０の変数領域５２には、第１実施形態と同様の時間軸と、時間軸に交差する数値軸ＡY（縦軸）とが設定される。数値軸ＡYは、音量Ｕの数値を示す座標軸である。 FIG. 12 is a schematic diagram of an edited image 30 in the second embodiment. In the variable area 52 of the variable image 50 in the second embodiment, a time axis similar to that in the first embodiment and a numerical axis AY (vertical axis) intersecting the time axis are set. The numerical value axis AY is a coordinate axis indicating the numerical value of the volume U.

第２実施形態の変数画像５０は、混合比率Ｒおよび音量Ｖの時間変化を第１実施形態と同様に遷移画像５４の表示態様（色相や明度等）で表現するほか、制御情報ＱCが指定する音量Ｕの時間変化を遷移画像５４の形状で表現する。具体的には、遷移画像５４の上縁に位置する外形線５６で音量Ｕの時間変化が表現される。図１２では、遷移画像５４の外形線５６を、音量Ｕの時間変化を表現する折線とした場合が例示されている。遷移画像５４の外形線５６のうち時間軸上の任意の時点ｔでの１点に対応する数値軸ＡY上の数値が、当該時点ｔにおける音量Ｕの数値を意味する。 In the variable image 50 of the second embodiment, the temporal change of the mixing ratio R and volume V is expressed by the display mode (hue, brightness, etc.) of the transition image 54 as in the first embodiment, and is specified by the control information QC. The time change of the volume U is expressed by the shape of the transition image 54. Specifically, the time change of the volume U is expressed by the outline 56 located at the upper edge of the transition image 54. FIG. 12 illustrates a case where the outline 56 of the transition image 54 is a broken line that expresses the change in volume U over time. A numerical value on the numerical axis AY corresponding to one point at an arbitrary time point t on the time axis in the outline 56 of the transition image 54 means a numerical value of the volume U at the time point t.

利用者は、入力装置１６を適宜に操作することで遷移画像５４の外形線５６の編集（変形）を指示することが可能である。表示制御部２４は、指示受付部２２が利用者から受付けた指示に応じて遷移画像５４の外形線５６を変形し、情報管理部２６は、合成情報Ｑの制御情報ＱCが指定する音量Ｕの時間変化を利用者からの指示に応じて更新する。具体的には、情報管理部２６は、音量Ｕの時間変化を、表示制御部２４による変形後の外形線５６で表現される時間変化に更新する。 The user can instruct editing (deformation) of the outline 56 of the transition image 54 by appropriately operating the input device 16. The display control unit 24 deforms the outline 56 of the transition image 54 according to the instruction received by the instruction receiving unit 22 from the user, and the information management unit 26 sets the volume U specified by the control information QC of the composite information Q. The time change is updated according to the instruction from the user. Specifically, the information management unit 26 updates the time change of the sound volume U to the time change represented by the outline 56 after being deformed by the display control unit 24.

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、混合比率Ｒおよび音量Ｖの時間変化が遷移画像５４の表示態様（色相や明度等）で表現されるほか、合成楽曲内の音量Ｕの時間変化が遷移画像５４の形状で表現される。したがって、例えば音量Ｕの時間変化を遷移画像５４とは別個に表示する構成と比較して簡素な表示で、混合比率Ｒおよび音量Ｖに加えて音量Ｕの時間変化を利用者が確認できるという利点がある。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, the temporal change in the mixing ratio R and the volume V is expressed by the display mode (hue, brightness, etc.) of the transition image 54, and the temporal change in the volume U in the composite music is the transition image 54. Expressed in shape. Therefore, for example, the user can confirm the time change of the sound volume U in addition to the mixing ratio R and the sound volume V with a simple display as compared with the configuration in which the time change of the sound volume U is displayed separately from the transition image 54. There is.

＜第３実施形態＞
図１３は、第３実施形態における編集画像３０の模式図である。第１実施形態では、楽譜領域４２とは別個の変数領域５２に配置された遷移画像５４で混合比率Ｒおよび音量Ｖの時間変化を表現した。第３実施形態では、楽譜領域４２内に配置された音符図像４４を利用して混合比率Ｒおよび音量Ｖの時間変化を表現する。調整画像６０の内容は第１実施形態と同様である。 <Third Embodiment>
FIG. 13 is a schematic diagram of an edited image 30 in the third embodiment. In the first embodiment, the temporal change of the mixing ratio R and the volume V is expressed by the transition image 54 arranged in the variable area 52 separate from the score area 42. In the third embodiment, the time change of the mixing ratio R and the volume V is expressed using the musical note image 44 arranged in the score area 42. The contents of the adjustment image 60 are the same as in the first embodiment.

図１３から理解される通り、第３実施形態の表示制御部２４は、制御期間内の各音符図像４４の表示態様が、時間軸上の時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様まで連続的に変化するように、各音符図像４４の表示態様を制御する。例えば、楽譜領域４２内の複数の音符図像４４のうち時点Ｔ1を含む音符の音符図像４４における当該時点Ｔ1での表示態様は、調整領域６２内の指示点Ｘ1での表示態様に一致する。同様に、時点Ｔ2を含む音符の音符図像４４における当該時点Ｔ2での表示態様は、調整領域６２内の指示点Ｘ2での表示態様に一致する。また、時間軸上の任意の時点ｔを含む音符の音符図像４４における当該時点ｔでの表示態様は、調整領域６２のうち指示点Ｘ1から指示点Ｘ2までの経路Ｃ上で当該時点ｔに対応した地点の表示態様に設定される。 As understood from FIG. 13, the display control unit 24 of the third embodiment is configured such that the display form of each musical note iconic image 44 in the control period is the indication point in the adjustment area 62 from the time T1 to the time T2 on the time axis. The display mode of each musical note image 44 is controlled so as to continuously change from the display mode at X1 to the display mode at the indication point X2. For example, the display mode at the time point T1 of the musical note iconic image 44 including the time point T1 among the plurality of note image images 44 in the score area 42 matches the display mode at the indication point X1 in the adjustment area 62. Similarly, the display mode at the time point T 2 in the musical note iconic image 44 of the note including the time point T 2 matches the display mode at the indication point X 2 in the adjustment area 62. Further, the display mode at the time t in the musical note image 44 of a note including an arbitrary time t on the time axis corresponds to the time t on the path C from the designated point X1 to the designated point X2 in the adjustment region 62. The display mode of the selected point is set.

他方、変数画像５０の変数領域５２には、第２実施形態と同様に時間軸と数値軸ＡY（縦軸）とが設定され、制御情報ＱCが指定する音量Ｕの時間変化を表現する遷移線５８が表示される。図１３の遷移線５８は、音量Ｕの時間変化を表現する折線であり、第２実施形態における遷移画像５４の外形線５６に相当する。なお、変数画像５０を省略することも可能である。 On the other hand, in the variable area 52 of the variable image 50, a time axis and a numerical value axis AY (vertical axis) are set in the same manner as in the second embodiment, and a transition line expressing the time change of the volume U specified by the control information QC. 58 is displayed. A transition line 58 in FIG. 13 is a polygonal line that represents a temporal change in the volume U, and corresponds to the outline 56 of the transition image 54 in the second embodiment. The variable image 50 can be omitted.

第３実施形態においても第１実施形態と同様の効果が実現される。また、第３実施形態では、楽譜領域４２内に配置された音符図像４４の表示態様に応じて混合比率Ｒおよび音量Ｖの時間変化が表現されるから、混合比率Ｒおよび音量Ｖの時間変化と合成楽曲の各音符との関係を利用者が容易に把握できるという利点がある。 In the third embodiment, the same effect as in the first embodiment is realized. Further, in the third embodiment, the time change of the mixing ratio R and the volume V is expressed according to the display mode of the musical note image 44 arranged in the score area 42. There is an advantage that the user can easily grasp the relationship with each note of the synthesized music.

＜第４実施形態＞
図１４は、第４実施形態における編集画像３０の模式図である。図１４に例示される通り、第４実施形態の楽譜領域４２には、合成楽曲の各音符を表象する音符図像４４と、各音符図像４４に対応する補助図像４６が配置される。相対応する音符図像４４と補助図像４６とは、相互に近接した位置（すなわち、音符図像４４と補助図像４６との対応関係を利用者が判別可能な位置）に配置される。補助図像４６の時間軸上の表示長は、当該補助図像４６に対応する音符図像４４と共通する。 <Fourth embodiment>
FIG. 14 is a schematic diagram of an edited image 30 in the fourth embodiment. As illustrated in FIG. 14, in the musical score area 42 of the fourth embodiment, a musical note image 44 representing each note of the synthesized music and an auxiliary graphic image 46 corresponding to each musical note image 44 are arranged. The corresponding musical note image 44 and auxiliary graphic image 46 are arranged at positions close to each other (that is, a position where the user can determine the correspondence between the musical note graphic image 44 and the auxiliary graphic image 46). The display length on the time axis of the auxiliary iconic image 46 is common to the musical note iconic image 44 corresponding to the auxiliary iconic image 46.

第３実施形態では楽譜領域４２内の各音符図像４４を利用して混合比率Ｒおよび音量Ｖの時間変化を表現したが、第４実施形態では、音符図像４４とは別個の各補助図像４６を混合比率Ｒおよび音量Ｖの時間変化の表現に利用する。すなわち、補助図像４６は、音符図像４４が表象する音符に関連する変数（混合比率Ｒ，音量Ｖ）の表示を補助する図像として位置付けられる。なお、調整画像６０の内容は第１実施形態と同様である。また、変数画像５０の変数領域５２には、第２実施形態と同様に、制御情報ＱCが指定する音量Ｕの時間変化を表現する遷移線５８が表示される。変数画像５０を省略することも可能である。 In the third embodiment, the time change of the mixing ratio R and the sound volume V is expressed using each musical note image 44 in the musical score area 42. However, in the fourth embodiment, each auxiliary graphic image 46, which is separate from the musical note graphic image 44, is represented. This is used for expressing the temporal change in the mixing ratio R and volume V. That is, the auxiliary iconic image 46 is positioned as an iconic image for assisting display of variables (mixing ratio R, volume V) related to the note represented by the note image 44. The contents of the adjustment image 60 are the same as those in the first embodiment. Also, in the variable area 52 of the variable image 50, similarly to the second embodiment, a transition line 58 expressing the time change of the volume U specified by the control information QC is displayed. The variable image 50 can be omitted.

図１４から理解される通り、第３実施形態の表示制御部２４は、制御期間内の各補助図像４６の表示態様が、時間軸上の時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様まで連続的に変化するように、各補助図像４６の表示態様を制御する。例えば、時点Ｔ1を含む音符の補助図像４６における当該時点Ｔ1の表示態様は、調整領域６２内の指示点Ｘ1での表示態様に一致し、時点Ｔ2を含む音符の補助図像４６における当該時点Ｔ2の表示態様は、調整領域６２内の指示点Ｘ2での表示態様に一致する。また、時間軸上の任意の時点ｔを含む音符の補助図像４６における当該時点ｔでの表示態様は、調整領域６２のうち指示点Ｘ1から指示点Ｘ2までの経路Ｃ上で当該時点ｔに対応した地点の表示態様に設定される。 As understood from FIG. 14, the display control unit 24 of the third embodiment is configured such that the display mode of each auxiliary iconic image 46 in the control period is the indication point in the adjustment area 62 from the time T1 to the time T2 on the time axis. The display mode of each auxiliary iconic image 46 is controlled so as to continuously change from the display mode at X1 to the display mode at the designated point X2. For example, the display mode of the time point T1 in the auxiliary note image 46 including the time point T1 matches the display mode at the indication point X1 in the adjustment area 62, and the time point T2 in the auxiliary icon image 46 of the note including the time point T2 is displayed. The display mode matches the display mode at the designated point X 2 in the adjustment area 62. Further, the display mode at the time t in the auxiliary graphic image 46 of the note including the arbitrary time t on the time axis corresponds to the time t on the path C from the designated point X1 to the designated point X2 in the adjustment region 62. The display mode of the selected point is set.

第４実施形態においても第１実施形態と同様の効果が実現される。また、第４実施形態では、楽譜領域４２内の各音符図像４４に対応する補助図像４６の表示態様に応じて混合比率Ｒおよび音量Ｖの時間変化が表現されるから、第３実施形態と同様に、混合比率Ｒおよび音量Ｖの時間変化と合成楽曲の各音符との関係を利用者が容易に把握できるという利点がある。 In the fourth embodiment, the same effect as in the first embodiment is realized. Further, in the fourth embodiment, since the temporal change of the mixing ratio R and the volume V is expressed according to the display mode of the auxiliary graphic image 46 corresponding to each musical note graphic image 44 in the score area 42, the same as in the third exemplary embodiment. In addition, there is an advantage that the user can easily grasp the relationship between the temporal change of the mixing ratio R and the volume V and each note of the synthesized music.

なお、混合比率Ｒおよび音量Ｖの表示に音符図像４４を利用する第３実施形態では、第４実施形態の補助図像４６が不要であるから、第４実施形態と比較して楽譜画像４０の内容が簡素化される（表示要素の総数が削減される）という利点がある。他方、音符図像４４とは別個の補助図像４６を混合比率Ｒおよび音量Ｖの表示に利用する第４実施形態では、音符図像４４の表示態様を混合比率Ｒや音量Ｖの時間変化とは無関係に選定でき（音符図像４４の表示態様の選定の自由度が高い）、例えば各音符図像４４に付加された音声符号ｑ3の視認性を維持できるという利点がある。 In the third embodiment in which the musical note image 44 is used to display the mixing ratio R and the volume V, the auxiliary image 46 of the fourth embodiment is unnecessary, and therefore the content of the score image 40 compared to the fourth embodiment. Has the advantage of being simplified (the total number of display elements is reduced). On the other hand, in the fourth embodiment in which the auxiliary graphic image 46 separate from the musical note image 44 is used to display the mixing ratio R and the volume V, the display mode of the musical note image 44 is independent of the temporal change of the mixing ratio R and the volume V. There is an advantage that the visibility of the voice code q3 added to each musical note graphic image 44 can be maintained, for example, which can be selected (the degree of freedom in selecting the display form of the musical note graphic image 44 is high).

＜第５実施形態＞
第１実施形態から第４実施形態では２種類の音声（第１音声，第２音声）の混合処理を例示したが、声質が相違する３種類以上の音声の音声素片Ｐを混合する混合処理も想定され得る。３種類以上の音声の音声素片Ｐを混合する場面で混合比率Ｒの時間変化を調整する場合、各音声の比率を利用者が総合的に考慮しながら混合比率Ｒを決定する必要があり、利用者の作業負担が大きいという問題がある。第５実施形態は、３種類以上の音声の混合比率Ｒの時間変化を調整する利用者の作業負担を軽減するための形態である。 <Fifth Embodiment>
In the first embodiment to the fourth embodiment, the mixing process of two types of sounds (first sound and second sound) is exemplified. However, the mixing process of mixing three or more types of speech units P having different voice qualities. Can also be envisaged. When adjusting the time variation of the mixing ratio R in a scene where three or more types of speech segments P are mixed, the user needs to determine the mixing ratio R while comprehensively considering the ratio of each voice, There is a problem that the work burden on the user is large. The fifth embodiment is a form for reducing the work burden of the user who adjusts the time change of the mixing ratio R of three or more types of sounds.

図１５は、第５実施形態における混合処理の説明図である。図１５に例示される通り、第５実施形態の記憶装置１２には、３個の音声ライブラリＬ（ＬA，ＬB，ＬC）が記憶される。音声ライブラリＬCは、音声ライブラリＬAの第１音声や音声ライブラリＬBの第２音声とは声質（声色）が相違する第３音声から抽出された複数の音声素片ＰCの集合である。音声合成部２８が実行する混合処理（ＳA3）では、音声ライブラリＬAから選択された音声素片ＰAと音声ライブラリＬBから選択された音声素片ＰBと音声ライブラリＬCから選択された音声素片ＰCとが混合比率Ｒのもとで混合される。第５実施形態の混合比率Ｒは、第１音声の比率（加重値）λAと第２音声の比率λBと第３音声の比率λCとを包含する。混合処理は、例えば以下の数式(b)で表現される通り、音声素片ＰAの変数ｐAと音声素片ＰBの変数ｐBと音声素片ＰCの変数ｐCとを混合比率Ｒ（λA，λB，λC）に応じて加重加算することで、合成後の音声素片ＰSの声質に関する変数ｐSを算定する処理である。比率λAと比率λBと比率λCとの合計値は例えば１である。
ｐS＝λA・ｐA＋λB・ｐB＋λC・ｐC ……(b) FIG. 15 is an explanatory diagram of a mixing process in the fifth embodiment. As illustrated in FIG. 15, three sound libraries L (LA, LB, LC) are stored in the storage device 12 of the fifth embodiment. The speech library LC is a set of a plurality of speech segments PC extracted from a third speech having a different voice quality (voice color) from the first speech of the speech library LA and the second speech of the speech library LB. In the mixing process (SA3) executed by the speech synthesizer 28, the speech unit PA selected from the speech library LA, the speech unit PB selected from the speech library LB, and the speech unit PC selected from the speech library LC Are mixed under a mixing ratio R. The mixing ratio R of the fifth embodiment includes the ratio (weighted value) λA of the first sound, the ratio λB of the second sound, and the ratio λC of the third sound. In the mixing process, for example, the variable pA of the speech unit PA, the variable pB of the speech unit PB, and the variable pC of the speech unit PC are mixed as R (λA, λB, This is a process of calculating a variable pS relating to the voice quality of the synthesized speech element PS by performing weighted addition according to (λC). The total value of the ratio λA, the ratio λB, and the ratio λC is 1, for example.
pS = λA · pA + λB · pB + λC · pC (b)

以上の説明から理解される通り、音声素片ＰSの声質は、混合比率Ｒに応じて音声素片ＰAと音声素片ＰBと音声素片ＰCとの中間的な声質に設定され得る。混合処理で順次に生成される音声素片ＰSを時間軸上で相互に連結して音声信号Ｓを生成する動作は第１実施形態と同様である。合成情報Ｑの制御情報ＱCは、混合処理に適用される混合比率Ｒ（λA，λB，λC）の時間変化を指定する。第５実施形態では音量Ｖの時間変化は省略される。なお、第５実施形態の混合処理には、数式(b)以外にも公知の技術が任意に採用され得る。 As understood from the above description, the voice quality of the speech unit PS can be set to an intermediate voice quality among the speech unit PA, the speech unit PB, and the speech unit PC according to the mixing ratio R. The operation of generating the audio signal S by mutually connecting the speech segments PS sequentially generated by the mixing process on the time axis is the same as that of the first embodiment. The control information QC of the composite information Q designates the time change of the mixing ratio R (λA, λB, λC) applied to the mixing process. In the fifth embodiment, the time change of the volume V is omitted. In addition to the mathematical formula (b), a known technique can be arbitrarily employed for the mixing process of the fifth embodiment.

図１６は、第５実施形態における編集画像３０の模式図である。第５実施形態の編集画像３０は、第１実施形態と同様の楽譜画像４０と、混合比率Ｒの時間変化を調整するための調整画像６０および変数画像５０とを含んで構成される。第５実施形態の調整画像６０は、音声合成処理に利用される各音声に対応する基準点Ｇ（ＧA，ＧB，ＧC）が相互に離間して設定された調整領域６２を包含する。基準点ＧA（声色Ａ）は第１音声（音声素片ＰA）に対応し、基準点ＧB（声色Ｂ）は第２音声（音声素片ＰB）に対応し、基準点ＧC（声色Ｃ）は第３音声（音声素片ＰC）に対応する。 FIG. 16 is a schematic diagram of an edited image 30 in the fifth embodiment. The edited image 30 of the fifth embodiment includes a score image 40 similar to that of the first embodiment, an adjustment image 60 for adjusting a temporal change in the mixing ratio R, and a variable image 50. The adjustment image 60 of the fifth embodiment includes an adjustment region 62 in which reference points G (GA, GB, GC) corresponding to each voice used for voice synthesis processing are set apart from each other. The reference point GA (voice color A) corresponds to the first voice (voice segment PA), the reference point GB (voice color B) corresponds to the second voice (voice segment PB), and the reference point GC (voice color C) is This corresponds to the third voice (speech element PC).

調整領域６２内の各地点は位置に応じて相異なる表示態様で表示される。実際の調整領域６２は多色の色彩を含むカラー画像であるが、図１６では便宜的に、調整領域６２内の色相を図面上の階調の濃淡（グレースケール）で代替的に表現した。具体的には、調整領域６２の中心の周囲に青色と青色と緑色と赤色とにわたる連続的な色相の分布を波長順に配色した画像が調整領域６２として好適である。 Each point in the adjustment area 62 is displayed in a different display mode depending on the position. The actual adjustment area 62 is a color image including multiple colors, but in FIG. 16, for the sake of convenience, the hue in the adjustment area 62 is alternatively expressed by the shade of gray (gray scale) on the drawing. Specifically, an image obtained by arranging a continuous hue distribution over blue, blue, green, and red in order of wavelength around the center of the adjustment region 62 is suitable as the adjustment region 62.

第１実施形態と同様に、指示受付部２２は、変数領域５２の時間軸上の複数の時点Ｔ（Ｔ1，Ｔ2，Ｔ3）の指示を利用者から受付け（ＳC2）、表示制御部２４は、各時点Ｔを変数領域５２内に表示する（ＳC3）。また、指示受付部２２は、調整領域６２の複数の指示点Ｘ（Ｘ1，Ｘ2，Ｘ3）の指示を利用者から受付け（ＳC4）、表示制御部２４は、指示受付部２２が受付けた各指示点Ｘと、相前後して指示された２個の指示点Ｘを連結する経路Ｃ（Ｃ12，Ｃ23）とを調整領域６２内に表示する（ＳC5）。 As in the first embodiment, the instruction receiving unit 22 receives instructions from a user at a plurality of time points T (T1, T2, T3) on the time axis of the variable area 52 (SC2), and the display control unit 24 Each time point T is displayed in the variable area 52 (SC3). Further, the instruction receiving unit 22 receives instructions from the user for a plurality of instruction points X (X1, X2, X3) in the adjustment area 62 (SC4), and the display control unit 24 receives each instruction received by the instruction receiving unit 22. The point X and the path C (C12, C23) connecting the two designated points X designated in succession are displayed in the adjustment area 62 (SC5).

第５実施形態の１個の指示点Ｘは、混合比率Ｒ（λA，λB，λC）の各数値に対応した座標点である。具体的には、指示点Ｘが１個の基準点Ｇに近いほど、当該基準点Ｇに対応する音声の比率λ（λA，λB，λC）が大きい数値となるように、調整領域６２内における指示点Ｘの位置に応じて混合比率Ｒの各比率λの数値が決定される。指示点Ｘの位置と各比率λの数値との関係の具体例を以下に列挙する。 One indication point X of the fifth embodiment is a coordinate point corresponding to each numerical value of the mixing ratio R (λA, λB, λC). More specifically, the closer the designated point X is to one reference point G, the larger the ratio λ (λA, λB, λC) of the sound corresponding to the reference point G becomes in the adjustment region 62. The numerical value of each ratio λ of the mixing ratio R is determined according to the position of the instruction point X. Specific examples of the relationship between the position of the indication point X and the numerical value of each ratio λ are listed below.

図１７は、第５実施形態の調整領域６２の模式図である。図１７では、調整領域６２内の各地点での表示態様の相違の図示を便宜的に省略した。図１７における指示点の各位置σ（σ1，σ2，σ3，σ4，……）と、当該位置σの指示点Ｘに対応する各比率λの相対比（λA：λB：λC）との関係は、例えば以下の通りである。
（１）位置σ1（基準点ＧA） λA：λB：λC＝１０：０：０
（２）位置σ2（基準点ＧB） λA：λB：λC＝０：１０：０
（３）位置σ3（基準点ＧC） λA：λB：λC＝０：０：１０
（４）位置σ4 λA：λB：λC＝０：１０：１０
（５）位置σ5 λA：λB：λC＝５：５：０
（６）位置σ6 λA：λB：λC＝５：０：５
（７）位置σ7 λA：λB：λC＝１０：５：５
なお、指示点Ｘと混合比率Ｒ（各比率λ）との関係は以上の例示に限定されない。例えば、各音声の基準点Ｇから指示点Ｘまでの距離と当該基準点Ｇに対応する音声の比率λとが反比例するように、調整領域６２内の指示点Ｘの位置に応じて各比率λの数値を決定することも可能である。 FIG. 17 is a schematic diagram of the adjustment region 62 of the fifth embodiment. In FIG. 17, illustration of the difference in display mode at each point in the adjustment area 62 is omitted for convenience. The relationship between each position σ (σ1, σ2, σ3, σ4,...) Of the indicated point in FIG. 17 and the relative ratio (λA: λB: λC) of each ratio λ corresponding to the indicated point X at the position σ is For example, as follows.
(1) Position σ1 (reference point GA) λA: λB: λC = 10: 0: 0
(2) Position σ2 (reference point GB) λA: λB: λC = 0: 10: 0
(3) Position σ3 (reference point GC) λA: λB: λC = 0: 0: 0
(4) Position σ4 λA: λB: λC = 0: 10: 10
(5) Position σ5 λA: λB: λC = 5: 5: 0
(6) Position σ6 λA: λB: λC = 5: 0: 5
(7) Position σ7 λA: λB: λC = 10: 5: 5
The relationship between the indication point X and the mixing ratio R (each ratio λ) is not limited to the above examples. For example, each ratio λ is set in accordance with the position of the indication point X in the adjustment region 62 so that the distance from the reference point G to the indication point X of each audio is inversely proportional to the audio ratio λ corresponding to the reference point G. It is also possible to determine the numerical value of.

表示制御部２４は、第１実施形態と同様に、調整領域６２内に指示された各指示点Ｘに応じた遷移画像５４を変数画像５０の変数領域５２に配置する（ＳC6）。具体的には、図１６から理解される通り、遷移画像５４の表示態様は、時間軸上の時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1での表示態様から指示点Ｘ2での表示態様まで連続的に変化するとともに、時間軸上の時点Ｔ2から時点Ｔ3にかけて、調整領域６２内の指示点Ｘ2での表示態様から指示点Ｘ3での表示態様まで連続的に変化する。 Similar to the first embodiment, the display control unit 24 arranges the transition image 54 corresponding to each designated point X designated in the adjustment area 62 in the variable area 52 of the variable image 50 (SC6). Specifically, as understood from FIG. 16, the display mode of the transition image 54 is changed from the display mode at the designated point X1 in the adjustment area 62 from the time point T1 to the time point T2 on the time axis. While changing continuously to the display mode, from the time point T2 to the time point T3 on the time axis, it continuously changes from the display mode at the indicated point X2 in the adjustment area 62 to the display mode at the indicated point X3.

情報管理部２６は、混合比率Ｒの各比率λが、調整領域６２内の各指示点Ｘでの数値に応じて経時的に遷移するように、合成情報Ｑの制御情報ＱCを更新する（ＳC7）。具体的には、混合比率Ｒ（各比率λ）が、時点Ｔ1から時点Ｔ2にかけて、調整領域６２内の指示点Ｘ1に対応する数値から指示点Ｘ2に対応する数値まで、経路Ｃ12に沿って遷移するとともに、時点Ｔ2から時点Ｔ3にかけて、調整領域６２内の指示点Ｘ2に対応する数値から指示点Ｘ3に対応する数値まで、経路Ｃ23に沿って遷移するように、制御情報ＱCが更新される。 The information management unit 26 updates the control information QC of the composite information Q so that each ratio λ of the mixing ratio R changes with time according to the numerical value at each indicated point X in the adjustment region 62 (SC7). ). Specifically, the mixing ratio R (each ratio λ) changes along the path C12 from the time point T1 to the time point T2 from the numerical value corresponding to the designated point X1 in the adjustment region 62 to the numerical value corresponding to the designated point X2. At the same time, from time T2 to time T3, the control information QC is updated so that a transition is made along the path C23 from a numerical value corresponding to the designated point X2 in the adjustment region 62 to a numerical value corresponding to the designated point X3.

以上に説明した通り、第５実施形態では、混合処理に適用される各音声に対応する基準点Ｇ（ＧA，ＧB，ＧC）が設定された調整領域６２に、利用者からの指示に応じた指示点Ｘ（Ｘ1，Ｘ2，Ｘ3）が設定され、各指示点Ｘの間にわたる混合比率Ｒ（λA，λB，λC）の時間変化が制御情報ＱCにて指定される。以上の構成によれば、利用者は、混合処理に適用される各音声の相互的な関係（各音声の比率λの関係）を調整領域６２で視覚的に確認しながら混合比率Ｒの時間変化を指示することが可能である。したがって、混合比率Ｒの時間変化を調整する利用者の作業負担が軽減されるという利点がある。 As described above, in the fifth embodiment, the adjustment area 62 in which the reference point G (GA, GB, GC) corresponding to each sound applied to the mixing process is set according to an instruction from the user. The designated point X (X1, X2, X3) is set, and the time change of the mixing ratio R (λA, λB, λC) between the designated points X is designated by the control information QC. According to the above configuration, the user can change the mixing ratio R over time while visually confirming the mutual relationship (relationship between the ratios λ of the respective sounds) of the sounds applied to the mixing process in the adjustment region 62. Can be instructed. Therefore, there is an advantage that the work burden of the user who adjusts the time change of the mixing ratio R is reduced.

なお、第５実施形態でも、第２実施形態と同様に、制御情報ＱCが指定する音量Ｕの時間変化を遷移画像５４の形状（外形線５６の形状）で表現する構成が採用され得る。また、第３実施形態と同様に、調整領域６２内の各指示点Ｘの間の表示態様の変化を、楽譜領域４２内の各音符図像４４の表示態様に応じて表現する構成や、第４実施形態と同様に、調整領域６２内の各指示点Ｘの間の表示態様の変化を、楽譜領域４２内の各補助図像４６の表示態様に応じて表現する構成も、第５実施形態に採用され得る。 Note that, in the fifth embodiment, as in the second embodiment, a configuration in which the temporal change of the volume U specified by the control information QC is expressed by the shape of the transition image 54 (the shape of the outline 56) can be employed. Similarly to the third embodiment, a configuration in which a change in display mode between each indication point X in the adjustment area 62 is expressed according to the display mode of each musical note image 44 in the score area 42, Similarly to the embodiment, the fifth embodiment also adopts a configuration in which the change in the display mode between each indication point X in the adjustment area 62 is expressed according to the display mode of each auxiliary iconic image 46 in the score area 42. Can be done.

＜第６実施形態＞
図１８は、第６実施形態の表示制御部２４が表示装置１４に表示させる調整画像６０の模式図である。第６実施形態の調整画像６０は、相互に交差する第１軸Ａ1（横軸）と第２軸Ａ2（縦軸）とが設定された調整領域６２を包含する。第１軸Ａ1と第２軸Ａ2とは、相異なる種類の特性変数の数値を示す座標軸である。特性変数は、合成対象音声の音響特性に関する変数であり、音声合成部２８による音声合成処理に適用される。例えば音声の明瞭度（brightness, clearness），気息成分の強弱（breathiness），男声/女声の度合（genderfactor），音高の微小変化（pitch-bend），音量（dynamics），発音の強弱（velocity）等の変数が特性変数として例示され得る。第１軸Ａ1は、以上の例示から選択された第１特性変数の数値を示す座標軸であり、第２軸Ａ2は、第１特性変数とは別種の第２特性変数の数値を示す座標軸である。 <Sixth Embodiment>
FIG. 18 is a schematic diagram of an adjustment image 60 that is displayed on the display device 14 by the display control unit 24 according to the sixth embodiment. The adjustment image 60 of the sixth embodiment includes an adjustment region 62 in which a first axis A1 (horizontal axis) and a second axis A2 (vertical axis) intersecting each other are set. The first axis A1 and the second axis A2 are coordinate axes indicating numerical values of different types of characteristic variables. The characteristic variable is a variable related to the acoustic characteristic of the synthesis target speech, and is applied to the speech synthesis processing by the speech synthesis unit 28. For example, speech intelligibility (brightness, clearness), breath component intensity (breathiness), male / female degree (genderfactor), pitch change (pitch-bend), volume (dynamics), pronunciation intensity (velocity) Etc. can be exemplified as characteristic variables. The first axis A1 is a coordinate axis indicating the numerical value of the first characteristic variable selected from the above examples, and the second axis A2 is a coordinate axis indicating the numerical value of the second characteristic variable different from the first characteristic variable. .

調整領域６２内の各地点が位置に応じて相異なる表示態様で表示される点や、調整領域６２内に利用者からの指示に応じて複数の指示点Ｘが設定される点は第１実施形態と同様である。また、第１実施形態と同様に、変数画像５０の変数領域５２には、利用者からの指示に応じた複数の時点Ｔが設定され、調整領域６２内の各指示点Ｘの間と同様に時間軸上の各時点Ｔ間で表示態様が変化する遷移画像５４が配置される。 The point that each point in the adjustment area 62 is displayed in a different display mode depending on the position, and the point that a plurality of instruction points X are set in the adjustment area 62 according to an instruction from the user is the first implementation. It is the same as the form. Similarly to the first embodiment, a plurality of time points T according to instructions from the user are set in the variable area 52 of the variable image 50, similarly to between the indicated points X in the adjustment area 62. A transition image 54 whose display mode changes between each time point T on the time axis is arranged.

合成情報Ｑの制御情報ＱCは、第１特性変数および第２特性変数の時間変化を指定する。情報管理部２６は、第１特性変数および第２特性変数の数値が、調整領域６２内の各指示点Ｘでの数値に応じて経時的に遷移するように、制御情報ＱCを更新する。具体的には、第１特性変数の数値が、時点Ｔ1から時点Ｔ2にかけて、第１軸Ａ1上で指示点Ｘ1に対応する数値から指示点Ｘ2に対応する数値まで連続的に遷移するとともに、第２特性変数の数値が、時点Ｔ1から時点Ｔ2にかけて、第２軸Ａ2上で指示点Ｘ1に対応する数値から指示点Ｘ2に対応する数値まで連続的に遷移するように、制御情報ＱCが更新される。 The control information QC of the composite information Q designates temporal changes of the first characteristic variable and the second characteristic variable. The information management unit 26 updates the control information QC so that the numerical values of the first characteristic variable and the second characteristic variable change over time according to the numerical values at each indicated point X in the adjustment region 62. Specifically, the numerical value of the first characteristic variable continuously transitions from the numerical value corresponding to the designated point X1 to the numerical value corresponding to the designated point X2 on the first axis A1 from the time T1 to the time T2. The control information QC is updated so that the numerical values of the two characteristic variables continuously transition from the numerical value corresponding to the designated point X1 to the numerical value corresponding to the designated point X2 on the second axis A2 from the time T1 to the time T2. The

以上に説明した通り、第６実施形態では、第１特性変数の第１軸Ａ1と第２特性変数の第２軸Ａ2とが設定された調整領域６２に、利用者からの指示に応じた指示点Ｘ（Ｘ1，Ｘ2）が設定され、指示点Ｘ1に対応する数値から指示点Ｘ2に対応する数値まで遷移するように第１特性変数および第２特性変数の時間変化が設定される。以上の構成によれば、利用者は、第１特性変数と第２特性変数との関係を確認しながら両者の時間変化を指示することが可能である。したがって、音声合成処理に適用される特性変数を指示する利用者の作業負担を軽減できるという利点がある。 As described above, in the sixth embodiment, an instruction according to an instruction from the user is provided in the adjustment region 62 in which the first axis A1 of the first characteristic variable and the second axis A2 of the second characteristic variable are set. A point X (X1, X2) is set, and changes in time of the first characteristic variable and the second characteristic variable are set so as to transition from a numerical value corresponding to the designated point X1 to a numerical value corresponding to the designated point X2. According to the above configuration, the user can instruct a time change of both while confirming the relationship between the first characteristic variable and the second characteristic variable. Therefore, there is an advantage that it is possible to reduce the work load on the user who designates the characteristic variable applied to the speech synthesis process.

＜変形例＞
前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the aforementioned embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）前述の各形態では、調整領域６２内の各指示点Ｘを直線の経路Ｃで連結したが、経路Ｃは直線に限定されない。例えば、図１９に例示される通り、３個の指示点Ｘ（Ｘ1，Ｘ2，Ｘ3）に応じた補間曲線を経路Ｃとして設定する構成や、図２０に例示される通り、各指示点Ｘ間で利用者が任意に指定した曲線（自由曲線）を経路Ｃとして設定する構成が採用され得る。 (1) In each form mentioned above, although each indication point X in the adjustment area | region 62 was connected with the linear path | route C, the path | route C is not limited to a straight line. For example, as illustrated in FIG. 19, a configuration in which an interpolation curve corresponding to three designated points X (X1, X2, X3) is set as a path C, or between each designated point X as illustrated in FIG. A configuration in which a curve (free curve) arbitrarily designated by the user is set as the path C can be adopted.

（２）前述の各形態では、調整領域６２内の各地点の位置に応じて明度や色相を相違させたが、調整領域６２内の各地点の位置に応じて相違させる表示態様は明度や色相に限定されない。例えば、図２１に例示される通り、網掛やハッチング等のパターン（塗潰しパターン）を調整領域６２内の各地点の位置に応じて相違させることも可能である。また、前述の各形態では調整領域６２をカラー画像としたが、調整領域６２を白黒画像として、調整領域６２内の各地点の位置に応じて明度（階調）を相違させることも可能である。 (2) In each of the above-described forms, the brightness and hue are made different according to the position of each point in the adjustment area 62. However, the display mode to be made different according to the position of each point in the adjustment area 62 is lightness and hue. It is not limited to. For example, as illustrated in FIG. 21, a pattern such as shading or hatching (filled pattern) can be made different according to the position of each point in the adjustment region 62. In each of the above-described embodiments, the adjustment area 62 is a color image. However, the adjustment area 62 is a black and white image, and the brightness (gradation) can be varied depending on the position of each point in the adjustment area 62. .

（３）第１実施形態から第４実施形態では、混合比率Ｒとともに調整される第１特性変数として音量Ｖを例示したが、第１特性変数は音量Ｖに限定されない。また、第２実施形態では、遷移画像５４の形状で時間変化が表現される第２特性変数として音量Ｕを例示したが、第２特性変数は音量Ｕに限定されない。例えば、第６実施形態で例示した通り、音声の明瞭度，気息成分の強弱，男声/女声の度合，音高の微小変化等を、第１実施形態から第４実施形態の第１特性変数や第２実施形態の第２特性変数として選択することも可能である。 (3) In the first to fourth embodiments, the volume V is exemplified as the first characteristic variable adjusted together with the mixing ratio R. However, the first characteristic variable is not limited to the volume V. In the second embodiment, the volume U is exemplified as the second characteristic variable in which the time change is expressed by the shape of the transition image 54. However, the second characteristic variable is not limited to the volume U. For example, as illustrated in the sixth embodiment, the intelligibility of speech, the strength of breath components, the degree of male / female voice, the minute change in pitch, etc. It is also possible to select it as the second characteristic variable of the second embodiment.

（４）前述の各形態では、時点Ｔ1から時点Ｔ2にわたる変数の時間変化に着目したが、例えば合成対象音声（合成楽曲）の全区間にわたる変数の時間変化を調整する場合にも本発明は同様に適用され得る。すなわち、変数の時間変化が調整される期間を合成対象音声の特定の期間（制御期間）に限定する構成は必須ではない。したがって、利用者による各時点Ｔの指示は省略され得る。また、時間軸上の各時点Ｔを利用者からの指示に応じて設定する構成は必須ではない。具体的には、利用者からの指示を要件としない所定の方法で合成対象音声（合成楽曲）に各時点Ｔを設定することも可能である。例えば、合成楽曲の歌唱区間（例えばフレーズ）を公知の方法で検出し、歌唱区間の始点および終点を各時点Ｔに設定すれば、歌唱区間の前方と後方とで歌唱音声の声質を相違させることが可能である。 (4) In each of the above-described embodiments, attention is paid to the time change of the variable from the time point T1 to the time point T2, but the present invention is similarly applied when adjusting the time change of the variable over the entire section of the synthesis target speech (synthesized music), for example. Can be applied to. That is, it is not essential to limit the period in which the time change of the variable is adjusted to a specific period (control period) of the synthesis target speech. Therefore, the instruction of each time point T by the user can be omitted. Moreover, the structure which sets each time T on a time-axis according to the instruction | indication from a user is not essential. Specifically, each time point T can be set for the synthesis target speech (synthesized music) by a predetermined method that does not require an instruction from the user. For example, if a singing section (for example, a phrase) of a synthesized music is detected by a known method and the starting point and ending point of the singing section are set to each time point T, the voice quality of the singing voice is made different between the front and rear of the singing section. Is possible.

（５）第５実施形態では、調整領域６２の隅部に基準点Ｇを設定したが、図２２に例示される通り、調整領域６２の内部に基準点Ｇ（ＧA，ＧB，ＧC）を設定することも可能である。図２２の例示では、各基準点Ｇを頂点とする三角形状の領域６８が調整領域６２の内側に画定される。また、図２２の例示のように、各基準点Ｇで画定される領域６８の内側および外側に指示点Ｘが設定され得る構成では、領域６８の内側と外側とで混合処理の内容を相違させることも可能である。 (5) Although the reference point G is set at the corner of the adjustment area 62 in the fifth embodiment, the reference point G (GA, GB, GC) is set inside the adjustment area 62 as illustrated in FIG. It is also possible to do. In the example of FIG. 22, a triangular area 68 having each reference point G as a vertex is defined inside the adjustment area 62. Further, as illustrated in FIG. 22, in the configuration in which the instruction point X can be set inside and outside the region 68 defined by each reference point G, the content of the mixing process is different between the inside and outside of the region 68. It is also possible.

例えば、図２２の例示の通り、指示点Ｘ1が領域６８の内側に位置するとともに指示点Ｘ2が領域６８の外側に位置する場合を想定する。音声合成部２８は、時点Ｔ1から時点Ｔ2までの期間のうち経路Ｃ12上で領域６８の内側の区間に対応する各時点ｔでは、第５実施形態の例示と同様に、全部（３種類）の音声の音声素片Ｐ（ＰA，ＰB，ＰC）を混合比率Ｒのもとで混合する。他方、時点Ｔ1から時点Ｔ2までの期間のうち経路Ｃ12上で領域６８の外側の区間に対応する各時点ｔでは、指示点Ｘ2に近い２個の基準点Ｇ（ＧA，ＧC）に対応する音声素片Ｐ（ＰA，ＰC）を混合比率Ｒのもとで混合する。図２２の例示では、指示点Ｘ2および指示点Ｘ3は何れも領域６８の外側に位置するから、時点Ｔ2から時点Ｔ3までの期間では、音声素片ＰAと音声素片ＰCとが混合比率Ｒのもとで混合される。 For example, as illustrated in FIG. 22, it is assumed that the indication point X1 is located inside the region 68 and the indication point X2 is located outside the region 68. At each time point t corresponding to the section inside the region 68 on the route C12 in the period from the time point T1 to the time point T2, the speech synthesizer 28 performs all (three types) as in the fifth embodiment. A speech unit P (PA, PB, PC) of speech is mixed under a mixing ratio R. On the other hand, at each time point t corresponding to the section outside the area 68 on the path C12 in the period from the time point T1 to the time point T2, the sound corresponding to the two reference points G (GA, GC) close to the indication point X2. The piece P (PA, PC) is mixed under the mixing ratio R. In the example of FIG. 22, the indication point X2 and the indication point X3 are both located outside the region 68. Therefore, in the period from the time point T2 to the time point T3, the speech unit PA and the speech unit PC have the mixing ratio R. Mixed in.

（６）第５実施形態では、３種類の音声（第１音声，第２音声，第３音声）の混合を例示したが、混合対象となる音声の種類数は任意であり、例えば４種類以上の音声を混合することも可能である。図２３は、５種類の音声を混合する場合の調整領域６２の模式図である。図２３に例示される通り、各音声に対応する５個の基準点Ｇ（ＧA，ＧB，ＧC，ＧD，ＧE）が円形状の調整領域６２の円周上に設定される。 (6) In the fifth embodiment, the mixing of three types of sounds (first sound, second sound, and third sound) is exemplified, but the number of types of sounds to be mixed is arbitrary, for example, four or more types It is also possible to mix the voices. FIG. 23 is a schematic diagram of the adjustment area 62 when five types of sound are mixed. As illustrated in FIG. 23, five reference points G (GA, GB, GC, GD, GE) corresponding to each voice are set on the circumference of the circular adjustment region 62.

（７）前述の各形態では、合成情報Ｑの管理（表示制御部２４および情報管理部２６）と音声信号Ｓの生成（音声合成部２８）との双方を実行する音声合成装置１００を例示したが、合成情報Ｑを管理する音声合成管理装置としても本発明は特定され得る。音声合成管理装置では音声合成部２８の有無は不問である。また、携帯電話機等の端末装置と通信するサーバ装置で音声合成装置１００や音声合成管理装置を実現することも可能である。指示受付部２２は、利用者が端末装置に付与した指示を端末装置から通信網を介して受付け、表示制御部２４は、例えば編集画像３０の画像データを端末装置に送信することで編集画像３０を端末装置の表示装置に表示させる。また、音声合成部２８は、音声合成処理で生成した音声信号Ｓを端末装置に送信する。 (7) In each of the above-described embodiments, the speech synthesizer 100 that executes both management of the synthesis information Q (display control unit 24 and information management unit 26) and generation of the speech signal S (speech synthesis unit 28) is illustrated. However, the present invention can also be specified as a speech synthesis management device that manages the synthesis information Q. In the speech synthesis management device, the presence or absence of the speech synthesizer 28 is not questioned. It is also possible to realize the speech synthesizer 100 and the speech synthesis management device with a server device that communicates with a terminal device such as a mobile phone. The instruction receiving unit 22 receives an instruction given by the user to the terminal device from the terminal device via the communication network, and the display control unit 24 transmits, for example, image data of the edited image 30 to the terminal device, thereby editing the edited image 30. Is displayed on the display device of the terminal device. The voice synthesizer 28 transmits the voice signal S generated by the voice synthesis process to the terminal device.

（８）前述の各形態では、合成楽曲の歌唱音声の音声信号Ｓの生成を例示したが、歌唱音声以外の音声（例えば会話音等）の音声信号Ｓの生成にも本発明を適用することが可能である。したがって、合成情報Ｑの楽曲情報ＱM（音高ｑ1，発音期間ｑ2）は省略され得る。また、前述の各形態では、日本語の音声の合成を例示したが、合成対象となる音声の言語は任意である。例えば、英語，スペイン語，中国語，韓国語等の任意の言語の音声を生成する場合にも本発明を適用することが可能である。 (8) In each of the above-described embodiments, the generation of the voice signal S of the singing voice of the synthesized music has been exemplified. However, the present invention is also applied to the generation of the voice signal S of the voice other than the singing voice (for example, conversation sound). Is possible. Therefore, the music information QM (pitch q1, tone generation period q2) of the composite information Q can be omitted. In each of the above-described embodiments, Japanese speech synthesis has been exemplified, but the speech language to be synthesized is arbitrary. For example, the present invention can be applied to the case of generating speech in an arbitrary language such as English, Spanish, Chinese, or Korean.

１００……音声合成装置、１０……演算処理装置、１２……記憶装置、１４……表示装置、１６……入力装置、１８……放音装置、２２……指示受付部、２４……表示制御部、２６……情報管理部、２８……音声合成部、３０……編集画像、４０……楽譜画像、４２……楽譜領域、４４……音符図像、４６……補助図像、５０……変数画像、５２……変数領域、５４……遷移画像、５６……外形線、５８……遷移線、６０……調整画像、６２……調整領域。 DESCRIPTION OF SYMBOLS 100 ... Voice synthesizer, 10 ... Arithmetic processing unit, 12 ... Memory | storage device, 14 ... Display device, 16 ... Input device, 18 ... Sound emission device, 22 ... Instruction reception part, 24 ... Display Control unit 26... Information management unit 28 .. speech synthesis unit 30 .. edit image 40 .. score image 42 .. score area 44 .. note image 46 .. auxiliary image 50. Variable image 52... Variable region 54. Transition image 56. Outline line 58. Transition line 60. Adjustment image 62.

Claims

An adjustment region in which a first axis indicating a mixing ratio of the first voice and the second voice at the time of synthesis of the synthesis target voice and a second axis indicating a first characteristic variable related to the acoustic characteristic of the synthesis target voice is set. Display control means for causing the display device to display the adjusted image including
An instruction receiving means for receiving an instruction of the first indicating point and the second indicating point in the adjustment area from a user;
A time change of the mixing ratio from a numerical value corresponding to the first indicating point on the first axis to a numerical value corresponding to the second indicating point, and a numerical value corresponding to the first indicating point on the second axis A speech synthesis management device comprising: information management means for generating control information indicating a time change of the first characteristic variable from a to a numerical value corresponding to the second indication point.

The information management means sets the second instruction point from a numerical value corresponding to the first instruction point on the first axis for a period from the first time point to the second time point on the time axis in the synthesis target speech. A time change of the mixing ratio to a corresponding numerical value and a time change of the first characteristic variable from a numerical value corresponding to the first indicated point on the second axis to a numerical value corresponding to the second indicated point. The speech synthesis management device according to claim 1, wherein control information to be generated is generated.

The display control means includes
The adjustment image in which each point in the adjustment area is set to a different display mode;
A variable image including a transition image that changes from a display mode at the first indication point in the adjustment region to a display mode at the second indication point from the first time point to the second time point on the time axis. The speech synthesis management device according to claim 2, which is displayed on a display device.

The speech synthesis management device according to claim 3, wherein the display control unit causes the display device to display the variable image representing a time change of a second characteristic variable related to an acoustic property of the synthesis target speech in a shape of the transition image.

The display control means includes
The adjustment image in which each point in the adjustment area is set to a different display mode;
A musical score image representing each note of the synthesis target speech and a musical score image arranged in a musical score area in which a time axis and a pitch axis are set are displayed on the display device, and on the time axis in each musical note graphic image The speech synthesis management device according to claim 2, wherein the display mode of each point is set to a display mode at a point corresponding to the point in the route from the first indication point to the second indication point in the adjustment region.