JP6583756B1

JP6583756B1 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP6583756B1
Application number: JP2018166693A
Authority: JP
Inventors: 恵一徳田; 圭一郎大浦; 和寛中村
Original assignee: Techno Speech Inc
Current assignee: Techno Speech Inc
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2019-10-02
Anticipated expiration: 2038-09-06
Also published as: JP2020042056A

Abstract

【課題】リアルタイムに合成音声を編集可能な技術を提供する。【解決手段】音声合成装置は、合成する合成音声の発語対象を、予め定めた表示領域に表示する表示制御部と、発語対象を、指定されたパラメータを用いて音声合成する合成部と、予め用意されたデバイスを操作することによって選択される表示領域内における特定の位置を移動する指示入力部と、特定される位置の座標と移動速度と移動形状との少なくとも一つにより、パラメータを編集する編集部と、を備え、パラメータは発語対象の表現の態様に関するパラメータであり、発語対象の少なくとも発語の速度を調整する第１パラメータと発語対象の発語の速度以外の要素を調整する第２パラメータと、を含む。【選択図】図５A technique capable of editing synthesized speech in real time is provided. A speech synthesizer includes: a display control unit that displays a speech target of synthesized speech to be synthesized in a predetermined display area; and a synthesis unit that performs speech synthesis of the speech target using specified parameters. The parameter is set according to at least one of an instruction input unit that moves a specific position in a display area selected by operating a prepared device, a coordinate of the specified position, a moving speed, and a moving shape. An editing unit for editing, and the parameter is a parameter related to a mode of expression of the speech target, and is a first parameter that adjusts at least the speed of the speech target and an element other than the speed of the speech target speech And a second parameter for adjusting. [Selection] Figure 5

Description

本発明は、音声合成装置、および音声合成方法の技術に関する。 The present invention relates to a speech synthesis apparatus and a speech synthesis method.

従来の音声合成装置として、発語対象であるテキスト情報に基づいて音声合成を行うものが知られている（例えば、特許文献１）。この技術を用いて仮想的な歌い手がコンサートを行うといったライブ活動や、動画配信サイトを利用したライブ配信も行われている。 As a conventional speech synthesizer, a device that synthesizes speech based on text information to be spoken is known (for example, Patent Document 1). Live activities such as virtual singers performing concerts using this technology and live distribution using video distribution sites are also being carried out.

特開２０１５−１０２７２７号公報JP2015-102727A

こうしたライブでは、環境や聴衆に合わせて歌い方や話し方を変えることで聴衆との一体感を生み、ライブならではの価値をもたらしている。しかしながら、リアルタイムで歌い方や話し方を変えるためにはパラメータ等の煩雑な編集が必要となり、実際にライブ等で音声合成を活用することは困難であった。そのため、リアルタイムに編集可能な音声合成技術が望まれていた。 Such live shows a sense of unity with the audience by changing the way of singing and speaking according to the environment and audience, bringing the value of live. However, in order to change the way of singing and speaking in real time, complicated editing of parameters and the like is required, and it has been difficult to actually use speech synthesis in live performances. Therefore, a voice synthesis technique that can be edited in real time has been desired.

本発明は、上述の課題を解決するためになされたものであり、以下の形態として実現することが可能である。本発明の一形態によれば、音声合成装置が提供される。この音声合成装置は、合成する合成音声の発語対象を、前記発語対象を構成する文字と前記文字を発話するための表記との少なくとも一方によって予め定めた表示領域に表示する表示制御部と、前記発語対象を、指定されたパラメータを用いて音声合成する合成部と、予め用意されたデバイスを操作することによって選択される前記表示領域内における特定の位置を移動する指示入力部と、前記特定される位置の座標と移動速度と移動形状とのうち少なくとも前記移動速度を含む一つ以上により、前記パラメータを編集する編集部と、前記合成音声を再生する再生部と、を備え、前記表示制御部は、予め定められたスクロール速度で前記発語対象をスクロールし、前記パラメータは前記発語対象の少なくとも発語の速度を調整する第１パラメータを含み、前記第１パラメータは、前記スクロール速度と前記移動速度との差に応じて定められ、前記合成部は、前記編集部より取得した前記デバイスによる操作を反映したパラメータを用いて逐次音声合成を行い、合成した前記合成音声を逐次前記再生部より再生する。この形態の音声合成装置によれば、例えば発語対象をなぞることで、リアルタイムに合成音声を編集、再生できる。本発明は以下の形態としても実現できる。 The present invention has been made to solve the above-described problems, and can be realized as the following forms. According to one aspect of the present invention, a speech synthesizer is provided. The speech synthesizer includes: a display control unit configured to display a speech target of a synthesized speech to be synthesized in a display area predetermined by at least one of a character constituting the speech target and a notation for speaking the character; A speech synthesis unit that synthesizes speech using a designated parameter; an instruction input unit that moves a specific position in the display area selected by operating a prepared device; wherein the one or more including at least the moving speed of the position of the coordinates specified moving speed and the moving shape, includes an editing section for editing the parameter, and a reproduction unit for reproducing the synthesized speech, the display control unit scrolls the onset word object scrolling a predetermined speed, the first parameter the parameter for adjusting the speed of at least speech of the calling language object Includes a motor, the first parameter is the determined in accordance with the difference between the scroll speed and the moving speed, the combining unit, sequentially speech using the parameters that reflects the operation by the device retrieved from the editing unit Synthesis is performed, and the synthesized speech synthesized is sequentially reproduced from the reproduction unit. According to the speech synthesizer of this embodiment, the synthesized speech can be edited and reproduced in real time by tracing the speech target, for example. The present invention can be realized in the following forms.

（１）本発明の一形態によれば、音声合成装置が提供される。この音声合成装置は、合成する合成音声の発語対象を、予め定めた表示領域に表示する表示制御部と；前記発語対象を、指定されたパラメータを用いて音声合成する合成部と；予め用意されたデバイスを操作することによって選択される前記表示領域内における特定の位置を移動する指示入力部と；前記特定される位置の座標と移動速度と移動形状との少なくとも一つにより、前記パラメータを編集する編集部と、を備え；前記パラメータは前記発語対象の少なくとも発語の速度を調整する第１パラメータを含む。この形態の音声合成装置によれば、例えば発語対象をなぞることで、リアルタイムに合成音声を編集できる。 (1) According to an aspect of the present invention, a speech synthesizer is provided. The speech synthesizer includes: a display control unit that displays a speech target of synthesized speech to be synthesized in a predetermined display area; a synthesis unit that performs speech synthesis of the speech target using a specified parameter; An instruction input unit for moving a specific position in the display area selected by operating a prepared device; and the parameter according to at least one of coordinates of the specified position, a moving speed, and a moving shape An editing unit that edits; the parameter includes a first parameter that adjusts at least a speed of the speech to be spoken. According to the speech synthesizer of this form, for example, the synthesized speech can be edited in real time by tracing the speech target.

（２）上記形態の音声合成装置において、前記合成部は、前記合成音声を統計的手法により音響パラメータを学習した音響モデルを用いて音声合成を行ってもよい。この形態の音声合成装置によれば、少ないデータ量で合成音声を生成する事ができる。 (2) In the speech synthesizer of the above aspect, the synthesizer may perform speech synthesis using an acoustic model obtained by learning acoustic parameters of the synthesized speech by a statistical method. According to the speech synthesizer of this embodiment, synthesized speech can be generated with a small amount of data.

（３）上記形態の音声合成装置において、更に、前記合成音声を再生する再生部を備え、前記合成部は、前記編集部より取得した前記デバイスによる操作を反映したパラメータを用いて逐次音声合成を行い、合成した前記合成音声を前記再生部より再生してもよい。この形態の音声合成装置によれば、リアルタイムに合成音声を編集・再生できる。 (3) The speech synthesizer according to the above aspect further includes a reproduction unit that reproduces the synthesized speech, and the synthesis unit sequentially synthesizes speech using a parameter that reflects the operation by the device acquired from the editing unit. And the synthesized speech synthesized may be reproduced from the reproduction unit. According to the speech synthesizer of this form, the synthesized speech can be edited and reproduced in real time.

（４）上記形態の音声合成装置において、前記表示制御部は、ポインタを表示可能で有り、前記指示入力部は、前記特定の位置の移動を前記ポインタの位置の移動として実現してもよい。この形態の音声合成装置によれば、より視覚的にパラメータを編集することができる。 (4) In the speech synthesizer of the above aspect, the display control unit may display a pointer, and the instruction input unit may realize the movement of the specific position as the movement of the pointer. According to this form of speech synthesizer, parameters can be edited more visually.

（５）上記形態の音声合成装置において、前記デバイスは、前記表示制御部に表示された前記ポインタを移動するポインティングデバイスとしてもよい。この形態の音声合成装置によれば、視覚的にパラメータを編集することができる。 (5) In the speech synthesizer of the above aspect, the device may be a pointing device that moves the pointer displayed on the display control unit. According to this form of the speech synthesizer, parameters can be visually edited.

（６）上記形態の音声合成装置において、前記デバイスは前記表示領域を有してもよい。この形態の音声合成装置によれば、例えば、接触した座標および感圧を取得できるタッチパネルを用いて合成音声を編集できる。 (6) In the speech synthesizer of the above aspect, the device may include the display area. According to the speech synthesizer of this embodiment, for example, the synthesized speech can be edited using a touch panel that can acquire the coordinates and pressure sensitivity that have been touched.

（７）上記形態の音声合成装置において、前記編集部は、前記デバイスの感圧に応じて、前記発語対象の表現の態様に関するパラメータであり、前記発語対象の発語の速度以外の要素を調整する第２パラメータを編集してもよい。この形態の音声合成装置によれば、複数のパラメータを同時に編集できるため、リアルタイムに合成音声を編集できる。 (7) In the speech synthesizer of the above aspect, the editing unit is a parameter related to the expression mode of the speech target according to pressure sensitivity of the device, and is an element other than the speed of the speech target speech You may edit the 2nd parameter which adjusts. According to the speech synthesizer of this aspect, since a plurality of parameters can be edited simultaneously, the synthesized speech can be edited in real time.

なお、本発明は、種々の態様で実現することが可能である。例えば、この形態の音声合成装置を利用した音声合成システム、音声合成装置や音声合成システムの機能を実現するために情報処理装置において実行される方法、コンピュータプログラム、そのコンピュータプログラムを配布するためのサーバ装置、そのコンピュータプログラムを記憶した一時的でない記憶媒体等の形態で実現することができる。 Note that the present invention can be realized in various modes. For example, a speech synthesis system using the speech synthesizer of this embodiment, a method executed in the information processing apparatus to realize the functions of the speech synthesizer and the speech synthesis system, a computer program, and a server for distributing the computer program The present invention can be realized in the form of a device, a non-temporary storage medium storing the computer program, and the like.

音声合成装置の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of a speech synthesizer. 表示制御部により表示される表示領域の一例である。It is an example of the display area displayed by a display control part. 音響モデルによりモデル化する各種の音響パラメータの一例を示す図である。It is a figure which shows an example of the various acoustic parameters modeled with an acoustic model. 音声合成装置を用いたライブでの合成音声再生処理を表すフローチャートである。It is a flowchart showing the synthetic | combination audio | voice reproduction | regeneration process in live using a speech synthesizer. 音声合成処理を表すフローチャートである。It is a flowchart showing a speech synthesis process. 指示入力部がなぞった軌跡の一例を示した図である。It is the figure which showed an example of the locus | trajectory which the instruction | indication input part traced. 指示入力部がなぞった軌跡の他の一例を示した図である。It is the figure which showed another example of the locus | trajectory which the instruction | indication input part traced. 第２実施形態における音声合成装置の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of the speech synthesizer in 2nd Embodiment. 第２実施形態における音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process in 2nd Embodiment. 第２実施形態で表示制御部により表示される表示領域の一例である。It is an example of the display area displayed by the display control part in 2nd Embodiment.

Ａ．第１実施形態：
図１は、本発明の一実施形態における音声合成装置１００の概要を示す説明図である。音声合成装置１００は、合成部１０と、再生部２０と、表示制御部３０と、指示入力部４０と、編集部５０と、制御部６０と、を備える。 A. First embodiment:
FIG. 1 is an explanatory diagram showing an overview of a speech synthesizer 100 according to an embodiment of the present invention. The speech synthesizer 100 includes a synthesis unit 10, a playback unit 20, a display control unit 30, an instruction input unit 40, an editing unit 50, and a control unit 60.

合成部１０は、音響モデル１１と、合成エンジン１２と、を含む。合成部１０は、発語対象であるテキスト情報に基づいて音声合成を行う。本実施形態において、合成エンジン１２は、統計的手法により音響パラメータを学習した音響モデル１１と、後述する編集部５０より編集されたパラメータ１３とを用いて音声合成を行う。より具体的には、隠れマルコフモデル（以下、ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）とも記載する）や、ディープニューラルネットワーク（以下、ＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）とも記載する）を用いて、合成音声を生成する。音響モデルの学習に用いる音響パラメータの詳細については後述する。 The synthesizing unit 10 includes an acoustic model 11 and a synthesis engine 12. The synthesizer 10 synthesizes speech based on the text information to be spoken. In the present embodiment, the synthesis engine 12 performs speech synthesis using the acoustic model 11 in which acoustic parameters are learned by a statistical method and the parameter 13 edited by the editing unit 50 described later. More specifically, a synthesized speech is generated using a hidden Markov model (hereinafter also referred to as HMM (Hidden Markov Model)) or a deep neural network (hereinafter also referred to as DNN (Deep Natural Network)). Details of the acoustic parameters used for learning the acoustic model will be described later.

再生部２０は、合成部１０によって生成された合成音声をスピーカ７０に出力する。 The playback unit 20 outputs the synthesized voice generated by the synthesis unit 10 to the speaker 70.

表示制御部３０は、合成音声の発語対象を予め定めた表示領域３１に表示する。図２は、表示制御部３０により表示される表示領域３１の一例である。表示領域３１は、それぞれ発語対象の異なる表示領域３１Ａ、３１Ｂ、３１Ｃを含む。本実施形態において、表示領域３１は予め用意されたデバイスであるタッチパネル８０が有する。タッチパネル８０は、タッチペン８５が接触した座標および感圧を取得できる。なお、タッチパネル８０が取得できる情報は、座標や感圧に限らず、例えばタッチペン８５の傾きを取得出来るようにしてもよい。また、表示制御部３０は、タッチパネル８０に限らず、通常のディスプレイや、音声合成装置１００が有するディスプレイに表示してもよい。本実施形態において、例えば、表示領域３１Ａに表示される発語対象は「こんにちは」であり、表示領域３１Ｂに表示される発語対象は「ありがとう」であり、表示領域３１Ｃに表示される発語対象は「さようなら」である。 The display control unit 30 displays the speech target of the synthesized speech in a predetermined display area 31. FIG. 2 is an example of the display area 31 displayed by the display control unit 30. The display area 31 includes display areas 31A, 31B, and 31C that are different from each other. In the present embodiment, the display area 31 has a touch panel 80 which is a device prepared in advance. The touch panel 80 can acquire the coordinates and pressure sensitivity with which the touch pen 85 is in contact. The information that can be acquired by the touch panel 80 is not limited to coordinates and pressure sensitivity, and for example, the tilt of the touch pen 85 may be acquired. In addition, the display control unit 30 is not limited to the touch panel 80 and may be displayed on a normal display or a display included in the speech synthesizer 100. In the present embodiment, for example, speech subject displayed in the display area 31A is "Hello", speech subject displayed in the display area 31B is "thank you", is displayed in the display area 31C speech The subject is "Goodbye".

指示入力部４０は、予め用意されたデバイスであるタッチパネル８０およびタッチペン８５を操作することによって、表示領域３１Ａ〜３１Ｃから選択される表示領域内における特定の位置を移動する。指示入力部４０は、マウスや指による操作や、ハンドトラッキング等の非接触デバイスからの入力によって、選択される表示領域内における特定の位置を移動してもよい。なお、特定の位置は、表示領域内等にポインタなどを表示して視覚的に表してもよい。また、特定の位置の移動は、タッチペン８５でなぞっている位置であるとして軌跡として表してもよい。 The instruction input unit 40 moves a specific position in the display area selected from the display areas 31 </ b> A to 31 </ b> C by operating the touch panel 80 and the touch pen 85 which are devices prepared in advance. The instruction input unit 40 may move a specific position in the selected display area by an operation with a mouse or a finger or an input from a non-contact device such as hand tracking. The specific position may be visually represented by displaying a pointer or the like in the display area or the like. In addition, the movement of the specific position may be represented as a trajectory assuming that the position is being traced with the touch pen 85.

編集部５０は、特定される位置に基づき、発語対象の表現の態様に関するパラメータを編集する。本実施形態において、パラメータは、発語対象の少なくとも発語の速度を調整する第１パラメータ（以下、「話速パラメータ」という）と、発語対象の発語の速度以外の要素である音量を調整する第２パラメータ（以下、「音量パラメータ」という）と、発語対象の発語の速度以外の要素である音高を調整する第３パラメータ（以下、「音高パラメータ」という）とを含む。なお、発語対象の発語の速度以外の要素のパラメータとしては、他にも、感情やくせ等の発話スタイル、ビブラートの深さや周期、話者補間比率、ジェンダーパラメータ（幼い女性的な声から老人男性的な声までの変化）等が挙げられる。パラメータは、話速パラメータを含むことが好ましい。また、発話対象を、表示領域３１Ａ〜３１Ｃをまたいで特定される位置を移動させることにより、発話中であっても変更できるようにしてもよい。 The editing unit 50 edits the parameters related to the expression mode of the speech target based on the specified position. In this embodiment, the parameters are a first parameter (hereinafter referred to as “speech speed parameter”) that adjusts at least the speed of the speech target, and a volume that is an element other than the speed of the target speech. A second parameter to be adjusted (hereinafter referred to as “volume parameter”) and a third parameter (hereinafter referred to as “pitch parameter”) for adjusting a pitch that is an element other than the speed of the speech to be spoken. . Other parameters other than the speed of speech to be spoken include speech styles such as emotions and habits, vibrato depth and cycle, speaker interpolation ratio, gender parameters (from young feminine voices). Change to the voice of an old man). The parameter preferably includes a speech speed parameter. Further, the utterance target may be changed even during the utterance by moving the position specified across the display areas 31A to 31C.

制御部６０は、ＣＰＵとメモリとを備えたコンピュータとして構成されている。ＣＰＵは、メモリに記憶された制御プログラムを実行することにより、合成部１０、再生部２０、表示制御部３０、指示入力部４０および編集部５０を制御して、後述する音声合成処理を実現する。 The control unit 60 is configured as a computer including a CPU and a memory. The CPU executes the control program stored in the memory, thereby controlling the synthesis unit 10, the playback unit 20, the display control unit 30, the instruction input unit 40, and the editing unit 50 to realize a speech synthesis process described later. .

図３は、音響モデルによりモデル化する各種の音響パラメータの一例を示す図である。基本周波数は、一般に対数基本周波数ｐｔとして扱われており、その関連パラメータとしては、有声／無声の区別、対数基本周波数の一次微分（Δｐｔ）や二次微分（Δ２ｐｔ）が考えられる。これらは音源情報と呼ばれることがある。なお、無声部分は対数基本周波数ｐｔの値を持たない。このため、無声部分に所定の定数を入れる等の方法によって有声／無声の区別を行う。また、スペクトルパラメータとしては、メルケプストラムｃｔやその一次微分（Δｃｔ）、二次微分（Δ２ｃｔ）などがある。これらは、スペクトル情報と呼ばれることがある。更に、こうした音源情報、スペクトル情報の他に、歌声を合成する場合には、歌唱表現情報を扱う。 FIG. 3 is a diagram illustrating an example of various acoustic parameters modeled by an acoustic model. The fundamental frequency is generally handled as a logarithmic fundamental frequency pt, and as related parameters, voiced / unvoiced discrimination, primary differential (Δpt) and secondary differential (Δ2pt) of the logarithmic fundamental frequency can be considered. These are sometimes called sound source information. The unvoiced part does not have a value of the logarithmic fundamental frequency pt. For this reason, the voiced / unvoiced distinction is made by a method such as putting a predetermined constant in the voiceless part. The spectral parameters include mel cepstrum ct, its first derivative (Δct), and second derivative (Δ2ct). These are sometimes referred to as spectral information. Furthermore, in addition to such sound source information and spectrum information, singing expression information is handled when a singing voice is synthesized.

歌唱表現情報には、音高のビブラートの周期Ｖ１ｆｔおよび振幅Ｖ１ａｔと、音の大きさのビブラートの周期Ｖ２ｆｔおよび振幅Ｖ２ａｔとが、音素やフレーム等の単位でモデル化されて含まれている。音高のビブラートの周期、音高のビブラートの振幅、音の大きさのビブラートの周期、音の大きさのビブラートの振幅についても、それぞれに対応する一次微分（Δ）と、二次微分（Δ２）とを持つが、図示の便宜上、図３ではこれら周期および振幅についての一次微分、二次微分の図示を省略している。上記パラメータのうち、メルケプストラムｃｔを初めとする各パラメータの一次微分や二次微分は、時間変動を考慮するために用いられる。動的特徴を考慮することにより、音声の合成時における音と音のつながりが滑らかなものとなる。動的特徴を用いた音声合成の手法については、説明を省略する。 The singing expression information includes the vibrato period V1ft and amplitude V1at of the pitch, and the vibrato period V2ft and amplitude V2at of the loudness modeled in units of phonemes and frames. The pitch vibrato period, the pitch vibrato amplitude, the loudness vibrato period, and the loud vibrato amplitude also correspond to the primary differential (Δ) and the secondary differential (Δ2), respectively. However, for the sake of convenience of illustration, in FIG. 3, illustration of the first and second derivatives for the period and the amplitude is omitted. Of the above parameters, the first and second derivatives of each parameter including the mel cepstrum ct are used to take into account time variations. By taking into account the dynamic features, the connection between sounds during sound synthesis becomes smooth. Description of the speech synthesis method using dynamic features is omitted.

また、音響モデルとしてＤＮＮを用いた場合には、メルケプストラムｃｔの代わりにスペクトルをモデル化してもよいし、上記音響パラメータの代わりに音声波形を音響パラメータとしてモデル化してもよい。 When DNN is used as the acoustic model, a spectrum may be modeled instead of the mel cepstrum ct, or a voice waveform may be modeled as an acoustic parameter instead of the acoustic parameter.

図４は、本実施形態における音声合成装置１００を用いたライブでの合成音声再生処理を表すフローチャートである。まず、制御部６０は、再生するデータを取得する（ステップＳ１００）。再生するデータとは、予め作成されている発語対象のテキストデータや楽曲の楽譜情報である。取得するデータは、ユーザが直接指定してもよく、また、演奏する曲や発話内容の一覧を順番に記したセットリストに基づいて、自動で取得されてもよい。なお、ステップＳ１００の処理を省略し、その場で発語対象をキーボード等により入力してもよい。次に、制御部６０は、指示入力部４０を制御して、ステップＳ１００で取得したデータを表示させる（ステップＳ１１０）。 FIG. 4 is a flowchart showing the synthesized speech reproduction process in live using the speech synthesizer 100 according to the present embodiment. First, the control unit 60 acquires data to be reproduced (step S100). The data to be reproduced is text data to be uttered and musical score information created in advance. The data to be acquired may be directly designated by the user, or may be automatically acquired based on a set list in which a list of songs to be played and utterance contents is listed in order. Note that the processing in step S100 may be omitted, and the speech target may be input on the spot using a keyboard or the like. Next, the control unit 60 controls the instruction input unit 40 to display the data acquired in step S100 (step S110).

続いて、制御部６０は、データを表示させた後、音声合成処理を行う（ステップＳ１２０）。音声合成処理については後述する。最後に、制御部６０は、ライブが終了したか否か判定する（ステップＳ１３０）。ライブの終了は、例えば、セットリストの最後の楽曲および発話内容が再生されたか否かで判定することが出来る。ライブ終了でないと判定した場合（ステップＳ１３０：ＮＯ）、ステップＳ１００の処理に戻り、次のデータを取得する。 Subsequently, after displaying the data, the control unit 60 performs a speech synthesis process (step S120). The speech synthesis process will be described later. Finally, the control unit 60 determines whether or not the live has ended (step S130). The end of the live can be determined by, for example, whether or not the last music piece and utterance content in the set list have been reproduced. If it is determined that the live has not ended (step S130: NO), the process returns to step S100, and the next data is acquired.

図５は、本実施形態における音声合成処理を表すフローチャートである。音声合成処理は、ユーザによるタッチペン８５の操作に応じて合成音声を合成するための処理である。より具体的には、タッチペン８５を操作することによって表示領域３１内において特定される位置の移動に応じて合成音声を合成するための処理である。指示入力部４０は、まず、タッチペン８５による操作があるか否か判定する（ステップＳ２００）。タッチペン８５による操作が無い場合（ステップＳ２００：ＮＯ）、指示入力部４０は、終了指示があったか否か判定する（ステップＳ２４０）。一方、タッチペン８５による操作がある場合（ステップＳ２００：ＹＥＳ）、指示入力部４０は、表示領域３１における特定の位置の移動とその移動軌跡の座標を取得する（ステップＳ２１０）。 FIG. 5 is a flowchart showing the speech synthesis process in this embodiment. The voice synthesis process is a process for synthesizing synthesized voice in accordance with the operation of the touch pen 85 by the user. More specifically, it is a process for synthesizing synthesized speech in accordance with the movement of the position specified in the display area 31 by operating the touch pen 85. The instruction input unit 40 first determines whether or not there is an operation with the touch pen 85 (step S200). When there is no operation with the touch pen 85 (step S200: NO), the instruction input unit 40 determines whether or not there is an end instruction (step S240). On the other hand, when there is an operation with the touch pen 85 (step S200: YES), the instruction input unit 40 acquires the movement of a specific position in the display area 31 and the coordinates of the movement locus (step S210).

図６および図７を用いて、表示領域３１Ａを例として説明する。軌跡は左から右へ矢印方向になぞられたものである。合成部１０は、軌跡の矢印方向に発話するようパラメータを生成する。なお、逆方向に発語対象を指定した場合に、発語対象を逆再生されるようパラメータを生成してもよい。図６に示すように、軌跡Ｌ１は、「こ」が表示されている表示領域３１Ａの略矩形状の横幅に対して左から１／３程度の箇所から開始している。編集部５０は、予め定められた「こ」の発話の長さに対して、発話開始より１／３程度再生した箇所より再生されるようパラメータを生成する。また、図７に示すように、指示入力部４０が、軌跡Ｌ２をなぞった後に、軌跡Ｌ３をなぞっている場合、編集部５０は、指示入力部４０が一方向になぞって指定した発語対象である「おん」「ちは」を発話するようパラメータを生成する。 The display area 31A will be described as an example with reference to FIGS. The locus is traced from left to right in the direction of the arrow. The synthesizing unit 10 generates a parameter so as to speak in the arrow direction of the locus. Note that when a speech target is specified in the reverse direction, a parameter may be generated so that the speech target is reversely reproduced. As shown in FIG. 6, the locus L1 starts from a position about 1/3 from the left with respect to the substantially rectangular lateral width of the display area 31A in which “ko” is displayed. The editing unit 50 generates a parameter so as to be played back from a portion that has been played back about 1/3 from the start of the utterance with respect to a predetermined length of “ko”. In addition, as shown in FIG. 7, when the instruction input unit 40 traces the locus L3 after tracing the locus L2, the editing unit 50 specifies the speech target specified by the instruction input unit 40 tracing in one direction. A parameter is generated so that “on” and “chiha” are spoken.

次に、編集部５０は、ステップＳ２１０で取得した軌跡の座標から合成音声のパラメータを編集する（図５、ステップＳ２２０）。本実施形態では、編集部５０は、特定される位置の座標に応じてパラメータを編集する発語対象を定め、特定される位置の移動速度に応じて話速パラメータを定め、特定される位置の移動形状に応じて音高パラメータを定め、タッチペン８５のタッチパネル８０を操作する接触の強さ（感圧）に応じて、音量パラメータを定める。より具体的には特定される位置の移動速度が早い場合には、遅い場合よりも短い時間で発話されるよう話速パラメータが設定される。また、特定される位置の移動形状が表示領域３１の上下方向における上に向かって移動した場合は音高をあげていき、下に向かって移動した場合は音高を下げていくように音高パラメータが設定される。また、タッチパネル８０の感圧が強い場合には、弱い場合よりも大きい音量で発話されるよう音量パラメータが設定される。なお、各パラメータの割り当ては、上述した例に限られず、例えば、編集部５０は、特定される位置の移動形状によって話速パラメータや音量パラメータを定めてもよい。また、本実施形態において、編集部５０は、特定される位置の移動速度や移動形状にかかわらず、表示領域３１の上下方向における上部が選択された場合に、下部が選択された場合よりも音高を高く設定する。 Next, the editing unit 50 edits the parameters of the synthesized speech from the coordinates of the trajectory acquired in step S210 (FIG. 5, step S220). In the present embodiment, the editing unit 50 determines an utterance target for editing the parameter according to the coordinates of the specified position, determines a speech speed parameter according to the moving speed of the specified position, and determines the position of the specified position. A pitch parameter is determined according to the moving shape, and a volume parameter is determined according to the strength of contact (pressure sensitivity) for operating the touch panel 80 of the touch pen 85. More specifically, when the moving speed of the specified position is fast, the speech speed parameter is set so that the speech is uttered in a shorter time than when it is slow. In addition, when the moving shape of the specified position moves upward in the vertical direction of the display area 31, the pitch is raised, and when moved downward, the pitch is lowered. The parameter is set. Further, when the pressure sensitivity of the touch panel 80 is strong, the volume parameter is set so that the speech is uttered at a louder volume than when the touch panel 80 is weak. Note that the assignment of each parameter is not limited to the above-described example. For example, the editing unit 50 may determine the speech speed parameter and the volume parameter according to the movement shape of the specified position. Further, in the present embodiment, the editing unit 50 sounds more sound when the upper part in the vertical direction of the display area 31 is selected than when the lower part is selected, regardless of the moving speed and moving shape of the specified position. Set the height higher.

続いて、合成部１０は、ステップＳ２２０で編集したパラメータを用いて合成音声を合成し、再生する（ステップＳ２３０）。最後に、合成部１０は、終了指示があったか否か判定する（ステップＳ２４０）。終了指示は、例えば、ユーザによる終了ボタンの押下である、終了指示があった場合（ステップＳ２４０：ＹＥＳ）、合成部１０は音声合成処理を終了する。一方、終了指示がない場合（ステップＳ２４０：ＮＯ）、合成部１０は、ステップＳ２００の処理に戻る。 Subsequently, the synthesizer 10 synthesizes and reproduces the synthesized speech using the parameters edited in step S220 (step S230). Finally, the synthesis unit 10 determines whether or not there is an end instruction (step S240). For example, when the end instruction is an end instruction that is a press of an end button by the user (step S240: YES), the synthesis unit 10 ends the speech synthesis process. On the other hand, when there is no end instruction (step S240: NO), the synthesis unit 10 returns to the process of step S200.

以上で説明した本実施形態の音声合成装置１００によれば、発語対象をなぞることで、話速パラメータと音高パラメータと音量パラメータを同時に編集できるため、リアルタイムに合成音声を編集できる。 According to the speech synthesizer 100 of the present embodiment described above, since the speech speed parameter, the pitch parameter, and the volume parameter can be edited simultaneously by tracing the speech target, the synthesized speech can be edited in real time.

また、本実施形態では、合成部１０は、統計的手法により音響パラメータを学習した音響モデルを用いて音声合成を行っている。そのため、発語対象毎に音声を収録すること無く、合成音声を生成する事ができる。 In the present embodiment, the synthesis unit 10 performs speech synthesis using an acoustic model in which acoustic parameters are learned by a statistical method. Therefore, synthesized speech can be generated without recording speech for each utterance target.

また、本実施形態では、編集部５０は、タッチペン８５のタッチパネル８０を操作する感圧に応じて音量パラメータを定めている。そのため、音の大きさを感覚的に編集することができる。 In the present embodiment, the editing unit 50 determines the volume parameter according to the pressure sensitivity for operating the touch panel 80 of the touch pen 85. Therefore, the loudness can be edited sensuously.

Ｂ．第２実施形態：
図８は、第２実施形態における音声合成装置１００Ａの概要を示す説明図である。図８に示す第２実施形態の音声合成装置１００Ａは、伴奏が記憶された記憶部９０を備える点が第１実施形態と異なり、他の構成は同一である。記憶部９０に記憶された伴奏は再生部２０によって読み出され、スピーカ７０に出力される。記憶部９０に記憶された伴奏は、例えば、ＭＩＤＩ規格で作成された伴奏楽音の演奏データである。 B. Second embodiment:
FIG. 8 is an explanatory diagram showing an overview of the speech synthesizer 100A according to the second embodiment. The speech synthesis apparatus 100A of the second embodiment shown in FIG. 8 is different from the first embodiment in that it includes a storage unit 90 in which accompaniment is stored, and the other configurations are the same. The accompaniment stored in the storage unit 90 is read by the playback unit 20 and output to the speaker 70. The accompaniment memorize | stored in the memory | storage part 90 is the performance data of the accompaniment musical sound produced by the MIDI specification, for example.

図９は、第２実施形態における音声合成処理のフローチャートである。図８に示す第２実施形態の音声合成処理は、伴奏が再生されると同時に開始される点と、合成音声が歌声である点が第１実施形態と異なる。第２実施形態の音声合成装置の構成は、第１実施形態の音声合成装置の構成と同一であるため、音声合成装置の構成の説明は省略する。 FIG. 9 is a flowchart of speech synthesis processing in the second embodiment. The voice synthesis process of the second embodiment shown in FIG. 8 is different from the first embodiment in that the voice synthesis process is started at the same time as the accompaniment is reproduced and the synthesized voice is a singing voice. Since the configuration of the speech synthesizer of the second embodiment is the same as that of the speech synthesizer of the first embodiment, description of the configuration of the speech synthesizer is omitted.

音声合成装置１００は、まず、記憶部９０に記憶された伴奏を再生する（ステップＳ３００）。第２実施形態における音声合成処理を開始するタイミングは、例えば、ユーザによる再生ボタンの押下を処理の開始の契機とし、伴奏の再生と同時に音声合成が再生されてもよい。伴奏の再生速度は、後述する合成音声の編集にかかわらず、予め定められた再生速度であることが好ましい。なお、伴奏の再生タイミングを予め発語対象の再生タイミングに対応付けることで、合成音声の話速パラメータの変化に応じて再生速度や再生タイミングを変化させて、合成音声と同期するように再生してもよい。また、伴奏の代わりに動画が再生されてもよい。次に、指示入力部４０は、タッチペン８５による操作があるか否か判定する（ステップＳ３１０）。タッチペン８５による操作が無い場合（ステップＳ３１０：ＮＯ）、指示入力部４０は、終了指示があったか否か判定する（ステップＳ３５０）。一方、タッチペン８５による操作がある場合（ステップＳ３１０：ＹＥＳ）、指示入力部４０は、表示領域３１における特定の位置の座標および特定の位置の移動とその移動軌跡の座標を取得する（ステップＳ３２０）。 The speech synthesizer 100 first reproduces the accompaniment stored in the storage unit 90 (step S300). The timing of starting the speech synthesis processing in the second embodiment may be, for example, when the user presses a playback button, and the speech synthesis is played simultaneously with the accompaniment playback. The accompaniment playback speed is preferably a predetermined playback speed regardless of the editing of the synthesized speech described later. Note that the accompaniment playback timing is associated with the speech target playback timing in advance, so that the playback speed and playback timing are changed according to the change in the speech speed parameter of the synthesized speech, and playback is performed in synchronization with the synthesized speech. Also good. Also, a moving image may be played instead of the accompaniment. Next, the instruction input unit 40 determines whether or not there is an operation with the touch pen 85 (step S310). When there is no operation with the touch pen 85 (step S310: NO), the instruction input unit 40 determines whether or not there is an end instruction (step S350). On the other hand, when there is an operation with the touch pen 85 (step S310: YES), the instruction input unit 40 acquires the coordinates of the specific position in the display region 31, the movement of the specific position, and the coordinates of the movement locus (step S320). .

図１０は、第２実施形態で表示制御部３０により表示される表示領域３１Ｄの一例である。本実施形態において、発語対象は「さいたさ」である。本実施形態の表示制御部３０は、ポインタＰを表示してもよく、指示入力部４０は特定の位置の移動をポインタＰの位置の移動とする。タッチペン８５は、ポインタＰを移動するポインティングデバイスである。表示領域３１Ｄの発語対象は、合成音声の再生と同時に、テンポに応じた速度で左（矢印方向）にスクロールしていく。基準位置ｐ０は予め定められたテンポで再生した場合の再生位置であり、固定されている。なお、スクロール速度をテンポによらず一定速度にして、表示領域３１Ｄに表示される内容の幅を変更してもよい。 FIG. 10 is an example of a display area 31D displayed by the display control unit 30 in the second embodiment. In the present embodiment, the speech target is “Saisai”. The display control unit 30 of the present embodiment may display the pointer P, and the instruction input unit 40 sets the movement of the specific position as the movement of the pointer P. The touch pen 85 is a pointing device that moves the pointer P. The speech target in the display area 31D is scrolled to the left (in the direction of the arrow) at a speed corresponding to the tempo simultaneously with the reproduction of the synthesized speech. The reference position p0 is a reproduction position when reproduction is performed at a predetermined tempo, and is fixed. Note that the width of the content displayed in the display area 31D may be changed by setting the scroll speed to a constant speed regardless of the tempo.

次に、編集部５０は、ステップＳ３２０で取得したポインタＰの動作と座標から合成音声のパラメータを編集する（図９、ステップＳ３３０）。編集部５０は、例えば、以下のポインタＰの動作に応じて、パラメータを編集する。 Next, the editing unit 50 edits the parameters of the synthesized speech from the movement and coordinates of the pointer P acquired in step S320 (FIG. 9, step S330). For example, the editing unit 50 edits the parameter in accordance with the following operation of the pointer P.

＜動作１＞ポインタＰが特定の位置を選択する
＜動作２＞ポインタＰが表示領域３１Ｄを左方向になぞる
＜動作３＞ポインタＰが表示領域３１Ｄを右方向になぞる <Operation 1> Pointer P selects a specific position <Operation 2> Pointer P traces display area 31D to the left <Action 3> Pointer P traces display area 31D to the right

上述した動作１の場合、編集部５０は、基準位置ｐ０に対してポインタＰで特定される位置（以下、特定位置ｐ１）に応じて、再生位置を定める。より具体的には特定位置ｐ１が基準位置ｐ０より右にずれている場合には、ずれの大きさに応じて先にシークするよう再生位置が設定される。現在の再生箇所から特定位置ｐ１に直接移動して再生してもよく、現在の再生箇所から特定位置ｐ１までを補間してなめらかに発語対象が再生されるようパラメータを設定してもよい。また、現在の再生箇所から特定位置ｐ１までの間の発語対象を不連続に再生してもよい。 In the case of the operation 1 described above, the editing unit 50 determines the playback position according to the position specified by the pointer P with respect to the reference position p0 (hereinafter, specified position p1). More specifically, when the specific position p1 is shifted to the right from the reference position p0, the reproduction position is set so as to seek first according to the magnitude of the shift. The playback may be performed by moving directly from the current playback location to the specific position p1, and the parameters may be set so that the speech target is played back smoothly by interpolating from the current playback location to the specific location p1. Further, the speech target between the current playback location and the specific position p1 may be played back discontinuously.

上述した動作２の場合、編集部５０は、ポインタＰが表示領域３１Ｄを左方向に発語対象のスクロール速度より遅くなぞった場合には、予め定められたテンポより長い時間で発話されるよう話速パラメータを定める。また、発語対象のスクロール速度より早くなぞった場合には、逆再生されるよう話速パラメータを設定する。つまり、話速パラメータは、発語対象のスクロール速度とポインタＰが左方向になぞる速度の差により定められる。 In the case of the operation 2 described above, when the pointer P traces the display area 31D in the left direction slower than the scroll speed of the speech target, the editing unit 50 speaks so that the speech is uttered in a time longer than a predetermined tempo. Determine the speed parameter. In addition, the speech speed parameter is set so that reverse playback is performed when the speed is traced faster than the speech target scroll speed. That is, the speech speed parameter is determined by the difference between the scroll speed of the speech target and the speed that the pointer P traces leftward.

上述した動作３の場合、編集部５０は、ポインタＰが表示領域３１Ｄを右方向になぞった場合には、なぞる速度に応じて、予め定められたテンポより短い時間で発話されるよう話速パラメータを定める。 In the case of operation 3 described above, when the pointer P traces the display area 31D to the right, the editing unit 50 sets the speech speed parameter so that the speech is uttered in a time shorter than a predetermined tempo according to the tracing speed. Determine.

続いて、合成部１０は、ステップＳ３３０で編集したパラメータを用いて合成音声を合成し、再生する（ステップＳ３４０）。合成部１０は、終了指示があったか否か判定する（ステップＳ３５０）。終了指示は、例えば、ユーザによる終了ボタンの押下である。終了指示がない場合（ステップＳ３５０：ＮＯ）、合成部１０は、伴奏の再生が完了したかどうか判定する（ステップＳ３６０）。再生が完了していない場合（ステップＳ３６０：ＮＯ）、ステップＳ３１０から処理を繰り返す。再生が完了した場合（ステップＳ３６０：ＹＥＳ）、音声合成処理は終了する。なお、合成部１０は、ステップＳ３６０において伴奏の再生が完了したと判断した場合においても、指示入力部４０への入力が続いている場合に、指示入力部４０への入力がなくなるまで合成音声の再生を続けてもよい。例えば、合成音声の最後の歌詞をロングトーンで伸ばすような再生をしてもよい。 Subsequently, the synthesizer 10 synthesizes synthesized speech using the parameters edited in step S330 and reproduces them (step S340). The synthesizer 10 determines whether or not there is an end instruction (step S350). The end instruction is, for example, pressing of an end button by the user. When there is no end instruction (step S350: NO), the synthesizer 10 determines whether or not the accompaniment reproduction has been completed (step S360). If reproduction has not been completed (step S360: NO), the processing is repeated from step S310. When the reproduction is completed (step S360: YES), the speech synthesis process ends. Even when it is determined in step S360 that the accompaniment reproduction has been completed, if the input to the instruction input unit 40 continues, the synthesis unit 10 continues to output the synthesized speech until there is no input to the instruction input unit 40. You may continue playing. For example, reproduction may be performed such that the last lyrics of the synthesized speech are extended with a long tone.

なお、本実施形態において、指示入力部４０への入力をしていない間は、再生部２０は、指示入力部４０へ再度入力があるまで、合成音声を消音で再生するが、予め定められた話速パラメータを用いて再生させるよう音声合成を行ってもよい。また、合成部１０は、合成音声の再生を途中で停止するよう音声合成を行ってもよい。例えば、合成部１０は、指示入力部４０への入力が予め定められた時間、無かった場合に、予定していた合成音声の再生が完了する前に、合成音声の再生を停止するようフェードアウトするように音声合成を行い、その後合成音声の再生を停止してもよい。また、再生部２０は、予定していた合成音声の再生が完了する前に、消音で再生してもよい。なお、伴奏は再生を継続していてもよく、合成音声と同時に停止や消音で再生してもよい。 In the present embodiment, while the input to the instruction input unit 40 is not performed, the reproduction unit 20 reproduces the synthesized voice with mute until input to the instruction input unit 40 again. Speech synthesis may be performed so that playback is performed using the speech speed parameter. The synthesizing unit 10 may perform speech synthesis so as to stop the playback of the synthesized speech. For example, when there is no input to the instruction input unit 40 for a predetermined time, the synthesis unit 10 fades out so as to stop the reproduction of the synthesized speech before the planned synthesis speech reproduction is completed. The speech synthesis may be performed as described above, and then the playback of the synthesized speech may be stopped. Further, the reproduction unit 20 may reproduce the sound with mute before the planned reproduction of the synthesized voice is completed. Note that the accompaniment may continue to be played, or may be played back with a pause or mute simultaneously with the synthesized voice.

以上で説明した本実施形態の音声合成装置１００によれば、再生している発語対象をなぞることで、話速パラメータを編集できるため、リアルタイムに合成音声を編集・再生できる。 According to the speech synthesizer 100 of the present embodiment described above, the speech speed parameter can be edited by tracing the speech object being reproduced, so that the synthesized speech can be edited and reproduced in real time.

Ｃ．その他の実施形態：
上記実施形態において、合成部１０は、統計的手法により音響パラメータを学習した音響モデルを用いて音声合成を行っている。この代わりに、合成部１０は、波形接続方式を用いて音声合成を行ってもよい。 C. Other embodiments:
In the above embodiment, the synthesis unit 10 performs speech synthesis using an acoustic model in which acoustic parameters are learned by a statistical method. Instead, the synthesis unit 10 may perform speech synthesis using a waveform connection method.

また、上記実施形態において、合成部１０は、生成したパラメータ値を記憶し、次回の合成音声の再生時には、記憶したパラメータ値を用いて音声合成を行い再生してもよい。ここで「次回」の合成音声の再生とは、音声合成処理を終えた以降に合成音声を再生する事を意味する。 In the above embodiment, the synthesizing unit 10 may store the generated parameter value, and at the next playback of the synthesized speech, may perform speech synthesis using the stored parameter value for playback. Here, “reproduction of the synthesized speech” means that the synthesized speech is reproduced after the speech synthesis process is completed.

また、上記実施形態において、編集部５０は、指示入力部４０のなぞった軌跡の形状と速度と感圧に応じて、それぞれパラメータに含まれるパラメータ値を定めている。この代わりに、合成部１０は、タッチペン８５の傾きや予め用意したデバイスである外部コントローラ等の値を用いて、各パラメータに含まれるパラメータ値を定めてもよい。 In the above embodiment, the editing unit 50 determines parameter values included in the parameters according to the shape, speed, and pressure sensitivity of the trace traced by the instruction input unit 40. Instead, the combining unit 10 may determine parameter values included in each parameter using values of the tilt of the touch pen 85 and values of an external controller that is a device prepared in advance.

また、上記実施形態において、表示制御部３０は、発語内容や楽譜の各パートに対応するような音程が異なる複数の発語領域が表示されていてもよい。この形態によれば、各発語領域を指示入力部４０により選択することで、様々な音声をずらしながら合成することができる。また、１つの発語領域への指示入力部４０の入力を他の指示入力部４０にも反映されてもよい。 Moreover, in the said embodiment, the display control part 30 may display several utterance area | regions from which a pitch differs corresponding to each part of utterance content and a score. According to this aspect, by selecting each speech area by the instruction input unit 40, it is possible to synthesize various voices while shifting. Further, the input of the instruction input unit 40 to one speech area may be reflected to the other instruction input units 40.

また、上記第２実施形態において、編集部５０は、特定位置ｐ１に応じて、再生位置を定めている。この代わりに、編集部５０は、特定位置ｐ１に応じて話速パラメータを定めてもよい。例えば、特定位置ｐ１が基準位置ｐ０より右にずれている場合には、ずれの大きさに応じて話速が早くなるように話速パラメータを設定してもよい。 In the second embodiment, the editing unit 50 determines the playback position according to the specific position p1. Instead, the editing unit 50 may determine the speech speed parameter according to the specific position p1. For example, when the specific position p1 is shifted to the right from the reference position p0, the speech speed parameter may be set so that the speech speed increases according to the magnitude of the shift.

また、上記第２実施形態において、編集部５０は、ポインタＰのなぞる速度に応じて、話速パラメータを定めているが、発語対象のスクロール速度を併せて変更してもよい。 In the second embodiment, the editing unit 50 determines the speech speed parameter according to the tracing speed of the pointer P, but it may also change the scroll speed of the speech target.

本発明は、上述の実施形態に限られるものではなく、その趣旨を逸脱しない範囲において種々の構成で実現することができる。例えば発明の概要の欄に記載した各形態中の技術的特徴に対応する実施形態中の技術的特徴は、上述した課題を解決するために、あるいは上述の効果の一部又は全部を達成するために、適宜、差し替えや組み合わせを行うことが可能である。また、その技術的特徴が本明細書中に必須なものとして説明されていなければ、適宜削除することが可能である。 The present invention is not limited to the above-described embodiment, and can be realized with various configurations without departing from the spirit of the present invention. For example, the technical features in the embodiments corresponding to the technical features in each embodiment described in the summary section of the invention are for solving the above-described problems or achieving some or all of the above-described effects. In addition, replacement and combination can be performed as appropriate. Further, if the technical feature is not described as essential in the present specification, it can be deleted as appropriate.

１０…合成部、１１…音響モデル、１２…合成エンジン、１３…パラメータ、２０…再生部、３０…表示制御部、３１、３１Ａ、３１Ｂ、３１Ｃ、３１Ｄ…表示領域、４０…指示入力部、５０…編集部、６０…制御部、７０…スピーカ、８０…タッチパネル、８５…タッチペン、９０…記憶部、１００、１００Ａ…音声合成装置、Ｌ１、Ｌ２、Ｌ３…軌跡、Ｐ…ポインタ、ｐ０…基準位置、ｐ１…特定位置 DESCRIPTION OF SYMBOLS 10 ... Synthesis | combination part, 11 ... Acoustic model, 12 ... Synthesis engine, 13 ... Parameter, 20 ... Playback part, 30 ... Display control part, 31, 31A, 31B, 31C, 31D ... Display area, 40 ... Instruction input part, 50 ... Editing unit, 60 ... Control unit, 70 ... Speaker, 80 ... Touch panel, 85 ... Touch pen, 90 ... Storage unit, 100, 100A ... Speech synthesizer, L1, L2, L3 ... Track, P ... Pointer, p0 ... Reference position , P1 ... specific position

Claims

A speech synthesizer,
A display control unit that displays a speech target of the synthesized speech to be synthesized in a predetermined display area by at least one of a character constituting the speech target and a notation for speaking the character;
A synthesizer that synthesizes the speech object using a designated parameter;
An instruction input unit for moving a specific position in the display area selected by operating a device prepared in advance;
An editing unit that edits the parameter according to one or more including at least the moving speed among the coordinates, moving speed, and moving shape of the specified position;
A playback unit for playing back the synthesized speech,
The display control unit scrolls the speech object at a predetermined scroll speed,
The parameter includes a first parameter that adjusts at least a speed of the speech of the speech target;
The first parameter is determined according to a difference between the scroll speed and the moving speed,
The speech synthesis apparatus, wherein the synthesis unit sequentially performs speech synthesis using parameters obtained from the editing unit and reflecting the operation by the device, and sequentially reproduces the synthesized speech from the playback unit.

The speech synthesizer according to claim 1,
The synthesis unit is a speech synthesizer that performs speech synthesis using an acoustic model obtained by learning acoustic parameters of the synthesized speech by a statistical method.

The speech synthesizer according to claim 1 or 2,
The display control unit can display a pointer,
The instruction input unit is a speech synthesizer that realizes movement of the specific position as movement of the position of the pointer.

The speech synthesizer according to claim 3,
The speech synthesizer, wherein the device is a pointing device that moves the pointer displayed on the display control unit.

The speech synthesizer according to any one of claims 1 to 3,
The speech synthesizer, wherein the device is a touch panel having the display area.

The speech synthesizer according to any one of claims 1 to 5,
The editing unit edits a second parameter that adjusts an element other than the speed of the speech target speech according to a pressure sensitivity of the device and adjusts elements other than the speed of the speech target speech. Synthesizer.

The speech synthesizer according to any one of claims 1 to 6,
The speech synthesis apparatus, wherein the synthesis unit performs speech synthesis in parallel with the reproduction unit reproducing the accompaniment, and sequentially reproduces the synthesized speech synthesized from the reproduction unit.

The speech synthesizer according to any one of claims 1 to 7,
The speech synthesizer, wherein the display control unit displays the utterance target in a display area predetermined by characters constituting the utterance target and a notation for speaking the character.

The speech synthesizer according to any one of claims 1 to 8,
The display control unit displays, in the display area, at least one of each character constituting the speech object and each notation for speaking the character according to a length corresponding to the time to speak. A speech synthesizer.

A speech synthesis method,
A display step of displaying a speech target of the synthesized speech to be synthesized in a predetermined display area by at least one of a character constituting the speech target and a notation for speaking the character;
A speech synthesis step of synthesizing the speech object using designated parameters;
A moving step of moving a specific position in the display area selected by operating a device prepared in advance;
A scrolling step of scrolling the speech object at a predetermined scrolling speed;
The above one including at least the moving speed of the movement trajectory and the coordinates and the moving speed of the position to be the specific, an editing step for editing the parameters,
In the speech synthesis step, a sequential speech synthesis is performed using parameters reflecting the operation by the device acquired in the editing step, and a playback step of sequentially reproducing the synthesized speech synthesized, and
The parameters observed contains a parameter for adjusting a speed of at least speech of the calling language object, the first parameter is determined according to the difference between the scroll speed and the moving speed, the speech synthesis method.