JP2012103654A

JP2012103654A - Voice synthesizer and program

Info

Publication number: JP2012103654A
Application number: JP2010266776A
Authority: JP
Inventors: Eiji Akazawa; 英治赤澤; Hidenori Kenmochi; 秀紀劔持
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2010-10-12
Filing date: 2010-11-30
Publication date: 2012-05-31
Anticipated expiration: 2030-11-30
Also published as: JP2015163982A; JP5879682B2

Abstract

PROBLEM TO BE SOLVED: To provide a technology capable of easily assigning a note sequence to a character sequence.SOLUTION: A user touches a touch screen 31 to enter the character sequence into a text box 101. A voice synthesizer 100 displays the entered character sequence on the touch screen 31 as an entered character image 104. When the user traces the touch screen 31, a pitch curve 103 is displayed and the entered character sequence is pronounced from a speaker 111 in accordance with a shape of the pitch curve 103.

Description

本発明は、音声合成装置及びプログラムに関する。 The present invention relates to a speech synthesizer and a program.

複数の音の音高および音長（以下「音符列」という）がデータとして入力された場合に、マウスやキーボードの操作に応じて、入力された音符列に対して、音程の連続変化やビブラート等の音響効果を反映させる技術が知られている（例えば特許文献１〜４）。 When multiple pitches and pitches (hereinafter referred to as “note strings”) are input as data, continuous pitch changes and vibrato are applied to the input note strings in response to mouse and keyboard operations. Techniques for reflecting such acoustic effects are known (for example, Patent Documents 1 to 4).

特開平１０−１４３１５５号公報Japanese Patent Laid-Open No. 10-143155 特許３７８１１６７号公報Japanese Patent No. 3781167 特許３６２０４０５号公報Japanese Patent No. 3620405 特開２００２−３７２９７２号公報JP 2002-372972 A

ところで、歌は、歌詞という文字列と上述した音符列とが対応付けられたものであるが、上記特許文献１〜４に記載の技術は、音符列のみを処理の対象としたものであって、歌詞という文字列と音符列との関係を考慮した技術ではない。
本発明は、文字列に対して音符列の割り当てを容易に行える技術を提供することを目的とする。 By the way, the song is a character string that is associated with the character string of the lyrics and the above-described note string. However, the techniques described in Patent Documents 1 to 4 are intended to process only the note string. This is not a technology that takes into account the relationship between the character string and the note string.
It is an object of the present invention to provide a technique that can easily assign a note string to a character string.

本発明は、複数の文字によって構成された文字列を取得する文字列取得手段と、取得された前記文字列を構成する各文字を表示手段に表示させる文字列表示手段と、時間を表す第１軸および音高を表す第２軸を有する座標系における図形が利用者によって指定されると、当該図形を、前記文字列を構成する各文字に対応付けた状態で前記表示手段に表示させる図形表示手段と、表示されている前記図形において、表示されている前記文字列を構成する各文字に対応する位置の座標値に基づき、当該各文字に音高及び音長を割り当てる割り当て手段と、前記文字列を構成する各文字を前記割り当て手段によって割り当てられた音高及び音長で発音させる音声データを合成する音声合成手段とを備えることを特徴とした音声合成装置を提供する。 The present invention provides a character string acquisition unit that acquires a character string composed of a plurality of characters, a character string display unit that displays on the display unit each character that constitutes the acquired character string, and a first that represents time. When a figure in a coordinate system having an axis and a second axis representing a pitch is designated by the user, the figure is displayed on the display means in a state in which the figure is associated with each character constituting the character string. Means for assigning a pitch and a tone length to each character based on coordinate values of positions corresponding to each character constituting the displayed character string in the displayed graphic, and the character There is provided a speech synthesizer comprising speech synthesizer for synthesizing speech data that causes each character constituting a sequence to be pronounced at a pitch and a tone length assigned by the assigning means.

好ましい態様において、単語を発音するときの発音時間に対する、当該単語を構成する各文字の発音時間の長さ又は当該単語を構成する各文字の発音時間の比を、複数の単語について記憶する発音長辞書記憶手段を備え、前記割り当て手段は、前記文字列の全体を発音するときの音長であって利用者が指定した文字列音長と、当該文字列を構成する各文字について前記発音長辞書記憶手段に記憶されている前記発音時間の長さ又は前記発音時間の比とに基づいて、前記各文字に音長を割り当てるようにしてもよい。 In a preferred embodiment, a pronunciation length for storing a length of a pronunciation time of each character constituting the word or a ratio of a pronunciation time of each character constituting the word with respect to the pronunciation time when the word is pronounced for a plurality of words A dictionary storage means, wherein the assigning means is a sound length when the entire character string is pronounced and is designated by the user, and the pronunciation length dictionary for each character constituting the character string A sound length may be assigned to each character based on the length of the sound generation time or the ratio of the sound generation times stored in the storage means.

別の好ましい態様において、前記文字列を構成する各文字に音高又は音長を割り当てるときの指標を前記表示手段に表示させる指標表示手段を備え、前記割り当て手段は、前記指標表示手段によって表示されている指標に従って、前記図形において前記文字列を構成する各文字に対応する位置の座標値を補正し、当該補正後の座標値に基づき、前記各文字に音高及び音長を割り当てるようにしてもよい。 In another preferred aspect, the display device includes indicator display means for displaying on the display means an indicator when a pitch or a tone length is assigned to each character constituting the character string, and the assigning means is displayed by the indicator display means. The coordinate value of the position corresponding to each character constituting the character string in the graphic is corrected according to the index, and the pitch and the sound length are assigned to each character based on the corrected coordinate value. Also good.

さらに別の好ましい態様において、複数の図形の形状の各々に対応付けて、文字が発音される際に適用される音響効果を記憶した音響効果記憶手段を備え、前記図形表示手段は、前記表示手段に表示されている図形に対して重ね合わせられた図形が利用者によって指定されると、当該重ね合わせられた図形を前記表示手段に表示させ、前記割り当て手段は、前記音響効果記憶手段に記憶されている複数の図形の形状のうち、前記重ね合わせられた図形との類似度が閾値を超える図形の形状を特定し、特定した図形の形状に対応付けられて記憶されている音響効果を、当該重ね合わせられた図形の座標値に対応する位置に表示されている前記文字に割り当てるようにしてもよい。 In still another preferred embodiment, the graphic display means includes acoustic effect storage means for storing an acoustic effect applied when a character is pronounced in association with each of a plurality of graphic shapes, and the graphic display means includes the display means When a figure superimposed on the figure displayed on the screen is designated by the user, the superimposed figure is displayed on the display means, and the assigning means is stored in the acoustic effect storage means. Among the shapes of a plurality of figures, the shape of a figure whose similarity with the superimposed figure exceeds a threshold is specified, and the acoustic effect stored in association with the shape of the specified figure is You may make it allocate to the said character currently displayed on the position corresponding to the coordinate value of the superimposed figure.

また、本発明は、コンピュータに、複数の文字によって構成された文字列を取得する文字列取得機能と、取得された前記文字列を構成する各文字を表示手段に表示させる文字列表示機能と、音高を表す第１軸および時間を表す第２軸を有する座標系における図形が利用者によって指定されると、当該図形を、前記文字列を構成する各文字に対応付けた状態で前記表示手段に表示させる図形表示機能と、表示されている前記文字列を構成する各文字に対応する前記図形の座標値に基づき、当該各文字に音高及び音長を割り当てる割り当て機能と、前記文字列を構成する各文字を前記割り当て機能によって割り当てられた音高及び音長で発音させる音声データを合成する音声合成機能とを実現させるためのプログラムを提供する。 Further, the present invention provides a computer with a character string acquisition function for acquiring a character string composed of a plurality of characters, a character string display function for displaying each character constituting the acquired character string on a display means, When a graphic in a coordinate system having a first axis representing pitch and a second axis representing time is designated by the user, the display means is associated with each character constituting the character string. A graphic display function to be displayed on the screen, an assignment function for assigning a pitch and a pitch to each character based on the coordinate value of the graphic corresponding to each character constituting the displayed character string, and the character string There is provided a program for realizing a voice synthesis function for synthesizing voice data for causing each constituent character to be generated at a pitch and a tone length assigned by the assignment function.

本発明によれば、文字列に対して音符列の割り当てを容易に行うことが可能となる。 According to the present invention, it is possible to easily assign a note string to a character string.

本発明の実施形態に係る音声合成装置のハードウェア構成を表すブロック図The block diagram showing the hardware constitutions of the speech synthesizer concerning the embodiment of the present invention. 発音辞書ＤＢの内容を表す図A diagram showing the contents of the pronunciation dictionary DB 最短発音時間ＤＢの内容を表す図A diagram showing the contents of the shortest pronunciation time DB 音声ＤＢの内容を表す図The figure which shows the contents of voice DB 音響効果ＤＢの内容を表す図The figure showing the contents of the sound effect DB 音声合成装置の機能構成を表すブロック図Block diagram showing functional configuration of speech synthesizer 音声合成装置の外観及び表示内容を表す図The figure showing the appearance and display contents of the speech synthesizer 音声合成装置の処理フロー図Processing flow diagram of speech synthesizer 軌跡分析処理及び文字間隔制御処理を説明するための模式図Schematic diagram for explaining trajectory analysis processing and character spacing control processing 音声合成処理を説明するための模式図Schematic diagram for explaining speech synthesis processing 音声合成処理を説明するための模式図Schematic diagram for explaining speech synthesis processing 音声合成処理を説明するための模式図Schematic diagram for explaining speech synthesis processing 音声ＤＢの内容を表す図The figure which shows the contents of voice DB 変形例１０を説明するための模式図Schematic diagram for explaining the modification 10 音高に対する補正機能を説明する模式図Schematic diagram explaining the correction function for pitch 音長に対する補正機能を説明する模式図Schematic diagram explaining the correction function for sound length 初期値発音辞書ＤＢの内容を表す図The figure which shows the contents of initial value pronunciation dictionary DB 変形例１３に係る、音声合成装置の表示内容を表す図The figure showing the display content of the speech synthesizer based on modification 13

以下、本発明の一実施形態について説明する。
＜実施形態＞
＜構成＞
図１は、本発明の実施形態に係る音声合成装置１００のハードウェア構成を表すブロック図である。この音声合成装置１００は、制御部１０、記憶部２０、ＵＩ（User Interface）部３０、及び音声出力部４０を備えており、各部がバスを介して接続されている。音声合成装置１００は、文字列と、音符列を含む発音に関する情報である音声情報とに基づいて音声データを合成し、この合成された音声データに基づく音声を出力する装置である。本実施形態においては、音声合成装置１００はスマートフォンである。制御部１０は、ＣＰＵ（Central Processing Unit）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などを有する。ＣＰＵは、ＲＯＭや記憶部２０に記憶されている制御プログラムを読み出して、ＲＡＭにロードして実行することにより、音声合成装置１００の各部をバス経由で制御する。また、ＲＡＭは、ＣＰＵがデータの加工などを行う際のワークエリアとして機能する。 Hereinafter, an embodiment of the present invention will be described.
<Embodiment>
<Configuration>
FIG. 1 is a block diagram showing a hardware configuration of a speech synthesizer 100 according to an embodiment of the present invention. The speech synthesizer 100 includes a control unit 10, a storage unit 20, a UI (User Interface) unit 30, and a speech output unit 40, which are connected via a bus. The voice synthesizer 100 is a device that synthesizes voice data based on a character string and voice information that is information related to pronunciation including a note string, and outputs a voice based on the synthesized voice data. In the present embodiment, the speech synthesizer 100 is a smartphone. The control unit 10 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The CPU reads out a control program stored in the ROM or the storage unit 20, loads it into the RAM, and executes it to control each unit of the speech synthesizer 100 via the bus. The RAM functions as a work area when the CPU processes data.

記憶部２０は、コンピュータを音声合成装置として機能させるためのアプリケーションプログラム（以下このプログラムを「音声合成アプリケーション」という）を記憶している。制御部１０がこの音声合成アプリケーションを実行することにより、音声合成装置１００に、後述する図６に示される機能が実現される。また、記憶部２０は、発音辞書ＤＢ（Database）２１、最短発音時間ＤＢ２２、音声ＤＢ２３、及び音響効果ＤＢ２４を備える。発音辞書ＤＢ２１は、利用者によって入力された文字列に発音時間が割り当てられる際の基準となるデータである、複数の発音レコードからなる。最短発音時間ＤＢ２２は、「あ」や「い」といった一文字毎に、その文字が発音される場合に最低限必要とされる時間の長さを割り当てたものである、複数の最短発音時間レコードからなる。 The storage unit 20 stores an application program for causing a computer to function as a speech synthesizer (hereinafter, this program is referred to as “speech synthesis application”). When the control unit 10 executes this speech synthesis application, the function shown in FIG. 6 to be described later is realized in the speech synthesis device 100. The storage unit 20 includes a pronunciation dictionary DB (Database) 21, a shortest pronunciation time DB 22, a voice DB 23, and an acoustic effect DB 24. The pronunciation dictionary DB 21 is composed of a plurality of pronunciation records, which are data used as a reference when a pronunciation time is assigned to a character string input by a user. The shortest pronunciation time DB 22 is obtained by assigning, for each character such as “A” and “I”, a minimum length of time required when the character is pronounced, from a plurality of shortest pronunciation time records. Become.

音声ＤＢ２３は、利用者による入力内容に従って合成される音声に関するデータである、複数の音声レコードからなる。この音声レコードは、利用者が入力した文字列（入力文字列という）に対して、この文字列を構成する１文字毎に音符列と音響効果とを対応付けたものである。この音声レコードは、利用者による入力内容に従って生成され、音声ＤＢ２３に登録される。音響効果ＤＢ２４は、予め決められた図形の形状（音響効果図形という）と音響効果の種類とを対応付けた、複数の音響効果レコードからなる。例えば、或る一定範囲の間隔でピーク値をとる波形の図形に対しては、ビブラートという音響効果が対応付けられている、といった具合である。 The voice DB 23 is composed of a plurality of voice records, which are data related to voice synthesized according to the content input by the user. In this voice record, a note string and a sound effect are associated with a character string (referred to as an input character string) input by a user for each character constituting the character string. This voice record is generated according to the input content by the user and registered in the voice DB 23. The acoustic effect DB 24 is composed of a plurality of acoustic effect records in which predetermined graphic shapes (referred to as acoustic effect graphics) and acoustic effect types are associated with each other. For example, an acoustic effect called vibrato is associated with a waveform figure having a peak value at an interval of a certain range.

図２は、発音辞書ＤＢ２１の内容を表す図である。発音辞書ＤＢ２１に含まれる各発音レコードは、識別ＩＤ、文字列、文字数、及び基準割り当て音長といった複数の項目からなる。識別ＩＤは、各発音レコードを一意に識別するためのＩＤであり、例えば６桁の数字からなる。文字列は、発音の対象となる単語として予め決められたものである。文字数は、文字列を構成する文字の数である。基準割り当て音長は、文字列を構成する各文字についてその文字列の先頭の文字から順番に割り当てられた発音時間の長さである。基準割り当て音長においては、該当する文字列が自然な抑揚で発音された場合に掛かる時間の長さに基づいて、各文字における音長が予め決定されている。例えば識別ＩＤが「０００００１」である発音レコードは、「おはよう」という「４つ」の文字数からなる文字列について、「お」、「は」、及び「よ」という文字についてはそれぞれ「０．２秒」ずつ発音がなされ、「う」という文字については「０．１秒」発音がなされることを表している。これらの０．２秒とか０．１秒という数値は、文字列を構成する各文字の発音時間の比を意味しているから、発音辞書ＤＢ２１は、単語を発音するときの発音時間に対する、当該単語を構成する各文字の発音時間の比を、複数の単語について記憶する発音辞書記憶手段の一例である。 FIG. 2 is a diagram showing the contents of the pronunciation dictionary DB 21. Each pronunciation record included in the pronunciation dictionary DB 21 includes a plurality of items such as an identification ID, a character string, the number of characters, and a reference assigned sound length. The identification ID is an ID for uniquely identifying each pronunciation record, and is composed of, for example, a six-digit number. The character string is predetermined as a word to be pronounced. The number of characters is the number of characters constituting the character string. The reference assigned sound length is the length of the pronunciation time assigned in order from the first character of the character string for each character constituting the character string. In the reference assigned sound length, the sound length of each character is determined in advance based on the length of time taken when the corresponding character string is pronounced with natural inflection. For example, the pronunciation record with the identification ID “000001” is “0.2” for the characters “o”, “ha”, and “yo” for the character string consisting of “four” characters “good morning”. "Second" is pronounced, and "u" is pronounced "0.1 second". These numerical values of 0.2 seconds or 0.1 seconds mean a ratio of pronunciation time of each character constituting the character string, so that the pronunciation dictionary DB 21 has a value corresponding to the pronunciation time when the word is pronounced. It is an example of the pronunciation dictionary memory | storage means which memorize | stores the pronunciation time ratio of each character which comprises a word about several words.

図３は、最短発音時間ＤＢ２２の内容を表す図である。最短発音時間ＤＢ２２に含まれる各最短発音時間レコードは、識別ＩＤ、文字、及び最短発音時間といった複数の項目からなる。識別ＩＤは、各最短発音時間レコードを一意に識別するためのＩＤであり、例えば４桁の数字からなる。文字は、例えば平仮名であれば「あ」から「ん」までの文字である。ここで、文字は、平仮名に限らず、漢字、数字あるいはアルファベット等であってもよい。最短発音時間は、該当する文字が発音される場合に最低限必要とされる時間の長さであり、例えば実験的にその文字を聞き取ることが可能とされた最短時間に基づいて予め決定されている。例えば図３において、「あ」という文字は、少なくとも「０．０５秒」の発音時間が必要とされることを表している。この最短発音時間ＤＢ２２は、文字が発音されるときの最短の音長を複数の文字について記憶する最短発音時間記憶手段である。なお、文字によっては（例えば母音である「あ」、「い」、「う」、「え」及び「お」や、撥音である「ん」）、最短発音時間ＤＢ２２に最短発音時間レコードが登録されていなくともよい。 FIG. 3 is a diagram showing the contents of the shortest pronunciation time DB 22. Each shortest pronunciation time record included in the shortest pronunciation time DB 22 includes a plurality of items such as an identification ID, characters, and the shortest pronunciation time. The identification ID is an ID for uniquely identifying each shortest pronunciation time record, and is composed of, for example, a 4-digit number. For example, in the case of Hiragana, the characters are characters from “a” to “n”. Here, the characters are not limited to hiragana, but may be kanji, numbers, alphabets, or the like. The shortest pronunciation time is a minimum length of time required when a corresponding character is pronounced. For example, the shortest pronunciation time is determined in advance based on the shortest time that the character can be heard experimentally. Yes. For example, in FIG. 3, the letters “A” indicate that a pronunciation time of at least “0.05 seconds” is required. The shortest pronunciation time DB 22 is shortest pronunciation time storage means for storing the shortest sound length when a character is pronounced for a plurality of characters. Depending on the character (for example, “A”, “I”, “U”, “E” and “O” which are vowels and “N” which is a repellent sound), the shortest pronunciation time record is registered in the shortest pronunciation time DB 22. It does not have to be.

図４は音声ＤＢ２３の内容を表す図である。音声ＤＢ２３に含まれる音声レコードは、文字順ＩＤ、入力文字、音高、音長、及び音響効果といった複数の項目からなる。文字順ＩＤは、各音声レコードを一意に識別し、且つ各入力文字の並び順を表すためのＩＤであり、例えば４桁の数字からなる。入力文字は、利用者によって歌詞として入力された文字列を構成する各文字である。音高は、文字が発音される際の音の高さであり、周波数により表される。音長は、文字が発音される際に掛かる時間を表す。音響効果は、文字に適用される音響効果の種類を表す。以降において、文字に対して音響効果が適用されることを、その音響効果が“かかった”状態であるという場合がある。例えば図４に示される音声レコードに対応する文字列が発音される際には、文字順ＩＤに従って、「こ」、「ん」、「に」、「ち」、「は」の順番で発音がなされる。図４において、例えば「こ」という文字は、「４０６Ｈｚ」の音高で「０．３秒」の間、「ビブラート」のかかった状態で発音がなされることを表している。なお、図４では、「こんにちは」という文字列についての音声レコードしか図示していないが、実際には、この「こんにちは」以外の全ての文字列に含まれる各文字についての音声レコードがこの音声ＤＢ２３に含まれている。 FIG. 4 is a diagram showing the contents of the voice DB 23. The voice record included in the voice DB 23 includes a plurality of items such as a character order ID, input characters, pitches, pitches, and sound effects. The character order ID is an ID for uniquely identifying each voice record and indicating the order of the input characters, and is composed of, for example, a four-digit number. An input character is each character which comprises the character string input as a lyric by the user. The pitch is the pitch when a character is pronounced and is represented by a frequency. The sound length represents the time taken when a character is pronounced. The sound effect represents the type of sound effect applied to the character. In the following, when an acoustic effect is applied to a character, it may be said that the acoustic effect is “applied”. For example, when a character string corresponding to the voice record shown in FIG. 4 is pronounced, the pronunciation is in the order of “ko”, “n”, “ni”, “chi”, “ha” according to the character order ID. Made. In FIG. 4, for example, the character “ko” represents that the pronunciation is made with “vibrato” applied for “0.3 seconds” at a pitch of “406 Hz”. In FIG. 4, although only shown audio record of the character string "Hello", in fact, voice record for each character that is included in all of the character string other than the "Hello" is this voice DB23 Included.

図５は音響効果ＤＢ２４の内容を表す図である。
音響効果ＤＢ２４に含まれる音響効果レコードは、識別ＩＤ、音響効果図形、音響効果といった複数の項目からなる。識別ＩＤは各音響効果レコードを一意に識別する数字であり、例えば３桁の数字からなる。音響効果図形は、図形の形状を表すデータであり、音響効果レコード毎にその形状が異なっている。音響効果は、発音の際に文字にかけられる音響効果の種類である。例えば、図５に示されるように、識別ＩＤが「００１」であって、音響効果図形が或る一定範囲の間隔でピーク値をとる波型の図形である音響効果レコードには、「ビブラート」の音響効果が対応付けられている。音響効果ＤＢ２４は、複数の図形の形状の各々に対応付けて、文字が発音される際に適用される音響効果とを対応付けて記憶した音響効果記憶手段である。 FIG. 5 is a diagram showing the contents of the acoustic effect DB 24.
The sound effect record included in the sound effect DB 24 includes a plurality of items such as an identification ID, a sound effect graphic, and a sound effect. The identification ID is a number that uniquely identifies each sound effect record, and includes, for example, a three-digit number. The sound effect graphic is data representing the shape of the graphic, and the shape is different for each sound effect record. The acoustic effect is a type of acoustic effect that can be applied to characters during pronunciation. For example, as shown in FIG. 5, “Vibrato” is included in the sound effect record whose identification ID is “001” and the sound effect graphic is a wave-shaped graphic that takes a peak value at a certain range interval. Are associated with each other. The acoustic effect DB 24 is an acoustic effect storage unit that associates and stores acoustic effects applied when a character is pronounced in association with each of a plurality of graphic shapes.

再び図１に戻る。ＵＩ部３０は、ボタン（非図示）及びタッチスクリーン３１を備える。利用者が、ボタンあるいはタッチスクリーン３１を操作すると、ＵＩ部３０はその操作に応じた信号を制御部１０に供給する。制御部１０は、受け取った信号に基づいて音声合成装置１００の全体を制御する。タッチスクリーン３１は、表示装置の画面上に光透過性のタッチセンサが積層された構造を有している。利用者は、表示装置に表示されている画像を見ながら、タッチスクリーン３１に指を触れたりタッチスクリーン３１上を指でなぞったりするなどの操作を行うことにより、音声合成装置１００に対して指示を入力する。 Returning again to FIG. The UI unit 30 includes buttons (not shown) and a touch screen 31. When the user operates the button or the touch screen 31, the UI unit 30 supplies a signal corresponding to the operation to the control unit 10. The control unit 10 controls the entire speech synthesizer 100 based on the received signal. The touch screen 31 has a structure in which a light transmissive touch sensor is stacked on a screen of a display device. The user gives instructions to the speech synthesizer 100 by performing an operation such as touching the touch screen 31 or tracing the touch screen 31 with a finger while viewing the image displayed on the display device. Enter.

音声出力部４０は、ＤＡＣ（Digital Analog Converter）、アンプ、及びスピーカを備えている。音声出力部４０は、制御部１０から供給されるデジタルの音声データをＤＡＣによってアナログの音声データに変換して、それをさらにアンプで増幅し、スピーカから増幅後のアナログの音声信号に応じた音声を出力させる。 The audio output unit 40 includes a DAC (Digital Analog Converter), an amplifier, and a speaker. The audio output unit 40 converts the digital audio data supplied from the control unit 10 into analog audio data by the DAC, further amplifies it with an amplifier, and outputs audio corresponding to the amplified analog audio signal from the speaker. Is output.

この音声合成装置１００において、利用者は、タッチスクリーン３１を介して、発音対象の文字列（つまり歌詞）を入力し、さらにこの文字列をどのように発音させるかを表す音声情報を図形によって入力する。この音声情報は、入力された文字列に対する文字列音長、音高、音長、及び音響効果を表している。なお、文字列音長とは、文字列全体が発音される際に掛かる時間のことであり、各文字に割り当てられる音長を合計した時間に相当する。音声合成装置１００は、これらの文字列と音声情報とに基づいて音声データを合成し、この合成された音声データに基づく音声を出力する。 In this speech synthesizer 100, a user inputs a character string to be pronounced (that is, lyrics) via the touch screen 31, and further inputs speech information representing how to pronounce this character string as a figure. To do. This voice information represents a character string sound length, a pitch, a sound length, and a sound effect with respect to the input character string. The character string sound length is the time taken when the entire character string is pronounced, and corresponds to the total time of the sound lengths assigned to each character. The speech synthesizer 100 synthesizes speech data based on these character strings and speech information, and outputs speech based on the synthesized speech data.

図６は、音声合成装置１００の機能構成を表すブロック図である。文字列取得手段１１は、利用者がタッチスクリーン３１を介して入力した、複数の文字によって構成された文字列を取得し、ＲＡＭに記憶させる。基準音長特定手段１２は、ＲＡＭに記憶された入力文字列を用いて発音辞書ＤＢ２１を検索し、該当する発音レコードを特定すると、特定した発音レコードをＲＡＭに記憶する。 FIG. 6 is a block diagram illustrating a functional configuration of the speech synthesizer 100. The character string acquisition unit 11 acquires a character string composed of a plurality of characters input by the user via the touch screen 31, and stores it in the RAM. The reference sound length specifying means 12 searches the pronunciation dictionary DB 21 using the input character string stored in the RAM, and when the corresponding pronunciation record is specified, stores the specified pronunciation record in the RAM.

表示制御手段１３は、利用者によりＵＩ部３０を通じて行われた操作に応じて、タッチスクリーン３１に表示する内容を制御する。例えば、表示制御手段１３は、ＲＡＭに記憶された入力文字列について、この入力文字列を構成する文字の各々を表す画像（入力文字画像という）をタッチスクリーン３１に表示させる。このように、表示制御手段１３は、取得された文字列を構成する各文字を表示手段であるタッチスクリーン３１に表示させる文字列表示手段１３Ａとして機能する。また、表示制御手段１３は、利用者がタッチスクリーン３１を指でなぞったときの、その指先の位置の軌跡をタッチスクリーン３１に表示させる。タッチスクリーン３１は、後述する図７で説明するように、時間を表す第１軸と音高を表す第２軸とで構成される座標系が設定された表示領域を備えており、この表示領域に対して利用者がタッチスクリーン３１を指でなぞって軌跡を描く。この軌跡は、上述した音声情報を表す図形、つまり入力文字列に対する文字列音長、音高、音長、及び音響効果を表す図形に相当するものであり、後述する図７で説明するように、入力文字画像に対応付けた状態でタッチスクリーン３１に表示される。以下、この図形のことをピッチカーブという。このように、表示制御手段１３は、利用者によって指定された、時間を表す第１軸および音高を表す第２軸を有する座標系における図形（ピッチカーブ１０３）を、入力文字列を構成する各文字を表す入力文字画像１０４に対応付けた状態でタッチスクリーン３１に表示させる図形表示手段１３Ｂとして機能する。 The display control unit 13 controls the content displayed on the touch screen 31 in accordance with an operation performed by the user through the UI unit 30. For example, the display control unit 13 causes the touch screen 31 to display an image (referred to as an input character image) representing each character constituting the input character string for the input character string stored in the RAM. Thus, the display control means 13 functions as the character string display means 13A that displays each character constituting the acquired character string on the touch screen 31 that is a display means. In addition, the display control unit 13 causes the touch screen 31 to display a locus of the position of the fingertip when the user traces the touch screen 31 with a finger. As will be described later with reference to FIG. 7, the touch screen 31 includes a display area in which a coordinate system including a first axis representing time and a second axis representing pitch is set. The user traces the touch screen 31 with his / her finger. This trajectory corresponds to the graphic representing the above-described voice information, that is, the graphic representing the character string sound length, pitch, tone length, and acoustic effect for the input character string, and will be described later with reference to FIG. Are displayed on the touch screen 31 in association with the input character image. Hereinafter, this figure is referred to as a pitch curve. In this way, the display control means 13 constitutes an input character string with a figure (pitch curve 103) in a coordinate system having a first axis representing time and a second axis representing pitch specified by the user. The graphic display means 13B is displayed on the touch screen 31 in a state associated with the input character image 104 representing each character.

文字間隔制御手段１４は、利用者がタッチスクリーン３１に表示される入力文字画像をドラッグし、その表示位置を変更する操作に応じて、各入力文字画像１０４の間隔を制御し、この制御内容（文字間隔制御内容という）をＲＡＭに記憶するとともに表示制御手段１３に入力する。ここで、入力文字画像１０４とは、利用者がテキストボックス１０１に入力した内容に応じて表示される各文字の画像である（図７参照）。また、ここでいうドラッグとは、タッチスクリーン３１上で入力文字画像１０４を指先で触って選択し、そのままの状態で指先を移動させることをいう。文字間隔制御内容には、タッチスクリーン３１に表示される各入力文字画像１０４同士の間の距離を表す数値が含まれる。この数値は常に０以上の値を取る。表示制御手段１３は、入力された文字間隔制御内容に含まれる、隣り合う位置にある入力文字画像の距離を表す数値（文字間隔値という）が０である場合、タッチスクリーン３１において、これらの入力文字画像１０４を結合された状態で表示させる。ここで、「結合」とは、１つの入力文字画像１０４を囲む矩形と、この入力文字画像１０４と隣り合う入力文字画像１０４を囲む矩形とが接している状態を意味する（図７参照）。 The character spacing control means 14 controls the spacing between the input character images 104 in accordance with an operation of dragging the input character image displayed on the touch screen 31 and changing the display position. Is stored in the RAM and is input to the display control means 13. Here, the input character image 104 is an image of each character displayed in accordance with the content input by the user in the text box 101 (see FIG. 7). Further, the dragging here means that the input character image 104 is selected by touching it with the fingertip on the touch screen 31, and the fingertip is moved as it is. The character spacing control content includes a numerical value representing the distance between the input character images 104 displayed on the touch screen 31. This number always takes a value of 0 or more. When the numerical value (referred to as character spacing value) representing the distance between the input character images at adjacent positions included in the inputted character spacing control content is 0, the display control means 13 performs these inputs on the touch screen 31. The character image 104 is displayed in a combined state. Here, “combined” means a state in which a rectangle surrounding one input character image 104 and a rectangle surrounding the input character image 104 adjacent to this input character image 104 are in contact (see FIG. 7).

また、表示制御手段１３は、文字間隔値が０でない場合、この文字間隔値に応じた距離を空けて、入力文字画像１０４を分離された状態で表示させる。ここで、「分離」とは、１つの入力文字画像１０４を囲む矩形と、この入力文字画像１０４と隣り合う入力文字画像１０４を囲む矩形とが離れている状態を意味する。つまり、文字列表示手段１３Ａとして機能する表示制御手段１３は、タッチスクリーン３１に表示されている各文字の位置を利用者の指示に従って変更して、この各文字を表示させる（図７参照）。軌跡分析手段１５は、上述したピッチカーブと各入力文字画像１０４との関係を分析し、入力文字列に対する文字列音長、音高、音長、及び音響効果を分析結果として算出すると、この分析結果をＲＡＭに記憶する。このとき、軌跡分析手段１５は、タッチスクリーン３１上の座標系において、各入力文字画像１０４に対応する図形（ピッチカーブ１０３）の座標値に基づき、各文字に対する音高及び音長を算出する。 If the character spacing value is not 0, the display control means 13 displays the input character image 104 in a separated state with a distance corresponding to the character spacing value. Here, “separated” means a state in which a rectangle surrounding one input character image 104 is separated from a rectangle surrounding the input character image 104 adjacent to the input character image 104. That is, the display control unit 13 functioning as the character string display unit 13A changes the position of each character displayed on the touch screen 31 in accordance with a user instruction, and displays each character (see FIG. 7). The trajectory analyzing means 15 analyzes the relationship between the above-described pitch curve and each input character image 104, and calculates the character string sound length, pitch, sound length, and acoustic effect for the input character string as analysis results. Store the result in RAM. At this time, the trajectory analyzing means 15 calculates the pitch and length of each character based on the coordinate value of the figure (pitch curve 103) corresponding to each input character image 104 in the coordinate system on the touch screen 31.

音声レコード生成手段１６は、ＲＡＭに記憶された、入力文字列、発音レコード、文字間隔制御内容、及び軌跡分析手段１５によるピッチカーブの分析結果と、最短発音時間ＤＢ２２の内容とを入力パラメータとして、音声レコードを生成する処理を行う。この際、音声レコード生成手段１６は、入力文字列を構成する各文字に対して、音高、音長、音響効果を割り当てる。音声レコード生成手段１６は、生成した音声レコードを、音声ＤＢ２３に記憶させる。音声合成手段１７は、音声レコード生成手段１６より処理が完了した旨の通知を受けると、音声ＤＢ２３に記憶された音声レコードに基づいて音声データを合成し、音声出力部４０からこの音声データに基づく音声を出力させる。 The voice record generation means 16 uses the input character string, pronunciation record, character interval control content, pitch curve analysis result by the trajectory analysis means 15 and the content of the shortest pronunciation time DB 22 stored in the RAM as input parameters. Performs processing to generate audio records. At this time, the sound record generating means 16 assigns a pitch, a sound length, and a sound effect to each character constituting the input character string. The voice record generating unit 16 stores the generated voice record in the voice DB 23. When the voice synthesizing unit 17 receives the notification that the processing is completed from the voice record generating unit 16, the voice synthesizing unit 17 synthesizes voice data based on the voice record stored in the voice DB 23, and based on the voice data from the voice output unit 40. Output audio.

上記のように、基準音長特定手段１２、文字間隔制御手段１４、軌跡分析手段１５、及び音声レコード生成手段１６が協働することで、表示されている文字列を構成する各文字に対応する図形であるピッチカーブの座標値に基づき、当該各文字に音高及び音長を割り当てる割り当て手段１８として機能する。 As described above, the reference sound length specifying unit 12, the character interval control unit 14, the trajectory analyzing unit 15, and the voice record generating unit 16 cooperate to correspond to each character constituting the displayed character string. Based on the coordinate value of the pitch curve which is a figure, it functions as an assigning means 18 for assigning a pitch and a tone length to each character.

＜動作＞
次に、図７及び図８を参照しながら音声合成装置１００の動作について説明を行う。
図７は、音声合成装置１００の外観及び表示内容を表す図であり、図８は、音声合成装置１００の処理フロー図である。
図７に示すように、音声合成装置１００は、筐体１１０、タッチスクリーン３１、及びスピーカ１１１を有する。筐体１１０にはタッチスクリーン３１及びスピーカ１１１が設けられている。タッチスクリーン３１には、テキストボックス１０１、発音基準線１０２、ピッチカーブ１０３、入力文字画像１０４、再生ボタン画像１０５、戻るボタン画像１０６、及び割り当て文字画像１０９が表示される。図７に示されたピッチカーブ１０３が表示される領域において、Ｘ軸（第１軸）は時間を表しており、Ｘ軸において負方向から正方向に向かって時間が経過する。また、図７に示されたピッチカーブ１０３が表示される領域において、Ｙ軸（第２軸）は音高を表しており、正方向に向かう程高い音となり、負方向に向かう程低い音となる。本実施形態においては、Ｙ軸における最小の座標値と最大の座標値との間には、１オクターブの音高が割り当てられているものとする。 <Operation>
Next, the operation of the speech synthesizer 100 will be described with reference to FIGS.
FIG. 7 is a diagram illustrating the appearance and display contents of the speech synthesizer 100, and FIG. 8 is a process flow diagram of the speech synthesizer 100.
As illustrated in FIG. 7, the speech synthesizer 100 includes a housing 110, a touch screen 31, and a speaker 111. The housing 110 is provided with a touch screen 31 and a speaker 111. On the touch screen 31, a text box 101, a pronunciation reference line 102, a pitch curve 103, an input character image 104, a play button image 105, a return button image 106, and an assigned character image 109 are displayed. In the region where the pitch curve 103 shown in FIG. 7 is displayed, the X axis (first axis) represents time, and time passes from the negative direction to the positive direction on the X axis. In the area where the pitch curve 103 shown in FIG. 7 is displayed, the Y axis (second axis) represents the pitch, and the higher the sound is in the positive direction, the lower the sound is in the negative direction. Become. In the present embodiment, it is assumed that a pitch of one octave is assigned between the minimum coordinate value and the maximum coordinate value on the Y axis.

テキストボックス１０１は、文字列が入力される領域である。利用者が、タッチスクリーン３１においてテキストボックス１０１に該当する領域を触れると、タッチスクリーン３１にキーボード画像が表示される。利用者は、このキーボード画像に触れることでテキストボックス１０１に文字列を入力する。発音基準線１０２は、入力された文字列が発音される際の音高の基準を直線で表したものである。例えば、利用者が発音基準線１０２をなぞるようにしてタッチスクリーン３１に触れると、発音基準線１０２に対して予め定められた音高（例えば４４０Ｈｚ）に従って、入力文字列の発音がなされる。ピッチカーブ１０３は、上述したように、入力文字列に対する文字列音長、音高、音長、及び音響効果を決定するものである。利用者によってタッチスクリーン３１に描かれた図形が、ピッチカーブ１０３としてタッチスクリーン３１に表示される。 The text box 101 is an area where a character string is input. When the user touches an area corresponding to the text box 101 on the touch screen 31, a keyboard image is displayed on the touch screen 31. The user inputs a character string into the text box 101 by touching the keyboard image. The pronunciation reference line 102 is a straight line representing the reference of the pitch when the input character string is pronounced. For example, when the user touches the touch screen 31 so as to trace the pronunciation reference line 102, the input character string is pronounced according to a predetermined pitch (eg, 440 Hz) with respect to the pronunciation reference line 102. As described above, the pitch curve 103 determines the character string tone length, pitch, tone length, and acoustic effect for the input character string. A figure drawn on the touch screen 31 by the user is displayed on the touch screen 31 as the pitch curve 103.

入力文字画像１０４は、上述したとおりである。また、上述したように、利用者がタッチスクリーン３１上で入力文字画像１０４をドラッグすることにより、各入力文字画像１０４の表示位置を変更し、これらを結合及び分離させることが可能である。隣り合う２つの入力文字画像１０４が結合された場合、これらの入力文字画像１０４に対応する２つの文字のうち、先に入力された文字の音長は最短発音時間となる。一方、隣り合う２つの入力文字画像１０４が分離された場合、これらの入力文字画像１０４に対応する２つの文字のうち、先に入力された文字の音長は上記の結合時よりも長くなる。つまり、音を延ばした状態で発音がなされる。なお、入力文字画像１０４は、その表示位置が、ピッチカーブ１０３の直下（Ｙ軸負方向）に、ピッチカーブ１０３の横幅（Ｘ軸方向の長さ）に沿った状態で表示される。ここで、利用者が、入力文字列を変更することなく新たにピッチカーブ１０３を描き直した場合、表示制御手段１３は、各入力文字画像１０４同士のＸ軸における間隔の比を保ったまま、新たなピッチカーブ１０３の横幅に沿うように、表示位置を変更して入力文字画像１０４を表示させる。 The input character image 104 is as described above. Further, as described above, when the user drags the input character image 104 on the touch screen 31, the display position of each input character image 104 can be changed, and these can be combined and separated. When two adjacent input character images 104 are combined, the sound length of the previously input character among the two characters corresponding to these input character images 104 is the shortest pronunciation time. On the other hand, when two adjacent input character images 104 are separated, the sound length of the previously input character among the two characters corresponding to these input character images 104 is longer than that at the time of the above combination. That is, the pronunciation is made with the sound extended. The input character image 104 is displayed in a state where the display position is directly below the pitch curve 103 (Y-axis negative direction) and along the horizontal width (length in the X-axis direction) of the pitch curve 103. Here, when the user newly redraws the pitch curve 103 without changing the input character string, the display control means 13 newly maintains the ratio of the intervals between the input character images 104 on the X axis. The input character image 104 is displayed by changing the display position along the horizontal width of the pitch curve 103.

利用者が再生ボタン画像１０５に触れると、入力文字列が、入力されたピッチカーブ１０３に従ってスピーカ１１１から発音される。利用者が、戻るボタン画像１０６に触れると、タッチスクリーン３１には入力文字ライブラリがツリー状に表示される。入力文字ライブラリとは、利用者が入力した文字列と、この文字列に対して入力されたピッチカーブ１０３との組み合わせを複数含むものである。利用者が戻るボタン画像１０６に触れると、例えば「こんにちは」、「こんばんは」、「おやすみなさい」といった複数の文字列がツリー状に表示される。利用者が、タッチスクリーン３１を介して、表示された複数の文字列のうちいずれかを選択すると、選択された文字列について図７のような画面がタッチスクリーン３１に表示される。 When the user touches the play button image 105, the input character string is pronounced from the speaker 111 according to the input pitch curve 103. When the user touches the return button image 106, the input character library is displayed in a tree shape on the touch screen 31. The input character library includes a plurality of combinations of a character string input by the user and the pitch curve 103 input to the character string. When the touch of a button image 106 the user returns, for example, "Hello", "Good evening", a plurality of character strings such as "Good night" is displayed in a tree-like. When the user selects any one of the displayed character strings via the touch screen 31, a screen as shown in FIG. 7 is displayed on the touch screen 31 for the selected character string.

割り当て文字画像１０９は、入力文字画像１０４に対して、音高を表すＹ軸方向における高さを反映させたものである。図７に表されるように、割り当て文字画像１０９の各々は、Ｘ軸方向における表示位置については、対応する各々の入力文字画像１０４と同一であり、Ｙ軸方向における表示位置については、ピッチカーブ１０３と、各入力文字画像１０４を囲む矩形の中心から発音基準線１０２に対してＹ軸正方向に延びる垂線との交差点の直下となる。ここで、利用者が、入力文字列を変更することなく新たにピッチカーブ１０３を描き直した場合、表示制御手段１３は、各割り当て文字画像１０９同士のＸ軸における間隔の比を保ったまま、新たなピッチカーブ１０３の横幅（Ｘ軸方向の長さ）に沿うように、各割り当て文字画像１０９のＸ軸方向の表示位置を変更するとともに、新たなピッチカーブ１０３のＹ軸方向の高さに応じて、各割り当て文字画像１０９のＹ軸方向の表示位置を変更して表示させる。 The assigned character image 109 is a reflection of the height in the Y-axis direction representing the pitch of the input character image 104. As shown in FIG. 7, each of the assigned character images 109 has the same display position in the X-axis direction as the corresponding input character image 104, and the display position in the Y-axis direction has a pitch curve. 103 and immediately below the intersection of the perpendicular line extending in the positive direction of the Y axis with respect to the pronunciation reference line 102 from the center of the rectangle surrounding each input character image 104. Here, when the user newly redraws the pitch curve 103 without changing the input character string, the display control means 13 newly maintains the ratio of the intervals between the assigned character images 109 on the X axis. The display position in the X-axis direction of each assigned character image 109 is changed so as to follow the horizontal width (length in the X-axis direction) of the pitch curve 103, and the new pitch curve 103 is adapted to the height in the Y-axis direction. Thus, the display position of each assigned character image 109 in the Y-axis direction is changed and displayed.

図８において、利用者が、テキストボックス１０１に文字列を入力すると（ステップＳ１；ＹＥＳ）、文字列取得手段１１は、入力文字列をＲＡＭに記憶させる（ステップＳ２）。例えば、図７に表されるように、利用者が、テキストボックス１０１に「こんにちは」という文字列を入力すると、文字列取得手段１１は、「こんにちは」という入力文字列をＲＡＭに記憶させる。次に基準音長特定手段１２が、上記の入力文字列を用いて発音辞書ＤＢ２１を検索し、該当する発音レコードを特定して、特定した発音レコードをＲＡＭに記憶させる（ステップＳ３）。ここでは、基準音長特定手段１２は、「こんにちは」という文字列を用いて、図２に表される発音辞書ＤＢ２１を検索した結果、識別ＩＤが「０００００２」の発音レコードをＲＡＭに記憶させる。 In FIG. 8, when the user inputs a character string into the text box 101 (step S1; YES), the character string acquisition unit 11 stores the input character string in the RAM (step S2). For example, as represented in FIG. 7, when the user inputs a character string "Hello" in the text box 101, a character string obtaining means 11 stores the input character string "Hello" to the RAM. Next, the reference sound length specifying means 12 searches the pronunciation dictionary DB 21 using the input character string, specifies the corresponding pronunciation record, and stores the specified pronunciation record in the RAM (step S3). Here, the reference tone duration specifying means 12 uses the string "Hello", the result of searching the pronunciation dictionary DB21 represented in Figure 2, the identification ID is to store the sound record "000002" to the RAM.

次に、表示制御手段１３が、上記の入力文字列（ここでは「こんにちは」）に基づいて、タッチスクリーン３１上に入力文字画像１０４を表示させる（ステップＳ４）。この際、表示制御手段１３は、ＲＡＭに記憶された発音レコードにおける基準割り当て音長に基づいた態様で入力文字列を表示させる。具体的に説明すると、図２に表されるように、「こんにちは」という文字列に対して、「こ」、「に」、「ち」、及び「は」の各文字には「０．２秒」の発音時間が割り当てられ、「ん」という文字には「０．１秒」の発音時間が割り当てられている。ここで、前述したように、これらの０．２秒とか０．１秒という数値は、文字列を構成する各文字の発音時間の比を意味しているから、表示制御手段１３は、隣り合う入力文字画像１０４を、この発音時間の比に応じた距離だけ分離させてタッチスクリーン３１に表示させる。この結果、入力文字画像１０４の表示態様は図７に表すようなものとなる。なお、入力文字列に該当する文字列が発音辞書ＤＢ２１に存在せず、この入力文字列に関する発音レコードがＲＡＭに記憶されていない場合、表示制御手段１３は、当該入力文字列を構成する文字を表す入力文字画像１０４を等間隔で表示させる。 Next, the display control means 13 (here "Hello") above the input string based on, and displays an input character image 104 on the touch screen 31 (step S4). At this time, the display control means 13 displays the input character string in a manner based on the reference assigned sound length in the pronunciation record stored in the RAM. Specifically, as represented in FIG. 2, for the character string "Hello", "this", "the", "Chi", and the respective character "ha", "0.2 The pronunciation time of "second" is assigned, and the pronunciation time of "0.1 second" is assigned to the character "n". Here, as described above, these numerical values of 0.2 seconds or 0.1 seconds mean the ratio of the pronunciation time of each character constituting the character string, so the display control means 13 are adjacent to each other. The input character image 104 is displayed on the touch screen 31 by being separated by a distance corresponding to the ratio of the pronunciation time. As a result, the display mode of the input character image 104 is as shown in FIG. If the character string corresponding to the input character string does not exist in the pronunciation dictionary DB 21 and the pronunciation record related to the input character string is not stored in the RAM, the display control means 13 displays the characters constituting the input character string. The displayed input character image 104 is displayed at equal intervals.

次に、文字間隔制御手段１４が、利用者の操作によって入力文字画像１０４の表示位置が変更されると、各入力文字画像１０４の間隔を制御し、文字間隔制御内容をＲＡＭに記憶するとともに、表示制御手段１３に対して文字間隔制御内容を入力し、入力文字画像１０４の表示態様を制御させる（ステップＳ５）。 Next, when the display position of the input character image 104 is changed by the user's operation, the character interval control unit 14 controls the interval of each input character image 104 and stores the character interval control content in the RAM. The character spacing control content is input to the display control means 13 to control the display mode of the input character image 104 (step S5).

図９は、文字間隔制御処理を説明するための模式図である。
図９を用いてステップＳ５の詳細を説明する。利用者が、矩形が破線で表された「ん」という入力文字画像１０４を、矩形が実線で表された位置へ向かってドラッグした場合、以下のような処理が行われる。文字間隔制御手段１４は、「ん」という入力文字画像１０４と「に」という入力文字画像１０４との間の距離βを文字間隔値として算出すると、算出した文字間隔値をＲＡＭに記憶させるとともに表示制御手段１３に入力する。表示制御手段１３は、入力された文字間隔値に基づいて、「ん」という入力文字画像１０４の表示位置を変更する。つまり表示制御手段１３は、「ん」という入力文字画像１０４について、その表示位置を、矩形が破線であらわされた位置から、矩形が実線で表された位置に変更する。この結果、図９においては、入力文字画像１０４の表示位置が変更された結果、「こ」と「ん」という入力文字画像１０４が結合されており、「に」、「ち」及び「は」という入力文字画像１０４とは、分離されている。 FIG. 9 is a schematic diagram for explaining the character spacing control process.
Details of step S5 will be described with reference to FIG. When the user drags the input character image “n” whose rectangle is represented by a broken line toward the position where the rectangle is represented by a solid line, the following processing is performed. When the distance β between the input character image 104 “n” and the input character image 104 “ni” is calculated as the character interval value, the character interval control means 14 stores the calculated character interval value in the RAM and displays it. Input to the control means 13. The display control means 13 changes the display position of the input character image 104 “n” based on the input character interval value. That is, the display control means 13 changes the display position of the input character image 104 “n” from the position where the rectangle is represented by a broken line to the position where the rectangle is represented by a solid line. As a result, in FIG. 9, as a result of the display position of the input character image 104 being changed, the input character images 104 “ko” and “n” are combined, and “ni”, “chi”, and “ha” are combined. The input character image 104 is separated.

そして、音声レコード生成手段１６によって、入力文字「ん」に対して音長が割り当てられる際には、距離βに応じた音長が算出されることで、変更前の距離γに応じた音長と比較して、長い音長が割り当てられる。結果として入力文字「ん」の発音が為される場合には、「んーーー」というように、延ばされた音として発音されることとなる。また、文字列全体の音長は文字列音長として定められているから、入力文字「ん」に対して算出される音長が長く変更されると、その分だけ、入力文字「こ」に対して算出される音長は、結合される前と比較して短いものとなる。ここで、入力文字「こ」に割り当てられる音長は、最短発音時間ＤＢ２２における文字「こ」に該当する最短発音時間として記憶された時間の長さよりも短くなることは無い。つまり、音声レコード生成手段１６は、文字列を構成する各文字に対し、最短発音時間記憶手段である最短発音時間ＤＢ２２に記憶されている最短の音長以上の音長を割り当てる。 When the sound record generation means 16 assigns a sound length to the input character “n”, the sound length corresponding to the distance γ before the change is calculated by calculating the sound length corresponding to the distance β. In comparison with, a longer sound length is assigned. As a result, when the input character “n” is pronounced, it is pronounced as an extended sound, such as “n-oo”. Also, since the length of the entire character string is defined as the character string sound length, if the sound length calculated for the input character “n” is changed to a longer length, the input character “ko” is correspondingly changed. On the other hand, the calculated sound length is shorter than before being combined. Here, the sound length assigned to the input character “ko” is never shorter than the length of time stored as the shortest pronunciation time corresponding to the character “ko” in the shortest pronunciation time DB 22. That is, the sound record generating means 16 assigns a sound length equal to or longer than the shortest sound length stored in the shortest pronunciation time DB 22 which is the shortest pronunciation time storage means to each character constituting the character string.

ステップＳ５の次に、利用者が再生ボタン画像１０５に触れることがなく（ステップＳ６；ＮＯ）、さらに、利用者がタッチスクリーン３１に図形を描くことが無い、すなわちピッチカーブ１０３が入力されなかった場合（ステップＳ７；ＮＯ）、処理がステップＳ４に戻り、上記の処理が繰り返される。 After step S5, the user does not touch the play button image 105 (step S6; NO), and the user does not draw a figure on the touch screen 31, that is, the pitch curve 103 is not input. In the case (step S7; NO), the process returns to step S4, and the above process is repeated.

一方、ステップＳ５の次に、利用者が再生ボタン画像１０５に触れることがなく（ステップＳ６；ＮＯ）、さらに、利用者が、タッチスクリーン３１に図形を描いた場合、すなわちピッチカーブ１０３が入力された場合（ステップＳ７；ＹＥＳ）、軌跡分析手段１５は、入力されたピッチカーブ１０３を分析する（ステップＳ９）。具体的には、ステップＳ９において軌跡分析手段１５は、入力されたピッチカーブ１０３と各入力文字画像との関係を分析して、入力文字列に対する文字列音長と、この入力文字列を構成する各文字に対する音高、音長、及び音響効果とを特定し、これらをＲＡＭに記憶する。 On the other hand, after step S5, the user does not touch the play button image 105 (step S6; NO), and the user draws a figure on the touch screen 31, that is, the pitch curve 103 is input. In the case (step S7; YES), the trajectory analyzing means 15 analyzes the input pitch curve 103 (step S9). Specifically, in step S9, the trajectory analyzing means 15 analyzes the relationship between the input pitch curve 103 and each input character image, and constructs a character string sound length for the input character string and this input character string. The pitch, length, and sound effect for each character are specified and stored in the RAM.

ステップＳ９の処理を更に詳細に説明する。まず、軌跡分析手段１５は、ピッチカーブ１０３が入力された際の利用者の指先の移動速度（つまりピッチカーブ１０３の始端から終端に至るまでの入力に要した時間）に応じて、文字列全体の発音時に割り当てる音長である文字列音長を算出する。利用者がタッチスクリーン３１上に図形を描いた際の速度が速ければ文字列音長は時間が短いものとなり、軌跡を描いた際の速度が遅ければ文字列音長は時間が長いものとなる。例えば軌跡分析手段１５が、ピッチカーブ１０３の描画速度から文字列音長を３秒と分析し、これが入力文字列に割り当てられると、文字列全体が３秒の長さで発音されることとなる。つまり、図７の例では、「こんにちは」という文字列が３秒の長さで発音されることとなる。 The process of step S9 will be described in further detail. First, the trajectory analysis means 15 determines the entire character string according to the moving speed of the user's fingertip when the pitch curve 103 is input (that is, the time required for input from the beginning to the end of the pitch curve 103). A character string sound length that is a sound length to be assigned at the time of pronunciation is calculated. If the user draws a figure on the touch screen 31 at a high speed, the character string sound length is short. If the user draws a locus at a low speed, the character string sound length is long. . For example, when the trajectory analyzing means 15 analyzes the character string sound length as 3 seconds from the drawing speed of the pitch curve 103 and assigns it to the input character string, the entire character string is pronounced with a length of 3 seconds. . That is, in the example of FIG. 7, so that the character string "Hello" is pronounced by the length of 3 seconds.

次に、軌跡分析手段１５は、入力文字列の各文字の音高を求める。具体的には、まず、軌跡分析手段１５は、ピッチカーブ１０３が入力される領域において、各入力文字画像１０４を囲む矩形の中心から発音基準線１０２に対してＹ軸正方向に延びる垂線（入力文字線という）を仮想的に描く。そして、軌跡分析手段１５は、発音基準線１０２に割り当てられた音高を基準とし、ピッチカーブ１０３と入力文字線との交差点のＹ座標値に応じて、入力文字列を構成する各文字の音高を算出する。つまり、上述した交差点のＹ座標値が、発音基準線１０２のＹ座標値より大きければ、その交差点に対する入力文字の音高は、発音基準線１０２に割り当てられたものよりも高いものとなる。一方、交差点のＹ座標値が、発音基準線１０２のＹ座標値より小さければ、その交差点に対する入力文字の音高は、発音基準線１０２に割り当てられたものよりも低いものとなる。 Next, the trajectory analyzing means 15 obtains the pitch of each character of the input character string. Specifically, first, the trajectory analyzing means 15 is a perpendicular line (input) extending in the positive direction of the Y axis with respect to the pronunciation reference line 102 from the center of the rectangle surrounding each input character image 104 in the area where the pitch curve 103 is input. A character line) is virtually drawn. The trajectory analyzing means 15 then uses the pitch assigned to the pronunciation reference line 102 as a reference, and the sound of each character constituting the input character string according to the Y coordinate value of the intersection of the pitch curve 103 and the input character line. Calculate the high. That is, if the above-described Y coordinate value of the intersection is larger than the Y coordinate value of the pronunciation reference line 102, the pitch of the input character at the intersection is higher than that assigned to the pronunciation reference line 102. On the other hand, if the Y coordinate value of the intersection is smaller than the Y coordinate value of the pronunciation reference line 102, the pitch of the input character at the intersection is lower than that assigned to the pronunciation reference line 102.

次に、軌跡分析手段１５は、入力文字列の各文字の音長を求める。具体的には、軌跡分析手段１５は、文字列音長における時間の長さを１として、ＲＡＭに記録された発音レコードにおける基準割り当て音長を正規化することにより、各文字に割り当てる音長の比率を算出する。 Next, the trajectory analyzing means 15 obtains the sound length of each character of the input character string. Specifically, the trajectory analyzing means 15 normalizes the reference assigned sound length in the pronunciation record recorded in the RAM, assuming that the time length in the character string sound length is 1, and thereby the sound length assigned to each character. Calculate the ratio.

ここで、図９を用いてステップＳ９の詳細を説明する。図９においては、タッチスクリーン３１を拡大して表しており、表示内容の一部を説明の都合上省略している。また、図９において、入力文字線１０７が表されているが、実際にはこれはタッチスクリーン３１に表示されない。交差点Ａ，Ｂ，Ｃ，Ｄ，及びＥは、各入力文字線１０７とピッチカーブ１０３とが交差する交差点であり、Ｘ座標値とＹ座標値とを各々保持している。例えば軌跡分析手段１５が「こ」という入力文字の音高を算出する際には、交差点ＡのＹ座標値が、発音基準線１０２のＹ座標値より小さいため、軌跡分析手段１５は、発音基準線１０２のＹ座標値から、交差点ＡのＹ座標値を減算することで、差分長αを求める。そして軌跡分析手段１５は、発音基準線１０２のＹ座標値を基準として、発音基準線１０２における音高（例えばここでは４４０Ｈｚ）よりも差文長αに相当する音高だけ低い音高を、入力文字「こ」に対して算出する。 Here, the details of step S9 will be described with reference to FIG. In FIG. 9, the touch screen 31 is shown in an enlarged manner, and a part of the display content is omitted for convenience of explanation. In FIG. 9, an input character line 107 is shown. However, this is not actually displayed on the touch screen 31. Intersections A, B, C, D, and E are intersections where each input character line 107 and pitch curve 103 intersect, and hold X coordinate values and Y coordinate values, respectively. For example, when the trajectory analyzing unit 15 calculates the pitch of the input character “KO”, the trajectory analyzing unit 15 determines the pronunciation reference because the Y coordinate value of the intersection A is smaller than the Y coordinate value of the pronunciation reference line 102. The difference length α is obtained by subtracting the Y coordinate value of the intersection A from the Y coordinate value of the line 102. The trajectory analyzing means 15 inputs a pitch that is lower than the pitch (for example, 440 Hz in this case) on the pronunciation reference line 102 by a pitch corresponding to the difference sentence length α with reference to the Y coordinate value of the pronunciation reference line 102. Calculate for the character “ko”.

一方、例えば軌跡分析手段１５が「に」という入力文字の音高を算出する際には、交差点ＣにおけるＹ座標値が、発音基準線１０２のＹ座標値より大きいため、軌跡分析手段１５は、交差点ＣのＹ座標値から、発音基準線１０２のＹ座標値を減算することで、差分長α’を求める。そして軌跡分析手段１５は、発音基準線１０２のＹ座標値を基準として、発音基準線１０２における音高よりも差文長α’に相当する音高だけ高い音高を入力文字「に」に対して算出する。 On the other hand, for example, when the trajectory analyzing unit 15 calculates the pitch of the input character “ni”, the Y coordinate value at the intersection C is larger than the Y coordinate value of the pronunciation reference line 102, so the trajectory analyzing unit 15 The difference length α ′ is obtained by subtracting the Y coordinate value of the pronunciation reference line 102 from the Y coordinate value of the intersection C. Then, the trajectory analyzing means 15 uses the Y coordinate value of the pronunciation reference line 102 as a reference and outputs a pitch higher than the pitch on the pronunciation reference line 102 by a pitch corresponding to the difference sentence length α ′ to the input character “ni”. To calculate.

次に、文字列を構成する各文字の音長については、前述したように、文字列音長と発音辞書ＤＢ２１の内容に基づく各文字の音長の比とに従って決められる。例えば、図７の例に戻ると、入力文字列が「こんにちは」であり、文字列音長が「３秒」であり、各文字の音長は、図２に示されるように、文字「こ」、「に」、「ち」及び「は」については「０．２秒」、文字「ん」については「０．１秒」である。従って、軌跡分析手段１５は、文字列音長の「３秒」を基準として正規化を行うことにより、文字「こ」、「に」、「ち」及び「は」については「０．６７秒」の音長を、文字「ん」については「０．３３秒」の音長を割り当てる（小数点以下第三位を四捨五入）。ここで、四捨五入する位は、設計において適宜変更されてもよい。このように、軌跡分析手段１５は、図形の始端から終端に至るまでの入力に要した時間に基づいて、文字列の全体を発音するときの音長である文字列音長を算出し、算出した文字列音長と、上記文字列を構成する各文字について発音辞書記憶手段に記憶されている比とに基づいて、各文字に音長を割り当てる。 Next, as described above, the sound length of each character constituting the character string is determined according to the character string sound length and the ratio of the sound length of each character based on the contents of the pronunciation dictionary DB 21. For example, returning to the example of FIG. 7, an input character string is "Hello", a character string tone length is "3 seconds", the sound length of each character, as shown in FIG. 2, the character "ko "," "Ni", "chi" and "ha" are "0.2 seconds", and the character "n" is "0.1 seconds". Accordingly, the trajectory analyzing means 15 normalizes the character string sound length “3 seconds” as a reference, so that the characters “ko”, “ni”, “chi” and “ha” are “0.67 seconds”. Is assigned the length of "0.33 seconds" for the character "n" (rounded to the second decimal place). Here, the rounding position may be changed as appropriate in the design. Thus, the trajectory analyzing means 15 calculates the character string sound length, which is the sound length when the entire character string is pronounced, based on the time required for the input from the beginning to the end of the figure. Based on the character string sound length and the ratio stored in the pronunciation dictionary storage means for each character constituting the character string, a sound length is assigned to each character.

ここで、利用者によって図７から図９のように入力文字画像１０４「ん」の表示位置が変更されていたとすると、軌跡分析手段１５は、表示位置が変更された後の入力文字画像１０４「ん」についての交差点ＢにおけるＸ座標値を取得する。また、軌跡分析手段１５は、入力文字画像１０４「ん」に対してＸ軸正方向で隣り合う入力文字画像１０４「に」についての交差点ＣにおけるＸ座標値を取得する。そして軌跡分析手段１５は、交差点Ｂに対応するＸ座標値から交差点Ｃに対応するＸ座標値を減算することで、交差点Ｂと交差点ＣとにおけるＸ座標値の差分を算出する。軌跡分析手段１５は、算出したＸ座標値の差分に基づいて、入力文字「ん」に対して、その表示位置が変更される前よりも長い音長を割り当てる。これにより、上述したように、入力文字「ん」は、延ばされた音として発音されることとなる。一方、ピッチカーブ１０３が変更されない限り、文字列音長は入力済みのピッチカーブ１０３に基づく文字列音長を維持するから、入力文字「ん」の音長が長く変更されることに従って、入力文字「こ」の音長は短く変更される。ここで入力文字「こ」に割り当てられる音長は、入力文字「ん」について、変更前よりも長くなった音長の分だけ、短いものとなる。また、ここにおいて、入力文字「こ」には、少なくとも、最短発音時間ＤＢ２２に記憶された、文字「こ」に対応する最短発音時間が割り当てられる。 Here, assuming that the display position of the input character image 104 “n” has been changed by the user as shown in FIGS. 7 to 9, the trajectory analyzing means 15 will input the input character image 104 “after the display position has been changed. X coordinate value at the intersection B for "" is acquired. Further, the trajectory analyzing means 15 acquires the X coordinate value at the intersection C for the input character image 104 “ni” adjacent to the input character image 104 “n” in the positive X-axis direction. Then, the trajectory analysis unit 15 calculates the difference between the X coordinate values at the intersection B and the intersection C by subtracting the X coordinate value corresponding to the intersection C from the X coordinate value corresponding to the intersection B. The trajectory analyzing means 15 assigns a longer sound length to the input character “n” than before the display position is changed based on the calculated difference of the X coordinate values. As a result, as described above, the input character “n” is pronounced as an extended sound. On the other hand, as long as the pitch curve 103 is not changed, the character string sound length maintains the character string sound length based on the pitch curve 103 that has already been input. The sound length of “ko” is changed to be shorter. Here, the sound length assigned to the input character “ko” is shorter by the length of the input character “n” that is longer than before the change. Here, at least the shortest pronunciation time corresponding to the character “ko” stored in the shortest pronunciation time DB 22 is assigned to the input character “ko”.

また、既に描かれたピッチカーブ１０３に対して、利用者が既に描いた図形（ピッチカーブ）に重ね合わされる位置に他の図形を描くと、軌跡分析手段１５が、複数の図形の形状を記憶する音響効果ＤＢ２４から、重ね合わせられるようにして描かれた図形に対して、予め定められた閾値以上の類似度を持つ音響効果図形を特定する。そして、軌跡分析手段１５は、特定した音響効果図形に対応付けられた音響効果の種類を、重ね合わせられた図形のＹ座標値に対応する位置に表示されている文字に対する音響効果の種類としてＲＡＭに記憶する。ここで、図形どうしの類似度を求める方法は、既知の方法を用いればよい。また、この際、表示制御手段１３は、入力済のピッチカーブ１０３に対して利用者が重ね合わせて描いた図形を、タッチスクリーン３１に表示する。 Further, when another figure is drawn on the already drawn pitch curve 103 at a position superimposed on the figure (pitch curve) already drawn by the user, the trajectory analyzing means 15 stores the shapes of the plurality of figures. From the acoustic effect DB 24, an acoustic effect graphic having a similarity equal to or higher than a predetermined threshold is specified for the graphic drawn so as to be superimposed. Then, the trajectory analyzing unit 15 stores the type of the acoustic effect associated with the identified acoustic effect graphic as the type of acoustic effect for the character displayed at the position corresponding to the Y coordinate value of the superimposed graphic. To remember. Here, a known method may be used as a method of obtaining the similarity between figures. At this time, the display control means 13 displays on the touch screen 31 a graphic drawn by the user so as to be superimposed on the input pitch curve 103.

ステップＳ９の次に、図形表示手段１３Ｂとして機能する表示制御手段１３は、ピッチカーブ１０３を入力文字画像１０４に対応付けた状態でタッチスクリーン３１に表示させる（ステップＳ１０）。次に、音声レコード生成手段１６は、ＲＡＭに記憶された、入力文字列、発音レコード、文字間隔制御内容、及びピッチカーブ１０３の分析結果と最短発音時間ＤＢ２２とに基づいて、入力文字列を構成する各文字に、音高、音長、及び音響効果を割り当てることで音声レコードを生成し、生成した音声レコードを音声ＤＢ２３に登録する（ステップＳ１１）。この際、音声レコード生成手段１６は、各文字に割り当てる音長の比率に基づいて、文字列全体における音長が文字列音長と同一になるように、各文字に音長を割り当てる。ここで、前述したように、各文字に割り当てられる音長は、最短発音時間ＤＢ２２において該当する文字の最短発音時間として記憶された時間の長さよりも短くなることは無い。なお、入力文字列を構成する全ての文字に対して最短発音時間が割り当てられた場合の合計時間と比較して、算出された文字列音長が上記合計時間に満たない場合、音声レコード生成手段１６は、各文字における最短発音時間を、文字の入力順に従って先頭から積算していく。そして、音声レコード生成手段１６は、この積算の結果が、算出された文字列音長を越えた時点で、以降の文字を発音対象とせず、音長を割り当てない。 After step S9, the display control unit 13 functioning as the graphic display unit 13B displays the pitch curve 103 on the touch screen 31 in a state of being associated with the input character image 104 (step S10). Next, the voice record generating means 16 constructs the input character string based on the input character string, the pronunciation record, the character interval control content, the analysis result of the pitch curve 103 and the shortest pronunciation time DB 22 stored in the RAM. A sound record is generated by assigning a pitch, a sound length, and a sound effect to each character to be performed, and the generated sound record is registered in the sound DB 23 (step S11). At this time, the voice record generating means 16 assigns a sound length to each character so that the sound length in the entire character string is the same as the character string sound length based on the ratio of the sound length assigned to each character. Here, as described above, the sound length assigned to each character does not become shorter than the length of time stored as the shortest pronunciation time of the corresponding character in the shortest pronunciation time DB 22. When the calculated character string sound length is less than the total time compared to the total time when the shortest pronunciation time is assigned to all the characters constituting the input character string, the voice record generating means 16 accumulates the shortest pronunciation time of each character from the top in accordance with the input order of the characters. Then, when the result of this integration exceeds the calculated character string sound length, the voice record generation means 16 does not set the subsequent characters as sounding targets and assigns no sound length.

そして、音声合成手段１７が、音声ＤＢ２３に記憶された音声レコードの内容に基づき、音声データを合成する（ステップＳ１２）。つまり、音声合成手段１７は、文字列を構成する各文字を、割り当て手段１８によって割り当てられた音高及び音長で発音させる音声データを合成する。音声合成手段１７は、音声データを合成する際に、或る文字に割り当てられた音高と、この文字の次に入力された文字に割り当てられた音高とを、ピッチベンドによって繋ぐ処理を施す。また、音声合成手段１７は、該当する文字について、割り当てられた音響効果を反映した状態で音声データを合成する。音声出力部４０は、この音声データに基づいて音声を出力する（ステップＳ１３）。 Then, the voice synthesizer 17 synthesizes voice data based on the contents of the voice record stored in the voice DB 23 (step S12). That is, the voice synthesizing unit 17 synthesizes voice data that causes each character constituting the character string to be pronounced with the pitch and the tone length assigned by the assigning unit 18. When synthesizing voice data, the voice synthesizing unit 17 performs a process of connecting a pitch assigned to a certain character and a pitch assigned to a character inputted next to the character by pitch bend. The voice synthesizer 17 synthesizes voice data for the corresponding character in a state in which the assigned acoustic effect is reflected. The audio output unit 40 outputs audio based on the audio data (step S13).

一方、ステップＳ５の後に、利用者が再生ボタン画像１０５に触れた場合（ステップＳ６；ＹＥＳ）、すなわちピッチカーブ１０３が入力されていない（ステップＳ９〜Ｓ１１を経ない）場合、音声レコード生成手段１６は、ＲＡＭに記憶された入力文字列を音声ＤＢ２３に登録する。そして音声レコード生成手段１６は、ＲＡＭに記憶された発音レコードにおける基準割り当て音長に従って、音声ＤＢ２３における項目「音長」を更新するとともに、発音基準線１０２と同一の音高で、音声ＤＢ２３における項目「音高」を更新する（ステップＳ８）。例えば、図７の例では入力文字列が「こんにちは」であるため、音声レコード生成手段１６は、音声ＤＢ２３における項目「音長」について、図２に示される内容に従って、文字「こ」、「に」、「ち」及び「は」については「０．２秒」、文字「ん」については「０．１秒」という数値で更新する。また、例えば、図７の例では発音基準線１０２の音高が「４４０Ｈｚ」であるため、音声レコード生成手段１６は、「こ」、「ん」、「に」、「ち」、及び「は」における項目「音高」について「４４０Ｈｚ」で更新する。その後、処理はステップＳ１２に移行する。 On the other hand, when the user touches the play button image 105 after step S5 (step S6; YES), that is, when the pitch curve 103 is not input (steps S9 to S11 are not passed), the voice record generation means 16 Registers the input character string stored in the RAM in the voice DB 23. Then, the sound record generating means 16 updates the item “sound length” in the sound DB 23 according to the reference assigned sound length in the pronunciation record stored in the RAM, and at the same pitch as the sound generation reference line 102, the item in the sound DB 23. “Pitch” is updated (step S8). For example, since in the example of FIG. 7 the input string is "Hello", the sound record generating unit 16, the item "tone length" in the voice DB 23, in accordance with the contents shown in FIG. 2, the character "ko", "in "," "Chi" and "ha" are updated with numerical values "0.2 seconds", and the character "n" is updated with numerical values "0.1 seconds". Further, for example, in the example of FIG. 7, since the pitch of the pronunciation reference line 102 is “440 Hz”, the voice record generation unit 16 performs “ko”, “n”, “ni”, “chi”, and “ha”. The item “pitch” in “” is updated at “440 Hz”. Thereafter, the process proceeds to step S12.

図１０〜図１２は、音声合成処理を説明するための模式図である。
なお、音声合成装置１００には、ピッチカーブ１０３が描かれた際の処理を設定するモードである「描画モード」が存在する。「描画モード」には、初期設定で設定されている「上書き描画モード」と、「連続描画モード」とがある。「上書き描画モード」は、タッチパネル３１にピッチカーブ１０３が既に描かれ、表示された状態において、利用者が新たにピッチカーブ１０３を描くと、既に描かれていたピッチカーブ１０３が消去され、新たに描かれた際の軌跡に応じてピッチカーブ１０３が表示されるモードである。一方、「連続描画モード」は、タッチパネル３１にピッチカーブ１０３が既に描かれ、表示された状態において、既に描かれたピッチカーブ１０３と重ならないタッチパネル３１の領域に利用者が新たにピッチカーブ１０３を描くと、既に描かれていたピッチカーブ１０３は変更されないまま、新たに描かれた際の軌跡に応じてピッチカーブ１０３が追加で表示されるモードである。この描画モードは、利用者が、ＵＩ部３０を通じて適宜変更することが可能である。描画モードは、「上書き描画モード」、及び「連続描画モード」に限らず、他の設定内容を選択可能としてもよい。 10 to 12 are schematic diagrams for explaining the speech synthesis process.
Note that the speech synthesizer 100 has a “drawing mode” that is a mode for setting processing when the pitch curve 103 is drawn. “Drawing mode” includes “overwrite drawing mode” and “continuous drawing mode” set in the initial setting. In the “overwrite drawing mode”, when the pitch curve 103 is already drawn on the touch panel 31 and displayed, when the user newly draws the pitch curve 103, the already drawn pitch curve 103 is erased and newly added. This is a mode in which the pitch curve 103 is displayed according to the trace when drawn. On the other hand, in the “continuous drawing mode”, the pitch curve 103 is already drawn on the touch panel 31 and, when displayed, the user newly sets the pitch curve 103 in an area of the touch panel 31 that does not overlap with the already drawn pitch curve 103. In this mode, the pitch curve 103 that has already been drawn is not changed, and the pitch curve 103 is additionally displayed in accordance with the newly drawn locus. This drawing mode can be appropriately changed by the user through the UI unit 30. The drawing mode is not limited to “overwrite drawing mode” and “continuous drawing mode”, and other setting contents may be selectable.

図１０〜図１２において、「お」、「は」、「よ」、及び「う」という入力文字画像１０４が表示されている。図１０、及び図１１は、描画モードが「連続描画モード」である場合の例である。図１０は、入力文字画像群１０８Ａにおいて全ての入力文字画像１０４が結合された場合を示している。図１０では、入力されたピッチカーブ１０３ａに応じて、「おはよう」という音声が出力される。図１０において右側に表示されたピッチカーブ１０３ｂは、入力文字線と交差しないため、何も処理が行われない。あるいは、このようなピッチカーブ１０３ｂは、入力文字列に応じた音声の出力を繰り返すように、ＵＩ部３０を通じて利用者が設定可能としてもよい。上述した、入力文字列に応じた音声の出力を繰り返す場合、軌跡分析手段１５は、入力文字列の繰り返しを意味するフラグを立てた状態（例えば「１」の値）でＲＡＭに記憶する。音声レコード生成手段１６は、このフラグを参照し、フラグが入力文字列を繰り返すことを意味する値（ここでは「１」）を取っている場合、登録済みの音声レコードと同一の内容を、音声ＤＢ２３に追加で登録する。あるいは、音声レコード生成手段１６が、予め定められた、入力文字列を繰り返して発音させることを表す記号（例えば「＊」）を、音声ＤＢ２３における項目「入力文字」に対して追加で登録することで、音声合成手段１７が、入力文字列が繰り返し発音されるように音声データを合成するようにしてもよい。 10 to 12, input character images 104 of “O”, “Ha”, “Yo”, and “U” are displayed. 10 and 11 show examples when the drawing mode is the “continuous drawing mode”. FIG. 10 shows a case where all the input character images 104 are combined in the input character image group 108A. In FIG. 10, the voice “Good morning” is output according to the input pitch curve 103a. Since the pitch curve 103b displayed on the right side in FIG. 10 does not intersect the input character line, no processing is performed. Alternatively, such a pitch curve 103b may be settable by the user through the UI unit 30 so as to repeat the output of sound according to the input character string. When repeating the above-described voice output according to the input character string, the trajectory analyzing unit 15 stores the flag in the RAM in a state (for example, a value “1”) indicating that the input character string is repeated. The voice record generating means 16 refers to this flag, and when the flag takes a value (in this case, “1”) which means that the input character string is repeated, the same contents as the registered voice record are Register additional to DB23. Alternatively, the voice record generating means 16 additionally registers a predetermined symbol (for example, “*”) indicating that the input character string is repeatedly generated for the item “input character” in the voice DB 23. Thus, the voice synthesizer 17 may synthesize voice data so that the input character string is repeatedly pronounced.

図１１では、「お」及び「は」が結合された入力文字画像群１０８Ｂと、「よ」及び「う」が結合された入力文字画像群１０８Ｃとが分離されている。また、入力文字画像群１０８Ｂに対するピッチカーブ１０３ｃと、入力文字画像群１０８Ｃに対するピッチカーブ１０３ｄとは連続しておらず、軌跡が途切れた状態となっている。このような場合、入力されたピッチカーブ１０３に応じて音声が出力される際には、軌跡の途切れに応じて、「おは□□よう」というように、「おは」という音声と「よう」という音声との間に無音の期間（無音期間という）が生じる（ここで□は無音期間を意味する）。具体的には、軌跡分析手段１５は、軌跡が途切れた箇所におけるＸ軸方向の長さに基づいて、この無音期間の長さを算出する。そして軌跡分析手段１５は、算出した無音期間の時間の長さと、無音期間を挟む２つの入力文字（すなわちここでは、「は」と「よ」）とを対応付けて、ＲＡＭに記憶する。音声レコード生成手段１６は、ＲＡＭに記憶されたこの内容に従って、無音期間に相当する音声レコードを生成する。図１３は、音声ＤＢの内容を表す図である。ここで、文字順ＩＤが「０００３」の音声レコードが、無音期間に該当する音声レコードである。無音期間に該当する音声レコードを区別するには、例えば入力文字としてスペースを割り当ててもよい。図１３においては便宜上、スペースを「△」で表している。図１３における音声レコードに基づいて音声合成手段１７が合成した音声データが、音声として出力される際には、「おは」と発音された後に０．５秒の無音期間が生じ、次いで「よう」と発音されることとなる。 In FIG. 11, an input character image group 108B in which “o” and “ha” are combined and an input character image group 108C in which “yo” and “u” are combined are separated. Further, the pitch curve 103c for the input character image group 108B and the pitch curve 103d for the input character image group 108C are not continuous, and the locus is interrupted. In such a case, when sound is output according to the input pitch curve 103, the sound of “Oha” and “ A silent period (referred to as a silent period) is generated between the voice and the voice (where □ means a silent period). Specifically, the trajectory analyzing means 15 calculates the length of the silent period based on the length in the X-axis direction at the location where the trajectory is interrupted. Then, the trajectory analyzing means 15 associates the calculated length of the silent period with two input characters sandwiching the silent period (that is, “ha” and “yo” here) and stores them in the RAM. The voice record generating means 16 generates a voice record corresponding to the silent period according to the contents stored in the RAM. FIG. 13 is a diagram showing the contents of the voice DB. Here, the voice record whose character order ID is “0003” is a voice record corresponding to the silent period. In order to distinguish audio records corresponding to the silent period, for example, a space may be allocated as an input character. In FIG. 13, the space is represented by “Δ” for convenience. When the voice data synthesized by the voice synthesizer 17 based on the voice record in FIG. 13 is output as voice, a silence period of 0.5 seconds occurs after “Oha” is pronounced. Will be pronounced.

図１２は、描画モードが「上書き描画モード」である場合の例である。図１２では、「お」及び「は」が結合された入力文字画像群１０８Ｄと、「よ」及び「う」が結合された入力文字画像群１０８Ｅとが分離されている。また、入力文字画像群１０８Ｄに対するピッチカーブ１０３ｅと、入力文字画像群１０８Ｅに対するピッチカーブ１０３ｆとは、途切れずに連続した軌跡を描いている。さらにピッチカーブ１０３ｅおよびピッチカーブ１０３ｆに対して、ビブラートの音響効果が適用される波型の図形が重ね合わせて描かれている。このように、表示制御手段１３は、表示手段であるタッチスクリーン３１に表示されている図形（すなわちピッチカーブ１０３）に対して、重ね合わせる図形が利用者によって指定されると、この重ね合わせる図形をタッチスクリーン３１に表示させる。この波型の図形のＹ座標値は、「お」、「は」、「よ」、「う」のそれぞれの表示位置に対応している。従って、図１２では、入力されたピッチカーブ１０３に応じて、「お〜は〜〜〜よ〜う〜」というように、「は」に対する発音が延ばされた状態で、各文字に対する発音にビブラートがかかった状態で音声が出力される。 FIG. 12 shows an example when the drawing mode is “overwrite drawing mode”. In FIG. 12, an input character image group 108D in which “o” and “ha” are combined and an input character image group 108E in which “yo” and “u” are combined are separated. Further, the pitch curve 103e for the input character image group 108D and the pitch curve 103f for the input character image group 108E draw a continuous locus without interruption. Further, a wave-like figure to which the vibrato acoustic effect is applied is superimposed on the pitch curve 103e and the pitch curve 103f. As described above, when the user designates a graphic to be superimposed on the graphic (that is, the pitch curve 103) displayed on the touch screen 31 that is the display means, the display control unit 13 displays the graphic to be superimposed. It is displayed on the touch screen 31. The Y-coordinate value of this wave-shaped figure corresponds to the display positions of “O”, “Ha”, “Yo”, and “U”. Accordingly, in FIG. 12, according to the input pitch curve 103, the pronunciation for each character is pronounced in a state where the pronunciation for “ha” is extended as “Oh ~~~~~~~~”. Audio is output with vibrato applied.

このように、音声合成装置１００によれば、文字列が入力された後に、利用者が軌跡を描くようにタッチスクリーン３１に触れることで、文字列に対して音符列の割り当てを容易に行うことが可能となる。 As described above, according to the speech synthesizer 100, after a character string is input, the user touches the touch screen 31 so as to draw a trajectory, thereby easily assigning a note string to the character string. Is possible.

以上の実施形態は次のように変形可能である。尚、以下の変形例は適宜組み合わせて実施しても良い。
＜変形例１＞
実施形態においては、音声合成装置１００の例としてタッチスクリーン３１を備えたスマートフォンを挙げていたが、これに限ったものではない。音声合成装置１００は、タッチスクリーン３１を備えていなくてもよい。例えば、音声合成装置１００は、ＵＩ部３０としてマウス、キーパッド、またはペンタブレットを有していてもよい。また、音声合成装置１００は、ＰＤＡ（Personal Digital Assistant）、携帯ゲーム機、携帯音楽プレーヤ、あるいはＰＣ（Personal Computer）であってもよい。音声合成装置１００がＰＣである場合、ディスプレイ上に表示された内容に対して利用者がマウスを用いて描いた結果が、ピッチカーブ１０３として認識されたり、入力文字画像１０４に対する間隔制御として認識されたりする。 The above embodiment can be modified as follows. In addition, you may implement the following modifications suitably combining.
<Modification 1>
In the embodiment, a smartphone provided with the touch screen 31 is cited as an example of the speech synthesizer 100, but the present invention is not limited to this. The speech synthesizer 100 may not include the touch screen 31. For example, the speech synthesizer 100 may include a mouse, a keypad, or a pen tablet as the UI unit 30. The voice synthesizer 100 may be a PDA (Personal Digital Assistant), a portable game machine, a portable music player, or a PC (Personal Computer). When the speech synthesizer 100 is a PC, the result drawn by the user using the mouse for the content displayed on the display is recognized as the pitch curve 103 or recognized as the interval control for the input character image 104. Or

＜変形例２＞
実施形態においては、音響効果の例として、ビブラートを挙げたが、これに限ったものではない。例えば、図９に表されるように、入力済みのピッチカーブ１０３に対して丸で囲むような軌跡を重ね合わせて描くと、対応する文字の発音がファルセットでなされるようにしてもよい。この他にも、ピッチカーブ１０３の描き方や重ね合わせて描く図形の形状に応じて、対応する文字に様々な音響効果を割り当てるようにしてもよい。 <Modification 2>
In the embodiment, vibrato is given as an example of the acoustic effect, but the embodiment is not limited to this. For example, as shown in FIG. 9, when a trajectory surrounded by a circle is superimposed on the input pitch curve 103, the corresponding character may be pronounced in a false setting. In addition to this, various acoustic effects may be assigned to the corresponding characters in accordance with how the pitch curve 103 is drawn or the shape of the figure drawn in an overlapping manner.

＜変形例３＞
実施形態において、ピッチカーブ１０３は利用者により入力されるものとしていたが、これに限らず、特定の形状を持つ複数のピッチカーブ１０３をプリセットデータとして記憶部２０が記憶していてもよい。例えば、標準語、関西弁、東北弁といった方言の抑揚に対応したピッチカーブ１０３がプリセットデータとして記憶部２０に記憶されている場合、利用者がＵＩ部３０を通じて、このプリセットデータから特定のピッチカーブ１０３を指定できるようにしてもよい。要するに、利用者が、音高を表す第１軸および時間を表す第２軸を有する座標系における図形を指定できればよい。 <Modification 3>
In the embodiment, the pitch curve 103 is input by the user. However, the present invention is not limited to this, and the storage unit 20 may store a plurality of pitch curves 103 having a specific shape as preset data. For example, when a pitch curve 103 corresponding to dialect inflections such as standard language, Kansai dialect, and Tohoku dialect is stored in the storage unit 20 as preset data, the user can select a specific pitch curve from the preset data through the UI unit 30. 103 may be designated. In short, it is only necessary that the user can specify a graphic in the coordinate system having the first axis representing the pitch and the second axis representing the time.

＜変形例４＞
実施形態においては、Ｙ軸における最小の座標値と最大の座標値との間に、１オクターブの音高が割り当てられているものとしたが、この音高は、これに限ったものではない。例えば、利用者がＵＩ部３０を介して設定することにより、Ｙ軸に割り当てられる音高の幅を狭く、あるいは広く、変更することが可能としてもよい。例えば、利用者が、Ｙ軸における最小の座標値と最大の座標値との間に、２オクターブの音高を設定した場合を考える。また、このとき、発音基準線１０２の音高が「２６１Ｈｚ」であったとする。この場合、発音基準線１０２の音高を中心として、発音基準線１０２のＹ軸正方向には、「２６１Ｈｚ」より１オクターブ高い「５２３Ｈｚ」の音高を持つ仮想的な発音基準線が存在する。また、発音基準線１０２のＹ軸負方向には、「２６１Ｈｚ」より１オクターブ低い「１３０Ｈｚ」の音高を持つ仮想的な発音基準線が存在する。実施形態においては、軌跡分析手段１５は、常に発音基準線１０２のＹ座標値と交差点のＹ座標値との差分長から、ある入力文字についての音高を算出していた。しかし、上述のように、タッチパネル３１に表示される発音基準線１０２以外に、仮想的な発音基準線が存在する場合は、Ｙ軸方向において交差点と最も近い発音基準線のＹ座標値を基準として、入力文字の音高を算出するようにしてもよい。 <Modification 4>
In the embodiment, a pitch of one octave is assigned between the minimum coordinate value and the maximum coordinate value on the Y axis, but this pitch is not limited to this. For example, the user may be able to change the pitch range assigned to the Y axis narrower or wider by setting via the UI unit 30. For example, consider a case where the user sets a pitch of 2 octaves between the minimum coordinate value and the maximum coordinate value on the Y axis. At this time, it is assumed that the pitch of the pronunciation reference line 102 is “261 Hz”. In this case, a virtual pronunciation reference line having a pitch of “523 Hz” that is one octave higher than “261 Hz” exists in the positive Y-axis direction of the pronunciation reference line 102 with the pitch of the pronunciation reference line 102 as the center. . Further, in the negative Y-axis direction of the sound generation reference line 102, there is a virtual sound generation reference line having a pitch of “130 Hz” that is one octave lower than “261 Hz”. In the embodiment, the trajectory analyzing means 15 always calculates the pitch for a certain input character from the difference length between the Y coordinate value of the pronunciation reference line 102 and the Y coordinate value of the intersection. However, as described above, when there is a virtual pronunciation reference line other than the pronunciation reference line 102 displayed on the touch panel 31, the Y coordinate value of the pronunciation reference line closest to the intersection in the Y-axis direction is used as a reference. The pitch of the input character may be calculated.

＜変形例５＞
実施形態においては、入力文字画像１０４に対して利用者がドラッグの操作を行うことで、各入力文字画像１０４を結合及び分離させることを可能としたが、入力文字画像１０４に対する操作は、これに限ったものではない。例えば、利用者が、或る入力文字画像１０４の右辺あるいは左辺に触れて、これをドラッグすることにより、この入力文字画像１０４がＸ軸方向において表示される長さ（入力文字画像１０４の横幅）を変更可能としてもよい。この場合、音声レコード生成手段１６は、変更された入力文字画像１０４の横幅に応じて、横幅が長いほど長い音長を、横幅が短いほど短い音長を、該当する文字に割り当てる。 <Modification 5>
In the embodiment, the user can perform a drag operation on the input character image 104 so that the input character images 104 can be combined and separated. It is not limited. For example, when the user touches and drags the right side or the left side of a certain input character image 104, the length that the input character image 104 is displayed in the X-axis direction (width of the input character image 104) May be changeable. In this case, the voice record generating means 16 assigns a longer sound length to the corresponding character as the width is longer and a shorter sound length as the width is shorter, according to the changed width of the input character image 104.

＜変形例６＞
実施形態においては、音声合成手段１７は、或る文字に割り当てられた音高と、この文字の次に入力された文字に割り当てられた音高とを、ピッチベンドによって繋ぐ処理を施していたが、これに限ったものではない。例えば、音声合成手段は、ピッチベンドを施さずに、音声ＤＢ２３に記憶された、各文字に割り当てられた音高のみに従って音声データを合成するようにしてもよい。 <Modification 6>
In the embodiment, the voice synthesizing unit 17 performs a process of connecting the pitch assigned to a certain character and the pitch assigned to the character input next to this character by pitch bend. It is not limited to this. For example, the voice synthesizing unit may synthesize voice data according to only the pitch assigned to each character stored in the voice DB 23 without performing pitch bend.

＜変形例７＞
実施形態においては、入力文字列を構成する全ての文字に対して最短発音時間が割り当てられた場合の合計時間と比較して、算出された文字列音長が上記合計時間に満たない場合、軌跡分析手段１５は、各文字における最短発音時間を、入力された順序に従って先頭から積算した。そして、軌跡分析手段１５は、この積算の結果が、算出された文字列音長を越えた時点で、以降の文字を発音対象としないような制御を行っていた。これに限らず、軌跡分析手段１５は、上述の合計時間に満たないような速度でピッチカーブ１０３が入力された場合、タッチスクリーン３１に表示するピッチカーブ１０３における軌跡の長さを予め制限するようにしてもよい。 <Modification 7>
In the embodiment, when the calculated character string sound length is less than the total time compared to the total time when the shortest pronunciation time is assigned to all the characters constituting the input character string, the trajectory The analysis means 15 integrated the shortest pronunciation time in each character from the top in accordance with the input order. The trajectory analyzing means 15 performs control so that subsequent characters are not subjected to pronunciation when the result of the integration exceeds the calculated character string sound length. Not only this but the locus | trajectory analysis means 15 restrict | limits the length of the locus | trajectory in the pitch curve 103 displayed on the touch screen 31 beforehand, when the pitch curve 103 is input at the speed which does not satisfy the above-mentioned total time. It may be.

＜変形例８＞
また、利用者が音高を直感的に分かるように、タッチパネル３１を正面から見た場合の左側に鍵盤の画像を表示しても良い。 <Modification 8>
Further, a keyboard image may be displayed on the left side when the touch panel 31 is viewed from the front so that the user can intuitively understand the pitch.

＜変形例９＞
実施形態においては、利用者が、入力済みのピッチカーブ１０３に対してさらに別の図形を重ねて描くと、この別の図形に応じた音響効果が発音時に適用されるようになっていたが、これを以下のようにしてもよい。音声合成装置１００は、入力済みのピッチカーブ１０３に対して利用者が別の図形を重ねて描く際の処理のモードを表す「追加入力モード」を記憶部２０に記憶する。「追加入力モード」には、「音響効果モード」と「音符列変更モード」が存在する。「音響効果モード」は、実施形態において説明したとおりであって、上述した、重ねて描かれた別の図形に応じた音響効果が、該当する入力文字の発音に際して適用される。「音符列変更モード」では、利用者が、入力済みのピッチカーブ１０３における特定の箇所に触れてドラッグすると、表示制御手段１３が、ピッチカーブ１０３の該当する箇所について、ドラッグの内容に応じて表示態様を変更する。例えば、利用者が、ピッチカーブ１０３の特定の箇所に触れて、Ｙ軸正方向にドラッグすると、ピッチカーブ１０３における該当の箇所の座標値がＹ軸正方向に移動するとともに、ピッチカーブ１０３における該当の箇所の周辺についての表示態様が、この移動に伴ってＹ軸正方向に曲線を描くように表示される。また、利用者がピッチカーブ１０３の特定の箇所に触れて、Ｙ軸負方向にドラッグすると、ピッチカーブ１０３における該当の箇所の座標値がＹ軸負方向に移動するとともに、ピッチカーブ１０３における該当の箇所の周辺についての表示態様が、この移動に伴ってＹ軸負方向に曲線を描くように表示される。利用者は、ＵＩ部３０を通じて、「追加入力モード」を適宜変更することが可能である。そして、軌跡分析手段１５は、該当する箇所の変更後の座標値に従って、この座標値に対応する入力文字に割り当てられる音高及び音長を算出し、音声レコード生成手段１６は、算出された結果に基づいて該当する入力文字に音高及び音長を割り当てる。 <Modification 9>
In the embodiment, when the user draws another figure on the pitch curve 103 that has already been input, the acoustic effect corresponding to the other figure is applied during pronunciation. This may be as follows. The speech synthesizer 100 stores in the storage unit 20 an “additional input mode” that represents a processing mode when the user draws another graphic on the input pitch curve 103. The “additional input mode” includes a “sound effect mode” and a “note string change mode”. The “acoustic effect mode” is as described in the embodiment, and the above-described acoustic effect according to another figure drawn in an overlapping manner is applied when the corresponding input character is pronounced. In the “musical note change mode”, when the user touches and drags a specific part of the input pitch curve 103, the display control unit 13 displays the corresponding part of the pitch curve 103 according to the content of the drag. Change the aspect. For example, when the user touches a specific part of the pitch curve 103 and drags in the positive direction of the Y axis, the coordinate value of the corresponding part in the pitch curve 103 moves in the positive direction of the Y axis and the corresponding value in the pitch curve 103. The display mode for the periphery of the point is displayed so as to draw a curve in the positive direction of the Y-axis with this movement. When the user touches a specific part of the pitch curve 103 and drags in the negative Y-axis direction, the coordinate value of the corresponding part in the pitch curve 103 moves in the negative Y-axis direction and The display mode for the periphery of the location is displayed so as to draw a curve in the negative Y-axis direction with this movement. The user can appropriately change the “additional input mode” through the UI unit 30. Then, the trajectory analyzing means 15 calculates the pitch and the sound length assigned to the input character corresponding to the coordinate value according to the changed coordinate value of the corresponding part, and the voice record generating means 16 calculates the calculated result. The pitch and pitch are assigned to the corresponding input characters based on the above.

＜変形例１０＞
実施形態においては、ピッチカーブ１０３と入力文字線１０７との交差する座標値に従って、各々の文字の音高及び音長が算出されていたが、発音に際しての音声情報はこれに限ったものではない。例えば、ピッチカーブ１０３の形状がＸ軸に対して平坦な箇所には、より多くの文字が発音対象となるように、また、ピッチカーブ１０３の形状がＸ軸に対して急峻な箇所には、より少ない文字が発音対象となるように、各々の文字の発音開始位置が算出されてもよい。具体的には、以下のとおりである。 <Modification 10>
In the embodiment, the pitch and the length of each character are calculated according to the coordinate value at which the pitch curve 103 and the input character line 107 intersect, but the sound information at the time of pronunciation is not limited to this. . For example, in a place where the shape of the pitch curve 103 is flat with respect to the X axis, more characters are to be pronounced, and in a place where the shape of the pitch curve 103 is steep with respect to the X axis, The pronunciation start position of each character may be calculated so that fewer characters are to be pronounced. Specifically, it is as follows.

図１４は、変形例１０を説明するための模式図である。
図１４は、タッチパネル３１の一部を拡大したものである。図１４における入力文字「こ」を例に挙げると、軌跡分析手段１５は、交差点Ａにおけるピッチカーブ１０３に対する接線Ｌ１ａの傾きを算出すると、さらにこの傾きの絶対値を算出してＲＡＭに記憶させる。ここで、矩形が破線で表された「こ」という入力文字画像１０４ａ、破線で表された入力文字線１０７ａ、及び交差点Ａは、変形例９における処理によって表示位置が変更される前の状態を表している。軌跡分析手段１５は、他の入力文字「ん」についても、上述した、傾きの絶対値を算出する。ここで、交差点における傾きの絶対値が大きいほど、その交差点において、ピッチカーブ１０３が急峻、つまりそのピッチカーブの形状がＸ軸に対して直交した状態に近いことを表している。一方、傾きの絶対値が小さいほど、その交差点において、ピッチカーブ１０３が平坦、つまりそのピッチカーブの形状がＸ軸に対して平行に近いことを表している。 FIG. 14 is a schematic diagram for explaining the tenth modification.
FIG. 14 is an enlarged view of a part of the touch panel 31. Taking the input character “ko” in FIG. 14 as an example, when calculating the inclination of the tangent L1a with respect to the pitch curve 103 at the intersection A, the trajectory analyzing means 15 further calculates the absolute value of this inclination and stores it in the RAM. Here, the input character image 104a “ko” whose rectangle is represented by a broken line, the input character line 107a represented by a broken line, and the intersection A are in a state before the display position is changed by the processing in the modification 9. Represents. The trajectory analyzing means 15 calculates the absolute value of the inclination described above for the other input characters “n”. Here, as the absolute value of the slope at the intersection is larger, the pitch curve 103 is steeper at the intersection, that is, the shape of the pitch curve is closer to a state orthogonal to the X axis. On the other hand, the smaller the absolute value of the slope, the flatter the pitch curve 103 at the intersection, that is, the shape of the pitch curve is almost parallel to the X axis.

そして軌跡分析手段１５は、傾きの絶対値が予め定められた閾値を超える場合には、該当する入力文字画像の座標値を基準として、Ｘ軸における正方向あるいは負方向のいずれかにおいて、接線の傾きの絶対値が上述の閾値以下となる直近の座標値を求め、この求められた座標値におけるＸ軸が取る値を、該当する入力文字の発音開始位置として算出する。一方、傾きの絶対値が予め定められた閾値を超えない場合には、軌跡分析手段１５は、入力文字線１０７とピッチカーブ１０３との交差点における座標値におけるＸ軸が取る値を、該当する入力文字の発音開始位置として算出する。つまり、接線の傾きの絶対値が閾値に向かって大きくなる箇所、すなわちピッチカーブ１０３が急峻な箇所には、入力文字の発音開始位置が割り当てられない可能性が高くなる。結果として、上述した急峻な箇所では、ピッチカーブ１０３が平坦な箇所と比較して、より少ない文字が発音対象とされることになる。 When the absolute value of the inclination exceeds a predetermined threshold value, the trajectory analyzing unit 15 uses the coordinate value of the corresponding input character image as a reference, and determines whether the tangent line is in the positive direction or the negative direction on the X axis. The most recent coordinate value at which the absolute value of the inclination is equal to or less than the above threshold is obtained, and the value taken by the X axis in the obtained coordinate value is calculated as the pronunciation start position of the corresponding input character. On the other hand, if the absolute value of the inclination does not exceed a predetermined threshold value, the trajectory analyzing means 15 uses the value taken by the X axis at the coordinate value at the intersection of the input character line 107 and the pitch curve 103 as the corresponding input. Calculated as the character's pronunciation start position. That is, there is a high possibility that the pronunciation start position of the input character is not assigned to a location where the absolute value of the tangential slope increases toward the threshold, that is, a location where the pitch curve 103 is steep. As a result, in the steep portion described above, fewer characters are targeted for pronunciation compared to the portion where the pitch curve 103 is flat.

例えば、図１４において、交差点Ａにおける接線Ｌ１ａの傾きの絶対値が、上述した閾値を超えるため、「こ」という入力画像文字１０４ａの発音開始位置が、Ｘ軸において負方向に移動した位置（すなわち時間軸において前方の位置）となる。具体的には、交差点Ａ’において接線Ｌ１ａ’の傾きの絶対値が閾値以下となるため、軌跡分析手段１５は、入力文字表示画像１０４ａを、入力文字画像１０４ａ’の表示位置まで移動させる。そして、入力文字「ん」の発音開始位置が、交差点Ｂにおける座標値におけるＸ軸が取る値と同一に算出された場合、入力文字「ん」に対して、入力文字「こ」が時間軸において前方へ移動するため、入力文字「こ」に割り当てられる音長が、移動前と比較して長いものとなる。 For example, in FIG. 14, since the absolute value of the slope of the tangent L1a at the intersection A exceeds the above-described threshold, the pronunciation start position of the input image character 104a “ko” has moved in the negative direction on the X axis (ie, Forward position on the time axis). Specifically, since the absolute value of the slope of the tangent L1a 'is equal to or less than the threshold value at the intersection A', the trajectory analyzing unit 15 moves the input character display image 104a to the display position of the input character image 104a '. When the pronunciation start position of the input character “n” is calculated to be the same as the value taken by the X axis in the coordinate value at the intersection B, the input character “ko” is on the time axis with respect to the input character “n”. Since it moves forward, the sound length assigned to the input character “ko” is longer than before the movement.

＜変形例１１＞
実施形態では、軌跡分析手段１５が、ピッチカーブ１０３の始端から終端に至るまでの入力に要した時間に応じて、文字列全体の発音時に割り当てる音長である文字列音長を算出していたが、文字列音長の算出方法はこれに限らない。軌跡分析手段１５は、ピッチカーブ１０３の始端から終端に至るまでのそのピッチカーブ１０３上の距離や、ピッチカーブ１０３の始端のＸ座標値と終端のＸ座標値との差の大きさに基づいて文字列音長を算出してもよい。例えば、ピッチカーブ１０３の始端から終端に至るまでのそのピッチカーブ１０３上の距離が大きいほど文字列音長も大きくなったり、ピッチカーブ１０３の始端のＸ座標値と終端のＸ座標値との差が大きいほど文字列音長も大きくなるといった具合である。このように、文字列音長の算出方法には種々のものがあるが、文字列を構成する各文字の音長の算出に関しては、軌跡分析手段１５は各文字に対応する図形（ピッチカーブ）の座標値に基づき音高及び音長を算出する。 <Modification 11>
In the embodiment, the trajectory analyzing unit 15 calculates a character string sound length that is a sound length to be assigned at the time of pronunciation of the entire character string, according to the time required for the input from the start end to the end of the pitch curve 103. However, the method for calculating the character string sound length is not limited to this. The trajectory analyzing means 15 is based on the distance on the pitch curve 103 from the start end to the end of the pitch curve 103 and the magnitude of the difference between the X coordinate value at the start end and the end X coordinate value of the pitch curve 103. The character string sound length may be calculated. For example, the longer the distance on the pitch curve 103 from the start end to the end of the pitch curve 103, the longer the character string sound length, or the difference between the X coordinate value at the start end and the end X coordinate value of the pitch curve 103. The larger the is, the longer the character string sound length is. As described above, there are various methods for calculating the character string sound length. Regarding the calculation of the sound length of each character constituting the character string, the trajectory analyzing means 15 uses a figure (pitch curve) corresponding to each character. The pitch and length are calculated based on the coordinate values.

＜変形例１２＞
実施形態においては、音声合成手段１７が音声データを合成する際に、或る文字に割り当てられた音高と、この文字の次に入力された文字に割り当てられた音高とを、ピッチベンドによって繋ぐ処理を施していたが、これに限らず、割り当て手段１８が、補正機能と称するものを実現することで、入力文字の各々に、上記補正機能によって定まる所定の音階に従った音高を割り当てるようにしてもよい。また、割り当て手段１８は、入力文字の各々に、上記補正機能によって定まる所定の時間の長さに従った音長を割り当てるようにしてもよい。つまり、割り当て手段１８が実現する補正機能には、音高に対する補正機能と、音長に対する補正機能とがある。 <Modification 12>
In the embodiment, when the speech synthesizer 17 synthesizes speech data, the pitch assigned to a certain character and the pitch assigned to the character input next to this character are connected by pitch bend. Although the processing has been performed, the present invention is not limited to this, and the assigning means 18 realizes what is called a correction function so that a pitch according to a predetermined scale determined by the correction function is assigned to each input character. It may be. The assigning means 18 may assign a sound length according to a predetermined length of time determined by the correction function to each input character. In other words, the correction function realized by the assigning means 18 includes a correction function for pitch and a correction function for pitch.

図１５（ａ）及び図１５（ｂ）は、音高に対する補正機能を説明する模式図である。
図１５（ａ）及び図１５（ｂ）においては、タッチスクリーン３１上に、メニューボタン画像１１２が表示されている。利用者がメニューボタン画像１１２に触れると、制御部１０が、タッチスクリーン３１に、利用者が実行可能な機能の選択肢（機能選択肢という）をリスト形式で表示する。利用者が、表示された機能選択肢から望みのものを選択すると、制御部１０は、選択された機能を実行する。ここで、タッチスクリーン３１に表示される機能選択肢には、制御部１０によって実現される、「音高の補正」及び「音長の補正」が含まれており、利用者は、両者の機能について「ＯＮ／ＯＦＦ」を設定することで、これらの機能を実現するか否かを選択することができる。 FIGS. 15A and 15B are schematic diagrams for explaining a correction function for pitch.
In FIG. 15A and FIG. 15B, the menu button image 112 is displayed on the touch screen 31. When the user touches the menu button image 112, the control unit 10 displays function options executable by the user (referred to as function options) on the touch screen 31 in a list format. When the user selects a desired function from the displayed function options, the control unit 10 executes the selected function. Here, the function options displayed on the touch screen 31 include “pitch correction” and “pitch correction” realized by the control unit 10. By setting “ON / OFF”, it is possible to select whether or not to realize these functions.

図１５（ａ）は、音高に対して補正が行われる前の表示状態を表しており、「こんにちは」という入力文字列に対して、傾斜したピッチカーブ１０３が入力されている様子を例示している。この状態で利用者が再生ボタン画像１０５に触れると、実施形態で説明したように、入力文字列における隣り合う文字同士がピッチベンドによって繋げられた音声データが合成される。図１５（ａ）の状態において、利用者が機能選択肢における「音高の補正」を「ＯＮ」に設定すると、割り当て手段１８が音高に対する補正を行った結果、図１５（ｂ）のような表示状態となる。図１５（ｂ）において、タッチスクリーン３１には、表示制御手段１３によって、音高を表すピアノロールを模した横縞模様の画像が、背景画像として表示されている。ここで、黒色の横縞画像は黒鍵を表し、白色の横縞画像は白鍵を表す。また、各横縞画像には、割り当て手段１８によって、ピッチ方向（Ｙ軸方向）における縦幅及び時間軸方向（Ｘ軸方向）における横幅の全域にわたり、１つの音高が割り当てられている。これらの各横縞画像は、割り当て手段１８が入力文字列１０４を構成する各文字に音高を割り当てるときの指標となる。 FIG. 15 (a), is corrected for pitch represents a display state before that takes place, for the input character string "Hello", the pitch curve 103 slope illustrates a state in which the input ing. When the user touches the play button image 105 in this state, as described in the embodiment, voice data in which adjacent characters in the input character string are connected by pitch bend is synthesized. In the state of FIG. 15A, when the user sets “pitch correction” in the function option to “ON”, the assigning means 18 performs correction on the pitch, and as a result, as shown in FIG. Display state. In FIG. 15B, an image having a horizontal stripe pattern imitating a piano roll representing a pitch is displayed on the touch screen 31 as a background image by the display control means 13. Here, the black horizontal stripe image represents a black key, and the white horizontal stripe image represents a white key. In addition, one pitch is assigned to each horizontal stripe image by the assigning unit 18 over the entire width in the pitch direction (Y-axis direction) and the horizontal width in the time-axis direction (X-axis direction). Each of these horizontal stripe images serves as an index when the assigning means 18 assigns a pitch to each character constituting the input character string 104.

このとき、割り当て手段１８は、図１５（ａ）において、ピッチカーブ１０３において入力文字画像１０４の各々に対応する位置に応じて割り当てた音高を、ピアノの音階においてその音高に最も近い音高に変更する。つまり、割り当て手段１８は、指標に相当する横縞画像に従って、ピッチカーブ１０３において入力文字画像１０４を構成する各文字に対応する位置の座標値を補正し、その補正後の座標値に基づき、各文字に音高を割り当てる。これに伴い、表示制御手段１３は、割り当て文字画像１０９の各々が、ピッチ方向において最も近い横縞画像と重なる位置に表示されるように、表示位置の制御を行う。例えば図１５（ｂ）の例では、Ｙ軸において最も下方に表示された白鍵の横縞画像を「Ｃ３」とすると、「こ」という割り当て文字画像１０９には「Ｄ＃３」の音高が割り当てられ、「に」という割り当て文字画像１０９には「Ｇ３」の音高が割り当てられるといった具合である。表示制御手段１３は、このように割り当て文字画像１０９の表示位置を制御するとともに、ピッチカーブ１０３の表示態様も変更する。具体的には、図１５（ｂ）で表されるように、表示制御手段１３は、ピッチカーブ１０３を、各割り当て文字画像１０９及び横縞画像の位置に従って階段状に表示する。すなわち、表示制御手段１３は、利用者によって指定されたピッチカーブ１０３において各割り当て文字画像１０９に対応する位置の座標値を、指標に相当する横縞画像に従って変更し、座標値が変更されたピッチカーブ１０３を表示させることになる。これにより、各々の文字が、割り当てられた音高で割り当てられた音長の期間だけ発音されることとなる。 At this time, in FIG. 15A, the assigning means 18 assigns the pitch assigned according to the position corresponding to each of the input character images 104 in the pitch curve 103 to the pitch closest to the pitch in the piano scale. Change to That is, the assigning means 18 corrects the coordinate value of the position corresponding to each character constituting the input character image 104 in the pitch curve 103 according to the horizontal stripe image corresponding to the index, and based on the corrected coordinate value, each character. Assign a pitch to. Accordingly, the display control means 13 controls the display position so that each of the assigned character images 109 is displayed at a position that overlaps the horizontal stripe image that is closest in the pitch direction. For example, in the example of FIG. 15B, if the horizontal stripe image of the white key displayed at the bottom on the Y axis is “C3”, the assigned character image 109 “ko” has a pitch of “D # 3”. The pitch of “G3” is assigned to the assigned character image 109 “NI”. The display control means 13 controls the display position of the assigned character image 109 as described above, and also changes the display mode of the pitch curve 103. Specifically, as shown in FIG. 15B, the display control unit 13 displays the pitch curve 103 in a stepped manner according to the positions of the assigned character images 109 and the horizontal stripe images. That is, the display control means 13 changes the coordinate value of the position corresponding to each assigned character image 109 in the pitch curve 103 designated by the user according to the horizontal stripe image corresponding to the index, and the pitch curve whose coordinate value has been changed. 103 is displayed. As a result, each character is pronounced for the duration of the assigned tone length at the assigned pitch.

また、図１５（ｂ）の状態において、利用者が或る割り当て文字画像１０９を指定してピッチ方向（Ｙ軸方向）に引きずるように移動（ピッチ方向へのいわゆる「ドラッグ」）させると、表示制御手段１３は、この割り当て文字画像１０９を、ピッチ方向においてドラッグの終点と最も近い横縞画像に重なるように表示する制御を行う。表示制御手段１３は、この制御に伴って、ピッチカーブ１０３の形状もピッチ方向へとその形状を変更する。この結果、当該割り当て文字画像１０９には、割り当て手段１８により上記横縞画像に相当する音高が割り当てられる。つまり、割り当て手段１８は、指標に相当する横縞画像に従って、ピッチカーブ１０３において割り当て文字画像１０９を構成する各文字に対応する位置の座標値を補正し、その補正後の座標値に基づき、各文字に音高を割り当てる。 In the state of FIG. 15B, when the user designates a certain assigned character image 109 and moves it so as to drag it in the pitch direction (Y-axis direction) (so-called “drag” in the pitch direction), The control means 13 performs control to display the assigned character image 109 so as to overlap the horizontal stripe image closest to the drag end point in the pitch direction. With this control, the display control means 13 changes the shape of the pitch curve 103 in the pitch direction. As a result, a pitch corresponding to the horizontal stripe image is assigned to the assigned character image 109 by the assigning means 18. That is, the assigning means 18 corrects the coordinate value of the position corresponding to each character constituting the assigned character image 109 in the pitch curve 103 according to the horizontal stripe image corresponding to the index, and based on the corrected coordinate value, each character. Assign a pitch to.

また、予め定められた閾値を超える速度でピッチカーブ１０３が入力されると、制御部１０は、そのような速度で入力された範囲のピッチカーブ１０３においては補正を行わない。つまり、上記範囲については、実施形態と同様に、割り当て手段１８によって、或る文字に割り当てられた音高と、この文字の次に入力された文字に割り当てられた音高とが、ピッチベンドによって繋がれる処理が施される。なお、上記処理は、入力済みのピッチカーブ１０３における特定の箇所を、予め定められた閾値を超える速度で、利用者が変形例９で上述したようにドラッグすることで変更した場合についても同様である。 When the pitch curve 103 is input at a speed exceeding a predetermined threshold, the control unit 10 does not correct the pitch curve 103 in the range input at such speed. That is, in the above range, as in the embodiment, the pitch assigned to a certain character by the assigning means 18 and the pitch assigned to the character input next to this character are connected by pitch bend. Processing is performed. The above processing is the same when the user changes a specific portion of the input pitch curve 103 by dragging as described above in Modification 9 at a speed exceeding a predetermined threshold. is there.

なお、表示制御手段１３及び割り当て手段１８は、「音高の補正」が「ＯＮ」に設定されると同時に、既に表示されている割り当て文字画像１０９及びピッチカーブ１０３に対して上述の処理を行ってもよいし、「音高の補正」が「ＯＮ」に設定された後から入力された割り当て文字画像１０９及びピッチカーブ１０３に対してのみ上述の処理を行ってもよい。このような上述の処理を行うタイミングは、音声合成装置１００において予め定められていてもよいし、タッチスクリーン３１を介して利用者により変更可能としてもよい。 The display control means 13 and the assignment means 18 perform the above-described processing on the assigned character image 109 and the pitch curve 103 that are already displayed at the same time that “pitch correction” is set to “ON”. Alternatively, the above-described processing may be performed only for the assigned character image 109 and the pitch curve 103 that are input after “pitch correction” is set to “ON”. The timing for performing the above-described processing may be predetermined in the speech synthesizer 100 or may be changeable by the user via the touch screen 31.

図１６（ａ）及び図１６（ｂ）は、音長に対する補正機能を説明する模式図である。
図１６（ａ）及び図１６（ｂ）は、利用者が、機能選択肢における「音長の補正」を「ＯＮ」に設定した状態を表している。「音長の補正」が「ＯＮ」に設定されると、表示制御手段１３によって、タッチスクリーン３１のＹ軸正方向における上部に時間軸目盛り１１３が表示される。時間軸目盛り１１３は、Ｘ軸正方向に向かうにつれて経過する時間を表している。図１６の例では、時間軸目盛り１１３における１つ分の目盛りが０．１秒を表しているが、この１つ分の目盛りには、０．１秒以外の所定長の時間が対応付けられていてもよいし、小節や拍が対応付けられていてもよい。なお、小節や拍が対応付けられる場合、利用者がタッチスクリーン３１を介して、拍子について「４拍子」や「３／４拍子」といった設定を行えるようにしてもよい。この時間軸目盛り１１３は、割り当て手段１８が入力文字列１０４を構成する各文字に音長を割り当てるときの指標となる。 FIGS. 16A and 16B are schematic diagrams for explaining a correction function for the sound length.
FIGS. 16A and 16B show a state in which the user has set “sound length correction” in the function options to “ON”. When “sound length correction” is set to “ON”, the display control means 13 displays the time axis scale 113 on the upper part of the touch screen 31 in the positive Y-axis direction. The time axis scale 113 represents the time that elapses in the positive direction of the X axis. In the example of FIG. 16, one scale on the time axis scale 113 represents 0.1 seconds, but a time of a predetermined length other than 0.1 seconds is associated with this one scale. Or a measure or a beat may be associated. When measures and beats are associated with each other, the user may be able to make settings such as “4 beats” and “3/4 beats” for the beats via the touch screen 31. The time axis scale 113 serves as an index when the assigning unit 18 assigns a sound length to each character constituting the input character string 104.

図１６（ａ）において、利用者が、「ち」という割り当て文字画像１０９を指定して、Ｄ１に示されるような軌跡でドラッグを行うと、このドラッグの軌跡（ドラッグ軌跡という）に応じて、表示制御手段１３は、割り当て文字画像１０９の表示位置を変更して表示させる。ドラッグ軌跡Ｄ１では、「ち」という割り当て文字画像１０９が、「に」という割り当て文字画像１０９に近づくようにドラッグされている。従って、表示制御手段１３は、「ち」という割り当て文字画像１０９の表示位置を、図１６（ｂ）で示される表示位置に変更して表示させる。このとき、割り当て文字画像１０９の表示位置は、時間軸目盛り１１３の１単位（つまり１つ分の各目盛り）に応じた位置に規制される。つまり、割り当て文字画像１０９の表示位置が、或る時間軸目盛り１１３とそれに隣り合う時間軸目盛り１１３との間にある場合には、近いほうの時間軸目盛り１１３の位置に変更される。この表示位置の変更に伴って、割り当て手段１８は、「に」という割り当て文字画像１０９に割り当てる音長を短くするとともに、「ち」という割り当て文字画像１０９に割り当てる音長を長くする。つまり、割り当て手段１８は、指標に相当する時間軸目盛り１１３に従って、ピッチカーブ１０３において割り当て文字画像１０９を構成する各文字に対応する位置の座標値を補正し、その補正後の座標値に基づき、各文字に音長を割り当てる。これに伴い、表示制御手段１３は、入力文字画像１０４の表示位置を変更して表示させる。また、利用者が、入力文字画像１０４を表す矩形の左端（あるいは右端）を指定してドラッグを行うと、割り当て手段１８によって上記入力文字画像１０４に割り当てられる音長が変更される。例えば図１６（ｂ）において、利用者がドラッグ軌跡Ｄ２で表される軌跡でドラッグを行うと、割り当て手段１８は、「こ」という入力文字画像１０４に割り当てる音長を短くするとともに、「ん」という入力文字画像１０４に割り当てる音長を長くする。これに伴い、表示制御手段１３は、割り当て文字画像１０９の表示位置を変更して表示させる。 In FIG. 16A, when the user designates the assigned character image 109 “Chi” and drags along the trajectory as shown in D1, according to the drag trajectory (referred to as drag trajectory), The display control means 13 changes the display position of the assigned character image 109 and displays it. In the drag locus D1, the assigned character image 109 “CHI” is dragged so as to approach the assigned character image 109 “NI”. Accordingly, the display control means 13 changes the display position of the assigned character image 109 “CHI” to the display position shown in FIG. At this time, the display position of the assigned character image 109 is restricted to a position corresponding to one unit of the time axis scale 113 (that is, one scale for each scale). That is, when the display position of the assigned character image 109 is between a certain time axis scale 113 and the adjacent time axis scale 113, it is changed to the position of the closer time axis scale 113. Along with the change of the display position, the assigning unit 18 shortens the sound length assigned to the assigned character image 109 “NI” and lengthens the sound length assigned to the assigned character image 109 “CHI”. That is, the assigning means 18 corrects the coordinate value of the position corresponding to each character constituting the assigned character image 109 in the pitch curve 103 according to the time axis scale 113 corresponding to the index, and based on the corrected coordinate value, Assign a note length to each character. Accordingly, the display control means 13 changes the display position of the input character image 104 and displays it. When the user specifies and drags the left end (or right end) of the rectangle representing the input character image 104, the sound length assigned to the input character image 104 is changed by the assigning means 18. For example, in FIG. 16B, when the user performs dragging along the trajectory represented by the drag trajectory D2, the assigning means 18 shortens the sound length assigned to the input character image 104 “ko” and “n”. The sound length assigned to the input character image 104 is increased. Accordingly, the display control means 13 changes the display position of the assigned character image 109 and displays it.

なお、便宜上、「音高の補正」と「音長の補正」とを、それぞれ別の図面を用いて、個別の機能として説明したが、両者の機能は、制御部１０によって同時に並列で実行されてもよい。また、表示制御手段１３が、ピアノロールを模した背景画像を時間軸方向（Ｘ軸方向）の全域に渡って表示することに代えて、タッチスクリーン３１の左端にのみピアノロールを模した背景画像を表示するようにしてもよい。以上述べた変形例１２によれば、利用者が、文字列に対して直感的に音符列の割り当てを行うことが可能となる。 For convenience, “pitch correction” and “pitch correction” have been described as separate functions using different drawings. However, both functions are executed in parallel by the control unit 10 in parallel. May be. Further, instead of displaying the background image simulating the piano roll over the entire area in the time axis direction (X-axis direction), the display control means 13 instead of displaying the background image simulating the piano roll only on the left end of the touch screen 31. May be displayed. According to the modified example 12 described above, a user can intuitively assign a note string to a character string.

＜変形例１３＞
実施形態においては、文字列を構成する各文字の、発音時間の比を含む複数の発音レコードからなる発音辞書ＤＢ２１を記憶部２０が備えており、割り当て手段１８が、文字列音長と発音レコードとに基づいて各文字に割り当てる音長を決定していたが、これに限らず、発音辞書ＤＢ２１に代えて、発音レコードが各文字についての発音時間の絶対値を含むような初期値発音辞書ＤＢを、記憶部２０が備えるようにしてもよい。 <Modification 13>
In the embodiment, the storage unit 20 includes a pronunciation dictionary DB 21 composed of a plurality of pronunciation records including the pronunciation time ratio of each character constituting the character string, and the assigning means 18 includes the character string sound length and the pronunciation record. However, the present invention is not limited to this, and instead of the pronunciation dictionary DB 21, an initial value pronunciation dictionary DB in which the pronunciation record includes the absolute value of the pronunciation time for each character is determined. May be included in the storage unit 20.

図１７は、初期値発音辞書ＤＢの内容を表す図である。
初期値発音辞書ＤＢに含まれる各発音レコードは、識別ＩＤ、文字、初期値発音時間といった複数の項目からなる。識別ＩＤは、各発音レコードを一意に識別するためのＩＤであり、例えば４桁の数字からなる。文字は、発音の対象となる文字として予め決められた１文字である。初期値発音時間は、各発音レコードにおける各文字について、予め割り当てられた発音時間の初期値である。初期値発音時間においては、実験的に求められた、該当する文字が自然な抑揚で発音された場合に掛かる時間の長さに基づいて、各文字における音長の初期値が予め決定されている。例えば図１７において、文字「あ」、「い」、「う」及び「え」という文字について、初期値として「０．３秒」という長さの発音時間が予め割り当てられている。 FIG. 17 is a diagram showing the contents of the initial value pronunciation dictionary DB.
Each pronunciation record included in the initial value pronunciation dictionary DB includes a plurality of items such as an identification ID, a character, and an initial value pronunciation time. The identification ID is an ID for uniquely identifying each pronunciation record, and is composed of, for example, a 4-digit number. The character is one character that is predetermined as a character to be pronounced. The initial value sounding time is an initial value of the sounding time allocated in advance for each character in each sounding record. In the initial value pronunciation time, the initial value of the sound length of each character is determined in advance based on the length of time that is required when the corresponding character is pronounced with natural inflection, obtained experimentally. . For example, in FIG. 17, for the characters “A”, “I”, “U”, and “E”, a pronunciation time of “0.3 seconds” is assigned in advance as an initial value.

図１８（ａ）及び図１８（ｂ）は、変形例１３に係る、音声合成装置１００の表示内容を表す図である。
図１８（ａ）は、利用者がテキストボックス１０１に文字列を入力した直後であって、利用者によってピッチカーブ１０３が入力されていない状態を表している。図１８（ａ）においては、「あたま」という文字列が入力されており、「あ」、「た」及び「ま」という各文字に対して、割り当て手段１８によって、初期値発音辞書ＤＢの発音レコードに基づいて、同一の長さの初期値発音時間が音長として割り当てられている。また、図１８（ａ）においては、入力文字画像１０４のＸ軸方向における横幅と、発音基準線１０２のＹ軸方向における位置に基づいて、表示制御手段１３によってデフォルトのピッチカーブ１０３が表示されている。 18A and 18B are diagrams illustrating display contents of the speech synthesizer 100 according to the modification 13.
FIG. 18A shows a state immediately after the user inputs a character string in the text box 101 and the pitch curve 103 is not input by the user. In FIG. 18 (a), the character string “Atama” is input, and for each character “A”, “TA” and “MA”, the assigning means 18 stores the character string in the initial value pronunciation dictionary DB. Based on the pronunciation record, an initial value pronunciation time of the same length is assigned as a sound length. In FIG. 18A, the default pitch curve 103 is displayed by the display control means 13 based on the horizontal width of the input character image 104 in the X-axis direction and the position of the pronunciation reference line 102 in the Y-axis direction. Yes.

図１８（ｂ）は、図１８（ａ）の状態から、利用者がピッチカーブ１０３を入力した状態を表している。図１８（ｂ）においては、ピッチカーブ１０３の形状に従って、入力文字画像１０４の各々について、表示制御手段１３によって、ピッチ方向（Ｙ軸方向）における表示位置が変更されている。ここで、入力文字画像１０４の各々には、初期値発音時間が既に割り当てられているから、実施形態のように、ピッチカーブ１０３の形状に従って、割り当て手段１８によって文字列音長と発音時間の比を含む複数の発音レコードとに基づいた音長が、各入力文字画像１０４に割り当てられることはない。一方、入力文字列の最初の文字を表す入力文字画像１０４の左端から、入力文字列の最後の文字を表す入力文字画像１０４の右端までの長さ（つまり、入力文字画像１０４の時間軸方向（Ｘ軸方向）における幅）が、入力されたピッチカーブ１０３の時間軸方向（Ｘ軸方向）における長さよりも短い場合、以下のようにしてもよい。この場合、割り当て手段１８によって、入力文字列の最後の文字を表す入力文字画像１０４に対して、入力されたピッチカーブ１０３の終端に合わせた音長が割り当てられる。図１８（ｂ）においては、割り当て手段１８によって入力文字列の最後の文字を表す「ま」という入力文字画像１０４に対して、入力されたピッチカーブ１０３の終端に合わせた音長が割り当てられている。 FIG. 18B shows a state in which the user has input the pitch curve 103 from the state of FIG. In FIG. 18B, the display position in the pitch direction (Y-axis direction) is changed by the display control unit 13 for each of the input character images 104 according to the shape of the pitch curve 103. Here, since the initial sound generation time is already assigned to each of the input character images 104, the ratio between the character string sound length and the sound generation time is assigned by the assigning means 18 according to the shape of the pitch curve 103 as in the embodiment. A sound length based on a plurality of pronunciation records including is not assigned to each input character image 104. On the other hand, the length from the left end of the input character image 104 representing the first character of the input character string to the right end of the input character image 104 representing the last character of the input character string (that is, the time axis direction of the input character image 104 ( When the width in the (X-axis direction) is shorter than the length in the time-axis direction (X-axis direction) of the input pitch curve 103, the following may be performed. In this case, the assigning unit 18 assigns a sound length that matches the end of the input pitch curve 103 to the input character image 104 representing the last character of the input character string. In FIG. 18B, the assigning unit 18 assigns a sound length corresponding to the end of the input pitch curve 103 to the input character image 104 “ma” representing the last character of the input character string. Yes.

なお、割り当て手段１８が、入力文字列の最後の文字を表す入力文字画像１０４に対して、ピッチカーブ１０３の終端に合わせて音長を割り当てる処理は、利用者によって「ＯＮ／ＯＦＦ」を設定可能としてもよい。また、利用者がピッチカーブ１０３を入力した際に、割り当て手段１８が入力文字画像１０４の各々に割り当てる発音時間は、あくまでも初期値であるため、利用者は、ピッチカーブ１０３の入力後に、入力文字画像１０４を表す矩形の左端（あるいは右端）又は割り当て文字画像１０９を時間軸方向（Ｘ軸方向）にドラッグすることで、各文字に割り当てられる音長を変更させることが可能である。また、利用者がタッチスクリーン３１を介して設定を変更することで、当該変形例のような初期値発音辞書ＤＢに基づく音長の割り当て処理に代わって、実施形態における、各文字の発音時間の比を記憶した発音辞書ＤＢ２１に基づく音長の割り当て処理に移行することを可能にしてもよい。また、その逆に、利用者がタッチスクリーン３１を介して設定を変更することで、各文字の発音時間の比を記憶した発音辞書ＤＢ２１に基づく音長の割り当て処理に代わって、当該変形例のような初期値発音辞書ＤＢに基づく音長の割り当て処理に移行することを可能にしてもよい。以上述べた変形例１３によれば、各文字について、利用者の操作に依らず、自然な抑揚で発音された場合の音長が初期値として割り当てられる。なお、実施形態における発音辞書ＤＢ２１及び上記初期値発音辞書ＤＢはいずれも発音長辞書記憶手段に相当する。 In addition, the process in which the assigning unit 18 assigns the sound length to the input character image 104 representing the last character of the input character string in accordance with the end of the pitch curve 103 can be set to “ON / OFF” by the user. It is good. Further, when the user inputs the pitch curve 103, the sound generation time assigned to each of the input character images 104 by the assigning means 18 is an initial value to the last. By dragging the left end (or right end) of the rectangle representing the image 104 or the assigned character image 109 in the time axis direction (X-axis direction), the sound length assigned to each character can be changed. Further, when the user changes the setting via the touch screen 31, instead of the tone length assignment process based on the initial value pronunciation dictionary DB as in the modification, the pronunciation time of each character in the embodiment is changed. It may be possible to shift to a sound length assignment process based on the pronunciation dictionary DB 21 storing the ratio. On the contrary, when the user changes the setting via the touch screen 31, instead of the tone length assignment process based on the pronunciation dictionary DB 21 storing the ratio of the pronunciation time of each character, the modification example It may be possible to shift to a sound length assignment process based on such an initial value pronunciation dictionary DB. According to the modified example 13 described above, for each character, the sound length when it is pronounced with natural inflection is assigned as an initial value regardless of the operation of the user. Note that both the pronunciation dictionary DB 21 and the initial value pronunciation dictionary DB in the embodiment correspond to a pronunciation length dictionary storage unit.

＜変形例１４＞
音声合成装置１００のハードウェア構成は、図１で説明したものに限定されない。図５に示される機能を実装できるものであれば、音声合成装置１００はどのようなハードウェア構成を有していてもよい。例えば、音声合成装置１００は、図５に示される機能要素の各々に対応する専用のハードウェア（回路）を有していてもよい。 <Modification 14>
The hardware configuration of the speech synthesizer 100 is not limited to that described with reference to FIG. The speech synthesizer 100 may have any hardware configuration as long as the functions shown in FIG. 5 can be implemented. For example, the speech synthesizer 100 may have dedicated hardware (circuit) corresponding to each of the functional elements shown in FIG.

＜変形例１５＞
上述の実施形態で説明した音声合成アプリケーションに関するプログラムは、磁気記録媒体（磁気テープ、磁気ディスク（ＨＤＤ、ＦＤ（Flexible Disk））など）、光記録媒体（光ディスク（ＣＤ（Compact Disk）、ＤＶＤ（Digital Versatile Disk））など）、光磁気記録媒体、半導体メモリ（フラッシュＲＯＭなど）などのコンピュータ読取り可能な記録媒体に記憶した状態で提供されてもよい。また、このプログラムは、インターネットのようなネットワーク経由でダウンロードされてもよい。 <Modification 15>
The program relating to the speech synthesis application described in the above-described embodiment includes a magnetic recording medium (magnetic tape, magnetic disk (HDD, FD (Flexible Disk)), etc.), optical recording medium (optical disk (CD (Compact Disk)), DVD (Digital Versatile Disk))), a magneto-optical recording medium, and a computer-readable recording medium such as a semiconductor memory (flash ROM or the like). The program may be downloaded via a network such as the Internet.

１０…制御部、１１…文字列取得手段、１２…基準音長特定手段、１３…表示制御手段、１４…文字間隔制御手段、１５…軌跡分析手段、１６…音声レコード生成手段、１７…音声合成手段、１８…割り当て手段、２０…記憶部、２１…発音辞書ＤＢ、２２…最短発音時間ＤＢ、２３…音声ＤＢ、２４…音響効果ＤＢ、３０…ＵＩ部、３１…タッチスクリーン、４０…音声出力部、１００…音声合成装置、１０１…テキストボックス、１０２…発音基準線、１０３、１０３ａ〜１０３ｆ…ピッチカーブ、１０４、１０４ａ、１０４ａ’、１０４ｂ…入力文字画像、１０５…再生ボタン画像、１０６…戻るボタン画像、１０７、１０７ａ、１０７ａ’、１０７ｂ…入力文字線、１０８Ａ〜１０８Ｅ…入力文字画像群、１０９…割り当て文字画像、１１０…筐体、１１１…スピーカ、１１２…メニューボタン画像、１１３…時間軸目盛り、Ａ〜Ｅ、Ａ’…交差点、Ｌ１ａ、Ｌ１ａ’、Ｌ２…接線、α、α’…差分長、β、γ…距離 DESCRIPTION OF SYMBOLS 10 ... Control part, 11 ... Character string acquisition means, 12 ... Reference | standard sound length specification means, 13 ... Display control means, 14 ... Character space | interval control means, 15 ... Trajectory analysis means, 16 ... Voice record generation means, 17 ... Speech synthesis Means, 18 ... Allocation means, 20 ... Storage part, 21 ... Pronunciation dictionary DB, 22 ... Shortest pronunciation time DB, 23 ... Audio DB, 24 ... Acoustic effect DB, 30 ... UI part, 31 ... Touch screen, 40 ... Audio output , 100 ... speech synthesizer, 101 ... text box, 102 ... pronunciation reference line, 103, 103a to 103f ... pitch curve, 104, 104a, 104a ', 104b ... input character image, 105 ... playback button image, 106 ... return Button image, 107, 107a, 107a ′, 107b... Input character line, 108A to 108E... Input character image group, 109. 110 ... Case, 111 ... Speaker, 112 ... Menu button image, 113 ... Time axis scale, A to E, A '... Intersection, L1a, L1a', L2 ... Tangent, α, α '... Difference length, β, γ …distance

Claims

A character string acquisition means for acquiring a character string composed of a plurality of characters;
Character string display means for displaying each character constituting the acquired character string on a display means;
When a graphic in a coordinate system having a first axis representing time and a second axis representing pitch is specified by the user, the display means is associated with each character constituting the character string. Graphic display means to be displayed on,
In the displayed graphic, based on the coordinate value of the position corresponding to each character constituting the displayed character string, an assigning unit that assigns a pitch and a sound length to each character;
A speech synthesizer comprising: speech synthesizer that synthesizes speech data that causes each character constituting the character string to be pronounced at a pitch and a pitch assigned by the assigning unit.

A pronunciation length dictionary storage means for storing, for a plurality of words, a length of a pronunciation time of each character constituting the word or a ratio of a pronunciation time of each character constituting the word with respect to a pronunciation time when the word is pronounced Prepared,
The assigning means is a sound length used when the entire character string is pronounced and is stored in the pronunciation length dictionary storage means for the character string sound length designated by the user and each character constituting the character string. The speech synthesizer according to claim 1, wherein a sound length is assigned to each character based on the length of the pronunciation time or the ratio of the pronunciation times.

Comprising indicator display means for displaying on the display means an indicator for assigning pitch or length to each character constituting the character string;
The allocating unit corrects the coordinate value of the position corresponding to each character constituting the character string in the graphic according to the index displayed by the index display unit, and based on the corrected coordinate value, The speech synthesizer according to claim 1 or 2, wherein pitch and length are assigned to the character.

Corresponding to each of the shapes of a plurality of figures, comprising acoustic effect storage means for storing the acoustic effect applied when a character is pronounced,
When the figure superimposed on the figure displayed on the display means is designated by the user, the figure display means displays the superimposed figure on the display means,
The assigning means specifies a shape of a figure whose similarity with the superimposed figure exceeds a threshold among the shapes of the plurality of figures stored in the acoustic effect storage means, and sets the shape of the specified figure. The sound effect stored in association with each other is assigned to the character displayed at a position corresponding to the coordinate value of the superimposed figure. Voice synthesizer.

On the computer,
A character string acquisition function for acquiring a character string composed of a plurality of characters;
A character string display function for causing the display means to display each character constituting the acquired character string;
When a graphic in a coordinate system having a first axis representing pitch and a second axis representing time is designated by the user, the display means is associated with each character constituting the character string. Graphic display function to be displayed on
An assignment function for assigning a pitch and a tone length to each character based on the coordinate value of the graphic corresponding to each character constituting the displayed character string;
A speech synthesis function for synthesizing speech data that causes each character constituting the character string to be pronounced at a pitch and a tone length assigned by the assignment function.