JP5953743B2

JP5953743B2 - Speech synthesis apparatus and program

Info

Publication number: JP5953743B2
Application number: JP2011286728A
Authority: JP
Inventors: 治大島; 資司永田
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-12-27
Filing date: 2011-12-27
Publication date: 2016-07-20
Anticipated expiration: 2031-12-27
Also published as: JP2013134476A

Description

本発明は、音声合成装置及びプログラムに関する。 The present invention relates to a speech synthesizer and a program.

歌詞等の文字列、並びに複数の音符（以下「音符列」という）の音高および音長がデータとして入力されると、その文字列および音に応じて、歌唱音の音声を合成する音声合成装置が知られている（例えば特許文献１）。また、このような装置において、歌唱者の歌唱音声から音声の特徴を抽出し、抽出した特徴を用いて合成音声を編集する技術が提案されている（例えば特許文献２）。 When a character string such as lyrics and pitches and lengths of a plurality of notes (hereinafter referred to as “note strings”) are input as data, voice synthesis is performed that synthesizes the sound of a singing sound according to the character string and sound. An apparatus is known (for example, Patent Document 1). Moreover, in such an apparatus, a technique has been proposed in which a voice feature is extracted from a singing voice of a singer and a synthesized voice is edited using the extracted feature (for example, Patent Document 2).

特開２００６−２５９７６８号公報JP 2006-259768 A 特開２００９−２１７１４１号公報JP 2009-217141 A

ところで、上述のような音声合成装置において、自分の意図どおりに合成音声の編集がなされたか確認するためには、ユーザは、歌唱音声を入力した後で、合成された音声を再生する必要があった。この場合、音声から抽出された特徴を用いて合成音声を生成するためには各種の複雑な処理を行う必要があるが、このような処理は処理負荷が大きく処理に時間を要する。そのため、ユーザは合成音声を再生されるまで待機する必要があり、合成音声の編集処理を円滑に行えない場合があった。
本発明は上述の背景に鑑みてなされたものであり、音声から抽出される特徴（属性）を用いて合成音声を生成する装置において、生成される合成音声の確認を容易に行うことのできる技術を提供することを目的とする。 By the way, in the speech synthesizer as described above, in order to confirm whether the synthesized speech has been edited as intended, the user needs to reproduce the synthesized speech after inputting the singing speech. It was. In this case, in order to generate synthesized speech using features extracted from speech, it is necessary to perform various kinds of complicated processing. However, such processing has a large processing load and requires time. For this reason, the user needs to wait until the synthesized speech is reproduced, and the synthesized speech editing process may not be performed smoothly.
The present invention has been made in view of the above-described background, and in a device that generates synthesized speech using features (attributes) extracted from speech, it is possible to easily check the generated synthesized speech. The purpose is to provide.

上述した課題を解決するために、本発明は、音声データのピッチ及び音量を含む属性を示す属性データを受け取る受取部と、前記受取部が受け取った属性データに基づいて合成音声を生成する音声合成部と、前記音声合成部が合成音声を生成する前に、前記受取部が受け取った属性データの示すピッチ及び音量に基づいて周期性を有する音信号を生成する音信号生成部であって、前記音声合成部が合成音声を生成する処理よりも短時間の処理によって周期性を有する音信号を生成する音信号生成部とを具備することを特徴とする音声合成装置を提供する。 To solve the problems described above, the present invention includes generating a receiving unit for receiving the attribute data indicating an attribute containing the pitch and volume of the audio data, the synthesized speech based on the previous SL attribute data receiving unit has received And a sound signal generation unit that generates a sound signal having periodicity based on the pitch and volume indicated by the attribute data received by the reception unit before the speech synthesis unit generates synthesized speech. The speech synthesizer further includes a sound signal generation unit that generates a sound signal having periodicity by a process of a shorter time than a process of generating the synthesized speech .

本発明の好ましい態様において、歌詞を示す歌詞データと、該歌詞との対応付けがなされた楽譜データとを受け取る第２の受取部と、前記受取部が受け取った属性データの示すピッチと前記第２の受取部が受け取った楽譜データとの対応付けを行い、該対応付け結果に基づいて前記歌詞データと前記ピッチを表すピッチデータとの対応付けを行う対応付け部とを具備し、前記音声合成部は、前記受取部が受け取った属性データ並びに前記対応付け部により対応付けがなされた歌詞データ及びピッチデータに基づいて、合成音声を生成してもよい。 In a preferred aspect of the present invention, a second receiving unit that receives lyric data indicating lyrics and score data associated with the lyrics, a pitch indicated by attribute data received by the receiving unit, and the second An association unit that associates the score data received by the receiving unit with each other and associates the lyrics data with the pitch data representing the pitch based on the association result, and the speech synthesis unit May generate synthesized speech based on the attribute data received by the receiving unit and the lyric data and pitch data associated by the associating unit.

また、本発明の更に好ましい態様において、前記音声データを、ピッチ及び音量を含む属性について解析し、解析結果を示す属性データを前記受取部に供給する音声解析部を具備してもよい。 In a further preferred aspect of the present invention, the audio data may be analyzed for attributes including pitch and volume , and an audio analysis unit may be provided that supplies attribute data indicating an analysis result to the receiving unit.

また、本発明は、コンピュータに、音声データのピッチ及び音量を含む属性を示す属性データを受け取る受取機能と、前記受け取った属性データに基づいて合成音声を生成する音声合成機能と、前記音声合成機能が合成音声を生成する前に、前記受け取った属性データの示すピッチ及び音量に基づいて周期性を有する音信号を生成する音信号生成機能であって、前記音声合成機能が合成音声を生成する処理よりも短時間の処理によって周期性を有する音信号を生成する音信号生成機能とを実現させるためのプログラムを提供する。 Further, the present invention is a computer to receive function to receive attribute data indicating an attribute containing the pitch and volume of the audio data, and voice synthesis function that generates synthesized speech based on the previous SL received attribute data, wherein A sound signal generating function for generating a sound signal having periodicity based on a pitch and a volume indicated by the received attribute data before the voice synthesizing function generates a synthesized voice, wherein the voice synthesizing function generates a synthesized voice; Provided is a program for realizing a sound signal generation function for generating a sound signal having periodicity by processing in a shorter time than processing to be generated .

本発明によれば、音声から抽出される特徴（属性）を用いて合成音声を生成する装置において、生成される合成音声の確認を容易に行うことができる。 ADVANTAGE OF THE INVENTION According to this invention, in the apparatus which produces | generates a synthetic | combination voice using the characteristic (attribute) extracted from an audio | voice, confirmation of the synthetic | combination voice produced | generated can be performed easily.

音声合成装置のハードウェア構成の一例を表すブロック図Block diagram showing an example of a hardware configuration of a speech synthesizer 歌唱スコアデータの内容の一例を示す図The figure which shows an example of the content of song score data 表示部に表示される画面の一例を示す図The figure which shows an example of the screen displayed on a display part 音声合成装置の機能的構成の一例を示すブロック図Block diagram showing an example of the functional configuration of a speech synthesizer 制御部が行う処理の流れを示すフロー図Flow chart showing the flow of processing performed by the control unit

＜実施形態＞
＜構成＞
図１は、本発明の実施形態に係る音声合成装置１００のハードウェア構成の一例を示すブロック図である。音声合成装置１００は、文字列及び音素列を含む楽譜データに基づいて音声を合成し、合成した音声を出力する装置である。音声合成装置１００は、制御部１０、記憶部２０、操作部３０、表示部４０、音声処理部６０、マイクロホン６１、及びスピーカ６２を有し、これら各部がバス７０を介して接続されている。制御部１０は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、及びＲＯＭ（Read Only Memory）等を有している。制御部１０において、ＣＰＵが、ＲＯＭや記憶部２０に記憶されているコンピュータプログラムを読み出しＲＡＭにロードして実行することにより、音声合成装置１００の各部を制御する。操作部３０は、各種の操作子を備え、ユーザによる操作内容を表す操作信号を制御部１０に出力する。表示部４０は、例えば液晶パネルを備え、制御部１０による制御の下、各種の画像を表示する。 <Embodiment>
<Configuration>
FIG. 1 is a block diagram illustrating an example of a hardware configuration of a speech synthesizer 100 according to an embodiment of the present invention. The speech synthesizer 100 is a device that synthesizes speech based on musical score data including a character string and a phoneme sequence and outputs the synthesized speech. The voice synthesizer 100 includes a control unit 10, a storage unit 20, an operation unit 30, a display unit 40, a voice processing unit 60, a microphone 61, and a speaker 62, and these units are connected via a bus 70. The control unit 10 includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. In the control unit 10, the CPU controls each unit of the speech synthesizer 100 by reading a computer program stored in the ROM or the storage unit 20, loading it into the RAM, and executing it. The operation unit 30 includes various operators and outputs an operation signal representing the content of an operation performed by the user to the control unit 10. The display unit 40 includes a liquid crystal panel, for example, and displays various images under the control of the control unit 10.

マイクロホン６１は、収音した音声を表すアナログの音声信号を音声処理部６０に出力する。音声処理部６０は、Ａ／Ｄ（Analog / Digital）コンバータを有し、マイクロホン６１が出力したアナログの音声信号をデジタルの音声データに変換して制御部１０に出力し、制御部１０はこれを取得する。また、音声処理部６０は、Ｄ／Ａ（Digital / Analog）コンバータを有し、制御部１０から受け取ったデジタルの音声データをアナログの音声信号に変換してスピーカ６２に出力する。スピーカ６２は、音声処理部６０から受け取ったアナログの音声信号に基づく音を放音する。なお、この実施形態では、マイクロホン６１とスピーカ６２とが音声合成装置１００に含まれている場合について説明するが、音声処理部６０に入力端子及び出力端子を設け、オーディオケーブルを介してその入力端子に外部マイクロホンを接続する構成としても良く、同様に、オーディオケーブルを介してその出力端子に外部スピーカを接続するとしても良い。また、この実施形態では、マイクロホン６１からスピーカ６２へ出力されるオーディオ信号がアナログオーディオ信号である場合について説明するが、デジタルオーディオデータを入出力するようにしても良い。このような場合には、音声処理部６０にてＡ／Ｄ変換やＤ／Ａ変換を行う必要はない。操作部３０や表示部４０についても同様であり、外部出力端子を設け、外部モニタを接続する構成としてもよい。 The microphone 61 outputs an analog audio signal representing the collected audio to the audio processing unit 60. The audio processing unit 60 includes an A / D (Analog / Digital) converter, converts the analog audio signal output from the microphone 61 into digital audio data, and outputs the digital audio data to the control unit 10. get. The audio processing unit 60 includes a D / A (Digital / Analog) converter, converts digital audio data received from the control unit 10 into an analog audio signal, and outputs the analog audio signal to the speaker 62. The speaker 62 emits a sound based on the analog audio signal received from the audio processing unit 60. In this embodiment, the case where the microphone 61 and the speaker 62 are included in the speech synthesizer 100 will be described. However, the speech processing unit 60 is provided with an input terminal and an output terminal, and the input terminal is connected via an audio cable. An external microphone may be connected, and similarly, an external speaker may be connected to the output terminal via an audio cable. In this embodiment, the audio signal output from the microphone 61 to the speaker 62 is an analog audio signal. However, digital audio data may be input / output. In such a case, the audio processing unit 60 does not need to perform A / D conversion or D / A conversion. The same applies to the operation unit 30 and the display unit 40, and an external output terminal may be provided to connect an external monitor.

記憶部２０は、各種のデータを記憶するための記憶手段であり、例えばＨＤＤや不揮発性メモリである。記憶部２０は、図示のように、Ｔｉｍｂｒｅデータベース２１と、音韻テンプレートデータベース２２と、歌唱スコアデータ記憶領域２３と、歌唱音声データ記憶領域２４と、解析結果データ記憶領域２５とを有している。Ｔｉｍｂｒｅデータベース２１は、音韻名、ピッチを異にする各音声パラメータを集めたデータベースである。このデータベースは、制御部１０が歌唱スコアデータから音声合成を行う際に参照するデータベースである。音声パラメータは、例えば、励起波形スペクトルのエンベロープ、励起レゾナンス、フォルマント、差分スペクトルの４つに分類することが出来る。これらの４つの音声パラメータは、実際の人間の音声等（オリジナルの音声）を分析して得られる調和成分のスペクトル・エンベロープ（オリジナルのスペクトル）を分解することにより得られるものである。ある時刻における音声は音声パラメータ（励起スペクトル、励起レゾナンス、フォルマント、差分スペクトルのセット）で表現でき、同じ音声でもピッチが異なればこれを表現する音声パラメータも異なる。このＴｉｍｂｒｅデータベース２１は、インデックスとして音韻名、ピッチを持つ。従って、制御部１０は、歌唱スコアデータの音韻トラック及びピッチトラックに属するデータをキーとして、ある時刻ｔにおける音声パラメータを読み出すことができる。 The storage unit 20 is a storage unit for storing various data, and is, for example, an HDD or a nonvolatile memory. The storage unit 20 includes a Timbre database 21, a phonological template database 22, a singing score data storage area 23, a singing voice data storage area 24, and an analysis result data storage area 25, as illustrated. The Timbre database 21 is a database in which voice parameters having different phoneme names and pitches are collected. This database is a database that the control unit 10 refers to when performing speech synthesis from singing score data. The voice parameters can be classified into, for example, an envelope of an excitation waveform spectrum, an excitation resonance, a formant, and a difference spectrum. These four speech parameters are obtained by decomposing the spectral envelope (original spectrum) of the harmonic component obtained by analyzing actual human speech or the like (original speech). A voice at a certain time can be expressed by a voice parameter (a set of excitation spectrum, excitation resonance, formant, and difference spectrum), and a voice parameter expressing the same voice is different if the pitch is different. The Timbre database 21 has phoneme names and pitches as indexes. Therefore, the control unit 10 can read out the voice parameters at a certain time t using the data belonging to the phonological track and the pitch track of the singing score data as keys.

音韻テンプレートデータベース２２は、音韻テンプレートデータを格納している。この音韻テンプレートデータは、上記歌唱スコアデータにおける音韻と音韻との遷移区間に適用するデータである。人間が２つの音韻を連続して発する場合には、突然変化するのではなくゆるやかに移行していく。例えば「あ」という母音の後に区切りを置かないで連続して「え」という母音を発音する場合には、最初に「あ」が発音され、「あ」と「え」の中間に位置する発音を経て「え」に変化する。したがって、音韻の結合部分が自然になるように歌唱合成を行うには、ある言語において組み合わせ可能な音韻の組み合わせについて、渇仰部分の音声情報を何らかの形で持つことが好ましい。これを考慮し、音韻が遷移する区間における、音声パラメータとピッチの変動量をテンプレートデータとして準備し、歌唱スコアデータにおける音韻の遷移区間にこのテンプレートデータを適用することによって、より実際の歌唱に近い音声の合成を実現する。 The phoneme template database 22 stores phoneme template data. The phoneme template data is data applied to the transition interval between phonemes and phonemes in the singing score data. When a human utters two phonemes in succession, it changes slowly, not suddenly. For example, if the vowel “e” is pronounced continuously without placing a break after the vowel “a”, “a” is pronounced first, and the pronunciation located between “a” and “e” After that, it changes to “E”. Therefore, in order to perform singing synthesis so that the phoneme combination part becomes natural, it is preferable to have some form of speech information of the excitement part for phoneme combinations that can be combined in a certain language. Considering this, preparing the voice data and the amount of pitch fluctuation in the phonological transition section as template data, and applying this template data to the phonological transition section in the singing score data, it is closer to the actual singing Realize speech synthesis.

この音韻テンプレートデータは、時刻ｔの関数として表された音声パラメータＰとピッチの変動量Ｐｉｔｃｈとを一定時間Δｔ間隔でサンプリングしたデジタル値のシーケンスと、音声パラメータＰとピッチＰｉｔｃｈの区間長Ｔ（ｓｅｃ．）の組により構成されるものであり、以下の式（Ａ）により表すことができる。なお、以下の式（Ａ）において、ｔ＝０、Δｔ、２Δｔ、３Δｔ、…Ｔである。
［数１］
Ｔｅｍｐｌａｔｅ＝［Ｐ（ｔ），Ｐｉｔｃｈ（ｔ），Ｔ］ …（Ａ） The phoneme template data includes a sequence of digital values obtained by sampling a speech parameter P and a pitch variation Pitch expressed as a function of time t at a constant time Δt interval, and a section length T (sec.) Between the speech parameter P and the pitch pitch. .)) And can be represented by the following formula (A). In the following formula (A), t = 0, Δt, 2Δt, 3Δt,.
[Equation 1]
Template = [P (t), Pitch (t), T] (A)

次に、歌唱スコアデータ記憶領域２３には、音素の列で構成されるメロディを表す歌唱スコアデータであって、各音素の特徴（各音素の発音タイミング、ピッチの時間的な変化、各音素の音韻等）を表す属性データ（音韻データ、発音タイミングデータ、ピッチデータ等）を含む歌唱スコアデータが記憶される。 Next, the singing score data storage area 23 is singing score data representing a melody composed of a sequence of phonemes, and features of each phoneme (pronunciation timing of each phoneme, temporal change in pitch, Singing score data including attribute data (phoneme data, pronunciation timing data, pitch data, etc.) representing phonemes etc. is stored.

図２は、歌唱スコアデータの内容の一例を示す概念図である。この歌唱スコアデータは、音韻トラックと、ピッチトラックとの複数のトラックによって構成されている。音韻トラックには、音韻を表す音韻データと、それぞれの音韻の発音開始タイミングと発音終了タイミングとを示す発音タイミングデータとが記録される。具体的には、例えば、図２に示す例では、「さ」の音韻の音素が時刻ｔ１から時刻ｔ２の間で発音され、「い」の音韻の音素が時刻ｔ２から時刻ｔ３の間で発音される旨が示されている。なお、以下では、説明の便宜上、「発音開始タイミング」と「発音終了タイミング」とを各々区別する必要がない場合には、これらを「発音タイミング」と称して説明する。ピッチトラックには、各時刻において発音すべき音声の基本周波数（ピッチ）の時間的な変化を示すピッチデータが記録される。 FIG. 2 is a conceptual diagram showing an example of the content of singing score data. This singing score data is composed of a plurality of tracks including a phonological track and a pitch track. In the phoneme track, phoneme data representing phonemes and sounding timing data indicating the sounding start timing and sounding end timing of each phoneme are recorded. Specifically, for example, in the example shown in FIG. 2, the phoneme of “sa” phoneme is pronounced from time t1 to time t2, and the phoneme of “I” phoneme is pronounced from time t2 to time t3. It is shown that it will be. In the following, for convenience of explanation, when it is not necessary to distinguish between “sound generation start timing” and “sound generation end timing”, these will be referred to as “sound generation timing”. In the pitch track, pitch data indicating temporal changes in the fundamental frequency (pitch) of the sound to be sounded at each time is recorded.

この歌唱スコアデータは、記憶部２０の歌唱スコアデータ記憶領域２３に予め記憶しておくようにしてもよく、また、ユーザの操作に応じて制御部１０が所定のアプリケーションプログラムを実行することによって生成するようにしてもよい。歌唱スコアデータは、歌詞を示す歌詞データ及び該歌詞との対応付けがなされた楽譜データの一例である。
図３は、制御部１０が歌唱スコアデータ生成処理を行う場合において、表示部４０に表示される画面の一例を示す図である。制御部１０は、図３に例示するような画面を表示して、ユーザに歌唱スコアデータの入力を促す。図において、歌唱スコアデータ編集画面６００は、ノートデータをピアノロール形式で表示するイベント表示領域６０１を備えている。イベント表示領域６０１の右側には、イベント表示領域６０１の表示画面を上下にスクロールするためのスクロールバー６０６が設けられている。イベント表示領域６０１の下側には、イベント表示領域６０１の表示画面を左右にスクロールするためのスクロールバー６０７が設けられている。 This singing score data may be stored in advance in the singing score data storage area 23 of the storage unit 20, or generated by the control unit 10 executing a predetermined application program in response to a user operation. You may make it do. The singing score data is an example of lyric data indicating lyrics and musical score data associated with the lyrics.
FIG. 3 is a diagram illustrating an example of a screen displayed on the display unit 40 when the control unit 10 performs a singing score data generation process. The control unit 10 displays a screen illustrated in FIG. 3 and prompts the user to input singing score data. In the figure, the singing score data editing screen 600 includes an event display area 601 for displaying note data in a piano roll format. A scroll bar 606 for scrolling up and down the display screen of the event display area 601 is provided on the right side of the event display area 601. A scroll bar 607 for scrolling the display screen of the event display area 601 left and right is provided below the event display area 601.

イベント表示領域６０１の左側にはピアノの鍵盤を模した鍵盤表示６０２（ピッチを示す座標軸）が表示され、イベント表示領域６０１の上側には楽曲の先頭からの小節位置を示す小節表示６０４が表示される。６０３はピアノロール表示領域であり、鍵盤表示６０２で示されるピッチの小節表示６０４で示される時間位置にノートデータを横長の矩形（バー）で表示している。バーの左端位置は発声開始タイミングを示し、バーの長さは発声継続時間を示し、バーの左端位置は発声終了タイミングを示している。 On the left side of the event display area 601, a keyboard display 602 (coordinate axis indicating the pitch) simulating a piano keyboard is displayed. Above the event display area 601, a bar display 604 indicating the bar position from the beginning of the music is displayed. The Reference numeral 603 denotes a piano roll display area which displays note data in a horizontally long rectangle (bar) at a time position indicated by a measure display 604 of a pitch indicated by a keyboard display 602. The left end position of the bar indicates the utterance start timing, the bar length indicates the utterance duration time, and the left end position of the bar indicates the utterance end timing.

ユーザは、所望のピッチ及び時間位置に対応した表示画面上の位置にマウスポインタを移動してクリックし、発声開始位置を特定する。そして、ドラッグ操作により発声開始位置から発声終了位置に至るノートデータのバー（以下「ノートバー」という）をイベント表示領域６０１に形成し、その後、マウスをドロップする。例えば、ノートバー６１１を形成するためには、第５３小節目の第１拍目の先頭の位置にマウスポインタを位置決めしてマウスをクリックし、１拍後までドラッグすればよい。 The user moves the mouse pointer to a position on the display screen corresponding to the desired pitch and time position and clicks to specify the utterance start position. Then, a note data bar (hereinafter referred to as “note bar”) from the utterance start position to the utterance end position is formed in the event display area 601 by a drag operation, and then the mouse is dropped. For example, in order to form the note bar 611, the mouse pointer is positioned at the first position of the first beat of the 53rd bar, the mouse is clicked, and the drag is performed after one beat.

ユーザは、上述のようにして、表示部４０に表示される画面を確認しつつ操作部３０を用いて歌唱スコアデータを入力する。制御部１０は、操作部３０から出力される信号に応じて歌唱スコアデータを生成し、生成した歌唱スコアデータを歌唱スコアデータ記憶領域２３に記憶する。 As described above, the user inputs the singing score data using the operation unit 30 while confirming the screen displayed on the display unit 40. The control unit 10 generates singing score data according to the signal output from the operation unit 30, and stores the generated singing score data in the singing score data storage area 23.

次に、記憶部２０の歌唱音声データ記憶領域２４には、例えばＷＡＶＥ形式やＭＰ３（MPEG Audio Layer-3）形式等の音声波形を表す音声データであって、ユーザが歌唱した歌唱音声を表す音声データ（以下「歌唱音声データ」という）が記憶される。解析結果データ記憶領域２５には、制御部１０が歌唱音声データを複数の属性について解析した解析結果を示す解析結果データ（属性データ）が記憶される。この実施形態では、制御部１０は、歌唱音声データを解析して音声のピッチ、パワー及びスペクトルを検出し、検出結果を示すデータを解析結果データとして、解析結果データ記憶領域２５に記憶する。 Next, in the singing voice data storage area 24 of the storage unit 20, for example, voice data representing voice waveforms in the WAVE format, MP3 (MPEG Audio Layer-3) format, etc., and voice representing the singing voice sung by the user. Data (hereinafter referred to as “singing voice data”) is stored. The analysis result data storage area 25 stores analysis result data (attribute data) indicating analysis results obtained by analyzing the singing voice data with respect to a plurality of attributes. In this embodiment, the control unit 10 analyzes the singing voice data to detect the pitch, power, and spectrum of the voice, and stores data indicating the detection result in the analysis result data storage area 25 as analysis result data.

次に、図４に示すブロック図を参照しながら、音声合成装置１００の機能的構成の一例について説明する。図４において、音声合成部１１、解析部１２、歌唱スコアデータ修正部１３及び確認音生成部１４は、制御部１０のＣＰＵが、ＲＯＭや記憶部２０に記憶されているコンピュータプログラムを読み出しＲＡＭにロードして実行することにより実現される。制御部１０のＣＰＵは、音声合成部１１、解析部１２、歌唱スコアデータ修正部１３、確認音生成部１４の一例である。音声合成部１１は、歌唱スコアデータ記憶領域２３から歌唱スコアデータを読み出し、読み出した歌唱スコアデータから、その歌唱スコアデータに対応する音声波形を表す音声波形データを生成する。より具体的には、この実施形態では、音声合成部１１は、歌唱スコアデータに含まれるピッチデータ、発音タイミングデータ、音韻データ等を参照して、ピッチと音韻に対応する音声パラメータを、音韻テンプレートデータベース２２を参照してＴｉｍｂｒｅデータベース２１から読み出し、読み出した音声パラメータを用いてデジタル音声波形データを生成する。なお、音声合成部１１は、歌唱合成の開始・停止、テンポ指定等の各種の制御処理を行うが、これらの処理は従来の歌唱合成技術におけるそれと同様であり、ここではその詳細な説明を省略する。なお、以下では、説明の便宜上、歌唱スコアデータから生成される音声波形データを「合成音声データ」と称して説明する。 Next, an example of a functional configuration of the speech synthesizer 100 will be described with reference to the block diagram shown in FIG. In FIG. 4, the voice synthesis unit 11, the analysis unit 12, the singing score data correction unit 13, and the confirmation sound generation unit 14 read out the computer program stored in the ROM or the storage unit 20 by the CPU of the control unit 10 and read it into the RAM. Realized by loading and executing. The CPU of the control unit 10 is an example of a voice synthesis unit 11, an analysis unit 12, a singing score data correction unit 13, and a confirmation sound generation unit 14. The speech synthesizer 11 reads the singing score data from the singing score data storage area 23, and generates speech waveform data representing the speech waveform corresponding to the singing score data from the read singing score data. More specifically, in this embodiment, the speech synthesizer 11 refers to the pitch data, pronunciation timing data, phonological data, etc. included in the singing score data, and converts the speech parameters corresponding to the pitch and phonology to the phonological template. The database 22 is read from the Timbre database 21 with reference to the database 22, and digital voice waveform data is generated using the read voice parameters. The voice synthesizer 11 performs various control processes such as singing synthesis start / stop, tempo designation, and the like. These processes are the same as those in the conventional singing synthesis technique, and detailed description thereof is omitted here. To do. Hereinafter, for convenience of explanation, the speech waveform data generated from the singing score data will be referred to as “synthesized speech data”.

この音声合成部１１で生成された合成音声データの表す合成音声は、機械的で不自然な場合がある。また、不自然でない場合であっても、ユーザが所望する歌い方（抑揚等）に修正したい場合がある。そこで、本実施形態では、制御部１０は、ユーザによる歌唱音声を入力し、この歌唱音声を用いて合成音声データを修正する処理を行う。 The synthesized speech represented by the synthesized speech data generated by the speech synthesizer 11 may be mechanical and unnatural. Even if it is not unnatural, there is a case where it is desired to correct the singing method (intonation etc.) desired by the user. So, in this embodiment, the control part 10 inputs the singing voice by a user, and performs the process which corrects synthetic | combination audio | voice data using this singing voice.

解析部１２は、歌唱音声データを、ピッチを含む複数の属性について解析し、解析結果を示す解析結果データを出力する。この実施形態では、解析部１２は、音声データを解析し、音声データのピッチ、パワー及びスペクトルを検出する。スペクトルの検出には、例えばＦＦＴ（Fast Fourier Transform）が用いられる。解析部１２は、解析結果を示すデータを解析結果データ記憶領域２５に記憶する。 The analysis unit 12 analyzes the singing voice data for a plurality of attributes including the pitch, and outputs analysis result data indicating the analysis result. In this embodiment, the analysis unit 12 analyzes audio data and detects the pitch, power, and spectrum of the audio data. For example, FFT (Fast Fourier Transform) is used for spectrum detection. The analysis unit 12 stores data indicating the analysis result in the analysis result data storage area 25.

歌唱スコアデータ修正部１３は、解析結果データに基づいて歌唱スコアデータに含まれるピッチデータと発音タイミングデータとを修正する。歌唱スコアデータ修正部１３は、解析結果データを受け取る受取部１３１と、歌唱スコアデータ（歌詞データと楽譜データ）を受け取る第２の受取部１３２と、受取部１３１が受け取った解析結果データと第２の受取部１３２が受け取った歌唱スコアデータとの対応付けを行い、この対応付け結果に基づいて歌詞データとピッチデータとの対応付けを行う対応付け部１３３とを有する。より具体的には、まず、対応付け部１３３は、歌唱スコアデータと解析結果データとに基づいて、合成音声とユーザ歌唱音声との対応関係を求める。歌唱音声データの表す音声（以下「歌唱音声」）と合成音声データの表す音声（以下「合成音声」）とは時間的にずれている可能性がある。例えば、ユーザが歌い始めや歌い終わりを意図的にずらして歌唱した場合などは、歌唱音声と合成音声とは時間的に前後にずれている。このように歌唱音声と合成音声とが時間的に前後にずれている場合であっても、両者を対応付けられるようにするため、合成音声データの時間軸を伸縮させる時間正規化（ＤＴＷ：Dynamic Time Warping）を行い、両者の時間軸を合わせる。このＤＴＷを行うための手法としては、この実施形態ではＤＰ（Dynamic programming：動的計画法）を用いてもよい。 The singing score data correction unit 13 corrects pitch data and pronunciation timing data included in the singing score data based on the analysis result data. The singing score data correction unit 13 includes a receiving unit 131 that receives the analysis result data, a second receiving unit 132 that receives the singing score data (lyric data and score data), and the analysis result data received by the receiving unit 131 and the second The receiving unit 132 performs association with the singing score data received, and the association unit 133 associates the lyrics data with the pitch data based on the association result. More specifically, the associating unit 133 first obtains a correspondence relationship between the synthesized speech and the user singing speech based on the singing score data and the analysis result data. The voice represented by the singing voice data (hereinafter “singing voice”) and the voice represented by the synthesized voice data (hereinafter “synthesized voice”) may be shifted in time. For example, when the user sings while intentionally shifting the start or end of singing, the singing voice and the synthesized voice are shifted back and forth in time. In this way, even when the singing voice and the synthesized voice are shifted in time, the time normalization (DTW: Dynamic) is performed to expand and contract the time axis of the synthesized voice data so that they can be associated with each other. Time Warping) and align the time axes of both. As a technique for performing this DTW, DP (Dynamic programming) may be used in this embodiment.

対応付け部１３３は、検出した差異を元に歌唱スコアデータの修正を行う。より具体的には、対応付け部１３３は、合成音声データと歌唱音声データとの差異をなくす方向に、歌唱スコアデータを構成するピッチデータと発音タイミングデータとを修正する。ピッチについては、対応付け部１３３は、歌唱音声データのピッチ、合成音声データのピッチ、歌唱音声と合成音声の対応箇所に基づいて、歌唱スコアデータに含まれるピッチデータの値を、歌唱音声データのピッチとそのピッチに対応する合成音声のピッチとの差分が小さくなるように修正する。なお、この処理における修正量は、例えば、合成音声のピッチが歌唱音声のピッチと一致するようにピッチデータの値を修正するようにしてもよく、また、例えば、両者の差分が検出された差分の略半分となるように修正するようにしてもよい。また、歌唱音声のピッチと合成音声のピッチとの差分が予め定められた閾値以下となるように修正するようにしてもよい。要は、対応付け部１３３が、合成音声のピッチと歌唱音声のピッチとの差分が小さくなるように、歌唱スコアデータに含まれるピッチデータの値を修正するようにすればよい。 The associating unit 133 corrects the singing score data based on the detected difference. More specifically, the associating unit 133 corrects the pitch data and the pronunciation timing data constituting the singing score data in a direction that eliminates the difference between the synthesized voice data and the singing voice data. For the pitch, the associating unit 133 determines the pitch data value included in the singing score data based on the pitch of the singing voice data, the pitch of the synthesized voice data, and the corresponding location of the singing voice and the synthesized voice. Correction is made so that the difference between the pitch and the pitch of the synthesized speech corresponding to the pitch becomes small. Note that the amount of correction in this processing may be, for example, correcting the value of the pitch data so that the pitch of the synthesized voice matches the pitch of the singing voice, or, for example, the difference in which the difference between the two is detected You may make it correct so that it may become substantially half. Moreover, you may make it correct so that the difference of the pitch of a song voice and the pitch of a synthetic | combination voice may become below a predetermined threshold value. In short, the associating unit 133 may correct the value of the pitch data included in the singing score data so that the difference between the pitch of the synthesized speech and the pitch of the singing speech becomes small.

また、対応付け部１３３は、歌唱スコアデータに含まれる発音タイミングデータの値を、歌唱音声データから検出された発音タイミングと合成音声データから検出された発音タイミングとの差分が小さくなるように修正する。なお、この修正量も、上述のピッチの修正と同様であり、合成音声の発音タイミングが歌唱音声の発音タイミングと一致するように発音タイミングデータの値を修正するようにしてもよい。対応付け部１３３は、各属性データを修正した歌唱スコアデータによって歌唱スコアデータ記憶領域２３の記憶内容を更新する。歌唱スコアデータ記憶領域２３に記憶された歌唱スコアデータは、音声合成部１１が音声合成処理を行う際に参照される。 In addition, the associating unit 133 corrects the value of the pronunciation timing data included in the singing score data so that the difference between the pronunciation timing detected from the singing voice data and the pronunciation timing detected from the synthesized voice data becomes small. . This correction amount is also the same as the pitch correction described above, and the value of the sound generation timing data may be corrected so that the sound generation timing of the synthesized speech matches the sound generation timing of the singing sound. The associating unit 133 updates the stored content of the singing score data storage area 23 with the singing score data obtained by correcting each attribute data. The singing score data stored in the singing score data storage area 23 is referred to when the speech synthesizer 11 performs speech synthesis processing.

ところで、ユーザが歌唱音声を入力してから、入力された歌唱音声によって修正された合成音声が再生されるまでには、上述した歌唱スコアデータ修正部１３及び音声合成部１１による処理が必要となる。このとき、歌唱スコアデータ修正部１３及び音声合成部１１が行う処理はある程度の処理時間を要するため、ユーザは処理が終わるまで待機する必要がある。ユーザが合成音声の修正を繰り返し行う場合には、修正後の音声を確認するためには修正を行う毎にその都度待機する必要があり、合成音声の編集処理がスムーズに行われない場合がある。そのため本実施形態では、ユーザによる操作に応じて解析結果データの示すピッチに基づいた音信号（以下「確認音信号」という）を確認音生成部１４によって生成して出力し、入力音声の解析結果の確認を容易にしている。 By the way, after the user inputs the singing voice and before the synthesized voice corrected by the inputted singing voice is reproduced, the processing by the singing score data correcting unit 13 and the voice synthesizing unit 11 described above is required. . At this time, since the processing performed by the singing score data correction unit 13 and the speech synthesis unit 11 requires a certain amount of processing time, the user needs to wait until the processing is completed. When the user repeatedly corrects the synthesized speech, it is necessary to wait each time the correction is made in order to check the corrected speech, and the synthesized speech editing process may not be performed smoothly. . Therefore, in the present embodiment, a sound signal based on the pitch indicated by the analysis result data (hereinafter referred to as “confirmation sound signal”) is generated and output by the confirmation sound generation unit 14 according to the operation by the user, and the analysis result of the input speech It is easy to confirm.

確認音生成部１４は、解析部１２によって生成された解析結果データを受け取り、受け取った解析結果データの示すピッチに基づいて、周期性を有する確認音信号を生成する。この実施形態では、確認音生成部１４は、解析結果データの示すピッチに対応する周波数の正弦波を生成する。確認音生成部１４は、生成した確認音信号を音声処理部６０に供給し、生成した確認音信号に応じた音（以下「確認音」という）をスピーカ６２から放音させる。 The confirmation sound generation unit 14 receives the analysis result data generated by the analysis unit 12, and generates a confirmation sound signal having periodicity based on the pitch indicated by the received analysis result data. In this embodiment, the confirmation sound generator 14 generates a sine wave having a frequency corresponding to the pitch indicated by the analysis result data. The confirmation sound generation unit 14 supplies the generated confirmation sound signal to the sound processing unit 60 and emits a sound corresponding to the generated confirmation sound signal (hereinafter referred to as “confirmation sound”) from the speaker 62.

＜動作＞
図５は、音声合成装置１００が行う合成音声の修正処理の流れを示すフロー図である。操作部３０を介してユーザにより合成音声の編集指示がされると（ステップＳ１００；Ｙｅｓ）、制御部１０は、まず、歌唱音声が入力されるのを待機する（ステップＳ１０２；Ｎｏ）。ユーザによって歌唱音声が入力されると（ステップＳ１０２；Ｙｅｓ）、入力された歌唱音声を解析し、解析結果を示す解析結果データを生成する（ステップＳ１０４）。 <Operation>
FIG. 5 is a flowchart showing the flow of the synthesized speech correction process performed by the speech synthesizer 100. When the user gives an instruction to edit the synthesized voice via the operation unit 30 (step S100; Yes), the control unit 10 first waits for the singing voice to be input (step S102; No). When the singing voice is input by the user (step S102; Yes), the input singing voice is analyzed, and analysis result data indicating the analysis result is generated (step S104).

次いで、制御部１０は、ユーザの操作に応じて、確認音を再生するか否かを判断する（ステップＳ１０６）。この処理は、制御部１０が、例えば、表示部４０に確認音を生成するためのボタンを表示し、このボタンがクリックされた場合に確認音を再生すると判断するようにしてもよい。確認音を再生しないと判断された場合は（ステップＳ１０６；ＮＯ）、制御部１０は、ステップＳ１０８の処理を行うことなくステップＳ１１０の処理へ進む。一方、確認音を再生すると判断された場合は（ステップＳ１０６；ＹＥＳ）、制御部１０は、解析結果データの示すピッチに基づいて周期性を有する確認音信号を生成し（ステップＳ１０８）、生成した確認音信号の表す音をスピーカ６２から放音させる。 Next, the control unit 10 determines whether or not to reproduce the confirmation sound according to the user operation (step S106). In this process, for example, the control unit 10 may display a button for generating a confirmation sound on the display unit 40, and may determine that the confirmation sound is reproduced when the button is clicked. When it is determined not to reproduce the confirmation sound (step S106; NO), the control unit 10 proceeds to the process of step S110 without performing the process of step S108. On the other hand, when it is determined that the confirmation sound is reproduced (step S106; YES), the control unit 10 generates a confirmation sound signal having periodicity based on the pitch indicated by the analysis result data (step S108). The sound represented by the confirmation sound signal is emitted from the speaker 62.

音声の解析結果は、微妙なピッチの変化があるため、その微妙な変化が実際どのような音なのかは実際の音を聴いてみないと把握し難い場合がある。ステップＳ１０８において再生される確認音は、最終的に生成される合成音声ではないものの、生成される合成音声のピッチが表された音であるから、ユーザは、この再生される音を聴くことで、どのような音声が生成されるかを直感的に把握することができる。このとき、確認音を生成する処理（すなわち確認音生成部１４が行う処理）は、合成音声を生成する処理（すなわち上述した歌唱スコアデータ修正部１３及び音声合成部１１が行う処理）と比して計算量が少なく、短時間で処理が行われるため、ユーザは、音声の解析結果を確認するためにいちいち待機する必要がない。 Since the analysis result of the sound has a subtle change in pitch, it may be difficult to grasp what kind of sound the subtle change is actually without listening to the actual sound. Although the confirmation sound reproduced in step S108 is not a synthesized voice to be finally generated, it is a sound in which the pitch of the synthesized voice to be generated is represented. Therefore, the user can listen to the reproduced sound. It is possible to intuitively understand what kind of sound is generated. At this time, the process of generating the confirmation sound (that is, the process performed by the confirmation sound generation unit 14) is compared with the process of generating the synthesized speech (that is, the process performed by the singing score data correction unit 13 and the speech synthesis unit 11 described above). Therefore, since the amount of calculation is small and processing is performed in a short time, the user does not have to wait for each time to check the result of speech analysis.

図５の説明に戻る。ユーザは、合成音声を生成するか、それとも歌唱音声を入力し直すかを選択することができる。ユーザは、操作部３０を操作して合成音声を生成するかを選択し、制御部１０は、ユーザの操作に応じて、合成音声を生成するか否かを判断する（ステップＳ１１０）。合成音声を生成すると判断された場合は（ステップＳ１１０；Ｙｅｓ）、制御部１０は、上述の歌唱スコアデータ修正部１３及び音声合成部１１の処理を行って、合成音声データを生成する（ステップＳ１１２）。すなわち、制御部１０は、解析結果データに基づいて歌唱スコアデータを修正するとともに、修正された歌唱スコアデータから、Ｔｉｍｂｒｅデータベース２１及び音韻テンプレートデータベース２２を参照して、合成音声データを生成する。一方、歌唱音声を入力し直すと判断された場合は（ステップＳ１１２；Ｎｏ）、制御部１０は、ステップＳ１００の処理に戻り、修正指示の入力を待機する。 Returning to the description of FIG. The user can select whether to generate synthesized speech or re-input singing speech. The user selects whether to generate synthesized speech by operating the operation unit 30, and the control unit 10 determines whether to generate synthesized speech in accordance with the user's operation (step S110). When it is determined that the synthesized speech is to be generated (step S110; Yes), the control unit 10 performs the processes of the singing score data correction unit 13 and the speech synthesis unit 11 described above to generate the synthesized speech data (step S112). ). That is, the control unit 10 corrects the singing score data based on the analysis result data, and generates synthesized speech data from the corrected singing score data with reference to the Timbre database 21 and the phonological template database 22. On the other hand, when it is determined that the singing voice is to be input again (step S112; No), the control unit 10 returns to the process of step S100 and waits for the input of the correction instruction.

＜変形例＞
以上の実施形態は次のように変形可能である。尚、以下の変形例は適宜組み合わせて実施しても良い。 <Modification>
The above embodiment can be modified as follows. In addition, you may implement the following modifications suitably combining.

＜変形例１＞
上述の実施形態では、制御部１０は、確認音として、解析結果データの示すピッチに応じた周波数の正弦波を生成したが、制御部１０が生成する確認音信号はこれに限らず、例えば、解析結果データの示すピッチに対応する周波数及び解析結果データの示す音量（パワー）に対応する振幅の正弦波を生成するようにしてもよい。また、例えば、制御部１０が、解析結果データの示すピッチに対応する周波数の正弦波に対して予め定められた変調処理を施して波形を歪ませてもよい。また、例えば、制御部１０が、解析結果データの示すピッチに対応する周波数成分と、その周波数成分の２倍音、３倍音といった特定の倍音の成分とを合成した音信号を、確認音信号として用いてもよい。また、例えば、制御部１０が、以下の式（Ｂ）を用いて、解析結果の示すピッチに対応する周波数成分のｎ倍音までの倍音成分Ｆ０を合成して確認音信号を生成してもよい。なお、以下の式（Ｂ）において、ＰＯＷはパワー、ａは定数又は歌唱音声データの解析結果であるスペクトルのピーク情報からフォルマントを模した値を示す。ａが定数である場合には鼻歌のような確認音信号が生成され、ａとしてフォルマントを模した値を用いる場合には、ユーザの歌唱音声に似た確認音信号が生成される。
［数２］
Σｓｉｎ（ｎ・Ｆ０）＊（ａ・ＰＯＷ） …（Ｂ） <Modification 1>
In the above-described embodiment, the control unit 10 generates a sine wave having a frequency corresponding to the pitch indicated by the analysis result data as the confirmation sound. However, the confirmation sound signal generated by the control unit 10 is not limited to this, for example, A sine wave having an amplitude corresponding to the frequency corresponding to the pitch indicated by the analysis result data and the volume (power) indicated by the analysis result data may be generated. For example, the control unit 10 may distort the waveform by performing a predetermined modulation process on a sine wave having a frequency corresponding to the pitch indicated by the analysis result data. Further, for example, the control unit 10 uses a sound signal obtained by synthesizing a frequency component corresponding to the pitch indicated by the analysis result data and a specific harmonic component such as a second harmonic or a third harmonic of the frequency component as the confirmation sound signal. May be. Further, for example, the control unit 10 may generate a confirmation sound signal by synthesizing the harmonic component F0 up to the nth harmonic of the frequency component corresponding to the pitch indicated by the analysis result using the following equation (B). . In the following formula (B), POW is power, a is a constant or a value imitating a formant from peak information of a spectrum which is an analysis result of singing voice data. When a is a constant, a confirmation sound signal like a nose song is generated, and when a value imitating a formant is used as a, a confirmation sound signal similar to the user's singing sound is generated.
[Equation 2]
Σsin (n · F0) * (a · POW) (B)

このように、確認音信号は、解析結果データの示すピッチに対応する周波数の正弦波であってもよく、また、例えば、解析結果データの示すピッチに対応する周波数成分とその倍音成分とを合成した音信号であってもよく、要は、制御部１０が、解析結果データの示すピッチに基づいて、周期性を有する音信号を生成すればよい。また、上述の実施形態では、確認音信号として正弦波を用いたが、確認音信号はこれに限らず、例えば、三角波や矩形波等の単純な波形の音信号であってもよい。また、周知の楽器音合成等の技術を用いて、楽器の音色を表す確認音信号を生成してもよい。確認音信号は、歌唱音声データの解析結果を示す解析結果データに基づいて生成される音信号であって処理負荷の軽い処理により生成されるものであればどのようなものであってもよい。 As described above, the confirmation sound signal may be a sine wave having a frequency corresponding to the pitch indicated by the analysis result data. For example, a frequency component corresponding to the pitch indicated by the analysis result data and its harmonic component are synthesized. In short, the control unit 10 may generate a sound signal having periodicity based on the pitch indicated by the analysis result data. In the above-described embodiment, a sine wave is used as the confirmation sound signal. However, the confirmation sound signal is not limited to this and may be a sound signal having a simple waveform such as a triangular wave or a rectangular wave. Further, a confirmation sound signal representing the tone color of the musical instrument may be generated using a known technique such as musical instrument sound synthesis. The confirmation sound signal may be any sound signal that is generated based on the analysis result data indicating the analysis result of the singing voice data and is generated by a process with a light processing load.

＜変形例２＞
上述の実施形態では、制御部１０が、歌唱スコアデータに含まれるピッチデータと発音タイミングデータとを修正するようにしたが、修正する属性データはこれに限らない。例えば、制御部１０が、音質・声質の差分を検出し、音質・声質を修正するようにしてもよい。この場合は、歌唱スコアデータに、音質や声質を示す音質データや声質データを含める構成とし、制御部１０が、歌唱音声データと合成音声データとからフォルマントを検出し、検出したフォルマントの差分が小さくなるように、音質データや声質データを修正するようにしてもよい。 <Modification 2>
In the above-described embodiment, the control unit 10 corrects the pitch data and the pronunciation timing data included in the singing score data, but the attribute data to be corrected is not limited to this. For example, the control unit 10 may detect a sound quality / voice quality difference and correct the sound quality / voice quality. In this case, the singing score data includes sound quality data and voice quality data indicating the sound quality and voice quality, and the control unit 10 detects formants from the singing voice data and the synthesized voice data, and the difference between the detected formants is small. As such, sound quality data and voice quality data may be corrected.

このように、制御部１０が修正する音声の属性を表す属性データは、上述した実施形態で示したピッチの時間的な変化を示すピッチデータや発音タイミングデータであってもよく、また、音韻データや音質データ、声質データであってもよい。また、他の例として、例えば、音のベロシティ（強弱）を表すデータや、ビブラートの態様を表すデータであってもよい。このように、制御部１０が修正する属性データは、音声の属性を表すものであればどのようなものであってもよい。 As described above, the attribute data representing the sound attribute to be corrected by the control unit 10 may be pitch data or pronunciation timing data indicating the temporal change of the pitch shown in the above-described embodiment, or phoneme data. Or sound quality data or voice quality data. As another example, for example, data representing the velocity (strongness) of sound, or data representing the form of vibrato may be used. As described above, the attribute data corrected by the control unit 10 may be any data as long as it represents a voice attribute.

また、上述の実施形態では、制御部１０が、歌唱音声データの解析結果を基に歌唱スコアデータを修正する態様について説明したが、これに限らず、歌唱音声データの解析結果そのものを歌唱スコアデータとして音声合成を行ってもよい。 Moreover, although the control part 10 demonstrated the aspect which corrects singing score data based on the analysis result of song voice data in the above-mentioned embodiment, it is not restricted to this, The analysis result itself of song voice data is used as song score data. Speech synthesis may be performed.

また、上述の実施形態では、制御部１０は、歌唱音声データを、ピッチを含む複数の属性について解析し、解析結果を示す解析結果データを生成したが、制御部１０が解析する属性は複数に限らず、ピッチのみを解析し、解析したピッチを示す属性データを生成してもよい。 In the above-described embodiment, the control unit 10 analyzes the singing voice data with respect to a plurality of attributes including the pitch, and generates analysis result data indicating the analysis result. However, the control unit 10 analyzes a plurality of attributes. Not limited to this, only the pitch may be analyzed, and attribute data indicating the analyzed pitch may be generated.

＜変形例３＞
上述の実施形態では、制御部１０は、歌唱スコアデータを歌唱スコアデータ記憶領域２３から読み出すようにしたが、音声合成部１１が歌唱スコアデータを取得する態様はこれに限らず、例えば、インターネット等の通信ネットワークを介して歌唱スコアデータを受信するようにしてもよく、また、例えば、ユーザが操作部３０を用いて歌唱スコアデータを入力するための操作を行い、制御部１０が操作部３０から出力される信号に応じて歌唱スコアデータを生成するようにしてもよく、制御部１０が歌唱スコアデータを取得するものであればどのようなものであってもよい。 <Modification 3>
In the above-described embodiment, the control unit 10 reads the singing score data from the singing score data storage area 23. However, the mode in which the speech synthesis unit 11 acquires the singing score data is not limited to this, for example, the Internet or the like. The singing score data may be received via the communication network, and for example, the user performs an operation for inputting the singing score data using the operation unit 30, and the control unit 10 performs the operation from the operation unit 30. Singing score data may be generated according to the output signal, and any data may be used as long as the control unit 10 acquires the singing score data.

また、上述の実施形態では、歌詞データ及び楽譜データとして歌唱スコアデータを用いたが、歌詞データ及び楽譜データの構造は、上述した実施形態で例示したものに限定されない。音符と歌詞との対応関係、および音符の属性を特定できるものであれば、どのような構造のデータが用いられてもよい。また、実施形態において歌詞（文字列）と楽譜データとが別のデータセットである例を説明したが、歌詞は楽譜データの一部であってもよい。 In the above-described embodiment, the singing score data is used as the lyric data and the score data. However, the structure of the lyric data and the score data is not limited to that exemplified in the above-described embodiment. Data having any structure may be used as long as the correspondence between the notes and the lyrics and the attributes of the notes can be specified. In the embodiment, the example in which the lyrics (character string) and the score data are separate data sets has been described. However, the lyrics may be a part of the score data.

また、音声合成処理の詳細は、実施形態で説明したものに限定されない。音符と発音記号（文字）とが与えられたときに、その音符および発音記号に応じた音声を合成するものであれば、どのような処理が用いられてもよい。 The details of the speech synthesis process are not limited to those described in the embodiment. As long as a note and a phonetic symbol (character) are given, any processing may be used as long as it synthesizes a sound corresponding to the note and the phonetic symbol.

また、上述の実施形態では、歌唱音声を入力し直すことによって合成音声を修正する構成とした（図５のステップＳ１１２〜ステップＳ１０２参照）が、合成音声の修正の態様はこれに限定されるものではなく、例えば、ユーザが操作部３０を用いて歌唱スコアデータを修正するための操作を行い、制御部１０が、操作部３０の操作内容に応じて歌唱スコアデータを修正するようにしてもよい。
また、上述の実施形態では、制御部１０は、歌唱者の歌唱音声を解析したが、歌唱者の歌唱音声に代えて、演奏者による楽器の演奏音を評価してもよい。本実施形態にいう「音声」には、人間が発生した音声や楽器の演奏音といった種々の音響が含まれる。 Moreover, in the above-mentioned embodiment, it was set as the structure which corrects a synthetic | combination voice by re-inputting a singing voice (refer FIG.5 S112-step S102), However, The aspect of correction | amendment of a synthetic | combination voice is limited to this. Instead, for example, the user may perform an operation for correcting the singing score data using the operation unit 30, and the control unit 10 may correct the singing score data according to the operation content of the operation unit 30. .
Moreover, in the above-mentioned embodiment, although the control part 10 analyzed the song voice of a singer, it may replace with a song voice of a singer, and may evaluate the performance sound of the musical instrument by a player. The “speech” referred to in the present embodiment includes various sounds such as a sound generated by a person and a performance sound of a musical instrument.

＜変形例４＞
上述の実施形態では、制御部１０が、歌唱音声データを解析し、解析結果を示す解析結果データを生成したが、制御部１０が解析結果データを生成するに限らず、他の装置（例えば、通信ネットワークで接続されたサーバ装置、等）から解析結果を取得する構成であってもよい。なお、この場合、制御部１０が取得する解析結果データは、ピッチを含む複数の属性を示すデータであってもよく、また、ピッチのみを示すデータであってもよい。 <Modification 4>
In the above-described embodiment, the control unit 10 analyzes the singing voice data and generates the analysis result data indicating the analysis result. However, the control unit 10 is not limited to generating the analysis result data, and other devices (for example, The analysis result may be acquired from a server device connected via a communication network. In this case, the analysis result data acquired by the control unit 10 may be data indicating a plurality of attributes including the pitch, or may be data indicating only the pitch.

＜変形例５＞
音声合成装置１００のハードウェア構成は、図１で説明したものに限定されない。図４に示される機能を実装できるものであれば、音声合成装置１００はどのようなハードウェア構成を有していてもよい。例えば、音声合成装置１００は、図４に示される機能要素の各々に対応する専用のハードウェア（回路）を有していてもよい。 <Modification 5>
The hardware configuration of the speech synthesizer 100 is not limited to that described with reference to FIG. As long as the function shown in FIG. 4 can be implemented, the speech synthesizer 100 may have any hardware configuration. For example, the speech synthesizer 100 may have dedicated hardware (circuit) corresponding to each of the functional elements shown in FIG.

＜変形例６＞
上述の実施形態において、通信ネットワークで接続された２以上の装置が、上記実施形態の音声合成装置１００に係る機能を分担するようにし、それら複数の装置を備えるシステムが同実施形態の音声合成装置１００を実現するようにしてもよい。例えば、マイクロホンやスピーカ、表示装置及び操作部等を備えるコンピュータ装置と、音声の解析処理を行うサーバ装置とが通信ネットワークで接続されたシステムとして構成されていてもよい。この場合は、例えば、コンピュータ装置が、マイクロホンで収音された音声をオーディオ信号に変換してサーバ装置に送信し、サーバ装置が、受信したオーディオ信号を解析し、解析結果をコンピュータ装置に送信してもよい。 <Modification 6>
In the above-described embodiment, two or more devices connected by a communication network share functions related to the speech synthesizer 100 of the above-described embodiment, and a system including these devices is a speech synthesizer of the same embodiment. 100 may be realized. For example, a computer device including a microphone, a speaker, a display device, an operation unit, and the like and a server device that performs voice analysis processing may be configured as a system connected via a communication network. In this case, for example, the computer apparatus converts the sound collected by the microphone into an audio signal and transmits it to the server apparatus, and the server apparatus analyzes the received audio signal and transmits the analysis result to the computer apparatus. May be.

＜変形例７＞
本発明は、音声合成装置以外にも、これらを実現するための方法や、コンピュータに音声合成機能を実現させるためのプログラムとしても把握される。かかるプログラムは、これを記憶させた光ディスク等の記録媒体の形態で提供されたり、インターネット等を介して、コンピュータにダウンロードさせ、これをインストールして利用させるなどの形態でも提供されたりする。 <Modification 7>
In addition to the speech synthesizer, the present invention can be understood as a method for realizing these and a program for causing a computer to realize a speech synthesis function. Such a program may be provided in the form of a recording medium such as an optical disk storing the program, or may be provided in the form of being downloaded to a computer via the Internet or the like and installed and used.

１０…制御部、２０…記憶部、２１…Ｔｉｍｂｒｅデータベース、２２…音韻テンプレートデータベース、２３…歌唱スコアデータ記憶領域、２４…歌唱音声データ記憶領域、２５…解析結果データ記憶領域、３０…操作部、４０…表示部、６０…音声処理部、６１…マイクロホン、６２…スピーカ、７０…バス、１００…音声合成装置 DESCRIPTION OF SYMBOLS 10 ... Control part, 20 ... Storage part, 21 ... Timbre database, 22 ... Phoneme template database, 23 ... Singing score data storage area, 24 ... Singing voice data storage area, 25 ... Analysis result data storage area, 30 ... Operation part, DESCRIPTION OF SYMBOLS 40 ... Display part, 60 ... Speech processing part, 61 ... Microphone, 62 ... Speaker, 70 ... Bus, 100 ... Speech synthesizer

Claims

A receiving unit for receiving the attribute data indicating an attribute containing the pitch and volume of the audio data,
A speech synthesis unit that generates synthesized speech based on the previous SL attribute data receiving unit has received,
A sound signal generator for generating a sound signal having periodicity based on a pitch and a volume indicated by the attribute data received by the receiver before the voice synthesizer generates a synthesized voice; And a sound signal generation unit that generates a sound signal having periodicity by processing in a shorter time than processing for generating synthesized speech .

A second receiving unit for receiving lyric data indicating lyrics and score data associated with the lyrics;
Correspondence between the pitch indicated by the attribute data received by the receiving unit and the score data received by the second receiving unit, and correspondence between the lyrics data and pitch data representing the pitch based on the association result And an associating unit for attaching,
The speech synthesis unit generates synthesized speech based on the attribute data received by the reception unit and the lyric data and pitch data associated by the association unit. Speech synthesizer.

The speech synthesizer according to claim 1 or 2, further comprising: a speech analysis unit that analyzes the speech data for attributes including pitch and volume and supplies attribute data indicating an analysis result to the reception unit. .

On the computer,
A receiving function of receiving the attribute data indicating an attribute containing the pitch and volume of the audio data,
A speech synthesis function for generating a synthesized speech based on the previous SL received attribute data,
A sound signal generating function for generating a sound signal having periodicity based on a pitch and a volume indicated by the received attribute data before the voice synthesizing function generates a synthesized voice, wherein the voice synthesizing function is a synthesized voice; A program for realizing a sound signal generation function for generating a sound signal having periodicity by processing in a shorter time than processing for generating the sound .