JP5136128B2

JP5136128B2 - Speech synthesizer

Info

Publication number: JP5136128B2
Application number: JP2008062706A
Authority: JP
Inventors: 卓朗曽根
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2008-03-12
Filing date: 2008-03-12
Publication date: 2013-02-06
Anticipated expiration: 2028-03-12
Also published as: JP2009217141A

Description

本発明は、音声合成装置に関する。 The present invention relates to a speech synthesizer.

メロディと歌詞を入力することで人の声を元にした歌声を合成する技術が提案されている。例えば、特許文献１には、スペクトルモデリング合成（ＳＭＳ：Spectral Modeling Synthesis）と呼ばれる技術を用いて、音素又は２つ以上の音素連鎖についてＳＭＳ分析を行ってデータベースを生成し、必要な音素又は音素連鎖のＳＭＳデータを接続することで歌唱音声を合成する技術が提案されている。また、特許文献２乃至５には、より自然な歌唱合成を行うための技術が提案されている。
特開２００２−２０２７９０号公報特開２００２−２０２７８８号公報特開２００３−３２３１８８号公報特開２００４−２６４６７６号公報特開２００４−４４４０号公報 A technique for synthesizing a singing voice based on a human voice by inputting melody and lyrics has been proposed. For example, Patent Document 1 discloses that a database is generated by performing an SMS analysis on a phoneme or two or more phoneme chains using a technique called spectral modeling synthesis (SMS), and a necessary phoneme or phoneme chain is generated. A technique for synthesizing a singing voice by connecting the SMS data is proposed. Patent Documents 2 to 5 propose techniques for performing more natural singing synthesis.
JP 2002-202790 A JP 2002-202788 A JP 2003-323188 A JP 2004-264676 A Japanese Patent Laid-Open No. 2004-4440

ところで、歌唱合成によって生成された歌唱合成音は、機械的で不自然なものとなってしまう場合がある。また、ユーザの嗜好に合わせて、歌唱合成音の抑揚や声質をユーザ自身で調整したい場合がある。そこで、歌唱合成音をユーザの所望する音声とするために、ユーザがパラメータ値を調整することで、ピッチベントやベロシティの調整、各種のエフェクト付与等を行うことが出来るものもある。 By the way, the singing synthesized sound generated by the singing synthesis may be mechanical and unnatural. Also, there are cases where the user himself / herself wants to adjust the inflection and voice quality of the singing synthesized sound according to the user's preference. Therefore, in order to make the singing synthesized sound desired by the user, there are some that the user can adjust the parameter value and adjust the pitch vent and velocity, and apply various effects.

しかしながら、このようなパラメータ値の調整は経験則に依存することが多く、ユーザは、所望する歌唱合成音を得るために試行錯誤を重ねる必要がある。特に、不慣れなユーザは、所望する歌唱合成音に調整することが困難である場合が多い。
本発明は上述した背景の下になされたものであり、歌唱合成音をユーザが所望する態様に容易に修正することのできる技術を提供することを目的とする。 However, adjustment of such parameter values often depends on empirical rules, and the user needs to repeat trial and error in order to obtain a desired singing synthesized sound. In particular, inexperienced users often have difficulty adjusting to a desired singing synthesized sound.
The present invention has been made under the above-described background, and an object thereof is to provide a technique capable of easily correcting a singing synthesized sound to a mode desired by a user.

上記課題を解決するため、本発明は、音素の列で構成されるメロディを表す歌唱スコアデータであって、各音素の特徴を表す特徴データを含む歌唱スコアデータを取得する歌唱スコアデータ取得手段と、音声波形を表す第１の音声波形データを取得する第１の音声波形データ取得手段と、前記歌唱スコアデータ取得手段により取得された歌唱スコアデータから、該歌唱スコアデータに対応する音声波形を表す第２の音声波形データを生成する第２の音声波形データ生成手段と、前記第１の音声波形データと前記第２の音声波形データとを、時間軸方向に対応付ける対応付手段と、前記第１の音声波形データを解析し、解析結果に応じて前記特徴を検出する第１の特徴検出手段と、前記第２の音声波形データを解析し、解析結果に応じて前記特徴を検出する第２の特徴検出手段と、前記対応付手段の対応付結果に応じて、前記歌唱スコアデータ取得手段により取得された歌唱スコアデータに含まれる特徴データを、前記第１の特徴検出手段によって検出された前記第１の音声波形データの特徴と前記第２の特徴検出手段によって検出された前記第２の音声波形データの特徴との対応箇所における差分が小さくなるように修正する特徴データ修正手段と、前記特徴データ修正手段により修正された歌唱スコアデータから、該歌唱スコアデータに対応する音声波形を表す第３の音声波形データを生成する第３の音声波形データ生成手段と、前記第３の音声波形データ生成手段により生成された第３の音声波形データを出力する出
力手段とを具備することを特徴とする音声合成装置を提供する。
本発明の好ましい態様において、前記第１の音声波形データと前記第３の音声波形データ生成手段により生成された第３の音声波形データとを、時間軸方向に対応付ける第２の対応付手段と、前記第３の音声波形データを解析し、解析結果に応じて前記特徴を検出する第３の特徴検出手段と、前記第２の対応付手段の対応付結果に応じて、前記特徴データ修正手段により修正された歌唱スコアデータに含まれる特徴データを、前記第１の特徴検出手段によって検出された第１の音声波形データの特徴と前記第３の特徴検出手段によって検出された前記第３の音声波形データの特徴との対応箇所における差分が小さくなるように修正する第２の特徴データ修正手段とを具備し、前記第３の音声波形データ生成手段は、前記特徴データ修正手段により修正された歌唱スコアデータ又は前記前記第２の特徴データ修正手段により修正された歌唱スコアデータに対応する音声波形を表す音声波形データを前記第３の音声波形データとして生成してもよい。 In order to solve the above problems, the present invention provides singing score data acquisition means for acquiring singing score data including melody data representing characteristics of each phoneme, which is singing score data representing a melody composed of a sequence of phonemes. The voice waveform corresponding to the singing score data is expressed from the first voice waveform data acquiring means for acquiring the first voice waveform data representing the voice waveform and the singing score data acquired by the singing score data acquiring means. Second voice waveform data generating means for generating second voice waveform data, correspondence means for associating the first voice waveform data and the second voice waveform data in the time axis direction, and the first First feature detecting means for detecting the feature according to the analysis result, and analyzing the second speech waveform data, and analyzing the feature according to the analysis result. The feature data included in the singing score data acquired by the singing score data acquiring means according to the association result of the association means and the first feature detecting means Correction of feature data so that a difference at a corresponding portion between the feature of the first speech waveform data detected by the feature and the feature of the second speech waveform data detected by the second feature detection means is reduced. Means, third voice waveform data generating means for generating third voice waveform data representing a voice waveform corresponding to the singing score data from the singing score data corrected by the feature data correcting means, and the third And an output means for outputting the third voice waveform data generated by the voice waveform data generation means. .
In a preferred aspect of the present invention, second association means for associating the first speech waveform data and the third speech waveform data generated by the third speech waveform data generation means in the time axis direction; Analyzing the third speech waveform data and detecting the feature according to the analysis result, and the feature data correcting unit according to the association result of the second association unit The feature data included in the modified singing score data includes the features of the first speech waveform data detected by the first feature detection unit and the third speech waveform detected by the third feature detection unit. Second feature data correction means for correcting the difference in the corresponding location with the feature of the data to be small, and the third speech waveform data generation means includes the feature data correction means. Ri the modified speech waveform data representing the song score data or the second speech waveform corresponding to the song score data corrected by the characteristic data correcting means is or may be generated as the third speech waveform data.

本発明の更に好ましい態様において、前記特徴は、前記メロディを構成する各音素の発音タイミング、ピッチの時間的な変化、前記メロディを構成する各音素の音韻、音声スペクトル、音量、音質及び声質の少なくともいずれか一つを含んでもよい。 In a further preferred aspect of the present invention, the characteristics include at least the pronunciation timing of each phoneme constituting the melody, a temporal change in pitch, the phoneme of each phoneme constituting the melody , the speech spectrum, the volume, the sound quality, and the voice quality . Any one of them may be included.

また、本発明の更に好ましい態様において、前記特徴データ修正手段により修正された歌唱スコアデータが予め定められた条件を満たす場合に、該歌唱スコアデータを前記歌唱スコアデータ取得手段に供給する歌唱スコアデータ取得制御手段を具備してもよい。 Further, in a further preferred aspect of the present invention, when the singing score data corrected by the feature data correcting unit satisfies a predetermined condition, the singing score data is supplied to the singing score data acquiring unit. An acquisition control means may be provided.

また、本発明の更に好ましい態様において、前記歌唱スコアデータは、複数の時間区間に区分されるとともに、複数の時間区間の対応関係を示す区間対応データを含み、前記特徴データ修正手段は、前記複数の時間区間のうちの少なくともいずれかひとつの時間区間について、前記対応付手段の対応付結果に応じて、前記歌唱スコアデータ取得手段により取得された歌唱スコアデータに含まれる特徴データを、前記第１の特徴検出手段によって検出された前記第１の音声波形データの特徴と前記第２の特徴検出手段によって検出された前記第２の音声波形データの特徴との対応箇所における差分が小さくなるように修正するとともに、前記区間対応データに基づいて、該時間区間に対応する他の時間区間について、前記歌唱スコアデータに含まれる特徴データを、該時間区間における修正態様で修正してもよい。 Further, in a further preferred aspect of the present invention, the singing score data is divided into a plurality of time sections and includes section correspondence data indicating a correspondence relationship between the plurality of time sections. The feature data included in the singing score data acquired by the singing score data acquiring means according to the association result of the associating means for at least one of the time intervals of So that the difference at the corresponding location between the feature of the first speech waveform data detected by the feature detection unit and the feature of the second speech waveform data detected by the second feature detection unit is reduced. In addition, based on the interval correspondence data, other time intervals corresponding to the time interval are included in the singing score data. The characteristic data may be modified by modifying aspects in said time interval.

また、本発明の更に好ましい態様において、前記第１の音声波形データ取得手段は、収音手段によって収音された音声を表す音声データを、前記第１の音声データとして取得してもよい。
また、本発明の更に好ましい態様において、前記特徴データ修正手段は、前記歌唱スコアデータ取得手段により取得された歌唱スコアデータに含まれる特徴データを、異なる複数の修正態様で修正してもよい。
また、本発明の更に好ましい態様において、前記特徴データ修正手段は、前記歌唱スコアデータ取得手段により取得された歌唱スコアデータに含まれる特徴データを、前記第１の特徴検出手段によって検出された前記第１の音声波形データの特徴と前記第２の特徴検出手段によって検出された前記第２の音声波形データの特徴との対応箇所における差分が略半分又は予め定められた閾値となるように修正してもよい。
また、本発明の好ましい態様において、前記複数の修正態様からいずれかをユーザインタフェースから出力される情報に従って選択する選択手段を具備し、前記第３の音声波形データ生成手段は、前記選択手段により選択された修正態様で修正された特徴データを含む歌唱スコアデータから前記第３の音声波形データを生成してもよい。
また、本発明の別の好ましい態様において、前記第１の特徴検出手段及び前記第２の特徴検出手段は、フォルマントの検出、ケプストラムの検出、及び音声認識処理の少なくともいずれかひとつの処理を実行して前記音韻を検出してもよい。 In a further preferred aspect of the present invention, the first voice waveform data acquisition means may acquire voice data representing the voice collected by the sound collection means as the first voice data.
Moreover, the further preferable aspect of this invention WHEREIN: The said characteristic data correction means may correct the characteristic data contained in the singing score data acquired by the said singing score data acquisition means with a several different correction | amendment aspect.
Further, in a further preferred aspect of the present invention, the feature data correction means is characterized in that the feature data included in the song score data acquired by the song score data acquisition means is detected by the first feature detection means. The difference at the corresponding location between the feature of the first speech waveform data and the feature of the second speech waveform data detected by the second feature detection means is corrected to be approximately half or a predetermined threshold value. Also good.
In a preferred aspect of the present invention, the information processing apparatus further comprises selection means for selecting one of the plurality of correction aspects according to information output from a user interface, and the third speech waveform data generation means is selected by the selection means. The third speech waveform data may be generated from singing score data including feature data corrected in the corrected mode.
In another preferred aspect of the present invention, the first feature detection means and the second feature detection means execute at least one of formant detection, cepstrum detection, and speech recognition processing. The phoneme may be detected.

本発明によれば、歌唱合成音をユーザが所望する態様に容易に修正することができる。 According to the present invention, the singing synthesized sound can be easily corrected to a mode desired by the user.

＜Ａ：構成＞
図１は、この発明の一実施形態である音声合成装置１のハードウェア構成を例示したブロック図である。この音声合成装置１は、メロディと歌詞を表すデータ（以下「歌唱スコアデータ」）から、予め作成されたデータベースを用いて歌唱合成（音声合成）を行う装置である。図において、ＣＰＵ（Central Processing Unit）１１は、ＲＯＭ（Read Only Memory）１２又は記憶部１４に記憶されているコンピュータプログラムを読み出してＲＡＭ（Random Access Memory）１３にロードし、これを実行することにより、音声合成装置１の各部を制御する。記憶部１４は、ＣＰＵ１１によって実行されるコンピュータプログラムや各種のデータを記憶する記憶手段であり、例えばハードディスク装置である。なお、記憶部１４は、ＣＤ−ＲＯＭ装置、光磁気ディスク（ＭＯ）装置、デジタル多目的ディスク（ＤＶＤ）装置等であってもよい。表示部１５は、液晶ディスプレイ等を備え、ＣＰＵ１１の制御の下で、音声合成装置１を操作するためのメニュー画面等の各種の画面を表示する。操作部１６は、マウスやキーボードを備え、ユーザによって操作された内容に応じた信号を出力する。マイクロホン１７は、収音し、収音した音声を表す音声信号（アナログ信号）を出力する。音声処理部１８は、ＤＡＣやＡＤＣを備え、マイクロホン１７が出力する音声信号（アナログ信号）をＡ／Ｄ変換によりデジタルデータに変換してＣＰＵ１１に出力する。また、音声処理部１８は、ＣＰＵ１１から供給されるデジタルデータをＤ／Ａ変換によりアナログ信号に変換してスピーカ１９に供給する。スピーカ１９は、音声処理部１８から出力されるアナログ信号に応じた強度で放音する。 <A: Configuration>
FIG. 1 is a block diagram illustrating a hardware configuration of a speech synthesizer 1 according to an embodiment of the invention. The speech synthesizer 1 is a device that performs singing synthesis (speech synthesis) from data representing melody and lyrics (hereinafter referred to as “singing score data”) using a database created in advance. In the figure, a CPU (Central Processing Unit) 11 reads a computer program stored in a ROM (Read Only Memory) 12 or a storage unit 14, loads it into a RAM (Random Access Memory) 13, and executes it. The units of the speech synthesizer 1 are controlled. The storage unit 14 is a storage unit that stores a computer program executed by the CPU 11 and various data, and is, for example, a hard disk device. The storage unit 14 may be a CD-ROM device, a magneto-optical disk (MO) device, a digital multipurpose disk (DVD) device, or the like. The display unit 15 includes a liquid crystal display and the like, and displays various screens such as a menu screen for operating the speech synthesizer 1 under the control of the CPU 11. The operation unit 16 includes a mouse and a keyboard, and outputs a signal corresponding to the content operated by the user. The microphone 17 collects sound and outputs a sound signal (analog signal) representing the collected sound. The audio processing unit 18 includes a DAC and an ADC, converts an audio signal (analog signal) output from the microphone 17 into digital data by A / D conversion, and outputs the digital data to the CPU 11. The audio processing unit 18 converts the digital data supplied from the CPU 11 into an analog signal by D / A conversion and supplies the analog signal to the speaker 19. The speaker 19 emits sound with an intensity corresponding to the analog signal output from the sound processing unit 18.

なお、この実施形態では、マイクロホン１７とスピーカ１９とが音声合成装置１に含まれている場合について説明するが、音声処理部１８に入力端子及び出力端子を設け、オーディオケーブルを介してその入力端子に外部マイクロホンを接続する構成としても良い。同様に、オーディオケーブルを介してその出力端子に外部スピーカを接続するとしても良い。また、この実施形態では、マイクロホン１７から音声処理部１８へ入力されるオーディオ信号及び音声処理部１８からスピーカ１９へ出力されるオーディオ信号がアナログオーディオ信号である場合について説明するが、デジタルオーディオデータを入出力するようにしても良い。このような場合には、音声処理部１８にてＡ／Ｄ変換やＤ／Ａ変換を行う必要はない。表示部１５についても同様であり、外部出力端子を設け、外部モニタを接続する構成としても良い。 In this embodiment, the case where the microphone 17 and the speaker 19 are included in the speech synthesizer 1 will be described. However, the speech processing unit 18 is provided with an input terminal and an output terminal, and the input terminal is connected via an audio cable. A configuration may be adopted in which an external microphone is connected. Similarly, an external speaker may be connected to the output terminal via an audio cable. In this embodiment, the audio signal input from the microphone 17 to the audio processing unit 18 and the audio signal output from the audio processing unit 18 to the speaker 19 are analog audio signals. You may make it input / output. In such a case, the audio processing unit 18 does not need to perform A / D conversion or D / A conversion. The same applies to the display unit 15, and an external output terminal may be provided to connect an external monitor.

記憶部１４は、図示のように、Ｔｉｍｂｒｅデータベース１４１と、音韻テンプレートデータベース１４２と、歌唱スコアデータ記憶領域１４３と、修正後歌唱スコアデータ記憶領域１４４と、模範音声データ記憶領域１４５とを有している。Ｔｉｍｂｒｅデータベース１４１は、音韻名、ピッチを異にする各音声パラメータを集めたデータベースである。このデータベースは、ＣＰＵ１１が歌唱スコアデータから音声合成を行う際に参照するデータベースである。音声パラメータは、例えば、励起波形スペクトルのエンベロープ、励起レゾナンス、フォルマント、差分スペクトルの４つに分類することが出来る。これらの４つの音声パラメータは、実際の人間の音声等（オリジナルの音声）を分析して得られる調和成分のスペクトル・エンベロープ（オリジナルのスペクトル）を分解することにより得られるものである。ある時刻における音声は音声パラメータ（励起スペクトル、励起レゾナンス、フォルマント、差分スペクトルのセット）で表現でき、同じ音声でもピッチが異なればこれを表現する音声パラメータも異なる。このＴｉｍｂｒｅデータベース１４１は、インデックスとして音韻名、ピッチを持つ。従って、ＣＰＵ１１は、上記歌唱スコアデータの音韻トラック及びピッチトラックに属するデータをキーとして、ある時刻ｔにおける音声パラメータを読み出すことができる。 The storage unit 14 includes a Timbre database 141, a phonological template database 142, a singing score data storage area 143, a corrected singing score data storage area 144, and an exemplary voice data storage area 145, as shown in the figure. Yes. The Timbre database 141 is a database in which voice parameters having different phoneme names and pitches are collected. This database is a database that the CPU 11 refers to when performing speech synthesis from singing score data. The voice parameters can be classified into, for example, an envelope of an excitation waveform spectrum, an excitation resonance, a formant, and a difference spectrum. These four speech parameters are obtained by decomposing the spectral envelope (original spectrum) of the harmonic component obtained by analyzing actual human speech or the like (original speech). A voice at a certain time can be expressed by a voice parameter (a set of excitation spectrum, excitation resonance, formant, and difference spectrum), and a voice parameter expressing the same voice is different if the pitch is different. The Timbre database 141 has phoneme names and pitches as indexes. Therefore, the CPU 11 can read out the voice parameter at a certain time t using the data belonging to the phonological track and the pitch track of the singing score data as a key.

音韻テンプレートデータベース１４２は、音韻テンプレートデータを格納している。この音韻テンプレートデータは、上記歌唱スコアデータにおける音韻と音韻との遷移区間に適用するデータである。人間が２つの音韻を連続して発する場合には、突然変化するのではなくゆるやかに移行していく。例えば「あ」という母音の後に区切りを置かないで連続して「え」という母音を発音する場合には、最初に「あ」が発音され、「あ」と「え」の中間に位置する発音を経て「え」に変化する。したがって、音韻の結合部分が自然になるように歌唱合成を行うには、ある言語において組み合わせ可能な音韻の組み合わせについて、渇仰部分の音声情報を何らかの形で持つことが好ましい。これを考慮し、音韻が遷移する区間における、音声パラメータとピッチの変動量をテンプレートデータとして準備し、歌唱スコアデータにおける音韻の遷移区間にこのテンプレートデータを適用することによって、より実際の歌唱に近い音声の合成を実現する。 The phoneme template database 142 stores phoneme template data. The phoneme template data is data applied to the transition interval between phonemes and phonemes in the singing score data. When a human utters two phonemes in succession, it changes slowly, not suddenly. For example, if the vowel “e” is pronounced continuously without placing a break after the vowel “a”, “a” is pronounced first, and the pronunciation located between “a” and “e” After that, it changes to “E”. Therefore, in order to perform singing synthesis so that the phoneme combination part becomes natural, it is preferable to have some form of speech information of the excitement part for phoneme combinations that can be combined in a certain language. Considering this, preparing the voice data and the amount of pitch fluctuation in the phonological transition section as template data, and applying this template data to the phonological transition section in the singing score data, it is closer to the actual singing Realize speech synthesis.

この音韻テンプレートデータは、時刻ｔの関数として表された音声パラメータＰとピッチの変動量Ｐｉｔｃｈとを一定時間Δｔ間隔でサンプリングしたデジタル値のシーケンスと、音声パラメータＰとピッチＰｉｔｃｈの区間長Ｔ（ｓｅｃ．）の組により構成されるものであり、以下の式（Ａ）により表すことができる。なお、以下の式（Ａ）において、ｔ＝０、Δｔ、２Δｔ、３Δｔ、…Ｔである。
［数１］
Ｔｅｍｐｌａｔｅ＝［Ｐ（ｔ），Ｐｉｔｃｈ（ｔ），Ｔ］ …（Ａ） The phoneme template data includes a sequence of digital values obtained by sampling a speech parameter P and a pitch variation Pitch expressed as a function of time t at a constant time Δt interval, and a section length T (sec.) Between the speech parameter P and the pitch pitch. .)) And can be represented by the following formula (A). In the following formula (A), t = 0, Δt, 2Δt, 3Δt,.
[Equation 1]
Template = [P (t), Pitch (t), T] (A)

次に、歌唱スコアデータ記憶領域１４３には、音素の列で構成されるメロディを表す歌唱スコアデータであって、各音素の特徴（各音素の発音タイミング、ピッチの時間的な変化、各音素の音韻等）を表す特徴データ（音韻データ、発音タイミングデータ、ピッチデータ等）を含む歌唱スコアデータが記憶される。 Next, the singing score data storage area 143 is singing score data representing a melody composed of a sequence of phonemes, and features of each phoneme (pronunciation timing of each phoneme, temporal change in pitch, Singing score data including feature data (phoneme data, pronunciation timing data, pitch data, etc.) representing phonemes etc. is stored.

図２（ａ）は、歌唱スコアデータの内容の一例を示す概念図である。この歌唱スコアデータは、音韻トラックと、ピッチトラックとの複数のトラックによって構成されている。音韻トラックには、音韻を表す音韻データと、それぞれの音韻の発音開始タイミングと発音終了タイミングとを示す発音タイミングデータとが記録される。具体的には、例えば、図２（ａ）に示す例では、「さ」の音韻の音素が時刻ｔ１から時刻ｔ２の間で発音され、「い」の音韻の音素が時刻ｔ２から時刻ｔ３の間で発音される旨が示されている。なお、以下では、説明の便宜上、「発音開始タイミング」と「発音終了タイミング」とを各々区別する必要がない場合には、これらを「発音タイミング」と称して説明する。ピッチトラックには、各時刻において発音すべき音声の基本周波数（ピッチ）の時間的な変化を示すピッチデータが記録される。 Fig.2 (a) is a conceptual diagram which shows an example of the content of song score data. This singing score data is composed of a plurality of tracks including a phonological track and a pitch track. In the phoneme track, phoneme data representing phonemes and sounding timing data indicating the sounding start timing and sounding end timing of each phoneme are recorded. Specifically, for example, in the example shown in FIG. 2A, the phoneme of the “sa” phoneme is pronounced between the time t1 and the time t2, and the phoneme of the “i” phoneme is from the time t2 to the time t3. It is shown that it is pronounced between. In the following, for convenience of explanation, when it is not necessary to distinguish between “sound generation start timing” and “sound generation end timing”, these will be referred to as “sound generation timing”. In the pitch track, pitch data indicating temporal changes in the fundamental frequency (pitch) of the sound to be sounded at each time is recorded.

この歌唱スコアデータは、記憶部１４の歌唱スコアデータ記憶領域１４３に予め記憶しておくようにしてもよく、また、ユーザの操作に応じてＣＰＵ１１が所定のアプリケーションプログラムを実行することによって生成するようにしてもよい。
図２（ｂ）は、ＣＰＵ１１が歌唱スコアデータ生成処理を行う場合において、表示部１５に表示される画面の一例を示す図である。ＣＰＵ１１は、図２（ｂ）に例示するような画面を表示して、ユーザに歌唱スコアデータの入力を促す。図において、歌唱スコアデータ編集画面６００は、ノートデータをピアノロール形式で表示するイベント表示領域６０１を備えている。イベント表示領域６０１の右側には、イベント表示領域６０１の表示画面を上下にスクロールするためのスクロールバー６０６が設けられている。イベント表示領域６０１の下側には、イベント表示領域６０１の表示画面を左右にスクロールするためのスクロールバー６０７が設けられている。 The singing score data may be stored in advance in the singing score data storage area 143 of the storage unit 14 or generated by the CPU 11 executing a predetermined application program in response to a user operation. It may be.
FIG. 2B is a diagram illustrating an example of a screen displayed on the display unit 15 when the CPU 11 performs a song score data generation process. The CPU 11 displays a screen as illustrated in FIG. 2B and prompts the user to input singing score data. In the figure, the singing score data editing screen 600 includes an event display area 601 for displaying note data in a piano roll format. A scroll bar 606 for scrolling up and down the display screen of the event display area 601 is provided on the right side of the event display area 601. A scroll bar 607 for scrolling the display screen of the event display area 601 left and right is provided below the event display area 601.

イベント表示領域６０１の左側にはピアノの鍵盤を模した鍵盤表示６０２（ピッチを示す座標軸）が表示され、イベント表示領域６０１の上側には楽曲の先頭からの小節位置を示す小節表示６０４が表示される。６０３はピアノロール表示領域であり、鍵盤表示６０２で示されるピッチの小節表示６０４で示される時間位置にノートデータを横長の矩形（バー）で表示している。バーの左端位置は発声開始タイミングを示し、バーの長さは発声継続時間を示し、バーの左端位置は発声終了タイミングを示している。 On the left side of the event display area 601, a keyboard display 602 (coordinate axis indicating the pitch) simulating a piano keyboard is displayed. Above the event display area 601, a bar display 604 indicating the bar position from the beginning of the music is displayed. The Reference numeral 603 denotes a piano roll display area which displays note data in a horizontally long rectangle (bar) at a time position indicated by a measure display 604 of a pitch indicated by a keyboard display 602. The left end position of the bar indicates the utterance start timing, the bar length indicates the utterance duration time, and the left end position of the bar indicates the utterance end timing.

ユーザは、所望のピッチ及び時間位置に対応した表示画面上の位置にマウスポインタを移動してクリックし、発声開始位置を特定する。そして、ドラッグ操作により発声開始位置から発声終了位置に至るノートデータのバー（以下「ノートバー」という）をイベント表示領域６０１に形成し、その後、マウスをドロップする。例えば、ノートバー６１１を形成するためには、第５３小節目の第１拍め先頭の位置にマウスポインタを位置決めしてマウスをクリックし、１泊後までドラッグすればよい。 The user moves the mouse pointer to a position on the display screen corresponding to the desired pitch and time position and clicks to specify the utterance start position. Then, a note data bar (hereinafter referred to as “note bar”) from the utterance start position to the utterance end position is formed in the event display area 601 by a drag operation, and then the mouse is dropped. For example, in order to form the note bar 611, the mouse pointer is positioned at the beginning of the first beat of the 53rd bar, clicked on the mouse, and dragged until after one night.

ユーザは、上述のようにして、表示部１５に表示される画面を確認しつつ操作部１６を用いて歌唱スコアデータを入力する。ＣＰＵ１１は、操作部１６から出力される信号に応じて歌唱スコアデータを生成し、生成した歌唱スコアデータを歌唱スコアデータ記憶領域１４３に記憶する。 As described above, the user inputs the singing score data using the operation unit 16 while confirming the screen displayed on the display unit 15. The CPU 11 generates singing score data according to the signal output from the operation unit 16, and stores the generated singing score data in the singing score data storage area 143.

次に、記憶部１４の修正後歌唱スコアデータ記憶領域１４４には、ＣＰＵ１１が歌唱スコアデータに対して後述する歌唱スコアデータ修正処理を施すことによって生成される修正後歌唱スコアデータが記憶される。なお、ＣＰＵ１１が実行する歌唱スコアデータ修正処理については後述するため、ここではその詳細な説明を省略する。 Next, the corrected singing score data storage area 144 of the storage unit 14 stores corrected singing score data generated by the CPU 11 performing a singing score data correcting process to be described later on the singing score data. Since the singing score data correction process executed by the CPU 11 will be described later, detailed description thereof is omitted here.

次に、記憶部１４の模範音声データ記憶領域１４５には、例えばＷＡＶＥ形式やＭＰ３（MPEG Audio Layer-3）形式等の音声波形を表す音声データであって、ユーザ等が歌唱した歌唱音声を表す音声データ（第１の音声波形データ）が記憶されている。なお、以下の説明では、説明の便宜上、模範音声データ記憶領域１４５に記憶された音声データを「模範音声データ」という。なお、この模範音声データは、ユーザの嗜好（好みの歌い方、好みの抑揚の付け方、等）に合った歌唱音声を表すデータであることが好ましい。 Next, in the exemplary audio data storage area 145 of the storage unit 14, for example, audio data representing an audio waveform in the WAVE format, MP3 (MPEG Audio Layer-3) format, etc., representing the singing audio sung by the user or the like. Audio data (first audio waveform data) is stored. In the following description, for convenience of explanation, the voice data stored in the model voice data storage area 145 is referred to as “model voice data”. The exemplary voice data is preferably data representing singing voice that matches the user's preferences (how to sing, how to add favorite inflection, etc.).

次に、図３に示すブロック図を参照しながら、音声合成装置１の機能的構成の一例について説明する。ＲＯＭ１２又は記憶部１４に記憶された歌唱合成プログラムを実行することによって、ＣＰＵ１１は、歌唱合成部１１１、音声再生部１１２、歌唱比較部１１３、及び歌唱スコアデータ修正部１１４としての役割を担う。 Next, an example of the functional configuration of the speech synthesizer 1 will be described with reference to the block diagram shown in FIG. By executing the song synthesis program stored in the ROM 12 or the storage unit 14, the CPU 11 plays a role as the song synthesis unit 111, the voice reproduction unit 112, the song comparison unit 113, and the song score data correction unit 114.

歌唱合成部１１１は、歌唱スコアデータ記憶領域１４３から歌唱スコアデータを読み出し、読み出した歌唱スコアデータから、その歌唱スコアデータに対応する音声波形を表す音声波形データ（第２の音声波形データ）を生成する。より具体的には、この実施形態では、歌唱合成部１１１は、歌唱スコアデータに含まれるピッチデータ、発音タイミングデータ、音韻データ等を参照して、ピッチと音韻に対応する音声パラメータを、音韻テンプレートデータベース１４２を参照してＴｉｍｂｒｅデータベース１４１から読み出し、読み出した音声パラメータを用いてデジタル音声波形データを生成する。なお、歌唱合成部１１１は、歌唱合成の開始・停止、テンポ指定等の各種の制御処理を行うが、これらの処理は従来の歌唱合成技術におけるそれと同様であり、ここではその詳細な説明を省略する。なお、以下では、説明の便宜上、歌唱スコアデータから生成される音声波形データを「合成音声データ」と称して説明する。 The singing voice synthesizing unit 111 reads the singing score data from the singing score data storage area 143, and generates voice waveform data (second voice waveform data) representing a voice waveform corresponding to the singing score data from the read singing score data. To do. More specifically, in this embodiment, the singing voice synthesizing unit 111 refers to pitch data, pronunciation timing data, phonological data, and the like included in the singing score data, and converts the speech parameters corresponding to the pitch and phonology to the phonological template. The database 142 is read from the Timbre database 141 with reference to the database 142, and digital voice waveform data is generated using the read voice parameters. Note that the singing synthesis unit 111 performs various control processes such as singing synthesis start / stop, tempo designation, and the like. These processes are the same as those in the conventional singing synthesis technique, and a detailed description thereof is omitted here. To do. Hereinafter, for convenience of explanation, the speech waveform data generated from the singing score data will be referred to as “synthesized speech data”.

この歌唱合成部１１１で生成された合成音声データの表す合成音声は、機械的で不自然な場合がある。また、不自然でない場合であっても、ユーザが所望する歌い方（抑揚等）に修正したい場合がある。そこで、本実施形態では、以下の歌唱スコアデータ修正部１１４で示す処理を行うことによって、この合成音声データを修正する。 The synthesized voice represented by the synthesized voice data generated by the singing voice synthesis unit 111 may be mechanical and unnatural. Even if it is not unnatural, there is a case where it is desired to correct the singing method (intonation etc.) desired by the user. Therefore, in the present embodiment, this synthesized speech data is corrected by performing the processing indicated by the following singing score data correction unit 114.

音声再生部１１２は、模範音声データ記憶領域１４５に記憶された模範音声データを読み出し、読み出した模範音声データを再生する。歌唱比較部１１３は、模範音声データと歌唱音声データとを比較し、両者の歌唱タイミングのずれや音程及び音程変化（カーブ）のずれを検出し、検出した差分を表すデータを、歌唱スコアデータ修正部１１４に出力する。なお、図３に示す例において、実際には歌唱合成部１１１と音声再生部１１２とを同期して制御する機構や操作系が必要となるが、図が煩雑になるのを防ぐため図示を省略している。 The audio reproducing unit 112 reads out the exemplary audio data stored in the exemplary audio data storage area 145, and reproduces the read out exemplary audio data. The singing comparison unit 113 compares the model voice data and the singing voice data, detects the singing timing shift, the pitch and the pitch change (curve) shift, and corrects the singing score data to the data representing the detected difference. Output to the unit 114. In the example shown in FIG. 3, a mechanism and an operation system for controlling the singing voice synthesizing unit 111 and the sound reproducing unit 112 are actually required, but the illustration is omitted to prevent the diagram from becoming complicated. doing.

ここで、歌唱比較部１１３が行う処理の詳細について、図面を参照しつつ以下に説明する。まず、歌唱比較部１１３は、模範音声データと合成音声データから、それぞれ所定時間長のフレーム単位で、各音声データのピッチ、パワー及びスペクトルを検出する。スペクトルの検出には、例えばＦＦＴ（Fast Fourier Transform）が用いられる。 Here, the detail of the process which the song comparison part 113 performs is demonstrated below, referring drawings. First, the singing comparison unit 113 detects the pitch, power, and spectrum of each voice data from the model voice data and the synthesized voice data in units of frames each having a predetermined time length. For example, FFT (Fast Fourier Transform) is used for spectrum detection.

また、歌唱比較部１１３は、検出したスペクトルに基づいて、両者の対応関係を求める。模範音声データの表す音声（以下「模範音声」）と合成音声データの表す音声（以下「合成音声」）とは時間的にずれている可能性がある。例えば、模範となる歌唱者が歌い始めや歌い終わりを意図的にずらして歌唱した場合などは、模範音声と合成音声とは時間的に前後にずれている。このように模範音声と合成音声とが時間的に前後にずれている場合であっても、両者を対応付けられるようにするため、合成音声データの時間軸を伸縮させる時間正規化（ＤＴＷ：Dynamic Time Warping）を行い、両者の時間軸を合わせる。このＤＴＷを行うための手法としては、この実施形態ではＤＰ（Dynamic programming：動的計画法）を用いる。具体的には以下のような処理となる。 Moreover, the song comparison part 113 calculates | requires both correspondence based on the detected spectrum. The voice represented by the model voice data (hereinafter “model voice”) and the voice represented by the synthesized voice data (hereinafter “synthesized voice”) may be shifted in time. For example, when a model singer deliberately shifts the beginning and end of singing, the model voice and the synthesized voice are shifted in time. Thus, even when the model voice and the synthesized voice are shifted forward and backward in time, time normalization (DTW: Dynamic) is performed to expand and contract the time axis of the synthesized voice data so that they can be associated with each other. Time Warping) and align the time axes of both. As a method for performing this DTW, DP (Dynamic programming) is used in this embodiment. Specifically, the processing is as follows.

歌唱比較部１１３は、図４に示すような座標平面（以下、ＤＰプレーンという）をＲＡＭ１３に形成する。このＤＰプレーンの縦軸は、模範音声データの各フレームのスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータに対応しており、横軸は、合成音声データの各フレームから得たスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータ（ケプストラム）に対応している。図４において、ａ１、ａ２、ａ３・・・ａｎは、模範音声データの各フレームを時間軸に従って並べたものであり、ｂ１、ｂ２、ｂ３・・・ｂｎは、合成音声データの各フレームを時間軸に従って並べたものである。縦軸のａ１、ａ２、ａ３・・・ａｎの間隔と横軸のｂ１、ｂ２、ｂ３・・・ｂｎの間隔は、いずれもフレームの時間長と対応している。このＤＰプレーンにおける各格子点の各々には、ａ１、ａ２、ａ３・・・の各パラメータと、ｂ１、ｂ２、ｂ３・・・の各パラメータのユークリッド距離を夫々示す値であるＤＰマッチングスコアが対応付けられている。例えば、ａ１とｂ１とにより位置決めされる格子点には、模範音声データの一連のフレームのうち最初のフレームから得たパラメータと合成音声データの一連のフレームのうち最初のフレームから得たパラメータのユークリッド距離を示す値が対応付けられることになる。歌唱比較部１１３は、このような構造を成すＤＰプレーンを形成した後、ａ１とｂ１とにより位置決めされる格子点（始端）からａｎとｂｎとにより位置決めされる格子点（終端）に至る全経路を探索し、探索した各経路毎に、その始端から終端までの間に辿る各格子点のＤＰマッチングスコアを累算して行き、最小の累算値を求める。このＤＰマッチングスコアの累算値が最も小さくなる経路は、合成音声データの各フレームの時間軸を模範音声データの時間軸に合わせて伸縮する際における伸縮の尺度として参酌される。 The singing comparison unit 113 forms a coordinate plane (hereinafter referred to as a DP plane) as shown in FIG. The vertical axis of the DP plane corresponds to a parameter obtained by performing inverse Fourier transform on the logarithm of the absolute value of the spectrum of each frame of the model voice data, and the horizontal axis is obtained from each frame of the synthesized voice data. This corresponds to a parameter (cepstrum) obtained by applying inverse Fourier transform to the logarithm of the absolute value of the spectrum. In FIG. 4, a1, a2, a3... An are obtained by arranging the frames of the model voice data according to the time axis, and b1, b2, b3. They are arranged according to the axis. The intervals of a1, a2, a3... An on the vertical axis and the intervals of b1, b2, b3... Bn on the horizontal axis all correspond to the time length of the frame. Each lattice point in the DP plane corresponds to a DP matching score which is a value indicating the Euclidean distance of each parameter of a1, a2, a3... And each parameter of b1, b2, b3. It is attached. For example, the grid points positioned by a1 and b1 include the Euclidean parameters of the parameters obtained from the first frame of the series of exemplary voice data and the parameters obtained from the first frame of the series of synthesized voice data. A value indicating the distance is associated. After the singing comparison unit 113 forms the DP plane having such a structure, the entire path from the lattice point (starting end) positioned by a1 and b1 to the lattice point (ending point) positioned by an and bn For each searched route, the DP matching score of each lattice point traced from the beginning to the end is accumulated, and the minimum accumulated value is obtained. The path with the smallest accumulated value of the DP matching score is considered as a scale of expansion / contraction when the time axis of each frame of the synthesized speech data is expanded / contracted in accordance with the time axis of the exemplary speech data.

そして、歌唱比較部１１３は、ＤＰマッチングスコアの累算値が最小となる経路をＤＰプレーン上から特定し、特定した経路の内容に応じて合成音声データの時間軸を伸縮する処理であるアライメント処理を行う。具体的には、ＤＰプレーン上から特定された経路上の各格子点のＤＰマッチングスコアが時間軸上の位置を同じくするフレームから得たパラメータのユークリッド距離を表わすものとなるように、合成音声データの各フレームのタイムスタンプの内容を書き換えた上で、時間軸上の位置を同じくする各フレームを組として順次対応付けていく。例えば、図４に示すＤＰプレーン上に記された経路においては、ａ１とｂ１により位置決めされる始点からその右上のａ２とｂ２により位置決めされる格子点に進んでいることが分かる。この場合、ａ２とｂ２のフレームの時間軸上の位置は当初から同じであるので、ｂ２のフレームのタイムスタンプの内容を書き換える必要はない。更に、この経路においては、ａ２とｂ２により位置決めされる格子点からその右のａ２とｂ３により位置決めされる格子点に進んでいることが分かる。この場合、ｂ２のフレームだけでなくｂ３のフレームもａ２のフレームと時間軸上の位置を同じくする必要があるので、ｂ３のフレームと対を成していたタイムスタンプをフレーム一つ分だけ早いものと置き換える。この結果、ａ２のフレームとｂ２及びｂ３のフレームが時間軸上の位置を同じくするフレームの組として対応付けられることになる。このようなタイムスタンプの置き換えとフレームの対応付けがｂ１からｂｎに至る全フレーム区間について行われる。これにより、歌唱合成の発音タイミングと模範音声の発音タイミングとがずれていたとしても、合わせられた時間軸上の位置を同じくするフレーム（音素）どうしを対応付けることができる。以上がＤＰマッチングの仕組みである。 Then, the singing comparison unit 113 specifies a path on which the accumulated value of the DP matching score is minimum from the DP plane, and an alignment process that is a process of expanding and contracting the time axis of the synthesized speech data according to the content of the specified path I do. Specifically, the synthesized speech data is such that the DP matching score of each lattice point on the path specified from the DP plane represents the Euclidean distance of the parameter obtained from the frame having the same position on the time axis. After rewriting the contents of the time stamp of each frame, the frames having the same position on the time axis are sequentially associated as a set. For example, in the path marked on the DP plane shown in FIG. 4, it can be seen that the starting point positioned by a1 and b1 advances to the lattice point positioned by upper right a2 and b2. In this case, since the positions on the time axis of the frames a2 and b2 are the same from the beginning, it is not necessary to rewrite the contents of the time stamp of the frame b2. Furthermore, in this route, it can be seen that the grid point positioned by a2 and b2 advances from the grid point positioned by a2 and b3 on the right. In this case, not only the frame b2 but also the frame b3 need to have the same position on the time axis as the frame a2, so that the time stamp paired with the frame b3 is one frame earlier. Replace with As a result, the frame a2 and the frames b2 and b3 are associated as a set of frames having the same position on the time axis. Such time stamp replacement and frame association are performed for all frame sections from b1 to bn. Thereby, even if the sound generation timing of singing synthesis and the sound generation timing of the model voice are shifted, frames (phonemes) having the same position on the time axis can be associated with each other. The above is the mechanism of DP matching.

図５は、模範音声と合成音声との対応付けの一例を示す図である。図５（ａ）は合成音声のピッチの時間的変化を示すグラフの一例を示すものであり、同図（ｂ）は模範音声のピッチの時間的変化を示すグラフの一例を示すものである。図においては、合成音声の発音タイミングｔ１１と模範音声の発音タイミングｔ２１とが対応付けられ、合成音声の発音タイミングｔ１２と模範音声の発音タイミングｔ２２とが対応付けられた様子を示している。 FIG. 5 is a diagram illustrating an example of correspondence between model voices and synthesized voices. FIG. 5A shows an example of a graph showing the temporal change of the pitch of the synthesized speech, and FIG. 5B shows an example of a graph showing the temporal change of the pitch of the exemplary speech. In the figure, the sound generation timing t11 of the synthesized voice is associated with the sound generation timing t21 of the model voice, and the sound generation timing t12 of the synthesized voice and the sound generation timing t22 of the model voice are associated with each other.

図３の説明に戻る。歌唱スコアデータ修正部１１４は、歌唱比較部１１３で検出した差異を元に歌唱スコアデータの修正を行う。より具体的には、歌唱スコアデータ修正部１１４は、合成音声データと模範音声データとの差異をなくす方向に、歌唱スコアデータを構成するピッチデータと発音タイミングデータとを修正する。ピッチについては、歌唱スコアデータ修正部１１４は、模範音声データのピッチ、合成音声データのピッチ、模範音声と合成音声の対応箇所に基づいて、歌唱スコアデータに含まれるピッチデータの値を、模範音声データのピッチとそのピッチに対応する合成音声のピッチとの差分が小さくなるように修正する。なお、この処理における修正量は、例えば、合成音声のピッチが模範音声のピッチと一致するようにピッチデータの値を修正するようにしてもよく、また、例えば、両者の差分が検出された差分の略半分となるように修正するようにしてもよい。また、模範音声のピッチと合成音声のピッチとの差分が予め定められた閾値以下となるように修正するようにしてもよい。要は、歌唱スコアデータ修正部１１４が、合成音声のピッチと模範音声のピッチとの差分が小さくなるように、歌唱スコアデータに含まれるピッチデータの値を修正するようにすればよい。 Returning to the description of FIG. The singing score data correction unit 114 corrects the singing score data based on the difference detected by the singing comparison unit 113. More specifically, the singing score data correction unit 114 corrects the pitch data and the pronunciation timing data constituting the singing score data in a direction that eliminates the difference between the synthesized voice data and the model voice data. For the pitch, the singing score data correction unit 114 determines the pitch data value included in the singing score data based on the pitch of the model voice data, the pitch of the synthesized voice data, and the corresponding part of the model voice and the synthesized voice. Correction is made so that the difference between the pitch of the data and the pitch of the synthesized speech corresponding to the pitch becomes small. Note that the amount of correction in this process may be, for example, correcting the value of the pitch data so that the pitch of the synthesized speech matches the pitch of the model speech, or, for example, the difference in which the difference between the two is detected You may make it correct so that it may become substantially half. Further, the difference between the pitch of the model voice and the pitch of the synthesized voice may be corrected so as to be equal to or less than a predetermined threshold value. In short, the singing score data correction unit 114 may correct the value of the pitch data included in the singing score data so that the difference between the pitch of the synthesized voice and the pitch of the model voice becomes small.

また、歌唱スコアデータ修正部１１４は、歌唱スコアデータに含まれる発音タイミングデータの値を、模範音声データから検出された発音タイミングと合成音声データから検出された発音タイミングとの差分が小さくなるように修正する。なお、この修正量も、上述のピッチの修正と同様であり、合成音声の発音タイミングが模範音声の発音タイミングと一致するように発音タイミングデータの値を修正するようにしてもよい。
図６は、修正された歌唱スコアデータの内容の一例を示す図である。図示のように、ピッチや各音韻の発音開始タイミング、発音終了タイミングが、模範音声に応じて修正される。
歌唱スコアデータ修正部１１４は、各特徴データを修正した歌唱スコアデータを、修正後歌唱スコアデータとして、修正後歌唱スコアデータ記憶領域１４４に記憶する。 Further, the singing score data correcting unit 114 sets the value of the pronunciation timing data included in the singing score data so that the difference between the pronunciation timing detected from the model voice data and the pronunciation timing detected from the synthesized voice data becomes small. Correct it. This correction amount is also the same as the pitch correction described above, and the value of the sound generation timing data may be corrected so that the sound generation timing of the synthesized sound matches the sound generation timing of the model sound.
FIG. 6 is a diagram illustrating an example of the content of the corrected singing score data. As shown in the figure, the pitch and the pronunciation start timing and the pronunciation end timing of each phoneme are corrected according to the model voice.
The singing score data correction unit 114 stores the singing score data obtained by correcting each feature data in the corrected singing score data storage area 144 as corrected singing score data.

＜Ｂ：動作＞
次に、この実施形態の動作について説明する。ユーザが操作部１６を用いて歌唱スコアデータの修正を行う旨の操作を行うと、ＣＰＵ１１は、まず、操作部１６から出力される信号に応じて、上述の歌唱合成部１１１の処理を行う。すなわち、ＣＰＵ１１は、歌唱スコアデータ記憶領域１４３に記憶された歌唱スコアデータから、Ｔｉｍｂｒｅデータベース１４１及び音韻テンプレートデータベース１４２を参照して、合成音声データを生成する。 <B: Operation>
Next, the operation of this embodiment will be described. When the user performs an operation to correct the singing score data using the operation unit 16, the CPU 11 first performs the process of the singing synthesis unit 111 according to a signal output from the operation unit 16. That is, the CPU 11 generates synthesized speech data from the singing score data stored in the singing score data storage area 143 with reference to the Timbre database 141 and the phonological template database 142.

次いで、ＣＰＵ１１は、上述した音声再生部１１２、歌唱比較部１１３の処理を行う。すなわち、ＣＰＵ１１は、模範音声データ記憶領域１４５から模範音声データを読み出して再生し、再生される模範音声データと合成音声データとを時間軸方向に対応付け、それぞれの音声の特徴（ピッチ、音素毎の発音タイミング、等）を検出し、比較する。 Next, the CPU 11 performs the processing of the voice reproduction unit 112 and the song comparison unit 113 described above. That is, the CPU 11 reads out and reproduces the model voice data from the model voice data storage area 145, associates the reproduced model voice data with the synthesized voice data in the time axis direction, and sets the characteristics of each voice (for each pitch and phoneme). Are detected and compared.

次いで、ＣＰＵ１１は、上述した歌唱スコアデータ修正部１１４の処理を行う。すなわち、ＣＰＵ１１は、比較結果に基づいて、歌唱スコアデータに含まれる特徴データを、模範音声と合成音声との差分が小さくなるように修正する。ＣＰＵ１１は、特徴データを修正した歌唱スコアデータを、修正後歌唱スコアデータとして、修正後歌唱スコアデータ記憶領域１４４に記憶する。 Next, the CPU 11 performs processing of the singing score data correction unit 114 described above. In other words, the CPU 11 modifies the feature data included in the singing score data based on the comparison result so that the difference between the model voice and the synthesized voice becomes small. CPU11 memorize | stores the song score data which corrected the characteristic data in the after-correction song score data storage area 144 as after-correction song score data.

ＣＰＵ１１が歌唱スコアデータ修正処理を終えると、ユーザは、操作部１６を用いて歌唱合成を行う旨の操作を行う。ＣＰＵ１１は、操作部１６から出力される信号に応じて、修正後歌唱スコアデータ記憶領域１４４に記憶された歌唱スコアデータから、その歌唱スコアデータに対応する音声波形を表す合成音声データ（第３の音声波形データ）を生成する。なお、この処理は、上述した歌唱合成部１１１が行う処理と同様であり、ここではその詳細な説明を省略する。
ＣＰＵ１１は、生成した合成音声データを音声処理部１８に供給してスピーカ１９から音として放音させる。これにより、スピーカ１９からは、模範音声に基づいて修正された歌唱スコアデータの表す音声が放音される。 When the CPU 11 finishes the singing score data correction process, the user performs an operation for performing singing synthesis using the operation unit 16. In response to the signal output from the operation unit 16, the CPU 11, from the singing score data stored in the post-correction singing score data storage area 144, the synthesized voice data representing the speech waveform corresponding to the singing score data (third Audio waveform data) is generated. This process is the same as the process performed by the singing voice synthesizing unit 111 described above, and a detailed description thereof is omitted here.
The CPU 11 supplies the generated synthesized voice data to the voice processing unit 18 and emits it as a sound from the speaker 19. Thereby, the sound represented by the singing score data corrected based on the model voice is emitted from the speaker 19.

以上説明したように本実施形態によれば、模範音声と合成音声の差を自動的に分析し、合成音声データを自動修正することにより、所望の品質の歌唱合成音声をより簡単に生成することができる。すなわち、本実施形態によれば、ＣＰＵ１１が、歌唱スコアデータから合成音声データを生成し、生成した合成音声データと実際の歌唱音声を表す模範音声データとを比較し、比較結果に応じて両者の差分が小さくなるように歌唱スコアデータを修正する。このとき、比較対象として用いられる模範音声データは、実際の歌唱音声を表すデータであるから、修正された歌唱スコアデータの表す歌唱合成音は、より実際の歌唱音声に近いものとなり、より自然なものとなる。 As described above, according to the present embodiment, the difference between the model voice and the synthesized voice is automatically analyzed, and the synthesized voice data is automatically corrected, thereby making it possible to more easily generate the synthesized voice of the desired quality. Can do. That is, according to the present embodiment, the CPU 11 generates synthesized voice data from the singing score data, compares the generated synthesized voice data with the model voice data representing the actual singing voice, and determines both of them according to the comparison result. The singing score data is corrected so that the difference becomes smaller. At this time, since the model voice data used as the comparison target is data representing the actual singing voice, the singing synthesized sound represented by the modified singing score data is closer to the actual singing voice and is more natural. It will be a thing.

また、この実施形態では、模範音声データとしてユーザの嗜好に合った歌い方（抑揚、歌唱技法、等）で歌唱された音声データを用いることにより、生成される修正後歌唱スコアデータの表す歌唱合成音は、よりユーザの嗜好に近い歌唱音声となる。このように、本実施形態によれば、ユーザは、自身の嗜好に合った模範音声データを用意するだけで、各種パラメータの修正等の煩雑な作業を行うことなく、歌唱合成音を、自身の嗜好に合ったものにすることができる。 Moreover, in this embodiment, the singing composition which the corrected singing score data produced | generated by using the audio | voice data sung by the way of singing (intonation, singing technique, etc.) suitable for a user's preference as model audio | voice data represents The sound becomes a singing voice closer to the user's preference. Thus, according to the present embodiment, the user simply prepares the model voice data that suits his / her preference, and performs the singing synthesized sound without performing complicated operations such as correction of various parameters. It can be tailored to your taste.

＜Ｃ：変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。なお、以下の各態様を適宜に組み合わせてもよい。
（１）上述の実施形態では、ＣＰＵ１１は、歌唱スコアデータを修正し、修正した歌唱スコアデータを修正後歌唱スコアデータとして、修正後歌唱スコアデータ記憶領域１４４に記憶するようにしたが、これに限らず、図７に例示するように、ＣＰＵ１１が、歌唱スコアデータ記憶領域１４３に上書きするようにしてもよい。この場合、ＣＰＵ１１が、修正した歌唱スコアデータを用いて再度修正を行うようにしてもよい。
図７に示す例において、ＣＰＵ１１は、修正した歌唱スコアデータを歌唱スコアデータ記憶領域１４３に記憶する。そして、ＣＰＵ１１は、修正された歌唱スコアデータを歌唱スコアデータ記憶領域１４３から読み出し（すなわち、修正された歌唱スコアデータを取得し）、修正された歌唱スコアデータを用いて合成音声データを生成し、生成した合成音声データを用いて再度模範音声データとの比較を行い、比較結果を用いて歌唱スコアデータの修正を再度実行する。
このように、歌唱合成パラメータの修正を繰り返し行うようにすれば、歌唱スコアデータをより模範音声に近づけることができ、歌唱品質を高めることができる。
例えば、模範音声としてユーザの歌唱音声を記憶させておけば、繰り返し修正することにより、ユーザの歌唱音声に歌唱合成音を近づけることができる。 <C: Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. An example is shown below. In addition, you may combine each following aspect suitably.
(1) In the above-described embodiment, the CPU 11 corrects the singing score data, and stores the corrected singing score data as the corrected singing score data in the corrected singing score data storage area 144. Not limited to this, as illustrated in FIG. 7, the CPU 11 may overwrite the singing score data storage area 143. In this case, the CPU 11 may make correction again using the corrected singing score data.
In the example shown in FIG. 7, the CPU 11 stores the corrected singing score data in the singing score data storage area 143. Then, the CPU 11 reads out the corrected singing score data from the singing score data storage area 143 (that is, acquires the corrected singing score data), generates synthesized voice data using the corrected singing score data, The generated synthesized voice data is used for comparison with the model voice data again, and the singing score data is corrected again using the comparison result.
As described above, if the singing synthesis parameter is repeatedly corrected, the singing score data can be made closer to the model voice, and the singing quality can be improved.
For example, if the user's singing voice is stored as a model voice, the singing synthesized sound can be brought close to the user's singing voice by repeatedly correcting the voice.

（２）上述の実施形態において、図８に例示するように、ユーザインタフェース１１５を設け、修正の態様をユーザが選択するようにしてもよい。この場合は、例えば、ＣＰＵ１１が、修正程度の異なる（例えば、合成音声のピッチを模範音声のピッチに一致させる、合成音声のピッチと模範音声のピッチとの差分を半分にする、等）複数の特徴データを生成し、生成した特徴データのリストを表示部１５に表示するようにしてもよい。ユーザは、表示されたリストの中から所望する修正態様を選択し、ＣＰＵ１１は、選択された内容に応じて歌唱スコアデータを修正するようにすればよい。 (2) In the above-described embodiment, as illustrated in FIG. 8, the user interface 115 may be provided, and the user may select a correction mode. In this case, for example, the CPU 11 has a plurality of correction levels that are different (for example, the pitch of the synthesized voice is matched with the pitch of the model voice, the difference between the pitch of the synthesized voice and the pitch of the model voice is halved, etc.) The feature data may be generated, and the generated feature data list may be displayed on the display unit 15. The user selects a desired correction mode from the displayed list, and the CPU 11 may correct the singing score data according to the selected content.

（３）上述の実施形態では、予め録音しておいた歌唱音声を表す模範音声データを模範音声データ記憶領域１４５に予め記憶させておく構成とし、ＣＰＵ１１が、模範音声データ記憶領域１４５に記憶された模範音声データを読み出すようにしたが、これに限らず、図９に例示するように、ユーザが歌唱した音声をリアルタイムで音声合成装置１に入力するようにしてもよい。図９に示す例において、ユーザの歌唱音声はマイクロホン１７で収音されて音声信号（音声データ）に変換され、歌唱比較部１１３に出力される。歌唱比較部１１３は、マイクロホン１７で収音された音声を表す音声データと歌唱合成部１１１で生成される合成音声データとを比較する。 (3) In the above embodiment, the model voice data representing the pre-recorded singing voice is stored in the model voice data storage area 145 in advance, and the CPU 11 is stored in the model voice data storage area 145. However, the present invention is not limited to this, and the voice sung by the user may be input to the speech synthesizer 1 in real time as illustrated in FIG. In the example shown in FIG. 9, the user's singing voice is picked up by the microphone 17, converted into a voice signal (voice data), and output to the singing comparison unit 113. The singing comparison unit 113 compares the voice data representing the voice collected by the microphone 17 with the synthesized voice data generated by the singing voice synthesis unit 111.

（４）また、図９に示す例において、更に、伴奏データを再生するようにしてもよい。図１０は、伴奏データを再生する場合の音声合成装置１の機能的構成の一例を示す図である。この例においては、記憶部１４に伴奏データを記憶する伴奏データ記憶領域１４６（図１に鎖線で図示）を設け、この伴奏データ記憶領域１４６に伴奏データを予め記憶しておく。
１４６から読み出して再生し、音声混合部１１９は、伴奏再生部１１８から供給される伴奏音を表す信号とマイクロホン１７から供給される音声信号とを混合してスピーカ１９に出力する。これにより、スピーカ１９からは、伴奏音と収音されたユーザの歌唱音声とが放音される。なお、伴奏再生と歌唱合成はタイミングを合わせて行う必要があり、そのための制御機構が必要であるが、図面が煩雑になるのを防ぐためそれらの図示を省略している。
このように、歌唱音声を収音する際に、伴奏音を再生することで、ユーザが、歌唱スコアデータの表す歌唱合成音にタイミングを合わせて歌唱することができる。 (4) In the example shown in FIG. 9, accompaniment data may be further reproduced. FIG. 10 is a diagram illustrating an example of a functional configuration of the speech synthesizer 1 when reproducing accompaniment data. In this example, an accompaniment data storage area 146 (shown by a chain line in FIG. 1) for storing accompaniment data is provided in the storage unit 14, and accompaniment data is stored in advance in the accompaniment data storage area 146.
The audio mixing unit 119 mixes the signal representing the accompaniment sound supplied from the accompaniment reproduction unit 118 and the audio signal supplied from the microphone 17 and outputs the result to the speaker 19. Thereby, from the speaker 19, the accompaniment sound and the collected user's singing voice are emitted. Note that accompaniment playback and singing synthesis need to be performed at the same timing, and a control mechanism for that is necessary, but illustration of these is omitted to prevent the drawing from becoming complicated.
Thus, when collecting the singing voice, by reproducing the accompaniment sound, the user can sing in time with the singing synthesized sound represented by the singing score data.

（５）また、図１０に示す例において、更に、マイクロホンによる入力音声を録音するようにしてもよい。図１１にこの場合の音声合成装置１の機能的構成の一例を示す。図１１に示す例において、音声録音・再生部１１７は、マイクロホン１７から出力される音声データを、模範音声データ記憶領域１４５に記憶する。この場合、音声合成装置１は、録音した模範音声を用いて、歌唱スコアデータの修正を繰り返し行うことができる。 (5) In addition, in the example shown in FIG. 10, it is also possible to record the input sound from the microphone. FIG. 11 shows an example of the functional configuration of the speech synthesizer 1 in this case. In the example shown in FIG. 11, the voice recording / playback unit 117 stores the voice data output from the microphone 17 in the model voice data storage area 145. In this case, the speech synthesizer 1 can repeatedly correct the singing score data using the recorded exemplary speech.

（６）上述の実施形態では、ＣＰＵ１１が、歌唱スコアデータに含まれるピッチデータと発音タイミングデータとを修正するようにしたが、修正する特徴データはこれに限らない。例えば、ＣＰＵ１１が歌詞間違いを検出するようにしてもよい。この場合は、ＣＰＵ１１が、模範音声データと合成音声データとの音韻の差分を検出し、その差分が小さくなるように、歌唱スコアデータの音韻データを修正するようにすればよい。この場合、音韻の差分の検出方法としては、例えば、模範音声データと合成音声データについて、フォルマントやケプストラムの差を検出するようにしてもよく、また、模範音声データに対して音声認識処理を施して音韻を検出するようにしてもよい。 (6) In the above-described embodiment, the CPU 11 corrects the pitch data and the pronunciation timing data included in the singing score data, but the feature data to be corrected is not limited to this. For example, the CPU 11 may detect a lyric error. In this case, the CPU 11 may detect the phoneme difference between the model voice data and the synthesized voice data, and correct the phoneme data of the singing score data so that the difference becomes smaller. In this case, as a method for detecting a phoneme difference, for example, a difference between formants and cepstrum may be detected for the model voice data and the synthesized voice data, and voice recognition processing is performed on the model voice data. Thus, the phoneme may be detected.

（７）また、ＣＰＵ１１が、音質・声質の差分を検出し、音質・声質を修正するようにしてもよい。この場合は、歌唱スコアデータに、音質や声質を示す音質データや声質データを含める構成とし、ＣＰＵ１１が、模範音声データと合成音声データとからフォルマントを検出し、検出したフォルマントの差分が小さくなるように、音質データや声質データを修正するようにしてもよい。 (7) Further, the CPU 11 may detect the difference between the sound quality and the voice quality and correct the sound quality and the voice quality. In this case, the singing score data includes sound quality data and voice quality data indicating the sound quality and voice quality, and the CPU 11 detects formants from the model voice data and the synthesized voice data so that the difference between the detected formants becomes small. In addition, sound quality data and voice quality data may be corrected.

このように、ＣＰＵ１１が修正する特徴データは、上述した実施形態で示したピッチの時間的な変化を示すピッチデータや発音タイミングデータであってもよく、また、音韻データや音質データ、声質データであってもよい。また、他の例として、例えば、音のベロシティ（強弱）を表すデータであってもよい。このように、ＣＰＵ１１が修正する特徴データは、メロディの特徴やそのメロディを構成する各音素の特徴を示すものであればどのようなものであってもよい。 As described above, the feature data to be corrected by the CPU 11 may be pitch data or sounding timing data indicating a temporal change in pitch shown in the above-described embodiment, and may be phoneme data, sound quality data, or voice quality data. There may be. As another example, for example, data representing the velocity (strongness) of sound may be used. As described above, the feature data corrected by the CPU 11 may be any data as long as it indicates the features of the melody and the features of each phoneme constituting the melody.

（８）また、上述の実施形態において、歌唱スコアデータに楽曲の構成を示すデータを含めるようにし、ＣＰＵ１１が、歌唱スコアデータを修正する際に、曲中の対応する箇所を同時に修正しても良い。例えば、１番のある箇所のパラメータを修正したら、２番、３番の対応する箇所を同様に修正するようにしてもよい。この場合、歌唱スコアデータには、複数の時間区間に区分されるとともに、複数の時間区間の対応関係を示す区間対応データを含める構成とする。そして、ＣＰＵ１１は、複数の時間区間のうちの少なくともいずれかひとつの時間区間について、歌唱スコアデータに含まれる特徴データの値を上述の実施形態と同様の態様で修正する。その後、ＣＰＵ１１は、歌唱スコアデータの区間対応データに基づいて、修正した時間区間に対応する他の時間区間について、歌唱スコアデータに含まれる特徴データの値を、該時間区間と同様の態様で修正する。
このようにすることで、例えば、楽曲の１番を修正し終えた段階で、２番、３番の歌唱スコアデータの修正を終わらせることができるので、修正に係る処理時間を短くすることができる。 (8) Moreover, in the above-described embodiment, data indicating the composition of the music is included in the singing score data, and when the CPU 11 corrects the singing score data, the corresponding part in the music is corrected at the same time. good. For example, if the parameter at a certain position of No. 1 is corrected, the corresponding positions of No. 2 and No. 3 may be similarly corrected. In this case, the singing score data is divided into a plurality of time sections and includes section correspondence data indicating a correspondence relationship between the plurality of time sections. Then, the CPU 11 corrects the value of the feature data included in the singing score data in the same manner as in the above-described embodiment for at least one of the plurality of time intervals. Thereafter, the CPU 11 corrects the value of the feature data included in the singing score data for the other time intervals corresponding to the corrected time interval based on the interval corresponding data of the singing score data in the same manner as the time interval. To do.
In this way, for example, the correction of the second and third singing score data can be finished at the stage where the first of the music has been corrected, so that the processing time related to the correction can be shortened. it can.

（９）上述の実施形態では、歌唱合成部１１１は、歌唱スコアデータを歌唱スコアデータ記憶領域１４３から読み出すようにしたが、歌唱合成部１１１が歌唱スコアデータを取得する態様はこれに限らず、例えば、インターネット等の通信ネットワークを介して歌唱スコアデータを受信するようにしてもよく、また、例えば、ユーザが操作部１６を用いて歌唱スコアデータを入力するための操作を行い、ＣＰＵ１１が操作部１６から出力される信号に応じて歌唱スコアデータを生成するようにしてもよく、ＣＰＵ１１が歌唱スコアデータを取得するものであればどのようなものであってもよい。 (9) In the above-described embodiment, the singing voice synthesizing unit 111 reads the singing score data from the singing score data storage area 143, but the mode in which the singing voice synthesizing unit 111 acquires the singing score data is not limited thereto. For example, the singing score data may be received via a communication network such as the Internet. For example, the user performs an operation for inputting the singing score data using the operation unit 16, and the CPU 11 operates the operation unit. Singing score data may be generated in accordance with the signal output from 16, and any data may be used as long as the CPU 11 acquires the singing score data.

（１０）上述した実施形態では、音声合成装置１は、予め生成された歌唱スコアデータを、模範音声データを用いて修正するようにしたが、これに変えて、模範音声データから歌唱スコアデータを生成するようにしてもよい。図１２は、この場合の音声合成装置１の機能的構成の一例を示す図である。図において、歌唱合成部１１１，歌唱比較部１１３、歌唱スコアデータ修正部１１４は、上述した実施形態において図３に示したそれと同様であるため、ここではその説明を省略する。図において、歌唱分析部１１６は、マイクロホン１７から出力される音声データを解析し、所定時間長のフレーム単位でピッチを検出する。また、歌唱分析部１１６は、マイクロホン１７から出力される音声データを解析し、音素毎の発音タイミングを検出する。そして、歌唱分析部１１６は、検出したピッチを示すピッチデータを生成するとともに、検出した発音タイミングを示す発音タイミングデータを生成し、生成したピッチデータと発音タイミングデータとを含む歌唱スコアデータを生成し、生成した歌唱スコアデータを歌唱スコアデータ記憶領域１４３に記憶する。
このようにすることで、予め歌唱スコアデータを用意する必要がなく、マイクロホン１７から入力された音声を分析して、歌唱スコアデータを自動生成することができる。 (10) In the embodiment described above, the speech synthesizer 1 corrects the singing score data generated in advance using the model voice data, but instead of this, the singing score data is obtained from the model voice data. You may make it produce | generate. FIG. 12 is a diagram illustrating an example of a functional configuration of the speech synthesizer 1 in this case. In the figure, the singing composition unit 111, the singing comparison unit 113, and the singing score data correction unit 114 are the same as those shown in FIG. In the figure, a singing analysis unit 116 analyzes audio data output from the microphone 17 and detects a pitch in units of a frame having a predetermined time length. Further, the singing analysis unit 116 analyzes the audio data output from the microphone 17 and detects the sound generation timing for each phoneme. The singing analysis unit 116 generates pitch data indicating the detected pitch, generates sounding timing data indicating the detected sounding timing, and generates song score data including the generated pitch data and sounding timing data. The generated singing score data is stored in the singing score data storage area 143.
By doing in this way, it is not necessary to prepare song score data beforehand, and the voice score input from the microphone 17 can be analyzed and song score data can be automatically generated.

また、この態様において、マイクロホン１７に入力された音声を音声認識して歌詞（音韻の列）を抽出し、抽出結果から音韻データを生成するようにしてもよい。
また、この態様において、マイクロホン１７に入力された音声を音声認識してフォルマントを検出し、音質データや声質データを生成するようにしてもよい。 In this embodiment, the speech input to the microphone 17 may be recognized as speech to extract lyrics (phoneme string), and phoneme data may be generated from the extraction result.
In this aspect, sound input to the microphone 17 may be recognized and formants may be detected to generate sound quality data and voice quality data.

（１１）上述した実施形態における音声合成装置１のＣＰＵ１１によって実行されるプログラムは、磁気記録媒体（磁気テープ、磁気ディスクなど）、光記録媒体（光ディスクなど）、光磁気記録媒体、半導体メモリなどのコンピュータが読取可能な記録媒体に記録した状態で提供し得る。また、インターネットのようなネットワーク経由で音声合成装置１にダウンロードさせることも可能である。 (11) The program executed by the CPU 11 of the speech synthesizer 1 in the above-described embodiment is a magnetic recording medium (magnetic tape, magnetic disk, etc.), an optical recording medium (optical disk, etc.), a magneto-optical recording medium, a semiconductor memory, etc. It can be provided in a state where it is recorded on a computer-readable recording medium. It is also possible to download the speech synthesizer 1 via a network such as the Internet.

音声合成装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a speech synthesizer. 歌唱スコアデータの内容の一例を示す図である。It is a figure which shows an example of the content of song score data. 表示部に表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed on a display part. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. ＤＰマッチングを示す図である。It is a figure which shows DP matching. 模範音声と合成音声の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of a model audio | voice and a synthetic | combination audio | voice. 修正後歌唱スコアデータの内容の一例を示す図である。It is a figure which shows an example of the content of the song score data after correction. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer. 音声合成装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of a speech synthesizer.

Explanation of symbols

１…音声合成装置、１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部、１５…表示部、１６…操作部、１７…マイクロホン、１８…音声処理部、１９…スピーカ、１１１…歌唱合成部、１１２…音声再生部、１１３…歌唱比較部、１１４…歌唱スコアデータ修正部、１１５…ユーザインタフェース、１１６…歌唱分析部、１１７…音声録音・再生部、１１８…伴奏再生部、１１９…音声混合部、１４１…Ｔｉｍｂｒｅデータベース、１４２…音韻テンプレートデータベース、１４３…歌唱スコアデータ記憶領域、１４４…修正後歌唱スコアデータ記憶領域、１４５…模範音声データ記憶領域、１４６…伴奏データ記憶領域。 DESCRIPTION OF SYMBOLS 1 ... Voice synthesizer, 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... Memory | storage part, 15 ... Display part, 16 ... Operation part, 17 ... Microphone, 18 ... Speech processing part, 19 ... Speaker, 111 ... Singing Synthesizer, 112 ... voice playback unit, 113 ... singing comparison unit, 114 ... singing score data correction unit, 115 ... user interface, 116 ... singing analysis unit, 117 ... voice recording / playback unit, 118 ... accompaniment playback unit, 119 ... Voice mixing unit, 141 ... Timbre database, 142 ... Phoneme template database, 143 ... Singing score data storage area, 144 ... Modified song score data storage area, 145 ... Model voice data storage area, 146 ... Accompaniment data storage area.

Claims

Singing score data representing melody composed of a sequence of phonemes, and singing score data obtaining means for obtaining singing score data including feature data representing features of each phoneme;
First voice waveform data acquisition means for acquiring first voice waveform data representing a voice waveform;
Second voice waveform data generating means for generating second voice waveform data representing a voice waveform corresponding to the singing score data from the singing score data acquired by the singing score data acquiring means;
Association means for associating the first speech waveform data and the second speech waveform data in a time axis direction;
First feature detection means for analyzing the first speech waveform data and detecting the feature according to an analysis result;
Second feature detection means for analyzing the second speech waveform data and detecting the feature according to an analysis result;
The first speech waveform data detected by the first feature detecting means is the feature data included in the singing score data acquired by the singing score data acquiring means in accordance with the result of the association by the associating means. And a feature data correcting unit that corrects the difference in the corresponding portion between the feature of the second voice waveform data detected by the second feature detecting unit and the feature of the second speech waveform data to be small.
Third voice waveform data generating means for generating third voice waveform data representing a voice waveform corresponding to the singing score data from the singing score data corrected by the feature data correcting means;
Output means for outputting third voice waveform data generated by the third voice waveform data generating means ;
Second correspondence means for associating the first voice waveform data and the third voice waveform data generated by the third voice waveform data generation means in a time axis direction;
Third feature detection means for analyzing the third speech waveform data and detecting the feature according to the analysis result;
The first speech waveform detected by the first feature detecting unit is the feature data included in the singing score data corrected by the feature data correcting unit according to the association result of the second association unit. Second feature data correction means for correcting the difference between the data feature and the feature of the third speech waveform data detected by the third feature detection means so as to be small.
Comprising
The third voice waveform data generating means is voice waveform data representing a voice waveform corresponding to the singing score data corrected by the feature data correcting means or the singing score data corrected by the second characteristic data correcting means. Is generated as the third speech waveform data
A speech synthesizer characterized by the above.

The features include at least one of pronunciation timing of each phoneme constituting the melody, temporal change in pitch, phoneme of each phoneme constituting the melody, speech spectrum, volume, sound quality, and voice quality. The speech synthesizer according to claim 1 .

The singing score data acquisition control means for supplying the singing score data to the singing score data acquiring means when the singing score data corrected by the characteristic data correcting means satisfies a predetermined condition. The speech synthesizer according to claim 1 or 2 .

The singing score data is divided into a plurality of time sections, and includes section correspondence data indicating a correspondence relationship between the plurality of time sections,
The feature data correction means adds the singing score data acquired by the singing score data acquisition means to the singing score data acquisition means for at least one of the plurality of time intervals according to the association result of the association means. Correspondence between the feature data included in the feature of the first speech waveform data detected by the first feature detection unit and the feature of the second speech waveform data detected by the second feature detection unit In addition to making corrections so that the differences at the locations are smaller,
Based on the section corresponding data for other time interval corresponding to said time interval, the first to third aspects of feature data included in the song score data, characterized by modifying a modified embodiment of said time interval The speech synthesis device according to any one of the above.

The first speech waveform data acquiring means, the audio data representing sound collected by the sound collecting unit, any one of claims 1 to 4, characterized in that to obtain, as the first sound data The speech synthesizer described in 1.

The said feature data correction means corrects the feature data contained in the singing score data acquired by the said singing score data acquisition means by a several different correction | amendment aspect. The any one of Claim 1 thru | or 5 characterized by the above-mentioned. The speech synthesizer described in 1.

The feature data correcting means includes feature data included in the singing score data acquired by the singing score data acquiring means, and features of the first speech waveform data detected by the first feature detecting means and the first of claims 1 to 6, characterized in that the difference in the corresponding portion of the feature of the second speech waveform data detected by the second feature detecting means is corrected so as to be substantially half or predetermined threshold The speech synthesis device according to any one of the above.

Selecting means for selecting one of the plurality of correction modes according to information output from the user interface;
It said third audio waveform data generating means, according to claim, characterized in that to generate the third speech waveform data from the song score data including characteristic data corrected by the selected modified manner by the selection means 6 The speech synthesizer described in 1.

The features include phonology;
The first feature detection unit and the second feature detection unit detect the phoneme by executing at least one of formant detection, cepstrum detection, and speech recognition processing. The speech synthesizer according to claim 1 .