JP2010169889A

JP2010169889A - Voice synthesis device and program

Info

Publication number: JP2010169889A
Application number: JP2009012300A
Authority: JP
Inventors: Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2009-01-22
Filing date: 2009-01-22
Publication date: 2010-08-05
Anticipated expiration: 2029-01-22
Also published as: JP5176981B2

Abstract

<P>PROBLEM TO BE SOLVED: To synthesize natural voice, even when the number of phonemes which is made into a database is fewer than in a conventional method, in a voice synthesis of a phoneme connection system. <P>SOLUTION: On the basis of a musical piece data for indicating notes for constituting a musical piece and a lyrics data for indicating its lyrics, a singing synthesis score (voice synthesis direction) for indicating a plurality of phonemes used for synthesizing singing voice of the musical piece, its generation time and a pitch of its singing voice, are created. When the phoneme indicated by the singing synthesis score includes a silence vowel, a waveform of a predetermined silence pronunciation in the vowel of an object language of voice synthesis is converted into one including a spectrum envelope similar to the spectrum envelope of the phoneme in which the vowel is pronounced with voice, and the waveform after the conversion is used for synthesis of the singing voice as the waveform of the phoneme including the silence vowel. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声を合成する技術に関し、特に、複数の音声素片を接続して音声を合成する素片接続方式の音声合成技術に関する。 The present invention relates to a technology for synthesizing speech, and more particularly, to a speech synthesis technology of a unit connection method for synthesizing speech by connecting a plurality of speech units.

この種の音声合成技術の一例としては、素片接続方式の歌唱合成が挙げられる。素片接続方式の歌唱合成では、単一の音素や音素から音素への遷移部分など歌唱音声の素材となる各種の音声素片の波形を定義した音声素片データを予めデータベースに格納しておくことが一般的である。そして、ある歌詞をあるメロディに合わせて歌唱する歌唱音声を合成する際には、歌詞を構成する音声素片の音声素片データをデータベースから読み出し、各々のピッチをメロディに合わせるためのピッチ変換を施した後に連結して、歌唱音声の波形を示すデータを合成する（特許文献１〜３参照）。 An example of this type of speech synthesis technique is the unit connection type singing synthesis. In the singing synthesis of the unit connection method, speech unit data defining waveforms of various speech units that are materials of singing speech such as a single phoneme or a transition part from a phoneme to a phoneme is stored in a database in advance. It is common. And when synthesizing a singing voice that sings a certain lyric according to a certain melody, the speech element data of the speech element constituting the lyric is read from the database, and pitch conversion is performed to match each pitch to the melody. After giving, it connects and synthesize | combines the data which show the waveform of a song voice (refer patent documents 1-3).

特開２００７−２４０５６４号公報JP 2007-240564 A 特開２００６−１７９４６号公報JP 2006-17946 A 特開２０００−３２００号公報JP 2000-3200 A

素片接続方式の音声合成で自然な音声を合成するためには、できるだけ多くの音声素片をデータベースに格納しておく必要がある。音韻の種類（有声、無声、母音の欠落など）、前後の音韻の組み合わせや、声質、情感などを考慮すると、データベース化しておくべき音声素片の数は膨大なものになる。このため、携帯型ゲーム機やＰＤＡ（Personal Digital Assistants）、携帯電話などの携帯端末に音声合成を実行させようとすると、データベース化しておく音声素片の数が大きな問題となる。これら携帯端末は大容量の記憶装置を有しておらず、格納可能なデータサイズに制限があるからである。
本発明は上記課題に鑑みて為されたものであり、素片接続方式の音声合成において、データベース化する音声素片の数を従来より少なくしても自然な音声を合成できるようにする技術を提供することを目的とする。 In order to synthesize natural speech by speech synthesis using the unit connection method, it is necessary to store as many speech units as possible in the database. Considering the types of phonemes (voiced, unvoiced, missing vowels, etc.), the combination of previous and subsequent phonemes, voice quality, emotions, etc., the number of speech segments to be stored in a database becomes enormous. For this reason, if a mobile terminal such as a portable game machine, a PDA (Personal Digital Assistants), or a mobile phone is to perform speech synthesis, the number of speech segments stored in a database becomes a big problem. This is because these portable terminals do not have a large-capacity storage device and there is a limit on the size of data that can be stored.
The present invention has been made in view of the above problems, and in the speech synthesis of the unit connection method, a technique for synthesizing natural speech even if the number of speech units to be databased is smaller than the conventional one. The purpose is to provide.

上記課題を解決するため、本発明は、各種の音声素片の波形を示す波形データを含む音声素片データが格納されている音声素片データベースと、音声合成の対象言語の母音の全部または一部について、無声発音の波形を表す無声化テンプレートが母音毎に格納されている記憶手段と、音声の合成に用いる音声素片に対応する音声素片データを前記音声素片データベースから選択する手段であって、無声化された母音を含む音声素片が音声合成に用いられる場合には、当該母音を有声化した音声素片の音声素片データを前記音声素片データベースから選択する素片選択手段と、前記素片選択手段により選択される音声素片データが無声化された母音を含む音声素片に対して選択されたものである場合には、何れかの無声化テンプレートが示す波形を当該音声素片データの示す波形のスペクトルエンベロープと同様なスペクトルエンベロープを有する波形に加工し、当該音声素片データを当該加工後の波形を示す音声素片データに変換して出力する一方、他の音声素片に対して選択されたものである場合には、そのまま出力する無声化変換手段と、前記無声化変換手段から出力される各音声素片データに含まれる波形データを調整しつつ連結して出力する素片連結手段とを有することを特徴とする音声合成装置、を提供する。 In order to solve the above problems, the present invention provides a speech unit database in which speech unit data including waveform data indicating waveforms of various speech units is stored, and all or one of vowels of a speech synthesis target language. And a means for selecting a speech unit data corresponding to a speech unit used for speech synthesis from the speech unit database. When a speech unit including a vowel that has been devoted is used for speech synthesis, a unit selection unit that selects speech unit data of a speech unit obtained by voicing the vowel from the speech unit database If the speech unit data selected by the unit selection means is selected for a speech unit including a devoted vowel, the wave indicated by any devoicing template Is processed into a waveform having a spectrum envelope similar to the spectral envelope of the waveform indicated by the speech segment data, and the speech segment data is converted into speech segment data indicating the processed waveform and output. If the speech unit is selected, the devoicing conversion means for outputting the speech unit and the waveform data included in each speech unit data output from the devoicing conversion unit are connected while being adjusted. A speech synthesizer characterized in that the speech synthesizer has a segment connecting means for outputting.

このような音声合成装置によれば、無声化母音を含む音声素片の音声素片データが音声素片データベースに格納されていなくても、当該母音を有声発音した場合の音声素片の波形のスペクトルエンベロープと無声化テンプレートの示す波形のスペクトルとから当該無声化母音を含む音声素片の音声素片データが生成され音声合成に使用される。つまり、この音声合成装置によれば、無声化母音の音声素片をデータベース化の対象から除外しても、従来技術と同様に自然な音声の合成を行うことができる。なお、本発明の別の態様においては、コンピュータ装置を素片選択手段、無声化変換手段および素片連結手段として機能させるプログラムを提供する態様であっても良い。 According to such a speech synthesizer, even if speech unit data of a speech unit including a devoted vowel is not stored in the speech unit database, the waveform of the speech unit when the vowel is pronounced is voiced. Speech unit data of speech units including the unvoiced vowel is generated from the spectrum envelope and the spectrum of the waveform indicated by the devoicing template, and used for speech synthesis. That is, according to this speech synthesizer, natural speech synthesis can be performed in the same manner as in the prior art even if speech units of devoted vowels are excluded from the database. In another aspect of the present invention, a program for causing a computer device to function as a segment selection unit, a devoicing conversion unit, and a segment connection unit may be provided.

より好ましい態様においては、上記音声合成装置の無声化変換手段は、無声化母音を含む音声素片と無声化母音を含まない音声素片とが連続して音声合成に用いられる場合には、当該無声化母音を含む音声素片において、前記素片選択手段により選択される音声素片データの示す波形と前記加工後の波形とがクロスフェードするように当該音声素片データを変換することを特徴とする。このような態様によれば無声化母音を含む音声素片と無声化母音を含まない音声素片のつなぎ目が不自然になることを回避することが可能になる。 In a more preferred aspect, the devoicing conversion means of the speech synthesizer includes a speech unit that includes a devoicing vowel and a speech unit that does not include a devoicing vowel that is used for speech synthesis. In the speech unit including the unvoiced vowel, the speech unit data is converted so that the waveform indicated by the speech unit data selected by the unit selection unit and the processed waveform cross-fade. And According to such an aspect, it is possible to avoid the unnatural connection between the speech unit including the unvoiced vowel and the speech unit not including the unvoiced vowel.

さらに好ましい態様においては、上記音声合成装置の無声化変換手段は、前記素片選択手段により選択される音声素片データの示す波形と前記加工後の波形とがクロスフェードするように変換した音声素片データに、そのクロスフェードにおけるミキシング比率に応じた気息音を付与して出力することを特徴とする。このように気息音などの所謂非調和成分の音を上記ミキシング比率に応じて付与することによって、より自然な感じの音声を合成することが可能になる。 In a further preferred aspect, the devoicing conversion means of the speech synthesizer includes a speech element that has been converted so that the waveform indicated by the speech segment data selected by the segment selection means and the processed waveform cross-fade. A piece of data is output with a breath sound corresponding to the mixing ratio in the crossfade. As described above, by adding so-called nonharmonic component sounds such as breath sounds according to the mixing ratio, it is possible to synthesize more natural-feeling sounds.

また、本発明の別の態様においては、各種の音声素片の波形を示す波形データを含む音声素片データが格納されている音声素片データベースと、音声合成の対象言語の母音の全部または一部について、有声発音の波形を示す有声化テンプレートが母音毎に格納されている記憶手段と、音声の合成に用いる音声素片に対応する音声素片データを前記音声素片データベースから選択する手段であって、有声発音される母音を含む音声素片が音声合成に用いられる場合には、当該母音を無声化した音声素片の音声素片データを前記音声素片データベースから選択する素片選択手段と、前記素片選択手段により選択される音声素片データが有声発音される母音を含む音声素片に対して選択されたものである場合には、何れかの有声化テンプレートが示す波形を当該音声素片データの示す波形のスペクトルエンベロープと同様なスペクトルエンベロープを有する波形に加工し、当該音声素片データを当該加工後の波形を示す音声素片データに変換して出力する一方、他の音声素片に対して選択されたものである場合には、そのまま出力する有声化変換手段と、前記有声化変換手段から出力される各音声素片データに含まれる波形データを調整しつつ連結して出力する素片連結手段とを有することを特徴とする音声合成装置、またはコンピュータ装置を上記素片選択手段、有声化変換手段および素片連結手段として機能させることを特徴とするプログラム、を提供する。 In another aspect of the present invention, a speech unit database storing speech unit data including waveform data indicating waveforms of various speech units, and all or one of vowels of a speech synthesis target language. A means for selecting a speech unit data corresponding to a speech unit to be used for speech synthesis from the speech unit database; When a speech unit including a vowel that is voiced is used for speech synthesis, a unit selection unit that selects speech unit data of a speech unit obtained by devoting the vowel from the speech unit database. If the speech segment data selected by the segment selection means is selected for a speech segment containing a vowel that is voiced, any voicing template is displayed. While processing the waveform into a waveform having a spectrum envelope similar to the spectrum envelope of the waveform indicated by the speech unit data, the speech unit data is converted into speech unit data indicating the processed waveform and output, In the case where it is selected for another speech unit, the voicing conversion means that outputs it as it is, and the waveform data included in each speech unit data output from the voicing conversion means are adjusted A speech synthesizing apparatus characterized by having a segment connecting means for connecting and outputting, or a program that causes a computer device to function as the unit selecting means, voicing conversion means, and segment connecting means, I will provide a.

このような音声合成装置やプログラムによれば、有声発音された母音を含む音声素片の音声素片データが音声素片データベースに格納されていなくとも、当該母音を無声発音した音声素片の波形のスペクトルエンベロープと有声化テンプレートの示す波形のスペクトルとから、有声発音された当該母音を含む音声素片の音声素片データが生成され、音声合成に使用される。ここで、無声化母音を含む音声素片と有声母音を含む音声素片の何れをデータベース化の対象から除外するのかについては、音声合成の対象言語との関係で定めるようにすれば良い。例えば、前者（無声化母音を含む音声素片）のほうが後者（有声母音を含む音声素片）に比較して出現頻度が高い言語については前者をデータベース化の対象にすれば良く、その逆の場合は後者をデータベース化の対象とすれば良い。 According to such a speech synthesizer and program, even if speech unit data of a speech unit including a vowel that is voiced is not stored in the speech unit database, the waveform of the speech unit that is silently generated from the vowel The speech segment data of the speech unit including the vowel-spoken vowel is generated from the spectrum envelope and the spectrum of the waveform indicated by the voicing template, and used for speech synthesis. Here, which one of the speech units including the unvoiced vowels and the speech unit including the voiced vowels is excluded from the target of database creation may be determined in relation to the target language for speech synthesis. For example, the former (speech segment containing unvoiced vowels) should be used as a database for the language that appears more frequently than the latter (speech segment containing voiced vowels), and vice versa. In this case, the latter should be the target of database creation.

この発明の一実施形態である歌唱合成装置の構成例を示す図である。It is a figure which shows the structural example of the song synthesizing | combining apparatus which is one Embodiment of this invention. 発音内容（歌詞）の入力態様の一例を示す図である。It is a figure which shows an example of the input mode of pronunciation content (lyrics). 有声発音される母音を含む音声の素片構成および無声発音される母音を含む音声の素片構成の一例を説明するための図である。It is a figure for demonstrating an example of the fragment | piece structure of the audio | voice containing the vowel which is voiced pronunciation, and the fragment | piece structure of the audio | voice containing the vowel which is voiced. 同歌唱合成装置のフラッシュメモリ６に格納されている音声合成プログラム６５の構成を説明するための図である。It is a figure for demonstrating the structure of the speech synthesis program 65 stored in the flash memory 6 of the song synthesis apparatus. 無声化変換手段６５４が実行する処理を説明するための図面である。It is a figure for demonstrating the process which the devoicing conversion means 654 performs. 合成フレームのスペクトルエンベロープ、無声化ＥＱ特性、無性化変換でのイコライジング特性、無声化テンプレートの振幅スペクトル、同イコライジング特性で無声化変換が施された音声素片の振幅スペクトルの一例を示す図面である。Drawing showing examples of spectrum envelope of synthesized frame, devoiced EQ characteristic, equalizing characteristic in neutralization conversion, amplitude spectrum of devoiced template, and amplitude spectrum of speech unit that has been subjected to devoiced conversion using the same equalizing characteristic is there.

以下、図面を参照しつつ本発明の一実施形態について説明する。
図１は、本発明に係る音声合成装置の一実施形態である歌唱合成装置の構成を示すブロック図である。この歌唱合成装置は、例えば携帯電話機や携帯型ゲーム機など音声を出力する機能を備えた携帯端末に音声合成プログラムをインストールしたものである。図１において、ＣＰＵ（Central Processing Unit）１は、この歌唱合成装置の各部を制御する制御中枢である。ＲＯＭ（Read Only Memory）２は、ローダなど、この歌唱合成装置の基本的な動作を制御するための制御プログラムを記憶した読み出し専用メモリである。表示部３は、例えば液晶ディスプレイとその駆動回路であり、装置の動作状態や入力データ、利用者に対するメッセージなどを表示する。操作部４は、利用者に各種情報を入力させるための手段であり、複数の操作子（例えば、携帯型ゲーム機であればスタートボタンやカーソルキー、携帯電話であればテンキーなど）やタッチパネルなどで構成されている。インタフェース群５は、ネットワークを介して他の装置との間でデータ通信を行うためのネットワークインタフェースや、ＵＭＤ（Universal Media Disc）やＣＤ−ＲＯＭ（Compact Disk-Read
Only Memory）などの外部記録媒体との間でデータの授受を行うためのドライバなどである。フラッシュメモリ６は、各種のプログラムやデータベースなどの情報を記憶するための不揮発性記憶装置（記憶手段）である。ＲＡＭ７は、ＣＰＵ１によってワークエリアとして利用される揮発性メモリである。ＣＰＵ１は、操作部４を介して与えられる指令にしたがいフラッシュメモリ６内のプログラムをＲＡＭ７に読み出し実行する。サウンドシステム８は、この歌唱合成装置において合成される音声を出力する手段である。このサウンドシステム８は、合成音声の波形を示すデジタル音声信号（例えば、合成音声の波形を示すサンプリングデータ）をアナログ音声信号に変換するＤ／Ａ変換器と、このアナログ音声信号を増幅するアンプと、このアンプの出力信号を音として出力するスピーカ等を含んでいる。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a singing voice synthesizing apparatus which is an embodiment of a voice synthesizing apparatus according to the present invention. This singing voice synthesizing apparatus is obtained by installing a voice synthesizing program on a portable terminal having a function of outputting voice, such as a mobile phone or a portable game machine. In FIG. 1, a CPU (Central Processing Unit) 1 is a control center that controls each part of the singing voice synthesizing apparatus. A ROM (Read Only Memory) 2 is a read only memory storing a control program for controlling basic operations of the singing voice synthesizing apparatus such as a loader. The display unit 3 is, for example, a liquid crystal display and a driving circuit thereof, and displays an operation state of the apparatus, input data, a message to the user, and the like. The operation unit 4 is a means for allowing a user to input various types of information, and includes a plurality of operators (for example, a start button and cursor keys for a portable game machine, a numeric keypad for a mobile phone, etc.), a touch panel, and the like. It consists of The interface group 5 includes a network interface for performing data communication with other devices via a network, UMD (Universal Media Disc), CD-ROM (Compact Disk-Read).
A driver for transmitting / receiving data to / from an external recording medium such as “Only Memory”. The flash memory 6 is a non-volatile storage device (storage means) for storing information such as various programs and databases. The RAM 7 is a volatile memory that is used as a work area by the CPU 1. The CPU 1 reads the program in the flash memory 6 into the RAM 7 and executes it in accordance with a command given via the operation unit 4. The sound system 8 is a means for outputting the sound synthesized in this singing voice synthesizing apparatus. The sound system 8 includes a D / A converter that converts a digital audio signal indicating the waveform of the synthesized speech (for example, sampling data indicating the waveform of the synthesized speech) into an analog audio signal, an amplifier that amplifies the analog audio signal, and the like. And a speaker for outputting the output signal of the amplifier as sound.

フラッシュメモリ６に記憶されている情報としては、曲編集プログラム６１、曲データ６２、無声化テンプレート６３、音声素片データベース６４、および音声合成プログラム６５が挙げられる。曲データ６２は、曲を構成する一連の音符を表す音符データと、音符に合わせて発音する歌詞を表す歌詞データと、曲に音楽的表情を与えるためのダイナミックス情報等のその他の情報とを含んでいる。この曲データ６２は、曲毎に作成されフラッシュメモリ６に格納される。 Examples of information stored in the flash memory 6 include a song editing program 61, song data 62, a devoicing template 63, a speech segment database 64, and a speech synthesis program 65. The song data 62 includes note data representing a series of notes constituting the song, lyric data representing lyrics to be generated in accordance with the notes, and other information such as dynamics information for giving a musical expression to the song. Contains. The song data 62 is created for each song and stored in the flash memory 6.

曲編集プログラム６１は、曲データ６２を編集するためにＣＰＵ１によって実行されるプログラムである。好ましい態様において、この曲編集プログラム６１は、ピアノの鍵盤の画像からなるＧＵＩ（Graphical User Interface）を表示部３に表示させる。ユーザは、表示部３に表示される鍵盤における所望の鍵の画像に対する操作部４の操作により音符を指定し、その音符に合わせて発音する歌詞を操作部４の操作により入力することができる。ここで、歌詞の入力は、図２（Ａ）に示すように、仮名入力で行っても良く、また、図２（Ｂ）に示すように音声記号入力で行っても良い。また、音声記号入力で歌詞を入力する態様においては、図２（Ｃ）に示すように、母音の無声化を指示する音声記号（“＿０”）を付与することで素片単位で母音の無声化を指示することができる。曲編集プログラム６１は、音符とその音符に合わせて発音される歌詞に関する情報を操作部４から受け取り、音符毎に音符データと歌詞データとを対応付け、曲データ６２としてフラッシュメモリ６内に格納する。さらに、ユーザは操作部４の操作によりダイナミックス情報等を曲データ６２に追加することができる。なお、このように曲データ６２の全てを操作部４の操作により入力させるのではなく、歌唱合成装置に鍵盤を接続し、ユーザによる鍵盤の操作を検知することにより音符データを生成し、この音符データに対応させる歌詞を操作部４の操作により入力させるようにしても良い。また、他の装置で作成した曲データ６２をインタフェース群５を介してこの歌唱合成装置へ入力し、フラッシュメモリ６に格納させる態様でも良く、この態様においては曲編集プログラム６１をフラッシュメモリ６に格納しておく必要はない。 The song editing program 61 is a program executed by the CPU 1 in order to edit the song data 62. In a preferred embodiment, the song editing program 61 causes the display unit 3 to display a GUI (Graphical User Interface) composed of an image of a piano keyboard. The user can specify a note by operating the operation unit 4 with respect to an image of a desired key on the keyboard displayed on the display unit 3, and can input lyrics to be pronounced according to the note by operating the operation unit 4. Here, the input of lyrics may be performed by kana input as shown in FIG. 2A, or may be performed by phonetic symbol input as shown in FIG. 2B. Further, in the mode of inputting lyrics by phonetic symbol input, as shown in FIG. 2 (C), a voice symbol (“_0”) instructing devoicing of a vowel is added to silence the vowel in units. Can be ordered. The song editing program 61 receives information about the notes and the lyrics that are pronounced according to the notes from the operation unit 4, associates the note data and the lyrics data for each note, and stores them in the flash memory 6 as song data 62. . Further, the user can add dynamics information and the like to the song data 62 by operating the operation unit 4. Instead of inputting all of the music data 62 by operating the operation unit 4 in this way, a note is generated by connecting a keyboard to the singing voice synthesizing device and detecting a user's operation of the keyboard. The lyrics corresponding to the data may be input by operating the operation unit 4. Alternatively, the song data 62 created by another device may be input to the song synthesizing device via the interface group 5 and stored in the flash memory 6. In this embodiment, the song editing program 61 is stored in the flash memory 6. There is no need to keep it.

１個の音符に対応した音符データは、音符の発生時刻、音高、音符の長さを示す各情報を含んでいる。歌詞データは、音符に合わせて発音するべき歌詞を音符毎に定義したデータである。曲データ６２は、曲の開始からの発生順序に合わせて、個々の音符に対応した音符データと歌詞データとを時系列的に並べたものであり、曲データ６２内においては音符データと歌詞データとは音符単位で対応付けられている。 The note data corresponding to one note includes information indicating the note generation time, pitch, and note length. The lyric data is data in which lyrics to be pronounced in accordance with the notes are defined for each note. The song data 62 is obtained by arranging note data and lyrics data corresponding to individual notes in time series in accordance with the generation order from the start of the song. In the song data 62, the note data and the lyrics data are arranged. Are associated with each note.

音声合成プログラム６５は、曲データ６２にしたがって音声（本実施形態では、歌唱音声）を合成する処理を、ＣＰＵ１に実行させるプログラムである。好ましい態様において、音声合成プログラム６５と曲編集プログラム６１は、例えばインターネット内のサイトからインタフェース群５の中の適当なものを介してダウンロードされ、フラッシュメモリ６にインストールされる。また、他の態様において、音声合成プログラム６５等は、ＣＤ−ＲＯＭやＵＭＤ等のコンピュータ装置読取可能な記録媒体に記録された状態で取引される。この態様では、インタフェース群５の中の適当なものを介して記録媒体から音声合成プログラム６５等が読み出され、フラッシュメモリ６にインストールされる。 The voice synthesis program 65 is a program that causes the CPU 1 to perform a process of synthesizing a voice (singing voice in this embodiment) according to the song data 62. In a preferred embodiment, the speech synthesis program 65 and the song editing program 61 are downloaded from a site in the Internet, for example, through an appropriate one in the interface group 5 and installed in the flash memory 6. In another aspect, the speech synthesis program 65 or the like is traded in a state recorded on a computer-readable recording medium such as a CD-ROM or UMD. In this aspect, the speech synthesis program 65 and the like are read from the recording medium via an appropriate interface group 5 and installed in the flash memory 6.

音声素片データベース６４は、子音から母音への遷移部分、母音から他の母音への遷移部分など音素から音素への遷移部分や、母音の伸ばし音など、歌声の素材となる各種の音声素片を示す音声素片データの集合体である。これらの音声素片データは、実際に人間が発した音声波形から抽出された音声素片に基づいて作成されたデータである。音声素片データベース６４では、男性歌手、女性歌手、澄んだ声の歌手、ハスキーな声の歌手など、声質の異なった歌手毎に、各歌手の歌唱音声波形から得られる音声素片データのグループが用意されている。音声合成プログラム６５による歌唱合成の際、ユーザは、操作部４の操作により、以上のような各種の音声素片データのグループの中から歌唱合成に使用する音声素片データのグループを選択することができる。 The speech segment database 64 includes various speech segments that are materials for singing voices, such as transition portions from consonants to vowels, transition portions from phonemes to phonemes such as transition portions from vowels to other vowels, and vowel extension sounds. Is a collection of speech segment data indicating. These speech segment data are data created based on speech segments extracted from speech waveforms actually emitted by humans. In the speech segment database 64, a group of speech segment data obtained from the singer speech waveform for each singer, such as a male singer, a female singer, a singer with a clear voice, a singer with a husky voice, and the like. It is prepared. At the time of singing synthesis by the speech synthesis program 65, the user selects a group of speech unit data to be used for singing synthesis from among the various groups of speech unit data as described above by operating the operation unit 4. Can do.

前述したように、音声素片データベース６４には、できるだけ多くの音声素片データが格納されていることが好ましい。しかし、本実施形態では、音声素片データベース６４は、フラッシュメモリ６に格納されるので、音声素片データベース６４に格納する音声素片データの数を最小限に絞り込む必要がある。何故ならば、フラッシュメモリの記憶容量はハードディス等に比較して小さいことが一般的だからである。そこで、本実施形態では、全ての母音（例えば、日本語（標準語）であれば、“ａ”，“ｉ”，“ｕ”，“ｅ”および“ｏ”の５つ）について、無声発音されたもの（以下、無声化母音）を含む音声素片の音声素片データを格納対象から除外することで、上記絞込みが実現されている。ここで、無声化母音を含む音声素片の一例としては、単一の音素（無声化母音）のみからなるものだけでなく、子音から無声化母音への遷移部分、無声化母音から無音への遷移部分などが挙げられる。 As described above, the speech unit database 64 preferably stores as much speech unit data as possible. However, in the present embodiment, since the speech unit database 64 is stored in the flash memory 6, it is necessary to minimize the number of speech unit data stored in the speech unit database 64. This is because the storage capacity of flash memory is generally smaller than that of a hard disk or the like. Therefore, in this embodiment, all vowels (for example, “a”, “i”, “u”, “e”, and “o” in the case of Japanese (standard language)) are silently pronounced. The above-described narrowing down is realized by excluding the speech unit data of speech units including those that have been performed (hereinafter referred to as “voicing vowels”) from the storage target. Here, as an example of a speech unit including a voiced vowel, not only a single phoneme (voiceless vowel) but also a transition part from a consonant to a voiced vowel, a voiced vowel to voiceless Transition part etc. are mentioned.

特許文献１等に示す従来の素片接続方式の歌唱合成技術では、無声化母音を含む音声素片についても、その波形等を示す音声素片データが音声素片データベースに格納されている。何故ならば、これらの音声素片を欠く状態では音声合成に支障が生じるからである。図３（Ａ）は、母音が無声化されていない（すなわち、母音が有声発音された）「す」の発音の素片構成を示す図であり、図３（Ｂ）は、母音が無声化された「す」の発音の素片構成を示す図である。図３において、音声記号［ｕ］は有声発音された母音（う）を表し、音声記号［ｕ＿０］は無声化された母音（う）を表す。 In the conventional unit connection type singing synthesis technique shown in Patent Document 1 and the like, speech unit data indicating the waveform and the like is stored in the speech unit database even for speech units including unvoiced vowels. This is because the speech synthesis is hindered in the state where these speech segments are missing. FIG. 3A is a diagram showing a segment structure of the pronunciation of “su” in which the vowels are not devoiced (that is, the vowels are voiced), and FIG. It is a figure which shows the segment composition of the pronunciation of "su" done. In FIG. 3, the phonetic symbol [u] represents a vowel that is voiced, and the phonetic symbol [u_0] represents a vowel that was made unvoiced.

図３（Ｂ）を参照すれば明らかなように、母音（う）が無声化された「す」の発音を合成するには、子音［ｓ］から無声化母音［ｕ＿０］への遷移部分、無声化母音［ｕ＿０］、および無声化母音［ｕ＿０］から無音［ｓｉｌ］への遷移部分の各素片が必要である。したがって、本実施形態のように、これら素片の音声素片データが音声素片データベース６４に格納されていない状況下では、従来の素片接続方式の音声合成で、母音が無声化された「す」の発音を合成することはできない。しかしながら、本実施形態では、子音［ｓ］から無声化されていない母音［ｕ］への遷移部分、同母音［ｕ］、および同母音［ｕ］から無音への遷移部分の各音声素片データと無声化テンプレート６３とから、子音［ｓ］から無声化母音［ｕ＿０］への遷移部分、同無声化母音［ｕ＿０］、および同無声化母音［ｕ＿０］から無音［ｓｉｌ］への遷移部分の音声素片データを合成することで上記のような不具合の発生を回避している。ここで、無声化テンプレート６３は、対象言語の母音のうちの任意の１つ（例えば、対象言語が日本語であれば、“ａ”）を無声発音した音の波形を示すデータ（同波形のサンプリングデータや、周波数スペクトルや位相スペクトルを表すデータ）である。この無声化テンプレート６３を利用した合成処理については後に詳細に説明する。 As is clear from FIG. 3B, in order to synthesize the pronunciation of “su” with the vowel (u) devoiced, the transition from the consonant [s] to the unvoiced vowel [u — 0], The unvoiced vowel [u_0] and each segment of the transition part from the unvoiced vowel [u_0] to the silence [sil] are required. Therefore, as in the present embodiment, in a situation where the speech unit data of these units is not stored in the speech unit database 64, the vowels are made unvoiced by speech synthesis of the conventional unit connection method. You cannot synthesize the pronunciation of “su”. However, in this embodiment, each speech unit data of the transition part from the consonant [s] to the vowel [u] that has not been devoiced, the vowel [u], and the transition part from the vowel [u] to silence. And the unvoiced template 63, the transition part from the consonant [s] to the unvoiced vowel [u_0], the unvoiced vowel [u_0], and the transition part from the unvoiced vowel [u_0] to the silence [sil]. By synthesizing the speech unit data, the occurrence of the above-described problems is avoided. Here, the devoicing template 63 is data indicating the waveform of a sound in which any one of the vowels of the target language (for example, “a” if the target language is Japanese) is silently generated (the same waveform). Sampling data, data representing frequency spectrum and phase spectrum). The synthesis process using the devoicing template 63 will be described in detail later.

音声素片データベース６４に格納されている各音声素片データは、音声素片の波形を示す波形データを含んでいる。この波形データは、音声素片の波形を所定のサンプリングレートでサンプリングしたサンプル列であっても良いし、音声素片の波形のサンプル列を一定時間長のフレームに分割し、ＦＦＴ（高速フーリエ変換）を行うことにより得られたフレーム毎のスペクトル（振幅スペクトルおよび位相スペクトル）を示すデータであっても良い。また、各音声素片データは、音声素片を構成する音素の種類と各音素の開始時刻を示すセグメンテーションデータを含む。 Each speech unit data stored in the speech unit database 64 includes waveform data indicating the waveform of the speech unit. The waveform data may be a sample sequence obtained by sampling the speech unit waveform at a predetermined sampling rate, or the speech unit waveform sample sequence may be divided into frames of a certain time length and subjected to FFT (Fast Fourier Transform). Data indicating the spectrum (amplitude spectrum and phase spectrum) for each frame obtained by performing (). Further, each speech unit data includes segmentation data indicating the type of phoneme constituting the speech unit and the start time of each phoneme.

本実施形態では、音声素片データに含まれる波形データにピッチ変換を施して利用することにより、任意のメロディに対応した歌唱音声を合成する。このピッチ変換については、例えば特許文献１に開示されている手法を採用すれば良い。ピッチ変換を行うためには、その対象である波形データのピッチに関する情報が必要である。そこで、ある好ましい態様では、歌唱合成の際のピッチ変換の便宜のため、音声素片のピッチがフレーム毎に算出され、各フレームにおけるピッチを示す素片ピッチデータが音声素片データの一部として音声素片データベース６４に格納される。また、他の好ましい態様では、歌唱合成の際の演算処理の便宜のため、上記素片ピッチデータに加えて、音声素片の振幅スペクトルの包絡線がフレーム毎に求められ、各フレームにおけるスペクトル包絡を示すスペクトル包絡データが音声素片データの一部として音声素片データベース６４に格納されている。 In the present embodiment, the singing voice corresponding to an arbitrary melody is synthesized by applying the pitch conversion to the waveform data included in the speech segment data. For this pitch conversion, for example, a method disclosed in Patent Document 1 may be adopted. In order to perform pitch conversion, information on the pitch of the waveform data that is the object is required. Therefore, in a preferred embodiment, for the convenience of pitch conversion at the time of singing synthesis, the pitch of the speech unit is calculated for each frame, and the unit pitch data indicating the pitch in each frame is a part of the speech unit data. It is stored in the speech segment database 64. In another preferred embodiment, for the convenience of calculation processing at the time of singing synthesis, in addition to the above unit pitch data, an envelope of the amplitude spectrum of the speech unit is obtained for each frame, and the spectral envelope in each frame is obtained. Is stored in the speech unit database 64 as a part of the speech unit data.

次いで、音声合成プログラム６５の構成について説明する。
図４は、音声合成プログラム６５の構成を説明するための図である。この音声合成プログラム６５は、所謂素片接続方式の音声合成（本実施形態では、歌唱合成）処理をＣＰＵ１に実行させるプログラムであり、図４に示すように、音声合成指示生成手段６５１、素片選択手段６５２、ピッチ変換手段６５３、無声化変換手段６５４および素片連結手段６５５を含んでいる。なお、本実施形態では、ＣＰＵ１が音声合成指示生成手段６５１等に相当するプログラムを実行することにより歌唱音声の合成を行うが、これらの各プログラムを複数のプロセッサが分担して並列実行するように構成しても良い。また、音声合成指示生成手段６５１等の各プログラムの一部を電子回路により構成しても良い。 Next, the configuration of the speech synthesis program 65 will be described.
FIG. 4 is a diagram for explaining the configuration of the speech synthesis program 65. This speech synthesis program 65 is a program that causes the CPU 1 to perform a so-called segment connection type speech synthesis (single synthesis in this embodiment) process. As shown in FIG. 4, as shown in FIG. A selection unit 652, a pitch conversion unit 653, a devoicing conversion unit 654, and a segment connection unit 655 are included. In the present embodiment, the CPU 1 synthesizes the singing voice by executing a program corresponding to the voice synthesis instruction generating means 651 and the like, but a plurality of processors share these programs and execute them in parallel. It may be configured. In addition, a part of each program such as the speech synthesis instruction generation unit 651 may be configured by an electronic circuit.

音声合成指示生成手段６５１は、操作部４の操作により指定された曲データ６２から音声合成指示６６０を生成するプログラムである。この音声合成指示６６０は、所謂歌唱合成スコアであり、音韻データトラック６６１、ピッチデータトラック６６２、無声化係数トラック６６３、およびその他のデータトラック６６４を含んでいる。これら各データトラックは時間軸を共通にするものである。音韻データトラック６６１は、１曲分の歌唱音声を合成するのに使用する複数の音声素片と、それらの各音声素片の時間軸上における位置（具体的には、音声素片の開始タイミングおよび継続時間）を示すデータトラックである。ピッチデータトラック６６２は、合成するべき歌唱音声のピッチを示すデータトラックである。無声化係数トラック６６３は、音韻データトラック６６１の示す各音声素片について母音の無声化を行うか否かを示す無声化係数ｗが書き込まれたデータトラックである。この無声化係数トラック６６３の生成態様としては種々の態様が考えられる。 The voice synthesis instruction generation unit 651 is a program that generates a voice synthesis instruction 660 from the music data 62 specified by the operation of the operation unit 4. The voice synthesis instruction 660 is a so-called singing synthesis score, and includes a phoneme data track 661, a pitch data track 662, a devoicing coefficient track 663, and other data tracks 664. Each of these data tracks has a common time axis. The phoneme data track 661 includes a plurality of speech segments used for synthesizing a singing voice of a song, and positions of the speech segments on the time axis (specifically, the start timing of the speech segment). Data duration). The pitch data track 662 is a data track indicating the pitch of the singing voice to be synthesized. The devoicing coefficient track 663 is a data track in which a devoicing coefficient w indicating whether or not to devoicize a vowel is written for each speech unit indicated by the phoneme data track 661. Various forms of generation of the devoicing coefficient track 663 can be considered.

第１に、音符と対応付けて曲データ６２に格納されている歌詞にて母音の無声化が指示されている音声素片に対しては、その音声素片に含まれる母音を無声化することを示す値（例えば、１）をセットし、母音の無声化を指示されていない音声素片に対しては、母音の無声化を行わないことを示す値（例えば、０）をセットする態様である。この態様では、無声化係数トラック６６３の生成が容易といった利点がある一方、無声化母音を含む音声素片と母音を無声化しない音声素片とが隣接する場合、両素片のつなぎ目で無声化係数ｗが０から１、あるいは１から０に不連続に変化し、この不連続性に起因してノイズが発生し易いといった不具合がある。 First, for a speech unit in which vowel devoicing is instructed by the lyrics stored in the song data 62 in association with a note, the vowel included in the speech unit is devoiced. A value (for example, 1) indicating that vowel devoicing is not performed is set for a speech unit that is not instructed to devocate vowels (for example, 1). is there. In this aspect, there is an advantage that it is easy to generate the devoicing coefficient track 663. On the other hand, when a speech unit including a devoicing vowel and a speech unit that does not devoice the vowel are adjacent to each other, devoicing is performed at the joint between both units. There is a problem that the coefficient w changes discontinuously from 0 to 1, or from 1 to 0, and noise is likely to occur due to this discontinuity.

これに対して第２の態様では、無声化係数ｗを音声素片を構成するフレーム単位でセットする態様であり、無声化母音を含む音声素片と母音を無声化しない音声素片とが隣接する場合、無声化母音を含む音声素片については、その音声素片内で無声化係数が０から１、或いは１から０に緩やかに変化するように各フレームの無声化係数ｗを０から１までの小数値でセットする態様である。この態様では、フレーム単位で小数値の無声化係数ｗをセットするため、上記第１の態様に比較して無声化係数トラック６６３の生成には手間がかかるものの、無声化母音を含む音声素片と母音を無声化しない音声素片とが隣接する場合に両素片のつなぎ目でノイズが発生するといった不具合が回避されるといった利点がある。本実施形態では、ノイズの発生を回避してより自然な音声合成を行うため、上記第２の態様で無声化係数トラック６６３が生成される。 On the other hand, the second mode is a mode in which the devoicing coefficient w is set for each frame constituting the speech unit, and the speech unit including the unvoiced vowel is adjacent to the speech unit that does not devoice the vowel. In this case, for a speech unit including a devoicing vowel, the devoicing coefficient w of each frame is changed from 0 to 1 so that the devoicing coefficient gradually changes from 0 to 1 or from 1 to 0 in the speech unit. This is a mode of setting with decimal values up to. In this aspect, since the devoicing coefficient w having a decimal value is set in units of frames, it takes more time to generate the devoicing coefficient track 663 than in the first aspect, but the speech unit including the devoicing vowel is used. And a speech unit that does not devocate the vowel are adjacent to each other, there is an advantage of avoiding a problem that noise is generated at the joint between both units. In this embodiment, in order to avoid noise and perform more natural speech synthesis, the devoicing coefficient track 663 is generated in the second mode.

音声合成指示生成手段６５１は、基本的には音符データに従い、また、ビブラートやポルタメント、レガートの指示がある場合にはそれに従い、ピッチデータトラック６６２を生成する。ただし、ピッチデータトラックを音符データ通りのものにすると、ピッチの変化が階段状になり、不自然な歌唱音になるので、本実施形態では、ピッチの切り換り区間においてピッチが自然な動きとなるように、ピッチデータトラック６６２が示すピッチに変化を与える。その他のデータトラック６６４は、曲データ６２に含まれるダイナミックス情報等に基づいて作成される。 The voice synthesis instruction generating unit 651 basically generates the pitch data track 662 in accordance with the note data, and when there is an instruction for vibrato, portamento, or legato. However, if the pitch data track is the same as the note data, the change in pitch will be stepped, resulting in an unnatural singing sound.In this embodiment, the pitch changes in the pitch switching section. Thus, the pitch indicated by the pitch data track 662 is changed. The other data track 664 is created based on the dynamics information included in the song data 62.

素片選択手段６５２、ピッチ変換手段６５３、無声化変換手段６５４および素片連結手段６５５は、音声合成指示６６０に従って歌唱音声の波形を示す波形データである歌唱音声データを生成する役割を担っている。ここで、音声合成指示６６０から歌唱音声データを生成する処理は、１曲分の音声合成指示６６０の生成が完了した後に開始するようにしても良く、音声合成指示６６０の生成開始から少し遅れて開始するようにしても良い。 The segment selection unit 652, the pitch conversion unit 653, the devoicing conversion unit 654, and the segment connection unit 655 play a role of generating singing voice data that is waveform data indicating the waveform of the singing voice in accordance with the voice synthesis instruction 660. . Here, the process of generating the singing voice data from the voice synthesis instruction 660 may be started after the generation of the voice synthesis instruction 660 for one song is completed, and is slightly delayed from the generation start of the voice synthesis instruction 660. It may be started.

素片選択手段６５２は、音声合成指示６６０の音韻データトラック６６１において指定されている音声素片の音声素片データを音声素片データベース６４から読み出し、その音声素片データをピッチ変換手段６５３に出力するプログラムである。ただし、本実施形態では、無声化母音を含む音声素片の音声素片データは音声素片データベース６４に格納されていない。そこで、音韻データトラック６６１にて指定されている音声素片が無声化母音を含むものである場合、素片選択手段６５２は、当該無声化母音を有声発音に置き換えた音声素片の音声素片データを音声素片データベース６４から読み出し、その音声素片データをピッチ変換手段６５３へ出力する。例えば、音韻データトラック６６１にて指定されている音声素片が［ｓ−ｕ＿０］である場合には、素片選択手段６５２は、［ｓ−ｕ］の音声素片データを音声素片データベース６４から読み出しピッチ変換手段６５３へ出力する。同様に、音韻データトラック６６１にて［ｕ＿０］が指定されている場合には［ｕ］の音声素片データが、［ｕ＿０−ｓｉｌ］が指定されている場合には［ｕ−ｓｉｌ］の音声素片データが音声素片データベース６４から読み出される。加えて、素片選択手段６５２は、音声素片データをピッチ変換手段６５３に引渡す際に、その継続時間長を音声合成指示６６０において指定された音声素片の継続時間長に合わせる機能を備えている。 The unit selection unit 652 reads out the speech unit data of the speech unit specified in the phoneme data track 661 of the speech synthesis instruction 660 from the speech unit database 64 and outputs the speech unit data to the pitch conversion unit 653. It is a program to do. However, in the present embodiment, the speech unit data of speech units including a devoicing vowel is not stored in the speech unit database 64. Therefore, when the speech unit specified in the phoneme data track 661 includes a voiced vowel, the unit selection unit 652 converts the voice unit data of the voice unit obtained by replacing the voiced vowel with voiced pronunciation. Read from the speech unit database 64 and output the speech unit data to the pitch conversion means 653. For example, when the speech unit specified in the phoneme data track 661 is [su−0], the segment selection unit 652 converts the speech unit data of [su] into the speech unit database 64. To read pitch conversion means 653. Similarly, when [u_0] is specified in the phoneme data track 661, the speech unit data of [u] is specified, and when [u_0-sil] is specified, the sound of [u-sil] is specified. Segment data is read from the speech segment database 64. In addition, the unit selection unit 652 has a function of adjusting the duration length of the speech unit data to the duration length of the speech unit specified in the speech synthesis instruction 660 when the speech unit data is delivered to the pitch conversion unit 653. Yes.

ピッチ変換手段６５３は、ピッチデータトラック６６２において指定されたピッチに対応した波形データとなるように、素片選択手段６５２から出力される音声素片データに含まれる波形データにピッチ変換を施すプログラムである。さらに詳述すると、例えば上記波形データが音声素片の波形を所定のサンプリングレートでサンプリングしたサンプル列である場合、ピッチ変換手段６５３は、所定サンプル数からなるフレーム単位でサンプル列のＦＦＴ（高速フーリエ変換）を行い、音声素片の振幅スペクトルおよび位相スペクトルをフレーム毎に求める。そして、ピッチデータトラック６６２において指定されたピッチに対応するように、各フレームにおける振幅スペクトルを周波数軸方向に伸張または圧縮する。その際、基音および倍音に相当する周波数の近傍は、元のスペクトルの概形が保たれるように、非線形な伸張または圧縮を行い、ピッチ変換後の振幅スペクトルとする。また、ピッチ変換後においても、ピッチ変換前のスペクトル包絡が維持されるように、非線形な圧縮または伸張を経た振幅スペクトルのレベル調整を行う。位相スペクトルに関しては、元の位相スペクトルをそのままピッチ変換後の位相スペクトルとしても良いが、振幅スペクトルの周波数軸方向の圧縮または伸張に合わせて補正を行ったものをピッチ変換後の位相スペクトルとする方が好ましい。なお、素片選択手段６５２から出力される音声素片データに含まれる波形データが音声素片を構成する各フレームの振幅スペクトルおよび位相スペクトルを表すデータである場合には、上記ＦＦＴを行うことなく周波数軸方向の圧縮または伸張を行えば良いことは言うまでもない。 The pitch conversion means 653 is a program that performs pitch conversion on the waveform data included in the speech element data output from the element selection means 652 so that the waveform data corresponds to the pitch specified in the pitch data track 662. is there. More specifically, for example, when the waveform data is a sample string obtained by sampling the waveform of a speech unit at a predetermined sampling rate, the pitch converting means 653 performs FFT (fast Fourier transform) of the sample string in units of frames each having a predetermined number of samples. Conversion), and the amplitude spectrum and phase spectrum of the speech unit are obtained for each frame. Then, the amplitude spectrum in each frame is expanded or compressed in the frequency axis direction so as to correspond to the pitch specified in the pitch data track 662. At that time, in the vicinity of the frequencies corresponding to the fundamental tone and the harmonic overtone, nonlinear expansion or compression is performed so that the outline of the original spectrum is maintained, and an amplitude spectrum after pitch conversion is obtained. In addition, after the pitch conversion, the level of the amplitude spectrum that has undergone nonlinear compression or expansion is adjusted so that the spectrum envelope before the pitch conversion is maintained. Regarding the phase spectrum, the original phase spectrum may be used as the phase spectrum after the pitch conversion as it is, but the phase spectrum after the pitch conversion is corrected according to the compression or expansion in the frequency axis direction of the amplitude spectrum. Is preferred. When the waveform data included in the speech unit data output from the unit selection unit 652 is data representing the amplitude spectrum and phase spectrum of each frame constituting the speech unit, the above FFT is not performed. Needless to say, compression or expansion in the frequency axis direction may be performed.

無声化変換手段６５４は、ピッチ変換手段６５３から引渡される音声素片データに含まれる波形データに対して、無声化係数トラック６６３に書き込まれている無声化係数ｗの値に応じた無声化変換処理を施して素片連結手段６５５へ出力する。以下、母音［ｕ］の無声化を行う場合を例にとって、この無声化変換処理の内容を説明する。 The devoicing conversion unit 654 performs the devoicing conversion on the waveform data included in the speech segment data delivered from the pitch conversion unit 653 according to the value of the devoicing coefficient w written in the devoicing coefficient track 663. Processing is performed and output to the piece connecting means 655. Hereinafter, the content of this devoicing conversion process will be described with reference to the case where the vowel [u] is devoiced.

図５は、無声化変換手段６５４が実行する無声化変換処理の流れを示すフロー図である。この無声化変換処理は、ピッチ変換手段６５３から音声素片データが引渡される度に実行される。図５に示すように、この無声化変換処理では、まず、ピッチ変換手段６５３から引き渡される音声素片データをＲＡＭ７内の所定領域へ書き込み記憶する処理（ステップＳＡ０１０）が実行される。前述したように、ピッチ変換手段６５３によるピッチ変換処理を経た音声素片データに含まれる波形データは、音声素片を構成するフレーム毎にその音声素片のスペクトル（振幅スペクトルおよび位相スペクトル）を表すデータである。以下では、各フレームのスペクトルを表すデータのことをフレームデータと呼ぶ。 FIG. 5 is a flowchart showing the flow of the unvoiced conversion process executed by the unvoiced conversion means 654. This devoicing conversion process is executed each time speech segment data is delivered from the pitch conversion means 653. As shown in FIG. 5, in this devoicing conversion process, first, a process (step SA010) of writing and storing the speech element data delivered from the pitch converting means 653 to a predetermined area in the RAM 7 is executed. As described above, the waveform data included in the speech unit data subjected to the pitch conversion processing by the pitch conversion unit 653 represents the spectrum (amplitude spectrum and phase spectrum) of the speech unit for each frame constituting the speech unit. It is data. Hereinafter, data representing the spectrum of each frame is referred to as frame data.

次いで、無声化変換手段６５４は、ステップＳＡ０１０にてＲＡＭ７に書き込んだ音声素片データの各フレームについて、無声化係数ｗが０であるか否かを判定し（ステップＳＡ０２０）、その判定結果が“Ｙｅｓ”である場合には、ステップＳＡ０６０の処理を実行する。逆に、ステップＳＡ０２０の判定結果が“Ｎｏである場合には、無声化変換手段６５４は、その無声化係数ｗに対応するフレームデータの表すフレーム（以下、処理対象フレーム）の波形のスペクトルエンベロープと無声化テンプレート６３の示す波形のスペクトルとから、母音を無声化した波形（以下、無声化波形）を生成する処理を以下の数１および数２にしたがって実行する（ステップＳＡ０３０）。

Next, the devoicing conversion means 654 determines whether or not the devoicing coefficient w is 0 for each frame of the speech segment data written to the RAM 7 in step SA010 (step SA020). If “Yes”, the process of step SA060 is executed. On the other hand, if the determination result in step SA020 is “No”, the devoicing conversion means 654 determines the spectral envelope of the waveform of the frame (hereinafter referred to as the processing target frame) represented by the frame data corresponding to the devoicing coefficient w. From the waveform spectrum indicated by the devoicing template 63, a process of generating a waveform in which the vowels are devoted (hereinafter referred to as devoicing waveform) is executed according to the following equations 1 and 2 (step SA030).

数１は、処理対象フレームの振幅スペクトルのスペクトルエンベロープに対して施すイコライジングのＥＱ特性を示す式であり、この数１に示すＥＱ特性（周波数ωと振幅Ｅの関数関係）のスペクトルエンベロープは、図６（Ａ）にて点線で示すグラフにより表される。この数１において、Ｅ_０、Ｅ_１、ｆ_０、ｆ_１およびｆ_２は上記ＥＱ特性を規定するパラメータであり、実験等により好適な値を定めるようにすれば良い。ここで、処理対象フレームの振幅スペクトルのスペクトルエンベロープに対して上記イコライジングを施すのは、無声化テンプレート６３の示す振幅スペクトルとの馴染みを良くするためである。一方、数２において、Ｓ（ω）は処理対象フレームの振幅スペクトルのスペクトルエンベロープであり、Ｔ（ω）は無声化テンプレート６３の示す波形の振幅スペクトル、Ｅ（ω）はＥＱ特性の振幅スペクトル（数１参照）のスペクトルエンベロープである。この数２に従って算出されるＷ（ω）が上記無声化波形の振幅スペクトルを表すのである。この数２を参照すれば明らかなように、無声化波形は、無声化テンプレート６３の示す波形を、無声化対象の音声素片データの示す波形のスペクトルエンベロープと同様なスペクトルエンベロープを有する波形に加工することで生成される。 Equation (1) is an equation showing the EQ characteristic of equalization applied to the spectral envelope of the amplitude spectrum of the processing target frame. The spectral envelope of the EQ characteristic (function relationship between frequency ω and amplitude E) shown in Equation (1) is shown in FIG. It is represented by a graph indicated by a dotted line at 6 (A). In Equation 1, E ₀ , E ₁ , f ₀ , f _1, and f ₂ are parameters that define the above-described EQ characteristics, and may be set to suitable values through experiments or the like. Here, the equalization is performed on the spectral envelope of the amplitude spectrum of the processing target frame in order to improve the familiarity with the amplitude spectrum indicated by the devoicing template 63. On the other hand, in Equation 2, S (ω) is the spectrum envelope of the amplitude spectrum of the processing target frame, T (ω) is the amplitude spectrum of the waveform indicated by the devoicing template 63, and E (ω) is the amplitude spectrum of the EQ characteristic ( This is a spectral envelope of the equation (1). W (ω) calculated according to Equation 2 represents the amplitude spectrum of the unvoiced waveform. As apparent from reference to Equation 2, the unvoiced waveform is obtained by processing the waveform indicated by the unvoiced template 63 into a waveform having a spectrum envelope similar to the spectrum envelope of the waveform indicated by the speech segment data to be devoiced. To be generated.

図６（Ｂ）では、無声化テンプレート６３の振幅スペクトルＴ（ω）が太線で描画されており、無声化波形の振幅スペクトルＷ（ω）が細線で描画されている。なお、図６（Ｂ）において点線は、無声化変換後の対象フレームの振幅スペクトルＷ（ω）のスペクトルエンベロープ（すなわち、Ｅ（ω）＋Ｓ（ω））である。一般に、音の音色はその音の振幅スペクトルのスペクトルエンベロープで定まり、その音が有声音であるのか無声音であるのかはその音の振幅スペクトルのスペクトル構造で定まる。数２に従って演算される振幅スペクトルＷ（ω）は、そのスペクトルエンベロープは［ｕ］の音のスペクトルエンベロープ（Ｓ（ω））に近似し、そのスペクトル構造は無声化テンプレート６３の示すスペクトル構造（Ｔ（ω）のスペクトル構造）に略等しい。したがって、振幅スペクトルＷ（ω）は、無声化された母音［ｕ＿０］の音を表すのである。 In FIG. 6B, the amplitude spectrum T (ω) of the devoicing template 63 is drawn with a thick line, and the amplitude spectrum W (ω) of the devoicing waveform is drawn with a thin line. In FIG. 6B, the dotted line is the spectrum envelope (that is, E (ω) + S (ω)) of the amplitude spectrum W (ω) of the target frame after devoicing conversion. Generally, the tone color of a sound is determined by the spectrum envelope of the amplitude spectrum of the sound, and whether the sound is voiced or unvoiced is determined by the spectrum structure of the amplitude spectrum of the sound. The amplitude spectrum W (ω) calculated according to Equation 2 has a spectrum envelope that approximates the spectrum envelope (S (ω)) of the sound of [u], and the spectrum structure is the spectrum structure (T (Ω) spectral structure). Therefore, the amplitude spectrum W (ω) represents the sound of the vowel [u_0] that has been devoiced.

次いで、ＣＰＵ１は、ステップＳＡ０３０の処理により得られた振幅スペクトルＷ（ω）と、処理対象フレームの元々の振幅スペクトルＰ（ω）とを、数３にしたがってミキシングし（ステップＳＡ０４０）、ＲＡＭ７に格納されている音声素片データに含まれるフレームデータのうち、この処理対象フレームに該当するものを、数３の演算結果Ｍ（ω）を示すフレームデータで置き換える（ステップＳＡ０５０）。このようなミキシングを行うのは、無声化母音を含む音声素片と無声化母音を含まない音声素片のつなぎ目で不連続が生じないようにするためである。

Next, the CPU 1 mixes the amplitude spectrum W (ω) obtained by the process of step SA030 and the original amplitude spectrum P (ω) of the processing target frame according to Equation 3 (step SA040), and stores it in the RAM 7. Of the frame data included in the speech unit data being processed, the frame data corresponding to the processing target frame is replaced with the frame data indicating the calculation result M (ω) of Equation 3 (step SA050). Such mixing is performed in order to prevent discontinuity from occurring at the joint between the speech unit including the unvoiced vowel and the speech unit not including the unvoiced vowel.

以降、ＣＰＵ１は、ＲＡＭ７に格納されている音声素片データに含まれる全てのフレームデータについて処理を完了したか否かを判定し（ステップＳＡ０６０）、その判定結果が“Ｎｏ”である場合には、ステップＳＡ０２０以降の処理を繰り返し実行する。逆に、ステップＳＡ０６０の判定結果が“Ｙｅｓ”である場合は、ＣＰＵ１は、ＲＡＭ７に格納されている音声素片データを素片連結手段６５５に出力し（ステップＳＡ０７０）、この音声素片データに対応する音声素片についての無声化変換処理を終了する。 Thereafter, the CPU 1 determines whether or not the processing has been completed for all the frame data included in the speech segment data stored in the RAM 7 (step SA060), and if the determination result is “No”. The processes after step SA020 are repeatedly executed. Conversely, if the determination result in step SA060 is “Yes”, the CPU 1 outputs the speech segment data stored in the RAM 7 to the segment coupling means 655 (step SA070). The devoicing conversion process for the corresponding speech segment is terminated.

そして、素片連結手段６５５は、最終的に得られる歌唱音声が一連の音声素片が滑らかに繋がったものとなるように、無声化変換手段６５４の処理を経た音声素片データに含まれている波形データの調整を行い、この調整後の波形データをＩＦＦＴ（逆高速フーリエ変換）により時間領域のデジタル音声信号に変換してサウンドシステム８に出力する。 Then, the segment connecting means 655 is included in the speech unit data that has undergone the process of the devoicing conversion means 654 so that the finally obtained singing voice is smoothly connected with a series of speech segments. The adjusted waveform data is converted into a time-domain digital audio signal by IFFT (Inverse Fast Fourier Transform) and output to the sound system 8.

以上説明したように本実施形態に係る歌唱合成装置によれば、無声化母音を含む音声素片の音声素片データが音声素片データベース６４に格納されていなくても、特許文献１等に開示された従来の素片接続方式の歌唱合成と同様に自然な歌唱音声を合成することができる。つまり、本実施形態によれば、無声化母音を含む音声素片の分だけデータベース化する音声素片の数を削減しつつ、自然な歌唱音声の合成を行うことが可能になる。 As described above, according to the singing voice synthesizing apparatus according to the present embodiment, even if the speech unit data of the speech unit including the unvoiced vowel is not stored in the speech unit database 64, it is disclosed in Patent Document 1 and the like. It is possible to synthesize a natural singing voice as in the conventional singing synthesis of the unit connection method. That is, according to the present embodiment, it is possible to synthesize natural singing speech while reducing the number of speech units to be databased by the amount of speech units including devoiced vowels.

以上、本発明の一実施形態について説明したが、この実施形態を以下のように変形しても勿論良い。
（１）上述した実施形態では、音声合成の対象言語の母音のうちの何れか１つについて無音化テンプレート６３をフラッシュメモリ６に格納しておいたが、対象言語の全ての母音についての無音化テンプレートをフラッシュメモリ６に格納しておいても良く、また、それら母音のうちから任意に選択した複数のものについての無音化テンプレートをフラッシュメモリ６に格納しておいても良い。つまり、音声合成の対象言語の母音の一部または全てについて無音化テンプレートがフラッシュメモリ６に格納されていれば良い。このように、音声合成の対象言語の母音の一部または全てについて無音化テンプレートがフラッシュメモリ６に格納されている態様においては、以下の要領でより自然な音声合成を行うことが可能になる。すなわち、無音化変換するべき音声素片に含まれている母音について、該当する無音化テンプレートがフラッシュメモリ６に格納されている場合には、その無音化テンプレートを用いて無音化変換を行い、該当するものがない場合には、フラッシュメモリ６に格納されている無音化テンプレートのうち、音響的な特徴がその無音化対象の母音に最も近い母音のものを用いて無音化変換処理を行うのである。例えば、フラッシュメモリ６に、母音［ａ］、［ｉ］および［ｕ］についての無音化テンプレートが格納されている場合には、これら３つの母音については各々に対応する無音化テンプレートを用いて無音化変換を行い、母音［ｅ］および［ｏ］については、これら３つの無音化テンプレートのうち音響的な特徴が最も近いものを用いて無音化変換を行えば良い。 Although one embodiment of the present invention has been described above, the present embodiment may of course be modified as follows.
(1) In the embodiment described above, the silencing template 63 is stored in the flash memory 6 for any one of the vowels of the target language for speech synthesis. However, the silencing for all the vowels of the target language is performed. Templates may be stored in the flash memory 6, and silence templates for a plurality of vowels arbitrarily selected from these vowels may be stored in the flash memory 6. That is, it is only necessary that the silence template is stored in the flash memory 6 for some or all of the vowels of the speech synthesis target language. As described above, in the aspect in which the silence template is stored in the flash memory 6 for some or all of the vowels of the speech synthesis target language, it is possible to perform more natural speech synthesis in the following manner. That is, when the corresponding silence template is stored in the flash memory 6 for the vowel included in the speech unit to be silenced, the silence conversion is performed using the silence template. If there is nothing to do, the silence conversion processing is performed using the template with the acoustic characteristics closest to the vowel to be silenced among the silence templates stored in the flash memory 6. . For example, when the silencing templates for vowels [a], [i], and [u] are stored in the flash memory 6, the three vowels are silenced using the corresponding silencing templates. For the vowels [e] and [o], the silence conversion may be performed by using the three silence templates having the closest acoustic characteristics.

（２）上述した実施形態では、無声化母音を含む音声素片の音声素片データを無声化変換手段６５４による無声化変換処理で合成することで、当該音声素片のデータベース化を不要にした。しかし、無声化母音を含む音声素片をデータベース化の対象とし、有声発音される同母音を含む音声素片をデータベース化の対象から除外しても良い。このように、無声化母音を含む音声素片をデータベース化の対象とする態様においては、ある母音（例えば、“ａ”）を有声発音した場合の波形を示すデータである有声化テンプレートを無声化テンプレート６３に換えてフラッシュメモリ６に格納しておき、この有声化テンプレートの示す波形と無声化母音を含む音声素片の波形の振幅スペクトルのスペクトルエンベロープとから、当該有声発音される母音を含む音声素片の波形を示す音声素片データを生成するようにすれば良い。具体的には、無声化変換手段６５４に代えて、有声化テンプレートの示す波形を無声化母音を含む音声素片の波形のスペクトルエンベロープと同様なスペクトルエンベロープを有する波形に加工し、当該無声化母音を含む音声素片の音声素片データを当該加工後の波形を示す音声素片データに変換して出力する有声化変換手段を設けて歌唱合成装置を構成すれば良い。なお、無声化母音を含む音声素片と、有声発音された同母音を含む音声素片の何れをデータベース化の対象とするのかについては、音声合成の対象言語との関係で定めれば良い。有声発音された同母音を含む音声素片の出現頻度が高い言語の場合には、同音声素片を音声素片データベースの格納対象にするほうが好ましく、逆に、無声化母音を含む音声素片の出現頻度が高い言語の場合には、同音声素片を格納対象とするほうが好ましい。 (2) In the above-described embodiment, by synthesizing the speech unit data of the speech unit including the unvoiced vowel by the unvoiced conversion process by the unvoiced conversion unit 654, the database of the speech unit becomes unnecessary. . However, a speech unit including the unvoiced vowel may be a database target, and a speech unit including the same vowel that is voiced may be excluded from the database target. As described above, in a mode in which speech units including unvoiced vowels are to be databased, a voicing template that is data indicating a waveform when a vowel (for example, “a”) is voiced is devoted. It is stored in the flash memory 6 in place of the template 63, and the voice including the vowel that is voiced is generated from the waveform indicated by the voicing template and the spectrum envelope of the amplitude spectrum of the speech unit waveform including the unvoiced vowel. It is only necessary to generate speech segment data indicating the waveform of the segment. Specifically, instead of the unvoiced conversion means 654, the waveform indicated by the voicing template is processed into a waveform having a spectrum envelope similar to the spectrum envelope of the waveform of the speech unit including the unvoiced vowel, and the unvoiced vowel The singing voice synthesizing device may be configured by providing voicing conversion means for converting the speech unit data of the speech unit including the speech unit data into the speech unit data indicating the processed waveform and outputting the speech unit data. It should be noted that which of the speech unit including the unvoiced vowel and the speech unit including the voiced pronunciation of the same vowel is to be determined in relation to the speech synthesis target language. In the case of a language in which the appearance frequency of a speech unit containing a voiced pronunciation of the same vowel is high, it is preferable to use the same speech unit as a storage target of the speech unit database, and conversely, a speech unit containing a voiced vowel. In the case of a language having a high appearance frequency, it is preferable to store the same speech element.

（３）上述した実施形態では、無声化母音を含む音声素片と母音を無声化しない音声素片とが隣接する場合、無声化係数トラックにフレーム単位で０から１までの何れかの無声化係数をセットして、無声化変換を施した音と無声化変換を施さない音とをクロスフェードさせたが、有声から無声への切り換り或いは無声から有声への切り換りの音の立ち上がりを明瞭にするため、上記クロスフェードを行う期間の長さに最大値を設定しても良い。ここで、ユーザが操作部４等の操作により無声化係数トラックを入力する態様においては、ユーザが入力した無声化係数トラックにて指定されているクロスフェード期間が上記最大値よりも長いといった事態が起こり得る。このようにユーザにより指定されたクロスフェード期間が上記最大値よりも長い場合には、クロスフェードの開始時刻は変えず、クロスフェードの終了時刻が早まるように、クロスフェード期間を切り詰めるようにすれば良い。また、ユーザに無声化係数トラックを入力させる態様においては、クロスフェード期間の長さに最小値を予め定めておき、その最小値よりも短いクロスフェード期間が指定された場合には、そのクロスフェード期間が上記最小値になるように、その開始時刻または終了時刻の何れか一方を調整するようにすれば良い。これはクロスフェード期間が短すぎると、無声化母音を含む音声素片と母音を無声化しない音声素片のつなぎ目でノイズが発生する虞があるからである。 (3) In the above-described embodiment, when a speech unit including a devoicing vowel is adjacent to a speech unit that does not devoicize the vowel, any one of 0 to 1 is devoted to the devoicing coefficient track in units of frames. A coefficient is set, and the sound that has undergone devoicing conversion and the sound that has not undergone devoicing conversion have been cross-faded, but the rise of the voice that switches from voiced to unvoiced or voiceless to voiced In order to clarify the above, a maximum value may be set for the length of the period in which the crossfade is performed. Here, in the aspect in which the user inputs the devoicing coefficient track by operating the operation unit 4 or the like, there is a situation in which the crossfade period specified in the devoicing coefficient track input by the user is longer than the maximum value. Can happen. If the crossfade period specified by the user is longer than the maximum value, the crossfade start time is not changed and the crossfade period is shortened so that the crossfade end time is advanced. good. Further, in the aspect in which the user inputs the devoicing coefficient track, a minimum value is set in advance for the length of the crossfade period, and when a crossfade period shorter than the minimum value is designated, the crossfade period is specified. Either the start time or the end time may be adjusted so that the period becomes the minimum value. This is because if the cross-fade period is too short, noise may occur at the joint between a speech unit containing a devoicing vowel and a speech unit that does not devoicize the vowel.

（４）ピッチデータトラック６６２については、上記実施形態に挙げたもの以外の方法により音符の切り換り部分のピッチに動きを与えても良い。例えばユーザが操作部４の操作によりピッチに動きを与える構成でも良い。 (4) For the pitch data track 662, movement may be given to the pitch of the note switching portion by a method other than that described in the above embodiment. For example, a configuration in which the user moves the pitch by operating the operation unit 4 may be used.

（５）上述した実施形態では、仮名入力と音声記号入力の何れによっても歌詞の入力ができるようにしたが、例えば、仮名入力で歌詞が入力された場合には、常に母音の無声化を行わない態様で歌唱合成を行い、音声記号入力で歌詞が入力された場合には、その入力内容に応じて母音の無声化の有無を制御するようにしても良い。また、仮名入力の場合も音声記号入力の場合と同様に、制御文字（＿０）を付与してその旨を指示するようにしても良い。例えば、母音を無声化した「す」の音声を合成する場合には、「す＿０」と入力するようにすれば良い。 (5) In the embodiment described above, lyrics can be input by either kana input or phonetic symbol input. For example, when lyrics are input by kana input, vowels are always devoiced. When singing is performed in a manner that does not exist and lyrics are input by phonetic symbol input, the presence or absence of vowel devoicing may be controlled according to the input content. Also, in the case of kana input, similarly to the case of phonetic symbol input, a control character (_0) may be added to instruct that effect. For example, in the case of synthesizing “su” speech in which the vowel is made unvoiced, “su_0” may be input.

（６）素片連結手段６５５は、周波数領域の情報である波形データ（振幅スペクトル、位相スペクトル）を時間領域の情報であるデジタル音声信号に変換した後、このデジタル音声信号を対象としてスムージング処理を行うようにしても良い。例えば、先行音声素片の最後のｎ個の波形データと後続音声素片の最初のｎ個の波形データからＩＦＦＴにより得られた時間領域のデジタル音声信号を対象としてクロスフェードを行い、最終的なデジタル音声信号としても良い。 (6) The segment connecting means 655 converts the waveform data (amplitude spectrum, phase spectrum), which is information in the frequency domain, into a digital audio signal, which is information in the time domain, and then performs a smoothing process on the digital audio signal. You may make it do. For example, the final n waveform data of the preceding speech unit and the first n waveform data of the subsequent speech unit are subjected to crossfading for the time domain digital speech signal obtained by IFFT, and the final It may be a digital audio signal.

（７）上述した実施形態では、ステップＳＡ０３０の処理により得られた振幅スペクトルＷ（ω）と元々の振幅スペクトルＰ（ω）とを無声化係数ｗの値に応じて数３にしたがってミキシングしたが、さらに、無声化係数ｗの値に応じて気息音などの非調和成分の音を付与するようにしても良い。具体的には、気息音をどの程度付与するのかを示す気息音係数ｂ（０から１の小数値）に代えて、以下の数４にしたがって算出される値ｂ´を気息音係数として用いるようにすれば良い。

(7) In the above-described embodiment, the amplitude spectrum W (ω) obtained by the process of step SA030 and the original amplitude spectrum P (ω) are mixed according to Equation 3 according to the value of the devoicing coefficient w. Furthermore, a sound of an anharmonic component such as a breath sound may be applied according to the value of the devoicing coefficient w. Specifically, instead of the breath sound coefficient b (decimal value from 0 to 1) indicating how much the breath sound is added, the value b ′ calculated according to the following equation 4 is used as the breath sound coefficient. You can do it.

（８）音声合成指示は、音声合成パラメータの時系列情報であれば良く、１曲分に限らず、曲の１部分についての時系列情報でも良い。 (8) The voice synthesis instruction may be time series information of the voice synthesis parameter, and is not limited to one piece of music, but may be time series information about one part of the piece of music.

（９）上述した実施形態では、音声合成の素材となる各音声素片の音声素片データの集合からなる音声素片データベース６４が歌唱合成装置のフラッシュメモリ６に格納されていた。しかし、上記音声素片データベース６４をＳＤメモリやＵＭＤなどのコンピュータ装置読取可能な記録媒体に書き込んで流通させ、これら記録媒体に格納されている音声素片データベース６４へのインタフェース群５経由でのアクセスにより、音声合成指示６６０にて指定された音声素片に対応する音声素片データを読み出す処理をＣＰＵ１に実行させても良い。このような態様においては、歌唱合成装置が備える記憶装置の記憶容量の大小が問題となることはない。しかし、一般にＵＭＤやＳＤメモリの記憶容量はハードディスク等に比較して小さいため、上記実施形態にて説明した手法により、データベース化する音声素片の数を削減することには十分な意義がある。 (9) In the above-described embodiment, the speech unit database 64 composed of a set of speech unit data of each speech unit that is a material for speech synthesis is stored in the flash memory 6 of the singing synthesizer. However, the speech unit database 64 is written and distributed in a computer-readable recording medium such as an SD memory or UMD, and the speech unit database 64 stored in these recording media is accessed via the interface group 5. Thus, the CPU 1 may be caused to execute processing for reading out speech unit data corresponding to the speech unit specified by the speech synthesis instruction 660. In such an aspect, the size of the storage capacity of the storage device included in the singing voice synthesizing device does not become a problem. However, since the storage capacity of a UMD or SD memory is generally smaller than that of a hard disk or the like, it is sufficiently meaningful to reduce the number of speech segments to be databased by the method described in the above embodiment.

（１０）上述した実施形態では、歌唱音声（すなわち、曲に合わせて歌唱する音声）の合成に本発明を適用したが、発話音声など歌唱音声以外の音声の合成に本発明を適用しても勿論良い。発話音声の合成の場合においても、音声の合成に用いる複数の音声素片を指定する情報を時系列化した音声合成指示を生成する処理を音声合成指示生成手段６５１に実行させ、素片選択手段６５２から出力される各音声素片データに対してピッチ変換手段６５３によるピッチ変換を施し、ピッチ変換後の音声素片データに対して無声化変換手段６５４による無声化変換処理を施した後に、各音声素片データに含まれる波形データを調整しつつ連結して出力する処理を素片連結手段６５５に実行させるようにすれば良い。 (10) In the above-described embodiment, the present invention is applied to synthesis of singing voice (that is, voice sung in accordance with a song). However, even if the present invention is applied to synthesis of voice other than singing voice such as speech voice. Of course it is good. Even in the case of speech synthesis, the speech synthesis instruction generation unit 651 executes a process for generating a speech synthesis instruction in which information designating a plurality of speech units used for speech synthesis is time-sequentially, and unit selection unit After performing the pitch conversion by the pitch conversion unit 653 on each speech unit data output from the 652 and performing the unvoiced conversion processing by the unvoiced conversion unit 654 on the speech unit data after the pitch conversion, What is necessary is just to make the element connection means 655 perform the process which connects and outputs the waveform data contained in audio | voice element data, adjusting.

（１１）上述した実施形態では、音声合成プログラム６５にしたがってＣＰＵ１を作動させることにより、そのＣＰＵ１を、音声合成指示生成手段６５１、素片選択手段６５２、ピッチ変換手段６５３、無声化変換手段６５４および素片連結手段６５５として機能させた。しかしながら、インタフェース群５のうちの適当なものを介して他の装置から音声合成指示６６０が与えられる態様においては、ＣＰＵ１を音声合成指示生成手段６５１として機能させる必要はない。また、合成音声を構成する音声素片毎にピッチの指定を行わない態様においては、ＣＰＵ１をピッチ変換手段６５３として機能させる必要はない。つまり、本発明に特徴的な音声合成を実現するには、音声合成指示生成手段６５１およびピッチ変換手段６５３は必ずしも必須ではなく、素片選択手段６５２、無声化変換手段６５４および素片連結手段６５５としてＣＰＵ１を機能させることができれば良い。 (11) In the above-described embodiment, by operating the CPU 1 in accordance with the speech synthesis program 65, the CPU 1 is converted into the speech synthesis instruction generation unit 651, the segment selection unit 652, the pitch conversion unit 653, the unvoiced conversion unit 654, and It was made to function as the piece connection means 655. However, in a mode in which the speech synthesis instruction 660 is given from another device via an appropriate interface group 5, the CPU 1 does not need to function as the speech synthesis instruction generation unit 651. Further, in a mode in which the pitch is not designated for each speech unit constituting the synthesized speech, it is not necessary for the CPU 1 to function as the pitch conversion unit 653. That is, in order to realize the speech synthesis characteristic of the present invention, the speech synthesis instruction generation unit 651 and the pitch conversion unit 653 are not necessarily required, but the unit selection unit 652, the unvoiced conversion unit 654, and the unit connection unit 655. As long as the CPU 1 can function.

１…ＣＰＵ、２…ＲＯＭ、３…表示部、４…操作部、５…インタフェース群、６…フラッシュメモリ、６１…曲編集プログラム、６２…曲データ、６３…無声化テンプレート、６４…音声素片データベース、６５…音声合成プログラム、６５１…音声合成指示生成手段、６５２…素片選択手段、６５３…ピッチ変換手段、６５４…無声化変換手段、６５５…素片連結手段、７…ＲＡＭ。 DESCRIPTION OF SYMBOLS 1 ... CPU, 2 ... ROM, 3 ... Display part, 4 ... Operation part, 5 ... Interface group, 6 ... Flash memory, 61 ... Song edit program, 62 ... Song data, 63 ... Silent template, 64 ... Speech unit Database, 65 ... voice synthesis program, 651 ... voice synthesis instruction generation means, 652 ... segment selection means, 653 ... pitch conversion means, 654 ... devoice conversion means, 655 ... segment connection means, 7 ... RAM.

Claims

A speech unit database in which speech unit data including waveform data indicating waveforms of various speech units is stored;
A storage means in which a devoicing template representing a waveform of unvoiced pronunciation is stored for each vowel for all or part of the vowels of the target language for speech synthesis;
A means for selecting speech unit data corresponding to a speech unit used for speech synthesis from the speech unit database, and when a speech unit including a devoted vowel is used for speech synthesis, Unit selection means for selecting speech unit data of a speech unit obtained by voicing a vowel from the speech unit database;
When the speech unit data selected by the unit selection unit is selected for a speech unit including a vowel that has been devoted, the waveform indicated by any of the unvoiced templates is represented by the speech unit data. Processed into a waveform having a spectrum envelope similar to the spectral envelope of the waveform indicated by the piece data, and converted the speech unit data into speech unit data indicating the processed waveform and output the other speech unit If it is selected for the unvoiced conversion means to output as it is,
A speech synthesizer comprising: a unit linking unit configured to link and output waveform data included in each speech unit data output from the devoicing conversion unit.

In the case where the speech unit including the unvoiced vowel is continuously used for speech synthesis, the speech unit including the unvoiced vowel, and the speech unit including the unvoiced vowel, The speech synthesis according to claim 1, wherein the speech segment data is converted so that a waveform indicated by the speech segment data selected by the segment selection means and the processed waveform are cross-faded. apparatus.

The devoicing conversion means converts the speech unit data selected by the unit selection means to the speech unit data converted so that the waveform after the processing cross-fades, and the mixing ratio in the crossfade The speech synthesizer according to claim 2, wherein a breath sound corresponding to the sound is output and output.

A speech unit database in which speech unit data including waveform data indicating waveforms of various speech units is stored;
Storage means for storing, for each vowel, a voicing template showing a waveform of voiced pronunciation for all or part of the vowels of the target language for speech synthesis;
A means for selecting speech unit data corresponding to a speech unit used for speech synthesis from the speech unit database, and when a speech unit including a vowel that is voiced is used for speech synthesis, Unit selection means for selecting speech unit data of a speech unit from which vowels are made unvoiced from the speech unit database;
When the speech unit data selected by the unit selection unit is selected for a speech unit including a vowel that is voiced, the waveform indicated by any of the voiced templates is represented by the speech unit. Processed into a waveform having a spectrum envelope similar to the spectral envelope of the waveform indicated by the piece data, and converted the speech unit data into speech unit data indicating the processed waveform and output the other speech unit Voicing conversion means for outputting as it is,
A speech synthesizer, comprising: a unit linking unit that connects and outputs waveform data included in each speech unit data output from the voicing conversion unit.

Computer equipment,
Means for selecting speech unit data corresponding to a speech unit used for speech synthesis from a speech unit database storing speech unit data including waveform data indicating waveforms of various speech units; When a speech unit including a devoted vowel is used for speech synthesis, a unit selection unit that selects speech unit data of a speech unit obtained by voicing the vowel from the speech unit database;
If the speech unit data selected by the unit selection means is selected for a speech unit including a devoted vowel, one of the vowels of the target language for speech synthesis Processes unvoiced sound waveform into a waveform having a spectrum envelope similar to the spectrum envelope of the speech segment data, converts the speech segment data into speech segment data representing the processed waveform, and outputs it On the other hand, if it is selected for another speech unit, the devoicing conversion means for outputting it as it is,
A program which functions as segment connecting means for connecting and outputting waveform data contained in each speech unit data output from the devoicing conversion means while adjusting.

Computer equipment,
Means for selecting speech unit data corresponding to a speech unit used for speech synthesis from a speech unit database storing speech unit data including waveform data indicating waveforms of various speech units; When a speech unit including a vowel that is voiced is used for speech synthesis, a unit selection unit that selects speech unit data of a speech unit that is made unvoiced from the speech unit database;
If the speech segment data selected by the segment selection means is selected for a speech segment containing a vowel that is voiced, any one of the vowels of the target language for speech synthesis The voiced waveform is processed into a waveform having a spectrum envelope similar to that of the waveform indicated by the speech segment data, and the speech segment data is converted into speech segment data indicating the processed waveform and output. On the other hand, if it is selected for another speech unit, voicing conversion means for outputting as it is,
A program which functions as segment connecting means for connecting and outputting waveform data contained in each speech segment data output from the voice conversion means while adjusting.