JP2015082028A

JP2015082028A - Singing synthetic device and program

Info

Publication number: JP2015082028A
Application number: JP2013219805A
Authority: JP
Inventors: 土屋　豪; Takeshi Tsuchiya; 豪土屋; 川▲原▼　毅彦; Takehiko Kawahara; 毅彦川▲原▼; 純也浦; Junya Ura
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-10-23
Filing date: 2013-10-23
Publication date: 2015-04-27
Also published as: WO2015060340A1

Abstract

PROBLEM TO BE SOLVED: To allow a singer to increase expression of singing and experience new kind of singing.SOLUTION: The singing synthetic device comprises: a pitch detection part 104 for detecting pitch of inputted voice; a volume detection part 108 for detecting volume of the inputted voice; and a voice synthetic part 140 for synthesizing the voice based on text data according to the pitch detected by the pitch detection part 104 and the volume detected by the volume detection part 108, when the text and text data regulating vocalization timing of the text are supplied according to progress of musical performance.

Description

本発明は、歌唱音声を合成する歌唱合成装置および歌唱合成プログラムに関する。 The present invention relates to a singing voice synthesizing device and a singing voice synthesis program for synthesizing a singing voice.

従来より、歌唱者の歌唱（音声）を他人の歌唱に変換する技術としては、次のようなものが知られている。すなわち、予め特定人（例えばオリジナルの歌手）が歌唱したときのフォルマントシーケンスデータを記憶しておき、歌唱者による歌唱音声を変換する際には、当該歌唱音声の音高および音量に合わせて、オリジナル歌手のフォルマントシーケンスに基づくフォルマントを整形して、歌唱音声を合成する技術が提案されている（例えば特許文献１参照）。 Conventionally, the following is known as a technique for converting a singer's song (voice) into another person's song. That is, the formant sequence data when a specific person (for example, an original singer) sings is stored in advance, and when the singing voice by the singer is converted, the original is matched to the pitch and volume of the singing voice. A technique for shaping a formant based on a singer's formant sequence to synthesize a singing voice has been proposed (see, for example, Patent Document 1).

特開平１０−２６８８９５号公報JP-A-10-268895

ところで、上記技術では、オリジナル歌手のフォルマントシーケンスデータに基づくフォルマントを整形するので、出力される歌唱音声において、オリジナルの歌手の歌い方の影響が残存するのは避けられない。
本発明は、上述した事情に鑑みてなされたもので、その目的の一つは、入力音声、例えば歌唱者の歌唱とは違う声質の歌唱音声で出力する際に、出力される歌唱音声にオリジナルの歌手の歌い方の影響が残存しない歌唱合成装置および歌唱合成プログラムを提供することにある。 By the way, in the above technique, since the formant based on the formant sequence data of the original singer is shaped, it is inevitable that the influence of the original singer's singing remains in the output singing voice.
The present invention has been made in view of the above-described circumstances, and one of its purposes is an original to the singing voice that is output when the input voice, for example, the singing voice having a voice quality different from the singing of the singer is output. An object of the present invention is to provide a singing synthesis apparatus and a singing synthesis program in which the influence of the singer's singing method does not remain.

上記目的を達成するために本発明の一態様に係る歌唱合成装置は、入力音声の歌唱の音高を検出する音高検出部と、前記入力音声の音量を検出する音量検出部と、歌詞と当該歌詞の歌唱タイミングとが規定された歌詞データが演奏の進行に応じて供給されると、前記歌詞データに基づく歌唱音声を、前記音高検出部で検出された音高と、前記音量検出部で検出された音量とに応じて合成する音声合成部と、を備えることを特徴とする。 In order to achieve the above object, a singing voice synthesizing apparatus according to one aspect of the present invention includes a pitch detecting unit that detects a pitch of a singing of input speech, a volume detecting unit that detects a volume of the input speech, and lyrics. When lyric data defining the singing timing of the lyrics is supplied as the performance progresses, the singing voice based on the lyric data is converted into a pitch detected by the pitch detecting unit, and the volume detecting unit. And a voice synthesis unit that synthesizes the sound according to the volume detected in step (b).

この一態様によれば、歌詞データに基づく歌唱音声が、検出された音高および音量で合成される。このため、オリジナルの歌手の歌い方という概念が存在しない。また、歌唱者による歌唱の音高、音量が反映されつつ、歌唱者とは異なる声質で歌唱音声が合成されるので、歌唱者からみれば、歌唱の表現を拡大することができるとともに、新たなる歌唱を体験することができる。
また、好ましい一態様において、音声合成部は、音声素片であるライブラリに基づいて歌唱音声を合成する構成が良い。
なお、音声合成部は、歌唱音声を、例えば、音高検出部で検出された音高と同じ音高で合成しても良いし、検出された音高に対して所定の関係でシフトした音高で合成しても良い。また、音声合成部は、歌唱音声を、例えば、音量検出部で検出された音量と同じ音量で合成しても良いし、検出された音量に対して所定の関係にある音量で合成しても良いし、検出された音量が閾値を超えたときに当該音量に応じて合成しても良い。 According to this aspect, the singing voice based on the lyric data is synthesized with the detected pitch and volume. For this reason, there is no concept of how to sing an original singer. Also, since the singing voice is synthesized with a voice quality different from that of the singer while reflecting the pitch and volume of the singing by the singer, the expression of the singing can be expanded from the viewpoint of the singer, and new You can experience singing.
In a preferred embodiment, the speech synthesizer is preferably configured to synthesize a singing speech based on a library that is a speech segment.
The voice synthesis unit may synthesize the singing voice with, for example, the same pitch as the pitch detected by the pitch detection unit, or the sound shifted in a predetermined relationship with respect to the detected pitch. You may synthesize with high. Further, the voice synthesis unit may synthesize the singing voice with the same volume as the volume detected by the volume detection unit, or may synthesize the singing voice with a volume having a predetermined relationship with the detected volume. Alternatively, when the detected sound volume exceeds a threshold value, synthesis may be performed according to the sound volume.

上記一態様において、前記演奏の進行に応じて伴奏音を生成する音源部と、前記伴奏音と、前記入力音声と、前記歌唱音声と、を出力する出力部と、を備える構成としても良い。この構成によれば、入力音声と、音声合成部よって合成された歌唱音声と、演奏の進行に応じた伴奏音とが出力されるので、歌唱者に新たなる歌唱を体験させることができる。 In the one aspect described above, a configuration may be provided that includes a sound source unit that generates an accompaniment sound according to the progress of the performance, an output unit that outputs the accompaniment sound, the input sound, and the singing sound. According to this configuration, since the input voice, the singing voice synthesized by the voice synthesizing unit, and the accompaniment sound according to the progress of the performance are output, it is possible for the singer to experience a new singing.

上記態様において、前記音声合成部は、前記音量検出部で検出された音量に応じて前記歌詞データの歌唱タイミングを変化させて歌唱音声を合成する構成としても良い。この構成によれば、歌唱者は、合成される歌詞音声を、歌詞データで規定されるタイミング通りではなく、ある程度コントロールできる。このため、音声合成される歌唱のタイミングを即興（アドリブ）的に変化させることが可能になる。
なお、本発明の態様については、歌唱合成装置のみならず、コンピュータを当該歌唱合成装置として機能させるプログラムとして概念することが可能である。 The said aspect WHEREIN: The said voice synthesis | combination part is good also as a structure which synthesize | combines singing voice by changing the singing timing of the said lyric data according to the sound volume detected by the said sound volume detection part. According to this configuration, the singer can control the synthesized lyric sound to some extent, not at the timing defined by the lyric data. For this reason, it becomes possible to improvise (ad-lib) the timing of the singing voice-synthesized.
In addition, about the aspect of this invention, it is possible to consider not only a song synthesizing | combining apparatus but a computer as a program which functions as the said song synthesizing | combining apparatus.

第１実施形態に係る歌唱合成装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 1st Embodiment. 歌唱合成装置における歌詞データ等を示す図である。It is a figure which shows the lyric data etc. in a song synthesizing | combining apparatus. 歌唱合成装置における歌唱音声合成処理を示すフローチャートである。It is a flowchart which shows the singing voice synthesis | combination process in a song synthesizer. 歌唱合成装置における歌唱音声の出力例を示す図である。It is a figure which shows the example of an output of the song voice in a song synthesis apparatus. 第２実施形態に係る歌唱合成装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 2nd Embodiment. 歌唱合成装置における歌唱音声の出力例を示す図である。It is a figure which shows the example of an output of the song voice in a song synthesis apparatus. 第３実施形態に係る歌唱合成装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the song synthesizing | combining apparatus which concerns on 3rd Embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

＜第１実施形態＞
図１は、第１実施形態に係る歌唱合成装置１０の構成を示す機能ブロック図である。
この図において、歌唱合成装置１０は、ノート型やタブレット型などのコンピュータであって、音声入力部１０２、音高検出部１０４、音量検出部１０８、操作部１１２、制御部１２０、データベース１３０、音声合成部１４０、音源部１６０、スピーカ１７２、１７４を有する。
これらの機能ブロックのうち、例えば音声入力部１０２、操作部１１２、音声合成部１４０、スピーカ１７２、１７４についてはハードウェアによって構築され、音高検出部１０４、音量検出部１０８、制御部１２０、データベース１３０、音源部１６０については、図示省略したＣＰＵ（Central Processing Unit）が予めインストールされたアプリケーションプログラムを実行することによって構築される。
なお、特に図示しないが、歌唱合成装置１０は、このほかにも表示部を有し、利用者が装置の状況や設定を確認することができるようになっている。 <First Embodiment>
FIG. 1 is a functional block diagram showing the configuration of the singing voice synthesizing apparatus 10 according to the first embodiment.
In this figure, the singing voice synthesizing apparatus 10 is a notebook type or tablet type computer such as a voice input unit 102, a pitch detection unit 104, a volume detection unit 108, an operation unit 112, a control unit 120, a database 130, a voice. A synthesis unit 140, a sound source unit 160, and speakers 172 and 174 are included.
Among these functional blocks, for example, the voice input unit 102, the operation unit 112, the voice synthesis unit 140, and the speakers 172 and 174 are constructed by hardware, and the pitch detection unit 104, the volume detection unit 108, the control unit 120, the database 130 and the sound source unit 160 are constructed by executing an application program in which a CPU (Central Processing Unit) (not shown) is installed in advance.
Although not shown in particular, the singing voice synthesizing apparatus 10 has a display unit in addition to this, so that the user can check the status and settings of the apparatus.

音声入力部１０２は、詳細については省略するが、歌唱者（ユーザ）による歌唱音声を電気信号の歌唱音声信号に変換するマイクロフォンと、変換された歌唱音声信号の高域成分をカットするＬＰＦ（ローパスフィルタ）と、高域成分をカットした歌唱音声信号をデジタル信号に変換するＡ／Ｄ変換器とで構成される。
音高検出部１０４は、デジタル信号に変換された歌唱音声信号（入力音声）を周波数解析するとともに、解析して得られた音高（周波数）を示す音高データをほぼリアルタイムで出力する。なお、周波数解析については、ＦＦＴ（Fast Fourier Transform）や、その他公知の方法を用いることができる。 The audio input unit 102 omits details, but a microphone that converts a singing voice by a singer (user) into a singing voice signal of an electric signal, and an LPF (low-pass) that cuts a high frequency component of the converted singing voice signal. Filter) and an A / D converter that converts a singing voice signal from which a high frequency component is cut into a digital signal.
The pitch detection unit 104 performs frequency analysis on the singing voice signal (input voice) converted into a digital signal, and outputs pitch data indicating the pitch (frequency) obtained by the analysis in substantially real time. For frequency analysis, FFT (Fast Fourier Transform) or other known methods can be used.

音量検出部１０８は、例えばデジタル信号に変換された歌唱音声信号の振幅エンベロープをローパスフィルタで濾波するなどして、歌唱者の音量を示す音量データを、ほぼリアルタイムで出力する。
一方、操作部１１２は、歌唱者による操作、例えば歌唱する楽曲の選択操作などを入力して、当該操作を示す情報を、制御部１２０に供給する。
データベース１３０は、複数の曲分の楽曲データを記憶する。１曲分の楽曲データは、当該曲の伴奏音を１以上のトラックで規定する伴奏データ、および、当該曲の歌詞を示す歌詞データから構成される。 The volume detection unit 108 outputs volume data indicating the volume of the singer in substantially real time by filtering the amplitude envelope of the singing voice signal converted into a digital signal with a low-pass filter, for example.
On the other hand, the operation unit 112 inputs an operation by a singer, for example, an operation for selecting a song to be sung, and supplies information indicating the operation to the control unit 120.
The database 130 stores music data for a plurality of songs. The music data for one song is composed of accompaniment data that defines the accompaniment sound of the song with one or more tracks, and lyrics data indicating the lyrics of the song.

制御部１２０は、データベース１３０を管理するほか、演奏の進行時にあたっては、シーケンサとして機能する。
シーケンサとして機能する制御部１２０は、データベース１３０から読み出した楽曲データのうち、伴奏データを解釈して、発生すべき楽音を規定する楽音情報を、演奏の開始時から演奏の進行に合わせて時系列の順で音源部１６０に供給する。ここで、伴奏データとして例えばＭＩＤＩ規格に準拠したものが用いられる。なお、ＭＩＤＩ規格に準拠した場合、当該伴奏データは、イベントと、イベント同士の時間間隔を示すデュレーションとの組み合わせで規定される。このため、制御部１２０は、デュレーションで示される時間が経過する毎に、イベントの内容を示す楽音情報を、音源部１６０に供給する。つまり、制御部１２０は、伴奏データを解釈して、楽音情報を音源部１６０に供給することで当該曲の演奏を進行させることになる。 In addition to managing the database 130, the control unit 120 functions as a sequencer when the performance is in progress.
The control unit 120 functioning as a sequencer interprets the accompaniment data in the music data read from the database 130, and sets the musical tone information that defines the musical tone to be generated in time series from the start of the performance to the progress of the performance. Are supplied to the sound source unit 160 in this order. Here, for example, the accompaniment data conforming to the MIDI standard is used. In the case of conforming to the MIDI standard, the accompaniment data is defined by a combination of an event and a duration indicating a time interval between events. For this reason, the control unit 120 supplies tone information indicating the content of the event to the sound source unit 160 every time the time indicated by the duration elapses. In other words, the control unit 120 interprets the accompaniment data and supplies musical tone information to the sound source unit 160 to advance the performance of the song.

また、制御部１２０は、伴奏データを解釈する際に、演奏開始からのデュレーションの積算値を求める。制御部１２０は、当該積算値によって、演奏の進行状態、すなわち曲のどの部分が演奏されているかを把握することができる。 In addition, when interpreting the accompaniment data, the control unit 120 obtains an integrated value of the duration from the start of the performance. The control unit 120 can grasp the progress state of the performance, that is, which part of the song is being played based on the integrated value.

音源部１６０は、制御部１２０から供給される楽音情報にしたがって、伴奏音を示す楽音信号を合成する。なお、本実施形態では、必ずしも伴奏音を出力する必要はないので、音源部１６０は必須ではない。また、音源部１６０から出力される楽音信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１７４によって音響変換されて出力される。 The tone generator 160 synthesizes a tone signal indicating an accompaniment sound according to the tone information supplied from the control unit 120. In the present embodiment, the sound source unit 160 is not essential because it is not always necessary to output the accompaniment sound. The musical tone signal output from the sound source unit 160 is converted into an analog signal by a D / A conversion unit (not shown), and then is acoustically converted by the speaker 174 and output.

制御部１２０は、楽音情報を音源部１６０に供給するほか、演奏の進行に合わせて、歌詞データを音声合成部１４０に供給する。
音声合成部１４０は、制御部１２０から供給される歌詞データと、音高検出部１０４から供給される音高データと、音量検出部１０８から供給される音量データと、にしたがって歌唱音声を合成し、歌唱音声信号として出力する。なお、音声合成部１４０から出力される歌唱音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１７２によって音響変換されて出力される。 The control unit 120 supplies musical tone information to the sound source unit 160 and also supplies lyrics data to the voice synthesis unit 140 as the performance progresses.
The voice synthesizer 140 synthesizes the singing voice according to the lyrics data supplied from the controller 120, the pitch data supplied from the pitch detector 104, and the volume data supplied from the volume detector 108. And output as a singing voice signal. Note that the singing voice signal output from the voice synthesis unit 140 is converted into an analog signal by a D / A conversion unit (not shown), and then acoustically converted by the speaker 172 and output.

図２は、歌詞データの一例を示す図である。この図の例では、楽曲として「さくら」の歌詞データが旋律（歌詞の上に表示された楽譜）とともに示されている。なお、「さくら」の著作権の保護期間は、我が国の著作権法第５１条及び第５７条の規定によりすでに満了している。 FIG. 2 is a diagram illustrating an example of lyrics data. In the example of this figure, the lyrics data of “Sakura” is shown as a song together with the melody (the score displayed on the lyrics). The copyright protection period of “Sakura” has already expired in accordance with Article 51 and Article 57 of the Copyright Act of Japan.

この図に示されるように、歌詞データは、歌唱すべき歌詞を、演奏の開始時から順番に配列される。歌詞データは、歌詞を示す文字情報を含み、歌唱に対応した文字（文字列を含む。以下同じ）が図に示されるように区切られるとともに、旋律の音符、すなわち、歌詞を歌唱すべき歌唱タイミングおよび歌唱すべき音高に、それぞれ対応付けられている。この例では、歌詞５１〜（図では歌詞５７までを図示し、以降については図示省略）のそれぞれに対して１つの音符が割り当てられているが、曲（歌詞）によっては、１つの文字に対して複数の音符が割り当てられる場合もあれば、１つの音符に対して複数の文字が割り当てられる場合もある。
演奏の進行が音符で示される歌唱タイミングに到達したときに、制御部１２０は、当該音符に対応する歌詞の文字および当該歌詞の音高を示すデータを音声合成部１４０に供給する。 As shown in this figure, in the lyrics data, the lyrics to be sung are arranged in order from the start of the performance. The lyric data includes character information indicating lyrics, and characters corresponding to singing (including character strings; the same applies hereinafter) are divided as shown in the figure, and melody notes, that is, singing timing at which the lyrics should be sung. And the pitch to be sung, respectively. In this example, one note is assigned to each of the lyrics 51 to (in the figure, up to the lyric 57 is shown, and the subsequent illustration is omitted). However, depending on the song (lyric), one character may be assigned. A plurality of notes may be assigned, and a plurality of characters may be assigned to one note.
When the progress of the performance reaches the singing timing indicated by the notes, the control unit 120 supplies the speech synthesizer 140 with the data of the lyrics corresponding to the notes and the data indicating the pitch of the lyrics.

なお、演奏の進行が歌唱タイミングに到達したか否かについて、伴奏データの解釈におけるデュレーションの積算値と歌詞データの歌唱タイミングとを予め対応付けておけば、演奏進行において当該積算値が歌詞データの歌唱タイミングに対応付けられた値に達したか否かによって、制御部１２０が判別することができる。
また、伴奏音を出力しない場合（伴奏データを使用しない場合）には、伴奏データのデュレーションの積算値で演奏の進行を把握できないので、この場合には、例えば歌詞の歌唱タイミングを、伴奏データと同じように、イベント（歌詞の歌唱イベント）と当該イベント同士の時間間隔を示すデュレーションとで規定して、歌唱タイミングであるか否かについては、当該歌詞データにおいて歌唱すべきイベントが到来しているか否かで判別すれば良い。 As for whether the progress of the performance has reached the singing timing, if the accumulated value of duration in the interpretation of the accompaniment data and the singing timing of the lyrics data are associated in advance, the integrated value in the performance of the lyric data The control unit 120 can determine whether or not the value corresponding to the singing timing has been reached.
In addition, when the accompaniment sound is not output (when accompaniment data is not used), the progress of the performance cannot be grasped by the integrated value of the duration of the accompaniment data. In this case, for example, the singing timing of the lyrics is determined as the accompaniment data. Similarly, whether or not it is the singing timing is defined by the event (the lyrics singing event) and the duration indicating the time interval between the events. What is necessary is just to judge by no.

図１において、音声合成部１４０は、制御部１２０から供給された歌詞データの文字を、ライブラリ（図示省略）に登録された音声素片データを用いて音声合成する。このライブラリには、単一の音素や音素から音素への遷移部分など、歌唱音声の素材となる各種の音声素片の波形を定義した音声素片データが予め登録されている。
詳細には、音声合成部１４０は、供給された歌詞データの文字で示される音素列を音声素片の列に変換し、これらの音声素片に対応する音声素片データをライブラリから選択して接続するとともに、接続した音声素片データに対して各々のピッチを、指定された音高に合わせて変換して、歌唱音声を示す歌唱音声信号を合成する。
なお、音声合成部１４０における歌唱音声の音高および音量については、後述する。 In FIG. 1, the speech synthesizer 140 synthesizes the text of the lyric data supplied from the controller 120 using speech segment data registered in a library (not shown). In this library, speech unit data defining waveforms of various speech units that are materials of singing speech, such as a single phoneme or a transition part from a phoneme to a phoneme, is registered in advance.
Specifically, the speech synthesizer 140 converts the phoneme sequence indicated by the characters of the supplied lyric data into a speech unit sequence, and selects speech unit data corresponding to these speech units from the library. While connecting, each pitch is converted with respect to the connected audio | voice element data according to the designated pitch, and the singing voice signal which shows a singing voice is synthesize | combined.
Note that the pitch and volume of the singing voice in the voice synthesizer 140 will be described later.

また、本実施形態では、歌唱音声をスピーカ１７２によって、伴奏音をスピーカ１７４によって、それぞれ別々に出力する構成としたが、歌唱音声と伴奏音とをミキシングして同じスピーカから出力する構成としても良い。 In this embodiment, the singing voice is output separately from the speaker 172 and the accompaniment sound is output separately from the speaker 174. However, the singing voice and the accompaniment sound may be mixed and output from the same speaker. .

次に、本実施形態に係る歌唱合成装置１０における動作について説明する。
この歌唱合成装置１０では、歌唱者が操作部１１２を操作して、所望の曲を選択すると、制御部１２０が、当該曲に対応する楽曲データをデータベース１３０から読み出すとともに、当該楽曲データのうち、伴奏データを解釈し、合成すべき伴奏音の楽音情報を音源部１６０に供給して、当該音源部１６０に楽音信号を合成させる一方、当該楽曲データのうち、歌詞データを演奏の進行に合わせて音声合成部１４０に供給して、当該音声合成部１４０に歌唱音声信号を合成させる。
すなわち、歌唱合成装置１０において、演奏が開始されると、第１に、演奏の進行に合わせて楽音信号を合成する楽音合成処理と、第２に、当該演奏の進行に合わせて歌詞データを供給することによる歌唱音声合成処理とが互いに独立して実行される。
このうち、楽音合成処理は、制御部１２０が演奏の進行に合わせて楽音情報を供給する一方、音源部１６０が当該楽音情報に基づいて楽音信号を合成する処理であり、この処理自体は周知である（例えば特開平７−１９９９７５号公報等参照）。このため、楽音合成処理の詳細については説明を省略し、以下においては、歌唱音声合成処理について説明する。 Next, the operation | movement in the song synthesizing | combining apparatus 10 concerning this embodiment is demonstrated.
In this song synthesizing device 10, when the singer operates the operation unit 112 to select a desired song, the control unit 120 reads the song data corresponding to the song from the database 130, and among the song data, Interpret the accompaniment data, supply the musical tone information of the accompaniment sound to be synthesized to the sound source unit 160 and synthesize the musical tone signal to the sound source unit 160, while the lyrics data of the music data is synchronized with the progress of the performance The voice synthesis unit 140 is supplied to synthesize a singing voice signal.
That is, in the singing synthesizing apparatus 10, when a performance is started, firstly, a musical tone synthesis process for synthesizing musical tone signals in accordance with the progress of the performance, and secondly, lyrics data is supplied in accordance with the progress of the performance. The singing voice synthesis process is performed independently of each other.
Among these, the tone synthesis process is a process in which the control unit 120 supplies tone information as the performance progresses, while the tone generator unit 160 synthesizes a tone signal based on the tone information. This process itself is well known. (See, for example, JP-A-7-199975). For this reason, description is abbreviate | omitted about the detail of a musical tone synthesis process, and the singing voice synthesis | combination process is demonstrated below.

なお、曲が操作部１１２によって選択された場合に、制御部１２０は、当該曲の伴奏データや歌詞データの供給を自動的に開始する。これによって、当該曲の演奏開始が指示されることになる。ただし、制御部１２０は、曲が選択された場合であっても、他の曲の演奏が進行していれば、当該他の曲が終了するまで、選択された曲の演奏を待機させる。 When a song is selected by the operation unit 112, the control unit 120 automatically starts supplying accompaniment data and lyrics data of the song. As a result, the start of performance of the song is instructed. However, even if a song is selected, if the performance of another song is in progress, the control unit 120 waits for the performance of the selected song until the other song ends.

図３は、歌唱音声合成処理を示すフローチャートである。この歌唱音声合成処理は、制御部１２０と音声合成部１４０とで実行される。
演奏が開始されると、制御部１２０は、まず演奏の進行段階が歌唱タイミングであるか否かを判別する（ステップＳａ１１）。 FIG. 3 is a flowchart showing the singing voice synthesis process. This singing voice synthesis process is executed by the control unit 120 and the voice synthesis unit 140.
When the performance is started, the control unit 120 first determines whether or not the progress stage of the performance is a singing timing (step Sa11).

演奏の進行段階が歌唱タイミングでないと判別すれば（ステップＳａ１１の判別結果が「Ｎｏ」であれば）、制御部１２０は、処理手順をステップＳａ１１に戻す。換言すれば、演奏の進行段階が歌唱タイミングになるまで、ステップＳａ１１で待機することになる。
また、演奏の進行段階が歌唱タイミングになったと判別すれば（ステップＳａ１１の判別結果が「Ｙｅｓ」であれば）、制御部１２０は、歌詞データ、すなわち、当該歌唱タイミングで歌唱すべき文字、音高を規定するデータを音声合成部１４０に供給する（ステップＳａ１２）。 If it is determined that the performance stage is not the singing timing (if the determination result of step Sa11 is “No”), the control unit 120 returns the processing procedure to step Sa11. In other words, the process waits at step Sa11 until the performance stage reaches the singing timing.
Also, if it is determined that the progress stage of the performance is the singing timing (if the determination result of step Sa11 is “Yes”), the control unit 120 determines the lyrics data, that is, the characters and sounds to be sung at the singing timing. Data defining the height is supplied to the speech synthesizer 140 (step Sa12).

音声合成部１４０は、制御部１２０から、歌詞データが供給された場合に、当該歌詞データに基づき音声合成するが、音高および音量ついては、次のように制御する（ステップＳａ１３）。
すなわち、音声合成部１４０は、音量検出部１０８から供給される音量データで示される音量が閾値以下であれば、当該歌詞データの文字を、当該歌詞データの音高で、音量検出部１０８から供給される音量データで示される音量で音声合成して、歌唱音声信号として出力する。ただし、当該音量データで示される音量が閾値以下であることから、当該歌唱音声信号をスピーカ１７２から出力させても、聴感上無視できるレベルである。
一方、音声合成部１４０は、制御部１２０から歌詞データが供給された場合に音量データで示される音量が閾値を超えたとき、制御部１２０から供給される歌詞データの音高を音高検出部１０４から供給された音高データで示される音高に変更して、音量検出部１０８から供給される音量データで示される音量で、当該歌詞データの文字を音声合成して歌唱音声信号として出力する。
このため、スピーカ１７２から聴こえる当該歌唱音声信号は、歌詞データの文字を、歌唱者が歌唱した音高で、歌唱者が歌唱した音量で、音声合成したものとなる。 When the lyrics data is supplied from the controller 120, the speech synthesizer 140 synthesizes speech based on the lyrics data, but controls the pitch and volume as follows (step Sa13).
That is, if the volume indicated by the volume data supplied from the volume detection unit 108 is equal to or lower than the threshold, the voice synthesis unit 140 supplies the text of the lyrics data from the volume detection unit 108 at the pitch of the lyrics data. The voice is synthesized at the volume indicated by the volume data to be output and output as a singing voice signal. However, since the volume indicated by the volume data is less than or equal to the threshold value, even if the singing voice signal is output from the speaker 172, it is at a level that can be ignored for hearing.
On the other hand, when the lyrics data is supplied from the control unit 120 and the volume indicated by the volume data exceeds a threshold value, the voice synthesis unit 140 determines the pitch of the lyrics data supplied from the control unit 120 as a pitch detection unit. The pitch is changed to the pitch indicated by the pitch data supplied from 104, and the characters of the lyrics data are synthesized with the volume indicated by the volume data supplied from the volume detector 108 and output as a singing voice signal. .
Therefore, the singing voice signal that can be heard from the speaker 172 is obtained by synthesizing the characters of the lyric data at the pitch sung by the singer and the volume sung by the singer.

一方、制御部１２０は、歌唱タイミングに至った歌詞データを音声合成部１４０に供給した後、次に歌唱すべき歌詞データが存在しないか否かを判別する（ステップＳａ１４）。
存在すれば（ステップＳａ１４の判別結果が「Ｎｏ」であれば）、制御部１２０は、処理手順をステップＳａ１１に戻す。これにより、演奏の進行段階が次の歌唱タイミングに至ったときにステップＳａ１２、１３の処理が実行される。
また、次に歌唱すべきデータが存在しなければ（ステップＳａ１４の判別結果が「Ｙｅｓ」であれば）、制御部１２０は、歌唱音声合成処理を終了させる。 On the other hand, after supplying the lyric data that has reached the singing timing to the speech synthesizer 140, the control unit 120 determines whether or not there is lyric data to be sung next (step Sa14).
If it exists (if the determination result in step Sa14 is “No”), the control unit 120 returns the processing procedure to step Sa11. Thereby, the process of step Sa12, 13 is performed when the progress stage of performance comes to the next song timing.
If there is no data to be sung next (if the determination result in step Sa14 is “Yes”), the control unit 120 ends the singing voice synthesis process.

図４は、歌唱音声の具体的な合成例を示す図である。この図は、歌唱者が歌唱する曲として「さくら」（図２参照）を選択した場合の例である。当該歌唱者が、伴奏音を聴きながら演奏の進行に合わせて、（ｂ）で示されるような音量で歌唱したときに、本実施形態では、同図（ｃ）で示されるように歌唱音声が出力される。
すなわち、歌唱者が演奏の進行に対して、「さ」（歌詞５１）の冒頭から若干遅れ気味のタイミングで音量を上げて歌唱した場合、音声合成部１４０は、音量検出部１０８から供給された音量データで示される音量が閾値を超えたときに、歌唱音声信号の振幅を当該音量に合わせて調整するので、（ｃ）の歌唱音声の「さ」（符号６１）は、（ａ）の歌詞データ（歌詞５１）で規定されるようなタイミング通りとはならない。
また、歌唱者が、演奏の進行に対して、「く」（歌詞５２）から「ら」（歌詞５３）までにおいて音量を下げたとき（または音声入力部１０２のマイクロフォンを口から遠ざけたとき）、（ｃ）の歌唱音声では、「く」（符号６２）と「ら」（符号６３−１）とに間が空くことになる。
歌唱者が演奏の進行に対して、「ら」（歌詞５３）の途中において音量を下げたとき、同様な理由により、（ｃ）の歌唱音声では、「ら」が符号６３−１、６３−２に分断されることになる。なお、時間的後方の「ら」（符号６３−２）は、説明の便宜のために「ら」と表記しているが、実際には「ら」の母音である「あ」として聴こえることになる。 FIG. 4 is a diagram illustrating a specific synthesis example of the singing voice. This figure is an example in the case where “Sakura” (see FIG. 2) is selected as a song sung by the singer. When the singer sings at a volume as shown in (b) while listening to the accompaniment sound, in this embodiment, the singing voice is shown in FIG. Is output.
That is, when the singer sings with the volume turned up with a slight delay from the beginning of “sa” (lyrics 51) with respect to the progress of the performance, the speech synthesizer 140 is supplied from the volume detector 108. When the volume indicated by the volume data exceeds the threshold, the amplitude of the singing voice signal is adjusted according to the volume, so that “c” (symbol 61) of the singing voice of (c) is the lyrics of (a) The timing is not as defined by the data (lyrics 51).
In addition, when the singer decreases the volume from “ku” (lyric 52) to “ra” (lyric 53) as the performance progresses (or when the microphone of the voice input unit 102 is moved away from the mouth). In the singing voice of (c), there is a gap between “ku” (reference numeral 62) and “ra” (reference numeral 63-1).
For the same reason, when the singer decreases the volume in the middle of “ra” (lyric 53) with respect to the progress of the performance, in the singing voice of (c), “ra” is indicated by reference numerals 63-1, 63-. It will be divided into two. Note that “ra” (symbol 63-2) behind the time is expressed as “ra” for convenience of explanation, but in reality, it can be heard as “a” which is a vowel of “ra”. Become.

なお、図４の例では、歌唱者がどのような音量で歌唱したときに、歌唱音声がどのように音声合成されるのか、という観点で説明した図である。この例では、歌唱者がどのような音高で歌唱したときに、歌唱音声がどのような音高で声合成されるのか、という点については示していないが、特段に説明は要しないであろう。
また、第１実施形態における歌唱合成装置１０は、歌唱音声の合成にあたって、歌唱者による音高および音量のみを用いている。したがって、歌唱者が、「さくら、さくら…」という歌詞ではなく、例えば「あああ、あああ…」と歌唱しても、歌唱合成装置１０によって合成される歌唱音声は、「さくら、さくら…」となる。 In addition, in the example of FIG. 4, when a singer sings with what kind of volume, it is the figure demonstrated from a viewpoint of how a singing voice is voice-synthesized. In this example, it is not shown what pitch the singing voice is synthesized with when the singer sings, but no particular explanation is required. Let's go.
The singing voice synthesizing apparatus 10 according to the first embodiment uses only the pitch and volume of the singer when synthesizing the singing voice. Therefore, even if the singer does not sing the lyrics “Sakura, Sakura ...” but sings, for example, “Ah, ah…”, the singing voice synthesized by the singing synthesizer 10 is “Sakura, Sakura ...”. .

背景技術で述べたようなフォルマントシーケンスデータを用いる場合には、オリジナルの歌手が歌唱したときのデータを採取する必要がある。また、この場合、歌唱者が歌唱した音高および音量に応じて、フォルマントシーケンスデータに基づくフォルマントを整形するので、オリジナルの歌手の歌い方の影響を受けるのは避けられない。
これに対して、本実施形態では、音声素片であるライブラリを用いて歌唱音声を合成するので、モデルとなる人物の歌い方の影響を受けないし、そもそもモデルとなる人物に曲を歌わせる必要がないほか、歌唱者が実際にその場で歌唱した音高および音量に対して忠実に、歌唱音声を音声合成することができる、という利点がある。
そして、本実施形態によれば、歌唱者による歌唱の意図（音高、音量）が反映されつつ、歌唱者とは異なる声質で合成された歌唱音声が出力されるので、歌唱者に対して、歌唱することの表現を拡げさせることができるとともに、新たなる歌唱を体験させることができる。 When formant sequence data as described in the background art is used, it is necessary to collect data when the original singer sang. Further, in this case, since the formant based on the formant sequence data is shaped according to the pitch and volume sung by the singer, it is inevitable that the singer is influenced by the way of singing the original singer.
On the other hand, in this embodiment, since the singing voice is synthesized using the library which is a speech unit, it is not affected by the way of singing the model person, and it is necessary to let the model person sing the song in the first place. In addition, there is an advantage that the singing voice can be synthesized with high fidelity to the pitch and volume that the singer actually sang on the spot.
And according to this embodiment, since the singing voice synthesize | combined with the voice quality different from a singer is output, reflecting the intention (pitch, volume) of the singing by a singer, The expression of singing can be expanded and new singing can be experienced.

＜第２実施形態＞
第１実施形態では、歌唱者による歌唱の音高および音量を反映させて、歌唱音声を合成する構成であり、音高および音量以外の情報、端的にいえば、歌唱者による歌唱それ自体は全く利用していない。
そこで次に、歌唱者による歌唱それ自体と、音声合成した歌唱音声とで合唱させる第２実施形態について説明する。この第２実施形態は、概略すると、例えば歌唱者による歌唱を根音とする一方、当該根音に対して３度上の音と、当該根音に対して５度上の音とを音声合成して、歌唱者がひとりで歌唱しているにもかかわらず、三和音でハモるようにしたものである。 Second Embodiment
In the first embodiment, the singing voice is synthesized by reflecting the pitch and volume of the singing by the singer, and information other than the pitch and the volume, in short, the singing by the singer itself is not at all. Not used.
Next, a second embodiment in which the singing by the singer and the singing voice synthesized by the voice singing will be described. In summary, the second embodiment is based on, for example, a song performed by a singer, and a voice synthesis of a sound three times higher than the root sound and a sound five times higher than the root sound. Even though the singer is singing alone, it is designed as a triad.

図５は、第２実施形態に係る歌唱合成装置１０の構成を示す機能ブロック図である。
この図に示される歌唱合成装置１０が、図１に示した第１実施形態と相違する部分は、音高変換部１０６ａ、１０６ｂが設けられた点と、２系統の音声合成部１４０ａ、１４０ｂが設けられた点、および、ミキサ１５０が設けられた点である。
このため、第２実施形態では、これらの相違部分を中心に説明することにする。 FIG. 5 is a functional block diagram showing a configuration of the singing voice synthesizing apparatus 10 according to the second embodiment.
The singing voice synthesizing device 10 shown in this figure is different from the first embodiment shown in FIG. 1 in that the pitch converting units 106a and 106b are provided, and the two voice synthesizing units 140a and 140b are provided. The point provided is the point where the mixer 150 is provided.
For this reason, in the second embodiment, these different portions will be mainly described.

音高変換部１０６ａは、音高検出部１０４から供給される音高データで示される音高に対して、予め定められた関係にある音高、例えば３度上にある音高に変換して、音声合成部１４０ａに供給する。音高変換部１０６ｂは、音高検出部１０４から供給される音高データで示される音高に対して、予め定められた関係にある音高、例えば５度上にある音高に変換して、音声合成部１４０ｂに供給する。なお、根音に対する３度には短３度と長３度とがあり、根音に対して５度には完全５度と減５度と増５度とがある。いずれになるかについては、根音の音高（および調号）で定まるので、音高変換部１０６ａ、１０６ｂは、例えば、根音の音高に対する変換後の音高を予めテーブル化しておき、音高検出部１０４から供給される音高データで示される音高を、当該テーブルを参照して変換する構成とすれば良い。
音声合成部１４０ａ、１４０ｂは、機能的には第１実施形態における音声合成部１４０と同機能を有するものであり、制御部１２０から同じ歌詞データの供給を受けるが、音声合成部１４０ａには、音高変換部１０６ａで変換された音高が指定され、音声合成部１４０ｂには、音高変換部１０６ｂで変換された音高が指定される。
ミキサ１５０は、音声入力部１０２による歌唱音声信号と、音声合成部１４０ａによる歌唱音声信号と、音声合成部１４０ｂによる歌唱音声信号とをミキシングする。なお、ミキシングされた歌唱音声信号は、図示省略したＤ／Ａ変換部によってアナログ信号に変換された後、スピーカ１７２によって音響変換されて出力される。 The pitch conversion unit 106a converts the pitch indicated by the pitch data supplied from the pitch detection unit 104 into a pitch having a predetermined relationship, for example, a pitch that is three degrees above. To the speech synthesizer 140a. The pitch conversion unit 106b converts the pitch indicated by the pitch data supplied from the pitch detection unit 104 into a pitch having a predetermined relationship, for example, a pitch that is 5 degrees above. To the speech synthesizer 140b. Note that 3 degrees for the root sound include a short 3 degree and a long 3 degree, and 5 degrees for the root sound includes a complete 5 degree, a decreased 5 degree, and an increased 5 degree. Since which is determined by the pitch (and key signature) of the root note, the pitch conversion units 106a and 106b, for example, tabulate the converted pitches with respect to the root pitch in advance, The pitch indicated by the pitch data supplied from the pitch detection unit 104 may be converted with reference to the table.
The voice synthesis units 140a and 140b are functionally the same as the voice synthesis unit 140 in the first embodiment, and receive the same lyrics data from the control unit 120, but the voice synthesis unit 140a The pitch converted by the pitch converter 106a is specified, and the pitch converted by the pitch converter 106b is specified in the voice synthesizer 140b.
The mixer 150 mixes the singing voice signal from the voice input unit 102, the singing voice signal from the voice synthesis unit 140a, and the singing voice signal from the voice synthesis unit 140b. Note that the mixed singing voice signal is converted into an analog signal by a D / A converter (not shown), and then acoustically converted by a speaker 172 and output.

図６は、第２実施形態による歌唱音声の具体的な合成例を示す図である。この図は、歌唱者が歌唱する曲として「さくら」（図２参照）を選択して、当該歌唱者が、伴奏音を聴きながら演奏の進行に合わせて、符号７１、７２、７３、…の歌詞を同図の左欄の鍵盤で示される音高で歌唱した場合、すなわち、同図の上欄で示される楽譜（歌詞データ）の音高および歌唱タイミングで歌唱した場合の例である。この場合、音声合成部１４０ａは、符号６１ａ、６２ａ、６３ａ、…で示されるように当該歌唱の音高に対して３度上の音高で音声合成し、音声合成部１４０ｂは、符号６１ｂ、６２ｂ、６３ｂ、…で示されるように歌唱者の歌唱の音高に対して５度上の音高で音声合成する。
なお、図６の例では、符号６１ａは、ハ長調において符号７１に対して短３度の関係にあり、符号６１ｂは、符号６１ａに対して長３度の関係にある。このため、符号７１、６１ａ、６１ｂは短三和音となる。符号７２、６２ａ、６２ｂも同様に短三和音となる。また、符号６３ａは、符号７３に対して短３度の関係にあり、符号６３ｂは、符号６３ａに対して短３度の関係にある。このため、符号７３、６３ａ、６３ｂは減三和音となる。
このように、歌唱者が、閾値を超える音量で、かつ、同図に示される楽譜通りの音高、タイミングで歌唱したとき、スピーカ１７２からは、歌唱者による歌唱を根音とする三和音でハモった歌唱音声が出力されることになる。 FIG. 6 is a diagram illustrating a specific synthesis example of the singing voice according to the second embodiment. In this figure, “Sakura” (see FIG. 2) is selected as a song to be sung by the singer, and the singer listens to the accompaniment sound, and in accordance with the progress of the performance, reference numerals 71, 72, 73,. This is an example in which the lyrics are sung at the pitch indicated by the keyboard in the left column of the figure, that is, when the lyrics are sung at the pitch and singing timing of the score (lyric data) shown in the upper column of the figure. In this case, the speech synthesizer 140a synthesizes speech with a pitch three times higher than the pitch of the singing as indicated by reference numerals 61a, 62a, 63a,. As shown by 62b, 63b,..., Voice synthesis is performed at a pitch 5 degrees higher than the pitch of the singer's singing.
In addition, in the example of FIG. 6, the code | symbol 61a has a 3rd minor relationship with respect to the code | symbol 71 in C major, and the code | symbol 61b has a 3rd major relationship with respect to the code | symbol 61a. For this reason, the code | symbol 71, 61a, 61b becomes a short triad. Reference numerals 72, 62a and 62b are also short triads. Further, the reference numeral 63a has a minor third relation with respect to the reference numeral 73, and the reference numeral 63b has a minor third relation with respect to the reference numeral 63a. For this reason, the code | symbol 73, 63a, 63b becomes a reduced triad.
In this way, when the singer sings at a volume exceeding the threshold and at the pitch and timing as shown in the figure, the speaker 172 uses a triad with the singer's singing as the root. A singing voice will be output.

このように、第２実施形態によれば、歌唱者は、１人で歌唱しているにもかかわらず、ハモることができるので、歌唱者に対して、歌唱の表現をさらに拡大させることができる。なお、上述した音高の変換は、あくまでも一例に過ぎない。和音以外となるように変換しても良いし、オクターブ変換しても良い。また、音声合成部は２系統に限られず、１系統として、所定の関係にある音高に変換する構成であっても良いし、３系統以上でも良い。 Thus, according to 2nd Embodiment, since a singer can sing even though he is singing alone, he can further expand the expression of singing to the singer. it can. Note that the pitch conversion described above is merely an example. You may convert so that it may become other than a chord, and you may carry out octave conversion. Further, the speech synthesis unit is not limited to two systems, and may be configured to convert to a pitch having a predetermined relationship as one system, or may be three or more systems.

なお、第２実施形態では、歌唱者の歌唱音声と音声合成部１４０ａ、１４０ｂの歌唱音声とをミキシングしてスピーカ１７２から出力し、音源部１６０による伴奏音を別のスピーカ１７４から出力する構成としたが、歌唱音声と伴奏音とをミキシングして１つのスピーカから出力する構成としても良い。すなわち、歌唱音声と伴奏音とを出力する出力部は、別々のスピーカであるか、同じスピーカであるかについては問われない。
また、音高変換部１０６ａは、音高検出部１０４から供給される音高データで示される音高に対して、予め定められた関係にある音高にそれぞれ変換するが、変換する音高の関係については、制御部１２０や操作部１１２による指示によって変更可能な構成にしても良い。音高変換部１０６ｂについても同様であり、変換する音高の関係を制御部１２０や操作部１１２による指示によって変更可能な構成にしても良い。 In the second embodiment, the singing voice of the singer and the singing voice of the voice synthesis units 140a and 140b are mixed and output from the speaker 172, and the accompaniment sound by the sound source unit 160 is output from another speaker 174. However, it is good also as a structure which mixes a singing voice and an accompaniment sound and outputs it from one speaker. That is, it does not matter whether the output unit that outputs the singing voice and the accompaniment sound is a separate speaker or the same speaker.
The pitch conversion unit 106a converts the pitches indicated by the pitch data supplied from the pitch detection unit 104 into pitches having a predetermined relationship. The relationship may be changed by an instruction from the control unit 120 or the operation unit 112. The same applies to the pitch conversion unit 106b, and the pitch relationship to be converted may be changed by an instruction from the control unit 120 or the operation unit 112.

＜第３実施形態＞
第１実施形態において、演奏の進行段階が歌唱タイミングになったときに、歌詞データのうち、当該歌唱タイミングで歌唱すべきデータ（文字、音高）が音声合成部１４０に供給される構成であるので、歌唱者からみれば、音声合成される歌詞のタイミングをコントロールすることができなかった。
そこで、歌唱者が、音声合成される歌詞のタイミングをある程度、コントロールすることができる第３実施形態について説明することにする。 <Third Embodiment>
In the first embodiment, when the progress stage of the performance is the singing timing, among the lyrics data, data (characters, pitches) to be sung at the singing timing is supplied to the speech synthesizer 140. So, from the point of view of the singer, it was impossible to control the timing of the lyrics that were synthesized.
Therefore, a third embodiment will be described in which a singer can control the timing of lyrics to be synthesized with voice to some extent.

図７は、第３実施形態に係る歌唱合成装置１０の構成を示す機能ブロック図である。
この図に示される歌唱合成装置１０が、図１に示した第１実施形態と相違する部分は、音量検出部１０８から出力される音量データが音声合成部１４０とともに制御部１２０に供給される点である。このため、第３実施形態では、この相違部分を中心に説明することにする。 FIG. 7 is a functional block diagram showing the configuration of the singing voice synthesizing apparatus 10 according to the third embodiment.
The singing voice synthesizing apparatus 10 shown in this figure is different from the first embodiment shown in FIG. 1 in that the volume data output from the volume detection unit 108 is supplied to the control unit 120 together with the voice synthesis unit 140. It is. For this reason, in the third embodiment, this difference will be mainly described.

第３実施形態において制御部１２０は、音量検出部１０８から供給される音量データで示される音量が閾値を超えたこと、または、当該音量の時間的な変化が所定値を超えたことをトリガーとして、次の音符に対応する歌詞データを音声合成部１４０に供給する。すなわち、制御部１２０は、歌唱者の歌唱した音量が閾値を超えたとき等において、次の音符に対応する歌詞データを、演奏の進行段階が当該歌詞データの歌詞タイミングでなくても、音声合成部１４０に供給する。 In the third embodiment, the control unit 120 triggers that the volume indicated by the volume data supplied from the volume detection unit 108 exceeds a threshold value or that the temporal change in the volume exceeds a predetermined value. The lyrics data corresponding to the next note is supplied to the speech synthesizer 140. That is, the control unit 120 synthesizes the lyric data corresponding to the next note, such as when the volume of the singer's singing exceeds a threshold value, even if the performance stage is not the lyric timing of the lyric data. To the unit 140.

第３実施形態による歌唱音声の具体的な合成例について説明する。
ここでは、第１実施形態と同様に、図４（ａ）に示されるように、歌唱者が歌唱する曲として「さくら」を選択した場合であって、当該歌唱者が、伴奏音を聴きながら演奏の進行に合わせて、同図の（ｂ）で示されるような音量で歌唱した場合を例にとって説明すると、第３実施形態では、同図の（ｄ）で示されるように歌唱音声が出力される。
第３実施形態の特徴的な部分について説明すると、歌唱者が演奏の進行に対して、「ら」（歌詞５３）の途中において音量を下げた後、次の「さ」（歌詞５４）の前に、音量を上げたとき（当該音量の時間的な変化が所定値を超えたとき）、音量検出部１０８から供給される音量データの変化に応じて、制御部１２０は、次の「さ」（符号５４）の歌詞データを音声合成部１４０に供給する。
このため、歌詞データで規定される歌唱タイミングよりも早いタイミングで「さ」（符号６４）が音声合成されることになる。
なお、次の音符に対応する歌詞データの読み出しについては、音量検出部１０８から供給される音量データで示される音量が閾値を超えたことや、当該音量の時間的な変化が所定値を超えたこと以外にも、当該音量の時間的な変化の傾き（加速度）が所定値を超えたことをトリガーとして実行しても良い。 A specific synthesis example of the singing voice according to the third embodiment will be described.
Here, as in the first embodiment, as shown in FIG. 4A, the singer selects “Sakura” as the song to sing, and the singer listens to the accompaniment sound. A description will be given of an example of singing at a volume as shown in (b) of the figure as the performance progresses. In the third embodiment, a singing voice is output as shown in (d) of the figure in the third embodiment. Is done.
The characteristic part of the third embodiment will be described. After the singer decreases the volume in the middle of “ra” (lyrics 53) with respect to the progress of the performance, before the next “sa” (lyrics 54) In addition, when the volume is increased (when the temporal change of the volume exceeds a predetermined value), the control unit 120 performs the following “sa” in accordance with the change in the volume data supplied from the volume detection unit 108. The lyric data (reference numeral 54) is supplied to the speech synthesizer 140.
For this reason, “sa” (symbol 64) is voice-synthesized at a timing earlier than the singing timing defined by the lyrics data.
Regarding the reading of the lyric data corresponding to the next note, the volume indicated by the volume data supplied from the volume detection unit 108 has exceeded a threshold value, or the temporal change in the volume has exceeded a predetermined value. In addition to this, it may be executed as a trigger when the slope (acceleration) of the temporal change of the volume exceeds a predetermined value.

ところで、歌唱者が、ある歌詞をほぼ同じ音高で、ほぼ同じ音量で、歌詞データで規定されるタイミングよりも長く継続して歌唱する場合、当該歌詞を意図的に（余韻を込めて）延ばしていると考えられる。
このような場合に対応するためには、図７において破線で示されるような構成とすれば良い。すなわち、音高検出部１０４から出力される音高データを、音声合成部１４０とともに制御部１２０に供給して、当該制御部１２０が、音高検出部１０４から供給される音高データで示される音高が所定値以内で一定であって、音量検出部１０８から供給される音量データで示される音量が所定値以内で一定である場合、次の歌唱タイミングが到来していても、当該次の歌詞データを音声合成部１４０に供給しないで、所定時間だけ（または音量が下がるまで）待機する構成とすれば良い。この構成により、歌唱者は、所望の歌詞を、歌詞データで規定されるタイミングよりも長く継続させて歌唱音声を合成させることができる。 By the way, when a singer continuously sings a certain lyrics with almost the same pitch, almost the same volume, and longer than the timing specified by the lyrics data, the lyrics are intentionally extended (with a reverberation). It is thought that.
In order to cope with such a case, the configuration shown by the broken line in FIG. That is, the pitch data output from the pitch detection unit 104 is supplied to the control unit 120 together with the voice synthesis unit 140, and the control unit 120 is indicated by the pitch data supplied from the pitch detection unit 104. When the pitch is constant within a predetermined value and the volume indicated by the volume data supplied from the volume detector 108 is constant within a predetermined value, even if the next singing timing has arrived, A configuration may be adopted in which the lyrics data is not supplied to the speech synthesizer 140 and waits for a predetermined time (or until the volume is lowered). With this configuration, the singer can synthesize the singing voice by continuing the desired lyrics longer than the timing defined by the lyrics data.

このように、第３実施形態によれば、歌唱者が、音声合成される歌詞を、歌詞データで規定されるタイミング通りではなく、ある程度コントロールできるので、音声合成される歌唱のタイミングを即興（アドリブ）的に変化させることが可能になる。
なお、この第３実施形態は、第１実施形態に限られず、歌唱者自身による歌唱と、音声合成された歌唱とをミキシングする第２実施形態と組み合わせても良い。 As described above, according to the third embodiment, the singer can control the lyrics to be synthesized by voice to some extent rather than the timing specified by the lyrics data, and thus improvise the timing of singing to be synthesized (ad-lib). ) Can be changed.
In addition, this 3rd Embodiment is not restricted to 1st Embodiment, You may combine with 2nd Embodiment which mixes the song by the singer himself, and the voice-synthesized song.

＜応用・変形例＞
本発明は、上述した第１乃至第３実施形態に限定されるものではなく、例えば次に述べるような各種の応用・変形が可能である。なお、次に述べる応用・変形の態様は、任意に選択された一または複数を適宜に組み合わせることもできる。 <Applications and modifications>
The present invention is not limited to the first to third embodiments described above, and various applications and modifications described below are possible, for example. Note that one or a plurality of arbitrarily selected aspects of application / deformation described below can be appropriately combined.

第１（第２）実施形態において、制御部１２０は、演奏の進行段階が歌唱タイミングになったときに、当該歌唱タイミングに対応する歌詞データ（文字、音高）を音声合成部１４０に供給する構成であったが、このうち、音高について、制御部１２０は、音声合成部１４０に供給しなくても良い。その理由は、音声合成部１４０は、音量データで示される音量が閾値以下のときは、歌唱音声信号を実質的に出力せず、音量がしきい値を超えたときは、歌詞データの音高ではなく、音高検出部１０４から出力された音高データで示される音高であるためである。
制御部１２０が、歌詞の音高を供給しない構成であっても、音声合成部１４０は、制御部１２０から供給される歌詞データの文字を、音量データで示される音量が閾値を超えたときに、音高データで示される音高で、当該音量に応じて音声合成すれば良い。 In the first (second) embodiment, the control unit 120 supplies lyrics data (characters, pitches) corresponding to the singing timing to the voice synthesizing unit 140 when the progress stage of the performance is the singing timing. Although it was a structure, among these, the control part 120 does not need to supply to the speech synthesis part 140 about a pitch. The reason is that the voice synthesizer 140 does not substantially output the singing voice signal when the volume indicated by the volume data is equal to or lower than the threshold, and when the volume exceeds the threshold, the pitch of the lyrics data This is because the pitch is indicated by the pitch data output from the pitch detector 104.
Even if the control unit 120 is configured not to supply the pitch of the lyrics, the speech synthesis unit 140 reads the characters of the lyrics data supplied from the control unit 120 when the volume indicated by the volume data exceeds a threshold value. The voice may be synthesized according to the volume at the pitch indicated by the pitch data.

各実施形態において伴奏データとしてＭＩＤＩデータを用いたが、本発明はこれに限られない。例えばコンパクトディスクを再生させることによって楽音信号を得る構成としても良い。この構成において演奏の進行状態を把握するための情報としては、経過時間情報や残り時間情報を用いることができる。このため、制御部１２０は、経過時間情報や残り時間情報で把握した演奏の進行に合わせて歌詞データを音声合成部１４０（１４０ａ、１４０ｂ）に供給すれば良い。 In each embodiment, MIDI data is used as accompaniment data, but the present invention is not limited to this. For example, a configuration may be adopted in which a musical tone signal is obtained by reproducing a compact disc. In this configuration, elapsed time information and remaining time information can be used as information for grasping the progress of performance. For this reason, the control part 120 should just supply lyric data to the speech synthesis part 140 (140a, 140b) according to the progress of the performance grasped | ascertained by elapsed time information and remaining time information.

各実施形態では、音声入力部１０２が、歌唱者の歌唱をマイクロフォンで入力して歌唱音声信号に変換する構成としたが、歌唱音声信号（入力音声）をなんらかの形で入力する、または、入力される構成であれば良い。例えば、音声入力部１０２としては、他の処理部で処理された歌唱音声信号や、他の装置から供給（または転送された）歌唱音声信号を入力する構成でも良いし、さらには、単に歌唱音声信号を受信し後段に転送する入力インターフェース回路等であっても良い。 In each embodiment, the voice input unit 102 is configured to input a singer's singing with a microphone and convert it into a singing voice signal. However, the singing voice signal (input voice) is input or inputted in some form. Any configuration can be used. For example, the voice input unit 102 may be configured to input a singing voice signal processed by another processing unit, a singing voice signal supplied (or transferred) from another device, or simply a singing voice. It may be an input interface circuit that receives a signal and transfers it to a subsequent stage.

各実施形態において、音高検出部１０４、音高変換部１０６ａ、１０６ｂ、および、音量検出部１０８については、ソフトウェアで構成したが、ハードウェアで構成しても良い。また、音声合成部１４０（１４０ａ、１４０ｂ）をソフトウェアで構成しても良い。 In each embodiment, the pitch detection unit 104, the pitch conversion units 106a and 106b, and the volume detection unit 108 are configured by software, but may be configured by hardware. Further, the speech synthesizer 140 (140a, 140b) may be configured by software.

１０…歌唱合成装置、１０４…音高検出部、１０６ａ、１０６ｂ…音高変換部、１２０…制御部、１４０、１４０ａ、１４０ｂ…音声合成部、１５０…ミキサ、１６０…音源部。
DESCRIPTION OF SYMBOLS 10 ... Singing synthesis apparatus, 104 ... Pitch detection part, 106a, 106b ... Pitch conversion part, 120 ... Control part, 140, 140a, 140b ... Speech synthesis part, 150 ... Mixer, 160 ... Sound source part.

Claims

A pitch detector for detecting the pitch of the input voice;
A volume detector for detecting the volume of the input voice;
When the lyrics and the lyric data in which the utterance timing of the lyrics is defined are supplied according to the progress of the performance, the singing voice based on the lyric data, the pitch detected by the pitch detection unit, and the volume detection A voice synthesizer for synthesizing according to the volume detected by the unit;
A singing synthesizer.

A sound source unit that generates an accompaniment sound according to the progress of the performance;
An output unit that outputs the accompaniment sound, the input sound, and the singing sound;
A singing voice synthesizing device according to claim 1.

The speech synthesizer
The singing voice synthesizing apparatus according to claim 1 or 2, wherein the singing voice is synthesized by changing the utterance timing of the lyrics data in accordance with the volume detected by the volume detecting unit.

Computer
A pitch detector for detecting the pitch of the input voice;
A volume detector for detecting the volume of the input voice;
When the lyrics and the lyric data in which the utterance timing of the lyrics is defined are supplied according to the progress of the performance, the singing voice based on the lyric data, the pitch detected by the pitch detection unit, and the volume detection A voice synthesizer for synthesizing according to the volume detected by the unit,
A singing synthesis program characterized by functioning as