JP5810947B2

JP5810947B2 - Speech segment specifying device, speech parameter generating device, and program

Info

Publication number: JP5810947B2
Application number: JP2012018609A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2012-01-31
Filing date: 2012-01-31
Publication date: 2015-11-11
Anticipated expiration: 2032-01-31
Also published as: JP2013156544A

Description

本発明は、音声波形における発声区間を特定する発声区間特定装置、プログラム、及びその特定した発声区間での音声データから音声パラメータを生成する音声パラメータ生成装置に関する。 The present invention relates to an utterance interval specifying device and program for specifying an utterance interval in a speech waveform, and an audio parameter generation device for generating an audio parameter from audio data in the specified utterance interval.

従来、入力された音から音声パラメータを用意する音声パラメータ生成装置を備え、音声パラメータ生成装置によって用意された音声パラメータを合成することで、指定された内容の音声を生成する音声合成装置が知られている（例えば、特許文献１参照）。 2. Description of the Related Art Conventionally, there has been known a speech synthesizer that includes a speech parameter generation device that prepares speech parameters from input sound and generates speech of specified content by synthesizing speech parameters prepared by the speech parameter generation device. (For example, refer to Patent Document 1).

この特許文献１に記載された音声合成装置における音声パラメータ生成装置は、入力された音（以下、音声データと称す）を調波成分と非調波成分とに分離する音分離部と、音分離部にて分離した調波成分に基づいて、音声データを、各音素について発声したと推定される区間（以下、発声区間と称す）ごとに切り分ける音素切分部と、音素切分部にて切り分けられた各発声区間での音声データから音声パラメータを生成するパラメータ生成部とを備えている。 The speech parameter generation device in the speech synthesizer described in Patent Document 1 includes a sound separation unit that separates an input sound (hereinafter referred to as speech data) into a harmonic component and a non-harmonic component, and a sound separation. On the basis of the harmonic components separated in the section, the speech data is segmented into a phoneme segmentation section and a phoneme segmentation section that are segmented into sections estimated to be uttered for each phoneme (hereinafter referred to as utterance sections). And a parameter generation unit that generates a speech parameter from speech data in each utterance section.

そして、特許文献１に記載された音素切分部における音声データを発声区間ごとに切り分ける方法では、当該音声データの波形を表示した上で、その表示された波形を当該音声パラメータ生成装置の使用者が目視で確認しながら、各音素の発声開始時刻と発声終了時刻とをスイッチ操作にて指定することで、各発声区間を特定することがなされている。 And in the method of dividing the speech data in the phoneme segmentation part described in Patent Document 1 for each utterance section, the waveform of the speech data is displayed, and then the displayed waveform is used by the user of the speech parameter generation device. , The utterance section is specified by designating the utterance start time and utterance end time of each phoneme by a switch operation while visually confirming.

特開２００４−０３８０７１号公報JP 2004-038071 A

このような特許文献１に記載された、音声データから各発声区間を切り分ける方法では、各発声区間の始端（発声開始時刻）及び終端（発声終了時刻）を、当該音声パラメータ生成装置の使用者自身が目視で確認しながら指定しなければならず、音声データを発声区間ごとに切り分ける際の精度が低いという問題があった。 In the method of separating each utterance section from the speech data described in Patent Document 1, the start (speech start time) and the end (speech end time) of each utterance section are defined by the user of the speech parameter generation device. Has to be specified while visually confirming, and there has been a problem that the accuracy when the voice data is divided for each utterance section is low.

さらに、特許文献１に記載された方法では、人手に頼る以上、多量の音声データについて、各音声データを発声区間ごとに切り分けることが困難であるという問題があった。
そこで、本発明は、多量の音声データについて、各音声データを発声区間ごとに切り分け可能とすることを目的とする。 Furthermore, in the method described in Patent Document 1, there is a problem that it is difficult to separate each voice data for each utterance section of a large amount of voice data as long as it relies on human hands.
In view of this, an object of the present invention is to make it possible to separate each piece of voice data for each voice section with respect to a large amount of voice data.

上記目的を達成するためになされた本発明の発声区間特定装置では、内容情報取得手段が、発声内容情報を取得し、タイミング情報取得手段が、発声タイミング情報を取得すると共に、楽譜データ取得手段が、楽曲楽譜データを取得すると共に、音声データ取得手段が、音声データを取得する。なお、本発明において、発声内容情報とは、一つの楽曲である対象楽曲において発声すべき内容の文字列を表す情報であり、発声タイミング情報とは、内容情報取得手段で取得した発声内容情報（以下、特定内容情報）によって表される文字の発声開始タイミングを規定する情報である。さらに、本発明において、楽曲楽譜データとは、少なくとも対象楽曲における歌唱旋律の楽譜を表し、当該歌唱旋律を構成する個々の出力音について、少なくとも音高及び演奏開始タイミングが規定されたデータであり、音声データとは、特定内容情報によって表される文字列について発声された音声波形を表すデータである。 In the utterance section specifying device of the present invention made to achieve the above object, the content information acquisition means acquires the utterance content information, the timing information acquisition means acquires the utterance timing information, and the score data acquisition means The music score data is acquired, and the audio data acquisition means acquires the audio data. In the present invention, the utterance content information is information representing a character string of the content to be uttered in the target music that is one music, and the utterance timing information is the utterance content information ( Hereinafter, the information specifies the utterance start timing of the character represented by the specific content information. Furthermore, in the present invention, the music score data represents at least the music score of the singing melody in the target music, and for each output sound constituting the singing melody, at least the pitch and performance start timing are defined, Voice data is data representing a voice waveform uttered for a character string represented by specific content information.

そして、本発明の発声区間特定装置では、パワー推移導出手段が、音声データ取得手段で取得した音声データに基づいて、当該音声データにおいて時間軸に沿って連続するように規定された単位時間ごとに算出したパワーの時間推移を表す音声パワー推移を導出する。すると、発声区間特定手段が、その音声パワー推移が時間軸に沿って変化する時刻を、発声開始時刻及び発声終了時刻として特定し、発声開始時刻、発声終了時刻の順に連続する当該発声開始時刻と当該発声終了時刻とのペアによって規定される発声区間それぞれを特定する。 And in the utterance section specifying device of the present invention, the power transition deriving unit is based on the audio data acquired by the audio data acquiring unit, for each unit time defined to be continuous along the time axis in the audio data. A voice power transition representing a time transition of the calculated power is derived. Then, the utterance section specifying means specifies the time when the voice power transition changes along the time axis as the utterance start time and the utterance end time, and the utterance start time continuous in the order of the utterance start time and the utterance end time; Each utterance section specified by the pair with the utterance end time is specified.

さらに、本発明の発声区間特定装置では、音符歌詞対応付手段が、内容情報取得手段で取得した特定内容情報、タイミング情報取得手段で取得した発声タイミング情報、及び楽譜データ取得手段で取得した楽曲楽譜データに基づいて、特定内容情報によって表される文字列の音節ごとに、当該音節に対応する文字の発声開始タイミングとの時間差分が最小となる演奏開始タイミングを有した出力音を特定すると共に、当該出力音と当該音節の内容とを対応付けた音符音節組それぞれを生成する。その後、音符歌声統合手段が、発声区間特定手段にて特定された発声区間ごとに、当該発声区間を規定する発声開始時刻との時間差分が最小となる演奏開始タイミングを有し、かつ音符歌詞対応付手段で生成された音符音節組を特定し、少なくとも、当該発声区間と当該音符音節組とを対応付けた音節データを生成する。
そして、本発明における発声区間特定手段は、音声パワー推移の時間進行において、パワーが予め規定された規定閾値以上となるタイミングそれぞれを発声開始時刻とし、パワーが規定閾値以下となるタイミングそれぞれを発声終了時刻として、発声区間それぞれを特定しても良い。
また、本発明における発声区間特定手段は、音声パワー推移を時間微分した結果、極大となるタイミングそれぞれを発声開始時刻とし、極小となるタイミングそれぞれを発声終了時刻として、発声区間それぞれを特定しても良い。 Further, in the utterance section specifying device of the present invention, the musical note lyrics associating means includes the specific content information acquired by the content information acquiring means, the utterance timing information acquired by the timing information acquiring means, and the music score acquired by the score data acquiring means. Based on the data, for each syllable of the character string represented by the specific content information, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is specified, Each note syllable group is generated by associating the output sound with the contents of the syllable. After that, the note singing voice integration means has a performance start timing that minimizes the time difference from the utterance start time that defines the utterance section for each utterance section specified by the utterance section specifying means, and is compatible with note lyrics The note syllable set generated by the appending means is specified, and at least syllable data in which the utterance section and the note syllable set are associated with each other is generated.
Then, the utterance section specifying means in the present invention sets the timing at which the power is equal to or higher than a predetermined threshold value in the time progression of the voice power transition as the utterance start time and ends the timing at which the power is equal to or lower than the predetermined threshold value. Each utterance section may be specified as time.
Further, the utterance section specifying means in the present invention may identify each utterance section with the maximal timing as the utterance start time and the maximal timing as the utterance end time as a result of time differentiation of the voice power transition. good.

このような発声区間特定装置によれば、発声された音声波形が時間軸に沿って推移しながら、その音声パワーが変化するタイミングに基づいて、発声開始時刻及び発声終了時刻、ひいては発声区間を自動的に特定することができる。 According to such an utterance section specifying device, the utterance start time and the utterance end time, and thus the utterance section are automatically determined based on the timing at which the voice power changes while the uttered voice waveform changes along the time axis. Can be identified.

この結果、本発明の発声区間特定装置によれば、特許文献１に記載された装置とは異なり、発声開始時刻及び発声終了時刻を当該装置の使用者が指定する必要が無く、多量の音声データについて、各音声データを発声区間ごとに切り分けることが可能となる。 As a result, according to the utterance section specifying device of the present invention, unlike the device described in Patent Document 1, it is not necessary for the user of the device to specify the utterance start time and utterance end time, and a large amount of audio data For each of the voice data, the voice data can be divided for each utterance section.

しかも、本発明の発声区間特定装置では、特定した発声区間それぞれに当該発声区間に対応する音符音節組を対応付けることで音節データを生成している。
このため、本発明の発声区間特定装置によれば、当該音節データに含まれる発声区間での音声データから音声パラメータを生成する際に、信頼度の高い音節の内容を当該音声パラメータに付加することができ、ひいては、音声合成に必要な音声パラメータに多様な情報を付加することができる。この結果、本発明の発声区間特定装置によれば、音声合成の際に、その音声合成を利用する人物が望む音声を実現しやすくできる。 Moreover, in the utterance section specifying device of the present invention, syllable data is generated by associating each specified utterance section with a note syllable group corresponding to the utterance section.
Therefore, according to the utterance section specifying device of the present invention, when generating a speech parameter from speech data in the utterance section included in the syllable data, the content of the highly reliable syllable is added to the speech parameter. As a result, various information can be added to the speech parameters necessary for speech synthesis. As a result, according to the utterance section specifying device of the present invention, it is possible to easily realize a voice desired by a person using the voice synthesis at the time of voice synthesis.

また、本発明の発声区間特定装置においては、音高推移導出手段が、少なくとも、音声データ取得手段で取得した音声データに基づいて、音声波形における音高が時間軸に沿って推移した音高時間推移を導出し、音高特定手段が、その音高時間推移において、発声区間特定手段で特定された発声区間それぞれでの音高を発声音高として特定しても良い。 In the utterance section specifying device of the present invention, the pitch transition deriving means is a pitch time in which the pitch in the voice waveform has shifted along the time axis based on at least the voice data acquired by the voice data acquiring means. The transition may be derived, and the pitch specifying means may specify the pitch in each utterance section specified by the utterance section specifying means as the utterance pitch in the pitch time transition.

この場合、本発明における音符歌声統合手段は、音高特定手段で特定された各発声音高と、当該発声音高に対応する発声区間と、当該発声区間に対応する音符音節組とを対応付けることで、音節データを生成しても良い。 In this case, the note singing voice integration means in the present invention associates each utterance pitch specified by the pitch specification means, the utterance interval corresponding to the utterance pitch, and the note syllable group corresponding to the utterance interval. in, but it may also be generating a syllable data.

このような発声区間特定装置によれば、各音節データに、発声区間それぞれでの音高を付加することができる。この結果、当該音節データに含まれる発声区間での音声データから音声パラメータを生成する際に、当該発声区間での信頼度の高い音高の情報を音声パラメータに付加することができる。 According to such an utterance section specifying device, the pitch of each utterance section can be added to each syllable data. As a result, when generating the speech parameter from the speech data in the utterance section included in the syllable data, it is possible to add information on the pitch with high reliability in the utterance section to the speech parameter.

さらに、本発明における音高特定手段は、音声データにおいて時間軸に沿って連続するように規定された単位時間ごとに算出した周波数スペクトルの自己相関値に、楽曲楽譜データによって表される出力音のうち、当該単位時間に対応する出力音の音高に対応する周波数ほど大きな重みとなるように周波数軸に沿って重みが規定された重み関数を乗じた結果が最大となる周波数に対応する音高を、発声音高として特定しても良い。 Furthermore, the pitch specifying means in the present invention is based on the autocorrelation value of the frequency spectrum calculated for each unit time defined to be continuous along the time axis in the audio data, and the output sound represented by the music score data. Of these, the pitch corresponding to the frequency that maximizes the result of multiplying the weight function that is defined along the frequency axis so that the frequency corresponding to the pitch of the output sound corresponding to the unit time has a greater weight. the, but it may also be identified as utterance pitch.

このような発声区間特定装置によれば、発声音高を、対象楽曲において発声すべき音高に近い音高として特定できる。
そして、本発明における音高特定手段は、音声データにおいて時間軸に沿って連続するように規定された単位時間ごとに算出した周波数スペクトルの自己相関値が最大となる周波数に対応する音高を、発声音高として特定しても良い。 According to such an utterance section specifying device, the utterance pitch can be specified as a pitch close to the pitch to be uttered in the target music piece.
And the pitch specifying means in the present invention, the pitch corresponding to the frequency at which the autocorrelation value of the frequency spectrum calculated for each unit time defined to be continuous along the time axis in the voice data is maximum, It is identified as utterance pitch not good.

このような発声区間特定装置によれば、実際に発声された音声の音高を発声音高として特定できる。
ところで、本発明における音符歌声統合手段は、音声パワー推移における各時刻でのパワーと、当該時刻に対応する発声区間と、当該発声区間に対応する音符音節組とを対応付けることで、音節データを生成しても良い。 According to such an utterance section specifying device, the pitch of the actually uttered voice can be specified as the utterance pitch.
By the way, the note singing voice integration means in the present invention generates syllable data by associating the power at each time in the sound power transition, the utterance section corresponding to the time, and the note syllable group corresponding to the utterance section. and even good.

このような発声区間特定装置によれば、各音節データに、発声区間それぞれでのパワーを付加することができる。この結果、当該音節データに含まれる発声区間での音声データから音声パラメータを生成する際に、当該発声区間における信頼度の高い音声のパワー（強さ）に関する情報を音声パラメータに付加することが可能となる。 According to such an utterance section specifying device, power in each utterance section can be added to each syllable data. As a result, when generating a speech parameter from speech data in the utterance section included in the syllable data, it is possible to add information on the power (strength) of the speech with high reliability in the utterance section to the speech parameter. It becomes.

ところで、本発明の適用対象は、音声パラメータ生成装置であっても良い。ただし、本発明が適用される音声パラメータ生成装置は、請求項１に記載された発声区間特定装置と、パラメータ導出手段とを備えている必要がある。なお、ここで言うパラメータ導出手段とは、発声区間特定装置における音符歌声統合手段で生成された音節データにおける発声区間での音声データから、予め規定された少なくとも一つの特徴量である音声パラメータを導出する手段である。 In time and, application of the present invention may be a voice parameter generating device. However, the speech parameter generation device to which the present invention is applied needs to include the utterance section identification device described in claim 1 and parameter derivation means. The parameter deriving means here refers to deriving a speech parameter that is at least one feature quantity defined in advance from speech data in the utterance section in the syllable data generated by the note singing voice integration means in the utterance section specifying device. means der to Apply predicates.

このような音声パラメータ生成装置によれば、当該音節データに含まれる発声区間での音声データから音声パラメータを生成する際に、信頼度の高い音節の内容を当該音声パラメータに付加することができ、ひいては、音声合成に必要な音声パラメータに多様な情報を付加することができる。この結果、本発明の発声区間特定装置によれば、音声合成の際に、その音声合成を利用する人物が望む音声を実現しやすくすることができる。 According to such a speech parameter generation device, when generating speech parameters from speech data in the utterance section included in the syllable data, it is possible to add highly reliable syllable content to the speech parameters, As a result, various information can be added to the speech parameters required for speech synthesis. As a result, according to the utterance section specifying device of the present invention, it is possible to easily realize the voice desired by the person using the voice synthesis at the time of voice synthesis.

なお、ここで言う音声パラメータとしての特徴量は、フォルマント合成による音声合成を実行する際に必要となる特徴量であり、例えば、基本周波数や、メル周波数ケプストラム（ＭＦＣＣ）、パワーなど、及びそれらの各時間差分などを含む。 Note that the feature amount as the speech parameter mentioned here is a feature amount necessary for executing speech synthesis by formant synthesis. For example, the fundamental frequency, the mel frequency cepstrum (MFCC), the power, etc. Each time difference is included.

さらには、本発明は、コンピュータに実行させるプログラムとしてなされていても良い。この場合、本発明が適用されたプログラムは、発声内容情報を取得する内容情報取得手順と、発声タイミング情報を取得するタイミング情報取得手順と、楽曲楽譜データを取得する楽譜データ取得手順と、音声データを取得する音声データ取得手順と、音声パワー推移を導出するパワー推移導出手順と、発声区間それぞれを特定する発声区間特定手順と、音符音節組それぞれを生成する音符歌詞対応付手順と、音節データを生成する音符歌声統合手順とをコンピュータに実行させるプログラムである必要がある。
なお、発声区間特定手順は、音声パワー推移の時間進行において、パワーが予め規定された規定閾値以上となるタイミングそれぞれを発声開始時刻とし、パワーが規定閾値以下となるタイミングそれぞれを発声終了時刻として、発声区間それぞれを特定しても良いし、音声パワー推移を時間微分した結果、極大となるタイミングそれぞれを発声開始時刻とし、極小となるタイミングそれぞれを発声終了時刻として、発声区間それぞれを特定しても良い。 Furthermore, the present invention may be implemented as a program that is executed by a computer. In this case, the program to which the present invention is applied includes a content information acquisition procedure for acquiring utterance content information, a timing information acquisition procedure for acquiring utterance timing information, a score data acquisition procedure for acquiring music score data, and audio data. Voice data acquisition procedure for acquiring voice power transition, power transition derivation procedure for deriving voice power transition, utterance interval specifying procedure for identifying each utterance segment, note lyrics corresponding procedure for generating each note syllable group, and syllable data It must be programmed to execute the note voice integration procedure resulting in the computer Ru.
In the utterance section specifying procedure, in the time progression of the voice power transition, each timing when the power is equal to or higher than a predetermined threshold value is set as the utterance start time, and each timing when the power is equal to or lower than the specified threshold value is set as the utterance end time. Each utterance section may be specified, or each time when the speech power transition is differentiated as a result of time differentiation, each maximal timing is set as the utterance start time, and each minimum timing is set as the utterance end time. good.

なお、本発明は、コンピュータを発声区間特定装置として機能させるためのプログラムであっても良い。
本発明のプログラムが、このようになされていれば、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された発声区間特定装置として機能させることができる。 The present invention may be a program for causing a computer to function as an utterance section specifying device.
If the program of the present invention is made in this way, for example, it can be recorded on a computer-readable recording medium such as a DVD-ROM, CD-ROM, hard disk, etc. If necessary, it can be used by being acquired and activated by a computer via a communication line. And by making a computer perform each procedure, the computer can be functioned as an utterance area identification device described in claim 1.

音声合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a speech synthesis system. 音声パラメータ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an audio | voice parameter registration process. 発声区間推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an utterance area estimation process. 音声データを例示する図である。It is a figure which illustrates audio | voice data. 音高時間変化の導出方法を例示する図である。It is a figure which illustrates the derivation method of a pitch time change. 音高時間変化を例示する図である。It is a figure which illustrates pitch time change. パワー時間変化の導出方法を例示する図である。It is a figure which illustrates the derivation method of a power time change. 発声区間の特定方法を例示する図である。It is a figure which illustrates the identification method of an utterance area. 発声区間の特定方法の変形例を示す図である。It is a figure which shows the modification of the identification method of an utterance area. 音声合成処理の処理手順を示す図である。It is a figure which shows the process sequence of a speech synthesis process.

以下に本発明の実施形態を図面と共に説明する。
〈音声合成システムについて〉
図１は、本発明が適用された音声合成システムの概略構成を示す図である。 Embodiments of the present invention will be described below with reference to the drawings.
<About the speech synthesis system>
FIG. 1 is a diagram showing a schematic configuration of a speech synthesis system to which the present invention is applied.

本発明が適用された音声合成システム１は、当該音声合成システム１の利用者が指定した内容の音声が出力されるように、予め登録された音声パラメータに基づいて音声合成した音声（即ち、合成音）を出力するシステムである。 The speech synthesis system 1 to which the present invention has been applied is a speech synthesized based on speech parameters registered in advance (that is, synthesis) so that speech having contents designated by the user of the speech synthesis system 1 is output. Sound).

これを実現するために、音声合成システム１は、音声を入力する音声入力装置１０と、音声入力装置１０を介して入力された音声（以下、音声波形データＳＶと称す）及びカラオケの用途に用いられる各種データ（以下、音楽データＭＤと称す）を格納するＭＩＤＩ格納サーバ２５とを備えている。さらに、音声合成システム１は、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて、音声パラメータを生成する処理を実行する情報処理装置３０と、情報処理装置３０にて生成された音声パラメータを格納するデータ格納サーバ５０とを備えている。その上、音声合成システム１は、データ格納サーバ５０に格納されている音声パラメータに基づいて音声合成した合成音を出力する音声出力端末６０を備えている。なお、本実施形態における音声合成システム１は、音声出力端末６０を複数台備えている。 In order to realize this, the speech synthesis system 1 is used for speech input device 10 for inputting speech, speech input through speech input device 10 (hereinafter referred to as speech waveform data SV), and karaoke. And a MIDI storage server 25 for storing various data (hereinafter referred to as music data MD). Furthermore, the speech synthesis system 1 includes an information processing device 30 that executes processing for generating speech parameters based on the speech waveform data SV and the music data MD stored in the MIDI storage server 25, and the information processing device 30. And a data storage server 50 for storing the generated voice parameters. In addition, the speech synthesis system 1 includes a speech output terminal 60 that outputs a synthesized sound synthesized by speech based on speech parameters stored in the data storage server 50. Note that the speech synthesis system 1 in this embodiment includes a plurality of speech output terminals 60.

すなわち、本実施形態の音声合成システム１においては、情報処理装置３０が、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて、少なくとも音声パラメータＰＭを生成してデータ格納サーバ５０に格納する。そして、音声出力端末６０は、当該音声出力端末６０を介して、利用者が指定した内容の音声が出力されるように、データ格納サーバ５０に格納された音声パラメータＰＭに基づいて音声合成した合成音を出力する。 That is, in the speech synthesis system 1 of the present embodiment, the information processing apparatus 30 generates at least the speech parameter PM based on the speech waveform data SV and the music data MD stored in the MIDI storage server 25 and stores the data. Store in the server 50. Then, the voice output terminal 60 performs synthesis by voice synthesis based on the voice parameter PM stored in the data storage server 50 so that the voice having the content specified by the user is output via the voice output terminal 60. Output sound.

なお、ここで言う音声パラメータＰＭとは、詳しくは後述するが、いわゆるフォルマント合成に用いる音声の特徴量であり、例えば、発声音声における各音節での基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれらの時間差分を含むものである。
〈ＭＩＤＩ格納サーバについて〉
まず、ＭＩＤＩ格納サーバ２５は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して、音声入力装置１０に接続されている。 The speech parameter PM referred to here is a feature amount of speech used for so-called formant synthesis, which will be described in detail later. For example, the fundamental frequency, mel frequency cepstrum (MFCC), power, And the time difference between them.
<About the MIDI storage server>
First, the MIDI storage server 25 is a device mainly composed of a storage device configured to be able to read and write stored contents, and is connected to the voice input device 10 via a communication network.

このＭＩＤＩ格納サーバ２５には、少なくとも、楽曲ごとに予め用意された音楽データＭＤが格納されている。この音楽データＭＤには、楽曲ＭＩＤＩデータＤＭ（特許請求の範囲における楽曲楽譜データに相当）と、歌詞データ群ＤＬとが含まれ、これら楽曲ＭＩＤＩデータＤＭと歌詞データ群ＤＬとは、それぞれ対応する楽曲ごとに対応付けられている。 The MIDI storage server 25 stores at least music data MD prepared in advance for each piece of music. The music data MD includes music MIDI data DM (corresponding to music score data in the claims) and a lyric data group DL. The music MIDI data DM and the lyric data group DL correspond to each other. Each song is associated with each other.

このうち、楽曲ＭＩＤＩデータＤＭは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表すデータであり、楽曲ごとに予め用意されている。この楽曲ＭＩＤＩデータＤＭの各々は、楽曲を区別するデータである識別データと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックとを少なくとも有している。 Of these, the music MIDI data DM is data representing the score of one music according to the well-known MIDI (Musical Instrument Digital Interface) standard, and is prepared in advance for each music. Each of the music MIDI data DM includes at least identification data that is data for discriminating music and a score track that represents a score for each instrument used in the music.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の出力音について、少なくとも、音高（いわゆるノートナンバー）と、音源モジュールが出力音を出力する期間（以下、音符長）とが規定されている。ただし、楽譜トラックの音符長は、当該出力音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該出力音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least a pitch (so-called note number) and a period during which the sound module outputs the output sound (hereinafter, note length) for each output sound output from the MIDI sound source. Yes. However, the note length of the musical score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the output sound is started, and the output of the output sound is ended. It is defined by the performance end timing (so-called note-off timing) that represents the time from the start of performance of the music.

なお、楽譜トラックは、例えば、鍵盤楽器（例えば、ピアノやパイプオルガンなど）、弦楽器（例えば、バイオリンやビオラ、ギター、ベースギター、琴など）、打楽器（例えば、ヴィブラフォンや、ドラム、シンバル、ティンパニー、木琴など）、及び管楽器（例えば、クラリネットやトランペット、フルート、尺八など）などの楽器ごとに用意されている。このうち、本実施形態では、ヴィブラフォンが、当該楽曲において歌唱旋律（メロディライン）を担当する楽器として規定されている。また、以下では、歌唱旋律を担当する楽器に対応する楽譜トラックに規定された出力音を歌唱出力音と称す。 Note that the score track includes, for example, a keyboard instrument (eg, piano or pipe organ), a stringed instrument (eg, violin, viola, guitar, bass guitar, koto), or a percussion instrument (eg, vibraphone, drum, cymbal, timpani, Xylophone, etc.) and wind instruments (eg, clarinet, trumpet, flute, shakuhachi, etc.). Among these, in this embodiment, the vibraphone is defined as an instrument in charge of singing melody (melody line) in the music. In the following, the output sound defined in the score track corresponding to the musical instrument in charge of singing melody will be referred to as singing output sound.

一方、歌詞データ群ＤＬは、周知のカラオケ装置を構成する表示装置に表示される歌詞に関するデータであり、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す歌詞テロップデータＤＴ（本発明における発声内容情報に相当）と、歌詞構成文字の出力タイミングである歌詞出力タイミングを、楽曲ＭＩＤＩデータＤＭの演奏と対応付けるタイミング対応関係が規定された歌詞出力データＤＯ（本発明における発声タイミング情報に相当）とを備えている。 On the other hand, the lyrics data group DL is data relating to lyrics displayed on a display device that constitutes a well-known karaoke device, and lyrics telop data DT ( Lyric output data DO (corresponding to the utterance content information in the present invention) and lyric output data DO in which the timing correspondence relationship for associating the lyric output timing, which is the output timing of the lyrics constituent characters, with the performance of the music MIDI data DM is defined. Equivalent).

具体的に、本実施形態におけるタイミング対応関係は、楽曲ＭＩＤＩデータＤＭの演奏を開始するタイミングに、歌詞テロップデータＤＴの出力を開始するタイミングが対応付けられた上で、当該楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、楽曲ＭＩＤＩデータＤＭの演奏を開始からの経過時間によって規定されている。なお、ここで言う経過時間とは、例えば、表示された歌詞構成文字の色替えを実行するタイミングを表す時間であり、色替えの速度によって規定されている。また、ここで言う歌詞構成文字は、歌詞を構成する文字の各々であっても良いし、その文字の各々を時間軸に沿った特定の規則に従って一群とした文節やフレーズであっても良い。
〈音声入力装置の構成について〉
次に、音声入力装置１０について説明する。 Specifically, the timing correspondence relationship in the present embodiment is such that the timing of starting the output of the lyrics telop data DT is associated with the timing of starting the performance of the music MIDI data DM, and then along the time axis of the music. The lyrics output timing of each lyrics constituent character is defined by the elapsed time from the start of the performance of the music MIDI data DM. Note that the elapsed time referred to here is, for example, a time indicating the timing for executing color change of displayed lyrics constituent characters, and is defined by the speed of color change. Further, the lyric constituent characters referred to here may be each of the characters constituting the lyric, or may be a phrase or a phrase grouped according to a specific rule along the time axis.
<About the configuration of the voice input device>
Next, the voice input device 10 will be described.

音声入力装置１０は、通信部１１と、入力受付部１２と、表示部１３と、音声入力部１４と、音声出力部１５と、音源モジュール１６と、記憶部１７と、制御部２０とを備えている。すなわち、本実施形態における音声入力装置１０は、いわゆる周知のカラオケ装置として構成されている。 The voice input device 10 includes a communication unit 11, an input receiving unit 12, a display unit 13, a voice input unit 14, a voice output unit 15, a sound source module 16, a storage unit 17, and a control unit 20. ing. That is, the voice input device 10 in the present embodiment is configured as a so-called well-known karaoke device.

このうち、通信部１１は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して、音声入力装置１０が外部との間で通信を行う。入力受付部１２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーやスイッチ、リモコンの受付部など）である。 Among these, the communication unit 11 communicates with the outside through the communication network (for example, a public wireless communication network or a network line). The input reception unit 12 is an input device (for example, a key, a switch, a remote control reception unit, or the like) that receives input of information and commands in accordance with external operations.

表示部１３は、少なくとも、文字コードで示される情報を含む画像を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。また、音声入力部１４は、音を電気信号に変換して制御部２０に入力する装置（いわゆるマイクロホン）である。音声出力部１５は、制御部２０からの電気信号を音に変換して出力する装置（いわゆるスピーカ）である。さらに、音源モジュール１６は、ＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって規定されたデータに基づいて、音源からの音を模擬した音（即ち、出力音）を出力する装置（例えば、ＭＩＤＩ音源）である。 The display unit 13 is a display device (for example, a liquid crystal display or a CRT) that displays an image including at least information indicated by a character code. The voice input unit 14 is a device (so-called microphone) that converts sound into an electrical signal and inputs the signal to the control unit 20. The audio output unit 15 is a device (so-called speaker) that converts an electrical signal from the control unit 20 into sound and outputs the sound. Furthermore, the sound module 16 is a device (for example, a MIDI sound source) that outputs a sound (that is, an output sound) that simulates a sound from a sound source, based on data defined by the MIDI (Musical Instrument Digital Interface) standard. .

記憶部１７は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）である。
また、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 17 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents.
The control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.

そして、ＲＯＭ２１には、周知のカラオケ演奏処理を制御部が実行する処理プログラムや、カラオケ演奏処理によって一つの楽曲が演奏されている期間中に、音声入力部１４を介して入力された音声を音声波形データＳＶとして、当該対象楽曲を識別する楽曲識別情報と対応付けて、ＭＩＤＩ格納サーバ２５に格納する音声格納処理を制御部２０が実行する処理プログラムが記憶されている。 Then, the ROM 21 stores a processing program for executing a well-known karaoke performance process by the control unit and a voice input via the voice input unit 14 during a period when one piece of music is being played by the karaoke performance process. As the waveform data SV, a processing program in which the control unit 20 executes an audio storage process stored in the MIDI storage server 25 in association with music identification information for identifying the target music is stored.

つまり、音声入力装置１０では、カラオケ演奏処理に従って、入力受付部１２を介して指定された一つの楽曲（以下、対象楽曲とする）に対応する音楽データＭＤをＭＩＤＩ格納サーバ２５から取得して、当該音楽データＭＤ中の楽曲ＭＩＤＩデータＤＭに基づいて、対象楽曲を演奏すると共に、当該音楽データＭＤ中の歌詞データ群ＤＬに基づいて対象楽曲の歌詞を表示部１３に表示する。 That is, the voice input device 10 acquires music data MD corresponding to one piece of music (hereinafter referred to as a target song) designated via the input receiving unit 12 from the MIDI storage server 25 according to the karaoke performance process, The target music is played based on the music MIDI data DM in the music data MD, and the lyrics of the target music are displayed on the display unit 13 based on the lyrics data group DL in the music data MD.

さらに、音声入力装置１０では、音声波形データＳＶを、当該対象楽曲を識別する楽曲識別情報（ここでは、音楽データＭＤそのもの）及び音声を入力した人物（以下、発声者とする）を識別する発声者識別情報（以下、発声者ＩＤと称す）と対応付けて、ＭＩＤＩ格納サーバ２５に格納する。なお、ＭＩＤＩ格納サーバ２５に格納される音声波形データＳＶには、発声者の特徴を表す発声者特徴情報も対応付けられており、この発声者特徴情報には、例えば、発声者の性別、年齢などを含む。
〈情報処理装置の構成について〉
次に、情報処理装置３０について説明する。 Furthermore, in the voice input device 10, the voice waveform data SV is uttered to identify music identification information (here, the music data MD itself) for identifying the target music and a person who has input the voice (hereinafter referred to as a speaker). Stored in the MIDI storage server 25 in association with the person identification information (hereinafter referred to as the speaker ID). The voice waveform data SV stored in the MIDI storage server 25 is also associated with speaker feature information representing the features of the speaker. The speaker feature information includes, for example, the gender and age of the speaker. Etc.
<Configuration of information processing device>
Next, the information processing apparatus 30 will be described.

この情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、記憶部３４と、制御部４０とを備えている。
このうち、通信部３１は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して外部との間で通信を行う。入力受付部３２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーボードやポインティングデバイス）である。表示部３３は、画像を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。 The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a storage unit 34, and a control unit 40.
Among these, the communication unit 31 communicates with the outside via a communication network (for example, a public wireless communication network or a network line). The input receiving unit 32 is an input device (for example, a keyboard or a pointing device) that receives input of information and commands in accordance with an external operation. The display unit 33 is a display device (for example, a liquid crystal display or a CRT) that displays an image.

記憶部３４は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）である。また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 34 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents. The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43.

そして、情報処理装置３０のＲＯＭ４１には、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて生成した音声パラメータＰＭを、データ格納サーバ５０に格納する音声パラメータ登録処理を制御部４０が実行するための処理プログラムが記憶されている。 Then, in the ROM 41 of the information processing apparatus 30, a voice parameter registration process for storing the voice parameter PM generated based on the voice waveform data SV and the music data MD stored in the MIDI storage server 25 in the data storage server 50 is performed. A processing program to be executed by the control unit 40 is stored.

なお、データ格納サーバ５０は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して情報処理装置３０に接続されている。
〈音声パラメータ登録処理について〉
次に、情報処理装置３０が実行する音声パラメータ登録処理について説明する。 The data storage server 50 is a device that is mainly configured of a storage device that is configured to be able to read and write stored contents, and is connected to the information processing device 30 via a communication network.
<Voice parameter registration process>
Next, the voice parameter registration process executed by the information processing apparatus 30 will be described.

図２に示すように、音声パラメータ登録処理は、起動されると、入力受付部３２を介して指定された楽曲（即ち、対象楽曲）の楽曲ＭＩＤＩデータＤＭを取得する（Ｓ１１０）。続いて、対象楽曲の歌詞データ群ＤＬを取得し（Ｓ１２０）、対象楽曲に対応し、かつ入力受付部３２を介して指定された発声者ＩＤに対応する一つの音声波形データＳＶを取得する（Ｓ１３０）。 As shown in FIG. 2, when the voice parameter registration process is started, the music MIDI data DM of the music specified via the input receiving unit 32 (that is, the target music) is acquired (S110). Subsequently, the lyrics data group DL of the target music is acquired (S120), and one speech waveform data SV corresponding to the target music and corresponding to the speaker ID designated via the input receiving unit 32 is acquired (S120). S130).

さらに、Ｓ１３０で取得した音声波形データＳＶにおいて、当該音声波形データＳＶの発声内容に含まれる音節それぞれに対応して発声されたと推定される区間（以下、発声区間と称す）を特定し、各発声区間に各種情報を対応付けた音節データを生成する発声区間推定処理を実行する（Ｓ１４０）。 Furthermore, in the speech waveform data SV acquired in S130, a section estimated to be uttered corresponding to each syllable included in the utterance content of the speech waveform data SV (hereinafter referred to as utterance section) is specified, and each utterance is identified. An utterance section estimation process for generating syllable data in which various information is associated with the section is executed (S140).

そして、音節データそれぞれによって表された発声区間での音声波形（以下、音節波形と称す）から音声パラメータＰＭを導出する（Ｓ１５０）。本実施形態のＳ１５０では、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分それぞれを、音声パラメータＰＭとして導出する。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音節波形の時間軸に沿った自己相関、音節波形の周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音節波形に対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音節波形に対して時間分析窓を適用して振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Then, the speech parameter PM is derived from the speech waveform (hereinafter referred to as a syllable waveform) in the utterance section represented by each syllable data (S150). In S150 of this embodiment, the fundamental frequency, the mel frequency cepstrum (MFCC), the power, and their time differences are each derived as a speech parameter PM. Since these fundamental frequency, MFCC, and power derivation methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis of the syllable waveform, syllable waveform What is necessary is just to derive | lead-out using methods, such as an autocorrelation of a frequency spectrum, or a cepstrum method. In addition, in the case of MFCC, a result obtained by applying a time analysis window to a syllable waveform and performing frequency analysis (for example, FFT) for each time analysis window is further obtained by logarithmizing the size for each frequency. It can be derived by frequency analysis. The power may be derived by applying a time analysis window to the syllable waveform and integrating the square of the amplitude in the time direction.

続いて、Ｓ１５０にて導出した音声パラメータＰＭを格納する音声パラメータ登録を実行する（Ｓ１６０）。なお、本実施形態のＳ１６０にてデータ格納サーバ５０に格納される音声パラメータＰＭは、発声した音節の内容（種類）や、発声者ＩＤ、発声者特徴情報と対応付けられている。 Subsequently, voice parameter registration for storing the voice parameter PM derived in S150 is executed (S160). Note that the speech parameter PM stored in the data storage server 50 in S160 of the present embodiment is associated with the content (type) of the uttered syllable, the speaker ID, and speaker characteristic information.

その後、本音声パラメータ登録処理を終了する。
〈発声区間推定処理について〉
次に、音声パラメータ登録処理におけるＳ１４０にて起動される発声区間推定処理について説明する。 Thereafter, the voice parameter registration process is terminated.
<About the utterance interval estimation process>
Next, the speech segment estimation process activated in S140 in the speech parameter registration process will be described.

図３に示すように、発声区間推定処理は、起動されると、先のＳ１３０にて取得した音声波形データＳＶに基づいて、音声波形における音高が時間軸に沿って推移した音高時間変化を算出する（Ｓ２１０）。 As shown in FIG. 3, when the utterance interval estimation process is started, the pitch time change in which the pitch in the speech waveform has shifted along the time axis is based on the speech waveform data SV acquired in S <b> 130. Is calculated (S210).

具体的に、本実施形態のＳ２１０では、図４に示すような音声波形データＳＶから、当該音声波形データＳＶによって表される音声波形ｘ（ｎ）を一定の時間幅ＬＷを有した時間窓ずらしながら波形素片ｘｗ（ｎ）を切り出す（下記（１）式参照）。ただし、ｘは、サンプリング周波数ＦＳでサンプリングされた離散信号であり、ｎは、時間を表すインデックスである。また、（１）式中の符号“ｓｉ”は、時間窓の最初の位置を示すインデックスであり、一定の間隔（例えば、ＬＷの５０％など）で変化する。 Specifically, in S210 of the present embodiment, the speech waveform x (n) represented by the speech waveform data SV is shifted from the speech waveform data SV as shown in FIG. 4 by a time window having a certain time width LW. Then, the waveform segment xw (n) is cut out (see the following formula (1)). Here, x is a discrete signal sampled at the sampling frequency FS, and n is an index representing time. Further, the symbol “si” in the equation (1) is an index indicating the first position of the time window, and changes at a constant interval (for example, 50% of LW).

この切り出した波形素片ｘｗ（ｎ）それぞれをＤＦＴ（離散フーリエ変換）し、下記（２）式に示すような周波数スペクトルＸ（ｋ）を導出する。周波数スペクトルＸ（ｋ）は、図５（Ａ）に示すように、横軸に単位時間に含まれる周波数を、縦軸に各周波数のレベル（振幅）を示したものである。ただし、（２）式中のｋは、０から“ＬＷ−１”までの値である。 Each of the cut out waveform segments xw (n) is subjected to DFT (Discrete Fourier Transform) to derive a frequency spectrum X (k) as shown in the following equation (2). As shown in FIG. 5A, the frequency spectrum X (k) shows the frequency included in the unit time on the horizontal axis and the level (amplitude) of each frequency on the vertical axis. However, k in the equation (2) is a value from 0 to “LW−1”.

さらに、周波数スペクトルＸ（ｋ）について、下記（３）式に従って周波数軸に沿った自己相関関数ＣＯＲＸ（ｐ）を算出する。自己相関関数ＣＯＲＸ（ｐ）は、図５（Ｂ）に示すように、横軸に周波数のズレ量を、縦軸に各周波数のズレに応じた相関値を示したものである。ただし、（３）式中の符号ｐは、周波数インデックスのズレであり、Ａｂｓは、複素数の絶対値をとる関数である。 Further, for the frequency spectrum X (k), an autocorrelation function CORX (p) along the frequency axis is calculated according to the following equation (3). As shown in FIG. 5B, the autocorrelation function CORX (p) shows the amount of frequency deviation on the horizontal axis and the correlation value corresponding to the frequency deviation on the vertical axis. However, the sign p in the equation (3) is a frequency index shift, and Abs is a function that takes an absolute value of a complex number.

続いて、このように算出された自己相関関数ＣＯＲＸ（ｐ）に、予め規定された重み関数ｗｆ（ｐ）を乗じる。この重み関数ｗｆ（ｐ）は、下記（４）式にて表されるものであり、図５（Ｃ）に示すように、各時間窓に対応する歌唱出力音の音高（以下、模範音高とする）に対応する周波数ほど大きな重みとなるように周波数軸に沿って重みが規定されている。ただし、（４）式中の符号“σ”は、模範音高に対する音高の分布を示す標準偏差である。 Subsequently, the autocorrelation function CORX (p) calculated in this way is multiplied by a weight function wf (p) defined in advance. This weighting function wf (p) is expressed by the following equation (4). As shown in FIG. 5C, the pitch of the singing output sound corresponding to each time window (hereinafter referred to as model sound). The weight is defined along the frequency axis so that the frequency corresponding to “high” becomes a larger weight. However, the sign “σ” in the equation (4) is a standard deviation indicating a pitch distribution with respect to the model pitch.

この重み関数ｗｆ（ｐ）を自己相関関数ＣＯＲＸ（ｐ）に乗じた結果（以下、最終演算結果と称す）、図５（Ｄ）に示すように、最終演算結果が最大となる周波数インデックスのズレｐを特定し、その特定した周波数インデックスのズレｐに基づいて、当該時間窓における音高を導出する。具体的に当該時間窓における音高は、音高＝Ｆｓ・ｐ／ＬＷにて導出する。 The result of multiplying the autocorrelation function CORX (p) by this weighting function wf (p) (hereinafter referred to as the final calculation result), as shown in FIG. 5D, the frequency index shift that maximizes the final calculation result. p is specified, and the pitch in the time window is derived based on the specified frequency index shift p. Specifically, the pitch in the time window is derived by pitch = Fs · p / LW.

なお、本実施形態のＳ２１０では、この一連の処理を、音声波形データＳＶの全時間に対して時間窓をスライドさせながら実行し、導出された音高を時間軸に沿って配置することで、図６に示すような音高時間変化を算出する。 In S210 of this embodiment, this series of processing is executed while sliding the time window with respect to the entire time of the audio waveform data SV, and the derived pitches are arranged along the time axis. The pitch time change as shown in FIG. 6 is calculated.

ただし、本実施形態のＳ２１０においては、最終演算結果において、突出するピークが存在しない場合には、歌声が含まれていないものと判定しても良い。この場合、歌声が含まれていないものと判定する条件は、「最終演算結果におけるピークレベル／最終演算結果における平均レベル」が予め規定された閾値以下である場合としても良い。 However, in S210 of the present embodiment, it may be determined that the singing voice is not included when there is no protruding peak in the final calculation result. In this case, the condition for determining that the singing voice is not included may be that “the peak level in the final calculation result / the average level in the final calculation result” is equal to or less than a predetermined threshold value.

発声区間推定処理では、続いて、各時間窓について、Ｓ２１０にて求めた音高及びその調波成分（以下、歌声成分と称す）の周波数を抽出し、その抽出した周波数からなる音声波形（以下、歌声波形と称す）を抽出する（Ｓ２２０）。具体的に、本実施形態のＳ２２０において、歌声成分の周波数を抽出する手法として、音声波形データＳＶに対し、図７に示すような櫛形フィルタを用いる周知の手法を用いる。なお、本実施形態のＳ２２０では、Ｓ２１０にて、歌声が含まれていないものと判定された時間窓については、音高及び調波成分の周波数の抽出を実施しなくとも良い。 In the utterance interval estimation process, subsequently, for each time window, the pitch obtained in S210 and the frequency of its harmonic component (hereinafter referred to as singing voice component) are extracted, and a speech waveform (hereinafter referred to as the waveform) composed of the extracted frequency is extracted. , Referred to as a singing voice waveform) (S220). Specifically, in S220 of the present embodiment, as a technique for extracting the frequency of the singing voice component, a known technique using a comb filter as shown in FIG. 7 is used for the voice waveform data SV. In S220 of the present embodiment, it is not necessary to extract the pitch and the frequency of the harmonic component for the time window determined in S210 that the singing voice is not included.

また、Ｓ２２０においては、音声波形データＳＶに対するＦＦＴの結果から歌声成分の周波数を抽出しても良い。
そして、Ｓ２２０にて抽出した歌声波形におけるパワーの時間推移（以下、音声パワー推移と称す）を導出する（Ｓ２３０）。具体的に、本実施形態のＳ２３０では、歌声波形に対し時間軸に沿って連続するように規定された時間窓ｉごとにパワーを算出し、その算出したパワーを時間軸に沿って配置することで、図８（Ａ）に示すような音声パワー推移を導出する。 Moreover, in S220, you may extract the frequency of a singing voice component from the result of FFT with respect to audio | voice waveform data SV.
Then, a power time transition (hereinafter referred to as voice power transition) in the singing voice waveform extracted in S220 is derived (S230). Specifically, in S230 of the present embodiment, power is calculated for each time window i defined to be continuous along the time axis with respect to the singing voice waveform, and the calculated power is arranged along the time axis. Thus, a voice power transition as shown in FIG.

このパワーを算出する方法は、歌声成分の周波数の抽出を櫛形フィルタにて実施した場合には、前記時間窓ｉにおける歌声波形の振幅の二乗値を時間軸方向に累積することで、当該時間窓ｉにおけるパワーを導出する。一方、歌声成分の周波数の抽出をＦＦＴにて実施した場合には、パワーを算出する方法は、ＦＦＴにて抽出された歌声成分の周波数成分について、振幅二乗値を周波数方向の累積値を、当該時間窓ｉにおけるパワーとして導出する。 When the frequency of the singing voice component is extracted by a comb filter, this power is calculated by accumulating the square value of the amplitude of the singing voice waveform in the time window i in the time axis direction. Deriving the power at i. On the other hand, when the frequency of the singing voice component is extracted by FFT, the method for calculating the power is to calculate the squared amplitude value of the frequency component of the singing voice component extracted by FFT, the cumulative value in the frequency direction, Derived as power in time window i.

ここで、下記（５）式は、歌声成分の周波数成分の各振幅値の二乗値和（すなわち、パワー）ｐｗを算出する式である。ただし、（５）式における符号ｍは、何番目の高調波であるかを表すインデックスであり、ｐ０は、Ｓ２１０にて求めた音高を表すインデックスである。 Here, the following expression (5) is an expression for calculating the square value sum (ie, power) pw of each amplitude value of the frequency component of the singing voice component. However, the symbol m in the equation (5) is an index indicating what number of harmonics, and p0 is an index indicating the pitch obtained in S210.

続いて、Ｓ２３０にて導出された音声パワー時間推移に基づいて、先のＳ１３０にて取得した音声波形データＳＶにおける各発声区間を特定する（Ｓ２４０）。具体的に、本実施形態のＳ２４０では、音声パワー推移が時間軸に沿って変化する時刻から、発声開始時刻ｖｓ及び発声終了時刻ｖｅを特定し、それら発声開始時刻ｖｓ及び発声終了時刻ｖｅの順に連続する当該発声開始時刻ｖｓと当該発声終了時刻ｖｅとのペアによって規定される区間それぞれを発声区間として特定する。 Subsequently, based on the voice power time transition derived in S230, each utterance section in the voice waveform data SV acquired in the previous S130 is specified (S240). Specifically, in S240 of the present embodiment, the utterance start time vs and the utterance end time ve are specified from the time when the sound power transition changes along the time axis, and the utterance start time vs and the utterance end time ve are in this order. Each section defined by a pair of the continuous utterance start time vs and the utterance end time ve is specified as the utterance section.

本実施形態において、発声開始時刻ｖｓ及び発声終了時刻ｖｅを特定する方法は、図８（Ｂ）に示すように、音声パワー推移の時間進行において、パワーｐｗが予め規定された規定閾値以上となるタイミングそれぞれを発声開始時刻ｖｓとし、パワーｐｗが規定閾値以下となるタイミングそれぞれを発声終了時刻ｖｅとして特定しても良い。 In the present embodiment, as shown in FIG. 8B, the method for specifying the utterance start time vs and the utterance end time ve is that the power pw is equal to or greater than a predetermined threshold value as time progresses in the transition of the voice power. Each timing may be specified as the utterance start time vs, and each timing at which the power pw is equal to or less than a specified threshold may be specified as the utterance end time ve.

また、発声開始時刻ｖｓ及び発声終了時刻ｖｅを特定する方法は、下記（６）式に従って、音声パワー推移の時間微分ｄｐｗ（ｉ）を導出し、図９に示すように、その時間微分ｄｐｗ（ｉ）が、極大となるタイミングそれぞれを発声開始時刻ｖｓとし、極小となるタイミングそれぞれを発声終了時刻ｖｅとして特定しても良い。 Further, the method for specifying the utterance start time vs and the utterance end time ve is to derive the time derivative dpw (i) of the sound power transition according to the following equation (6), and as shown in FIG. Each of i) may be specified as the utterance start time vs at each of the maximum times, and each of the minimum times as the utterance end time ve.

発声区間推定処理では、先のＳ１２０にて取得した歌詞データ群ＤＬのうちの歌詞テロップデータＤＴによって表される歌詞に対して周知の形態素解析を実行し、さらに、予め用意された辞書を参照して、形態素解析の結果を読みに変換する（Ｓ２５０）。すなわち、本実施形態のＳ２５０では、対象楽曲の歌詞を音節（音素）単位で表すように変換すると共に、各音節の内容を特定する。 In the utterance interval estimation process, a well-known morphological analysis is performed on the lyrics represented by the lyrics telop data DT in the lyrics data group DL acquired in the previous S120, and further, referring to a dictionary prepared in advance. Then, the result of morphological analysis is converted into reading (S250). That is, in S250 of the present embodiment, the lyrics of the target music are converted to be expressed in syllable (phoneme) units, and the contents of each syllable are specified.

そして、Ｓ２５０にて変換された歌詞の音節ごとに、当該音節に対応する出力音と当該音節の内容とを対応付けた音符音節組それぞれを生成する（Ｓ２６０）。本実施形態のＳ２６０では、具体的に、先のＳ１１０にて取得した楽曲ＭＩＤＩデータにおける歌唱旋律を構成する出力音の演奏開始タイミングであって、歌詞出力データＤＯによって表される当該音節の歌詞出力タイミングとの時間差分が最小となる演奏開始タイミングを特定し、その特定した演奏開始タイミングを有した出力音と当該音節の内容とを対応付ける。 Then, for each syllable of the lyrics converted in S250, a note syllable group in which the output sound corresponding to the syllable is associated with the content of the syllable is generated (S260). In S260 of the present embodiment, specifically, the lyric output of the syllable represented by the lyric output data DO, which is the performance start timing of the output sound constituting the singing melody in the music MIDI data acquired in the previous S110. The performance start timing that minimizes the time difference from the timing is specified, and the output sound having the specified performance start timing is associated with the content of the syllable.

続いて、先のＳ１３０にて取得した音声波形データＳＶと、先のＳ１１０にて取得した楽曲ＭＩＤＩデータとの時間ズレを修正すると共に、音声波形データＳＶにおける各発声区間に、当該発声区間に対応する音符音節組それぞれを対応づけることで、音節データを生成する（Ｓ２７０）。 Subsequently, the time gap between the speech waveform data SV acquired at the previous S130 and the music MIDI data acquired at the previous S110 is corrected, and each utterance section in the speech waveform data SV corresponds to the utterance section. Syllable data is generated by associating each set of note syllables to be performed (S270).

具体的に、本実施形態のＳ２７０では、歌詞出力データＤＯによって表される歌詞の時間軸に沿った最初の歌詞出力タイミングと音声波形データＳＶにおける時間軸に沿った最初の発声区間とのズレ、及び歌詞出力データＤＯによって表される歌詞の時間軸に沿った最後の歌詞出力タイミングと音声波形データＳＶにおける時間軸に沿った最後の発声区間とのズレの平均値を算出する。そして、発声区間ごとに、算出した平均値を加味した発声開始時刻ｖｓとの時間差分が最小となる演奏開始タイミングを有した音符音節組を特定し、少なくとも、当該発声区間と当該音符音節組とを対応付けた音節データを生成する。 Specifically, in S270 of the present embodiment, a deviation between the first lyrics output timing along the time axis of the lyrics represented by the lyrics output data DO and the first utterance section along the time axis in the speech waveform data SV, The average value of the deviation between the last lyrics output timing along the time axis of the lyrics expressed by the lyrics output data DO and the last utterance section along the time axis in the speech waveform data SV is calculated. Then, for each utterance section, a note syllable group having a performance start timing that minimizes the time difference from the utterance start time vs. the calculated average value is specified, and at least the utterance section and the note syllable group Is generated.

さらに、Ｓ２１０にて算出された音高時間推移において、発声区間それぞれに対応する区間での音高を発声音高として特定し、その特定した各発声音高と、当該発声音高に対応する発声区間を有した音節データと対応付ける（Ｓ２８０）。このＳ２８０にて対応付ける発声音高は、ＭＩＤＩ規格におけるノートナンバーでも良いし、音階でも良い。 Furthermore, in the pitch time transition calculated in S210, the pitch in the section corresponding to each utterance section is specified as the utterance pitch, and each specified utterance pitch and the utterance corresponding to the utterance pitch are specified. Corresponding to the syllable data having a section (S280). The utterance pitch associated in S280 may be a note number in the MIDI standard or a musical scale.

そして、Ｓ２３０で導出された音声パワー推移における各時刻でのパワーを、当該時刻に対応する発声区間を有した音節データと対応付ける（Ｓ２９０）。このＳ２９０にて対応付けるパワーは、ＭＩＤＩ規格におけるベロシティでも良いし、五線譜に記載される強弱記号（例えば、ピアノやフォルテなど）でも良い。 Then, the power at each time in the sound power transition derived in S230 is associated with the syllable data having the utterance section corresponding to the time (S290). The power associated in S290 may be a velocity in the MIDI standard, or a dynamic symbol (for example, a piano or a forte) described in a musical score.

その後、本発声区間推定処理を終了して、音声パラメータ登録処理のＳ１５０へと移行する。
〈音声出力端末の構成について〉
次に、音声出力端末について説明する（図１参照）。 Thereafter, the utterance section estimation process is terminated, and the process proceeds to S150 of the voice parameter registration process.
<Configuration of audio output terminal>
Next, the audio output terminal will be described (see FIG. 1).

この音声出力端末６０は、情報受付部６１と、表示部６２と、音出力部６３と、通信部６４と、記憶部６５と、制御部６７とを備えている。本実施形態における音声出力端末６０として、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 The voice output terminal 60 includes an information receiving unit 61, a display unit 62, a sound output unit 63, a communication unit 64, a storage unit 65, and a control unit 67. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal (a mobile phone or a portable information terminal) or a known information processing apparatus (a so-called personal computer) may be assumed.

このうち、情報受付部６１は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６２は、制御部６７からの指令に基づいて画像を表示する。音出力部６３は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。 Among these, the information reception part 61 receives the information input via the input device (not shown). The display unit 62 displays an image based on a command from the control unit 67. The sound output unit 63 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker.

通信部６４は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して音声出力端末６０が外部との間で情報通信を行うものである。記憶部６５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）であり、各種処理プログラムや各種データが記憶される。 The communication unit 64 is for the voice output terminal 60 to perform information communication with the outside via a communication network (for example, a public wireless communication network or a network line). The storage unit 65 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents, and stores various processing programs and various data.

また、制御部６７は、ＲＯＭ、ＲＡＭ、ＣＰＵを少なくとも有した周知のコンピュータを中心に構成されている。
〈音声合成処理について〉
次に、音声出力端末６０の制御部６７が実行する音声合成処理について説明する。 The control unit 67 is mainly configured by a known computer having at least a ROM, a RAM, and a CPU.
<About voice synthesis processing>
Next, speech synthesis processing executed by the control unit 67 of the speech output terminal 60 will be described.

この音声合成処理は、音声出力端末６０の情報受付部６１を介して起動指令が入力されると起動される。
図１０に示すように、音声合成処理は、起動されると、まず、情報受付部６１を介して入力された情報（以下、入力情報と称す）を取得する（Ｓ５１０）。このＳ５１０にて取得する入力情報とは、例えば、合成音として出力する音声の内容（文言）を表す出力文言や、合成音として出力する音の性質を表す出力性質情報を含むものである。なお、ここで言う音の性質（即ち、出力性質情報）とは、発声者の性別、発声者の年齢といった、発声者の声の特徴を含むものである。 This voice synthesis process is started when a start command is input via the information receiving unit 61 of the voice output terminal 60.
As shown in FIG. 10, when the speech synthesis process is started, first, information input via the information receiving unit 61 (hereinafter referred to as input information) is acquired (S510). The input information acquired in S510 includes, for example, output text indicating the content (word) of the sound output as synthesized sound, and output property information indicating the nature of the sound output as synthesized sound. Note that the sound property (that is, output property information) mentioned here includes characteristics of the voice of the speaker such as the gender of the speaker and the age of the speaker.

続いて、Ｓ５１０にて取得した出力文言を合成音として出力するために必要な音節それぞれに対応し、かつＳ５１０にて取得した出力性質情報に最も類似する情報と対応付けられた音声パラメータＰＭを、データ格納サーバ５０から抽出する（Ｓ５２０）。 Subsequently, the speech parameter PM corresponding to each syllable necessary to output the output word acquired in S510 as a synthesized sound and associated with the information most similar to the output property information acquired in S510, Extracted from the data storage server 50 (S520).

そして、Ｓ５１０にて取得した出力文言の内容にて合成音が出力されるように、Ｓ５２０にて取得した音声パラメータＰＭを設定する（Ｓ５３０）。続いて、Ｓ５３０にて設定された音声パラメータＰＭに基づいて、音声合成する（Ｓ５４０）。このＳ５４０における音声合成は、特許文献１の他にもフォルマント合成による周知の音声合成の手法を用いれば良い。 Then, the speech parameter PM acquired in S520 is set so that the synthesized sound is output with the content of the output word acquired in S510 (S530). Subsequently, speech synthesis is performed based on the speech parameter PM set in S530 (S540). For the speech synthesis in S540, a well-known speech synthesis method using formant synthesis other than Patent Document 1 may be used.

さらに、Ｓ５４０にて音声合成することによって生成された合成音を音出力部６３から出力する（Ｓ５５０）。
その後、本音声合成処理を終了する。
［実施形態の効果］
以上説明したように、本実施形態の発声区間推定処理によれば、発声された音声波形が時間軸に沿って推移しながら、その音声パワーが変化するタイミングに基づいて、発声開始時刻及び発声終了時刻、ひいては発声区間を自動的に特定することができる。 Furthermore, the synthesized sound generated by the voice synthesis in S540 is output from the sound output unit 63 (S550).
Thereafter, the speech synthesis process ends.
[Effect of the embodiment]
As described above, according to the utterance interval estimation process of the present embodiment, the utterance start time and the utterance end are based on the timing at which the voice power changes while the uttered voice waveform changes along the time axis. It is possible to automatically specify the time and thus the utterance section.

より詳細には、本実施形態の発声区間推定処理によれば、音声パワー推移の時間進行において、パワーが予め規定された規定閾値以上となるタイミングそれぞれを発声開始時刻とし、パワーが規定閾値以下となるタイミングそれぞれを発声終了時刻とすることで、発声区間を自動的に特定できる。また、本実施形態の発声区間推定処理によれば、音声パワー推移を時間微分した結果、極大となるタイミングそれぞれを発声開始時刻とし、極小となるタイミングそれぞれを発声終了時刻とすることで、発声区間を自動的に特定できる。 More specifically, according to the utterance interval estimation process of the present embodiment, each time when the power becomes greater than or equal to a predefined threshold value in the time progression of the voice power transition is set as the utterance start time, and the power is equal to or less than the defined threshold value. The utterance section can be automatically specified by setting each of the following timings as the utterance end time. Further, according to the utterance interval estimation process of the present embodiment, as a result of time differentiation of the voice power transition, each of the maximal timings is set as the utterance start time, and each of the minimum timings is set as the utterance end time. Can be identified automatically.

この結果、本実施形態の情報処理装置によれば、特許文献１に記載された装置とは異なり、発声開始時刻及び発声終了時刻を当該装置の使用者が指定する必要が無く、多量の音声波形データについて、各音声波形データを発声区間ごとに切り分けることが可能となる。 As a result, according to the information processing apparatus of this embodiment, unlike the apparatus described in Patent Document 1, there is no need for the user of the apparatus to specify the utterance start time and utterance end time, and a large amount of speech waveforms With respect to data, it is possible to segment each speech waveform data for each utterance section.

しかも、本実施形態の発声区間推定処理では、特定した発声区間それぞれに、当該発声区間に対応する音符音節組を対応付けることで音節データを生成している。さらに、本実施形態の発声区間推定処理では、その音節データそれぞれに、各音節データに含まれる発声区間それぞれでの音高や、パワーを付加している。 Moreover, in the utterance interval estimation process of the present embodiment, syllable data is generated by associating each specified utterance interval with a note syllable group corresponding to the utterance interval. Furthermore, in the utterance interval estimation process of the present embodiment, the pitch and power in each utterance interval included in each syllable data are added to each syllable data.

この結果、本実施形態の音声パラメータ登録処理において、当該音節データに含まれる発声区間での音声波形データから音声パラメータを生成する際に、当該発声区間での信頼度の高い発声内容や、音高、パワーなどの情報を音声パラメータに付加することができる。 As a result, in the speech parameter registration processing of the present embodiment, when generating speech parameters from speech waveform data in the utterance interval included in the syllable data, the utterance content with high reliability in the utterance interval, the pitch Information such as power can be added to the audio parameters.

換言すれば、本実施形態の音声パラメータ登録処理によれば、音声合成に必要な音声パラメータに多様な情報を付加することができ、ひいては、音声出力端末にて音声合成処理を実行する際に、その音声合成を利用する人物が望む音声を実現しやすくできる。 In other words, according to the speech parameter registration processing of the present embodiment, various information can be added to the speech parameters necessary for speech synthesis. As a result, when executing speech synthesis processing at the speech output terminal, The voice desired by the person using the voice synthesis can be easily realized.

なお、本実施形態の発声区間推定処理では、単位時間ごとに算出した周波数スペクトルの自己相関値に重み関数を乗じた結果が最大となる周波数に対応する音高を、発声音高として特定している。このため、本実施形態の発声区間推定処理によれば、発声音高を、対象楽曲において発声すべき音高に近い音高として特定できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 In the utterance interval estimation process of the present embodiment, the pitch corresponding to the frequency that maximizes the result of multiplying the autocorrelation value of the frequency spectrum calculated per unit time by the weight function is specified as the utterance pitch. Yes. For this reason, according to the utterance section estimation process of this embodiment, the utterance pitch can be specified as a pitch close to the pitch to be uttered in the target music.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態では、重み関数ｗｆを、（４）式にて規定したが、重み関数ｗｆは、これに限るものではなく、音声入力装置１０の利用者が、対象楽曲の歌唱旋律に対して、音高が１オクターブ高くなるように歌唱する場合や、音高が１オクターブ低くなるように歌唱する場合を考慮して重み関数ｗｆを設定しても良い。 For example, in the above embodiment, the weighting function wf is defined by equation (4). However, the weighting function wf is not limited to this, and the user of the voice input device 10 can perform the singing melody of the target song. Thus, the weight function wf may be set in consideration of the case where the pitch is sung so that the pitch becomes 1 octave higher, or the case where the pitch is sung so that the pitch becomes 1 octave lower.

この場合の重み関数ｗｆ（ｐ）は、前者であれば、下記（７）式のように規定し、後者であれば、下記（８）式のように規定すれば良い。 In this case, the weighting function wf (p) may be defined as the following equation (7) if the former, and may be defined as the following equation (8) if the latter.

また、上記実施形態の発声区間推定処理におけるＳ２３０では、歌声成分の周波数成分の各振幅値の二乗値和（すなわち、パワー）ｐｗを（５）式に従って算出していたが、パワーｐｗの算出方法は、これに限るものではない。例えば、パワーｐｗの算出は、歌声成分のスペクトル上での拡がりを考慮し、基本波と高調波成分の近傍の成分を考慮し算出しても良い。具体的には、下記（９）式に示すように、フィルタＦＰ（ｋ）を混合ガウス分布にて数式化し、そのフィルタＦＰによる重みを周波数スペクトル（レベル）Ｘ（ｋ）に乗じた結果の二乗和を、下記（１０）式に従って、周波数インデックスｋで積算することで、パワーｐｗを算出しても良い。 In S230 in the utterance interval estimation process of the above embodiment, the sum of squares of each amplitude value (ie, power) pw of the frequency component of the singing voice component is calculated according to the equation (5). Is not limited to this. For example, the power pw may be calculated in consideration of the spread of the singing voice component on the spectrum and considering components in the vicinity of the fundamental wave and the harmonic component. Specifically, as shown in the following equation (9), the filter FP (k) is expressed by a mixed Gaussian distribution, and the square of the result of multiplying the weight by the filter FP by the frequency spectrum (level) X (k) The power pw may be calculated by integrating the sum with the frequency index k according to the following equation (10).

上記実施形態では、音声入力装置１０がカラオケ演奏処理を実行して対象楽曲を演奏している期間に入力された音声に基づいて音声波形データＳＶを生成していたが、本発明における音声波形データＳＶは、これに限るものではない。 In the above embodiment, the voice waveform data SV is generated based on the voice input during the period in which the voice input device 10 performs the karaoke performance process and plays the target music. The SV is not limited to this.

すなわち、本発明では、音声入力装置１０において、カラオケ装置などにて周知のアフレコ機能を用いて、音声波形データＳＶを生成しても良い。つまり、アフレコ機能を有した音声入力装置（カラオケ装置）であれば、発声すべき台詞に関するデータとして、台詞を構成する文字（以下、台詞構成文字と称す）を表す台詞テロップデータ（即ち、歌詞テロップデータと同様のデータ）と、台詞構成文字を表示部１３に表示するタイミングを規定した台詞出力データ（即ち、歌詞出力データと同様のデータ）とを備えている。よって、アフレコ機能を用いて音声波形データＳＶを取得する場合、音声入力装置１０は、台詞テロップデータに基づく台詞を表示部１３に表示し、当該台詞が表示部１３に表示されている期間に音声入力部１４を介して入力された音声波形を音声波形データＳＶとして、ＭＩＤＩ格納サーバ２５に格納しても良い。 That is, in the present invention, the voice input data 10 may be generated in the voice input device 10 using a well-known after-recording function in a karaoke machine or the like. That is, in the case of a voice input device (karaoke device) having an after-recording function, dialogue telop data (that is, lyrics telop) representing characters constituting dialogue (hereinafter referred to as dialogue constituent characters) as data relating to dialogue to be uttered. Data) and dialogue output data that defines the timing for displaying dialogue constituent characters on the display unit 13 (that is, data similar to the lyrics output data). Therefore, when the speech waveform data SV is acquired using the after-recording function, the speech input device 10 displays the speech based on the speech telop data on the display unit 13 and performs speech during the period in which the speech is displayed on the display unit 13. A speech waveform input via the input unit 14 may be stored in the MIDI storage server 25 as speech waveform data SV.

この場合、情報処理装置３０では、アフレコ機能を用いて生成した音声波形データＳＶを音声パラメータ登録処理の処理対象としても良い。
また、上記実施形態では、音声入力装置１０として、カラオケ装置を想定したが、音声入力装置１０として想定する装置は、カラオケ装置に限るものではなく、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 In this case, the information processing apparatus 30 may use the speech waveform data SV generated using the after-recording function as a processing target of the speech parameter registration process.
Moreover, in the said embodiment, although the karaoke apparatus was assumed as the audio | voice input apparatus 10, the apparatus assumed as the audio | voice input apparatus 10 is not restricted to a karaoke apparatus, For example, a well-known portable terminal (a mobile phone or portable information) Terminal) or a known information processing apparatus (so-called personal computer) may be assumed.

また、上記実施形態の音声合成システムにおいては、ＭＩＤＩ格納サーバ２５が設けられていたが、本発明においては、ＭＩＤＩ格納サーバ２５は設けられていなくとも良い。この場合、音楽データＭＤや音声波形データＳＶは、音声入力装置１０の記憶部１７に格納されても良いし、データ格納サーバ５０に格納されても良いし、さらには、情報処理装置３０の記憶部３４に格納されても良い。 In the speech synthesis system of the above embodiment, the MIDI storage server 25 is provided. However, in the present invention, the MIDI storage server 25 may not be provided. In this case, the music data MD and the voice waveform data SV may be stored in the storage unit 17 of the voice input device 10, may be stored in the data storage server 50, and further stored in the information processing device 30. It may be stored in the unit 34.

同様に、上記実施形態の音声合成システムにおいては、データ格納サーバ５０が設けられていたが、本発明においては、データ格納サーバ５０は設けられていなくとも良い。この場合、音声パラメータＰＭや表情テーブルＴＤは、情報処理装置３０の記憶部３４に格納されても良いし、音声入力装置１０の記憶部１７に格納されても良いし、さらには、ＭＩＤＩ格納サーバ２５に格納されても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 Similarly, in the speech synthesis system of the above-described embodiment, the data storage server 50 is provided. However, in the present invention, the data storage server 50 may not be provided. In this case, the voice parameter PM and the expression table TD may be stored in the storage unit 34 of the information processing device 30, may be stored in the storage unit 17 of the voice input device 10, or may be a MIDI storage server. 25 may be stored.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声パラメータ登録処理におけるＳ１２０が、特許請求の範囲の記載における内容情報取得手段及びタイミング情報取得手段に相当し、Ｓ１１０が、特許請求の範囲の記載における楽譜データ取得手段に相当し、Ｓ１３０が、特許請求の範囲の記載における音声データ取得手段に相当する。さらに、上記実施形態の発声区間推定処理におけるＳ２３０が、特許請求の範囲の記載におけるパワー推移導出手段に相当し、Ｓ２４０が、特許請求の範囲の記載における発声区間特定手段に相当し、Ｓ２５０，Ｓ２６０が、特許請求の範囲の記載における音符歌詞対応付手段に相当し、Ｓ２７０〜Ｓ２９０が、特許請求の範囲の記載における音符歌声統合手段に相当する。 S120 in the voice parameter registration process of the above embodiment corresponds to the content information acquisition unit and the timing information acquisition unit in the description of the claims, and S110 corresponds to the score data acquisition unit in the description of the claims. S130 corresponds to the sound data acquisition means in the claims. Further, S230 in the utterance section estimation process of the above embodiment corresponds to the power transition deriving means in the description of the claims, S240 corresponds to the utterance section specifying means in the description of the claims, and S250, S260. Is equivalent to the note lyrics corresponding means in the description of the claims, and S270 to S290 correspond to the note singing voice integration means in the claims.

また、上記実施形態の発声区間推定処理におけるＳ２１０が、特許請求の範囲の記載における音高推移導出手段に相当し、Ｓ２８０が、特許請求の範囲の記載における音高特定手段に相当する。なお、上記実施形態の音声パラメータ登録処理におけるＳ１６０が、特許請求の範囲の記載におけるパラメータ導出手段に相当する。 Further, S210 in the utterance section estimation process of the above embodiment corresponds to a pitch transition deriving unit in the description of the claims, and S280 corresponds to a pitch specifying unit in the description of the claims. Note that S160 in the voice parameter registration process of the above embodiment corresponds to the parameter derivation means in the claims.

１…音声合成システム１０…音声入力装置１１…通信部１２…入力受付部１３…表示部１４…音声入力部１５…音声出力部１６…音源モジュール１７…記憶部２０…制御部２１…ＲＯＭ２２…ＲＡＭ２３…ＣＰＵ２５…ＭＩＤＩ格納サーバ３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…記憶部４０…制御部４１…ＲＯＭ４２…ＲＡＭ４３…ＣＰＵ５０…データ格納サーバ６０…音声出力端末 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Voice input device 11 ... Communication part 12 ... Input reception part 13 ... Display part 14 ... Voice input part 15 ... Voice output part 16 ... Sound source module 17 ... Memory | storage part 20 ... Control part 21 ... ROM 22 ... RAM 23 ... CPU 25 ... MIDI storage server 30 ... Information processing device 31 ... Communication unit 32 ... Input receiving unit 33 ... Display unit 34 ... Storage unit 40 ... Control unit 41 ... ROM 42 ... RAM 43 ... CPU 50 ... Data storage server 60 ... Voice output terminal

Claims

Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music;
Timing information acquisition means for acquiring utterance timing information defining the utterance start timing of the character represented by the specific content information which is the utterance content information acquired by the content information acquisition means;
Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, musical score data acquisition means for acquiring musical score data in which at least the pitch and performance start timing are defined;
Voice data acquisition means for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
Based on the voice data acquired by the voice data acquisition means, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Deriving means;
The time when the voice power transition derived by the power transition deriving means changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying means for specifying each utterance section defined by a pair of the utterance end time and the utterance end time;
The character string represented by the specific content information based on the specific content information acquired by the content information acquisition means, the utterance timing information acquired by the timing information acquisition means, and the music score data acquired by the score data acquisition means For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated Note syllable correspondence means for generating each attached note syllable group;
For each utterance section specified by the utterance section specifying means, the performance start timing has a minimum time difference from the utterance start time defining the utterance section, and is generated by the note lyrics association means. A note singing syllable set, and at least a note singing voice integration means for generating syllable data in which the utterance section and the note syllable set are associated with each other .
The utterance section specifying means is
In the time progress of the voice power transition, each time when the power becomes equal to or higher than a predetermined threshold value is set as the utterance start time, and each timing when the power becomes equal to or lower than the specified threshold value is set as the utterance end time. Identify each section
An utterance section identification device characterized by the above.

  Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music;
  Timing information acquisition means for acquiring utterance timing information defining the utterance start timing of the character represented by the specific content information which is the utterance content information acquired by the content information acquisition means;
  Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, musical score data acquisition means for acquiring musical score data in which at least the pitch and performance start timing are defined;
  Voice data acquisition means for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
  Based on the voice data acquired by the voice data acquisition means, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Deriving means;
  The time when the voice power transition derived by the power transition deriving means changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying means for specifying each utterance section defined by a pair of the utterance end time and the utterance end time;
  The character string represented by the specific content information based on the specific content information acquired by the content information acquisition means, the utterance timing information acquired by the timing information acquisition means, and the music score data acquired by the score data acquisition means For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated Note syllable correspondence means for generating each attached note syllable group;
  For each utterance section specified by the utterance section specifying means, the performance start timing has a minimum time difference from the utterance start time defining the utterance section, and is generated by the note lyrics association means. Note singing syllable sets, and at least note singing voice integration means for generating syllable data in which the utterance section and the note syllable set are associated with each other
  With
  The utterance section specifying means is
  As a result of time differentiation of the voice power transition, each of the utterance sections is identified with each of the maximum timings as the utterance start time and each of the minimum timings as the utterance end time.
  An utterance section identification device characterized by the above.

At least a pitch transition deriving unit for deriving a pitch time transition in which the pitch in the voice waveform has shifted along the time axis based on the voice data acquired by the voice data acquiring unit;
A pitch specifying means for specifying a pitch in each of the utterance sections specified by the utterance section specifying means in a pitch time transition derived by the pitch transition derivation means;
The note singing voice integration means is
The syllable data is generated by associating each utterance pitch specified by the pitch specifying means, the utterance interval corresponding to the utterance pitch, and the note syllable group corresponding to the utterance interval. vocal section specifying device according to claim 1 or claim 2, characterized in.

The pitch specifying means is:
The output corresponding to the unit time among the output sounds represented by the musical score data is used as the autocorrelation value of the frequency spectrum calculated for each unit time defined to be continuous along the time axis in the audio data. The pitch corresponding to the frequency that maximizes the result of multiplying the weight function with the weights defined along the frequency axis so that the higher the frequency corresponding to the pitch of the sound is, the higher is the speech pitch. The utterance section identification device according to claim 3 .

The pitch specifying means is:
The pitch corresponding to the frequency at which the autocorrelation value of the frequency spectrum calculated for each unit time defined so as to be continuous along the time axis in the voice data is specified as the utterance pitch. The utterance section specifying device according to claim 3 .

The note singing voice integration means is
By associating the power at each time in the sound power transition derived by the power transition deriving unit, the utterance section corresponding to the time, and the note syllable group corresponding to the utterance section, the syllable data speech section detection apparatus according to any one of claims 1 or claim 2, characterized in that to generate.

  Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music;
  Timing information acquisition means for acquiring utterance timing information defining the utterance start timing of the character represented by the specific content information which is the utterance content information acquired by the content information acquisition means;
  Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, musical score data acquisition means for acquiring musical score data in which at least the pitch and performance start timing are defined;
  Voice data acquisition means for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
  Based on the voice data acquired by the voice data acquisition means, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Deriving means;
  The time when the voice power transition derived by the power transition deriving means changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying means for specifying each utterance section defined by a pair of the utterance end time and the utterance end time;
  The character string represented by the specific content information based on the specific content information acquired by the content information acquisition means, the utterance timing information acquired by the timing information acquisition means, and the music score data acquired by the score data acquisition means For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated Note syllable correspondence means for generating each attached note syllable group;
  For each utterance section specified by the utterance section specifying means, the performance start timing has a minimum time difference from the utterance start time defining the utterance section, and is generated by the note lyrics association means. Note syllable set, and at least the utterance interval and the note syllable set
Note singing voice integration means for generating syllable data associated with
  Parameter deriving means for deriving a speech parameter, which is at least one feature amount defined in advance, from the speech data in the utterance section in the syllable data generated by the note singing voice integration unit;
  With
  The utterance section specifying means is
  In the time progress of the voice power transition, each time when the power becomes equal to or higher than a predetermined threshold value is set as the utterance start time, and each timing when the power becomes equal to or lower than the specified threshold value is set as the utterance end time. Identify each section
  An audio parameter generation device characterized by the above.

Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music;
Timing information acquisition means for acquiring utterance timing information defining the utterance start timing of the character represented by the specific content information which is the utterance content information acquired by the content information acquisition means;
Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, musical score data acquisition means for acquiring musical score data in which at least the pitch and performance start timing are defined;
Voice data acquisition means for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
Based on the voice data acquired by the voice data acquisition means, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Deriving means;
The time when the voice power transition derived by the power transition deriving means changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying means for specifying each utterance section defined by a pair of the utterance end time and the utterance end time;
The character string represented by the specific content information based on the specific content information acquired by the content information acquisition means, the utterance timing information acquired by the timing information acquisition means, and the music score data acquired by the score data acquisition means For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated Note syllable correspondence means for generating each attached note syllable group;
For each utterance section specified by the utterance section specifying means, the performance start timing has a minimum time difference from the utterance start time defining the utterance section, and is generated by the note lyrics association means. A note singing syllable set, and at least a note singing voice integration means for generating syllable data in which the utterance section and the note syllable set are associated with each other,
Parameter deriving means for deriving a speech parameter, which is at least one feature amount defined in advance, from the speech data in the utterance section in the syllable data generated by the note singing voice integration unit;
With
The utterance section specifying means is
As a result of time differentiation of the voice power transition, each of the utterance sections is identified with each of the maximum timings as the utterance start time and each of the minimum timings as the utterance end time.
An audio parameter generation device characterized by the above.

A content information acquisition procedure for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music,
Timing information acquisition procedure for acquiring utterance timing information that defines the utterance start timing of the character represented by the specific content information that is the utterance content information acquired in the content information acquisition procedure;
Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, a musical score data acquisition procedure for acquiring musical score data in which at least the pitch and the performance start timing are defined;
A voice data acquisition procedure for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
Based on the voice data acquired in the voice data acquisition procedure, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Derivation procedure;
The time when the voice power transition derived in the power transition deriving procedure changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying procedure for specifying each of the utterance sections defined by the pair of the utterance end time and the utterance end time;
The character string represented by the specific content information based on the specific content information acquired in the content information acquisition procedure, the utterance timing information acquired in the timing information acquisition procedure, and the music score data acquired in the score data acquisition procedure For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated A note lyrics procedure for generating each note syllable set,
For each utterance section identified by the utterance section identification procedure, the performance start timing that minimizes the time difference from the utterance start time defining the utterance section is generated, and is generated by the procedure for associating notes and lyrics. And at least a note singing voice integration procedure for generating syllable data in which the utterance section and the note syllable group are associated with each other .
In the utterance section identification procedure,
In the time progress of the voice power transition, each time when the power becomes equal to or higher than a predetermined threshold value is set as the utterance start time, and each timing when the power becomes equal to or lower than the specified threshold value is set as the utterance end time. Identify each section
A program characterized by that.

  A content information acquisition procedure for acquiring utterance content information representing a character string of content to be uttered in a target music that is one music,
  Timing information acquisition procedure for acquiring utterance timing information that defines the utterance start timing of the character represented by the specific content information that is the utterance content information acquired in the content information acquisition procedure;
  Representing at least the musical score of the singing melody in the target musical piece, and for each output sound constituting the singing melody, a musical score data acquisition procedure for acquiring musical score data in which at least the pitch and the performance start timing are defined;
  A voice data acquisition procedure for acquiring voice data representing a voice waveform uttered for the character string represented by the specific content information;
  Based on the voice data acquired in the voice data acquisition procedure, power transition for deriving a voice power transition representing a time transition of power calculated for each unit time defined to be continuous along the time axis in the voice data Derivation procedure;
  The time when the voice power transition derived in the power transition deriving procedure changes along the time axis is specified as the utterance start time and the utterance end time, and the utterance start time that is continuous in the order of the utterance start time and the utterance end time An utterance section specifying procedure for specifying each of the utterance sections defined by the pair of the utterance end time and the utterance end time;
  The character string represented by the specific content information based on the specific content information acquired in the content information acquisition procedure, the utterance timing information acquired in the timing information acquisition procedure, and the music score data acquired in the score data acquisition procedure For each syllable, the output sound having the performance start timing that minimizes the time difference from the utterance start timing of the character corresponding to the syllable is identified, and the output sound and the content of the syllable are associated A note lyrics procedure for generating each note syllable set,
  For each utterance section identified by the utterance section identification procedure, the performance start timing that minimizes the time difference from the utterance start time defining the utterance section is generated, and is generated by the procedure for associating notes and lyrics. A note singing syllable set, and at least a note singing voice integration procedure for generating syllable data that associates the utterance section with the note syllable set;
  To the computer,
  In the utterance section identification procedure,
  As a result of time differentiation of the voice power transition, each of the utterance sections is identified with each of the maximum timings as the utterance start time and each of the minimum timings as the utterance end time.
  A program characterized by that.