JP5678912B2

JP5678912B2 - Voice identification device, program

Info

Publication number: JP5678912B2
Application number: JP2012054615A
Authority: JP
Inventors: 久美太田
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2012-03-12
Filing date: 2012-03-12
Publication date: 2015-03-04
Anticipated expiration: 2032-03-12
Also published as: JP2013190473A

Description

本発明は、音声波形から各音節に対して発声された発声区間を特定する発声特定装置、及びプログラムに関する。 The present invention relates to an utterance specifying device and a program for specifying an utterance section uttered for each syllable from a speech waveform.

従来、音声波形において、音節の各々に対して発声された区間である発声区間を自動的に特定する発声特定技術が知られている（例えば、特許文献１参照）。
この特許文献１に記載された発声特定技術では、歌声と伴奏音とを含む楽曲の音楽音響信号からボーカル音声に対応する信号（以下、ボーカル信号と称す）を抽出し、そのボーカル信号の単位区間ごとに、音素ごとに予め用意された音響モデルの各々を照合する。そして、照合結果が最も大きな尤度となる音響モデルに対応する音素を、当該単位区間にて発声した内容の音素（音節）として特定することがなされている。 2. Description of the Related Art Conventionally, an utterance specifying technique that automatically specifies an utterance section that is an utterance section for each syllable in a speech waveform is known (see, for example, Patent Document 1).
In the utterance specifying technique described in Patent Document 1, a signal corresponding to vocal speech (hereinafter referred to as vocal signal) is extracted from a music acoustic signal of a song including a singing voice and an accompaniment sound, and a unit section of the vocal signal is extracted. Each of the acoustic models prepared in advance for each phoneme is collated. Then, the phoneme corresponding to the acoustic model with the highest likelihood of matching is specified as the phoneme (syllable) of the content uttered in the unit section.

さらに、一般的な発声特定技術では、音声波形の時間軸に沿って、上述した発声内容の特定を順次実行すると共に、発声内容として特定した音節が切り替わったタイミングを、１つの音節に対する発声区間が切り替わったタイミングとして特定している（例えば、特許文献２参照）。 Further, in a general utterance specifying technique, the above-described utterance contents are sequentially specified along the time axis of the speech waveform, and the timing at which the syllable specified as the utterance contents is switched is determined by the utterance section for one syllable. The switching timing is specified (for example, refer to Patent Document 2).

特開２００９−１８６６８７号公報JP 2009-186687 A 特開２００９−１３９８６２号公報JP 2009-139862 A

ところで、特許文献１に記載された技術では、発声内容の特定精度は、音声波形に照合する音響モデルに依存する。
一般的に、音響モデルは、数多くの人物に適合するように構築されているものの、適合の度合いには個人差があるため、人によっては適合度が低くなる。 By the way, in the technique described in Patent Document 1, the specific accuracy of the utterance content depends on the acoustic model to be collated with the speech waveform.
In general, an acoustic model is constructed so as to be adapted to a large number of persons, but the degree of adaptation varies depending on individuals, so that the degree of adaptation is low depending on the person.

このように、音響モデルの適合度が低い人物が発声した音声波形に当該音響モデルを照合する場合、発声特定技術では、その発声内容の認識精度が低下し、ひいては、各音節に対する発声区間の認識精度が低下するという問題がある。 In this way, when matching the acoustic model to a speech waveform uttered by a person with a low acoustic model suitability, the utterance identification technique reduces the recognition accuracy of the utterance content, and consequently recognizes the utterance interval for each syllable. There is a problem that accuracy decreases.

そこで、本発明は、各音節に対する発声内容の認識精度、及び各音節に対する発声区間の認識精度を向上させる技術の提供を目的とする。 Therefore, an object of the present invention is to provide a technique for improving the recognition accuracy of the utterance content for each syllable and the recognition accuracy of the utterance section for each syllable.

上記目的を達成するためになされた本発明は、発声特定装置に関する。
本発明の発声特定装置では、発声情報取得手段が、発声すべき内容の文字列、並びに文字列を構成する文字の各々の基準発声開始タイミング、及び基準発声終了タイミングを表す発声内容情報を取得し、その取得した発声内容情報に基づいて、模範区間特定手段が、文字列を構成する音節の各々について発声すべき区間である模範発声区間を特定する。 The present invention made to achieve the above object relates to an utterance specifying device.
In the utterance specifying device of the present invention, the utterance information acquisition means acquires the utterance content information representing the character string of the content to be uttered, and the reference utterance start timing and the reference utterance end timing of each of the characters constituting the character string. Based on the acquired utterance content information, the model section specifying means specifies a model utterance section that is a section to be uttered for each syllable constituting the character string.

そして、重み設定手段が、模範区間特定手段で特定された模範発声区間の各々に対応する音節ほど大きな値となるように、音節の各々、かつ当該音節に対応する区間である重み付与区間ごとに重みを設定する。 For each weighting section that is a section corresponding to each syllable and the corresponding syllable so that the weight setting means has a larger value for the syllable corresponding to each of the exemplary utterance sections specified by the exemplary section specifying means. Set the weight.

一方、本発明の発声特定装置では、音声データ取得手段が、発声内容情報によって表される文字列について発声された音声波形を表す音声データを取得し、その取得した音声データに対して、区間設定手段が、当該音声データにおける時間軸に沿って連続するように複数の単位区間を設定する。 On the other hand, in the utterance specifying device of the present invention, the voice data acquisition means acquires voice data representing a voice waveform uttered for the character string represented by the utterance content information, and sets a section for the acquired voice data. The means sets a plurality of unit sections so as to be continuous along the time axis in the audio data.

すると、音素特定手段が、区間設定手段で設定された音声データの１つの単位区間における特徴量に対して、音素ごとの特徴量を表すモデルとして予め用意された音響モデルの各々を照合すると共に、重み設定手段で設定され、かつ当該単位区間に対応する重み付与区間での重みを付与した発声尤度それぞれを導出する。さらに、音素特定手段では、発声尤度が最も大きな結果となる音響モデルに対応する音素を、当該単位区間において発声された音素（以下、発声音素とする）として特定する内容特定処理を、当該音声データにおける時間軸に沿って単位区間ごとに順次実行する。 Then, the phoneme specifying means collates each acoustic model prepared in advance as a model representing the feature quantity for each phoneme against the feature quantity in one unit section of the speech data set by the section setting means, Each of the utterance likelihoods set by the weight setting means and given the weight in the weighting section corresponding to the unit section is derived. Furthermore, in the phoneme specifying means, a content specifying process for specifying a phoneme corresponding to an acoustic model with the largest utterance likelihood as a phoneme uttered in the unit section (hereinafter referred to as a uttered phoneme), Executed sequentially for each unit interval along the time axis in the audio data.

そして、音素特定手段で順次特定される発声音素に基づいて、タイミング特定手段が、音声データにおいて発声された音節である発声音節を特定するごとに、発声音節が時間軸に沿って切り替わったタイミングである発声切替タイミングを特定する。 The timing at which the utterance syllable is switched along the time axis every time the timing specifying means specifies the utterance syllable that is syllable uttered in the voice data based on the utterance phonemes sequentially specified by the phoneme specifying means. The utterance switching timing is specified.

さらに、本発明の発声特定装置では、その特定した発声切替タイミングの直前の発声音節を推定完了音節とし、当該発声切替タイミングの直後の発声音素が構成する音節を推定中音節とする。そして、重み変更手段が、時間軸に沿って模範発声区間が切り替わるタイミング（以下、模範切替タイミングとする）と発声切替タイミングとのズレが予め規定された許容値以上である場合、時間軸に沿った推定中音節以降の音節に対応する単位区間について、文字列によって表され、かつ当該単位区間に対応する音節を構成する音素が音素特定手段にて発声音素として特定されるように、重み設定手段で規定され推定中音節に対応する重み付与区間以降における重み付与区間を変更する。 Further, in the utterance specifying device of the present invention, the utterance syllable immediately before the specified utterance switching timing is set as the estimation completion syllable, and the syllable formed by the utterance phoneme immediately after the utterance switching timing is set as the estimated middle syllable. When the deviation between the timing at which the model utterance section is switched along the time axis (hereinafter referred to as model switching timing) and the utterance switching timing is equal to or greater than a predetermined allowable value, the weight changing unit performs along the time axis. For the unit sections corresponding to the estimated syllables after the estimated middle syllable, a weight is set so that the phonemes that are represented by character strings and that constitute the syllables corresponding to the unit sections are specified as uttered phonemes by the phoneme specifying means. The weighting section after the weighting section corresponding to the estimated middle syllable defined by the means is changed.

このような発声特定装置では、音声データにおける発声内容を特定する際に、単位区間ごとに音声データから抽出した特徴量に音響モデルを照合すると共に、発声内容情報に基づいて規定された重みを付与している。しかも、本発明の発声特定装置では、１つの模範発声区間に対する実際の発声区間のズレに応じて、時間軸に沿った当該音節に対応する重み付与区間以降の重み付与区間を変更している。 In such an utterance specifying device, when specifying the utterance content in the audio data, the acoustic model is collated with the feature amount extracted from the audio data for each unit section, and the weight specified based on the utterance content information is given. doing. Moreover, in the utterance specifying device of the present invention, the weighting section after the weighting section corresponding to the syllable along the time axis is changed according to the deviation of the actual utterance section with respect to one model utterance section.

すなわち、本発明の発声特定装置によれば、重み付与区間を、各音節に対する実際の発声区間に合致するように調整することができる。
この結果、本発明の発声特定装置によれば、各音節に対する発声内容の認識精度を向上させることができ、ひいては、各音節に対する発声区間の認識精度を向上させることができる。 That is, according to the utterance specifying device of the present invention, the weighting interval can be adjusted to match the actual utterance interval for each syllable.
As a result, according to the utterance specifying device of the present invention, it is possible to improve the recognition accuracy of the utterance content for each syllable, and consequently improve the recognition accuracy of the utterance section for each syllable.

このため、本発明の発声特定装置によれば、発声区間での音声データから音声パラメータを生成する際に、信頼度の高い音節の内容を当該音声パラメータに付加することができる。この結果、本発明の発声特定装置によれば、音声合成の際に、その音声合成を利用する人物が望む音声を実現しやすくできる。 For this reason, according to the utterance specifying device of the present invention, when generating a speech parameter from speech data in a speech section, it is possible to add the content of a highly reliable syllable to the speech parameter. As a result, according to the utterance specifying device of the present invention, it is possible to easily realize a voice desired by a person using the voice synthesis at the time of voice synthesis.

なお、ここで言う模範発声区間とは、少なくとも、実際に発声する区間の各々を含む概念であり、この実際に発声する区間に加えて、実際には声を出さない無音区間も含む概念であっても良い。 The exemplary utterance section referred to here is a concept including at least each section that actually utters, and in addition to the section that actually utters, the concept includes a silent section that does not actually speak. May be.

さらに、ここで言う模範切替タイミングは、模範発声区間が切り替わるタイミングであり、例えば、基準発声開始タイミングであっても良いし、基準発声開始タイミングに基づいて規定されるタイミングであっても良い。 Furthermore, the model switching timing here is a timing at which the model utterance section is switched, and may be, for example, a reference utterance start timing or a timing defined based on the reference utterance start timing.

また、本発明の発声特定装置において、重み変更手段は、発声切替タイミングが模範切替タイミングよりも早ければ、重み付与区間を時間軸に沿って早くしても良い（請求項２）。 In the utterance specifying device of the present invention, the weight changing means may make the weighting section earlier along the time axis if the utterance switching timing is earlier than the model switching timing.

このような発声特定装置によれば、各音節の発声開始タイミングが基準発声開始タイミングよりも早くなる人物が発声した音声であっても、各音節に対する発声内容、及び、各音節に対する発声区間の認識精度を向上させることができる。 According to such an utterance identification device, even if the voice is uttered by a person whose utterance start timing is earlier than the reference utterance start timing, the utterance content for each syllable and the utterance section for each syllable are recognized. Accuracy can be improved.

この場合、本発明における重み変更手段は、発声切替タイミングが模範切替タイミングよりも早ければ、時間軸に沿って推定中音節の次の音節に対応する重み付与区間の開始タイミングを、時間軸に沿って早くしても良い（請求項３）。 In this case, the weight changing means in the present invention sets the start timing of the weighting section corresponding to the next syllable of the estimated syllable along the time axis along the time axis if the utterance switching timing is earlier than the model switching timing. (Claim 3).

発声切替タイミングが模範切替タイミングよりも早いタイミングで発声している人物は、次の音節の発声切替タイミングも模範切替タイミングより早くなる可能性が高い。よって、本発明のように、次の音節に対応する重み付与区間の開始タイミングを早くすれば、音節に対する発声内容、及び、音節に対する発声区間の認識精度が向上する可能性も高くなる。 A person who speaks at a timing earlier than the model switching timing is highly likely that the voice switching timing of the next syllable will be earlier than the model switching timing. Therefore, if the start timing of the weighting interval corresponding to the next syllable is advanced as in the present invention, the utterance content for the syllable and the recognition accuracy of the utterance interval for the syllable are likely to be improved.

なお、本発明において、重みを変更する対象としての重み付与区間は、請求項３に記載された、「推定中音節の次の音節に対応する重み付与区間」に限るものではなく、例えば、１つの楽曲において、推定中音節と同一のフレーズ（即ち、Ａメロ、Ｂメロ、サビなど）中での音節に対応する重み付与区間を含んでも良い。 In the present invention, the weighting section as a target for changing the weight is not limited to the “weighting section corresponding to the syllable next to the estimated syllable” described in claim 3. One piece of music may include a weighting section corresponding to a syllable in the same phrase as the estimated middle syllable (ie, A melody, B melody, rust, etc.).

また、本発明における重み変更手段は、推定完了音節の発声開始タイミングから発声終了タイミングまでの音声発声区間の時間長が、当該推定完了音節に対応する音節についての模範発声区間よりも短ければ、推定中音節に対応する重み付与区間の終了タイミングを、時間軸に沿って早くしても良い（請求項４）。 Further, the weight changing means in the present invention estimates if the time length of the speech utterance section from the utterance start timing to the utterance end timing of the estimated completion syllable is shorter than the exemplary utterance section for the syllable corresponding to the estimation completed syllable. The end timing of the weighting section corresponding to the middle syllable may be advanced along the time axis (claim 4).

このような発声特定装置によれば、各音節について発声する時間長が模範発声区間の時間長よりも短くなる人物が発声した音声であっても、各音節に対する発声内容、及び、各音節に対する発声区間の認識精度を向上させることができる。 According to such an utterance specifying device, even if the utterance time for each syllable is a voice uttered by a person whose time length is shorter than the time length of the exemplary utterance section, the utterance content for each syllable and the utterance for each syllable The recognition accuracy of the section can be improved.

ところで、本発明における重み変更手段は、発声切替タイミングが模範切替タイミングよりも遅ければ、重み付与区間を時間軸に沿って遅くしても良い（請求項５）。
このような発声特定装置によれば、各音節の発声開始タイミングが基準発声開始タイミングよりも遅くなる人物が発声した音声であっても、各音節に対する発声内容、及び、各音節に対する発声区間の認識精度を向上させることができる。 By the way, the weight changing means in the present invention may delay the weighting section along the time axis if the utterance switching timing is later than the model switching timing.
According to such an utterance specifying device, even if the voice is uttered by a person whose utterance start timing of each syllable is later than the reference utterance start timing, the utterance content for each syllable and the utterance interval for each syllable are recognized. Accuracy can be improved.

この場合、本発明における重み変更手段は、発声切替タイミングが模範切替タイミングよりも遅ければ、時間軸に沿って推定中音節の次の音節に対応する重み付与区間の開始タイミングを、時間軸に沿って遅くしても良い（請求項６）。 In this case, the weight changing means according to the present invention sets the start timing of the weighting section corresponding to the syllable next to the estimated syllable along the time axis along the time axis if the utterance switching timing is later than the model switching timing. (Claim 6).

発声切替タイミングが模範切替タイミングよりも遅いタイミングで発声する人物は、次の音節の発声切替タイミングも模範切替タイミングより遅くなる可能性が高い。よって、本発明のように、次の音節に対応する重み付与区間の開始タイミングを遅くすれば、音節に対する発声内容、及び、音節に対する発声区間の認識精度が向上する可能性も高くなる。 A person who speaks at a timing when the utterance switching timing is later than the model switching timing has a high possibility that the utterance switching timing of the next syllable will be later than the model switching timing. Therefore, if the start timing of the weighting interval corresponding to the next syllable is delayed as in the present invention, the utterance content for the syllable and the recognition accuracy of the utterance interval for the syllable are likely to be improved.

なお、本発明において、重みを変更する対象としての重み付与区間は、請求項６に記載された、「推定中音節の次の音節に対応する重み付与区間」に限るものではなく、例えば、１つの楽曲において、推定中音節と同一のフレーズ（即ち、Ａメロ、Ｂメロ、サビなど）中での音節に対応する重み付与区間を含んでも良い。 In the present invention, the weighting section as a target for changing the weight is not limited to the “weighting section corresponding to the syllable next to the estimated syllable” described in claim 6. One piece of music may include a weighting section corresponding to a syllable in the same phrase as the estimated middle syllable (ie, A melody, B melody, rust, etc.).

さらに、本発明における重み変更手段は、推定完了音節の発声開始タイミングから発声終了タイミングまでの音声発声区間の時間長が、当該推定完了音節に対応する音節についての模範発声区間よりも長ければ、推定中音節に対応する重み付与区間の終了タイミングを、時間軸に沿って遅くしても良い（請求項７）。 Furthermore, the weight changing means in the present invention estimates if the time length of the speech utterance section from the utterance start timing to the utterance end timing of the estimated completion syllable is longer than the exemplary utterance section for the syllable corresponding to the estimation completed syllable. The end timing of the weighting section corresponding to the middle syllable may be delayed along the time axis.

このような発声特定装置によれば、各音節について発声する時間長が模範発声区間の時間長よりも長くなる人物が発声した音声であっても、各音節に対する発声内容、及び、各音節に対する発声区間の認識精度を向上させることができる。 According to such an utterance specifying device, even if a voice uttered by a person whose utterance time for each syllable is longer than the time length of the exemplary utterance section, the utterance content for each syllable and the utterance for each syllable The recognition accuracy of the section can be improved.

通常、時間軸に沿って連続する音節について人が発声する場合、音節の切り替わりが、模範切替タイミングに完全に一致することは少ない。
このため、発声特定装置において、模範切替タイミングの直前や直後の単位区間から重みを付与すると、当該音声特定装置は、実際に発声した音素とは異なる音素を発声したと誤認してしまう可能性がある。 Normally, when a person utters a continuous syllable along the time axis, the syllable switching rarely coincides perfectly with the model switching timing.
For this reason, in the utterance specifying device, if a weight is given from the unit interval immediately before or after the model switching timing, the sound specifying device may mistakenly utter a phoneme different from the phoneme actually uttered. is there.

そこで、本発明においては、重み付与区間は、模範発声区間より短い時間長であり、模範発声区間内に包含されていても良い（請求項８）。
このような発声特定装置によれば、模範切替タイミングの直前や直後の単位区間から重みを付与することがないため、実際に発声した内容と異なる内容を発声したと誤認する可能性を低減できる。 Therefore, in the present invention, the weighting section has a shorter time length than the model utterance section, and may be included in the model utterance section (claim 8).
According to such an utterance specifying device, since no weight is given from the unit interval immediately before or after the model switching timing, it is possible to reduce the possibility of misidentifying that the content is different from the content actually uttered.

なお、本発明は、コンピュータを発声特定装置として機能させるためのプログラムであっても良い。
この場合、本発明のプログラムは、発声情報取得手順と、模範区間特定手順と、重み設定手順と、音声データ取得手順と、区間設定手順と、音素特定手順と、タイミング特定手順と、重み変更手順とをコンピュータに実行させるプログラムである必要がある。 The present invention may be a program for causing a computer to function as an utterance specifying device.
In this case, the program of the present invention includes an utterance information acquisition procedure, an exemplary segment identification procedure, a weight setting procedure, a voice data acquisition procedure, a segment setting procedure, a phoneme identification procedure, a timing identification procedure, and a weight change procedure. Must be a program that causes a computer to execute.

本発明のプログラムが、このようになされていれば、例えば、ＤＶＤ−ＲＯＭ、ＣＤ−ＲＯＭ、ハードディスク等のコンピュータ読み取り可能な記録媒体に記録し、必要に応じてコンピュータにロードさせて起動することや、必要に応じて通信回線を介してコンピュータに取得させて起動することにより用いることができる。そして、コンピュータに各手順を実行させることで、そのコンピュータを、請求項１に記載された発声特定装置として機能させることができる。 If the program of the present invention is made in this way, for example, it can be recorded on a computer-readable recording medium such as a DVD-ROM, CD-ROM, hard disk, etc. If necessary, it can be used by being acquired and activated by a computer via a communication line. And by making a computer perform each procedure, the computer can be functioned as an utterance specific | specification apparatus described in Claim 1.

本発明が適用された情報処理装置を中心に構成された音声合成システムのブロック図である。1 is a block diagram of a speech synthesis system configured mainly with an information processing apparatus to which the present invention is applied. 音声パラメータ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an audio | voice parameter registration process. 発声区間推定処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the utterance area estimation process. 重み付与区間と模範発声区間との関係を示した図である。It is the figure which showed the relationship between a weight provision area and an exemplary utterance area. 音素特定処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the phoneme specific process. 付与区間変更処理の処理手順を示したフローチャートである。It is the flowchart which showed the process sequence of the provision area change process. 付与区間推定処理の処理概要を示した図である。It is the figure which showed the process outline | summary of the provision area estimation process. 付与区間推定処理の処理概要をまとめた図である。It is the figure which summarized the process outline | summary of the provision area estimation process. 音声合成処理の処理手順を示す図である。It is a figure which shows the process sequence of a speech synthesis process.

以下に本発明の実施形態を図面と共に説明する。
〈音声合成システム〉
図１に示す音声合成システム１は、当該音声合成システム１の利用者が指定した内容の音声が出力されるように、予め登録された音声パラメータＰＭに基づいて音声合成した音声（即ち、合成音）を出力するシステムである。ここで言う音声パラメータＰＭとは、詳しくは後述するが、いわゆるフォルマント合成に用いる音声特徴量である。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
The speech synthesis system 1 shown in FIG. 1 synthesizes speech (that is, synthesized speech) based on speech parameters PM registered in advance so that speech having contents designated by the user of the speech synthesis system 1 is output. ). The speech parameter PM referred to here is a speech feature amount used for so-called formant synthesis, which will be described later in detail.

これを実現するために、音声合成システム１は、音声を入力する音声入力装置１０と、音声入力装置１０を介して入力された音声（以下、音声データＳＶと称す）及びカラオケの用途に用いられる各種データ（以下、音楽データＭＤと称す）を格納する音楽サーバ２５とを備えている。さらに、音声合成システム１は、音楽サーバ２５に格納されている音声データＳＶ及び音楽データＭＤに基づいて、音声パラメータＰＭを生成する情報処理装置３０と、情報処理装置３０にて生成された音声パラメータＰＭを格納するデータ格納サーバ５０とを備えている。その上、音声合成システム１は、データ格納サーバ５０に格納されている音声パラメータＰＭに基づいて音声合成した合成音を出力する音声出力端末６０を備えている。 In order to realize this, the voice synthesis system 1 is used for voice input device 10 for inputting voice, voice input through voice input device 10 (hereinafter referred to as voice data SV), and karaoke. And a music server 25 for storing various data (hereinafter referred to as music data MD). Furthermore, the speech synthesis system 1 includes an information processing device 30 that generates a speech parameter PM based on the speech data SV and the music data MD stored in the music server 25, and a speech parameter generated by the information processing device 30. And a data storage server 50 for storing PM. In addition, the speech synthesis system 1 includes a speech output terminal 60 that outputs a synthesized sound synthesized by speech based on the speech parameter PM stored in the data storage server 50.

〈音楽サーバ〉
まず、音楽サーバ２５は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して、音声入力装置１０に接続されている。 <Music server>
First, the music server 25 is a device that is mainly configured of a storage device configured to be able to read and write stored contents, and is connected to the voice input device 10 via a communication network.

この音楽サーバ２５には、少なくとも、楽曲ごとに予め用意された音楽データＭＤが格納されている。この音楽データＭＤは、楽曲を演奏するための楽曲データＤＭ（楽曲データＤＭとは、例えば、ＭＩＤＩデータ形式データや、生音形式のデータである）と、歌詞データ群ＤＬとを含む。 The music server 25 stores at least music data MD prepared in advance for each song. The music data MD includes music data DM (music data DM is, for example, MIDI data format data or raw sound format data) and a lyrics data group DL for playing a music.

一方、歌詞データ群ＤＬは、当該楽曲の歌詞に関するデータであり、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す歌詞テロップデータＤＴと、歌詞構成文字の出力タイミングである歌詞出力タイミングを、楽曲データＤＭの演奏と対応付けるタイミング対応関係が規定された歌詞出力データＤＯとを備えている。 On the other hand, the lyric data group DL is data relating to the lyrics of the music, and the lyrics telop data DT representing the characters constituting the lyrics of the music (hereinafter referred to as lyrics constituent characters) and the lyrics that are the output timing of the lyrics constituent characters. And lyrics output data DO in which a timing correspondence relationship for associating the output timing with the performance of the music data DM is provided.

タイミング対応関係は、楽曲データＤＭの演奏を開始するタイミングに、歌詞テロップデータＤＴの出力を開始するタイミングが対応付けられた上で、当該楽曲の時間軸に沿った各歌詞構成文字の出力タイミングが、楽曲データＤＭの演奏を開始からの経過時間によって規定されている。なお、ここで言う経過時間とは、例えば、表示された歌詞構成文字の色替えを実行するタイミングを表す時間であり、色替えの速度によって規定されている。また、ここで言う歌詞構成文字は、歌詞を構成する文字の各々であっても良いし、その文字の各々を時間軸に沿った特定の規則に従って一群とした文節やフレーズであっても良い。 The timing correspondence relationship is that the timing of starting the output of the lyrics telop data DT is associated with the timing of starting the performance of the song data DM, and the output timing of each lyrics constituent character along the time axis of the song is The music data DM is defined by the elapsed time from the start of the performance. Note that the elapsed time referred to here is, for example, a time indicating the timing for executing color change of displayed lyrics constituent characters, and is defined by the speed of color change. Further, the lyric constituent characters referred to here may be each of the characters constituting the lyric, or may be a phrase or a phrase grouped according to a specific rule along the time axis.

〈音声入力装置〉
次に、音声入力装置１０は、通信部１１と、入力受付部１２と、表示部１３と、音声入力部１４と、音声出力部１５と、音源モジュール１６と、記憶部１７と、制御部２０とを備えている。すなわち、音声入力装置１０は、いわゆる周知のカラオケ装置として構成されている。 <Voice input device>
Next, the voice input device 10 includes a communication unit 11, an input receiving unit 12, a display unit 13, a voice input unit 14, a voice output unit 15, a sound source module 16, a storage unit 17, and a control unit 20. And. That is, the voice input device 10 is configured as a so-called well-known karaoke device.

このうち、通信部１１は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して、音声入力装置１０が外部との間で通信を行う。入力受付部１２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーやスイッチ、リモコンの受付部など）である。 Among these, the communication unit 11 communicates with the outside through the communication network (for example, a public wireless communication network or a network line). The input reception unit 12 is an input device (for example, a key, a switch, a remote control reception unit, or the like) that receives input of information and commands in accordance with external operations.

表示部１３は、後述する音楽データＭＤの曲名、選曲番号、歌詞を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。また、音声入力部１４は、音を電気信号に変換して制御部２０に入力する装置（いわゆるマイクロホン）である。音声出力部１５は、制御部２０からの電気信号を音に変換して出力する装置（いわゆるスピーカ）である。さらに、音源モジュール１６は、楽曲データＤＭがＭＩＤＩ形式の場合に、音源からの音を模擬した音（即ち、出力音）を出力する装置（例えば、ＭＩＤＩ音源）である。 The display unit 13 is a display device (for example, a liquid crystal display or a CRT) that displays a song name, a song selection number, and lyrics of music data MD, which will be described later. The voice input unit 14 is a device (so-called microphone) that converts sound into an electrical signal and inputs the signal to the control unit 20. The audio output unit 15 is a device (so-called speaker) that converts an electrical signal from the control unit 20 into sound and outputs the sound. Furthermore, the sound source module 16 is a device (for example, a MIDI sound source) that outputs a sound that simulates a sound from a sound source (that is, an output sound) when the music data DM is in the MIDI format.

記憶部１７は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）である。
また、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 17 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents.
The control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.

そして、ＲＯＭ２１には、周知のカラオケ演奏処理を制御部が実行する処理プログラムや、カラオケ演奏処理によって一つの楽曲が演奏されている期間中に、音声入力部１４を介して入力された歌唱音（発声者の声）を音声データＳＶとして、当該対象楽曲を識別する楽曲識別情報と対応付けて、音楽サーバ２５に格納する音声格納処理を制御部２０が実行する処理プログラムが記憶されている。 In the ROM 21, a singing sound (via a processing program for executing a well-known karaoke performance process or a singing sound input via the voice input unit 14 during a period when one piece of music is being played by the karaoke performance process) A processing program in which the control unit 20 executes a voice storage process stored in the music server 25 in association with music identification information for identifying the target music is stored as voice data SV.

以下に、音声格納処理を説明する。音声入力装置１０では、カラオケ演奏処理に従って、入力受付部１２を介して指定された一つの楽曲（以下、対象楽曲とする）に対応する音楽データＭＤを音楽サーバ２５から取得して、当該音楽データＭＤ中の楽曲データＤＭに基づいて対象楽曲を演奏すると共に、当該音楽データＭＤ中の歌詞データ群ＤＬに基づいて対象楽曲の演奏進行の歌唱（発声）すべきタイミングで歌詞を表示部１３に表示したり、その表示色を変化させたりする。 Hereinafter, the voice storing process will be described. In the voice input device 10, music data MD corresponding to one piece of music (hereinafter referred to as “target music”) designated via the input receiving unit 12 is acquired from the music server 25 in accordance with the karaoke performance processing, and the music data The target music is played based on the music data DM in the MD, and the lyrics are displayed on the display unit 13 at the timing when the performance of the target music should be sung (spoken) based on the lyrics data group DL in the music data MD. Or change the display color.

さらに、当該対象楽曲を識別する楽曲識別情報（例えば、音楽データＭＤの曲名、選曲番号など、）、音声入力部１４から音声を入力した人物（以下、発声者とする）を識別する発声者識別情報（以下、発声者ＩＤと称す）、音声データＳＶとを、音楽データＭＤの演奏時に対応付けて、音楽サーバ２５に格納する。なお、音楽サーバ２５に格納される音声データＳＶには、発声者の特徴を表す発声者特徴情報も対応付けられており、この発声者特徴情報には、例えば、発声者の性別、年齢などを含む。 Furthermore, song identification information for identifying the target song (for example, song name of music data MD, song selection number, etc.), speaker identification for identifying a person who has input voice from the voice input unit 14 (hereinafter referred to as a speaker) Information (hereinafter referred to as a speaker ID) and audio data SV are stored in the music server 25 in association with the performance of the music data MD. Note that the voice data SV stored in the music server 25 is also associated with speaker feature information representing the features of the speaker. The speaker feature information includes, for example, the gender and age of the speaker. Including.

楽曲識別情報と発声者ＩＤとの対応付けは、例えば、発声者が音声入力装置１０に対して、入力受付部１２から発声者ＩＤを用いてログインすることで、発声者ＩＤが音声入力装置１０に入力され、対象楽曲情報の選曲などにより、楽曲識別情報と発声者ＩＤとの対応付けが行われる。 Associating the music identification information with the speaker ID, for example, when the speaker logs in to the voice input device 10 using the speaker ID from the input receiving unit 12, the speaker ID becomes the voice input device 10. The music identification information and the speaker ID are associated with each other by selecting the target music information.

以上、音声格納処理によって、発声者(発声者ＩＤ）によって選曲された対象楽曲が演奏され、その演奏進行に応じて、表示部１３で色替え表示された歌詞を、発声者が音声入力部１４（マイク）に向かって歌唱（発声）したときの歌唱音声が、対象楽曲に対する発声者ＩＤの音声データＳＶとして、音楽サーバ２５に記憶されることとなる。 As described above, the target music selected by the speaker (speaker ID) is played by the voice storage process, and the speaker inputs the lyrics displayed in the display unit 13 in accordance with the progress of the performance. The singing voice when singing (speaking) toward (microphone) is stored in the music server 25 as the voice data SV of the speaker ID for the target music.

発声者ＩＤを入力するときに、性別、年齢なども併せて入力すれば、発声者特徴情報に性別、年齢なども含んで音楽サーバ２５に記憶される。
その後、情報処理装置３０の制御部４０は、入力受付部３２からの発声者ＩＤの入力によって、音楽サーバ２５に問い合わせし、発声者ＩＤの対象楽曲と、その音声データＳＶを情報処理装置３０側にダウンロードする。 If the gender, age, etc. are also entered when inputting the speaker ID, the gender, age, etc. are included in the speaker characteristic information and stored in the music server 25.
Thereafter, the control unit 40 of the information processing device 30 makes an inquiry to the music server 25 by inputting the speaker ID from the input receiving unit 32, and obtains the target music of the speaker ID and the voice data SV on the information processing device 30 side. Download to.

〈情報処理装置〉
次に、情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、記憶部３４と、制御部４０とを備えている。 <Information processing device>
Next, the information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a storage unit 34, and a control unit 40.

このうち、通信部３１は、通信網を介して外部との間で通信を行う。入力受付部３２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。表示部３３は、画像を表示する表示装置である。 Among these, the communication part 31 communicates with the exterior via a communication network. The input receiving unit 32 is an input device that receives input of information and commands in accordance with external operations. The display unit 33 is a display device that displays an image.

記憶部３４は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 34 is a non-volatile storage device configured to be able to read and write stored contents. The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43.

そして、情報処理装置３０のＲＯＭ４１には、音楽サーバ２５に格納されている音声データＳＶ及び音楽データＭＤに基づいて生成した音声パラメータＰＭを、データ格納サーバ５０に格納する音声パラメータ登録処理を制御部４０が実行するための処理プログラムが記憶されている。 The ROM 41 of the information processing apparatus 30 controls the voice parameter registration process for storing the voice parameter PM generated based on the voice data SV and the music data MD stored in the music server 25 in the data storage server 50. A processing program to be executed by 40 is stored.

なお、データ格納サーバ５０は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して情報処理装置３０に接続されている。
〈音声パラメータ登録処理〉
図２に示すように、音声格納処理の実行終了後、音声パラメータ登録処理は、起動されると、入力受付部３２を介して指定された楽曲（即ち、対象楽曲）の楽曲データＭＤを音楽サーバ２５から取得する（Ｓ１１０）。続いて、対象楽曲の歌詞データ群ＤＬを抽出し（Ｓ１２０）、対象楽曲に対応し、かつ入力受付部３２を介して指定された発声者ＩＤに対応する一つの音声データＳＶを音楽サーバ２５から取得する（Ｓ１３０）。 The data storage server 50 is a device that is mainly configured of a storage device that is configured to be able to read and write stored contents, and is connected to the information processing device 30 via a communication network.
<Voice parameter registration process>
As shown in FIG. 2, when the voice parameter registration process is started after the voice storage process is completed, the music data MD of the music (that is, the target music) designated via the input receiving unit 32 is stored in the music server. 25 (S110). Subsequently, the lyrics data group DL of the target music is extracted (S120), and one piece of audio data SV corresponding to the target music and corresponding to the speaker ID designated via the input receiving unit 32 is obtained from the music server 25. Obtain (S130).

さらに、Ｓ１３０で取得した音声データＳＶにおいて、歌詞構成文字を構成する音節それぞれに対応して発声されたと推定される区間（以下、発声区間と称す）を特定し、発声区間の各々を表す発声区間データを生成する発声区間推定処理を実行する（Ｓ１４０）。 Further, in the speech data SV acquired in S130, a section that is estimated to be uttered corresponding to each syllable constituting the lyrics constituent characters (hereinafter referred to as a utterance section) is specified, and a utterance section that represents each of the utterance sections. An utterance section estimation process for generating data is executed (S140).

そして、発声区間推定処理（Ｓ１４０）にて生成した各発声区間データに基づく発声区間の各々での音声波形（以下、音節波形と称す）から音声パラメータＰＭを導出する（Ｓ１５０）。Ｓ１５０にて導出する音声パラメータＰＭには、少なくとも、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分それぞれを含む。これらの基本周波数、ＭＦＣＣ、パワー、及びそれらの時間差分の導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音節波形の時間軸に沿った自己相関、音節波形の周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音節波形に対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音節波形に対して時間分析窓を適用して振幅を二乗した結果を時間軸に沿って積分することで導出すれば良い。 Then, the speech parameter PM is derived from the speech waveform (hereinafter referred to as syllable waveform) in each utterance interval based on each utterance interval data generated in the utterance interval estimation process (S140) (S150). The audio parameter PM derived in S150 includes at least a fundamental frequency, a mel frequency cepstrum (MFCC), power, and their time differences. Since these fundamental frequencies, MFCC, power, and methods for deriving their time differences are well known, detailed description thereof is omitted here. For example, in the case of a fundamental frequency, along the time axis of the syllable waveform. What is necessary is just to derive | lead-out using methods, such as an autocorrelation, the autocorrelation of the frequency spectrum of a syllable waveform, or a cepstrum method. In addition, in the case of MFCC, a result obtained by applying a time analysis window to a syllable waveform and performing frequency analysis (for example, FFT) for each time analysis window is further obtained by logarithmizing the size for each frequency. It can be derived by frequency analysis. The power may be derived by integrating the result of squaring the amplitude by applying a time analysis window to the syllable waveform along the time axis.

続いて、Ｓ１５０にて導出した音声パラメータＰＭを格納する音声パラメータ登録を実行する（Ｓ１６０）。なお、Ｓ１６０にてデータ格納サーバ５０に格納される音声パラメータＰＭは、各発声区間にて発声した音節の内容（種類）や、発声者ＩＤ、発声者特徴情報と対応付けられている。 Subsequently, voice parameter registration for storing the voice parameter PM derived in S150 is executed (S160). Note that the speech parameter PM stored in the data storage server 50 in S160 is associated with the content (type) of the syllable uttered in each utterance section, the speaker ID, and speaker feature information.

その後、本音声パラメータ登録処理を終了する。
〈発声区間推定処理〉
この発声区間推定処理は、音声パラメータ登録処理のＳ１４０にて起動されると、図３に示すように、先のＳ１３０にて取得した音声データＳＶに対して、時間に沿って連続するように複数のフレームｔ_nを設定する（Ｓ２１０）。ここで言うフレームｔ_nとは、予め規定された時間長（例えば、１０［ｍｓ］）を有した単位時間であり、添え字ｎは、当該フレームｔが時間軸に沿ってｎ番目であることを表す添え字である。 Thereafter, the voice parameter registration process is terminated.
<Speech interval estimation processing>
When the speech segment estimation process is started in S140 of the speech parameter registration process, as shown in FIG. 3, a plurality of speech segment estimation processes are performed so as to be continuous over time with respect to the speech data SV acquired in the previous S130. Frame t _n is set (S210). The frame t _{n mentioned} here is a unit time having a predetermined time length (for example, 10 [ms]), and the subscript n indicates that the frame t is the nth along the time axis. Is a subscript representing

そして、音声データＳＶにおける各フレームｔ_nから音声パラメータＰＭを算出する（Ｓ２２０）。続いて、先のＳ１２０にて取得した歌詞データ群ＤＬのうちの歌詞テロップデータＤＴについて、歌詞構成文字の各々を音素表記で表した歌詞音素データへと変換する（Ｓ２３０）。具体的に、Ｓ２３０では、歌詞テロップデータＤＴによって表される歌詞構成文字の列に対して周知の形態素解析を実施して、歌詞テロップデータＤＴを形態素（単語）ごとに切り分ける。さらに、Ｓ２３０では、各単語の音素表記を表す辞書データとして予め用意した音素辞書に基づいて、各形態素を音素表記へと変換する。これにより、歌詞テロップデータＤＴが歌詞音素データへと変換される。 Then, the audio parameter PM is calculated from each frame t _n in the audio data SV (S220). Subsequently, the lyrics telop data DT in the lyrics data group DL acquired in S120 is converted into lyrics phoneme data in which each of the lyrics constituent characters is expressed in phoneme notation (S230). Specifically, in S230, a well-known morpheme analysis is performed on the string of lyrics constituent characters represented by the lyrics telop data DT, and the lyrics telop data DT is segmented for each morpheme (word). Further, in S230, each morpheme is converted into a phoneme notation based on a phoneme dictionary prepared in advance as dictionary data representing the phoneme notation of each word. Thereby, the lyrics telop data DT is converted into lyrics phoneme data.

その変換された歌詞音素データ、及び先のＳ１２０にて取得した歌詞データ群ＤＬのうちの、歌詞出力データＤＯに基づいて、歌詞構成文字を構成する音節である構成音節ＣＳの各々に対して発声されるべき区間（以下、模範発声区間と称す）ＥＳを特定する（Ｓ２４０）。 Based on the converted lyric phoneme data and the lyric output data DO in the lyric data group DL acquired in the previous S120, utterance is made for each of the constituent syllables CS that are syllables constituting the lyric constituent characters. A section ES to be performed (hereinafter referred to as a model utterance section) ES is identified (S240).

具体的には、模範発声区間ＥＳは、時間軸に沿って連続する複数のフレームｔによって規定された区間であり、歌詞出力データＤＯによって規定され、かつ時間軸に沿って連続する２つの歌詞構成文字の出力開始タイミングの間の区間を、上記構成音節ＣＳの模範発声区間ＥＳとしたものである。ただし、Ｓ２４０においては、歌詞構成文字が「漢字」である場合には、当該歌詞構成文字を「ひらがな」などに置き換えた場合の文字数にて、出力開始タイミングの間の区間を等分割し、その分割された各区間を当該音節ＣＳの模範発声区間ＥＳとして特定する。 Specifically, the exemplary utterance section ES is a section defined by a plurality of frames t that are continuous along the time axis, and is composed of two lyrics that are defined by the lyrics output data DO and that are continuous along the time axis. The section between the character output start timings is set as an exemplary utterance section ES of the constituent syllable CS. However, in S240, if the lyrics constituent character is “Kanji”, the section between the output start timings is equally divided by the number of characters when the lyrics constituent character is replaced with “Hiragana”, etc. Each divided section is specified as an exemplary utterance section ES of the syllable CS.

なお、本実施形態の構成音節ＣＳには、無音の文字相当であることを表す無音状態ｓｐ（即ち、楽曲データＤＭに規定された休符）を含む。また、ＣＳ_Nと、その符号Ｎは、歌詞の文頭から数えて、当該構成音節ＣＳが時間軸に沿ったＮ番目の構成音節ＣＳであることを表す符号である。 Note that the constituent syllable CS of this embodiment includes a silent state sp (that is, a rest defined in the music data DM) indicating that it corresponds to a silent character. CS _N and its code N are codes indicating that the constituent syllable CS is the Nth constituent syllable CS along the time axis, counted from the beginning of the lyrics.

続いて、Ｓ２４０にて特定した各模範発声区間ＥＳ_Nに対応する構成音節ＣＳ_Nを構成する音素に対して、後述する音素特定処理にて重みβを付与する区間（以下、重み付与区間と称す）ＧＳ_Nを設定する（Ｓ２５０）。 Subsequently, for the phonemes constituting the constituent syllable CS _N corresponding to each exemplary utterance section ES _N identified in S240, a section for assigning weight β in the phoneme specifying process described later (hereinafter referred to as a weight provision section). ) Set GS _N (S250).

本実施形態における重み付与区間ＧＳ_Nは、時間軸に沿って連続する複数のフレームｔによって規定された区間であり、各模範発声区間ＥＳに対して、１つの重み付与区間ＧＳを設定する。重み付与区間ＧＳ_Nのそれぞれは、図４（Ａ）に示すように、模範発声区間ＥＳ_Nの始端から、時間軸に沿って規定数α分のフレームｔに相当する時間長遅い時刻を、重み付与区間ＧＳ_Nの開始タイミングとし、模範発声区間ＥＳ_Nの終端よりも、時間軸に沿って規定数α分のフレームｔに相当する時間長早い時刻を、重み付与区間ＧＳ_Nの終了タイミングとして設定する。すなわち、各重み付与区間ＧＳ_Nは、模範発声区間ＥＳ_Nより短い時間長であり、模範発声区間ＥＳ_N内に包含される区間である。 The weighting section GS _N in the present embodiment is a section defined by a plurality of frames t continuous along the time axis, and one weighting section GS is set for each exemplary utterance section ES. As shown in FIG. 4A, each of the weighting intervals GS _N is weighted with a time that is later in time length corresponding to the frame t corresponding to the specified number α along the time axis from the beginning of the model utterance interval ES _N. The start timing of the granting section GS _N is set as the end timing of the weighting section GS _{N as} the start timing of the weighting section GS _{N as} the start timing corresponding to the frame t corresponding to the specified number α along the time axis from the end of the exemplary utterance section ES _N To do. That is, each weight applied section GS _N is a time length shorter than the exemplary utterance section ES _N, is an interval to be encompassed model speech section in ES _N.

そして、Ｓ２５０にて設定した各重み付与区間ＧＳ_Nに対して、重みβを規定する（Ｓ２６０）。このＳ２６０にて規定する重みβは、音素ごとに規定された値であり、当該模範発声区間ＥＳ_Nの各々に対応する構成音節ＣＳ_Nを構成する音素ほど大きな値（例えば、「１．５」）となるように規定されている。例えば、模範発声区間ＥＳ_Nに対応する音節が「か」であれば、音素“ｋ”及び“ａ”についての重みβが大きな値に設定され、音素“ｋ”及び“ａ”以外の音素（例えば、“ｉ”や，“ｕ”）についての重みβは小さな値（例えば、「１」）に設定される。ただし、重みβは、正の実数である。 Then, a weight β is defined for each weighting section GS _N set in S250 (S260). The weight β defined in S260 is a value defined for each phoneme, and the larger the phoneme constituting the constituent syllable CS _N corresponding to each of the model utterance sections ES _N (for example, “1.5”). ). For example, if the syllable corresponding to the exemplary utterance section ES _N is “ka”, the weight β for the phonemes “k” and “a” is set to a large value, and phonemes other than the phonemes “k” and “a” ( For example, the weight β for “i” and “u”) is set to a small value (for example, “1”). However, the weight β is a positive real number.

さらに、音声データＳＶの各フレームｔ_nにおける発声内容を発声音素として特定すると共に、その特定した発声音素に基づいて、音声データＳＶにおいて発声された音節（以下、発声音節ＶＳとする）を特定するごとに、時間軸に沿って発声音節ＶＳが切り替わったタイミング（以下、発声切替タイミングとする）を特定する音素特定処理を実行する（Ｓ２７０）。なお、発声音節ＶＳには、無音状態ｓｐを含む。 Furthermore, the utterance content in each frame t _n of the audio data SV is specified as an utterance phoneme, and a syllable uttered in the audio data SV (hereinafter referred to as an utterance syllable VS) based on the specified utterance phoneme. Every time it is specified, a phoneme specifying process for specifying the timing when the utterance syllable VS is switched along the time axis (hereinafter referred to as utterance switching timing) is executed (S270). Note that the utterance syllable VS includes a silent state sp.

続いて、Ｓ２７０にて特定され、時間軸に沿って隣接する２つの発声切替タイミング間の区間を発声区間として特定すると共に、当該発声区間における発声内容としての発声音節と音節波形とを対応付けた音節データを生成する（Ｓ２８０）。さらに、その生成した音節データの各々に、当該発声区間に対応する出力音の音高を対応付ける（Ｓ２９０）。 Subsequently, in S270, a section between two utterance switching timings adjacent along the time axis is specified as a utterance section, and the utterance syllable as the utterance content in the utterance section is associated with the syllable waveform. Syllable data is generated (S280). Further, the pitch of the output sound corresponding to the utterance section is associated with each of the generated syllable data (S290).

その後、音声パラメータ登録処理のＳ１５０へと移行する。
〈音素特定処理〉
次に、音素特定処理は、発声区間推定処理のＳ２７０にて起動されると、図５に示すように、フレームｔ_nを初期値（ｎ＝「１」）に設定する（Ｓ３１０）。続いて、フレームｔ₁に対する尤度を初期値に設定する（Ｓ３２０）。すなわち、Ｓ３２０では、人が発声する場合、発声開始の時点では無音であることが多いため、フレームｔ₁に対する発声音素を無音状態ｓｐとする。 Thereafter, the process proceeds to S150 of the voice parameter registration process.
<Phoneme specific processing>
Next, when the phoneme identification process is started in S270 of the speech segment estimation process, the frame t _n is set to an initial value (n = “1”) as shown in FIG. 5 (S310). Subsequently, the likelihood for the frame t ₁ is set to an initial value (S320). That is, in S320, when a person utters, there is often no sound at the time of utterance start, so the utterance phoneme for the frame t _{1 is set} to the silence state sp.

さらに、音声データＳＶにおける時間軸に沿ってフレームｔ_nを１つ進行（ｎ＝ｎ＋１）させる（Ｓ３３０）。そのフレームｔ_nにおいて、各音素が発声された可能性を表す発声尤度を導出する（Ｓ３４０）。 Furthermore, one proceeds frames t _n along the time axis in the voice data SV (n = n + 1) causes (S330). In the frame t _n , an utterance likelihood representing the possibility that each phoneme was uttered is derived (S340).

このＳ３４０では、具体的に、歌詞音素データに規定された時間軸に沿った音素の配列に従ってフレームｔ_nにて採りうる音素（以下、候補音素とする）の１つに対応する音響モデルを、フレームｔ_nでの音声パラメータＰＭに対して照合する。この音響モデルは、音素ごとに当該音素を表す音声パラメータＰＭとして予め用意されたモデルである。 In S340, specifically, an acoustic model corresponding to one of the phonemes (hereinafter referred to as candidate phonemes) that can be taken in the frame t _{n in} accordance with the phoneme arrangement along the time axis defined in the lyric phoneme data, matching the speech parameter PM in frame t _n. This acoustic model is a model prepared in advance as a speech parameter PM representing the phoneme for each phoneme.

そして、照合結果（即ち、各音素を発声した確からしさ、以下、第一尤度と称す）に、当該フレームｔ_nに対応する重み付与区間ＧＳに対して先のＳ２６０にて設定され、かつ当該音響モデルに対応する音素における重みβを付与する。つまり、第一尤度に重みβを乗算することで、１つの音響モデルに対応する音素に対する発声尤度を導出する（Ｓ３４０）。 Then, the collation result (that is, the probability of uttering each phoneme, hereinafter referred to as the first likelihood) is set in the previous S260 for the weighting section GS corresponding to the frame t _n , and A weight β in the phoneme corresponding to the acoustic model is assigned. That is, by multiplying the first likelihood by the weight β, the utterance likelihood for a phoneme corresponding to one acoustic model is derived (S340).

続いて、発声尤度の導出を全ての候補音素に対して実行したか否かを判定する（Ｓ３５０）。この判定の結果、全ての候補音素について発声尤度を導出していなければ（Ｓ３５０：ＮＯ）、Ｓ３４０へと戻り、フレームｔ_nでの音声パラメータＰＭに対して、新たな音響モデルを照合すると共に、当該候補音素に対する発声尤度を導出することを繰返す。 Subsequently, it is determined whether or not the derivation of utterance likelihood has been performed for all candidate phonemes (S350). If the result of this determination, it is not to derive the utterance likelihood for all the candidate phoneme (S350: NO), it returns to S340, the speech parameters PM in frame t _n, with matching new acoustic model Deriving the utterance likelihood for the candidate phoneme is repeated.

一方、Ｓ３５０での判定の結果、全ての候補音素に対して発声尤度を導出していれば（Ｓ３５０：ＹＥＳ）、フレームｔ_nに対する発声尤度の中で、値が最大となる発声尤度に対応する音響モデルの候補音素を発声音素として特定する（Ｓ３６０）。 On the other hand, the result of the determination in S350, if the derived utterance likelihood for all the option phonemes (S350: YES), in the utterance likelihood for the frame t _n, utterance likelihood value is maximum The candidate phoneme of the acoustic model corresponding to is specified as the utterance phoneme (S360).

すなわち、本実施形態では、周知のビタビアルゴリズム、及びビタビアライメントに従って、フレームｔ_nが採りうる音素全てについて仮説を展開した上で発声音素を特定する。 That is, in the present embodiment, the utterance phoneme is specified after developing hypotheses for all phonemes that can be taken by the frame t _n according to the well-known Viterbi algorithm and Viterbi alignment.

続いて、Ｓ３４０を実行したフレーム（以下、現フレームと称す）ｔ_nが最終フレームｔ_nmaxまで達したか否かを判定する（Ｓ３７０）。このＳ３７０での判定の結果、現フレームｔ_nが最終フレームｔ_nmaxにまで達していなければ（Ｓ３７０：ＮＯ）、現フレームｔ_nにて音節が切り替わったか否かを判定する（Ｓ３８０）。 Subsequently, it is determined whether or not the frame (hereinafter referred to as the current frame) t _{n for} which S340 has been executed has reached the final frame t _nmax (S370). A result of the determination in this S370, unless the current frame t _n has reached up to the last frame t _nmax (S370: NO), determines whether the switched syllable at the current frame t _n (S380).

具体的に、本実施形態のＳ３８０では、フレームｔ_n-1での発声音素が母音であるときに、フレームｔ_nでの発声音素が、フレームｔ_n-1での発声音素とは異なる母音や、子音、無音状態ｓｐとなった場合、フレームｔ_nにて音節が切り替わったものと判定する。この場合、Ｓ３８０では、フレームｔ_nとフレームｔ_n-1との境界（または、フレームｔ_nそのもの）を、発声切替タイミングとして特定する。 More specifically, in S380 of the present embodiment, when utterance phoneme in the frame t _n-1 is a vowel utterance phonemes in frame t _n is the utterance phoneme in the frame t _n-1 is different and vowels, it determines consonants, when a silent state sp, shall syllable is switched in frame t _n. In this case, the S380, the frame t _n and frame t _n-1 and the boundary (or, frame t _n itself) to be identified as utterance switching timing.

そして、Ｓ３８０での判定の結果、現フレームｔ_nにて音節が切り替わっていなければ（Ｓ３８０：ＮＯ）、Ｓ３３０へと戻り、時間軸に沿ってフレームｔ_nを１つ進行（ｎ＝ｎ＋１）させた後、Ｓ３５０〜Ｓ３６０を実行する。 As a result of the determination in S380, if no switching syllable at the current frame t _n (S380: NO), returns to S330, allowed to proceed one frame t _n along the time axis (n = n + 1) After that, S350 to S360 are executed.

一方、Ｓ３８０での判定の結果、現フレームｔ_nにて音節が切り替わっていれば（Ｓ３８０：ＹＥＳ）、フレームｔ_n+1以降のフレームｔに対応する重み付与区間ＧＳ_Nについて規定するフレームｔを変更する付与区間変更処理を実行する（Ｓ３９０）。 On the other hand, as a result of the determination in S380, if the syllable is switched in the current frame t _n (S380: YES), the frame t that defines the weighting section GS _N corresponding to the frame t after the frame t _{n + 1} is determined. A grant section changing process to be changed is executed (S390).

以下では、発声音節ＶＳのうち、現フレームｔ_nから、時間軸に沿って次の発声切替タイミングを構成するフレームｔまでに対する発声音節ＶＳを推定中音節ＶＳ_Mと称し、時間軸に沿って推定中音節ＶＳ_Mの１つの前の発声音節ＶＳを推定完了音節ＶＳ_M-1と称す。なお、発声音節ＶＳには、無音状態ｓｐも含む。 In the following, the utterance syllable VS from the current frame t _n to the frame t constituting the next utterance switching timing along the time axis in the utterance syllable VS is referred to as an estimated middle syllable VS _M and is estimated along the time axis. The utterance syllable VS that precedes the middle syllable VS _M is referred to as an estimated completion syllable VS _M-1 . Note that the utterance syllable VS includes a silent state sp.

ここで符号Ｍは、推定中音節ＶＳ_Mが時間軸に沿ってＭ番目の発声音節ＶＳであることを表す符号であり、符号Ｍ−１は、推定完了音節ＶＳ_M-1が、時間軸に沿ってＭ−１番目の発声音節ＶＳであることを表す符号である。 Here, the symbol M is a symbol representing that the estimated middle syllable VS _M is the Mth utterance syllable VS along the time axis, and the symbol M-1 is the estimated completion syllable VS _M-1 on the time axis. A symbol indicating that it is the (M−1) -th vocal syllable VS.

その後、Ｓ３３０へと戻り、時間軸に沿ってフレームｔ_nを１つ進行（ｎ＝ｎ＋１）させた後、Ｓ３５０〜Ｓ３６０を実行する。
なお、Ｓ３７０での判定の結果、現フレームｔ_nが最終フレームｔ_nmaxまで達していれば（Ｓ３７０：ＹＥＳ）、時間軸に沿って隣接する２つの発声切替タイミングの間の区間それぞれを発声区間として特定し、発声区間データを生成する（Ｓ４００）。 Thereafter, the process returns to S330, was allowed to proceed one frame t _n along the time axis (n = n + 1), executes S350～S360.
As a result of the determination in S370, if the current frame t _n has reached the final frame t _nmax (S370: YES), each of the sections between two utterance switching timings adjacent along the time axis is set as the utterance section. The utterance section data is generated (S400).

その後、発声区間推定処理のＳ２８０へと移行する。
〈付与区間変更処理〉
次に、付与区間変更処理は、音素特定処理のＳ３９０にて起動されると、図６に示すように、まず、重み付与区間ＧＳ_M-1を構成する複数のフレームｔ内に現フレームｔ_nが含まれるか否かを判定する（Ｓ６１０）。 Thereafter, the process proceeds to S280 of the utterance section estimation process.
<Grant section change processing>
Next, when the added section changing process is started in S390 of the phoneme specifying process, as shown in FIG. 6, first, the current frame t _{n is included} in a plurality of frames t constituting the weight applying section GS _M-1. Is included (S610).

このＳ６１０での判定の結果、重み付与区間ＧＳ_M-1を構成する複数のフレームｔ内に現フレームｔ_nが含まれていなければ（Ｓ６１０：ＮＯ）、詳しくは後述するＳ６４０へと進む。 As a result of the determination in S610, if the current frame t _n is not included in the plurality of frames t constituting the weighting section GS _M-1 (S610: NO), the process proceeds to S640 described in detail later.

一方、Ｓ６１０での判定の結果、重み付与区間ＧＳ_M-1を構成する複数のフレームｔ内に現フレームｔ_nが含まれている場合（Ｓ６１０：ＹＥＳ）、重み付与区間ＧＳ_M-1を構成する最終フレームｔと、現フレームｔ_nとの間のフレーム数を時間差分ＤＰとして導出する（Ｓ６２０）。 On the other hand, if the result of determination in S610 is that the current frame t _n is included in a plurality of frames t constituting the weighting section GS _M-1 (S610: YES), the weighting section GS _M-1 is configured. The number of frames between the last frame t to be performed and the current frame t _n is derived as a time difference DP (S620).

続いて、重み付与区間ＧＳ_M+1を構成する最初のフレームｔを、時間軸に沿って時間差分ＤＰ分早くなるように変更する（Ｓ６３０）。すなわち、Ｓ６３０では、推定中音節ＶＳ_Mから時間軸に沿った次の構成音節ＣＳ_M+1に対応する重み付与区間ＧＳ_M+1の始端が、時間軸に沿って早められる。 Subsequently, the first frame t constituting the weighting section GS _{M + 1} is changed so as to be earlier by the time difference DP along the time axis (S630). That is, in S630, the start end of the weighting section GS _{M + 1} corresponding to the next constituent syllable CS _{M + 1} along the time axis from the estimated middle syllable VS _M is advanced along the time axis.

ところで、Ｓ６４０では、重み付与区間ＧＳ_Mを構成する複数のフレームｔ内に現フレームｔ_nが含まれるか否かを判定する（Ｓ６４０）。このＳ６４０での判定の結果、重み付与区間ＧＳ_Mを構成する複数のフレームｔ内に現フレームｔ_nが含まれていれば（Ｓ６４０：ＹＥＳ）、重み付与区間ＧＳ_Mを構成する最初のフレームｔと、現フレームｔ_nとの間のフレーム数を時間差分ＤＣとして導出する（Ｓ６５０）。 Incidentally, in S640, it is determined whether or not the current frame t _n is included in the plurality of frames t constituting the weighting section GS _M (S640). As a result of the determination in S640, if the current frame t _n is included in the plurality of frames t constituting the weighting interval GS _M (S640: YES), the first frame t constituting the weighting interval GS _M And the number of frames between the current frame t _n is derived as a time difference DC (S650).

続いて、重み付与区間ＧＳ_M+1を構成する最初のフレームｔを、時間軸に沿って時間差分ＤＣ分遅くなるように変更する（Ｓ６６０）。すなわち、Ｓ６６０では、構成音節ＣＳ_M+1に対応する重み付与区間ＧＳ_M+1の始端が、時間軸に沿って遅くなる。 Subsequently, the first frame t constituting the weighting section GS _{M + 1} is changed so as to be delayed by the time difference DC along the time axis (S660). That is, in S660, the start end of the weighting section GS _{M + 1} corresponding to the constituent syllable CS _{M + 1} is delayed along the time axis.

その後、Ｓ６７０へと移行する。
このＳ６７０には、Ｓ６３０が完了した場合や、Ｓ６４０での判定の結果、重み付与区間ＧＳ_Mを構成する複数のフレームｔ内に現フレームｔ_nが含まれていない場合（Ｓ６４０：ＮＯ）にも移行する。 Thereafter, the process proceeds to S670.
In S670, when S630 is completed, or when the result of determination in S640 is that the current frame t _n is not included in the plurality of frames t constituting the weighting section GS _M (S640: NO). Transition.

そのＳ６７０では、推定完了音節ＶＳ_M-1として発声した発声区間の区間長ＥＬを導出する。具体的に区間長ＥＬの導出は、推定完了音節ＶＳ_M-1を構成する発声音素を特定したフレームｔの数量であればよい。 In S670, the section length EL of the utterance section uttered as the estimated completion syllable VS _M-1 is derived. Specifically, the section length EL may be derived from the number of frames t in which the utterance phonemes constituting the estimation completion syllable VS _M-1 are specified.

続いて、推定完了音節ＶＳ_M-1に対応する構成音節ＣＳ_M-1の模範発声区間ＥＳ_M-1の区間長ＰＬを導出する（Ｓ６８０）。具体的に区間長ＰＬの導出は、構成音節ＣＳ_M-1の模範発声区間ＥＳを構成するフレームｔの数量であればよい。 Subsequently, the section length PL of the exemplary utterance section ES _M-1 of the constituent syllable CS _M-1 corresponding to the estimated completion syllable VS _M-1 is derived (S680). Specifically, the section length PL may be derived as long as it is the number of frames t constituting the exemplary utterance section ES of the constituent syllable CS _M-1 .

さらに、区間長ＥＬが区間長ＰＬと不一致であるか否かを判定する（Ｓ６９０）。このＳ６９０での判定の結果、区間長ＥＬが区間長ＰＬと不一致であれば（Ｓ６９０：ＹＥＳ）、区間長ＥＬと区間長ＰＬとの差分Ｄ（Ｄ＝ＥＬ−ＰＬ）を導出する（Ｓ７００）。
差分Ｄを実時間にするのであれば、差分Ｄに１つのフレームの時間長を乗算すればよい。 Further, it is determined whether or not the section length EL does not match the section length PL (S690). As a result of the determination in S690, if the section length EL does not match the section length PL (S690: YES), a difference D (D = EL−PL) between the section length EL and the section length PL is derived (S700). .
If the difference D is to be real time, the difference D may be multiplied by the time length of one frame.

そして、重み付与区間ＧＳ_Mを構成する最終のフレームｔを、時間軸に沿って差分Ｄ分変更する（Ｓ７１０）。
ただし、本実施形態におけるＳ７１０では、差分Ｄが負の値（即ち、ＥＬ＜ＰＬ）であれば、重み付与区間ＧＳ_Mを構成する最終のフレームｔを、時間軸に沿って差分Ｄ（ただし、絶対値）分早くなるように変更する。また、差分Ｄが正の値（即ち、ＥＬ＞ＰＬ）であれば、重み付与区間ＧＳ_Mを構成する最終のフレームｔを、時間軸に沿って差分Ｄ（ただし、絶対値）分遅くなるように変更する。後者の場合、推定中音節ＶＳ_Mに対応する構成音節ＣＳ_Mの模範発声区間ＥＳ_Mと、時間軸に沿って推定中音節ＶＳ_Mの次の発声音節ＶＳ_M+1に対応する構成音節ＣＳ_M+1の模範発声区間ＥＳ_M+1とが切り替わるフレームｔを変更可能な上限としても良い。 Then, the final frame t constituting the weighting section GS _M is changed by the difference D along the time axis (S710).
However, in S710 in the present embodiment, if the difference D is a negative value (that is, EL <PL), the final frame t constituting the weighting interval GS _M is changed along the time axis by the difference D (however, (Absolute value) Change to be faster. If the difference D is a positive value (that is, EL> PL), the final frame t constituting the weighting section GS _M is delayed by the difference D (however, absolute value) along the time axis. Change to In the latter case, the model vocal section ES _M configuration syllable CS _M corresponding to the estimated in syllables VS _M, constituting syllables CS _M corresponding to the next utterance syllable VS M _{+ 1} estimated in the syllable VS _M along the time axis _The frame t at which the ₊₁ model utterance section ES _{M + 1} is switched may be set as an upper limit that can be changed.

その後、本付与区間変更処理を終了し、音素特定処理のＳ３３０へと移行する。
なお、Ｓ６９０での判定の結果、区間長ＥＬが区間長ＰＬと一致した場合にも（Ｓ６９０：ＮＯ）、本付与区間変更処理を終了し、音素特定処理のＳ３３０へと移行する。 Then, this provision section change process is complete | finished, and it transfers to S330 of a phoneme specific process.
Even when the section length EL matches the section length PL as a result of the determination in S690 (S690: NO), the present section change process is terminated, and the process proceeds to S330 of the phoneme identification process.

なお、Ｓ６９０における「一致」とは、区間長ＥＬが区間長ＰＬと完全に一致することの他に、区間長ＥＬが区間長ＰＬに対して、予め規定された誤差範囲以下で一致することも含む。この場合、Ｓ６９０にて判定する際に、区間長ＥＬと区間長ＰＬの差の絶対値（｜ＥＬ−ＰＬ｜）がある閾値γより小さければ、区間長ＥＬが区間長ＰＬと一致した（Ｓ６９０：ＮＯ）ものとすれば良い。 Note that “match” in S690 means that, in addition to the section length EL completely matching the section length PL, the section length EL matches the section length PL within a predetermined error range. Including. In this case, when determining in S690, if the absolute value (| EL-PL |) of the difference between the section length EL and the section length PL is smaller than a certain threshold γ, the section length EL coincides with the section length PL (S690). : NO).

すなわち、付与区間変更処理では、推定完了音節ＶＳ_M-1の発声終了タイミングが、推定完了音節ＶＳ_M-1に対応する模範発声区間ＥＳ_M-1の終了タイミングよりも早い場合、図４（Ｂ）に示すように、重み付与区間ＧＳ_M+1の開始タイミングを時間軸に沿って早くする。この場合、重み付与区間ＧＳ_Mの開始タイミングは、推定完了音節ＶＳ_M-1の発声終了タイミングに対するフレームｔ_n-1の次のフレームｔ_n（即ち、現フレームｔ_n）とする。 That is, in applying section changing process, if utterance end timing of the estimated completion syllables VS _M-1 is earlier than the end timing of the model speech section ES _M-1 corresponding to the estimated completion syllables VS _M-1, FIG. 4 (B ), The start timing of the weighting section GS _{M + 1} is advanced along the time axis. In this case, the start timing of the weighting section GS _M is set to the frame t _n (that is, the current frame t _n ) next to the frame t _{n−1 with} respect to the utterance end timing of the estimation completion syllable VS _M−1 .

そして、推定完了音節ＶＳ_M-1に対する発声区間の時間長が、模範発声区間ＥＳ_M-1の時間長よりも短い場合、図４（Ｃ）に示すように、重み付与区間ＧＳ_Mの終了タイミングを時間軸に沿って早くする。なお、図４においては、Ｍ＝Ｎ＋１とする。 Then, when the time length of the utterance section for the estimated completion syllable VS _M-1 is shorter than the time length of the model utterance section ES _M-1 , as shown in FIG. 4C, the end timing of the weighting section GS _M Speed up along the time axis. In FIG. 4, it is assumed that M = N + 1.

また、付与区間変更処理では、推定完了音節ＶＳ_M-1の発声終了タイミングが、推定中音節ＶＳ_Mに対応する模範発声区間ＥＳ_Mの開始タイミングよりも遅い場合、図７（Ａ），図７（Ｂ）に示すように、重み付与区間ＧＳ_M+1の開始タイミングを時間軸に沿って遅くする。 In addition, in the added section changing process, when the utterance end timing of the estimated completion syllable VS _M-1 is later than the start timing of the exemplary utterance section ES _M corresponding to the estimated middle syllable VS _M , FIG. As shown in (B), the start timing of the weighting section GS _{M + 1} is delayed along the time axis.

そして、推定完了音節ＶＳ_M-1に対する発声区間の時間長が、模範発声区間ＥＳ_M-1の時間長よりも長い場合、図７（Ｃ）に示すように、重み付与区間ＧＳ_Mの終了タイミングを時間軸に沿って遅くする。ただし、この場合、重み付与区間ＧＳ_Mの最も遅い終了タイミングを、発声音節ＶＳ_M+1に対応する構成音節ＣＳ_M+1と構成音節ＣＳ_Mと切り替わりタイミングとする。なお、図７においては、Ｍ＝Ｎ＋１とする。 Then, when the time length of the utterance section for the estimated completion syllable VS _M-1 is longer than the time length of the model utterance section ES _M-1 , as shown in FIG. 7C, the end timing of the weighting section GS _M Slow down along the time axis. However, in this case, the latest start timing of the weight applying section GS _M, a configuration syllable CS M _{+ 1} configured syllable CS _M and switching timing corresponding to the utterance syllable VS M _{+ 1.} In FIG. 7, it is assumed that M = N + 1.

さらに、付与区間変更処理では、図８に示すように、推定完了音節ＶＳ_M-1の発声終了タイミングが、付与禁止区間ＰＳ_M-1内である場合や、推定完了音節ＶＳ_M-1に対する発声区間の時間長が、模範発声区間ＥＳ_M-1の時間長に一致する場合には、重み付与区間ＧＳ_Mの変更を実施することなく、時間軸に沿って次のフレームｔへと処理を移行する。なお、図８に記載された後続音節とは、発声音節ＶＳ_M+1である。 Furthermore, the grant interval change process, as shown in FIG. 8, utterance end timing of the estimated completion syllables VS _M-1 is, and if it is granted within disabled section PS _M-1, utterance against estimated completion syllables VS _M-1 If the time length of the section matches the time length of the model utterance section ES _M-1 , the process shifts to the next frame t along the time axis without changing the weighting section GS _M. To do. Note that the subsequent syllable described in FIG. 8 is the utterance syllable VS _{M + 1} .

〈音声出力端末〉
次に、音声出力端末について説明する（図１参照）。
この音声出力端末６０は、情報受付部６１と、表示部６２と、音出力部６３と、通信部６４と、記憶部６５と、制御部６７とを備えている。音声出力端末６０として、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 <Audio output terminal>
Next, the audio output terminal will be described (see FIG. 1).
The voice output terminal 60 includes an information receiving unit 61, a display unit 62, a sound output unit 63, a communication unit 64, a storage unit 65, and a control unit 67. As the audio output terminal 60, for example, a known portable terminal (a mobile phone or a portable information terminal) or a known information processing apparatus (a so-called personal computer) may be assumed.

このうち、情報受付部６１は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６２は、制御部６７からの信号に基づいて画像を表示する。音出力部６３は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。 Among these, the information reception part 61 receives the information input via the input device (not shown). The display unit 62 displays an image based on a signal from the control unit 67. The sound output unit 63 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker.

通信部６４は、周知の通信網を介して音声出力端末６０が外部との間で情報通信を行うものである。記憶部６５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置であり、各種処理プログラムや各種データが記憶される。 The communication unit 64 is for the voice output terminal 60 to perform information communication with the outside via a known communication network. The storage unit 65 is a non-volatile storage device configured to be able to read and write stored contents, and stores various processing programs and various data.

また、制御部６７は、ＲＯＭ、ＲＡＭ、ＣＰＵを少なくとも有した周知のコンピュータを中心に構成されている。
〈音声合成処理〉
次に、音声出力端末６０の制御部６７が実行する音声合成処理について説明する。 The control unit 67 is mainly configured by a known computer having at least a ROM, a RAM, and a CPU.
<Speech synthesis processing>
Next, speech synthesis processing executed by the control unit 67 of the speech output terminal 60 will be described.

この音声合成処理は、音声出力端末６０の情報受付部６１を介して起動指令が入力されると起動される。
図９に示すように、音声合成処理は、起動されると、まず、情報受付部６１を介して入力された情報（以下、入力情報と称す）を取得する（Ｓ９１０）。このＳ９１０にて取得する入力情報とは、例えば、合成音として出力する音声の内容（文言）を表す出力文言や、合成音として出力する音の性質を表す出力性質情報を含むものである。なお、ここで言う音の性質（即ち、出力性質情報）とは、発声者の性別、発声者の年齢といった、発声者の声の特徴を含むものである。 This voice synthesis process is started when a start command is input via the information receiving unit 61 of the voice output terminal 60.
As shown in FIG. 9, when the speech synthesis process is started, first, information input via the information receiving unit 61 (hereinafter referred to as input information) is acquired (S910). The input information acquired in S910 includes, for example, output text indicating the content (word) of the sound output as synthesized sound, and output property information indicating the nature of the sound output as synthesized sound. Note that the sound property (that is, output property information) mentioned here includes characteristics of the voice of the speaker such as the gender of the speaker and the age of the speaker.

続いて、Ｓ９１０にて取得した出力文言を合成音として出力するために必要な音節それぞれに対応し、かつＳ９１０にて取得した出力性質情報に最も類似する情報と対応付けられた音声パラメータＰＭを、データ格納サーバ５０から抽出する（Ｓ９２０）。 Subsequently, the speech parameter PM corresponding to each syllable necessary for outputting the output message acquired in S910 as a synthesized sound and associated with information most similar to the output property information acquired in S910 is Extracted from the data storage server 50 (S920).

そして、Ｓ９１０にて取得した出力文言の内容にて合成音が出力されるように、Ｓ９２０にて取得した音声パラメータＰＭを設定する（Ｓ９３０）。続いて、Ｓ９３０にて設定された音声パラメータＰＭに基づいて、音声合成する（Ｓ９４０）。このＳ９４０における音声合成は、特許文献１に記載された方法や、フォルマント合成による周知の音声合成の手法を用いれば良い。 Then, the voice parameter PM acquired in S920 is set so that the synthesized sound is output with the content of the output word acquired in S910 (S930). Subsequently, speech synthesis is performed based on the speech parameter PM set in S930 (S940). For the speech synthesis in S940, a method described in Patent Document 1 or a well-known speech synthesis method using formant synthesis may be used.

さらに、Ｓ９４０にて生成した合成音を音出力部６３から出力する（Ｓ９５０）。
その後、本音声合成処理を終了する。
［実施形態の効果］
以上説明したように、本実施形態の付与区間変更処理では、推定完了音節ＶＳ_M-1の発声終了タイミングが、模範発声区間ＥＳ_M-1の終了タイミングよりも早い場合、重み付与区間ＧＳ_M+1の開始タイミングを時間軸に沿って早くしている。このようにすることで、模範発声区間ＥＳの切替タイミングに対して発声音節ＶＳの発声開始タイミングが常時先行する人物に対して、時間軸に沿って当該発声音節ＶＳ以降の音節に対応する重み付与区間ＧＳを適切な時間帯に再設定することができる。 Further, the synthesized sound generated in S940 is output from the sound output unit 63 (S950).
Thereafter, the speech synthesis process ends.
[Effect of the embodiment]
As described above, in the adding section changing process of the present embodiment, when the utterance end timing of the estimated completion syllable VS _M-1 is earlier than the ending timing of the exemplary utterance section ES _M-1 , the weighting section GS _{M +} The start timing of ₁ is advanced along the time axis. By doing so, weighting corresponding to the syllables after the utterance syllable VS along the time axis is given to the person whose utterance start timing of the utterance syllable VS always precedes the switching timing of the model utterance section ES. The section GS can be reset to an appropriate time zone.

また、付与区間変更処理では、推定完了音節ＶＳ_M-1に対する発声区間の時間長が、模範発声区間ＥＳ_M-1の時間長よりも短い場合にも、重み付与区間ＧＳ_Mの終了タイミングを時間軸に沿って早くしている。このようにすることで、模範発声区間ＥＳの区間長よりも短い区間で発声する人物に対して、当該発声音節ＶＳに対応する重み付与区間ＧＳを適切な時間帯に再設定することができる。 In addition, in the adding section changing process, the end timing of the weight adding section GS _M is set to the time even when the time length of the utterance section for the estimated completion syllable VS _M-1 is shorter than the time length of the model utterance section ES _M-1. It is fast along the axis. By doing in this way, with respect to a person who utters in a section shorter than the section length of the model utterance section ES, the weighting section GS corresponding to the utterance syllable VS can be reset to an appropriate time zone.

さらに、付与区間変更処理では、推定完了音節ＶＳ_M-1の発声終了タイミングが、模範発声区間ＥＳ_Mの開始タイミングよりも遅い場合、重み付与区間ＧＳ_M+1の開始タイミングを時間軸に沿って遅くしている。このようにすることで、模範発声区間ＥＳの切替タイミングに対して発声音節ＶＳの発声開始タイミングが常時遅延する人物に対して、時間軸に沿って当該発声音節ＶＳ以降の音節に対応する重み付与区間ＧＳを適切な時間帯に再設定することができる。 Further, in the adding section changing process, when the utterance end timing of the estimated completion syllable VS _M-1 is later than the starting timing of the exemplary utterance section ES _M , the start timing of the weighting section GS _{M + 1} is set along the time axis. It is late. In this way, weighting corresponding to syllables after the utterance syllable VS along the time axis is given to a person whose utterance start timing of the utterance syllable VS is always delayed with respect to the switching timing of the exemplary utterance section ES. The section GS can be reset to an appropriate time zone.

また、付与区間変更処理では、推定完了音節ＶＳ_M-1に対する発声区間の時間長が、模範発声区間ＥＳ_M-1の時間長よりも長い場合には、重み付与区間ＧＳ_Mの終了タイミングを時間軸に沿って遅くする。このようにすることで、模範発声区間ＥＳの区間長よりも長い区間で発声する人物に対して、当該発声音節ＶＳに対応する重み付与区間ＧＳを適切な時間帯に再設定することができる。 In addition, when the time length of the utterance section with respect to the estimated completion syllable VS _M-1 is longer than the time length of the model utterance section ES _{M-1 in} the assignment section change process, the end timing of the weight addition section GS _M is set to the time. Slow along the axis. By doing in this way, with respect to a person who utters in a section longer than the section length of the model utterance section ES, the weighting section GS corresponding to the utterance syllable VS can be reset to an appropriate time zone.

以上のことから、上記実施形態の発声区間特定処理によれば、各音節に対する発声内容の認識精度を向上させることができ、ひいては、各音節に対する発声区間の認識精度を向上させることができる。 From the above, according to the utterance section specifying process of the above embodiment, it is possible to improve the recognition accuracy of the utterance content for each syllable, and consequently improve the recognition accuracy of the utterance section for each syllable.

したがって、上記実施形態の音声パラメータ登録処理によれば、各発声区間での音声パラメータを生成する際に、信頼度の高い音節の内容を当該音声パラメータに付加することができる。この結果、上記実施形態において、音声合成処理を実行する場合には、その音声合成を利用する人物が望む合成音を実現しやすくできる。 Therefore, according to the speech parameter registration process of the above embodiment, when generating speech parameters in each utterance section, the content of syllables with high reliability can be added to the speech parameters. As a result, in the above embodiment, when the speech synthesis process is executed, it is possible to easily realize a synthesized sound desired by a person using the speech synthesis.

しかも、上記実施形態の発声区間特定処理では、模範発声区間ＥＳの切替タイミングの直前や直後の単位区間を、付与禁止区間ＰＳとした上で、重み付与区間ＧＳを設定している。これは、時間軸に沿って連続する音節について人が発声する場合、通常、音節の切り替わりが、模範切替タイミングに完全に一致することは少なく、模範発声区間ＥＳの切替タイミングの直前・直後から重みβの付与を禁止することで、音素特定処理において、実際に発声した音素とは異なる音素を発声したと誤認することを低減できるためである。 Moreover, in the utterance section specifying process of the above embodiment, the weight provision section GS is set after the unit section immediately before and immediately after the switching timing of the model utterance section ES is set as the provision prohibition section PS. This is because when a person utters a continuous syllable along the time axis, the syllable switching is rarely completely coincident with the model switching timing, and the weight is given immediately before and after the switching timing of the model utterance section ES. This is because by prohibiting the addition of β, it is possible to reduce misidentification that a phoneme different from the phoneme actually spoken is uttered in the phoneme identification process.

［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 [Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

上記実施形態では、音声入力装置（カラオケ装置）１０と音楽サーバ２５のシステムに、情報処理装置３０とデータ格納サーバ６０とを追加し、音声入力装置１０がカラオケ演奏処理を実行して対象楽曲を演奏している期間に入力された音声に基づいて音声データＳＶを生成していたが、本発明における音声データＳＶは、これに限るものではない。 In the above embodiment, the information processing device 30 and the data storage server 60 are added to the system of the voice input device (karaoke device) 10 and the music server 25, and the voice input device 10 executes the karaoke performance process to obtain the target music. The sound data SV is generated based on the sound input during the performance period, but the sound data SV in the present invention is not limited to this.

すなわち、本発明では、音声入力装置１０において、カラオケ装置などにて周知のアフレコ機能を用いて、音声データＳＶを生成しても良い。つまり、アフレコ機能を有した音声入力装置（カラオケ装置）であれば、発声すべき台詞に関するデータとして、台詞を構成する文字（以下、台詞構成文字と称す）を表す台詞テロップデータ（即ち、歌詞テロップデータと同様のデータ）と、台詞構成文字を表示部１３に表示するタイミングを規定した台詞出力データ（即ち、歌詞出力データと同様のデータ）とを備えている。よって、アフレコ機能を用いて音声データＳＶを取得する場合、音声入力装置１０は、台詞テロップデータに基づく台詞を表示部１３に表示し、当該台詞が表示部１３に表示されている期間に音声入力部１４を介して入力された音声波形を音声データＳＶとして、音楽サーバ２５に格納しても良い。 That is, in the present invention, the voice input device 10 may generate the voice data SV using a well-known after-recording function in a karaoke device or the like. That is, in the case of a voice input device (karaoke device) having an after-recording function, dialogue telop data (that is, lyrics telop) representing characters constituting dialogue (hereinafter referred to as dialogue constituent characters) as data relating to dialogue to be uttered. Data) and dialogue output data that defines the timing for displaying dialogue constituent characters on the display unit 13 (that is, data similar to the lyrics output data). Therefore, when acquiring the voice data SV using the after-recording function, the voice input device 10 displays a dialogue based on the dialogue telop data on the display unit 13 and performs voice input during a period in which the dialogue is displayed on the display unit 13. The voice waveform input via the unit 14 may be stored in the music server 25 as voice data SV.

この場合、情報処理装置３０では、アフレコ機能を用いて生成した音声データＳＶを音声パラメータ登録処理の処理対象としても良い。
また、上記実施形態では、音声入力装置１０として、カラオケ装置を想定したが、音声入力装置１０として想定する装置は、カラオケ装置に限るものではなく、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 In this case, the information processing apparatus 30 may use the voice data SV generated using the after-recording function as a processing target for the voice parameter registration process.
Moreover, in the said embodiment, although the karaoke apparatus was assumed as the audio | voice input apparatus 10, the apparatus assumed as the audio | voice input apparatus 10 is not restricted to a karaoke apparatus, For example, a well-known portable terminal (a mobile phone or portable information) Terminal) or a known information processing apparatus (so-called personal computer) may be assumed.

また、上記実施形態の音声合成システムにおいては、音楽サーバ２５、データ格納サーバ５０が設けられていたが、これらは補助記憶手段として機能すればよく、音声入力装置１０の記憶部１７に格納し、更には情報処理装置３０の各手段も、音声入力装置１０へ組み込むことで、音声入力装置（カラオケ装置）を、音声合成データ作成に用いる、発声を特定する装置としても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In the speech synthesis system of the above embodiment, the music server 25 and the data storage server 50 are provided. However, these may function as auxiliary storage means, and are stored in the storage unit 17 of the speech input device 10. Furthermore, each means of the information processing device 30 may be incorporated into the voice input device 10 so that the voice input device (karaoke device) is used as a device for specifying utterances used for creating voice synthesis data.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声パラメータ登録処理におけるＳ１２０が、特許請求の範囲の記載における発声情報取得手段に相当し、Ｓ１３０が、特許請求の範囲の記載における音声データ取得手段に相当する。さらに、発声区間推定処理におけるＳ２４０が、特許請求の範囲の記載における模範区間特定手段に相当し、Ｓ２５０及びＳ２６０が、特許請求の範囲の記載における重み設定手段に相当し、Ｓ２１０が、特許請求の範囲の記載における区間設定手段に相当する。 S120 in the voice parameter registration process of the above embodiment corresponds to the utterance information acquisition means in the description of the claims, and S130 corresponds to the voice data acquisition means in the claims. Further, S240 in the utterance section estimation process corresponds to the model section specifying means in the description of the claims, S250 and S260 correspond to the weight setting means in the description of the claims, and S210 is the claim. This corresponds to the section setting means in describing the range.

上記実施形態の音素特定処理におけるＳ３１０〜Ｓ３６０が、特許請求の範囲の記載における音素特定手段に相当し、Ｓ３８０が、特許請求の範囲の記載におけるタイミング特定手段に相当する。そして、上記実施形態の音素特定処理におけるＳ３９０、即ち、付与区間変更処理が、特許請求の範囲の記載における重み変更手段に相当する。 S310 to S360 in the phoneme specifying process of the above embodiment correspond to the phoneme specifying means in the claims, and S380 corresponds to the timing specifying means in the claims. And S390 in the phoneme identification process of the said embodiment, ie, an provision area change process, is equivalent to the weight change means in description of a claim.

１…音声合成システム１０…音声入力装置１１…通信部１２…入力受付部１３…表示部１４…音声入力部１５…音声出力部１６…音源モジュール１７…記憶部２０…制御部２１…ＲＯＭ２２…ＲＡＭ２３…ＣＰＵ２５…音楽サーバ３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…記憶部４０…制御部４１…ＲＯＭ４２…ＲＡＭ４３…ＣＰＵ５０…データ格納サーバ６０…音声出力端末 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Voice input device 11 ... Communication part 12 ... Input reception part 13 ... Display part 14 ... Voice input part 15 ... Voice output part 16 ... Sound source module 17 ... Memory | storage part 20 ... Control part 21 ... ROM 22 ... RAM 23 ... CPU 25 ... music server 30 ... information processing device 31 ... communication unit 32 ... input receiving unit 33 ... display unit 34 ... storage unit 40 ... control unit 41 ... ROM 42 ... RAM 43 ... CPU 50 ... data storage server 60 ... Audio output terminal

Claims

Utterance information acquisition means for acquiring utterance content information representing a character string of contents to be uttered, and a reference utterance start timing and a reference utterance end timing of each of the characters constituting the character string;
Based on the utterance content information acquired by the utterance information acquisition means, exemplary section specifying means for specifying an exemplary utterance section that is a section to be uttered for each syllable constituting the character string;
A weight is set for each of the syllables and for each weighting section that is a section corresponding to the syllable so that the syllable corresponding to each of the model utterance sections specified by the model section specifying means has a larger value. Weight setting means;
Voice data acquisition means for acquiring voice data representing a voice waveform uttered for a character string represented by the utterance content information acquired by the utterance information acquisition means;
Section setting means for setting a plurality of unit sections to be continuous along the time axis in the voice data for the voice data acquired by the voice data acquisition means;
Each of the acoustic models prepared in advance as a model representing the feature value for each phoneme is checked against the feature value in one unit section of the voice data set by the section setting means, and set by the weight setting means. And each of the utterance likelihoods given weights in the weighting interval corresponding to the unit interval is derived, and the phoneme corresponding to the acoustic model with the largest utterance likelihood is uttered in the unit interval. Phoneme specifying means for sequentially executing a content specifying process for specifying the uttered phoneme as a phoneme, for each unit section along the time axis in the voice data;
Each time the utterance syllable that is uttered in the speech data is specified based on the utterance phoneme sequentially specified by the phoneme specifying means, the utterance switching is the timing at which the utterance syllable is switched along the time axis. Timing specifying means for specifying timing;
The utterance syllable immediately before the utterance switching timing specified by the timing specifying means is an estimated completion syllable, the syllable formed by the utterance phoneme immediately after the utterance switching timing is the estimated middle syllable, and the exemplary utterance along the time axis When the deviation between the model switching timing, which is the timing at which the section is switched, and the utterance switching timing is greater than or equal to a predetermined tolerance, the unit section corresponding to the syllable after the estimated middle syllable along the time axis, The phoneme that is represented by the character string and that constitutes the syllable corresponding to the unit section is specified by the weight setting means so as to be specified as the utterance phoneme by the phoneme specifying means, and corresponds to the estimated middle syllable A utterance specifying device comprising: weight changing means for changing a weighting section after the weighting section.

The weight changing means includes
The utterance identification device according to claim 1, wherein if the utterance switching timing is earlier than the model switching timing, the weighting section is advanced along a time axis.

The weight changing means includes
If the utterance switching timing is earlier than the model switching timing, the start timing of the weighting section corresponding to the syllable next to the estimated syllable along the time axis is advanced along the time axis. The utterance identification device according to claim 2.

The weight changing means includes
If the time length of the speech utterance section from the utterance start timing to the utterance end timing of the estimated completion syllable is shorter than the exemplary utterance section for the syllable corresponding to the estimated completed syllable, weighting corresponding to the estimated middle syllable The utterance specifying device according to claim 2, wherein the end timing of the section is advanced along the time axis.

The weight changing means includes
The utterance identification device according to any one of claims 1 to 4, wherein if the utterance switching timing is later than the model switching timing, the weighting interval is delayed along the time axis.

The weight changing means includes
If the utterance switching timing is later than the model switching timing, the start timing of the weighting interval corresponding to the syllable next to the estimated syllable along the time axis is delayed along the time axis. The utterance specifying device according to claim 5.

The weight changing means includes
If the time length of the speech utterance section from the utterance start timing to the utterance end timing of the estimated completion syllable is longer than the exemplary utterance section for the syllable corresponding to the estimation completed syllable, weighting corresponding to the estimated middle syllable The utterance specifying device according to claim 5, wherein the end timing of the section is delayed along the time axis.

The weighting interval is
The utterance specifying device according to any one of claims 1 to 7, wherein the utterance specifying device has a shorter time length than the exemplary utterance section and is included in the exemplary utterance section.

A utterance information acquisition procedure for acquiring utterance content information representing a character string of contents to be uttered, and a reference utterance start timing and a reference utterance end timing of each of the characters constituting the character string;
Based on the utterance content information acquired by the utterance information acquisition procedure, an exemplary section specifying procedure for specifying an exemplary utterance section that is a section to be uttered for each syllable constituting the character string,
A weight is set for each of the syllables and for each weighting section that is a section corresponding to the syllable so that a syllable corresponding to each of the model utterance sections specified by the model section specifying procedure has a larger value. Weight setting procedure;
An audio data acquisition procedure for acquiring audio data representing an audio waveform uttered for a character string represented by the utterance content information acquired by the utterance information acquisition procedure;
An interval setting procedure for setting a plurality of unit intervals so as to be continuous along the time axis in the audio data for the audio data acquired in the audio data acquisition procedure,
Each of the acoustic models prepared in advance as a model representing the feature quantity for each phoneme is checked against the feature quantity in one unit section of the voice data set in the section setting procedure, and set in the weight setting procedure. And each of the utterance likelihoods given weights in the weighting interval corresponding to the unit interval is derived, and the phoneme corresponding to the acoustic model with the largest utterance likelihood is uttered in the unit interval. A phoneme specifying procedure for sequentially executing the content specifying process for specifying the uttered phoneme as the phoneme, for each unit section along the time axis in the voice data;
Each time the utterance syllable, which is the syllable uttered in the speech data, is specified based on the utterance phoneme sequentially specified in the phoneme specification procedure, the utterance switching is the timing at which the utterance syllable is switched along the time axis. A timing identification procedure for identifying timing;
The utterance syllable immediately before the utterance switching timing specified in the timing specifying procedure is set as an estimated completion syllable, the syllable formed by the utterance phoneme immediately after the utterance switching timing is set as an estimated syllable, and the exemplary utterance along the time axis When the deviation between the model switching timing, which is the timing at which the section is switched, and the utterance switching timing is greater than or equal to a predetermined tolerance, the unit section corresponding to the syllable after the estimated middle syllable along the time axis, The phoneme that is represented by the character string and that constitutes the syllable corresponding to the unit interval is specified as the utterance phoneme in the phoneme specifying procedure, and is defined in the weight setting procedure and corresponds to the estimated middle syllable. A program for causing a computer to execute a weight changing procedure for changing a weighting section after the weighting section. Lamb.