JP6281447B2

JP6281447B2 - Speech synthesis apparatus and speech synthesis system

Info

Publication number: JP6281447B2
Application number: JP2014175830A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2018-02-21
Anticipated expiration: 2034-08-29
Also published as: JP2016051035A

Description

本発明は、歌唱技巧の実現の成否を通知する技術に関する。 The present invention relates to a technique for notifying success or failure of realization of a singing technique.

従来、楽曲を構成する構成音それぞれの音高及び音長を表す楽曲データに従って、当該楽曲を演奏するカラオケ装置が知られている（特許文献１参照）。
特許文献１に記載されたカラオケ装置では、楽曲の演奏中に入力された音声を順次周波数解析した結果に従って特定した歌唱音高から、構成音の音高である基準音高を減算して特定音高差を導出する。これと共に、特許文献１に記載されたカラオケ装置では、基準音高を軸として歌唱音高と対称となるように、歌唱音高を特定音高差の２倍分シフトした音高の補助音を出力する。このとき、補助音の音量は、特定音高差が大きいほど大きく、特定音高差が「０」であれば「０（即ち、消音した状態）」となるように制御される。 2. Description of the Related Art Conventionally, a karaoke apparatus that plays a music piece according to music data that represents the pitch and length of each of the constituent sounds that make up the music piece is known (see Patent Document 1).
In the karaoke apparatus described in Patent Literature 1, a specific sound is obtained by subtracting a reference pitch that is a pitch of a constituent sound from a singing pitch specified according to a result of sequential frequency analysis of voices input during performance of a musical piece. Deriving the height difference. At the same time, in the karaoke apparatus described in Patent Document 1, an auxiliary sound having a pitch obtained by shifting the singing pitch by twice the specific pitch difference so as to be symmetrical with the singing pitch with the reference pitch as an axis. Output. At this time, the volume of the auxiliary sound is controlled to be larger as the specific pitch difference is larger, and to be “0 (that is, muted state)” if the specific pitch difference is “0”.

特開２０１０−２３１０７０号公報JP 2010-2331070 A

特許文献１に記載されたカラオケ装置の利用者は、補助音が出力されている場合、歌唱音高と基準音高との間にズレが生じていることを認識できる。そして、歌唱音が補助音に引っ張られることで、歌唱音の音高を基準音高に近づけることができる。 The user of the karaoke apparatus described in Patent Document 1 can recognize that there is a difference between the singing pitch and the reference pitch when the auxiliary sound is output. Then, the pitch of the singing sound can be brought close to the reference pitch by pulling the singing sound by the auxiliary sound.

しかしながら、特許文献１に記載されたカラオケ装置では、当該楽曲において用いられるべき歌唱技巧を利用者が実現できているか否かを、利用者に認識させることができず、その歌唱技巧を利用者に実現させることができないという課題がある。 However, in the karaoke apparatus described in Patent Document 1, it is impossible for the user to recognize whether or not the user has realized the singing technique to be used in the music, and the singing technique is made to the user. There is a problem that it cannot be realized.

つまり、従来の技術では、楽曲にて用いられるべき歌唱技巧の実現の成否を利用者に認識させることができず、かつ、その歌唱技巧を利用者に実現させることができないという課題がある。 That is, in the conventional technique, there is a problem that the user cannot recognize the success or failure of realizing the singing technique to be used in the music, and the user cannot realize the singing technique.

そこで、本発明は、歌唱技巧の実現の成否を利用者に認識させ、その歌唱技巧を利用者に実現させることが可能な技術を提供することを目的とする。 Then, an object of this invention is to provide the technique which makes a user recognize the success or failure of realization of a singing technique, and makes a user realize the singing technique.

上記目的を達成するためになされた本発明は、楽曲を歌った音声である模範ボーカルを音声合成にて生成して出力する音声合成装置に関する。
本発明においては、音高と音価との組み合わせからなる音符を複数有し、複数の音符のうちの少なくとも一部に歌詞が割り当てられた楽曲のうち、指定された１つの楽曲を特定楽曲とし、特定楽曲を構成する音符であって、歌詞が割り当てられた音符を構成音符とする。 The present invention made to achieve the above object relates to a speech synthesizer that generates and outputs an exemplary vocal, which is a voice of singing music, by speech synthesis.
In the present invention, a specified piece of music is designated as a specific piece of music having a plurality of notes composed of combinations of pitches and note values and lyrics are assigned to at least some of the plurality of notes. A note that constitutes a specific musical piece and that is assigned a lyrics is defined as a constituent note.

本発明の音声合成装置は、特徴データ取得手段と、特徴量算出手段と、比較手段と、音声合成手段とを備えている。
このうち、特徴データ取得手段は、指定された歌手である指定歌手が構成音符それぞれにて用いる歌唱技巧である特定技巧の特徴量を表す技巧特徴量を、構成音符の音高，音価
、及び当該構成音符に割り当てられた歌詞の組み合わせである音符プロパティごとに対応付けた歌唱特徴データを取得する。特徴量算出手段は、特定楽曲の演奏中に入力された音声データを解析し、音声データにて表現された構成音符それぞれでの歌唱技巧の特徴量を表す歌唱特徴量を算出する。 The speech synthesizer of the present invention includes feature data acquisition means, feature amount calculation means, comparison means, and speech synthesis means.
Among these, the feature data acquisition means includes a technique feature amount representing a feature amount of a specific technique that is a singing technique used by each designated musical note by a designated singer that is a designated singer, and the pitch, pitch value, and Singing feature data associated with each note property that is a combination of lyrics assigned to the constituent note is acquired. The feature amount calculation means analyzes the voice data input during the performance of the specific music piece, and calculates a singing feature amount that represents the feature amount of the singing skill at each of the constituent notes expressed by the voice data.

さらに、比較手段は、特徴データ取得手段で取得した歌唱特徴データに含まれる技巧特徴量と、特徴量算出手段で算出された歌唱特徴量とを、互いに対応する構成音符ごとに比較する。そして、音声合成手段は、比較手段での比較の結果、技巧特徴量と歌唱特徴量との差分である技巧差分が、予め規定された規定条件を満たしていなければ、当該構成音符の音符プロパティと同一の音符プロパティを有し、かつ、当該構成音符とは異なる構成音符である対象音符での技巧差分が規定条件を満たすように、対象音符に対する模範ボーカルを音声合成して出力する。 Furthermore, the comparison unit compares the skill feature amount included in the singing feature data acquired by the feature data acquisition unit and the singing feature amount calculated by the feature amount calculation unit for each constituent note corresponding to each other. Then, as a result of the comparison by the comparison means, the speech synthesis means, if the skill difference that is the difference between the skill feature quantity and the singing feature quantity does not satisfy the prescribed condition specified in advance, An exemplary vocal for the target note is synthesized and output so that the technical difference in the target note, which has the same note property and is different from the constituent note, satisfies the specified condition.

このような音声合成装置によれば、技巧差分が規定条件を満たしていない構成音符の音符プロパティと同一の音符プロパティを有し、かつ、当該構成音符とは異なる構成音符（即ち、対象音符）に対する模範ボーカルを音声合成して出力できる。 According to such a speech synthesizer, for a constituent note (that is, a target note) that has the same note property as the note property of the constituent note whose skill difference does not satisfy the specified condition, and is different from the constituent note. Can synthesize and output model vocals.

この結果、音声合成装置の利用者は、特定楽曲を指定歌手が歌唱した場合に用いる歌唱技巧を、利用者自身が実現できているか否かを認識できる。
しかも、本発明の音声合成装置によれば、対象音符に対する模範ボーカルを、技巧差分が規定条件を満たすように音声合成している。そして、この模範ボーカルを聞いて歌唱した利用者は、歌唱した際の歌声が模範ボーカルに引っ張られるため、自身が歌唱した際の歌唱技巧を、指定歌手が歌唱した場合に用いる歌唱技巧に近づけることができる。 As a result, the user of the speech synthesizer can recognize whether or not the user himself can realize the singing technique used when the designated singer sings the specific music piece.
Moreover, according to the speech synthesizer of the present invention, the exemplary vocal for the target note is speech-synthesized so that the technical difference satisfies the specified condition. And, the user who sang by listening to this model vocal is pulled by the model vocal so that the singing skill when singing is brought closer to the singing technique used when the designated singer sings. Can do.

このような音声合成装置によれば、利用者自身の歌唱を指定歌手の歌唱態様に近づけることができる。
これらのことから、本発明の音声合成装置によれば、歌唱技巧の実現の成否を利用者に認識させ、その歌唱技巧を利用者が実現可能となる。 According to such a speech synthesizer, the user's own singing can be brought close to the singing mode of the designated singer.
For these reasons, according to the speech synthesizer of the present invention, the user can recognize the success or failure of realizing the singing technique, and the user can realize the singing technique.

本発明の歌唱技巧には、「ビブラート」を含んでいても良い。この場合、本発明の音声合成手段は、比較手段での比較の結果、歌唱特徴量におけるビブラートの特徴量と、技巧特徴量におけるビブラートの特徴量との差分が、予め規定された第１規定閾値を超えていれば、技巧差分が規定条件を満たしていないものとすれば良い。 The singing technique of the present invention may include “vibrato”. In this case, as a result of the comparison by the comparison means, the speech synthesizing means of the present invention has a first prescribed threshold value in which the difference between the vibrato feature quantity in the singing feature quantity and the vibrato feature quantity in the technical feature quantity is defined in advance. If it exceeds, it is sufficient that the skill difference does not satisfy the prescribed condition.

このような音声合成装置によれば、歌唱技巧としての「ビブラート」の成否を利用者に認識させることができる。
さらに、本発明の音声合成手段は、技巧差分が第１規定閾値を超えていれば、対象音符に対する模範ボーカルにおいて、周波数の最高値と周波数の最小値との差分が大きくなるように音声合成して出力しても良い。 According to such a speech synthesizer, the user can recognize the success or failure of “vibrato” as a singing technique.
Furthermore, the speech synthesizing means of the present invention synthesizes speech so that the difference between the highest frequency value and the lowest frequency value is increased in the exemplary vocal for the target note if the skill difference exceeds the first specified threshold. May be output.

請求項２に記載の音声合成装置において、技巧差分が規定条件を満たしていない場合とは、指定歌手が構成音符を「ビブラート」を用いて歌唱したにも関わらず、当該構成音符に対して利用者が「ビブラート」を用いて歌唱できなかった場合である。この場合、請求項３に記載された音声合成装置によれば、対象音符での模範ボーカルにおける「ビブラート」を深くすることができる。 3. The speech synthesizer according to claim 2, wherein the skill difference does not satisfy the prescribed condition means that the designated singer sings the constituent note using “vibrato” but uses it for the constituent note. This is a case where the person could not sing using “Vibrato”. In this case, according to the speech synthesizer described in claim 3, it is possible to deepen the “vibrato” in the exemplary vocal at the target note.

このような模範ボーカルを出力することで、対象音符を利用者が歌唱する際に、「ビブラート」を実現しやすくできる。
また、本発明の歌唱技巧には、「しゃくり」を含んでも良い。「しゃくり」は、時間軸に沿って連続し、互いに異なる音高を有した２つの音符を含む音符群に対して発声音高を
変化させつつ連続して歌唱する歌唱技巧である。 By outputting such an exemplary vocal, it is easy to realize “vibrato” when the user sings the target note.
In addition, the singing technique of the present invention may include “shaking”. “Chicking” is a singing technique that sings continuously while changing the utterance pitch for a group of notes including two notes having different pitches along the time axis.

この場合の音声合成手段は、比較手段での比較の結果、歌唱特徴量におけるしゃくりの特徴量と、技巧特徴量におけるしゃくりの特徴量との差分が予め規定された第２規定閾値を超えていれば、技巧差分が規定条件を満たしていないものとすれば良い。 In this case, as a result of the comparison by the comparison means, the speech synthesis means may be configured such that the difference between the screaming feature value in the singing feature value and the screaming feature value in the skill feature value exceeds a second prescribed threshold value defined in advance. For example, it is sufficient that the skill difference does not satisfy the prescribed condition.

このような音声合成装置によれば、歌唱技巧としての「しゃくり」の成否を利用者に認識させることができる。
また、本発明の音声合成手段は、技巧差分が第２規定閾値を超えていれば、音符群のうち時間軸に沿った後の音符を対象音符とし、当該対象音符の音高に模範ボーカルの音高が遷移する音高遷移変化速度が早くなるように音声合成して出力しても良い。 According to such a speech synthesizer, it is possible to make the user recognize the success or failure of “shakuri” as a singing technique.
Further, the speech synthesis means of the present invention, if the skill difference exceeds the second specified threshold value, the note after the time axis in the note group is set as the target note, and the pitch of the model vocal is set to the pitch of the target note. Speech synthesis may be performed and output so that the pitch transition change rate at which the pitch transitions becomes faster.

請求項４に記載の音声合成装置において、技巧差分が規定条件を満たしていない場合とは、指定歌手が構成音符を「しゃくり」を用いて歌唱したにも関わらず、利用者が「しゃくり」を実現できなかった場合である。この場合、音声合成装置によれば、対象音符の音高に模範ボーカルの音高が遷移する音高遷移変化速度を早くすることができる。 In the speech synthesizer according to claim 4, the case where the technical difference does not satisfy the prescribed condition means that the user sings "shrimp" even though the designated singer sang the constituent notes using "shrimp". This is the case when it could not be realized. In this case, according to the speech synthesizer, the pitch transition change speed at which the pitch of the model vocal transitions to the pitch of the target note can be increased.

このような模範ボーカルを音声合成して出力することで、対象音符を利用者が歌唱する際に、「しゃくり」を実現しやすくすることができる。
さらに、本発明の特徴データ取得手段は、指定歌手が歌唱したボーカル音を含む楽曲データから、ボーカル音を表すボーカルデータを抽出するステップと、楽曲を構成する複数の音符から構成される楽譜データを取得するステップと、その取得した楽譜データを構成する各音符と、ボーカルデータとに基づいて、楽曲における歌唱旋律を構成する音符それぞれに対応するボーカルデータの区間である音符ボーカルデータを特定して、各音符ボーカルデータにおける技巧特徴量を決定するステップと、特定楽曲の楽譜を表す対象楽譜データを取得し、その取得した対象楽譜データに含まれる構成音符に、音符プロパティが一致する音符での技巧特徴量を対応付けることで、歌唱特徴データを生成するステップとを実行することで生成された歌唱特徴データを取得しても良い。 By synthesizing and outputting such model vocals, it is possible to easily realize “scribbling” when the user sings the target note.
Further, the feature data acquisition means of the present invention includes a step of extracting vocal data representing a vocal sound from music data including a vocal sound sung by a designated singer, and score data comprising a plurality of notes constituting the music. Based on the obtaining step, each note constituting the obtained musical score data, and the vocal data, the note vocal data which is the section of the vocal data corresponding to each note constituting the singing melody in the music is specified, The step of determining the technical feature amount of each note vocal data and the target musical score data representing the score of the specific music are obtained, and the technical feature of the note whose note property matches the constituent note included in the acquired target musical score data Singing generated by executing the step of generating singing feature data by associating quantities Features data may be acquired.

上述したステップの各々を実行することで、歌唱特徴データを確実に生成することができる。本発明の音声合成装置によれば、歌唱特徴データを取得できる。
本発明は、楽曲を歌った音声である模範ボーカルを音声合成にて生成して出力する音声合成システムであって、特徴データ取得手段と、特徴量算出手段と、比較手段と、音声合成手段とを備えた音声合成システムとしてなされていても良い。 By performing each of the steps described above, the singing feature data can be reliably generated. According to the speech synthesizer of the present invention, singing feature data can be acquired.
The present invention is a speech synthesis system that generates and outputs an exemplary vocal that is a voice of singing music by speech synthesis, and includes a feature data acquisition unit, a feature amount calculation unit, a comparison unit, and a speech synthesis unit. It may be made as a speech synthesis system provided with.

このようなシステムであっても、請求項１に係る発明と同様の効果を得ることができる。 Even with such a system, the same effect as that of the first aspect of the invention can be obtained.

カラオケシステムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a karaoke system. 技巧特徴生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a technique feature production | generation process. 技巧特徴データの概要を説明する説明図である。It is explanatory drawing explaining the outline | summary of technical feature data. 歌唱特徴生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a song feature production | generation process. 歌唱特徴データの概要を説明する説明図である。It is explanatory drawing explaining the outline | summary of singing characteristic data. カラオケ演奏処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a karaoke performance process. カラオケ演奏処理にて出力する模範ボーカルの概要を示す図である。It is a figure which shows the outline | summary of the model vocal output in a karaoke performance process. カラオケ演奏処理にて出力する模範ボーカルの概要を示す図である。It is a figure which shows the outline | summary of the model vocal output in a karaoke performance process.

以下に本発明の実施形態を図面と共に説明する。
＜システム構成＞
図１に示すカラオケシステム１は、ユーザが指定した楽曲である特定楽曲を演奏すると共に、ユーザが指定した歌手である指定歌手が特定楽曲を歌唱した場合に用いられる歌唱技巧と、特定楽曲をユーザが歌唱した場合に用いる歌唱技巧とのズレを、音声合成で生成した模範ボーカルにて報知するシステムである。 Embodiments of the present invention will be described below with reference to the drawings.
<System configuration>
The karaoke system 1 shown in FIG. 1 plays a specific music that is a music specified by the user, and a singing technique used when a specified singer that is a singer specified by the user sings the specific music, and the specific music as a user. Is a system that informs the deviation from the singing technique used when singing with an exemplary vocal generated by speech synthesis.

このようなカラオケシステム１は、情報処理装置２と、情報処理サーバ１０と、カラオケ装置３０とを備えている。
情報処理装置２は、楽曲ごとに用意された楽曲データＷＤ及びＭＩＤＩ楽曲ＭＤに基づいて技巧特徴データＳＦを算出する。ここで言う技巧特徴データＳＦとは、歌手の歌い方における特徴を表すデータである。ここで言う歌手には、プロの歌手、及びアマチュアの歌手を含む。 Such a karaoke system 1 includes an information processing device 2, an information processing server 10, and a karaoke device 30.
The information processing device 2 calculates the technical feature data SF based on the music data WD and the MIDI music MD prepared for each music. The technical feature data SF referred to here is data representing features in the way the singer sings. The singer here includes a professional singer and an amateur singer.

情報処理サーバ１０には、情報処理装置２にて算出された技巧特徴データＳＦ及びＭＩＤＩ楽曲ＭＤに基づいて、特定楽曲を指定歌手が歌唱した場合の歌唱技巧の特徴量を表す歌唱特徴データＭＳを生成する。 In the information processing server 10, singing feature data MS representing the feature amount of the singing skill when the designated singer sings the specific music based on the technical feature data SF and the MIDI music MD calculated by the information processing device 2. Generate.

カラオケ装置３０は、情報処理サーバ１０に記憶されたＭＩＤＩ楽曲ＭＤに従って特定楽曲を演奏すると共に、情報処理サーバ１０にて生成された歌唱特徴データＭＳに基づいて、模範ボーカルを生成して出力する。
＜楽曲データ＞
次に、楽曲データＷＤは、特定の楽曲ごとに予め用意されたものであり、楽曲に関する情報が記述された楽曲管理情報と、楽曲の演奏音を表す原盤波形データとを備えている。楽曲管理情報には、楽曲を識別する楽曲識別情報（以下、楽曲ＩＤと称す）と、当該楽曲を歌唱した歌手を識別する歌手識別情報（以下、歌手ＩＤと称す）とが含まれる。 The karaoke apparatus 30 plays a specific musical piece according to the MIDI musical piece MD stored in the information processing server 10, and generates and outputs an exemplary vocal based on the singing feature data MS generated by the information processing server 10.
<Music data>
Next, the music data WD is prepared in advance for each specific music, and includes music management information in which information related to the music is described, and master waveform data representing the performance sound of the music. The music management information includes music identification information for identifying music (hereinafter referred to as music ID) and singer identification information for identifying the singer who sang the music (hereinafter referred to as singer ID).

本実施形態の原盤波形データは、複数の楽器の演奏音と、歌唱旋律を歌手が歌唱したボーカル音とを含む音声データである。この音声データは、非圧縮音声ファイルフォーマットの音声ファイルによって構成されたデータであっても良いし、音声圧縮フォーマットの音声ファイルによって構成されたデータであっても良い。 The master waveform data of this embodiment is audio data including performance sounds of a plurality of musical instruments and vocal sounds sung by a singer. The audio data may be data constituted by an audio file in an uncompressed audio file format, or data constituted by an audio file in an audio compression format.

なお、以下では、原盤波形データに含まれる楽器の演奏音を表す音声波形データを伴奏データと称し、原盤波形データに含まれるボーカル音を表す音声波形データをボーカルデータと称す。 In the following, voice waveform data representing the performance sound of the musical instrument included in the master waveform data is referred to as accompaniment data, and voice waveform data representing the vocal sound included in the master waveform data is referred to as vocal data.

本実施形態の伴奏データに含まれる楽器の演奏音としては、打楽器（例えば、ドラム，太鼓，シンバルなど）の演奏音，弦楽器（例えば、ギター，ベースなど）の演奏音，打弦楽器（例えば、ピアノ）の演奏音，及び管楽器（例えば、トランペットやクラリネットなど）の演奏音がある。
＜ＭＩＤＩ楽曲＞
ＭＩＤＩ楽曲ＭＤは、楽曲ごとに予め用意されたものであり、楽曲管理情報と、演奏データと、歌詞データとを有している。 Musical instrument performance sounds included in the accompaniment data of the present embodiment include percussion instrument (eg, drum, drum, cymbal, etc.) performance sounds, stringed instrument (eg, guitar, bass, etc.) performance sounds, percussion instrument (eg, piano) ) And wind instruments (eg, trumpet, clarinet, etc.).
<MIDI music>
The MIDI music MD is prepared in advance for each music, and has music management information, performance data, and lyrics data.

このうち、楽曲管理情報は、楽曲ＩＤと、歌手ＩＤとを含む。
演奏データは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表したデータである。この演奏データは、楽曲ＩＤと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックとを少なくとも有している。 Among these, the music management information includes a music ID and a singer ID.
The performance data is data representing the score of one piece of music according to the well-known MIDI (Musical Instrument Digital Interface) standard. This performance data has at least a music ID and a music score track representing a music score for each musical instrument used in the music.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の演奏音について、少な
くとも、音高（いわゆるノートナンバー）と、ＭＩＤＩ音源が演奏音を出力する期間（以下、音符長と称す）とが規定されている。楽譜トラックにおける音符長は、当該演奏音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該演奏音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least the pitch (so-called note number) and the period during which the MIDI sound source outputs the performance sound (hereinafter referred to as the note length) for each performance sound output from the MIDI sound source. Has been. The note length in the score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the performance sound and the music until the output of the performance sound ends. Performance end timing (so-called note-off timing) representing the time from the start of the performance.

すなわち、楽譜トラックでは、ノートナンバーと、ノートオンタイミング及びノートオフタイミングによって表される音符長とによって、１つの音符ＮＯが規定される。そして、楽譜トラックは、音符ＮＯが演奏順に配置されることによって、１つの楽譜として機能する。なお、楽譜トラックは、例えば、鍵盤楽器、弦楽器、打楽器、及び管楽器などの楽器ごとに用意されている。このうち、本実施形態では、特定の楽器（例えば、ヴィブラフォン）が、楽曲における歌唱旋律を担当する楽器として規定されている。 That is, in the score track, one note NO is defined by the note number and the note length represented by the note-on timing and note-off timing. The musical score track functions as one musical score by arranging note NO in the order of performance. Note that the musical score track is prepared for each instrument such as a keyboard instrument, a stringed instrument, a percussion instrument, and a wind instrument, for example. Among these, in this embodiment, a specific musical instrument (for example, vibraphone) is defined as a musical instrument responsible for singing melody in music.

一方、歌詞データは、楽曲の歌詞に関するデータであり、歌詞テロップデータと、歌詞出力データとを備えている。歌詞テロップデータは、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す。歌詞出力データは、歌詞構成文字の出力タイミングである歌詞出力タイミングを、演奏データの演奏と対応付けるタイミング対応関係が規定されたデータである。 On the other hand, the lyrics data is data relating to the lyrics of the music, and includes lyrics telop data and lyrics output data. The lyrics telop data represents characters that constitute the lyrics of the music (hereinafter referred to as lyrics component characters). The lyrics output data is data in which a timing correspondence relationship that associates the lyrics output timing, which is the output timing of the lyrics constituent characters, with the performance of the performance data is defined.

具体的に、本実施形態におけるタイミング対応関係では、演奏データの演奏を開始するタイミングに、歌詞テロップデータの出力を開始するタイミングが対応付けられている。さらに、タイミング対応関係では、楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、演奏データの演奏開始からの経過時間によって規定されている。これにより、楽譜トラックに規定された個々の演奏音の音符ＮＯと、歌詞構成文字それぞれとが対応付けられる。
＜情報処理装置＞
情報処理装置２は、入力受付部３と、外部出力部４と、記憶部５と、制御部６とを備えた周知の情報処理装置（例えば、パーソナルコンピュータ）である。 Specifically, in the timing correspondence relationship in the present embodiment, the timing for starting the output of the lyrics telop data is associated with the timing for starting the performance of the performance data. Furthermore, in the timing correspondence relationship, the lyrics output timing of each lyrics constituent character along the time axis of the music is defined by the elapsed time from the performance start of the performance data. As a result, the note NO of each performance sound defined in the score track is associated with each of the lyrics constituent characters.
<Information processing device>
The information processing apparatus 2 is a known information processing apparatus (for example, a personal computer) including an input receiving unit 3, an external output unit 4, a storage unit 5, and a control unit 6.

入力受付部３は、外部からの情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、可搬型の記憶媒体（例えば、ＣＤやＤＶＤ、フラッシュメモリ）に記憶されたデータを読み取る読取ドライブ、通信網を介して情報を取得する通信ポートなどである。外部出力部４は、外部に情報を出力する出力装置である。ここでの出力装置とは、可搬型の記憶媒体にデータを書き込む書込ドライブや、通信網に情報を出力する通信ポートなどである。 The input receiving unit 3 is an input device that receives input of information and commands from the outside. The input device here is, for example, a key or switch, a reading drive for reading data stored in a portable storage medium (for example, CD, DVD, flash memory), or a communication port for acquiring information via a communication network. Etc. The external output unit 4 is an output device that outputs information to the outside. Here, the output device is a writing drive that writes data to a portable storage medium, a communication port that outputs information to a communication network, or the like.

記憶部５は、記憶内容を読み書き可能に構成された周知の記憶装置である。記憶部５には、少なくとも１つの楽曲データＷＤと、少なくとも１つのＭＩＤＩ楽曲ＭＤとが、共通する楽曲ごとに対応付けられて記憶されている。 The storage unit 5 is a known storage device configured to be able to read and write stored contents. The storage unit 5 stores at least one piece of music data WD and at least one MIDI piece of music MD in association with each common piece of music.

制御部６は、ＲＯＭ７，ＲＡＭ８，ＣＰＵ９を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ７は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ８は、処理プログラムやデータを一時的に記憶する。ＣＰＵ９は、ＲＯＭ７やＲＡＭ８に記憶された処理プログラムに従って各処理を実行する。 The control unit 6 is a known control device that is configured around a known microcomputer including a ROM 7, a RAM 8, and a CPU 9. The ROM 7 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 8 temporarily stores processing programs and data. The CPU 9 executes each process according to a processing program stored in the ROM 7 or RAM 8.

本実施形態のＲＯＭ７には、技巧特徴生成処理を、制御部６が実行するための処理プログラムが記憶されている。技巧特徴生成処理は、記憶部５に記憶されている楽曲データＷＤ及びＭＩＤＩ楽曲ＭＤに基づいて、技巧特徴データＳＦを生成する処理である。
＜情報処理サーバ＞
情報処理サーバ１０は、通信部１２と、記憶部１４と、制御部１６とを備えている。 The ROM 7 of the present embodiment stores a processing program for the control unit 6 to execute the technical feature generation processing. The skill feature generation process is a process for generating skill feature data SF based on the music data WD and the MIDI music MD stored in the storage unit 5.
<Information processing server>
The information processing server 10 includes a communication unit 12, a storage unit 14, and a control unit 16.

このうち、通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。すなわち、情報処理サーバ１０は、通信網を介してカラオケ装置３０と接続されている。なお、ここで言う通信網は、有線による通信網であっても良いし、無線による通信網であっても良い。 Among these, the communication unit 12 performs communication between the information processing server 10 and the outside via a communication network. That is, the information processing server 10 is connected to the karaoke apparatus 30 via a communication network. The communication network referred to here may be a wired communication network or a wireless communication network.

記憶部１４は、記憶内容を読み書き可能に構成された周知の記憶装置である。この記憶部１４には、少なくとも１つのＭＩＤＩ楽曲ＭＤが記憶される。なお、図１に示す符号「ｎ」は、情報処理サーバ１０の記憶部１４に記憶されているＭＩＤＩ楽曲ＭＤを識別する識別子であり、１以上の自然数である。さらに、記憶部１４には、情報処理装置２がデータ生成処理を実行することで生成された技巧特徴データＳＦが記憶される。なお、図１に示す符号「ｍ」は、情報処理サーバ１０の記憶部１４に記憶されている技巧特徴データＳＦを識別する識別子であり、１以上の自然数である。 The storage unit 14 is a known storage device configured to be able to read and write stored contents. The storage unit 14 stores at least one MIDI music piece MD. 1 is an identifier for identifying the MIDI music piece MD stored in the storage unit 14 of the information processing server 10, and is a natural number of 1 or more. Furthermore, the technical feature data SF generated by the information processing apparatus 2 executing the data generation process is stored in the storage unit 14. 1 is an identifier for identifying the technical feature data SF stored in the storage unit 14 of the information processing server 10, and is a natural number of 1 or more.

制御部１６は、ＲＯＭ１８，ＲＡＭ２０，ＣＰＵ２２を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ１８は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ２０は、処理プログラムやデータを一時的に記憶する。ＣＰＵ２２は、ＲＯＭ１８やＲＡＭ２０に記憶された処理プログラムに従って各処理を実行する。 The control unit 16 is a known control device that is configured around a known microcomputer including a ROM 18, a RAM 20, and a CPU 22. The ROM 18 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 20 temporarily stores processing programs and data. The CPU 22 executes each process according to a processing program stored in the ROM 18 or the RAM 20.

制御部１６のＲＯＭ１８には、歌唱特徴生成処理を、制御部１６が実行するための処理プログラムが格納されている。歌唱特徴生成処理は、カラオケ装置３０のユーザが指定した楽曲（即ち、特定楽曲）を、そのユーザが模範とする人物として指定した指定歌手が歌唱した場合に表出する歌唱技巧の特徴を表す歌唱特徴データＭＳを生成する処理である。なお、指定歌手は、特定楽曲を歌唱した独自の（オリジナルな）歌手自身であっても良いし、特定楽曲を歌唱した独自の歌手とは異なる歌手であっても良い。
＜カラオケ装置＞
カラオケ装置３０は、通信部３２と、入力受付部３４と、楽曲再生部３６と、記憶部３８と、音声制御部４０と、映像制御部４６と、制御部５０とを備えている。 The ROM 18 of the control unit 16 stores a processing program for the control unit 16 to execute the singing feature generation process. The singing feature generation process is a singing that represents the characteristics of the singing technique that is displayed when a designated singer who has designated a song designated by the user of the karaoke device 30 (that is, a specific song) as a model person sings. This is processing for generating feature data MS. The designated singer may be the original (original) singer who sang the specific music, or may be a singer different from the original singer who sang the specific music.
<Karaoke equipment>
The karaoke apparatus 30 includes a communication unit 32, an input reception unit 34, a music playback unit 36, a storage unit 38, an audio control unit 40, a video control unit 46, and a control unit 50.

通信部３２は、通信網を介して、カラオケ装置３０が外部との間で通信を行う。入力受付部３４は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、リモコンの受付部などである。楽曲再生部３６は、情報処理サーバ１０からダウンロードしたＭＩＤＩ楽曲ＭＤに基づく楽曲の演奏を実行する。 In the communication unit 32, the karaoke apparatus 30 communicates with the outside via a communication network. The input receiving unit 34 is an input device that receives input of information and commands in accordance with external operations. Here, the input device is, for example, a key, a switch, a reception unit of a remote controller, or the like. The music playback unit 36 performs a music performance based on the MIDI music MD downloaded from the information processing server 10.

この楽曲再生部３６は、例えば、ＭＩＤＩ音源である。音声制御部４０は、音声の入出力を制御するデバイスであり、出力部４２と、マイク入力部４４とを備えている。マイク入力部４４には、マイク６２が接続される。これにより、マイク入力部４４は、マイク６２を介して入力された音声を取得する。出力部４２にはスピーカ６０が接続されている。出力部４２は、楽曲再生部３６によって再生される楽曲の音源信号、マイク入力部４４からの歌唱音の音源信号をスピーカ６０に出力する。スピーカ６０は、出力部４２から出力される音源信号を音に換えて出力する。 The music reproducing unit 36 is, for example, a MIDI sound source. The voice control unit 40 is a device that controls voice input / output, and includes an output unit 42 and a microphone input unit 44. A microphone 62 is connected to the microphone input unit 44. As a result, the microphone input unit 44 acquires the sound input via the microphone 62. A speaker 60 is connected to the output unit 42. The output unit 42 outputs the sound source signal of the music reproduced by the music reproducing unit 36 and the sound source signal of the singing sound from the microphone input unit 44 to the speaker 60. The speaker 60 outputs the sound source signal output from the output unit 42 instead of sound.

映像制御部４６は、制御部５０から送られてくる映像データに基づく映像または画像の出力を行う。映像制御部４６には、映像または画像を表示する表示部６４が接続されている。 The video control unit 46 outputs a video or an image based on the video data sent from the control unit 50. The video control unit 46 is connected to a display unit 64 that displays video or images.

制御部５０は、ＲＯＭ５２，ＲＡＭ５４，ＣＰＵ５６を少なくとも有した周知のコンピュータを中心に構成されている。ＲＯＭ５２は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ５４は、処理プログラムやデータを一時的に記憶する。ＣＰＵ５６は、ＲＯＭ５２やＲＡＭ５４に記憶された処理プログラムに従って各処理を実行する。 The control unit 50 is configured around a known computer having at least a ROM 52, a RAM 54, and a CPU 56. The ROM 52 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 54 temporarily stores processing programs and data. The CPU 56 executes each process according to a processing program stored in the ROM 52 or the RAM 54.

制御部５０のＲＯＭ５２には、カラオケ演奏処理を、制御部５０が実行するための処理プログラムが格納されている。カラオケ演奏処理は、ユーザが指定した特定楽曲を演奏すると共に、歌唱特徴データＭＳに基づいて、指定歌手の歌い方の特徴を再現した歌唱音声である模範ボーカルを音声生成にて生成して出力する処理である。
＜技巧特徴生成処理＞
次に、情報処理装置２の制御部６が実行する技巧特徴生成処理について説明する。 The ROM 52 of the control unit 50 stores a processing program for the control unit 50 to execute karaoke performance processing. In the karaoke performance process, a specific musical piece designated by the user is played, and based on the singing characteristic data MS, an exemplary vocal that is a singing voice reproducing the characteristic of the singing method of the designated singer is generated by voice generation and output. It is processing.
<Technical feature generation processing>
Next, a technique feature generation process executed by the control unit 6 of the information processing apparatus 2 will be described.

この技巧特徴生成処理は、処理プログラムを起動するための起動指令が、情報処理装置２の入力受付部３を介して入力されたタイミングで起動される。
そして、図２に示す技巧特徴生成処理が起動されると、まず、制御部６は、情報処理装置２の入力受付部（図示せず）を介して入力された歌手ＩＤを取得する（Ｓ１０５）。続いて、制御部６は、情報処理装置２の記憶部５に記憶されている全ての楽曲データＷＤの中から、Ｓ１０５にて取得した歌手ＩＤを含む一つの楽曲データＷＤを取得する（Ｓ１１０）。 This technical feature generation process is activated at the timing when an activation command for activating a processing program is input via the input receiving unit 3 of the information processing apparatus 2.
2 is activated, first, the control unit 6 obtains a singer ID input via an input receiving unit (not shown) of the information processing apparatus 2 (S105). . Subsequently, the control unit 6 acquires one piece of music data WD including the singer ID acquired in S105 from all the music data WD stored in the storage unit 5 of the information processing apparatus 2 (S110). .

さらに、技巧特徴生成処理では、制御部６は、情報処理装置２の記憶部５に記憶されている全てのＭＩＤＩ楽曲ＭＤの中から、Ｓ１１０で取得した楽曲データＷＤと同一の楽曲ＩＤが対応付けられた一つのＭＩＤＩ楽曲ＭＤを取得する（Ｓ１２０）。すなわち、Ｓ１１０及びＳ１２０では、制御部６は、同一の楽曲に対する楽曲データＷＤ及びＭＩＤＩ楽曲ＭＤを取得する。 Further, in the technique feature generation process, the control unit 6 associates the same music ID with the music data WD acquired in S110 out of all the MIDI music MD stored in the storage unit 5 of the information processing apparatus 2. The obtained MIDI music piece MD is acquired (S120). That is, in S110 and S120, the control unit 6 acquires music data WD and MIDI music MD for the same music.

続いて、制御部６は、Ｓ１１０にて取得した楽曲データＷＤ（以下、「取得楽曲データ」と称す）における各音符に対応する各音の再生時間に、Ｓ１２０で取得したＭＩＤＩ楽曲ＭＤ（以下、「取得ＭＩＤＩ」と称す）を構成する各音符の演奏タイミングが一致するように、その取得ＭＩＤＩを調整する（Ｓ１３０）。このＳ１３０における取得ＭＩＤＩを調整する手法として、周知の手法（例えば、特許第５３１０６７７号に記載の手法）を用いることが考えられる。特許第５３１０６７７号に記載の手法では、具体的には、制御部６は、取得ＭＩＤＩをレンダリングし、その取得ＭＩＤＩのレンダリング結果と取得楽曲データの原盤波形データとの双方を規定時間単位でスペクトルデータに変換する。そして、双方のスペクトルデータ上の時間が同期するように、各演奏音の演奏開始タイミング及び演奏終了タイミングを修正する。なお、スペクトルデータ上の時間が同期するように調整する際には、ＤＰマッチングを用いても良い。 Subsequently, the control unit 6 uses the MIDI music MD (hereinafter referred to as “hereinafter referred to as MIDI music”) acquired in S120 during the reproduction time of each sound corresponding to each note in the music data WD (hereinafter referred to as “acquired music data”) acquired in S110. The acquisition MIDI is adjusted so that the performance timings of the notes constituting the “acquisition MIDI”) match (S130). As a technique for adjusting the acquired MIDI in S130, it is conceivable to use a known technique (for example, the technique described in Japanese Patent No. 5310679). Specifically, in the method described in Japanese Patent No. 5310679, the control unit 6 renders the acquired MIDI, and both the rendering result of the acquired MIDI and the master waveform data of the acquired music data are spectral data in a predetermined time unit. Convert to And the performance start timing and performance end timing of each performance sound are corrected so that the time on both spectrum data may synchronize. Note that DP matching may be used when adjusting the time on the spectrum data so as to be synchronized.

技巧特徴生成処理では、制御部６は、取得楽曲データＷＤに含まれる原盤波形データを取得する（Ｓ１４０）。続いて、制御部６は、Ｓ１４０にて取得した原盤波形データから、ボーカルデータと伴奏データとを分離して抽出する（Ｓ１５０）。このＳ１５０における伴奏データとボーカルデータとの分離手法として、周知の手法（例えば、特開２００８−１３４６０６に記載された“ＰｒｅＦＥｓｔ”）を使って推定された音高および調波成分を利用する手法が考えられる。なお、ＰｒｅＦＥｓｔとは、原盤波形データにおいて最も優勢な音声波形をボーカルデータとみなしてボーカルの音高（即ち、基本周波数）および調波成分の大きさを推定する手法である。 In the technical feature generation process, the control unit 6 acquires master waveform data included in the acquired music data WD (S140). Subsequently, the control unit 6 separates and extracts vocal data and accompaniment data from the master disk waveform data acquired in S140 (S150). As a method for separating accompaniment data and vocal data in S150, there is a method using a pitch and a harmonic component estimated using a known method (for example, “PreFEst” described in JP-A-2008-134606). Conceivable. Note that PreFEst is a technique for estimating the pitch of a vocal (that is, the fundamental frequency) and the magnitude of a harmonic component by regarding the most prevalent voice waveform in the master waveform data as vocal data.

さらに、制御部６は、Ｓ１３０にて時間調整が実施されたＭＩＤＩ楽曲ＭＤ（以下、「調整済ＭＩＤＩ」と称す）と、Ｓ１５０で抽出したボーカルデータとに基づいて、音符ボ
ーカルＶｏ（ａ，ｉ）を特定する（Ｓ１６０）。音符ボーカルＶｏ（ａ，ｉ）とは、ボーカルデータにおいて、歌唱旋律を構成する各音符ＮＯ（ａ，ｉ）に対応する区間である。Ｓ１６０においては、制御部６は、Ｓ１５０で抽出したボーカルデータに、調整済ＭＩＤＩにおける演奏開始タイミングｎｎｔ（ａ，ｉ）及び演奏終了タイミングｎｆｔ（ａ，ｉ）を照合することで、音符ボーカルＶｏ（ａ，ｉ）を特定する。なお、符号ａは、楽曲を識別する符号であり、符号ｉは、楽曲における歌唱旋律の音符ＮＯを識別する符号である。 Further, the control unit 6 performs the musical note vocal Vo (a, i) based on the MIDI music MD (hereinafter referred to as “adjusted MIDI”) whose time has been adjusted in S130 and the vocal data extracted in S150. ) Is specified (S160). The note vocal Vo (a, i) is a section corresponding to each note NO (a, i) constituting the singing melody in the vocal data. In S160, the control unit 6 collates the vocal data extracted in S150 with the performance start timing nnt (a, i) and performance end timing nft (a, i) in the adjusted MIDI, so that the note vocal Vo ( a, i) are specified. In addition, the code | symbol a is a code | symbol which identifies a music, and the code | symbol i is a code | symbol which identifies the note NO of the song melody in a music.

さらに、技巧特徴生成処理では、制御部６は、各音符ボーカルＶｏ（ａ，ｉ）での歌唱技巧の特徴量を表す技巧特徴量Ｓ（ａ，ｉ）を決定する（Ｓ１７０）。ここで言う歌唱技巧には、少なくとも“ビブラート”，“しゃくり”を含む。なお、“しゃくり”とは、時間軸に沿って連続し、互いに異なる音高を有した２つの音符を含む音符群に対して発声音高を変化させつつ連続して歌唱する技巧である。 Further, in the technique feature generation process, the control unit 6 determines a technique feature quantity S (a, i) representing the feature quantity of the singing technique at each note vocal Vo (a, i) (S170). The singing skills mentioned here include at least “vibrato” and “shrimp”. Note that “chucking” is a technique of singing continuously while changing the utterance pitch for a group of notes including two notes having different pitches along the time axis.

このうち、“ビブラート”についての技巧特徴量（以下、「ビブラート特徴量」と称す）ｖｉｂの算出では、制御部６は、まず、音符ボーカルＶｏ（ａ，ｉ）それぞれについて周波数解析（ＤＦＴ）を実施する。そして、制御部６は、下記（１）式に従って、ビブラート特徴量ｖｉｂを算出する。 Among these, in calculating the technical feature amount (hereinafter referred to as “vibrato feature amount”) vib for “vibrato”, the control unit 6 first performs frequency analysis (DFT) for each of the note vocals Vo (a, i). carry out. And the control part 6 calculates the vibrato feature-value vib according to following (1) Formula.

ｖｉｂ（ａ，ｉ）＝ｖｉｂ＿ｐｅｒ（ａ，ｉ）×ｖｉｐ＿ｄｅｐ（ａ，ｉ）（１）
ただし、上記（１）式におけるｖｉｂ＿ｐｅｒ（ａ，ｉ）は、各音符ボーカルＶｏ（ａ，ｉ）におけるスペクトルピークの突出精度を表す指標である。このｖｉｂ＿ｐｅｒは、周波数解析結果（即ち、振幅スペクトル）のピーク値を、周波数解析結果の平均値で除すことで求めれば良い。また、上記（１）式におけるｖｉｐ＿ｄｅｐは、各音符ボーカルＶｏ（ａ，ｉ）の標準偏差である。 vib (a, i) = vib_per (a, i) × vip_dep (a, i) (1)
However, vib_per (a, i) in the above equation (1) is an index representing the protruding accuracy of the spectrum peak in each note vocal Vo (a, i). This vib_per may be obtained by dividing the peak value of the frequency analysis result (that is, the amplitude spectrum) by the average value of the frequency analysis result. Further, vip_dep in the above equation (1) is a standard deviation of each note vocal Vo (a, i).

“しゃくり”についての技巧特徴量（以下、「しゃくり特徴量」と称す）ｒｉｓｅ（ａ，ｉ）の算出では、制御部６は、まず、ボーカルデータの音高時間変化を微分した微分変化を算出する。続いて、制御部６は、各音符ＮＯ（ａ，ｉ）の演奏開始タイミングｎｎｔ（ａ，ｉ）以前で、微分変化が時間軸に沿って正の値となったタイミングを特定する。さらに、制御部６は、その特定した各タイミングから演奏開始タイミングｎｎｔ（ａ，ｉ）までの区間におけるボーカルデータの音高時間変化と予め規定された模範曲線との相互相関値を、しゃくり特徴量ｒｉｓｅ（ａ，ｉ）として算出する。 In calculating the skill feature amount (hereinafter referred to as “shackle feature amount”) rise (a, i) for “shrimp”, the control unit 6 first calculates a differential change obtained by differentiating the pitch time change of vocal data. To do. Subsequently, the control unit 6 specifies the timing at which the differential change becomes a positive value along the time axis before the performance start timing nnt (a, i) of each note NO (a, i). Further, the control unit 6 obtains the cross-correlation value between the pitch time change of the vocal data and the predefined exemplary curve in the section from the specified timing to the performance start timing nnt (a, i), and the scribing feature amount. Calculate as rise (a, i).

さらに、技巧特徴生成処理では、制御部６は、各音符ボーカルＶｏ（ａ，ｉ）の音声パラメータＰ（ａ，ｉ）を算出する（Ｓ１８０）。本実施形態のＳ１８０にて導出する音声パラメータＰには、少なくとも、基本周波数（ｆ０）、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれぞれの時間差分を含む。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音符ボーカルＶｏの時間軸に沿った自己相関、音符ボーカルＶｏの周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音符ボーカルＶｏに対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音符ボーカルＶｏに対して時間分析窓を適用して振幅を二乗した結果を時間方向に積分することで導出すれば良い。 Further, in the technical feature generation process, the control unit 6 calculates the speech parameter P (a, i) of each note vocal Vo (a, i) (S180). The audio parameter P derived in S180 of the present embodiment includes at least the fundamental frequency (f0), the mel frequency cepstrum (MFCC), power, and each time difference. Since these fundamental frequency, MFCC, and power deriving methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis of the note vocal Vo, note vocal What is necessary is just to derive | lead-out using methods, such as the autocorrelation of the frequency spectrum of Vo, or a cepstrum method. In the case of MFCC, the result of frequency analysis (for example, FFT) for each time analysis window by applying a time analysis window to the note vocal Vo, and the result of logarithmizing the size for each frequency, Furthermore, it may be derived by frequency analysis. The power may be derived by integrating the result of squaring the amplitude by applying a time analysis window to the note vocal Vo and integrating it in the time direction.

さらに、技巧特徴生成処理では、制御部６は、各音符ボーカルＶｏ（ａ，ｉ）に対応する音符ＮＯ（ａ，ｉ）の音符プロパティｐ（ａ，ｉ）を特定する（Ｓ１９０）。本実施形態のＳ１９０では、具体的には、制御部６は、取得ＭＩＤＩから、その取得ＭＩＤＩに規
定された各音符ＮＯ（ａ，ｉ）の情報を音符プロパティｐ（ａ，ｉ）として抽出して特定する。 Furthermore, in the technical feature generation process, the control unit 6 specifies the note property p (a, i) of the note NO (a, i) corresponding to each note vocal Vo (a, i) (S190). In S190 of the present embodiment, specifically, the control unit 6 extracts information on each note NO (a, i) defined in the acquired MIDI from the acquired MIDI as a note property p (a, i). To identify.

ここで言う音符プロパティｐ（ａ，ｉ）には、特定音符属性と、前音符属性と、後音符属性とを含む。
特定音符属性とは、音符ＮＯ（ａ，ｉ）の属性を表す情報である。この特定音符属性には、音符ＮＯ（ａ，ｉ）の音階（音高）、音符長、歌詞の音節、及び歌詞の母音を含む。また、前音符属性とは、時間軸に沿って音符ＮＯ（ａ，ｉ）の一つ前の音符（以下、前音符と称す）ＮＯ（ａ，ｉ−１）の属性を表す情報である。この前音符属性には、前音符ＮＯ（ａ，ｉ）の音階（音高）、音符長、歌詞の音節、及び前音符ＮＯ（ａ，ｉ−１）と音符ＮＯ（ａ，ｉ）との間の時間長（即ち、無音期間）を含む。 The note property p (a, i) mentioned here includes a specific note attribute, a front note attribute, and a rear note attribute.
The specific note attribute is information representing the attribute of the note NO (a, i). The specific note attributes include the scale (pitch) of note NO (a, i), note length, syllable of lyrics, and vowel of lyrics. The previous note attribute is information representing the attribute of the note NO (a, i-1) immediately preceding the note NO (a, i) (hereinafter referred to as the previous note) along the time axis. The preceding note attributes include the scale (pitch), note length, syllable of the previous note NO (a, i), and the preceding note NO (a, i-1) and note NO (a, i). Including the length of time between them (ie, the silent period).

さらに、後音符属性とは、時間軸に沿って特定音符ＮＯ（ａ，ｉ）の一つ後の音符（以下、後音符と称す）ＮＯ（ａ，ｉ＋１）の属性を表す情報である。この後音符属性には、音階（音高）、音符長、歌詞の音節、及び音符ＮＯ（ａ，ｉ）と後音符ＮＯ（ａ，ｉ＋１）との間の時間長（即ち、無音期間）を含む。 Furthermore, the post note attribute is information representing the attribute of the note NO (a, i + 1) immediately after the specific note NO (a, i) along the time axis (hereinafter referred to as a post note). The subsequent note attributes include a scale (pitch), a note length, a syllable of the lyrics, and a time length between the note NO (a, i) and the subsequent note NO (a, i + 1) (that is, a silent period). Including.

なお、音符プロパティｐ（ａ，ｉ）における音符長、及び音符間の時間長は、予め規定された階級へと量子化されたものでも良い。
技巧特徴生成処理では、続いて、制御部６は、Ｓ１９０で特定した各音符ＮＯ（ａ，ｉ）の音符プロパティｐ（ａ，ｉ）を、対応する音符ＮＯ（ａ，ｉ）の技巧特徴量Ｓ（ａ，ｉ）及び音声パラメータＰ（ａ，ｉ）と対応付ける（Ｓ２００）。 Note that the note length and the time length between notes in the note property p (a, i) may be quantized to a predetermined class.
In the technique feature generation process, the control unit 6 subsequently uses the note property p (a, i) of each note NO (a, i) specified in S190 to obtain the technique feature amount of the corresponding note NO (a, i). Corresponding to S (a, i) and speech parameter P (a, i) (S200).

さらに、技巧特徴生成処理では、制御部６は、Ｓ１０５にて取得した歌手ＩＤが対応付けられた楽曲のうち、予め設定された設定条件を満たす全楽曲について、Ｓ１１０からＳ２００のステップを完了したか否かを判定する（Ｓ２１０）。ここで言う設定条件とは、Ｓ１０５にて取得した歌手ＩＤが対応付けられ、楽曲データＷＤとＭＩＤＩ楽曲ＭＤとの双方が用意されていることである。 Furthermore, in the technical feature generation process, has the control unit 6 completed the steps from S110 to S200 for all the songs that satisfy the preset setting condition among the songs associated with the singer ID acquired in S105? It is determined whether or not (S210). The setting condition referred to here is that the singer ID acquired in S105 is associated, and both the music data WD and the MIDI music MD are prepared.

このＳ２１０での判定の結果、設定条件を満たす全楽曲について、Ｓ１１０からＳ２００のステップを完了していなければ（Ｓ２１０：ＮＯ）、制御部６は、技巧特徴生成処理をＳ１１０へと戻す。そして、制御部６は、Ｓ１０５にて指定された歌手ＩＤが対応付けられた楽曲データＷＤの中から、新たな楽曲データＷＤを取得し（Ｓ１１０）、さらに、その楽曲データＷＤに対応するＭＩＤＩ楽曲ＭＤを取得して（Ｓ１２０）、Ｓ１３０からＳ２１０のステップを実行する。 As a result of the determination in S210, if the steps from S110 to S200 have not been completed for all the music pieces satisfying the setting condition (S210: NO), the control unit 6 returns the technique feature generation process to S110. Then, the control unit 6 acquires new song data WD from the song data WD associated with the singer ID designated in S105 (S110), and further, the MIDI song corresponding to the song data WD. The MD is acquired (S120), and the steps from S130 to S210 are executed.

一方、Ｓ２１０での判定の結果、全楽曲について、Ｓ１１０からＳ２００のステップを完了していれば（Ｓ２１０：ＹＥＳ）、制御部６は、技巧特徴生成処理をＳ２２０へと移行させる。 On the other hand, as a result of the determination in S210, if the steps from S110 to S200 have been completed for all songs (S210: YES), the control unit 6 shifts the technique feature generation process to S220.

そのＳ２２０では、制御部６は、技巧特徴量Ｓ及び音声パラメータＰについて、対応付けられた音符プロパティｐが共通するものごとに代表値を算出する。
すなわち、本実施形態のＳ２２０では、制御部６は、特定音符属性と前音符属性と後音符属性との全てが共通する音符ＮＯごとに、それらの属性が共通する音符ＮＯにおける技巧特徴量Ｓ及び音声パラメータＰそれぞれの相加平均を、技巧特徴量Ｓの代表値として算出する。つまり、Ｓ２２０では、技巧特徴量Ｓの代表値として、音符プロパティｐごとに、ビブラート特徴量ｖｉｂの代表値と、しゃくり特徴量ｒｉｓｅの代表値とが算出される。また、Ｓ２２０では、音声パラメータＰの代表値として、音符プロパティｐごとに、基本周波数の代表値と、ＭＦＣＣの代表値と、パワーの代表値とが算出される。 In S220, the control unit 6 calculates a representative value for each of the skill feature S and the speech parameter P for each of the associated note properties p.
That is, in S220 of the present embodiment, the control unit 6 performs the technical feature amount S and the note feature S in the note NO having the common attributes for each of the note NO having the same specific note attribute, the preceding note attribute, and the subsequent note attribute. The arithmetic average of each of the speech parameters P is calculated as a representative value of the skill feature amount S. That is, in S220, the representative value of the vibrato feature amount vib and the representative value of the sneezing feature amount rise are calculated for each note property p as the representative values of the skill feature amount S. In S220, as the representative value of the voice parameter P, the representative value of the fundamental frequency, the representative value of the MFCC, and the representative value of the power are calculated for each note property p.

このＳ２２０にて代表値として算出する相加平均は、技巧特徴量Ｓ（ａ，ｉ）及び音声パラメータＰ（ａ，ｉ）を算出した全ての楽曲の範囲内で実施する。なお、Ｓ２２０において算出する代表値は、相加平均の結果に限るものではなく、中央値や最頻値であっても良い。 The arithmetic mean calculated as the representative value in S220 is performed within the range of all the music pieces for which the skill feature amount S (a, i) and the speech parameter P (a, i) are calculated. The representative value calculated in S220 is not limited to the arithmetic mean result, and may be a median value or a mode value.

続いて、制御部６は、Ｓ２２０にて算出された技巧特徴量Ｓの代表値及び音声パラメータＰの代表値を、対応する音符プロパティｐ及び歌手ＩＤと対応付けることで、技巧特徴データＳＦを生成して記憶部５に記憶する（Ｓ２３０）。 Subsequently, the control unit 6 generates the technical feature data SF by associating the representative value of the technical feature amount S calculated in S220 and the representative value of the voice parameter P with the corresponding note property p and singer ID. Is stored in the storage unit 5 (S230).

その後、制御部６は、本技巧特徴生成処理を終了する。
つまり、技巧特徴生成処理では、歌手の歌い方における特徴を表す技巧特徴データＳＦを歌手ごとに生成する。この技巧特徴生成処理にて生成される技巧特徴データＳＦは、図３に示すように、歌手の歌手ＩＤと、共通する音符プロパティｐにおける技巧特徴量Ｓの代表値及び音声パラメータＰの代表値と、その共通する音符プロパティｐとが対応付けられたものである。 Then, the control part 6 complete | finishes this technical feature generation process.
That is, in the skill feature generation process, skill feature data SF representing features in the way of singing a singer is generated for each singer. As shown in FIG. 3, the skill feature data SF generated by the skill feature generation process includes a singer ID, a representative value of the skill feature amount S in the common note property p, and a representative value of the speech parameter P. , And the common note property p.

なお、情報処理装置２の制御部６が技巧特徴生成処理を実行することで生成される技巧特徴データＳＦは、可搬型の記憶媒体を用いて情報処理サーバ１０の記憶部１４に記憶されても良い。情報処理装置２と情報処理サーバ１０とが通信網を介して接続されている場合には、情報処理装置２の記憶部５に記憶された技巧特徴データＳＦは、通信網を介して転送されることで、情報処理サーバ１０の記憶部１４に記憶されても良い。
＜歌唱特徴生成処理＞
次に、情報処理サーバ１０の制御部１６が実行する歌唱特徴生成処理について説明する。 Note that the technical feature data SF generated when the control unit 6 of the information processing device 2 executes the technical feature generation processing is stored in the storage unit 14 of the information processing server 10 using a portable storage medium. good. When the information processing device 2 and the information processing server 10 are connected via a communication network, the technical feature data SF stored in the storage unit 5 of the information processing device 2 is transferred via the communication network. Thus, the information may be stored in the storage unit 14 of the information processing server 10.
<Singing feature generation processing>
Next, singing feature generation processing executed by the control unit 16 of the information processing server 10 will be described.

図４に示す歌唱特徴生成処理が起動されると、制御部１６は、記憶部１４に記憶されている全てのＭＩＤＩ楽曲ＭＤの中から１つのＭＩＤＩ楽曲ＭＤを取得する（Ｓ３１０）。このＳ３１０にて制御部１６が取得する１つのＭＩＤＩ楽曲ＭＤは、カラオケ演奏処理のＳ５１０（詳しくは後述）にて指定された楽曲に対応するＭＩＤＩ楽曲ＭＤであっても良いし、情報処理サーバ１０に接続された入力装置（図示せず）を介して指定された楽曲に対応するＭＩＤＩ楽曲ＭＤであっても良い。 When the singing feature generation process shown in FIG. 4 is activated, the control unit 16 acquires one MIDI music MD from all the MIDI music MD stored in the storage unit 14 (S310). One MIDI musical piece MD acquired by the control unit 16 in S310 may be a MIDI musical piece MD corresponding to the musical piece designated in S510 (details will be described later) of the karaoke performance processing. It may be a MIDI music MD corresponding to a music specified via an input device (not shown) connected to.

続いて、制御部１６は、Ｓ３１０にて取得したＭＩＤＩ楽曲ＭＤを分析し、そのＭＩＤＩ楽曲ＭＤにおける歌唱旋律を構成する各音符ＮＯ（ｂ，ｉ）の音符プロパティｐ（ｂ，ｉ）を特定する（Ｓ３２０）。ここでの符号ｂは、Ｓ３１０にて取得したＭＩＤＩ楽曲ＭＤに対応する楽曲を識別する符号である。 Subsequently, the control unit 16 analyzes the MIDI musical piece MD acquired in S310 and specifies the musical note property p (b, i) of each musical note NO (b, i) constituting the song melody in the MIDI musical piece MD. (S320). The code b here is a code for identifying the music corresponding to the MIDI music MD acquired in S310.

そして、歌唱特徴生成処理では、制御部１６は、記憶部１４に記憶されている全ての技巧特徴データＳＦの中から、１つの技巧特徴データＳＦを取得する（Ｓ３３０）。このＳ３３０にて制御部１６が取得する１つの技巧特徴データＳＦは、カラオケ演奏処理のＳ５２０（詳しくは後述）にて指定された歌手（即ち、指定歌手）に対応する技巧特徴データＳＦであっても良いし、情報処理サーバ１０に接続された入力装置（図示せず）を介して指定された歌手に対応する技巧特徴データＳＦであっても良い。 In the singing feature generation process, the control unit 16 acquires one skill feature data SF from all the skill feature data SF stored in the storage unit 14 (S330). One skill feature data SF acquired by the control unit 16 in S330 is the skill feature data SF corresponding to the singer (that is, the designated singer) designated in S520 (details will be described later) of the karaoke performance process. Alternatively, it may be technical feature data SF corresponding to a singer designated through an input device (not shown) connected to the information processing server 10.

続いて、制御部１６は、歌唱特徴データＭＳを生成する（Ｓ３４０）。このＳ３４０での歌唱特徴データＭＳの生成は、Ｓ３２０での分析によって特定された各音符ＮＯ（ｂ，ｉ）に、Ｓ３３０にて取得した技巧特徴データＳＦに含まれる技巧特徴量Ｓ及び音声パラメータであり、かつ特定の条件を満たす技巧特徴量Ｓの代表値及び音声パラメータＰの代表値を割り当てることで実現する。なお、ここで言う特定の条件とは、各音符ＮＯ（ｂ，ｉ）における音符プロパティｐ（ｂ，ｉ）と一致する音符プロパティｐが対応付けられて
いることである。 Subsequently, the control unit 16 generates singing feature data MS (S340). The generation of the singing feature data MS in S340 is performed using the skill feature amount S and the speech parameters included in the skill feature data SF acquired in S330 for each note NO (b, i) specified by the analysis in S320. This is realized by assigning a representative value of the technical feature quantity S and a representative value of the speech parameter P that satisfy certain conditions. The specific condition mentioned here is that the note property p that matches the note property p (b, i) in each note NO (b, i) is associated.

ところで、技巧特徴量Ｓ及び音声パラメータＰの中に、特定の条件を満たす技巧特徴量Ｓ及び音声パラメータＰが存在しない場合も考えられる。この場合のＳ３４０では、制御部１６は、音符ＮＯ（ｂ，ｉ）の音符プロパティｐ（ｂ，ｉ）に近接する音符プロパティｐが対応付けられた技巧特徴量Ｓ及び音声パラメータＰの代表値を、その音符ＮＯ（ｂ，ｉ）に割り当てれば良い。ここで言う近接する音符プロパティｐとは、例えば、音階が１音違う、音符長が１階級違うことなどである。また、上記の方法とは別に、制御部１６は、時間及び周波数の２つの軸について、両隣の２つの音符ＮＯ（ｂ，ｉ−１），ＮＯ（ｂ，ｉ＋１）に対応する技巧特徴量Ｓ及び音声パラメータＰの代表値を平均した値を、その音符ＮＯ（ｂ，ｉ）に割り当てても良い。 By the way, there may be a case where the technical feature quantity S and the voice parameter P satisfying specific conditions do not exist in the technical feature quantity S and the voice parameter P. In S340 in this case, the control unit 16 obtains the representative values of the technical feature amount S and the speech parameter P associated with the note property p adjacent to the note property p (b, i) of the note NO (b, i). And assigning to the note NO (b, i). The adjacent note property p referred to here is, for example, that the scale is different by one note, the note length is different by one rank, or the like. In addition to the above method, the control unit 16 has the technical feature S corresponding to the two adjacent notes NO (b, i−1) and NO (b, i + 1) on the two axes of time and frequency. A value obtained by averaging the representative values of the voice parameter P may be assigned to the note NO (b, i).

さらに、Ｓ３４０では、制御部１６は、生成した歌唱特徴データＭＳに、Ｓ３１０にて取得したＭＩＤＩ楽曲ＭＤに含まれる楽曲ＩＤと、Ｓ３３０にて取得した技巧特徴データＳＦに含まれる歌手ＩＤ（即ち、指定歌手の歌手ＩＤ）とを付与して記憶部１４に記憶する。 Further, in S340, the control unit 16 adds the song ID included in the MIDI song MD acquired in S310 to the generated song feature data MS and the singer ID included in the skill feature data SF acquired in S330 (ie, The designated singer ID) is assigned and stored in the storage unit 14.

その後、本歌唱特徴生成処理を終了する。
つまり、歌唱特徴生成処理では、図５に示すように、ＭＩＤＩ楽曲ＭＤにおける歌唱旋律を構成する各音符ＮＯ（ｂ，ｉ）に、その音符ＮＯ（ｂ，ｉ）の音符プロパティｐと共通する音符プロパティｐが対応付けられた技巧特徴量Ｓの代表値及び音声パラメータＰの代表値を割り当てる。これにより、情報処理サーバ１０の制御部１６は、歌唱特徴データＭＳを生成する。
＜カラオケ演奏処理＞
次に、カラオケ装置３０の制御部５０が実行するカラオケ演奏処理について説明する。 Then, this singing characteristic production | generation process is complete | finished.
That is, in the singing feature generation process, as shown in FIG. 5, each note NO (b, i) constituting the singing melody in the MIDI music piece MD has a note common to the note property p of the note NO (b, i). A representative value of the skill feature amount S and a representative value of the voice parameter P to which the property p is associated are assigned. Thereby, the control part 16 of the information processing server 10 produces | generates song characteristic data MS.
<Karaoke performance processing>
Next, the karaoke performance process which the control part 50 of the karaoke apparatus 30 performs is demonstrated.

このカラオケ演奏処理は、カラオケ演奏処理を実行するための処理プログラムを起動する指令が入力されると起動される。
図６に示すカラオケ演奏処理では、起動されると、制御部５０は、まず、入力受付部３４を介して指定された楽曲（即ち、特定楽曲）に対応するＭＩＤＩ楽曲ＭＤを、情報処理サーバ１０の記憶部１４から取得する（Ｓ５１０）。続いて、制御部５０は、情報処理サーバ１０の記憶部１４に格納されている全ての歌唱特徴データＭＳの中から、入力受付部３４を介して指定された指定歌手が特定楽曲を歌唱した場合の歌唱技巧を表す歌唱特徴データＭＳを取得する（Ｓ５２０）。 The karaoke performance process is activated when a command for activating a processing program for executing the karaoke performance process is input.
In the karaoke performance process shown in FIG. 6, when activated, the control unit 50 first selects the MIDI music MD corresponding to the music (ie, specific music) designated via the input receiving unit 34 as the information processing server 10. (S510). Subsequently, the control unit 50 sings a specific music piece by a designated singer designated via the input receiving unit 34 from all the singing feature data MS stored in the storage unit 14 of the information processing server 10. The singing feature data MS representing the singing technique is acquired (S520).

さらに、カラオケ演奏処理では、制御部５０は、Ｓ５１０にて取得したＭＩＤＩ楽曲ＭＤ及びＳ５２０にて取得した歌唱特徴データＭＳに基づいて、技巧音符ＮＯを特定する（Ｓ５３０）。ここで言う技巧音符ＮＯとは、指定歌手が特定楽曲を歌唱する場合に、歌唱技巧が用いられる特定楽曲上の音符である。具体的に本実施形態においては、予め規定された閾値以上の技巧特徴量Ｓと対応付けられた音符ＮＯ（ｃ，ｉ）それぞれを技巧音符ＮＯ（ｃ，ｉ）として特定すれば良い。つまり、ＭＩＤＩ楽曲ＭＤの音符ＮＯ（ｃ，ｉ）において、その音符ＮＯ（ｃ，ｉ）に対応する“ビブラート特徴量ｖｉｂ”が第１閾値以上であれば、指定歌手が「ビブラート」を用いる技巧音符ＮＯ（ｃ，ｉ）であるものと特定する。また、歌唱特徴データＭＳの音符ＮＯ（ｃ，ｉ）において、その音符ＮＯ（ｃ，ｉ）に対応する“しゃくり特徴量ｒｉｓｅ”が第２閾値以上であれば、指定歌手が「しゃくり」を用いる技巧音符ＮＯ（ｃ，ｉ）であるものと特定する。 Further, in the karaoke performance process, the control unit 50 specifies the technical note NO based on the MIDI music MD acquired in S510 and the singing feature data MS acquired in S520 (S530). The technical note NO referred to here is a musical note on a specific musical piece in which the singing skill is used when the designated singer sings the specific musical piece. Specifically, in the present embodiment, each note NO (c, i) associated with the skill feature amount S equal to or greater than a predetermined threshold value may be specified as the skill note NO (c, i). In other words, in the note NO (c, i) of the MIDI music piece MD, if the “vibrato feature amount vib” corresponding to the note NO (c, i) is equal to or greater than the first threshold, the designated singer uses “vibrato”. It is specified that the note is NO (c, i). In addition, in the note NO (c, i) of the singing feature data MS, if the “shrimp feature amount rise” corresponding to the note NO (c, i) is equal to or greater than the second threshold, the designated singer uses “shrimp”. It is specified that it is a technical note NO (c, i).

なお、ここでの符号ｃは、Ｓ５１０にて取得したＭＩＤＩ楽曲ＭＤに対応する楽曲を識別する符号である。
さらに、Ｓ５３０では、制御部５０は、技巧音符ＮＯ（ｃ，ｉ）のそれぞれにおける模
範ボーカルにて用いるべき歌唱技巧を表す技巧情報を初期値に設定する。ここで言う初期値とは、歌唱特徴データＭＳに表された音符ＮＯそれぞれの技巧特徴量Ｓである。 The code c here is a code for identifying the music corresponding to the MIDI music MD acquired in S510.
Furthermore, in S530, the control part 50 sets the skill information showing the singing skill which should be used by the model vocal in each skill note NO (c, i) to an initial value. The initial value referred to here is the skill feature amount S of each note NO represented in the singing feature data MS.

続いて、制御部５０は、Ｓ５１０にて取得したＭＩＤＩ楽曲ＭＤの演奏を開始し、歌詞を含む各種情報を表示部６４に表示する（Ｓ５４０）。具体的にＳ５４０におけるＭＩＤＩ楽曲ＭＤの演奏では、制御部５０は、楽曲再生部３６にＭＩＤＩ楽曲ＭＤを時間軸に沿って順次出力する。そのＭＩＤＩ楽曲ＭＤを取得した楽曲再生部３６は、楽曲の演奏を行う。そして、楽曲再生部３６によって演奏された楽曲の音源信号が、出力部４２を介してスピーカ６０へと出力される。すると、スピーカ６０は、音源信号を音に換えて出力する。 Subsequently, the control unit 50 starts playing the MIDI music piece MD acquired in S510, and displays various information including lyrics on the display unit 64 (S540). Specifically, in the performance of the MIDI music MD in S540, the control unit 50 sequentially outputs the MIDI music MD to the music playback unit 36 along the time axis. The music reproducing unit 36 that has acquired the MIDI music MD performs the music. Then, the sound source signal of the music played by the music playback unit 36 is output to the speaker 60 via the output unit 42. Then, the speaker 60 outputs the sound source signal instead of sound.

また、Ｓ５４０では、制御部５０は、各種情報を表す画像信号を映像制御部４６に出力する。その画像信号を取得した映像制御部４６は、楽曲再生部３６での特定楽曲の演奏に同期させて、各種情報を表示部６４に表示する。なお、表示部６４に表示される各種情報には、特定楽曲の各音符ＮＯにおける歌詞の他に、各技巧音符ＮＯ（ｃ，ｉ）において、カラオケ装置３０のユーザが歌唱技巧を再現できたか否かを表す成否情報を含む。 In S540, the control unit 50 outputs image signals representing various types of information to the video control unit 46. The video control unit 46 that has acquired the image signal displays various information on the display unit 64 in synchronization with the performance of the specific music in the music playback unit 36. In addition, in the various information displayed on the display unit 64, whether or not the user of the karaoke apparatus 30 can reproduce the singing technique in each technical note NO (c, i) in addition to the lyrics in each musical note NO of the specific music. Success / failure information indicating

さらに、カラオケ演奏処理では、制御部５０は、模範ボーカルを音声合成にて生成する（Ｓ５５０）。Ｓ５７０での音声合成は、周知のフォルマント合成によって実現する。なお、本実施形態における模範ボーカルは、指定歌手の声質、歌唱技巧が再現されるように、音声パラメータＰを調整して音声合成されることで生成される。 Further, in the karaoke performance process, the control unit 50 generates a model vocal by voice synthesis (S550). The speech synthesis in S570 is realized by a well-known formant synthesis. The exemplary vocal in the present embodiment is generated by performing voice synthesis by adjusting the voice parameter P so that the voice quality and singing skill of the designated singer are reproduced.

そして、制御部５０は、Ｓ５５０にて音声合成することによって生成された模範ボーカルを出力部４２へと出力する（Ｓ５６０）。その出力部４２は、スピーカ６０から模範ボーカルを放音する。 Then, the control unit 50 outputs the exemplary vocal generated by performing the voice synthesis in S550 to the output unit 42 (S560). The output unit 42 emits an exemplary vocal from the speaker 60.

続いて、カラオケ演奏処理では、制御部５０は、マイク６２及びマイク入力部４４を介して入力された音声を歌唱音声データとして取得する（Ｓ５７０）。そして、制御部５０は、Ｓ５３０にて取得した歌唱音声データを記憶部３８に記憶する（Ｓ５８０）。 Subsequently, in the karaoke performance process, the control unit 50 acquires the voice input through the microphone 62 and the microphone input unit 44 as singing voice data (S570). And the control part 50 memorize | stores the singing voice data acquired in S530 in the memory | storage part 38 (S580).

続いて、カラオケ演奏処理では、制御部５０は、記憶部３８に記憶されている歌唱音声データに基づいて、楽曲における時間軸に沿った歌唱音声データから、音符歌唱データＶｏｓ（ｃ，ｉ）を抽出する（Ｓ５９０）。ここで言う音符歌唱データＶｏｓとは、現時点音符ＮＯ（ｃ，ｉ）を歌唱した歌唱波形である。この音符歌唱データの特定は、例えば、「ボーカルデータ」を「歌唱音声データ」へと読み替えた上で、技巧特徴生成処理におけるＳ１６０と同様の手法を用いれば良い。 Subsequently, in the karaoke performance process, the control unit 50 obtains the note singing data Vos (c, i) from the singing voice data along the time axis in the music based on the singing voice data stored in the storage unit 38. Extract (S590). The note singing data Vos mentioned here is a singing waveform obtained by singing the current note NO (c, i). The musical note singing data may be specified by, for example, replacing “vocal data” with “singing voice data” and using the same technique as S160 in the technique feature generation processing.

続いて、カラオケ演奏処理では、制御部５０は、各音符歌唱データＶｏｓ（ｃ，ｉ）における歌唱技巧の特徴量を表す歌唱特徴量ＳＳ（ｃ，ｉ）を算出する（Ｓ６００）。ここで言う歌唱特徴量ＳＳ（ｃ，ｉ）には、歌声ビブラート特徴量Ｖｖｉｂ（ｃ，ｉ）と、歌声しゃくり特徴量Ｖｒｉｓｅ（ｃ，ｉ）とを含む。 Subsequently, in the karaoke performance process, the control unit 50 calculates a singing feature amount SS (c, i) representing a feature amount of the singing technique in each note singing data Vos (c, i) (S600). The singing feature amount SS (c, i) mentioned here includes a singing voice vibrato feature amount Vvib (c, i) and a singing voice chatting feature amount Vrise (c, i).

このうち、歌声ビブラート特徴量Ｖｖｉｂ（ｃ，ｉ）は、音符歌唱データＶｏｓ（ｃ，ｉ）における“ビブラート”についての歌唱技巧量である。歌声しゃくり特徴量Ｖｒｉｓｅ（ｃ，ｉ）は、音符歌唱データＶｏｓ（ｃ，ｉ）における“しゃくり”についての歌唱技巧量である。これらの歌声ビブラート特徴量Ｖｖｉｂ（ｃ，ｉ）、歌声しゃくり特徴量Ｖｒｉｓｅ（ｃ，ｉ）の算出方法は、「ボーカルデータ」を「歌唱音声データ」へと、「音符ボーカル」を「音符歌唱データ」へと読み替えた上で、技巧特徴生成処理におけるＳ１７０と同様の手法を用いれば良い。 Among these, the singing voice vibrato feature value Vvib (c, i) is a singing skill amount for “vibrato” in the note singing data Vos (c, i). The singing voice scribbling feature amount Vrise (c, i) is a singing skill amount for “scribbing” in the note singing data Vos (c, i). The singing voice vibrato feature value Vvib (c, i) and the singing voice chatter feature value Vrise (c, i) are calculated by changing “vocal data” to “singing voice data” and “note vocal” to “note singing data”. And the same technique as S170 in the skillful feature generation process may be used.

カラオケ演奏処理では、続いて、制御部５０は、Ｓ５２０にて取得した歌唱特徴データに含まれる技巧特徴量Ｓ（ｃ，ｉ）と、Ｓ６００にて算出した歌唱特徴量ＳＳ（ｃ，ｉ）とを、現時点音符ＮＯ（ｃ，ｉ）について比較する（Ｓ６１０）。そして、制御部５０は、Ｓ６１０での比較の結果、技巧特徴量Ｓ（ｃ，ｉ）と歌唱特徴量ＳＳ（ｃ，ｉ）との差分である技巧差分が、予め規定された規定条件を満たしているか否かを判定する（Ｓ６１０）。ここで言う規定条件とは、指定歌手が用いた歌唱技巧をユーザが再現できていることを表す条件である。規定条件として、技巧差分が、予め規定された閾値範囲内であることが考えられる。 Subsequently, in the karaoke performance process, the control unit 50 includes the skill feature amount S (c, i) included in the song feature data acquired in S520 and the song feature amount SS (c, i) calculated in S600. Are compared for the current note NO (c, i) (S610). Then, as a result of the comparison in S610, the control unit 50 determines that the skill difference, which is the difference between the skill feature amount S (c, i) and the singing feature amount SS (c, i), satisfies a prescribed condition. It is determined whether or not (S610). The prescribed conditions referred to here are conditions indicating that the user can reproduce the singing technique used by the designated singer. As the defining condition, it is conceivable that the skill difference is within a predetermined threshold range.

例えば、技巧音符ＮＯ（ｃ，ｉ）において、ビブラート特徴量ｖｉｂ（ｃ，ｉ）から歌声ビブラート特徴量Ｖｖｉｂ（ｃ，ｉ）を減算した結果の絶対値が、予め規定された第１規定閾値未満であれば、技巧差分が規定条件を満たしているもの、即ち、指定歌手が用いた「ビブラート」をユーザが再現できていないものと判定すれば良い。また、技巧音符ＮＯ（ｃ，ｉ）において、しゃくり特徴量ｒｉｓｅ（ｃ，ｉ）から歌声しゃくり特徴量Ｖｒｉｓｅ（ｃ，ｉ）を減算した結果の絶対値が、予め規定された第２規定閾値未満であれば、技巧差分が規定条件を満たしているもの、即ち、指定歌手が用いた「しゃくり」をユーザが再現できていないものと判定すれば良い。 For example, in the technical note NO (c, i), the absolute value of the result of subtracting the singing voice vibrato feature amount Vvib (c, i) from the vibrato feature amount vib (c, i) is less than the first prescribed threshold value defined in advance. If so, it may be determined that the skill difference satisfies the prescribed condition, that is, the user cannot reproduce the “vibrato” used by the designated singer. In addition, in the technical note NO (c, i), the absolute value of the result of subtracting the singing voice chatting feature value Vrise (c, i) from the scribbling feature value rise (c, i) is less than a second prescribed threshold value defined in advance. If so, it may be determined that the skill difference satisfies the prescribed condition, that is, the user cannot reproduce the “shrunk” used by the designated singer.

そして、Ｓ６１０での判定の結果、技巧差分が規定条件を満たしていれば（Ｓ６２０：ＹＥＳ）、制御部５０は、カラオケ演奏処理をＳ６３０へと移行させる。
そのＳ６３０では、制御部５０は、現時点音符ＮＯに対する成否情報を表示すると共に、その現時点音符ＮＯに対応する対象音符ＮＯでの模範ボーカルでの合成態様を標準態様に設定する。なお、対象音符ＮＯとは、現時点音符ＮＯ（ｃ，ｉ）の音符プロパティｐと同一の音符プロパティｐを有した音符ＮＯの中で、特定楽曲における時間軸に沿って現時点音符ＮＯ（ｃ，ｉ）よりも後に登場する音符ＮＯである。 As a result of the determination in S610, if the skill difference satisfies the specified condition (S620: YES), the control unit 50 shifts the karaoke performance process to S630.
In S630, the control unit 50 displays success / failure information for the current note NO, and sets the synthesis mode of the exemplary vocal at the target note NO corresponding to the current note NO to the standard mode. Note that the target note NO is the current note NO (c, i along the time axis in a specific musical piece among the note NOs having the same note property p as the note property p of the current note NO (c, i). ) Is a note NO that appears after).

すなわち、Ｓ６３０では、現時点音符ＮＯにおいて指定歌手が「ビブラート」または「しゃくり」を用いている場合にユーザが「ビブラート」または「しゃくり」を再現できていれば、現時点音符ＮＯに対して、「ビブラート」または「しゃくり」を再現できていることを意味する成功アイコンを成否情報として表示する。ここでの成功アイコンは、例えば、二重丸である。また、Ｓ６３０では、現時点音符ＮＯにおいて指定歌手が「ビブラート」または「しゃくり」を用いている場合にユーザが「ビブラート」または「しゃくり」を再現できていれば、現時点音符ＮＯ（ｃ，ｉ）に対応する全ての対象音符ＮＯにおける模範ボーカルでの歌唱技巧を初期値に維持することを、合成態様を標準態様に設定することとして実行する。 That is, in S630, when the designated singer uses “vibrato” or “shrimp” in the current note NO, if the user can reproduce “vibrato” or “shrimp”, “vibrato” Or “success” is displayed as success / failure information, which means that the user can reproduce “suck”. The success icon here is, for example, a double circle. In S630, when the designated singer uses “vibrato” or “shrimp” in the current note NO, if the user can reproduce “vibrato” or “shrimp”, the current note NO (c, i) is set. Maintaining the singing skills with the exemplary vocals in all the corresponding target note NO at the initial value is executed by setting the synthesis mode to the standard mode.

その後、制御部５０は、詳しくは後述するＳ６６０へとカラオケ演奏処理を移行させる。
一方、Ｓ６１０での比較の結果、技巧差分が規定条件を満たしていなければ（Ｓ６２０：ＮＯ）、制御部５０は、カラオケ演奏処理をＳ６４０へと移行させる。そのＳ６４０では、制御部５０は、技巧特徴量Ｓ（ｃ，ｉ）と歌唱特徴量ＳＳ（ｃ，ｉ）とのズレが、歌唱者強調を表しているか歌唱者不再現を表しているかを判定する。ここで言う歌唱者強調とは、指定歌手が用いる歌唱技巧に対して、ユーザが歌唱技巧を強調して歌唱し過ぎている状態である。一方、歌唱者不再現とは、指定歌手が用いる歌唱技巧をユーザが再現できていない状態である。 Thereafter, the control unit 50 shifts the karaoke performance process to S660 described later in detail.
On the other hand, as a result of the comparison in S610, if the technical difference does not satisfy the prescribed condition (S620: NO), the control unit 50 shifts the karaoke performance process to S640. In S640, the control unit 50 determines whether the difference between the skill feature amount S (c, i) and the singing feature amount SS (c, i) represents singer emphasis or non-reproduction of the singer. To do. The singer emphasis mentioned here is a state where the user sings too much with emphasis on the singing technique with respect to the singing technique used by the designated singer. On the other hand, non-reproduction of the singer is a state in which the user cannot reproduce the singing technique used by the designated singer.

本実施形態においては、現時点音符ＮＯが技巧音符以外の音符ＮＯであり、かつ、歌唱特徴量ＳＳ（ｃ，ｉ）が、ユーザが歌唱技巧を実行したことを表している場合には、歌唱者強調であるものと判定する。具体的には、技巧特徴量Ｓから歌唱者特徴量ＳＳを減算した結果が、負の値であり、かつ、その絶対値が予め規定された第１規定閾値を上回ってい
れば、歌唱者強調であるものと判定する。 In the present embodiment, when the current note NO is a note NO other than the skillful note and the singing feature amount SS (c, i) indicates that the user has performed the singing skill, the singer Judged to be emphasis. Specifically, if the result of subtracting the singer feature amount SS from the skill feature amount S is a negative value and the absolute value thereof exceeds a predetermined first threshold value, singer emphasis is provided. It is determined that

一方、本実施形態においては、現時点音符ＮＯが技巧音符であり、かつ、歌唱特徴量ＳＳ（ｃ，ｉ）が、ユーザが歌唱技巧を実行したことを表していない場合には、歌唱者強調でない（歌唱者不再現である）ものと判定する。具体的には、技巧特徴量Ｓから歌唱者特徴量ＳＳを減算した結果が、正の値であり、かつ、その絶対値が予め規定された第２規定閾値を上回っていれば、歌唱者不再現であるものと判定する。 On the other hand, in the present embodiment, if the current note NO is a skill note and the singing feature amount SS (c, i) does not indicate that the user has performed the singing technique, the singer is not emphasized. It is determined that it is a non-reproduced singer. Specifically, if the result of subtracting the singer feature amount SS from the skill feature amount S is a positive value and the absolute value thereof exceeds a second prescribed threshold value defined in advance, the singers Judged to be a reproduction.

そして、Ｓ６４０での判定の結果、歌唱者強調であれば（Ｓ６４０：ＹＥＳ）、制御部５０は、現時点音符ＮＯに対する成否情報を表示すると共に、その現時点音符ＮＯに対応する対象音符ＮＯでの模範ボーカルでの合成態様を抑制態様に設定する（Ｓ６５０）。 If the result of determination in S640 is singer emphasis (S640: YES), the control unit 50 displays success / failure information for the current note NO and an example of the target note NO corresponding to the current note NO. The vocal synthesis mode is set as the suppression mode (S650).

すなわち、Ｓ６５０では、現時点音符ＮＯにおいて指定歌手が「ビブラート」または「しゃくり」などの歌唱技巧を用いていない場合にユーザが「ビブラート」または「しゃくり」を用いて歌唱していれば、現時点音符ＮＯに対して、「ビブラート」または「しゃくり」が不要であることを意味する抑制アイコンを成否情報として表示する。また、Ｓ６５０では、現時点音符ＮＯに対応する全ての対象音符ＮＯでの模範ボーカルでの歌唱技巧を初期値に維持することを、合成態様を抑制態様に設定することとして実行する。 That is, in S650, if the designated singer does not use a singing technique such as “vibrato” or “shack” in the current note NO, and if the user is singing using “vibrato” or “shack”, the current note NO On the other hand, a suppression icon indicating that “vibrato” or “sneezing” is unnecessary is displayed as success / failure information. Moreover, in S650, maintaining the singing technique by the model vocal with all the target note NO corresponding to the present note NO at an initial value is performed as setting a synthetic | combination aspect to a suppression aspect.

その後、制御部５０は、カラオケ演奏処理をＳ６７０へと移行させる。
一方、Ｓ６４０での判定の結果、歌唱者強調でなければ（即ち、歌唱者不再現であれば）（Ｓ６４０：ＮＯ）、制御部５０は、現時点音符ＮＯに対する成否情報を表示すると共に、その現時点音符ＮＯに対応する対象音符ＮＯでの模範ボーカルでの技巧情報を強調態様に設定する（Ｓ６６０）。 Thereafter, the control unit 50 shifts the karaoke performance process to S670.
On the other hand, as a result of the determination in S640, if the singer is not emphasized (that is, if the singer is not reproduced) (S640: NO), the control unit 50 displays the success / failure information for the current note NO and the current time The skill information on the exemplary vocal at the target note NO corresponding to the note NO is set as an emphasis mode (S660).

すなわち、Ｓ６６０では、図７に示すように、現時点音符ＮＯにおいて指定歌手が「ビブラート」または「しゃくり」などの歌唱技巧を用いている場合にユーザが「ビブラート」または「しゃくり」を用いて歌唱していなければ、現時点音符ＮＯに対して、「ビブラート」または「しゃくり」を用いるべきであることを意味する第１強調アイコンを表示する。ここでの第１強調アイコンは、例えば、「×印」である。 That is, in S660, as shown in FIG. 7, when the designated singer uses a singing technique such as “vibrato” or “shrimp” in the current note NO, the user sings using “vibrato” or “shrimp”. If not, a first emphasis icon indicating that “vibrato” or “suckling” should be used for the current note NO is displayed. Here, the first emphasis icon is, for example, “x mark”.

また、Ｓ６６０では、現時点音符ＮＯに対応する全ての対象音符ＮＯでの模範ボーカルでの歌唱技巧が強調されるように音声パラメータＰを調整することを、それらの対象音符ＮＯに対する模範ボーカルでの合成態様を強調態様に設定することとして実行する。 Also, in S660, the adjustment of the speech parameter P so as to emphasize the singing skill in the exemplary vocals for all the target notes NO corresponding to the current note NO is performed by synthesizing the exemplary vocals for the target notes NO. This is performed by setting the aspect to the emphasized aspect.

Ｓ６６０では具体的には、現時点音符ＮＯにおいて指定歌手が用いている歌唱技巧が「ビブラート」であれば、制御部５０は、図８（Ａ）に示すように、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「ビブラート」において、周波数の最高値と周波数の最小値との差分が大きくなるように音声パラメータＰを調整する。一方、現時点音符ＮＯにおいて指定歌手が用いている歌唱技巧が「しゃくり」であれば、制御部５０は、図８（Ｂ）に示すように、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「しゃくり」において、対象音符ＮＯの音高に模範ボーカルの音高が遷移する音高遷移変化速度が初期値よりも早くなるように音声パラメータＰを調整する。 Specifically, in S660, if the singing technique used by the designated singer at the current note NO is “vibrato”, the control unit 50 performs the target note corresponding to the current note NO as shown in FIG. The voice parameter P is adjusted so that the difference between the maximum value of the frequency and the minimum value of the frequency becomes large in the “vibrato” with the exemplary vocal in NO. On the other hand, if the singing technique used by the designated singer in the current note NO is “Sharukuri”, the control unit 50 is an exemplary vocal in the target note NO corresponding to the current note NO as shown in FIG. In “Shacking”, the voice parameter P is adjusted so that the pitch transition change rate at which the pitch of the model vocal transitions to the pitch of the target note NO is faster than the initial value.

その後、制御部５０は、カラオケ演奏処理をＳ６７０へと移行させる。
そのＳ６７０では、制御部５０は、続いて、特定楽曲の演奏が終了したか否かを判定する。このＳ６７０での判定の結果、特定楽曲の演奏が終了していなければ（Ｓ６７０：ＮＯ）、制御部５０は、カラオケ演奏処理をＳ５４０へと戻す。そのＳ５４０では、制御部５０は、特定楽曲の続きを演奏すると共に、各種情報を表示する。続くＳ５５０では、制御部５０は、現時点音符ＮＯ（ｃ，ｉ）が対象音符ＮＯであれば、Ｓ６３０，Ｓ６５０，
Ｓ６６０のいずれかにて設定した合成態様に従って模範ボーカルを音声合成する。その後、Ｓ５７０からＳ６７０までのステップを繰り返す。 Thereafter, the control unit 50 shifts the karaoke performance process to S670.
In S670, the control unit 50 subsequently determines whether or not the performance of the specific music has ended. If the result of the determination in S670 is that the performance of the specific music has not ended (S670: NO), the control unit 50 returns the karaoke performance processing to S540. In S540, the control unit 50 plays the continuation of the specific music piece and displays various information. In subsequent S550, if the current note NO (c, i) is the target note NO, the control unit 50 performs S630, S650,
The exemplary vocal is synthesized according to the synthesis mode set in any one of S660. Thereafter, the steps from S570 to S670 are repeated.

一方、Ｓ６７０での判定の結果、特定楽曲の演奏が終了していれば（Ｓ６７０：ＹＥＳ）、制御部５０は、カラオケ演奏処理を終了する。
つまり、本実施形態のカラオケ演奏処理では、特定楽曲の演奏中に入力された音声データを解析し、その音声データにて表現された各音符（構成音符）ＮＯでの歌唱技巧の特徴量を表す歌唱特徴量ＳＳを算出する。その算出した歌唱特徴量Ｓと、歌唱特徴データＭＳに含まれる技巧特徴量Ｓとを、互いに対応する音符ＮＯごとに比較する。そして、比較の結果、技巧特徴量Ｓと歌唱特徴量ＳＳとの差分である技巧差分が、予め規定された規定条件を満たしていなければ、現時点音符ＮＯの音符プロパティｐと同一の音符プロパティｐを有し、かつ、現時点音符ＮＯとは異なる音符ＮＯである対象音符ＮＯにおいて、技巧差分が規定条件を満たすように、模範ボーカルを音声合成によって生成して出力する。 On the other hand, if the result of determination in S670 is that the performance of the specific music has been completed (S670: YES), the control unit 50 ends the karaoke performance processing.
That is, in the karaoke performance processing of the present embodiment, the voice data input during the performance of the specific music is analyzed, and the characteristic amount of the singing technique is expressed by each note (component note) NO expressed by the voice data. The singing feature amount SS is calculated. The calculated singing feature amount S and the skill feature amount S included in the singing feature data MS are compared for each note NO corresponding to each other. As a result of comparison, if the skill difference that is the difference between the skill feature quantity S and the singing feature quantity SS does not satisfy the prescribed condition, a note property p that is the same as the note property p of the current note NO is obtained. An exemplary vocal is generated by speech synthesis and output so that the technical difference satisfies the specified condition in the target note NO that is a note NO that is different from the current note NO.

その音声合成では、現時点音符ＮＯにおいて指定歌手が用いている歌唱技巧が「ビブラート」であれば、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「ビブラート」において、周波数の最高値と周波数の最小値との差分が大きくなるように、模範ボーカルを生成して出力する。一方、現時点音符ＮＯにおいて指定歌手が用いている歌唱技巧が「しゃくり」であれば、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「しゃくり」において、対象音符ＮＯの音高に模範ボーカルの音高が遷移する音高遷移変化速度が初期値よりも早くなるように、模範ボーカルを音声合成によって生成して出力する。 In the voice synthesis, if the singing technique used by the designated singer at the current note NO is “vibrato”, the maximum frequency and the frequency at the “vibrato” at the model vocal corresponding to the current note NO are “vibrato”. A model vocal is generated and output so that the difference from the minimum value of the value increases. On the other hand, if the singing technique used by the designated singer at the current note NO is “Sharukuri”, the model vocal at the pitch of the target note NO at “Sharukuri” in the model vocal at the target note NO corresponding to the current note NO. An exemplary vocal is generated by voice synthesis and output so that the pitch transition change speed at which the pitch of the voice changes becomes faster than the initial value.

なお、本実施形態のカラオケシステム１が、特許請求の範囲に記載された音声合成システムの一例であり、カラオケ装置３０が、特許請求の範囲に記載された音声合成装置の一例である。
［実施形態の効果］
以上説明したように、カラオケ演奏処理によれば、現時点音符ＮＯ（ｃ，ｉ）における技巧差分が規定条件を満たしていなければ、その現時点音符ＮＯ（ｃ，ｉ）に対応する対象音符ＮＯにおける模範ボーカルを、技巧差分が規定条件を満たすように音声合成している。 In addition, the karaoke system 1 of this embodiment is an example of the speech synthesis system described in the scope of claims, and the karaoke apparatus 30 is an example of the speech synthesis apparatus described in the scope of claims.
[Effect of the embodiment]
As described above, according to the karaoke performance process, if the technical difference in the current note NO (c, i) does not satisfy the specified condition, the model in the target note NO corresponding to the current note NO (c, i). Voice synthesis is performed so that the technical difference satisfies the specified condition.

この結果、音声合成装置の利用者は、特定楽曲を指定歌手が歌唱した場合に用いる歌唱技巧を、利用者自身が実現できているか否かを認識できる。
しかも、カラオケ演奏処理では、指定歌手が用いる歌唱技巧をユーザが再現できていなければ、音声合成によって生成する模範ボーカルを、その現時点音符ＮＯ（ｃ，ｉ）に対応する対象音符ＮＯにおける技巧差分が規定条件を満たすようにしている。 As a result, the user of the speech synthesizer can recognize whether or not the user himself can realize the singing technique used when the designated singer sings the specific music piece.
Moreover, in the karaoke performance process, if the user cannot reproduce the singing technique used by the designated singer, the skill difference in the target note NO corresponding to the current note NO (c, i) is determined as the model vocal generated by speech synthesis. The specified conditions are met.

そして、この模範ボーカルを聞いて歌唱したカラオケ装置３０のユーザは、ユーザ自身が歌唱した際の歌声が模範ボーカルに引っ張られるため、自身が歌唱した際の歌唱技巧を、指定歌手が歌唱した場合に用いる歌唱技巧に近づけることができる。 And the user of the karaoke apparatus 30 who sang and listened to this model vocal is when the designated singer sings the singing skill when he sang because the singing voice when the user sings is pulled by the model vocal. It can be close to the singing technique used.

このようなカラオケ装置３０によれば、利用者自身の歌唱を指定歌手の歌唱態様に近づけることができる。これらのことから、カラオケ装置３０によれば、歌唱技巧の実現の成否を利用者に認識させ、その歌唱技巧を利用者が実現可能となる。 According to such a karaoke apparatus 30, the user's own singing can be brought close to the singing mode of the designated singer. From these things, according to the karaoke apparatus 30, the user can recognize the success or failure of realization of the singing technique, and the user can realize the singing technique.

特に、カラオケ演奏処理においては、現時点音符ＮＯにおいて指定歌手が「ビブラート」を用いている場合にユーザが「ビブラート」を用いて歌唱していなければ、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「ビブラート」において、周波数の最高値と周波数の最小値との差分が大きくなるように音声パラメータＰを調整している
。これにより、カラオケ装置３０が出力する、対象音符での模範ボーカルにおける「ビブラート」を深くすることができる。 In particular, in the karaoke performance process, when the designated singer uses “vibrato” in the current note NO and the user does not sing using “vibrato”, the exemplary vocal in the target note NO corresponding to the current note NO. In “Vibrato”, the audio parameter P is adjusted so that the difference between the maximum frequency value and the minimum frequency value becomes large. Thereby, the “vibrato” in the exemplary vocal in the target note output from the karaoke device 30 can be deepened.

このような模範ボーカルを出力することで、対象音符をユーザが歌唱する際に、「ビブラート」を実現しやすくできる。
また、カラオケ演奏処理においては、現時点音符ＮＯにおいて指定歌手が「しゃくり」を用いている場合にユーザが「しゃくり」を用いて歌唱していなければ、現時点音符ＮＯに対応する対象音符ＮＯにおける模範ボーカルでの「しゃくり」において、対象音符ＮＯの音高に模範ボーカルの音高が遷移する音高遷移変化速度が初期値よりも早くなるように音声パラメータＰを調整している。これにより、カラオケ装置３０が出力する、対象音符での模範ボーカルにおける「しゃくり」の音高遷移変化速度を速くすることができる。 By outputting such an exemplary vocal, it is easy to realize “vibrato” when the user sings the target note.
Also, in the karaoke performance process, when the designated singer uses “shakuri” in the current note NO and the user does not sing using “shakuri”, the exemplary vocal in the target note NO corresponding to the current note NO. In the “shacking”, the voice parameter P is adjusted so that the pitch transition change speed at which the pitch of the model vocal shifts to the pitch of the target note NO is faster than the initial value. Thereby, the pitch transition change speed of the “shrimp” in the exemplary vocal with the target note output from the karaoke device 30 can be increased.

このような模範ボーカルを音声合成して出力することで、対象音符ＮＯをユーザが歌唱する際に、「しゃくり」を実現しやすくすることができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 By synthesizing and outputting such an exemplary vocal, when the user sings the target note NO, it is possible to easily realize “shrimp”.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

また、上記実施形態においては、歌唱特徴生成処理を情報処理サーバ１０が実行していたが、歌唱特徴生成処理を実行する装置は、情報処理サーバ１０に限るものではない。すなわち、歌唱特徴生成処理を実行する装置は、情報処理装置２であっても良いし、カラオケ装置３０であっても良い。 Moreover, in the said embodiment, although the information processing server 10 performed the song feature production | generation process, the apparatus which performs a song feature production | generation process is not restricted to the information processing server 10. FIG. In other words, the information processing device 2 or the karaoke device 30 may be the device that executes the singing feature generation process.

なお、上記実施形態の構成の一部を省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 In addition, the aspect which abbreviate | omitted a part of structure of the said embodiment is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

また、本発明は、前述した音声合成装置の他、音声合成を実現するためにコンピュータが実行するプログラム、音声合成の方法等、種々の形態で実現することができる。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In addition to the above-described speech synthesizer, the present invention can be realized in various forms such as a program executed by a computer to realize speech synthesis and a speech synthesis method.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

カラオケ演奏処理のＳ５２０を実行することで実現される機能が、特許請求の範囲に記載された特徴データ取得手段の一例であり、Ｓ５６０〜Ｓ６００を実行することで実現される機能が、特許請求の範囲に記載された特徴量算出手段の一例である。また、カラオケ演奏処理のＳ６１０を実行することで得られる機能が、特許請求の範囲に記載された比較手段の一例であり、Ｓ６２０〜Ｓ６６０，Ｓ５４０，Ｓ５５０を実行することで得られる機能が、音声合成手段の一例である。 The function realized by executing S520 of the karaoke performance processing is an example of the feature data acquisition means described in the claims, and the function realized by executing S560 to S600 is claimed. It is an example of the feature-value calculation means described in the range. Further, the function obtained by executing S610 of the karaoke performance processing is an example of the comparison means described in the claims, and the function obtained by executing S620 to S660, S540, and S550 is the voice. It is an example of a synthetic | combination means.

１…音声合成システム２…情報処理装置３…入力受付部４…外部出力部５，１４，３８…記憶部６，１６，５０…制御部７，１８，５２…ＲＯＭ８，２０，５４…ＲＡＭ９，２２，５６…ＣＰＵ１０…情報処理サーバ１２，３２…通信部３０…カラオケ装置３４…入力受付部３６…楽曲再生部４０…音声制御部４２…出力部４４…マイク入力部４６…映像制御部６０…スピーカ６２…マイク６４…表示部 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 2 ... Information processing device 3 ... Input reception part 4 ... External output part 5, 14, 38 ... Memory | storage part 6, 16, 50 ... Control part 7, 18, 52 ... ROM 8, 20, 54 ... RAM 9, 22, 56 ... CPU 10 ... Information processing server 12, 32 ... Communication unit 30 ... Karaoke device 34 ... Input reception unit 36 ... Music playback unit 40 ... Audio control unit 42 ... Output unit 44 ... Microphone input unit 46 ... Video control Unit 60 ... Speaker 62 ... Microphone 64 ... Display unit

Claims

A voice synthesizer that generates and outputs an exemplary vocal that is the voice of singing a song by voice synthesis,
Among the music pieces having a plurality of notes composed of combinations of pitches and note values, and lyrics are assigned to at least some of the plurality of notes, one designated music piece is designated as a specific music piece, and the specific music piece A note to which the lyrics are assigned as a constituent note,
A skill feature amount representing a feature amount of a specific skill, which is a singing skill used by each designated note by a designated singer who is a designated singer, is assigned to a pitch, a note value, and the relevant note. Characteristic data acquisition means for acquiring singing characteristic data associated with each note property that is a combination of lyrics;
Analyzing voice data input during the performance of the specific music, a feature quantity calculation means for calculating a singing feature quantity representing a singing skill quantity in each of the constituent notes expressed in the voice data;
Comparison means for comparing the skill feature amount included in the singing feature data acquired by the feature data acquisition means and the singing feature amount calculated by the feature amount calculation means for each constituent note corresponding to each other;
As a result of the comparison by the comparison means, if the skill difference that is the difference between the skill feature quantity and the singing feature quantity does not satisfy the prescribed condition specified in advance, the note property that is the same as the note property of the constituent note Voice synthesis means for synthesizing and outputting an exemplary vocal for the target note such that the technical difference in the target note that is a constituent note different from the constituent note satisfies the specified condition. Prepared ,
The singing technique includes vibrato,
The speech synthesis means
If the difference between the vibrato feature amount in the singing feature amount and the vibrato feature amount in the skill feature amount exceeds a predetermined first threshold range as a result of the comparison by the comparison means, the skill is determined. It is assumed that the difference does not satisfy the specified conditions
A speech synthesizer characterized by the above.

The speech synthesis means
If the skill difference exceeds the first threshold range, voice synthesis is performed so that the difference between the maximum frequency value and the minimum frequency value is increased in the exemplary vocal for the target note. The speech synthesizer according to claim 1 .

The singing technique includes a squeak that continuously sings while changing the utterance pitch with respect to a group of notes including two notes having continuous pitches along the time axis,
The speech synthesis means
If the difference between the screaming feature value in the singing feature value and the screaming feature value in the skill feature value exceeds a predetermined second threshold range as a result of the comparison by the comparison means, the skill difference The speech synthesizer according to claim 1 or 2 , wherein the predetermined condition is not satisfied.

A voice synthesizer that generates and outputs an exemplary vocal that is the voice of singing a song by voice synthesis,
Among the music pieces having a plurality of notes composed of combinations of pitches and note values, and lyrics are assigned to at least some of the plurality of notes, one designated music piece is designated as a specific music piece, and the specific music piece A note to which the lyrics are assigned as a constituent note,
A skill feature amount representing a feature amount of a specific skill, which is a singing skill used by each designated note by a designated singer who is a designated singer, is assigned to a pitch, a note value, and the relevant note. Characteristic data acquisition means for acquiring singing characteristic data associated with each note property that is a combination of lyrics;
Analyzing voice data input during the performance of the specific music, a feature quantity calculation means for calculating a singing feature quantity representing a singing skill quantity in each of the constituent notes expressed in the voice data;
Comparison means for comparing the skill feature amount included in the singing feature data acquired by the feature data acquisition means and the singing feature amount calculated by the feature amount calculation means for each constituent note corresponding to each other;
As a result of the comparison by the comparison means, if the skill difference that is the difference between the skill feature quantity and the singing feature quantity does not satisfy the prescribed condition specified in advance, the note property that is the same as the note property of the constituent note Voice synthesis means for synthesizing and outputting an exemplary vocal for the target note such that the technical difference in the target note that is a constituent note different from the constituent note satisfies the specified condition. Prepared ,
The singing technique includes a squeak that continuously sings while changing the utterance pitch with respect to a group of notes including two notes having continuous pitches along the time axis,
The speech synthesis means
If the difference between the screaming feature value in the singing feature value and the screaming feature value in the skill feature value exceeds a predetermined second threshold range as a result of the comparison by the comparison means, the skill difference Does not meet the above requirements
A speech synthesizer characterized by the above.

The speech synthesis means
If the skill difference exceeds the second threshold range, the note after the time axis in the note group is set as the target note, and the pitch of the model vocal shifts to the pitch of the target note. The speech synthesizer according to claim 3 or 4, wherein speech synthesis is performed so that a pitch transition change speed becomes faster.

The feature data acquisition means includes
Extracting vocal data representing the vocal sound from music data including the vocal sound sung by the designated singer;
Obtaining musical score data composed of a plurality of notes constituting the music;
Based on each note constituting the obtained musical score data and the vocal data, the note vocal data which is a section of the vocal data corresponding to each note constituting the melody of the song is specified, and each note Determining the skill features in vocal data;
The singing feature data is obtained by acquiring target musical score data representing the score of the specific musical piece, and associating a technical feature amount in a note having the same note property with the constituent note included in the acquired target musical score data. The voice synthesizer according to any one of claims 1 to 5, wherein the singing characteristic data generated by executing the step is generated.

A voice synthesizer that generates and outputs an exemplary vocal that is the voice of singing a song by voice synthesis,
Among the music pieces having a plurality of notes composed of combinations of pitches and note values, and lyrics are assigned to at least some of the plurality of notes, one designated music piece is designated as a specific music piece, and the specific music piece A note to which the lyrics are assigned as a constituent note,
A skill feature amount representing a feature amount of a specific skill, which is a singing skill used by each designated note by a designated singer who is a designated singer, is assigned to a pitch, a note value, and the relevant note. Characteristic data acquisition means for acquiring singing characteristic data associated with each note property that is a combination of lyrics;
Analyzing voice data input during the performance of the specific music, a feature quantity calculation means for calculating a singing feature quantity representing a singing skill quantity in each of the constituent notes expressed in the voice data;
Comparison means for comparing the skill feature amount included in the singing feature data acquired by the feature data acquisition means and the singing feature amount calculated by the feature amount calculation means for each constituent note corresponding to each other;
As a result of the comparison by the comparison means, if the skill difference that is the difference between the skill feature quantity and the singing feature quantity does not satisfy the prescribed condition specified in advance, the note property that is the same as the note property of the constituent note Voice synthesis means for synthesizing and outputting an exemplary vocal for the target note such that the technical difference in the target note that is a constituent note different from the constituent note satisfies the specified condition. Prepared ,
The feature data acquisition means includes
Extracting vocal data representing the vocal sound from music data including the vocal sound sung by the designated singer;
Obtaining musical score data composed of a plurality of notes constituting the music;
Based on each note constituting the obtained musical score data and the vocal data, the note vocal data which is a section of the vocal data corresponding to each note constituting the melody of the song is specified, and each note Determining the skill features in vocal data;
The singing feature data is obtained by acquiring target musical score data representing the score of the specific musical piece, and associating a technical feature amount in a note having the same note property with the constituent note included in the acquired target musical score data. Generating step and
The singing feature data generated by executing
A speech synthesizer characterized by the above.

A voice synthesis system that generates and outputs an exemplary vocal that is the voice of singing a song by voice synthesis,
Among the music pieces having a plurality of notes composed of combinations of pitches and note values, and lyrics are assigned to at least some of the plurality of notes, one designated music piece is designated as a specific music piece, and the specific music piece A note to which the lyrics are assigned as a constituent note,
A skill feature amount representing a feature amount of a specific skill, which is a singing skill used by each designated note by a designated singer who is a designated singer, is assigned to a pitch, a note value, and the relevant note. Characteristic data acquisition means for acquiring singing characteristic data associated with each note property that is a combination of lyrics;
Analyzing voice data input during the performance of the specific music, a feature quantity calculation means for calculating a singing feature quantity representing a singing skill quantity in each of the constituent notes expressed in the voice data;
Comparison means for comparing the skill feature amount included in the singing feature data acquired by the feature data acquisition means and the singing feature amount calculated by the feature amount calculation means for each constituent note corresponding to each other;
As a result of the comparison by the comparison means, if the skill difference that is the difference between the skill feature quantity and the singing feature quantity does not satisfy the prescribed condition specified in advance, the note property that is the same as the note property of the constituent note Voice synthesis means for synthesizing and outputting an exemplary vocal for the target note such that the technical difference in the target note that is a constituent note different from the constituent note satisfies the specified condition. Prepared ,
The singing technique includes vibrato,
The speech synthesis means
If the difference between the vibrato feature amount in the singing feature amount and the vibrato feature amount in the skill feature amount exceeds a predetermined first threshold range as a result of the comparison by the comparison means, the skill is determined. A speech synthesis system characterized in that a difference does not satisfy the prescribed condition .