JP4983958B2

JP4983958B2 - Singing scoring device and singing scoring program

Info

Publication number: JP4983958B2
Application number: JP2010101757A
Authority: JP
Inventors: 克瀬戸口
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2010-04-27
Filing date: 2010-04-27
Publication date: 2012-07-25
Anticipated expiration: 2026-07-13
Also published as: JP2010191463A

Description

本発明は、カラオケ装置に用いて好適な歌唱採点装置および歌唱採点プログラムに関する。 The present invention relates to a singing scoring device and a singing scoring program suitable for use in a karaoke apparatus.

カラオケ伴奏の主旋律パートを構成する各音符の音高および発音タイミングを採点基準とし、この採点基準に対して歌唱者の歌声から抽出したピッチを比較することで歌唱採点する歌唱採点装置を備えたカラオケ装置が各種開発されている。この種の装置については、例えば特許文献１に開示されている。 A karaoke equipped with a singing scoring device that scores singing by comparing the pitch extracted from the singing voice of the singer against the scoring standard, with the pitch and pronunciation timing of each note constituting the main melody part of the karaoke accompaniment as the scoring standard Various devices have been developed. This type of apparatus is disclosed in, for example, Patent Document 1.

特開平１１−１９４７８２号公報JP-A-11-194782

ところで、上記特許文献１に開示の技術のように、採点基準と歌唱ピッチとを比較して採点する方式では、明確なメロディーを持たない「ラップ」と呼ばれるスタイルのカラオケ曲であると、採点基準が存在しないことから歌唱採点することができない、という問題がある。 By the way, as in the technique disclosed in Patent Document 1 above, in the method of scoring by comparing the scoring standard and the singing pitch, the scoring standard is a karaoke song of a style called “rap” that does not have a clear melody. There is a problem that singing cannot be scored because there is no.

また、採点基準と歌唱ピッチとを比較して採点する方式であっても、カラオケ曲の一部分のみを丁寧に歌唱してカラオケ伴奏を停止させると、その一部分のみについて歌唱採点されることから、高得点を得ることが出来てしまう弊害もある。そうした弊害を回避するには、一定時間以上歌唱し続けなければ、得点が無効になるようにすれば良いが、そのようにすると、今度は非常に短い曲では全て歌唱しても採点されなくなるという問題が生じる。 In addition, even if the scoring standard is compared with the singing pitch, if only a part of the karaoke song is sung carefully and the karaoke accompaniment is stopped, the singing score is only given to that part. There is also an adverse effect that can be scored. In order to avoid such harmful effects, if the singing is not continued for a certain period of time, the score may be invalidated. However, in this case, all the very short songs will not be scored even if they are sung. Problems arise.

本発明は、このような事情に鑑みてなされたもので、メロディーが無い曲や、演奏時間が非常に短い曲であっても歌唱採点することができる歌唱採点装置および歌唱採点プログラムを提供することを目的としている。 The present invention has been made in view of such circumstances, and provides a singing scoring device and a singing scoring program capable of scoring even a song without a melody or a song with a very short performance time. It is an object.

請求項１に記載の発明では、カラオケ曲の再生に同期して発生するデータであって、手本として歌唱された手本歌唱音を表す手本歌唱音データと、再生されるカラオケ曲に合せてユーザが歌唱するユーザ歌唱音から得たユーザ歌唱データとを所定データ数毎に区切りフレーム化するフレーム化手段と、前記フレーム化手段によりフレーム化された所定データ数分の手本歌唱音データから手本歌唱音の音声特徴量を抽出する第１の特徴抽出手段と、前記フレーム化手段によりフレーム化された所定データ数分のユーザ歌唱データからユーザ歌唱音の音声特徴量を抽出する第２の特徴抽出手段と、前記第１および第２の特徴抽出手段によりそれぞれ抽出された手本歌唱音の音声特徴量とユーザ歌唱音の音声特徴量との類似度を算出する類似度算出手段と、前記類似度算出手段により算出された類似度に応じて、手本歌唱音に対するユーザ歌唱音の適否をフレーム毎に判定する判定手段と、ユーザ歌唱データのフレームの数と手本歌唱音データのフレームの数との比が一定値以上の場合にのみ、前記判定手段がフレーム毎に適否判定した結果に基づきユーザの歌唱を採点する採点手段とを具備することを特徴とする。 According to the first aspect of the present invention, the data is generated in synchronization with the reproduction of the karaoke song, and is matched with the sample singing sound data representing the example singing sound sung as a model and the reproduced karaoke song. From the singing sound data corresponding to the predetermined number of data framed by the framing means, and the framing means for dividing the user singing data obtained from the user singing sound sung by the user into predetermined frames. A first feature extracting unit that extracts a voice feature amount of the model singing sound; and a second feature extracting unit that extracts the voice feature amount of the user singing sound from the user singing data of a predetermined number of data framed by the framing unit. Similarity calculation for calculating the similarity between the voice feature amount of the model singing sound and the voice feature amount of the user singing sound respectively extracted by the feature extraction means and the first and second feature extraction means And stage, in response to said degree of similarity calculated by the similarity calculation means, determination means for determining suitability of the user singing sound for each frame for model singing sound, the number of frames of the user singing data and model singing sound Only when the ratio with the number of frames of data is equal to or greater than a certain value, there is provided scoring means for scoring the user's song based on the result of the determination by the determination means for each frame.

請求項２に記載の発明では、カラオケ曲の再生に同期して発生するデータであって、手本として歌唱された手本歌唱音を表す手本歌唱音データと、再生されるカラオケ曲に合せてユーザが歌唱するユーザ歌唱音から得たユーザ歌唱データとを所定データ数毎に区切りフレーム化するフレーム化処理と、前記フレーム化処理によりフレーム化された所定データ数分の手本歌唱音データから手本歌唱音の音声特徴量を抽出する第１の特徴抽出処理と、前記フレーム化処理によりフレーム化された所定データ数分のユーザ歌唱データからユーザ歌唱音の音声特徴量を抽出する第２の特徴抽出処理と、前記第１および第２の特徴抽出処理によりそれぞれ抽出された手本歌唱音の音声特徴量とユーザ歌唱音の音声特徴量との類似度を算出する類似度算出処理と、前記類似度算出処理により算出された類似度に応じて、手本歌唱音に対するユーザ歌唱音の適否をフレーム毎に判定する判定処理と、ユーザ歌唱データのフレームの数と手本歌唱音データのフレームの数との比が一定値以上の場合にのみ、前記判定処理がフレーム毎に適否判定した結果に基づきユーザの歌唱を採点する採点処理とをコンピュータで実行させることを特徴とする。 According to the second aspect of the present invention, the data is generated in synchronization with the reproduction of the karaoke song, and is matched with the sample singing sound data representing the example singing sound sung as a model and the reproduced karaoke song. From the singing sound data for the predetermined number of data framed by the framing process, the framing process for dividing the user singing data obtained from the user singing sound sung by the user into predetermined frames A first feature extraction process for extracting a voice feature value of a model singing sound, and a second feature for extracting a voice feature value of a user singing sound from user song data for a predetermined number of data framed by the framing process Similarity calculation for calculating the similarity between the voice feature amount of the model singing sound and the voice feature amount of the user singing sound extracted by the feature extraction process and the first and second feature extraction processes, respectively. And a determination process for determining, for each frame, the suitability of the user singing sound with respect to the model singing sound according to the similarity calculated by the similarity calculating process, and the number of frames of the user singing data and the model singing sound. Only when the ratio with the number of frames of data is a certain value or more, the computer executes a scoring process for scoring the user's song based on the result of the determination process determining whether each frame is appropriate.

本発明では、手本として歌唱された手本歌唱音から抽出した音声特徴量と、ユーザが歌唱するユーザ歌唱音から抽出した音声特徴量との類似度に応じて、手本歌唱音に対するユーザ歌唱音の適否を判定して採点するので、曲の歌詞が正しく歌唱されているかどうかを歌唱採点できる。したがって、明確なメロディーを持たない「ラップ」と呼ばれるスタイルのカラオケ曲であっても歌唱採点することができる。 In the present invention, the user singing for the model singing sound according to the similarity between the voice feature extracted from the model singing sound sung as a model and the voice feature extracted from the user singing sound sung by the user. Since the sound is judged to be appropriate and graded, it can be scored whether the lyrics of the song are sung correctly. Therefore, even a karaoke song of a style called “rap” without a clear melody can be scored.

また、本発明では、ユーザ歌唱データのフレームの数と手本歌唱音データのフレームの数との比が一定値以上の場合にのみ、フレーム毎に適否判定した結果に基づきユーザの歌唱を採点するので、演奏時間が非常に短い曲であっても歌唱採点することができる。 Moreover, in this invention, only when the ratio of the number of frames of user song data and the number of frames of model song sound data is a certain value or more, the user's song is scored based on the result of determining the suitability for each frame. Therefore, even a song with a very short performance time can be scored.

本発明による実施の一形態の構成を示すブロック図である。It is a block diagram which shows the structure of one Embodiment by this invention. カラオケ処理の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a karaoke process. 部分採点処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of a partial scoring process. ＭＦＣＣ算出処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of a MFCC calculation process. フィルタバンクの一例を示す図である。It is a figure which shows an example of a filter bank.

以下、図面を参照して本発明の実施の形態について説明する。
Ａ．構成
図１は、本発明の実施の一形態による歌唱採点装置を備えたカラオケ装置の構成を示すブロック図である。この図において、ＣＰＵ１０は、スイッチ部１４から供給されるスイッチイベントに応じて、プログラムＲＯＭ１１に記憶される所定のプログラムを実行して装置各部を制御する。本発明の要旨に係わるＣＰＵ１０の特徴的な処理動作（カラオケ処理、部分採点処理およびＭＦＣＣ算出処理）については追って述べる。 Embodiments of the present invention will be described below with reference to the drawings.
A. Constitution
FIG. 1 is a block diagram showing a configuration of a karaoke apparatus provided with a singing scoring apparatus according to an embodiment of the present invention. In this figure, the CPU 10 controls each part of the apparatus by executing a predetermined program stored in the program ROM 11 in accordance with a switch event supplied from the switch part 14. Characteristic processing operations (karaoke processing, partial scoring processing, and MFCC calculation processing) of the CPU 10 according to the gist of the present invention will be described later.

プログラムＲＯＭ１１には、ＣＰＵ１０により実行される各種プログラムや制御データが記憶される。プログラムＲＯＭ１１に記憶される各種プログラムとは、後述する「カラオケ処理」、「部分採点処理」および「ＭＦＣＣ算出処理」を含む。ＲＡＭ１２は、ワークエリアおよびバッファエリアを備える。ＲＡＭ１２のワークエリアには、ＣＰＵ１０の処理に用いる各種レジスタ・フラグデータが一時記憶される。ＲＡＭ１２のバッファエリアには、後述する手本歌唱データおよびユーザ歌唱データが一時記憶される。 The program ROM 11 stores various programs executed by the CPU 10 and control data. The various programs stored in the program ROM 11 include “karaoke processing”, “partial scoring processing”, and “MFCC calculation processing” described later. The RAM 12 includes a work area and a buffer area. In the work area of the RAM 12, various register / flag data used for the processing of the CPU 10 are temporarily stored. In the buffer area of the RAM 12, model song data and user song data, which will be described later, are temporarily stored.

カラオケデータメモリ１３は、例えばフラッシュメモリなど電気的に書き換え可能な不揮発性メモリから構成され、複数の曲のカラオケデータを記憶する。スイッチ部１４には、例えば電源スイッチの他、カラオケ伴奏する曲を選択する曲選択スイッチや、カラオケの開始／停止を指示するスタート／ストップスイッチ等の各種スイッチを備え、これらスイッチ操作に応じたスイッチイベントを発生してＣＰＵ１０に供給する。スイッチ部１４のスタート／ストップスイッチの操作によってカラオケが開始された場合、ＣＰＵ１０は曲選択スイッチの操作で予め選択される曲のカラオケデータをカラオケデータメモリ１３から読み出すようになっている。 The karaoke data memory 13 is composed of an electrically rewritable nonvolatile memory such as a flash memory, for example, and stores karaoke data of a plurality of songs. For example, in addition to the power switch, the switch unit 14 includes various switches such as a song selection switch for selecting a song to accompany karaoke and a start / stop switch for instructing start / stop of karaoke. An event is generated and supplied to the CPU 10. When karaoke is started by operating the start / stop switch of the switch unit 14, the CPU 10 reads out karaoke data from the karaoke data memory 13 for a song selected in advance by operating the song selection switch.

カラオケデータメモリ１３に記憶される１つの曲のカラオケデータは、歌詞データおよび音声データから構成される。歌詞データは、カラオケ伴奏に同期して曲の歌詞を字幕表示させるための情報である。音声データは、カラオケトラックおよびボーカルトラックを有するデュアルモノラルモードでＭＰ３形式に圧縮符号化された伴奏データおよび手本歌唱データを含む。 The karaoke data of one song stored in the karaoke data memory 13 is composed of lyrics data and voice data. The lyrics data is information for displaying subtitles of the lyrics of the song in synchronization with the karaoke accompaniment. The audio data includes accompaniment data and sample singing data compressed and encoded in MP3 format in a dual monaural mode having a karaoke track and a vocal track.

すなわち、カラオケトラックには、カラオケ伴奏音をサンプリングしてなる伴奏データが圧縮符号化されて格納され、ボーカルトラックには、例えばカラオケ伴奏音に同期して歌手が手本として歌唱した歌唱音をサンプリングしてなる手本歌唱データが圧縮符号化されて格納されている。 That is, accompaniment data obtained by sampling the karaoke accompaniment sound is compressed and stored in the karaoke track, and the vocal track samples the singing sound sung by the singer as a model in synchronization with the karaoke accompaniment sound, for example. The model singing data is stored after being compressed and encoded.

マイク１５は、入力されるユーザの歌唱音を歌唱音声信号に変換して出力する。コーデック１６は、マイク１５から供給される歌唱音声信号をＡ／Ｄ変換して得るユーザ歌唱データを、ＣＰＵ１０の制御の下に、ＲＡＭ１２のバッファエリアにストアする。また、コーデック１６は、ＣＰＵ１０の制御の下に、カラオケデータメモリ１３から読み出されるＭＰ３形式の手本歌唱データをデコード（伸長復号）してＲＡＭ１２のバッファエリアにストアする。なお、カラオケ実行中にＲＡＭ１２のバッファエリアにストアされるユーザ歌唱データおよび手本歌唱データは、それぞれ１０２４サンプリングポイント分のフレームに相当する２５６ｍｓｅｃ毎に更新される。 The microphone 15 converts the input user's singing sound into a singing voice signal and outputs it. The codec 16 stores user singing data obtained by A / D converting the singing voice signal supplied from the microphone 15 in the buffer area of the RAM 12 under the control of the CPU 10. Also, the codec 16 decodes (decompresses) the MP3 format model song data read from the karaoke data memory 13 under the control of the CPU 10 and stores it in the buffer area of the RAM 12. Note that the user song data and model song data stored in the buffer area of the RAM 12 during karaoke execution are updated every 256 msec corresponding to a frame of 1024 sampling points.

さらに、コーデック１６は、ＣＰＵ１０の制御の下に、カラオケデータメモリ１３から読み出されるＭＰ３形式の伴奏データをデコード（伸長復号）すると共に、デコードされた伴奏データをＤ／Ａ変換して得られるカラオケ伴奏音信号と、マイク１５から供給される歌唱音声信号とを混合してオーディオ出力を発生する。オーディオ出力は、例えば図示されていないテレビジョン受像機の外部音声入力端子に供給されて音声再生される。ビデオエンコーダ１７は、ＣＰＵ１０の制御の下に、カラオケデータメモリ１３から読み出される歌詞データを字幕表示用のビデオ出力に変換する。ビデオ出力は、例えば図示されていないテレビジョン受像機のビデオ入力端子に供給されて歌詞字幕として画面表示される。 Further, the codec 16 decodes (decompresses) the MP3 format accompaniment data read from the karaoke data memory 13 under the control of the CPU 10, and performs karaoke accompaniment obtained by D / A converting the decoded accompaniment data. The sound signal and the singing voice signal supplied from the microphone 15 are mixed to generate an audio output. The audio output is supplied to, for example, an external audio input terminal of a television receiver (not shown) to reproduce the audio. Under the control of the CPU 10, the video encoder 17 converts the lyrics data read from the karaoke data memory 13 into a video output for subtitle display. The video output is supplied to, for example, a video input terminal of a television receiver (not shown) and displayed on the screen as lyrics subtitles.

Ｂ．動作
次に、図２〜図５を参照して、上記構成によるカラオケ装置の動作を説明する。以下では、ＣＰＵ１０が実行するカラオケ処理、部分採点処理およびＭＦＣＣ算出処理の各動作について述べる。 B. Action
Next, with reference to FIGS. 2-5, operation | movement of the karaoke apparatus by the said structure is demonstrated. Below, each operation | movement of the karaoke process, partial scoring process, and MFCC calculation process which CPU10 performs is described.

（１）カラオケ処理の動作
図２は、カラオケ処理の動作を示すフローチャートである。装置電源が投入されると、ＣＰＵ１０は、図２に示すステップＳＡ１に処理を進め、カラオケ開始指示があるまで待機する。ここで、スイッチ部１４に設けられるスタート／ストップスイッチの操作に応じてカラオケ開始指示が発生すると、ステップＳＡ１の判断結果が「ＹＥＳ」になり、次のステップＳＡ２に進む。 (1) Karaoke processing operation
FIG. 2 is a flowchart showing the operation of karaoke processing. When the apparatus power is turned on, the CPU 10 advances the process to step SA1 shown in FIG. 2 and waits for a karaoke start instruction. Here, when a karaoke start instruction is generated in response to an operation of a start / stop switch provided in the switch unit 14, the determination result in step SA1 is “YES”, and the process proceeds to the next step SA2.

ステップＳＡ２では、曲選択スイッチの操作で予め選択される曲のカラオケデータ（歌詞データおよび音声データ）をカラオケデータメモリ１３から読み出し、読み出したカラオケデータ中の歌詞データをビデオエンコーダ１７に供給して歌詞字幕表示用のビデオ出力に変換する。また、ステップＳＡ２では、読み出したカラオケデータ中の音声データ、すなわちＭＰ３形式で圧縮符号化されたカラオケトラックの伴奏データおよびボーカルトラックの手本歌唱データをコーデック１６に供給してデコード（伸長復号）させる。 In step SA2, karaoke data (lyric data and audio data) of a song preselected by the operation of the song selection switch is read from the karaoke data memory 13, and the lyrics data in the read karaoke data is supplied to the video encoder 17 to input lyrics. Convert to video output for subtitle display. In step SA2, the voice data in the read karaoke data, that is, the accompaniment data of the karaoke track compressed and encoded in the MP3 format and the sample song data of the vocal track are supplied to the codec 16 to be decoded (decompressed). .

次いで、ステップＳＡ３では、上記ステップＳＡ２においてデコードされた伴奏データをＤ／Ａ変換して得られるカラオケ伴奏音信号と、マイク１５から供給される歌唱音声信号とを混合してオーディオ出力を発生するようコーデック１６に指示する。これにより、例えばテレビジョン受像機（不図示）の外部音声入力端子にオーディオ出力を、ビデオ入力端子にビデオ出力をそれぞれ供給すれば、歌詞字幕が画面表示されると共に、カラオケ伴奏音が再生される。 Next, in step SA3, the karaoke accompaniment sound signal obtained by D / A converting the accompaniment data decoded in step SA2 and the singing voice signal supplied from the microphone 15 are mixed to generate an audio output. The codec 16 is instructed. Thus, for example, if audio output is supplied to an external audio input terminal of a television receiver (not shown) and video output is supplied to a video input terminal, lyrics subtitles are displayed on the screen and karaoke accompaniment sounds are reproduced. .

こうしてカラオケ伴奏が始ると、ＣＰＵ１０はステップＳＡ４に処理を進め、上記ステップＳＡ２においてコーデック１６がデコードした手本歌唱データをＲＡＭ１２のバッファエリアにストアし、続くステップＳＡ５では、コーデック１６が発生するユーザ歌唱データをＲＡＭ１２のバッファエリアにストアする。 When the karaoke accompaniment starts, the CPU 10 advances the process to step SA4, stores the model song data decoded by the codec 16 in the above step SA2 in the buffer area of the RAM 12, and in the subsequent step SA5, the user who generates the codec 16 The song data is stored in the buffer area of the RAM 12.

そして、ステップＳＡ６では、ＲＡＭ１２にバッファリングされた１０２４サンプリングポイント分の手本歌唱データおよびユーザ歌唱データからそれぞれ抽出する両者の音声特徴量ＭＦＣＣに基づき、手本歌唱音（手本歌唱データ）に対するユーザ歌唱音（ユーザ歌唱データ）の類似度を算出し、算出した類似度に応じて適否判定した結果に基づき歌唱採点する部分採点処理（後述する）を実行する。なお、部分採点処理は、ＲＡＭ１２にバッファリングされる１０２４サンプリングポイント分のデータを使用する為、２５６ｍｓｅｃ毎に実行される。 In step SA6, based on both voice feature values MFCC extracted from the sample song data and user song data for 1024 sampling points buffered in the RAM 12, the user with respect to the sample song sound (example song data). The degree of similarity of the singing sound (user singing data) is calculated, and a partial scoring process (described later) for singing a song based on the result of determining the suitability according to the calculated degree of similarity is executed. The partial scoring process is executed every 256 msec because data for 1024 sampling points buffered in the RAM 12 is used.

次いで、ステップＳＡ７では、カラオケ停止指示の有無を判断する。カラオケ停止指示が無ければ、判断結果は「ＮＯ」になり、上述したステップＳＡ２に処理を戻す。以後、カラオケ伴奏が曲終端に達するか、あるいはスイッチ部１４のスタート／ストップスイッチの操作によってカラオケ停止指示が発生するまで上述したステップＳＡ２〜ＳＡ６を繰り返してカラオケ伴奏を進行させながら、ユーザ歌唱音を２５６ｍｓｅｃ毎に採点する。そして、例えばカラオケ伴奏が曲終端に達してカラオケ停止指示が発生すると、ステップＳＡ７の判断結果が「ＹＥＳ」になり、ステップＳＡ８に進み、採点処理を実行する。 Next, in step SA7, it is determined whether or not there is a karaoke stop instruction. If there is no karaoke stop instruction, the determination result is “NO”, and the process returns to step SA2 described above. Thereafter, until the karaoke accompaniment reaches the end of the song or the karaoke stop instruction is generated by operating the start / stop switch of the switch unit 14, the above-described steps SA2 to SA6 are repeated to advance the karaoke accompaniment. A score is given every 256 msec. For example, when the karaoke accompaniment reaches the end of the song and a karaoke stop instruction is issued, the determination result in step SA7 is “YES”, and the process proceeds to step SA8 to execute the scoring process.

採点処理では、フレームカウンタの数と楽曲全体のフレーム数の比率が一定値以上あるか否かを判定する。なお、フレームカウンタとは、後述するように、無音状態でない手本歌唱データのフレームを計数するカウンタである。また、フレームとは、１０２４サンプリングポイント毎（２５６ｍｓｅｃ毎）にバッファリングされるデータの区切りを指す。楽曲全体のフレーム数とは、手本歌唱データをフレームで除した数に相当する。 In the scoring process, it is determined whether or not the ratio between the number of frame counters and the number of frames of the entire music is a certain value or more. As will be described later, the frame counter is a counter that counts frames of model singing data that are not silent. A frame refers to a segment of data buffered every 1024 sampling points (every 256 msec). The number of frames of the entire music corresponds to the number obtained by dividing the model song data by frames.

したがって、採点処理では、カラオケ伴奏される曲を一定比率以上歌唱したかどうかを判断し、一定比率以上歌唱していなければ、上記ステップＳＡ６の部分採点処理で得られる部分得点を無効とし、歌唱評価を零点と採点して次のステップＳＡ９に進む。 Therefore, in the scoring process, it is determined whether or not the karaoke accompaniment has been sung over a certain ratio. If the singing is not performed over a certain ratio, the partial scoring obtained in the partial scoring process in step SA6 is invalidated and the singing evaluation is performed. Is scored as zero, and the process proceeds to the next step SA9.

一方、カラオケ伴奏される曲を一定比率以上歌唱していれば、上記ステップＳＡ６の部分採点処理で得られる部分得点をフレームカウンタの数で除し、その値の百分率を点数データとして算出する。この後、ステップＳＡ９に進み、算出した点数データをビデオエンコーダ１７にてビデオ出力に変換することでユーザの歌唱点数を画面表示して本処理を終える。 On the other hand, if the karaoke accompaniment is sung at a certain ratio or more, the partial score obtained by the partial scoring process in step SA6 is divided by the number of frame counters, and the percentage of the value is calculated as score data. Thereafter, the process proceeds to step SA9, where the calculated score data is converted into a video output by the video encoder 17, whereby the user's singing score is displayed on the screen and the present process is terminated.

（２）部分採点処理の動作
次に、図３を参照して部分採点処理の動作を説明する。上述したカラオケ処理のステップＳＡ６（図２参照）を介して本処理が実行されると、ＣＰＵ１０は図３に図示するステップＳＢ１に進み、ＲＡＭ１２のバッファエリアにストアされた１０２４サンプリングポイント分の手本歌唱データが無音状態であるかをチェックする。 (2) Partial scoring operation
Next, the operation of the partial scoring process will be described with reference to FIG. When this processing is executed through the above-described karaoke processing step SA6 (see FIG. 2), the CPU 10 proceeds to step SB1 shown in FIG. 3 and samples 1024 sampling points stored in the buffer area of the RAM 12. Check if the singing data is silent.

続いて、ステップＳＢ２では、上記ステップＳＢ１のチェック結果に基づき、無音状態の手本歌唱データであるかどうかを判断する。無音状態の手本歌唱データであると、歌唱部分ではないと見做し、ここでの判断結果が「ＹＥＳ」となり、一旦本処理を完了させる。この場合、無音状態の手本歌唱データを含むフレームを破棄し、次フレームまで待機する。 Subsequently, in step SB2, based on the check result in step SB1, it is determined whether the sample singing data is silent. If the sample singing data is in the silent state, it is assumed that it is not a singing part, and the determination result here is “YES”, and this processing is once completed. In this case, the frame including the model song data in the silent state is discarded, and the process waits until the next frame.

一方、無音状態でない手本歌唱データならば、上記ステップＳＢ２の判断結果は「ＮＯ」になり、ステップＳＢ３に進む。ステップＳＢ３では、フレームカウンタをインクリメントして歩進させる。フレームカウンタとは、無音状態でない手本歌唱データのフレームを計数するカウンタであり、その値は曲の進行位置を表す。次いで、ステップＳＢ４では、手本歌唱データＭＦＣＣ算出処理を実行する。 On the other hand, if it is model singing data that is not silent, the determination result of step SB2 is “NO”, and the process proceeds to step SB3. In step SB3, the frame counter is incremented and incremented. The frame counter is a counter that counts the frames of the model song data that is not silent, and the value represents the progress position of the song. Next, in step SB4, a model song data MFCC calculation process is executed.

ここで、図４を参照してＭＦＣＣ算出処理の動作を説明する。上記ステップＳＢ４を介してＭＦＣＣ算出処理が実行されると、ＣＰＵ１０は図４に図示するステップＳＣ１に処理を進め、ＲＡＭ１２のバッファエリアにストアされた無音状態でない１０２４サンプリングポイント分の手本歌唱データ（以下、入力信号と称す）に対し、低次のハイパスフィルタリングを施して直流分（バイアスノイズ）を除去する。続いて、ステップＳＣ２では、バイアス除去された入力信号にハニング窓をかけて高速フーリエ変換ＦＦＴを施すことによって、入力信号をスペクトル領域に変換する。 Here, the operation of the MFCC calculation process will be described with reference to FIG. When the MFCC calculation process is executed via the above step SB4, the CPU 10 advances the process to step SC1 shown in FIG. 4, and singing sample data for 1024 sampling points (non-silent state) stored in the buffer area of the RAM 12 ( Hereinafter, the input signal is subjected to low-order high-pass filtering to remove a DC component (bias noise). Subsequently, in step SC2, the input signal is converted into a spectral domain by applying a fast Fourier transform FFT to the input signal from which the bias has been removed by applying a Hanning window.

次いで、ステップＳＣ３では、スペクトル領域に変換された入力信号にフィルタバンク処理を施し、特徴量として用いられる２０次元のスペクトル系列を発生する。すなわち、このフィルタバンク処理では、図５に図示するように、周波数軸に対して対数尺度で幅をとった２０個の三角窓を備えるフィルタバンクを用いる。続いて、ステップＳＣ４では、線形領域にある２０次元のスペクトル系列を対数スペクトル系列に変換する対数化処理を行う。そして、ステップＳＣ４では、対数スペクトル系列に離散コサイン変換ＤＣＴを施してケプストラム領域に変換するＤＣＴ処理を実行する。 Next, in step SC3, a filter bank process is performed on the input signal converted into the spectral domain to generate a 20-dimensional spectral sequence used as a feature quantity. That is, in this filter bank processing, as shown in FIG. 5, a filter bank having 20 triangular windows having a logarithmic scale with respect to the frequency axis is used. Subsequently, in step SC4, a logarithmic process for converting a 20-dimensional spectrum sequence in the linear region into a logarithmic spectrum sequence is performed. In step SC4, a DCT process is performed in which a discrete cosine transform DCT is performed on the logarithmic spectrum sequence to convert it into a cepstrum region.

次に、ステップＳＣ６では、上記ステップＳＣ５のＤＣＴ処理で得られたＤＣＴ係数の内からスペクトル直流成分である最低次の係数Ｃ₀を除いた低次から１２個の係数を、ケプストラム領域の音声特徴量ＭＦＣＣ（Mel Frequency Cepstrum Coefficient）として抽出する係数抽出処理を実行した後、本処理を完了させて図３に図示する部分採点処理に復帰する。 Next, in step SC6, the 12 coefficients from low order excluding the lowest-order coefficient C ₀ is the spectral DC component from among the DCT coefficients obtained by the DCT process in the step SC5, the audio characteristics of the cepstrum domain After executing a coefficient extraction process for extracting as a quantity MFCC (Mel Frequency Cepstrum Coefficient), the present process is completed and the process returns to the partial scoring process shown in FIG.

以上のように、ステップＳＢ４の手本歌唱データＭＦＣＣ算出処理では、ＲＡＭ１２のバッファエリアにストアされた無音状態でない１０２４サンプリングポイント分の手本歌唱データからケプストラム領域の音声特徴量ＭＦＣＣを算出するようになっている。 As described above, in the example song data MFCC calculation process in step SB4, the speech feature value MFCC of the cepstrum area is calculated from the sample song data for 1024 sampling points stored in the buffer area of the RAM 12 and not in silence. It has become.

この後、図３に図示するステップＳＢ５に進み、ＲＡＭ１２のバッファエリアにストアされた１０２４サンプリングポイント分のユーザ歌唱データが無音状態であるかをチェックする。そして、ステップＳＢ６では、上記ステップＳＢ５のチェック結果に基づき、無音状態のユーザ歌唱データであるかどうかを判断する。無音状態のユーザ歌唱データであると、歌唱部分ではないと見做して判断結果が「ＹＥＳ」となり、一旦本処理を完了させる。この場合、無音状態のユーザ歌唱データを含むフレームを破棄し、次フレームまで待機する。 Thereafter, the process proceeds to step SB5 shown in FIG. 3 to check whether the user singing data for 1024 sampling points stored in the buffer area of the RAM 12 is in a silent state. In step SB6, based on the check result in step SB5, it is determined whether the user singing data is silent. If the user singing data is in a silent state, the determination result is “YES” assuming that it is not a singing portion, and this processing is once completed. In this case, the frame including the user singing data in the silent state is discarded, and the process waits until the next frame.

一方、無音状態でないユーザ歌唱データならば、上記ステップＳＢ６の判断結果が「ＮＯ」になり、ステップＳＢ７に進む。ステップＳＢ７では、ユーザ歌唱データＭＦＣＣ算出処理を実行する。ユーザ歌唱データＭＦＣＣ算出処理では、上述したステップＳＢ４と同様、ＲＡＭ１２のバッファエリアにストアされた無音状態でない１０２４サンプリングポイント分のユーザ歌唱データから音声特徴量ＭＦＣＣを算出する。 On the other hand, if the user song data is not silent, the determination result in step SB6 is “NO”, and the process proceeds to step SB7. In step SB7, user song data MFCC calculation processing is executed. In the user song data MFCC calculation process, the speech feature value MFCC is calculated from the user song data for 1024 sampling points that are not silenced and stored in the buffer area of the RAM 12, as in step SB4 described above.

続いて、ステップＳＢ８では、上記ステップＳＢ４で算出した手本歌唱データの音声特徴量ＭＦＣＣと、上記ステップＳＢ７で算出したユーザ歌唱データの音声特徴量ＭＦＣＣとの類似度を測る尺度として、手本歌唱データの音声特徴量ＭＦＣＣを表すベクトルａ＝（ａ₁，ａ₂，…，ａ₁₂）と、ユーザ歌唱データの音声特徴量ＭＦＣＣを表すベクトルｂ＝（ｂ₁，ｂ₂，…，ｂ₁₂）との間のユークリッド距離ｄ（ａ，ｂ）を算出する。 Subsequently, in step SB8, the model song is used as a measure for measuring the similarity between the voice feature value MFCC of the model song data calculated in step SB4 and the voice feature value MFCC of the user song data calculated in step SB7. A vector a = (a ₁ , a ₂ ,..., A ₁₂ ) representing the voice feature value MFCC of the data, and a vector b = (b ₁ , b ₂ ,..., B ₁₂ ) representing the voice feature value MFCC of the user song data. Euclidean distance d (a, b) between is calculated.

次いで、ステップＳＢ９では、上記ステップＳＢ８にて算出したユークリッド距離ｄ（ａ，ｂ）が予め設定した閾値以下であるか否か、すなわち手本の歌唱音とユーザの歌唱音とが類似しているかどうかを判断する。上記ステップＳＢ８にて算出したユークリッド距離ｄ（ａ，ｂ）が閾値以上となり、手本の歌唱音とユーザの歌唱音との類似度が低い場合には、判断結果が「ＮＯ」となり、本処理を終える。 Next, in step SB9, whether or not the Euclidean distance d (a, b) calculated in step SB8 is equal to or smaller than a preset threshold value, that is, whether the model singing sound is similar to the user singing sound. Judge whether. When the Euclidean distance d (a, b) calculated in step SB8 is equal to or greater than the threshold value and the similarity between the model singing sound and the user singing sound is low, the determination result is “NO”, and this processing is performed. Finish.

これに対し、上記ステップＳＢ８にて算出したユークリッド距離ｄ（ａ，ｂ）が閾値未満となり、手本の歌唱音とユーザの歌唱音との類似度が高い場合には、判断結果が「ＹＥＳ」となり、ステップＳＢ１０に進む。そして、ステップＳＢ１０では、採点対象としているフレームの採点結果を合格とし、部分得点をインクリメントして本処理を終える。 On the other hand, when the Euclidean distance d (a, b) calculated in step SB8 is less than the threshold value and the similarity between the singing sound of the model and the singing sound of the user is high, the determination result is “YES”. Thus, the process proceeds to step SB10. In step SB10, the scoring result of the frame that is the scoring target is accepted, the partial score is incremented, and the process is completed.

以上のように、本実施の形態では、カラオケ伴奏音をサンプリングした伴奏データと、歌手が手本として歌唱した歌唱音をサンプリングした手本歌唱データとをカラオケデータメモリ１３に記憶しておき、カラオケ開始指示に応じて、カラオケデータメモリ１３から伴奏データを読み出してカラオケ伴奏音を再生すると、再生されるカラオケ伴奏音に合せてユーザが歌唱する歌唱音をサンプリングして得たユーザ歌唱データと、上記伴奏データに同期してカラオケデータメモリ１３から読み出される手本歌唱データとを所定のデータ数分のフレームで区切り、区切られたフレーム中の手本歌唱データから手本歌唱音の音声特徴量ＭＦＣＣを、ユーザ歌唱データからユーザ歌唱音の音声特徴量ＭＦＣＣをそれぞれ抽出する。 As described above, in the present embodiment, accompaniment data obtained by sampling a karaoke accompaniment sound and model singing data obtained by sampling a singing sound sung by a singer as a model are stored in the karaoke data memory 13, When the accompaniment data is read from the karaoke data memory 13 and the karaoke accompaniment sound is reproduced in response to the start instruction, the user singing data obtained by sampling the singing sound sung by the user in accordance with the reproduced karaoke accompaniment sound, and the above The sample singing data read from the karaoke data memory 13 in synchronization with the accompaniment data is divided into frames for a predetermined number of data, and the voice feature quantity MFCC of the sample singing sound is calculated from the sample singing data in the divided frames. The voice feature amount MFCC of the user singing sound is extracted from the user singing data.

そして、抽出した手本歌唱音の音声特徴量ＭＦＣＣおよびユーザ歌唱音の音声特徴量ＭＦＣＣに基づき手本歌唱音に対するユーザ歌唱音の類似度を算出して適否判定し、その結果に基づき歌唱採点するので、曲の歌詞が正しく歌唱されているかどうかを歌唱採点できる。この結果、明確なメロディーを持たない「ラップ」と呼ばれるスタイルのカラオケ曲であっても歌唱採点し得るようになる。 Then, based on the extracted voice feature value MFCC of the model singing sound and the voice feature value MFCC of the user singing sound, the similarity of the user singing sound with respect to the model singing sound is calculated to determine suitability, and singing is scored based on the result. So, you can score whether the song lyrics are sung correctly. As a result, even a karaoke song of a style called “rap” without a clear melody can be scored.

また、本実施の形態では、ユーザが歌唱したフレームの数と、カラオケ曲中で手本歌唱データが存在するフレームの数との比をとり、その比が一定値以上ある場合にのみ歌唱採点するので、演奏時間が非常に短い曲であっても歌唱採点することが可能になる。 In this embodiment, the ratio between the number of frames sung by the user and the number of frames in which singing singing data exists in the karaoke song is taken, and the singing score is given only when the ratio is a certain value or more. Therefore, even a song with a very short performance time can be scored.

なお、上述した実施形態では、音声特徴量ＭＦＣＣに基づき手本歌唱音に対するユーザ歌唱音の類似度を算出して歌唱採点したが、これに加えて、従来のピッチ抽出による歌唱採点方式を併用する態様としてもよい。例えば、カラオケ曲の伴奏データにメロディ部分とラップ部分とが混在する場合には、当該伴奏データ中にメロディ部分とラップ部分とを区別する識別フラグを設けておき、この識別フラグを参照してメロディ部分の伴奏データが再生される時にはピッチ抽出して歌唱採点を行い、一方、ラップ部分の伴奏データが再生される時には音声特徴量ＭＦＣＣを抽出して歌唱採点を行う態様となる。このようにすれば、ユーザ歌唱音の音高の適否と、歌唱した歌詞の適否とを同時に判定することができる。 In the embodiment described above, the singing score is calculated by calculating the similarity of the user singing sound with respect to the model singing sound based on the voice feature value MFCC, but in addition to this, the singing scoring method based on the conventional pitch extraction is used in combination. It is good also as an aspect. For example, when the accompaniment data of a karaoke song includes a melody part and a rap part, an identification flag for distinguishing the melody part from the rap part is provided in the accompaniment data, and the melody is referenced with reference to this identification flag. When the accompaniment data of the part is reproduced, the pitch is extracted and the singing is performed. On the other hand, when the accompaniment data of the rap part is reproduced, the voice feature amount MFCC is extracted and the singing is performed. If it does in this way, the appropriateness of the pitch of a user singing sound and the appropriateness of the sung lyrics can be determined simultaneously.

また、上述した実施形態では、音声の特徴量を表すパラメータとして、ケプストラム領域の特徴量であるＭＦＣＣ（Mel Frequency Cepstrum Coefficient）を抽出するようにしたが、これに替えて、ＬＰＣケプストラム等の他の特徴パラメータを抽出する態様としても構わない。 In the above-described embodiment, the MFCC (Mel Frequency Cepstrum Coefficient), which is a feature amount of the cepstrum region, is extracted as a parameter representing the feature amount of the voice. However, instead of this, other parameters such as an LPC cepstrum are extracted. A feature parameter may be extracted.

加えて、本実施形態では、手本歌唱データの音声特徴量ＭＦＣＣと、ユーザ歌唱データの音声特徴量ＭＦＣＣとの類似度を測る尺度として、手本歌唱データの音声特徴量ＭＦＣＣを表すベクトルａ＝（ａ₁，ａ₂，…，ａ₁₂）と、ユーザ歌唱データの音声特徴量ＭＦＣＣを表すベクトルｂ＝（ｂ₁，ｂ₂，…，ｂ₁₂）との間のユークリッド距離ｄ（ａ，ｂ）を算出するようにしたが、これに限らず、例えば板倉距離などの他の尺度で類似度を算出しても構わない。 In addition, in the present embodiment, as a measure for measuring the similarity between the voice feature value MFCC of the sample song data and the voice feature value MFCC of the user song data, a vector a = representing the voice feature value MFCC of the sample song data Euclidean distance d (a, b) between (a ₁ , a ₂ ,..., A ₁₂ ) and a vector b = (b ₁ , b ₂ ,..., B ₁₂ ) representing the voice feature value MFCC of the user song data. However, the present invention is not limited to this, and the degree of similarity may be calculated using another scale such as the Itakura distance.

また、上述した実施形態では、カラオケ伴奏音の再生に同期してカラオケデータメモリ１３から読み出される手本歌唱データを所定のデータ数分のフレームで区切り、区切られたフレーム毎の音声特徴量ＭＦＣＣを抽出するようにしたが、これに替えて、予め手本歌唱データからフレーム毎の音声特徴量ＭＦＣＣを算出し、これを手本歌唱データの替わりにカラオケデータメモリ１３に記憶しておくこともできる。このようにすれば、前述したステップＳＢ４（図３参照）の手本歌唱データＭＦＣＣ算出処理を不要にし、ＣＰＵ１０の処理負荷低減を図ることができる。 In the above-described embodiment, the sample singing data read from the karaoke data memory 13 in synchronization with the reproduction of the karaoke accompaniment sound is divided into frames for a predetermined number of data, and the audio feature value MFCC for each divided frame is obtained. However, instead of this, the voice feature quantity MFCC for each frame can be calculated from the model song data in advance and stored in the karaoke data memory 13 instead of the model song data. . In this way, the above-described example song data MFCC calculation process in step SB4 (see FIG. 3) is not required, and the processing load on the CPU 10 can be reduced.

１０ＣＰＵ
１１プログラムＲＯＭ
１２ＲＡＭ
１３カラオケデータメモリ
１４スイッチ部
１５マイク
１６コーデック
１７ビデオエンコーダ 10 CPU
11 Program ROM
12 RAM
13 Karaoke data memory 14 Switch unit 15 Microphone 16 Codec 17 Video encoder

Claims

Data generated in synchronization with the reproduction of the karaoke song, from the model singing sound data representing the model singing sound sung as a model, and the user singing sound sung by the user in accordance with the reproduced karaoke song Framing means for dividing the obtained user singing data into predetermined frames for each predetermined number of data;
First feature extraction means for extracting a voice feature amount of model singing sound from model singing sound data for a predetermined number of data framed by the framing means;
Second feature extraction means for extracting voice feature values of user singing sound from user singing data for a predetermined number of data framed by the framing means;
Similarity calculating means for calculating the similarity between the voice feature quantity of the model singing sound and the voice feature quantity of the user singing sound respectively extracted by the first and second feature extracting means;
A determination unit that determines the suitability of the user singing sound with respect to the model singing sound for each frame according to the similarity calculated by the similarity calculating unit;
Scoring means for scoring the user's singing based on the result of determination by the determination means for each frame only when the ratio of the number of frames of the user singing data and the number of frames of the sample singing sound data is equal to or greater than a certain value A singing scoring device comprising:

Data generated in synchronization with the reproduction of the karaoke song, from the model singing sound data representing the model singing sound sung as a model, and the user singing sound sung by the user in accordance with the reproduced karaoke song A framing process for dividing the obtained user singing data into predetermined frames for each predetermined number of data;
A first feature extraction process for extracting a voice feature amount of a model singing sound from model singing sound data for a predetermined number of data framed by the framing process;
A second feature extraction process for extracting voice feature quantities of user singing sound from user singing data for a predetermined number of data framed by the framing process;
A similarity calculation process for calculating a similarity between the voice feature quantity of the model singing sound and the voice feature quantity of the user singing sound, respectively extracted by the first and second feature extraction processes;
A determination process for determining the suitability of the user singing sound for the model singing sound for each frame according to the similarity calculated by the similarity calculating process;
A scoring process for scoring the user's song based on the result of determining whether each frame is appropriate or not only when the ratio between the number of frames of the user song data and the number of frames of the sample song sound data is a certain value or more. When
A singing scoring program, which is executed by a computer.