JP4862413B2

JP4862413B2 - Karaoke equipment

Info

Publication number: JP4862413B2
Application number: JP2006022648A
Authority: JP
Inventors: あかね野口
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-01-31
Filing date: 2006-01-31
Publication date: 2012-01-25
Anticipated expiration: 2026-01-31
Also published as: WO2007088820A1; JP2007206183A

Description

本発明は、歌唱者の歌唱力を採点する技術に関する。 The present invention relates to a technique for scoring a singer's singing ability.

楽曲データに基づいて自動演奏を行うカラオケ装置の中には、マイクに入力された歌唱者の音声を解析し、歌唱者の歌唱力を採点するものがある。例えば、特許文献１に開示されたカラオケ装置は、マイクに入力された歌唱者の音声の文言を認識し、楽曲の歌詞の文言とどの程度一致しているかを評価する。このカラオケ装置によれば、歌唱者が歌詞を正しく覚えているか否かを評価することができる。
特開平１０−９１１７２号公報 Some karaoke apparatuses that perform automatically based on music data analyze a singer's voice input to a microphone and score the singer's singing ability. For example, the karaoke apparatus disclosed in Patent Document 1 recognizes the words of the singer's voice input to the microphone and evaluates how much the words match the words of the lyrics of the music. According to this karaoke apparatus, it is possible to evaluate whether or not the singer correctly remembers the lyrics.
JP-A-10-91172

ところで、特許文献１に開示されているカラオケ装置のように音声の文言を認識するためには、音声認識を行う必要がある。音声認識を行う場合、入力された音声を分析し、音声の音響特徴を抽出する。そして、辞書に記憶されている言葉の中から、言葉の音響特徴が入力音声の音響特徴に最も近い言葉を探して音声認識結果として出力する。ここで、言葉を正しく認識するには、辞書に記憶されている言葉が重要となり、正確に言葉を認識するには多くの言葉を辞書に記憶させておく必要がある。しかしながら、多くの言葉を辞書に記憶させると、多くの言葉の中から最も近い言葉を探し出すのに時間がかかることとなり、直ぐに評価結果を示すことができなくなる。また、カラオケで歌われる楽曲は、日本語だけでなく外国語の楽曲も多数ある。多数の言語について音声認識を行う場合には、言語毎に辞書を用意する必要があり、新たな言語の楽曲をカラオケ装置に追加する場合には、辞書も新たに用意しなければならず、システムが複雑化して簡単に楽曲を追加するのが難しくなるという問題が発生する。 By the way, in order to recognize the wording of a voice | voice like the karaoke apparatus currently disclosed by patent document 1, it is necessary to perform voice recognition. When performing speech recognition, the input speech is analyzed and the acoustic features of the speech are extracted. Then, from the words stored in the dictionary, the word whose acoustic feature is closest to the acoustic feature of the input speech is searched for and output as a speech recognition result. Here, in order to correctly recognize words, words stored in the dictionary are important, and in order to correctly recognize words, it is necessary to store many words in the dictionary. However, if many words are stored in the dictionary, it takes time to find the closest word from many words, and the evaluation result cannot be immediately displayed. There are many songs sung in karaoke that are not only in Japanese but also in foreign languages. When performing speech recognition for a large number of languages, it is necessary to prepare a dictionary for each language, and when adding songs in a new language to a karaoke device, a dictionary must also be newly prepared. The problem becomes that it becomes difficult to add music easily due to complexity.

本発明は、上述した背景の下になされたものであり、その目的は、システムを複雑化させることなく、歌唱者が歌詞を正しく覚えているか否かを評価できるようにすることにある。 The present invention has been made under the background described above, and its purpose is to enable a singer to evaluate whether or not he / she correctly remembers lyrics without complicating the system.

上述した課題を解決するために本発明は、楽曲を歌詞通りに歌唱したときの手本音声を表す手本音声データを記憶した記憶手段と、歌唱者の歌唱音声が入力される音声入力手段と、前記手本音声データが表す手本音声を複数の音声区間に分割し、前記音声入力手段に入力された歌唱音声において、前記分割された各音声区間に対応する音声区間を特定する特定手段と、前記特定手段で特定された音声区間の歌唱音声と、該音声区間の歌唱音声に対応する手本音声とを比較して歌詞の正誤の評価を行う評価手段と、前記評価手段の評価結果を表示する表示手段とを有するカラオケ装置を提供する。 In order to solve the above-described problems, the present invention includes a storage unit that stores model voice data representing a model voice when a song is sung according to lyrics, and a voice input unit that inputs a singing voice of a singer. Specifying means for dividing the model voice represented by the model voice data into a plurality of voice sections, and identifying voice sections corresponding to the divided voice sections in the singing voice input to the voice input means; The evaluation means for evaluating the correctness of the lyrics by comparing the singing voice of the voice section specified by the specifying means and the model voice corresponding to the singing voice of the voice section, and the evaluation result of the evaluation means A karaoke apparatus having display means for displaying is provided.

この態様においては、前記評価手段は、前記特定手段で特定された音声区間の歌唱音声と、該音声区間の歌唱音声に対応する手本音声との一致度を求め、求めた一致度により歌詞の正誤の評価を行うようにしてもよい。
また、前記記憶手段は、前記楽曲の歌詞を表す歌詞データを記憶し、前記評価手段が求
めた前記一致度が所定値未満である場合、前記一致度が所定値未満となった音声区間の音
声に対応した歌詞を前記記憶手段に記憶された歌詞データが表す歌詞の中から特定する歌
詞特定手段を有し、前記表示手段は、前記歌詞特定手段で特定された歌詞を表示するよう
にしてもよい。
また、前記評価手段は、前記歌唱音声のフォルマント周波数と前記手本音声のフォルマ
ント周波数の一致度を求めるようにしてもよい。 In this aspect, the evaluation means obtains the degree of coincidence between the singing voice of the voice section specified by the specifying means and the model voice corresponding to the singing voice of the voice section, and based on the obtained degree of coincidence , You may make it evaluate correctness .
In addition, the storage unit stores lyrics data representing the lyrics of the music, and when the degree of coincidence obtained by the evaluation unit is less than a predetermined value, the voice of the voice section in which the degree of coincidence is less than the predetermined value The lyrics specifying means for specifying the lyrics corresponding to the lyrics from the lyrics represented by the lyrics data stored in the storage means, and the display means displays the lyrics specified by the lyrics specifying means Good.
Further, the evaluation means may obtain a degree of coincidence between the formant frequency of the singing voice and the formant frequency of the model voice.

本発明によれば、システムを複雑化させることなく、歌唱者が歌詞を正しく覚えているか否かを評価することができる。 According to the present invention, it is possible to evaluate whether a singer correctly remembers lyrics without complicating the system.

［実施形態の構成］
図１は本発明の実施形態に係わるカラオケ装置の外観を示した図である。同図に示したように、カラオケ装置１にはモニタ２、スピーカ３Ｌ、スピーカ３Ｒ、そしてマイク４が接続されている。カラオケ装置１は、リモコン装置５から送信される赤外線信号により遠隔操作される。 [Configuration of the embodiment]
FIG. 1 is an external view of a karaoke apparatus according to an embodiment of the present invention. As shown in the figure, the karaoke apparatus 1 is connected with a monitor 2, a speaker 3L, a speaker 3R, and a microphone 4. Karaoke device 1 is remotely operated by an infrared signal transmitted from remote control device 5.

図２は、カラオケ装置１のハードウェア構成を示したブロック図である。バス１０１に接続されている各部は、このバス１０１を介して各部間で通信を行う。ＣＰＵ（Central Processing Unit）１０２は、ＲＡＭ（Random Access Memory）１０４をワークエリアとして利用し、ＲＯＭ（Read Only Memory）１０３に格納されている各種プログラムを実行することでカラオケ装置１の各部を制御する。また、ＲＡＭ１０４には楽曲データを一時記憶する楽曲記憶領域が確保される。記憶部１０５はハードディスク装置を具備しており、後述する楽曲データやマイク４より入力された歌唱音声のデジタルデータ等の各種データを記憶する。 FIG. 2 is a block diagram showing a hardware configuration of the karaoke apparatus 1. Each unit connected to the bus 101 communicates with each other via the bus 101. A CPU (Central Processing Unit) 102 uses a RAM (Random Access Memory) 104 as a work area and executes various programs stored in a ROM (Read Only Memory) 103 to control each unit of the karaoke apparatus 1. . The RAM 104 has a music storage area for temporarily storing music data. The storage unit 105 includes a hard disk device, and stores various data such as music data described later and digital data of singing voice input from the microphone 4.

通信部１０８は、楽曲データの配信元であるホストコンピュータ（図示略）から、例えばインターネットなどの通信ネットワーク（図示略）を介して楽曲データを受信し、受信した楽曲データをＣＰＵ１０２の制御のもと記憶部１０５へと転送する。なお、本実施形態においては、楽曲データは予め記憶部１０５に記憶されていてもよい。また、ＣＤ−ＲＯＭやＤＶＤ等の各種記録媒体を読み取る読み取り装置をカラオケ装置１に設け、各種記録媒体に記録された楽曲データを、この読み取り装置により読み取って記憶部１０５に転送して記憶させるようにしてもよい。
ここで、本実施形態において用いられる楽曲データの構造について説明する。本実施形態における楽曲データは、図３に示すように、ヘッダ、カラオケ演奏音の内容を表すＷＡＶＥ形式のデータである楽音データ、楽曲の歌詞を間違えずに正しく歌ったときのお手本の音声の波形を表すＷＡＶＥ形式の手本音声データ、および楽曲の歌詞を表す歌詞データを格納した歌詞テーブルとを有している。 The communication unit 108 receives music data from a host computer (not shown), which is a music data distribution source, via a communication network (not shown) such as the Internet, and the received music data is controlled by the CPU 102. Transfer to the storage unit 105. In the present embodiment, the music data may be stored in the storage unit 105 in advance. Further, the karaoke apparatus 1 is provided with a reading device that reads various recording media such as CD-ROM and DVD, and the music data recorded on the various recording media is read by the reading device and transferred to the storage unit 105 to be stored. It may be.
Here, the structure of music data used in the present embodiment will be described. As shown in FIG. 3, the music data in the present embodiment includes a header, musical sound data that is WAVE data representing the contents of the karaoke performance sound, and a waveform of a model voice when the lyrics of the music are correctly sung. Model audio data in the WAVE format, and a lyrics table storing lyrics data representing the lyrics of the music.

図４は、歌詞テーブルのフォーマットを例示した図である。歌詞テーブルにおいては、演奏される楽曲の歌詞を表す歌詞データと、楽音データに従って楽音が出力されたときに、この歌詞データが表す歌詞を発音すべき時間区間を示す時間区間データとが対応付けて格納される。
例えば、図４に示した歌詞テーブルにおいて、１行目の歌詞データは「かめれおんが」という歌詞を表しており、この歌詞データに対応付けられている時間区間データ「０１：００−０１：０２」は、お手本の音声において、楽曲の演奏が開始されて１分経過した時点から１分２秒経過した時点までの間に、この歌詞「かめれおんが」が発音されることを示している。また、２行目の歌詞データは「やってきたー」という歌詞を表しており、この歌詞データに対応付けられている時間区間データ「０１：０３−０１：０６」は、お手本の音声において、楽曲の演奏が開始されて１分３秒経過した時点から１分６秒経過した時点までの間に、この歌詞「やってきたー」が発音されることを示している。 FIG. 4 is a diagram illustrating the format of the lyrics table. In the lyrics table, the lyrics data representing the lyrics of the music to be played is associated with the time interval data indicating the time interval in which the lyrics represented by the lyrics data should be pronounced when the musical sound is output according to the musical sound data. Stored.
For example, in the lyrics table shown in FIG. 4, the lyrics data on the first line represents the lyrics “Kamere-onga”, and the time interval data “01: 00-01: 02” associated with the lyrics data. "Indicates that the lyrics" Kamere-onga "is pronounced in the model voice from the time when 1 minute has passed since the start of the performance of the music to the time when 1 minute and 2 seconds have passed. The lyric data on the second line represents the lyrics “I have come”, and the time section data “01: 03-01: 06” associated with the lyric data is the music in the model voice. This means that the lyrics “I have come” will be pronounced between the time when 1 minute 3 seconds have passed and the time 1 minute 6 seconds have passed.

マイク４は、入力される歌唱者の歌唱音声を音声信号に変換して出力する。マイク４から出力された音声信号は、音声処理用ＤＳＰ（Digital Signal Processor）１１１とアンプ１１２とに入力される。音声処理用ＤＳＰ１１１は、入力される音声信号をＡ／Ｄ変換し、歌唱音声を表す歌唱音声データを生成する。この歌唱音声データは、記憶部１０５に記憶され、手本音声データと比較されて歌唱者の歌唱力の採点に用いられる。 The microphone 4 converts the singing voice of the input singer into a voice signal and outputs it. The audio signal output from the microphone 4 is input to an audio processing DSP (Digital Signal Processor) 111 and an amplifier 112. The voice processing DSP 111 performs A / D conversion on the input voice signal, and generates singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105, compared with the model voice data, and used for scoring the singing ability of the singer.

入力部１０６は、カラオケ装置１にある操作パネルまたはリモコン装置５への入力操作により発せられる信号を検出し、この検出結果をＣＰＵ１０２へ出力する。表示制御部１０７は、ＣＰＵ１０２の制御のもと映像や歌唱者の歌唱力の採点結果をモニタ２に表示する。 The input unit 106 detects a signal generated by an input operation to the operation panel or the remote control device 5 in the karaoke device 1 and outputs the detection result to the CPU 102. The display control unit 107 displays the video and the singer's singing score on the monitor 2 under the control of the CPU 102.

音源装置１０９は供給される楽音データに対応する楽音信号を生成し、生成した楽音信号をカラオケ演奏音として効果用ＤＳＰ１１０へ出力する。効果用ＤＳＰ１１０は、音源装置１０９で生成された楽音信号に対してリバーブやエコー等の効果を付与する。効果を付与された楽音信号は、効果用ＤＳＰ１１０によってＤ／Ａ変換されてアンプ１１２へ出力される。アンプ１１２は、効果用ＤＳＰ１１０から出力された楽音信号と、マイク４から出力された音声信号とを合成・増幅し、スピーカ３Ｌ、３Ｒへ出力する。これにより、楽曲のメロディと歌唱者の音声とがスピーカ３Ｌ、３Ｒから出力される。 The tone generator 109 generates a tone signal corresponding to the supplied tone data, and outputs the generated tone signal to the effect DSP 110 as a karaoke performance sound. The effect DSP 110 gives effects such as reverberation and echo to the musical sound signal generated by the sound source device 109. The effected tone signal is D / A converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 synthesizes and amplifies the musical tone signal output from the effect DSP 110 and the audio signal output from the microphone 4 and outputs the resultant signal to the speakers 3L and 3R. Thereby, the melody of music and the voice of the singer are output from the speakers 3L and 3R.

［実施形態の動作］
次に本実施形態の動作について説明する。まず、利用者がリモコン装置５を操作して楽曲を指定する操作を行うと、指定された楽曲の楽曲データがＣＰＵ１０２により記憶部１０５からＲＡＭ１０４の楽曲記憶領域へ転送される。ＣＰＵ１０２は、この楽曲記憶領域に格納された楽曲データに含まれている各種データを順次読み出すことにより、カラオケ伴奏処理を実行する。 [Operation of the embodiment]
Next, the operation of this embodiment will be described. First, when the user operates the remote controller 5 to designate a music piece, the music data of the designated music piece is transferred from the storage unit 105 to the music storage area of the RAM 104 by the CPU 102. The CPU 102 executes karaoke accompaniment processing by sequentially reading various data included in the music data stored in the music storage area.

具体的には、ＣＰＵ１０２は、楽曲データに含まれている楽音データを読み出し、読み出した楽音データを音源装置１０９へ出力する。音源装置１０９は、供給される楽曲データに基づいて所定の音色の楽音信号を生成し、生成した楽音信号を効果用ＤＳＰ１１０へ出力する。効果用ＤＳＰ１１０においては、音源装置１０９から出力された楽音信号に対してリバーブやエコー等の効果が付与される。効果を付与された楽音信号は、効果用ＤＳＰ１１０によってＤ／Ａ変換されてアンプ１１２へ出力される。アンプ１１２は、効果用ＤＳＰ１１０から出力された楽音信号を増幅してスピーカ３Ｌ、３Ｒへ出力する。これにより、楽曲のメロディがスピーカ３Ｌ、３Ｒから出力される。また、ＣＰＵ１０２は、楽曲データを音源装置１０９へ供給して楽音の出力が開始されると、楽曲の出力が開始されてから経過した経過時間のカウントを開始する。 Specifically, the CPU 102 reads out musical tone data included in the music data and outputs the read musical tone data to the sound source device 109. The tone generator 109 generates a tone signal of a predetermined tone color based on the supplied music data, and outputs the generated tone signal to the effect DSP 110. In the effect DSP 110, effects such as reverb and echo are applied to the musical sound signal output from the sound source device 109. The effected tone signal is D / A converted by the effect DSP 110 and output to the amplifier 112. The amplifier 112 amplifies the musical tone signal output from the effect DSP 110 and outputs it to the speakers 3L and 3R. Thereby, the melody of a music is output from the speakers 3L and 3R. In addition, when the music data is supplied to the sound source device 109 and the output of the musical sound is started, the CPU 102 starts counting the elapsed time that has elapsed since the output of the music is started.

一方、楽曲の再生に応じて、歌唱者が歌唱すると、歌唱者の音声がマイク４に入力され、マイク４から音声信号が出力される。音声処理用ＤＳＰ１１１は、マイク４から出力された音声信号をＡ／Ｄ変換し、歌唱音声を表す歌唱音声データを生成する。この歌唱音声データは、記憶部１０５に記憶される。 On the other hand, when the singer sings according to the reproduction of the music, the singer's voice is input to the microphone 4 and an audio signal is output from the microphone 4. The voice processing DSP 111 performs A / D conversion on the voice signal output from the microphone 4 and generates singing voice data representing the singing voice. This singing voice data is stored in the storage unit 105.

ＣＰＵ１０２は、経過時間のカウントを続け、カウントした時間を時間区間の開始時間として含む時間区間を、歌詞テーブルにおいて検索する。そして、検索した時間区間と、検索した時間区間に対応付けて格納されている歌詞データを読み出す。例えば、カウントされた経過時間が０１：００である場合、図４にした歌詞テーブルにおいては、１行目の時間区間「０１：００−０１：０２」と歌詞データ「かめれおんが」が読み出される。 The CPU 102 continues counting the elapsed time, and searches the lyrics table for a time interval including the counted time as the start time of the time interval. Then, the retrieved time interval and the lyrics data stored in association with the retrieved time interval are read out. For example, if the counted elapsed time is 01:00, the time section “01: 00-01: 02” and the lyrics data “Kamere-onga” on the first line are read out in the lyrics table shown in FIG. .

ＣＰＵ１０２は、時間区間を読み出すと、この時間区間においてマイク４に入力された音声と、この時間区間におけるお手本の音声とを比較し、歌唱者が歌詞を正しく歌ったか否かを判断する。具体的には、ＣＰＵ１０２は、手本音声データが表す音声を解析し、図５に示したように、手本音声データが表す音声波形の時間軸において、読み出した時間区間（０１：００−０１：０２）の間にある音声波形Ａを抽出する。また、ＣＰＵ１０２は、記憶された歌唱音声データを解析し、図５に示したように、歌唱音声データが表す時間軸において、読み出した時間区間の間にある音声波形Ｂを抽出する。そして、抽出した音声波形Ａを、図６（ａ）に示したように所定の時間間隔（例えば、１０ｍｓ）で区切って複数のフレームに分割する。また、抽出した音声波形Ｂを、図６（ｂ）に示したように所定の時間間隔（例えば、１０ｍｓ）で区切って複数のフレームに分割する。 When the CPU 102 reads out the time section, the CPU 102 compares the voice input to the microphone 4 in this time section with the model voice in this time section, and determines whether or not the singer sang the lyrics correctly. Specifically, the CPU 102 analyzes the voice represented by the model voice data and, as shown in FIG. 5, reads the time interval (01: 00-01) on the time axis of the voice waveform represented by the model voice data. : The speech waveform A between 02) is extracted. In addition, the CPU 102 analyzes the stored singing voice data, and extracts the voice waveform B between the read time sections on the time axis represented by the singing voice data, as shown in FIG. Then, as shown in FIG. 6A, the extracted speech waveform A is divided at a predetermined time interval (for example, 10 ms) and divided into a plurality of frames. Further, as shown in FIG. 6B, the extracted speech waveform B is divided at a predetermined time interval (for example, 10 ms) and divided into a plurality of frames.

次にＣＰＵ１０２は、手本音声の各フレームの音声波形と、歌唱音声の各フレームの音声波形との対応付けをＤＰ（Dynamic Programming）マッチング法を用いて行う。例えば、図６に例示した波形において、手本音声のフレームＡ１の音声波形と、歌唱音声のフレームＢ１の音声波形とが対応している場合、フレームＡ１とフレームＢ１とが対応付けされる。また、手本音声のフレームＡ２の音声波形と、歌唱音声のフレームＢ２ないしフレームＢ３の音声波形とが対応している場合、フレームＡ２とフレームＢ２ないしフレームＢ３とが対応付けされる。 Next, the CPU 102 associates the speech waveform of each frame of the model speech with the speech waveform of each frame of the singing speech using a DP (Dynamic Programming) matching method. For example, in the waveform illustrated in FIG. 6, when the voice waveform of the frame A1 of the model voice corresponds to the voice waveform of the frame B1 of the singing voice, the frame A1 and the frame B1 are associated with each other. Further, when the voice waveform of the frame A2 of the model voice and the voice waveform of the frames B2 to B3 of the singing voice correspond to each other, the frame A2 and the frames B2 to B3 are associated with each other.

次にＣＰＵ１０２は、対応するフレーム間で音声波形の特徴を比較する。具体的には、ＣＰＵ１０２は、手本音声の各フレームの音声波形毎に音声波形をフーリエ変換する。そしてＣＰＵ１０２は、フーリエ変換により得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を生成する。そしてＣＰＵ１０２は、得られたスペクトル包絡から第１フォルマントの周波数ｆ１１および第２フォルマントの周波数ｆ１２、第３フォルマントの周波数ｆ１３を抽出する。
また、ＣＰＵ１０２は、手本音声の各フレームに対応付けされた歌唱者の音声のフレームの音声波形毎に、音声波形をフーリエ変換する。そしてＣＰＵ１０２は、フーリエ変換により得られた振幅スペクトルの対数を求め、それをフーリエ逆変換してフレームごとのスペクトル包絡を生成する。そしてＣＰＵ１０２は、得られたスペクトル包絡から第１フォルマントの周波数ｆ２１および第２フォルマントの周波数ｆ２２、第３フォルマントの周波数２３を抽出する。 Next, the CPU 102 compares the characteristics of the speech waveform between corresponding frames. Specifically, the CPU 102 Fourier transforms the speech waveform for each speech waveform of each frame of the model speech. Then, the CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. Then, the CPU 102 extracts the frequency f11 of the first formant, the frequency f12 of the second formant, and the frequency f13 of the third formant from the obtained spectrum envelope.
In addition, the CPU 102 performs a Fourier transform on the speech waveform for each speech waveform of the singer's speech frame associated with each frame of the model speech. Then, the CPU 102 obtains the logarithm of the amplitude spectrum obtained by the Fourier transform, and inversely transforms it to generate a spectrum envelope for each frame. Then, the CPU 102 extracts the frequency f21 of the first formant, the frequency f22 of the second formant, and the frequency 23 of the third formant from the obtained spectrum envelope.

例えば、ＣＰＵ１０２は、手本音声のフレームＡ１のスペクトル包絡を生成し、このスペクトル包絡から第１〜第３フォルマントのフォルマント周波数ｆ１１〜ｆ１３を抽出する。そして、ＣＰＵ１０２は、フレームＡ１に対応付けされているフレームＢ１の音声波形のスペクトル包絡を生成し、このスペクトル包絡から第１〜第３フォルマントのフォルマント周波数ｆ２１〜ｆ２３を抽出する。
また、ＣＰＵ１０２は、手本音声のフレームＡ２のスペクトル包絡を生成し、このスペクトル包絡から第１〜第３フォルマントのフォルマント周波数ｆ１１〜ｆ１３を抽出する。そして、ＣＰＵ１０２は、フレームＡ２に対応付けされているフレームＢ２ないしフレームＢ３の音声波形のスペクトル包絡を生成し、このスペクトル包絡から第１〜第３フォルマントのフォルマント周波数ｆ２１〜ｆ２３を抽出する。 For example, the CPU 102 generates a spectrum envelope of the frame A1 of the model voice, and extracts formant frequencies f11 to f13 of the first to third formants from the spectrum envelope. And CPU102 produces | generates the spectrum envelope of the audio | voice waveform of the flame | frame B1 matched with the flame | frame A1, and extracts the formant frequencies f21-f23 of the 1st-3rd formants from this spectrum envelope.
Further, the CPU 102 generates a spectrum envelope of the frame A2 of the sample voice, and extracts the first to third formant formant frequencies f11 to f13 from the spectrum envelope. Then, the CPU 102 generates a spectrum envelope of the speech waveforms of the frames B2 to B3 associated with the frame A2, and extracts the first to third formant formant frequencies f21 to f23 from the spectrum envelope.

次にＣＰＵ１０２は、手本音声の各フレームから抽出したフォルマント周波数ｆ１１〜ｆ１３と、手本音声の各フレームに対応付けされたフレームから抽出したフォルマント周波数ｆ２１〜ｆ２３とを比較する。そして、ＣＰＵ１０２は、対応する音声波形同士でフォルマント周波数ｆ１１とフォルマント周波数ｆ２１の差、フォルマント周波数ｆ１２とフォルマント周波数ｆ２２の差、フォルマント周波数ｆ１３とフォルマント周波数ｆ２３の差が、所定の値以上である場合には、フォルマント周波数が不一致であったことを示す不一致情報Ｄを手本音声のフレームに付加する。
例えば、ＣＰＵ１０２は、フレームＡ１の音声波形のフォルマント周波数ｆ１１〜ｆ１３と、フレームＢ１の音声波形のフォルマント周波数とが一致している場合、対応するフレーム同士で音声が一致していると判断し、不一致情報ＤをフレームＡ１に付加しない。
一方、フレームＡ２のフォルマント周波数ｆ１１〜ｆ１３と、フレームＢ２ないしフレームＢ３の音声波形のフォルマント周波数ｆ２１〜ｆ２３とで、各周波数の差が所定値以上である場合には、フォルマント周波数が不一致であったことを示す不一致情報ＤをフレームＡ２に付加する。 Next, the CPU 102 compares the formant frequencies f11 to f13 extracted from each frame of the model voice with the formant frequencies f21 to f23 extracted from the frame associated with each frame of the model voice. Then, the CPU 102 determines that the difference between the formant frequency f11 and the formant frequency f21, the difference between the formant frequency f12 and the formant frequency f22, and the difference between the formant frequency f13 and the formant frequency f23 are equal to or greater than a predetermined value. Adds mismatch information D indicating that the formant frequencies do not match to the frame of the model voice.
For example, if the formant frequencies f11 to f13 of the speech waveform of the frame A1 match the formant frequencies of the speech waveform of the frame B1, the CPU 102 determines that the speech is matched between the corresponding frames, and does not match. Information D is not added to the frame A1.
On the other hand, when the difference between the formant frequencies f11 to f13 of the frame A2 and the formant frequencies f21 to f23 of the speech waveforms of the frames B2 to B3 is equal to or greater than a predetermined value, the formant frequencies do not match. Is added to the frame A2.

ＣＰＵ１０２は、手本音声の各フレームの音声波形について、歌唱者の音声波形のフォルマント周波数との一致／不一致を判断すると、不一致情報Ｄが付加されたフレームの数Ｎをカウントする。次にＣＰＵ１０２は、分割した手本音声データのフレームの総数Ｍと、数Ｎの値とを比較し、数Ｎの値がフレーム総数Ｍの半分の以上である場合には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の歌詞とが異なると判断し、数Ｎの値がフレーム総数Ｍの半分未満である場合には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の歌詞とが同じであると判断する。例えば、手本音声データが表す「かめれおんが」という音声について、不一致情報の数Ｎがフレーム総数Ｍの半分未満である場合には、ＣＰＵ１０２は、歌唱者の発音した歌詞と、手本音声の歌詞とが同じであると判断する。
なお、本実施形態においては、数Ｎの値がフレーム総数Ｍの半分以上である場合には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の歌詞とが異なると判断しているが、フレーム総数Ｍに対する数Ｎの割合が５割以外の所定の割合以上である場合に読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の歌詞とが異なると判断するようにしてもよい。 When determining whether the voice waveform of each frame of the model voice matches or does not match the formant frequency of the singer's voice waveform, the CPU 102 counts the number N of frames to which the mismatch information D is added. Next, the CPU 102 compares the total number M of frames of the divided sample voice data with the value of the number N. If the value of the number N is more than half of the total number M of frames, the read lyrics data is When it is determined that the lyric of the singer is different from the lyric of the model voice and the value of the number N is less than half of the total number M of frames, the singing is performed on the lyrics represented by the read lyric data. It is determined that the lyrics that the person pronounces are the same as the lyrics of the model voice. For example, if the number N of mismatch information is less than half of the total number M of frames for the voice “Kamere-onga” represented by the model voice data, the CPU 102 determines the lyrics of the singer and the model voice. Judge that the lyrics are the same.
In the present embodiment, when the value of the number N is half or more of the total number M of frames, the lyrics expressed by the read lyrics data are different from the lyrics of the singer's pronunciation and the lyrics of the model voice. If the ratio of the number N with respect to the total number M of frames is equal to or greater than a predetermined ratio other than 50%, the lyrics expressed by the lyrics data read out are the lyrics of the singer and the lyrics of the model voice. You may make it judge that it is different.

ＣＰＵ１０２は、手本音声と歌唱音声の比較に並行して経過時間のカウントを続け、カウントした経過時間が０１：０３になると、図４にした歌詞テーブルの２行目の時間区間「０１：０３−０１：０６」と歌詞データ「やってきたー」を読み出す。また、楽曲の再生に従って歌唱者がこの読み出した時間区間において歌唱を行うと、歌唱音声データが記憶部１０５に記憶される。ここで、例えば、歌唱者が歌詞を間違え、読み出された歌詞データ２が表す歌詞「やってきた」とは異なる「いってくる」という歌詞で歌唱者が歌唱を行うと、「いってくる」という音声を表す歌唱音声データが生成されて記憶部１０５に記憶される。 The CPU 102 continues to count the elapsed time in parallel with the comparison between the model voice and the singing voice. When the counted elapsed time reaches 01:03, the time interval “01:03” in the second row of the lyrics table shown in FIG. “-01: 06” and the lyrics data “I came over” are read out. Further, when the singer sings in the read time interval according to the reproduction of the music, the singing voice data is stored in the storage unit 105. Here, for example, when the singer sings with a lyric “I will come” different from the lyric “I came” expressed by the read lyric data 2 when the singer mistakes the lyric, “I will come” Is generated and stored in the storage unit 105.

次にＣＰＵ１０２は、この時間区間においてマイク４に入力された音声の波形と、この時間区間におけるお手本の音声の波形とを複数のフレームに分割する。そして、手本音声の各フレームの音声波形と、歌唱音声の各フレームの音声波形との対応付けを行い、対応付けられたフレーム間で音声波形のフォルマント周波数の比較を行う。そして、ＣＰＵ１０２は、手本音声の各フレームの音声波形について、歌唱者の音声波形のフォルマント周波数との一致／不一致を判断し、不一致情報Ｄを付加した後、分割した手本音声データのフレーム総数Ｍと、不一致情報が付加されたフレームの数Ｎの値とを比較し、歌唱者が歌詞を正しく歌ったか否かを判断する。 Next, the CPU 102 divides the waveform of the sound input to the microphone 4 in this time interval and the waveform of the model sound in this time interval into a plurality of frames. Then, the speech waveform of each frame of the model speech is associated with the speech waveform of each frame of the singing speech, and the formant frequencies of the speech waveforms are compared between the associated frames. Then, the CPU 102 determines whether the voice waveform of each frame of the model voice matches or does not match the formant frequency of the singer's voice waveform, adds the mismatch information D, and then adds the total number of frames of the sample voice data divided. M is compared with the value of the number N of frames to which the mismatch information is added, and it is determined whether or not the singer sang the lyrics correctly.

ここで、歌唱者は「やってきた」という歌詞に対し、「いってくる」と異なる歌詞で歌唱したため、手本音声の音声波形のフォルマント周波数と、歌唱者の音声波形のフォルマント周波数とを比較すると、フォルマント周波数が一致せず、不一致情報の数Ｎがフレーム総数Ｍ以上となる。ＣＰＵ１０２は、数Ｎの値がフレーム総数Ｍの半分以上である場合には、読み出した歌詞データが表す歌詞について、歌唱者の発音した歌詞と手本音声の歌詞とが異なると判断し、読み出した歌詞データが表す歌詞「やってきた」を、表示制御部１０７を制御してモニタ２に表示させ、歌詞を間違った旨を報知する。 Here, because the singer sang with the lyrics different from “I come” against the lyrics “I came”, comparing the formant frequency of the voice waveform of the model voice with the formant frequency of the voice waveform of the singer The formant frequencies do not match, and the number N of mismatch information is equal to or greater than the total number M of frames. When the value of the number N is more than half of the total number M of frames, the CPU 102 determines that the lyrics expressed by the read lyrics data are different from the lyrics of the singer's pronunciation and the lyrics of the model voice. The display unit 107 controls the display 2 to display the lyrics “I have come” represented by the lyrics data to notify that the lyrics are wrong.

以下、ＣＰＵ１０２は楽曲の再生に伴って、上述したように、歌詞データおよび手本音声データの読み出し、歌唱者が歌唱した歌詞の正誤の判断を繰り返す。そして、全ての演奏イベントデータを読み出すとカラオケ伴奏処理を終了する。 Hereinafter, as described above, the CPU 102 repeats the reading of the lyrics data and the model voice data and the determination of the correctness of the lyrics sung by the singer along with the reproduction of the music. Then, when all performance event data is read, the karaoke accompaniment process is terminated.

以上説明したように、本実施形態によれば、辞書を用いた音声認識を行わなくても、歌唱者が歌詞通りに歌唱したか否かを判断することができる。また、本実施形態では、歌詞どおりに正しく歌唱した音声のデータがあれば、歌詞通りに正しく歌唱したか否か評価することができるので、辞書を用いて言語認識を行う態様のようにシステムを複雑化させることなく、様々な言語の歌詞について、歌唱者が歌詞を正しく覚えているか否かを評価することができる。 As described above, according to the present embodiment, it is possible to determine whether or not the singer has sung according to the lyrics without performing voice recognition using a dictionary. In addition, in this embodiment, if there is data of voice sung correctly according to the lyrics, it is possible to evaluate whether or not the singing was correctly performed according to the lyrics, so the system is configured like a mode of performing language recognition using a dictionary. Without being complicated, it is possible to evaluate whether or not the singer remembers the lyrics correctly for lyrics in various languages.

［変形例］
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、例えば、上述の実施形態を以下のように変形して本発明を実施してもよい。 [Modification]
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, For example, you may implement the present invention, changing the above-mentioned embodiment as follows.

上述した実施形態においては、歌唱音声データが表す音声波形のピッチが手本音声データが表す音声波形のピッチとなるように、歌唱音声データが表す音声のピッチを補正するようにしてもよい。 In the embodiment described above, the pitch of the voice represented by the singing voice data may be corrected so that the pitch of the voice waveform represented by the singing voice data becomes the pitch of the voice waveform represented by the model voice data.

また、上述した実施形態においては、手本音声データが表す音声波形のピッチの周期的な変動を検出して手本となる音声にビブラートがかかっているか否かを判断し、ビブラートがかかっていると判断した場合、手本音声データが表す音声波形のピッチ変動と歌唱音声データが表す音声波形のピッチ変動との一致度を判断し、歌唱者が正しくビブラートをかけて歌唱しているか否かを判断するようにしてもよい。
また、手本音声データが表す音声波形のピッチ変動を検出して手本となる音声にしゃくりがあるか否かを判断し、しゃくりがあると判断した場合、手本音声データが表す音声波形のピッチ変動と歌唱音声データが表す音声波形のピッチ変動との一致度を判断し、歌唱者が正しくしゃくりを行って歌唱しているか否かを判断するようにしてもよい。 In the embodiment described above, periodic fluctuations in the pitch of the voice waveform represented by the model voice data are detected to determine whether the model voice is vibrato, and the vibrato is applied. If it is determined, the degree of coincidence between the pitch fluctuation of the voice waveform represented by the model voice data and the pitch fluctuation of the voice waveform represented by the singing voice data is determined, and whether or not the singer is singing with vibrato correctly being sung. You may make it judge.
In addition, the pitch fluctuation of the voice waveform represented by the model voice data is detected to determine whether or not the voice serving as the model is screaming. The degree of coincidence between the pitch fluctuation and the pitch fluctuation of the voice waveform represented by the singing voice data may be determined, and it may be determined whether or not the singer sings correctly.

また、上述した実施形態においては、複数のバンドパスフィルタによって、手本音声データが表す音声波形と歌唱音声データが表す音声波形とを複数の周波数帯域に分割し、周波数帯域毎に音声の特徴量の一致度を判断して歌詞の正否を判断するようにしてもよい。 In the above-described embodiment, the voice waveform represented by the model voice data and the voice waveform represented by the singing voice data are divided into a plurality of frequency bands by a plurality of bandpass filters, and the feature amount of the voice for each frequency band. The correctness of the lyrics may be determined by determining the degree of coincidence.

また、上述した実施形態においては、お手本の音声波形を表す手本音声データを記憶し、この手本音声データが表す音声波形を解析してフォルマント周波数の解析を行っているが、音声波形を複数のフレームに分割したときのフレーム毎のフォルマント周波数を予め記憶部１０５に記憶し、この記憶したフォルマント周波数と、歌唱者の音声波形の各フレームのフォルマント周波数とを比較して一致度を判断するようにしてもよい。 In the above-described embodiment, model voice data representing a model voice waveform is stored, and the voice waveform represented by the model voice data is analyzed to analyze the formant frequency. The formant frequency for each frame when the frame is divided is stored in the storage unit 105 in advance, and the degree of coincidence is determined by comparing the stored formant frequency with the formant frequency of each frame of the singer's speech waveform. It may be.

上述した実施形態においては、歌唱者が楽曲を歌い終えた後に歌唱者が歌唱した歌詞の正誤の判断を行うようにしてもよい。また、上述した実施形態においては、歌唱者の発音した歌詞と手本音声の歌詞とが異なると判断した場合、歌詞を表示するのではなく、歌詞を間違った旨を知らせるメッセージや画像をモニタ２に表示するようにしてもよい。 In the above-described embodiment, after the singer has finished singing the music, the correctness of the lyrics sung by the singer may be determined. Further, in the above-described embodiment, when it is determined that the lyrics of the singer's pronunciation and the lyrics of the model voice are different from each other, a message or an image notifying that the lyrics are wrong is displayed instead of displaying the lyrics. May be displayed.

本発明の実施形態に係るカラオケ装置の外観図である。1 is an external view of a karaoke apparatus according to an embodiment of the present invention. 同カラオケ装置のハードウェア構成を示したブロック図である。It is the block diagram which showed the hardware constitutions of the karaoke apparatus. 同実施形態における楽曲データのフォーマットを例示した図である。It is the figure which illustrated the format of the music data in the embodiment. 歌詞テーブルのフォーマットを例示した図である。It is the figure which illustrated the format of the lyrics table. 手本音声の波形と歌唱音声の波形とを例示した図である。It is the figure which illustrated the waveform of the model voice and the waveform of the singing voice. 手本音声の波形と歌唱音声の波形とを複数のフレームに分割した時の図である。It is a figure when the waveform of a model voice and the waveform of a singing voice are divided into a plurality of frames.

Explanation of symbols

１・・・カラオケ装置、２・・・モニタ、３Ｌ，３Ｒ・・・スピーカ、４・・・マイク、５・・・リモコン装置、１０１・・・バス、１０２・・・ＣＰＵ、１０３・・・ＲＯＭ、１０４・・・ＲＡＭ、１０５・・・記憶部、１０６・・・入力部、１０７・・・表示制御部、１０８・・・通信部、１０９・・・音源装置、１１０・・・効果用ＤＳＰ、１１１・・・音声処理用ＤＳＰ、１１２・・・アンプ DESCRIPTION OF SYMBOLS 1 ... Karaoke device, 2 ... Monitor, 3L, 3R ... Speaker, 4 ... Microphone, 5 ... Remote control device, 101 ... Bus, 102 ... CPU, 103 ... ROM, 104 ... RAM, 105 ... storage unit, 106 ... input unit, 107 ... display control unit, 108 ... communication unit, 109 ... sound source device, 110 ... for effect DSP, 111 ... DSP for voice processing, 112 ... amplifier

Claims

Storage means for storing example voice data representing a model voice when a song is sung according to lyrics;
Voice input means for inputting the singing voice of the singer;
A specifying unit that divides a sample voice represented by the sample voice data into a plurality of voice sections, and specifies a voice section corresponding to each divided voice section in the singing voice input to the voice input unit;
An evaluation means for evaluating the correctness of the lyrics by comparing the singing voice of the voice section specified by the specifying means and the model voice corresponding to the singing voice of the voice section;
A karaoke apparatus comprising: display means for displaying an evaluation result of the evaluation means.

The evaluation means obtains the degree of coincidence between the singing voice of the voice section specified by the specifying means and the model voice corresponding to the singing voice of the voice section, and evaluates the correctness of the lyrics based on the obtained degree of coincidence. The karaoke apparatus according to claim 1, wherein:

The storage means stores lyric data representing the lyrics of the music,
If the degree of coincidence obtained by the evaluation means is less than a predetermined value, the lyrics corresponding to the speech of the voice section where the degree of coincidence is less than the predetermined value are included in the lyrics represented by the lyrics data stored in the storage means There is a lyrics identification means to identify from
The karaoke apparatus according to claim 2, wherein the display means displays the lyrics specified by the lyrics specifying means.

The karaoke apparatus according to claim 2, wherein the evaluation unit obtains a degree of coincidence between a formant frequency of the singing voice and a formant frequency of the model voice.