JP2008039833A

JP2008039833A - Voice evaluation apparatus

Info

Publication number: JP2008039833A
Application number: JP2006209920A
Authority: JP
Inventors: Tatsuya Iriyama; 達也入山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-08-01
Filing date: 2006-08-01
Publication date: 2008-02-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide voice evaluation technology for evaluating sound volume change in each syllable in a karaoke device for scoring singing. <P>SOLUTION: A comparison evaluation section 7 reads non-sound region reference data stored in a prescribed area in a storage section 14 and singer voice non-sound region data, and compares both of the non-sound regions for each syllable, and judges a mismatching amount, and thereby, this is reflected to scoring of singer's voice. Thus, a time axis of a singer voice data is expanded or contracted, and matched to a time axis of a guide vocal data, and by comparing the sound volume change for each syllable which is separated with syllable separation data, even delicate change in each syllable is evaluated. Consequently, an accurate scoring result can be output, and regarding a point to be corrected further, training is carried out by specifying a correction point for each syllable. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、歌唱を採点するカラオケ装置において、各音節内の音量変化を評価する技術に関する。 The present invention relates to a technique for evaluating a volume change in each syllable in a karaoke apparatus for scoring a song.

カラオケ装置において、歌唱者の歌唱の巧拙を点数で表示する採点機能を備えたものがある。このような採点機能のうち、できるだけ実際の歌唱の巧拙と採点の結果が対応するように、歌唱者の歌唱音声信号から抽出された音高データや音量データなどのデータと、カラオケ曲の歌唱旋律と対応するデータ（ガイドメロディ）との比較機能を持たせたものがある。（例えば、特許文献１）
特開平１０−６９２１６号公報 Some karaoke apparatuses have a scoring function for displaying the skill of a singer's singing in points. Of these scoring functions, data such as pitch data and volume data extracted from the singer's singing voice signal, and singing melody of karaoke songs, so that the actual singing skill and scoring result correspond as much as possible. And the corresponding data (guide melody). (For example, Patent Document 1)
Japanese Patent Laid-Open No. 10-69216

このような採点機能を備えたカラオケ装置によって、１音を単位としてノートごとの音量変化を比較して採点することが可能になったが、この採点機能は、ＭＩＤＩ（Musical Instruments Digital Interface：登録商標）形式でデータ化されたガイドメロディを基準にして、歌唱者の歌唱音声と比較していたため、楽譜上の音符を基準にした採点に止まっていた。しかしながら、実際の歌唱は、一つの音符内でも音量が多様に変化する。例えば、一つのノート内においても音を徐々に大きくするクレッシェンドや、音を短く切るスタッカートなどの技法があり、ノート内で音量が多様に変化する。
そのため、ガイドメロディを基準に採点した場合、見本となる歌唱（以下、ガイドボーカルという）に近い歌い方をする歌唱者とそうでない歌唱者とで、実際の巧拙にあった採点結果がでないことがあった。 The karaoke apparatus having such a scoring function makes it possible to compare and change the volume change of each note in units of one sound. This scoring function is based on MIDI (Musical Instruments Digital Interface: registered trademark). ) Since it was compared with the singing voice of the singer based on the guide melody that was converted into data, the scoring was based on the notes on the score. However, actual singing varies in volume within a single note. For example, there are techniques such as crescendo that gradually increases the sound within one note, and staccato that cuts the sound short, and the volume changes variously within the note.
Therefore, when scoring based on the guide melody, the singer who sings close to the sample singing (hereinafter referred to as “guide vocal”) and the singer who does not sing, may not have an actual skillful scoring result. there were.

本発明は、上述の事情を鑑みてなされたものであり、カラオケ曲のノート内の音量変化について、ガイドボーカルと歌唱者の歌唱音声とを比較評価する音声評価装置を提供することを目的とする。 This invention is made in view of the above-mentioned situation, and it aims at providing the audio | voice evaluation apparatus which compares and evaluates a guide vocal and a song person's singing voice | voice about the volume change in the notebook of a karaoke song. .

上述の課題を解決するため、本発明は、楽曲の歌声を示す第１の音声データと前記歌声の音声の区切りを示す音声区切りデータとを記憶する記憶手段と、楽曲の進行に応じて前記記憶手段から前記第１の音声データと前記音声区切りデータとを読み出す読出手段と、歌唱者の音声が入力され、入力された音声を第２の音声データに変換して出力する音声入力手段と、前記読出手段が読み出した前記第１の音声データと前記音声区切りデータを参照して、前記第２の音声データを解析し、前記第１の音声データが示す音声の音節に対応する音節部分を前記第２の音声データから切り出す音節対応付け手段と、前記第１、第２の音声データについて、前記音節対応付け手段によって対応付けられた音節部分毎に音量の変化の態様を比較し、その比較結果に対応する評価を行う評価手段とを具備することを特徴とする音声評価装置を提供する。 In order to solve the above-mentioned problems, the present invention provides a storage means for storing first voice data indicating a singing voice of music and voice separation data indicating a voice break of the singing voice, and the storage according to the progress of the music. Reading means for reading out the first voice data and the voice separation data from the means, voice input means for inputting the voice of the singer, converting the input voice into second voice data, and outputting the voice data; The second voice data is analyzed with reference to the first voice data and the voice break data read by the reading means, and the syllable portion corresponding to the syllable of the voice indicated by the first voice data is determined as the first voice data. The syllable association means cut out from the two audio data and the first and second audio data are compared in terms of the volume change mode for each syllable part associated by the syllable association means. To provide a speech evaluation apparatus characterized by comprising an evaluation means for evaluating that corresponds to the result.

また、別の好ましい態様において、前記音節対応付け手段は、対応付けた音節部分に該当する前記第１、第２の音声データについて、当該第２の音声データの時間幅を当該第１の音声データの時間幅と同じになるように伸縮させてもよい。 Moreover, in another preferable aspect, the syllable association means sets the time width of the second audio data for the first and second audio data corresponding to the associated syllable portion, as the first audio data. It may be expanded and contracted to be the same as the time width.

また、別の好ましい態様において、前記評価手段は、前記音節対応付け手段によって対応付けられた音節部分毎に所定レベルの音量をしきい値として区間に分割し、前記しきい値より小さい音量と判断された区間の長さまたは前記しきい値より大きい音量と判断された区間の長さに基づいて評価を行ってもよい。また、前記評価手段は、前記音節対応付け手段によって対応付けられた音節部分毎に、当該音節部分の音量に基づいて前記しきい値を自動設定してもよい。さらに、音節部分毎の音量が前記しきい値を上回るまでの期間は評価から除外してもよい。 In another preferred aspect, the evaluation means divides the sound volume at a predetermined level for each syllable part associated by the syllable association means into sections using a threshold as a threshold, and determines that the volume is lower than the threshold. The evaluation may be performed based on the length of the determined section or the length of the section determined to be larger than the threshold value. Further, the evaluation means may automatically set the threshold value for each syllable part associated by the syllable association means based on the volume of the syllable part. Further, a period until the volume of each syllable part exceeds the threshold value may be excluded from the evaluation.

また、別の好ましい態様において、前記評価手段は、前記音節対応付け手段によって対応付けられた音節部分毎に、当該音節部分の音量変化率を抽出して比較してもよい。 In another preferable aspect, the evaluation unit may extract and compare the volume change rate of the syllable part for each syllable part associated by the syllable association unit.

歌唱者音声データの時間軸を伸縮しガイドボーカルデータの時間軸と合わせ、音節区切りデータで区切られた音節ごとの音量変化を比較することで、各音節内の微妙な変化についても評価を行うことができる。 Evaluate even subtle changes in each syllable by expanding and contracting the time axis of the singer's voice data, aligning it with the time axis of the guide vocal data, and comparing the volume change for each syllable divided by the syllable break data. Can do.

以下、本発明の一実施形態について説明する。 Hereinafter, an embodiment of the present invention will be described.

＜実施形態＞
図１は、この発明の一実施形態に係る音声評価装置としてのカラオケ装置１のハードウェア構成を例示したブロック図である。ＣＰＵ（Central Processing Unit）１１は、ＲＯＭ（Read Only Memory）１２または記憶部１４に記憶されているコンピュータプログラムを読み出してＲＡＭ（Random Access Memory）１３にロードし、これを実行することにより、カラオケ装置１の各部を制御する。記憶部１４は、例えばハードディスクなどの大容量の記憶手段であり、楽曲データ記憶領域１４ａと、歌唱者音声データ記憶領域１４ｂを有している。表示部１５は、例えば液晶ディスプレイなどであり、ＣＰＵ１１の制御の下で、カラオケ装置１を操作するためのメニュー画面や、背景画像に歌詞テロップを重ねたカラオケ画面などの各種画面を表示する。操作部１６は、各種のキーを備えており、押下されたキーに対応した信号をＣＰＵ１１へ出力する。マイクロフォン１７は、歌唱者が発音した音声を収音する。音声処理部１８は、マイクロフォン１７によって収音された音声をＡ／Ｄ変換してＣＰＵ１１に供給する。スピーカ１９は、音声処理部１８に接続されており、音声処理部１８から出力される音声信号をもとに放音する。 <Embodiment>
FIG. 1 is a block diagram illustrating a hardware configuration of a karaoke apparatus 1 as a voice evaluation apparatus according to an embodiment of the invention. A CPU (Central Processing Unit) 11 reads a computer program stored in a ROM (Read Only Memory) 12 or a storage unit 14, loads it into a RAM (Random Access Memory) 13, and executes it to execute a karaoke apparatus. 1 part is controlled. The storage unit 14 is a large-capacity storage unit such as a hard disk, and has a music data storage area 14a and a singer voice data storage area 14b. The display unit 15 is, for example, a liquid crystal display, and displays various screens such as a menu screen for operating the karaoke apparatus 1 and a karaoke screen in which lyrics telop is superimposed on a background image under the control of the CPU 11. The operation unit 16 includes various keys and outputs a signal corresponding to the pressed key to the CPU 11. The microphone 17 picks up sound produced by the singer. The sound processing unit 18 performs A / D conversion on the sound collected by the microphone 17 and supplies it to the CPU 11. The speaker 19 is connected to the audio processing unit 18 and emits sound based on the audio signal output from the audio processing unit 18.

楽曲データ記憶領域１４ａには、カラオケ曲の楽曲データが複数記憶されており、各楽曲データは、ガイドメロディトラック、伴奏データトラック、歌詞データトラック、ガイドボーカルトラック、音節区切りデータトラックを有している。 A plurality of song data of karaoke songs are stored in the song data storage area 14a, and each song data has a guide melody track, an accompaniment data track, a lyrics data track, a guide vocal track, and a syllable break data track. .

ガイドメロディトラックは、楽曲のボーカルパートのメロディを示すデータであり、各ノートについてのノートオン（発音の指令）、ベロシティ（音の強さ）、ノートオフ（消音の指令）などのイベントデータと次のイベントデータを読み込んで実行するまでの時間を示すデルタタイムデータを有している。伴奏データトラックは、各伴奏楽器の複数のトラックから構成されており、各楽器のトラックは上述したガイドメロディトラックと同様のデータ構造を有している。なお、本実施形態の場合、ＭＩＤＩ（登録商標）形式のデータが記憶されている。 The guide melody track is data indicating the melody of the vocal part of the music, and event data such as note-on (pronunciation command), velocity (sound intensity), note-off (silence command) for each note, and the following Delta time data indicating the time until the event data is read and executed. The accompaniment data track is composed of a plurality of tracks of each accompaniment instrument, and each instrument track has the same data structure as the above-described guide melody track. In the case of this embodiment, data in the MIDI (registered trademark) format is stored.

歌詞データトラックは、楽曲の歌詞を示しているテキストデータ、歌詞の改行を示す改行データ、および歌詞の一文字毎にワイプ開始時刻を示すワイプ開始時刻データを有している。そして、カラオケ装置によって再生され、画面に歌詞テロップが表示されているときは、歌詞の１文字は対応するワイプ開始時刻になると当該文字の左側から色を変化させ始めて、次の文字ワイプ開始時刻に達すると、その文字全体について色の変化が完了するよう色替え制御される。この場合、ワイプ開始時刻データは、改行データにも設けられており、行の最後に表示される文字については、当該文字のワイプ開始時刻データと改行データのワイプ開始時刻データの時間間隔が、当該文字の色替え時間となる。各文字の色が変化するスピードは、文字の横方向のドット数と色替え時間（当該文字のワイプ開始時刻と次の文字のワイプ開始時刻の時間差）から決定される。 The lyrics data track has text data indicating the lyrics of the music, line feed data indicating the line break of the lyrics, and wipe start time data indicating the wipe start time for each character of the lyrics. Then, when it is played back by the karaoke device and the lyrics telop is displayed on the screen, one character of the lyrics starts to change color from the left side of the character at the corresponding wipe start time, and at the next character wipe start time. When it reaches, the color change is controlled so that the color change is completed for the entire character. In this case, the wipe start time data is also provided in the line feed data, and for the character displayed at the end of the line, the time interval between the wipe start time data of the character and the wipe start time data of the line feed data is It becomes the color change time of characters. The speed at which the color of each character changes is determined from the number of dots in the horizontal direction of the character and the color change time (the time difference between the wipe start time of the character and the wipe start time of the next character).

ガイドボーカルトラックは、見本となる歌手の歌声を記録した音声データ（以下、ガイドボーカルデータという）であって、例えば、ＷＡＶＥ形式やＭＰ３（MPEG Audio Layer-3）形式などの音声データである。音節区切りデータトラックは、図２に示すように、ガイドボーカルを音節ごとに区切る時刻を示す音節区切りデータであり、ｔ_１、ｔ_２、ｔ_３・・・として各音節が発声される時刻を示している。ここで、図の縦軸はガイドボーカルの音量であり、横軸はガイドボーカルの進行を示す時間軸を表し、それぞれの音節に対応する歌詞を図の上部に対応させて表している。なお、ガイドボーカルデータにフレーム番号やサンプリング番号などの情報が付されている場合には、それらを時刻情報の代わりに用いて音節区切りデータとしてもよい。 The guide vocal track is audio data (hereinafter referred to as guide vocal data) in which a singer's voice as a sample is recorded, and is audio data in the WAVE format or MP3 (MPEG Audio Layer-3) format, for example. As shown in FIG. 2, the syllable separation data track is syllable separation data indicating the time at which the guide vocal is divided into syllables, and indicates the time when each syllable is uttered as t ₁ , t ₂ , t _3. ing. Here, the vertical axis in the figure represents the volume of the guide vocal, the horizontal axis represents the time axis indicating the progress of the guide vocal, and the lyrics corresponding to each syllable are represented in the upper part of the figure. When information such as a frame number and a sampling number is attached to the guide vocal data, these may be used as syllable delimiter data instead of the time information.

歌唱者音声データ記憶領域１４ｂには、マイクロフォン１７から音声処理部１８を経てＡ／Ｄ変換された音声データ（以下、歌唱者音声データという）が、例えばＷＡＶＥ形式やＭＰ３形式などで時系列に記憶される。 In the singer voice data storage area 14b, voice data A / D converted from the microphone 17 via the voice processing unit 18 (hereinafter referred to as singer voice data) is stored in time series, for example, in WAVE format or MP3 format. Is done.

次に、ＣＰＵ１１が、ＲＯＭ１２または記憶部１４に記憶されたコンピュータプログラムを実行することによって実現する機能について説明する。図３は、ＣＰＵ１１が実現する機能を示したブロック図である。 Next, functions realized by the CPU 11 executing a computer program stored in the ROM 12 or the storage unit 14 will be described. FIG. 3 is a block diagram illustrating functions realized by the CPU 11.

図において、ガイドボーカル音量抽出部２は、楽曲データ記憶領域１４ａからガイドボーカルデータおよび音節区切りデータを読み出し、当該ガイドボーカルの音量を抽出して、ガイドボーカル音量データを作成する機能、および音節区切りデータが有している音節区切り時刻によって音節ごとに当該ガイドボーカル音量データを分割して、全ての音節に対して、ガイドボーカル音節単位音量データを生成する機能を有している。例えば、図２に示すように、音節「あ」に対応するガイドボーカル音節単位音量データは、ｔ_１からｔ_２の間の時間の音量の変化を示す音量データである。 In the figure, a guide vocal volume extraction unit 2 reads guide vocal data and syllable separation data from the music data storage area 14a, extracts the volume of the guide vocal, and creates guide vocal volume data, and syllable separation data. Has the function of dividing the guide vocal volume data for each syllable according to the syllable break time, and generating the guide vocal syllable unit volume data for all syllables. For example, as shown in FIG. 2, the guide vocal syllable unit volume data corresponding to the syllable “a” is volume data indicating a change in volume over time between t ₁ and t ₂ .

ガイドボーカル無音領域抽出部３は、ガイドボーカル音節単位音量データに基づいて歌唱者音声と比較するための基準データとして、各音節について無音領域時間を抽出し、音節ごとに無音領域時間を対応させた無音領域基準データを作成する機能を有している。例えば、ｎ番目の音節のガイドボーカル音節単位音量データが図４（ａ）に示すような音量変化を示している場合は、音量がスレッショルド音量Ｖ_ｎｔｈ以下に下がった時刻ｔ_ｎｔｈになったところから無音領域とみなし、次の音節の区切り時刻ｔ_ｎ+1までの無音領域時間ｔ_ｎｏｆｆ＝ｔ_ｎ+1−ｔ_ｎｔｈをｎ番目の音節の無音領域基準データとして作成する。ここで、スレッショルド音量Ｖ_ｎｔｈは、例えば以下のように自動設定する。図４（ａ）に示すようなｎ番目の音節内の音量変化を所定の時間単位で分割したフレーム単位に分け、各音量の発生頻度をフレーム数で表すと、図４（ｂ）のようなヒストグラムで表すことができる。ここで、ｎ番目の音節の音量最小値をＶ_ｎｍｉｎとし、音量の中央値をＶ_ｎｍｅｄとすると、スレッショルド音量Ｖ_ｎｔｈは（Ｖ_ｎｍｉｎ＋Ｖ_ｎｍｅｄ）／２として自動設定する。このようにスレッショルド音量を決めると、マイクロフォン１７に入り込む周囲の雑音などの影響を受けにくくなり、正確な検出ができる。なお、周囲の雑音の音量が一定に保たれている場合は、スレッショルド音量をある一定水準として決めておいてもよい。 The guide vocal silent region extraction unit 3 extracts the silent region time for each syllable as reference data for comparison with the singer's voice based on the guide vocal syllable unit volume data, and associates the silent region time with each syllable. It has a function to create silent area reference data. For example, when the guide vocal syllable unit volume data of the nth syllable indicates a volume change as shown in FIG. 4A, from the time t _nth when the volume falls below the threshold volume V _nth. The silent region time t _noff = t _{n + 1} -t _nth until the next syllable separation time t _{n + 1} is created as the silent region reference data of the _nth syllable. Here, the threshold volume V _nth is automatically set as follows, for example. When the volume change in the nth syllable as shown in FIG. 4 (a) is divided into frame units divided by a predetermined time unit, and the frequency of occurrence of each volume is expressed by the number of frames, as shown in FIG. 4 (b). It can be represented by a histogram. Here, the volume minimum value of n-th syllable and _{V nmin,} when the median of the volume and _{V NMED,} threshold volume _{V nth} is automatically set as _{_{(V nmin + V nmed) /}} 2. When the threshold sound volume is determined in this way, it becomes difficult to be affected by ambient noise entering the microphone 17 and accurate detection can be performed. When the volume of ambient noise is kept constant, the threshold volume may be determined as a certain level.

アライメント部４は、ガイドボーカルと歌唱者音声の音節の時間的なずれの調整を行う機能を有している。図５に示すように、ガイドボーカル（図５（ａ））と歌唱者音声（図５（ｂ））にずれが発生している場合、両者を正確に比較するために、ガイドボーカルと時間軸を合わせるように、歌唱者音声の時間軸を伸縮させてＤＴＷ（Dynamic Time Warping：時間正規化）を行う必要がある。本実施形態ではこのＤＴＷを行うための手法としてＤＰ（Dynamic Programming：動的計画法）マッチングを用いる。具体的には以下のような処理となる。 The alignment unit 4 has a function of adjusting a temporal shift between the syllables of the guide vocal and the singer's voice. As shown in FIG. 5, when there is a deviation between the guide vocal (FIG. 5 (a)) and the singer's voice (FIG. 5 (b)), in order to accurately compare the two, the guide vocal and the time axis Therefore, it is necessary to perform DTW (Dynamic Time Warping: time normalization) by expanding and contracting the time axis of the singer's voice. In the present embodiment, DP (Dynamic Programming) matching is used as a technique for performing this DTW. Specifically, the processing is as follows.

アライメント部４は、図６に示すような座標平面（以下、ＤＰプレーンという）をＲＡＭ１３に形成する。このＤＰプレーンの縦軸は、ガイドボーカルデータをそれぞれ所定時間長のフレーム単位に分離してその各々に対してＦＦＴ（Fast Fourier Transform）を施して得られたスペクトルについて、各フレームのスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータに対応しており、横軸は、同様にして得られた歌唱者音声データのスペクトルについて、各フレームから得たスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータに対応している。図６において、ａ１、ａ２、ａ３・・・ａｎは、ガイドボーカルデータの各フレームを時間軸に従って並べたものであり、ｂ１、ｂ２、ｂ３・・・ｂｎは、歌唱者音声データの各フレームを時間軸に従って並べたものである。縦軸のａ１、ａ２、ａ３・・・ａｎの間隔と横軸のｂ１、ｂ２、ｂ３・・・ｂｎの間隔は、いずれもフレームの時間長と対応している。このＤＰプレーンにおける各格子点の各々には、ａ１、ａ２、ａ３・・・の各パラメータと、ｂ１、ｂ２、ｂ３・・・の各パラメータのユークリッド距離を夫々示す値であるＤＰマッチングスコアが対応付けられている。例えば、ａ１とｂ１とにより位置決めされる格子点には、ガイドボーカルデータの一連のフレームのうち最初のフレームから得たパラメータと歌唱者音声データの一連のフレームのうち最初のフレームから得たパラメータのユークリッド距離を示す値が対応付けられることになる。アライメント部４は、このような構造を成すＤＰプレーンを形成した後、ａ１とｂ１とにより位置決めされる格子点（始端）からａｎとｂｎとにより位置決めされる格子点（終端）に至る全経路を探索し、探索した経路ごとに、その始端から終端までの間に辿る各格子点のＤＰマッチングスコアを累算して行き、最小の累算値を求める。このＤＰマッチングスコアの累算値が最も小さくなる経路は、歌唱者音声データの各フレームの時間軸をガイドボーカルデータの時間軸に合わせて伸縮する際における伸縮の尺度として参酌される。 The alignment unit 4 forms a coordinate plane (hereinafter referred to as a DP plane) as shown in FIG. The vertical axis of the DP plane is the absolute value of the spectrum of each frame for the spectrum obtained by separating guide vocal data into frames each having a predetermined time length and applying FFT (Fast Fourier Transform) to each. The horizontal axis represents the spectrum of the singing voice data obtained in the same manner, and the logarithm of the absolute value of the spectrum obtained from each frame is the inverse Fourier. It corresponds to the parameter obtained by conversion. In FIG. 6, a1, a2, a3... An are obtained by arranging the frames of the guide vocal data according to the time axis, and b1, b2, b3... Bn are the frames of the singer voice data. They are arranged according to the time axis. The intervals of a1, a2, a3... An on the vertical axis and the intervals of b1, b2, b3... Bn on the horizontal axis all correspond to the time length of the frame. Each lattice point in the DP plane corresponds to a DP matching score which is a value indicating the Euclidean distance of each parameter of a1, a2, a3... And each parameter of b1, b2, b3. It is attached. For example, the lattice points positioned by a1 and b1 include parameters obtained from the first frame of a series of frames of guide vocal data and parameters obtained from the first frame of a series of frames of singing voice data. A value indicating the Euclidean distance is associated. After forming the DP plane having such a structure, the alignment unit 4 performs the entire path from the lattice point (starting end) positioned by a1 and b1 to the lattice point (end) positioned by an and bn. For each route searched, the DP matching score of each lattice point traced from the start end to the end is accumulated to obtain the minimum accumulated value. The path with the smallest accumulated value of the DP matching score is considered as a scale of expansion / contraction when the time axis of each frame of the singer's voice data is expanded / contracted in accordance with the time axis of the guide vocal data.

そして、アライメント部４は、ＤＰマッチングスコアの累算値が最小となる経路をＤＰプレーン上から特定し、特定した経路の内容に応じて歌唱者音声データの時間軸を伸縮する処理であるアライメント処理を行う。具体的には、ＤＰプレーン上から特定された経路上の各格子点のＤＰマッチングスコアが時間軸上の位置を同じくするフレームから得たパラメータのユークリッド距離を表わすものとなるように、歌唱者音声データの各フレームのタイムスタンプの内容を書き換えた上で、時間軸上の位置を同じくする各フレームを組として順次対応付けていく。例えば、図６に示すＤＰプレーン上に記された経路においては、ａ１とｂ１により位置決めされる始点からその右上のａ２とｂ２により位置決めされる格子点に進んでいることが分かる。この場合、ａ２とｂ２のフレームの時間軸上の位置は当初から同じであるので、ｂ２のフレームのタイムスタンプの内容を書き換える必要はない。更に、この経路においては、ａ２とｂ２により位置決めされる格子点からその右のａ２とｂ３により位置決めされる格子点に進んでいることが分かる。この場合、ｂ２のフレームだけでなくｂ３のフレームもａ２のフレームと時間軸上の位置を同じくする必要があるので、ｂ３のフレームと対を成していたタイムスタンプをフレーム一つ分だけ早いものと置き換える。この結果、ａ２のフレームとｂ２及びｂ３のフレームが時間軸上の位置を同じくするフレームの組として対応付けられることになる。このようなタイムスタンプの置き換えとフレームの対応付けがｂ１からｂｎに至る全フレーム区間について行われる。これにより、例えば図５（ｂ）に示すように、歌唱者音声の発音時刻がガイドボーカルの発音時刻とずれている部分があったとしても、歌唱者音声データの時間軸をガイドボーカルデータの時間軸に合わせて伸縮し、図５（ｃ）に示すように時間軸をあわせることができる。以上がＤＰマッチングの仕組みである。 And the alignment part 4 is the process which specifies the path | route where the accumulated value of DP matching score becomes the minimum from a DP plane, and is the process which expands / contracts the time axis | shaft of singer voice data according to the content of the specified path | route. I do. Specifically, the singer voice so that the DP matching score of each lattice point on the path specified from the DP plane represents the Euclidean distance of the parameter obtained from the frame having the same position on the time axis. After rewriting the contents of the time stamp of each frame of data, each frame having the same position on the time axis is sequentially associated as a set. For example, in the path marked on the DP plane shown in FIG. 6, it can be seen that the path from the starting point positioned by a1 and b1 progresses to the lattice point positioned by upper right a2 and b2. In this case, since the positions on the time axis of the frames a2 and b2 are the same from the beginning, it is not necessary to rewrite the contents of the time stamp of the frame b2. Furthermore, in this route, it can be seen that the grid point positioned by a2 and b2 advances from the grid point positioned by a2 and b3 on the right. In this case, not only the frame b2 but also the frame b3 need to have the same position on the time axis as the frame a2, so that the time stamp paired with the frame b3 is one frame earlier. Replace with As a result, the frame a2 and the frames b2 and b3 are associated as a set of frames having the same position on the time axis. Such time stamp replacement and frame association are performed for all frame sections from b1 to bn. As a result, for example, as shown in FIG. 5B, even if there is a portion where the sounding time of the singer's voice is deviated from the sounding time of the guide vocal, the time axis of the singer's voice data is set to the time of the guide vocal data. The time axis can be adjusted as shown in FIG. The above is the mechanism of DP matching.

歌唱者音声音量抽出部５は、アライメント部４で得られた時間伸縮を行った歌唱者音声データについて、ガイドボーカル音量抽出部２と同様に、当該歌唱者音声の音量を抽出して、歌唱者音声音量データを作成する機能および音節ごとに歌唱者音声音節単位音量データを生成する機能を有している。 The singer voice volume extracting unit 5 extracts the volume of the singer's voice from the singer voice data obtained by the alignment unit 4 and subjected to the time expansion / contraction, as in the case of the guide vocal volume extraction unit 2. It has a function of creating voice volume data and a function of generating singer voice syllable unit volume data for each syllable.

歌唱者音声無音領域抽出部６は、ガイドボーカル無音領域抽出部３と同様に、歌唱者音声音節単位音量データに基づいて、各音節について無音領域時間（例えばｎ番目の音節ならｔ’_ｎｏｆｆとする）を抽出し、音節ごとに無音領域時間を対応させた歌唱者音声無音領域データを作成する機能を有している。 The singer voice silence area extraction unit 6, like the guide vocal silence area extraction unit 3, sets a silence area time (for example, _t′noff for the nth syllable) for each syllable based on the singer voice syllable unit volume data. ) Is extracted, and a singer's voice silence area data in which a silence area time is associated with each syllable is created.

比較評価部７は、ガイドボーカル無音領域抽出部３から無音領域基準データを、歌唱者音声無音領域抽出部６から歌唱者音声無音領域データを取得し、ガイドボーカルと歌唱者音声について、それぞれを音節ごとに無音領域時間を比較して、当該音節の音の長さについての評価を行う機能を有している。例えば、図５（ａ）に示すガイドボーカルの３番目の音節に対応する無音領域時間ｔ_３ｏｆｆと図５（ｃ）に示す時間伸縮を行った歌唱者音声の３番目の音節に対応する無音領域時間ｔ’_３ｏｆｆとを比較して、ｔ_３ｏｆｆ＞ｔ’_３ｏｆｆなら３番目の音節については、歌唱者音声の無音領域時間が短い、すなわち音の発声時間が長いと評価する。 The comparative evaluation unit 7 acquires the silent region reference data from the guide vocal silent region extraction unit 3 and the singer voice silent region data from the singer voice silent region extraction unit 6, and each of the guide vocal and the singer voice is syllable. It has a function of comparing the silent region time for each and evaluating the sound length of the syllable. For example, the silent region time t _3off corresponding to the third syllable of the guide vocal shown in FIG. _5A and the silent region corresponding to the third syllable of the singer voice _subjected to the time expansion and contraction shown in FIG. The time t ′ _3off is compared, and if t _3off > t ′ _{3off, the} third syllable is evaluated as having a short silent region time of the singer's voice, that is, a long sounding time.

次に、カラオケ装置１の動作について説明する。練習者は、カラオケ装置１の操作部１６を操作して歌唱したい曲を選定し、伴奏の再生を指示する。ＣＰＵ１１は、この指示に応じて処理を開始する。ＣＰＵ１１は、まず、指定された曲の伴奏データトラックを楽曲データ記憶領域１４ａから読み出し、音声処理部１８に供給する。音声処理部１８は、供給された伴奏データをアナログ音声信号に変換してスピーカ１９に供給して放音させる。このとき、ＣＰＵ１１は表示部１５を制御して、歌詞データトラックを楽曲データ記憶領域１４ａから読み出し、読み出した歌詞を表示し、楽曲の進行に合わせて歌詞の文字を色替えしていく。歌唱者は、スピーカ１９から放音される伴奏に合わせて歌唱を行う。このとき、歌唱者の音声はマイクロフォン１７によって収音されて音声信号に変換され、音声処理部１８へと供給される。そして、音声処理部１８によってＡ／Ｄ変換された歌唱者音声データは、記憶部１４の歌唱者音声データ記憶領域１４ｂに時系列に記憶されていく。 Next, the operation of the karaoke apparatus 1 will be described. The practitioner operates the operation unit 16 of the karaoke apparatus 1 to select a song to be sung and instructs the accompaniment to be reproduced. The CPU 11 starts processing in response to this instruction. The CPU 11 first reads the accompaniment data track of the designated song from the song data storage area 14 a and supplies it to the audio processing unit 18. The sound processing unit 18 converts the supplied accompaniment data into an analog sound signal and supplies the analog sound signal to the speaker 19 for sound emission. At this time, the CPU 11 controls the display unit 15 to read out the lyrics data track from the song data storage area 14a, displays the read lyrics, and changes the characters of the lyrics in accordance with the progress of the song. The singer sings along with the accompaniment emitted from the speaker 19. At this time, the voice of the singer is picked up by the microphone 17, converted into a voice signal, and supplied to the voice processing unit 18. The singer voice data A / D converted by the voice processing unit 18 is stored in time series in the singer voice data storage area 14b of the storage unit 14.

伴奏データの再生が終了すると、ＣＰＵ１１は、アライメント部４の処理を行う。すなわち、楽曲データ記憶領域１４ａからガイドボーカルデータを読み出し、歌唱者音声データ記憶領域１４ｂから読み出した歌唱者音声データを読み出す。そして、ＤＰマッチングによって、ガイドボーカルデータの時間軸と合うように歌唱者音声データの時間軸を伸縮させ、歌唱者音声データのタイムスタンプを書き換え、記憶部１４の歌唱者音声データ記憶領域１４ｂに記憶する。 When the reproduction of the accompaniment data is completed, the CPU 11 performs processing of the alignment unit 4. That is, the guide vocal data is read from the music data storage area 14a, and the singer voice data read from the singer voice data storage area 14b is read. Then, by DP matching, the time axis of the singer voice data is expanded and contracted so as to match the time axis of the guide vocal data, the time stamp of the singer voice data is rewritten, and stored in the singer voice data storage area 14b of the storage unit 14 To do.

次に、ＣＰＵ１１は、ガイドボーカル音量抽出部２、および歌唱者音声音量抽出部５の処理を行う。つまり、ガイドボーカル音量抽出部２は、楽曲データ記憶領域１４ａから読み出したガイドボーカルデータと音節区切りデータを元に、ガイドボーカルの全ての音節ごとに、当該音節と対応付けてガイドボーカル音節単位音量データを生成し、記憶部１４の所定のエリアに記憶させる。また、同様にして、歌唱者音声音量抽出部５は、タイムスタンプを書き換えた歌唱者音声データについて、音節ごとに、当該音節と対応付けて歌唱者音声音節単位音量データを生成し、記憶部１４の所定のエリアに記憶させる。 Next, the CPU 11 performs processing of the guide vocal volume extraction unit 2 and the singer voice volume extraction unit 5. That is, the guide vocal volume extraction unit 2 guides the vocal vocal syllable unit volume data in association with each syllable for each syllable of the guide vocal based on the guide vocal data and the syllable separation data read from the music data storage area 14a. Is generated and stored in a predetermined area of the storage unit 14. Similarly, the singer voice volume extraction unit 5 generates singer voice syllable unit volume data in association with the syllable for each syllable of the singer voice data with the rewritten time stamp, and the storage unit 14. Stored in a predetermined area.

次に、ＣＰＵ１１は、ガイドボーカル無音領域抽出部３、および歌唱者音声無音領域抽出部６の処理を行う。ガイドボーカル無音領域抽出部３は、記憶部１４の所定のエリアに記憶された全ての音節に対するガイドボーカル音節単位音量データを読み出し、ガイドボーカルの全ての音節に対して、それぞれスレッショルド音量を算出し、全ての音節に対しての無音領域時間を算出し無音領域基準データとして記憶部１４の所定のエリアに記憶させる。また、同様にして、歌唱者音声無音領域抽出部６は、記憶部１４の所定のエリアに記憶された全ての音節に対する歌唱者音声音節単位データを読み出し、歌唱者音声の全ての音節に対して、それぞれスレッショルド音量を算出し、全ての音節に対しての無音領域時間を算出し、歌唱者音声無音領域データとして記憶部１４の所定のエリアに記憶させる。 Next, the CPU 11 performs processing of the guide vocal silent area extracting unit 3 and the singer voice silent area extracting unit 6. The guide vocal silent region extraction unit 3 reads guide vocal syllable unit volume data for all syllables stored in a predetermined area of the storage unit 14, calculates threshold volume for each syllable of the guide vocal, Silence area time for all syllables is calculated and stored in a predetermined area of the storage unit 14 as silence area reference data. Similarly, the singer voice silence area extraction unit 6 reads out singer voice syllable unit data for all syllables stored in a predetermined area of the storage unit 14, and for all syllables of the singer voice. Then, the threshold volume is calculated, the silent area time for all syllables is calculated, and is stored in a predetermined area of the storage unit 14 as singer voice silent area data.

次に、ＣＰＵ１１は、比較評価部７の処理を行う。比較評価部７は、記憶部１４の所定のエリアに記憶された無音領域基準データおよび歌唱者音声無音領域データを読み出し、各々の音節ごとに両者の無音領域時間を比較する。そして、歌唱者音声の各音節について、ガイドメロディの各音節の無音領域時間と比較し、ずれの量を判断することにより、歌唱者音声の採点に反映させる。また、歌唱者に指導するために、どの音節がどのようにずれているのかを表示部１５に表示させてもよい。その場合は、例えば図７に示すように、歌詞の音節ごとに当該音節の無音領域時間がわかるように表示すればよい。ここで、横軸は歌詞の各音節を表している。また、縦軸は各音節の無音領域時間を音節区切りデータによって区切られた当該音節の全体の時間で規格化したものであり、数値が大きいほど音節の無音領域時間が長い、すなわち発音時間が短いことを意味し、ｎ番目の音節の無音領域時間は、ｔ_ｎｏｆｆ／（ｔ_ｎ+1−ｔ_ｎ）として、表示されている。また、ガイドボーカルは先生の音声、歌唱者音声は生徒の音声として扱われている。 Next, the CPU 11 performs processing of the comparative evaluation unit 7. The comparative evaluation unit 7 reads out the silent region reference data and the singer's voice silent region data stored in a predetermined area of the storage unit 14 and compares the silent region times of both for each syllable. Then, each syllable of the singer's voice is compared with the silent area time of each syllable of the guide melody, and the amount of deviation is judged, so that it is reflected in the singer's voice scoring. In order to instruct the singer, which syllable is shifted how may be displayed on the display unit 15. In such a case, for example, as shown in FIG. 7, display may be performed so that the silent region time of the syllable can be understood for each syllable of the lyrics. Here, the horizontal axis represents each syllable of the lyrics. In addition, the vertical axis is the silence region time of each syllable normalized by the total time of the syllable divided by the syllable separation data. The larger the numerical value, the longer the silent region time of the syllable, that is, the shorter the pronunciation time. It means that, n-th silence area time syllable _{as _{t noff / (t n + 1}} -t n), are displayed. The guide vocal is treated as the teacher's voice, and the singer's voice is treated as the student's voice.

このように、歌唱者音声データの時間軸を伸縮しガイドボーカルデータの時間軸と合わせ、音節区切りデータで区切られた音節ごとの音量変化を比較することで、各音節内の微妙な変化についても評価を行うことができる。そのため、精度の高い採点結果を出すことができ、さらに直したほうがよい点については、音節ごとに修正点を明示して指導をすることができる。 In this way, the time axis of the singer's voice data is expanded and contracted to match the time axis of the guide vocal data, and the volume change for each syllable divided by the syllable break data is compared, so that subtle changes within each syllable can also be detected. Evaluation can be made. Therefore, a highly accurate scoring result can be obtained, and points that should be further corrected can be instructed by clearly specifying correction points for each syllable.

以上、本発明の実施形態について説明したが、例えば、上述した実施形態を以下のように変形して本発明を実施してもよい。 As mentioned above, although embodiment of this invention was described, for example, you may implement this invention, changing embodiment mentioned above as follows.

＜変形例１＞
実施形態においては、ガイドボーカルと歌唱者音声について音節の音の長さを比較するようにしていたが、音の大きさの変化として、徐々に大きくなるクレッシェンドや徐々に小さくなるデクレッシェンドについて評価を行ってもよい。この場合は、実施形態の無音領域を抽出する代わりに、図８に示すように、各音節内の音量変化率αを各音節の音量変化を抜き出して１次近似などにより抽出するようにして、ガイドボーカルと歌唱者音声を比較すればよい。ここで、音量変化率を比較する対象となる区間は、例えば、図８に示すように、音節の占有する時間を１００％とした場合、当該音節が発音された時刻から３０％の時間を加えた時刻ｔ_ｎｓから７０％の時間を加えた時刻ｔ_ｎｅまでの時間などと適宜設定すればよい。このようにすると、各音節内の抑揚についても評価することができ、より精度の高い採点結果を出すことができる。 <Modification 1>
In the embodiment, the syllable sound lengths of the guide vocal and the singer's voice are compared. However, as the sound volume changes, the crescendo that gradually increases and the decrescendo that gradually decreases are evaluated. You may go. In this case, instead of extracting the silent region of the embodiment, as shown in FIG. 8, the volume change rate α in each syllable is extracted by extracting the volume change of each syllable by primary approximation or the like, What is necessary is just to compare a guide vocal and a singer's voice. Here, for example, as shown in FIG. 8, if the time occupied by the syllable is set to 100%, the section to which the volume change rate is compared is added with a time of 30% from the time when the syllable is pronounced. the time t _ns 70% of the time from may be such a properly set time until the time t _ne plus a. In this way, the inflection in each syllable can be evaluated, and a more accurate scoring result can be obtained.

＜変形例２＞
実施形態においては、音節区切りデータは楽曲ごとに事前に作成されていたが、ガイドボーカルデータから得られるスペクトルやピッチの検出・非検出状態から音節区切りデータを自動的に作成するようにしてもよい。このようにすると、数多くの楽曲に対して音節区切りデータを作成する手間が省くことができる。 <Modification 2>
In the embodiment, the syllable separation data is created in advance for each music piece. However, the syllable separation data may be automatically created from the spectrum obtained from the guide vocal data and the detected / non-detected state of the pitch. . In this way, it is possible to save the trouble of creating syllable break data for a large number of music pieces.

＜変形例３＞
実施形態においては、ガイドボーカルと歌唱者音声の無音領域時間を抽出して各々を比較していたが、スレッショルド音量以上の音量になっている領域のみを検出して、音節が発音されているとみなす有音領域時間を抽出して比較してもよい。また、無音領域以外の部分を有音領域とみなしてもよい。 <Modification 3>
In the embodiment, the silent region times of the guide vocal and the singer's voice are extracted and compared with each other, but only the region where the volume is higher than the threshold volume is detected and the syllable is pronounced. You may extract and compare the sound area time to consider. Moreover, you may consider a part other than a silence area | region as a sound area.

＜変形例４＞
実施形態においては、ＤＰマッチングによって、歌唱者音声データの時間軸をガイドボーカルデータの時間軸に合わせて伸縮して時間軸をあわせて、音節区切りデータによって歌唱者音声を音節ごとに区切っていたが、ガイドボーカルデータと歌唱者音声データのスペクトルやピッチの比較を行って、ガイドボーカルの各音節に対応する歌唱者音声の音節の検出を行って、歌唱者音声を音節に区切ってもよい。この場合はガイドボーカルと歌唱者音声の音節についてはそれぞれ時間軸が揃っていないため、音節全体の占有する時間に対する無音領域時間の割合として比較すればよい。 <Modification 4>
In the embodiment, by DP matching, the time axis of the singer's voice data is expanded and contracted to match the time axis of the guide vocal data, the time axis is adjusted, and the singer's voice is divided into syllables by the syllable separation data. The singing voice may be divided into syllables by comparing the spectrum and pitch of the guide vocal data and the singing voice data and detecting the syllable of the singing voice corresponding to each syllable of the guide vocal. In this case, since the time axes of the syllabary of the guide vocal and the singer's voice are not aligned, it may be compared as the ratio of the silent area time to the time occupied by the entire syllable.

本発明の実施形態に係る音声評価装置であるカラオケ装置のハードウェアの構成を示すブロック図である。It is a block diagram which shows the structure of the hardware of the karaoke apparatus which is the audio | voice evaluation apparatus which concerns on embodiment of this invention. 音節区切りデータが有する音節を区切る時刻を示す説明図である。It is explanatory drawing which shows the time which delimits the syllable which syllable delimiter data has. 本発明の実施形態に係る音声評価装置であるカラオケ装置のソフトウェアの構成を示すブロック図である。It is a block diagram which shows the structure of the software of the karaoke apparatus which is the audio | voice evaluation apparatus which concerns on embodiment of this invention. 無音領域時間の検出方法を示す説明図である。It is explanatory drawing which shows the detection method of silence area | region time. 歌唱者音声データの時間の伸縮を示す説明図である。It is explanatory drawing which shows expansion / contraction of the time of singer audio | voice data. ＤＰマッチングを行う際のＤＰプレーンを示す説明図である。It is explanatory drawing which shows DP plane at the time of performing DP matching. 歌唱者音声の評価結果が表示される画面の例を示す説明図である。It is explanatory drawing which shows the example of the screen on which the evaluation result of a singer voice is displayed. 変形例１に係る音声評価方法である音量変化率を示す説明図である。It is explanatory drawing which shows the volume change rate which is the audio | voice evaluation method which concerns on the modification 1. FIG.

Explanation of symbols

１…カラオケ装置、２…ガイドボーカル音量抽出部、３…ガイドボーカル無音領域抽出部、４…アライメント部、５…歌唱者音声音量抽出部、６…歌唱者音声無音領域抽出部、７…比較評価部、１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部、１４ａ…楽曲データ記憶領域、１４ｂ…歌唱者音声データ記憶領域、１５…表示部、１６…操作部、１７…マイクロフォン、１８…音声処理部、１９…スピーカ DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 2 ... Guide vocal volume extraction part, 3 ... Guide vocal silence area extraction part, 4 ... Alignment part, 5 ... Singer voice volume extraction part, 6 ... Singer voice silence area extraction part, 7 ... Comparison evaluation , 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... storage unit, 14a ... song data storage area, 14b ... singer voice data storage area, 15 ... display unit, 16 ... operation unit, 17 ... microphone, 18 ... Sound processor, 19 ... Speaker

Claims

Storage means for storing first voice data indicating a singing voice of music and voice break data indicating a break of voice of the singing voice;
Reading means for reading out the first audio data and the audio delimiter data from the storage means according to the progress of the music;
Voice input means for inputting the voice of the singer, converting the input voice into second voice data, and outputting the second voice data;
The second voice data is analyzed with reference to the first voice data and the voice break data read by the reading means, and the syllable portion corresponding to the syllable of the voice indicated by the first voice data is Syllable association means cut out from the second audio data;
An evaluation unit that compares the first and second audio data with respect to a change in volume for each syllable portion associated by the syllable association unit, and performs an evaluation corresponding to the comparison result; A voice evaluation apparatus characterized by the above.

The syllable association means makes the time width of the second sound data the same as the time width of the first sound data for the first and second sound data corresponding to the associated syllable portions. The voice evaluation device according to claim 1, wherein the voice evaluation device is expanded and contracted.

The evaluation means divides the sound volume of a predetermined level for each syllable part associated by the syllable association means into sections using a threshold as a threshold, and determines the length of the section determined to be less than the threshold or the The voice evaluation apparatus according to claim 1, wherein the evaluation is performed based on a length of a section determined to have a volume larger than a threshold value.

4. The speech evaluation apparatus according to claim 3, wherein the evaluation unit automatically sets the threshold value for each syllable part associated by the syllable association unit based on a volume of the syllable part. .

5. The speech evaluation apparatus according to claim 3, wherein the evaluation unit excludes a period until the volume of each syllable part exceeds the threshold value from the evaluation.

The speech evaluation according to claim 1 or 2, wherein the evaluation unit extracts and compares a volume change rate of the syllable part for each syllable part associated by the syllable association unit. apparatus.