JP2007233077A

JP2007233077A - Evaluation device, control method, and program

Info

Publication number: JP2007233077A
Application number: JP2006055328A
Authority: JP
Inventors: Akane Noguchi; あかね野口
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-03-01
Filing date: 2006-03-01
Publication date: 2007-09-13

Abstract

<P>PROBLEM TO BE SOLVED: To more exactly point out a gap in vocal timing with a model singing or performance and a vocal error for singing or performance of a musical instrument by a trainee. <P>SOLUTION: A correspondent part detecting section 112 matches time axes of model voice data and trainee voice data by DP matching and associates sounds having the same position on the time axes with each other. A vocal content comparing section 113 and a vocal timing comparing section 114 compare vocal contents and vocal timings of the model voice data and the trainee voice data, and indicate a part in disagreement. Thus, the trainee can clearly be aware of the existence in a gap in the vocal timing or the vocal error in the own singing, and visually grasp the part in disagreement and the content of the disagreement. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、模範となる歌唱（又は演奏）と、練習者の歌唱（又は演奏）との発音タイミングのずれや発音の間違いを練習者に示す技術に関する。 The present invention relates to a technique for indicating to a practitioner a shift in pronunciation timing between a singing (or performance) as an example and a singing (or performance) of a practitioner and a pronunciation error.

カラオケ装置は、歌詞テロップを画面に表示するとともに、そのテロップを伴奏に合わせて順番に色変わりさせていく機能を備えている。カラオケ装置は、このような機能により、正しい歌詞を正しいタイミングで発音するように歌唱者に案内することができる。 The karaoke apparatus has a function of displaying lyrics telop on the screen and changing the color of the telop in order according to the accompaniment. With such a function, the karaoke apparatus can guide the singer to pronounce the correct lyrics at the correct timing.

ところが、歌唱者によっては、上記のような案内があるにも関わらず、伴奏よりも遅いタイミングで発音したり、逆に発音のタイミングが早すぎたり、或いは歌詞の内容そのものを間違ったりする場合がある。このような歌唱者に対しては発音タイミングや歌詞の間違いを速やかに指摘してあげることが望ましいが、これに有効な技術が特許文献１に開示されている。即ち、伴奏を表すＭＩＤＩデータに含まれるノートオンのタイミングと、歌唱者の音声がマイクに収音されたタイミングとを比較し、両者の時間的なずれを検出するというものである。
特開２００５−１７３２５６号公報 However, depending on the singer, although there is the guidance as described above, it may sound at a timing later than the accompaniment, conversely, the timing of pronunciation may be too early, or the content of the lyrics itself may be wrong. is there. For such a singer, it is desirable to promptly indicate a mistake in pronunciation timing and lyrics, but a technique effective for this is disclosed in Patent Document 1. That is, the note-on timing included in the MIDI data representing the accompaniment is compared with the timing at which the voice of the singer is picked up by the microphone, and the time lag between the two is detected.
JP 2005-173256 A

この特許文献１に記載された技術を応用すれば、発音タイミングのずれを歌唱者に指摘することが可能であると考えられる。ところが、この技術は、ノートオンのタイミングと収音タイミングとを単純に比較するだけであり、歌詞を構成する語（音）と歌唱者の音声とを照合するものではない。よって、当然のことながら歌唱した歌詞の間違いを指摘することは無理であるし、発音タイミングのずれを正確に捉えることもできないという問題がある。後者の問題が発生するのは、例えば発音タイミングそのものは伴奏に合っているものの、歌詞を構成するそれぞれの語を１テンポ早く発音したりとか、１テンポ遅く発音したような場合である。 By applying the technique described in Patent Document 1, it is considered possible to point out to the singer the difference in pronunciation timing. However, this technique simply compares the note-on timing with the sound collection timing, and does not collate the words (sounds) constituting the lyrics with the voice of the singer. Therefore, as a matter of course, there is a problem that it is impossible to point out mistakes in the lyrics that have been sung, and it is impossible to accurately grasp the deviation in pronunciation timing. The latter problem occurs when, for example, the pronunciation timing itself matches the accompaniment, but each word constituting the lyrics is pronounced one tempo earlier or one tempo later.

具体的には、図７（ａ）に例示するように、伴奏に従って「すぎさりしひびのゆめを・・・」の「り」という語を発音すべき期間内に、その「り」という語に続けて次の「し」という語を発音してしまい、さらに本来は「し」を発音すべきタイミングで次の「ひ」を発音してしまったような場合である。つまり、特許文献１に記載の技術では、どのような音であってもよいから、とにかく何らかの音が上記「し」の正しいタイミングで発音されている限り、これを正しい発音とみなしてしまうのである。この点に特許文献１の技術を適用した場合の限界がある。さらに、これらの問題は、歌唱を練習する場合に限らず、模範演奏を真似て楽器の演奏を練習する場合であっても同様に発生する。 Specifically, as illustrated in FIG. 7A, according to the accompaniment, the word “ri” is to be pronounced within the period in which the word “ri” of “the dream of the crack is ...” In this case, the next word “shi” is pronounced, and the next word “hi” is pronounced at the timing when “shi” should be pronounced. That is, in the technique described in Patent Document 1, any sound may be used, so as long as any sound is pronounced at the correct timing of the “shi”, it is regarded as a correct pronunciation. . In this respect, there is a limit when the technique of Patent Document 1 is applied. Furthermore, these problems occur not only when practicing singing, but also when practicing playing musical instruments by imitating model performances.

本発明は上述した背景に鑑みてなされたものであり、その目的は、練習者の歌唱や楽器の演奏に対し、その模範となる歌唱や演奏との発音タイミングのずれや発音の間違いをより正確に示すことにある。 The present invention has been made in view of the above-described background, and its purpose is to more accurately detect a deviation in pronunciation timing and a mistake in pronunciation with the exemplar singing or playing with respect to the singing or playing of the instrument of the practitioner. It is to show.

上記課題を解決するため、本発明は、発音タイミングが時系列に連なる複数の音素を表す第１の音データを、該各音素の発音タイミングと対応付けて記憶する第１の記憶手段と、音を収音する収音手段から供給される第２の音データを記憶する第２の記憶手段と、前記第１の音データと前記第２の音データとを所定時間長のフレーム単位で対応付け、対応付けたフレームを表す対応箇所データを生成する対応箇所検出手段と、前記第２の音データが表す音素の発音タイミングを、前記第１の音データが表わす音素の発音タイミングと前記対応箇所データとに基づいて特定し、該第１の音データが表わす音素の発音タイミングと該第２の音データが表わす音素の発音タイミングとの差分が閾値を超えるか否かを判断する比較手段と、前記比較手段によって前記差分が閾値を超えると判断された音素を特定する情報を出力する出力手段とを備えることを特徴とする評価装置を提供する。 In order to solve the above-mentioned problem, the present invention provides a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with the sound generation timings of the respective phonemes, The second storage means for storing the second sound data supplied from the sound collection means for collecting the sound, and the first sound data and the second sound data are associated with each other in a frame unit of a predetermined time length. Corresponding location detecting means for generating corresponding location data representing the associated frame, and the pronunciation timing of the phoneme represented by the second sound data, the pronunciation timing of the phoneme represented by the first sound data, and the corresponding location data Comparing means for determining whether or not a difference between a sound generation timing of a phoneme represented by the first sound data and a sound generation timing of a phoneme represented by the second sound data exceeds a threshold value, Comparison It provides an assessment device characterized by an output means for outputting information specifying the phoneme determining that the difference exceeds the threshold value by.

本発明によれば、第１の音データと第２の音データとを所定時間長のフレーム単位で対応付けることで、第１の音データが表す音素と第２の音データが表す音素とを正確に対応付けることができる。よって、発音タイミングのずれをより正確に示すことが可能となる。 According to the present invention, by associating the first sound data and the second sound data in units of frames having a predetermined time length, the phonemes represented by the first sound data and the phonemes represented by the second sound data can be accurately determined. Can be associated. Therefore, it is possible to more accurately indicate a difference in sound generation timing.

また、本発明は、発音タイミングが時系列に連なる複数の音素を表す音データを、該各音素の発音タイミングと対応付けて記憶する第１の記憶手段と、音を収音する収音手段から供給される第２の音データを記憶する第２の記憶手段と、前記第１の音データと前記第２の音データとを所定時間長のフレーム単位で対応付け、対応付けたフレームを表す対応箇所データを生成する対応箇所検出手段と、前記第１の音データが表す音素と前記第２の音データが表す音素とを前記対応箇所データによって表されるフレーム単位で比較し、該第１の音データが表わす音素と該第２の音データが表わす音素との差分が閾値を超えるか否かを判断する比較手段と、前記比較手段によって前記差分が閾値を超えると判断された音素を特定する情報を出力する出力手段とを備えることを特徴とする評価装置を提供する。 In addition, the present invention includes: first storage means for storing sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with sound generation timings of the respective phonemes; and sound collection means for collecting sounds. A second storage means for storing the supplied second sound data, and the first sound data and the second sound data are associated with each other in a frame unit of a predetermined time length, and a correspondence representing the associated frame A corresponding location detecting means for generating location data, a phoneme represented by the first sound data and a phoneme represented by the second sound data are compared in units of frames represented by the corresponding location data, and the first A comparing means for determining whether or not a difference between a phoneme represented by the sound data and a phoneme represented by the second sound data exceeds a threshold; and a phoneme for which the difference is determined to exceed the threshold by the comparing means Output information Provides an assessment apparatus, characterized in that it comprises a power means.

本発明によれば、第１の音データと第２の音データとを所定時間長のフレーム単位で対応付けることで、第１の音データが表す音素と第２の音データが表す音素とを正確に対応付けることができる。よって、これらの音の違いをより正確に示すことが可能となる。 According to the present invention, by associating the first sound data and the second sound data in units of frames having a predetermined time length, the phonemes represented by the first sound data and the phonemes represented by the second sound data can be accurately determined. Can be associated. Therefore, it is possible to more accurately indicate the difference between these sounds.

また、本発明は、発音タイミングが時系列に連なる複数の音素を表す第１の音データを、該各音素の発音タイミングと対応付けて記憶する第１の記憶手段と、音を収音する収音手段から供給される第２の音データを記憶する第２の記憶手段と、制御手段とを備える評価装置の制御方法であって、前記制御手段が、前記第１の音データと前記第２の音データとを所定時間長のフレーム単位で対応付け、対応付けたフレームを表す対応箇所データを生成するステップと、前記制御手段が、前記第２の音データが表す音素の発音タイミングを、前記第１の音データが表わす音素の発音タイミングと前記対応箇所データとに基づいて特定し、該第１の音データが表わす音素の発音タイミングと該第２の音データが表わす音素の発音タイミングとの差分が閾値を超えるか否かを判断するステップと、前記制御手段が、前記差分が閾値を超えると判断された音を特定する情報を出力するステップとを備えることを特徴とする制御方法を提供する。 According to the present invention, there is provided a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with the sound generation timing of each phoneme, and a sound collecting device for collecting sounds. An evaluation apparatus control method comprising: a second storage unit that stores second sound data supplied from a sound unit; and a control unit, wherein the control unit includes the first sound data and the second sound data. The corresponding sound data in units of frames of a predetermined time length, generating corresponding location data representing the correlated frames, and the control means, the sound generation timing of the phoneme represented by the second sound data, The sound generation timing of the phoneme represented by the first sound data is specified based on the sounding timing of the phoneme represented by the first sound data and the corresponding portion data, and the sounding timing of the phoneme represented by the first sound data and the sounding timing of the phoneme represented by the second sound data Difference And determining whether more than a threshold value, said control means provides a control method characterized by comprising the step of outputting information specifying the sound the difference is determined to exceed the threshold value.

また、本発明は、発音タイミングが時系列に連なる複数の音素を表す第１の音データを、該各音素の発音タイミングと対応付けて記憶する第１の記憶手段と、音を収音する収音手段から供給される第２の音データを記憶する第２の記憶手段と、制御手段とを備える評価装置の制御方法であって、前記制御手段が、前記第１の音データと前記第２の音データとを所定時間長のフレーム単位で対応付け、対応付けたフレームを表す対応箇所データを生成するステップと、前記制御手段が、前記第１の音データが表す音素と前記第２の音データが表す音素とを前記対応箇所データによって表されるフレーム単位で比較し、該第１の音データが表わす音素と該第２の音データが表わす音素との差分が閾値を超えるか否かを判断するステップと、前記制御手段が、前記差分が閾値を超えると判断された音を特定する情報を報知するステップとを備えることを特徴とする制御方法を提供する。 According to the present invention, there is provided a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with the sound generation timing of each phoneme, and a sound collecting device for collecting sounds. An evaluation apparatus control method comprising: a second storage unit that stores second sound data supplied from a sound unit; and a control unit, wherein the control unit includes the first sound data and the second sound data. The corresponding sound data in units of frames of a predetermined time length, generating corresponding location data representing the correlated frames, and the control means comprising the phoneme represented by the first sound data and the second sound The phonemes represented by the data are compared in units of frames represented by the corresponding location data, and whether or not the difference between the phonemes represented by the first sound data and the phonemes represented by the second sound data exceeds a threshold value. A step of judging, and It means, provides a control method characterized by comprising the step of notifying the information specifying the sound the difference is determined to exceed the threshold value.

さらに、本発明は、コンピュータに対して機能を実現させるプログラムとしての形態も採り得る。なお、本発明において「発音」という用語には、人が歌唱するときに発せられる音声のほか、楽器を演奏することで発せられる演奏音も含むものとする。また、本発明において、各々の「音素」とは、ひとまとまりの音として意識されて発音されるものであり、発音タイミングや発音の間違いを指摘することに意味がある音であればよい。 Furthermore, the present invention may also take the form of a program that causes a computer to realize functions. Note that in the present invention, the term “pronunciation” includes not only a sound produced when a person sings but also a performance sound produced by playing an instrument. Further, in the present invention, each “phoneme” is consciously pronounced as a group of sounds, and may be any sound that is meaningful for pointing out a pronunciation timing or a mistake in pronunciation.

本発明によれば、練習者の歌唱や楽器の演奏に対し、その模範となる歌唱や演奏との発音タイミングのずれや発音の間違いをより正確に示すことができる。 ADVANTAGE OF THE INVENTION According to this invention, with respect to a practitioner's song and musical instrument performance, the difference in pronunciation timing and the mistake of pronunciation with the example song and performance can be shown more correctly.

次に、本発明を実施するための最良の形態を説明する。
１．構成
図１は、この発明の一実施形態に係る評価装置としてのカラオケ装置１のハードウェア構成を例示したブロック図である。ＣＰＵ（Central Processing Unit）１１は、ＲＯＭ（Read Only Memory）１２又は記憶部１４に記憶されているコンピュータプログラムを読み出してＲＡＭ（Random Access Memory）１３にロードし、これを実行することにより、カラオケ装置１の各部を制御する。記憶部１４は、例えばハードディスクなどの大容量の記憶手段であり、伴奏データ記憶領域１４ａと、模範音声データ記憶領域１４ｂと、歌詞データ記憶領域１４ｃと、練習者音声データ記憶領域１４ｄを有している。表示部１５は、例えば液晶ディスプレイなどであり、ＣＰＵ１１の制御の下で、カラオケ装置１を操作するためのメニュー画面や、背景画像に歌詞テロップを重ねたカラオケ画面などの各種画面を表示する。操作部１６は、各種のキーを備えており、押下されたキーに対応した信号をＣＰＵ１１へ出力する。マイクロフォン１７は、歌唱者が発音した音声を収音する収音手段である。音声処理部１８は、マイクロフォン１７によって収音された音声（アナログデータ）をデジタルデータに変換してＣＰＵ１１に供給する。スピーカ１９は、音声処理部１８に接続されており、音声処理部１８から出力される音声を放音する。 Next, the best mode for carrying out the present invention will be described.
1. Configuration FIG. 1 is a block diagram illustrating a hardware configuration of a karaoke apparatus 1 as an evaluation apparatus according to an embodiment of the invention. A CPU (Central Processing Unit) 11 reads a computer program stored in a ROM (Read Only Memory) 12 or a storage unit 14, loads it into a RAM (Random Access Memory) 13, and executes it to execute a karaoke device. 1 part is controlled. The storage unit 14 is a large-capacity storage unit such as a hard disk, and includes an accompaniment data storage area 14a, an exemplary voice data storage area 14b, a lyrics data storage area 14c, and a trainer voice data storage area 14d. Yes. The display unit 15 is, for example, a liquid crystal display, and displays various screens such as a menu screen for operating the karaoke apparatus 1 and a karaoke screen in which lyrics telop is superimposed on a background image under the control of the CPU 11. The operation unit 16 includes various keys and outputs a signal corresponding to the pressed key to the CPU 11. The microphone 17 is a sound collecting unit that picks up the sound produced by the singer. The sound processing unit 18 converts sound (analog data) collected by the microphone 17 into digital data and supplies it to the CPU 11. The speaker 19 is connected to the sound processing unit 18 and emits sound output from the sound processing unit 18.

記憶部１４の伴奏データ記憶領域１４ａには、例えばＭＩＤＩ（Musical Instruments Digital Interface）形式などの伴奏データであって、各曲の伴奏を行う各種楽器の音程（ピッチ）を示す情報が楽曲の進行に伴って記された伴奏データが記憶されている。模範音声データ記憶領域１４ｂには、例えばＷＡＶＥ形式やＭＰ３（MPEG Audio Layer-3）形式などの音声データであって、伴奏データによって表わされる伴奏に沿って歌唱者が発音した音声（以下、模範音声という）を表す音声データ（以下、模範音声データ）が記憶されている。また、歌詞データ記憶領域１４ｃには、模範音声データと対応する歌詞を示す歌詞データが記憶されている。 In the accompaniment data storage area 14a of the storage unit 14, for example, accompaniment data in the MIDI (Musical Instruments Digital Interface) format, etc., and information indicating the pitch (pitch) of various musical instruments that accompany each song is in progress of the song. Accompaniment data written with it is stored. In the exemplary audio data storage area 14b, for example, audio data in the WAVE format or MP3 (MPEG Audio Layer-3) format, etc., which is voiced by the singer along the accompaniment represented by the accompaniment data (hereinafter, exemplary audio) Voice data (hereinafter referred to as model voice data) is stored. The lyrics data storage area 14c stores lyrics data indicating lyrics corresponding to the model voice data.

ここで、図２は、模範音声データと歌詞データとの対応関係を説明する図である。図示のように、歌詞データは、歌詞を構成するそれぞれの語（音素）と、これらの音素を発音すべき時間を表す発音タイミングとを含んでいる。そして、模範音声データと、歌詞を構成するそれぞれの語（音素）と、これらの各音素の発音タイミングとが互いに対応付けられている。図２に示す例では、「すぎさりしひびのゆめを・・」の「す」は発音タイミングＴ₁で発音を開始し、「ぎ」は発音タイミングＴ_２で発音を開始し、「さ」は発音タイミングＴ_３で発音を開始し・・・（以下同様）ということを示している。この発音タイミングは、伴奏データに基づく伴奏が開始された時点からの経過時間によって表されている。 Here, FIG. 2 is a diagram for explaining the correspondence between the model voice data and the lyrics data. As shown in the figure, the lyric data includes each word (phoneme) that constitutes the lyric, and a pronunciation timing that represents a time during which these phonemes should be pronounced. The model voice data, each word (phoneme) constituting the lyrics, and the pronunciation timing of each phoneme are associated with each other. In the example shown in FIG. 2, “Su” in “Sustained Dream of Dreams” starts sounding at the sounding timing T ₁ , “Gi” starts sounding at the sounding timing T ₂ , and “S” It indicates that ... starts the pronunciation sounding timing T ₃ (the same applies hereinafter). This sounding timing is represented by an elapsed time from the start of the accompaniment based on the accompaniment data.

再び図１の説明に戻る。
練習者音声データ記憶領域１４ｄには、マイクロフォン１７から音声処理部１８を経てＡ／Ｄ変換された音声データが、例えばＷＡＶＥ形式やＭＰ３（MPEG Audio Layer-3）形式で時系列に記憶される。この音声データは、練習者の音声（以下、練習者音声）を表す音声データであるから、以下では、練習者音声データという。ＣＰＵ１１は、この練習者音声データと前述した模範音声データとを比較することで、発音タイミングや発音内容の相違箇所を検出し、その相違箇所を表示部１５に表示するなどして練習者に報知する。練習者は、その報知内容を参照することで、自身の発音タイミングがずれている箇所や発音が間違っている箇所を認識することができる。なお、以下の説明においては、説明の便宜上、「模範音声データ」と「練習者音声データ」とを各々区別する必要がない場合には、これらを「音声データ」と総称する。 Returning to the description of FIG.
In the practitioner audio data storage area 14d, audio data A / D converted from the microphone 17 via the audio processing unit 18 is stored in time series, for example, in WAVE format or MP3 (MPEG Audio Layer-3) format. Since this voice data is voice data representing the voice of the practitioner (hereinafter referred to as “practice voice”), it is hereinafter referred to as “practice voice data”. The CPU 11 compares the practitioner voice data with the above-described model voice data, thereby detecting a difference in pronunciation timing and pronunciation content and displaying the difference on the display unit 15 to notify the practitioner. To do. The practitioner can recognize a location where his / her pronunciation timing is shifted or a location where the pronunciation is wrong by referring to the notification content. In the following description, for convenience of explanation, when it is not necessary to distinguish between “exemplary voice data” and “trainer voice data”, these are collectively referred to as “voice data”.

次に、図３に示すブロック図を参照しながら、カラオケ装置１のソフトウェア構成について説明する。図３に示した基礎分析部１１１、対応箇所検出部１１２、発音内容比較部１１３、発音タイミング比較部１１４及び報知部１１５は、ＣＰＵ１１がＲＯＭ１２又は記憶部１４に記憶されたコンピュータプログラムを実行することによって実現される。なお、図中の矢印は、データの流れを概略的に示したものである。図３において、基礎分析部１１１は、模範音声データ記憶領域１４ｂから読み出された模範音声データと、練習者音声データ記憶領域１４ｄから読み出された練習者音声データとを、それぞれ所定時間長のフレーム単位に分離し、その各々に対してＦＦＴ（Fast Fourier Transform）を施して、それぞれの音声データのスペクトルを算出する。 Next, the software configuration of the karaoke apparatus 1 will be described with reference to the block diagram shown in FIG. The basic analysis unit 111, the corresponding part detection unit 112, the pronunciation content comparison unit 113, the pronunciation timing comparison unit 114, and the notification unit 115 illustrated in FIG. 3 execute the computer program stored in the ROM 12 or the storage unit 14 by the CPU 11. It is realized by. The arrows in the figure schematically show the flow of data. In FIG. 3, the basic analysis unit 111 converts the model voice data read from the model voice data storage area 14b and the trainer voice data read from the trainer voice data storage area 14d to a predetermined time length. Each frame is separated and subjected to FFT (Fast Fourier Transform) to calculate the spectrum of the respective audio data.

対応箇所検出部１１２は、基礎分析部１１１によって算出された各音声データのスペクトルに基づいて、模範音声データに含まれる音素（語）と練習者音声データに含まれる音素（語）との対応関係（対応箇所）を求める。これら練習者音声及び模範音声の対応箇所は、対応箇所検出部１１２から発音内容比較部１１３及び発音タイミング比較部１１４に供給される。発音内容比較部１１３は、互いに対応する模範音声の発音内容と練習者音声の発音内容とを比較し、両者の相違箇所を検出する処理を行う。発音タイミング比較部１１４は、互いに対応する模範音声の発音タイミングと練習者音声の発音タイミングとを比較して、両者の相違箇所を検出する処理を行う。報知部１１５は、発音内容比較部１１３及び発音タイミング比較部１１４によって検出された相違箇所を特定する情報や各種メッセージを生成し、これらを表示部１５に表示するなどして、練習者に報知する。 Based on the spectrum of each speech data calculated by the basic analysis unit 111, the corresponding location detection unit 112 correlates the phoneme (word) included in the model speech data and the phoneme (word) included in the trainer speech data. (Corresponding location) is obtained. Corresponding portions of the trainer voice and the model voice are supplied from the corresponding portion detection unit 112 to the pronunciation content comparison unit 113 and the pronunciation timing comparison unit 114. The pronunciation content comparison unit 113 compares the pronunciation content of the model voice and the pronunciation content of the practitioner voice corresponding to each other, and performs a process of detecting a difference between the two. The sound generation timing comparison unit 114 compares the sound generation timings of the model voices corresponding to each other and the sound generation timings of the practitioner voices, and performs a process of detecting a difference between the two. The notification unit 115 generates information and various messages that specify the differences detected by the pronunciation content comparison unit 113 and the pronunciation timing comparison unit 114, and displays them on the display unit 15 to notify the practitioner. .

ところで、模範音声と練習者音声とは、前述の図７（ａ）に示したように、時間的にずれている可能性がある。そこで、対応箇所検出部１１２は、両者の音声データの時間軸を伸縮させて時間正規化（ＤＴＷ；Dynamic Time Warping）を行う必要がある。本実施形態ではこのＤＴＷを行うための手法としてＤＰ（Dynamic Programming：動的計画法）マッチングを用いる。具体的には以下のような処理となる。 By the way, the model voice and the practice person voice may be shifted in time as shown in FIG. Therefore, the corresponding location detection unit 112 needs to perform time normalization (DTW; Dynamic Time Warping) by expanding and contracting the time axes of both audio data. In the present embodiment, DP (Dynamic Programming) matching is used as a technique for performing this DTW. Specifically, the processing is as follows.

対応箇所検出部１１２は、図４に示すような座標平面（以下、ＤＰプレーンという）をＲＡＭ１３に形成する。このＤＰプレーンの縦軸は、模範音声データの各フレームのスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータに対応しており、横軸は、練習者音声データの各フレームから得たスペクトルの絶対値の対数に逆フーリエ変換をかけて得られるパラメータ（ケプストラム）に対応している。図４において、ａ１、ａ２、ａ３・・・ａｎは、模範音声データの各フレームを時間軸に従って並べたものであり、ｂ１、ｂ２、ｂ３・・・ｂｎは、練習者音声データの各フレームを時間軸に従って並べたものである。縦軸のａ１、ａ２、ａ３・・・ａｎの間隔と横軸のｂ１、ｂ２、ｂ３・・・ｂｎの間隔は、いずれもフレームの時間長と対応している。このＤＰプレーンにおける各格子点の各々には、ａ１、ａ２、ａ３・・・の各パラメータと、ｂ１、ｂ２、ｂ３・・・の各パラメータのユークリッド距離を夫々示す値であるＤＰマッチングスコアが対応付けられている。例えば、ａ１とｂ１とにより位置決めされる格子点には、模範音声データの一連のフレームのうち最初のフレームから得たパラメータと練習者音声データの一連のフレームのうち最初のフレームから得たパラメータのユークリッド距離を示す値が対応付けられることになる。対応箇所検出部１１２は、このような構造を成すＤＰプレーンを形成した後、ａ１とｂ１とにより位置決めされる格子点（始端）からａｎとｂｎとにより位置決めされる格子点（終端）に至る全経路を探索し、探索した各経路毎に、その始端から終端までの間に辿る各格子点のＤＰマッチングスコアを累算して行き、最小の累算値を求める。このＤＰマッチングスコアの累算値が最も小さくなる経路は、練習者音声データの各フレームの時間軸を模範音声データの時間軸に合わせて伸縮する際における伸縮の尺度として参酌される。 The corresponding location detection unit 112 forms a coordinate plane (hereinafter referred to as a DP plane) as shown in FIG. The vertical axis of this DP plane corresponds to the parameter obtained by applying the inverse Fourier transform to the logarithm of the absolute value of the spectrum of each frame of the model voice data, and the horizontal axis is obtained from each frame of the trainer voice data. It corresponds to a parameter (cepstrum) obtained by applying inverse Fourier transform to the logarithm of the absolute value of the spectrum. In FIG. 4, a1, a2, a3... An are obtained by arranging the frames of the model voice data according to the time axis, and b1, b2, b3. They are arranged according to the time axis. The intervals of a1, a2, a3... An on the vertical axis and the intervals of b1, b2, b3... Bn on the horizontal axis all correspond to the time length of the frame. Each lattice point in the DP plane corresponds to a DP matching score which is a value indicating the Euclidean distance of each parameter of a1, a2, a3... And each parameter of b1, b2, b3. It is attached. For example, the lattice points positioned by a1 and b1 include the parameters obtained from the first frame of the series of exemplary voice data and the parameters obtained from the first frame of the series of trainer voice data. A value indicating the Euclidean distance is associated. After forming the DP plane having such a structure, the corresponding part detection unit 112 performs all the processes from the lattice point (starting end) positioned by a1 and b1 to the lattice point (ending point) positioned by an and bn. A route is searched, and for each searched route, the DP matching score of each lattice point traced from the beginning to the end is accumulated, and the minimum accumulated value is obtained. The path with the smallest accumulated value of the DP matching score is considered as a scale of expansion / contraction when the time axis of each frame of the trainer voice data is expanded / contracted in accordance with the time axis of the model voice data.

そして、対応箇所検出部１１２は、ＤＰマッチングスコアの累算値が最小となる経路をＤＰプレーン上から特定し、特定した経路の内容に応じて練習者音声データの時間軸を伸縮する処理であるアライメント処理を行う。具体的には、ＤＰプレーン上から特定された経路上の各格子点のＤＰマッチングスコアが時間軸上の位置を同じくするフレームから得たパラメータのユークリッド距離を表わすものとなるように、練習者音声データの各フレームのタイムスタンプの内容を書き換えた上で、時間軸上の位置を同じくする各フレームを組として順次対応付けていく。例えば、図４に示すＤＰプレーン上に記された経路においては、ａ１とｂ１により位置決めされる始点からその右上のａ２とｂ２により位置決めされる格子点に進んでいることが分かる。この場合、ａ２とｂ２のフレームの時間軸上の位置は当初から同じであるので、ｂ２のフレームのタイムスタンプの内容を書き換える必要はない。更に、この経路においては、ａ２とｂ２により位置決めされる格子点からその右のａ２とｂ３により位置決めされる格子点に進んでいることが分かる。この場合、ｂ２のフレームだけでなくｂ３のフレームもａ２のフレームと時間軸上の位置を同じくする必要があるので、ｂ３のフレームと対を成していたタイムスタンプをフレーム一つ分だけ早いものと置き換える。この結果、ａ２のフレームとｂ２及びｂ３のフレームが時間軸上の位置を同じくするフレームの組として対応付けられることになる。このようなタイムスタンプの置き換えとフレームの対応付けがｂ１からｂｎに至る全フレーム区間について行われる。これにより、例えば図５（ａ）に示すように、練習者音声の発音タイミングが模範音声の発音タイミングよりも遅れていたとしても、図５（ｂ）に示すように、一方の音データの時間軸を他方の音データの時間軸に合わせて伸縮し、その伸縮によって合わせられた時間軸上の位置を同じくするフレーム（音素）どうしを対応付けることができる。
以上がＤＰマッチングの仕組みである。 And the corresponding location detection part 112 is a process which specifies the path | route where the accumulated value of DP matching score becomes the minimum from a DP plane, and expands / contracts the time axis | shaft of the trainer voice data according to the content of the specified path | route. Perform alignment processing. Specifically, the trainer's voice is such that the DP matching score of each lattice point on the path specified from the DP plane represents the Euclidean distance of the parameter obtained from the frame having the same position on the time axis. After rewriting the contents of the time stamp of each frame of data, each frame having the same position on the time axis is sequentially associated as a set. For example, in the path marked on the DP plane shown in FIG. 4, it can be seen that the starting point positioned by a1 and b1 advances to the lattice point positioned by upper right a2 and b2. In this case, since the positions on the time axis of the frames a2 and b2 are the same from the beginning, it is not necessary to rewrite the contents of the time stamp of the frame b2. Furthermore, in this route, it can be seen that the grid point positioned by a2 and b2 advances from the grid point positioned by a2 and b3 on the right. In this case, not only the frame b2 but also the frame b3 need to have the same position on the time axis as the frame a2, so that the time stamp paired with the frame b3 is one frame earlier. Replace with As a result, the frame a2 and the frames b2 and b3 are associated as a set of frames having the same position on the time axis. Such time stamp replacement and frame association are performed for all frame sections from b1 to bn. As a result, for example, as shown in FIG. 5A, even if the pronunciation timing of the practitioner voice is delayed from the pronunciation timing of the model voice, as shown in FIG. Frames (phonemes) having the same position on the time axis adjusted by the expansion and contraction can be associated with the other sound data.
The above is the mechanism of DP matching.

２．動作
次に、図６に示すフローチャートを参照しつつ、カラオケ装置１の動作を説明する。
練習者は、カラオケ装置１の操作部１６を操作して歌唱したい曲を選定し、伴奏データの再生を指示する。ＣＰＵ１１は、この指示に応じて図６に示す処理を開始する。ＣＰＵ１１は、まず、指定された曲の伴奏データを伴奏データ記憶領域１４ａから読み出し、音声処理部１８に供給する（ステップＳ１）。音声処理部１８は、供給された伴奏データをアナログ音声信号に変換してスピーカ１９に供給して放音させる。このとき、ＣＰＵ１１は表示部１５を制御して、歌詞データ記憶領域１４ｃから読み出した歌詞を表示するとともに、「伴奏に合わせて歌唱してください」というような歌唱を促すメッセージを表示し、さらに、その歌詞を伴奏の進行に合わせて順番に色変わりさせるようにしてもよい。練習者は、スピーカ１９から放音される伴奏に合わせて歌唱を行う。このとき、練習者の音声はマイクロフォン１７によって収音されて音声信号に変換され、音声処理部１８へと供給される。そして、音声処理部１８によってＡ／Ｄ変換された練習者音声データは、演奏開始から発音タイミングまでの経過時間を表す情報と共に、記憶部１４の練習者音声データ記憶領域１４ｄに時系列に記憶されていく（ステップＳ２）。 2. Operation Next, the operation of the karaoke apparatus 1 will be described with reference to the flowchart shown in FIG.
The practitioner operates the operation unit 16 of the karaoke apparatus 1 to select a song to be sung and instructs the accompaniment data to be reproduced. The CPU 11 starts the process shown in FIG. 6 in response to this instruction. First, the CPU 11 reads the accompaniment data of the designated song from the accompaniment data storage area 14a and supplies it to the audio processing unit 18 (step S1). The sound processing unit 18 converts the supplied accompaniment data into an analog sound signal and supplies the analog sound signal to the speaker 19 for sound emission. At this time, the CPU 11 controls the display unit 15 to display the lyrics read from the lyrics data storage area 14c, and displays a message prompting a song such as “Please sing along with the accompaniment”, You may make it change the color of the lyrics in order as the accompaniment progresses. The practitioner sings along with the accompaniment emitted from the speaker 19. At this time, the voice of the practitioner is picked up by the microphone 17 and converted into a voice signal, which is supplied to the voice processing unit 18. The trainer speech data A / D converted by the speech processing unit 18 is stored in the trainer speech data storage area 14d of the storage unit 14 in time series together with information indicating the elapsed time from the start of performance to the sounding timing. (Step S2).

伴奏データの再生が終了すると（ステップＳ３；ＹＥＳ）、ＣＰＵ１１は、前述した基礎分析部１１１の処理、即ち模範音声データ記憶領域１４ｂから読み出した模範音声データと、練習者音声データ記憶領域１４ｄから読み出した練習者音声データとを、それぞれ所定時間長のフレーム単位に分離し、その各々に対してＦＦＴを施して各音声データのスペクトルを算出する（ステップＳ４）。次に、ＣＰＵ１１は、前述した対応箇所検出部１１２の処理、つまりＤＰマッチングによって両者の音声データの時間軸を合わせ、合わせられた時間軸上の位置を同じくするフレームどうしを対応付け、対応付けたフレームを表す対応箇所データを生成する（ステップＳ５）。 When the reproduction of the accompaniment data is completed (step S3; YES), the CPU 11 reads out the processing of the basic analysis unit 111 described above, that is, the model voice data read from the model voice data storage area 14b and the trainer voice data storage area 14d. The trainee voice data is separated into frame units each having a predetermined time length, and FFT is applied to each frame to calculate the spectrum of each voice data (step S4). Next, the CPU 11 matches the time axes of both audio data by the processing of the corresponding part detection unit 112 described above, that is, DP matching, and associates and associates the frames having the same position on the time axis. Corresponding portion data representing a frame is generated (step S5).

続けて、ＣＰＵ１１は、前述した発音タイミング比較部１１４の処理、つまり、互いに対応する模範音声の発音タイミングと練習者音声の発音タイミングとを比較し、発音タイミングが相違する箇所を検出する処理を行う（ステップＳ６）。具体的には、ＣＰＵ１１は、練習者音声データと、ステップＳ５の処理によって得られた対応箇所データとに基づいて、練習者音声の発音タイミングを特定する。そのため、ＣＰＵ１１はまず、図５（ｂ）に示すように同一の時間軸における模範音声と練習者音声の対応関係を参照して両者の対応箇所を特定する。両者の対応関係が特定されると、ＣＰＵ１１は、図５（ａ）に示すような練習者音声の時間軸上において、ある音素（例えば「す」）とその次に発音された音素（例えば「ぎ」）との境界（切れ目）を特定することができる。練習者音声データは、前述したように、演奏開始から発音タイミングまでの経過時間を表す情報と共に記憶されているから、ＣＰＵ１１は、音と音の境界に相当する経過時間を特定することができる。この経過時間が、練習者音声に含まれる各音の発音タイミングとなる。 Subsequently, the CPU 11 performs the process of the sound generation timing comparison unit 114 described above, that is, the process of comparing the sound generation timings of the model voices corresponding to each other and the sound generation timings of the practitioner voices, and detecting portions where the sound generation timings are different. (Step S6). Specifically, the CPU 11 specifies the pronunciation timing of the practitioner voice based on the practitioner voice data and the corresponding portion data obtained by the process of step S5. For this reason, the CPU 11 first identifies the corresponding part of the reference voice and the practicer voice on the same time axis as shown in FIG. 5B. When the correspondence between the two is specified, the CPU 11 on the time axis of the trainee voice as shown in FIG. 5A, a phoneme (for example, “su”) and a phoneme that is pronounced next (for example, “ ”) And the boundary (cut) can be specified. As described above, the trainer voice data is stored together with the information indicating the elapsed time from the start of the performance to the sounding timing, so the CPU 11 can specify the elapsed time corresponding to the boundary between the sounds. This elapsed time becomes the pronunciation timing of each sound included in the trainee's voice.

次に、ＣＰＵ１１は、その練習者音声に対応付けられた模範音声データの音の発音タイミングを歌詞データ記憶領域１４ｃから読み出す。次いで、ＣＰＵ１１は、練習者音声の発音タイミングと、その練習者音声に対応する模範音声の発音タイミングとの差分を算出する。そして、ＣＰＵ１１は、その差分が予め決められた閾値を超えるか否かを判断する。このときの閾値は、発音のタイミングがずれていると判断される最小の時間間隔であり、記憶部１４に予め記憶されている。この差分が閾値を超えると判断された箇所の音声（音素）が、模範音声の発音タイミングと練習者音声の発音タイミングとの相違箇所となる。 Next, the CPU 11 reads out the sound generation timing of the model voice data associated with the trainee voice from the lyrics data storage area 14c. Next, the CPU 11 calculates a difference between the pronunciation timing of the practitioner voice and the pronunciation timing of the model voice corresponding to the practitioner voice. Then, the CPU 11 determines whether or not the difference exceeds a predetermined threshold value. The threshold value at this time is a minimum time interval at which the timing of sound generation is determined to be shifted, and is stored in the storage unit 14 in advance. The sound (phoneme) at the location where the difference is determined to exceed the threshold value is the difference between the sound generation timing of the model sound and the sound generation timing of the practitioner's sound.

次に、ＣＰＵ１１は、前述した発音内容比較部１１３の処理、つまり、互いに対応する模範音声と練習者音声とを比較し、これら音声そのものの相違箇所を検出する処理を行う（ステップＳ７）。具体的にはまず、ＣＰＵ１１は、練習者音声データが表す音声と、その音声に対応付けられた模範音声データが表す音声とを比較し、両者のスペクトルの差分を算出する。異なる人が同じ語を発音した場合、多少の個人差はあるものの、その音声波形はおおよそ類似したものとなる。よって、スペクトルが類似しているか否かによって、模範音声と練習者音声とが相違しているか否かを判断することができる。更に正確な方法としては、ステップＳ４で求めた音声データのスペクトルに対してフォルマント分析を行うことによって、模範音声と練習者音声の相違箇所を検出する方法がある。このフォルマントとは、特定周波数に偏ったスペクトルの山の部分のことであり、例え声質が異なっていても、同じ語を発音するとその語（音声）に固有のフォルマントが出現する。よって、このフォルマントの出現状態を解析すれば発音の内容を識別することができる。ＣＰＵ１１は、このように模範音声と練習者音声との間で、そのスペクトルやフォルマントの差分をとり、その差分が閾値を超える場合には、模範音声の発音内容と練習者音声の発音内容の相違箇所と判断する。なお、このとき用いる閾値としては、人が同じ語を発音したものと認められる程度の差分の上限値を予め決めておき、これを閾値として記憶部１４に記憶させておけばよい。 Next, the CPU 11 performs the process of the pronunciation content comparison unit 113 described above, that is, the process of comparing the model voice and the practitioner voice corresponding to each other, and detecting a difference between these voices (step S7). Specifically, first, the CPU 11 compares the voice represented by the trainer voice data with the voice represented by the model voice data associated with the voice, and calculates the difference between the two spectra. When different people pronounce the same word, their speech waveforms are roughly similar, although there are some individual differences. Therefore, it can be determined whether the model voice and the practitioner voice are different depending on whether the spectra are similar. As a more accurate method, there is a method of detecting a difference between the model voice and the trainer voice by performing a formant analysis on the spectrum of the voice data obtained in step S4. This formant is a peak portion of a spectrum biased to a specific frequency. Even if the voice quality is different, a formant specific to the word (speech) appears when the same word is pronounced. Therefore, by analyzing the appearance state of this formant, the content of pronunciation can be identified. In this way, the CPU 11 takes the spectrum or formant difference between the model voice and the trainer voice, and if the difference exceeds the threshold, the difference between the pronunciation content of the model voice and the pronunciation content of the trainer voice. Judged as a place As a threshold used at this time, an upper limit value of a difference that allows a person to pronounce the same word may be determined in advance and stored in the storage unit 14 as a threshold.

次に、ＣＰＵ１１は、ステップＳ６，Ｓ７における比較によって相違箇所が検出されたか否かを判断する（ステップＳ８）。相違箇所が検出されない場合には（ステップＳ８；ＮＯ）、ＣＰＵ１１は、「あなたはうまく歌えています」などといったメッセージを表示部１５に表示して処理を終了する。一方、相違箇所が検出された場合には（ステップＳ８；ＹＥＳ）、ＣＰＵ１１は、その相違箇所に相当する練習者音声（音素）又は模範音声（音素）を特定する情報を生成し、それを表示部１５によって表示させるなどの報知処理を行う（ステップＳ９）。具体的には、ＣＰＵ１１は、検出した相違箇所に対応する歌詞にアンダーラインを施して表示したり、その歌詞（文字）の色や太さをその他の文字の表示態様と異ならせて表示する。また、ＣＰＵ１１は、発音タイミングがどのようにずれているとか、発音がどのように間違っているかといったメッセージを併せて表示する。 Next, the CPU 11 determines whether or not a difference is detected by the comparison in steps S6 and S7 (step S8). If a difference is not detected (step S8; NO), the CPU 11 displays a message such as “You are singing well” on the display unit 15 and ends the process. On the other hand, when a different part is detected (step S8; YES), the CPU 11 generates information for specifying the trainer voice (phoneme) or the model voice (phoneme) corresponding to the different part and displays it. Notification processing such as display by the unit 15 is performed (step S9). Specifically, the CPU 11 displays the lyrics corresponding to the detected different portions with an underline, or displays the lyrics (characters) in different colors and thicknesses from the display modes of other characters. Further, the CPU 11 also displays a message such as how the sound generation timing is shifted and how the sound generation is wrong.

ここで、ステップＳ６，Ｓ７の処理によって検出される相違箇所には、図７の（ａ）〜（ｃ）に示すような３つの態様がある。
まず１つ目の態様は、図７の（ａ）に示すように、「すぎさりしひびのゆめを・・・」という歌詞のうち「しひびのゆめ」という一連の歌詞が、模範音声よりも早いタイミングで発音されている場合である。ＣＰＵ１１は、このように練習者音声の発音タイミングが所定数以上（ここでは２つの語以上）連続して模範音声の発音タイミングよりもずれている場合には、練習者の歌唱は「歌詞ずれ」であると判断する。この場合、ＣＰＵ１１は、図８（ａ）に示すように、「すぎさりしひびのゆめを・・・」という歌詞のうち「しひびのゆめ」にアンダーラインを施して表示するとともに、模範音声ないし伴奏よりも早いタイミングで発音される歌詞ずれが発生していることを練習者に報知するためのメッセージを表示する。 Here, there are three aspects as shown in (a) to (c) of FIG.
First, as shown in FIG. 7 (a), a series of lyrics “Shibino Yume” out of the lyrics “Yume of Susashiri Hibiki ...” is from the model voice. This is the case when it is pronounced at an early timing. In this way, when the pronunciation timing of the practitioner's voice is more than a predetermined number (here, two words or more) continuously deviating from the pronunciation timing of the model voice, the practitioner's singing is “lyric deviation”. It is judged that. In this case, as shown in FIG. 8 (a), the CPU 11 displays the underlined “Shibino Yume” in the lyrics of “The Dream of Susashiri Hibiki ...” and displays the model voice. In addition, a message for notifying the practitioner of the occurrence of a lyric deviation that is pronounced at an earlier timing than the accompaniment is displayed.

次に、２つ目の態様は、図７の（ｂ）に示すように、「すぎさりしひびのゆめを・・・」という練習者音声のうち「の」だけが模範音声よりも早いタイミングで発音されている場合である。ＣＰＵ１１は、このように練習者音声の発音タイミングが所定数未満（ここでは２つの語未満）の模範音声の発音タイミングよりもずれている場合には、練習者の歌唱が「タイミングずれ」であると判断する。この場合、ＣＰＵ１１は、図８（ｂ）に示すように、「すぎさりしひびのゆめを・・・」という歌詞のうち「の」だけにアンダーラインを施して表示するとともに、模範音声ないし伴奏よりも早いタイミングで発音されていることを練習者に報知するためのメッセージを表示する。
このように、ＣＰＵ１１は、発音タイミングがずれている場合には、そのずれている音素（語）が所定数以上連続するか否かを判断し、所定数以上連続すると判断した場合と所定数以上連続しないと判断された場合とで、それぞれ異なるメッセージを表示部１５に表示する。 Next, as shown in FIG. 7 (b), the second mode is a timing in which only “no” of the practitioner's voice “is a dream of the crack” is earlier than the model voice. Is pronounced in When the sound generation timing of the practitioner voice is deviated from the sound generation timing of the model voice of less than a predetermined number (here, less than two words) in this way, the practitioner's singing is “timing misalignment”. Judge. In this case, as shown in FIG. 8 (b), the CPU 11 displays an underline for only “no” in the lyrics of “the dream of the crack,” and an exemplary voice or accompaniment. A message is displayed to inform the practitioner that the pronunciation is made at an earlier timing.
As described above, when the sound generation timing is deviated, the CPU 11 determines whether or not the deviated phoneme (word) continues for a predetermined number or more, and determines that the predetermined number or more continues. Different messages are displayed on the display unit 15 when it is determined that they are not continuous.

そして、３つ目の態様は、図７の（ｃ）に示すように、「すぎさりしひびのゆめを・・・」の「すぎさりし」という歌詞が「すぎさった」というように間違えて歌唱された場合である。ＣＰＵ１１は、このように発音内容が異なっている場合には、練習者の歌唱が「歌詞の間違い」であると判断する。この場合、ＣＰＵ１１は、図８（ｃ）に示すように、「すぎさりしひびのゆめを・・・」という正しい歌詞と、「すぎさったひびのゆめを・・・」という練習者の間違った発音とを並列に表示し、さらに、練習者の間違った発音「った」にアンダーラインを施して表示するとともに、歌詞に間違いがあることを練習者に報知するためのメッセージを表示部１５に表示する。 And, as shown in FIG. 7 (c), the third mode is mistaken as the lyrics “Sugisarisashi” in “A dream of a crack is too much”. This is the case when singing. When the pronunciation content is different as described above, the CPU 11 determines that the practitioner's singing is “Lyrics mistake”. In this case, as shown in FIG. 8 (c), the CPU 11 makes a mistake in the correct lyrics of “Let ’s dream of a crack that has passed through ...” and the correct lyrics of “Let ’s dream of a crack that has passed through”. In addition to displaying the pronunciation in parallel, the wrong pronunciation of the practitioner “T” is displayed with an underline, and a message for notifying the practitioner that the lyrics are incorrect is displayed on the display unit 15. indicate.

ところで、ＣＰＵ１１は、図８（ａ）〜（ｃ）に示すように、「もう１回歌い直しますか？Ｙｅｓ／Ｎｏ」というメッセージも表示部１５に表示する。ここで、練習者が操作部１６を操作して「Ｙｅｓ」を選択すると、ＣＰＵ１１は歌唱の再練習が指示されたと判断する（ステップＳ１０；Ｙｅｓ）。そして、ＣＰＵ１１は、発音タイミング又は発音内容の相違箇所を中心として前後の所定範囲にわたる歌詞データ（この場合「すぎさりしひびのゆめを」という歌詞）と、その歌詞データに対応する伴奏データとを、歌詞データ記憶領域１４ｃ及び伴奏データ記憶領域１４ａから読み出し、これらを音声処理部１８に供給して再生させる（ステップＳ１１）。このとき、ＣＰＵ１１は表示部１５を制御して、歌詞データ記憶領域１４ｃから読み出した歌詞を表示し、さらにその歌詞を伴奏の進行に合わせて順番に色変わりさせる。練習者はこの伴奏に合わせて、表示部１５に表示された歌詞を歌唱する。 By the way, as shown in FIGS. 8A to 8C, the CPU 11 also displays a message “Do you want to sing again? Yes / No” on the display unit 15. Here, when the practitioner operates the operation unit 16 and selects “Yes”, the CPU 11 determines that the re-practice of singing has been instructed (step S10; Yes). Then, the CPU 11 obtains lyric data (in this case, lyrics such as “Yoshinari Hibino Yume”) and the accompaniment data corresponding to the lyric data centering on the difference in pronunciation timing or pronunciation content. The lyrics data storage area 14c and the accompaniment data storage area 14a are read out and supplied to the audio processing unit 18 for reproduction (step S11). At this time, the CPU 11 controls the display unit 15 to display the lyrics read from the lyrics data storage area 14c, and further changes the color of the lyrics in order as the accompaniment progresses. The practitioner sings the lyrics displayed on the display unit 15 in accordance with the accompaniment.

この後、ＣＰＵ１１の処理は前述したステップＳ２に戻る。つまり、練習者の音声がマイクロフォン１７によって収音されて音声信号に変換され、音声処理部１８へと供給される。そして、音声処理部１８によってＡ／Ｄ変換された練習者音声データは、記憶部１４の練習者音声データ記憶領域１４ｄに時系列に記憶されていく（ステップＳ２）。以降、この記憶された練習者音声データに対して上述したステップＳ３〜Ｓ１１の処理が繰り返される。これにより、練習者は自らが納得するまで、同一箇所の歌詞を繰り返し練習することができる。そして、図８（ａ）〜（ｃ）に示した画面で練習者が「Ｎｏ」を選択すると（ステップＳ１０；Ｎｏ）、ＣＰＵ１１の処理は終了する。 Thereafter, the processing of the CPU 11 returns to step S2 described above. That is, the practitioner's voice is picked up by the microphone 17, converted into a voice signal, and supplied to the voice processing unit 18. The trainer speech data A / D converted by the speech processing unit 18 is stored in the trainer speech data storage area 14d of the storage unit 14 in time series (step S2). Thereafter, the processes of steps S3 to S11 described above are repeated for the stored trainer voice data. Thereby, the practitioner can practice the lyrics in the same place repeatedly until he / she is satisfied. Then, when the practitioner selects “No” on the screens shown in FIGS. 8A to 8C (step S10; No), the processing of the CPU 11 ends.

このように本実施形態においては、模範音声データと練習者音声データの時間軸を合わせたうえで、その時間軸上の位置を同じくする音どうしを対応付けて両者を比較し、発音タイミングや発音内容の相違箇所を表示する。よって、練習者は、自らの歌唱に発音タイミングのずれや発音の間違いがあることを明確に意識することができると共に、その相違箇所や相違内容を視覚的に把握することができる。 As described above, in the present embodiment, after matching the time axes of the model voice data and the trainee voice data, the sounds having the same position on the time axis are associated with each other and compared, and the sound generation timing and the sound generation are compared. Display differences in content. Therefore, the practitioner can clearly recognize that his / her singing has a difference in pronunciation timing and a mistake in pronunciation, and can visually grasp the difference portion and the content of the difference.

３．変形例
上述した実施形態を次のように変形してもよい。
（１）上述した実施形態においては、練習者の歌唱を評価する場合を例に挙げて説明したが、これに限らず、練習者の楽器演奏を評価するようにしてもよい。この場合、伴奏データ記憶領域１４ａには、練習したい楽器（例えばギター）以外の楽器（例えばベースやドラム）の演奏データが記憶されており、模範音声データ記憶領域１４ｂには、模範となる模範演奏データが記憶されており、歌詞データ記憶領域１４ｃには、演奏音の音程と発音タイミングとが対応付けられて記憶されており、練習者音声データ記憶領域１４ｄには、練習者の演奏データが記憶されている。ＣＰＵ１１は、これらのデータに基づき、上記と同様の処理を経て模範演奏と練習演奏との相違箇所を検出し、その相違箇所を特定する情報を報知する。このように本発明は歌唱や演奏を含み得るため、本発明における「発音」という用語には、人が歌唱するときに発せられる音声のほか、楽器を演奏することで発せられる演奏音も含むものとする。また、本発明において、「音素」とは、歌唱や演奏のいずれの場合であっても、ひとまとまりの音として意識されて発音されるものであり、発音タイミングや発音の間違いを指摘することに意味があるものであればよい。 3. Modifications The embodiment described above may be modified as follows.
(1) In the above-described embodiment, the case where a practitioner's singing is evaluated has been described as an example. However, the present invention is not limited thereto, and the practitioner's musical instrument performance may be evaluated. In this case, the accompaniment data storage area 14a stores performance data of an instrument (for example, bass or drum) other than the instrument (for example, guitar) to be practiced, and the model audio data storage area 14b stores an exemplary model performance. In the lyrics data storage area 14c, the pitch of the performance sound and the sounding timing are stored in association with each other, and the performance data of the practitioner is stored in the practitioner voice data storage area 14d. Has been. Based on these data, the CPU 11 detects the difference between the model performance and the practice performance through the same processing as described above, and notifies the information specifying the difference. As described above, since the present invention can include singing and performing, the term “pronunciation” in the present invention includes not only a sound uttered when a person sings but also a performance sound uttered by playing an instrument. . Further, in the present invention, “phoneme” is consciously pronounced as a group of sounds, whether in singing or playing, and points out a pronunciation timing or a mistake in pronunciation. It only has to be meaningful.

（２）図２に示した歌詞データでは、それぞれの音声の発音を開始すべきタイミングを「発音タイミング」として考えていた。なぜなら、発音タイミングのずれは、大抵の場合、発音を開始すべきタイミングの影響が大きいからである。ただし、これに限らず、音声の発音を終了するタイミングも「発音タイミング」という概念に含めるようにしてもよい。例えば図２において冒頭の「す」という音声の発音を開始するタイミングＴ₁と、図示はしていないがその「す」という音声の発音を終了するタイミング（タイミングＴ₁よりは遅く、タイミングＴ_２よりは早いタイミング）とを、それぞれ模範音声と練習者音声との間で比較するようにしてもよい。このようにすれば、発音の開始から終了に至るまでの微妙なずれまでをも評価することが可能となる。 (2) In the lyric data shown in FIG. 2, the timing at which each sound is to be pronounced is considered as the “sounding timing”. This is because the difference in sound generation timing is largely affected by the timing at which sound generation should start. However, the present invention is not limited to this, and the timing of ending sound generation may be included in the concept of “sound generation timing”. For example, in FIG. 2, the timing T _{1 at} which the pronunciation of the voice “su” at the beginning is started, and the timing at which the pronunciation of the voice “su” is ended, although not shown (timing T ₂ later than timing T _1). May be compared between the model voice and the practitioner voice. In this way, it is possible to evaluate even a slight deviation from the start to the end of pronunciation.

（３）ＣＰＵ１１が発音タイミングの差分が閾値を超えると判断された回数を累算しておき、その累算結果に応じたメッセージを表示部１５に表示するようにしてもよい。発音タイミングの差分が閾値を超えると判断された回数が多いということは、それだけ発音タイミングのずれが頻繁に発生していることを意味しているから、ＣＰＵ１１が例えば「歌詞ずれがとても多いです。もっとしっかり練習しましょう。」とか、「歌詞ずれがずいぶん少なくなりましたね。その調子です。」などというようなメッセージを表示すると、練習者の練習の励みになる。
これは発音タイミングに限らず、発音内容であっても同様であり、ＣＰＵ１１は、発音内容の差分が閾値を超えると判断された回数を累算しておき、その累算結果に応じたメッセージを表示部１５に表示するようにしてもよい。 (3) The number of times that the CPU 11 determines that the difference in sound generation timing exceeds the threshold may be accumulated, and a message corresponding to the accumulation result may be displayed on the display unit 15. The fact that the number of times that the difference in pronunciation timing exceeds the threshold value means that the difference in pronunciation timing has occurred frequently, so the CPU 11 says, for example, “There are many lyrics deviations. If you display a message such as “Let's practice more firmly” or “Your lyrics shift is much less.
This applies not only to the pronunciation timing but also to the pronunciation content, and the CPU 11 accumulates the number of times it is determined that the difference in the pronunciation content exceeds the threshold, and sends a message according to the accumulated result. You may make it display on the display part 15. FIG.

（４）実施形態では、練習者の歌唱が終わってから、図８に示すような評価結果を表示していた。そうではなくて、例えば練習者の過去の歌唱における発音タイミングや発音内容の相違箇所を履歴として記憶しておき、練習者のカラオケ歌唱（伴奏データの再生）に先立って又はそのカラオケ歌唱（伴奏データの再生）に同期して、発音タイミングや発音内容の相違しやすい箇所を表示するようにしてもよい。具体的には、ＣＰＵ１１は、ステップＳ６，Ｓ７において差分が閾値を超えると判断された音素を、歌詞データ記憶領域１４ｃに記憶されている歌詞データに対応付けて記憶しておく。伴奏データ記憶領域１４ａと歌詞データ記憶領域１４ｃにおいては歌詞データと伴奏データとが対応付けられて記憶されている。よって、ＣＰＵ１１は、練習者によって伴奏データの再生（カラオケ歌唱）が指示されると、その再生に先立って又はその再生に同期して、再生する伴奏データに対応する歌詞データに対応付けて記憶されている音素（過去にステップＳ６，Ｓ７において差分が閾値を超えると判断された音素）を特定する情報を報知する。再生に先立って報知する場合には、ＣＰＵ１１は例えば「あなたは、冒頭の「すぎさりしひびのゆめを・・・」の「しひびのゆめ」の発音タイミングが遅れがちです。注意しましょう。」といったメッセージを表示したり、再生に同期して報知する場合には、ＣＰＵ１１は例えば「すぎさりしひびのゆめを・・・」という歌詞の「しひびのゆめ」の部分を強調表示するなどすればよい。これは、発音タイミングに限らず、発音内容についても同様である。このようにすれば、練習者は、歌唱する前に（又は歌唱している最中に）、発音タイミングや発音内容を間違えやすい部分を視覚的に把握することができる。 (4) In the embodiment, the evaluation results as shown in FIG. 8 are displayed after the practitioner's singing is over. Instead, for example, the pronunciation timing and the difference in pronunciation contents in the past singing of the practitioner are stored as a history, and the karaoke singing (accompaniment data) before the practicing karaoke singing (reproduction of accompaniment data) In synchronism with the reproduction of the sound, a portion where the pronunciation timing and the content of the pronunciation are likely to differ may be displayed. Specifically, the CPU 11 stores the phonemes for which the difference is determined to exceed the threshold value in steps S6 and S7 in association with the lyrics data stored in the lyrics data storage area 14c. In the accompaniment data storage area 14a and the lyrics data storage area 14c, the lyrics data and the accompaniment data are stored in association with each other. Therefore, when the trainee is instructed to reproduce accompaniment data (karaoke singing), the CPU 11 is stored in association with the lyrics data corresponding to the accompaniment data to be reproduced prior to or in synchronization with the reproduction. Information that identifies a phoneme that has been determined (a phoneme for which the difference has been determined to exceed the threshold value in steps S6 and S7 in the past). In the case of informing prior to playback, the CPU 11 tends to delay the pronunciation timing of “Shihibino Yume”, for example, “You are a dream of Sugisari Hibiki ...” at the beginning. Let's watch out. In the case of displaying a message such as “” or informing in synchronization with the reproduction, the CPU 11 highlights, for example, the “Shibino Yume” part of the lyrics of “Let ’s dream of the crack”. do it. This applies not only to the pronunciation timing but also to the pronunciation content. In this way, the practitioner can visually grasp a portion where the pronunciation timing and the content of the pronunciation are easily mistaken before singing (or during singing).

（５）報知部１１６による報知の形態は、表示に限らず、音素を特定する音声メッセージを出力するような形態であってもよい。また、音素を特定する情報を電子メール形式で練習者のメール端末に送信するという形態であってもよい。また、音素を特定する情報を記憶媒体に出力して記憶させるようにしてもよく、この場合、練習者はコンピュータを用いてこの記憶媒体から情報を読み出させることで、それを参照することができる。要は、練習者に対して何らかの手段でメッセージ乃至情報を伝えられるように、音素を特定する情報を出力するものであればよい。 (5) The form of notification by the notification unit 116 is not limited to display, but may be a form in which a voice message specifying a phoneme is output. Moreover, the form which transmits the information which specifies a phoneme to an e-mail format of a practitioner's mail terminal may be sufficient. In addition, information specifying phonemes may be output to a storage medium and stored. In this case, a practitioner can refer to the information by reading the information from the storage medium using a computer. it can. In short, any information may be used as long as it outputs information for identifying a phoneme so that a message or information can be transmitted to the practitioner by some means.

（６）実施形態では、ハードディスク等の記憶部１４に練習者音声データを記憶するようにしていたが、歌唱の評価を終えた後に練習者音声をすぐに破棄する場合には、練習者音声データをＲＡＭ１３に記憶するようにしてもよい。 (6) In the embodiment, the trainer voice data is stored in the storage unit 14 such as a hard disk. However, when the trainer voice is immediately discarded after the evaluation of the singing, the trainer voice data is stored. May be stored in the RAM 13.

（７）実施形態では、練習者音声データを記憶する際には、歌詞を表示し、さらに伴奏データを再生しながら練習者に歌唱させる、所謂カラオケ歌唱を行うようにしていたが、これは必ずしも必要ではない。つまり、練習者が歌詞の表示や伴奏データの再生が無いままで歌唱し、それを録音して模範音声と比較するようにしてもよい。歌唱能力が相当に高い練習者であっても、歌詞の表示や伴奏が無い状態で発音タイミングや歌詞を間違えずに歌唱することは容易ではないから、練習者の歌唱能力をより厳密に評価することが可能となる。 (7) In the embodiment, when practicing voice data is stored, so-called karaoke singing is performed in which lyrics are displayed and the practitioner sings while reproducing accompaniment data. Not necessary. That is, the practitioner may sing without displaying the lyrics or reproducing the accompaniment data, and record it and compare it with the model voice. Even a practitioner with a very high singing ability cannot sing without mistakes in pronunciation timing and lyrics without displaying lyrics or accompaniment. It becomes possible.

（８）実施形態では、ＣＰＵ１１が図６に示す処理を実行するたびに、模範音声データに対して周波数分析を行っていたが（ステップＳ４）、これに限らず、模範音声データに対して予め周波数分析を行った結果を記憶部１４に記憶しておいてもよいし、一度でも過去に周波数分析を行ったことがあれば、その結果を記憶部１４に記憶しておいてもよい。なお、模範音声データや練習者音声データはＷＡＶＥ形式やＭＰ３形式のデータとしたが、データの形式はこれに限定されるものではなく、音声を示すデータであればどのような形式のデータであってもよい。 (8) In the embodiment, every time the CPU 11 executes the process shown in FIG. 6, the frequency analysis is performed on the model voice data (step S <b> 4). The result of the frequency analysis may be stored in the storage unit 14, or if the frequency analysis has been performed once in the past, the result may be stored in the storage unit 14. The model voice data and the practice person voice data are data in the WAVE format or the MP3 format. However, the data format is not limited to this, and any format may be used as long as the data indicates voice. May be.

（９）さらに、実施形態においては、模範音声データを記憶部１４に記憶させて、カラオケ装置１のＣＰＵ１１が記憶部１４から模範音声データを読み出すようにしたが、これに代えて、通信ネットワークを介して音声データを受信するようにしてもよい。 (9) Furthermore, in the embodiment, the model voice data is stored in the storage unit 14 and the CPU 11 of the karaoke apparatus 1 reads the model voice data from the storage unit 14. Audio data may be received via the network.

（１０）実施形態では、カラオケ装置１が、図３に示した機能の全てを実現するようになっていた。これに対し、通信ネットワークで接続された２以上の装置が上記機能を分担するようにし、それら複数の装置を備えるシステムが同実施形態のカラオケ装置１を実現するようにしてもよい。例えば、マイクロフォンやスピーカ、表示装置及び入力装置等を備え、報知部１１５を実現するコンピュータ装置と、基礎分析部１１１、対応箇所検出部１１２，発音内容比較部１１３及び発音タイミング比較部１１４を実現するサーバ装置とが通信ネットワークで接続されたシステムとして構成されていてもよい。この場合は、コンピュータ装置が、マイクロフォンから入力された音声を音声データに変換してサーバ装置に送信し、サーバ装置が、受信した音声データと模範音声データ及び歌詞データとの比較処理を行い、その比較結果をコンピュータ装置に送信するようにすればよい。 (10) In the embodiment, the karaoke apparatus 1 realizes all of the functions shown in FIG. On the other hand, two or more devices connected via a communication network may share the above functions, and a system including the plurality of devices may realize the karaoke device 1 of the embodiment. For example, a computer device that includes a microphone, a speaker, a display device, an input device, and the like and realizes the notification unit 115, a basic analysis unit 111, a corresponding location detection unit 112, a pronunciation content comparison unit 113, and a pronunciation timing comparison unit 114 are realized. The server device may be configured as a system connected by a communication network. In this case, the computer device converts the voice input from the microphone into voice data and transmits it to the server device, and the server device performs a comparison process between the received voice data and the model voice data and the lyrics data, The comparison result may be transmitted to the computer device.

（１１）上述した実施形態における評価装置としてのカラオケ装置１のＣＰＵ１１によって実行されるプログラムは、磁気テープ、磁気ディスク、フロッピー（登録商標）ディスク、光記録媒体、光磁気記録媒体、ＣＤ（Compact Disk）−ＲＯＭ、ＤＶＤ（Digital Versatile Disk）、ＲＡＭなどの記録媒体に記憶した状態で提供し得る。また、インターネットのようなネットワーク経由でカラオケ装置１にダウンロードさせることも可能である。 (11) Programs executed by the CPU 11 of the karaoke apparatus 1 as the evaluation apparatus in the above-described embodiment are a magnetic tape, a magnetic disk, a floppy (registered trademark) disk, an optical recording medium, a magneto-optical recording medium, and a CD (Compact Disk). )-It can be provided in a state stored in a recording medium such as a ROM, a DVD (Digital Versatile Disk), or a RAM. It is also possible to download to the karaoke apparatus 1 via a network such as the Internet.

カラオケ装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of a karaoke apparatus. 模範音声データ及び歌詞データの内容の一例を示す図である。It is a figure which shows an example of the content of model audio | voice data and lyrics data. カラオケ装置のソフトウェア構造の一例を示すブロック図である。It is a block diagram which shows an example of the software structure of a karaoke apparatus. ＤＰマッチングを示す図である。It is a figure which shows DP matching. ＤＰマッチングにおいて時間軸の伸縮を示す図である。It is a figure which shows the expansion-contraction of the time axis in DP matching. カラオケ装置のＣＰＵが行う処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which CPU of a karaoke apparatus performs. 発音タイミング及び発音内容が相違する際の各種態様を説明する図である。It is a figure explaining the various aspects at the time of pronunciation timing and the content of pronunciation differing. カラオケ装置において表示される画面の一例を示す図である。It is a figure which shows an example of the screen displayed in a karaoke apparatus.

Explanation of symbols

１…カラオケ装置、１１…ＣＰＵ、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部、１５……表示部、１６…操作部、１７…マイクロフォン、１８…音声処理部、１９…スピーカ、１１１…基礎分析部、１１２…対応箇所検出部、１１３…発音内容比較部、１１４…発音タイミング比較部、１１５…報知部。
DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 11 ... CPU, 12 ... ROM, 13 ... RAM, 14 ... Memory | storage part, 15 ... Display part, 16 ... Operation part, 17 ... Microphone, 18 ... Audio | voice processing part, 19 ... Speaker, 111 ... Basics Analysis unit 112... Corresponding location detection unit 113. Sound generation content comparison unit 114. Sound generation timing comparison unit 115 115 notification unit.

Claims

First storage means for storing first sound data representing a plurality of phonemes whose pronunciation timings are arranged in time series in association with the pronunciation timing of each phoneme;
Second storage means for storing second sound data supplied from the sound collection means for collecting sound;
Corresponding location detection means for associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The sound generation timing of the phoneme represented by the second sound data is specified based on the sound generation timing of the phoneme represented by the first sound data and the corresponding portion data, and the sound generation timing of the phoneme represented by the first sound data Comparing means for determining whether or not a difference between the phoneme generation timing represented by the second sound data exceeds a threshold value;
And an output means for outputting information for identifying a phoneme for which the difference is determined to exceed a threshold value by the comparison means.

Determining means for determining whether or not a predetermined number of phonemes determined by the comparing means that the difference exceeds a threshold value;
The output means, in addition to the information for identifying the phoneme for which the difference is determined to exceed the threshold, and the case where it is determined that the predetermined number of phonemes determined by the determination means to exceed the threshold are consecutive. The evaluation apparatus according to claim 1, wherein different messages are output when it is determined that the predetermined number or more does not continue.

Accumulating means for accumulating the number of times that the difference is determined by the comparing means to exceed a threshold;
The evaluation device according to claim 1, wherein the output unit outputs a message corresponding to a result of accumulation by the accumulation unit in addition to information specifying a phoneme for which the difference is determined to exceed a threshold value.

History storage means for storing the phonemes determined by the comparison means to have the difference exceeding a threshold value, in association with the first sound data;
Third storage means for storing accompaniment data in association with the first sound data;
Reproduction means for reproducing the accompaniment data stored in the third storage means;
Prior to or in synchronization with the reproduction of the accompaniment data by the reproduction means, the history is associated with the first sound data stored in association with the accompaniment data by the third storage means. The evaluation apparatus according to claim 1, further comprising: a pre-output unit that outputs information for specifying a phoneme stored in the storage unit.

First storage means for storing sound data representing a plurality of phonemes whose pronunciation timings are arranged in time series in association with the pronunciation timing of each phoneme;
Second storage means for storing second sound data supplied from the sound collection means for collecting sound;
Corresponding location detection means for associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The phonemes represented by the first sound data and the phonemes represented by the second sound data are compared in units of frames represented by the corresponding location data, and the phonemes represented by the first sound data and the second sounds are compared. A comparing means for determining whether or not a difference between the phonemes represented by the data exceeds a threshold;
And an output means for outputting information for identifying a phoneme for which the difference is determined to exceed a threshold value by the comparison means.

The evaluation device according to claim 1, wherein the output unit notifies the practitioner of information specifying a phoneme for which the difference is determined to exceed a threshold by the comparison unit.

Accumulating means for accumulating the number of times that the difference is determined by the comparing means to exceed a threshold;
6. The evaluation apparatus according to claim 5, wherein the output means outputs a message corresponding to an accumulation result by the accumulation means in addition to information specifying a phoneme for which the difference is determined to exceed a threshold value.

History storage means for storing the phonemes determined by the comparison means to have the difference exceeding a threshold value in association with the first sound data;
Third storage means for storing accompaniment data in association with the first sound data;
Reproduction means for reproducing the accompaniment data stored in the third storage means;
Prior to or in synchronization with the reproduction of the accompaniment data by the reproduction means, the history is associated with the first sound data stored in association with the accompaniment data by the third storage means. The evaluation apparatus according to claim 5, further comprising: a pre-output unit that outputs information for specifying a phoneme stored in the storage unit.

Supplied from a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with sound generation timings of the respective phonemes, and a sound collection means for collecting sounds. A method for controlling an evaluation apparatus, comprising: a second storage unit that stores second sound data; and a control unit.
The control means associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The control means specifies the sound generation timing of the phoneme represented by the second sound data based on the sound generation timing of the phoneme represented by the first sound data and the corresponding portion data, and the first sound data is Determining whether the difference between the pronunciation timing of the phoneme represented and the pronunciation timing of the phoneme represented by the second sound data exceeds a threshold;
The control means comprises a step of outputting information for identifying a sound for which it is determined that the difference exceeds a threshold value.

Supplied from a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with sound generation timings of the respective phonemes, and a sound collection means for collecting sounds. A method for controlling an evaluation apparatus, comprising: a second storage unit that stores second sound data; and a control unit.
The control means associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The control means compares the phoneme represented by the first sound data with the phoneme represented by the second sound data in units of frames represented by the corresponding location data, and the phoneme represented by the first sound data Determining whether the difference from the phoneme represented by the second sound data exceeds a threshold;
The control means comprises a step of outputting information for identifying a sound for which it is determined that the difference exceeds a threshold value.

Supplied from a first storage means for storing first sound data representing a plurality of phonemes whose sound generation timings are arranged in time series in association with sound generation timings of the respective phonemes, and a sound collection means for collecting sounds. A computer comprising second storage means for storing second sound data;
A corresponding location detection function for associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The sound generation timing of the phoneme represented by the second sound data is specified based on the sound generation timing of the phoneme represented by the first sound data and the corresponding portion data, and the sound generation timing of the phoneme represented by the first sound data And a comparison function for determining whether or not the difference between the pronunciation timing of the phoneme represented by the second sound data exceeds a threshold value;
A program for realizing an output function for outputting information for identifying a sound for which the difference is determined to exceed a threshold value by the comparison means.

First sound data representing a plurality of phonemes whose sound generation timing is continuous in time series is supplied from first storage means for storing the sound data in association with sound generation timing of each phoneme, and sound collection means for collecting sounds. A computer comprising second storage means for storing second sound data;
A corresponding location detection function for associating the first sound data and the second sound data in units of frames of a predetermined time length, and generating corresponding location data representing the associated frames;
The phonemes represented by the first sound data and the phonemes represented by the second sound data are compared in units of frames represented by the corresponding location data, and the phonemes represented by the first sound data and the second sounds are compared. A comparison function for determining whether the difference from the phoneme represented by the data exceeds a threshold;
A program for realizing an output function for outputting information for identifying a sound for which the difference is determined to exceed a threshold value by the comparison means.