JP2008040260A

JP2008040260A - Musical piece practice assisting device, dynamic time warping module, and program

Info

Publication number: JP2008040260A
Application number: JP2006216059A
Authority: JP
Inventors: Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-08-08
Filing date: 2006-08-08
Publication date: 2008-02-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a karaoke device capable of evaluating skill in singing or musical performance using various techniques. <P>SOLUTION: Acoustic parameters (pitch, loudness, and spectrum) are detected from a model voice and a practicer's voice respectively. A section wherein detected loudness exceeds a threshold is determined as a voiced section and DTW (dynamic time wrapping) is performed for the voiced section to determine how both the voices correspond to each other. Then how much waveforms of corresponding parts of both the voices match each other is evaluated. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ユーザの歌唱または演奏をその手本と比較評価するための技術に関する。 The present invention relates to a technique for comparing and evaluating a user's song or performance with a model.

従来、カラオケ装置において、歌唱者の歌唱を評価するための採点機能が種々提案されている。この種のカラオケ装置は、マイクロフォンから入力された歌唱者の音声から歌唱者が発生した音声の高さや音量あるいはテンポなどの歌唱特性を示す歌唱データを生成する。そして上記カラオケ装置は、その歌唱データとガイドメロディなどの採点基準データとを比較し、その比較結果に基づいて所定の得点を付与して採点データを生成する。歌唱パートが終了すると、この採点データ中の得点を集計して総合得点を算出する。
たとえば特許文献１には、カラオケのガイドメロディから抽出した音の高さと、歌唱者が発生した音声の高さや音量を検出し、両者の比較により評価を行う一方、歌唱率（実際に歌唱した部分／歌唱すべき部分）を求め、歌唱率を上記の評価に加味する技術が開示されている。これにより、従来の抱えていた問題、すなわち実際に歌唱した部分が少ない場合に、歌唱したわずかな部分の音声だけで総合評価が決まってしまう、という課題が解決される。また、特許文献２には、カラオケボックスなどで歌唱者が録音した音声をネットワークなどにより遠隔の歌唱指導者に送り、歌唱指導者は歌唱指導内容をまたネットワークなどで歌唱者に提供し、個別に歌唱指導を行うことを可能にする通信システムが開示されている。
特開２００５−２１５４９３号公報特開２００３−１５６７３号公報 Conventionally, in a karaoke apparatus, various scoring functions for evaluating a singer's singing have been proposed. This type of karaoke apparatus generates singing data indicating the singing characteristics such as the pitch, volume or tempo of the voice generated by the singer from the singer's voice input from the microphone. Then, the karaoke apparatus compares the singing data with scoring reference data such as a guide melody, and assigns a predetermined score based on the comparison result to generate scoring data. When the singing part ends, the scores in the scoring data are totaled to calculate the total score.
For example, in Patent Document 1, the pitch of the sound extracted from the karaoke guide melody and the pitch and volume of the voice generated by the singer are detected and evaluated by comparing the two, while the singing rate (the part that was actually sung) / Parts to be sung), and a technique for adding the singing rate to the above evaluation is disclosed. This solves the problem of the conventional problem, that is, when the number of parts actually sung is small, the overall evaluation is determined only by a small part of the sung voice. In Patent Document 2, the voice recorded by a singer in a karaoke box or the like is sent to a remote singing instructor via a network or the like, and the singing instructor provides the singing instruction content to the singer through a network or the like individually. A communication system that enables singing instruction is disclosed.
JP 2005-215493 A JP 2003-15673 A

ところで、熟練した歌唱者は、楽譜内容に忠実に歌唱するのではなく、歌い始めや歌い終わりを意図的にずらしたり、声質や音量を変化させたり、或いはビブラートやこぶしを用いたりするなど様々な歌唱技法を駆使して情感や味わいを表現する場合がある。このような情感や味わいは歌唱者によって様々に表現され、例えば、フレーズの末尾に必ずビブラートをかけたり、歌い始めを必ずためたりする（歌い始めのタイミングを意図的に遅らせる）など、歌手毎に特徴があることが多い。
一方、カラオケ装置を用いて歌唱練習を行うユーザは、好みの歌手の歌唱技法を真似て歌唱したいと考えていることが多く、カラオケ装置を利用して歌唱練習を行う際には、その歌唱技法をどの程度再現できたのかについても評価を受けたいと望んでいる場合がある。 By the way, skilled singers are not singing faithfully to the content of the score, but intentionally shifting the beginning and end of singing, changing the voice quality and volume, using vibrato and fist, etc. There are cases where emotions and tastes are expressed using singing techniques. Such feelings and tastes are expressed in various ways by the singer. For example, always add vibrato to the end of a phrase or always start singing (deliberately delay the timing of singing). Often has characteristics.
On the other hand, users who practice singing using a karaoke device often want to imitate the singing technique of their favorite singer, and when performing singing practice using a karaoke device, the singing technique You may want to receive an evaluation as to how much you can reproduce.

しかしながら、特許文献１や特許文献２に開示された技術では、上記の如きニーズに応えることができないのみならず、歌い始めをためるなどの歌唱技法は、楽譜内容からの逸脱として減点対象となってしまう場合もある。何故ならば、特許文献１や特許文献２に開示された技術にて評価基準となるガイドメロディは楽曲のピッチの変化を楽譜内容に則して忠実に再現するものであり、これら特許文献１や特許文献２に開示された技術は楽譜内容に忠実に歌唱されたか否かを評価することを目的としているからである。なお、これは楽曲の歌唱に限らず、楽器の演奏についても同様である。 However, the techniques disclosed in Patent Document 1 and Patent Document 2 are not only able to meet the needs as described above, but singing techniques such as starting to sing are subject to deduction as deviations from the score content. Sometimes it ends up. This is because the guide melody, which is the evaluation standard in the techniques disclosed in Patent Document 1 and Patent Document 2, faithfully reproduces the change in the pitch of the music in accordance with the contents of the musical score. This is because the technique disclosed in Patent Document 2 is intended to evaluate whether or not the score is sung faithfully. This is not limited to the singing of music, and the same applies to the performance of musical instruments.

本発明は、上記の問題に鑑み、種々の歌唱技法が駆使された歌唱や演奏に関する評価をより効率的に実行することを可能にする技術を提供することを目的としている。 In view of the above problems, an object of the present invention is to provide a technique that makes it possible to more efficiently execute evaluations related to singing and playing using various singing techniques.

本発明に係る楽曲練習支援装置の第１の構成は、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記第１のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第１の音響パラメータを参照して選択し、前記第２のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第２の音響パラメータを参照して前記第２のオーディオ信号から選択する区間選択手段と、前記区間選択手段により選択された有音区間における前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記区間選択手段により選択された有音区間における前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値の差が予め定められた値より小さい格子点のみを選択する格子点選択手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、前記格子点選択手段により選択された格子点のみを通り、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段と、を有する動的時間整合モジュール、を備え、前記第１のオーディオ信号の信号波形と前記第２のオーディオ信号の信号波形とを前記動的時間整合モジュールにより対応付けが為された時間単位毎に比較し、両者の一致の度合いを点数化して出力することを特徴とする。 The first configuration of the music practice support apparatus according to the present invention analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and performs pitch, volume, and spectrum for each predetermined time unit. And extracting means for extracting a first acoustic parameter representing the degree of temporal change in volume and spectrum, and analyzing a second audio signal representing a waveform of a singing sound or performance sound as a model by the user A second acoustic parameter obtained for each time unit, which is a second acoustic parameter representing the pitch, volume, spectrum, and degree of time change of the volume and spectrum of the sound represented by the second audio signal in each time unit. Acquisition means for acquiring the acoustic parameters of the first audio signal, and at least one of pitch and volume from the first audio signal exceeds a predetermined threshold. A section is selected with reference to the first acoustic parameter, and a voiced section in which at least one of pitch or volume from the second audio signal exceeds a predetermined threshold is referred to the second acoustic parameter. Section selection means for selecting from the second audio signal, and the average value and standard deviation of the first acoustic parameters for each time unit in the sounded section selected by the section selection means are set to predetermined values. Normalization means for performing the normalization on the second acoustic parameter of each time unit in the sounded section selected by the section selection means while performing normalization to convert, and the time of the first audio signal The normalized first acoustic parameter and the previous position on a coordinate plane having one axis as the coordinate axis and the time axis of the second audio signal as the other coordinate axis. A calculation means for calculating an evaluation value calculated from a difference from the normalized second acoustic parameter for each lattice point having the time unit as a coordinate value, and a difference between both coordinate values on the coordinate plane is A grid point selection unit that selects only grid points that are smaller than a predetermined value, and on the coordinate plane, from a start point that is the minimum grid point of both coordinate values to an end point that is the maximum grid point of both coordinate values. A specifying means for specifying a path that passes only through the lattice points selected by the lattice point selecting means and has a minimum sum of the evaluation values at the lattice points on the route, and is specified by the specifying means. A dynamic time alignment module having an association means for associating the time unit in the first audio signal and the time unit in the second audio signal along the path, The signal waveform of the audio signal and the signal waveform of the second audio signal are compared for each time unit associated by the dynamic time matching module, and the degree of coincidence of both is scored and output. It is characterized by.

本発明に係る楽曲練習支援装置の第２の構成は、上記第１の構成において、前記区間選択手段は、前記第１および第２のオーディオ信号においてピッチまたは音量の少なくとも一方が予め定められた閾値を下回る無音区間が予め定められた時間を超過して継続する場合に、該当する無音区間を除く区間を前記第１および前記第２のオーディオ信号から選択することを特徴とする。 In a second configuration of the music practice support device according to the present invention, in the first configuration, the section selection unit is configured such that at least one of pitch and volume is predetermined in the first and second audio signals. When the silent section below the predetermined time continues for a longer period, a section excluding the corresponding silent section is selected from the first and second audio signals.

本発明に係る楽曲練習支援装置の第３の構成は、上記第１の構成において、楽曲を一意に識別する楽曲識別子と、前記楽曲識別子により識別される楽曲に対応する前記第２の音響パラメータとの組が１または複数記憶された記憶手段を備え、前記取得手段は、前記ユーザにより選択された楽曲識別子に対応する前記第２の音響パラメータを前記記憶装置から読み出して取得することを特徴とする。 According to a third configuration of the music practice support device according to the present invention, in the first configuration, the music identifier that uniquely identifies the music, and the second acoustic parameter that corresponds to the music identified by the music identifier, Storage means storing one or a plurality of sets, wherein the acquisition means reads out and acquires the second acoustic parameter corresponding to the music identifier selected by the user from the storage device. .

本発明に係る動的時間整合モジュールの第１の構成は、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記第１のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第１の音響パラメータを参照して選択し、前記第２のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第２の音響パラメータを参照して前記第２のオーディオ信号から選択する区間選択手段と、前記区間選択手段により選択された有音区間における前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記区間選択手段により選択された有音区間における前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値の差が予め定められた値より小さい格子点のみを選択する格子点選択手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、前記格子点選択手段により選択された格子点のみを通り、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段とを有することを特徴とする。 The first configuration of the dynamic time alignment module according to the present invention analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and for each predetermined time unit, the pitch, volume, Extracting means for extracting a spectrum, a first acoustic parameter representing a volume and a degree of temporal change of the spectrum, and analyzing a second audio signal representing a waveform of a singing sound or a performance sound modeled by the user Is a second acoustic parameter obtained for each time unit, and represents the pitch, volume, spectrum, and degree of temporal change in volume and spectrum of the sound represented by the second audio signal in each time unit. Acquisition means for acquiring two acoustic parameters, and at least one of pitch and volume from the first audio signal exceeds a predetermined threshold. A voiced section is selected with reference to the first acoustic parameter, and a voiced section in which at least one of pitch or volume exceeds a predetermined threshold from the second audio signal is selected as the second acoustic parameter. The section selection means for selecting from the second audio signal with reference to the first acoustic parameter of each time unit in the sounded section selected by the section selection means, the average value and the standard deviation are predetermined. Normalizing means for performing the normalization to the second acoustic parameter in each time unit in the sounded section selected by the section selecting means, The normalized first acoustic parameter in a coordinate plane having the time axis of the signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. And an evaluation value calculated from a difference between the normalized second acoustic parameter and a calculation unit that calculates the evaluation value for each lattice point having the time unit as a coordinate value, A grid point selecting means for selecting only grid points whose difference is smaller than a predetermined value and a grid point having both maximum coordinate values on the coordinate plane from the start point having the minimum grid point on both coordinate values. Among the routes to the end point, a specifying unit that specifies only a route that passes through only the lattice point selected by the lattice point selecting unit and minimizes the sum of the evaluation values at the lattice points on the route, and the specifying unit Along with the identified path, there is provided an association means for associating the time unit in the first audio signal with the time unit in the second audio signal.

本発明に係るプログラムの第１の構成は、コンピュータ装置を、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位におけるピッチ、音量、スペクトルと、音量とスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記第１のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第１の音響パラメータを参照して選択し、前記第２のオーディオ信号からピッチまたは音量の少なくとも一方が予め定められた閾値を上回る有音区間を前記第２の音響パラメータを参照して前記第２のオーディオ信号から選択する区間選択手段と、前記区間選択手段により選択された有音区間における前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記区間選択手段により選択された有音区間における前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値の差が予め定められた値より小さい格子点のみを選択する格子点選択手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、前記格子点選択手段により選択された格子点のみを通り、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段として機能させることを特徴とする。 In the first configuration of the program according to the present invention, a computer device analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and a pitch, volume, and volume in a predetermined time unit. Extracting means for extracting a spectrum, a first acoustic parameter representing a volume and a degree of temporal change of the spectrum, and analyzing a second audio signal representing a waveform of a singing sound or a performance sound modeled by the user Is a second acoustic parameter obtained for each time unit, and represents the pitch, volume, spectrum, and degree of temporal change in volume and spectrum of the sound represented by the second audio signal in each time unit. Acquisition means for acquiring two acoustic parameters, and at least one of pitch and volume is predetermined from the first audio signal A voiced section exceeding a value is selected with reference to the first acoustic parameter, and a voiced section in which at least one of pitch or volume from the second audio signal exceeds a predetermined threshold value is selected as the second sound. Section selection means for selecting from the second audio signal with reference to parameters, and the first acoustic parameter of each time unit in the sounded section selected by the section selection means, its average value and standard deviation Normalizing means for converting the second acoustic parameter in each time unit in the sounded section selected by the section selecting means, and normalizing means for converting the first acoustic parameter into a predetermined value; The normalized first sound on a coordinate plane with the time axis of the audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis A calculation means for calculating an evaluation value calculated from a difference between a parameter and the normalized second acoustic parameter for each lattice point having the time unit as a coordinate value, and both coordinate values on the coordinate plane; A grid point selection means for selecting only grid points whose difference is smaller than a predetermined value, and on the coordinate plane, from the start point where both coordinate values are the minimum grid point, A specifying unit that specifies only a path that passes through only the grid point selected by the grid point selection unit and has a minimum sum of the evaluation values at the grid points on the path, and the specifying unit; The time unit in the first audio signal and the time unit in the second audio signal are made to function as an association unit that associates the time unit in the first audio signal along the path specified by the above.

本発明によれば、種々の技法が駆使された歌唱や演奏に関する評価をより効率的に実行することが可能になる、といった効果を奏する。 According to the present invention, there is an effect that it is possible to more efficiently execute evaluations related to singing and playing using various techniques.

以下、図面を参照しつつ本発明の１実施形態について説明する。
（Ａ：構成）
図１は、この発明の１実施形態である楽曲練習支援装置としてのカラオケ装置１のハードウェア構成を例示したブロック図である。図１に示すように、カラオケ装置１は、制御部１１、ＲＯＭ（Read Only Memory）１２、ＲＡＭ（Random Access Memory）１３、記憶部１４、表示部１５、操作部１６、音声処理部１８およびこれらのデータ授受を仲介するバス１０を有している。
制御部１１は、例えばＣＰＵ（Central Processing Unit）であり、ＲＯＭ１２に記憶されている制御プログラムを読み出してＲＡＭ１３にロードし、これを実行することにより、カラオケ装置１の各部を制御する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
(A: Configuration)
FIG. 1 is a block diagram illustrating a hardware configuration of a karaoke apparatus 1 as a music practice support apparatus according to an embodiment of the present invention. As shown in FIG. 1, the karaoke apparatus 1 includes a control unit 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage unit 14, a display unit 15, an operation unit 16, an audio processing unit 18, and the like. Has a bus 10 that mediates data exchange.
The control unit 11 is, for example, a CPU (Central Processing Unit), reads out a control program stored in the ROM 12, loads it into the RAM 13, and executes it to control each unit of the karaoke apparatus 1.

記憶部１４は、例えばハードディスクなどの大容量の記憶手段であり、伴奏・歌詞データ記憶領域１４ａと、模範音声データ記憶領域１４ｂと、練習者音声データ記憶領域１４ｃとを有している。 The storage unit 14 is a large-capacity storage unit such as a hard disk, and has an accompaniment / lyric data storage area 14a, an exemplary voice data storage area 14b, and a trainer voice data storage area 14c.

表示部１５は、例えば液晶ディスプレイとその駆動回路であり、制御部１１の制御の下で、カラオケ装置１を操作するためのメニュー画面や、背景画像に歌詞テロップが重ねられたカラオケ画面などの各種画面を表示する。
操作部１６は、テンキーなど各種のキーを備えており、押下されたキーに対応した信号を制御部１１へ出力する。 The display unit 15 is, for example, a liquid crystal display and a driving circuit thereof, and various types such as a menu screen for operating the karaoke apparatus 1 under the control of the control unit 11 and a karaoke screen in which lyrics telop is superimposed on a background image. Display the screen.
The operation unit 16 includes various keys such as a numeric keypad, and outputs a signal corresponding to the pressed key to the control unit 11.

音声処理部１８には、マイクロフォン１７とスピーカ１９とが接続されている。マイクロフォン１７は、カラオケ装置１を利用して歌唱練習を行うユーザ（以下、練習者）の歌唱音を収音し、その歌唱音に応じた音声信号（アナログデータ）を音声処理部１８へ出力する。音声処理部１８は、マイクロフォン１７から出力された音声信号（アナログデータ）を音声データ（デジタルデータ）に変換して制御部１１へ出力する一方、制御部１１から引渡された音声データを音声信号に変換しスピーカ１９へ出力する。スピーカ１９は、音声処理部１８から出力される音声信号に応じた音声を放音する。 A microphone 17 and a speaker 19 are connected to the sound processing unit 18. The microphone 17 collects the singing sound of a user (hereinafter, a practitioner) who practice singing using the karaoke device 1, and outputs an audio signal (analog data) corresponding to the singing sound to the audio processing unit 18. . The audio processing unit 18 converts the audio signal (analog data) output from the microphone 17 into audio data (digital data) and outputs the audio data (digital data) to the control unit 11, while converting the audio data delivered from the control unit 11 into an audio signal. The data is converted and output to the speaker 19. The speaker 19 emits a sound corresponding to the sound signal output from the sound processing unit 18.

記憶部１４の伴奏・歌詞データ記憶領域１４ａには、楽曲の伴奏を行う各種楽器の演奏音（所謂ガイドメロディ）が楽曲の進行順に記された伴奏データと、楽曲の歌詞を示す歌詞データとが互いに関連付けられて１または複数の楽曲について記憶されている。より詳細に説明すると、伴奏・歌詞データ記憶領域１４ａに記憶されている伴奏データと歌詞データとには、カラオケ楽曲を一意に識別する識別子（例えば、英字や記号、数字などからなる楽曲コード：以下、楽曲識別子）が対応付けられており、この楽曲識別子によって伴奏データと歌詞データとが互いに関連付けられている。伴奏データは、例えばＭＩＤＩ（Musical Instruments Digital Interface）形式などのデータであり、練習者がカラオケ歌唱する際に再生される。歌詞データは、そのカラオケ歌唱の際に歌詞テロップとして表示部１５に表示される。 In the accompaniment / lyric data storage area 14a of the storage unit 14, accompaniment data in which performance sounds (so-called guide melodies) of various musical instruments that accompany the music are recorded in the order of progress of the music, and lyrics data indicating the lyrics of the music are included. One or more music pieces are stored in association with each other. More specifically, the accompaniment data and lyrics data stored in the accompaniment / lyric data storage area 14a include identifiers for uniquely identifying karaoke music (for example, music codes consisting of English letters, symbols, numbers, etc.) , Music identifier) are associated with each other, and the accompaniment data and the lyrics data are associated with each other by this music identifier. The accompaniment data is, for example, data in the MIDI (Musical Instruments Digital Interface) format and is reproduced when the practitioner sings a karaoke song. The lyrics data is displayed on the display unit 15 as a lyrics telop at the time of the karaoke song.

模範音声データ記憶領域１４ｂには、前述した楽曲識別子に対応付けて、その楽曲識別子で識別される楽曲を持ち歌とする歌手によるその楽曲の歌唱音（以下、模範音声）の音声を表すＷＡＶＥ形式の音声データ（以下、模範音声データ）が記憶されている。この模範音声データは、練習者の歌唱を評価する際の基準として利用される。 In the model voice data storage area 14b, the WAVE format that represents the voice of the singing sound (hereinafter, model voice) of the song by the singer who has the song identified by the song identifier in association with the song identifier described above. Voice data (hereinafter, model voice data) is stored. This model voice data is used as a reference when evaluating a practitioner's song.

練習者音声データ記憶領域１４ｃには、マイクロフォン１７から音声処理部１８を経てＡ／Ｄ変換されることにより生成される音声データ（以下、練習者音声データ）が、例えばＷＡＶＥ形式で記憶される。 In the practitioner voice data storage area 14c, voice data (hereinafter, practitioner voice data) generated by A / D conversion from the microphone 17 via the voice processing unit 18 is stored, for example, in the WAVE format.

次に、図２に示すブロック図を参照しながら、カラオケ装置１の機能構成について説明する。図２に示す基礎分析モジュール２１、動的時間整合（Dynamic Time Warping：以下、ＤＴＷ）モジュール２２、および、評価モジュール２３は、制御部１１が前述した制御プログラムを実行することによって実現されるソフトウェアモジュールである。なお、図中の矢印は、データの流れを概略的に示したものである。また、上記３つのソフトウェアモジュールの他にも、練習者により指定されたカラオケ曲の伴奏データにしたがった伴奏音の再生や、その伴奏音と練習者の歌唱音とを合成して出力するカラオケ演奏モジュールも上記制御プログラムを制御部１１が実行することによって実現されるが、係るカラオケ演奏モジュールの機能については従来のカラオケ装置となんら変わるところがないため、図示および詳細な説明については省略する。 Next, the functional configuration of the karaoke apparatus 1 will be described with reference to the block diagram shown in FIG. The basic analysis module 21, the dynamic time warping (hereinafter referred to as DTW) module 22, and the evaluation module 23 illustrated in FIG. 2 are software modules that are realized by the control unit 11 executing the control program described above. It is. The arrows in the figure schematically show the flow of data. In addition to the above three software modules, playback of accompaniment sounds according to accompaniment data of a karaoke song specified by the practitioner, and karaoke performance that combines and outputs the accompaniment sound and the practitioner's singing sound The module is also realized by the control unit 11 executing the control program. However, since the function of the karaoke performance module is not different from that of the conventional karaoke apparatus, illustration and detailed description are omitted.

基礎分析モジュール２１は、模範音声データと練習者音声データとについて、それぞれ所定時間長のフレーム単位で音響パラメータ（本実施形態では、ピッチ、音量およびスペクトルに関するパラメータ）を検出する。以下では、それぞれのパラメータについて、時間の早いフレームから順に０からフレーム番号を振り（i番目のフレームを第ｉフレームと呼ぶ）説明を行う。
なお、本実施形態では、模範音声データおよび練習者音声データの各々から上記音響パラメータを抽出する時間単位を１フレームとする場合について説明するが、１フレームをさらに分割したサブフレーム単位で上記音響パラメータを抽出するとしても良く、また、複数フレーム単位で上記音響パラメータから音響パラメータを抽出するとしても良い。要は、模範音声データから音響パラメータを抽出する際の時間単位と、練習者音声データから音響パラメータを抽出する際の時間単位とが一致していれば良く、その時間単位の長さは問わない。 The basic analysis module 21 detects acoustic parameters (in this embodiment, parameters relating to pitch, volume, and spectrum) in units of frames each having a predetermined time length for the model voice data and the practice person voice data. Hereinafter, for each parameter, a frame number is assigned from 0 in order from the frame with the earliest time (the i-th frame is referred to as the i-th frame).
In the present embodiment, the case where the time unit for extracting the acoustic parameter from each of the model voice data and the trainer voice data is set to one frame. However, the acoustic parameter is set in units of subframes obtained by further dividing one frame. May be extracted, or the acoustic parameters may be extracted from the acoustic parameters in units of a plurality of frames. In short, the time unit for extracting the acoustic parameters from the model voice data and the time unit for extracting the acoustic parameters from the trainer voice data need only be the same, and the length of the time unit does not matter. .

以下では、基礎分析モジュール２１について詳細に説明する。図２に示すように、基礎分析モジュール２１は、ピッチ検出手段２１１、音量検出手段２１２、スペクトル検出手段２１３、および微分手段２１４ａ〜２１４ｃを含んでいる。基礎分析モジュール２１へ引渡された音声データ（すなわち、模範音声データまたは練習者音声データ）は、図２に示す様に３分流され、ピッチ検出手段２１１、音量検出手段２１２およびスペクトル検出手段２１３の各々へ引渡される。 Hereinafter, the basic analysis module 21 will be described in detail. As shown in FIG. 2, the basic analysis module 21 includes pitch detection means 211, volume detection means 212, spectrum detection means 213, and differentiation means 214a to 214c. The voice data delivered to the basic analysis module 21 (that is, model voice data or practice person voice data) is divided into three parts as shown in FIG. 2, and each of the pitch detection means 211, the volume detection means 212, and the spectrum detection means 213 is distributed. Delivered to.

ピッチ検出手段２１１は、上記所定の時間単位分の音声データについて自己相関を求め、その時間単位におけるピッチを検出し、その検出結果を示すピッチデータを出力する。ピッチ検出手段２１１から出力されたピッチデータは、図２に示すように、ＤＴＷモジュール２２へ引渡される。なお、本実施形態では、自己相関を求めることによって、時間単位におけるピッチを検出する場合について説明したが、例えば上記時間単位毎にケプストラを求めてピッチを検出するようにしても勿論良い。 The pitch detection unit 211 obtains autocorrelation for the audio data for the predetermined time unit, detects the pitch in the time unit, and outputs pitch data indicating the detection result. The pitch data output from the pitch detector 211 is delivered to the DTW module 22 as shown in FIG. In the present embodiment, the case where the pitch in the time unit is detected by obtaining the autocorrelation has been described. However, for example, the pitch may be detected by obtaining the cepstra for each time unit.

音量検出手段２１２は、上記所定の時間単位分の音声データに含まれる各サンプル（本実施形態では２５６サンプル：図３参照）について、その振幅の絶対値の加算平均を算出し、その算出結果をそのフレームにおける音量を示す音量データとして出力する。音量検出手段２１２から出力された音量データは、図２に示すように２分流され、その一方はＤＴＷモジュール２２へ引渡され、他方は微分手段２１４ａへ引渡される。 The sound volume detection means 212 calculates the average of the absolute values of the amplitudes of the samples (256 samples in the present embodiment: see FIG. 3) included in the audio data for the predetermined time unit, and calculates the calculation result. Output as volume data indicating the volume of the frame. The volume data output from the volume detection means 212 is divided into two as shown in FIG. 2, one of which is delivered to the DTW module 22 and the other is delivered to the differentiation means 214a.

微分手段２１４ａは、連続する複数（本動作例では５）の時間単位についての音量データから、音量についての１次微分（以下、「速度」と呼ぶ）を算出し、その算出結果を示す音量速度データを出力する。本実施形態では、微分手段２１４ａは、図３に示すように、連続する５つのフレームについての音量データから音量速度データが生成され、この音量速度データは、図２に示すように２分流されてその一方はＤＴＷモジュール２２へ引渡され、他方は、微分手段２１４ｂへ引渡される。 The differentiating means 214a calculates a primary differential (hereinafter referred to as “speed”) for the sound volume from the sound volume data for a plurality of continuous time units (5 in this operation example), and the sound volume speed indicating the calculation result. Output data. In the present embodiment, as shown in FIG. 3, the differentiating means 214a generates volume speed data from volume data for five consecutive frames, and this volume speed data is divided into two as shown in FIG. One of them is delivered to the DTW module 22, and the other is delivered to the differentiating means 214b.

微分手段２１４ｂは、連続する複数（本動作例では５）の時間単位について音量速度データから、その１次微分（すなわち、音量の２次微分：以下、音量の加速度）を算出し、その算出結果を示す音量加速度データを出力する。微分手段２１４ｂから出力される音量加速度データはＤＴＷモジュール２２へ引渡される。 The differentiating means 214b calculates the first derivative (that is, the second derivative of the volume: hereinafter, the acceleration of the volume) from the volume speed data for a plurality of continuous time units (5 in this operation example), and the calculation result Volume acceleration data indicating is output. The volume acceleration data output from the differentiating means 214b is delivered to the DTW module 22.

スペクトル検出手段２１３は、図３に示すように連続する２つの時間単位分の音声データに、高速フーリエ変換(Fast Fourier Transform：以下、ＦＦＴ)を施した後に、所定の通過域を有するバンドパスフィルタ（本実施形態では、歌唱音の音声データが入力されるのであるから、０から２ｋＨＺまでは１／２オクターブバンドパスフィルタで、２から８ｋＨｚまでは１／４オクターブバンドパスフィルタ）を通過させ、その出力を上記時間単位のスペクトルを表すスペクトルデータとして出力する。スペクトル検出手段２１３から出力されたスペクトルデータは、図２に示すように２分流され、その一方はＤＴＷモジュール２２へ引渡され、他方は微分手段２１４ｃへ引渡される。 As shown in FIG. 3, the spectrum detection unit 213 performs a fast Fourier transform (hereinafter referred to as FFT) on two consecutive time units of audio data, and then a bandpass filter having a predetermined pass band. (In this embodiment, the voice data of the singing sound is input, so that 0 to 2 kHz is a 1/2 octave bandpass filter, and 2 to 8 kHz is a 1/4 octave bandpass filter) The output is output as spectrum data representing the spectrum in time units. The spectrum data output from the spectrum detecting means 213 is divided into two as shown in FIG. 2, one of which is delivered to the DTW module 22 and the other is delivered to the differentiating means 214c.

微分手段２１４ｃは、連続する複数（本動作例では５）の時間単位についてのスペクトルデータから、スペクトルの各振動数帯域別に１次微分を算出し、その算出結果を示すスペクトル速度データを図２に示すようにＤＴＷモジュール２２へ引渡す。
以上が基礎分析モジュール２１の構成である。 The differentiating means 214c calculates a first derivative for each frequency band of the spectrum from a plurality of continuous (5 in this operation example) time-series spectral data, and spectral velocity data indicating the calculation result is shown in FIG. Delivered to the DTW module 22 as shown.
The above is the configuration of the basic analysis module 21.

次いで、ＤＴＷモジュール２２の機能構成について説明する。ＤＴＷモジュール２２は、図４に示すように模範音声の時間軸と練習者音声の時間軸との対応関係を特定するためのものであり、図２に示すように、ＤＴＷ実施区間限定手段２２０、正規化手段２２１、差分マトリクス生成手段２２２、および、最適経路特定手段２２３を含んでいる。
ＤＴＷ実施区間限定手段２２０は、模範音声および練習者音声において、以下に説明する動的時間整合（ＤＴＷ）処理を施す区間を限定する手段である。その機能を以下に詳細に説明する。 Next, the functional configuration of the DTW module 22 will be described. The DTW module 22 is for specifying the correspondence between the time axis of the model voice and the time axis of the trainee voice as shown in FIG. 4, and as shown in FIG. A normalizing unit 221, a difference matrix generating unit 222, and an optimum route specifying unit 223 are included.
The DTW execution section limiting means 220 is a means for limiting a section for performing dynamic time matching (DTW) processing described below in the model voice and the practice voice. The function will be described in detail below.

ＤＴＷ実施区間限定手段２２０は、模範音声データから音量検出手段２１２により抽出された音量が予め決められた閾値を上回り、且つその上回る期間が予め規定された閾値を超える場合に、その区間をＤＴＷ実施区間とする。なぜなら、模範音声データの該当する楽曲部分は実際に歌唱した有音区間だからである。 The DTW execution section limiting unit 220 executes the DTW when the volume extracted from the model voice data by the volume detection unit 212 exceeds a predetermined threshold and the period exceeding the predetermined threshold exceeds a predetermined threshold. Interval. This is because the corresponding music portion of the model voice data is a voiced section that is actually sung.

また同様に、音量検出手段２１２により練習者音声データから抽出された音量が予め決められた閾値を上回り、且つその上回る期間が予め規定された閾値を超える場合に、その区間をＤＴＷ実施区間とする。なお模範音声および練習者音声において、有音区間の直前および直後に、上記時間単位を１つ分だけ加え（それらの区間をオフセット区間と呼ぶ）、合わせてＤＴＷ実施区間とする。
次いでＤＴＷ実施区間限定手段２２０は、基礎分析モジュール２１により抽出された模範音声および練習者音声の各種音響パラメータから、上記ＤＴＷ実施区間限定手段２２０により限定されたＤＴＷ実施区間のみからなるデータを生成する。 Similarly, when the volume extracted from the trainer voice data by the volume detection means 212 exceeds a predetermined threshold and the period exceeding the threshold exceeds a predetermined threshold, the section is set as the DTW implementation section. . In the model voice and the trainee voice, just one time unit is added immediately before and after the voiced section (the sections are referred to as offset sections), and these are combined as the DTW implementation section.
Next, the DTW execution section limiting means 220 generates data consisting only of the DTW execution section limited by the DTW execution section limiting means 220 from the various acoustic parameters of the model voice and the practice person voice extracted by the basic analysis module 21. .

正規化手段２２１は、上記ＤＴＷ実施区間においてフレーム毎に抽出された、模範音声と練習者音声それぞれの音響パラメータをＤＴＷ実施区間限定手段２２０から受け取り、各々を正規化して差分マトリクス生成手段２２２へ引渡す。ここで、データの正規化とは、フレーム単位でＤＴＷ実施区間限定手段２２０から引渡される一連の音響パラメータに対し、その加算平均および標準偏差が一定の値になるような変換を施すことであり、本実施形態では、以下の数１にしたがって上記正規化を行う。
（数１） AfterDat[i] ＝（BeforDat[i]） − AVR）／STD The normalizing means 221 receives the acoustic parameters of the model voice and the trainer voice extracted for each frame in the DTW execution section from the DTW execution section limiting means 220, normalizes each, and delivers them to the difference matrix generation means 222. . Here, the normalization of data means that a series of acoustic parameters delivered from the DTW execution section limiting unit 220 is converted in a frame unit so that the addition average and standard deviation become constant values. In the present embodiment, the normalization is performed according to the following formula 1.
(Equation 1) AfterDat [i] = (BeforDat [i])-AVR) / STD

なお、数１において、BeforDat[i]は、ＤＴＷ実施区間限定手段２２０から引渡される第ｉフレームについての音響パラメータであり、SDVはその音響パラメータについての標準偏差、AVRはその音響パラメータについての加算平均であり、AfterDat[i]はi番目のフレームについての正規化後の音響パラメータである。
数１に示す正規化を施すことによって、模範音声と練習者音声とのそれぞれについてＤＴＷ実施区間限定手段２２０から引渡される音響パラメータは、加算平均が“０”で標準偏差が“１”である音響パラメータ（すなわち、標準化された正規分布にしたがうデータ）にそれぞれ変換されることになる。
上記の正規化を施すことにより、収音環境の差異などの要因を取り除いて模範音声と練習者音声とを比較することができる。また、模範音声と練習者音声の間に音量レベル差があったり、ピッチがオクターブ単位で異なったりすることは歌唱の巧拙とは関わりがないことがほとんどであるが、そのような個々の音声が本来的に持っている差異などの要因を取り除くこともできる。 In Equation 1, BeforDat [i] is an acoustic parameter for the i-th frame delivered from the DTW implementation section limiting unit 220, SDV is a standard deviation for the acoustic parameter, and AVR is an addition for the acoustic parameter. It is an average, and AfterDat [i] is a normalized acoustic parameter for the i-th frame.
By performing the normalization shown in Equation 1, the acoustic parameters delivered from the DTW execution section limiting unit 220 for each of the model voice and the practice voice have an addition average of “0” and a standard deviation of “1”. Each is converted into acoustic parameters (that is, data according to a standardized normal distribution).
By performing the normalization described above, it is possible to compare the model voice and the practitioner voice by removing factors such as a difference in the sound pickup environment. In addition, there is a difference in volume level between the model voice and the practitioner voice, and the fact that the pitch is different in octave units is almost unrelated to the skill of singing. It is also possible to remove factors such as inherent differences.

差分マトリクス生成手段２２２は、模範音声の各フレームと練習者音声の各フレームについての音響パラメータ同士のユークリッド距離（以下、差分とも呼ぶ）を求め、その差分を成分とする行列（以下、差分マトリクス）を生成し、ＲＡＭ１３に記憶する。例えば、練習者音声の歌い始めが第０フレームで、その歌い終わりが第Ｎフレームである一方、模範音声の歌い始めが第０フレームで、その歌い終わりが第Ｍフレームである場合（Ｎ、Ｍは自然数）、差分マトリクス生成手段２２２は、以下の数２で示す値を（ｉ、ｊ）成分（ただし、０≦ｉ≦Ｎ，０≦ｊ≦Ｍ）とする（Ｎ＋１）行（Ｍ＋１）列の差分マトリクスを生成する。
（数２）Sqr{ (Σ(GuideSpectrum[j][k]−SingerSpectrum[i][k])＾2)*WeightScalar[k]
+(Σ(ΔGuideSpectrum[j][k]−ΔSingerSpectrum[i][k])＾2)*WeightVector[k]
+(ΔGuidePower[j]−ΔSingerPower[i])＾2)
+(ΔΔGuidePower[j]−ΔΔSingerPower[i])＾2)
}/num The difference matrix generation means 222 calculates a Euclidean distance (hereinafter also referred to as a difference) between acoustic parameters for each frame of the model voice and each frame of the trainee voice, and a matrix having the difference as a component (hereinafter referred to as a difference matrix). Is stored in the RAM 13. For example, when the practicing voice singing starts at the 0th frame and the singing end is the Nth frame, the singing of the model voice starts at the 0th frame and the singing end is the Mth frame (N, M Is a natural number), and the difference matrix generation means 222 uses the value represented by the following formula 2 as an (i, j) component (where 0 ≦ i ≦ N, 0 ≦ j ≦ M) (N + 1) rows (M + 1) columns The difference matrix is generated.
(Equation 2) Sqr {(Σ (GuideSpectrum [j] [k] −SingerSpectrum [i] [k]) ^ 2) * WeightScalar [k]
+ (Σ (ΔGuideSpectrum [j] [k] −ΔSingerSpectrum [i] [k]) ^ 2) * WeightVector [k]
+ (ΔGuidePower [j] −ΔSingerPower [i]) ^ 2)
+ (ΔΔGuidePower [j] −ΔΔSingerPower [i]) ^ 2)
} / num

数２において、
GuideSpectrum[j][k]：模範音声のｊ番目のフレームのｋ番目の通過域のスペクトル成分
SingerSpectrum[i][k]：練習者音声のｉ番目のフレームのｋ番目の通過域のスペクトル成分
ΔGuideSpectrum[j][k]：模範音声のｊ番目のフレームのｋ番目のスペクトル速度
ΔSingerSpectrum[i][k]：練習者音声のｉ番目のフレームのｋ番目のスペクトル速度
ΔGuidePower[j]：模範音声のｊ番目のフレームの音量速度
ΔSingerPower[i]：練習者音声のｉ番目のフレームの音量速度
ΔΔGuidePower[j]：模範音声のｊ番目のフレームの音量加速度
ΔΔSingerPower[i]：練習者音声のｉ番目のフレームの音量加速度
WeightScalar[k]：重み付け係数
WeightVector[k]：重み付け係数
num：ユークリッド距離を求めるパラメータの数（本実施形態では、（Ｎ＋１）×（Ｍ＋１））である。 In Equation 2,
GuideSpectrum [j] [k]: Spectral component of the kth passband of the jth frame of the model voice
SingerSpectrum [i] [k]: Spectral component of the k-th passband of the i-th frame of the trainee voice ΔGuideSpectrum [j] [k]: k-th spectral velocity of the j-th frame of the model voice ΔSingerSpectrum [i] [k]: k-th spectral speed of the i-th frame of the trainer voice ΔGuidePower [j]: volume speed of the j-th frame of the model voice ΔSingerPower [i]: volume speed of the i-th frame of the trainer voice ΔΔGuidePower [j]: Volume acceleration of the j-th frame of the model voice ΔΔSingerPower [i]: Volume acceleration of the i-th frame of the trainer voice
WeightScalar [k]: Weighting coefficient
WeightVector [k]: Weighting factor
num: The number of parameters for obtaining the Euclidean distance (in this embodiment, (N + 1) × (M + 1)).

ただし、WeightScalar[k]は、時間変化に依存しない音響パラメータへの重み付けを行う係数であり、練習者歌唱音および模範音声が有音（周期的な音声）であるか、無音（非周期的な音声）であるかに応じて適宜選択される値である。具体的には、練習者歌唱音および模範音声がともに有音である場合には、低域のスペクトルに重みが付与されるように値の選択がなされ、練習者歌唱音および模範音声がともに無音である場合には、高域のスペクトルに重みが付与されるように値の選択がなされる。なお、練習者歌唱音および模範音声について有音であるか無音であるかの判定は、各々のピッチおよび音量に基づいて為される。具体的には、差分マトリクス生成手段２２２は、ピッチが所定の閾値以上であり、かつ、音量も所定の閾値以上である場合に、該当する時間単位について有音であると判定し、その他の場合は無音と判定する。
これに対して、WeightVector[k]は、時間変化に依存する音響パラメータへの重み付けを行う係数であり、中域のスペクトルに重みを付与するための係数である。
なお、数２において、Σ記号は、添え字ｋについての総和を意味し、“＾２”は２乗を意味し、Sqr{}は平方根を意味している。 However, WeightScalar [k] is a coefficient that weights the acoustic parameters that do not depend on time change, and the trainer's singing sound and the model voice are sound (periodic sound) or silence (non-periodic) It is a value that is appropriately selected depending on whether it is a voice. Specifically, when both the practitioner's singing sound and the model voice are voiced, a value is selected so that a weight is given to the low-frequency spectrum, and both the practitioner's singing sound and the model voice are silent. In the case of, a value is selected so that a weight is given to a high-frequency spectrum. Note that whether the trainer singing sound and the model voice are voiced or silent is determined based on each pitch and volume. Specifically, the difference matrix generation unit 222 determines that the corresponding time unit is sound when the pitch is equal to or greater than a predetermined threshold and the volume is equal to or greater than the predetermined threshold. Is determined to be silent.
On the other hand, WeightVector [k] is a coefficient for weighting the acoustic parameter depending on time change, and is a coefficient for assigning a weight to the mid-range spectrum.
In Equation 2, the Σ symbol means the sum of the subscript k, “^ 2” means the square, and Sqr {} means the square root.

最適経路特定手段２２３は、差分マトリクス生成手段２２２により生成された（Ｎ＋１）×（Ｍ＋１）個の差分マトリクスの成分から、各成分に対応する模範音声のフレームと練習者音声のフレームのフレーム番号が、規定値を越えない成分を選択する。たとえば図１３に示される差分マトリクスにおいては、上記規定値を“２”とした場合について示されており、該差分マトリクスにおいては上記の条件を満たす差分マトリクス成分のみが示されている。 The optimum route specifying unit 223 determines the frame number of the model voice frame and the trainer voice frame corresponding to each component from the (N + 1) × (M + 1) difference matrix components generated by the difference matrix generation unit 222. Select a component that does not exceed the specified value. For example, the difference matrix shown in FIG. 13 shows the case where the specified value is “2”. In the difference matrix, only the difference matrix component satisfying the above condition is shown.

続いて最適経路特定手段２２３について説明する。最適経路特定手段２２３は、差分マトリクスの限定処理を施された差分マトリクス（図１３に示された差分マトリクスなど）について最適経路の特定を行う。以下では、図６に示すような成分の限定処理をされていない差分マトリクスを用いて説明する。図６に示す差分マトリクスにおいて、その左下隅（すなわち、（０、０）成分）からその右上隅（すなわち、（Ｎ，Ｍ）＝（３，４）成分）へ至る経路のうち、その経路上に位置する各成分の値の累積が最小になる経路を、練習者音声と模範音声の各時間単位の対応関係を表す経路として特定し、その経路の示す時間の対応関係を表すデータを評価モジュール２３へ引渡す。より詳細に説明すると、最適経路特定手段２２３は、以下に説明する規則にしたがって上記最適経路を特定する。 Next, the optimum route specifying unit 223 will be described. The optimum route specifying unit 223 specifies the optimum route for the difference matrix (such as the difference matrix shown in FIG. 13) subjected to the difference matrix limiting process. Hereinafter, description will be made using a difference matrix that is not subjected to component limitation processing as shown in FIG. In the difference matrix shown in FIG. 6, among the paths from the lower left corner (that is, (0, 0) component) to the upper right corner (that is, (N, M) = (3,4) component), The path that minimizes the accumulation of the value of each component located in is identified as the path that represents the correspondence between each time unit of the trainer voice and the model voice, and data representing the correspondence between the times indicated by the path is evaluated. Delivered to 23. More specifically, the optimum route specifying unit 223 specifies the optimum route according to the rules described below.

（規則１）差分マトリクスの左下隅から経路の探索を始め、移動先の成分値を累算した値が最小になるように移動先を選択する処理を右上隅に至るまで繰り返す。ただし、１回の移動は、右、上、または右上の何れかに制限する。例えば、（ｉ、ｊ）成分からの移動は、（ｉ、ｊ＋１）成分、（ｉ＋１、ｊ）成分、または、（ｉ＋１、ｊ＋１）成分への移動へ制限する。なお、右へ移動した場合の累積値と上へ移動した場合の累積値が等しい場合には、右への移動を優先する。同様に、右への移動と右上への移動の累積値が等しい場合には、右への移動を優先し、上への移動と右上への移動の累積値が等しい場合には、上への移動を優先する。
（規則２）上記規則１にしたがって選択された経路を右上隅から左下隅まで逆に辿り、最適経路を特定する。 (Rule 1) The route search is started from the lower left corner of the difference matrix, and the process of selecting the movement destination is repeated until reaching the upper right corner so that the accumulated value of the movement destination component values is minimized. However, one movement is limited to one of right, top, and top right. For example, movement from the (i, j) component is limited to movement to the (i, j + 1) component, (i + 1, j) component, or (i + 1, j + 1) component. If the cumulative value when moving to the right is equal to the cumulative value when moving upward, priority is given to the movement to the right. Similarly, when the accumulated value of the movement to the right and the movement to the upper right is equal, priority is given to the movement to the right, and when the accumulated value of the movement to the upper and the movement to the upper right is equal, Give priority to movement.
(Rule 2) The route selected in accordance with Rule 1 is traced backward from the upper right corner to the lower left corner to identify the optimum route.

図２の評価モジュール２３は、ＤＴＷモジュール２２により時間軸の対応付けが為された模範音声データと練習者音声データとについて、各々の信号波形を比較し、模範音声に対する練習者音声の一致度を点数化して表示部１５に表示させるものである。なお、評価モジュール２３は、練習者音声の波形と模範音声の波形とを比較する際に、ＤＴＷモジュール２２により為された動的時間整合の結果にしたがって、模範音声の時間軸に一致するように練習者音声の時間軸を伸縮した後に、両者の波形を比較する。
以上がカラオケ装置１の構成である。このように、本実施形態では、本発明に係る楽音練習支援装置に特徴的な機能を担っている基礎分析モジュール２１およびＤＴＷモジュール２２がソフトウェアモジュールで実現されている場合について説明したが、これら各モジュールをハードウェアモジュールで実現しても良いことは勿論である。 The evaluation module 23 in FIG. 2 compares the signal waveforms of the model voice data and the trainer voice data that have been associated with the time axis by the DTW module 22, and determines the degree of match between the model voice and the practice person voice. The score is displayed on the display unit 15. The evaluation module 23 matches the time axis of the model voice according to the result of the dynamic time matching performed by the DTW module 22 when comparing the waveform of the trainee voice and the waveform of the model voice. After expanding and contracting the time axis of the practitioner's voice, the waveforms of both are compared.
The above is the configuration of the karaoke apparatus 1. As described above, in the present embodiment, the case where the basic analysis module 21 and the DTW module 22 responsible for the characteristic functions of the musical tone practice support device according to the present invention are implemented by software modules has been described. Of course, the module may be realized by a hardware module.

（Ｂ：動作）
次いで、カラオケ装置１が行う採点処理のうち、その特徴を顕著に示している動作（すなわち、基礎分析モジュール２１およびＤＴＷモジュール２２の動作）を中心に図面を参照しつつ説明する。なお、以下に説明する動作例では、カラオケ装置１の電源（図示）が投入済みであり、制御部１１はＲＯＭ１２からＲＡＭ１３へロードした制御プログラムにしたがって作動しているものとする。 (B: Operation)
Next, the scoring process performed by the karaoke apparatus 1 will be described with reference to the drawings, centering on the operations remarkably showing the characteristics (that is, the operations of the basic analysis module 21 and the DTW module 22). In the operation example described below, it is assumed that the power source (illustrated) of the karaoke apparatus 1 has been turned on, and the control unit 11 operates according to a control program loaded from the ROM 12 to the RAM 13.

カラオケ装置１を用いて歌唱練習を行おうとする練習者は、表示部１５に表示されるメニュー画面等を参照しながら操作部１６を適宜操作することによって、歌唱練習を行う楽曲の楽曲識別子を入力するなど練習対象の楽曲を指定することができる。このようにして練習対象の楽曲が指定されると、制御部１１は、その楽曲識別子に対応する伴奏データおよび歌詞データを記憶部１４からＲＡＭ１３へロードする。そして、上記練習者が演奏開始を指示する旨の操作を操作部１６に対して行うと、制御部１１は、ＲＡＭ１３へ読み出した伴奏データにしたがった伴奏音の再生を音声処理部１８に実行させるとともに、歌詞データの表す歌詞テロップを埋め込んだカラオケ画面を表示部１５へ表示させ、楽曲の進行に併せてその歌詞のワイプ表示を行う。 A practitioner who wants to practice singing using the karaoke apparatus 1 inputs a song identifier of a song to be practiced by appropriately operating the operation unit 16 while referring to a menu screen or the like displayed on the display unit 15. You can specify the music to be practiced. When the music to be practiced is designated in this way, the control unit 11 loads accompaniment data and lyrics data corresponding to the music identifier from the storage unit 14 to the RAM 13. Then, when the practitioner performs an operation to instruct the start of performance on the operation unit 16, the control unit 11 causes the audio processing unit 18 to reproduce the accompaniment sound according to the accompaniment data read to the RAM 13. At the same time, a karaoke screen in which the lyrics telop represented by the lyrics data is embedded is displayed on the display unit 15, and the lyrics are wiped as the music progresses.

練習者は、上記カラオケ画面を視認し、スピーカから放音される伴奏音に合わせて楽曲の歌唱を行う。そして、練習者の歌唱音はマイクロフォン１７によって収音され、その歌唱音に応じた練習者音声データが練習者音声データ記憶領域１４ｃに順次書き込まれる。このようにして練習者音声データが練習者音声データ記憶領域１４ｃに記憶されると、制御部１１は、この練習者音声データと、上記楽曲識別子に対応付けて模範音声データ記憶領域１４ｂに記憶されている模範音声データとを読出し、図５に示す採点処理を実行する。 The practitioner visually recognizes the karaoke screen and sings the music in accordance with the accompaniment sound emitted from the speaker. The singing sound of the practitioner is picked up by the microphone 17, and the practitioner voice data corresponding to the singing sound is sequentially written in the practitioner voice data storage area 14c. When the trainer voice data is stored in the trainer voice data storage area 14c in this way, the control unit 11 stores the trainer voice data in the model voice data storage area 14b in association with the music identifier. The exemplary voice data is read out, and the scoring process shown in FIG. 5 is executed.

図５は、制御部１１が上記制御プログラムにしたがって行う採点処理の流れを示すフローチャートである。図５に示すように、制御部１１は、模範音声データおよび練習者音声データを解析して、楽曲のはじめから終わりまでについて、所定の時間単位（本実施形態では、フレーム）毎に音響パラメータを抽出する（ステップＳＡ１００）。なお、このステップＳＡ１００の処理は、前述した基礎分析モジュール２１により実行される。 FIG. 5 is a flowchart showing the flow of scoring processing performed by the control unit 11 according to the control program. As shown in FIG. 5, the control unit 11 analyzes the model voice data and the trainer voice data, and sets the acoustic parameters for each predetermined time unit (frame in the present embodiment) from the beginning to the end of the music. Extract (step SA100). The process of step SA100 is executed by the basic analysis module 21 described above.

次いで制御部１１は、ステップＳＡ１００にて抽出した模範音声データにおいて歌唱が行われている部分を、また練習者音声データにおいて、実際に歌唱が行われた部分を動的時間整合（ＤＴＷ）の処理を行う楽曲区間として特定する（ステップＳＡ１１０）。次いで、ステップＳＡ１００にて抽出した各種音響パラメータに関するデータから、上述のようにして特定されたＤＴＷ実施区間のみからなるデータを生成し、該データを正規化手段２２１に受渡す。このステップＳＡ１１０の処理は、前述したＤＴＷモジュール２２のＤＴＷ実施区間限定手段２２０により実行される。 Next, the control unit 11 performs dynamic time matching (DTW) processing on the part where the singing is performed in the model voice data extracted in step SA100, and on the part where the singing is actually performed in the trainer voice data. Is specified as the music section to be performed (step SA110). Next, data including only the DTW implementation period specified as described above is generated from the data regarding various acoustic parameters extracted in step SA100, and the data is transferred to the normalizing means 221. The process of step SA110 is executed by the DTW execution section limiting means 220 of the DTW module 22 described above.

本実施形態に係るカラオケ装置１において、歌唱を行わない間奏部分などについては当然歌唱の評価を行う必要はない。従って、上記のように歌唱を行う部分に対応するデータを生成することにより、以降実行される処理の効率化を図ることができる。
次いで制御部１１は、パラメータの種類毎に該音響パラメータに正規化を施し（ステップＳＡ１２０）、正規化後の音響パラメータから差分マトリクスを生成する（ステップＳＡ１３０）。なお、このステップＳＡ１２０の処理は、前述したＤＴＷモジュール２２の正規化手段２２１により実行され、ステップＳＡ１３０の処理は、同ＤＴＷモジュール２２の差分マトリクス生成手段２２２により実行される。なお、本動作例では、ステップＳＡ１３０までの処理が実行された結果として、図６に示す差分マトリクスが生成されたものとする。 In the karaoke apparatus 1 according to the present embodiment, it is naturally not necessary to evaluate the singing for an interlude portion where the singing is not performed. Therefore, by generating data corresponding to the part where the singing is performed as described above, it is possible to improve the efficiency of the processing executed thereafter.
Next, the control unit 11 normalizes the acoustic parameters for each parameter type (step SA120), and generates a difference matrix from the normalized acoustic parameters (step SA130). The process of step SA120 is executed by the normalizing unit 221 of the DTW module 22 described above, and the process of step SA130 is executed by the difference matrix generating unit 222 of the DTW module 22. In this operation example, it is assumed that the difference matrix shown in FIG. 6 is generated as a result of executing the processing up to step SA130.

ここで、最適経路特定手段２２３は、上記のように生成された差分マトリクスの成分から、差分マトリクス成分に対応する模範音声のフレームと練習者音声のフレーム番号の差が、予め定められた規定値以下である差分マトリクス成分を選択する。たとえば図１３に示される差分マトリクスにおいては、規定値は“２”であり、上記の条件を満たす差分マトリクス成分が限定される（ステップＳＡ１４０）。 Here, the optimum route specifying means 223 determines that a difference between the frame number of the model voice corresponding to the difference matrix component and the frame number of the trainee voice from the difference matrix component generated as described above is a predetermined specified value. The following difference matrix component is selected. For example, in the difference matrix shown in FIG. 13, the specified value is “2”, and the difference matrix component that satisfies the above condition is limited (step SA140).

次いで、制御部１１は、ステップＳＡ１４０にて限定された差分マトリクスの成分から最適経路を特定する（ステップＳＡ１５０）。このステップＳＡ１５０の処理は前述した最適経路特定手段２２３により実行される処理であり、具体的には、最適経路特定手段２２３は以下に説明する手順で、最適経路の特定を行う。 Next, the control unit 11 specifies the optimum route from the components of the difference matrix limited in step SA140 (step SA150). The process of step SA150 is a process executed by the optimum route specifying unit 223 described above. Specifically, the optimum route specifying unit 223 specifies an optimum route in the procedure described below.

最適経路特定手段２２３は、まず、差分マトリクスの第１列に沿った経路について、移動に伴う成分値の累積を行う（図７参照）。例えば、第１列に沿った経路の出発点である（０、０）成分の値は“１”であり、（１，０）成分の値は“４”であるから（図６参照）、（０，０）成分から（１，０）成分への移動に伴う累積値は“５”になる（図７参照）。そして、（２，０）成分の値は“１”であるから、（０，０）成分→（１，０）成分→（２，０）成分という移動に伴う累積値は“６”になる（図７参照）。以下、（３，０）成分に至るまで移動に伴う成分値の累積を行い、図７に示す結果が得られる。 First, the optimum route specifying unit 223 accumulates component values accompanying the movement of the route along the first column of the difference matrix (see FIG. 7). For example, the value of the (0, 0) component that is the starting point of the route along the first column is “1”, and the value of the (1, 0) component is “4” (see FIG. 6). The accumulated value accompanying the movement from the (0, 0) component to the (1, 0) component is “5” (see FIG. 7). Since the value of the (2, 0) component is “1”, the cumulative value accompanying the movement of (0, 0) component → (1, 0) component → (2, 0) component is “6”. (See FIG. 7). Thereafter, the component values accompanying the movement are accumulated until reaching the (3, 0) component, and the result shown in FIG. 7 is obtained.

次いで、最適経路特定手段２２３は、前述した第１列の場合と同様に、第２列についても移動に伴う成分値の累積を行う（図８参照）。以下、同様に、差分マトリクスの右上隅（すなわち、（３、４）成分）に至るまで、移動に伴う成分値の累積を繰り返す（図９参照）。 Next, as in the case of the first column described above, the optimum route specifying unit 223 accumulates component values associated with movement in the second column (see FIG. 8). Similarly, the accumulation of component values accompanying the movement is repeated until reaching the upper right corner of the difference matrix (that is, the (3, 4) component) (see FIG. 9).

図９に示すように、差分マトリクスの右上隅まで移動に伴う成分値の累積を完了すると、最適経路特定手段２２３は、その右上隅を出発点として、その出発点へ向けての移動が可能な格子点（すなわち、その出発点の左、左下、または、下の格子点）のうち、その格子点に至るまでの経路に沿った成分値の累積が最小である格子点を経路候補として特定する。そして、最適経路特定手段２２３は、経路候補が左下隅の格子点に一致するまで、上記特定した経路候補を上記出発点として次の経路候補を特定する処理を繰り返す。その結果、図７に示す差分マトリクスについては、図１０に示す最適経路候補（すなわち、（３，４）→（２，３）→（１，２）→（１，１）→（０，０））が特定される。
次いで、最適経路特定手段２２３は、上記のようにして特定した最適経路候補を逆に辿るとともに、その最適経路候補から外れて移動を行う場合には、上記累積値が増加することを確かめ、最適経路を特定する（図１１参照）。 As shown in FIG. 9, when the accumulation of the component values accompanying the movement to the upper right corner of the difference matrix is completed, the optimum route specifying means 223 can move toward the starting point using the upper right corner as the starting point. Among the lattice points (that is, the left, lower left, or lower lattice point of the starting point), the lattice point with the minimum accumulation of component values along the route to the lattice point is specified as the route candidate. . Then, the optimum route specifying unit 223 repeats the process of specifying the next route candidate using the specified route candidate as the starting point until the route candidate matches the lattice point in the lower left corner. As a result, the optimum matrix candidate shown in FIG. 10 (that is, (3,4) → (2,3) → (1,2) → (1,1) → (0,0) is obtained for the difference matrix shown in FIG. )) Is identified.
Next, the optimum route specifying means 223 reversely traces the optimum route candidate specified as described above, and confirms that the cumulative value increases when moving away from the optimum route candidate. A route is specified (see FIG. 11).

以上のようにして特定された最適経路は、模範音声の時間軸と練習者音声の時間軸との対応関係を表している。具体的には、図１１に示す最適経路は、模範音声についての各フレームと練習者音声についての各フレームとが図１２に示すように対応していることを示している。最適経路特定手段２２３は、図１２に示す対応関係を示すデータを生成し、そのデータを評価モジュール２３へ出力する（ステップＳＡ１６０）。 The optimum route identified as described above represents a correspondence relationship between the time axis of the model voice and the time axis of the trainee voice. Specifically, the optimum route shown in FIG. 11 indicates that each frame for the model voice corresponds to each frame for the trainee voice as shown in FIG. The optimum route specifying means 223 generates data indicating the correspondence shown in FIG. 12, and outputs the data to the evaluation module 23 (step SA160).

以下、評価モジュール２３は、最適経路特定手段２２３により特定された対応関係を満たすように練習者音声データにタイムアラインメントを施した後に模範音声データと比較し、その比較結果を点数化して表示部１５に表示する。 Thereafter, the evaluation module 23 performs time alignment on the trainer voice data so as to satisfy the correspondence specified by the optimum route specifying means 223, and then compares the trained voice data with the model voice data. To display.

以上に説明したように、本実施形態に係るカラオケ装置１によれば、練習者による歌唱を評価する過程において、ＤＴＷ実施区間を前述した有音区間に限定する（換言すれば、無音区間を除外する）。ＤＴＷ実施区間を限定することにより、また差分マトリクスにおいて成分の限定をすることにより、必要とされる計算量を減らすことができる、といった効果を奏する。 As described above, according to the karaoke apparatus 1 according to the present embodiment, in the process of evaluating the singing by the practitioner, the DTW execution section is limited to the above-described voiced section (in other words, the silent section is excluded). To do). By limiting the DTW implementation interval and by limiting the components in the difference matrix, there is an effect that the amount of calculation required can be reduced.

（Ｃ：変形）
以上、本発明の１実施形態について説明したが、係る実施形態に以下に述べるような変形を加えても良いことは勿論である。
（１）上述した実施形態では、本発明に特徴的な動的時間整合（ＤＴＷ）処理を行う機能をカラオケ装置へ組み込むことによって、種々の技法を駆使して歌唱が行われた場合に、その手本となる歌唱にて駆使されている技法との相違を評価することを可能にする場合について説明した。しかしながら、上記ＤＴＷモジュール２２による動的時間整合処理の処理対象は、上記歌唱音に限定されるものではなく、種々の技法を駆使して演奏された楽器の演奏音データとその手本となる模範演奏データであっても良く、また、英会話などの外国語習得にも利用することができる。 (C: deformation)
Although one embodiment of the present invention has been described above, it is needless to say that the embodiment may be modified as described below.
(1) In the embodiment described above, when a singing is performed using various techniques by incorporating a function for performing dynamic time alignment (DTW) processing characteristic of the present invention into a karaoke apparatus, We explained the case where it is possible to evaluate the difference from the technique used in the example song. However, the processing target of the dynamic time alignment processing by the DTW module 22 is not limited to the singing sound, and the performance sound data of a musical instrument played using various techniques and a model example thereof. It may be performance data, and can also be used to learn foreign languages such as English conversation.

（２）上述した実施形態では、練習者音声および模範音声のピッチおよび音量に基づいて有音であるか無音であるかを判定し、その判定結果に応じて時間変化に依存しない音響パラメータ（上記実施形態では、スペクトル）に付与する重みを切り替える場合について説明したが、ピッチのみ、或いは、音量のみに基づいて有音／無音の判定をするようにしても勿論良い。また、上記の如き重みの切り替えは必ずしも必須ではないから、係る切り替えを行わない態様においては、ピッチの検出や基礎分析モジュール２１からＤＴＷモジュール２２への音量データの引渡しを行う必要がないことは言うまでも無い。 (2) In the above-described embodiment, it is determined whether the voice is voiced or silent based on the pitch and volume of the trainer voice and the model voice, and an acoustic parameter that does not depend on time change according to the determination result (above In the embodiment, the case of switching the weight to be given to the spectrum) has been described, but it is of course possible to determine whether sound is present or not based on only the pitch or only the volume. In addition, since the weight switching as described above is not always essential, it is not necessary to detect the pitch or deliver the volume data from the basic analysis module 21 to the DTW module 22 in an aspect in which such switching is not performed. Not too long.

（３）上述した実施形態では、練習者歌唱音と模範音声との動的時間整合を行う際には、その都度、模範音声データ記憶領域１４ｂに記憶されている模範音声データを基礎分析モジュール２１によって分析し、その模範音声データの表す歌唱音についての音響パラメータを算出する場合について説明した。しかしながら、模範音声データについて上記音響パラメータを予め求めておき、その音響パラメータと楽曲識別子とを対応付けて記憶部１４に記憶させておくようにしても勿論良い。
また、上述した実施形態では、カラオケ装置１に設けられた記憶部１４に模範音声データを記憶させておく場合について説明したが、ＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）やＤＶＤ（Digital Versatile Disk）などのコンピュータ装置読み取り可能な記録媒体に模範音声データや模範音声データから抽出される音響パラメータを書き込んで配布し、このような記録媒体からの模範音声データや音響パラメータの読み出しにより、模範音声データや音響パラメータを取得させるようにしても良く、また、インターネットなどの電気通信回線経由で模範音声についての音響パラメータを取得させるようにしても良い。 (3) In the above-described embodiment, each time the dynamic time matching between the practitioner singing sound and the model voice is performed, the model voice data stored in the model voice data storage area 14b is used as the basic analysis module 21. The case where the acoustic parameters for the singing sound represented by the exemplary voice data are calculated has been described. However, it is of course possible to obtain the acoustic parameters for the model voice data in advance and store the acoustic parameters in association with the music identifiers in the storage unit 14.
Further, in the above-described embodiment, the case where the model audio data is stored in the storage unit 14 provided in the karaoke apparatus 1 has been described, but a CD-ROM (Compact Disk-Read Only Memory) or a DVD (Digital Versatile Disk) is described. Model audio data and acoustic parameters extracted from the exemplary audio data are written and distributed on a computer-readable recording medium such as), and the exemplary audio data is read by reading the exemplary audio data and acoustic parameters from such a recording medium. And acoustic parameters may be acquired, or acoustic parameters for model voices may be acquired via a telecommunication line such as the Internet.

（４）上述した実施形態では、練習者音声データや模範音声データから音響パラメータの抽出を行う基礎分析モジュール２１と、それら音響パラメータに基づいて模範音声と練習者音声との時間軸の対応付けを行うＤＴＷモジュール２２とを夫々別個のソフトウェアモジュールとして実現する場合について説明したが、１つのソフトウェアモジュールとして構成しても良いことは勿論である。具体的には、音響パラメータの正規化および正規化後の音響パラメータを用いて動的時間整合を行う動的時間整合モジュールに、練習者音声データから音響パラメータを抽出する機能や、模範音声データからの抽出或いは記録媒体等からの読出しにより模範音声についての音響パラメータの取得を行う機能を担わせるようにすれば良い。 (4) In the above-described embodiment, the basic analysis module 21 that extracts the acoustic parameters from the trainer voice data and the model voice data, and the time axis of the model voice and the trainer voice are associated with each other based on the acoustic parameters. Although the case where the DTW module 22 to be performed is realized as a separate software module has been described, it is needless to say that it may be configured as one software module. Specifically, the function to extract acoustic parameters from the trainer's voice data and the model voice data to the dynamic time matching module that performs normalization of the acoustic parameters and dynamic time matching using the normalized acoustic parameters It is sufficient to have a function of acquiring acoustic parameters for the model voice by extracting the data or reading from the recording medium or the like.

（５）上述した実施形態では、本発明に係る楽曲練習支援装置に特徴的な機能を制御部１１に実現させるための制御プログラムをＲＯＭ１２に予め書き込んでおく場合について説明したが、ＣＤ−ＲＯＭやＤＶＤなどのコンピュータ装置読み取り可能な記録媒体に上記制御プログラムを記録して配布するとしても良く、インターネットなどの電気通信回線経由のダウンロードにより上記制御プログラムを配布するようにしても勿論良い。 (5) In the above-described embodiment, a case has been described in which a control program for causing the control unit 11 to realize a characteristic function of the music practice support device according to the present invention is written in the ROM 12 in advance. The control program may be recorded and distributed on a computer-readable recording medium such as a DVD, or the control program may be distributed by downloading via a telecommunication line such as the Internet.

（６）上述した実施形態では、動的時間整合を行うための音響パラメータとして、ピッチ、音量およびスペクトルと、音量の１次微分および２次微分、スペクトルの１次微分を用いる場合について説明した。これら音響パラメータのうち、音量の１次微分および２次微分は、音量の時間変化の度合いを表すものであるが、２次微分は必ずしも必須ではない。また、スペクトルについても、その時間変化の度合いを動的時間整合により正確に反映させるため、２次微分まで求めるようにしても勿論良い。 (6) In the above-described embodiment, a case has been described in which pitch, volume, and spectrum, and primary and secondary derivatives of volume and primary derivative of spectrum are used as acoustic parameters for performing dynamic time matching. Of these acoustic parameters, the first and second derivatives of the volume represent the degree of temporal change in volume, but the second derivative is not necessarily essential. Of course, the spectrum may be obtained up to the second derivative in order to accurately reflect the degree of time change by dynamic time matching.

（７）上述した実施形態では、差分マトリクス生成手段２２２は、差分マトリクスの成分全てについて値を算出し（ステップＳＡ１３０）、その後、最適経路特定手段２２３がそれらの成分から経路の候補となる成分を限定する（ステップＳＡ１４０）場合について説明した。しかし、ステップＳＡ１４０において除外される成分については予めステップＳＡ１３０において差分マトリクス生成手段２２２が生成しないようにしてもよい。
その理由は以下の通りである。歌唱者は表示部１５に表示された歌詞テロップを見ながら歌唱するため、歌詞テロップがまだ表示されていない楽曲部分を歌うことや、歌詞テロップが表示され終わった楽曲部分を遅れて歌うといったように模範音声と極端にずれた歌唱を行う可能性は低い。そのように極端な歌唱を行う場合は、模範音声と練習者音声とでフレーム番号が極端に異なる組み合わせである場合に対応する。従って、差分マトリクスにおいて、たとえば（Ｎ，０）成分など、模範音声と練習者音声で番号が極端に異なるフレームについて算出された差分からなる成分については検討する必要性は低い。 (7) In the above-described embodiment, the difference matrix generation unit 222 calculates values for all the components of the difference matrix (step SA130), and then the optimum route specifying unit 223 determines a component that is a route candidate from those components. The case of limiting (step SA140) has been described. However, the difference matrix generating unit 222 may not generate the components excluded in step SA140 in advance in step SA130.
The reason is as follows. Since the singer sings while watching the lyrics telop displayed on the display unit 15, the singer sings the music part for which the lyrics telop is not yet displayed, or sings the music part for which the lyrics telop has been displayed. The possibility of singing extremely different from the model voice is low. Such extreme singing corresponds to a case in which the frame number is extremely different between the model voice and the trainee voice. Therefore, in the difference matrix, for example, a component made up of differences calculated for frames whose numbers are extremely different between the model voice and the practitioner voice, such as the (N, 0) component, is low.

（８）上述した実施形態では、ＤＴＷ実施区間限定手段２２０は、模範音声や練習者音声の音量が予め決められた閾値を上回り、且つその上回る期間が予め規定された閾値を超える区間を、ＤＴＷ実施区間とする場合について説明した。しかし、音量に加えてまたは音量の代わりに、模範音声のピッチに基づいてＤＴＷ実施区間を決定しても良い。具体的には、(ａ)模範音声のピッチが予め定められた閾値を上回り、且つその上回る期間が予め規定された閾値を超える区間、（ｂ）模範音声の音量およびピッチの両方が予め定められた閾値を上回り、且つその上回る期間が予め規定された閾値を超える区間、（ｃ）模範音声の音量が予め決められた閾値を上回る区間（区間の時間は考慮しない）、（ｄ）模範音声のピッチが予め決められた閾値を上回る区間（区間の時間は考慮しない）、をＤＴＷ実施区間としても良い。 (8) In the above-described embodiment, the DTW execution section limiting unit 220 determines a section in which the volume of the model voice or the practice person voice exceeds a predetermined threshold and the period exceeding the predetermined threshold exceeds the predetermined threshold. The case where it was set as the implementation section was explained. However, the DTW execution period may be determined based on the pitch of the model voice in addition to the volume or instead of the volume. Specifically, (a) a section in which the pitch of the model voice exceeds a predetermined threshold and the period exceeding the threshold exceeds a predetermined threshold; (b) both the volume and the pitch of the model voice are predetermined. (C) a section in which the volume of the model voice exceeds a predetermined threshold (the time of the section is not considered), (d) the model voice A section in which the pitch exceeds a predetermined threshold (the section time is not considered) may be set as the DTW implementation section.

（９）上述した実施形態では、カラオケ曲を一意に識別する楽曲識別子に対応付けてその楽曲識別子で識別されるカラオケ曲を持ち歌とする歌手によるそのカラオケ曲の歌唱音を表す模範音声データを記憶部１４に記憶させておく場合について説明した。
しかしながら、１つの楽曲を複数の歌手が夫々個別に持ち歌としている場合には、その歌手毎に異なる楽曲であるとして、互いに異なる楽曲識別子を付与しても良く、また、その楽曲を一意に識別する楽曲識別子に上記複数の歌手の各々を一意に識別する歌手識別子を対応付け、さらに、この楽曲識別子と歌手識別子の組に、その楽曲識別子で識別される楽曲の、その歌手識別子で識別される歌手による歌唱音を表す模範音声データを対応付けて記憶部１４に記憶させておくとしても良い。前述したように、歌手毎にその歌唱技法が異なっていることが一般的であり、同一の楽曲であっても歌い手が異なれば、その歌唱に込められる情感や味わいも異なることが一般的である。上記のように歌い手の識別を可能なように構成すれば、１つの楽曲を複数の歌手が持ち歌としている場合であっても、ユーザは、それら複数の歌手のうちから自身の好みに応じた歌手による歌唱を選択し、その歌唱を真似て歌唱練習を行うことが可能になる。 (9) In the above-described embodiment, the model voice data representing the singing sound of the karaoke song by the singer who has the karaoke song identified by the song identifier in association with the song identifier that uniquely identifies the karaoke song is used. The case where it memorize | stores in the memory | storage part 14 was demonstrated.
However, if a plurality of singers individually have a song as a song, different song identifiers may be given as different songs for each singer, and each song is uniquely identified. A singer identifier that uniquely identifies each of the plurality of singers is associated with the song identifier to be identified, and further, the set of the song identifier and the singer identifier is identified by the singer identifier of the song identified by the song identifier The model voice data representing the singing sound by the singer may be stored in the storage unit 14 in association with each other. As mentioned above, the singing technique is generally different for each singer, and even if the singer is different even if it is the same song, the feeling and taste that can be put into the singing is generally different. . If it is configured so that singers can be identified as described above, even if a plurality of singers have a single piece of music as a song, the user can respond to his / her preference from among those singers. It becomes possible to select a song by a singer and practice singing by imitating the song.

本発明の１実施形態に係るカラオケ装置１のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the karaoke apparatus 1 which concerns on one Embodiment of this invention. 同カラオケ装置１の機能構成例を示すブロック図である。It is a block diagram which shows the function structural example of the karaoke apparatus. 同基礎分析モジュール２１により実行される音響パラメータ抽出を説明するための図である。It is a figure for demonstrating the acoustic parameter extraction performed by the basic analysis module 21. FIG. 同ＤＴＷモジュール２２により実行される動的時間整合処理の実行結果の一例を示す図である。It is a figure which shows an example of the execution result of the dynamic time alignment process performed by the same DTW module. 同カラオケ装置１が行う採点処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the scoring process which the same karaoke apparatus 1 performs. 差分マトリクスの一例を示す図である。It is a figure which shows an example of a difference matrix. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理にて特定される最適経路の一例を示す図である。It is a figure which shows an example of the optimal path | route identified by the optimal path | route identification process. 動的時間整合処理の処理結果を説明するための図である。It is a figure for demonstrating the processing result of a dynamic time alignment process. 成分を限定された差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix by which the component was limited.

Explanation of symbols

１…カラオケ装置、１０…バス、１１…制御部、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部（１４ａ；伴奏・歌詞データ記憶領域、１４ｂ；模範音声データ記憶領域、１４ｃ；練習者音声データ記憶領域）、１５…表示部、１６…操作部、１７…マイクロフォン、１８…音声処理部、１９…スピーカ、２１…基礎分析モジュール、２２…ＤＴＷモジュール、２３…評価モジュール、２１１…ピッチ検出手段、２１２…音量検出手段、２１３…スペクトル検出手段、２１４、２１４ａ、２１４ｂ、２１４ｃ…微分手段、２２０…ＤＴＷ実施区間限定手段、２２１…正規化手段、２２２…差分マトリクス生成手段、２２３…最適経路特定手段 DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 10 ... Bus, 11 ... Control part, 12 ... ROM, 13 ... RAM, 14 ... Memory | storage part (14a; Accompaniment / lyrics data storage area, 14b; Model voice data storage area, 14c; Trainer voice data (Storage area), 15 ... display unit, 16 ... operation unit, 17 ... microphone, 18 ... audio processing unit, 19 ... speaker, 21 ... basic analysis module, 22 ... DTW module, 23 ... evaluation module, 211 ... pitch detection means, 212 ... Volume detection means, 213 ... Spectrum detection means, 214, 214a, 214b, 214c ... Differentiation means, 220 ... DTW implementation section limiting means, 221 ... Normalization means, 222 ... Difference matrix generation means, 223 ... Optimum path specifying means

Claims

The first audio signal representing the waveform of the singing sound or performance sound by the user is analyzed, and the first, representing the pitch, volume, spectrum, and degree of time change of the volume and spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Acquisition means for acquiring a second acoustic parameter representing a pitch, volume, spectrum, and degree of temporal change of volume and spectrum in each time unit of sound;
A voiced section in which at least one of pitch or volume exceeds a predetermined threshold from the first audio signal is selected with reference to the first acoustic parameter, and at least pitch or volume from the second audio signal is selected. Section selection means for selecting a voiced section in which one exceeds a predetermined threshold from the second audio signal with reference to the second acoustic parameter;
The first acoustic parameter of each time unit in the sounded section selected by the section selecting unit is subjected to normalization to convert the average value and standard deviation into a predetermined value, and selected by the section selecting unit. Normalization means for performing the normalization on the second acoustic parameter of each time unit in a sounded section,
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, grid point selection means for selecting only grid points where the difference between the coordinate values is smaller than a predetermined value;
On the coordinate plane, only the grid point selected by the grid point selection means passes through the path from the start point, which is the grid point having the smallest coordinate value, to the end point, the grid point having the largest coordinate value. Identifying means for identifying a path that minimizes the sum of the evaluation values at the grid points on the path;
A dynamic time alignment module comprising: an association means for associating the time unit in the first audio signal with the time unit in the second audio signal along the path specified by the specifying means;
The signal waveform of the first audio signal and the signal waveform of the second audio signal are compared for each time unit associated by the dynamic time matching module, and the degree of coincidence between the two is scored. A music practice support device characterized by output.

In the first and second audio signals, the section selection unit is configured to perform a corresponding silence when a silent section in which at least one of pitch and volume falls below a predetermined threshold value continues for a predetermined time. The music practice support device according to claim 1, wherein a section excluding a section is selected from the first and second audio signals.

A storage unit that stores one or a plurality of sets of a song identifier that uniquely identifies a song and the second acoustic parameter corresponding to the song identified by the song identifier;
The music practice support device according to claim 1, wherein the acquisition unit reads and acquires the second acoustic parameter corresponding to the music identifier selected by the user from the storage device.

The first audio signal representing the waveform of the singing sound or performance sound by the user is analyzed, and the first, representing the pitch, volume, spectrum, and degree of time change of the volume and spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Acquisition means for acquiring a second acoustic parameter representing a pitch, volume, spectrum, and degree of temporal change of volume and spectrum in each time unit of sound;
A voiced section in which at least one of pitch or volume exceeds a predetermined threshold from the first audio signal is selected with reference to the first acoustic parameter, and at least pitch or volume from the second audio signal is selected. Section selection means for selecting a voiced section in which one exceeds a predetermined threshold from the second audio signal with reference to the second acoustic parameter;
The first acoustic parameter of each time unit in the sounded section selected by the section selecting unit is subjected to normalization to convert the average value and standard deviation into a predetermined value, and selected by the section selecting unit. Normalization means for performing the normalization on the second acoustic parameter of each time unit in a sounded section,
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, grid point selection means for selecting only grid points where the difference between the coordinate values is smaller than a predetermined value;
On the coordinate plane, only the grid point selected by the grid point selection means passes through the path from the start point, which is the grid point having the smallest coordinate value, to the end point, the grid point having the largest coordinate value. Identifying means for identifying a path that minimizes the sum of the evaluation values at the grid points on the path;
Dynamic time alignment, comprising: association means for associating the time unit in the first audio signal with the time unit in the second audio signal along the path specified by the specifying means module.

Computer equipment,
The first audio signal representing the waveform of the singing sound or performance sound by the user is analyzed, and the first, representing the pitch, volume, spectrum, and degree of time change of the volume and spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Acquisition means for acquiring a second acoustic parameter representing a pitch, volume, spectrum, and degree of temporal change of volume and spectrum in each time unit of sound;
A voiced section in which at least one of pitch or volume exceeds a predetermined threshold from the first audio signal is selected with reference to the first acoustic parameter, and at least pitch or volume from the second audio signal is selected. Section selection means for selecting a voiced section in which one exceeds a predetermined threshold from the second audio signal with reference to the second acoustic parameter;
The first acoustic parameter of each time unit in the sounded section selected by the section selecting unit is subjected to normalization to convert the average value and standard deviation into a predetermined value, and selected by the section selecting unit. Normalization means for performing the normalization on the second acoustic parameter of each time unit in a sounded section,
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, grid point selection means for selecting only grid points where the difference between the coordinate values is smaller than a predetermined value;
On the coordinate plane, only the grid point selected by the grid point selection means passes through the path from the start point, which is the grid point having the smallest coordinate value, to the end point, the grid point having the largest coordinate value. Identifying means for identifying a path that minimizes the sum of the evaluation values at the grid points on the path;
A program that causes the time unit in the first audio signal and the time unit in the second audio signal to correspond to each other along the path specified by the specifying unit.