JP2008040258A

JP2008040258A - Musical piece practice assisting device, dynamic time warping module, and program

Info

Publication number: JP2008040258A
Application number: JP2006216057A
Authority: JP
Inventors: Hiroshi Kayama; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-08-08
Filing date: 2006-08-08
Publication date: 2008-02-21

Abstract

<P>PROBLEM TO BE SOLVED: To evaluate the level of reproduction of various techniques when a user practices in singing or musical performance modeled after the singing or musical performance that uses the techniques. <P>SOLUTION: A karaoke device which compares a first audio signal representing a user's singing voice with a second audio signal representing a singing voice of a singer as a model, and scores and outputs how much they match each other is provided with a DTW module which normalize a first acoustic parameter obtained by analyzing the first audio signal and a second acoustic parameter obtained by analyzing the second audio signal, and then performs dynamic time warping based upon those acoustic parameters. The karaoke device evaluates how much the first and the second audio signals match each other between time units made to correspond to each other by the DTW module. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、ユーザの歌唱または演奏をその手本と比較評価するための技術に関する。 The present invention relates to a technique for comparing and evaluating a user's song or performance with a model.

カラオケ装置のなかには、ユーザの歌唱音をマイクロホンにより収音し、その歌唱音のピッチ（音高）の時間変化と、カラオケ伴奏（以下、「ガイドメロディ」）のピッチの時間変化とを比較し、両者の一致の度合いを点数化して出力する機能（以下、採点機能）を備えたものがあり（例えば、特許文献１や特許文献２）、この種のカラオケ装置を利用して手軽に歌唱練習を行うことが可能になっている。
特開２００５−２１５４９３号公報特開２００３−１５６７３号公報 In the karaoke device, the user's singing sound is picked up by a microphone, and the time change of the pitch (pitch) of the singing sound is compared with the time change of the pitch of the karaoke accompaniment (hereinafter “guide melody”). Some of them have a function to score and output the degree of coincidence between them (hereinafter, scoring function) (for example, Patent Document 1 and Patent Document 2), and this kind of karaoke device makes it easy to practice singing. It is possible to do.
JP 2005-215493 A JP 2003-15673 A

ところで、熟練した歌唱者は、楽譜内容に忠実に歌唱するのではなく、歌い始めや歌い終わりを意図的にずらしたり、声質や音量を変化させたり、或いはビブラートやこぶしを用いたりするなど様々な歌唱技法を駆使して情感や味わいを表現する場合がある。このような情感や味わいは歌唱者によって様々に表現され、例えば、フレーズの末尾に必ずビブラートをかけたり、歌い始めを必ずためる（歌い始めのタイミングを意図的に遅らせる）など、歌手毎に特徴があることが多い。
一方、カラオケ装置を用いて歌唱練習を行うユーザは、自身の好みの歌手の歌唱技法を真似て歌唱したいと考えていることが多く、カラオケ装置を利用して歌唱練習を行う際には、その歌唱技法をどの程度再現できたのかについても評価を受けたいと望んでいる場合がある。 By the way, skilled singers are not singing faithfully to the content of the score, but intentionally shifting the beginning and end of singing, changing the voice quality and volume, using vibrato and fist, etc. There are cases where emotions and tastes are expressed using singing techniques. Such feelings and tastes are expressed in various ways by the singer. For example, each singer has characteristics such as always applying vibrato to the end of the phrase or always starting the singing (deliberately delaying the timing of the beginning of the singing). There are often.
On the other hand, users who practice singing using a karaoke device often want to imitate singing techniques of their favorite singers, and when performing singing practice using a karaoke device, You may want to receive an evaluation of how well you can reproduce the singing technique.

しかしながら、特許文献１や特許文献２に開示された技術では、上記の如きニーズに応えることができないのみならず、歌い始めを必ずためるなどの歌唱技法は、楽譜内容からの逸脱として減点対象となってしまう場合もある。何故ならば、特許文献１や特許文献２に開示された技術にて評価基準となるガイドメロディは楽曲のピッチの変化を楽譜内容に則して忠実に再現するものであり、これら特許文献１や特許文献２に開示された技術は楽譜内容に忠実に歌唱されたか否かを評価することを目的としているからである。なお、これは楽曲の歌唱に限らず、楽器の演奏についても同様である。
本発明は、上記課題に鑑みて為されたものであり、種々の技法が駆使された歌唱または演奏を手本として歌唱練習または演奏練習をユーザが行う場合に、それら技法の再現度合いを評価することを可能にする技術を提供することを目的としている。 However, the techniques disclosed in Patent Document 1 and Patent Document 2 are not only able to meet the above needs, but singing techniques such as always starting to sing are subject to deduction as deviations from the score content. There is also a case. This is because the guide melody, which is the evaluation standard in the techniques disclosed in Patent Document 1 and Patent Document 2, faithfully reproduces the change in the pitch of the music in accordance with the contents of the musical score. This is because the technique disclosed in Patent Document 2 is intended to evaluate whether or not the musical score is sung faithfully. This is not limited to the singing of music, and the same applies to the performance of musical instruments.
The present invention has been made in view of the above problems, and evaluates the degree of reproduction of a technique when a user performs a singing practice or a performance practice using a singing or a performance using various techniques as a model. The purpose is to provide technology that makes this possible.

上記課題を解決するために、本発明は、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段と、を有する動的時間整合モジュール、を備え、前記第１のオーディオ信号の信号波形と前記第２のオーディオ信号の信号波形とを前記動的時間整合モジュールにより対応付けが為された時間単位毎に比較し、両者の一致の度合いを点数化して出力することを特徴とする楽曲練習支援装置、を提供する。 In order to solve the above problems, the present invention analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and measures the degree of temporal change in volume in each time unit, spectrum And an extraction means for extracting a first acoustic parameter representing the degree of time change of the spectrum, and a second audio signal representing a waveform of a singing sound or a performance sound modeled by the user, thereby analyzing the time unit. A second acoustic parameter obtained for each time, the second acoustic parameter representing the degree of temporal change in volume, the spectrum, and the degree of temporal change in spectrum of the sound represented by the second audio signal in each time unit And obtaining means for obtaining the first acoustic parameter for each time unit, and converting the average value and the standard deviation into predetermined values. While normalizing, the normalizing means for normalizing the second acoustic parameter of each time unit, and the time axis of the first audio signal as one coordinate axis, the second audio signal An evaluation value calculated from the difference between the normalized first acoustic parameter and the normalized second acoustic parameter on the coordinate plane with the time axis as the other coordinate axis is coordinated with the time unit. A calculation means for calculating each grid point as a value, and on the coordinate plane, a path from a start point where the coordinate values are the minimum to the end point where the coordinate values are the maximum. Specifying means for specifying a path that minimizes the sum of the evaluation values at grid points on the path; and along the path specified by the specifying means, the time unit in the first audio signal and the second Audio signal A dynamic time matching module that associates the time units in the first and second audio signals with the dynamic time matching module. The dynamic time matching module uses the dynamic time matching module to match the signal waveform of the first audio signal and the signal waveform of the second audio signal. There is provided a music practice support device characterized in that a comparison is made for each time unit in which correspondence is made, and the degree of coincidence between the two is scored and output.

より好ましい態様においては、前記抽出手段は、前記第１のオーディオ信号の表す音のピッチを前記時間単位毎に抽出する一方、前記取得手段は、前記第２のオーディオ信号の表す音のピッチを前記時間単位毎に取得し、前記正規化手段は、前記第１のオーディオ信号の前記各時間単位におけるピッチについても前記正規化を施す一方、前記第２のオーディオ信号の前記各時間単位におけるピッチについても前記正規化を施し、前記算出手段は、前記評価値を算出する際に、その評価値が対応付けられる格子点の座標値に対応する時間単位におけるピッチの値に応じた係数を、その時間単位における前記第１のオーディオ信号のスペクトルの時間変化の度合いと前記第２のオーディオ信号のスペクトルの時間変化の度合いとの差に乗算して前記評価値を算出することを特徴としている。 In a more preferred aspect, the extraction means extracts the pitch of the sound represented by the first audio signal for each time unit, while the acquisition means obtains the pitch of the sound represented by the second audio signal. The normalization means performs the normalization on the pitch of each time unit of the first audio signal, and also the pitch of each second time unit of the second audio signal. The normalization is performed, and when the calculation unit calculates the evaluation value, a coefficient corresponding to the value of the pitch in the time unit corresponding to the coordinate value of the grid point with which the evaluation value is associated is expressed in the time unit. And multiplying the difference between the degree of temporal change in the spectrum of the first audio signal and the degree of temporal change in the spectrum of the second audio signal in FIG. It is characterized by calculating an evaluation value.

また、別の好ましい態様においては、楽曲を一意に識別する楽曲識別子とその楽曲の模範歌唱または模範演奏の波形を表す前記第１のオーディオ信号との組が１または複数記憶された記憶手段を備え、前記取得手段は、前記ユーザにより指定された楽曲識別子に対応する第２のオーディオデータを前記記憶手段から読出し、その第２のオーディオ信号を解析して前記時間単位毎に前記第２の音響パラメータを取得することを特徴としている。 In another preferred embodiment, the apparatus further comprises storage means for storing one or a plurality of sets of a music identifier for uniquely identifying a music and the first audio signal representing the waveform of the model song or model performance of the music. The acquisition means reads out second audio data corresponding to the music identifier designated by the user from the storage means, analyzes the second audio signal, and analyzes the second audio parameter for each time unit. It is characterized by acquiring.

また、上記課題を解決するために、本発明は、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段と、を有することを特徴とする動的時間整合モジュールを提供する。 In order to solve the above-mentioned problem, the present invention analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and the degree of temporal change in volume in a predetermined time unit. Extracting the first acoustic parameter representing the spectrum and the degree of time variation of the spectrum, and analyzing the second audio signal representing the waveform of the singing sound or performance sound modeled by the user A second acoustic parameter obtained for each time unit, the second acoustic parameter representing a degree of temporal change in volume, a spectrum, and a degree of temporal change in spectrum of the sound represented by the second audio signal in each time unit. An acquisition means for acquiring an acoustic parameter, and the average value and standard deviation of the first acoustic parameter for each time unit are set to predetermined values. Normalization means for performing normalization on the second acoustic parameters of each time unit, and the time axis of the first audio signal as one coordinate axis, and the second audio parameter An evaluation value calculated from the difference between the normalized first acoustic parameter and the normalized second acoustic parameter on the coordinate plane with the signal time axis as the other coordinate axis is the time unit. Calculating means for calculating each grid point with coordinate values, and on the coordinate plane, a path from the start point where the coordinate values are the minimum to the end point where the coordinate values are the maximum Specifying means for specifying a path that minimizes the sum of the evaluation values at the lattice points on the path; and along the path specified by the specifying means, the time unit and the first in the first audio signal. 2 Ode Providing dynamic time alignment module and having a a correlating means for correlating the time unit at the O signal.

また、上記課題を解決するために、本発明は、コンピュータ装置を、ユーザによる歌唱音または演奏音の波形を表す第１のオーディオ信号を解析し、所定の時間単位毎にその時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第１の音響パラメータを抽出する抽出手段と、前記ユーザが手本とする歌唱音または演奏音の波形を表す第２のオーディオ信号を解析することにより前記時間単位毎に得られる第２の音響パラメータであって、前記第２のオーディオ信号の表す音の前記各時間単位における音量の時間変化の度合い、スペクトルおよびスペクトルの時間変化の度合いを表す第２の音響パラメータを取得する取得手段と、前記各時間単位の前記第１の音響パラメータに、その平均値および標準偏差を所定の値に変換する正規化を施す一方、前記各時間単位の前記第２の音響パラメータに前記正規化を施す正規化手段と、前記第１のオーディオ信号の時間軸を一方の座標軸とし、前記第２のオーディオ信号の時間軸を他方の座標軸とする座標平面にて、前記正規化された第１の音響パラメータと前記正規化された第２の音響パラメータとの差から算出される評価値を、前記時間単位を座標値とする格子点毎に算出する算出手段と、前記座標平面上で、両座標値が最小の格子点である始点から、両座標値が最大の格子点である終点に至る経路のうち、その経路上の格子点における前記評価値の総和が最小になる経路を特定する特定手段と、前記特定手段により特定された経路に沿って、前記第１のオーディオ信号における前記時間単位と前記第２のオーディオ信号における時間単位とを対応付ける対応付け手段と、として機能させることを特徴とするプログラムを提供する。 In order to solve the above-described problem, the present invention analyzes a first audio signal representing a waveform of a singing sound or performance sound by a user, and calculates the volume of the sound volume in a predetermined time unit. Extraction means for extracting the degree of time change, the spectrum and the first acoustic parameter representing the degree of time change of the spectrum, and the second audio signal representing the waveform of the singing sound or performance sound modeled by the user The second acoustic parameter obtained for each time unit, the degree of temporal change in volume of the sound represented by the second audio signal in each time unit, the spectrum, and the degree of temporal change in the spectrum. An acquisition means for acquiring a second acoustic parameter to be expressed, and an average value of the first acoustic parameter for each time unit, and Normalization means for performing normalization to convert the quasi-deviation into a predetermined value, while normalizing means for normalizing the second acoustic parameter in each time unit, and the time axis of the first audio signal as one coordinate axis And calculated from the difference between the normalized first acoustic parameter and the normalized second acoustic parameter on a coordinate plane having the time axis of the second audio signal as the other coordinate axis. An evaluation value is calculated for each grid point having the time unit as a coordinate value, and on the coordinate plane, from the start point where both coordinate values are the minimum grid point, Among the routes to a certain end point, specifying means for specifying a route that minimizes the sum of the evaluation values at the lattice points on the route, and the first audio signal along the route specified by the specifying means In the time unit It is made to function as a correlating means for correlating the time units in the second audio signal and provides a program characterized.

本発明によれば、種々の技法が駆使された歌唱または演奏を手本として歌唱練習または演奏練習をユーザが行う場合に、それら技法の再現度合いを評価することが可能になる、といった効果を奏する。 According to the present invention, when a user performs a singing practice or a performance practice using a singing or a performance using various techniques as an example, it is possible to evaluate the degree of reproduction of those techniques. .

以下、図面を参照しつつ本発明の１実施形態について説明する。
（Ａ：構成）
図１は、本発明に係る楽曲練習支援装置の一実施形態であるカラオケ装置１のハードウェア構成の一例を示すブロック図である。図１に示すように、カラオケ装置１は、制御部１１、ＲＯＭ（Read Only Memory）１２、ＲＡＭ（Random Access Memory）１３、記憶部１４、表示部１５、操作部１６、音声処理部１８およびこれらのデータ授受を仲介するバス１０を有している。
制御部１１は、例えばＣＰＵ（Central Processing Unit）であり、ＲＯＭ１２に記憶されている制御プログラムを読み出してＲＡＭ１３にロードし、これを実行することにより、カラオケ装置１の各部を制御する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
(A: Configuration)
FIG. 1 is a block diagram showing an example of a hardware configuration of a karaoke apparatus 1 which is an embodiment of a music practice support apparatus according to the present invention. As shown in FIG. 1, the karaoke apparatus 1 includes a control unit 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage unit 14, a display unit 15, an operation unit 16, an audio processing unit 18, and the like. Has a bus 10 that mediates data exchange.
The control unit 11 is, for example, a CPU (Central Processing Unit), reads out a control program stored in the ROM 12, loads it into the RAM 13, and executes it to control each unit of the karaoke apparatus 1.

記憶部１４は、例えばハードディスクなどの大容量の記憶手段であり、伴奏・歌詞データ記憶領域１４ａと、模範音声データ記憶領域１４ｂと、練習者音声データ記憶領域１４ｃとを有している。 The storage unit 14 is a large-capacity storage unit such as a hard disk, and has an accompaniment / lyric data storage area 14a, an exemplary voice data storage area 14b, and a trainer voice data storage area 14c.

表示部１５は、例えば液晶ディスプレイとその駆動回路であり、制御部１１の制御の下で、カラオケ装置１の利用を促すためのメニュー画面や、背景画像に歌詞テロップが重ねられたカラオケ画面などの各種画面を表示する。操作部１６は、テンキーなど各種のキーを備えており、押下されたキーに対応した信号を制御部１１へ出力する。 The display unit 15 is, for example, a liquid crystal display and a driving circuit thereof, such as a menu screen for prompting the use of the karaoke device 1 under the control of the control unit 11 or a karaoke screen in which lyrics telop is superimposed on a background image. Display various screens. The operation unit 16 includes various keys such as a numeric keypad, and outputs a signal corresponding to the pressed key to the control unit 11.

音声処理部１８には、マイクロホン１７とスピーカ１９とが接続されている。マイクロホン１７は、カラオケ装置１を利用して歌唱練習を行うユーザ（以下、練習者）の歌唱音を収音し、その歌唱音に応じた音声信号（アナログデータ）を音声処理部１８へ出力する。音声処理部１８は、マイクロホン１７から出力された音声信号（アナログデータ）を音声データ（デジタルデータ）に変換して制御部１１へ出力する一方、制御部１１から引き渡された音声データを音声信号に変換しスピーカ１９へ出力する。スピーカ１９は、音声処理部１８から出力される音声信号に応じた音声を放音する。 A microphone 17 and a speaker 19 are connected to the sound processing unit 18. The microphone 17 collects the singing sound of a user (hereinafter, a practitioner) who practice singing using the karaoke device 1 and outputs an audio signal (analog data) corresponding to the singing sound to the audio processing unit 18. . The audio processing unit 18 converts the audio signal (analog data) output from the microphone 17 into audio data (digital data) and outputs the audio data (digital data) to the control unit 11, while converting the audio data delivered from the control unit 11 into an audio signal. The data is converted and output to the speaker 19. The speaker 19 emits a sound corresponding to the sound signal output from the sound processing unit 18.

記憶部１４の伴奏・歌詞データ記憶領域１４ａには、１または複数の楽曲の各々について伴奏を行う各種楽器の演奏音（所謂ガイドメロディ）がその楽曲の進行順に記された伴奏データと、その楽曲の歌詞を示す歌詞データとが互いに関連付けられて記憶されている。伴奏データは、例えばＭＩＤＩ（Musical Instruments Digital Interface）形式のデータであり、練習者がカラオケ曲を歌唱する際に再生される。歌詞データは、そのカラオケ歌唱の際に歌詞テロップとして表示部１５に表示される。
より詳細に説明する伴奏・歌詞データ記憶領域１４ａに記憶されている伴奏データと歌詞データとには、カラオケ曲を一意に識別する識別子（例えば、英字や記号、数字などからなる楽曲コード：以下、楽曲識別子）が対応付けられており、この楽曲識別子によって伴奏データと歌詞データとが互いに関連付けられている。この楽曲識別子は、練習者にその練習対象である楽曲を指定させる際に利用される。 In the accompaniment / lyric data storage area 14a of the storage unit 14, accompaniment data in which performance sounds (so-called guide melodies) of various musical instruments performing accompaniment for each of one or a plurality of music pieces are recorded in the order of progress of the music pieces, and the music pieces. Lyric data indicating the lyrics of the words are stored in association with each other. The accompaniment data is, for example, data in the MIDI (Musical Instruments Digital Interface) format, and is reproduced when the practitioner sings a karaoke song. The lyrics data is displayed on the display unit 15 as a lyrics telop at the time of the karaoke song.
The accompaniment data and lyrics data stored in the accompaniment / lyric data storage area 14a to be described in more detail include an identifier for uniquely identifying a karaoke song (for example, a music code consisting of English letters, symbols, numbers, etc .: (Music identifier) is associated, and accompaniment data and lyrics data are associated with each other by this music identifier. This music identifier is used when the trainee is allowed to specify the music to be practiced.

模範音声データ記憶領域１４ｂには、前述した楽曲識別子に対応付けて、その楽曲識別子で識別される楽曲を持ち歌とする歌手によるその楽曲の歌唱音（以下、模範音声）の音声波形を表すＷＡＶＥ形式の音声データ（以下、模範音声データ）が記憶されている。この模範音声データは、練習者の歌唱を評価する際の基準として利用される。 In the exemplary voice data storage area 14b, a WAVE that represents a voice waveform of a singing sound (hereinafter, exemplary voice) of a song by a singer who has a song identified by the song identifier in association with the song identifier described above. Format audio data (hereinafter, exemplary audio data) is stored. This model voice data is used as a reference when evaluating a practitioner's song.

練習者音声データ記憶領域１４ｃには、マイクロホン１７から音声処理部１８を経てＡ／Ｄ変換された音声データ（以下、練習者音声データ）が、例えばＷＡＶＥ形式で記憶される。 In the practitioner voice data storage area 14c, voice data (hereinafter, practitioner voice data) A / D converted from the microphone 17 via the voice processing unit 18 is stored, for example, in the WAVE format.

次に、図２に示すブロック図を参照しながら、カラオケ装置１の機能構成について説明する。図２に示す基礎分析モジュール２１、動的時間整合（Dynamic Time Warping：以下、ＤＴＷ）モジュール２２、および、評価モジュール２３は、前述した制御プログラムを制御部１１が実行することによって実現されるソフトウェアモジュールである。なお、図中の矢印は、データの流れを概略的に示したものである。また、上記３つのソフトウェアモジュールの他にも、練習者により指定されたカラオケ曲の伴奏音の再生や、その伴奏音と練習者の歌唱音とを合成して出力するカラオケ演奏モジュールも上記制御プログラムを制御部１１が実行することによって実現されるが、係るカラオケ演奏モジュールの機能については従来のカラオケ装置の機能となんら変わるところがないため、図示および詳細な説明を省略する。 Next, the functional configuration of the karaoke apparatus 1 will be described with reference to the block diagram shown in FIG. The basic analysis module 21, the dynamic time warping (hereinafter referred to as DTW) module 22, and the evaluation module 23 shown in FIG. 2 are software modules realized by the control unit 11 executing the control program described above. It is. The arrows in the figure schematically show the flow of data. In addition to the above three software modules, the karaoke performance module for reproducing the accompaniment sound of the karaoke song designated by the practitioner and synthesizing and outputting the accompaniment sound and the singing sound of the practitioner is also the control program. However, the function of the karaoke performance module is not different from the function of the conventional karaoke apparatus, and illustration and detailed description thereof are omitted.

基礎分析モジュール２１は、模範音声データ記憶領域１４ｂから読み出された模範音声データと、練習者音声データ記憶領域１４ｃから読み出された練習者音声データの各々について、音響パラメータ（ピッチ、音量、スペクトル、音量の時間変化の度合い、および、スペクトルの時間変化の度合い）を所定の時間単位（本実施形態では、１フレーム単位）毎に抽出する。なお、本実施形態では、模範音声データおよび練習者音声データの各々から上記音響パラメータを抽出する時間単位を１フレームとする場合について説明するが、１フレームをさらに分割したサブフレーム単位で上記音響パラメータを抽出するとしても良く、また、複数フレーム単位で上記音響パラメータを抽出するとしても勿論良い。要は、模範音声データから音響パラメータを抽出する際の時間単位と練習者音声データから音響パラメータを抽出する際の時間単位とが一致していれば良く、その時間単位の長さは問わない。
この基礎分析モジュール２１は、図２に示すように、ピッチ検出手段２１１、音量検出手段２１２、スペクトル検出手段２１３、および、微分手段２１４ａ〜２１４ｃを含んでおり、基礎分析モジュール２１へ引き渡された音声データ（すなわち、模範音声データまたは練習者音声データ）は、図２に示す様に３分流され、ピッチ検出手段２１１、音量検出手段２１２およびスペクトル検出手段２１３の各々へ引き渡される。 The basic analysis module 21 performs acoustic parameters (pitch, volume, spectrum) for each of the model voice data read from the model voice data storage area 14b and the trainer voice data read from the trainer voice data storage area 14c. , The degree of temporal change in volume, and the degree of temporal change in spectrum) are extracted every predetermined time unit (one frame unit in this embodiment). In the present embodiment, the case where the time unit for extracting the acoustic parameter from each of the model voice data and the trainer voice data is set to one frame. However, the acoustic parameter is set in units of subframes obtained by further dividing one frame. Of course, the acoustic parameters may be extracted in units of a plurality of frames. In short, the time unit for extracting the acoustic parameters from the model voice data and the time unit for extracting the acoustic parameters from the trainer voice data need only coincide with each other, and the length of the time unit is not limited.
As shown in FIG. 2, the basic analysis module 21 includes pitch detection means 211, volume detection means 212, spectrum detection means 213, and differentiation means 214 a to 214 c, and the voice delivered to the basic analysis module 21. The data (that is, the model voice data or the practice person voice data) is divided into three parts as shown in FIG. 2 and delivered to each of the pitch detection means 211, the volume detection means 212, and the spectrum detection means 213.

ピッチ検出手段２１１は、上記所定の時間単位分の音声データについて自己相関を求め、その時間単位におけるピッチを検出し、その検出結果を示すピッチデータを出力する。ピッチ検出手段２１１から出力されたピッチデータは、図２に示すように、ＤＴＷモジュール２２へ引き渡される。なお、本実施形態では、自己相関を求めることによって、各時間単位におけるピッチの検出を行う場合について説明したが、例えば、上記時間単位毎にケプストラムを求めてピッチの検出を行うようにしても勿論良い。 The pitch detection unit 211 obtains autocorrelation for the audio data for the predetermined time unit, detects the pitch in the time unit, and outputs pitch data indicating the detection result. The pitch data output from the pitch detector 211 is delivered to the DTW module 22 as shown in FIG. In this embodiment, the case where the pitch is detected in each time unit by obtaining the autocorrelation has been described. However, for example, the pitch may be detected by obtaining the cepstrum for each time unit. good.

音量検出手段２１２は、上記所定の時間単位分の音声データに含まれる各サンプル（本実施形態では２５６サンプル：図３参照）について、その振幅の絶対値の加算平均を算出し、その算出結果をその時間単位における音量を示す音量データとして出力する。音量検出手段２１２から出力された音量データは、図２に示すように２分流されてその一方はＤＴＷモジュール２２へ引き渡され、他方は、微分手段２１４ａへ引き渡される。 The sound volume detection means 212 calculates the average of the absolute values of the amplitudes of the samples (256 samples in the present embodiment: see FIG. 3) included in the audio data for the predetermined time unit, and calculates the calculation result. Output as volume data indicating the volume in that time unit. The volume data output from the volume detection means 212 is divided into two as shown in FIG. 2, one of which is delivered to the DTW module 22, and the other is delivered to the differentiation means 214a.

微分手段２１４ａは、連続する複数の時間単位についての音量データから、音量についての１次微分（以下、「速度」と呼ぶ）を算出し、その算出結果を示す音量速度データを出力する。この音量速度データは、上記音声データの表す音声の音量が上記複数の時間単位に渡って上昇傾向にあるのか、それとも、下降傾向にあるのかを表している。本実施形態では、微分手段２１４ａは、図３に示すように、連続する５つのフレームについての音量データから音量速度データを生成する。この音量速度データは、図２に示すように２分流されてその一方はＤＴＷモジュール２２へ引き渡され、他方は、微分手段２１４ｂへ引き渡される。 The differentiating unit 214a calculates a primary differential (hereinafter referred to as “speed”) for the volume from the volume data for a plurality of continuous time units, and outputs volume velocity data indicating the calculation result. The volume speed data indicates whether the volume of the voice represented by the voice data tends to increase or decrease over the plurality of time units. In the present embodiment, as shown in FIG. 3, the differentiating unit 214a generates volume speed data from the volume data for five consecutive frames. The volume speed data is divided into two as shown in FIG. 2, one of which is delivered to the DTW module 22, and the other is delivered to the differentiating means 214b.

微分手段２１４ｂは、連続する複数の時間単位についての音量速度データから、その１次微分（すなわち、音量の２次微分：以下、音量の加速度）を算出し、その算出結果を示す音量加速度データを出力する。この音量加速度データは、上記音量速度データの表す音量速度の変化の度合いが上記連続する複数の時間単位に渡って大きくなる傾向にあるのか、それとも、小さくなる傾向にあるのかを表している。図２に示すように、微分手段２１４ｂから出力される音量加速度データはＤＴＷモジュール２２へ引き渡される。 The differentiating means 214b calculates the first derivative (that is, the second derivative of the volume: hereinafter, the acceleration of the volume) from the volume speed data for a plurality of continuous time units, and obtains the volume acceleration data indicating the calculation result. Output. This volume acceleration data indicates whether the degree of change in volume speed represented by the volume speed data tends to increase or decrease over a plurality of continuous time units. As shown in FIG. 2, the volume acceleration data output from the differentiating means 214 b is delivered to the DTW module 22.

スペクトル検出手段２１３は、図３に示すように連続する２つの時間単位分の音声データにＦＦＴ（Fast Fourier Transform）を施し、さらに、所定の通過域を有するバンドパスフィルタ（本実施形態では、歌唱音の音声データが入力されるのであるから、０から２ｋＨＺまでは１／２オクターブバンドパスフィルタで、２から８ｋＨｚまでは１／４オクターブバンドパスフィルタ）を通過させ、その出力を上記時間単位におけるスペクトル（すなわち、上記各通過域成分）を表すスペクトルデータとして出力する。スペクトル検出手段２１３から出力されたスペクトルデータは、図２に示すように２分流され、その一方はＤＴＷモジュール２２へ引き渡され、他方は微分手段２１４ｃへ引き渡される。 As shown in FIG. 3, the spectrum detection means 213 performs FFT (Fast Fourier Transform) on the audio data for two continuous time units, and further, a bandpass filter having a predetermined pass band (in this embodiment, singing Since sound data of sound is input, it passes through a 1/2 octave bandpass filter from 0 to 2 kHz and a 1/4 octave bandpass filter from 2 to 8 kHz), and its output is in the above time unit. Output as spectrum data representing a spectrum (that is, each passband component). The spectrum data output from the spectrum detection means 213 is divided into two as shown in FIG. 2, one of which is delivered to the DTW module 22 and the other is delivered to the differentiation means 214c.

微分手段２１４ｃは、連続する複数の時間単位（本実施形態では、連続する５フレーム）についてのスペクトルデータから、スペクトルの速度を算出し、その算出結果を示すスペクトル速度データを出力する。微分手段１４ｃから出力されるスペクトル速度データは、図２に示すように、ＤＴＷモジュール２２へ引き渡される。
以上が基礎分析モジュール２１の構成である。 The differentiating means 214c calculates the spectral velocity from the spectral data for a plurality of continuous time units (in this embodiment, five consecutive frames), and outputs spectral velocity data indicating the calculation result. The spectral velocity data output from the differentiating means 14c is delivered to the DTW module 22 as shown in FIG.
The above is the configuration of the basic analysis module 21.

次いで、ＤＴＷモジュール２２の機能構成について説明する。
ＤＴＷモジュール２２は、図４に示すように、練習者音声の各時間単位と模範音声の各時間単位との対応関係を特定するためのものであり、図２に示すように、正規化手段２２１、差分マトリクス生成手段２２２、および、最適経路特定手段２２３を含んでいる。
正規化手段２２１は、模範音声および練習者音声のそれぞれについて、その歌い始めから歌い終わりまでの各時間単位における音響パラメータを基礎分析モジュール２１から受け取り、それら音声毎に正規化して差分マトリクス生成手段２２２へ引き渡す。ここで、データの正規化とは、上記時間単位毎に基礎分析モジュール２１から引き渡されてくる一連の音響パラメータに、その加算平均および標準偏差が一定の値になるような変換を施すことであり、本実施形態では、以下の数１にしたがって上記正規化を行う。
（数１） AfterDat[i] ＝（BeforDat[i]) − AVR）／STD
なお、数１において、BeforDat[i]は、基礎分析モジュール２１から引き渡されるi番目のフレームについての音響パラメータであり、SDVはその音響パラメータについての標準偏差、AVRはその音響パラメータについての加算平均であり、AfterDat[i]はi番目のフレームについての正規化後の音響パラメータである。
数１に示す正規化を施すことによって、基礎分析モジュール２１から引き渡される音響パラメータは、加算平均が“０”で標準偏差が“１”である音響パラメータ（すなわち、標準化された正規分布にしたがうデータ）に変換されることになる。このような正規化を施すことにより、歌唱者音声の収音環境と模範音声の収音環境との差異を取り除いて歌唱者音声と模範音声とを比較することが可能になる。また、模範音声と練習者音声の間に音量レベル差があったり、ピッチがオクターブ単位で異なったりすることは歌唱の巧拙とは関わりがないことがほとんどであるが、そのような個々の音声が本来的に持っている差異などの要因を取り除くことも可能であり、突発的なピッチや音量の変化に起因する影響を緩和することも可能になる。 Next, the functional configuration of the DTW module 22 will be described.
As shown in FIG. 4, the DTW module 22 is for specifying the correspondence between each time unit of the trainee voice and each time unit of the model voice. As shown in FIG. 2, the normalizing means 221 is used. , A difference matrix generation unit 222, and an optimum route specifying unit 223.
The normalizing means 221 receives acoustic parameters in each time unit from the beginning of the singing to the end of singing from the basic analysis module 21 for each of the model voice and the practice person voice, and normalizes each of the voices to normalize the difference matrix generating means 222. Hand over to Here, the normalization of data means that a series of acoustic parameters delivered from the basic analysis module 21 for each time unit is subjected to conversion such that the addition average and standard deviation become constant values. In the present embodiment, the normalization is performed according to the following formula 1.
(Equation 1) AfterDat [i] = (BeforDat [i])-AVR) / STD
In Equation 1, BeforDat [i] is an acoustic parameter for the i-th frame delivered from the basic analysis module 21, SDV is a standard deviation for the acoustic parameter, and AVR is an addition average for the acoustic parameter. Yes, AfterDat [i] is a normalized acoustic parameter for the i-th frame.
By applying the normalization shown in Equation 1, the acoustic parameters delivered from the basic analysis module 21 are acoustic parameters having an addition average of “0” and a standard deviation of “1” (that is, data according to a standardized normal distribution). ) Will be converted. By performing such normalization, it is possible to remove the difference between the sound collection environment of the singer's voice and the sound collection environment of the model voice and compare the singer's voice and the model voice. In addition, there is a difference in volume level between the model voice and the practitioner voice, and the fact that the pitch is different in octave units is almost unrelated to the skill of singing. It is also possible to remove factors such as inherent differences, and to mitigate the effects caused by sudden changes in pitch and volume.

差分マトリクス生成手段２２２は、練習者音声の各時間単位についての音響パラメータと模範音声の各時間単位についての音響パラメータとのユークリッド距離（以下、「差分」とも呼ぶ）を求め、その差分値を成分とするマトリクス（以下、差分マトリクス）をＲＡＭ１３内に生成する。例えば、練習者音声の歌い始めが第０フレームで、その歌い終わりが第Ｎフレームである一方、模範音声の歌い始めが第０フレームで、その歌い終わりが第Ｍフレームである場合（Ｎ、Ｍは１以上の自然数）、差分マトリクス生成手段２２２は、以下の数２で示す値を（ｉ、ｊ）成分（ただし、０≦ｉ≦Ｎ，０≦ｊ≦Ｍ）とする（Ｎ＋１）行（Ｍ＋１）列の差分マトリクスを生成する。
（数２）Sqr{ (Σ(GuideSpectrum[j][k]−SingerSpectrum[i][k])＾2)*WeightScalar[k]
+(Σ(ΔGuideSpectrum[j][k]−ΔSingerSpectrum[i][k])＾2)*WeightVector[k]
+(ΔGuidePower[j]−ΔSingerPower[i])＾2)
+(ΔΔGuidePower[j]−ΔΔSingerPower[i])＾2)
}/num
この数２において、
GuideSpectrum[j][k]：模範音声のｊ番目のフレームのｋ番目の通過域のスペクトル成分
SingerSpectrum[i][k]：練習者音声のｉ番目のフレームのｋ番目の通過域のスペクトル成分
ΔGuideSpectrum[j][k]：模範音声のｊ番目のフレームのｋ番目のスペクトル速度
ΔSingerSpectrum[i][k]：練習者音声のｉ番目のフレームのｋ番目のスペクトル速度
ΔGuidePower[j]：模範音声のｊ番目のフレームの音量速度
ΔSingerPower[i]：練習者音声のｉ番目のフレームの音量速度
ΔΔGuidePower[j]：模範音声のｊ番目のフレームの音量加速度
ΔΔSingerPower[i]：練習者音声のｉ番目のフレームの音量加速度
WeightScalar[k]：重み付け係数
WeightVector[k]：重み付け係数
num：ユークリッド距離を求めるパラメータの数（例えば、練習者音声が第０フレームから第Ｎフレームに渡っており、模範音声が第０フレームから第Ｍフレームに渡っている場合には、num＝（Ｎ＋１）×（Ｍ＋１））である。
ただし、WeightScalar[k]は、時間変化に依存しない音響パラメータへの重み付けを行う係数であり、練習者歌唱音および模範音声が有音（周期的な音声）であるか、無音（非周期的な音声）であるかに応じて適宜選択される値である。具体的には、練習者歌唱音および模範音声がともに有音である場合には、低域のスペクトルに重みが付与されるようにその値が選択され、練習者歌唱音および模範音声がともに無音である場合には、高域のスペクトルに重みが付与されるようにその値が選択される。なお、練習者歌唱音および模範音声について有音であるか無音であるかの判定は、各々のピッチおよび音量に基づいて為される。具体的には、差分マトリクス生成手段２２２は、ピッチが所定の閾値以上であり、かつ、音量も所定の閾値以上である場合に、該当する時間単位について有音であると判定し、その他の場合は無音と判定する。
これに対して、WeightVector[k]は、時間変化に依存する音響パラメータへの重み付けを行う係数であり、中域のスペクトルに重みを付与するための係数である。
なお、数２において、Σ記号は、添え字ｋについての総和（すなわち、全ての通過域についてのスペクトル成分の総和）を意味し、“＾２”は２乗を意味し、Sqr{}は平方根を意味している。 The difference matrix generation unit 222 obtains the Euclidean distance (hereinafter also referred to as “difference”) between the acoustic parameter for each time unit of the trainee voice and the acoustic parameter for each time unit of the model voice, and uses the difference value as a component. A matrix (hereinafter referred to as a difference matrix) is generated in the RAM 13. For example, when the practicing voice singing starts at the 0th frame and the singing end is the Nth frame, the singing of the model voice starts at the 0th frame and the singing end is the Mth frame (N, M Is a natural number greater than or equal to 1), and the difference matrix generation means 222 sets the value represented by the following formula 2 as (i, j) components (where 0 ≦ i ≦ N, 0 ≦ j ≦ M) (N + 1) rows ( A difference matrix of M + 1) columns is generated.
(Equation 2) Sqr {(Σ (GuideSpectrum [j] [k] −SingerSpectrum [i] [k]) ^ 2) * WeightScalar [k]
+ (Σ (ΔGuideSpectrum [j] [k] −ΔSingerSpectrum [i] [k]) ^ 2) * WeightVector [k]
+ (ΔGuidePower [j] −ΔSingerPower [i]) ^ 2)
+ (ΔΔGuidePower [j] −ΔΔSingerPower [i]) ^ 2)
} / num
In this equation 2,
GuideSpectrum [j] [k]: Spectral component of the kth passband of the jth frame of the model voice
SingerSpectrum [i] [k]: Spectral component of the k-th passband of the i-th frame of the trainee voice ΔGuideSpectrum [j] [k]: k-th spectral velocity of the j-th frame of the model voice ΔSingerSpectrum [i] [k]: k-th spectral speed of the i-th frame of the trainer voice ΔGuidePower [j]: volume speed of the j-th frame of the model voice ΔSingerPower [i]: volume speed of the i-th frame of the trainer voice ΔΔGuidePower [j]: Volume acceleration of the j-th frame of the model voice ΔΔSingerPower [i]: Volume acceleration of the i-th frame of the trainer voice
WeightScalar [k]: Weighting coefficient
WeightVector [k]: Weighting factor
num: number of parameters for obtaining the Euclidean distance (for example, when the trainer voice is from the 0th frame to the Nth frame and the model voice is from the 0th frame to the Mth frame, num = (N + 1 ) × (M + 1)).
However, WeightScalar [k] is a coefficient that weights the acoustic parameters that do not depend on time change, and the trainer's singing sound and the model voice are sound (periodic sound) or silence (non-periodic) It is a value that is appropriately selected depending on whether it is a voice. Specifically, if both the practitioner's singing sound and the model voice are voiced, the value is selected so that the low-frequency spectrum is weighted, and both the practitioner's singing sound and the model voice are silent. In the case of, the value is selected so that a weight is given to the high-frequency spectrum. Note that whether the trainer singing sound and the model voice are voiced or silent is determined based on each pitch and volume. Specifically, the difference matrix generation unit 222 determines that the corresponding time unit is sound when the pitch is equal to or greater than a predetermined threshold and the volume is equal to or greater than the predetermined threshold. Is determined to be silent.
On the other hand, WeightVector [k] is a coefficient for weighting the acoustic parameter depending on time change, and is a coefficient for assigning a weight to the mid-range spectrum.
In Equation 2, the symbol Σ means the sum for the subscript k (that is, the sum of the spectral components for all passbands), “^ 2” means the square, and Sqr {} is the square root. Means.

最適経路特定手段２２３は、差分マトリクス生成手段２２２により生成された差分マトリクスにおいて、その左下隅（すなわち、（０、０）成分）からその右上隅（すなわち、（Ｎ，Ｍ）成分）へ至る経路のうち、その経路上に位置する各成分の累積値が最小になる経路を、練習者音声と模範音声の各時間単位の対応関係を表す最適経路として特定し、その経路の示す時間の対応関係を表すデータを評価モジュール２３へ引き渡す。
より詳細に説明すると、最適経路特定手段２２３は、以下に説明する規則にしたがって上記最適経路を特定する。
（規則１）差分マトリクスの左下隅から経路の探索を始め、移動先の成分を累算した値が最小になるように移動先を選択する処理を右上隅に至るまで繰り返す。
ただし、１回の移動は、右、上、または右上の何れかに制限する。例えば、（ｉ、ｊ）成分からの移動は、（ｉ、ｊ＋１）成分、（ｉ＋１、ｊ）成分、または、（ｉ＋１、ｊ＋１）成分への移動へ制限する。
なお、右へ移動した場合の累積値と上へ移動した場合の累積値が等しい場合には、右への移動を優先する。同様に、右への移動と右上への移動の累積値が等しい場合には、右への移動を優先し、上への移動と右上への移動の累積値が等しい場合には、上への移動を優先する。
（規則２）上記規則１にしたがって選択された経路を右上隅から左下隅まで逆に辿り、最適経路を特定する。 The optimum path specifying means 223 is a path from the lower left corner (that is, (0, 0) component) to the upper right corner (that is, (N, M) component) in the difference matrix generated by the difference matrix generating means 222. Among them, the route that minimizes the cumulative value of each component located on the route is identified as the optimum route that represents the correspondence between the trainer's voice and the model voice in each time unit, and the correspondence between the times indicated by the route Is transferred to the evaluation module 23.
More specifically, the optimum route specifying unit 223 specifies the optimum route according to the rules described below.
(Rule 1) The route search is started from the lower left corner of the difference matrix, and the process of selecting the movement destination is repeated until reaching the upper right corner so that the accumulated value of the movement destination components is minimized.
However, one movement is limited to one of right, top, and top right. For example, movement from the (i, j) component is limited to movement to the (i, j + 1) component, (i + 1, j) component, or (i + 1, j + 1) component.
If the cumulative value when moving to the right is equal to the cumulative value when moving upward, priority is given to the movement to the right. Similarly, when the accumulated value of the movement to the right and the movement to the upper right is equal, priority is given to the movement to the right, and when the accumulated value of the movement to the upper and the movement to the upper right is equal, Give priority to movement.
(Rule 2) The route selected in accordance with Rule 1 is traced backward from the upper right corner to the lower left corner to identify the optimum route.

図２の評価モジュール２３は、ＤＴＷモジュール２２により各時間単位の対応付けが為された模範音声と練習者音声とについて、互いに対応する時間単位毎に信号波形を比較し、模範音声に対する練習者音声の一致度を点数化して表示部１５にさせるものである。
以上に説明したように、本実施形態に係るカラオケ装置１のハードウェア構成は、一般的なコンピュータ装置のハードウェア構成と同一であり、本発明に係る楽曲練習支援装置に特徴的な機能はソフトウェアモジュール（すなわち、基礎分析モジュール２１およびＤＴＷモジュール２２）により実現されている。なお、本実施形態では、本発明に係る楽曲練習支援装置に特徴的な基礎分析モジュールおよびＤＴＷモジュールをソフトウェアモジュールで実現する場合について説明したが、これら各モジュールをハードウェアで実現しても良いことは勿論である。
以上がカラオケ装置１の構成である。 The evaluation module 23 in FIG. 2 compares the signal waveform for each time unit corresponding to the model voice and the trainer voice that have been associated with each time unit by the DTW module 22, and the trainer voice for the model voice. The matching degree is scored and displayed on the display unit 15.
As described above, the hardware configuration of the karaoke apparatus 1 according to the present embodiment is the same as the hardware configuration of a general computer apparatus, and the functions characteristic of the music practice support apparatus according to the present invention are software. This is realized by modules (that is, basic analysis module 21 and DTW module 22). In this embodiment, the case where the basic analysis module and the DTW module characteristic of the music practice support device according to the present invention are implemented by software modules has been described. However, these modules may be implemented by hardware. Of course.
The above is the configuration of the karaoke apparatus 1.

（Ｂ：動作）
次いで、カラオケ装置１が行う動作のうち、その特徴を顕著に示している動作（すなわち、基礎分析モジュール２１およびＤＴＷモジュール２２の動作）を中心に図面を参照しつつ説明する。なお、以下に説明する動作例では、カラオケ装置１の電源（図示）は投入済みであり、制御部１１はＲＯＭ１２からＲＡＭ１３へロードした制御プログラムにしたがって作動しているものとする。 (B: Operation)
Next, among the operations performed by the karaoke apparatus 1, the operations that remarkably show the characteristics (that is, the operations of the basic analysis module 21 and the DTW module 22) will be described with reference to the drawings. In the operation example described below, it is assumed that the power (illustrated) of the karaoke apparatus 1 has been turned on, and the control unit 11 operates according to a control program loaded from the ROM 12 to the RAM 13.

カラオケ装置１を用いて歌唱練習を行おうとする練習者は、表示部１５に表示されるメニュー画面等を参照しながら操作部１６を適宜操作することによって、歌唱練習を所望する楽曲の楽曲識別子を入力することにより練習対象の楽曲を指定する。このようにして練習対象の楽曲が指定されると、制御部１１は、その楽曲識別子に対応する伴奏データおよび歌詞データを記憶部１４からＲＡＭ１３へロードする。そして、上記練習者が演奏開始を指示する旨の操作を操作部１６に対して行うと、制御部１１は、ＲＡＭ１３へ読み出した伴奏データにしたがった伴奏音の再生を音声処理部１８に開始させるとともに、歌詞データの表す歌詞テロップを埋め込んだカラオケ画面を表示部１５へ表示させ、楽曲の進行に併せてその歌詞のワイプ表示を行う。 A practitioner who wants to practice singing using the karaoke apparatus 1 appropriately operates the operation unit 16 while referring to a menu screen or the like displayed on the display unit 15, thereby obtaining a song identifier of a song for which singing practice is desired. Specify the music to be practiced by entering it. When the music to be practiced is designated in this way, the control unit 11 loads accompaniment data and lyrics data corresponding to the music identifier from the storage unit 14 to the RAM 13. Then, when the practitioner performs an operation to instruct the start of performance on the operation unit 16, the control unit 11 causes the audio processing unit 18 to start reproducing the accompaniment sound according to the accompaniment data read to the RAM 13. At the same time, a karaoke screen in which the lyrics telop represented by the lyrics data is embedded is displayed on the display unit 15, and the lyrics are wiped as the music progresses.

上記カラオケ画面を視認しスピーカから放音される伴奏音を聞いている練習者は、その楽曲の歌い出しタイミングに至ると、その楽曲の歌唱を開始する。そして、練習者の歌唱音は、マイクロホン１７によって収音され、その歌唱音に応じた練習者音声データが練習者音声データ記憶領域１４ｃに順次書き込まれる。このようにして練習者音声データが練習者音声データ記憶領域１４ｃに記憶されると、制御部１１は、この練習者音声データと、メニュー画面にてユーザにより指定された楽曲識別子に対応付けて模範音声データ記憶領域１４ｂに記憶されている模範音声データとを読出し、図５に示す採点処理を実行する。なお、本動作例では、練習者音声は第０フレームから第３フレームまでの４個のフレームに渡っている一方、模範音声は第０フレームから第４フレームまでの５個のフレームに渡っているものとする。 A practitioner who visually recognizes the karaoke screen and listens to the accompaniment sound emitted from the speaker starts singing the song when the singing timing of the song is reached. The practicing person's singing sound is picked up by the microphone 17, and the practicing person's voice data corresponding to the singing sound is sequentially written in the practicing person's voice data storage area 14c. When the trainer voice data is stored in the trainer voice data storage area 14c in this manner, the control unit 11 associates the trainer voice data with the song identifier designated by the user on the menu screen. The model voice data stored in the voice data storage area 14b is read out, and the scoring process shown in FIG. 5 is executed. In this operation example, the trainee voice is spread over four frames from the 0th frame to the third frame, while the model voice is spread over 5 frames from the 0th frame to the fourth frame. Shall.

図５は、制御部１１が上記制御プログラムにしたがって行う採点処理の流れを示すフローチャートである。図５に示すように、制御部１１は、練習者音声について、その歌い始めから歌い終わりまでの時間単位毎に音響パラメータを抽出する一方、練習者音声についてもその歌い始めから歌い終わりまでの時間単位毎に音響パラメータを抽出する（ステップＳＡ０１００）。なお、このステップＳＡ０１００の処理は、前述した基礎分析モジュール２１により実行される。 FIG. 5 is a flowchart showing the flow of scoring processing performed by the control unit 11 according to the control program. As shown in FIG. 5, the control unit 11 extracts the acoustic parameters for each time unit from the beginning of the singing to the end of the singing, and the time from the beginning of the singing to the end of the singing. An acoustic parameter is extracted for each unit (step SA0100). The process of step SA0100 is executed by the basic analysis module 21 described above.

次いで、制御部１１は、ステップＳＡ０１００にて抽出した音響パラメータに正規化を施し（ステップＳＡ０１１０）、正規化後の音響パラメータから差分マトリクスを生成する（ステップＳＡ０１２０）。なお、このステップＳＡ０１１０の処理は、前述した正規化手段２２１により実行され、ステップＳＡ０１２０の処理は差分マトリクス生成手段２２２により実行される。本動作例では、ステップＳＡ０１２０までの処理が実行された結果、図６に示す４行５列の差分マトリクスが生成され、ＲＡＭ１３に記憶されるものとする。 Next, the control unit 11 normalizes the acoustic parameters extracted in step SA0100 (step SA0110), and generates a difference matrix from the normalized acoustic parameters (step SA0120). Note that the processing in step SA0110 is executed by the normalization means 221 described above, and the processing in step SA0120 is executed by the difference matrix generation means 222. In this operation example, as a result of executing the processing up to step SA 0120, the difference matrix of 4 rows and 5 columns shown in FIG. 6 is generated and stored in the RAM 13.

次いで、制御部１１は、ステップＳＡ０１２０にて生成した差分マトリクスから最適経路を特定する（ステップＳＡ０１３０）。このステップＳＡ０１３０の処理は前述した最適経路特定手段２２３により実行される処理であり、具体的には、最適経路特定手段２２３は以下に説明する手順で、最適経路の特定を行う。
最適経路特定手段２２３は、まず、差分マトリクスの第０列（すなわち、左端の列）に属する成分について、前述した（規則１）にしたがった移動に伴う累積値を算出する（図７参照）。例えば、（０、０）成分の値は“１”であり、（１，０）成分の値は“４”であるから（図６参照）、（０，０）成分から（１，０）成分への移動に伴う累積値は“５”になる（図７参照）。そして、（２，０）成分の値は“１”であるから、（０，０）成分→（１，０）成分→（２，０）成分という移動に伴う累積値は“６”になる（図７参照）。以下、（３，０）成分に至るまで移動に伴う成分の累積値を算出し、図７に示す結果が得られる。なお、最適経路特定手段２２３は、移動に伴う累積値を算出する際には、その移動元の成分を一意に示す識別子（本実施形態では、その成分の２つの添え字）とその移動先の成分に示す識別子とを対応付けてＲＡＭ１３に記憶する。例えば、（０、０）成分から（１、１）成分への移動に際しては、“（０，０）→（１，１、）”という文字列データをＲＡＭ１３に記憶する。このようにしてＲＡＭ１３に記憶されるデータは、最適経路を特定する際のトレースバックにてバックポインタとして利用される。 Next, the control unit 11 specifies the optimum route from the difference matrix generated in Step SA0120 (Step SA0130). The process of step SA0130 is a process executed by the optimum route specifying unit 223 described above. Specifically, the optimum route specifying unit 223 specifies the optimum route in the procedure described below.
First, the optimum path specifying unit 223 calculates the cumulative value associated with the movement according to the above (Rule 1) for the component belonging to the 0th column (that is, the leftmost column) of the difference matrix (see FIG. 7). For example, since the value of the (0, 0) component is “1” and the value of the (1,0) component is “4” (see FIG. 6), from the (0, 0) component to (1, 0) The accumulated value accompanying the movement to the component is “5” (see FIG. 7). Since the value of the (2, 0) component is “1”, the cumulative value accompanying the movement of (0, 0) component → (1, 0) component → (2, 0) component is “6”. (See FIG. 7). Hereinafter, the cumulative value of the components accompanying the movement is calculated up to the (3, 0) component, and the result shown in FIG. 7 is obtained. When calculating the cumulative value associated with the movement, the optimum route specifying unit 223 uniquely identifies the movement source component (in this embodiment, two subscripts of the component) and the movement destination. The identifiers indicated by the components are stored in the RAM 13 in association with each other. For example, when moving from the (0,0) component to the (1,1) component, character string data “(0,0) → (1,1,)” is stored in the RAM 13. The data stored in the RAM 13 in this way is used as a back pointer in the trace back when specifying the optimum route.

最適経路特定手段２２３は、前述した第０列の場合と同様に、第１列についても移動に伴う成分の累積を行う。具体的には、最適経路特定手段２２３は、まず、（０、０）成分から（０、１）成分への移動に伴う成分の累積を行う。図８に示すように、（０，０）成分の値は“１”であり、（０，１）成分の値は“３”であるから、（０、０）成分から（０、１）成分への移動に伴う累積値は“４”になる。
次いで、最適経路特定手段２２３は、（１，１）成分への移動に伴う成分の累積を行うのであるが、ここで注目すべき点は、（１，１）成分への移動パターンとしては、以下に述べる３つのパターンが有り得る点である。すなわち、（０，１）成分から（１、１）成分への上方向の移動と、（０，０）成分から（１，１）成分への右上方向への移動と、（１、０）成分から（１、１）成分への右方向の移動である。 As in the case of the 0th column described above, the optimum route specifying unit 223 accumulates components accompanying movement in the 1st column. Specifically, the optimum route specifying means 223 first accumulates components accompanying movement from the (0, 0) component to the (0, 1) component. As shown in FIG. 8, since the value of the (0, 0) component is “1” and the value of the (0, 1) component is “3”, the (0, 0) component to (0, 1) The cumulative value accompanying the movement to the component is “4”.
Next, the optimum route specifying means 223 accumulates the components accompanying the movement to the (1,1) component, but the point to be noted here is that the movement pattern to the (1,1) component is as follows: There are three possible patterns described below. That is, upward movement from the (0, 1) component to the (1, 1) component, upward movement from the (0, 0) component to the (1, 1) component, and (1, 0) It is a rightward movement from the component to the (1, 1) component.

最適経路特定手段２２３は、上記３つの移動パターンのうち、移動に伴う成分の累積値が最小になる移動パターンを選択し、その移動パターンにしたがって（１，１）成分への移動に伴う累積値を算出する。図７に示すように、（０，１）成分から（１、１）成分への移動に伴う累積値は“５”であり、（０，０）成分から（１，１）成分への移動に伴う累積値は“２”であり、（１、０）成分から（１、１）成分への移動に伴う累積値は“６”であるから、最適経路特定手段２２３は、（１、１）成分への移動に伴う累積値として“２”（すなわち、（０、０）成分からの右上方向への移動に伴う累積値）を採用する。
以下、同様に、最適経路特定手段２２３は、差分マトリクスの右上隅（すなわち、（４、５）成分）に至るまで、移動に伴う成分値の累積を繰り返す（図９参照）。 The optimum route specifying means 223 selects a movement pattern in which the cumulative value of the component accompanying the movement becomes the minimum among the three movement patterns, and the cumulative value accompanying the movement to the (1, 1) component according to the movement pattern. Is calculated. As shown in FIG. 7, the cumulative value accompanying the movement from the (0, 1) component to the (1, 1) component is “5”, and the movement from the (0, 0) component to the (1, 1) component. Since the accumulated value accompanying the movement from the (1, 0) component to the (1, 1) component is “6”, the optimum path specifying means 223 is (1, 1). ) “2” (that is, the cumulative value accompanying the movement in the upper right direction from the (0, 0) component) is adopted as the cumulative value accompanying the movement to the component.
Hereinafter, similarly, the optimum route specifying unit 223 repeats accumulation of component values accompanying the movement until reaching the upper right corner (that is, the (4, 5) component) of the difference matrix (see FIG. 9).

図９に示すように、差分マトリクスの右上隅まで、移動に伴う成分の累積を完了すると、最適経路特定手段２２３は、その右上隅を出発点として、前述したバックポインタを辿る処理を左下隅の格子点へ到達するまで繰り返し、最適経路候補を特定する。その結果、図７に示す差分マトリクスについては、図１０に示す最適経路候補（すなわち、（４，５）→（３，４）→（２，３）→（２，２）→（１，１））が特定される。
次いで、最適経路特定手段２２３は、上記のようにして特定した最適経路候補を逆に辿るとともに、その最適経路候補から外れて移動を行う場合には、その移動に伴って上記累積値が増加することを確かめ、最適経路を特定する（図１１参照）。 As shown in FIG. 9, when the accumulation of the components accompanying the movement is completed up to the upper right corner of the difference matrix, the optimum route specifying unit 223 starts the upper right corner as the starting point and performs the process of tracing the back pointer described above at the lower left corner. Repeat until it reaches the lattice point, and identify the optimal route candidate. As a result, for the difference matrix shown in FIG. 7, the optimum route candidate shown in FIG. 10 (that is, (4,5) → (3,4) → (2,3) → (2,2) → (1,1) )) Is identified.
Next, the optimum route specifying means 223 reversely traces the optimum route candidate specified as described above, and when moving away from the optimum route candidate, the accumulated value increases with the movement. This is confirmed, and the optimum route is specified (see FIG. 11).

以上のようにして特定された最適経路は、模範音声の時間軸と練習者音声の時間軸との対応関係を表している。具体的には、図１１に示す最適経路は、模範音声についての各時間単位と練習者音声についての各時間単位とが図１２に示すように対応していることを示している。最適経路特定手段２２３は、図１２に示す対応関係を示すデータを生成し、そのデータを評価モジュール２３へ出力する（ステップＳＡ０１４０）。 The optimum route identified as described above represents a correspondence relationship between the time axis of the model voice and the time axis of the trainee voice. Specifically, the optimum route shown in FIG. 11 indicates that each time unit for the model voice corresponds to each time unit for the trainee voice as shown in FIG. The optimum route specifying means 223 generates data indicating the correspondence shown in FIG. 12, and outputs the data to the evaluation module 23 (step SA0140).

以下、評価モジュール２３は、最適経路特定手段２２３により特定された対応関係を満たすように練習者音声データにタイムアラインメント（時間軸の伸縮）を施した後に模範音声データと比較し、その比較結果を点数化して表示部１５に表示する。
ここで注目すべき点は、評価モジュール２３による評価の基準となるデータが、ＭＩＤＩデータなどのガイドメロディではなく、練習者が歌唱練習している楽曲を持ち歌とする歌手の歌唱音を表す模範音声データである点である。係る模範音声データには、歌い出しの“ため”などその歌手に特徴的な技法が反映されているのであるが、それら技法が駆使されているが故に歌い出しのタイミングなどが楽譜内容からずれてしまい、練習者の歌唱音と比較することが従来は困難であった。しかしながら、本実施形態に係るカラオケ装置１においては、ＤＴＷモジュール２２による動的時間整合が為された結果、練習者音声の時間軸と模範音声の時間軸とを対応付け、両者を比較評価することが可能になっている。
このように、本実施形態に係るカラオケ装置１によれば、種々の技法が駆使された歌唱を手本としてユーザが歌唱練習を行う場合に、それら技法の再現度合いを評価することが可能になる、といった効果を奏する。 Hereinafter, the evaluation module 23 performs time alignment (expansion and contraction of the time axis) on the trainer voice data so as to satisfy the correspondence specified by the optimum route specifying means 223, and then compares the result with the model voice data, and the comparison result is obtained. The score is displayed on the display unit 15.
What should be noted here is that the reference data for the evaluation by the evaluation module 23 is not a guide melody such as MIDI data, but a model representing a singer's singing sound with a song that the practitioner is singing. It is a point that is voice data. The model voice data reflects techniques that are characteristic of the singer, such as “for” the singing, but the timing of singing deviates from the content of the score because these techniques are fully utilized. In the past, it was difficult to compare with the practitioner's singing sound. However, in the karaoke apparatus 1 according to the present embodiment, as a result of the dynamic time alignment performed by the DTW module 22, the time axis of the trainer voice is associated with the time axis of the model voice, and both are compared and evaluated. Is possible.
Thus, according to the karaoke apparatus 1 according to the present embodiment, when a user performs a singing practice using a singing in which various techniques are used as a model, it is possible to evaluate the degree of reproduction of those techniques. The effects are as follows.

（Ｃ：変形）
以上、本発明の１実施形態について説明したが、係る実施形態に以下に述べるような変形を加えても良いことは勿論である。
（１）上述した実施形態では、基礎分析モジュール２１およびＤＴＷモジュール２２をカラオケ装置へ組み込むことによって、種々の技法を駆使して歌唱が行われた場合に、その手本となる歌唱にて駆使されている技法との一致度を評価することを可能にする場合について説明した。しかしながら、基礎分析モジュール２１による音響パラメータの抽出対象やＤＴＷモジュール２２による動的時間整合処理の処理対象は、上記歌唱音に限定されるものではなく、種々の技法を駆使して演奏された楽器の演奏音データとその手本となる模範演奏データであっても良く、また、英会話などの外国語習得にも利用することができる。 (C: deformation)
Although one embodiment of the present invention has been described above, it is needless to say that the embodiment may be modified as described below.
(1) In the above-described embodiment, when the basic analysis module 21 and the DTW module 22 are incorporated into the karaoke apparatus, when singing is performed using various techniques, the singing is used as a model. The case where it is possible to evaluate the degree of coincidence with the technique is described. However, the acoustic parameter extraction target by the basic analysis module 21 and the dynamic time alignment processing target by the DTW module 22 are not limited to the above-mentioned singing sound. The performance sound data and the model performance data that serves as an example may be used, and it can also be used to learn foreign languages such as English conversation.

（２）上述した実施形態では、練習者歌唱音と模範音声との動的時間整合を行う際には、その都度、模範音声データ記憶領域１４ｂに記憶されている模範音声データを基礎分析モジュール２１によって分析し、その模範音声データの表す歌唱音についての音響パラメータを算出する場合について説明した。しかしながら、模範音声データについて上記音響パラメータを予め求めておき、その音響パラメータと楽曲識別子とを対応付けて記憶部１４に記憶させておくようにしても勿論良い。
また、上述した実施形態では、カラオケ装置１に設けられた記憶部１４に模範音声データを記憶させておく場合について説明したが、ＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）やＤＶＤ（Digital Versatile Disk）などのコンピュータ装置読み取り可能な記録媒体に模範音声データや模範音声データから抽出される音響パラメータを書き込んで配布し、このような記録媒体からの模範音声データや音響パラメータの読み出しにより、模範音声データや音響パラメータを取得させるようにしても良く、また、インターネットなどの電気通信回線経由で模範音声についての音響パラメータを取得させるようにしても良い。 (2) In the above-described embodiment, when performing dynamic time matching between the practitioner singing sound and the model voice, the model voice data stored in the model voice data storage area 14b is used as the basic analysis module 21 each time. The case where the acoustic parameters for the singing sound represented by the exemplary voice data are calculated has been described. However, it is of course possible to obtain the acoustic parameters for the model voice data in advance and store the acoustic parameters in association with the music identifiers in the storage unit 14.
Further, in the above-described embodiment, the case where the model audio data is stored in the storage unit 14 provided in the karaoke apparatus 1 has been described, but a CD-ROM (Compact Disk-Read Only Memory) or a DVD (Digital Versatile Disk) is described. Model audio data and acoustic parameters extracted from the exemplary audio data are written and distributed on a computer-readable recording medium such as), and the exemplary audio data is read by reading the exemplary audio data and acoustic parameters from such a recording medium. And acoustic parameters may be acquired, or acoustic parameters for model voices may be acquired via a telecommunication line such as the Internet.

（３）上述した実施形態では、練習者音声データや模範音声データから音響パラメータの抽出を行う基礎分析モジュール２１と、それら音響パラメータに基づいて模範音声と練習者音声との時間軸の対応付けを行うＤＴＷモジュール２２とを夫々別個のソフトウェアモジュールとして実現する場合について説明したが、１つのソフトウェアモジュールとして構成しても良いことは勿論である。具体的には、音響パラメータの正規化および正規化後の音響パラメータを用いて動的時間整合を行う動的時間整合モジュールに、練習者音声データから音響パラメータを抽出する機能や、模範音声データからの抽出或いは記録媒体等からの読出しにより模範音声についての音響パラメータの取得を行う機能を担わせるようにすれば良い。 (3) In the above-described embodiment, the basic analysis module 21 that extracts the acoustic parameters from the trainer speech data and the model speech data, and the time axis of the model speech and the trainer speech are correlated based on the acoustic parameters. Although the case where the DTW module 22 to be performed is realized as a separate software module has been described, it is needless to say that it may be configured as one software module. Specifically, the function to extract acoustic parameters from the trainer's voice data and the model voice data to the dynamic time matching module that performs normalization of the acoustic parameters and dynamic time matching using the normalized acoustic parameters It is sufficient to have a function of acquiring acoustic parameters for the model voice by extracting the data or reading from the recording medium or the like.

（４）上述した実施形態では、動的時間整合を行うための音響パラメータとして、ピッチ、音量およびスペクトルと、音量の１次微分および２次微分、スペクトルの１次微分を用いる場合について説明した。これら音響パラメータのうち、音量の１次微分および２次微分は、音量の時間変化の度合いを表すものであるが、２次微分は必ずしも必須ではない。また、スペクトルについても、その時間変化の度合いを動的時間整合により正確に反映させるため、２次微分まで求めるようにしても勿論良い。 (4) In the above-described embodiment, the case has been described in which the pitch, volume, and spectrum, and the first and second derivatives of the volume and the first derivative of the spectrum are used as the acoustic parameters for performing dynamic time matching. Of these acoustic parameters, the first and second derivatives of the volume represent the degree of temporal change in volume, but the second derivative is not necessarily essential. Of course, the spectrum may be obtained up to the second derivative in order to accurately reflect the degree of time change by dynamic time matching.

（５）上述した実施形態では、カラオケ曲を一意に識別する楽曲識別子に対応付けてその楽曲識別子で識別されるカラオケ曲を持ち歌とする歌手によるそのカラオケ曲の歌唱音を表す模範音声データを記憶部１４に記憶させておく場合について説明した。
しかしながら、１つの楽曲を複数の歌手が夫々個別に持ち歌としている場合には、その歌手毎に異なる楽曲であるとして、互いに異なる楽曲識別子を付与しても良く、また、その楽曲を一意に識別する楽曲識別子に上記複数の歌手の各々を一意に識別する歌手識別子を対応付け、さらに、この楽曲識別子と歌手識別子の組に、その楽曲識別子で識別される楽曲の、その歌手識別子で識別される歌手による歌唱音を表す模範音声データを対応付けて記憶部１４に記憶させておくとしても良い。前述したように、歌手毎にその歌唱技法が異なっていることが一般的であり、同一の楽曲であっても歌い手が異なれば、その歌唱に込められる情感や味わいも異なることが一般的である。上記のように歌い手の識別を可能なように構成すれば、１つの楽曲を複数の歌手が持ち歌としている場合であっても、ユーザは、それら複数の歌手のうちから自身の好みに応じた歌手による歌唱を選択し、その歌唱を真似て歌唱練習を行うことが可能になる。 (5) In the above-described embodiment, the model voice data representing the singing sound of the karaoke song by the singer who has the karaoke song identified by the song identifier in association with the song identifier that uniquely identifies the karaoke song is used. The case where it memorize | stores in the memory | storage part 14 was demonstrated.
However, if a plurality of singers individually have a song as a song, different song identifiers may be given as different songs for each singer, and each song is uniquely identified. A singer identifier that uniquely identifies each of the plurality of singers is associated with the song identifier to be identified, and further, the set of the song identifier and the singer identifier is identified by the singer identifier of the song identified by the song identifier The model voice data representing the singing sound by the singer may be stored in the storage unit 14 in association with each other. As mentioned above, the singing technique is generally different for each singer, and even if the singer is different even if it is the same song, the feeling and taste that can be put into the singing is generally different. . If it is configured so that singers can be identified as described above, even if a plurality of singers have a single piece of music as a song, the user can respond to his / her preference from among those singers. It becomes possible to select a song by a singer and practice singing by imitating the song.

（６）上述した実施形態では、練習者音声および模範音声のピッチおよび音量に基づいて有音であるか無音であるかを判定し、その判定結果に応じて時間変化に依存しない音響パラメータ（上記実施形態では、スペクトル）に付与する重みを切り替える場合について説明したが、ピッチのみ、或いは、音量のみに基づいて有音／無音の判定をするようにしても勿論良い。なお、ピッチのみに基づいて有音／無音の判定を行う場合には、基礎分析モジュール２１からＤＴＷモジュール２２へ音量データを引渡す必要がないことは言うまでもなく、また、音量のみに基づいて有音／無音の判定を行う場合には、基礎分析モジュール２１にてピッチの検出を行う必要がないこと（すなわち、ピッチ検出手段２１１を設ける必要がないこと）は言うまでもない。また、上記の如き重みの切り替えは必ずしも必須ではないから、係る切り替えを行わない態様においては、ピッチの検出や基礎分析モジュール２１からＤＴＷモジュール２２への音量データの引渡しを行う必要がないことは言うまでも無い。 (6) In the above-described embodiment, it is determined whether the voice is voiced or silent based on the pitch and volume of the trainer voice and the model voice, and an acoustic parameter that does not depend on time change according to the determination result (above In the embodiment, the case of switching the weight to be given to the spectrum) has been described, but it is of course possible to determine whether sound is present or not based on only the pitch or only the volume. It should be noted that when the sound / silence determination is performed based only on the pitch, it is needless to say that the volume data need not be transferred from the basic analysis module 21 to the DTW module 22, and the sound / silence is determined based only on the volume. Needless to say, when the silence is determined, the basic analysis module 21 does not need to detect the pitch (that is, it is not necessary to provide the pitch detection means 211). In addition, since the weight switching as described above is not always essential, it is not necessary to detect the pitch or deliver the volume data from the basic analysis module 21 to the DTW module 22 in an aspect in which such switching is not performed. Not too long.

（７）上述した実施形態では、練習者の歌い始めから歌い終わりまでを表す（Ｎ＋１）個のフレームの各々に対して、その手本の歌い始めから歌い終わりまでを表す（Ｍ＋１）個のフレームの各々とのユークリッド距離を算出し、（Ｎ＋１）行（Ｍ＋１）列の差分マトリクスを生成し、その差分マトリクスの総ての成分を用いて最適経路の探索を行う場合について説明した。
しかしながら、カラオケ曲の歌唱はガイドメロディや歌詞テロップのワイプ表示で示される楽曲の進行に則して行われるのであるから、歌唱者音声の進行とその手本の進行とが極端にずれる（例えば、練習者音声の第０フレームが手本の第Ｍフレームに対応するなど）ことはない。このため、最適経路の探索を行う際に、差分マトリクスの成分を一意に表す２つの添え字の差（フレーム番号の差：すなわち、時間差）が所定の範囲内である成分についてのみステップＳＡ０１３０の処理を行うとしても良く、また、差分マトリクスを生成する際に、その成分を一意に表す２つの添え字の差（フレーム番号の差：すなわち、時間差）が所定の範囲内である成分についてのみ数２に示すユークリッド距離を算出し、このようにして算出された成分のみについてステップＳＡ０１３０の処理を実行するようにしても勿論良い。
このようにすると、最適経路の探索に要する計算回数（後者の態様にあっては、さらに、差分マトリクスの生成に要する計算回数）を削減することができ、動的時間整合に要するハードウェアリソース（例えば、ＲＡＭ１３の記憶容量や制御部１１の使用率）を低減させることが可能になる。 (7) In the above-described embodiment, for each of (N + 1) frames representing from the beginning to the end of singing of the practitioner, (M + 1) frames representing from the beginning of the example to the end of singing A case has been described in which a Euclidean distance from each of the above is calculated, a difference matrix of (N + 1) rows (M + 1) columns is generated, and an optimum route is searched using all components of the difference matrix.
However, since the singing of the karaoke song is performed in accordance with the progress of the music shown by the wipe display of the guide melody and the lyrics telop, the progress of the singer voice and the progress of the example are extremely shifted (for example, The 0th frame of the trainee voice does not correspond to the Mth frame of the model). For this reason, when searching for the optimum route, the process of step SA0130 is performed only for components in which the difference between two subscripts uniquely representing the components of the difference matrix (frame number difference: that is, time difference) is within a predetermined range. In addition, when generating a difference matrix, only a component having a difference between two subscripts uniquely representing the component (frame number difference: that is, time difference) within a predetermined range is expressed by Equation 2. Of course, the Euclidean distance shown in FIG. 6 may be calculated, and the process of step SA0130 may be executed only for the component thus calculated.
In this way, it is possible to reduce the number of calculations required for searching for the optimum route (in the latter mode, the number of calculations required for generating the difference matrix), and to reduce the hardware resources required for dynamic time alignment ( For example, the storage capacity of the RAM 13 and the usage rate of the control unit 11 can be reduced.

（８）上述した実施形態では、本発明に係る楽曲練習支援装置に特徴的な機能を制御部１１に実現させるための制御プログラムをＲＯＭ１２に予め書き込んでおく場合について説明したが、ＣＤ−ＲＯＭやＤＶＤなどのコンピュータ装置読み取り可能な記録媒体に上記制御プログラムを記録して配布するとしても良く、インターネットなどの電気通信回線経由のダウンロードにより上記制御プログラムを配布するようにしても勿論良い。このようにして配布される制御プログラムを一般的なコンピュータ装置へインストールし、そのコンピュータ装置をその制御プログラムにしたがって作動させることによって、そのコンピュータ装置に本発明に係る動的時間整合モジュールと同一の機能を付与することが可能になる。 (8) In the above-described embodiment, a case has been described in which a control program for causing the control unit 11 to realize functions characteristic of the music practice support device according to the present invention is written in the ROM 12 in advance. The control program may be recorded and distributed on a computer-readable recording medium such as a DVD, or the control program may be distributed by downloading via a telecommunication line such as the Internet. By installing the control program distributed in this way into a general computer device and operating the computer device according to the control program, the computer device has the same function as the dynamic time alignment module according to the present invention. Can be granted.

本発明の１実施形態に係るカラオケ装置１のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of the karaoke apparatus 1 which concerns on one Embodiment of this invention. 同カラオケ装置１の機能構成例を示すブロック図である。It is a block diagram which shows the function structural example of the karaoke apparatus. 同基礎分析モジュール２１により実行される音響パラメータ抽出を説明するための図である。It is a figure for demonstrating the acoustic parameter extraction performed by the basic analysis module 21. FIG. 同ＤＴＷモジュール２２により実行される動的時間整合処理の実行結果の一例を示す図である。It is a figure which shows an example of the execution result of the dynamic time alignment process performed by the same DTW module. 同カラオケ装置１が行う採点処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the scoring process which the same karaoke apparatus 1 performs. 差分マトリクスの一例を示す図である。It is a figure which shows an example of a difference matrix. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理中の差分マトリクスの一例を示す図である。It is a figure which shows an example of the difference matrix in the optimal path | route specific process. 最適経路特定処理にて特定される最適経路の一例を示す図である。It is a figure which shows an example of the optimal path | route identified by the optimal path | route identification process. 動的時間整合処理の処理結果を説明するための図である。It is a figure for demonstrating the processing result of a dynamic time alignment process.

Explanation of symbols

１…カラオケ装置、１１…制御部、１２…ＲＯＭ、１３…ＲＡＭ、１４…記憶部、１５…表示部、１６…操作部、１７…マイクロホン、１８…音声処理部、１９…スピーカ、２１…基礎分析モジュール、２１１…ピッチ検出手段、２１２…音量検出手段、２１３…スペクトル検出手段、２１４ａ，２１４ｂ，２１４ｃ…微分手段、２２…ＤＴＷモジュール、２２１…正規化手段、２２２…差分マトリクス生成手段、２２３…最適経路特定手段、２３…評価モジュール。 DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 11 ... Control part, 12 ... ROM, 13 ... RAM, 14 ... Memory | storage part, 15 ... Display part, 16 ... Operation part, 17 ... Microphone, 18 ... Sound processing part, 19 ... Speaker, 21 ... Basics Analysis module 211 ... Pitch detection means 212 ... Volume detection means 213 ... Spectrum detection means 214a, 214b, 214c ... Differentiation means 22 ... DTW module 221 ... Normalization means 222 ... Difference matrix generation means 223 ... Optimal route specifying means, 23... Evaluation module.

Claims

A first audio signal representing a waveform of a singing sound or performance sound by a user is analyzed, and a first time representing a degree of temporal change in volume, a spectrum, and a degree of temporal change in spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Obtaining means for obtaining a second acoustic parameter representing a degree of temporal change in volume in each time unit of sound, a spectrum, and a degree of temporal change in spectrum;
Normalizing the first acoustic parameter for each time unit with normalization for converting the average value and the standard deviation into predetermined values, and applying the normalization to the second acoustic parameter for each time unit And
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, the total sum of the evaluation values at the grid points on the path from the start point where the coordinate values are the minimum to the end point where the coordinate values are the maximum is the minimum. A specific means of identifying the route to become,
Association means for associating the time unit in the first audio signal and the time unit in the second audio signal along the path specified by the specifying means;
A dynamic time alignment module having,
The signal waveform of the first audio signal and the signal waveform of the second audio signal are compared for each time unit associated by the dynamic time matching module, and the degree of coincidence between the two is scored. A music practice support device characterized by output.

The extracting means extracts the pitch of the sound represented by the first audio signal for each time unit, while the obtaining means acquires the pitch of the sound represented by the second audio signal for each time unit. ,
The normalizing means performs the normalization also on the pitch of each time unit of the first audio signal, while performing the normalization on the pitch of the second audio signal in each time unit,
When calculating the evaluation value, the calculating means calculates a coefficient corresponding to the pitch value in the time unit corresponding to the coordinate value of the grid point with which the evaluation value is associated with the first audio in the time unit. 2. The music practice support device according to claim 1, wherein the evaluation value is calculated by multiplying a difference between a degree of temporal change in the spectrum of the signal and a degree of temporal change in the spectrum of the second audio signal. .

A storage means for storing one or a plurality of sets of a song identifier for uniquely identifying a song and the first audio signal representing the waveform of the model song or model performance of the song;
The acquisition means reads out second audio data corresponding to the music identifier designated by the user from the storage means, analyzes the second audio signal, and calculates the second acoustic parameter for each time unit. The music practice support device according to claim 1, wherein the music practice support device is acquired.

A first audio signal representing a waveform of a singing sound or performance sound by a user is analyzed, and a first time representing a degree of temporal change in volume, a spectrum, and a degree of temporal change in spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Obtaining means for obtaining a second acoustic parameter representing a degree of temporal change in volume in each time unit of sound, a spectrum, and a degree of temporal change in spectrum;
Normalizing the first acoustic parameter for each time unit with normalization for converting the average value and the standard deviation into predetermined values, and applying the normalization to the second acoustic parameter for each time unit And
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, the total sum of the evaluation values at the grid points on the path from the start point where the coordinate values are the minimum to the end point where the coordinate values are the maximum is the minimum. A specific means of identifying the route to become,
Association means for associating the time unit in the first audio signal and the time unit in the second audio signal along the path specified by the specifying means;
A dynamic time alignment module comprising:

Computer equipment,
A first audio signal representing a waveform of a singing sound or performance sound by a user is analyzed, and a first time representing a degree of temporal change in volume, a spectrum, and a degree of temporal change in spectrum for each predetermined time unit. Extraction means for extracting acoustic parameters;
A second acoustic parameter obtained for each time unit by analyzing a second audio signal that represents a waveform of a singing sound or performance sound that is modeled by the user, and represents the second audio signal Obtaining means for obtaining a second acoustic parameter representing a degree of temporal change in volume in each time unit of sound, a spectrum, and a degree of temporal change in spectrum;
Normalizing the first acoustic parameter for each time unit with normalization for converting the average value and the standard deviation into predetermined values, and applying the normalization to the second acoustic parameter for each time unit And
The normalized first acoustic parameter and the normalized are in a coordinate plane with the time axis of the first audio signal as one coordinate axis and the time axis of the second audio signal as the other coordinate axis. Calculating means for calculating an evaluation value calculated from a difference from the second acoustic parameter for each lattice point having the time unit as a coordinate value;
On the coordinate plane, the total sum of the evaluation values at the grid points on the path from the start point where the coordinate values are the minimum to the end point where the coordinate values are the maximum is the minimum. A specific means of identifying the route to become,
Association means for associating the time unit in the first audio signal and the time unit in the second audio signal along the path specified by the specifying means;
A program characterized by functioning as