JP2010134475A

JP2010134475A - Singing rating system, and program for singing rating processing

Info

Publication number: JP2010134475A
Application number: JP2010007225A
Authority: JP
Inventors: Katsu Setoguchi; 克瀬戸口
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2010-01-15
Filing date: 2010-01-15
Publication date: 2010-06-17
Anticipated expiration: 2025-11-17
Also published as: JP4998565B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately and promptly evaluate singing capability at low cost, without requiring a memory for storing a large amount of evaluation values, and calculation processing for calculating an average of the large amount of evaluation values after singing is finished. <P>SOLUTION: A CPU 1 sets a parameter t for defining an allowance range of a reference value in which the singing capability to be evaluated becomes 100 points, and a parameter a for defining a degree of singing capability outside of the allowance range from 100 to 0 points, and calculates the evaluation value of an input sound signal on the basis of the parameter t and the parameter a. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、入力される音声信号の歌唱力を採点する歌唱採点装置および歌唱採点処理のプログラムに関するものである。 The present invention relates to a singing scoring apparatus and a singing scoring program for scoring the singing ability of an input audio signal.

入力される音声信号の歌唱力を採点する装置として、カラオケ採点装置が知られている。
例えば、ある特許文献のカラオケ採点装置は、ＭＩＤＩメッセージで与えられたメロディ情報に対する歌唱者の歌い方を評価するために、第１の検出手段が歌唱者の音声に基づいてそのピッチデータおよびレベルデータを検出し、第２の検出手段がＭＩＤＩメッセージの中の歌唱者の発音すべき歌唱メロディに対応するノートオン／オフデータ、ピッチデータおよびレベルデータを検出する。そして、それぞれ検出されたピッチデータおよびレベルデータを個々にピッチ比較手段およびレベル比較手段で比較し、その比較結果とノートオン／オフデータとに基づいて、歌唱法の評価のためのデータを作成する。（特許文献１参照）
また、別の特許文献のカラオケ採点装置は、模範となる音声とマイクから入力される歌唱者の音声との近似度を、より精度よく自動的に判定するために、この特許文献における図１の構成に示されているように、レーザディスク１０１において再生された模範歌唱の音声信号は、レベル検出部Ａ１０３およびピッチ検出部Ａ１０４に供給され、レベル検出部Ａ１０３において検出された信号レベルデータおよびピッチ検出部Ａ１０４において検出されたピッチデータがバッファＡ１０２に格納される。また、利用者の音声はマイク２０１から入力されて、音声信号がレベル検出部Ｂ２０３およびピッチ検出部Ｂ２０４に供給され、レベル検出部Ｂ２０３において検出された信号レベルデータおよびピッチ検出部Ｂ２０４において検出されたピッチデータはバッファＢ２０２に格納される。そして、比較判定部３００は、それぞれの系統において格納されたデータを読み出し、レベル信号から歌うタイミングを、ピッチデータからピッチずれの判定を行い、比較結果を採点データとして出力する。（特許文献２参照） A karaoke scoring device is known as a device for scoring the singing ability of an input audio signal.
For example, in a karaoke scoring device of a certain patent document, in order to evaluate a singer's way of singing melody information given by a MIDI message, the first detection means uses its pitch data and level data based on the singer's voice. The second detecting means detects note on / off data, pitch data and level data corresponding to the singing melody to be generated by the singer in the MIDI message. Then, the detected pitch data and level data are individually compared by the pitch comparison means and the level comparison means, and data for evaluating the singing method is created based on the comparison result and the note on / off data. . (See Patent Document 1)
In addition, the karaoke scoring device of another patent document is shown in FIG. 1 in this patent document in order to automatically determine the degree of approximation between the model voice and the voice of the singer input from the microphone more accurately. As shown in the configuration, the audio signal of the model song reproduced on the laser disc 101 is supplied to the level detection unit A103 and the pitch detection unit A104, and the signal level data and pitch detection detected by the level detection unit A103. The pitch data detected in the part A104 is stored in the buffer A102. Further, the user's voice is input from the microphone 201, the voice signal is supplied to the level detection unit B203 and the pitch detection unit B204, and is detected by the signal level data detected by the level detection unit B203 and the pitch detection unit B204. The pitch data is stored in the buffer B202. Then, the comparison / determination unit 300 reads the data stored in each system, determines the timing of singing from the level signal, the pitch deviation from the pitch data, and outputs the comparison result as scoring data. (See Patent Document 2)

一方、マイクなどから入力された音声信号を分析する際に、デジタル信号に変換した時間領域の音声信号を高速フーリエ変換（ＦＦＴ）などによって周波数領域のスペクトル信号に変換して、変換したスペクトル信号を分析することが従来行われている。
例えば、ある特許文献の音声変換装置は、簡単な位相制御処理により高品質にピッチ周波数を変換して音声信号を分析するために、デジタル変換された第１の音声信号を受け、フーリエ変換によりスペクトル信号に変換するフーリエ変換手段と、フーリエ変換手段からスペクトル信号を受け、スペクトル信号の中から音源情報信号を選択的に出力する選択手段と、選択手段から音源情報信号を受け、音源情報信号のピッチ周波数を変換し、周波数変換された信号を出力する周波数変換手段と、フーリエ変換手段からのスペクトル信号に含まれるスペクトル包絡信号と周波数変換手段から出力された信号とに応答して、ピッチ周波数が変換されたスペクトル信号を分析フレーム毎に受け、これを逆フーリエ変換により第２の音声信号に変換する逆フーリエ変換手段と、逆フーリエ変換手段から第２の音声信号を受け、ピッチ周波数の変換倍率に応答して、第２の音声信号の位相を分析フレームのシフト幅によって制御する位相制御手段を含む構成になっている。（特許文献３参照） On the other hand, when analyzing an audio signal input from a microphone or the like, a time-domain audio signal converted into a digital signal is converted into a frequency-domain spectrum signal by Fast Fourier Transform (FFT) or the like, and the converted spectrum signal is converted into a spectrum signal. Analyzing is conventionally performed.
For example, an audio conversion apparatus of a patent document receives a first audio signal that has been digitally converted in order to analyze a sound signal by converting a pitch frequency with high quality by a simple phase control process, and a spectrum by Fourier transform. Fourier transform means for converting to a signal, selection means for receiving a spectrum signal from the Fourier transform means and selectively outputting a sound source information signal from the spectrum signal, a sound source information signal from the selection means, and a pitch of the sound source information signal The pitch frequency is converted in response to the frequency conversion means for converting the frequency and outputting the frequency-converted signal, and the spectrum envelope signal included in the spectrum signal from the Fourier transform means and the signal output from the frequency conversion means. The received spectrum signal is received every analysis frame, and this is converted into a second audio signal by inverse Fourier transform. A structure including a Fourier transform means and a phase control means for receiving the second sound signal from the inverse Fourier transform means and controlling the phase of the second sound signal by the shift width of the analysis frame in response to the conversion factor of the pitch frequency. It has become. (See Patent Document 3)

特開平１０−４９１８３号公報Japanese Patent Laid-Open No. 10-49183 特開平１１−２２４０９４号公報Japanese Patent Laid-Open No. 11-224094 特許第２７５３７１６号公報Japanese Patent No. 2753716

しかしながら、上記特許文献１および特許文献２においては、基準値（メロディ情報、模範となる音声）と入力される歌唱者の音声とを単純に比較して歌唱力を評価しているので、曲の開始から終了までの間で評価値が細かく変動することになり、膨大な数の評価値を一時的に記憶するメモリが必要となるためコストアップを招くとともに、曲の途中や曲が終了した後に膨大な数の評価値の平均を算出する演算処理、特に除算処理のためにＣＰＵ等の制御手段に大きな負荷がかかるという課題があった。
また、上記特許文献１および特許文献２においては、上級の歌唱者がビブラート唱法によってピッチに揺らぎが発生した場合には、それをピッチずれと誤って判断して歌唱力を低く評価するという課題があった。
また、上記特許文献１および特許文献２においては、入力される音声信号のピッチおよびレベルを２系統の比較手段によって比較し、ピッチずれおよび発音タイミングずれを検出して歌唱力を評価しているので、装置が複雑な構成になるという課題があった。
一方、上記特許文献３においては、フーリエ変換したスペクトル信号の中から音源情報信号、すなわち、基本周波数の音である基音のピッチを直接的に検出して選択するようになっているが、基音のレベルよりも倍音のレベルのほうが高い場合があるので、マイクから入力される音声信号の基音のピッチを確実に検出することができないという課題があった。 However, in Patent Document 1 and Patent Document 2, the singing ability is evaluated by simply comparing the reference value (melody information, model voice) and the input voice of the singer. Since the evaluation value varies finely from the start to the end, a memory that temporarily stores a large number of evaluation values is required, resulting in an increase in cost, and in the middle of the song or after the song ends There has been a problem that a large load is applied to control means such as a CPU due to arithmetic processing for calculating an average of a large number of evaluation values, particularly division processing.
Moreover, in the said patent document 1 and the patent document 2, when the senior singer generate | occur | produces fluctuation in a pitch by vibrato singing method, the subject that it judges incorrectly that it is pitch shift and evaluates singing ability low. there were.
In Patent Document 1 and Patent Document 2, the pitch and level of the input audio signal are compared by two systems of comparison means, and the singing ability is evaluated by detecting the pitch deviation and the pronunciation timing deviation. There has been a problem that the apparatus has a complicated configuration.
On the other hand, in Patent Document 3, a sound source information signal, that is, a pitch of a fundamental tone that is a fundamental frequency sound is directly detected and selected from a Fourier-transformed spectrum signal. Since the level of the harmonic overtone may be higher than the level, there has been a problem that the pitch of the fundamental tone of the audio signal input from the microphone cannot be reliably detected.

また、本発明は、入力される歌唱者の音声信号の基音のピッチを確実に検出し、且つ上級の歌唱者がビブラート唱法によってピッチに揺らぎが発生した場合でも、正当に歌唱力を評価することを目的とする。 In addition, the present invention reliably detects the pitch of the fundamental tone of the input singer's voice signal, and even when an advanced singer fluctuates in pitch due to the vibrato method, the singing ability is properly evaluated. With the goal.

請求項１に記載の歌唱採点装置は、評価すべき歌唱力が最大評価値となる基準値の許容範囲を規定する第１のパラメータおよび最大評価値から最低評価値の範囲で許容範囲外の歌唱力の度合いを規定する第２のパラメータを設定するパラメータ設定手段と、入力される音声信号における２のべき乗のサンプル数を１フレームとして各フレームのエラー数を分析する信号分析手段と、前記信号分析手段によって分析されたエラー数が１フレームの２分の１を超えたときは当該フレームを最低評価値とし、分析されたエラー数が１フレームの２分の１を超えないときは１フレームの２分の１に相当するエラー以外のサンプル数によって当該フレームの評価値を前記パラメータ設定手段によって設定された第１のパラメータおよび第２のパラメータに基づいて算出する評価演算手段と、を備えた構成になっている。 The singing scoring apparatus according to claim 1, wherein the singing ability to be evaluated is a singing out of the allowable range in the range of the first evaluation parameter and the maximum evaluation value to the minimum evaluation value that defines the allowable range of the reference value. Parameter setting means for setting a second parameter that defines the degree of force, signal analysis means for analyzing the number of errors in each frame with the number of samples of powers of 2 in the input speech signal as one frame, and the signal analysis When the number of errors analyzed by the means exceeds half of one frame, the frame is set as the lowest evaluation value, and when the number of errors analyzed does not exceed half of one frame, 2 of one frame The first parameter and the second parameter set by the parameter setting means with the evaluation value of the frame based on the number of samples other than the error corresponding to 1 / It has a configuration in which and an evaluation operation means for calculating based.

請求項１の歌唱採点装置において、請求項２に記載したように、信号分析手段は、入力される音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出して基音のピッチを分析し、評価演算手段は、信号分析手段によって検出された基音のピッチの評価値を算出するような構成にしてもよい。 2. The singing scoring apparatus according to claim 1, wherein the signal analysis means analyzes the pitch of the fundamental tone by detecting the greatest common divisor of at least two or more pitches from the input audio signal. The evaluation calculation means may be configured to calculate the evaluation value of the pitch of the fundamental tone detected by the signal analysis means.

請求項２の歌唱採点装置において、請求項３に記載したように、信号分析手段は、入力される音声信号の周波数成分から位相を算出し、当該算出した位相を用いて当該音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出するような構成にしてもよい。 In the singing scoring device according to claim 2, as described in claim 3, the signal analysis means calculates a phase from the frequency component of the input audio signal, and uses the calculated phase from the audio signal. It may be configured to detect the greatest common divisor of at least two or more pitches.

請求項４に記載の歌唱採点処理のプログラムは、評価すべき歌唱力が最大評価値となる基準値の許容範囲を規定する第１のパラメータおよび最大評価値から最低評価値の範囲で許容範囲外の歌唱力の度合いを規定する第２のパラメータを設定するステップＡと、入力される音声信号における２のべき乗のサンプル数を１フレームとして各フレームのエラー数を分析するステップＢと、前記ステップＢによって分析されたエラー数が１フレームの２分の１を超えたときは当該フレームを最低評価値とし、分析されたエラー数が１フレームの２分の１を超えないときは１フレームの２分の１に相当するエラー以外のサンプル数によって当該フレームの評価値を前記ステップＡによって設定された第１のパラメータおよび第２のパラメータに基づいて算出するステップＣと、をコンピュータに実行させる構成になっている。 The singing scoring program according to claim 4 is out of an allowable range within a range of a first parameter and a maximum evaluation value to a minimum evaluation value that define an allowable range of a reference value at which a singing ability to be evaluated becomes a maximum evaluation value A step A for setting a second parameter that defines the degree of singing ability of the sound; a step B for analyzing the number of errors in each frame with the number of samples of powers of 2 in the input audio signal as one frame; and the step B If the number of errors analyzed by exceeds one half of one frame, that frame is regarded as the lowest evaluation value, and if the number of errors analyzed does not exceed one half of one frame, two minutes of one frame Based on the first parameter and the second parameter set in the step A, the evaluation value of the frame is determined by the number of samples other than the error corresponding to 1. It has a configuration to execute a step C of calculating and the computer.

請求項４の歌唱採点処理のプログラムにおいて、請求項５に記載したように、ステップＢは、入力される音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出して基音のピッチを分析し、前記ステップＣは、前記ステップＢによって検出された基音のピッチの評価値を算出するような構成にしてもよい。 In the singing scoring program of claim 4, as described in claim 5, step B detects the greatest common divisor of at least two or more pitches from the input audio signal, and calculates the pitch of the fundamental tone. Analyzing, the step C may be configured to calculate the evaluation value of the pitch of the fundamental tone detected in the step B.

請求項５の歌唱採点処理のプログラムにおいて、請求項６に記載したように、ステップＢは、入力される音声信号の周波数成分から位相を算出し、当該算出した位相を用いて当該音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出するような構成にしてもよい。 In the singing scoring program of claim 5, as described in claim 6, the step B calculates the phase from the frequency component of the input audio signal, and uses the calculated phase to calculate the phase of the audio signal. Alternatively, the greatest common divisor of at least two or more pitches may be detected.

本発明によれば、入力される歌唱者の音声信号の基音のピッチを確実に検出し、且つ上級の歌唱者がビブラート唱法によってピッチに揺らぎが発生した場合でも、正当に歌唱力を評価するという効果が得られる。 According to the present invention, the pitch of the fundamental tone of the input singer's voice signal is reliably detected, and even when an advanced singer fluctuates in pitch due to the vibrato singing method, the singing ability is properly evaluated. An effect is obtained.

本発明の歌唱採点装置を適用した実施形態におけるカラオケ装置の構成を示すブロック図。The block diagram which shows the structure of the karaoke apparatus in embodiment to which the singing scoring apparatus of this invention is applied. 図１のＣＰＵの信号処理機能をハードウェアとして表した機能構成図。The function block diagram which represented the signal processing function of CPU of FIG. 1 as hardware. 歌唱採点処理のために図１のＣＰＵ内部のカウンタの構成を示す図。The figure which shows the structure of the counter inside CPU of FIG. 1 for a song scoring process. ＣＰＵのメインルーチンのフローチャート。The flowchart of the main routine of CPU. 図４のメインルーチンにおけるスイッチ処理のフローチャート。5 is a flowchart of switch processing in the main routine of FIG. ＣＰＵのタイマインタラプトのフローチャート。The flowchart of CPU timer interrupt. 図４のメインルーチンにおけるカラオケ処理のフローチャート。The flowchart of the karaoke process in the main routine of FIG. 図７におけるピッチ差分算出処理のフローチャート。The flowchart of the pitch difference calculation process in FIG. 図８におけるピッチ比算出処理のフローチャート。The flowchart of the pitch ratio calculation process in FIG. 図９における位相補償処理のフローチャート。10 is a flowchart of phase compensation processing in FIG. 9. 図７における差分平均算出処理のフローチャート。The flowchart of the difference average calculation process in FIG. 図７における区間得点算出処理のフローチャート。8 is a flowchart of section score calculation processing in FIG. 初級、中級、上級の各レベルにおけるパラメータの具体例を示す図。The figure which shows the specific example of the parameter in each level of a beginner level, an intermediate level, and an advanced level. 区間得点の演算式の特性を示す図。The figure which shows the characteristic of the calculation formula of a section score. 区間得点およびピッチずれを表示する画面を示す図。The figure which shows the screen which displays an area score and pitch deviation. 入力される音声信号の発声区間と評価対象の歌唱区間との関係を示す図。The figure which shows the relationship between the utterance area of the audio | voice signal input, and the singing area of evaluation object.

以下、本発明の歌唱採点装置の実施の形態について、カラオケ装置を例に採って、図面を参照しながら詳細に説明する。
図１は、実施の形態におけるカラオケ装置の構成図である。図１において、ＣＰＵ１は、装置全体の制御を行うとともに、少容量のＲＯＭ・ＲＡＭおよびＤＳＰ（デジタル信号プロセッサ）機能を有する。ＣＰＵ１のシステムバスには、曲メモリ２、スイッチ部３、ＲＯＭ４、ＲＡＭ５、表示部６、Ａ／Ｄ変換器８、及び楽音生成部９が相互に接続され、ＣＰＵ１と各部との間でデータやコマンドの授受を行う。 Hereinafter, embodiments of the singing scoring device of the present invention will be described in detail with reference to the drawings, taking a karaoke device as an example.
FIG. 1 is a configuration diagram of a karaoke apparatus according to an embodiment. In FIG. 1, a CPU 1 controls the entire apparatus and has a small capacity ROM / RAM and a DSP (digital signal processor) function. A music memory 2, a switch unit 3, a ROM 4, a RAM 5, a display unit 6, an A / D converter 8, and a musical tone generation unit 9 are connected to the system bus of the CPU 1, and data and data are transmitted between the CPU 1 and each unit. Send and receive commands.

曲メモリ２は、カラオケ用の複数の伴奏曲および歌詞を記憶している。スイッチ部３は、曲セレクトスイッチ、スタート／ストップスイッチ、およびその他の各種スイッチを備えている。ＲＯＭ４は、ＣＰＵ１が実行する歌唱採点処理のプログラムや各種制御用データ等を格納している。ＲＡＭ５は、ＣＰＵ１のワークエリアであり、各種のレジスタを有する。表示部６は、例えば液晶表示装置（ＬＣＤ）や複数のＬＥＤなどを備えている。Ａ／Ｄ変換器８には無線又は有線によってマイク７が接続され、マイク７から入力されるアナログの音声信号のＡ／Ｄ変換を行いその音声データを出力する。例えば、サンプリング周波数８０２１Ｈｚ、１６ｂｉｔでＡＤ変換を行う。以降、それがＡＤ変換して得られる音声データについては便宜的に「元音声データ」、或いは「元波形データ」と呼び、マイク７に入力された音声については「元音声」と呼ぶことにする。楽音生成部９は、ＣＰＵ１の指示に従い楽音発音用の波形データを生成する。Ｄ／Ａ変換器１０は、楽音生成部９が生成した波形データのＤ／Ａ変換を行い、アナログのオーディオ信号を出力する。サウンドシステム１１は、そのオーディオ信号を放音する。 The song memory 2 stores a plurality of accompaniment songs and lyrics for karaoke. The switch unit 3 includes a music selection switch, a start / stop switch, and other various switches. The ROM 4 stores a singing scoring program executed by the CPU 1 and various control data. The RAM 5 is a work area of the CPU 1 and has various registers. The display unit 6 includes, for example, a liquid crystal display (LCD) and a plurality of LEDs. A microphone 7 is connected to the A / D converter 8 wirelessly or by wire, performs A / D conversion of an analog audio signal input from the microphone 7 and outputs the audio data. For example, AD conversion is performed at a sampling frequency of 8021 Hz and 16 bits. Hereinafter, the voice data obtained by AD conversion will be referred to as “original voice data” or “original waveform data” for convenience, and the voice input to the microphone 7 will be referred to as “original voice”. . The musical tone generation unit 9 generates musical tone generation waveform data in accordance with an instruction from the CPU 1. The D / A converter 10 performs D / A conversion on the waveform data generated by the musical tone generation unit 9 and outputs an analog audio signal. The sound system 11 emits the audio signal.

図２は、図１のマイク７からＡ／Ｄ変換器８に入力された音声信号に対して、ＣＰＵ１の信号処理が行われて、Ｄ／Ａ変換器１０から出力されるまでの機能をハードウェアとして表した機能構成図である。図２において、入力バッファ２１は、Ａ／Ｄ変換器８が出力する元音声データを一時的に格納するバッファである。フレーム抽出部２２は、入力バッファ２１に格納された元音声データから予め定められたサイズ分の音声データであるフレームを切り出すことで抽出する。そのサイズ、つまり音声データ（サンプル）数は例えば２５６である。正確な位相展開の実施にはフレームをオーバーラップさせて抽出する必要があることから、フレームの切り出しはオーバーラップファクタＯＶＬでオーバーラップさせて行う。そのファクタＯＶＬの値としては４を設定している。この場合、ホップサイズは６４（２５６／６４＝４）である。また、元音声データのピッチ（以降「元ピッチ」と呼ぶ）から目標ピッチへのピッチスケーリング値の範囲は０．５〜２．０の範囲を前提としている。 FIG. 2 shows the hardware from the signal processing of the CPU 1 to the audio signal input from the microphone 7 of FIG. 1 to the A / D converter 8 until it is output from the D / A converter 10. 2 is a functional configuration diagram represented as hardware. In FIG. 2, an input buffer 21 is a buffer that temporarily stores original audio data output from the A / D converter 8. The frame extraction unit 22 extracts a frame that is audio data of a predetermined size from the original audio data stored in the input buffer 21. The size, that is, the number of audio data (samples) is, for example, 256. In order to perform accurate phase expansion, it is necessary to extract the frames by overlapping them. Therefore, the frames are cut out by overlapping with the overlap factor OVL. The factor OVL is set to 4. In this case, the hop size is 64 (256/64 = 4). The range of the pitch scaling value from the pitch of the original audio data (hereinafter referred to as “original pitch”) to the target pitch is premised on the range of 0.5 to 2.0.

フレーム抽出部２２が抽出したフレームは、ローパスフィルタ（ＬＰＦ）２３に出力される。そのＬＰＦ２３は、ピッチのシフトにより周波数成分がナイキスト周波数を超えることを防止するために高周波成分の除去を行う。ＦＦＴ部２４は、ＬＰＦ２３が出力するフレームを対象に高速フーリエ変換（ＦＦＴ）を実行する。そのＦＦＴは、ＦＦＴサイズ（点数）をフレームサイズの２倍（２５６×２＝５１２）にして実行する。 The frame extracted by the frame extraction unit 22 is output to a low pass filter (LPF) 23. The LPF 23 removes high frequency components in order to prevent the frequency components from exceeding the Nyquist frequency due to pitch shift. The FFT unit 24 performs fast Fourier transform (FFT) on the frame output by the LPF 23. The FFT is executed by setting the FFT size (number of points) to twice the frame size (256 × 2 = 512).

位相補償部２５は、ＦＦＴの実行により得られた各周波数チャンネルの周波数成分を対象にして、ピッチシフトによるフレームの伸縮を補償するようにそのサイズを伸縮させる。例えばピッチスケーリング値が前提とする範囲の最大値である２とすれば、ピッチシフトによりフレームサイズは１／２に縮小するから、そのサイズを補償（維持）するためにフレームを２倍に引き伸ばす。このことから、ＦＦＴサイズはフレームサイズの２倍としている。ピッチスケーリング値の算出方法についての詳細は後述する。 The phase compensation unit 25 expands / contracts the size so as to compensate for the expansion / contraction of the frame due to the pitch shift, with respect to the frequency component of each frequency channel obtained by performing the FFT. For example, if the pitch scaling value is 2, which is the maximum value in the premised range, the frame size is reduced to ½ by the pitch shift, so that the frame is doubled to compensate (maintain) the size. For this reason, the FFT size is set to twice the frame size. Details of the pitch scaling value calculation method will be described later.

ＦＦＴ部２４は、ＬＰＦ２３から２５６サンプルのフレームを入力し、ＦＦＴサイズのフレームの前半部分にセットする。後半部分には全て０をセットする。後半部分に０をセットするのは、ＦＦＴを実行した後、周波数領域における補間効果をもたらすためである。その補間効果をもたらすために周波数の分解能が向上する。ＦＦＴ部２４は、そのようなセットを行ったフレームを対象にＦＦＴを実行する。 The FFT unit 24 receives a 256-sample frame from the LPF 23 and sets it in the first half of the FFT-sized frame. Set all 0s in the second half. The reason why 0 is set in the latter half is to bring about an interpolation effect in the frequency domain after executing FFT. The frequency resolution is improved to provide the interpolation effect. The FFT unit 24 performs FFT on the frame in which such a set has been performed.

ＩＦＦＴ部２６は、位相補償部２５がサイズを伸縮させた後の各周波数チャンネルの周波数成分を、ＩＦＦＴ（逆ＦＦＴ）を行うことにより時間領域上のデータに戻し、１フレーム分の音声データを生成して出力する。ピッチシフタ２７は、位相補償部２５から入力するピッチスケーリング値に応じて、ＩＦＦＴ部２６が生成したフレームに対する補間、或いは間引きを行い、そのピッチをシフトする。補間、間引きには一般的なラグランジュ関数やｓｉｎｃ関数などが使用できるが、本実施の形態ではＮｅｖｉｌｌｅ補間によりピッチシフト（ピッチスケーリング）を行っている。上記補間、或いは間引きにより、フレームサイズは元のサイズ（２５６サンプル）となる。そのフレームの音声データについては以降「合成音声データ」と呼び、それによって発音される音声を「合成音声」と呼ぶことにする。 The IFFT unit 26 returns the frequency component of each frequency channel after the phase compensation unit 25 has expanded / contracted size to data in the time domain by performing IFFT (inverse FFT), and generates audio data for one frame. And output. The pitch shifter 27 performs interpolation or thinning for the frame generated by the IFFT unit 26 in accordance with the pitch scaling value input from the phase compensation unit 25, and shifts the pitch. A general Lagrangian function, a sinc function, or the like can be used for interpolation and decimation, but in this embodiment, pitch shift (pitch scaling) is performed by Neville interpolation. The frame size becomes the original size (256 samples) by the above interpolation or thinning. The audio data of the frame is hereinafter referred to as “synthetic audio data”, and the sound produced by the audio data is referred to as “synthesized audio”.

出力バッファ２９は、音声としてサウンドシステム１１から放音させる合成音声データを格納するバッファである。フレーム加算部２８は、ピッチシフタ２７から入力した１フレーム分の合成音声データを、出力バッファ２９に格納された合成音声データにオーバーラップファクタＯＶＬでオーバーラップさせて加算する。出力バッファ２９に格納された合成音声データは、Ｄ／Ａ変換器１０に出力されてＤ／Ａ変換される。 The output buffer 29 is a buffer for storing synthesized voice data to be emitted from the sound system 11 as voice. The frame adder 28 adds the synthesized voice data for one frame input from the pitch shifter 27 to the synthesized voice data stored in the output buffer 29 by overlapping with the overlap factor OVL. The synthesized speech data stored in the output buffer 29 is output to the D / A converter 10 and D / A converted.

上記入力バッファ２１、及び出力バッファ２９は、例えばＲＡＭ５に確保された領域である。Ａ／Ｄ変換器８、Ｄ／Ａ変換器１０、入力バッファ２１、及び出力バッファ２９を除く各部２２〜２８は、例えばＲＡＭ５をワーク用にして、ＣＰＵ１がＲＯＭ４に格納されたプログラムを実行することで実現される。特に詳細な説明は省略するが、目標ピッチは例えば鍵盤２への操作により指示するようになっている。その目標ピッチは、スタンダードＭＩＤＩファイル等の演奏データ、或いは通信ネットワークを介して受信されるデータなどにより指定してもよい。 The input buffer 21 and the output buffer 29 are areas secured in the RAM 5, for example. The units 22 to 28 except the A / D converter 8, the D / A converter 10, the input buffer 21, and the output buffer 29 execute the program stored in the ROM 4 by the CPU 1 using, for example, the RAM 5 as a work. It is realized with. Although a detailed description is omitted, the target pitch is instructed by operating the keyboard 2, for example. The target pitch may be specified by performance data such as a standard MIDI file or data received via a communication network.

次に、上記位相補償部２５によるピッチスケーリング値の算出方法について詳細に説明する。以降、そのスケーリング値はρと表記する。
ＦＦＴの実行により、周波数が異なる周波数チャンネル毎に、実数成分と虚数成分を持つ周波数成分が抽出される。実数成分をｒｅａｌ、虚数成分をｉｍｇと表記すると、各周波数チャンネルの周波数振幅ｍａｇ、及び位相ｐｈａｓｅは以下のように算出することができる。 Next, the calculation method of the pitch scaling value by the phase compensation unit 25 will be described in detail. Hereinafter, the scaling value is expressed as ρ.
By performing the FFT, a frequency component having a real component and an imaginary component is extracted for each frequency channel having a different frequency. When the real component is expressed as real and the imaginary component is expressed as img, the frequency amplitude mag and the phase phase of each frequency channel can be calculated as follows.

ｍａｇ＝（ｒｅａｌ²＋ｉｍｇ²）^1/2 ・・・（１）
ｐｈａｓｅ＝ａｒｃｔａｎ（ｉｍｇ／ｒｅａｌ）・・・（２）
ａｒｃｔａｎを用いて算出される位相ｐｈａｓｅは、−π〜πの間に制限される。しかし、位相ｐｈａｓｅは角速度の積分値であるから展開する必要がある。展開の有無の区別を容易にするために、折り畳まれている位相を小文字のθ、展開されている位相を大文字のΘで表記すると、本来は
Θ_k,t＝θ_k,t＋２ｎπ ｎ＝０，１，２，・・・・・・（３）
となる。このことから、位相ｐｈａｓｅ（＝θ）はｎを求めて展開する必要がある。ここで式（３）中のΘに下添字として付したｋ、ｔはそれぞれ、周波数チャンネルのインデクス、時刻を表している。 mag = (real ² + img ² ) ^1/2 (1)
phase = arctan (img / real) (2)
The phase phase calculated using arctan is limited to between −π and π. However, since the phase phase is an integral value of the angular velocity, it needs to be expanded. In order to easily distinguish the presence or absence of expansion, when the folded phase is expressed by a lowercase θ and the expanded phase is expressed by an uppercase Θ, it is originally Θ _{k, t} = θ _{k, t} + 2nπ n = 0 , 1, 2, ... (3)
It becomes. Therefore, the phase phase (= θ) needs to be developed by obtaining n. Here, k and t added as subscripts to Θ in the equation (3) represent the frequency channel index and time, respectively.

その展開は、以下のような手順で行うことができる。
先ず、フレーム間の位相差Δθを次のようにして算出する。
Δθ_i,k＝θ_ｉ,k−θ_i-1,k ・・・（４）
ここで、Δθ_i,kは元音声波形の周波数チャンネルｋにおける直前のフレームと今回のフレームとの間の位相差、下添字のｉはフレームをそれぞれ表している。今回のフレーム（現フレーム）はｉ、直前のフレームはｉ−１で表されている。 The expansion can be performed by the following procedure.
First, the phase difference Δθ between frames is calculated as follows.
Δθ _{i, k} = θ _{i, k} −θ _{i−1, k} (4)
Here, Δθ _{i, k} represents the phase difference between the previous frame and the current frame in the frequency channel k of the original speech waveform, and the subscript i represents the frame. The current frame (current frame) is represented by i, and the immediately preceding frame is represented by i-1.

式（４）中のΔθ_i,kは折り畳まれた状態にある。一方、周波数チャンネルｋの中心角周波数Ω_i,kは、サンプリング周波数をｆｓ、ＦＦＴ点数（サイズ）をＮと表記すると
Ω_i,k＝（２π・ｆｓ）・ｋ／Ｎ・・・（５）
で示される。その周波数Ω_i,kの時、直前のフレームとの時間差をΔｔとすると、位相差ΔＺ_i,kは
ΔＺ_i,k＝Ω_i,k・Δｔ・・・（６）
で算出できる。時間差Δｔは
Δｔ＝Ｎ／（ｆｓ・ＯＶＬ）・・・（７）
である。式（６）は位相展開されている状態なので、以下のように記述できる。 Δθ _{i, k} in equation (4) is in a folded state. On the other hand, the central angular frequency Ω _{i, k} of the frequency channel k is represented by Ω _{i, k} = (2π · fs) · k / N (5) where fs is the sampling frequency and N is the number of FFT points (size).
Indicated by At the frequency Ω _{i, k} , assuming that the time difference from the previous frame is Δt, the phase difference ΔZ _{i, k} is ΔZ _{i, k} = Ω _{i, k} · Δt (6)
It can be calculated by The time difference Δt is Δt = N / (fs · OVL) (7)
It is. Since equation (6) is in a phase-expanded state, it can be described as follows.

ΔＺ_i,k＝Δζ_i,k＋２ｎπ ・・・（８）
式（４）で算出される位相差Δθ_i,kと式（８）中の位相差Δζ_i,kの差をδ（＝Δθ_i,k−Δζ_i,k）とすると
Δθ_i,k−Ω_i,k・Δｔ＝（Δζ_i,k＋δ）−（Δζ_i,k＋２ｎπ）
＝δ−２ｎπ ・・・（９）
が導出できる。従って式（９）の右辺の２ｎπを削除してその範囲を−πからπの間に制限すればδを算出できる。そのδは、元音声波形において実際に検出される位相差（以降「実位相差」と呼ぶ）である。 ΔZ _{i, k} = Δζ _{i, k} + 2nπ (8)
If the difference between the phase difference Δθ _{i, k} calculated by equation (4) and the phase difference Δζ _{i, k} in equation (8) is δ (= Δθ _{i, k} −Δζ _{i, k} ), then Δθ _{i, k} − Ω _{i, k} · Δt = (Δζ _{i, k} + δ) − (Δζ _{i, k} + 2nπ)
= Δ-2nπ (9)
Can be derived. Therefore, if 2nπ on the right side of Equation (9) is deleted and the range is limited to between −π and π, δ can be calculated. The δ is a phase difference actually detected in the original speech waveform (hereinafter referred to as “actual phase difference”).

そのように算出される実位相差δに位相差ΔＺ_i,k（＝Ω_i,k・Δｔ）を加算すれば、以下のように位相展開された位相差ΔΘ_i,kを求めることができる。
ΔΘ_i,k＝δ＋Ω_i,k・Δｔ＝δ＋（Δζ_i,k＋２ｎπ）＝Δθ_i,k＋２ｎπ
・・・（１０）
式（１０）中のΩ_i,k・Δｔは、式（５）、（７）より下記のように変形できる。 If the phase difference ΔZ _{i, k} (= Ω _{i, k} · Δt) is added to the actual phase difference δ calculated as described above, the phase difference ΔΘ _{i, k} that has been phase-expanded can be obtained as follows. .
ΔΘ _{i, k} = δ + Ω _{i, k} · Δt = δ + (Δζ _{i, k} + 2nπ) = Δθ _{i, k} + 2nπ
(10)
Ω _{i, k} · Δt in equation (10) can be modified as follows from equations (5) and (7).

Ω_i,k・Δｔ＝｛（２π・ｆｓ）／Ｎ｝・ｋ・｛Ｎ／（ｆｓ・ＯＶＬ）｝
＝（２π／ＯＶＬ）・ｋ・・・（１１）
ＦＦＴを含む離散フーリエ変換（ＤＦＴ）では、音声データ（信号）に含まれる周波数成分の周波数がＤＦＴ点数の整数倍となる特別な場合を除き、すべての周波数チャンネルに周波数成分が漏れ出して（遷移して）しまう。そのため、信号の調波構造等を分析する場合は、ＤＦＴの結果から実際に周波数成分が存在する周波数チャンネルを検出する作業が必要になる。 Ω _{i, k} · Δt = {(2π · fs) / N} · k · {N / (fs · OVL)}
= (2π / OVL) · k (11)
In the discrete Fourier transform (DFT) including FFT, the frequency component leaks to all frequency channels (transition) except in a special case where the frequency of the frequency component included in the audio data (signal) is an integer multiple of the DFT point. Resulting in. Therefore, when analyzing the harmonic structure or the like of a signal, it is necessary to detect a frequency channel in which a frequency component actually exists from the result of DFT.

その検出には、周波数振幅のピークを検出し、そのピークを周波数成分の存在するチャンネルと見なす方法を採用するのが一般的である。そのための最も手順が単純な方法としては、前後２つのチャンネルの周波数振幅より大きい周波数振幅を持つチャンネルをピークとして見なすというものが挙げられる。しかし、そのような方法では、窓関数のサイドローブによるピークを間違ってピークと認識する場合がある。このため、探し出したピーク間のチャンネルで周波数振幅が最小となるチャンネルを抽出し、その周波数振幅がピークの周波数振幅の所定値（例えばピークの周波数振幅の−１４ｄｂ）以下であれば正しいピークと見なすことも行われている。 For the detection, a method is generally adopted in which a peak of a frequency amplitude is detected and the peak is regarded as a channel in which a frequency component exists. The simplest method for that purpose is to regard a channel having a frequency amplitude larger than the frequency amplitudes of the two front and rear channels as a peak. However, in such a method, the peak due to the side lobe of the window function may be mistakenly recognized as a peak. For this reason, the channel having the smallest frequency amplitude is extracted from the found channels, and if the frequency amplitude is equal to or less than a predetermined value of the peak frequency amplitude (for example, −14 db of the peak frequency amplitude), it is regarded as a correct peak. Things are also done.

そのようなピーク検出ではピークをより高精度に検出できるが、２段階の探索が必要で処理的に煩雑である。このことから、本実施の形態では、処理の負荷を軽減するために、ピーク検出は行わず、以下のように位相を考慮して、元音声の倍音の周波数成分が存在する周波数チャンネルを検出する。 Such peak detection can detect peaks with higher accuracy, but requires a two-step search and is complicated in processing. Therefore, in this embodiment, in order to reduce the processing load, peak detection is not performed, and a frequency channel in which a frequency component of harmonics of the original sound is present is detected in consideration of the phase as follows. .

展開した位相差と周波数の関係は直線のグラフで表される。この場合において、グラフの縦軸は位相差、横軸は周波数であり、各チャンネルの中心周波数から計算される位相差、すなわち式（６）により計算されるΔＺ_i,kが直線で表される。その直線に沿う形でプロットした線は、調波構造を持った音声、すなわち有声音の式（１０）により計算される位相差ΔΘ_i,kを表すことになる。その位相差ΔΘ_i,kはＦＦＴ点数５１２点の前半１２８点分である。 The relationship between the developed phase difference and frequency is represented by a straight line graph. In this case, the vertical axis of the graph is the phase difference and the horizontal axis is the frequency, and the phase difference calculated from the center frequency of each channel, that is, ΔZ _{i, k} calculated by Equation (6) is represented by a straight line. . The line plotted along the straight line represents the phase difference ΔΘ _{i, k} calculated by the expression (10) of the voice having the harmonic structure, that is, the voiced sound. The phase difference ΔΘ _{i, k} corresponds to the first 128 points of 512 FFT points.

調波構造を持った音声では、その音声の倍音の周波数成分を持つ周波数チャンネル付近で線が階段状（平坦）となる。これは、その周波数チャンネルの周波数成分が近傍のチャンネルに漏れ出すためである。このようなことから、線の階段状となっている部分と直線が交差する個所を含む周波数チャンネルに倍音の周波数成分が存在していると考えられる。その交差する個所の周波数チャンネル（以下「倍音チャンネル」と呼ぶ）は、式（１０）と式（６）から算出することができるが、処理的には多少煩雑となる。そこで本実施の形態では、式（９）の実位相差δを使って倍音チャンネルの検出を行う。 In a voice having a harmonic structure, a line is stepped (flat) in the vicinity of a frequency channel having a frequency component of harmonics of the voice. This is because the frequency component of the frequency channel leaks to a nearby channel. For this reason, it is considered that the frequency component of the harmonic overtone exists in the frequency channel including the portion where the straight line intersects the stepped portion of the line. The frequency channel at the intersection (hereinafter referred to as “overtone channel”) can be calculated from Equation (10) and Equation (6), but the processing is somewhat complicated. Therefore, in the present embodiment, the harmonic channel is detected using the actual phase difference δ of Equation (9).

上述したように、実位相差δは式（４）のΔθ_i,kと式（８）のΔζ_i,kとの差である。このδは実際に周波数成分が存在するチャンネルから離れるほど大きくなり、そのチャンネルに近づくほど小さくなる。チャンネルを超える際に０と交差し、周波数が大きくなる方向に越えた場合には、そのチャンネルを離れるにしたがって負側に絶対値が大きくなっていく。 As described above, the actual phase difference δ is the difference between Δθ _{i, k in} equation (4) and Δζ _{i, k in} equation (8). This δ increases as the distance from the channel in which the frequency component actually exists increases, and decreases as the channel approaches. When crossing 0 when exceeding a channel and exceeding the frequency increasing direction, the absolute value increases toward the negative side as the channel is left.

実位相差δがゼロクロスする点を検出することにより、倍音チャンネルを探し出すことができる。隣接する倍音同士が交錯する部分でも正から負へのゼロクロスが発生する。このことから、本実施の形態では、下記の条件（以降「ゼロクロス判定条件」と呼ぶ）に合致するインデクスｋの周波数チャンネルを倍音の周波数成分が存在する倍音チャンネルとして採用する。インデクスｋの周波数チャンネルはゼロクロス点に最も近い周波数チャンネルである。
δ［ｋ−２］＞δ［ｋ−１］＞δ［ｋ］＞δ［ｋ＋１］＞δ［ｋ＋２］
このようなゼロクロス判定条件を満たす周波数チャンネルｋを探すことにより、正から負に大きくゼロクロスする点に最も近い周波数チャンネルを倍音チャンネルとして高精度に抽出することができる。その抽出は、ＦＦＴ点数が十分でなく、周波数振幅による倍音チャンネルの抽出が困難であっても確実に行うことができる。より高精度な抽出を行う必要がある場合には、ピーク検出を併せて行うようにしてもよい。 By detecting a point where the actual phase difference δ crosses zero, a harmonic channel can be found. A zero crossing from positive to negative occurs even in a portion where adjacent harmonics intersect. For this reason, in the present embodiment, the frequency channel of the index k meeting the following condition (hereinafter referred to as “zero cross determination condition”) is adopted as the harmonic channel in which the harmonic frequency component exists. The frequency channel of index k is the frequency channel closest to the zero cross point.
δ [k−2]> δ [k−1]> δ [k]> δ [k + 1]> δ [k + 2]
By searching for the frequency channel k satisfying such a zero-cross determination condition, the frequency channel closest to the point where the zero-cross is greatly increased from positive to negative can be extracted with high accuracy as a harmonic channel. The extraction can be performed reliably even if the number of FFT points is not sufficient and it is difficult to extract a harmonic channel by frequency amplitude. When it is necessary to perform extraction with higher accuracy, peak detection may be performed together.

本実施の形態では、この判定条件を満たす周波数（倍音）チャンネルｋを周波数の小さい方から２つ検出する。これは、周波数が高くなるほど、誤差による影響が大きくなり、精度が低下する傾向があるためである。そのようにして検出した倍音チャンネルのインデクスを周波数の小さい方からｈｍ１、ｈｍ２と表記する。以降、ｈｍ１は基準インデクスとも呼び、その基準インデクスｈｍ１を持つ倍音チャンネルは基準チャンネルとも呼ぶことにする。各倍音チャンネルの位相差ΔΘ_i,k（ｋ＝ｈｍ１、ｈｍ２）は式（１０）、つまりそのチャンネルの実位相差δに式（１１）により算出されるΩ_i,k・Δｔを加算することで計算される。 In the present embodiment, two frequency (overtone) channels k satisfying this determination condition are detected from the smaller frequency. This is because the higher the frequency, the greater the influence of errors and the lower the accuracy. The indices of the harmonic channels detected in this way are denoted as hm1 and hm2 from the lowest frequency. Hereinafter, hm1 is also referred to as a reference index, and a harmonic channel having the reference index hm1 is also referred to as a reference channel. The phase difference ΔΘ _{i, k} (k = hm1, hm2) of each harmonic channel is _obtained by adding Ω _{i, k} · Δt calculated by the equation (11) to the equation (10), that is, the actual phase difference δ of the channel. Calculated by

ピッチスケーリング値ρは、倍音チャンネルの検出結果から以下のように算出する。
先ず、検出した２つの倍音チャンネルのインデクスｈｍ１、ｈｍ２に対応する周波数の最大公約数を求める。その最大公約数は、ユークリッドの互除法を使って算出することができる。負でない２つの整数ｘ、ｙの最大公約数ｇｃｄ（ｘ、ｙ）は

を再帰的に繰り返すことで算出することができる。式（１２）中の「ｘｍｏｄｙ」はｘをｙで割った余りを表している。最大公約数ｇｃｄ（ｘ、ｙ）は別の方法で算出してもよい。 The pitch scaling value ρ is calculated from the detection result of the harmonic channel as follows.
First, the greatest common divisor of the frequencies corresponding to the detected indexes hm1 and hm2 of the two overtone channels is obtained. The greatest common divisor can be calculated using the Euclidean algorithm. The greatest common divisor gcd (x, y) of two non-negative integers x and y is

Can be calculated by recursively repeating. “X mod y” in equation (12) represents the remainder of dividing x by y. The greatest common divisor gcd (x, y) may be calculated by another method.

本実施の形態では、元音声として人の音声を想定している。このことから、元音声の取りえる周波数の下限を８０Ｈｚとし、インデクス値の下限はその周波数に相当する６としている。それに合わせ、式（１２）中のｙ＝０の条件はｙ＜６としている。算出した最大公約数はｘと表記する。
最大公約数ｘは、ピッチ（基音）に相当する周波数チャンネルを倍音チャンネルとして抽出できたか否かに係わらずに求めることができる。このため、ミッシング・ファンダメンタルと呼ばれる基本周波数が欠落、或いは他の周波数と比較して非常に小さいような楽音でも確実に求めることができる。
最大公約数ｘを算出した後は、基準インデクスｈｍ１に対応する周波数とがその公約数ｘの比である倍数ｈｍｘを計算する。その倍数ｈｍｘは
ｈｍｘ＝ｈｍ１／ｘ・・・（１３）
により求められる。このようにして求める倍数ｈｍｘは、基準チャンネルに相当する周波数を基本周波数（基音（ピッチ）の周波数）で割った値に相当する。 In the present embodiment, human voice is assumed as the original voice. For this reason, the lower limit of the frequency that the original speech can take is 80 Hz, and the lower limit of the index value is 6 corresponding to the frequency. Accordingly, the condition of y = 0 in the formula (12) is y <6. The calculated greatest common divisor is expressed as x.
The greatest common divisor x can be obtained regardless of whether or not the frequency channel corresponding to the pitch (fundamental tone) can be extracted as a harmonic channel. For this reason, it is possible to reliably obtain even a musical sound in which the fundamental frequency called missing fundamental is missing or very small compared to other frequencies.
After calculating the greatest common divisor x, a multiple hmx is calculated that is a ratio of the common divisor x to the frequency corresponding to the reference index hm1. The multiple hmx is hmx = hm1 / x (13)
Is required. The multiple hmx obtained in this way corresponds to a value obtained by dividing the frequency corresponding to the reference channel by the fundamental frequency (frequency of the fundamental tone (pitch)).

目標ピッチの展開した位相差ΔΘ_dは、式（１３）により求めた倍数ｈｍｘを乗算して算出する。目標ピッチの基本周波数をｆｄ［Ｈｚ］とすると、それらの乗算は
ΔΘ_d・ｈｍｘ＝２πｆｄ・Δｔ・ｈｍｘ
＝（２πｆｄ・ｈｍｘ・Ｎ）／（ｆｓ・ＯＶＬ）・・・（１４）
により行うことができる。元音声のピッチを目標ピッチに変換するためのピッチスケーリング値ρは
ρ＝ΔΘ_d・ｈｍｘ／ΔΘ_i,hm1 ・・・（１５）
で算出できる。図２の位相補償部２５は、このようにしてスケーリング値ρを算出してピッチシフタ２７に出力する。それにより、ピッチシフタ２７はそのスケーリング値ρでピッチスケーリングを行い、ピッチをシフトさせる。 The phase difference ΔΘ _d developed by the target pitch is calculated by multiplying the multiple hmx obtained by the equation (13). Assuming that the fundamental frequency of the target pitch is fd [Hz], their multiplication is ΔΘ _d · hmx = 2πfd · Δt · hmx
= (2πfd · hmx · N) / (fs · OVL) (14)
Can be performed. The pitch scaling value ρ for converting the pitch of the original voice into the target pitch is ρ = ΔΘ _d · hmx / ΔΘ _{i, hm1} (15)
It can be calculated by The phase compensation unit 25 in FIG. 2 calculates the scaling value ρ in this way and outputs it to the pitch shifter 27. Thereby, the pitch shifter 27 performs pitch scaling with the scaling value ρ, and shifts the pitch.

また、位相補償部２５は、下式により位相のスケーリングを行う。
θ'_i,k＝ΔΘ_i,k（（θ'_i-1,hm1−θ_i-1,hm1）／ΔΘ_i,hm1＋（ρ−１））
＋θ_i,k ・・・（１６）
式（１６）では、スケーリングを行って得られる位相差には「’」を付して示している。その式（１６）によるスケーリングを行うことにより、時間軸上の位相の一貫性（ＨＰＣ：Horizontal Phase Coherence）およびチャンネル間、すなわち周波数成分間の位相関係（ＶＰＣ：Vertical Phase Coherence）は共に保存される（特願２００４−３７４０９０参照）。 Further, the phase compensation unit 25 performs phase scaling by the following equation.
θ ′ _{i, k} = ΔΘ _{i, k} ((θ ′ _{i−1, hm1} −θ _{i−1, hm1} ) / ΔΘ _{i, hm1} + (ρ−1))
+ Θ _{i, k} (16)
In the equation (16), “′” is added to the phase difference obtained by scaling. By performing scaling according to the equation (16), the phase consistency on the time axis (HPC) and the phase relationship between channels, that is, between frequency components (VPC: Vertical Phase Coherence) are both preserved. (See Japanese Patent Application No. 2004-374090).

位相補償部２５は、式（１６）によりスケーリングを行った後の位相ｐｈａｓｅ’、及び式（１）から算出した周波数振幅ｍａｇから、以下のオイラーの公式により実数成分ｒｅａｌ’、虚数成分ｉｍｇ’を算出し、複素数の周波数成分に変換する。
ｒｅａｌ’＝ｍａｇ・ｃｏｓ（ｐｈａｓｅ’）・・・（１７）
ｉｍｇ’ ＝ｍａｇ・ｓｉｎ（ｐｈａｓｅ’）・・・（１８） The phase compensator 25 calculates the real component real ′ and the imaginary component img ′ from the phase phase ′ after scaling according to the equation (16) and the frequency amplitude mag calculated from the equation (1) by the following Euler formula. Calculate and convert to complex frequency components.
real ′ = mag · cos (phase ′) (17)
img ′ = mag · sin (phase ′) (18)

ＩＦＦＴ部２６は、このようにして変換された周波数成分を周波数チャンネル毎に位相補償部２５から入力し、ＩＦＦＴを実行して時間領域上のデータに戻す。ピッチシフタ２７は、位相補償部２５から入力するピッチスケーリング値ρに応じて、ＩＦＦＴ部２６が生成したフレームに対する補間、或いは間引きによるピッチスケーリングを行う。それにより、データ量は１／ρに伸縮するが、位相補償部２５はρ倍の位相スケーリング（式（１６））を行っているため、その伸縮は打ち消され、データ量は元の大きさを維持することになる。そのようにして得られたフレームをフレーム加算部２８がオーバーラップ加算するため、目標ピッチを持つ合成音声がサウンドシステム１１により放音されることになる。 The IFFT unit 26 inputs the frequency component thus converted from the phase compensation unit 25 for each frequency channel, executes IFFT, and returns the data to the time domain. The pitch shifter 27 performs pitch scaling by interpolation or thinning on the frame generated by the IFFT unit 26 according to the pitch scaling value ρ input from the phase compensation unit 25. As a result, the data amount expands / contracts to 1 / ρ, but the phase compensator 25 performs phase scaling of ρ times (Equation (16)), so the expansion / contraction is canceled out, and the data amount becomes the original size. Will be maintained. Since the frame addition unit 28 overlaps the frames obtained in this way, the synthesized speech having the target pitch is emitted by the sound system 11.

次に、図１のカラオケ装置の動作について、図３に示す各種カウンタ、図４ないし図１２に示すＣＰＵ１による歌唱採点処理のプログラムのフローチャート、および、図１３ないし図１６を参照して詳細に説明する。
歌唱採点処理のプログラムの実行においては、２のべき乗である２５６（＝２^８）個のサンプルを１フレームとして、伴奏曲のノートオンのピッチとマイク７からの歌唱者の音声信号のピッチとのピッチ差を検出しながら歌唱力の採点を行う。具体的には、カラオケが開始すると８ｍｓｅｃごとに歌唱者の音声信号のピッチと歌唱されるべきピッチ（基準値）との差分を検出し、検出したピッチ差分のデータを図２の入力バッファ２１に積算する。積算されたピッチ差分のデータは、２５６ｍｓｅｃごと、すなわち３２回のピッチ差分のデータの積算値の平均値である平均ピッチ差分が計算される。次に、積算された平均ピッチ差分のデータが約４ｓｅｃ（４０９６ｍｓｅｃ）の区間ごとに採点される。 Next, the operation of the karaoke apparatus of FIG. 1 will be described in detail with reference to various counters shown in FIG. 3, a flowchart of a singing scoring process by the CPU 1 shown in FIGS. 4 to 12, and FIGS. To do.
In the execution of the singing scoring processing program, 256 (= 2 ⁸ ) samples, which are powers of 2, are taken as one frame, and the pitch of the note-on of the accompaniment and the pitch of the voice signal of the singer from the microphone 7 are calculated. Singing ability is scored while detecting the pitch difference. Specifically, when karaoke starts, the difference between the pitch of the singer's voice signal and the pitch to be sung (reference value) is detected every 8 msec, and the detected pitch difference data is stored in the input buffer 21 of FIG. Accumulate. The accumulated pitch difference data is calculated every 256 msec, that is, an average pitch difference that is an average value of the accumulated values of the 32 pitch difference data. Next, the accumulated average pitch difference data is scored for each interval of about 4 sec (4096 msec).

このため、図３に示すように、歌唱採点処理のためにＣＰＵ１内部のＲＡＭに用意された７個のカウンタＣＮＴＡ〜ＣＮＴＧを用いる。ＣＮＴＡは、ピッチ差分を積算した回数を表すカウンタである。ＣＮＴＢは、ピッチ差分のエラーを積算した回数を表すカウンタである。ＣＮＴＣは、ピッチ差分の算出の回数を表すカウンタである。ＣＮＴＤは、平均ピッチ差分を積算した回数を表すカウンタである。ＣＮＴＥは、平均ピッチ差分エラーを積算した回数を表すカウンタである。ＣＮＴＦは、平均ピッチ差分の算出の回数を表すカウンタである。ＣＮＴＧは、区間採点を積算した回数を表すカウンタである。 For this reason, as shown in FIG. 3, seven counters CNTA to CNTG prepared in the RAM inside the CPU 1 are used for the singing scoring process. CNTA is a counter that represents the number of times the pitch difference has been integrated. CNTB is a counter representing the number of times of pitch difference errors accumulated. CNTC is a counter that represents the number of times the pitch difference is calculated. CNTD is a counter that represents the number of times the average pitch difference has been integrated. CNTE is a counter that represents the number of times that the average pitch difference error has been integrated. CNTF is a counter indicating the number of times of calculating the average pitch difference. CNTG is a counter that represents the number of times the section scoring has been integrated.

図４は、ＣＰＵ１のメインルーチンのフローチャートである。
先ず、電源がオンされたことに伴い、初期化処理を実行する（ステップＳＡ１）。ステップＳＡ１の後は、ステップＳＡ２からステップＳＡ４のループ処理を繰り返す。すなわち、スイッチ部３を構成するスイッチへのユーザの操作に対応するためのスイッチ処理を実行し（ステップＳＡ２）、カラオケ処理を実行し（ステップＳＡ３）、発音処理、エフェクト処理、音量調整処理などのその他の処理を実行する（ステップＳＡ４）。 FIG. 4 is a flowchart of the main routine of the CPU 1.
First, when the power is turned on, an initialization process is executed (step SA1). After step SA1, the loop processing from step SA2 to step SA4 is repeated. That is, a switch process corresponding to a user's operation on the switch constituting the switch unit 3 is executed (step SA2), a karaoke process is executed (step SA3), and a sound generation process, an effect process, a volume adjustment process, etc. Other processing is executed (step SA4).

図５は、メインルーチンにおけるステップＳＡ２のスイッチ処理のフローチャートである。曲セレクトスイッチがオンされたか否かを判別し（ステップＳＢ１）、このスイッチがオンされたときは、セレクトされたカラオケ曲の曲番号をレジスタＳＯＮＧにストアする（ステップＳＢ２）。そして、その曲番号の伴奏曲を曲メモリ２から検索して（ステップＳＢ３）、歌詞が始まる歌唱区間を検出して時間を設定する（ステップＳＢ４）。具体的には、伴奏曲がスタートしてから歌唱区間が開始するまでの時間、すなわちイントロの時間をレジスタにストアして設定する。そして、図４のメインルーチンに戻る。伴奏曲がスタートすると、後述するタイマインタラプトごとに設定された時間がデクリメントされる。 FIG. 5 is a flowchart of the switch process in step SA2 in the main routine. It is determined whether or not the song selection switch is turned on (step SB1). When this switch is turned on, the song number of the selected karaoke song is stored in the register SONG (step SB2). And the accompaniment music of the music number is searched from the music memory 2 (step SB3), the song section where the lyrics start is detected, and the time is set (step SB4). Specifically, the time from the start of the accompaniment to the start of the singing section, that is, the time of the intro is stored in the register and set. Then, the process returns to the main routine of FIG. When the accompaniment starts, the time set for each timer interrupt described later is decremented.

ステップＳＢ１において曲セレクトスイッチがオンでない場合には、スタート／ストップスイッチがオンされたか否かを判別し（ステップＳＢ５）、このスイッチがオンされたときは、フラグＳＴＦを反転する（ステップＳＢ６）。そして、ＳＴＦが１（曲開始）に反転したか又は０（曲停止）に反転したかを判別する（ステップＳＢ７）。ＳＴＦが１に反転したときは、タイマインタラプトの禁止を解除する（ステップＳＢ８）。一方、ＳＴＦが０に反転したときは、タイマインタラプトを禁止する（ステップＳＢ９）。タイマインタラプトを解除又は禁止した後は、図４のメインルーチンに戻る。 If the song select switch is not on in step SB1, it is determined whether or not the start / stop switch is turned on (step SB5). If this switch is turned on, the flag STF is inverted (step SB6). Then, it is determined whether the STF is inverted to 1 (music start) or 0 (music stop) (step SB7). When STF is inverted to 1, the prohibition of the timer interrupt is canceled (step SB8). On the other hand, when STF is inverted to 0, timer interrupt is prohibited (step SB9). After the timer interrupt is canceled or prohibited, the process returns to the main routine of FIG.

ステップＳＢ５において、スタート／ストップスイッチがオンでない場合には、他のスイッチがオンされたか否かを判別する（ステップＳＢ１０）。例えば、エコーやリバーブの効果音を付加するエフェクトスイッチ、音量を調整するボリュームスイッチなどの、他のスイッチがオンされたか否かを判別する。他のスイッチがオンされたときは、そのスイッチに対応する処理を行って（ステップＳＢ１１）、図４のメインルーチンに戻る。 In step SB5, if the start / stop switch is not turned on, it is determined whether or not another switch is turned on (step SB10). For example, it is determined whether or not other switches such as an effect switch for adding echo and reverb sound effects and a volume switch for adjusting the volume are turned on. When another switch is turned on, processing corresponding to the switch is performed (step SB11), and the process returns to the main routine of FIG.

図６は、タイマインタラプトのフローチャートである。発声すべき歌唱区間であるか否かを判別し（ステップＳＤ１）、歌唱区間であるときは、フラグＫＦが０であるか否かを判別する（ステップＳＤ２）。ＫＦが０である場合には、ＫＦを１にセットする（ステップＳＤ３）。図５のステップＳＢ８において、タイマインタラプトの禁止が解除された後は、一定時間ごとのタイマ割り込みを受け付ける。この結果、曲が開始した後は、上記したように、タイマインタラプトごとにレジスタにストアしたイントロの時間がデクリメントされる。したがって、最初の歌唱区間になったときは、ＫＦを１にセットする。ステップＳＤ１において歌唱区間でない場合には、ＫＦが１であるか否かを判別する（ステップＳＤ４）。ＫＦが１であるときは、歌唱区間が終了したので、ＫＦを０にリセットする（ステップＳＤ５）。そして、図４のメインルーチンに戻る。この後は、歌唱区間になるたびにＫＦを１にセットする。ステップＳＤ４においてＫＦが０である場合には、まだイントロの時間が経過していない場合であるので、メインルーチンに戻る。 FIG. 6 is a flowchart of the timer interrupt. It is determined whether or not it is a singing section to be uttered (step SD1), and if it is a singing section, it is determined whether or not the flag KF is 0 (step SD2). If KF is 0, KF is set to 1 (step SD3). In step SB8 of FIG. 5, after the prohibition of the timer interrupt is canceled, a timer interrupt is accepted at regular intervals. As a result, after the music starts, as described above, the intro time stored in the register is decremented for each timer interrupt. Therefore, KF is set to 1 when the first singing section is reached. If it is not a singing section in step SD1, it is determined whether or not KF is 1 (step SD4). When KF is 1, since the singing section has ended, KF is reset to 0 (step SD5). Then, the process returns to the main routine of FIG. Thereafter, KF is set to 1 every time a singing section is entered. If KF is 0 in step SD4, it means that the intro time has not yet elapsed, and the process returns to the main routine.

ステップＳＤ３においてＫＦを１にセットした後、又は、ステップＳＤ２においてＫＦが１であるときは、タイマレジスタＴの値（初期値は０）をインクリメントする（ステップＳＤ６）。そして、Ｔの値が８ｍｓｅｃに達したか否かを判別し（ステップＳＤ７）、Ｔの値が８ｍｓｅｃに達したときは、フラグＴＦを１にセットする（ステップＳＤ８）。ＴＦを１にセットした後、又は、ステップＳＤ７においてＴの値が８ｍｓｅｃに達していない場合には、図４のメインルーチンに戻る。 After KF is set to 1 in step SD3 or when KF is 1 in step SD2, the value of the timer register T (initial value is 0) is incremented (step SD6). Then, it is determined whether or not the value of T has reached 8 msec (step SD7). When the value of T has reached 8 msec, the flag TF is set to 1 (step SD8). After setting TF to 1 or when the value of T has not reached 8 msec in step SD7, the process returns to the main routine of FIG.

図７は、メインルーチンにおけるステップＳＡ３のカラオケ処理のフローチャートである。まず、ＫＦが１（歌唱区間）であるか否かを判別し（ステップＳＣ１）、ＫＦが０の場合はメインルーチンに戻るが、ＫＦが１の場合には、ピッチ差分を算出する処理を実行する（ステップＳＣ２）。
図８は、ピッチ差分を算出する処理のフローチャートである。８ｍｓｅｃの経過時間を示すフラグＴＦが１であるか否かを判別し（ステップＳＥ１）、ＴＦが０の場合はこのフローチャートを終了するが、ＴＦが１の場合には、８ｍｓｅｃごとのピッチ比算出処理を実行する（ステップＳＥ２）。 FIG. 7 is a flowchart of the karaoke process in step SA3 in the main routine. First, it is determined whether or not KF is 1 (singing section) (step SC1). When KF is 0, the process returns to the main routine, but when KF is 1, a process of calculating a pitch difference is executed. (Step SC2).
FIG. 8 is a flowchart of a process for calculating the pitch difference. It is determined whether or not a flag TF indicating an elapsed time of 8 msec is 1 (step SE1). When TF is 0, this flowchart is ended. When TF is 1, a pitch ratio calculation every 8 msec is calculated. Processing is executed (step SE2).

図９は、ピッチ比算出処理のフローチャートである。まず、Ａ／Ｄ変換器８から元音声データが出力されるサンプリングタイミングか否か判定する（ステップＳＦ１）。そのタイミングであった場合、判定はＹＥＳとなり、その元音声データをＲＡＭ５上の入力バッファ２１に書き込み（ステップＳＦ２）、フレーム抽出タイミングか否か判定する（ステップＳＦ３）。前回そのタイミングとなってからホップサイズ分の元音声データをサンプリングする時間が経過した場合には、判定はＹＥＳとなって、入力バッファ２１に格納された元音声データを１フレーム分、抽出し、抽出したフレームに対して、高周波成分を除去するＬＰＦ（ローパスフィルタ）処理、及びＦＦＴ（高速フーリエ変換）を順次、行う（ステップＳＦ４）。 FIG. 9 is a flowchart of the pitch ratio calculation process. First, it is determined whether or not it is a sampling timing at which original audio data is output from the A / D converter 8 (step SF1). If it is the timing, the determination is YES, the original audio data is written to the input buffer 21 on the RAM 5 (step SF2), and it is determined whether or not it is the frame extraction timing (step SF3). If the time to sample the original voice data for the hop size has elapsed since the previous timing, the determination is YES, and the original voice data stored in the input buffer 21 is extracted for one frame, LPF (low-pass filter) processing for removing high-frequency components and FFT (fast Fourier transform) are sequentially performed on the extracted frame (step SF4).

次に、ＦＦＴによって得られる各チャンネルの周波数成分を対象に位相補償処理を実行する（ステップＳＦ５）。位相補償処理の後は、位相補償処理を実施した各チャンネルの周波数成分を対象にしたＩＦＦＴ（高速逆フーリエ変換）、そのＩＦＦＴにより得られる１フレーム分の音声データへのタイムスケーリング処理の実行によるピッチシフトを行い、そのピッチシフトによって得られた合成音声データをＲＡＭ５上の出力バッファ２９に格納された合成音声データにオーバーラップ加算する（ステップＳＦ６）。
なお、図２に示すフレーム抽出部２２、ＬＰＦ２３、及びＦＦＴ部２４の機能は、ハードウェアとしても実現できるが、この実施形態においては上記ステップＳＦ４の処理を実行することで実現される。同様に、位相補償部２５の機能は、ステップＳＦ５の位相補償処理を実行することで実現される。また、ＩＦＦＴ部２６、ピッチシフタ２７及びフレーム加算部２８の機能は、ステップＳＦ６の処理を実行することで実現される。 Next, a phase compensation process is executed for the frequency components of each channel obtained by FFT (step SF5). After the phase compensation process, the IFFT (Fast Inverse Fourier Transform) for the frequency components of each channel subjected to the phase compensation process, and the pitch obtained by executing the time scaling process on the audio data for one frame obtained by the IFFT The shift is performed, and the synthesized voice data obtained by the pitch shift is overlap-added to the synthesized voice data stored in the output buffer 29 on the RAM 5 (step SF6).
The functions of the frame extraction unit 22, the LPF 23, and the FFT unit 24 shown in FIG. 2 can be realized as hardware, but in this embodiment, are realized by executing the process of step SF4. Similarly, the function of the phase compensation unit 25 is realized by executing the phase compensation process in step SF5. The functions of the IFFT unit 26, the pitch shifter 27, and the frame addition unit 28 are realized by executing the process of step SF6.

次に、１サンプリング分の合成音声データを出力すべきタイミングか否か判定する（ステップＳＦ７）。そのタイミングであった場合、判定はＹＥＳとなり、出力すべき合成音声データを出力バッファ２９から読み出して、楽音生成部９を介して、Ｄ／Ａ変換器１０に送出する（ステップＳＦ８）。そして、このフローチャートを終了する。なお、楽音生成部９は、内部で生成した楽音の波形データと入力したデータとをミックスする機能を有する。 Next, it is determined whether or not it is time to output synthesized audio data for one sampling (step SF7). If it is the timing, the determination is YES, and the synthesized voice data to be output is read from the output buffer 29 and sent to the D / A converter 10 via the musical tone generator 9 (step SF8). And this flowchart is complete | finished. Note that the tone generator 9 has a function of mixing the waveform data of the tone generated inside and the input data.

図１０は、図９のピッチ比算出処理におけるステップＳＦ５の位相補償処理のフローチャートである。先ず、各周波数チャンネルの周波数成分から式（１）、（２）より周波数振幅ｍａｇ、位相ｐｈａｓｅ（＝θ）を算出する（ステップＳＧ１）。次に、式（４）〜（１０）による、展開した位相差ΔΘ_i,kの算出を開始し（ステップＳＧ２）、実位相差δが算出された時点である式（１０）の手前で、実位相差δから倍音チャンネルを２つ検出する（ステップＳＧ３）。次に、倍音チャンネルは２以上であるか否かを判別し（ステップＳＧ４）、２以上である場合には、式（１０）により各周波数チャンネルの位相差ΔΘ_i,kを算出して、位相展開を完了する（ステップＳＧ５）。次に、検出した２つの倍音チャンネルに対して、式（１２）〜（１５）により、スケーリング値ρを算出するスケーリング値算出処理を実行する（ステップＳＧ６）。 FIG. 10 is a flowchart of the phase compensation process of step SF5 in the pitch ratio calculation process of FIG. First, the frequency amplitude mag and the phase phase (= θ) are calculated from the frequency components of each frequency channel from the equations (1) and (2) (step SG1). Next, calculation of the developed phase difference ΔΘ _{i, k} according to equations (4) to (10) is started (step SG2), and before equation (10), which is the time when the actual phase difference δ is calculated, Two overtone channels are detected from the actual phase difference δ (step SG3). Next, it is determined whether or not the harmonic channel is 2 or more (step SG4). If it is 2 or more, the phase difference ΔΘ _{i, k} of each frequency channel is calculated by the equation (10), and the phase is calculated. The development is completed (step SG5). Next, a scaling value calculation process for calculating the scaling value ρ is performed on the two detected harmonic channels using the equations (12) to (15) (step SG6).

点線の枠で示すステップＳＧ６のスケーリング値算出処理においては、検出した２つの倍音チャンネルのインデクス値ｈｍ１、ｈｍ２に対応する周波数をそれぞれ変数ｈ１、ｈ２に代入する（ステップＳＧ１０）。ここで変数ｈ１、ｈ２はそれぞれ、式（１２）のｘ、ｙに対応する。そして、変数ｈ２の値に対応するインデクス値が６未満であるか否か判定する（ステップＳＧ１１）。そのインデクス値が６以上であった場合には、変数ｈ１の値を変数ｈ２の値で割って得られる剰余を変数ｔに代入し、変数ｈ１に変数ｈ２の値を代入し、更に変数ｈ２に変数ｔの値を代入する（ステップＳＧ１２）。そして、ステップＳＧ１１において再度インデクス値が６未満であるか否かの判定を行う。すなわち、変数ｈ２の値に対応するインデクス値が６未満になるまで、式（１２）によりインデクス値ｈｍ１、ｈｍ２に対応する周波数間の最大公約数が変数ｈ１に代入される。変数ｈ２の値に対応するインデクス値が６未満になったときは、式（１３）により変数ｈｍｘに、インデクス値ｈｍ１に対応する周波数を変数ｈ１の値（最大公約数）で割った値を代入する（ステップＳＧ１３）。次に、式（１４）により、位相差ΔΘ_dに変数ｈｍｘの値を乗算し、その乗算結果を用いて式（１５）によりスケーリング値ρを算出する（ステップＳＧ１４）。 In the scaling value calculation process in step SG6 indicated by the dotted frame, the frequencies corresponding to the detected index values hm1 and hm2 of the two overtone channels are substituted into variables h1 and h2, respectively (step SG10). Here, the variables h1 and h2 correspond to x and y in Expression (12), respectively. Then, it is determined whether or not the index value corresponding to the value of the variable h2 is less than 6 (step SG11). When the index value is 6 or more, the remainder obtained by dividing the value of the variable h1 by the value of the variable h2 is assigned to the variable t, the value of the variable h2 is assigned to the variable h1, and further to the variable h2. The value of variable t is substituted (step SG12). In step SG11, it is determined again whether the index value is less than 6. That is, until the index value corresponding to the value of the variable h2 becomes less than 6, the greatest common divisor between the frequencies corresponding to the index values hm1 and hm2 is substituted into the variable h1 according to the equation (12). When the index value corresponding to the value of the variable h2 is less than 6, the value obtained by dividing the frequency corresponding to the index value hm1 by the value of the variable h1 (the greatest common divisor) is substituted into the variable hmx according to Equation (13). (Step SG13). Then, the equation (14), multiplied by the value of the variable hmx the phase difference .DELTA..theta _d, calculates the scaling value ρ by formula (15) using the result of the multiplication (step SG14).

なお、この場合において、倍音チャンネルを２つ抽出しているが、３つ以上の倍音チャンネルを抽出するようにしてもよい。ピーク検出を併せて行うようにした場合には、実位相差に注目して抽出した倍音チャンネルの中から、周波数振幅の大きさを考慮して２つ以上の倍音チャンネルを抽出するようにしてもよい。
ピッチシフトによりフォルマントも移動する。そのため、シフト量（スケーリング値ρ）が大きくなるほど、合成音声は不自然なものとなる。それを回避するために、フォルマントの補償を併せて行うようにしてもよい。 In this case, two overtone channels are extracted, but three or more overtone channels may be extracted. When peak detection is also performed, two or more overtone channels may be extracted in consideration of the magnitude of the frequency amplitude from the overtone channels extracted by paying attention to the actual phase difference. Good.
The formant moves with the pitch shift. For this reason, the synthesized speech becomes unnatural as the shift amount (scaling value ρ) increases. In order to avoid this, formant compensation may be performed together.

また、元音声の基音周波数を抽出しなくとも目標ピッチへのピッチシフトを実現できることから、その基本周波数は抽出していない。しかし、その基本周波数は、倍数ｈｍｘを用いて抽出することができる。その抽出（算出）は、基本周波数をｆｉと表記すると、式（７）を用いて、
ｆｉ＝ΔΘ_i,hm1／（２π・Δｔ・ｈｍｘ）
＝（ΔΘ_i,hm1・ｆｓ・ＯＶＬ）／（２π・Ｎ・ｈｍｘ）・・・（１９）
により行うことができる。目標ピッチが周波数で指定されているような場合には、基本周波数ｆｉを算出してから、その目標ピッチの周波数との比をとることにより、スケーリング値ρを求めてもよい。また、算出した基本周波数ｆｉは表示部６等によりユーザに知らせるようにしてもよい。合成音声波形の生成については、別の方法を採用してもよい。 Further, since the pitch shift to the target pitch can be realized without extracting the fundamental frequency of the original voice, the fundamental frequency is not extracted. However, the fundamental frequency can be extracted using a multiple hmx. In the extraction (calculation), when the fundamental frequency is expressed as fi, the equation (7) is used.
fi = ΔΘ _{i, hm1} / (2π · Δt · hmx)
= (ΔΘ _{i, hm1} · fs · OVL) / (2π · N · hmx) (19)
Can be performed. When the target pitch is specified by frequency, the scaling value ρ may be obtained by calculating the fundamental frequency fi and then taking the ratio with the frequency of the target pitch. Further, the calculated fundamental frequency fi may be notified to the user by the display unit 6 or the like. Another method may be employed for generating the synthesized speech waveform.

ステップＳＧ６のスケーリング値算出の後は、位相差ΔΘ_i,kを用いて、式（１６）による位相スケーリング処理を行う（ステップＳＧ７）。次に、位相ｐｈａｓｅ’、及び式（１）から算出した周波数振幅ｍａｇから、実数成分ｒｅａｌ’（式（１７））、虚数成分ｉｍｇ’（式（１８））を算出し、複素数の周波数成分に変換する（ステップＳＧ８）。ステップＳＧ４において、倍音チャンネルが２以上でない場合には、エラーと判断してエラーフラグをアクティブにする（ステップＳＧ９）。ステップＳＧ８において複素数変換を行った後、又は、ステップＳＧ９においてエラーと判断した後は、位相補償処理を終了する。 After the scaling value is calculated in step SG6, the phase scaling process according to equation (16) is performed using the phase difference ΔΘ _{i, k} (step SG7). Next, the real number component real ′ (formula (17)) and the imaginary number component img ′ (formula (18)) are calculated from the phase phase ′ and the frequency amplitude mag calculated from the formula (1). Conversion is performed (step SG8). In step SG4, when the harmonic channel is not 2 or more, it is determined as an error and the error flag is activated (step SG9). After performing complex number conversion in step SG8 or after determining an error in step SG9, the phase compensation process is terminated.

図９のピッチ比算出処理の後は、図８のピッチ差分算出処理のステップＳＥ３に移行して、図１０のステップＳＧ４において倍音チャンネルが２以上か否かの判別結果を参照し、倍音チャンネルが２以上でエラーと判断しなかったか、又は、倍音チャンネルが２以上でなくエラーと判断したかによって、処理を分岐する。基音が存在するチャンネルが２以上でエラーと判断しなかったときは、算出したピッチ比から「１．０」を減算した値をピッチ差分としてストアする（ステップＳＥ４）。 After the pitch ratio calculation process of FIG. 9, the process proceeds to step SE3 of the pitch difference calculation process of FIG. 8, and the determination result of whether or not the harmonic channel is 2 or more is referred to in step SG4 of FIG. The process branches depending on whether it is determined that the error is not 2 or more, or the harmonic channel is not 2 or more and is determined to be an error. When the channel in which the fundamental tone exists is 2 or more and is not determined to be an error, a value obtained by subtracting “1.0” from the calculated pitch ratio is stored as a pitch difference (step SE4).

算出したピッチ比は、マイク７から入力された音声信号のピッチ（入力音声ピッチ）に対する基準値のピッチ（基準ピッチ）の比であるので、両者のピッチが一致したときは、ピッチ比である（入力音声ピッチ）／（基準ピッチ）は「１．０」の値になる。したがって、ピッチ比から「１．０」を減算した値は、入力音声ピッチが基準ピッチより高いときはプラス、入力音声ピッチが基準ピッチより低いときはマイナスとなり、ピッチ差分は正負の符号を含むことになる。 The calculated pitch ratio is the ratio of the pitch of the reference value (reference pitch) to the pitch of the audio signal input from the microphone 7 (input audio pitch). Therefore, when the two pitches match, the pitch ratio ( Input voice pitch) / (reference pitch) is a value of “1.0”. Therefore, the value obtained by subtracting “1.0” from the pitch ratio is positive when the input voice pitch is higher than the reference pitch, is negative when the input voice pitch is lower than the reference pitch, and the pitch difference includes a positive / negative sign. become.

ステップＳＥ４においてピッチ差分を算出した後は、ピッチ差分の積算回数であるＣＮＴＡのカウント値が１６未満であるか、又は、１６以上であるかを判別する（ステップＳＥ５）。ＣＮＴＡのカウント値が１６未満である場合には、ピッチ差分をバッファにストアして積算する（ステップＳＥ６）。そして、ＣＮＴＡのカウント値をインクリメントする（ステップＳＥ７）。本来ならば、ピッチ差分のデータは、２５６ｍｓｅｃごとに平均値を計算する。すなわち、３２回のピッチ差分のデータを積算するごとに平均値を計算する。しかし、この実施形態においては、ＣＰＵ１の負荷を軽減するために、フレーム数を２のべき乗として、フレーム数の半数までの１６回のピッチ差分のデータを積算するごとに平均値を計算する。 After calculating the pitch difference in step SE4, it is determined whether the count value of CNTA, which is the number of times the pitch difference is integrated, is less than 16 or more than 16 (step SE5). If the count value of CNTA is less than 16, the pitch difference is stored in the buffer and integrated (step SE6). Then, the count value of CNTA is incremented (step SE7). Originally, the average value of pitch difference data is calculated every 256 msec. That is, the average value is calculated every time the pitch difference data of 32 times are integrated. However, in this embodiment, in order to reduce the load on the CPU 1, the number of frames is set to a power of 2, and the average value is calculated every time 16 pitch difference data up to half of the number of frames are integrated.

ステップＳＥ３において、基音が存在するチャンネルが１以下でエラーであると判断したときは、ピッチ差分エラーの積算回数のカウンタＣＮＴＢの値をインクリメントする（ステップＳＥ８）。ＣＮＴＢの値をインクリメントした後、又は、ステップＳＥ７においてＣＮＴＡの値をインクリメントした後は、ピッチ差分の算出回数のカウンタＣＮＴＣの値をインクリメントする（ステップＳＥ９）。ステップＳＥ５においてＣＮＴＡの値が１６になった後は、ＣＮＴＡの値はインクリメントせず、ピッチ差分の算出回数のカウンタＣＮＴＣの値をインクリメントする（ステップＳＥ９）。この結果、エラー以外のピッチ差分のデータのうち、最大で半数のピッチ差分のデータが捨てられることになるが、８ｍｓｅｃという極めて短い時間においては、半数のピッチ差分のデータが失われても大勢に影響はない。次に、フラグＴＦを０にリセットして（ステップＳＥ１０）、図７のステップＳＣ３の差分平均算出処理に移行する。 When it is determined in step SE3 that the channel in which the fundamental tone is present is 1 or less and an error has occurred, the value of the counter CNTB of the pitch difference error integration count is incremented (step SE8). After incrementing the value of CNTB, or after incrementing the value of CNTA in step SE7, the value of the counter CNTC of the number of pitch difference calculations is incremented (step SE9). After the value of CNTA becomes 16 in step SE5, the value of CNTA is not incremented, but the value of the counter CNTC for calculating the number of pitch differences is incremented (step SE9). As a result, a maximum of half of the pitch difference data other than the error is discarded, but in the extremely short time of 8 msec, even if half of the pitch difference data is lost, there are many. There is no effect. Next, the flag TF is reset to 0 (step SE10), and the process proceeds to the difference average calculation process of step SC3 in FIG.

図１１は、２５６ｍｓｅｃごとに実行される差分平均算出処理のフローチャートである。
まず、８ｍｓｅｃごとのピッチ差分の算出回数のカウンタＣＮＴＣの値がフレーム数である３２に達したか否かを判別し（ステップＳＨ１）、ＣＮＴＣの値が３２に達した後は、ＣＮＴＢが１６より大きいか否かを判別する（ステップＳＨ２）。すなわち、ピッチ差分エラーがフレーム数の半分より多いか否かを判別する。ＣＮＴＢが１６以下である場合には、積算値を右シフトしてビット差分の平均値を算出する（ステップＳＨ３）。フレーム数は２のべき乗の３２であるので、フレーム数の半分も２のべき乗の１６である。したがって、積算された１６個のビット差分の平均値は、積算値を除算する代わりに４ビットの右シフトによって算出する。これによって、ＣＰＵ１の平均値算出の演算処理を軽減し、「演奏のコケ」と称される音切れのような発音処理のネックを回避できる。 FIG. 11 is a flowchart of the difference average calculation process executed every 256 msec.
First, it is determined whether or not the value of the counter CNTC for calculating the number of pitch differences every 8 msec has reached 32, which is the number of frames (step SH1), and after the value of CNTC reaches 32, CNTB is 16 It is determined whether or not it is larger (step SH2). That is, it is determined whether or not the pitch difference error is greater than half the number of frames. If CNTB is 16 or less, the integrated value is shifted to the right to calculate the average value of the bit differences (step SH3). Since the number of frames is a power of 2, 32, half of the number of frames is also a power of 2. Therefore, the average value of the integrated 16 bit differences is calculated by a 4-bit right shift instead of dividing the integrated value. As a result, the calculation processing of the average value calculation of the CPU 1 can be reduced, and a bottleneck of sound generation processing such as sound interruption called “moke of performance” can be avoided.

ステップＳＨ３のシフト処理の後は、平均ピッチ差分の積算回数のカウンタＣＮＴＤの値が８未満であるか否かを判別する（ステップＳＨ４）。上記したように、積算された２５６ｍｓｅｃごとの平均ピッチ差分のデータは、４０９６ｍｓｅｃの区間ごとに採点されるので、１区間の平均ピッチ差分は１６（＝４０９６／２５６）個のデータである。しかし、この場合にも、ＣＰＵ１の負荷を軽減するために、８個の平均ピッチ差分のデータによって採点を行う。ＣＮＴＤの値が８未満であるときは、平均ピッチ差分を積算し（ステップＳＨ５）、ＣＮＴＤの値をインクリメントする（ステップＳＨ６）。
ステップＳＨ２において、ＣＮＴＢの値が１６より多い場合、すなわち、ピッチ差分エラーの数がフレーム数である３２の半分を超えた場合には、平均ピッチ差分エラーの積算回数のカウンタＣＮＴＥの値をインクリメントする（ステップＳＨ７）。そして、エラー値を区間の平均値とする（ステップＳＨ８）。 After the shift process in step SH3, it is determined whether or not the value of the counter CNTD of the number of times of average pitch difference integration is less than 8 (step SH4). As described above, since the accumulated average pitch difference data every 256 msec is scored every 4096 msec, the average pitch difference per section is 16 (= 4096/256) data. However, also in this case, in order to reduce the load on the CPU 1, scoring is performed based on the data of eight average pitch differences. When the value of CNTD is less than 8, the average pitch difference is integrated (step SH5), and the value of CNTD is incremented (step SH6).
In step SH2, if the value of CNTB is larger than 16, that is, if the number of pitch difference errors exceeds half of 32 which is the number of frames, the value of counter CNTE of the average number of times of pitch difference error is incremented. (Step SH7). Then, the error value is set as the average value of the section (step SH8).

ステップＳＨ６においてＣＮＴＤの値をインクリメントした後、ステップＳＨ８においてＣＮＴＥの値をインクリメントした後、又は、ステップＳＨ４においてＣＮＴＤの値が８になった後は、平均ピッチ差分の算出回数のカウンタＣＮＴＦの値をインクリメントする（ステップＳＨ９）。そして、図７のカラオケ処理のステップＳＣ４に移行して、ピッチずれを表示し、次のステップＳＣ５に移行して区間得点計算処理を実行する。 After incrementing the value of CNTD in step SH6, incrementing the value of CNTE in step SH8, or after the value of CNTD reaches 8 in step SH4, the value of the counter CNTF of the number of times of calculating the average pitch difference is set. Increment (step SH9). And it transfers to step SC4 of the karaoke process of FIG. 7, a pitch shift | offset | difference is displayed, it transfers to the following step SC5, and a section score calculation process is performed.

図１２は、約４ｓｅｃごとに実行される区間得点計算処理のフローチャートである。
まず、２５６ｍｓｅｃごとの平均ピッチ差分の算出回数であるＣＮＴＦの値が１６になったか否かを判別する（ステップＳＪ１）。すなわち、平均ピッチ差分の算出回数が１区間である約４ｓｅｃ（４０９６ｍｓｅｃ）の最大回数になったか否かを判別する。ＣＮＴＦの値が１６未満の場合にはこのフローチャートを終了するが、ＣＮＴＦの値が１６になったときは、平均ピッチ差分エラーの積算回数のカウンタＣＮＴＥの値が平均ピッチ差分の算出回数の半分（許容エラー数）である８より多いか否かを判別する（ステップＳＪ２）。ＣＮＴＥの値が８以下である場合には、平均ピッチ差分の積算値を右シフトして平均値を算出する（ステップＳＪ３）。図１１のステップＳＨ４に示したように、ＣＮＴＤの値である積算回数は８であるので、３ビットの右シフトにより８個の平均ピッチ差分の平均値を算出する。 FIG. 12 is a flowchart of the interval score calculation process executed about every 4 seconds.
First, it is determined whether or not the value of CNTF, which is the number of times the average pitch difference is calculated every 256 msec, has reached 16 (step SJ1). That is, it is determined whether or not the average pitch difference calculation count has reached the maximum count of about 4 sec (4096 msec), which is one section. When the CNTF value is less than 16, this flowchart is terminated. However, when the CNTF value is 16, the value of the average pitch difference error integration counter CNTE is half the average pitch difference calculation count ( It is determined whether or not the number of allowable errors is greater than 8 (step SJ2). If the value of CNTE is 8 or less, the integrated value of the average pitch difference is shifted to the right to calculate the average value (step SJ3). As shown in step SH4 in FIG. 11, the number of integrations, which is the value of CNTD, is 8, so an average value of 8 average pitch differences is calculated by right shifting by 3 bits.

次に、歌唱者の初級、中級、上級のレベルごとにパラメータをセットする（ステップＳＪ４）。セットするパラメータは、歌唱力を最大評価値の１００点とする基準値の許容範囲を規定するパラメータｔ、および、１００点から０点までの範囲で許容範囲外の歌唱力の度合いを規定するパラメータａである。パラメータをセットした後、区間得点を計算する（ステップＳＪ５）。ステップＳＪ２において、ＣＮＴＥの値が８より多くなり許容エラー数を超えたときは、区間得点を０とする（ステップＳＪ６）。ステップＳＪ５において区間得点を計算した後、又は、ステップＳＪ６において区間得点を０点にした後は、区間得点を積算する（ステップＳＪ７）。そして、区間採点の積算回数のカウンタＣＮＴＧの値をインクリメントして（ステップＳＪ８）、ピッチずれを表示し（ステップＳＪ９）、図７のカラオケ処理に戻る。 Next, parameters are set for each of the beginner, intermediate and advanced levels of the singer (step SJ4). The parameter to be set is a parameter t that defines a permissible range of the reference value with a singing ability of 100 points of the maximum evaluation value, and a parameter that defines a degree of singing ability outside the permissible range in the range of 100 points to 0 points. a. After setting the parameters, the section score is calculated (step SJ5). In step SJ2, when the value of CNTE exceeds 8 and exceeds the allowable number of errors, the section score is set to 0 (step SJ6). After calculating the section score in step SJ5, or after setting the section score to 0 in step SJ6, the section scores are integrated (step SJ7). Then, the value of the counter CNTG of the number of times of section scoring is incremented (step SJ8), the pitch deviation is displayed (step SJ9), and the process returns to the karaoke process of FIG.

ステップＳＪ５における区間得点の計算において、基準値からの平均ピッチ差分をｘとし、区間得点をｇｒａｄｅとすると、区間得点の演算式は、下記の式（２０）で表される。

最初にｘの絶対値をとるのは、図８のステップＳＥ５で求めた平均ピッチ差分が正負の符号を含んでいるので、正の値の領域だけで計算を行うためである。
また、ピッチの最小単位である半音のピッチが１００セントであり、１オクターブが１２００セントであるので、得点が１００点となるピッチ範囲をdiff_t[セント]とすると、パラメータｔは、下記の式（２１）で表される。

得点が０点となるピッチ差分値をdiff_a[セント]とすると、パラメータａは、下記の式（２２）で表される。

In the calculation of the section score in step SJ5, assuming that the average pitch difference from the reference value is x and the section score is grade, an arithmetic expression for the section score is expressed by the following formula (20).

The reason why the absolute value of x is first taken is that the average pitch difference obtained in step SE5 in FIG. 8 includes a positive / negative sign, so that the calculation is performed only in the positive value region.
Also, since the semitone pitch, which is the minimum unit of the pitch, is 100 cents and one octave is 1200 cents, if the pitch range where the score is 100 points is diff_t [cents], the parameter t is expressed by the following formula ( 21).

If the pitch difference value at which the score is 0 is diff_a [cent], the parameter a is represented by the following equation (22).

図１３は、初級、中級、上級の各レベルにおけるパラメータｔおよびパラメータａの具体例を示す図である。図１３に示すように、初級の歌唱者の場合には、基準値からのピッチ差分が４０セントでも１００点になる。一方、上級の歌唱者の場合には、基準値からのピッチ差分が２０セント以内でなければ１００点にはならない。また、初級の歌唱者の場合には、基準値からのピッチ差分が２４０セントで０点になるが、上級の歌唱者の場合には、基準値からのピッチ差分が１２０セントで０点になる。中級の歌唱者の場合には、１００点になるパラメータｔおよび０点になるパラメータａは、初級と上級のほぼ中間になる。 FIG. 13 is a diagram illustrating specific examples of the parameter t and the parameter a at each level of the beginner level, intermediate level, and advanced level. As shown in FIG. 13, in the case of a beginner singer, the pitch difference from the reference value is 100 points even when the pitch difference is 40 cents. On the other hand, in the case of an advanced singer, it is not 100 points unless the pitch difference from the reference value is within 20 cents. In the case of a beginner singer, the pitch difference from the reference value is 240 cents and 0 points, whereas in the case of an advanced singer, the pitch difference from the reference value is 120 cents and 0 points. . In the case of an intermediate singer, the parameter t, which is 100 points, and the parameter a, which is 0 points, are approximately between the beginner level and the advanced level.

図１４は、区間得点の演算式である式（２０）〜（２２）の特性を示す図である。図１４において、横軸はピッチ差分を表し、縦軸は区間得点を表している。図１４に示すように、ピッチ差分に対する区間得点の特性は台形の形状になっている。そして、台形の上辺は歌唱力を１００点とするピッチ差分の許容範囲を規定し、台形の傾斜は１００点から０点までの範囲で許容範囲外の歌唱力の度合いを規定する。初級の歌唱者の場合には、台形の上辺が長くなり、傾斜が緩やかになる。一方、上級の歌唱者の場合には、台形の上辺が短くなり、傾斜が急峻になる。中級の歌唱者の場合には、台形は初級と上級との中間の形状になる。 FIG. 14 is a diagram illustrating the characteristics of the formulas (20) to (22) that are the calculation formulas of the section scores. In FIG. 14, the horizontal axis represents the pitch difference, and the vertical axis represents the section score. As shown in FIG. 14, the characteristic of the section score with respect to the pitch difference has a trapezoidal shape. The upper side of the trapezoid defines the allowable range of pitch difference with a singing ability of 100 points, and the slope of the trapezoid defines the degree of singing ability outside the allowable range in the range from 100 points to 0 points. For beginners, the upper side of the trapezoid is longer and the slope is gentler. On the other hand, in the case of an advanced singer, the upper side of the trapezoid becomes shorter and the slope becomes steeper. For intermediate singers, the trapezoid is an intermediate shape between beginner and advanced.

図７のカラオケ処理において、ステップＳＣ５の区間得点の計算を行った後は、その区間得点を表示する（ステップＳＣ６）。図１５は、区間得点およびピッチずれを表示する画面を示す図である。ピッチずれの表示は、図１５に示すように、発光色の異なる３つのＬＥＤによって、ピッチが高い状態、ピッチが合っている状態、ピッチが低い状態を対応するＬＥＤの点灯によって表示する。ステップＳＣ６において、区間得点を表示した後は、カラオケが停止したか否かを判別し（ステップＳＣ７）、カラオケが停止していない場合には、ステップＳＣ２に移行して、ステップＳＣ６までの処理を繰り返す。 In the karaoke process of FIG. 7, after the calculation of the section score in step SC5, the section score is displayed (step SC6). FIG. 15 is a diagram showing a screen that displays the section score and the pitch deviation. As shown in FIG. 15, the pitch deviation is displayed by lighting up the corresponding LED with three LEDs having different emission colors in a high pitch state, a pitch match state, and a low pitch state. In step SC6, after the section score is displayed, it is determined whether or not the karaoke is stopped (step SC7). If the karaoke is not stopped, the process proceeds to step SC2 and the processes up to step SC6 are performed. repeat.

カラオケ曲が終了するか、又は、スタート／ストップスイッチがオンされて、ステップＳＣ７においてカラオケの停止であると判別したときは、総合得点を計算する（ステップＳＣ８）。総合得点の計算は、図１２のステップＳＪ７において計算した区間得点の積算値をステップＳＪ８でインクリメントした最終のＣＮＴＧの値、すなわち、区間得点の積算回数で除算することによって算出される。総合得点の計算のときには、カラオケ演奏は停止しており、ＣＰＵ１は発音処理から解放されているので、シフト処理でなく除算処理によって総合得点を算出する。総合得点を算出した後は、その総合得点を表示し（ステップＳＣ９）、フラグＫＦを０にリセットし（ステップＳＣ１０）、フラグＳＴＦを０にリセットして（ステップＳＣ１１）、図４のメインルーチンに戻る。 When the karaoke song ends or the start / stop switch is turned on and it is determined in step SC7 that the karaoke is stopped, a total score is calculated (step SC8). The total score is calculated by dividing the integrated value of the section score calculated in step SJ7 in FIG. 12 by the final CNTG value incremented in step SJ8, that is, the number of times of section score integration. At the time of calculating the total score, the karaoke performance is stopped and the CPU 1 is released from the sound generation process, so the total score is calculated not by the shift process but by the division process. After calculating the total score, the total score is displayed (step SC9), the flag KF is reset to 0 (step SC10), the flag STF is reset to 0 (step SC11), and the main routine of FIG. Return.

図１６は、マイク７から入力される音声信号の発声区間と、評価対象の歌唱区間との関係を示す図である。図１６に示すように、伴奏曲において歌唱区間に達するまえに発声が行われた場合には、その発声については無判定として得点には加味しない。一方、歌唱区間であるにもかかわらず、発声が行われない場合には、得点を０点とする。 FIG. 16 is a diagram illustrating the relationship between the utterance section of the audio signal input from the microphone 7 and the singing section to be evaluated. As shown in FIG. 16, when an utterance is made before reaching the singing section in the accompaniment, the utterance is not determined as a determination and is not added to the score. On the other hand, if the utterance is not performed in spite of the singing section, the score is 0.

以上のように、上記実施形態によれば、ＣＰＵ１は、評価すべき歌唱力が１００点となる基準値の許容範囲を規定するパラメータｔ、および、１００点から０点の範囲で許容範囲外の歌唱力の度合いを規定するパラメータａを設定して、入力される音声信号の評価値をパラメータｔおよびパラメータａに基づいて算出する。この場合において、ＣＰＵ１は、音声信号のピッチおよび発声タイミングの評価値を算出する構成にしたが、いずれか一方の評価値を算出する構成にしてもよい。
したがって、膨大な数の評価値を記憶するためのメモリや、曲が終了した後に膨大な数の評価値の平均を算出する演算処理を必要とすることなく、迅速且つ正確に歌唱力をローコストで評価するができる。 As described above, according to the embodiment, the CPU 1 is out of the allowable range in the parameter t that defines the allowable range of the reference value at which the singing ability to be evaluated is 100 points, and in the range of 100 points to 0 points. The parameter a that defines the degree of singing ability is set, and the evaluation value of the input audio signal is calculated based on the parameter t and the parameter a. In this case, the CPU 1 is configured to calculate the evaluation value of the pitch of the audio signal and the utterance timing, but may be configured to calculate one of the evaluation values.
Therefore, without requiring a memory for storing a large number of evaluation values or a calculation process for calculating an average of a large number of evaluation values after the music is finished, singing ability can be quickly and accurately performed at low cost. Can be evaluated.

また、上記実施形態によれば、ＣＰＵ１は、入力される音声信号のピッチと基準値との差分を算出し、符号を含む差分を積算して音声信号のピッチの評価値をパラメータｔおよびパラメータａに基づいて算出する。
したがって、上級の歌唱者がビブラート唱法によってピッチに揺らぎが発生した場合でも、ピッチの揺らぎをピッチずれと見なさず、正当に歌唱力を評価することができる。 Further, according to the embodiment, the CPU 1 calculates the difference between the pitch of the input audio signal and the reference value, integrates the difference including the sign, and determines the evaluation value of the pitch of the audio signal as the parameter t and the parameter a. Calculate based on
Therefore, even when an advanced singer has a fluctuation in pitch due to the vibrato method, the singing ability can be properly evaluated without considering the fluctuation of the pitch as a pitch shift.

また、上記実施形態によれば、ＣＰＵ１は、入力される伴奏曲の歌唱区間を検索して、歌唱対象区間内に入力される音声信号の評価値を算出し、歌唱区間外に入力される音声信号については評価の対象外とする。
したがって、歌唱力を正確に採点することができる。 Moreover, according to the said embodiment, CPU1 searches the song area of the input accompaniment, calculates the evaluation value of the audio | voice signal input into a song object area, and is input outside a song area. Signals are not subject to evaluation.
Therefore, the singing ability can be scored accurately.

また、上記実施形態によれば、ＣＰＵ１は、入力される音声信号における２のべき乗のサンプル数を１フレームとして各フレームのエラー数を分析し、分析したエラー数が１フレームの２分の１を超えたときは、そのフレームを０点とし、分析したエラー数が１フレームの２分の１を超えないときは、１フレームの２分の１に相当するエラー以外のサンプル数によってそのフレームの評価値を算出する。
したがって、除算処理の代わりにシフト処理によってピッチ差分の平均値を算出することで、ＣＰＵ１に大きな負荷がかからないようにすることができる。その結果、音切れのような発音処理のネックを回避できる。 Further, according to the above embodiment, the CPU 1 analyzes the number of errors in each frame with the number of samples of powers of 2 in the input audio signal as one frame, and the analyzed number of errors is ½ of one frame. If it exceeds, the frame is regarded as 0 points. If the number of analyzed errors does not exceed half of one frame, the frame is evaluated based on the number of samples other than errors corresponding to half of one frame. Calculate the value.
Therefore, by calculating the average value of the pitch differences by shift processing instead of division processing, it is possible to prevent the CPU 1 from being heavily loaded. As a result, it is possible to avoid a bottleneck in sound generation processing such as sound interruption.

また、上記実施形態によれば、ＣＰＵ１は、入力される音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出して基音のピッチを分析し、検出した基音のピッチの評価値を算出する。この場合において、ＣＰＵ１は、入力される音声信号の周波数成分から位相を算出し、その算出した位相を用いて音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出する。
したがって、入力される歌唱者の音声信号の基音のピッチを確実に検出することができる。 Further, according to the embodiment, the CPU 1 detects the greatest common divisor of at least two or more pitches from the input audio signal, analyzes the pitch of the fundamental tone, and calculates the evaluation value of the detected fundamental pitch. calculate. In this case, the CPU 1 calculates the phase from the frequency component of the input audio signal, and detects the greatest common divisor of at least two or more pitches from the audio signal using the calculated phase.
Therefore, the pitch of the fundamental tone of the input singer's voice signal can be reliably detected.

さらに、倍音は基音（ピッチ）の周波数の整数倍の周波数を持っている。それにより、倍音の周波数成分が存在する２つ以上の周波数チャンネル（倍音チャンネル）に対応する周波数間の最大公約数は、基音の周波数を表す情報として扱うことができる。このため、図１０のスケーリング値の算出処理に示したように、２つ以上の周波数チャンネルの最大公約数を用いて、第１の音声波形であるｈｍ１の基音を目標とする基音に高精度に変換（シフト）した第２の音声波形であるｈｍ２を生成することができる。第１の音声波形の基音を抽出（検出）する必要性は回避されることから、ミッシング・ファンダメンタルと呼ばれる基本周波数が欠落、或いは他の周波数と比較して非常に小さいような第１の音声波形でも、目標とする基音を持つ第２の音声波形を確実に生成することができる。また、その最大公約数を用いることにより、第１の音声波形の基音の周波数も確実に抽出（検出）することができる。 Furthermore, the overtone has a frequency that is an integral multiple of the frequency of the fundamental tone (pitch). As a result, the greatest common divisor between frequencies corresponding to two or more frequency channels (harmonic channels) in which harmonic frequency components exist can be handled as information representing the frequency of the fundamental tone. For this reason, as shown in the scaling value calculation processing of FIG. 10, the fundamental tone of hm1 that is the first speech waveform is used as the target fundamental tone with high accuracy using the greatest common divisor of two or more frequency channels. It is possible to generate hm2, which is the converted (shifted) second speech waveform. Since the necessity of extracting (detecting) the fundamental tone of the first speech waveform is avoided, the first speech waveform in which the fundamental frequency called missing fundamental is missing or very small compared to other frequencies. However, the second speech waveform having the target fundamental tone can be reliably generated. Further, by using the greatest common divisor, the frequency of the fundamental tone of the first speech waveform can also be reliably extracted (detected).

次に、上記実施形態の変形例について説明する。
上記実施形態においては、１２００セントの１オクターブについた採点を行い、オクターブ違いの同音名については考慮していないが、オクターブ違いを検出して、差分をオクターブ以内の範囲になるように、折り返して採点するような構成にしてもよい。
また、総合得点の計算時に、初級や中級の歌唱者の場合には、ボーナス点を加算して採点するような構成にしてもよい。例えば、区間得点の最高点を保持しておき、総合得点の算出の際に、保持した最高点に、初級や中級のレベルに応じた係数を乗算して、総合点に加算する。
また、上記実施形態においては、平均ピッチ差分および区間採点のときに積算されるデータ数をフレーム数の半分にしたが、平均値を算出する演算を除算の代わりにシフト処理で行うことが可能なように、２のべき乗分の１であればフレーム数の半分でなくてもよい。４分の１、８分の１でもよい。分母が大きくなるほど廃棄するデータ数が多くなるが、採点の信頼性が得られる比率であればよい。一般に、４ｓｅｃのような短い区間においては、歌唱者のレベルにかかわらず歌唱力の変動は極めて少ないので、８分の１やそれより少ないデータ数によっても採点の信頼性は得られる。
また、上記実施形態においては、ピッチ差分を算出して歌唱力を採点する構成にしたが、発音のタイミングの差分を算出して歌唱力を採点する構成にしてもよい。 Next, a modification of the above embodiment will be described.
In the above embodiment, scoring for one octave of 1200 cents is performed, and the same note name with different octaves is not considered, but the octave difference is detected and folded so that the difference falls within the octave range. You may make it the structure which scores.
In addition, when calculating the total score, in the case of a beginner or intermediate singer, a bonus point may be added and scored. For example, the highest score of the section score is held, and when the total score is calculated, the held highest score is multiplied by a coefficient corresponding to the level of beginner or intermediate and added to the total score.
In the above embodiment, the number of data accumulated at the time of average pitch difference and section scoring is halved of the number of frames, but the calculation for calculating the average value can be performed by shift processing instead of division. In this way, the number of frames is not necessarily half as long as it is a power of 2. It may be 1/4 or 1/8. As the denominator increases, the number of data to be discarded increases. However, any ratio that provides the reliability of scoring is acceptable. In general, in a short section such as 4 sec, the singing power fluctuation is very small regardless of the level of the singer, so that the reliability of the scoring can be obtained even with a data number of 1/8 or less.
Moreover, in the said embodiment, although it was set as the structure which calculates a pitch difference and scores a song power, you may make it the structure which calculates the difference of the timing of pronunciation and scores a song power.

なお、上記実施形態においては、ＲＯＭ４にあらかじめ記憶されている歌唱採点処理のプログラムをＣＰＵ１が実行する装置の発明について説明したが、フレキシブルディスク（ＦＤ）、ＣＤ、メモリカードなどの外部記憶媒体に記憶された歌唱採点処理のプログラム、又は、インターネットなどのネットワークからダウンロードした歌唱採点処理のプログラムをＲＡＭ５あるいは別途設けたフラッシュＲＯＭなどの不揮発性メモリにインストールして、ＣＰＵ１がそのプログラムを実行する構成も可能である。この場合には、プログラムの発明および記憶媒体の発明を実現できる。 In the above-described embodiment, the invention of the apparatus in which the CPU 1 executes the singing scoring program stored in advance in the ROM 4 has been described. However, the program is stored in an external storage medium such as a flexible disk (FD), CD, or memory card. It is also possible to install the singing scoring processing program downloaded or the singing scoring processing program downloaded from a network such as the Internet in the RAM 5 or a non-volatile memory such as a separately provided flash ROM, and the CPU 1 can execute the program. It is. In this case, the invention of the program and the invention of the storage medium can be realized.

すなわち、本発明の歌唱採点処理のプログラムは、
評価すべき歌唱力が最大評価値となる基準値の許容範囲を規定する第１のパラメータおよび最大評価値から最低評価値の範囲で許容範囲外の歌唱力の度合いを規定する第２のパラメータを設定するステップＡと、入力される音声信号の評価値を前記ステップＡによって設定された第１のパラメータおよび第２のパラメータに基づいて算出するステップＢと、をコンピュータに実行させる。 In other words, the singing scoring program of the present invention is:
A first parameter that defines an allowable range of a reference value in which the singing ability to be evaluated is a maximum evaluation value, and a second parameter that defines a degree of singing ability outside the allowable range in the range from the maximum evaluation value to the minimum evaluation value. The computer is caused to execute step A for setting and step B for calculating the evaluation value of the input audio signal based on the first parameter and the second parameter set in step A.

前記ステップＢは、入力される音声信号のピッチおよび発声タイミングのうち少なくとも１つの評価値を算出することを特徴とする。
この場合において、前記ステップＢは、入力される音声信号のピッチと前記基準値との差分を算出し、符号を含む差分を積算して当該音声信号のピッチの評価値を前記設定された第１のパラメータおよび第２のパラメータに基づいて算出することを特徴とする。 The step B is characterized in that at least one evaluation value is calculated from the pitch and utterance timing of the input audio signal.
In this case, the step B calculates the difference between the pitch of the input audio signal and the reference value, integrates the difference including the sign, and sets the evaluation value of the pitch of the audio signal as the set first value. And calculating based on the second parameter and the second parameter.

入力される伴奏曲の歌唱区間を検索するステップＣをさらに有し、前記ステップＢは、当該ステップＣによって検索された歌唱対象区間内に入力される音声信号の評価値を算出し、当該歌唱区間外に入力される音声信号については評価の対象外とすることを特徴とする。 It further has a step C for searching for the singing section of the input accompaniment, and the step B calculates an evaluation value of the audio signal input in the singing target section searched by the step C, and the singing section An audio signal input outside is not subject to evaluation.

入力される音声信号における２のべき乗のサンプル数を１フレームとして各フレームのエラー数を分析するステップＤをさらに有し、前記ステップＢは、前記ステップＤによって分析されたエラー数が１フレームの２分の１を超えたときは当該フレームを最低評価値とし、分析されたエラー数が１フレームの２分の１を超えないときは１フレームの２分の１に相当するエラー以外のサンプル数によって当該フレームの評価値を算出することを特徴とする。 The method further includes a step D of analyzing the number of errors in each frame with the number of samples of powers of 2 in the input speech signal as one frame, and the step B includes the number of errors analyzed in the step D of 2 in one frame. If the number of errors exceeds one half, the corresponding frame is regarded as the lowest evaluation value. If the number of analyzed errors does not exceed one half of one frame, the number of samples other than errors corresponding to one half of one frame An evaluation value of the frame is calculated.

前記ステップＤは、入力される音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出して基音のピッチを分析し、前記ステップＢは、前記ステップＤによって検出された基音のピッチの評価値を算出することを特徴とする。 The step D analyzes the pitch of the fundamental tone by detecting the greatest common divisor of at least two or more pitches from the input speech signal, and the step B is a step of analyzing the pitch of the fundamental tone detected by the step D. An evaluation value is calculated.

前記ステップＤは、入力される音声信号の周波数成分から位相を算出し、当該算出した位相を用いて当該音声信号の中から少なくとも２つ以上のピッチの最大公約数を検出することを特徴とする。 The step D is characterized in that the phase is calculated from the frequency component of the input audio signal, and the greatest common divisor of at least two or more pitches is detected from the audio signal using the calculated phase. .

１ＣＰＵ
２曲メモリ
３スイッチ部
４ＲＯＭ
５ＲＡＭ
６表示部
７マイク
８Ａ／Ｄ変換器
９楽音生成部
１０Ｄ／Ａ変換器
１１サウンドシステム
1 CPU
2 song memory 3 switch part 4 ROM
5 RAM
6 Display Unit 7 Microphone 8 A / D Converter 9 Musical Sound Generation Unit 10 D / A Converter 11 Sound System

Claims

A first parameter that defines an allowable range of a reference value in which the singing ability to be evaluated is a maximum evaluation value, and a second parameter that defines a degree of singing ability outside the allowable range in the range from the maximum evaluation value to the minimum evaluation value. Parameter setting means to be set;
Signal analysis means for analyzing the number of errors in each frame with the number of samples of powers of 2 in the input speech signal as one frame;
When the number of errors analyzed by the signal analysis means exceeds half of one frame, the frame is set as the lowest evaluation value, and when the number of errors analyzed does not exceed half of one frame, 1 Evaluation calculation means for calculating the evaluation value of the frame based on the first parameter and the second parameter set by the parameter setting means by the number of samples other than the error corresponding to one half of the frame;
Singing scoring device with

The signal analysis means detects the greatest common divisor of at least two or more pitches from the input audio signal and analyzes the pitch of the fundamental tone,
The singing scoring apparatus according to claim 1, wherein the evaluation calculation unit calculates an evaluation value of a pitch of a fundamental tone detected by the signal analysis unit.

The signal analysis means calculates a phase from a frequency component of an input audio signal, and uses the calculated phase to detect a greatest common divisor of at least two or more pitches from the audio signal. The singing scoring device according to claim 2.

A first parameter that defines an allowable range of a reference value in which the singing ability to be evaluated is a maximum evaluation value, and a second parameter that defines a degree of singing ability outside the allowable range in a range from the maximum evaluation value to the minimum evaluation value Step A to set,
A step B of analyzing the number of errors in each frame by setting the number of power-of-two samples in the input audio signal as one frame;
When the number of errors analyzed in step B exceeds 1/2 of one frame, the frame is set as the lowest evaluation value, and when the number of errors analyzed does not exceed half of one frame, one frame. A step C for calculating an evaluation value of the frame based on the first parameter and the second parameter set in the step A by the number of samples other than the error corresponding to one half of
A singing scoring program that causes a computer to execute.

The step B analyzes the pitch of the fundamental tone by detecting the greatest common divisor of at least two or more pitches from the input audio signal, and the step C is a step of analyzing the pitch of the fundamental tone detected by the step B. An evaluation value is calculated, The singing scoring program according to claim 4.

The step B is characterized in that the phase is calculated from the frequency component of the input audio signal, and the greatest common divisor of at least two or more pitches is detected from the audio signal using the calculated phase. The singing scoring program according to claim 5.